Skip to content

Conversion of W3C documents: Support of section as href targets #2438

Closed
koppor opened this Issue · 5 comments

3 participants

@koppor

I'm trying to convert W3C documents such as http://www.w3.org/TR/DOM-Parsing/ to HTML.

The produced latex file when calling

pandoc http://www.w3.org/TR/DOM-Parsing/ --standalone -o test.tex

is not working as expected: pdflatex test fails.

I know, that this report is not a MCVE, but maybe someone can take over and track down the issue more precisely. Otherwise, just close the issue.

In the following, I list the issues I identified.

section as target

The document uses <section> tags to structure the document.
The hrefs of the TOC jumps to these sections.
This is not properly converted in pandoc:

\subsection{\texorpdfstring{{1.
}Conformance}{1. Conformance}}\label{h2ux5fconformance}

spurious includegraphics

Line 78 shows an \includegraphics in \href:

\href{http://www.w3.org/}{\includegraphics{https://www.w3.org/Icons/w3c_home}}

LaTeX Error: Too deeply nested.

The quick fix suggested by so doesn't help.

@rgaiacs

Thanks for the report.

section as target

Consider

$ cat section.html                                                
<section id="toc">
  <h2 class="introductory" id="h2_toc" role="heading" aria-level="1">Table of Contents</h2>
  <ul class="toc" id="respecContents" role="directory">
    <li class="tocline"><a class="tocxref" href="#conformance"><span class="secno">1. </span>Conformance</a></li>
  </ul>
</section>
<section id="conformance" typeof="bibo:Chapter" rel="bibo:Chapter" resource="#conformance"><!--OddPage-->
  <h2 id="h2_conformance" role="heading" aria-level="1"><span class="secno">1. </span>Conformance</h2>
  <p>
    As well as sections marked as non-normative, all authoring guidelines, diagrams, examples,
    and notes in this specification are non-normative. Everything else in this specification is
    normative.
  </p>
</sections>
$ pandoc -f html -t latex section.html
\subsection{Table of Contents}\label{h2ux5ftoc}

\begin{itemize}
\tightlist
\item
  \hyperref[conformance]{{1. }Conformance}
\end{itemize}

\subsection{\texorpdfstring{{1.
}Conformance}{1. Conformance}}\label{h2ux5fconformance}

As well as sections marked as non-normative, all authoring guidelines,
diagrams, examples, and notes in this specification are non-normative.
Everything else in this specification is normative.
$ pandoc -f html -t latex --standalone -o section.tex section.html
$ latexmk -pdf section.tex

No error when compiling. The hyperref link do not work because we have \label{h2ux5fconformance} instead of \label{conformance}. You can solve this problem with

$ sed -i 's/id="h2_/id="/g' section.html 
$ pandoc -f html -t latex section.html
\subsection{Table of Contents}\label{toc}

\begin{itemize}
\tightlist
\item
  \hyperref[conformance]{{1. }Conformance}
\end{itemize}

\hyperdef{}{conformance}{\subsection{\texorpdfstring{{1.
}Conformance}{1. Conformance}}\label{conformance}}

As well as sections marked as non-normative, all authoring guidelines,
diagrams, examples, and notes in this specification are non-normative.
Everything else in this specification is normative.

There is no bug in Pandoc. The issue is related to conventions used in the document.

spurious includegraphics

Consider

$ cat w3c.html 
<a href="http://www.w3.org/"><img width="72" height="48" alt="W3C" src="https://www.w3.org/Icons/w3c_home"></a>

Pandoc convert it into LaTeX and the output is what you reported:

$ pandoc -f html -t latex w3c.html        
\href{http://www.w3.org/}{\includegraphics{https://www.w3.org/Icons/w3c_home}}

We can try to compile it:

$ pandoc -f html -t latex --standalone -o w3c.tex  w3c.html
$ latexmk -pdf w3c.tex

and we will get

! LaTeX Error: File `https://www.w3.org/Icons/w3c_home' not found.

LaTeX can't get images over HTTP. Let solve this problem with

$ wget https://www.w3.org/Icons/w3c_home -O w3c_home.png
$ sed -i 's/https:\/\/www.w3.org\/Icons\/w3c_home/w3c_home.png/' w3c.tex
$ latexmk -pdf w3c.tex

And we have the PDF. =)

You can solve this problem from the begin using

$ wget --convert-links --mirror YOUR-URL

In this case there is no bug in Pandoc.

@jgm
Owner
jgm commented

One legitimate worry is that the id on the enclosing section is just dropped and doesn't produce a corresponding label or hyperdef in the latex. So links to that target don't work.

Note that if you had <div id="foo"> instead of <section>, pandoc would parse this as a native pandoc Div, and this would then produce a label and hyperdef in the latex.

So, a workaround would be to use sed to convert section tags to divs in your html before passing to pandoc.

And a suggested improvement to pandoc would be to parse section elements as pandoc Divs.

The problem is that when the output format is HTML-related (e.g. epub), this change would result in <section> elements turning to <div>. We could fix that by parsing <section> as a Div with a special "section" class, and then changing the HTML renderer to render such Divs using <section>.

@koppor

Reading "spurious includegraphics", you imply that I can't simply use pandoc to convert from http URLs, but have to download the page for myself? On the one hand, Is this really OK from a user's perspective? On the other hand, I see that there are other target formats where images can be included from a http source. Think, one doesn't want to include switches in pandoc to check whether the given URL will be converted to a system capable of remote files and to systems not. - When continuing thinking, shouldn't it be possible to specify: "Please make all content available offline"?

Regarding "section as target": It feels strange that I as user have to read and understand my (working and valid) HTML document and modify it to get pandoc running.

@jgm
Owner
jgm commented

@koppor what version of pandoc are you using?

When you produce a tex file using pandoc, pandoc's sole output is the tex file. It does not, for example, download images and put them in the working directory. (This is a general principle: we don't create files other than the ones you explicitly specify.) However, when you produce pdf, docx, epub, or odt, pandoc will attempt to fetch remote images.

Section as target: yes, I agree. Hence the suggestion I made above for a change to pandoc.

@koppor

pandoc 1.15.0.6 Compiled with texmath 0.8.2.2, highlighting-kate 0.6. on Windows 10.

@koppor koppor referenced this issue in wkhtmltopdf/wkhtmltopdf
Open

Hyperlinks do not work at W3C documents #2618

@jgm jgm added a commit that closed this issue
@jgm HTML reader/writer: better handling of "section" elements.
Previously `<section>` tags were just parsed as raw HTML
blocks.  With this change, section elements are parsed as
Div elements with the class "section".  The HTML writer will
use `<section>` tags to render these Divs in HTML5; otherwise
they will be rendered as `<div class="section">`.

Closes #2438.
0e78eba
@jgm jgm closed this in 0e78eba
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.