Conversion of W3C documents: Support of section as href targets #2438

koppor · Oct 7, 2015

I'm trying to convert W3C documents such as http://www.w3.org/TR/DOM-Parsing/ to HTML.

The produced latex file when calling

pandoc http://www.w3.org/TR/DOM-Parsing/ --standalone -o test.tex

is not working as expected: pdflatex test fails.

I know, that this report is not a MCVE, but maybe someone can take over and track down the issue more precisely. Otherwise, just close the issue.

In the following, I list the issues I identified.

section as target

The document uses <section> tags to structure the document.
The hrefs of the TOC jumps to these sections.
This is not properly converted in pandoc:

\subsection{\texorpdfstring{{1.
}Conformance}{1. Conformance}}\label{h2ux5fconformance}

spurious includegraphics

Line 78 shows an \includegraphics in \href:

\href{http://www.w3.org/}{\includegraphics{https://www.w3.org/Icons/w3c_home}}

LaTeX Error: Too deeply nested.

The quick fix suggested by so doesn't help.

rgaiacs · Oct 7, 2015

Thanks for the report.

section as target

Consider

$ cat section.html                                                
<section id="toc">
  <h2 class="introductory" id="h2_toc" role="heading" aria-level="1">Table of Contents</h2>
  <ul class="toc" id="respecContents" role="directory">
    <li class="tocline"><a class="tocxref" href="#conformance"><span class="secno">1. </span>Conformance</a></li>
  </ul>
</section>
<section id="conformance" typeof="bibo:Chapter" rel="bibo:Chapter" resource="#conformance"><!--OddPage-->
  <h2 id="h2_conformance" role="heading" aria-level="1"><span class="secno">1. </span>Conformance</h2>
  <p>
    As well as sections marked as non-normative, all authoring guidelines, diagrams, examples,
    and notes in this specification are non-normative. Everything else in this specification is
    normative.
  </p>
</sections>
$ pandoc -f html -t latex section.html
\subsection{Table of Contents}\label{h2ux5ftoc}

\begin{itemize}
\tightlist
\item
  \hyperref[conformance]{{1. }Conformance}
\end{itemize}

\subsection{\texorpdfstring{{1.
}Conformance}{1. Conformance}}\label{h2ux5fconformance}

As well as sections marked as non-normative, all authoring guidelines,
diagrams, examples, and notes in this specification are non-normative.
Everything else in this specification is normative.
$ pandoc -f html -t latex --standalone -o section.tex section.html
$ latexmk -pdf section.tex

No error when compiling. The hyperref link do not work because we have \label{h2ux5fconformance} instead of \label{conformance}. You can solve this problem with

$ sed -i 's/id="h2_/id="/g' section.html 
$ pandoc -f html -t latex section.html
\subsection{Table of Contents}\label{toc}

\begin{itemize}
\tightlist
\item
  \hyperref[conformance]{{1. }Conformance}
\end{itemize}

\hyperdef{}{conformance}{\subsection{\texorpdfstring{{1.
}Conformance}{1. Conformance}}\label{conformance}}

As well as sections marked as non-normative, all authoring guidelines,
diagrams, examples, and notes in this specification are non-normative.
Everything else in this specification is normative.

There is no bug in Pandoc. The issue is related to conventions used in the document.

spurious includegraphics

Consider

$ cat w3c.html 
<a href="http://www.w3.org/"><img width="72" height="48" alt="W3C" src="https://www.w3.org/Icons/w3c_home"></a>

Pandoc convert it into LaTeX and the output is what you reported:

$ pandoc -f html -t latex w3c.html        
\href{http://www.w3.org/}{\includegraphics{https://www.w3.org/Icons/w3c_home}}

We can try to compile it:

$ pandoc -f html -t latex --standalone -o w3c.tex  w3c.html
$ latexmk -pdf w3c.tex

and we will get

! LaTeX Error: File `https://www.w3.org/Icons/w3c_home' not found.

LaTeX can't get images over HTTP. Let solve this problem with

$ wget https://www.w3.org/Icons/w3c_home -O w3c_home.png
$ sed -i 's/https:\/\/www.w3.org\/Icons\/w3c_home/w3c_home.png/' w3c.tex
$ latexmk -pdf w3c.tex

And we have the PDF. =)

You can solve this problem from the begin using

$ wget --convert-links --mirror YOUR-URL

In this case there is no bug in Pandoc.

jgm · Oct 7, 2015

One legitimate worry is that the id on the enclosing section is just dropped and doesn't produce a corresponding label or hyperdef in the latex. So links to that target don't work.

Note that if you had <div id="foo"> instead of <section>, pandoc would parse this as a native pandoc Div, and this would then produce a label and hyperdef in the latex.

So, a workaround would be to use sed to convert section tags to divs in your html before passing to pandoc.

And a suggested improvement to pandoc would be to parse section elements as pandoc Divs.

The problem is that when the output format is HTML-related (e.g. epub), this change would result in <section> elements turning to <div>. We could fix that by parsing <section> as a Div with a special "section" class, and then changing the HTML renderer to render such Divs using <section>.

koppor · Oct 7, 2015

Reading "spurious includegraphics", you imply that I can't simply use pandoc to convert from http URLs, but have to download the page for myself? On the one hand, Is this really OK from a user's perspective? On the other hand, I see that there are other target formats where images can be included from a http source. Think, one doesn't want to include switches in pandoc to check whether the given URL will be converted to a system capable of remote files and to systems not. - When continuing thinking, shouldn't it be possible to specify: "Please make all content available offline"?

Regarding "section as target": It feels strange that I as user have to read and understand my (working and valid) HTML document and modify it to get pandoc running.

jgm · Oct 7, 2015

@koppor what version of pandoc are you using?

When you produce a tex file using pandoc, pandoc's sole output is the tex file. It does not, for example, download images and put them in the working directory. (This is a general principle: we don't create files other than the ones you explicitly specify.) However, when you produce pdf, docx, epub, or odt, pandoc will attempt to fetch remote images.

Section as target: yes, I agree. Hence the suggestion I made above for a change to pandoc.

koppor · Oct 7, 2015

pandoc 1.15.0.6 Compiled with texmath 0.8.2.2, highlighting-kate 0.6. on Windows 10.

koppor referenced this issue in wkhtmltopdf/wkhtmltopdf Oct 9, 2015
Open
Hyperlinks do not work at W3C documents #2618

jgm closed this in 0e78eba Oct 11, 2015

jgm/pandoc

Conversion of W3C documents: Support of section as href targets #2438

Labels

Milestone

Assignee

3 participants

Hyperlinks do not work at W3C documents #2618