Conversion of W3C documents: Support of section as href targets #2438
Thanks for the report.
section as target
Consider
$ cat section.html
<section id="toc">
<h2 class="introductory" id="h2_toc" role="heading" aria-level="1">Table of Contents</h2>
<ul class="toc" id="respecContents" role="directory">
<li class="tocline"><a class="tocxref" href="#conformance"><span class="secno">1. </span>Conformance</a></li>
</ul>
</section>
<section id="conformance" typeof="bibo:Chapter" rel="bibo:Chapter" resource="#conformance"><!--OddPage-->
<h2 id="h2_conformance" role="heading" aria-level="1"><span class="secno">1. </span>Conformance</h2>
<p>
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples,
and notes in this specification are non-normative. Everything else in this specification is
normative.
</p>
</sections>
$ pandoc -f html -t latex section.html
\subsection{Table of Contents}\label{h2ux5ftoc}
\begin{itemize}
\tightlist
\item
\hyperref[conformance]{{1. }Conformance}
\end{itemize}
\subsection{\texorpdfstring{{1.
}Conformance}{1. Conformance}}\label{h2ux5fconformance}
As well as sections marked as non-normative, all authoring guidelines,
diagrams, examples, and notes in this specification are non-normative.
Everything else in this specification is normative.
$ pandoc -f html -t latex --standalone -o section.tex section.html
$ latexmk -pdf section.tex
No error when compiling. The hyperref
link do not work because we have \label{h2ux5fconformance}
instead of \label{conformance}
. You can solve this problem with
$ sed -i 's/id="h2_/id="/g' section.html
$ pandoc -f html -t latex section.html
\subsection{Table of Contents}\label{toc}
\begin{itemize}
\tightlist
\item
\hyperref[conformance]{{1. }Conformance}
\end{itemize}
\hyperdef{}{conformance}{\subsection{\texorpdfstring{{1.
}Conformance}{1. Conformance}}\label{conformance}}
As well as sections marked as non-normative, all authoring guidelines,
diagrams, examples, and notes in this specification are non-normative.
Everything else in this specification is normative.
There is no bug in Pandoc. The issue is related to conventions used in the document.
spurious includegraphics
Consider
$ cat w3c.html
<a href="http://www.w3.org/"><img width="72" height="48" alt="W3C" src="https://www.w3.org/Icons/w3c_home"></a>
Pandoc convert it into LaTeX and the output is what you reported:
$ pandoc -f html -t latex w3c.html
\href{http://www.w3.org/}{\includegraphics{https://www.w3.org/Icons/w3c_home}}
We can try to compile it:
$ pandoc -f html -t latex --standalone -o w3c.tex w3c.html
$ latexmk -pdf w3c.tex
and we will get
! LaTeX Error: File `https://www.w3.org/Icons/w3c_home' not found.
LaTeX can't get images over HTTP. Let solve this problem with
$ wget https://www.w3.org/Icons/w3c_home -O w3c_home.png
$ sed -i 's/https:\/\/www.w3.org\/Icons\/w3c_home/w3c_home.png/' w3c.tex
$ latexmk -pdf w3c.tex
And we have the PDF. =)
You can solve this problem from the begin using
$ wget --convert-links --mirror YOUR-URL
In this case there is no bug in Pandoc.
One legitimate worry is that the id on the enclosing section is just dropped and doesn't produce a corresponding label or hyperdef in the latex. So links to that target don't work.
Note that if you had <div id="foo">
instead of <section>
, pandoc would parse this as a native pandoc Div, and this would then produce a label and hyperdef in the latex.
So, a workaround would be to use sed to convert section tags to divs in your html before passing to pandoc.
And a suggested improvement to pandoc would be to parse section elements as pandoc Divs.
The problem is that when the output format is HTML-related (e.g. epub), this change would result in <section>
elements turning to <div>
. We could fix that by parsing <section>
as a Div with a special "section" class, and then changing the HTML renderer to render such Divs using <section>
.
Reading "spurious includegraphics", you imply that I can't simply use pandoc to convert from http URLs, but have to download the page for myself? On the one hand, Is this really OK from a user's perspective? On the other hand, I see that there are other target formats where images can be included from a http
source. Think, one doesn't want to include switches in pandoc to check whether the given URL will be converted to a system capable of remote files and to systems not. - When continuing thinking, shouldn't it be possible to specify: "Please make all content available offline"?
Regarding "section as target": It feels strange that I as user have to read and understand my (working and valid) HTML document and modify it to get pandoc running.
@koppor what version of pandoc are you using?
When you produce a tex file using pandoc, pandoc's sole output is the tex file. It does not, for example, download images and put them in the working directory. (This is a general principle: we don't create files other than the ones you explicitly specify.) However, when you produce pdf, docx, epub, or odt, pandoc will attempt to fetch remote images.
Section as target: yes, I agree. Hence the suggestion I made above for a change to pandoc.
pandoc 1.15.0.6 Compiled with texmath 0.8.2.2, highlighting-kate 0.6.
on Windows 10.
I'm trying to convert W3C documents such as http://www.w3.org/TR/DOM-Parsing/ to HTML.
The produced latex file when calling
is not working as expected:
pdflatex test
fails.I know, that this report is not a MCVE, but maybe someone can take over and track down the issue more precisely. Otherwise, just close the issue.
In the following, I list the issues I identified.
section as target
The document uses
<section>
tags to structure the document.The
href
s of the TOC jumps to these sections.This is not properly converted in pandoc:
spurious includegraphics
Line 78 shows an
\includegraphics
in\href
:LaTeX Error: Too deeply nested.
The quick fix suggested by so doesn't help.