Pandoc 1.14+ changes to rst/ReST headings break Jupyter/nbconvert #2394
The gist is that nbconvert passes snippets of markdown one at a time to pandoc, so pandoc may get
# header 1
some text
in one call, then
### header 3
some time later. What nbconvert needs from pandoc is consistent heading markers for a given input heading level. Is there a way to get this from pandoc ≥ 1.14?
@minrk I think we need to normalize in some way, for valid reST.
However, we might be able to do something like this: before normalizing, detect the top-most header level. Then, after normalizing, increment all the header levels in the normalized document so that the top-most header level matches what was there before.
@lierdakil - would that still address the issue that motivated this change?
So, okay, it won't be possible to get "consistent" heading levels with rst output anyway, because there is nothing "consistent" in how docutils deals with headings. That said, incrementing "base" heading level won't really change much in terms of standalone rst export, so this will work at least somewhat satisfactory, I think.
[RST Writer] Don't normalize heading levels below input minimum #2405
So, I quickly slapped together some code, #2405, implementing @jgm's proposal. It might help with this issue, depending on how nbconvert handles this, but I have some, let's say, concerns. If header levels in input markdown are consistent, it should be all right. If not, however, there might be some unexpected formatting issues instead of 'Title level inconsistent' error message every docutils user knows and loves.
@minrk I've merged @lierdakil's patch. If you're able to build the latest pandoc master from source and test with nbconvert, that would be helpful.
@jgm tested, and this does fix it in most real cases. The one case where it still does the wrong thing (for us) is if we pass an h3 before an h2 in a single snippet. This is perfectly reasonable for us, since we are passing pandoc snippets of a document, not a whole document. So h3 before h2 makes perfect sense, since there may be an h2 earlier in the document.
We would prefer a 'snippet-mode', where pandoc would not assume that it is being given a complete document that must be fixed, and just consistently transform h3 to the same markup no matter what. An explicit flag to disable normalization altogether would also work. We just want to tell pandoc to let us worry about rst heading consistency.
The fragments are coming from IPython notebooks, which are snippets of markdown interleaved with code and output. When transforming the whole document to other formats (e.g. rst), we use pandoc to transform the markdown cells one at a time, and take care of transforming code and output ourselves (which may include more snippets passed to pandoc).
<markdown cell>
# header 1
some prose
<code cell>
some code
some output
<markdown cell>
## header 2
some prose
# header level 1 again
<code cell>
So each markdown cell will be passed to pandoc one at a time. Everything works out fine, as long as pandoc does not assume that the second markdown cell is a complete document and rewrite the heading formatting.
Okay. That's a much larger undertaking, so it will probably not happen any time soon. I think we'll pin to old pandoc for the forseeable future, if a --no-normalize-rst-headings
flag is not in the cards.
@lierdakil - I wonder if the normalization process could be tweaked a bit?
Instead of normalizing
Header 4
Header 3
Header 5
Header 3
to
Header 3
Header 3
Header 4
Header 3
could we normalize it to
Header 4
Header 3
Header 4
Header 3
That is: allow the hierarchy to begin with a header level that is less than the lowest header level?
To be more explicit, one way to do this would be to simply start normalizing at the first occurrence of a base-level header (in this case level 3). A slightly more aggressive version would do some normalization on the initial part, e.g. ensuring that there are not gaps of >1 level between headers.
Note also that there is no fixed mapping of header underline styles to header levels in reST! (The first one encountered is treated as h1, the second as h2, etc.) So I don't see how we could really have a mode that deals with your fragments in a way that is consistent with each other, unless we arbitrarily established some particular mapping from underline styles to header levels. This mapping would not correspond to anything in the reST spec, and it would produce incorrect results for anyone who was using a different mapping.
I'm not concerned about the lack of reST standardization on headings, because pandoc is writing all of the headings. If it's internally consistent, it's no problem. I think the main point for our use case is to tell pandoc that the first header doesn't need any normalization. Pandoc proceeding as it likes for the rest of the snippet should be fine.
I think it's only the MD to rst direction which is in this case a problem (the notebook has only markdown and code cells).
This is IMO a good example what nbconvert does (txt1 and txt2 are cells in a notebook):
λ cat txt1
# Header1
## Header2
λ cat txt2
## Header2
# Header1
λ cat txt1 | pandoc -f markdown -t rst
Header1
=======
Header2
-------
λ cat txt2 | pandoc -f markdown -t rst
Header2
=======
Header1
=======
On the nbconvert side, a way to prevent it would be to refactor the process to convert each cell to markdown and then convert the whole document to rst. This is similar how RMarkdown works.
I've pushed a change that avoids doing any normalization until you hit an N-level header, where N is the minimum header level in the fragment. This gives better results for fragments that look like this:
### level 3
## level 2
More testing by Jupyter/nbconvert people will be helpful.
New approach is just to skip normalization entirely for fragments, doing it only if --standalone/-s
is used.
I think this is simpler and more robust. Please let me know if it doesn't work for your application.
(Tag: more discussion needed)
Jupyter's nbconvert (formerly IPython) uses Pandoc to convert to .rst files from their syntax including markdown headers. Before Pandoc 1.14, markdown headers in either the "===="/"----"/"###" format or the "#"/"##"/"###" format were translated properly to various levels in .rst. From 1.14 onwards all notebook headers become top-level heading 1 headers in nbconvert. I believe that this change in the 1.14 release notes changed the behavior:
Can Nikolay or someone please take the time to explain the change so that the Jupyter folks can make a fix or use an alternate template of some sort (or could pandoc add an option to use the older behavior?). I'm new to contributing to either project (both of which I love and admire) so I'm not sure how to help with the situation except to bring it to the attention of both projects.
The relevant nbconvert issue is at jupyter/nbconvert#97
Thank you!