Skip to content

Pandoc 1.14+ changes to rst/ReST headings break Jupyter/nbconvert #2394

Closed
mscuthbert opened this Issue · 27 comments

6 participants

@mscuthbert

(Tag: more discussion needed)

Jupyter's nbconvert (formerly IPython) uses Pandoc to convert to .rst files from their syntax including markdown headers. Before Pandoc 1.14, markdown headers in either the "===="/"----"/"###" format or the "#"/"##"/"###" format were translated properly to various levels in .rst. From 1.14 onwards all notebook headers become top-level heading 1 headers in nbconvert. I believe that this change in the 1.14 release notes changed the behavior:

Normalize headings to sequential levels (Nikolay Yakimov). This is pretty much required by docutils.

Can Nikolay or someone please take the time to explain the change so that the Jupyter folks can make a fix or use an alternate template of some sort (or could pandoc add an option to use the older behavior?). I'm new to contributing to either project (both of which I love and admire) so I'm not sure how to help with the situation except to bring it to the attention of both projects.

The relevant nbconvert issue is at jupyter/nbconvert#97

Thank you!

@mpickering
Collaborator
@minrk

The gist is that nbconvert passes snippets of markdown one at a time to pandoc, so pandoc may get

# header 1
some text

in one call, then

### header 3

some time later. What nbconvert needs from pandoc is consistent heading markers for a given input heading level. Is there a way to get this from pandoc ≥ 1.14?

@jgm
Owner
@minrk

@jgm would it be possible to add an option to tell pandoc to consistently render h1 -> ====, etc. rather than normalizing?

@jgm
Owner

@minrk I think we need to normalize in some way, for valid reST.
However, we might be able to do something like this: before normalizing, detect the top-most header level. Then, after normalizing, increment all the header levels in the normalized document so that the top-most header level matches what was there before.
@lierdakil - would that still address the issue that motivated this change?

@lierdakil

Sorry, missed notifications. Let me check.

@lierdakil

So, okay, it won't be possible to get "consistent" heading levels with rst output anyway, because there is nothing "consistent" in how docutils deals with headings. That said, incrementing "base" heading level won't really change much in terms of standalone rst export, so this will work at least somewhat satisfactory, I think.

@lierdakil

So, I quickly slapped together some code, #2405, implementing @jgm's proposal. It might help with this issue, depending on how nbconvert handles this, but I have some, let's say, concerns. If header levels in input markdown are consistent, it should be all right. If not, however, there might be some unexpected formatting issues instead of 'Title level inconsistent' error message every docutils user knows and loves.

@jgm
Owner

@minrk I've merged @lierdakil's patch. If you're able to build the latest pandoc master from source and test with nbconvert, that would be helpful.

@minrk

@jgm on it, thanks!

@minrk

@jgm tested, and this does fix it in most real cases. The one case where it still does the wrong thing (for us) is if we pass an h3 before an h2 in a single snippet. This is perfectly reasonable for us, since we are passing pandoc snippets of a document, not a whole document. So h3 before h2 makes perfect sense, since there may be an h2 earlier in the document.

We would prefer a 'snippet-mode', where pandoc would not assume that it is being given a complete document that must be fixed, and just consistently transform h3 to the same markup no matter what. An explicit flag to disable normalization altogether would also work. We just want to tell pandoc to let us worry about rst heading consistency.

@jgm
Owner
@minrk

The fragments are coming from IPython notebooks, which are snippets of markdown interleaved with code and output. When transforming the whole document to other formats (e.g. rst), we use pandoc to transform the markdown cells one at a time, and take care of transforming code and output ourselves (which may include more snippets passed to pandoc).

<markdown cell>
# header 1
some prose

<code cell>
some code
some output

<markdown cell>
## header 2
some prose
# header level 1 again

<code cell>

So each markdown cell will be passed to pandoc one at a time. Everything works out fine, as long as pandoc does not assume that the second markdown cell is a complete document and rewrite the heading formatting.

@jgm
Owner
@minrk

Okay. That's a much larger undertaking, so it will probably not happen any time soon. I think we'll pin to old pandoc for the forseeable future, if a --no-normalize-rst-headings flag is not in the cards.

@jgm
Owner

@lierdakil - I wonder if the normalization process could be tweaked a bit?
Instead of normalizing

Header 4
Header 3
Header 5
Header 3

to

Header 3
Header 3
Header 4
Header 3

could we normalize it to

Header 4
Header 3
Header 4
Header 3

That is: allow the hierarchy to begin with a header level that is less than the lowest header level?

@jgm
Owner

To be more explicit, one way to do this would be to simply start normalizing at the first occurrence of a base-level header (in this case level 3). A slightly more aggressive version would do some normalization on the initial part, e.g. ensuring that there are not gaps of >1 level between headers.

@jgm
Owner

Note also that there is no fixed mapping of header underline styles to header levels in reST! (The first one encountered is treated as h1, the second as h2, etc.) So I don't see how we could really have a mode that deals with your fragments in a way that is consistent with each other, unless we arbitrarily established some particular mapping from underline styles to header levels. This mapping would not correspond to anything in the reST spec, and it would produce incorrect results for anyone who was using a different mapping.

@minrk

@jgm would you accept something that directly specified the format for each heading level?

@jgm
Owner
@jbarnoud jbarnoud referenced this issue in pierrepo/PBxplore
Merged

[WIP] Add a sphinx API documentation #82

@minrk

I'm not concerned about the lack of reST standardization on headings, because pandoc is writing all of the headings. If it's internally consistent, it's no problem. I think the main point for our use case is to tell pandoc that the first header doesn't need any normalization. Pandoc proceeding as it likes for the rest of the snippet should be fine.

@jgm
Owner
@janschulz

I think it's only the MD to rst direction which is in this case a problem (the notebook has only markdown and code cells).

This is IMO a good example what nbconvert does (txt1 and txt2 are cells in a notebook):

λ cat txt1
# Header1
## Header2
λ cat txt2
## Header2
# Header1
λ cat txt1 | pandoc -f markdown -t rst
Header1
=======

Header2
-------

λ cat txt2 | pandoc -f markdown -t rst
Header2
=======

Header1
=======

On the nbconvert side, a way to prevent it would be to refactor the process to convert each cell to markdown and then convert the whole document to rst. This is similar how RMarkdown works.

@jgm
Owner
@jgm jgm added a commit that closed this issue
@jgm RST writer: tweaks to header normalization.
These changes are intended to make the writer more
useful to people who are processing small fragments,
which may for example look like this:

    ### third level header from previous section

    ## second level header

Previously such fragments got turned into two
headers of the same level.  The new algorithm
avoids doing any normalization until we hit the
minimal-level header in the fragment (here, the
second level header).

Closes #2394.
476b383
@jgm jgm closed this in 476b383
@jgm
Owner

I've pushed a change that avoids doing any normalization until you hit an N-level header, where N is the minimum header level in the fragment. This gives better results for fragments that look like this:

### level 3

## level 2

More testing by Jupyter/nbconvert people will be helpful.

@jgm jgm added a commit that referenced this issue
@jgm RST writer: do header normalization only in "standalone" mode.
If we're producing a fragment, just skip normalization.
After all, the fragment might be somewhere in the middle
of the document.  It's more important for fragments to
have consistency in rendering (so they can be pieced
together) than to normalize.

This closes #2394.  It's simpler and more robust than
my earlier fix.
24f6865
@jgm
Owner

New approach is just to skip normalization entirely for fragments, doing it only if --standalone/-s is used.
I think this is simpler and more robust. Please let me know if it doesn't work for your application.

@minrk

@jgm that's perfect, thanks. I'll do more testing when I get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.