Pandoc 1.14+ changes to rst/ReST headings break Jupyter/nbconvert #2394

mscuthbert · Sep 12, 2015

(Tag: more discussion needed)

Jupyter's nbconvert (formerly IPython) uses Pandoc to convert to .rst files from their syntax including markdown headers. Before Pandoc 1.14, markdown headers in either the "===="/"----"/"###" format or the "#"/"##"/"###" format were translated properly to various levels in .rst. From 1.14 onwards all notebook headers become top-level heading 1 headers in nbconvert. I believe that this change in the 1.14 release notes changed the behavior:

Normalize headings to sequential levels (Nikolay Yakimov). This is pretty much required by docutils.

Can Nikolay or someone please take the time to explain the change so that the Jupyter folks can make a fix or use an alternate template of some sort (or could pandoc add an option to use the older behavior?). I'm new to contributing to either project (both of which I love and admire) so I'm not sure how to help with the situation except to bring it to the attention of both projects.

The relevant nbconvert issue is at jupyter/nbconvert#97

Thank you!

mpickering · Sep 12, 2015

@lierdakil

minrk · Sep 12, 2015

The gist is that nbconvert passes snippets of markdown one at a time to pandoc, so pandoc may get

# header 1
some text

in one call, then

### header 3

some time later. What nbconvert needs from pandoc is consistent heading markers for a given input heading level. Is there a way to get this from pandoc ≥ 1.14?

jgm · Sep 13, 2015

I see the problem, and I don't see a workaround. For background on the change, see the PR #2079 which links to a pandoc-discuss discussion. I'm open to discussion as to how to improve the situation.

minrk · Sep 13, 2015

@jgm would it be possible to add an option to tell pandoc to consistently render h1 -> ====, etc. rather than normalizing?

jgm · Sep 15, 2015

@minrk I think we need to normalize in some way, for valid reST.
However, we might be able to do something like this: before normalizing, detect the top-most header level. Then, after normalizing, increment all the header levels in the normalized document so that the top-most header level matches what was there before.
@lierdakil - would that still address the issue that motivated this change?

lierdakil · Sep 19, 2015

Sorry, missed notifications. Let me check.

lierdakil · Sep 19, 2015

So, okay, it won't be possible to get "consistent" heading levels with rst output anyway, because there is nothing "consistent" in how docutils deals with headings. That said, incrementing "base" heading level won't really change much in terms of standalone rst export, so this will work at least somewhat satisfactory, I think.

lierdakil · Sep 19, 2015

So, I quickly slapped together some code, #2405, implementing @jgm's proposal. It might help with this issue, depending on how nbconvert handles this, but I have some, let's say, concerns. If header levels in input markdown are consistent, it should be all right. If not, however, there might be some unexpected formatting issues instead of 'Title level inconsistent' error message every docutils user knows and loves.

jgm · Sep 20, 2015

@minrk I've merged @lierdakil's patch. If you're able to build the latest pandoc master from source and test with nbconvert, that would be helpful.

minrk · Sep 21, 2015

@jgm on it, thanks!

minrk · Sep 21, 2015

@jgm tested, and this does fix it in most real cases. The one case where it still does the wrong thing (for us) is if we pass an h3 before an h2 in a single snippet. This is perfectly reasonable for us, since we are passing pandoc snippets of a document, not a whole document. So h3 before h2 makes perfect sense, since there may be an h2 earlier in the document.

We would prefer a 'snippet-mode', where pandoc would not assume that it is being given a complete document that must be fixed, and just consistently transform h3 to the same markup no matter what. An explicit flag to disable normalization altogether would also work. We just want to tell pandoc to let us worry about rst heading consistency.

jgm · Sep 21, 2015

It's not that we're assuming we have complete documents. We need to do normalization even for document fragments, to ensure that the output is valid reST. For example, the fragment ``` ## A #### B ## C ### D ``` would need to be normalized or we'll get the "Title level inconsistent" error from docutils. How do you chunk things into fragments? If you could ensure that a fragment contained no more than one header, that would fix things. But I don't know where the fragments are coming from. +++ Min RK [Sep 21 15 02:35 ]:

…

minrk · Sep 21, 2015

The fragments are coming from IPython notebooks, which are snippets of markdown interleaved with code and output. When transforming the whole document to other formats (e.g. rst), we use pandoc to transform the markdown cells one at a time, and take care of transforming code and output ourselves (which may include more snippets passed to pandoc).

<markdown cell>
# header 1
some prose

<code cell>
some code
some output

<markdown cell>
## header 2
some prose
# header level 1 again

<code cell>

So each markdown cell will be passed to pandoc one at a time. Everything works out fine, as long as pandoc does not assume that the second markdown cell is a complete document and rewrite the heading formatting.

jgm · Sep 21, 2015

Thanks for the clarification. Since normalization is needed for valid reST output even for document fragments (as noted in my previous comment), I'd be reluctant to simply turn it off for fragments. One possible solution on your end would be to split the Markdown cells into pieces, each containing no more than one header. This could be done reliably by converting to pandoc native (or json), doing the splitting at that level, and then converting the chunks to reST (and, if needed, recombining them). +++ Min RK [Sep 21 15 13:02 ]:

…

minrk · Sep 22, 2015

Okay. That's a much larger undertaking, so it will probably not happen any time soon. I think we'll pin to old pandoc for the forseeable future, if a --no-normalize-rst-headings flag is not in the cards.

jgm · Sep 22, 2015

@lierdakil - I wonder if the normalization process could be tweaked a bit?
Instead of normalizing

Header 4
Header 3
Header 5
Header 3

to

Header 3
Header 3
Header 4
Header 3

could we normalize it to

Header 4
Header 3
Header 4
Header 3

That is: allow the hierarchy to begin with a header level that is less than the lowest header level?

jgm · Sep 22, 2015

To be more explicit, one way to do this would be to simply start normalizing at the first occurrence of a base-level header (in this case level 3). A slightly more aggressive version would do some normalization on the initial part, e.g. ensuring that there are not gaps of >1 level between headers.

jgm · Oct 10, 2015

Note also that there is no fixed mapping of header underline styles to header levels in reST! (The first one encountered is treated as h1, the second as h2, etc.) So I don't see how we could really have a mode that deals with your fragments in a way that is consistent with each other, unless we arbitrarily established some particular mapping from underline styles to header levels. This mapping would not correspond to anything in the reST spec, and it would produce incorrect results for anyone who was using a different mapping.

minrk · Oct 11, 2015

@jgm would you accept something that directly specified the format for each heading level?

jgm · Oct 11, 2015

+++ Min RK [Oct 11 15 02:27 ]:

[1]@jgm would you accept something that directly specified the format for each heading level?

I'm not really sure what you mean. As I said, establishing an arbitrary correlation between underline styles and heading levels would not correspond to anything in reST itself. It would be a fairly ad hoc kind of change motivated by a single application, and that's the sort of thing I'm inclined to resist.

minrk · Oct 12, 2015

I'm not concerned about the lack of reST standardization on headings, because pandoc is writing all of the headings. If it's internally consistent, it's no problem. I think the main point for our use case is to tell pandoc that the first header doesn't need any normalization. Pandoc proceeding as it likes for the rest of the snippet should be fine.

jgm · Oct 12, 2015

+++ Min RK [Oct 11 15 23:52 ]:

I'm not concerned about the lack of reST standardization on headings, because pandoc is writing all of the headings. If it's internally consistent, it's no problem. I think the main point for our use case is to tell pandoc that the first header doesn't need any normalization. Pandoc proceeding as it likes for the rest of the snippet should be fine.

Yes, I understand how this would work for you use case. My concern is that the proposed option is hard to motivate outside of your use case. It imposes an arbitrary assignment of underline styles to headers that doesn't correspond to anything in the reST spec. And, incidentally, it's not about normalization. Even before the changes that introduced normalization, pandoc (following the reST spec) always treated the first-occurring underline style as the top-level header. So, hi -- there ===== and hi == there ----- would *both* have been parsed as a level 1 header, then a level 2 header. This is going to be problematic for you if you're chopping up fragments in arbitrary ways.

janschulz · Oct 12, 2015

I think it's only the MD to rst direction which is in this case a problem (the notebook has only markdown and code cells).

This is IMO a good example what nbconvert does (txt1 and txt2 are cells in a notebook):

λ cat txt1
# Header1
## Header2
λ cat txt2
## Header2
# Header1
λ cat txt1 | pandoc -f markdown -t rst
Header1
=======

Header2
-------

λ cat txt2 | pandoc -f markdown -t rst
Header2
=======

Header1
=======

On the nbconvert side, a way to prevent it would be to refactor the process to convert each cell to markdown and then convert the whole document to rst. This is similar how RMarkdown works.

jgm · Oct 12, 2015

Thanks for this, and apologies: I was confused about what the issue was in my comments above. Now I see it (again). +++ Jan Schulz [Oct 12 15 12:47 ]:

…

jgm · Oct 13, 2015

I've pushed a change that avoids doing any normalization until you hit an N-level header, where N is the minimum header level in the fragment. This gives better results for fragments that look like this:

### level 3

## level 2

More testing by Jupyter/nbconvert people will be helpful.

jgm · Oct 13, 2015

New approach is just to skip normalization entirely for fragments, doing it only if --standalone/-s is used.
I think this is simpler and more robust. Please let me know if it doesn't work for your application.

minrk · Oct 13, 2015

@jgm that's perfect, thanks. I'll do more testing when I get a chance.

lierdakil referenced this issue Sep 19, 2015
Merged
[RST Writer] Don't normalize heading levels below input minimum #2405

jbarnoud referenced this issue in pierrepo/PBxplore Oct 11, 2015
Merged
[WIP] Add a sphinx API documentation #82

jgm closed this in 476b383 Oct 13, 2015

jgm/pandoc

Pandoc 1.14+ changes to rst/ReST headings break Jupyter/nbconvert #2394

Labels

Milestone

Assignee

6 participants

[RST Writer] Don't normalize heading levels below input minimum #2405

[WIP] Add a sphinx API documentation #82