Very slow performance with some markdown options #2730

wch · Feb 19, 2016

When converting some files from Markdown to HTML, performance can be very slow, depending on the markdown variant and options selected. The time grows exponentially, as shown in the graph below.

For this example, I have a very basic input -- it's just raw HTML with some JSON content embedded in a <script> tag. (We're using markdown as an input format because sometimes the HTML is intermingled with markdown. But in this example, the actual content is just HTML.)

index.html:

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
</head>
<body>
<script>{"x": "blah blah blah"}</script>
</body>
</html>

This is paired with a minimal template file:

$body$

And it's run through pandoc with:

pandoc index.html --from markdown_strict --output output.html --template template.html

The problem is that, with the content I have (the "blah blah blah" is replaced with a bunch of R code in a string), pandoc is extremely slow. Here's a graph of time, with 50KB, 100KB, and 150KB of text in the <script></script> tags, with various flavors of markdown. Note the log y scale:

For markdown_strict, the time for 50KB is 0.37 seconds; for 100KB, it's 3.1 seconds, and for 150KB, it's 25.5 seconds. If the input is a megabyte in size, the conversion time with this exponential growth rate would be about 30,000,000,000,000,000 seconds. My actual data is over two megabytes, so there would be many more zeros on there. :)

In the graph, I've also compared it to markdown and commonmark, which are much faster, as well as markdown-markdown_in_html_blocks and markdown+markdown_attribute, which are just as slow as markdown_strict. I would have expected the markdown-markdown_in_html_blocks and markdown+markdown_attribute options to be faster than markdown, but that opposite appears to be true.

The example input files are in https://github.com/wch/pandoc-hang, with a subdirectory for each input file size. For example, the 100KB input file is in:
https://github.com/wch/pandoc-hang/tree/master/simplified-100kb

I also tried changing the specific content in the <script> tags, and that makes a big difference in speed. In my use case, it's R code in a string, but when I replace it with just blank spaces, the conversion is fast for all of those settings. So there's something about that particular content that slows it down.

jgm · Feb 19, 2016

+++ Winston Chang [Feb 19 16 09:37 ]:

fast for all of those settings. So there's something about that particular content that slows it down.

Does it contain `<` characters?

wch · Feb 19, 2016

Yes, many of them.

jgm · Feb 20, 2016

Thanks for the excellent, detailed bug report. I think this commit fixes the problem (I tested on your files). But let me know if it doesn't.

wch · Feb 22, 2016

Great, thanks for the quick fix!

jgm added a commit that closed this issue Feb 20, 2016

jgm HTML reader: rewrote htmlInBalanced.
This version avoids an exponential performance problem with `<script>` tags, and it should be faster in general. Closes #2730.
1534052

jgm closed this in 1534052 Feb 20, 2016

jgm added a commit that referenced this issue Feb 21, 2016

jgm Markdown reader: use htmlInBalanced for rawVerbatimBlock.
This should give better performance. See #2730.
04d1e40

c-forster pushed a commit to c-forster/pandoc that referenced this issue Mar 4, 2016

jgm HTML reader: rewrote htmlInBalanced.
This version avoids an exponential performance problem with `<script>` tags, and it should be faster in general. Closes #2730.
a20895f

c-forster pushed a commit to c-forster/pandoc that referenced this issue Mar 4, 2016

jgm Markdown reader: use htmlInBalanced for rawVerbatimBlock.
This should give better performance. See #2730.
b3d55c3

jgm/pandoc

Very slow performance with some markdown options #2730

Labels

Milestone

Assignee

2 participants