Skip to content

Very slow performance with some markdown options #2730

Closed
wch opened this Issue · 4 comments

2 participants

@wch

When converting some files from Markdown to HTML, performance can be very slow, depending on the markdown variant and options selected. The time grows exponentially, as shown in the graph below.

For this example, I have a very basic input -- it's just raw HTML with some JSON content embedded in a <script> tag. (We're using markdown as an input format because sometimes the HTML is intermingled with markdown. But in this example, the actual content is just HTML.)

index.html:

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
</head>
<body>
<script>{"x": "blah blah blah"}</script>
</body>
</html>

This is paired with a minimal template file:

$body$

And it's run through pandoc with:

pandoc index.html --from markdown_strict --output output.html --template template.html

The problem is that, with the content I have (the "blah blah blah" is replaced with a bunch of R code in a string), pandoc is extremely slow. Here's a graph of time, with 50KB, 100KB, and 150KB of text in the <script></script> tags, with various flavors of markdown. Note the log y scale:

image

For markdown_strict, the time for 50KB is 0.37 seconds; for 100KB, it's 3.1 seconds, and for 150KB, it's 25.5 seconds. If the input is a megabyte in size, the conversion time with this exponential growth rate would be about 30,000,000,000,000,000 seconds. My actual data is over two megabytes, so there would be many more zeros on there. :)

In the graph, I've also compared it to markdown and commonmark, which are much faster, as well as markdown-markdown_in_html_blocks and markdown+markdown_attribute, which are just as slow as markdown_strict. I would have expected the markdown-markdown_in_html_blocks and markdown+markdown_attribute options to be faster than markdown, but that opposite appears to be true.

The example input files are in https://github.com/wch/pandoc-hang, with a subdirectory for each input file size. For example, the 100KB input file is in:
https://github.com/wch/pandoc-hang/tree/master/simplified-100kb

I also tried changing the specific content in the <script> tags, and that makes a big difference in speed. In my use case, it's R code in a string, but when I replace it with just blank spaces, the conversion is fast for all of those settings. So there's something about that particular content that slows it down.

@jgm
Owner
@wch

Yes, many of them.

@jgm jgm added a commit that closed this issue
@jgm HTML reader: rewrote htmlInBalanced.
This version avoids an exponential performance problem with `<script>` tags,
and it should be faster in general.

Closes #2730.
1534052
@jgm jgm closed this in 1534052
@jgm
Owner

Thanks for the excellent, detailed bug report. I think this commit fixes the problem (I tested on your files). But let me know if it doesn't.

@jgm jgm added a commit that referenced this issue
@jgm Markdown reader: use htmlInBalanced for rawVerbatimBlock.
This should give better performance.

See #2730.
04d1e40
@wch

Great, thanks for the quick fix!

@c-forster c-forster pushed a commit to c-forster/pandoc that referenced this issue
@jgm HTML reader: rewrote htmlInBalanced.
This version avoids an exponential performance problem with `<script>` tags,
and it should be faster in general.

Closes #2730.
a20895f
@c-forster c-forster pushed a commit to c-forster/pandoc that referenced this issue
@jgm Markdown reader: use htmlInBalanced for rawVerbatimBlock.
This should give better performance.

See #2730.
b3d55c3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.