Skip to content

markdown to docx unusually slow #2356

Closed
garrettgman opened this Issue · 8 comments

3 participants

@garrettgman

This file prints the numbers from 1 to 10000. It takes seconds to render to HTML or pdf:

pandoc foo.md -f markdown -t latex -o testdoc.pdf
pandoc foo.md -f markdown -t html -o testdoc.html

But it takes eight minutes to render to word:

pandoc foo.md -f markdown -t docx -o testdoc.docx

I notice this difference often when I use pandoc through R Markdown to report on data. If I try to do something more modest (like print the numbers from 1 to 1000) I do not notice much of a difference.

@jgm
Owner
@jgm
Owner

Some experiments: I changed the file from a fenced code block to an indented one, to allow testing arbitrary numbers of lines:

Lines Seconds
10 0.05
20 0.09
40 0.25
80 0.99
160 9.22
320 76.94

I also tried a version where the code block has just one enormously long line (converting newlines into spaces), and that also takes forever.

@garrettgman

Thanks for looking into this, John. I should've mentioned that it began as an issue over at the rstudio/rmarkdown repository, rstudio/rmarkdown#490

@jgm
Owner

Further experiment, breaking it down to its core (now using code spans and just a string of xs):

% python -c 'print ("`" + 60000 * "x" + "`")' | pandoc -o 2356.docx
@jgm
Owner

Also, --no-highlight has no real effect. This suggests that the problem is not specific to code spans. For a single line of unhighlighted text, not much more is going on than a single application of formattedString to the code. (And this is a simple function that just puts the code in some tags.)

Confirmation:

$ python -c 'print (60000 * "x")' | pandoc -o 2356.docx --no-highlight

also takes forever. This should just be a single long paragraph with regular text.

@jgm
Owner

I found the cause: commit f3aa03e which strips out invalid characters. I think this can easily be fixed by doing the stripping in the XML file rather than the Pandoc structure (bottomUp from Text.Pandoc.Generic is inefficient.)

@mpickering

@jgm jgm added a commit that closed this issue
@jgm Docx writer: Moved invalid character stripping to `formattedString`.
This avoids an inefficient generic traversal.

Updates f3aa03e.

Closes #2356.
0ad576e
@jgm jgm closed this in 0ad576e
@jgm
Owner

@mpickering, I solved this by doing the stripping in formattedString, avoiding the use of bottomUp.

@mpickering
Collaborator

Sorry! Didn't realise the files got so large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.