markdown to docx unusually slow #2356
Very slow (impossible?) to render word_document with large data display. #490
Some experiments: I changed the file from a fenced code block to an indented one, to allow testing arbitrary numbers of lines:
Lines | Seconds |
---|---|
10 | 0.05 |
20 | 0.09 |
40 | 0.25 |
80 | 0.99 |
160 | 9.22 |
320 | 76.94 |
I also tried a version where the code block has just one enormously long line (converting newlines into spaces), and that also takes forever.
Thanks for looking into this, John. I should've mentioned that it began as an issue over at the rstudio/rmarkdown repository, rstudio/rmarkdown#490
Further experiment, breaking it down to its core (now using code spans and just a string of x
s):
% python -c 'print ("`" + 60000 * "x" + "`")' | pandoc -o 2356.docx
Also, --no-highlight
has no real effect. This suggests that the problem is not specific to code spans. For a single line of unhighlighted text, not much more is going on than a single application of formattedString
to the code. (And this is a simple function that just puts the code in some tags.)
Confirmation:
$ python -c 'print (60000 * "x")' | pandoc -o 2356.docx --no-highlight
also takes forever. This should just be a single long paragraph with regular text.
I found the cause: commit f3aa03e which strips out invalid characters. I think this can easily be fixed by doing the stripping in the XML file rather than the Pandoc structure (bottomUp
from Text.Pandoc.Generic
is inefficient.)
@mpickering, I solved this by doing the stripping in formattedString
, avoiding the use of bottomUp
.
This file prints the numbers from 1 to 10000. It takes seconds to render to HTML or pdf:
But it takes eight minutes to render to word:
I notice this difference often when I use pandoc through R Markdown to report on data. If I try to do something more modest (like print the numbers from 1 to 1000) I do not notice much of a difference.