markdown to docx unusually slow #2356

garrettgman · Aug 10, 2015

This file prints the numbers from 1 to 10000. It takes seconds to render to HTML or pdf:

pandoc foo.md -f markdown -t latex -o testdoc.pdf
pandoc foo.md -f markdown -t html -o testdoc.html

But it takes eight minutes to render to word:

pandoc foo.md -f markdown -t docx -o testdoc.docx

I notice this difference often when I use pandoc through R Markdown to report on data. If I try to do something more modest (like print the numbers from 1 to 1000) I do not notice much of a difference.

jgm · Aug 10, 2015

Very strange! It's just a giant code block. Looking at the code for Text.Pandoc.Writers.Docx, I can't see any obvious reason why there'd be a performance problem here, so this is a puzzle that needs looking into.

jgm · Aug 10, 2015

Some experiments: I changed the file from a fenced code block to an indented one, to allow testing arbitrary numbers of lines:

Lines	Seconds
10	0.05
20	0.09
40	0.25
80	0.99
160	9.22
320	76.94

I also tried a version where the code block has just one enormously long line (converting newlines into spaces), and that also takes forever.

garrettgman · Aug 10, 2015

Thanks for looking into this, John. I should've mentioned that it began as an issue over at the rstudio/rmarkdown repository, rstudio/rmarkdown#490

jgm · Aug 10, 2015

Further experiment, breaking it down to its core (now using code spans and just a string of xs):

% python -c 'print ("`" + 60000 * "x" + "`")' | pandoc -o 2356.docx

jgm · Aug 10, 2015

Also, --no-highlight has no real effect. This suggests that the problem is not specific to code spans. For a single line of unhighlighted text, not much more is going on than a single application of formattedString to the code. (And this is a simple function that just puts the code in some tags.)

Confirmation:

$ python -c 'print (60000 * "x")' | pandoc -o 2356.docx --no-highlight

also takes forever. This should just be a single long paragraph with regular text.

jgm · Aug 10, 2015

I found the cause: commit f3aa03e which strips out invalid characters. I think this can easily be fixed by doing the stripping in the XML file rather than the Pandoc structure (bottomUp from Text.Pandoc.Generic is inefficient.)

@mpickering

jgm · Aug 10, 2015

@mpickering, I solved this by doing the stripping in formattedString, avoiding the use of bottomUp.

mpickering · Aug 10, 2015

Sorry! Didn't realise the files got so large.

garrettgman referenced this issue in rstudio/rmarkdown Aug 10, 2015
Closed
Very slow (impossible?) to render word_document with large data display. #490

jgm added a commit that closed this issue Aug 10, 2015

jgm Docx writer: Moved invalid character stripping to `formattedString`.
This avoids an inefficient generic traversal. Updates f3aa03e. Closes #2356.
0ad576e

jgm closed this in 0ad576e Aug 10, 2015

jgm/pandoc

markdown to docx unusually slow #2356

Labels

Milestone

Assignee

3 participants

Very slow (impossible?) to render word_document with large data display. #490