Text Layout Requirements When Encountered East Asian Languages #2586
For example, there is a demo file demo.md
with content:
## Case 1: only East Asian Characters
我能吞下玻璃,
而不伤身体。我能吞下玻璃
而不伤身体。
## Case 2: Only Western Characters
The quick brown fox,
jumps over the lazy dog. The quick brown fox
jumps over the lazy dog.
## Case 3: Blended
我能吞下玻璃而不伤身体,
the quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog,
我能吞下玻璃而不伤身体。
中文和
English 混合排版。
English blended with
中文.
Using pandoc
to convert it to html:
pandoc -f markdown -s -S demo.md -o demo-ext-off.html
pandoc -f markdown+ignore_line_breaks -s -S demo.md -o demo-ext-on.html
Without extension: (red marks point out pitfalls, I highlighted spaces in browser simply with Control+F)
With extension:
I think Pandoc
should be more intelligent so as to only insert space
- between two western chars, e.g.
apple\n
+pie
→apple pie
, - between asian char and western char, e.g.
豆瓣\n
+FM
→豆瓣 FM
and no extra spaces in others cases.
Or make it more simply:
Always add a space when join lines except when the previous line ends with an East Asia Character and this line starts with another.
One approach would be to implement this option using an AST filter (internal to pandoc), instead of in the Markdown parser. The AST contains Space elements for spaces and soft line breaks (though it doesn't currently distinguish between the two---that may change soon). The filter could look for and remove Space elements when they occur between two Chinese characters. Note that (unlike the current approach) this would also affect line-internal spaces -- they would be collapsed too. Let me know if that's not desirable.
Are spaces every used between two Chinese characters, or would it be safe for pandoc to avoid this by default?
Better not "affect line-internal spaces".
Spaces are not ever used between two Chinese characters.
Of course there would be someone in some cases to use "注 意 ! ! " (A T T E N T I O N ! ! !), but that's not normal. And I recommend they use fullwidth space (i.e. " ") instead of typical space (i.e. " "): 注 意 ! !
→ 注 意 ! !
.
So it would be safe for pandoc to avoid this by default.
For your information, adding a space between Chinese character and western character is not adopted by everyone, its more like a common rule for those who care typesetting. (see https://github.com/sparanoid/chinese-copywriting-guidelines/blob/master/README.en.md#place-one-space-before--after-english-words).
But this: fox\n
+ jumps
→ foxjumps
is bad, should be agreed by everyone.
Pandoc 确实有这么一个扩展: | Pandoc does have a relevant extension:
ignore_line_breaks
: Causes newlines within a paragraph to be ignored, rather than being treated
as spaces or as hard line breaks. This option is intended for use with East
Asian languages where spaces are not used between words, but text is
divided into lines for readability.
中
:但这个扩展实际不可用,因为当我用东亚文字的时候我总也会用到英文。这样一来,如果不加这个扩展,合并行的时候东亚文字(比如中文)会多出很多空格,不美观;
如果加入这个扩展,合并行的时候西文(如英文)会混作一团,不仅不美观,内容都变了。
En
: But this extension cannot work as expected, for we also use some English whenwriting in East Asian languages. In that case, if we not turn on this extension,
Asian character lines will be joined together with extra spaces, pretty ugly;
But if turning on this extension, Western character lines will join into
a mess (e.g. several pairs of words turned into one).