Text Layout Requirements When Encountered East Asian Languages #2586

district10 · Dec 11, 2015

更好地混排东亚文字和西文 | Text Layout Requirements When Encountered East Asian Languages

Pandoc 确实有这么一个扩展： | Pandoc does have a relevant extension:

ignore_line_breaks

: Causes newlines within a paragraph to be ignored, rather than being treated
as spaces or as hard line breaks. This option is intended for use with East
Asian languages where spaces are not used between words, but text is
divided into lines for readability.

中：但这个扩展实际不可用，因为当我用东亚文字的时候我总也会用到英文。这样一来，
如果不加这个扩展，合并行的时候东亚文字（比如中文）会多出很多空格，不美观；
如果加入这个扩展，合并行的时候西文（如英文）会混作一团，不仅不美观，内容都变了。

En: But this extension cannot work as expected, for we also use some English when
writing in East Asian languages. In that case, if we not turn on this extension,
Asian character lines will be joined together with extra spaces, pretty ugly;
But if turning on this extension, Western character lines will join into
a mess (e.g. several pairs of words turned into one).

district10 · Dec 11, 2015

For example, there is a demo file demo.md with content:

## Case 1: only East Asian Characters

我能吞下玻璃，
而不伤身体。我能吞下玻璃
而不伤身体。

## Case 2: Only Western Characters

The quick brown fox, 
jumps over the lazy dog. The quick brown fox
jumps over the lazy dog.

## Case 3: Blended

我能吞下玻璃而不伤身体，
the quick brown fox jumps over the lazy dog.

The quick brown fox jumps over the lazy dog,
我能吞下玻璃而不伤身体。

中文和
English 混合排版。

English blended with
中文.

Using pandoc to convert it to html:

pandoc -f markdown -s -S demo.md -o demo-ext-off.html
pandoc -f markdown+ignore_line_breaks -s -S demo.md -o demo-ext-on.html

Without extension: (red marks point out pitfalls, I highlighted spaces in browser simply with Control+F)

With extension:

district10 · Dec 11, 2015

I think Pandoc should be more intelligent so as to only insert space

between two western chars, e.g. apple\n + pie → apple pie,
between asian char and western char, e.g. 豆瓣\n + FM → 豆瓣 FM

and no extra spaces in others cases.

Or make it more simply:

Always add a space when join lines except when the previous line ends with an East Asia Character and this line starts with another.

jgm · Dec 11, 2015

One approach would be to implement this option using an AST filter (internal to pandoc), instead of in the Markdown parser. The AST contains Space elements for spaces and soft line breaks (though it doesn't currently distinguish between the two---that may change soon). The filter could look for and remove Space elements when they occur between two Chinese characters. Note that (unlike the current approach) this would also affect line-internal spaces -- they would be collapsed too. Let me know if that's not desirable.

jgm · Dec 11, 2015

Are spaces every used between two Chinese characters, or would it be safe for pandoc to avoid this by default?

district10 · Dec 12, 2015

Better not "affect line-internal spaces".

Spaces are not ever used between two Chinese characters.

Of course there would be someone in some cases to use "注意！！ " (A T T E N T I O N ! ! !), but that's not normal. And I recommend they use fullwidth space (i.e. "　") instead of typical space (i.e. " "): 注意！！ → 注　意　！　！.

So it would be safe for pandoc to avoid this by default.

district10 · Dec 12, 2015

For your information, adding a space between Chinese character and western character is not adopted by everyone, its more like a common rule for those who care typesetting. (see https://github.com/sparanoid/chinese-copywriting-guidelines/blob/master/README.en.md#place-one-space-before--after-english-words).

But this: fox\n + jumps → foxjumps is bad, should be agreed by everyone.

district10 changed the title from Text Layout Requirements When Encountered East Asian languages to Text Layout Requirements When Encountered East Asian Languages Dec 11, 2015

jgm closed this in 44120ea Dec 13, 2015

jgm/pandoc

Text Layout Requirements When Encountered East Asian Languages #2586

Labels

Milestone

Assignee

2 participants