Skip to content

Text Layout Requirements When Encountered East Asian Languages #2586

Closed
district10 opened this Issue · 6 comments

2 participants

@district10

更好地混排东亚文字和西文 | Text Layout Requirements When Encountered East Asian Languages

Pandoc 确实有这么一个扩展: | Pandoc does have a relevant extension:

ignore_line_breaks

: Causes newlines within a paragraph to be ignored, rather than being treated
as spaces or as hard line breaks. This option is intended for use with East
Asian languages where spaces are not used between words, but text is
divided into lines for readability.

:但这个扩展实际不可用,因为当我用东亚文字的时候我总也会用到英文。这样一来,
如果不加这个扩展,合并行的时候东亚文字(比如中文)会多出很多空格,不美观;
如果加入这个扩展,合并行的时候西文(如英文)会混作一团,不仅不美观,内容都变了。

En: But this extension cannot work as expected, for we also use some English when
writing in East Asian languages. In that case, if we not turn on this extension,
Asian character lines will be joined together with extra spaces, pretty ugly;
But if turning on this extension, Western character lines will join into
a mess (e.g. several pairs of words turned into one).

@district10

For example, there is a demo file demo.md with content:

## Case 1: only East Asian Characters

我能吞下玻璃,
而不伤身体。我能吞下玻璃
而不伤身体。

## Case 2: Only Western Characters

The quick brown fox, 
jumps over the lazy dog. The quick brown fox
jumps over the lazy dog.

## Case 3: Blended

我能吞下玻璃而不伤身体,
the quick brown fox jumps over the lazy dog.

The quick brown fox jumps over the lazy dog,
我能吞下玻璃而不伤身体。

中文和
English 混合排版。

English blended with
中文.

Using pandoc to convert it to html:

pandoc -f markdown -s -S demo.md -o demo-ext-off.html
pandoc -f markdown+ignore_line_breaks -s -S demo.md -o demo-ext-on.html

Without extension: (red marks point out pitfalls, I highlighted spaces in browser simply with Control+F)

With extension:

@district10

I think Pandoc should be more intelligent so as to only insert space

  1. between two western chars, e.g. apple\n + pieapple pie,
  2. between asian char and western char, e.g. 豆瓣\n + FM豆瓣 FM

and no extra spaces in others cases.


Or make it more simply:

Always add a space when join lines except when the previous line ends with an East Asia Character and this line starts with another.

@district10 district10 changed the title from Text Layout Requirements When Encountered East Asian languages to Text Layout Requirements When Encountered East Asian Languages
@jgm
Owner

One approach would be to implement this option using an AST filter (internal to pandoc), instead of in the Markdown parser. The AST contains Space elements for spaces and soft line breaks (though it doesn't currently distinguish between the two---that may change soon). The filter could look for and remove Space elements when they occur between two Chinese characters. Note that (unlike the current approach) this would also affect line-internal spaces -- they would be collapsed too. Let me know if that's not desirable.

@jgm
Owner

Are spaces every used between two Chinese characters, or would it be safe for pandoc to avoid this by default?

@district10

Better not "affect line-internal spaces".

Spaces are not ever used between two Chinese characters.

Of course there would be someone in some cases to use "注 意 ! ! " (A T T E N T I O N ! ! !), but that's not normal. And I recommend they use fullwidth space (i.e. " ") instead of typical space (i.e. " "): 注 意 ! !注 意 ! !.

So it would be safe for pandoc to avoid this by default.

@district10

For your information, adding a space between Chinese character and western character is not adopted by everyone, its more like a common rule for those who care typesetting. (see https://github.com/sparanoid/chinese-copywriting-guidelines/blob/master/README.en.md#place-one-space-before--after-english-words).

But this: fox\n + jumpsfoxjumps is bad, should be agreed by everyone.

@jgm jgm added a commit that closed this issue
@jgm Implemented `east_asian_line_breaks` extension.
Text.Pandoc.Options: Added `Ext_east_asian_line_breaks` constructor to
`Extension` (API change).

This extension is like `ignore_line_breaks`, but smarter -- it
only ignores line breaks between two East Asian wide characters.
This makes it better suited for writing with a mix of East Asian
and non-East Asian scripts.

Closes #2586.
44120ea
@jgm jgm closed this in 44120ea
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.