Write filter to support right-to-left direction in Persian text. #2191
If you don't want to write a filter as jgm recommended, you can always mark it up manually:
# عنوان اول {dir=rtl}
<div dir=rtl>این متن فارسی باید راست به چپ نشان داده شود.</div>
This is the English paragraph, so it's direction in html should be left-to-right.
you might also be interested in the RTL discussion on talk.commonmark.org.
I think dealing with languages and directionality should become a functionality of pandoc itself rather than being delegated to filters.
My suggestion would be to primarily rely on language tags in pandoc markdown:
- the existing
lang: fa-IR
in the document’s metadata for declaring the main language of the document. -
<div lang="fa-IR">…</div>
for longer and -
<span lang="fa-IR">…</span>
for shorter sections in a language different from the main language.
Most of this is already available in pandoc:
pandoc -s -t html << EOT
# عنوان اول
.این متن فارسی باید راست به چپ نشان داده شود
<span lang=en-US>This is the English paragraph, so its direction in html should be left-to-right.</span>
.این متن فارسی باید راست به چپ نشان داده شود
---
lang: en-US, fa-IR
...
EOT
generates
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US, fa-IR" xml:lang="fa-IR">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="generator" content="pandoc" />
<title></title>
<style type="text/css">code{white-space: pre;}</style>
</head>
<body>
<h1 id="عنوان-اول">عنوان اول</h1>
<p>.این متن فارسی باید راست به چپ نشان داده شود</p>
<p><span lang="en-US">This is the English paragraph, so its direction in html should be left-to-right.</span></p>
<p>.این متن فارسی باید راست به چپ نشان داده شود</p>
</body>
</html>
… which doesn’t look too bad as is, except for the facts that lang="en-US, fa-IR"
should be replaced by lang="fa-IR"
(just one main language per document), and that in my browsers the full stop is appearing to the right of the Farsi sentences rather than their left, in both Firefox and Safari – ideas on this, anyone?).
Unless declared explicitly, pandoc could then infer directionality from these language tags, and write, e.g.,
…
<html xmlns="http://www.w3.org/1999/xhtml" lang="fa-IR" xml:lang="fa-IR" dir="rtl">
…
<p><span lang="en-US" dir="ltr">This is the English paragraph, so its direction in html should be left-to-right.</span></p>
…
If xml:lang
tags are needed, they could be added during this step, too.
For latex output, pandoc would just have to map lang: en-US, fa-IR
to
\setmainlanguage{farsi}
\setotherlanguages{english}
and <div lang="fa-IR">…</div>
to \begin{farsi}…\end{farsi}
, and <span lang="fa-IR">…</span>
to \textfarsi{…}` (no directionality tags needed for latex).
@nickbart1980, wasn’t otherlang
supposed to be included for LaTeX?
As far as I can understand, language direction may be specified in CSS:
:lang(fa-IR) {
direction: rtl;
}
Yes, for LaTeX a comma-separated list in the metadata variable lang
is parsed into mainlang
(last item) and otherlang
(all others), but the values, e.g., en-US
, fa-IR
are not mapped yet to what polyglossia (and babel) expect, e.g. english
, farsi
. That's one thing that would be great to have fixed.
However, mainlang
and otherlang
are not available in any other formats than LaTeX (or else we could simply use mainlang
in the html template). A fix for this would be great, too.
As to CSS, I’m not quite sure. Adding your snippet to my HTML document above looks ok in a browser (again, with the exception of the full stops).
On the other hand, https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/dir recommends “As the directionality of the text is semantically related to its content and not to its presentation, it is recommended that web developers use this attribute [dir
] instead of the related CSS properties when possible. That way, the text will display correctly even on a browser that doesn't support CSS or has the CSS deactivated.”
Yes, for LaTeX a comma-separated list in the metadata variable
lang
is parsed intomainlang
(last item) andotherlang
(all others), but the values, e.g.,en-US
,fa-IR
are not mapped yet to what polyglossia (and babel) expect, e.g.english
,farsi
. That's one thing that would be great to have fixed.
There is an issue (#1614) exactly on this topic. It may make sense to add comments there (so developers see the real demand for this fix).
However,
mainlang
andotherlang
are not available in any other formats than LaTeX (or else we could simply usemainlang
in the html template). A fix for this would be great, too.
Where is the fix needed? I must confess that I still don’t get it (we have already discussed this at #2174).
How about using lang
only for the main language (it works everywhere) and otherlang
only for LaTeX (well, it is only required there)?
As to CSS, I’m not quite sure. Adding your snippet to my HTML document above looks ok in a browser (again, with the exception of the full stops).
I wonder whether this would work also with full stops:
:lang(fa-IR) {
direction: rtl;
unicode-bidi: bidi-override;
}
On the other hand, https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/dir recommends “As the directionality of the text is semantically related to its content and not to its presentation, it is recommended that web developers use this attribute [
dir
] instead of the related CSS properties when possible. That way, the text will display correctly even on a browser that doesn't support CSS or has the CSS deactivated.”
The reasoning behind this recommendation would lead to avoid as many CSS properties as possible: “[t]hat way, the text will display correctly even on a browser that doesn't support CSS or has the CSS deactivated”.
I don’t see the reason why the direction should also included in HTML (besides the language markup), if a given language can only have one direction.
How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?
That’s not so great since you would always have to tweak the source file depending on the target format. Parsing lang
into mainlang
and otherlang
(or, alternatively, discarding all items in lang
except the last for target formats that cannot ever use otherlang
for any purpose) makes more sense.
I wonder whether this would work also with full stops:
:lang(fa-IR) { direction: rtl; unicode-bidi: bidi-override; }
Unfortunately, no.
How about using
lang
only for the main language (it works everywhere) andotherlang
only for LaTeX (well, it is only required there)?That’s not so great since you would always have to tweak the source file depending on the target format.
@nickbart1980, I don’t think so. Let’s consider the following sample:
---
lang: en
otherlang: grc, la
...
<span lang="grc">χαλεπὰ τὰ καλά</span> was the ancient Greek saying to
state that beauty is difficult to attain.
Occam’s razor reads: <span lang="la">«entia non sunt multiplicanda sine
necessitate»</span>
If you have to tweak the source depending on your target, this isn’t due to the language information in the metadata. It has to do with the lack of translation among different language identification values [#1614]), non–existing special syntax for language attributes (#895) and missing syntax for raw division and raw inline elements (#168).
Parsing lang into
mainlang
andotherlang
(or, alternatively, discarding all items inlang
except the last for target formats that cannot ever useotherlang
for any purpose) makes more sense.
To the best of my knowledge, pandoc has four variables (metadata fields) to include language information in the metadata:
lang
contains the document (main) language information.language
includes the document main language in the ePub metadata (it may be safely replaced withlang
).mainlang
includes the document main language.otherlang
includes other languages present in the document.
Applying Occam’s razor to these variables, I think it would read: “do not create any language variable unless strictly required”.
I agree that lang
is required to specify the primary language in the document. And otherlang
is required by polyglossia
and babel
in LaTeX.
But I think that adapting lang
to the way the exception (LaTeX) works is the wrong path. Because it is easier to add all secondary languages in a variable especially created for LaTeX (otherlang
).
My final question is: wnat is wrong (or what does it need to be fixed) in using lang
for the main language (as it is [or would be] required [once fixed] for HTML, ePub, ConTeXt, OpenDocument and .docx) and reserve `otherlang' for LaTeX?
How about using
lang
only for the main language (it works everywhere) andotherlang
only for LaTeX (well, it is only required there)?That’s not so great since you would always have to tweak the source file depending on the target format.
@nickbart1980, I don’t think so. Let’s consider the following sample:
---
lang: en
otherlang: grc, la
...
<span lang="grc">χαλεπὰ τὰ καλά</span> was the ancient Greek saying to
state that beauty is difficult to attain.
Occam’s razor reads: <span lang="la">«entia non sunt multiplicanda sine
necessitate»</span>
If you have to tweak the source depending on your target, this isn’t due to the language information in the metadata. It has to do with the lack of translation among different language identification values [#1614]), non–existing special syntax for language attributes (#895) and missing syntax for raw division and raw inline elements (#168).
Parsing lang into
mainlang
andotherlang
(or, alternatively, discarding all items inlang
except the last for target formats that cannot ever useotherlang
for any purpose) makes more sense.
To the best of my knowledge, pandoc has four variables (metadata fields) to include language information in the metadata:
lang
contains the document (main) language information.language
includes the document main language in the ePub metadata (it may be safely replaced withlang
).mainlang
includes the document main language.otherlang
includes other languages present in the document.
Applying Occam’s razor to these variables, I think it would read: “do not create any language variable unless strictly required”.
I agree that lang
is required to specify the primary language in the document. And otherlang
is required by polyglossia
and babel
in LaTeX.
But I think that adapting lang
to the way the exception (LaTeX) works is the wrong path. Because it is easier to add all secondary languages in a variable especially created for LaTeX (otherlang
).
My final question is: what is wrong (or what does it need to be fixed) in using lang
for the main language (as it is [or would be] required [once fixed] for HTML, ePub, ConTeXt, OpenDocument and .docx) and reserve `otherlang' for LaTeX?
Writing
lang
tag explicitly in technical documents are cumbersome, because in technical documents that the main language is Persian, there are lots of time that we need write in English, so it's better than Pandoc check if the paragraph is written in Persian create proper tag and if the paragraph is written in English create proper tag so.
@khajavi, I don’t think I understand your proposal.
But first of all, why do you need language markup? If you only require it for text direction, I wonder whether this could be achieved without language or direction tagging. It is only a guess, but isn’t the Unicode bidirectional algorithm supposed deal with this?
If you need markup for hyphenation or other language–dependent feature, then you need to mark up languages.
How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?
This makes sense to me.
What it boils down to is, do we want
---
lang: fr-FR, en-US, fa-IR
...
where lang
is parsed into mainlang
(or just lang
; containing fa-IR
) and otherlang
(containing fr-FR, en-US
); or do we want
---
lang: fa-IR
otherlang: fr-FR, en-US
...
Both will work nicely with all formats (as soon as the latex writer maps fr-FR
to french
etc.). Since it's shorter, I have a slight preference for the first option.
@nickbart1980, many thanks for your reply.
I’m afraid that the first proposal doesn’t behave as you expect in pandoc-1.14.0.1.
---
lang: grc, it, fr, en, de, es
...
multiple languages
This gives the following HTML element:
<html xmlns="http://www.w3.org/1999/xhtml"
lang="grc, it, fr, en, de, es"
xml:lang="grc, it, fr, en, de, es">
In XML lang
or xml:lang
should have only one value.
From all formats that support language markup, only LaTeX needs the list of languages used in the document. This shouldn’t be the default in the way pandoc metadata deal with languages. This is the reason the otherlang
variable makes sense.
And this is the reason why there is nothing to fix here. lang
should only be used with a single language value.
BTW, the proposal doesn’t work even with LaTeX (the final comma after the last language is wrong):
\documentclass[grc, it, fr, en, de, es,]{article}
If the LaTeX writer needs to be adapted to the way pandoc works, this should be done. But it is crazy to adapt pandoc to the way LaTeX works. (At least, one writer is easier to do than many writers.)
Note that language and directionality are two independent properties and shouldn't be conflated:
there is not always a one-to-one mapping between language and script, and therefore directionality. For example, Azerbaijani can be written using both right-to-left (Arabic) and left-to-right (Latin or Cyrillic) scripts, and the language code az can be relevant for either.
In some scripts, such as Arabic and Hebrew, displayed text is read predominantly from right to left, although within that flow, numbers and text from other scripts are displayed from left to right.
The pandoc document metadata should have lang
, otherlang
and dir
properties (the global dir sets the base direction). Additionally, we need the writers to properly convert the dir
attribute on at least span
s and div
s to locally change the directionality of some ranges of text.
As I said over at commonmark discuss, I think we should be fine with supporting span
s and div
s with dir
attributes.
In ConTeXt, we can use \righttoleft{my span content}
, \startalignment[righttoleft] my div content \stopalignment
and \setupalign[righttoleft]
for the base direction of the document.
When using the bidi
package (which only works for XeLaTeX as far as I know), they are \RL
, setRL
and \usepackage[RTLdocument]{bidi}
respectively.
So what about pdfLaTeX and LuaLaTeX? I guess we can forget about the former, but it would be good if we could output the same commands for both Lua- and XeLaTeX. Maybe we can redefine it somehow in our LaTeX template—that is if there is a general purpose rtl/bidi package for LuaLaTex (not only arabic or only farsi), is there? Otherwise, we'll just have to tell people to use either XeLaTeX or ConTeXt. Maybe @khaledhosny can shed some light on these questions, please? :)
@mb21, as commented in #1614, do you really think that dir
has to be included in the document?
If each language has one and only one direction (and the number of languages is finite), I guess pandoc should assign direction to the language internally.
Consider a dissertation in Arabic literature written in English (or any Western language). It is easy that it may have over a thousand passages in Arabic.
What do you think it is easier to type: [Arabic text]{:ar}
or [Arabic text]{dir="rtl" lang="ar"}
? Which method do you think it may lead to more typing mistakes?
With ConTeXt, I had typeset a book in Spanish that had about a thousand passages in ancient Greek. And I really was relieved by the fact that I didn’t have to tag any of these texts. (Just in case you wonder, \setuplanguage[es][patterns={es, agr}]
.)
As I wrote above, language and scripts are two independent properties and shouldn't be conflated, e.g. Azerbaijani can be written using both right-to-left (Arabic) and left-to-right (Latin or Cyrillic) scripts.
But I think it's a good idea to introduce [Arabic text]{:ar}
(or a similar simplistic syntax) as a shorthand for (and converted already by the Markdown reader to) [Arabic text]{dir="rtl" lang="ar"}
. But I'd say that's a separate issue—indeed it's #895.
lang attribute fits for latex but not for html lang-attribute-value #1614
As I wrote above, language and scripts are two independent properties and shouldn't be conflated, e.g. Azerbaijani can be written using both right-to-left (Arabic) and left-to-right (Latin or Cyrillic) scripts.
@mb21, I think there are different issues involved here:
The link you provided is relevant for (X)HTML markup, but I don’t think it is mandatory for any text markup dealing with languages.
ISO 639 languages don’t contain any information about directionality. But BCP-47 codes include a script subtag. In fact, Azerbaijani can have the following three values:
az-Latn
,az-Cyrl
andaz-Arab
.Even in languages that only use a single script written right to left, numbers and some other common characters (even characters from other scripts) should be written left to right. But I think that pandoc should add direction markup automatically.
There is a question about languages that may use different scripts that I don’t understand.
Language markup is relevant to apply resources to the tagged text, such as hyphenation dictionaries. How would you apply the right hyphenation dictionary for a language that may use more than a script if the language itself doesn’t contain which one should be? Directionality doesn’t help much here.
This is why I think that dir
shouldn’t be included in the document.
But I think it's a good idea to introduce [Arabic text]{:ar} [...]
But I'd say that's a separate issue—indeed it's #895.
I know they are different issues, but also related.
I wanted to discuss the issue on a simplified or special language attribute, so that it could be implemented at the same time this issue is implemented (the original issue has been opened for almost 26 months).
The link you provided is relevant for (X)HTML markup, but I don’t think it is mandatory for any text markup dealing with languages.
True, but I think the (X)HTML folks have put a lot of thought into their docs and HTML remains one of the primary output targets of pandoc. Compared to LaTeX and ConTeXt their approach is much less of a mess and based on ISO standards. That's why I propose to model pandoc's model after the HTML model.
But yeah, I guess pandoc could extract a script tag from the BCP 47 string, yet this would require us to come up with (and maintain) a long list of language-to-script- and script-to-direction-mappings. I'm sure it's doable and if @jgm is in favour and someone gets around to implement it, why not? Meanwhile, mirroring the HTML model provides a working model, relatively simply.
Support bidirectional text output with XeLaTeX, ConTeXt and HTML #2419
To clarify, now you can write:
---
dir: rtl
---
# عنوان اول
این متن فارسی باید راست به چپ نشان داده شود.
<div dir="ltr">
This is an English paragraph, so its direction in html should be left-to-right.
</div>
As soon as native syntax for div
(#168) and span
(e.g. [my text]{dir=ltr}
) become available, you'll be able to use those instead.
I need to convert the Persian text like this:
# عنوان اول این متن فارسی باید راست به چپ نشان داده شود. This is the English paragraph, so it's direction in html should be left-to-right.
To HTML like this:
Any one could help me how can I write proper Pandoc filter in Haskell to solve this problem?