Write filter to support right-to-left direction in Persian text. #2191

khajavi · May 29, 2015

I need to convert the Persian text like this:

# عنوان اول
این متن فارسی باید راست به چپ نشان داده شود.

This is the English paragraph, so it's direction in html should be left-to-right.

To HTML like this:

<h1 dir="rtl">عنوان اول</h1>
<p dir="rtl">این متن فارسی باید راست به چپ نشان داده شود.</p>
<p>This is the English paragraph, so it's direction in html should be left-to-right.</p>

Any one could help me how can I write proper Pandoc filter in Haskell to solve this problem?

jgm · May 29, 2015

One option would be to check each block element for Persian letters (Text.Pandoc.Walk.query could be used). If they are present, the block could be converted to Div ("",[],[("dir","rtl")]) [b] where b is the original block. This would give you output in HTML like <div dir="rtl"> <h1>عنوان اول</h1> </div> <div dir="rtl"> <p>این متن فارسی باید راست به چپ نشان داده شود.</p> </div> <p>This is the English paragraph, so it's direction in html should be left-to-right.</p> I don't know if this would work in browsers. If not, you could add probably some CSS so that an h1 or p contained in a div with dir="rtl" also gets the "rtl" attribute. +++ Milad Khajavi [May 29 15 01:27 ]:

…

mb21 · May 31, 2015

If you don't want to write a filter as jgm recommended, you can always mark it up manually:

# عنوان اول {dir=rtl}

<div dir=rtl>این متن فارسی باید راست به چپ نشان داده شود.</div>

This is the English paragraph, so it's direction in html should be left-to-right.

you might also be interested in the RTL discussion on talk.commonmark.org.

nickbart1980 · Jun 1, 2015

I think dealing with languages and directionality should become a functionality of pandoc itself rather than being delegated to filters.

My suggestion would be to primarily rely on language tags in pandoc markdown:

the existing lang: fa-IR in the document’s metadata for declaring the main language of the document.
<div lang="fa-IR">…</div> for longer and
<span lang="fa-IR">…</span> for shorter sections in a language different from the main language.

Most of this is already available in pandoc:

pandoc -s -t html << EOT

# عنوان اول

.این متن فارسی باید راست به چپ نشان داده شود

<span lang=en-US>This is the English paragraph, so its direction in html should be left-to-right.</span>

.این متن فارسی باید راست به چپ نشان داده شود
---
lang: en-US, fa-IR
...

EOT

generates

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US, fa-IR" xml:lang="fa-IR">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <title></title>
  <style type="text/css">code{white-space: pre;}</style>
</head>
<body>
<h1 id="عنوان-اول">عنوان اول</h1>
<p>.این متن فارسی باید راست به چپ نشان داده شود</p>
<p><span lang="en-US">This is the English paragraph, so its direction in html should be left-to-right.</span></p>
<p>.این متن فارسی باید راست به چپ نشان داده شود</p>
</body>
</html>

… which doesn’t look too bad as is, except for the facts that lang="en-US, fa-IR" should be replaced by lang="fa-IR" (just one main language per document), and that in my browsers the full stop is appearing to the right of the Farsi sentences rather than their left, in both Firefox and Safari – ideas on this, anyone?).

Unless declared explicitly, pandoc could then infer directionality from these language tags, and write, e.g.,

…
<html xmlns="http://www.w3.org/1999/xhtml" lang="fa-IR" xml:lang="fa-IR" dir="rtl">
…
<p><span lang="en-US" dir="ltr">This is the English paragraph, so its direction in html should be left-to-right.</span></p>
…

If xml:lang tags are needed, they could be added during this step, too.

For latex output, pandoc would just have to map lang: en-US, fa-IR to

  \setmainlanguage{farsi}
  \setotherlanguages{english}

and <div lang="fa-IR">…</div> to \begin{farsi}…\end{farsi}, and <span lang="fa-IR">…</span> to \textfarsi{…}` (no directionality tags needed for latex).

ousia · Jun 1, 2015

@nickbart1980, wasn’t otherlang supposed to be included for LaTeX?

As far as I can understand, language direction may be specified in CSS:

:lang(fa-IR) {
   direction: rtl;
}

nickbart1980 · Jun 1, 2015

Yes, for LaTeX a comma-separated list in the metadata variable lang is parsed into mainlang (last item) and otherlang (all others), but the values, e.g., en-US, fa-IR are not mapped yet to what polyglossia (and babel) expect, e.g. english, farsi. That's one thing that would be great to have fixed.

However, mainlang and otherlang are not available in any other formats than LaTeX (or else we could simply use mainlang in the html template). A fix for this would be great, too.

As to CSS, I’m not quite sure. Adding your snippet to my HTML document above looks ok in a browser (again, with the exception of the full stops).

On the other hand, https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/dir recommends “As the directionality of the text is semantically related to its content and not to its presentation, it is recommended that web developers use this attribute [dir] instead of the related CSS properties when possible. That way, the text will display correctly even on a browser that doesn't support CSS or has the CSS deactivated.”

ousia · Jun 1, 2015

Yes, for LaTeX a comma-separated list in the metadata variable lang is parsed into mainlang (last item) and otherlang (all others), but the values, e.g., en-US, fa-IR are not mapped yet to what polyglossia (and babel) expect, e.g. english, farsi. That's one thing that would be great to have fixed.

There is an issue (#1614) exactly on this topic. It may make sense to add comments there (so developers see the real demand for this fix).

However, mainlang and otherlang are not available in any other formats than LaTeX (or else we could simply use mainlang in the html template). A fix for this would be great, too.

Where is the fix needed? I must confess that I still don’t get it (we have already discussed this at #2174).

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

As to CSS, I’m not quite sure. Adding your snippet to my HTML document above looks ok in a browser (again, with the exception of the full stops).

I wonder whether this would work also with full stops:

:lang(fa-IR) {
   direction: rtl;
   unicode-bidi: bidi-override;
}

On the other hand, https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/dir recommends “As the directionality of the text is semantically related to its content and not to its presentation, it is recommended that web developers use this attribute [dir] instead of the related CSS properties when possible. That way, the text will display correctly even on a browser that doesn't support CSS or has the CSS deactivated.”

The reasoning behind this recommendation would lead to avoid as many CSS properties as possible: “[t]hat way, the text will display correctly even on a browser that doesn't support CSS or has the CSS deactivated”.

I don’t see the reason why the direction should also included in HTML (besides the language markup), if a given language can only have one direction.

nickbart1980 · Jun 1, 2015

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

That’s not so great since you would always have to tweak the source file depending on the target format. Parsing lang into mainlang and otherlang (or, alternatively, discarding all items in lang except the last for target formats that cannot ever use otherlang for any purpose) makes more sense.

I wonder whether this would work also with full stops:
:lang(fa-IR) {
   direction: rtl;
   unicode-bidi: bidi-override;
}

Unfortunately, no.

khajavi · Jun 1, 2015

John, This solution is good for converting markdown to HTML, so what is the general solution? It's better to be a built-in feature of Pandoc to handle right-to-left letters. So I think my question (writing filter) was not good.

…

khajavi · Jun 1, 2015

Writing `lang` tag explicitly in technical documents are cumbersome, because in technical documents that the main language is Persian, there are lots of time that we need write in English, so it's better than Pandoc check if the paragraph is written in Persian create proper tag and if the paragraph is written in English create proper tag so.

…

On Mon, Jun 1, 2015 at 6:21 PM, nickbart1980 <notifications@github.com> wrote: I think dealing with languages and directionality should become a functionality of pandoc itself rather than being delegated to filters. My suggestion would be to primarily rely on language tags in pandoc markdown: - the existing lang: fa-IR in the document’s metadata for declaring the main language of the document. - <div lang="fa-IR">…</div> for longer and - <span lang="fa-IR">…</span> for shorter sections in a language different from the main language. Most of this is already available in pandoc: pandoc -s -t html << EOT # عنوان اول .این متن فارسی باید راست به چپ نشان داده شود <span lang=en-US>This is the English paragraph, so its direction in html should be left-to-right.</span> .این متن فارسی باید راست به چپ نشان داده شود --- lang: en-US, fa-IR ... EOT generates <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US, fa-IR" xml:lang="fa-IR"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta http-equiv="Content-Style-Type" content="text/css" /> <meta name="generator" content="pandoc" /> <title></title> <style type="text/css">code{white-space: pre;}</style> </head> <body> <h1 id="عنوان-اول">عنوان اول</h1> <p>.این متن فارسی باید راست به چپ نشان داده شود</p> <p><span lang="en-US">This is the English paragraph, so its direction in html should be left-to-right.</span></p> <p>.این متن فارسی باید راست به چپ نشان داده شود</p> </body> </html> … which doesn’t look too bad as is, except for the facts that lang="en-US, fa-IR" should be replaced by lang="fa-IR" (just one main language per document), and that in my browsers the full stop is appearing to the right of the Farsi sentences rather than their left, in both Firefox and Safari – ideas on this, anyone?). Unless declared explicitly, pandoc could then infer directionality from these language tags, and write, e.g., … <html xmlns="http://www.w3.org/1999/xhtml" lang="fa-IR" xml:lang="fa-IR" dir="rtl"> … <p><span lang="en-US" dir="ltr">This is the English paragraph, so its direction in html should be left-to-right.</span></p> … If xml:lang tags are needed, they could be added during this step, too. For latex output, pandoc would just have to map lang: en-US, fa-IR to \setmainlanguage{farsi} \setotherlanguages{english} and <div lang="fa-IR">…</div> to \begin{farsi}…\end{farsi}, and <span lang="fa-IR">…</span> to \textfarsi{…}` (no directionality tags needed for latex). — Reply to this email directly or view it on GitHub <#2191 (comment)>.

-- Milād Khājavi http://blog.khajavi.ir Having the source means you can do it yourself. I tried to change the world, but I couldn’t find the source code.

ousia · Jun 1, 2015

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

That’s not so great since you would always have to tweak the source file depending on the target format.

@nickbart1980, I don’t think so. Let’s consider the following sample:

---
lang: en
otherlang: grc, la
...

<span lang="grc">χαλεπὰ τὰ καλά</span> was the ancient Greek saying to
state that beauty is difficult to attain.

Occam’s razor reads: <span lang="la">«entia non sunt multiplicanda sine
necessitate»</span>

If you have to tweak the source depending on your target, this isn’t due to the language information in the metadata. It has to do with the lack of translation among different language identification values [#1614]), non–existing special syntax for language attributes (#895) and missing syntax for raw division and raw inline elements (#168).

Parsing lang into mainlang and otherlang (or, alternatively, discarding all items in lang except the last for target formats that cannot ever use otherlang for any purpose) makes more sense.

To the best of my knowledge, pandoc has four variables (metadata fields) to include language information in the metadata:

lang contains the document (main) language information.
language includes the document main language in the ePub metadata (it may be safely replaced with lang).
mainlang includes the document main language.
otherlang includes other languages present in the document.

Applying Occam’s razor to these variables, I think it would read: “do not create any language variable unless strictly required”.

I agree that lang is required to specify the primary language in the document. And otherlang is required by polyglossia and babel in LaTeX.

But I think that adapting lang to the way the exception (LaTeX) works is the wrong path. Because it is easier to add all secondary languages in a variable especially created for LaTeX (otherlang).

My final question is: wnat is wrong (or what does it need to be fixed) in using lang for the main language (as it is [or would be] required [once fixed] for HTML, ePub, ConTeXt, OpenDocument and .docx) and reserve `otherlang' for LaTeX?

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

That’s not so great since you would always have to tweak the source file depending on the target format.

@nickbart1980, I don’t think so. Let’s consider the following sample:

---
lang: en
otherlang: grc, la
...

<span lang="grc">χαλεπὰ τὰ καλά</span> was the ancient Greek saying to
state that beauty is difficult to attain.

Occam’s razor reads: <span lang="la">«entia non sunt multiplicanda sine
necessitate»</span>

If you have to tweak the source depending on your target, this isn’t due to the language information in the metadata. It has to do with the lack of translation among different language identification values [#1614]), non–existing special syntax for language attributes (#895) and missing syntax for raw division and raw inline elements (#168).

Parsing lang into mainlang and otherlang (or, alternatively, discarding all items in lang except the last for target formats that cannot ever use otherlang for any purpose) makes more sense.

To the best of my knowledge, pandoc has four variables (metadata fields) to include language information in the metadata:

lang contains the document (main) language information.
language includes the document main language in the ePub metadata (it may be safely replaced with lang).
mainlang includes the document main language.
otherlang includes other languages present in the document.

Applying Occam’s razor to these variables, I think it would read: “do not create any language variable unless strictly required”.

I agree that lang is required to specify the primary language in the document. And otherlang is required by polyglossia and babel in LaTeX.

But I think that adapting lang to the way the exception (LaTeX) works is the wrong path. Because it is easier to add all secondary languages in a variable especially created for LaTeX (otherlang).

My final question is: what is wrong (or what does it need to be fixed) in using lang for the main language (as it is [or would be] required [once fixed] for HTML, ePub, ConTeXt, OpenDocument and .docx) and reserve `otherlang' for LaTeX?

ousia · Jun 1, 2015

Writing lang tag explicitly in technical documents are cumbersome, because in technical documents that the main language is Persian, there are lots of time that we need write in English, so it's better than Pandoc check if the paragraph is written in Persian create proper tag and if the paragraph is written in English create proper tag so.

@khajavi, I don’t think I understand your proposal.

But first of all, why do you need language markup? If you only require it for text direction, I wonder whether this could be achieved without language or direction tagging. It is only a guess, but isn’t the Unicode bidirectional algorithm supposed deal with this?

If you need markup for hyphenation or other language–dependent feature, then you need to mark up languages.

khajavi · Jun 1, 2015

@ousia I need language markup for text direction in outputs like html and latex (mainly html). Without language markup, how can I do that? With Unicode bid algorithm? Could you explain more? My proposal is that pandoc able to detect the language of the text, here English or Persian, and then mark the paragraph direction ltr or rtl.

…

jgm · Jun 1, 2015

+++ Pablo Rodríguez [Jun 01 15 09:12 ]:

Yes, for LaTeX a comma-separated list in the metadata variable lang is parsed into mainlang (last item) and otherlang (all others), but the values, e.g., en-US, fa-IR are not mapped yet to what polyglossia (and babel) expect, e.g. english, farsi. That's one thing that would be great to have fixed. There is an issue ([1]#1614) exactly on this topic. It may make sense to add comments there (so developers see the real demand for this fix).

I think it's a good idea.

However, mainlang and otherlang are not available in any other formats than LaTeX (or else we could simply use mainlang in the html template). A fix for this would be great, too. Where is the fix needed? I must confess that I still don’t get it (we have already discussed this at [2]#2174). How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

This makes sense to me.

nickbart1980 · Jun 3, 2015

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

This makes sense to me.

What it boils down to is, do we want

---
lang: fr-FR, en-US, fa-IR
...

where lang is parsed into mainlang (or just lang; containing fa-IR) and otherlang (containing fr-FR, en-US); or do we want

---
lang: fa-IR
otherlang: fr-FR, en-US
...

Both will work nicely with all formats (as soon as the latex writer maps fr-FRto frenchetc.). Since it's shorter, I have a slight preference for the first option.

ousia · Jun 3, 2015

@nickbart1980, many thanks for your reply.

I’m afraid that the first proposal doesn’t behave as you expect in pandoc-1.14.0.1.

---
lang: grc, it, fr, en, de, es
...

multiple languages

This gives the following HTML element:

<html xmlns="http://www.w3.org/1999/xhtml"
lang="grc, it, fr, en, de, es"
xml:lang="grc, it, fr, en, de, es">

In XML lang or xml:lang should have only one value.

From all formats that support language markup, only LaTeX needs the list of languages used in the document. This shouldn’t be the default in the way pandoc metadata deal with languages. This is the reason the otherlang variable makes sense.

And this is the reason why there is nothing to fix here. lang should only be used with a single language value.

BTW, the proposal doesn’t work even with LaTeX (the final comma after the last language is wrong):

\documentclass[grc, it, fr, en, de, es,]{article}

If the LaTeX writer needs to be adapted to the way pandoc works, this should be done. But it is crazy to adapt pandoc to the way LaTeX works. (At least, one writer is easier to do than many writers.)

mb21 · Aug 11, 2015

Note that language and directionality are two independent properties and shouldn't be conflated:

there is not always a one-to-one mapping between language and script, and therefore directionality. For example, Azerbaijani can be written using both right-to-left (Arabic) and left-to-right (Latin or Cyrillic) scripts, and the language code az can be relevant for either.

In some scripts, such as Arabic and Hebrew, displayed text is read predominantly from right to left, although within that flow, numbers and text from other scripts are displayed from left to right.

The pandoc document metadata should have lang, otherlang and dir properties (the global dir sets the base direction). Additionally, we need the writers to properly convert the dir attribute on at least spans and divs to locally change the directionality of some ranges of text.

mb21 · Aug 12, 2015

@ousia btw, no-NO and nb-NO should be “norsk”, not “nynorsk” AFAIK

ousia · Aug 12, 2015

@ousia btw, no-NO and nb-NO should be “norsk”, not “nynorsk” AFAIK

Totally right (although the list belongs to #1614).

BTW, will be the dir metadata field created?

mb21 · Aug 22, 2015

As I said over at commonmark discuss, I think we should be fine with supporting spans and divs with dir attributes.

In ConTeXt, we can use \righttoleft{my span content}, \startalignment[righttoleft] my div content \stopalignment and \setupalign[righttoleft] for the base direction of the document.

When using the bidi package (which only works for XeLaTeX as far as I know), they are \RL, setRL and \usepackage[RTLdocument]{bidi} respectively.

So what about pdfLaTeX and LuaLaTeX? I guess we can forget about the former, but it would be good if we could output the same commands for both Lua- and XeLaTeX. Maybe we can redefine it somehow in our LaTeX template—that is if there is a general purpose rtl/bidi package for LuaLaTex (not only arabic or only farsi), is there? Otherwise, we'll just have to tell people to use either XeLaTeX or ConTeXt. Maybe @khaledhosny can shed some light on these questions, please? :)

ousia · Aug 23, 2015

@mb21, as commented in #1614, do you really think that dir has to be included in the document?

If each language has one and only one direction (and the number of languages is finite), I guess pandoc should assign direction to the language internally.

Consider a dissertation in Arabic literature written in English (or any Western language). It is easy that it may have over a thousand passages in Arabic.

What do you think it is easier to type: [Arabic text]{:ar} or [Arabic text]{dir="rtl" lang="ar"}? Which method do you think it may lead to more typing mistakes?

With ConTeXt, I had typeset a book in Spanish that had about a thousand passages in ancient Greek. And I really was relieved by the fact that I didn’t have to tag any of these texts. (Just in case you wonder, \setuplanguage[es][patterns={es, agr}].)

mb21 · Aug 23, 2015

As I wrote above, language and scripts are two independent properties and shouldn't be conflated, e.g. Azerbaijani can be written using both right-to-left (Arabic) and left-to-right (Latin or Cyrillic) scripts.

But I think it's a good idea to introduce [Arabic text]{:ar} (or a similar simplistic syntax) as a shorthand for (and converted already by the Markdown reader to) [Arabic text]{dir="rtl" lang="ar"}. But I'd say that's a separate issue—indeed it's #895.

ousia · Aug 24, 2015

As I wrote above, language and scripts are two independent properties and shouldn't be conflated, e.g. Azerbaijani can be written using both right-to-left (Arabic) and left-to-right (Latin or Cyrillic) scripts.

@mb21, I think there are different issues involved here:

The link you provided is relevant for (X)HTML markup, but I don’t think it is mandatory for any text markup dealing with languages.
ISO 639 languages don’t contain any information about directionality. But BCP-47 codes include a script subtag. In fact, Azerbaijani can have the following three values: az-Latn, az-Cyrl and az-Arab.
Even in languages that only use a single script written right to left, numbers and some other common characters (even characters from other scripts) should be written left to right. But I think that pandoc should add direction markup automatically.

There is a question about languages that may use different scripts that I don’t understand.

Language markup is relevant to apply resources to the tagged text, such as hyphenation dictionaries. How would you apply the right hyphenation dictionary for a language that may use more than a script if the language itself doesn’t contain which one should be? Directionality doesn’t help much here.

This is why I think that dir shouldn’t be included in the document.

But I think it's a good idea to introduce [Arabic text]{:ar} [...]
But I'd say that's a separate issue—indeed it's #895.

I know they are different issues, but also related.

I wanted to discuss the issue on a simplified or special language attribute, so that it could be implemented at the same time this issue is implemented (the original issue has been opened for almost 26 months).

mb21 · Aug 24, 2015

The link you provided is relevant for (X)HTML markup, but I don’t think it is mandatory for any text markup dealing with languages.

True, but I think the (X)HTML folks have put a lot of thought into their docs and HTML remains one of the primary output targets of pandoc. Compared to LaTeX and ConTeXt their approach is much less of a mess and based on ISO standards. That's why I propose to model pandoc's model after the HTML model.

But yeah, I guess pandoc could extract a script tag from the BCP 47 string, yet this would require us to come up with (and maintain) a long list of language-to-script- and script-to-direction-mappings. I'm sure it's doable and if @jgm is in favour and someone gets around to implement it, why not? Meanwhile, mirroring the HTML model provides a working model, relatively simply.

mb21 · Sep 27, 2015

To clarify, now you can write:

---
dir: rtl
---

# عنوان اول

این متن فارسی باید راست به چپ نشان داده شود.

<div dir="ltr">
This is an English paragraph, so its direction in html should be left-to-right.
</div>

As soon as native syntax for div (#168) and span (e.g. [my text]{dir=ltr}) become available, you'll be able to use those instead.

khajavi changed the title from Add filter to support right-to-left direction in Persian text. to Write filter to support right-to-left direction in Persian text. May 29, 2015

mb21 referenced this issue Aug 23, 2015
Closed
lang attribute fits for latex but not for html lang-attribute-value #1614

mb21 added a commit to mb21/pandoc that referenced this issue Sep 26, 2015

mb21 Support bidirectional text output with XeLaTeX, ConTeXt and HTML
closes #2191
2221732

mb21 added a commit to mb21/pandoc that referenced this issue Sep 26, 2015

mb21 Support bidirectional text output with XeLaTeX, ConTeXt and HTML
closes #2191
9d010c0

mb21 referenced this issue Sep 26, 2015
Merged
Support bidirectional text output with XeLaTeX, ConTeXt and HTML #2419

mb21 added a commit to mb21/pandoc that referenced this issue Sep 26, 2015

mb21 Support bidirectional text output with XeLaTeX, ConTeXt and HTML
closes #2191
7b0c1e0

jgm closed this in #2419 Sep 27, 2015

jgm/pandoc

Write filter to support right-to-left direction in Persian text. #2191

Labels

Milestone

Assignee

5 participants

lang attribute fits for latex but not for html lang-attribute-value #1614

Support bidirectional text output with XeLaTeX, ConTeXt and HTML #2419