Skip to content

Write filter to support right-to-left direction in Persian text. #2191

Closed
khajavi opened this Issue · 24 comments

5 participants

@khajavi

I need to convert the Persian text like this:

# عنوان اول
این متن فارسی باید راست به چپ نشان داده شود.

This is the English paragraph, so it's direction in html should be left-to-right.

To HTML like this:

<h1 dir="rtl">عنوان اول</h1>
<p dir="rtl">این متن فارسی باید راست به چپ نشان داده شود.</p>
<p>This is the English paragraph, so it's direction in html should be left-to-right.</p>

Any one could help me how can I write proper Pandoc filter in Haskell to solve this problem?

@khajavi khajavi changed the title from Add filter to support right-to-left direction in Persian text. to Write filter to support right-to-left direction in Persian text.
@jgm
Owner
@mb21

If you don't want to write a filter as jgm recommended, you can always mark it up manually:

# عنوان اول {dir=rtl}

<div dir=rtl>این متن فارسی باید راست به چپ نشان داده شود.</div>

This is the English paragraph, so it's direction in html should be left-to-right.

you might also be interested in the RTL discussion on talk.commonmark.org.

@nickbart1980

I think dealing with languages and directionality should become a functionality of pandoc itself rather than being delegated to filters.

My suggestion would be to primarily rely on language tags in pandoc markdown:

  • the existing lang: fa-IR in the document’s metadata for declaring the main language of the document.
  • <div lang="fa-IR">…</div> for longer and
  • <span lang="fa-IR">…</span> for shorter sections in a language different from the main language.

Most of this is already available in pandoc:

pandoc -s -t html << EOT

# عنوان اول

.این متن فارسی باید راست به چپ نشان داده شود

<span lang=en-US>This is the English paragraph, so its direction in html should be left-to-right.</span>

.این متن فارسی باید راست به چپ نشان داده شود
---
lang: en-US, fa-IR
...

EOT

generates

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US, fa-IR" xml:lang="fa-IR">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <title></title>
  <style type="text/css">code{white-space: pre;}</style>
</head>
<body>
<h1 id="عنوان-اول">عنوان اول</h1>
<p>.این متن فارسی باید راست به چپ نشان داده شود</p>
<p><span lang="en-US">This is the English paragraph, so its direction in html should be left-to-right.</span></p>
<p>.این متن فارسی باید راست به چپ نشان داده شود</p>
</body>
</html>

… which doesn’t look too bad as is, except for the facts that lang="en-US, fa-IR" should be replaced by lang="fa-IR" (just one main language per document), and that in my browsers the full stop is appearing to the right of the Farsi sentences rather than their left, in both Firefox and Safari – ideas on this, anyone?).

Unless declared explicitly, pandoc could then infer directionality from these language tags, and write, e.g.,

…
<html xmlns="http://www.w3.org/1999/xhtml" lang="fa-IR" xml:lang="fa-IR" dir="rtl">
…
<p><span lang="en-US" dir="ltr">This is the English paragraph, so its direction in html should be left-to-right.</span></p>
…

If xml:lang tags are needed, they could be added during this step, too.

For latex output, pandoc would just have to map lang: en-US, fa-IR to

  \setmainlanguage{farsi}
  \setotherlanguages{english}

and <div lang="fa-IR">…</div> to \begin{farsi}…\end{farsi}, and <span lang="fa-IR">…</span> to \textfarsi{…}` (no directionality tags needed for latex).

@ousia

@nickbart1980, wasn’t otherlang supposed to be included for LaTeX?

As far as I can understand, language direction may be specified in CSS:

:lang(fa-IR) {
   direction: rtl;
}
@nickbart1980

Yes, for LaTeX a comma-separated list in the metadata variable lang is parsed into mainlang (last item) and otherlang (all others), but the values, e.g., en-US, fa-IR are not mapped yet to what polyglossia (and babel) expect, e.g. english, farsi. That's one thing that would be great to have fixed.

However, mainlang and otherlang are not available in any other formats than LaTeX (or else we could simply use mainlang in the html template). A fix for this would be great, too.

As to CSS, I’m not quite sure. Adding your snippet to my HTML document above looks ok in a browser (again, with the exception of the full stops).

On the other hand, https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/dir recommends “As the directionality of the text is semantically related to its content and not to its presentation, it is recommended that web developers use this attribute [dir] instead of the related CSS properties when possible. That way, the text will display correctly even on a browser that doesn't support CSS or has the CSS deactivated.”

@ousia

Yes, for LaTeX a comma-separated list in the metadata variable lang is parsed into mainlang (last item) and otherlang (all others), but the values, e.g., en-US, fa-IR are not mapped yet to what polyglossia (and babel) expect, e.g. english, farsi. That's one thing that would be great to have fixed.

There is an issue (#1614) exactly on this topic. It may make sense to add comments there (so developers see the real demand for this fix).

However, mainlang and otherlang are not available in any other formats than LaTeX (or else we could simply use mainlang in the html template). A fix for this would be great, too.

Where is the fix needed? I must confess that I still don’t get it (we have already discussed this at #2174).

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

As to CSS, I’m not quite sure. Adding your snippet to my HTML document above looks ok in a browser (again, with the exception of the full stops).

I wonder whether this would work also with full stops:

:lang(fa-IR) {
   direction: rtl;
   unicode-bidi: bidi-override;
}

On the other hand, https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/dir recommends “As the directionality of the text is semantically related to its content and not to its presentation, it is recommended that web developers use this attribute [dir] instead of the related CSS properties when possible. That way, the text will display correctly even on a browser that doesn't support CSS or has the CSS deactivated.”

The reasoning behind this recommendation would lead to avoid as many CSS properties as possible: “[t]hat way, the text will display correctly even on a browser that doesn't support CSS or has the CSS deactivated”.

I don’t see the reason why the direction should also included in HTML (besides the language markup), if a given language can only have one direction.

@nickbart1980

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

That’s not so great since you would always have to tweak the source file depending on the target format. Parsing lang into mainlang and otherlang (or, alternatively, discarding all items in lang except the last for target formats that cannot ever use otherlang for any purpose) makes more sense.

I wonder whether this would work also with full stops:

:lang(fa-IR) {
   direction: rtl;
   unicode-bidi: bidi-override;
}

Unfortunately, no.

@khajavi
@khajavi
@ousia

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

That’s not so great since you would always have to tweak the source file depending on the target format.

@nickbart1980, I don’t think so. Let’s consider the following sample:

---
lang: en
otherlang: grc, la
...

<span lang="grc">χαλεπὰ τὰ καλά</span> was the ancient Greek saying to
state that beauty is difficult to attain.

Occam’s razor reads: <span lang="la">«entia non sunt multiplicanda sine
necessitate»</span>

If you have to tweak the source depending on your target, this isn’t due to the language information in the metadata. It has to do with the lack of translation among different language identification values [#1614]), non–existing special syntax for language attributes (#895) and missing syntax for raw division and raw inline elements (#168).

Parsing lang into mainlang and otherlang (or, alternatively, discarding all items in lang except the last for target formats that cannot ever use otherlang for any purpose) makes more sense.

To the best of my knowledge, pandoc has four variables (metadata fields) to include language information in the metadata:

  • lang contains the document (main) language information.

  • language includes the document main language in the ePub metadata (it may be safely replaced with lang).

  • mainlang includes the document main language.

  • otherlang includes other languages present in the document.

Applying Occam’s razor to these variables, I think it would read: “do not create any language variable unless strictly required”.

I agree that lang is required to specify the primary language in the document. And otherlang is required by polyglossia and babel in LaTeX.

But I think that adapting lang to the way the exception (LaTeX) works is the wrong path. Because it is easier to add all secondary languages in a variable especially created for LaTeX (otherlang).

My final question is: wnat is wrong (or what does it need to be fixed) in using lang for the main language (as it is [or would be] required [once fixed] for HTML, ePub, ConTeXt, OpenDocument and .docx) and reserve `otherlang' for LaTeX?

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

That’s not so great since you would always have to tweak the source file depending on the target format.

@nickbart1980, I don’t think so. Let’s consider the following sample:

---
lang: en
otherlang: grc, la
...

<span lang="grc">χαλεπὰ τὰ καλά</span> was the ancient Greek saying to
state that beauty is difficult to attain.

Occam’s razor reads: <span lang="la">«entia non sunt multiplicanda sine
necessitate»</span>

If you have to tweak the source depending on your target, this isn’t due to the language information in the metadata. It has to do with the lack of translation among different language identification values [#1614]), non–existing special syntax for language attributes (#895) and missing syntax for raw division and raw inline elements (#168).

Parsing lang into mainlang and otherlang (or, alternatively, discarding all items in lang except the last for target formats that cannot ever use otherlang for any purpose) makes more sense.

To the best of my knowledge, pandoc has four variables (metadata fields) to include language information in the metadata:

  • lang contains the document (main) language information.

  • language includes the document main language in the ePub metadata (it may be safely replaced with lang).

  • mainlang includes the document main language.

  • otherlang includes other languages present in the document.

Applying Occam’s razor to these variables, I think it would read: “do not create any language variable unless strictly required”.

I agree that lang is required to specify the primary language in the document. And otherlang is required by polyglossia and babel in LaTeX.

But I think that adapting lang to the way the exception (LaTeX) works is the wrong path. Because it is easier to add all secondary languages in a variable especially created for LaTeX (otherlang).

My final question is: what is wrong (or what does it need to be fixed) in using lang for the main language (as it is [or would be] required [once fixed] for HTML, ePub, ConTeXt, OpenDocument and .docx) and reserve `otherlang' for LaTeX?

@ousia

Writing lang tag explicitly in technical documents are cumbersome, because in technical documents that the main language is Persian, there are lots of time that we need write in English, so it's better than Pandoc check if the paragraph is written in Persian create proper tag and if the paragraph is written in English create proper tag so.

@khajavi, I don’t think I understand your proposal.

But first of all, why do you need language markup? If you only require it for text direction, I wonder whether this could be achieved without language or direction tagging. It is only a guess, but isn’t the Unicode bidirectional algorithm supposed deal with this?

If you need markup for hyphenation or other language–dependent feature, then you need to mark up languages.

@khajavi
@jgm
Owner
@nickbart1980

How about using lang only for the main language (it works everywhere) and otherlang only for LaTeX (well, it is only required there)?

This makes sense to me.

What it boils down to is, do we want

---
lang: fr-FR, en-US, fa-IR
...

where lang is parsed into mainlang (or just lang; containing fa-IR) and otherlang (containing fr-FR, en-US); or do we want

---
lang: fa-IR
otherlang: fr-FR, en-US
...

Both will work nicely with all formats (as soon as the latex writer maps fr-FRto frenchetc.). Since it's shorter, I have a slight preference for the first option.

@ousia

@nickbart1980, many thanks for your reply.

I’m afraid that the first proposal doesn’t behave as you expect in pandoc-1.14.0.1.

---
lang: grc, it, fr, en, de, es
...

multiple languages

This gives the following HTML element:

<html xmlns="http://www.w3.org/1999/xhtml"
lang="grc, it, fr, en, de, es"
xml:lang="grc, it, fr, en, de, es">

In XML lang or xml:lang should have only one value.

From all formats that support language markup, only LaTeX needs the list of languages used in the document. This shouldn’t be the default in the way pandoc metadata deal with languages. This is the reason the otherlang variable makes sense.

And this is the reason why there is nothing to fix here. lang should only be used with a single language value.

BTW, the proposal doesn’t work even with LaTeX (the final comma after the last language is wrong):

\documentclass[grc, it, fr, en, de, es,]{article}

If the LaTeX writer needs to be adapted to the way pandoc works, this should be done. But it is crazy to adapt pandoc to the way LaTeX works. (At least, one writer is easier to do than many writers.)

@mb21

Note that language and directionality are two independent properties and shouldn't be conflated:

there is not always a one-to-one mapping between language and script, and therefore directionality. For example, Azerbaijani can be written using both right-to-left (Arabic) and left-to-right (Latin or Cyrillic) scripts, and the language code az can be relevant for either.

In some scripts, such as Arabic and Hebrew, displayed text is read predominantly from right to left, although within that flow, numbers and text from other scripts are displayed from left to right.

The pandoc document metadata should have lang, otherlang and dir properties (the global dir sets the base direction). Additionally, we need the writers to properly convert the dir attribute on at least spans and divs to locally change the directionality of some ranges of text.

@mb21

@ousia btw, no-NO and nb-NO should be “norsk”, not “nynorsk” AFAIK

@ousia

@ousia btw, no-NO and nb-NO should be “norsk”, not “nynorsk” AFAIK

Totally right (although the list belongs to #1614).

BTW, will be the dir metadata field created?

@mb21

As I said over at commonmark discuss, I think we should be fine with supporting spans and divs with dir attributes.

In ConTeXt, we can use \righttoleft{my span content}, \startalignment[righttoleft] my div content \stopalignment and \setupalign[righttoleft] for the base direction of the document.

When using the bidi package (which only works for XeLaTeX as far as I know), they are \RL, setRL and \usepackage[RTLdocument]{bidi} respectively.

So what about pdfLaTeX and LuaLaTeX? I guess we can forget about the former, but it would be good if we could output the same commands for both Lua- and XeLaTeX. Maybe we can redefine it somehow in our LaTeX template—that is if there is a general purpose rtl/bidi package for LuaLaTex (not only arabic or only farsi), is there? Otherwise, we'll just have to tell people to use either XeLaTeX or ConTeXt. Maybe @khaledhosny can shed some light on these questions, please? :)

@ousia

@mb21, as commented in #1614, do you really think that dir has to be included in the document?

If each language has one and only one direction (and the number of languages is finite), I guess pandoc should assign direction to the language internally.

Consider a dissertation in Arabic literature written in English (or any Western language). It is easy that it may have over a thousand passages in Arabic.

What do you think it is easier to type: [Arabic text]{:ar} or [Arabic text]{dir="rtl" lang="ar"}? Which method do you think it may lead to more typing mistakes?

With ConTeXt, I had typeset a book in Spanish that had about a thousand passages in ancient Greek. And I really was relieved by the fact that I didn’t have to tag any of these texts. (Just in case you wonder, \setuplanguage[es][patterns={es, agr}].)

@mb21

As I wrote above, language and scripts are two independent properties and shouldn't be conflated, e.g. Azerbaijani can be written using both right-to-left (Arabic) and left-to-right (Latin or Cyrillic) scripts.

But I think it's a good idea to introduce [Arabic text]{:ar} (or a similar simplistic syntax) as a shorthand for (and converted already by the Markdown reader to) [Arabic text]{dir="rtl" lang="ar"}. But I'd say that's a separate issue—indeed it's #895.

@ousia

As I wrote above, language and scripts are two independent properties and shouldn't be conflated, e.g. Azerbaijani can be written using both right-to-left (Arabic) and left-to-right (Latin or Cyrillic) scripts.

@mb21, I think there are different issues involved here:

  • The link you provided is relevant for (X)HTML markup, but I don’t think it is mandatory for any text markup dealing with languages.

  • ISO 639 languages don’t contain any information about directionality. But BCP-47 codes include a script subtag. In fact, Azerbaijani can have the following three values: az-Latn, az-Cyrl and az-Arab.

  • Even in languages that only use a single script written right to left, numbers and some other common characters (even characters from other scripts) should be written left to right. But I think that pandoc should add direction markup automatically.

There is a question about languages that may use different scripts that I don’t understand.

Language markup is relevant to apply resources to the tagged text, such as hyphenation dictionaries. How would you apply the right hyphenation dictionary for a language that may use more than a script if the language itself doesn’t contain which one should be? Directionality doesn’t help much here.

This is why I think that dir shouldn’t be included in the document.

But I think it's a good idea to introduce [Arabic text]{:ar} [...]
But I'd say that's a separate issue—indeed it's #895.

I know they are different issues, but also related.

I wanted to discuss the issue on a simplified or special language attribute, so that it could be implemented at the same time this issue is implemented (the original issue has been opened for almost 26 months).

@mb21

The link you provided is relevant for (X)HTML markup, but I don’t think it is mandatory for any text markup dealing with languages.

True, but I think the (X)HTML folks have put a lot of thought into their docs and HTML remains one of the primary output targets of pandoc. Compared to LaTeX and ConTeXt their approach is much less of a mess and based on ISO standards. That's why I propose to model pandoc's model after the HTML model.

But yeah, I guess pandoc could extract a script tag from the BCP 47 string, yet this would require us to come up with (and maintain) a long list of language-to-script- and script-to-direction-mappings. I'm sure it's doable and if @jgm is in favour and someone gets around to implement it, why not? Meanwhile, mirroring the HTML model provides a working model, relatively simply.

@mb21 mb21 added a commit to mb21/pandoc that referenced this issue
@mb21 mb21 Support bidirectional text output with XeLaTeX, ConTeXt and HTML
closes #2191
2221732
@mb21 mb21 added a commit to mb21/pandoc that referenced this issue
@mb21 mb21 Support bidirectional text output with XeLaTeX, ConTeXt and HTML
closes #2191
9d010c0
@mb21 mb21 added a commit to mb21/pandoc that referenced this issue
@mb21 mb21 Support bidirectional text output with XeLaTeX, ConTeXt and HTML
closes #2191
7b0c1e0
@jgm jgm closed this in #2419
@mb21

To clarify, now you can write:

---
dir: rtl
---

# عنوان اول

این متن فارسی باید راست به چپ نشان داده شود.

<div dir="ltr">
This is an English paragraph, so its direction in html should be left-to-right.
</div>

As soon as native syntax for div (#168) and span (e.g. [my text]{dir=ltr}) become available, you'll be able to use those instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.