lang attribute fits for latex but not for html lang-attribute-value #1614

maybegeek · Sep 8, 2014

Hi there,

for documents in german language I use

lang: ngerman
mainlang: german

the first beeing for babel, second for polyglossia (i use mainly xelatex). If I pandoc to html the lang-attribute (ngerman) gets used but ngerman is not a valid value for the html-attribute (nor would be german).

html needs something like de or de-DE.

The documentation says lang language code for HTML or LaTeX documents.

And therefore the documentation is completely wright, you could use it for the one or the other, not both at the same time.

perhaps this needs clarification in the docs and/or a different approach. Meanwhile you can always override the yaml attributes in your central file with direct command line switches.

all the best,
christoph

ousia · Sep 10, 2014

The documentation says lang language code for HTML or LaTeX documents.

And therefore the documentation is completely wright, you could use it for the one or the other, not both at the same time.

@maybegeek: In my opinion, if this isn’t a bug in pandoc, it should be improved.

The document language attribute is also relevant (at least) for:

ConTeXt
Microsoft Word
LIbreOffice/Apache OpenOffice Writer
ePub

Not abstracting the lang value to fit all possible output formats that make use of it, in my opinion, is a missing implementation.

ousia · Sep 15, 2014

@maybegeek, I’m afraid I found a bug.

This source markdown document:

---
title: Titel
language: de-DE
...

# Kapitel

Mein Text

is converted by pandoc into the following standalone html document:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <title>Mein Titel</title>
  <style type="text/css">code{white-space: pre;}</style>
</head>
<body>
<div id="header">
<h1 class="title">Mein Titel</h1>
</div>
<h1 id="titel">Titel</h1>
</body>
</html>

Where is the language attribute in the html document? I miss the <xml:lang="de-DE"> from either the <html> or the <body> element.

I use pandoc 1.12.3.3. I wonder whether this has been fixed in a later version.

maybegeek · Sep 15, 2014

Hi Pablo, not so fast :) language != lang

…

jgm · Sep 15, 2014

See the documentation (README) on variables used in templates. (Or just look at the HTML template itself.) `lang` (not `language`) is what you use for the language code in HTML or LaTeX documents. `language` is used for EPUBs specifically (we used the same names as used in Dublin Core metadata). You can set both of them, of course. That's not to say that there isn't an issue here (the original poster's): you need different settings for `lang` in LaTeX and HTML, so you can't set this in the metadata if you need both formats. Some kind of automatic conversion would be convenient. +++ Pablo Rodríguez [Sep 15 14 13:05 ]:

…

ousia · Oct 19, 2014

@jgm, I think that a way to solve this would be to have a file that converts language from its HTML value to the LaTeX value.

It would be something similar to:

en -> english
en-US -> USenglish
en-UK -> UKenglish

Of course, I don’t know what the best format for pandoc is.

Could you provide the right format for the minimal sample above?

So, I could provide the file with the full list of languages supported by LaTeX.

Many thanks for your help.

ousia · Oct 25, 2014

@jgm, I attach a list that contains the equivalences between ISO-639 language codes used in HTML and LaTeX language codes.

I avoided three-letter language codes as much as I could. But there are language codes that aren’t defined in two-language codes (such as ancient Greek).

af -> afrikaans
af-ZA -> afrikaans
ar -> arabic
bg -> bulgarian
bg-BG -> bulgarian
br -> breton
ca -> catalan
ca-ES -> catalan
cy -> welsh
cy-UK -> welsh
cz -> czech
cz-CZ -> czech
da -> danish
da-DK -> danish
de -> ngerman
de-1901 -> german
de-AT -> naustrian
de-AT-1901 -> austrian
de-DE -> ngerman
dsb -> lowersorbian
el -> greek
el-poly -> greek.polutoniko
en -> english
en-AU -> australian
en-CA -> canadian
en-NZ -> newzealand
en-UK -> british
en-US   -> american
eo -> esperanto
es -> spanish
es-ES -> spanish
et -> estonian
et-EE -> estonian
eu -> basque
eu-ES -> basque
fa -> farsi
fa-IR -> farsi
fi -> finnish
fi-FI -> finnish
fr -> french
fr-CA -> canadien
fr-FR -> french
fra-aca -> acadian
fur -> friulan
ga -> irish
ga-IE -> irish
gd -> scottish
gd-UK -> scottish
gl -> galician
gl-ES -> galician
grc -> greek.ancient
he -> hebrew
he-IL -> hebrew
hi -> hindi
hi-IN -> hindi
hr -> croatian
hr-HR -> croatian
hsb -> uppersorbian
hu -> magyar
hu-HU -> magyar
id -> indonesian
id-IN -> indonesian
ie -> interlingua
is -> icelandic
is-IS -> icelandic
it -> italian
it-IT -> italian
jp -> japanese
jp-JP -> japanese
la -> latin
lt -> lithuanian
lt-LT -> lithuanian
lv -> latvian
lv-LV -> latvian
mn -> mongolian
mn-MN -> mongolian
nb -> norsk
nb-NO -> norsk
nl -> dutch
nl-NL -> dutch
nn -> nynorsk
nn-NO -> nynorsk
no -> norsk
no-NO -> norsk
pl -> polish
pl-PL -> polish
pt -> portuguese
pt-BR -> brazilian
pt-PT -> portuguese
rm -> romansh
rm-CH -> romansh
ro -> romanian
ro-RO -> romanian
ru -> russian
ru-RU -> russian
se -> samin
se-FI -> samin
sk -> slovak
sk-SK -> slovak
sl -> slovene
sl-SL -> slovene
sr -> serbian
sv -> swedish
sv-SE -> swedish
th -> thai
th-TH -> thai
tk -> turkmen
tr -> turkish
tr-TR -> turkish
uk -> ukrainian
uk-UA -> ukrainian
vi -> vietnamese
vi-VN -> vietnamese

mpickering · Dec 8, 2014

Thank you for compiling this list Pedro.

ousia · Dec 12, 2014

@mpickering, would you be interested in the corresponding list for ConTeXt?

This would be required for issue (#1667).

HughP · May 16, 2015

@ousia In thinking about inclusivity of languages, do your language lists include ISO 639-3 additions or are you limiting your list to ISO 639-1 listings? BCP47 suggests to use the shortest ISO 639 code for a language (I take that to mean that ISO 639-3 is used when there is no corresponding ISO 639-1 code for said language), while the Dublin Core standard points to ISO 639-3.

ousia · May 17, 2015

@HughP, I took the shortest code from the ISO 639-1 listing. But I had to use ISO 639-3 for languages not defined in ISO 639-1 (such as grc and similar ones).

The language list is limited to the languages LaTeX can handle. If anyone wants to add ISO 639-3 codes for the already defined ISO 639-1 codes, that would be fine. But I think this addition would make sense after the first list is implemented in pandoc.

mb21 · Aug 12, 2015

There's agreement that lang should contain ISO 639 format that is then translated to LaTeX.

But should lang be a list of languages or only one? I think I agree with @ousia that it would be better for lang to contain only one language and serve as a synonym for mainlang. @jgm?

Authors could then use otherlang explicitly to specify a list of other languages. Finally, should otherlang also be in ISO 639 format, even though it's currently only supported by LaTeX which is exactly the format that doesn't use ISO codes?

jgm · Aug 12, 2015

One possible approach would allow lang to be either a single value or a list. If a single value, it fills 'lang' and 'otherlang' is empty. If a list, the first item becomes 'lang' and the rest 'otherlang'. +++ mb21 [Aug 12 15 06:54 ]:

…

mb21 · Aug 12, 2015

Yeah, I just never understood why the last and not the first language in the list is the mainlang, so why not make it explicit? Also, I suspect it wouldn't be backwards-compatible with existing document anyway, since those would use LaTeX format for multiple languages, not ISO 639.

ousia · Aug 12, 2015

@jgm and @mb21,

there is a pending pull request in jgm/pandoc-templates#101, that implements some issues already discussed on the mailing list:

lang should be replaced with language (only language related names are shortened in metadata field names).
language should only have one variable.
otherlang should be replaced with other-languages.

This variable should include all other languages used in the document. This variable is only required by LaTeX (not even ConTeXt needs it).
The ConTeXt template defines language synonyms so that language values can be used directly in ConTeXt documents.

There are two issues pending. I have contacted both babel and polyglossia developers. They are interested in loading languages by ISO 639 codes. In fact, these codes with language-region structure (such as en-GB) seem to be BCP-47 codes, since ISO 639 only refers to languages themselves (I realized that yesterday [@HughP, I owe you an apology]).

But it may take a while before it has implemented. I hope I can include the language synonyms in the LaTeX templates as soon as possible (so that there is no need to wait for the implementation in the packages themselves.

The pull request jgm/pandoc-templates#101 is waiting for review and I hope it may be merged.

jgm · Aug 20, 2015

I would argue that it's best to stick with lang rather than language. If we stick with lang, then we don't need to change existing xml/html templates, which already use lang, and people don't need to change their custom templates or workflows.

Although it's true that we use complete words for other fields, there might even be a reason for using lang instead of language: it is a kind of signal that this field takes technical values like en-US rather than English.

ousia · Aug 20, 2015

@jgm and @mb21,

language is more intuitive for non-technical users.
I think it is better to have a single language metadata field, not two.
language is required for ePub document language, so:
- language: de-DE would be required for ePub documents.
- But If we want the same source to generate other formats, lang: de-DE would be also required. Sorry, but this doesn’t seem reasonable to me.
- The point here is to reject other values than ISO 639 or BCP-47 formats for languages.
  
  The issue is not the metadata field name, but the values that it accepts.

jgm · Aug 20, 2015

+++ Pablo Rodríguez [Aug 20 15 12:32 ]:

@jgm and @mb211, * `language` is more intuitive for non-technical users. * I think it is better to have a single language metadata field, not two. * `language` is required for ePub document language, so: * `language: de-DE` would be required for ePub documents.

We could easily switch to using `lang` here. (We could also set `lang` to the value of `language` behind the scenes, if only `language` is used, to avoid breaking existing documents.) Note also that `lang` *does* appear in the EPUB templates.

* But If we want the same source to generate other formats, `lang: de-DE` would be also required. Sorry, but this doesn’t seem reasonable to me. * The point here is to reject other values than ISO 639 or BCP-47 formats for languages. The issue is not the metadata field name, but the values that it accepts.

Yes, but we need to settle on a single field name. I think your argument boils down to liking the less technical sounding `language`. Mine boils down to not wanting to break existing documents.

mb21 · Aug 20, 2015

Right, we should merge the ePUB language and the lang variable since both are BCP47 now.

I don't have a strong opinion on lang vs language, but tend to agree with @jgm: lang is backwards compatible, there's the xml:lang and HTML lang attributes (both BCP47), and eventually we'll have documents like the following:

---
lang: en
otherlangs: [ar]
---

The title in Arabic is [عنوان اول]{dir=rtl lang=ar}.

ousia · Aug 23, 2015

language is required for ePub document language, so:

language: de-DE would be required for ePub documents.

We could easily switch to using lang here. (We could also
set lang to the value of language behind the scenes, if
only language is used, to avoid breaking existing
documents.)

Fine for me when either lang or language can be used with the same results.

I think your argument boils down to liking the less technical sounding language. Mine boils down to not wanting to break existing documents.

Of course, I don’t want to break existing compatibility.

My argument It isn’t about more or less technical sounding. It is about not mixing markups.

lang is (X)HTML markup. For most users, learning basic text markup may be extremely hard. Mixing markups would make it harder. And for users that aren’t fluent in English, it is even harder.

I agree that a good compromise is to be able to use either lang or language. So there is no broken compatibility.

ousia · Aug 23, 2015

I don't have a strong opinion on lang vs language, but tend to agree with @jgm: lang is backwards compatible, there's the xml:lang and HTML lang attributes (both BCP47), and eventually we'll have documents like the following:
---
lang: en
otherlangs: [ar]
---

The title in Arabic is [عنوان اول](dir=rtl lang=ar).

@mb21, could you please consider a different language handling?

Your proposal is:

The title in Arabic is [عنوان اول](dir=rtl lang=ar).

But it is already accepted syntax in Markdown:

`a = b`{.variable-assignment} may be Python code

Sorry, but braces should be preferred to parentheses for attribute assignment.

And specific to languages, I see two main issues:

Both unique identifiers and classes have special syntax, such as in:
```
this is `code`{#snippet-1 .generic-code}
```
Why is SGML/XML markup required for languages? In that case, it should be also required for unique identifiers and classes.

Sorry, but that makes things much harder for newcomers. And even for computer experts, mixing markups may lead to mistakes.
Since directionality belongs to scripts and scripts are deployed by languages, why isn’t directionality assigned to the language internally, so the user hasn’t to specify it?

I mean, if Arabic or Hebrew (sorry, I don’t know another right-to-left written languages [but Wikipedia has a list of scripts written right to left]) have to be always written right to left, why is it required that the user has to include this information?

With both issues, your sample would read:

The title in Arabic is [عنوان اول](:ar).

I think that with this proposal (explained in #895), the user has less to type (and there are fewer possibilities to make mistakes).

mb21 · Aug 23, 2015

@ousia I'm sorry, I meant to write the following (must have been tired, it's corrected above now):

The title in Arabic is [عنوان اول]{dir=rtl lang=ar}.

While `a = b`{.myClass} translates to <code class="myClass">a=b</code, it has long since been proposed to use [foo]{.myClass} to mean <span class="myClass">foo</span>, yet this hasn't been implemented in the Markdown Reader yet.

As for your other points, I've answered at #2191.

mb21 · Aug 24, 2015

But back to the current hold-up: I think we should have language only as a legacy fallback for the ePUB metadata. Having lang and language as complete synonyms only confuses people. More opinions?

ousia · Aug 24, 2015

@ousia I'm sorry, I meant to write the following (must have been tired, it's corrected above now):
The title in Arabic is [عنوان اول]{dir=rtl lang=ar}.

@mb21, sorry for my misunderstanding. I thought there was a new syntax copied from CommonMark.

ousia · Aug 24, 2015

But back to the current hold-up: I think we should have language only as a legacy fallback for the ePUB metadata. Having lang and language as complete synonyms only confuses people. More opinions?

@mb21, I think the right option would be to make both names full synonyms explaining that lang is only kept to avoid breaking backwards compatibility. And advising users that the new field language should be used for new documents and templates.

ousia referenced this issue Oct 5, 2014
Open
YAML lang attribute not working in docx and ODF #1667

ousia referenced this issue Dec 7, 2014
Open
(feature) :lang special attribute syntax #895

mpickering added enhancement Minor labels Dec 8, 2014

This was referenced May 24, 2015

Closed

lang does not set mainlang when using xelatex #2174

Closed

Write filter to support right-to-left direction in Persian text. #2191

This was referenced Aug 13, 2015

Closed

Set default non-English language for citations and hyphenation #2359

Closed

Standardize lang metadata #2366

mb21 added a commit to mb21/pandoc that referenced this issue Aug 20, 2015

mb21 replace the `lang` variable with a new `language` variable in BCP47 f…
…ormat closes #1614
28c705c

mb21 referenced this issue Aug 20, 2015
Merged
`lang` variable is now in BCP47 format #2369

mb21 added a commit to mb21/pandoc that referenced this issue Aug 20, 2015

mb21 `lang` variable is now in BCP47 format
strings are converted for LaTeX and ConTeXt output, closes #1614
622df70

jgm closed this in #2369 Sep 23, 2015

ousia referenced this issue Dec 20, 2015
Closed
language specification for LaTeX is broken for Czech language #2597

jgm/pandoc

lang attribute fits for latex but not for html lang-attribute-value #1614

Labels

Milestone

Assignee

6 participants

YAML lang attribute not working in docx and ODF #1667

(feature) :lang special attribute syntax #895

lang does not set mainlang when using xelatex #2174

Write filter to support right-to-left direction in Persian text. #2191

Set default non-English language for citations and hyphenation #2359

Standardize lang metadata #2366

`lang` variable is now in BCP47 format #2369

language specification for LaTeX is broken for Czech language #2597