Skip to content

lang attribute fits for latex but not for html lang-attribute-value #1614

Closed
maybegeek opened this Issue · 24 comments

6 participants

@maybegeek

Hi there,

for documents in german language I use

  • lang: ngerman
  • mainlang: german

the first beeing for babel, second for polyglossia (i use mainly xelatex). If I pandoc to html the lang-attribute (ngerman) gets used but ngerman is not a valid value for the html-attribute (nor would be german).

html needs something like de or de-DE.

The documentation says lang language code for HTML or LaTeX documents.

And therefore the documentation is completely wright, you could use it for the one or the other, not both at the same time.

perhaps this needs clarification in the docs and/or a different approach. Meanwhile you can always override the yaml attributes in your central file with direct command line switches.

all the best,
christoph

@ousia

The documentation says lang language code for HTML or LaTeX documents.

And therefore the documentation is completely wright, you could use it for the one or the other, not both at the same time.

@maybegeek: In my opinion, if this isn’t a bug in pandoc, it should be improved.

The document language attribute is also relevant (at least) for:

  • ConTeXt
  • Microsoft Word
  • LIbreOffice/Apache OpenOffice Writer
  • ePub

Not abstracting the lang value to fit all possible output formats that make use of it, in my opinion, is a missing implementation.

@ousia

@maybegeek, I’m afraid I found a bug.

This source markdown document:

---
title: Titel
language: de-DE
...

# Kapitel

Mein Text

is converted by pandoc into the following standalone html document:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <title>Mein Titel</title>
  <style type="text/css">code{white-space: pre;}</style>
</head>
<body>
<div id="header">
<h1 class="title">Mein Titel</h1>
</div>
<h1 id="titel">Titel</h1>
</body>
</html>

Where is the language attribute in the html document? I miss the <xml:lang="de-DE"> from either the <html> or the <body> element.

I use pandoc 1.12.3.3. I wonder whether this has been fixed in a later version.

@maybegeek
@jgm
Owner
@ousia

@jgm, I think that a way to solve this would be to have a file that converts language from its HTML value to the LaTeX value.

It would be something similar to:

en -> english
en-US -> USenglish
en-UK -> UKenglish

Of course, I don’t know what the best format for pandoc is.

Could you provide the right format for the minimal sample above?

So, I could provide the file with the full list of languages supported by LaTeX.

Many thanks for your help.

@ousia

@jgm, I attach a list that contains the equivalences between ISO-639 language codes used in HTML and LaTeX language codes.

I avoided three-letter language codes as much as I could. But there are language codes that aren’t defined in two-language codes (such as ancient Greek).

af -> afrikaans
af-ZA -> afrikaans
ar -> arabic
bg -> bulgarian
bg-BG -> bulgarian
br -> breton
ca -> catalan
ca-ES -> catalan
cy -> welsh
cy-UK -> welsh
cz -> czech
cz-CZ -> czech
da -> danish
da-DK -> danish
de -> ngerman
de-1901 -> german
de-AT -> naustrian
de-AT-1901 -> austrian
de-DE -> ngerman
dsb -> lowersorbian
el -> greek
el-poly -> greek.polutoniko
en -> english
en-AU -> australian
en-CA -> canadian
en-NZ -> newzealand
en-UK -> british
en-US   -> american
eo -> esperanto
es -> spanish
es-ES -> spanish
et -> estonian
et-EE -> estonian
eu -> basque
eu-ES -> basque
fa -> farsi
fa-IR -> farsi
fi -> finnish
fi-FI -> finnish
fr -> french
fr-CA -> canadien
fr-FR -> french
fra-aca -> acadian
fur -> friulan
ga -> irish
ga-IE -> irish
gd -> scottish
gd-UK -> scottish
gl -> galician
gl-ES -> galician
grc -> greek.ancient
he -> hebrew
he-IL -> hebrew
hi -> hindi
hi-IN -> hindi
hr -> croatian
hr-HR -> croatian
hsb -> uppersorbian
hu -> magyar
hu-HU -> magyar
id -> indonesian
id-IN -> indonesian
ie -> interlingua
is -> icelandic
is-IS -> icelandic
it -> italian
it-IT -> italian
jp -> japanese
jp-JP -> japanese
la -> latin
lt -> lithuanian
lt-LT -> lithuanian
lv -> latvian
lv-LV -> latvian
mn -> mongolian
mn-MN -> mongolian
nb -> norsk
nb-NO -> norsk
nl -> dutch
nl-NL -> dutch
nn -> nynorsk
nn-NO -> nynorsk
no -> norsk
no-NO -> norsk
pl -> polish
pl-PL -> polish
pt -> portuguese
pt-BR -> brazilian
pt-PT -> portuguese
rm -> romansh
rm-CH -> romansh
ro -> romanian
ro-RO -> romanian
ru -> russian
ru-RU -> russian
se -> samin
se-FI -> samin
sk -> slovak
sk-SK -> slovak
sl -> slovene
sl-SL -> slovene
sr -> serbian
sv -> swedish
sv-SE -> swedish
th -> thai
th-TH -> thai
tk -> turkmen
tr -> turkish
tr-TR -> turkish
uk -> ukrainian
uk-UA -> ukrainian
vi -> vietnamese
vi-VN -> vietnamese
@mpickering
Collaborator

Thank you for compiling this list Pedro.

@ousia

@mpickering, would you be interested in the corresponding list for ConTeXt?

This would be required for issue (#1667).

@HughP

@ousia In thinking about inclusivity of languages, do your language lists include ISO 639-3 additions or are you limiting your list to ISO 639-1 listings? BCP47 suggests to use the shortest ISO 639 code for a language (I take that to mean that ISO 639-3 is used when there is no corresponding ISO 639-1 code for said language), while the Dublin Core standard points to ISO 639-3.

@ousia

@HughP, I took the shortest code from the ISO 639-1 listing. But I had to use ISO 639-3 for languages not defined in ISO 639-1 (such as grc and similar ones).

The language list is limited to the languages LaTeX can handle. If anyone wants to add ISO 639-3 codes for the already defined ISO 639-1 codes, that would be fine. But I think this addition would make sense after the first list is implemented in pandoc.

@mb21

There's agreement that lang should contain ISO 639 format that is then translated to LaTeX.

But should lang be a list of languages or only one? I think I agree with @ousia that it would be better for lang to contain only one language and serve as a synonym for mainlang. @jgm?

Authors could then use otherlang explicitly to specify a list of other languages. Finally, should otherlang also be in ISO 639 format, even though it's currently only supported by LaTeX which is exactly the format that doesn't use ISO codes?

@jgm
Owner
@mb21

Yeah, I just never understood why the last and not the first language in the list is the mainlang, so why not make it explicit? Also, I suspect it wouldn't be backwards-compatible with existing document anyway, since those would use LaTeX format for multiple languages, not ISO 639.

@ousia

@jgm and @mb21,

there is a pending pull request in jgm/pandoc-templates#101, that implements some issues already discussed on the mailing list:

  • lang should be replaced with language (only language related names are shortened in metadata field names).

  • language should only have one variable.

  • otherlang should be replaced with other-languages.

    This variable should include all other languages used in the document. This variable is only required by LaTeX (not even ConTeXt needs it).

  • The ConTeXt template defines language synonyms so that language values can be used directly in ConTeXt documents.

There are two issues pending. I have contacted both babel and polyglossia developers. They are interested in loading languages by ISO 639 codes. In fact, these codes with language-region structure (such as en-GB) seem to be BCP-47 codes, since ISO 639 only refers to languages themselves (I realized that yesterday [@HughP, I owe you an apology]).

But it may take a while before it has implemented. I hope I can include the language synonyms in the LaTeX templates as soon as possible (so that there is no need to wait for the implementation in the packages themselves.

The pull request jgm/pandoc-templates#101 is waiting for review and I hope it may be merged.

@mb21 mb21 added a commit to mb21/pandoc that referenced this issue
@mb21 mb21 replace the `lang` variable with a new `language` variable in BCP47 f…
…ormat

closes #1614
28c705c
@jgm
Owner

I would argue that it's best to stick with lang rather than language. If we stick with lang, then we don't need to change existing xml/html templates, which already use lang, and people don't need to change their custom templates or workflows.

Although it's true that we use complete words for other fields, there might even be a reason for using lang instead of language: it is a kind of signal that this field takes technical values like en-US rather than English.

@ousia

@jgm and @mb21,

  • language is more intuitive for non-technical users.

  • I think it is better to have a single language metadata field, not two.

  • language is required for ePub document language, so:

    • language: de-DE would be required for ePub documents.

    • But If we want the same source to generate other formats, lang: de-DE would be also required. Sorry, but this doesn’t seem reasonable to me.

    • The point here is to reject other values than ISO 639 or BCP-47 formats for languages.

      The issue is not the metadata field name, but the values that it accepts.

@jgm
Owner
@mb21

Right, we should merge the ePUB language and the lang variable since both are BCP47 now.

I don't have a strong opinion on lang vs language, but tend to agree with @jgm: lang is backwards compatible, there's the xml:lang and HTML lang attributes (both BCP47), and eventually we'll have documents like the following:

---
lang: en
otherlangs: [ar]
---

The title in Arabic is [عنوان اول]{dir=rtl lang=ar}.
@mb21 mb21 added a commit to mb21/pandoc that referenced this issue
@mb21 mb21 `lang` variable is now in BCP47 format
strings are converted for LaTeX and ConTeXt output, closes #1614
622df70
@ousia
  • language is required for ePub document language, so:

    • language: de-DE would be required for ePub documents.

We could easily switch to using lang here. (We could also
set lang to the value of language behind the scenes, if
only language is used, to avoid breaking existing
documents.)

Fine for me when either lang or language can be used with the same results.

I think your argument boils down to liking the less technical sounding language. Mine boils down to not wanting to break existing documents.

Of course, I don’t want to break existing compatibility.

My argument It isn’t about more or less technical sounding. It is about not mixing markups.

lang is (X)HTML markup. For most users, learning basic text markup may be extremely hard. Mixing markups would make it harder. And for users that aren’t fluent in English, it is even harder.

I agree that a good compromise is to be able to use either lang or language. So there is no broken compatibility.

@ousia

I don't have a strong opinion on lang vs language, but tend to agree with @jgm: lang is backwards compatible, there's the xml:lang and HTML lang attributes (both BCP47), and eventually we'll have documents like the following:

---
lang: en
otherlangs: [ar]
---

The title in Arabic is [عنوان اول](dir=rtl lang=ar).

@mb21, could you please consider a different language handling?

Your proposal is:

The title in Arabic is [عنوان اول](dir=rtl lang=ar).

But it is already accepted syntax in Markdown:

`a = b`{.variable-assignment} may be Python code

Sorry, but braces should be preferred to parentheses for attribute assignment.

And specific to languages, I see two main issues:

  • Both unique identifiers and classes have special syntax, such as in:

    this is `code`{#snippet-1 .generic-code}
    

    Why is SGML/XML markup required for languages? In that case, it should be also required for unique identifiers and classes.

    Sorry, but that makes things much harder for newcomers. And even for computer experts, mixing markups may lead to mistakes.

  • Since directionality belongs to scripts and scripts are deployed by languages, why isn’t directionality assigned to the language internally, so the user hasn’t to specify it?

    I mean, if Arabic or Hebrew (sorry, I don’t know another right-to-left written languages [but Wikipedia has a list of scripts written right to left]) have to be always written right to left, why is it required that the user has to include this information?

With both issues, your sample would read:

The title in Arabic is [عنوان اول](:ar).

I think that with this proposal (explained in #895), the user has less to type (and there are fewer possibilities to make mistakes).

@mb21

@ousia I'm sorry, I meant to write the following (must have been tired, it's corrected above now):

The title in Arabic is [عنوان اول]{dir=rtl lang=ar}.

While `a = b`{.myClass} translates to <code class="myClass">a=b</code, it has long since been proposed to use [foo]{.myClass} to mean <span class="myClass">foo</span>, yet this hasn't been implemented in the Markdown Reader yet.

As for your other points, I've answered at #2191.

@mb21

But back to the current hold-up: I think we should have language only as a legacy fallback for the ePUB metadata. Having lang and language as complete synonyms only confuses people. More opinions?

@ousia

@ousia I'm sorry, I meant to write the following (must have been tired, it's corrected above now):

The title in Arabic is [عنوان اول]{dir=rtl lang=ar}.

@mb21, sorry for my misunderstanding. I thought there was a new syntax copied from CommonMark.

@ousia

But back to the current hold-up: I think we should have language only as a legacy fallback for the ePUB metadata. Having lang and language as complete synonyms only confuses people. More opinions?

@mb21, I think the right option would be to make both names full synonyms explaining that lang is only kept to avoid breaking backwards compatibility. And advising users that the new field language should be used for new documents and templates.

@jgm jgm closed this in #2369
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.