Proposed features/Language information

From OpenStreetMap Wiki
Jump to: navigation, search
Language information text tags
Status: Proposed (under way)
Proposed by: sommerluk
Tagging: language:*=code
Applies to:
Definition: Describes the language of the text
Rendered as: Not rendered itself. But might improve rendering of name=*
Drafted on: 2017-30-07

Proposal text

The prefix language:*=<code> can be used to describe the language that the value of another tag has. This makes only sense if the other tag has a free-text value. Allowed values for language:*=<code> are the standard BCP-47 codes.

It is not necessary to add language information to most objects and keys in OSM. But in regions where a different language/script combination makes a difference in rendering, it can be usefull to add language:name=* to determine the language of the name key. Anyway using language:name=* is not mandatory.

Example: For the Bulgarian city of Montana use:

name=Монтана

language:name=bg

Usually rendering engines default to russian cyrillic, but this city is in Bulgary. See here the significant difference in russian rendering (above) and the bulgarian (below) rendering:

Montana.svg

Rationale

There are many applications that use the name=* tag in OSM. You will usually use the name=* tag when you intentionally want to use the name in the default language. (Example: OSMand lets you choose between “local names” or a specific language for map rendering. And the default style at openstreetmap.org uses exclusivly name=* because it wants to use always the default local names, so you can see the names at the map like they are written locally at each place of the world. They do intentionally not use tags like name:en, name:jp, name:de…)

The content of name=* is plain Unicode. Problem: This is not enough to render the text correctly. There are glyphs (character shapes) that are different in the four variants (japanese, traditional chinese, simplified chinese, korean) of the CJK script, but Unicode encodes them at the same codepoint. This process is called “Han Unification”. Also there are four variants of some cyrillic glyphs (russian, bulgarian, serbian, mazedonian) that are encoded at the same Unicode codepoint. And there are also seldom cases in the latin alphabet: Ŋ has different glyph forms in Sami language and in african languages. In the web, this problem is easily solved: The HTML code contains a language tag that gives the necessary information about the language. So the Internet browser can display everything correctly. In OSM this information is missing.

Deduce this information by the country in which our OSM element is located is not very reliably. Also within the same country may exist (much) more than only one language. Also within the same region, there might be objects who’s name is in a different language than the mayority language of this region, for example shops that sell Bulgarian food in Ireland. So it’s error-prone to deduce this from the geocoordinates. That’s not an option.

Deduce this information by comparing name=* with the other name:en, name:jp, name:de … tags does also not help. Example: The node http://www.openstreetmap.org/node/25248662 (english: Beijing) has name=北京市 and name:ja=北京市 and name:zh=北京市. They are identical. We cannot reliably determine the language of the name value.

This tag helps to respect the cultural heritage of the local writing.

Wether or not multiple values for cases like “Bruxelles - Brussel” can be used might (or might not) be subject of another proposal…

A possible usecase could look like this: A cartographic style uses this information for rendering names. It requests language=name directly in the SQL querry. This value is passed simply as-is to Mapnik (Mapnik will likely support language tags starting with Mapnik 3.1). Mapnik uses Harfbuzz internally for text rendering, and Harfbuzz accepts BCP-47 values (and if the value is invalid, it is silently ignored). BCP-47 is in wide-spreaded use, and it allows to distinguish not only between Chinese and Japanese, but also between Traditional Chinese and Simplified Chinese.

Representation

Not rendered itself. But can be used to make correct language-specific rendering of name=* possible.

This informaction can be used also by text-to-speech-engines to correctly pronounce the default local name of a place.