Talk:Map internationalization

From OpenStreetMap Wiki
Jump to: navigation, search

Transliterations of toponyms

As discussed here and on other pages, many places have names in different languages, like "London", "Londres", "Llundain", "Лондон" for the British Capital. In countries which do not use the latin alphabet, places may or may not have an English version of their name (Moscow for instance has name="Москва" and name:en="Moscow" and int_name="Moscow". What is missing in my view is a transliteration (romanization) of the local name(s), possibly in a dedicated tag. For Moscow we would get translit="Moskva".

BTW, Do not confuse transliteration with translation and transcription

Many Non-latin alphabets have an official transliteration into latin (sometimes depending on the language, e.g. the transliteration of the cyrillic alphabet of Russian into latin is not the same as the transliteration of the cyrillic alphabet of Ukrainian or Caucasian languages.

Would it be a good idea to add these official transcriptions to the OSM data? I have some java classes which could transliterate Russian place-names (as well as Ukrainian, Greek, Hebrew, (Most)Indian, Thai, etc. into Latin. Whether tile-generators use them or not is another issue, but it could be useful for GPS devices where you could use the transliterated name if have no local keyboard or cannot read/write the local alphabet.

IoanAp (talk) 19:53, 3 April 2016 (UTC)

If the transliteration can be created by software, then we should not add it to OSM data. Instead, renderers wanting to make use of them could create these transliterations themselves. --Lyx (talk) 20:39, 3 April 2016 (UTC)
More seriously, most of those transliterations have been imported "as is" from Wikipedia (because Wikipedia needs titles for its articles, and those titles are not in the original script).
However, many of them are in fact completely wrong, invented locally on Wikipedia, notably in Russian Wikipedia then often borrowed "as is" into other Cyrillic Wikipediae such as Ukrainian, Bulgarian or Belarussian, with only minor modifications such as the Cyrillic letter "i". But those transliterations were made using false readings (ignoring the effective pronunciation in the original name, or ignoring diacritics that could have helped creating a more accurate transliteration, or ignoring the long-attested translations that exist for many terms such as "Saint", first names, or river names found in composite toponyms... They are not even following the same conventions everywhere for these name parts, and ignore the content of the article itself which still explains the origin of these compound names)
But those Russian Wikipedia authors, self-proclaimed as "experts" in "their" language but who completely ignore the source language they are transliterating or that read it superficially, don't want to change any iota (and they revert any attempt to correct these bad names, only because this could temporarily create redirects on Wikipedia, sometimes double redirects from other bad names..., and would require editing a few some navboxes or templates to avoid those redirects).
So we must be careful about automated imports of "translated" toponyms from Wikipedia (and now from "translations" found in Wikidata, that have been borrowed automatically from article titles in Wikipedia).
For geographic use, we must change those bad names and use the official geographical international standards for the transliteration of toponyms (e.g. BPGN, used by the United Nations and international organisations), or accurate dictionary sources if those toponyms have actual translations since long (e.g. "Londres" or "Moscou" in French); some names previously used officially have also changed over time (e.g. "Pékin" is still used in French even if officially it should be "Beijing"; same remark about "Calcutta" in India, still much more used than the new official name, defined in fact for English as used in India, but not really for French), we can track them with "official_name:lang=" if their usage is still not common (e.g. "official_name:en=Beijing", used also in French, but still "name:fr=Pékin")
In many cases, when these transliterated toponyms are actually not in an official language of the local country (or a working language used by that country, e.g. in the UNO which use English, French, Russian, Simplified Chinese, and Spanish most of the time, sometimes Arabic, German and Portuguese, rarely Japanese). They are highly questionable (except for the most well known metropolitan cities, or country names in major languages: they are often wrong as well for the primary country subdivisions, even in these major languages such as Russian !).
Frequently, Wikipedia also contain articles for toponyms without knowing first how they are effectively called in their language, so they just borrow the original local name "as is", or the name used in English Wikipedia, even if this usage is not attested locally: those articles are renamed later, but these titles have already made their path to other Wikipediae or to Wikidata, and come to OSM directly without having passed any quality test. Then people on OSM fix it, and later the article in Wikipedia is renamed (but still not the correct name) and the new name overwrites the correction that was made in OSM (correct translation or correct transliteration that conforms to international transliteration standards or attested by UNO documents or wellknown NGO's acting locally).
For this reason, we should forbid to all bots, that currently massively and automatically import transliterated names from Wikipedia or Wikidata, to overwrite any existing translated name in OSM unless these are names in an official language or national language of the local country, or a recognized regional language of the local region (in that case only, these allowed overwrites by bots should be made in separate changesets, specific to that country or subregion and for that specific language).
Those bots should also never overwrite in OSM any existing country name, or names of large metropolitan cities (over 1 million inhabitants?): if corrections are needed, they will be made manually, in small changesets. — Verdy_p (talk) 23:46, 3 April 2016 (UTC)
@Lyx, OK, I agree, since transliterations can be created automatically, possibly renderers should do them, but how to contribute tu the renderers ? Personally I only use mkgmap (and I will propose them to add the transliterations as a new feature). But I do not create tiles myself, I use sites like, but every time there is a place-name without a int_name or name:en tag, its in local script (I can read Cyrillic, Arabic, Georgian, but not Armenian, and south Asian scripts, others will have trouble with Cyrillic already
Actually DOES use automatic transliteration if no name:de is available. --Lyx (talk) 23:48, 4 April 2016 (UTC)
As long as it does not do that from the Latin script, but only to Latin, i.e. romanizations of non-Latin scripts using German standards if they exist or international BPGN standards otherwise, it should remain globally safe.
But transliteration from Latin to any other script, or even to other Latin alphabets is completely unreliable (see below). — Verdy_p (talk) 04:10, 5 April 2016 (UTC)
Great! I did'n see that (since I mainly use the default Server). Does not look bad at all --IoanAp (talk) 18:54, 5 April 2016 (UTC)
@Verdy_p, you are totally right about the translations and the "noise" coming from Wikipedia. Translations can probably not be automated only assisted. But I was talking about transliterations. They do not reflect the sounds of a language but only recode letter by letter from the local (non-latin) alphabet to latin alphabet. So the currently well-known Greek place Idomeni (Ειδομένη) would give "Eidomeni" as transliteration, eventhough it is pronounced more like "ithomeni". "Calcutta" (in local (Bengal) alphabet is "কলকাতা" which is transliterated to "Kalkātā", i.e. totally independent of name:en or int_name.
This can easily be automated for some alphabets (greek, cyrillic, georgian, armenian, most indian alphabets and south-east asian including the Corean Hangul, probably also for the ethopian alphabets), but unfortunately not for hebrew and arabic, since the hebrew and arabic alphabets do not write vowels (if you do not know the place, you can't read it neither), nor Chinese and Japanese. --IoanAp (talk) 20:55, 4 April 2016 (UTC)
For Chinese and Japanese names, the situation is simpler because Japanese has official transcriptions at least in Kana scripts (which are extremely easy to transliterate to Latin/Cyrillic/Greek scripts) and very often Latin transcriptions. For Chinese, beside the rare Bopomofo transcriptions, almost everywhere you'll find accurate sources for the official Chinese romanization, which is extremely used (including as an input method for entering sinograms, by adding extra digits for syllabic tones, even if there's a need to select among several proposed variants with distinct semantics, but generally with the same semi-phonetic radical in them).
Hebrew and Arabic can be almost fully transliterated when they write vowel diacritics (writing them is possible, even if it's optional, but there are also a few plain letters that represent in fact vowels using matres lectionis': this last use is systematic for example in Yiddish written in the Hebrew script, where there's a clear distinction of consonnants and vowels, and no vowel diactritics are even needed to get a correct reading, so transliteration of Yiddish from Hebrew script to Latin script is quite easy and reliable; the same is true for a few other languages written with the Arabic script, e.g. Urdu, because diacritics needed for the correct readiong are required, or matres lectionis is used, or a few additional Arabic letters are added, or because some letter forms considered as "variants" of the same letter in the Arabic language are considered as distinct letters in Urdu, including for consonnants; Persian/Farsi also has its own capabilities when written in the Arabic script, as well as other languages written with the Eastern Arabic/Persian variant of the Arabic script). So the difficulties for Arabic is only for the modern use of the script for writing the modern/vernacular Arabic language, using the Western variant of the Arabic script. For the classic/religious use, dropping those Arabic vowels is very often unacceptable as it create confusions about the actual meaning when reading it from various vernacular Arabic languages.
In fine, only Hebrew is really difficult, because even religious texts in Classical Hebrew written with vowel diacritics don't have a very formal way to write them (there are several competing "standards"). When there are doubts in the religious Hebrew texts, the experts compare it to other scripts (notably Old Greek script, Classical Aramaic script, Syriac script, or sometimes even the Arabic script !)
But the really difficult transliterations ar those made from the Latin script due to its very wide spectrum and high variability of usages: diacritics do not always transcribe the same phonetics, and there are conflicting uses of digrams/trigrams across Latin-written languages, or even within those languages themselves with many exceptions (caused by a huge number of terms borrowed from one of these languages to another). For this reason, it's always best to leave the original Latin transcriptions unchanged. All transcriptions from the Latin script are full of errors (e.g. transcription of French to Russian), as they mix transliterations, phonetic transcriptions, simplifications, and partial translations, and there are competing standards between all language pairs (and then borrowing transcriptions in the non-Latin script to another language using the same non-Latin script but with different rules causes many more errors). It's one of the main reason why transliterators from Latin actually don't work (or you have to choose an "ideal" theoretical target language. Instead, trascriptions should be based not from the Latin transcription of the source language, but its the IPA transcription (for that you need to lookup first in native dictionaries for these Latin-written languages, something that has been largely forgotten when many pseudo-transcriptions have been made in Russian Wikipedia, and they did not even look at *translation* dictionaries to solve some wellknown problems, even for common terms like "Saint").
In summary, you cannot simply and reliably transliterate any language from the Latin script, you absolutely need to use transcriptions instead, from the IPA transcription (which may be viewed as a normalized form of the Latin script, with some additional symbols borrowed from Greek). For this reason, instead of inserting "international_name" in OSM (roughly based on English but still not transcriptable), we should better insert the native IPA transcriptions (from which other automated transcriptions to other scripts can be inferred reliably). — Verdy_p (talk) 21:54, 4 April 2016 (UTC)
Well, I only wrote about transliteration to the latin alphabet. But as I said above this is totally language (un pronunciation) independent. For Arabic, Hebrew: of course you can transliterate, if the vowels are written. But in OSM placenames in countries using the Arabic alphabet are rarely written with vowels: "اصفهان", so here one could transliterate "'ṡfhān" but this would not help anybody.
By the way, of course you can transliterate from Latin to other alphabets (even though I would never propose this for OSM). If you transliterate "Bordeaux" or "Marseille" to Cyrillic you'd get "Бордеаукс" "Марсеилле" (probably not useful for many).
Certainly no ! Transliteration of Latin to Cyrillic is not useful simply because it is wrong most of the time (including in your two examples): you cannot do that safely before knowing exactly what letters mean, if they are mute or note, so you need a good knowledge of the original language, but because there are frequent exceptions in the Latin spelling of these languages (here French), there's little you can do except keeping them as is in the Latin script. — Verdy_p (talk) 19:58, 5 April 2016 (UTC)
If you need the pronunciation, you'd better transcribe: "Бордо", "Марсей".
This is no longer a transliteration, but a transcription to the Cyrillic script, actually made from a first transcription to IPA.
  • For that you need to first lookup the terms in a French dictionnary. Then you'll adapt the French phonetic transcription in IPA to the phonetic transcription in IPA suitable for Russian, and then transcript that modified IPA transcription to Cyrrilic. So you need a French dictionary lookup.
  • The alternative (which is also the best one and recommended, if it is possible to use it) is to lookup in a French-Russian translation dictionary to locate common terms and then use the Russian grammatical rules for composite names (considering properties such as plurals, genders or grammatical cases such as genitive for finalizing the grammatical desinences). But here also you need a dictionary lookup (but not the same dictionary: in the first case you just look into a native French dictionary, in the second case you look into a French-Russian translation dictionary or try to locate terms in a native Russian dictionary with the appropriate lemma definition).
In both cases, you cannot use transcription alone. — Verdy_p (talk) 19:58, 5 April 2016 (UTC)
Concerning IPA: this could be added in addition, since pronunciation is totally different to transliteration and much more detailed than simple transcription, but difficult to automatise ...
Anyway, I finally come to the conclusion that automatic transliteration can be done be renderers or mkgmap in order to avoid cramming the OSM server with redundant data. --IoanAp (talk) 18:54, 5 April 2016 (UTC)
Yes, but only if we store IPA transcriptions for the original Latin-written language. If there's no IPA transcription, don't infer that from Latin (notably in Languages like English and French whose orthographic usage of letters is extremely irregular !). But a single native IPA transcription could replace many pseudo-translations — Verdy_p (talk) 19:58, 5 April 2016 (UTC)
Verdy_p, as I wrote in my OP, do not confuse transcription and transliteration (the latter is my original point, I did not want to discuss transcription). Transliteration is a recoding from one alphabet to another following some rules but not the pronunciation. So if I want to transliterate "Marseille" into Cyrillic, I get "Марсеилле", if I want to transliterate "Тирасполь" into Latin I get "Tiraspol'". This can be automatically done, and as Lyx wrote earlier today, does it to render non-latin scripts. could transliterate into Cyrillic (currently they don't). I agree that transliteration of French placenames into Cyrillic is useless, since it does not show the pronunciation, and probably most Russian speakers can reat the latin alphabet. But If I want to transcribe I need to know the pronunciation of the original word, so in this case "Marseille" is transcribed into Cyrillic as "Марсей", whereas "Тирасполь" might become "Tiraspol". Again, I only wanted to discuss transliteration and not transcription, which is, as we both agree, is difficult.--IoanAp (talk) 20:50, 5 April 2016 (UTC)