User talk:SafwatHalaby/scripts/nameCopy

From OpenStreetMap Wiki
Jump to navigation Jump to search

Source code availability

Is the source code of the bot available for review?

I'll publish the code on Github soon. Here's the relevant snippet for now. -- SafwatHalaby (talk) 15:30, 11 October 2017 (UTC)
function sameValue(str1, str2)
{
	// treat nbsp and sp as the same in comparisons
	return (str1.replace(/\u00A0/, ' ') == str2.replace(/\u00A0/, ' '));
}

function heEnOnly(str)
{
	return (str.search(/[^\u00A0\u0020-\u007E\u0590-\u05FF]/) === -1);
}

function arEnOnly(str)
{
	return (str.search(/[^\u00A0\u0020-\u007E\u0600-\u06FF]/) === -1);
}

function enOnly(str)
{
	return (str.search(/[^\u00A0\u0020-\u007E]/) === -1);
}

function hasInvalidChars(str)
{
	return (str.search(/[\u0000-\u001F\u007F]/) !== -1);
}

var languages = [
{name: "English", tag: "name:en", check: enOnly, checkStrict: enOnly, stats: {toName: 0, fromName: 0}},
{name: "Hebrew", tag: "name:he", check: heEnOnly, checkStrict: heOnly, stats: {toName: 0, fromName: 0}},
{name: "Arabic", tag: "name:ar", check: arEnOnly, checkStrict: arOnly, stats: {toName: 0, fromName: 0}}
];

//checkStrict is unused.

...snipped code...

else // name exists, consider name_to_*
{
	if (p.tags["noname"] !== undefined) remove(p, "noname");

	if (hasInvalidChars(name))
	{
		printErr(p, 'name has non printable characters.');
		return;
	}
	for (var i = 0; i < languages.length; ++i)
	{
		var lang = languages[i];
		var nameLang = p.tags[lang.tag];
		if (lang.check(name)) // if "name" is determined to be in this language
		{
			// if "name:lang" does not exist, add it
			if (nameLang === undefined)
			{
				p.tags[lang.tag] = name;
				modifiedCnt++;
				lang.stats.fromName++;
			}
			// If both "name" and "name:lang" exist but they're not the same
			else if (!sameValue(nameLang, name))
			{
				printErr(p, "name, " + lang.tag + " mismatch.");
				autoFixAttempts++;
				if (traceAndFixMismatch(p, name, nameLang, lang)) //disabled, always returns false. No autofixes.
					autoFixSuccess++;
				else
					autoFixFail++;
			}
			return;
		}
	}
	printErr(p, "name is not ar,he,en");
}

Language deduction criteria

The algorithm description says "If name exists, deduce language", but the language deduction criteria is not clear.

Based on the code above, I think setting the name:en tag by the bot should be avoided because English identification can easily be mistaken with names in many other languages. To illustrate the point, different spelling of Jerusalem in many languages could be identified as English. Zstadler (talk) 20:05, 11 October 2017 (UTC)
I agree. "One more bug" indeed :P.
Although this could happen in principle, it would require inputting a "name" tag which has English characters but is not English, which is very rare in practice for Israel. So I think the damage so far is negligible, but I will update the script nevertheless. I'll make it log all name to name:en copies, so that I review them manually before uploading to catch those cases. -- SafwatHalaby (talk) 20:22, 11 October 2017 (UTC)