User:Krauss/Wikidata-std01-proposal

From OpenStreetMap Wiki
Jump to navigation Jump to search

Towards a standard hash for wikidata preservation and quality control.

There are a rare but problematic case when a OSM Map Feature changes its wikidata value. The only way to ensure the traceability (track back the original Wikidata) is checking and preserving some spatial information.

Wikidata context conventions

The Wikidata information is not alone: there are also a type tag obligation and, acording the type-tag the Wikidata relationships.

Other obligation: all Wikidata-ID used as OSM tag of an OSM element must be linked, directly or indirectly (by instance or subclass relationships), to a country (or the "international" rare case).

Basic spatial information

The information at the XML OSM planet file, or some "easy to get" information (eg. by fast Overpass operation) like a bound box or centroid. Supposing a "canonic" XML representation:

  <bounds minlat="47.04774" minlon="9.471078" maxlat="47.27128" maxlon="9.636217"/>
  <centroid lat="47.8480348" lon="16.191717"/>

The "basic information" about any kind of element (node/way/relation) is the centroid inside the Map Feature, so the ST_PointOnSurface() PostGIS function, that is a OGS standard of SFA (Simple feature access).

Minimal spatial information

Minimal information is also about "minimal spacial precision", so we can use round(number,cut) to cut extra decimal places. And "minimal" is also about "minimal representation", so we can map the XML representation to a simple JSON representation by arrays:

  {"bounds":[47.048,9.471,47.271,9.636],"centroid":[47.8480,16.1917]}

where we used the following conventions:

  • BBOX bounds as an array of 4 with three-decimal-places coordinates, in the "canonical sequence" (minlat,minlon,maxlat,maxlon).
  • centroid coodinate as an array of "canonical lat-long sequence" with four-decimal-places coordinates.
For a node
There are only one coordinate, the lat and lon as XML attributes, so use it as centroid.
For a way
Can be characterized by BBOX or by its length as a complement of its centroid. Both, length and BBOX can be sensitive to little editions when spacial precision is not adequate. So there are no "optimal choice" for it.
For a relation
Can be characterized by BBOX or by its area (or length if it is a set of ways) as a complement of its centroid. Both, area and BBOX can be sensitive to little editions when spacial precision is not adequate. So there are no "optimal choice" for it.

Conclusion and convention

The simplest is to use the centroid, but (for ways and relations) the ST_PointOnSurface(). As all nodes are in the XML, the only demand to PostGIS or Overpass are ways and relations... A big reduction in the process time, and, as it is simple, can be the hash of 2 space-separed numbers,
  47.8480 16.1917
or, a bigint representing this 2 numbers, any trucated value from lat*lon,
eg. 77474046160 = 478480::bigint*161917::bigint.

As the longitude can be reduced, eg. the elements of countries with centroid at higher the latitude (>60°).

Optimal hash function

Most important is to be a standard/popular hash function. So, there are a little set of options. The choice criteria is "uniqueness" and "CPU velocity" at PostgreSQL processing. There are some benchmarks:

So, the w:CRC32 is perhaps good enough. Example: crc32 of {"bounds":[47.048,9.471,47.271,9.636],"centroid":[47.8480,16.1917]} is 20e9b05d and can by represented as bigint datatype at PostgreSQL. Other hashes:

  • truncated MD5 by 8 hexadecimal digits: fd18cf67.
  • truncated SHA1 by 8 hexadecimal digits: ec668989.
  • Murmur...