User:Danysan/Sandbox/Opinionated Planet.osm

From OpenStreetMap Wiki
Jump to navigation Jump to search

information sign

This article is a stub. You can help OpenStreetMap by expanding it.

This page aims to be a brainstorming space for an opinionated distribution of Planet.osm.

Why

The openness, freedom and community focus of OSM are its strengths but they can make life harder for data consumers:

  • often unexperienced (and sometimes experienced) mappers involuntarily introduce errors in the map
  • it's easy to sneak in vandalic edits and while most of the time other mappers quickly find and fix these edits, the exceptions can be very problematic
  • Deprecated features usually are not immediately mass-updated to their suggested replacement, forcing consumers to check multiple tags to find the same data
  • OSM's non-stringent schema can make life harder for consumers, forcing them to check multiple non-documented and often non-homogeneous tags
  • Good practice rules are suggested but not always enforced and this can lead to non-homogeneous data

It would be useful to find a way to simplify data consumers' life by making available a distribution of the data checked, cleaned, schema-normalised (and possibly enhanced) through opinionated filters and actions to the data.

Who

This proposal was born from this OSM community thread and takes inspiration from Meta Daylight Map Distribution' Planet file. Given that having a clean and safe dataset out of OSM is in the best interest of not only Meta but of the full OSM Community and all OSM consumers, this proposal aims to explore the feasibility, opportunity and obstacles of an OSM in-house opinionated distribution of the data where all stakeholders of such a project could join forces.

What

Brainstorming of possible activities to execute on the data:

Wrong element removal

Element editing to fixup tagging errors

  • Remove tags with values that clearly are impossible (such as if maximum highway speed in a country is 120 km/h, have 300 km/h in a highway=residential likely to be extra 0)
  • remove broken links in website=, wikipedia=, wikidata=* and wikimedia_commons=*
  • remove wikidata links that are clearly wrong because they point to a person (the user likely used wikidata=* instead of subject:wikidata=* or something similar), a tree species, …
  • fix coastlines to prevent the “flooding” effect when they get broken

Schema normalization

Data enhancement

  • restore elements removed by changesets highly likely to be vandalism
    • an algorithm would be needed to calculate how long a changeset will be ignored before it's applied
    • very complex task. This would mean selectively reverting changesets; tools like osmium apply-changes would help, but it would still be complex and computationally expensive
  • integration with OSM's schema of data from Wikidata in elements where wikidata=* is available (Wikidata entities are CC-0 licensed, compatible with ODBL)

How

Most basic rule based checks could be executed with libraries like Osmium.

For more efficient handling of the computing load a parallel MapReduce approach could be more appropriate, for example with libraries for Apache Spark such as [Atlas](https://github.com/osmlab/atlas).

For some of the above tasks rule-based elaboration will not be enough and AI powered tools will be needed (machine learning powered classification, NLP models, ...). Daylight Map Distribution has publicly described some details of its ML-powered pipeline for vandalism prevention (see its wiki page for details), given Meta invlovement in OSMF it would be great to see its participation in this project.

For tasks that require the intersection of OSM data with Wikidata or other resources, other libraries will need to be used (hypothesis: wikibrain). In general, Wikidata data can be accessed in one of three ways:

  • Download a dump of the DB and do anything you want with it [1]
    • high client cost (Requires a lot of space, more than OSM), high availability, high bandwidth (once downloaded it will be extremely fast)
  • Wikidata Query Service (WDQS), Wikidata's own SPARQL endpoint[2]
    • very powerful query language, low client cost (no need to download the full DB), high server cost, low availability, low bandwidth (unfeasable for very big quantities of data, pagination will be needed)
  • Linked Data Fragments (LDF) endpoint [3] [4]
    • somewhere in between the two options: low client cost (no need to download the full dump of the DB), extremely basic query language, high bandwidth

A Proof of Concept implementation can be found at https://github.com/Danysan1/opinionated-planet .

When

TBD

Where

OSM infrastructure, details TBD

Notes