Rarely verified and third-party data staleness in OpenStreetMap

From OpenStreetMap Wiki
Jump to navigation Jump to search
Rarely verified and third-party data staleness in OpenStreetMap
Proposal status: Draft (under way)
Proposed by: European Water
Draft started: 2020-04-06

Introduction:

Static data describing objects in an evolving world, will become staler over time. By categorizing the types of data and the tag keys contained in the OpenStreetMap database, maybe we can infer which data will be refreshed/maintained through “natural” user cycles and which data will likely become outdated and on what timeframe.

Goal:

Be able to efficiently be able to extract via overpass the last date verified or the creation date for particular keys. Eg. amenity=drinking_water, drinking_water=yes, drinking_water:refill=yes

Proposal:

For maintenance of data on keys which will likely need to be verified through a non-natural user cycle, the proposal is to create per key meta data containing the most recent date among the key creation date and the key verified date.

EWP Instructions Wikimedia Signup


Obvious conclusion

Data which the community determines can neither be observed nor maintained should be systematically excluded from the OpenStreetMap database.

Tags and tag proposals which already address this issue:

Key “survey : date” https://taginfo.openstreetmap.org/keys/?key=survey%3Adate With over 650,000 occurrences, this tag looks promising at first glance. In reality, most occurrences were over 10 years ago. I also found an abandoned proposal with a similar purpose, key: “last checked = data”. https://wiki.openstreetmap.org/wiki/Proposed_features/last_checked

Data classification:

While most data in OpenStreetMap is theoretically verifiable(following one of the OSM guiding principles: https://wiki.openstreetmap.org/wiki/Verifiability, it seems useful to create a classification of verifiable data based on a scale of likely to be verified through a “natural” user cycle. I also include a third-party data source category, which can’t be directly verified but has sufficient value to be included nevertheless. Slicing the same data another way, three categories of data mutability were created.

Data examples for verifiability categories

Frequently verified - Directly verifiable
A building, a city name, a house, a mountain, road name, tree, etc …
Intermittently verified - Directly verifiable
A restaurant, shop or commerce occupying a building, building color, a fountain,
Rarely verified - Directly verifiable
water quality at a beach, a restaurant’s menu items, artist name, year built,
Third-party data source – Not directly verifiable
City population, descriptive data about a population, operator of a water fountain,



Data examples for mutability categories

Not mutable – Will not change (immutable)
artist, building material, a mountain,
A little mutable – Liable to change but very infrequently
name of a road, the existence of a town church,
Highly mutable – Liable to change, somewhat frequently (with different frequency)
a restaurant, a restaurant’s opening hours, a water fountain being operation


OpenStreetMap Key/Tag distribution analysis


The total number of keys present on 57 or fewer objects in the OpenStreetMap database represent 80% of the 76,177 keys (60,952/76,177).

74,938 out of 76,177 keys are used on just 1% of all key/value tag pairs in the OpenStreetMap database.

42 keys out of 76,177 keys are present on 80% of all key/value tag pairs (1.7Bn/2.1Bn) in the OpenStreetMap database.


Chart 1 : Analysis of how many keys are used X times on objects.

This first illustration is a mutant histogram with bucket size = 1, a line instead of horizontal bars and a lognormal x-axis.

Analysis of how many keys are used X times on objects


Chart 2 : Analysis of key tag distribution

Analysis of key tag distribution

Do these concentrated distributions of data infer anything about data quality ?