Mechanical Edits/wikidata

From OpenStreetMap Wiki
Jump to: navigation, search

This page was set up to comply with the Mechanical Edit Policy. The user account that these edits will be made under has not yet been determined. For now I have simply used the name "wikidata" as the most logical match. When a user account has been determined this wikipage should be moved to Mechanical Edits/Username.

Proposal

Match Wikidata items and OSM objects automatically and add the Wikidata IDs (Q values) to the OSM object according to Key:wikidata.

Rationale

Wed Aug 27: Approached by Edward Betts about an idea [1] to add Wikidata ID to matching OpenStreetMap objects.

There have been several discussions (User:Pigsonthewing/Wikipedia, [2] [3]) prior to this proposal as to the benefits of having a link to a wikidata item on an OSM item. When this proposal was made, one of the arguemnts made was that the matching list could form some sort of API and therefore no edit needs to be made in OpenStreetMap. It should however be noted that Wikidata tags are already being added to OSM (this will not change) and the matching algorithm takes 3 days to run. If the data was added to OSM then this time would be significantly reduced next time the matching algorithm is run (by filtering objects in OSM with a wikidata tag). Furthermore having this data already in OSM makes it slightly easier for our data consumers and increases visibility of this rich data source.

Benefits

1. Benefiting from non-geographic data held in Wikidata
Wikidata has data on each of these entiti which either isn't in OSM (who's the mayor of this town/ vicar of this church?) or which acts as a sanity check for what is in OSM (We can generate lists where the two disagree, for humans to check and fix). Wikidata has multi-lingual labels for many objects, which OSM renderers can fetch via the Wikidata link.

2. Grow the community
Wikidata has a different community to OSM. By working together we stand to benefit from the expertise of non-traditional OSM contributors.

3. Saves time
We are already adding Wikidata to OpenStreetMap. In the 3 months since this was proposed we manually added 4000 wikidata tags to OpenStreetMap. This will continue regardless, therefore we should see this as a way to save time and speed up the process. May even have a lower error rate than manual (?).

Concerns

1. OpenStreetMap data is fluid
The matches should not be added to OSM and should be held as an API as OSM data is fluid. For example ways get split as more detail is added to the map. If this happens the wikidata tag would end up being on two objects.

RESPONSE: This is true of any tag not just wikidata. The user will need to be aware of what the wikidata means in order to maintain. How big an issue is this and how does it compare to existing practices? --RobJN (talk) 22:29, 25 November 2014 (UTC)

2. Updates
How do we do updates and check that what has already been uploaded is still correct?

RESPONSE: I assume you simply re-run the process. The matching algorithm highlights mis-matches so also forms a QA tool.--RobJN (talk) 22:29, 25 November 2014 (UTC)

3. Multiple matches
Sometimes in OSM you may get, for example, multiple buildings tagged as amenity=hospital with the same name tag. How to process those?

RESPONSE: One-to-one matching only. If code spots two or more nearby items with the correct tags and matching names it skips them. Also skip nearby objects with same name but different tags (e.g. bridge with monument mapped as two objects, same name). --Edward Betts

4. Contributors move to Wikidata
This encourages people to contribute to wikidata rather than OSM.

RESPONSE: It is likely that both projects will benefit. Traffic will not be one way. See also [4] --RobJN (talk) 22:29, 25 November 2014 (UTC)

5. Bridges and tunnels
These can often be represented via multiple ways, for example one for the railway part of the bridge, another for the road.

RESPONSE: May need to remove bridges and tunnels, or check these manually.

6. Links to external
Do we have a rule for which external datasets we are willing to add links to?

RESPONSE: Not that I am aware of but Wikidata is a large project and would no doubt pass any criteria. If criteria are to be established it would be a side project and we should not hold up this work waiting.

7. Use of Maproulette
Instead of an automated import we could use Maproulette so that everything gets checked.

RESPONSE: This option would be very slow and would be a dull task (prone to human error). We would also need clear rules as to whether the human should add the tag to the node or the relation (etc.). There is a risk that the Maproulette approach would results in bad data. We could however use Maproulette as a QA tool when the matching algorithm detects mismatches.

8. Use of wikipedia to start the match process
The matcing algorithm starts from Wikipedia categories. Why not use Wikidata's 'instance of' property?

Using English Wikipedia introduces an English-language bias, there are items in Wikidata without an associated article in English Wikipedia. The reason for using Wikipedia Categories is because use of the 'instance of' property is very patchy. The majority of the items in my result list don't include the 'instance of' property. -- Edward Betts
The code could be adapted to produce further runs for other-language Wikipedias. Andy Mabbett (User:Pigsonthewing); Andy's talk; Andy's edits 19:26, 26 November 2014 (UTC)

Other discussion points

  • DONE: Matching algorithm to include a geographic proximity test.
  • DONE: Split the upload by region.
  • DONE: Check licence issues - no problem adding Wikidata info to OSM as Wikidata in CC0.
  • DONE: Include a mismatch list.
  • DONE: Please show geographic distance between OSM object and Lat/Lon in wikidata in the results of the matching algorithm.
  • DONE: Should not match on power=generator.


Technical Implementation

A matching algorithm is used to find one-to-one matches between objects in OpenStreetMap and Wikidata IDs. The search starts with articles in English Wikipedia Categories, looks for the matching Wikidata items, then searches OSM for items that are geographically close and have matching tags. For example 'Category:Castles by country' and 'historic=castle'. The names are compared with some fuzzy matching, if the names match then the two items are the same. More details: https://lists.openstreetmap.org/pipermail/talk/2014-November/071510.html

The code to find the matches is here: https://github.com/edwardbetts/osm-wikidata

Matching criteria: https://github.com/EdwardBetts/osm-wikidata/blob/master/entity_types.json

The result of the matching algoritm are here: http://edwardbetts.com/osm-wikidata/

Standard OpenStreetMap tools are used to add the Wikidata ID to OSM objects using the key:wikidata tag. The upload will be done in chuncks - probably 1km square regions. Existing tags will not be changed and object geometry will not be altered.

PROPOSAL (needs confirming): A small sample to be uploaded first followed by a 1 or 2 day pause to give people a chance to check.

Exclusions

The following may be excluded:

  • Settlements in Germany.
  • Filtering out Wikidata chain store items.

Further work

The following ideas are out of scope of this mechanical edit, but are interesting ideas that the community may want to consider as a future piece of work.

Maproulette for multi-match case: A recommendation to handle the multiple match case. For example, using Maproulette to handle cases that do not fit the script criteria.

See also