Automated edits/JaTrainWikipedia

From OpenStreetMap Wiki
Jump to: navigation, search

Goal

This is a proposal for an mechanical edit, allowing to complete the tag "wikipedia" for the Japanese train stations, under the following account:

There is about 9500 train stations nodes for Japan in Openstreetmap, but only 1000 of them have a tag "wikipedia" linking to their corresponding page. The goal of this script is to:

  • fill the tag "wikipedia" if not present
  • convert the outdated tag "wikipedia:ja" (pointing to an URL) to the right format
  • automate most of the work
  • be safe: point on the right page (do not point on disambiguation page) and give up if we cannot be sure
  • provide some samples for the local mappers, so they can discuss about the validity of the project
  • allow the user to review the modifications before any commit to the server
  • prevent the commits to be a burden for the server

Implementation

The process will perform the following tests.

Creating a station extract from Wikipedia

The first thing is to retrieve the Japanese Wikipedia dump and extract from it the list of the stations and their location. It will be only used to check that picked the right Wikipedia page for the OSM station node. No Wikipedia data will be put in the Openstreetmap database (licenses are not compatible).

Listing the train station nodes

Then we will retrieve all the OSM nodes to update, by:

  • retrieving a recent XML dump of the Japan data
  • filtering it using Osmosis and keeping only the nodes with the tag «railway=station» (I don't have the command line, I used OSMembrane)

Processing the stations in batch

Starting from here, the job will be done by a Python script.

The tool would not process all the stations at once, but by batches. Each batch would contain a number of stations low enough:

  • to be allow a human review (in JOSM)
  • not to be a burden for the servers

The stations of a batch will be about in the same region. The result of each batch will be saved in an OSM XML.

Processing a station

If the tag for a station is already filled, we can skip this station.

Retrieving the latest version of the node

The script will get it by using the API 0.6 (for example, using this download URL). We will retrieve many nodes at once.

Converting the tag

If a tag «wikipedia:ja -> URL» is present, we convert it to the format «wikipedia -> ja:name_of_page».

Finding the right Wikipedia page (current implementation)

Most of the time, the name is the «kanji name» + «駅». But we have to make sure that it does not point on a Wikipedia disambiguation page. And we can compare the coordinates from Openstreetmap and from Wikipedia.

  • From the Wikipedia extract, get a list of potentially matching stations. Each one will have coordinates and a page name
  • If we have many stations in a close distance (arbitrary distance: 500m), skip this station (not safe. For instance, there may be a JR station and another one).
  • If we have one station within rage, pick this one

Finding the right Wikipedia page (previous implementation)

Most of the time, the name is the «kanji name» + «駅». But we have to make sure that it does not point on a Wikipedia disambiguation page. And we can compare the coordinates from Openstreetmap and from Wikipedia.

  • retrieve the Japanese edit page for this station
  • If this page contains the text «{{aimai} }», it is a disambiguation page, and we can skip this node.*
  • we can easily extract the coordinates from the page (if present) and check against the OSM coordinates, because they are present in a line starting by «|座標»

Submitting

After having processed the whole input file, the user will have as result a number of output XML files. He can:

  • open one of them in JOSM
  • see the modifications (search for «railway station» and see the history)
  • if ok, submit

Source code

The source code is available there.

https://github.com/Fabiensk/osm-enrich

Sample output file

You can get the current OSM output files. To check the modifications, you can open it in JOSM and look at the history of the nodes.

A log file is also generated to explain why the station are updated or not (distance between Wikipedia and Openstreetmap to big, disambiguation page)

Improvements and other tasks

  • clean the code of the script, make it more modular and put it on Github

Status

Done

  • principle is accepted
  • preliminary output files and modification log are available
  • on 9241 stations nodes, 8197 could be modified
  • a first edit has been made public

To do

  • get the feedback from the local mappers (japanese mailing list).
  • decide on many nodes per commit we will have
  • commit the rest
  • clean the code, put it on github