User:SafwatHalaby/scripts/gtfs

From OpenStreetMap Wiki
Jump to: navigation, search

The script is a work in progress and details are subject to change. Documentation may at times lag behind the source code due to the rapid changes and experimentation I'm doing. You may want to check out the Forum thread for news and updates.

Annotated Aerial view of bus stops
Result of a local incremental update test (JOSM)

GTFS imports are hard to keep up to date. The first import is easy, but both OSM mappers and the gtfs provider will change their dataset, making the two datasets gradually stray apart.

This script is a custom made conflation tool, allowing us to continuously update the import with periodic merges. Only the bus stops that have changed since the last import are edited. This keeps the GTFS import up to date.

The script resolves conflicts by assuming that the most recent edit is the correct one, be it an OSM mapper edit or a GTFS provider edit. The script creates and deletes bus stops when needed.

The Israeli government publishes new GTFS updates nightly, but the dataset was last imported to OSM in 2012, making osm bus stops very out of date. This script aims to fix this, but it may also be suitable worldwide with some minor changes.

Currently, only stops.txt and translations.txt are processed. No routes are added or removed.

Script pages:

This script is part of SafwatHalaby_bot. (Click for contact details, opt-out, other scripts, bot overview, etc.).


Data Source

The GTFS files are provided by Israeli Ministry of Transportation (MOT). The data is high quality and very accurate, and is used nationally throughout Israel for all public transportation. Links:

Usage permission

We are allowed to freely use the data. See this and this forum posts.

Main Algorithm

The main algorithm is in gtfs.js.

The "ref" tag is used as the conflation key to correlate between GTFS stops and OSM stops.

Updating stop data

The script requires two gtfs files as inputs. One of them is the "old file", and one of them is the "new file". Upon the next run, the previously "new file" should become the "old file", and a fresh file is downloaded and becomes the "new file". Files are fetched from http://he.mot.gov.il/. If a piece of data (bus stop tag value or coordinate) has changed between the old gtfs file and the new one, then it is applied to OSM, otherwise it is not. This means that user edits are not overridden as long as they are the most recent data. But if a provider ever updates a stop, it overrides user edits if present, because it's more recent.

Data update cases (X,Y,Z are different versions of some piece of data e.g. tag value / coordinates):

data in old file data in new file data in OSM action notes
X Y X change OSM data to Y gtfs data (Y) is more up to date
X X Y Nothing user data (Y) is more up to date
X X X Nothing Nothing has changed.
N/A N/A X nothing. This data tag is not present in the gtfs files. (e.g. shelter=*).
X Y Z Change data to Y It's impossible to tell which is newer, but we choose to trust the provider. If the user insists, they can re-apply Z, and the next update would be YYZ, meaning the bot won't override again. (line 2).

gtfs:verified=no is added to any modified stop.

Adding and removing stops

Quite often, one of the 3 columns wouldn't have a bus stop at all. The following algorithm is consulted:

Column 1: Old Database
Column 2: New Database
Column 3: Openstreetmap

for each bus stop having a reference(ref tag), find out in which columns it exists in which it doesn't.
If multiple bus stops have the same reference in any column, we halt. The user must fix this manually.
Exception: platforms (ratzefeem) sometimes have db ref duplication that we merge into one.

X       : A single bus stop with that reference exists in that column
-       : No bus stop with that reference exists in that column
=>      : action to be taken

1 2 3
- - -  => N/A
- - X  => Delete if has source=gtfs_israel, otherwise, it's a stop not introduced by the gtfs, do nothing.
- X -  => Create.
X - -  => Nothing.
X X -  => Nothing if the stops in col1 and col2 are identical, create otherwise.
- X X  => Update.
X - X  => Delete.
X X X  => Update.

Updating  only updates the keys where col1[key] != col2[key] (see first table)
Note that space.js may still delete it. 
source=israel_gtfs is added to any created or updated stop. (even in cases where the update did not really touch anything because it's already up to date).

Assumptions/drawbacks

The script makes 2 assumptions. If these assumptions are broken for your use case, it could be a bad idea to deploy it.

Note: Third assumption eliminated (was: GTFS file has no stale stops)

Most recent change is the correct change

No one is perfect and this assumption will inevitably be violated from time to time by mappers or by the gtfs provider. But If your gtfs provider is unreliable such that this assumption breaks often, then deploying this script is a bad idea.

The script is run frequently

Suppose the script runs once in 2010, and then never again until 2020. (Either because the provider doesn't provide frequent gtfs updates, or because the bot maintainer abandons updating further). Meanwhile:

  • The provider changes a stop and updates its internal bus stop DB in 2012
  • In 2015, the provider changes the same stop again but forgets to update the internal DB, but an editor updates the osm stop
  • When the script runs, the 2020 file will still contain an out of date 2012 value, and since the gtfs file does not have an internal "last-updated" field, that old 2012 value would be assumed to be a 2020 value, and would override the 2015 value.

Note that similar scenarios could occur on much shorter span. e.g.

  • Sunday: script runs
  • Monday: bus stop changes, internal provider DB edit
  • Tuesday: bus stop changes again, OSM user edit
  • Wednesday: New gtfs file published, script runs again. The bot would think Monday edit is newer and override the Tuesday edit.

As long as the frequency of gtfs file and script running is much higher than the frequency in which an individual bus stop changes, the scenarios above would never occur.

In Israel, new gtfs files are published nightly by the government.

Secondary scripts

Duplicate removal algorithm

Note: algorithm in latest source is different from the description below. Will update this description when I settle on the algorithm desired.

This algorithm is in space.js. It should run after gtfs.js.

Bus stops without a ref tag are never modified by the main algorithm. This helper algorithm works as follows:

If there's a bus stop without a ref tag within a 50 meter radius of a bus stop which does have a ref tag:

  • If the ref-less stop has no other tags, remove it
  • Otherwise, add a "fixme: suspected duplicate". note.

If a bus stop without a ref tag is not within a 50 meter radius of a bus stpo which does have a ref tag, nothing is done.

This algorithm may be tweaked and improved later. I'm just testing it out.

Side note: The source code, space.js, is generic, and you can modify and use it to other things that are unrelated to bus stops. It's suitable for operations where elements need to be compared with nearby elements in linear complexity.

Visual desync output

Can be found here: http://www.safwat.xyz/stops/

Todo documentation

Summary of changes the scripts make

When the script is finished, it is guaranteed that all stops in the new gtfs file will be on the map except stops that were explicitly deleted by users and that haven't been updated in the GTFS since then. (XX-). The stops won't necessarily have the data which is present in the GTFS files. User changes are honored. All stops will have israel_gtfs and ref. Other stops may also be present on the map:

  • stops with a ref but with no source=israel_gtfs, added independently by mappers, and their ref is not present in the gtfs file. Ignored by the main script. (If the ref is present in the gtfs file, source=israel_gtfs is added, and stop is updated).
  • stops without a ref. Ignored by main script.
  • Abnormal: stops with source=israel_gtfs but no ref. Ignored by the main script and mappers should manually inspect them.

For cases 1 and 2, space.js some removes highly likely duplicates stops or adds fixmes to some likely but uncertain duplicates. It also adds fixmes to refs not present in the government gtfs file. And always adds a fixme for case 3.

All stops will have a unique ref after the run. If prior to the run the refs are not unique, the script will refuse to run.

source=israel_gtfs exists? ref exists? touched by which algorithm? Possible actions
yes yes gtfs.js delete/modify/do nothing (bot created stops always start in this category)
no yes gtfs.js Same as above. Also, add source=israel_gtfs if modified.
no no space.js delete(<50m from a ref)/add fixme tag(<50 from a ref, has extra tags)/do nothing(>=50m from a ref)
yes no none nothing. Manual intervention required. This is not a normal stop.

Changesets

Initial Run

Performed at 04 Nov, 2017.

Manual cleanup followups

Incremental updates

Log

Click here for a log from an experimental run.

Among other things, the log contains desync information that can be helpful for finding mapper errors or gtfs provider errors. Perhaps we could send provider errors back upstream.

Known Bugs

Road dragging bug

If a bus stop is part of a road, the script may accidentally drag the road if the stop is moved. I am currently resolving this manually by finding these stops in advance and separating them from ways, prior to running the script.

An Overpass query is performed, and then the following JOSM filters are used to mark the problem nodes. I then manually detach stops from them. This is not a common occurrence in Israel, so manual fixing currently does not take much time.

-child type:way
-highway=bus_stop

documentation todo

Documentation lagging behind in:

  • fixme codes
  • translations
  • railway
  • spaces.js

Possible future changes

  • special treatment for name - ar or he
  • Grab more data from the gtfs files
  • Provider feedback loop (conflict log files?)
  • routes?