Note: As of march 2018, the documentation is mostly accurate, but some minor details are not up to date. I will update this when I have the time.. You may want to check out the Forum thread for news and updates.
GTFS imports are hard to keep up to date. The first import is easy, but both OSM mappers and the gtfs provider will change their dataset, making the two datasets gradually stray apart.
This script is a custom made conflation tool, allowing us to continuously update the import with periodic merges. Only the bus stops that have changed since the last import are edited. This keeps the GTFS import up to date.
The script resolves conflicts by assuming that the most recent edit is the correct one, be it an OSM mapper edit or a GTFS provider edit. The script creates and deletes bus stops when needed.
The Israeli government publishes new GTFS updates nightly, but the dataset was last imported to OSM in 2012, making osm bus stops very out of date. This script aims to fix this, but it may also be suitable worldwide with some minor changes.
Currently, only stops.txt and translations.txt are processed. No routes are added or removed.
- Script wiki page: You are here.
- Script discussion: Forum thread | Script Talk Page
- Changeset history: Click here
- Initial discussion: https://forum.openstreetmap.org/viewtopic.php?id=16738
- Source code: https://github.com/SafwatHalaby/osm-bot/blob/master/gtfs (several files)
This script is part of my scripting project. (Click for contact details, opt-out, other scripts, overview, etc.).
- 1 Data Source
- 2 Main Algorithm
- 3 Secondary scripts
- 4 Summary of changes the scripts make
- 5 Changesets
- 6 Known Bugs
- 7 documentation todo
- 8 Possible future changes
The GTFS files are provided by Israeli Ministry of Transportation (MOT). The data is high quality and very accurate, and is used nationally throughout Israel for all public transportation. Links:
- MOT website: http://he.mot.gov.il
- Terms and conditions
- Direct FTP download link: ftp://gtfs.mot.gov.il/israel-public-transportation.zip
- The file is updated nightly
The main algorithm is in gtfs.js.
The "ref" tag is used as the conflation key to correlate between GTFS stops and OSM stops.
Updating stop data
The script requires two gtfs files as inputs. One of them is the "old file", and one of them is the "new file". Upon the next run, the previously "new file" should become the "old file", and a fresh file is downloaded and becomes the "new file". Files are fetched from http://he.mot.gov.il/. If a piece of data (bus stop tag value or coordinate) has changed between the old gtfs file and the new one, then it is applied to OSM, otherwise it is not. This means that user edits are not overridden as long as they are the most recent data. But if a provider ever updates a stop, it overrides user edits if present, because it's more recent.
Data update cases (X,Y,Z are different versions of some piece of data e.g. tag value / coordinates):
|data in old file||data in new file||data in OSM||action||notes|
|X||Y||X||change OSM data to Y||gtfs data (Y) is more up to date|
|X||X||Y||Nothing||user data (Y) is more up to date|
|X||X||X||Nothing||Nothing has changed.|
|N/A||N/A||X||nothing.||This data tag is not present in the gtfs files. (e.g. shelter=*).|
|X||Y||Z||Change data to Y||It's impossible to tell which is newer, but we choose to trust the provider. If the user insists, they can re-apply Z, and the next update would be YYZ, meaning the script won't override again. (line 2).|
gtfs:verified=no is added to any modified stop.
Adding and removing stops
Quite often, one of the 3 columns wouldn't have a bus stop at all. The following algorithm is consulted:
Column 1: Old Database Column 2: New Database Column 3: Openstreetmap for each bus stop having a reference(ref tag), find out in which columns it exists in which it doesn't. If multiple bus stops have the same reference in any column, we halt. The user must fix this manually. Exception: platforms (ratzefeem) sometimes have db ref duplication that we merge into one. X : A single bus stop with that reference exists in that column - : No bus stop with that reference exists in that column => : action to be taken 1 2 3 - - - => N/A - - X => Delete if has source=gtfs_israel, otherwise, it's a stop not introduced by the gtfs, do nothing. - X - => Create. X - - => Nothing. X X - => Nothing if the stops in col1 and col2 are identical, create otherwise. - X X => Update. X - X => Delete. X X X => Update. Updating only updates the keys where col1[key] != col2[key] (see first table) Note that space.js may still delete it. source=israel_gtfs is added to any created or updated stop. (even in cases where the update did not really touch anything because it's already up to date).
The script makes 2 assumptions. If these assumptions are broken for your use case, it could be a bad idea to deploy it.
Note: Third assumption eliminated (was: GTFS file has no stale stops)
Most recent change is the correct change
No one is perfect and this assumption will inevitably be violated from time to time by mappers or by the gtfs provider. But If your gtfs provider is unreliable such that this assumption breaks often, then deploying this script is a bad idea.
The script is run frequently
Suppose the script runs once in 2010, and then never again until 2020. (Either because the provider doesn't provide frequent gtfs updates, or because the script maintainer abandons updating further). Meanwhile:
- The provider changes a stop and updates its internal bus stop DB in 2012
- In 2015, the provider changes the same stop again but forgets to update the internal DB, but an editor updates the osm stop
- When the script runs, the 2020 file will still contain an out of date 2012 value, and since the gtfs file does not have an internal "last-updated" field, that old 2012 value would be assumed to be a 2020 value, and would override the 2015 value.
Note that similar scenarios could occur on much shorter span. e.g.
- Sunday: script runs
- Monday: bus stop changes, internal provider DB edit
- Tuesday: bus stop changes again, OSM user edit
- Wednesday: New gtfs file published, script runs again. The script would think Monday edit is newer and override the Tuesday edit.
As long as the frequency of gtfs file and script running is much higher than the frequency in which an individual bus stop changes, the scenarios above would never occur.
In Israel, new gtfs files are published nightly by the government.
Duplicate removal algorithm
Note: algorithm in latest source is different from the description below. Will update this description when I settle on the algorithm desired.
This algorithm is in space.js. It should run after gtfs.js.
Bus stops without a ref tag are never modified by the main algorithm. This helper algorithm works as follows:
If there's a bus stop without a ref tag within a 50 meter radius of a bus stop which does have a ref tag:
- If the ref-less stop has no other tags, remove it
- Otherwise, add a "fixme: suspected duplicate". note.
If a bus stop without a ref tag is not within a 50 meter radius of a bus stpo which does have a ref tag, nothing is done.
This algorithm may be tweaked and improved later. I'm just testing it out.
Side note: The source code, space.js, is generic, and you can modify and use it to other things that are unrelated to bus stops. It's suitable for operations where elements need to be compared with nearby elements in linear complexity.
Visual desync output
Can be found here: http://www.safwat.xyz/stops/
Summary of changes the scripts make
When the script is finished, it is guaranteed that all stops in the new gtfs file will be on the map except stops that were explicitly deleted by users and that haven't been updated in the GTFS since then. (XX-). The stops won't necessarily have the data which is present in the GTFS files. User changes are honored. All stops will have israel_gtfs and ref. Other stops may also be present on the map:
- stops with a ref but with no source=israel_gtfs, added independently by mappers, and their ref is not present in the gtfs file. Ignored by the main script. (If the ref is present in the gtfs file, source=israel_gtfs is added, and stop is updated).
- stops without a ref. Ignored by main script.
- Abnormal: stops with source=israel_gtfs but no ref. Ignored by the main script and mappers should manually inspect them.
For cases 1 and 2, space.js some removes highly likely duplicates stops or adds fixmes to some likely but uncertain duplicates. It also adds fixmes to refs not present in the government gtfs file. And always adds a fixme for case 3.
All stops will have a unique ref after the run. If prior to the run the refs are not unique, the script will refuse to run.
|source=israel_gtfs exists?||ref exists?||touched by which algorithm?||Possible actions|
|yes||yes||gtfs.js||delete/modify/do nothing (script-created stops always start in this category)|
|no||yes||gtfs.js||Same as above. Also, add source=israel_gtfs if modified.|
|no||no||space.js||delete(<50m from a ref)/add fixme tag(<50 from a ref, has extra tags)/do nothing(>=50m from a ref)|
|yes||no||none||nothing. Manual intervention required. This is not a normal stop.|
The changeset history/log was moved to a dedicated page.
Road dragging bug
If a bus stop is part of a road, the script may accidentally drag the road if the stop is moved. I am currently resolving this manually by finding these stops in advance and separating them from ways, prior to running the script.
An Overpass query is performed, and then the following JOSM filters are used to mark the problem nodes. I then manually detach stops from them. This is not a common occurrence in Israel, so manual fixing currently does not take much time.
-child type:way -highway=bus_stop
Documentation lagging behind in:
- fixme codes
Possible future changes
- special treatment for name - ar or he
- Grab more data from the gtfs files
- Provider feedback loop (conflict log files?)