User:SafwatHalaby/scripts/gtfs

From OpenStreetMap Wiki
Jump to navigation Jump to search
Annotated Aerial view of bus stops
Result of a local incremental update test (JOSM). Red: pre-update. Green: post-update.

This script is a custom made import tool for Israel, but it can be easily adapted to other countries. It allows us to continuously update the bus stops based on Ministry Of Transportation data. Only the bus stops that have changed since the last import are edited. The script removes, adds, and updates stops wherever needed.

If you are a mapper, you should read the "Information for mappers" section.

For some tags, the script resolves conflicts by assuming that the most recent edit is the correct one, be it an OSM mapper edit or a GTFS provider edit. For other tags, the script assumes GTFS is authoritative and always overrides mapper edits. The "recent is better" assumption keeps the code and the workflow simple; The script doesn't need a database, and no manual conflict resolution is required.

Since GTFS has no last-modified timestamps, deducing the most recent edit requires some file comparison tricks which are the core of this script. See the technical information section for details. As far as I know, this is the only OSM script employing this trick.

Script pages:

This script is part of my scripting project. (Click for contact details, other scripts, overview, etc.).

Information for mappers

This script auto-edits bus stops based on Israel's Ministry of Transportation (MOT) data, but mappers are allowed to make some changes. The script operates in Israel and West Bank area C.

As of 2018, I am running the script on non-regular intervals. A version with fully automated runs is being investigated.

Forbidden bus stop edits

The following tags should NOT be edited by a mapper. If they are edited, the script simply overrides them and/or warns me about them.

If you find a mistake in any of the above tags for a stop, please contact the Ministry of Transportation (MOT). Many apps rely on MOT data, and this will help fix the error everywhere, and not just in OSM. Once MOT fixes their dataset, the fixes will be automatically copied to OSM.

You can change anything else, including adding and deleting stops. See the sections below for specific details.

Allowed bus stop edits

Anything not mentioned in the forbidden section above can be edited freely. Below are specific details.

Editing softly updated tags

You may edit the following tags. The script will respect the most recent change. In other words, it will not override your edits unless MOT publishes a more recent edit in the future.

  • location (latitude, longitude) - Exception: Less than 3m changes are assumed to be a mistake and are undone.
  • level=* - What floor is the platform on?
  • addr:street - This should only be edited if the value has slight discrepancy with an existing OSM street. E.g. You may modify "Hagana" to "haHagana" if the nearby OSM street name is "haHagana". In case a of a semantic difference, you should contact MOT instead!

Editing ignored tags

You may edit any other tag freely. The script never edits any tag not listed so far. Here are some common bus stop tags:

  • shelter=yes/no - Is the stop sheltered?
  • bench=yes/no - Does it have a sitting bench?
  • wheelchair=yes/no - Is the stop accessible to wheelchairs?
  • bin=yes/no - Does it have a trash bin?
  • Any additional tag not in the lists above can be edited freely.

Adding or removing stops

  • You may delete a bus stop. It will not be re-added by the script unless MOT changes any of its values.
  • You may add a new stop. You are strongly advised to add the "ref" tag if you know it. It is the number found on the yellow bus stop signs. Please do NOT add a "source=israel_gtfs" tag. If the stop ever appears in MOT's data, the script will "adopt" it, apply the MOT data, and add "source=israel_gtfs".

Limited name tag editing

You cannot freely edit name tags, but you may add a missing "name:ar" or "name:en". You may also copy "name:ar" or "name:he" to "name" to change the language of the "name" tag. See the multilingual handling section for details regarding this.

As of June 2018, mapper name tag edits are always overridden by the Ministry of Transportation (MOT) version if present. The reasoning behind this is that in Israel, the official stop name is the name in the MOT GTFS files. It is the name used in other applications and services, bus voice systems, and stops with digital screens. Therefore, the OSM stop name should be identical to the MOT name, even if an OSM mapper thinks it's a bad name. Improper names should be patched upstream at MOT, and they actively maintain the names. The same reasoning applies to all MOT-overridden tags. See the "Requesting upstream changes" section for MOT contact details.

Requesting upstream changes

If you found a mistake in a tag that you cannot edit, you can send an E-mail to MOT (ptsupport@mot.gov.il), and they may change it. Once it changes upstream, the script will use the new value.

Adding alt_name for on the ground names

If the physical name on the bus stop sign differs from the MOT name, you can add the physical name to alt_name=*. You are also advised to add a note=* indicating this difference. If the difference is nontrivial and/or problematic, you can also notify MOT (ptsupport@mot.gov.il).

Multilingual handling

The Ministry of Transportation (MOT) has Hebrew, Arabic, and English versions of bus stop names. The Hebrew name is always available. The Arabic and English translations are available for most stops but missing for some. "name:ar", "name:he", and "name:en" are fetched from MOT data.

If MOT does not have a "name:ar" or a "name:en" for a certain stop, the script does not touch that tag and a mapper can edit it freely. If MOT does have a translation, then the script will always override mapper edits for that language. This approach has a minor drawback: A mapper cannot tell if "name:en"/"name:ar" originate from MOT or from a mapper-added translation without looking at the edit history.

Deciding the language of the "name" tag

By default, the script copies "name:he" to "name". This is OK most of the time, but according to global and local conventions, the "name" tag should be in the language most commonly used in an area, and some areas have an Arabic speaking majority. So the following mechanism was implemented:

The script will usually copy "name:he" to "name", but if "name" already has Arabic characters, and "name:ar" is MOT-provided, then "name:ar" is copied to "name" instead. This means that you can switch the "name" from Hebrew to Arabic or vice versa by copying "name:ar" or "name:he" to "name". The script will honor the language of the "name" tag in its next updates.

If "name:ar" is not MOT-provided and "name" has Arabic characters, the script will not update the "name" tag.

If someone puts an English name in the "name" tag (or any language other than Hebrew or Arabic), the script will switch it to Hebrew.

Changeset history

The changeset history was moved to a dedicated page.

Technical information

The Ministry of Transportation (MOT) publishes new GTFS updates nightly, but the dataset had been last imported to OSM in 2012, making osm bus stops very out of date. The introduction of this script has fixed this.

The script consumes GTFS files and bus stops downloaded via Overpass, and then manipulates the map accordingly. Currently, only stops.txt and translations.txt are processed. No routes are added or removed.

Data Source

The GTFS files are fetched from the MOT FTP server. The data is high quality and very accurate, and is used nationally throughout Israel for all public transportation.

Links:

Usage permission

We are allowed to freely use the data. See this and this forum posts.

Tag update algorithm

The script requires two gtfs files as inputs. One of them is the "old file", and one of them is the "new file". Upon the next run, the previously "new file" should become the "old file", and a fresh file is downloaded and becomes the "new file". Files are fetched from http://he.mot.gov.il/. A bash script runs prior to the main JS code, which moves "new" to "old" and downloads a replacement for "new".

Tags are divided into 3 lists: gMostRecent, gOverride, gAlwaysAdd

For tags in the "gMostRecent" list: If a piece of data (bus stop tag value or coordinate) has changed between the old gtfs file and the new one, then it is applied to OSM, otherwise it is not. This means that mapper edits are not overridden as long as they are the most recent data. But if a provider ever updates a stop, it overrides mapper edits if present, because it's more recent.

For tags in the "gOverrideList": Data from the new gtfs file always overrides OSM data.

gAlwaysAdd has constants which are always overridden/added. Namely source=israel_gtfs, highway=bus_stop, public_transport=platform, bus=yes

Tags not in any list are never touched and can be freely edited by mappers.

Here are the rules governing gMostRecent updates. (X,Y,Z are different versions of some piece of data e.g. tag value / coordinates):

data in old file data in new file data in OSM action notes
X Y X change OSM data to Y gtfs data (Y) is more up to date
X X Y Nothing mapper data (Y) is more up to date
X X X Nothing Nothing has changed.
N/A N/A X nothing. This data tag is not present in the gtfs files. (e.g. shelter=*).
X Y Z Change data to Y It's impossible to tell which is newer, but we choose to trust the provider. If the mapper insists, they can re-apply Z, and the next update would be YYZ, meaning the script won't override again. (line 2).

Adding and removing stops

Quite often, one of the 3 columns wouldn't have a bus stop at all. The following algorithm is consulted:

Column 1: Old GTFS
Column 2: New GTFS
Column 3: Openstreetmap

for each bus stop having a reference(ref tag), find out in which columns it exists in which it doesn't.
If multiple bus stops have the same reference in any column, we halt. I must fix this manually.
Exception: platforms (ratzefeem) sometimes have db ref duplication that we merge into one.

X       : A single bus stop with that reference exists in that column
-       : No bus stop with that reference exists in that column
=>      : action to be taken

1 2 3
- - -  => N/A
- - X  => Delete if has source=gtfs_israel, otherwise, it's a stop not introduced by the gtfs, do nothing.
- X -  => Create.
X - -  => Nothing.
X X -  => Nothing if the stops in col1 and col2 are identical, create otherwise.
- X X  => Update.*
X - X  => Delete.
X X X  => Update.*

*For the updating logic, each tag is handled individually based on which list it belongs to. See the previous section.

Assumptions/drawbacks

The script makes 2 assumptions. If these assumptions are broken for your use case, it could be a bad idea to deploy it. The assumptions hold in Israel.

Note: The third assumption was eliminated (was: GTFS file has no stale stops)

Most recent change is the correct change

No one is perfect and this assumption will inevitably be violated from time to time by mappers or by the gtfs provider. But If your gtfs provider is unreliable such that this assumption breaks often, then deploying this script is a bad idea. As for mappers, we assume that most mappers make good edits, and that most bad edits get caught.

The script is run frequently

Suppose the script runs once in 2010, and then never again until 2020. (Either because the provider doesn't provide frequent gtfs updates, or because the script maintainer abandons updating further). Meanwhile:

  • The provider physically changes a stop and updates its internal bus stop data in 2012
  • In 2015, the provider physically changes the same stop again but forgets to update the internal data, but a mapper notices and updates the osm stop.
  • In 2020, when the script runs, the file will still contain an out of date 2012 value, and since the gtfs file does not have an internal "last-updated" field, that old 2012 value would be assumed to be a 2020 value, and would override the correct 2015 value.

Similar scenarios could occur on much shorter time spans. e.g.

  • Sunday: script runs
  • Monday: bus stop changes, provider GTFS edit
  • Tuesday: bus stop changes again, OSM mapper edit
  • Wednesday: New gtfs file published, script runs again. The script would think Monday edit is newer and override the Tuesday edit.

As long as the frequency of gtfs file and script running is much higher than the frequency in which an individual bus stop changes, the scenarios above would never occur.

In Israel, new gtfs files are published nightly by the government.

Types of stops and post-run guarantees

When the script is finished, it is guaranteed that all stops in the new gtfs file will be on the map except stops that were explicitly deleted by mappers and that haven't been updated in the GTFS since then. (XX-). The stops won't necessarily have the location data which is present in the GTFS files. Mapper location changes are honored. The stops will have the most recent GTFS name. Mapper name changes are not honored and overridden. Tags unrelated to GTFS are never touched (e.g. wheelchair and shelter). All stops will have israel_gtfs and ref. Other present stop are also tolerated:

  • stops with a ref but with no source=israel_gtfs, added independently by mappers, and their ref is not present in the gtfs file. Ignored by the main script. (If the ref is present in the gtfs file, source=israel_gtfs is added, and stop is updated).
  • stops without a ref are ignored by the main script.

All stops will have a unique ref after the run. If prior to the run the refs are not unique, the script will refuse to run and ask me to intervene.

source=israel_gtfs exists? ref exists? touched by which algorithm? Possible actions
yes yes gtfs.js delete/modify/do nothing (script-created stops always start in this category)
no yes gtfs.js Same as above. Also, add source=israel_gtfs if modified.
no no space.js delete(<50m from a ref)/add fixme tag(<50 from a ref, has extra tags)/do nothing(>=50m from a ref)
yes no none nothing. Manual intervention required. This is not a normal stop.

Secondary scripts

getAndParse.sh

A bash script which handles switching "new" to "old" and downloading a new "new" dataset. It also cuts uninteresting fields from stops.txt. I run it prior to running the main script.

gtfs_bootstrap.js

Reminder: The script works by comparing the "old" and the "new" GTFS datasets. In the following run, the previously "new" dataset becomes "old" and a fresh dataset becomes the "new" dataset.

But what do we do if have no old dataset?

This script reconstructs the "old" gtfs file based on Overpass Attic data or historic OSM data. It is used for recovering a lost "old" file. For instance, I had no way to find the original stops.txt file used in the old 2012 Israel import. This script recreated it.

I should better document the usage steps some day.

Duplicate removal algorithm

spaces.js can cleanup duplicated stops. The documentation has been removed because it is outdated. The script is currently not used, but it was used extensively (in various forms and versions) during the initial import to cleanup stops already added manually by mappers.

Visual desync output

No longer used.

Known Bugs

Road dragging bug

If a bus stop is part of a road, the script may accidentally drag the road if the stop is moved. I am currently resolving this manually by finding these stops in advance and separating them from ways, prior to running the script.

An Overpass query is performed, and then the following JOSM filters are used to mark the problem nodes. I then manually detach stops from them. This is not a common occurrence in Israel, so manual fixing currently does not take much time.

-child type:way
-highway=bus_stop

Multiple stations with identical "ref"

The script assumes each stop has a unique "ref". In Israel, this is true with one exception: Some stops in central bus stations have two GTFS entries with completely identical data except for the "platform/רציף" field. Currently, only the first of these entries is handled. This will be fixed soon (It should be trivial to create an aggregated "רציף" value for one single stop. e.g. platformNumber=1-3, 4).

The script cannot handle multiple stations with an identical "ref" tag. It stops and asks for my manual interventions.

Other things

  • Currently, railway stops are ignored by inspecting the description and finding the sentence: "רחוב: מסילת ברזל".
  • Street address, street number, and level are parsed from the GTFS "description" field. MOT puts them inside the field in the form "x:y z:m...".

Version history

Current version is v2.

V1 did not have an override list and couldn't parse information out of the description tag.