Automated edits/tjhorner-import

From OpenStreetMap Wiki
Jump to navigation Jump to search

King County Metro Bus Stop Import

This import is intended to improve the quality of bus stop data in the region that King County Metro serves. Due to service changes, quality of the current data varies a lot, for example:

  • Some stops on OSM are no longer in service or have moved
  • New stops that have recently come into service are not on OSM
  • The tagging used on each stop is inconsistent throughout the region

There are additional issues such as disparity between current KCM routes and the route relations on OSM, but these are out of the scope of this import—it will focus solely on the stops themselves.

KCM publishes a frequently-updated GTFS feed with high-quality data about their bus stops. This import will merge the KCM GTFS data with the existing bus stop data on OSM, taking care not to duplicate or disrupt human edits.

This import is a first step to importing full KCM route data (see Washington/Public_Transport/King_County_Metro) and hopefully opening a path forward to more automated maintenance of the data going forward.

Data License

King County has provided explicit permission for OpenStreetMap contributors to use "any King County-derived data"[1] for edits.

Planned Import Date

This import proposal was posted to the OSM Community Forum on August 3, 2024. Unless there are any outstanding issues, the import is planned for August 17, 2024.

Feedback

For feedback on this import plan, you can reach me in the OpenStreetMap Discord. My Discord username is tjhorner.

You can also report problems or submit suggestions for the import tool as a GitHub issue; this will help keep discussion organized.

Preparation of OSM Data

In preparation, I have engaged in some manual cleanup of the existing bus stop data to ensure the import works as intended. Here are some potential problems that were identified and resolved:

Import Tool

To facilitate this import, and hopefully future maintenance of GTFS data, I am creating a web-based tool called GTFS Janitor to help automate the process of matching the GTFS data to OSM data. The tool accepts a GTFS zip, runs the matching algorithm described below, assists the user in disambiguating matches, and outputs an osmChange file that can be reviewed in an external editor.

Eventually it will have its own wiki page, but for now you can find it on GitHub: https://github.com/tjhorner/gtfs-janitor

Matching Strategy

information sign

The specifics of the matching algorithm are subject to change. I will try to keep this section up-to-date, but you can visit the linked source files for the authoritative implementation of each part.

The importer will first query the Overpass API for any existing bus stops in the bounds of the GTFS stops. The query[6] looks something like this:

try it yourself in overpass-turbo
(
  node[~"^(.+:)?highway$"~"^bus_stop$"];
  node[~"^(.+:)?railway$"~"^tram_stop$"];
  node[~"^(.+:)?amenity$"~"^ferry_terminal$"];
  node[~"bus|tram|ferry|trolleybus"~"^yes$"];
)->.ptStops;

way(bn.ptStops)["highway"~"motorway|trunk|primary|secondary|tertiary|unclassified|residential|service"]->.roadWaysWithStops;
node(w.roadWaysWithStops)->.stopsOnRoadWays;

(.ptStops; - .stopsOnRoadWays;);

out meta;

It will filter the query results for candidate nodes using these conditions:

Then, for each stop in the GTFS data, it will conflate it with candidate nodes using the following strategies in this order:

  1. gtfs:stop_id or ref matches the GTFS stop ID or stop code[8]
    • If there is only one match but it's very far away (> 500m away), submit for human review
    • If there are multiple matches, narrow it down to the matches that are within 100 meters of the GTFS stop (to account for stops from other transit agencies that have the same ID, for example) and submit for human review
  2. name or ref matches the normalized[9] GTFS stop name[10]
    • If there is only one match but it's not super close (> 30m away), submit for human review
    • If there are multiple matches, narrow it down to the matches that are within 100 meters of the GTFS stop
      • If there is a node within 10 meters of the GTFS stop, match with that one automatically
      • If there is not, submit for human disambiguation
  3. Any candidate node that is within 100 meters of the GTFS stop location[11]
    • If the closest match is 10 meters or closer to the GTFS stop, automatically that one regardless of the number of matches
    • If there is only a single node and it is 30 meters or closer to the GTFS stop, automatically match that one
    • Otherwise, present the node(s) for human disambiguation
Screenshot of disambiguation UI

Once a match is made, the matched node is removed from the candidate pool so it's not matched against multiple stops. It will run this process in a loop until only ambiguous matches remain.

The remaining ambiguous matches are presented to the user, where they can decide what to do with each candidate node.

Automated Changes

Tags

In addition to the default tags that GTFS Janitor applies, these King County Metro-specific tags will be added:

Key Value
operator King County Metro
operator:short KCM
operator:wikidata Q6411393
operator:wikipedia en:King County Metro
gtfs:feed US-WA-KCM

Deprecated tags gtfs:dataset_id and source will be removed. They are leftovers from previous GTFS imports and are superseded by gtfs:feed.

public_transport=platform handling

Since it is not necessarily true that all highway=bus_stop nodes should be accompanied by a public_transport=platform (for example, if the platform is mapped separately as a way), the decision was made to only add it to nodes that are being newly-created, not to existing nodes that are being modified. The task of deciding whether a highway=bus_stop should also have public_transport=platform is better suited for human review due to various different mapping styles. Therefore, a MapRoulette challenge was created for this task instead.

Location

GTFS data is sometimes imprecise, so we want to preserve the tweaks that human mappers make to more closely match the physical sign/platform. But bus stops also sometimes move (for example, down the road) while retaining the same stop ID or name, and we want to move the node accordingly when this happens as well. To account for both of these, if a matching node is found and it is > 100 meters from the GTFS stop, then it will move the node to the GTFS stop location.

Feedback on this strategy is encouraged, as this heuristic may not be the best for determining if a stop has actually moved in the real world. For example, should we instead add a tag requesting a check of the stop location, or make a MapRoulette quest?

Out-of-Service Stops

After matching and disambiguation, the tool will check for stops that exist on OSM (using the previously-calculated candidate pool) but are out-of-service in the GTFS feed (i.e., they no longer exist in the feed).

Due to the potentially destructive nature of this edit, we must pay special attention to the below factors.

Physical Status

When a stop goes out of service in GTFS, it does not necessarily mean the physical sign post, platform, etc is removed. Since OpenStreetMap data is meant to represent the state of the physical world, we do not want to actually remove the node in these cases. Instead, it's probably best to apply a lifecycle prefix such as disused:* to the highway and public_transport keys. This way the node still exists for human mappers to verify, and we don't lose the history if the stop ever comes back into service.

Matching

We don't want to erroneously modify stops that are actually in service, nor do we want to modify stops outside the jurisdiction of King County Metr. This means we should be pretty strict in how we match out-of-service stops. They should meet the following criteria:

  • Node was not matched to any stop from GTFS data
  • operator or network key exactly matches the transit agency
  • ref or gtfs:stop_id does not appear in the GTFS stop data as a stop ID or stop code

No further matching will be done. For example, a bare highway=bus_stop + public_transport=platform node will not be affected even though it does not match any of the stops in the GTFS data.

Human Review

To ensure correctness, this part of the import will be performed separately; that is, a separate osmChange file will be generated and every individual node inspected with more scrutiny by a human.

Quality Assurance

Here are some resources that can be used to ensure data integrity before and after the import.

Overpass Queries

MapRoulette Project

I created a MapRoulette project to address various errors in existing data: https://maproulette.org/browse/projects/56979

Notes and References