Proposal:GTFS Tagging Standard
|GTFS Tagging Standard|
|Proposal status:||Proposed (under way)|
|Definition:||Group objects of a GTFS feed and link to that feed|
|Vote start:||2023-11-30 (aborted)|
|Vote end:||2023-12-14 (aborted)|
I want to standardise the way of tagging GTFS.
This proposal consists of multiple parts:
- Make the gtfs:*=* namespace the standard, deprecate the gtfs_*=* namespace.
- Deprecate gtfs_id=* and gtfs:id=*, prefer tags with the GTFS column names
- Add a type=gtfs_feed relation that contains the entities of a GTFS feed. It tags specify feed-related information like gtfs:url=*
- Propose gtfs:stop_id:(feed_code)=* for when a feature belongs to multiple feeds
How this proposal is formulated
Any requirements will be underlined and formulated using MUST, SHOULD, MAY, ... as they are defined in RFC 2119
These requirements are meant for implementation of validators, import tools and data consumers.
These requirements are interspersed with the reasoning for them.
I want that apps that use OpenStreetMap, like OSMAnd, to be able to do proper public transit routing.
There have been many discussions on ways to include timetables in OSM.
However these have all led to the following conclusion: too much data that is too volatile.
Google developed GTFS (General Transit Feed Specification) to solve this problem for their products
This is now an open standard, and pretty much all operators offer such a feed - often under a public license.
Therefore I believe that we should standardise a way to link to those feeds.
Furthermore, this proposal also wants to resolve the issues raised on the GTFS wiki page.
Background: How does GTFS work?
GTFS is a standard that describes how public transport agencies should publish timetables and other information so that they can be used by public transit applications. GTFS data is published as a zip file, containing multiple csv files. The transit agencies make this zip file available at a fixed URL - the GTFS feed. When things change in the network, agencies make a new zip and update the version in the feed.
|stops.txt||Describes physical locations||stop_id||ID||Unique identifier for a location|
|stop_code||Text||Public-facing identifier for a location|
|stop_name||Text||Name of the location|
|stop_lat||Float||Location, for stops on the pole.|
|location_type||Enum||0: stop / platform
1: station; contains multiple stops/platforms 2: entrance / exit 3: generic node 4: boarding area
|parent_station||ID||the station it belongs to|
|platform_code||Text||Identifier for the platform|
|routes.txt||Describes a service
OSM equivalent: route_master
|route_id||ID||Unique identifier for a route|
|agency_id||ID||Id of the agency running this route (agency.txt)|
|route_short_name||Text||Short identifier of the route - e.g. bus number|
|route_long_name||Text||Full name of route, often with destinations|
|route_type||Enum||What sort of transportation is used (bus/train/...)|
|trips.txt||Describes a sequence of stops, at a particular time.
In OSM, a route (variant) refers to a sequence of stops. Thus the difference with a trip is the inclusion of time.
|route_id||ID||The route the trip belongs to|
|service_id||ID||The days of operation for the trip (calendar.txt)|
|trip_id||ID||Unique identifier for the trip|
|trip_headsign||Text||Displayed destination of the trip|
|trip_short_name||Text||Public facing text to identify the trip|
|direction_id||Enum||0/1, distinguish direction of trips|
|shape_id||ID||The path the vehicle travels (shapes.txt)|
|shapes.txt||The path that a vehicle travels
This is closer to a route variant as it is in OSM. Does not include the sequence of stops.
|shape_id||ID||Unique identifier for a shape|
|shape_pt_sequence||Int||Place of point in the shape|
|shape_pt_lat||Float||Position of a point|
|stop_times.txt||Describes the times that a trip stops at each stop||trip_id||ID||The trip this stop time is for|
|stop_id||ID||The stop this stop time is for|
|stop_sequence||Int||Place of stop in trip|
|arrival_time||Time||Time of arrival/departure at this stop|
|pickup_type||Enum||Whether pickup or dropoff is available
0: yes, 1: no, 2: call required, 3: ask driver
The examples below indicate how we could make use of both OSM and GTFS, and what we need for that.
Example: how to find departure times for a particular route and stop
- Download the feed and unzip it
- Determine the right row in stops.txt and remember its stop_id
- Determine the right row in routes.txt and remember its route_id
- Find all trips with that route_id and remember their trip_id
- Find all stop times with the stop_id and route_id to get the departure times
If we want to initiate this from OSM, we need the following information in OSM:
- Where can we download the feed?
- Enough information about the stop to determine the right row in stops.txt
- Enough information about the route to determine the right row in routes.txt
Example: how to find the route variants for these departure times
- In OSM: look at all route variants of the route we used
- See if they match with one of the trips
- In OSM: look at all route variants that contain the stop
- See if they match with one of the trips
So for this we need enough information to see if a trip matches with a route variant.
We have the following list of things that this proposal should achieve:
- A way to discover where to download the GTFS feed
- A way to determine the right row in a GTFS file for stops, stations, route variants and routes.
- Resolve the namespace that is used for GTFS tagging; currently both gtfs_*=* and gtfs:*=* are in use.
- Create a way to handle features that are part of multiple feeds.
I propose to make the gtfs:*=* namespace the standard.
I prefer the syntax since it is clearer that it is a namespace.
A comparison of usage for both namespaces can be found further down in this proposal.
I suspect that a decision either way will be the same amount of work.
All tags about GTFS (General Transit Feed Specification) MUST be placed in the gtfs:*=* namespace
Tags MAY use underscores after the namespace prefix
Tags SHOULD use GTFS column names where applicable
This last rule also makes it clear which file of the GTFS feed is referred to; most columns start with the object's type.
To find the feed from OSM, we need a place to tag the properties of the feed.
Later on I will require that some features must be a member of the feed relation.
The GTFS feed relation can have the following tags
|type=gtfs_feed||type=network||Required||Relations that can act as GTFS feed||gtfs_feed|
|gtfs:url=*||Required||The URL of this feed.
As is recommended by the GTFS standard; there should be a fixed publicly accessible URL to the latest feed. An URL that achieves this by an HTTP redirect is also allowed. If there is no fixed URL, point the operator to the GTFS The URL SHOULD always point to the latest version
The URL MUST start with the protocol
|gtfs:feed=*||Required||A code for the feed, to distinguish between feeds in tags
lowercase version MUST be unique among all GTFS feed relations
|gtfs:release_date=*||Recommended||MUST be included when the URL is not a fixed URL to the latest version||2023-10-30|
|name=*||gtfs:name=*||Recommended||The name of the feed||OV Api Netherlands|
|ref=*||gtfs:ref=*||Optional||An official code for the feed|
|operator=*||gtfs:operator=*||Optional||The organisation that manages the feed||Stichting OpenGeo|
There exists an established pattern for gtfs:feed=*.
- UPPERCASE code for the main extent of the feed
- Title-case operator abbreviation / descriptor
I suggest to follow this pattern, but it is not required.
However, to be able to consistently lowercase this value for use as part of a key:
The value of gtfs:feed=* MUST use only printable ASCII characters (0x21 (!) - 0x7E (~))
Therefore accents and foreign scripts should be replaced with latinized forms (so ö becomes oe)
Note: There also exists network:guid=* to give networks a unique identifier. I consider feeds and networks orthogonal concepts, though they may line up. Therefore network:guid=* has no effect on the interpretation of a feed relation. In particular, the uniqueness constraint for gtfs:feed=* does not extend to network:guid=* and vice versa. As such, there may exist two relations, one with gtfs:feed=foo and the other with network:guid=foo.
|Inferred role||PTv2 concept||GTFS concept||GTFS Description|
|stop||public_transport=platform (preferred)||stop (stops.txt)
|Place where passengers board/disembark|
|station||public_transport=station public_transport=stop_area||station (stops.txt)
|Physical structure or area with one or more platforms|
|A location where passengers enter/exit a station|
|trip||type=route(gtfs:trip_id=*)||trip (trips.txt) +
|A sequence of stops, at a specific time|
|route||type=route_master(gtfs:route_id=*)||route (routes.txt)||Group of trips displayed as a single service|
|network||type=network||-||- (only used for cascading membership)|
It is not required to add the role to these objects; they can be inferred based on the listed tags.
Any members that do not have any of the listed tags MUST use the appropriate role
Consider the GTFS feed description to decide which role is appropriate.
Size of relation - Cascading membership
Relations have a technical limit on their size of 32000 objects. This limit is easily reached if all GTFS objects are included.
To solve this, we use cascading membership;
if a relation is part of the feed relation, then all their members with one of the listed tags are also part of the feed relation.
This is an iterative process. As an extreme example we may have:
Cascading membership ensures that each of them is considered to be an (indirect) member of the feed relation.
Based on the tags, we can infer the role of all direct and indirect members.
The following overpass query implements the cascading membership https://overpass-turbo.eu/s/1E8J
All OSM objects that reference a GTFS object MUST be an (indirect) member of the GTFS feed relation.
Tags for referencing a GTFS object
Referencing a GTFS object is done using the following tags:
|stop||gtfs:stop_id=* or gtfs:stop_code=* or (gtfs:stop_name=* + gtfs:platform_code=*)|
|station||gtfs:stop_id=* or gtfs:stop_code=* or gtfs:stop_name=*|
|entrance||gtfs:stop_id=* or gtfs:stop_code=* or gtfs:stop_name=*|
|trip||gtfs:trip_id=* or gtfs:trip_id:sample=* or gtfs:shape_id=*|
|route||gtfs:route_id=* or gtfs:route_long_name=* or gtfs:route_short_name=*|
The combination of tags SHOULD reference an unique object in the GTFS feed
The GTFS standard recommends that ID's should persist between versions. This may not always be the case.
Therefore, analyse historic versions of the feed to see which combination of tags is the most stable.
gtfs:name=* MAY be used, but MUST NOT be required to find the right object in the GTFS feed
Handling multiple feeds
It may happen that multiple feeds contain the same object.
- operator owned feeds that operate in the same area
- border regions between municipalities, states or countries
- international train stations and train lines.
I propose to handle this with the tagging scheme: gtfs:stop_id:(feed_code)=*
When these feed-specific tags are present, they get priority over their non-specific counterparts
The (feed_code) part MUST be the lowercase version of gtfs:feed=* for a feed relation that contains it
When a feature is part of multiple feed relations, it MUST use feed-specific tags for different values
Feed-specific tags SHOULD be used, even when part of a single feed to prevent conflicts when adding more feeds
The general tag MAY be used when multiple feeds have a common value ,
for example when they use the IFOPT identifier. In that case also adding ref:IFOPT=*is adviced.
adding gtfs:stop_id=* communicates that it is needed for matching and gives the exact value that is present in the feed.
I will use the OVApi GTFS feed as an example. It covers all public transport in The Netherlands.
For the feed relation we have two options:
- Put it on the master relation of all public tranport networks in the Netherlands: https://www.openstreetmap.org/relation/2779009
- Make a type=gtfs_feed and include this network as a member.
In this case the first option could be a good choice, but prevents us to tag the international lines in the feed.
|name||OV Api Netherlands|
Cascading membership means that this relation has 63615 (indirect) members.
Tags on a route_master
One of the route masters that is included in the extended relation is Bus 10 (Nijmegen):
The corresponding row in the GTFS feed looks like:
|87966||BRENG||10||Nijmegen CS - Heyendaal||3 (bus)|
|87987||BRENG||10||Nijmegen CS - Heyendaal - CS||3 (bus)|
Tags on a route
One of the routes included is Bus 10: Ringlijn Nijmegen Centraal Station => Universiteit HAN
There are many trips for this route, a couple of them:
In my analysis of different feed versions I found that the trip_id's are not stable, most of them change between versions.
Tags on a station
We included Station Nijmegen directly in the feed relation.
The corresponding row in the feed:
Tags on a bus station
Currently Nijmegen bus station is not included in the feed relation; we could add it as a direct member.
The corresponding row in the feed:
|stoparea:122872||Nijmegen, Centraal Station||1 (station)|
Tags on a bus platform
Even though the bus station is not included, the platforms are.
The corresponding row:
|2547419||60001013||Nijmegen, Centraal Station||0 (platform)||stoparea:122872||M|
Existing tagging and projects
The stops in gtfs_*=* have tags for all columns in stops.txt, likely from an import
In contrast gtfs:*=* seems to be more aimed at linking to a GTFS feed
An overview of tools that do something with both GTFS and OSM:
Currently there are a lot of objects using both namespaces.
There are some (but few) route relations that use gtfs_id=*.
I don't think this edit would be problematic since:
- There is no semantic difference for the namespace change, and minor difference for the gtfs_id=* change.
- GTFS tags are not (or at least should not) be rendered.
- I suspect most of these tags were originally imported anyway
Licensing (not part of the proposal, many open questions)
I think that we should require that feed relations only exist for feeds that are OSM compatible.
The reason for this is that linked feeds are likely going to be used to maintain routes in OSM (likely as a diff tool).
This leaves open if the GTFS feeds should be attributed somewhere if their license is open but requires attribution.
Another option is to present the feed license in a machine-readable form (SPDX license identifier, license URL, attribution string, ...)
Other ways of dealing with multiple values
Existing schemes for dealing with multiple values :
- (deprecated) numbering the multiple values: name_1=*, name_2=*, ...
- Multiple values seperated by a semicolon (
- More specific tags: old_name=*, official_name=*, alt_name=*, ...
These do not address cases where
- We want to know which feed corresponds to which value.
- The values can be arbitrary strings, making the use of seperators troublesome.
- The domain of keys is not predefined (like 1,2,3 or old,official,alt)
Some other ways considered were:
- Multiple values in a single tag. Needs escape sequences and a way to distinguish which ID belongs to which feed.
- gtfs:stop_id=value1;value2, in the lexicographic order of feeds gtfs:feed=*. If a feed does not contain a value, skip it like gtfs:stop_id=;value2;;value4;. Can cause misalignment if it becomes part of another feed (potentially by cascading membership).
- gtfs:stop_id=ref1=value1;ref2=value2Needs more advanced parsing. Can in some extreme cases hit the 255 character limit on tag size.
- Relations between stop and feed relation: too cumbersome
Normal tags serve a different function than the gtfs:*=* tags, neither is a replacement of the other. The goal of normal tags is to be displayed to the user. The goal of gtfs:*=* tags is to allow easy lookup of the corresponding GTFS object and associated timetables. Because they have different goals, there are different requirements for their values. Normal tags should be optimized for humans - proper capitalisation - and match what is on the ground. gtfs:*=* tags should match exactly with the value in the GTFS feed. This makes them incompatible to have them in the same tag. Imagine the following scenario:
The following bus stop (osm) is imported from OVApi (stop_id=1329998), and
stop_name is placed in name=Huis ter Heide, Pr. Alexanderstichting. Suppose that we defaulted gtfs:stop_name=* with the value of name=*, and this is required to find the right GTFS object (no stable id/stop code). Now a mapper sees this on the map and fixes this to name=Prins Alexanderstichting, now we cannot lookup the GTFS object anymore. You may argue that the mapper should have added gtfs:stop_name=Huis ter Heide, Pr. Alexanderstichting, but how could they have known this? Instead, the original import should have added this tag - even if it has exactly the same value - to allow the mapper to change name=* without affecting the GTFS lookup.
The other direction can cause even more grief. Imagine that a name/id has changed in a GTFS feed and an mechanical edit changes name=* to fix this. This can undo a proper change by a mapper. Therefore, any such mechanical edit would get a lot of push-back from the community. If instead the mechanical edit changes gtfs:stop_name=* nothing breaks, and then a QA tool can ask mappers to review if this change should also be pushed to name=*.
- GTFS to describe the standardised tagging scheme
- Key:gtfs id to deprecate it
- Key:gtfs:stop id to describe the multiple value syntax
- Key:gtfs:trip id to describe the multiple value syntax
- Key:gtfs:trip id:sample to describe the multiple value syntax
- Key:gtfs:route id to describe the multiple value syntax
- Key:gtfs:shape id to describe the multiple value syntax
- Key:gtfs:feed describe the pattern used and add the new restrictions added by this proposal
- Key:gtfs:name to mention that the GTFS column versions may still be needed to identify the object
- Public transport to add that there is a standardised way of tagging feeds
Please comment on the discussion page.
- I approve this proposal. I hope that this proposal will lead to better integration of public transportation routing. --Ÿnérant (talk) 08:47, 30 November 2023 (UTC)
- I approve this proposal. -- Something B (talk) 09:09, 30 November 2023 (UTC)
- I oppose this proposal. This adds an extremely complex relation where it is unclear to me why this information cannot be automatically derived. I don't see how the information can be kept in sync with the actual GTFS feed. And it is unclear if the information is useful at all if it goes out of sync. Altogether there was far too little discussion for such a complex topic. In particular I would like to hear from the proposed users (OSMAnd etc.) --Lonvia (talk) 09:19, 30 November 2023 (UTC)
- There is currently no good way to figure out which feeds an bus stop belongs to. And if you found the feed you can only really determine the object based on location, which could give you the stop on the other side of the road.
- With this proposal the process is: look at all the parent relations to find the GTFS feed relation. Then follow the URL to download the feed. Then find the line in `stops.txt` with the matching `gtfs:stop_id`, `gtfs:stop_code` or similar.
- For keeping it in sync - before adding the `gtfs:*` tags someone should look at the feed and it's historic version to determine which combination of columns is the most stable. This should ensure that the references are quite stable. A bot could monitor changed objects in the feed/OSM and fix automatically or ask a mapper.
- If it goes out of sync it becomes much less useful.
- Spaanse (talk) 10:29, 30 November 2023 (UTC)
Aborted the vote to allow for more discussion