Proposal:GTFS Tagging Standard

From OpenStreetMap Wiki
Jump to navigation Jump to search
GTFS Tagging Standard
Proposal status: Proposed (under way)
Proposed by: Spaanse
Tagging: type=gtfs_feed
Applies to: relation
Definition: Group objects of a GTFS feed and link to that feed
Statistics:

Draft started: 2023-11-07
RFC start: 2023-11-09
Vote start: 2023-11-30 (aborted)
Vote end: 2023-12-14 (aborted)


Proposal

I want to standardise the way of tagging GTFS.

This proposal consists of multiple parts:

  1. Make the gtfs:*=* namespace the standard, deprecate the gtfs_*=* namespace.
  2. Deprecate gtfs_id=* and gtfs:id=*, prefer tags with the GTFS column names
  3. Add a relation type=gtfs_feed relation that contains the entities of a GTFS feed. It tags specify feed-related information like gtfs:url=*
  4. Propose gtfs:stop_id:(feed_code)=* for when a feature belongs to multiple feeds

How this proposal is formulated

Any requirements will be underlined and formulated using MUST, SHOULD, MAY, ... as they are defined in RFC 2119

These requirements are meant for implementation of validators, import tools and data consumers.

These requirements are interspersed with the reasoning for them.

Rationale

I want that apps that use OpenStreetMap, like OSMAnd, to be able to do proper public transit routing.

There have been many discussions on ways to include timetables in OSM.

However these have all led to the following conclusion: too much data that is too volatile.

Google developed GTFS (General Transit Feed Specification) to solve this problem for their products

This is now an open standard, and pretty much all operators offer such a feed - often under a public license.

Therefore I believe that we should standardise a way to link to those feeds.

Furthermore, this proposal also wants to resolve the issues raised on the GTFS wiki page.

Background: How does GTFS work?

GTFS is a standard that describes how public transport agencies should publish timetables and other information so that they can be used by public transit applications. GTFS data is published as a zip file, containing multiple csv files. The transit agencies make this zip file available at a fixed URL - the GTFS feed. When things change in the network, agencies make a new zip and update the version in the feed.

Important GTFS files and columns
File Description Column Type Description
stops.txt Describes physical locations stop_id ID Unique identifier for a location
stop_code Text Public-facing identifier for a location
stop_name Text Name of the location
stop_lat Float Location, for stops on the pole.
stop_lon Float
location_type Enum 0: stop / platform

1: station; contains multiple stops/platforms 2: entrance / exit 3: generic node 4: boarding area

parent_station ID the station it belongs to
platform_code Text Identifier for the platform
routes.txt Describes a service

OSM equivalent: route_master

route_id ID Unique identifier for a route
agency_id ID Id of the agency running this route (agency.txt)
route_short_name Text Short identifier of the route - e.g. bus number
route_long_name Text Full name of route, often with destinations
route_type Enum What sort of transportation is used (bus/train/...)
trips.txt Describes a sequence of stops, at a particular time.

In OSM, a route (variant) refers to a sequence of stops. Thus the difference with a trip is the inclusion of time.

route_id ID The route the trip belongs to
service_id ID The days of operation for the trip (calendar.txt)
trip_id ID Unique identifier for the trip
trip_headsign Text Displayed destination of the trip
trip_short_name Text Public facing text to identify the trip
direction_id Enum 0/1, distinguish direction of trips
shape_id ID The path the vehicle travels (shapes.txt)
shapes.txt The path that a vehicle travels

This is closer to a route variant as it is in OSM. Does not include the sequence of stops.

shape_id ID Unique identifier for a shape
shape_pt_sequence Int Place of point in the shape
shape_pt_lat Float Position of a point
shape_pt_lon Float
stop_times.txt Describes the times that a trip stops at each stop trip_id ID The trip this stop time is for
stop_id ID The stop this stop time is for
stop_sequence Int Place of stop in trip
arrival_time Time Time of arrival/departure at this stop
departure_time Time
pickup_type Enum Whether pickup or dropoff is available

0: yes, 1: no, 2: call required, 3: ask driver

drop_off_type Enum

The examples below indicate how we could make use of both OSM and GTFS, and what we need for that.

Example: how to find departure times for a particular route and stop

  1. Download the feed and unzip it
  2. Determine the right row in stops.txt and remember its stop_id
  3. Determine the right row in routes.txt and remember its route_id
  4. Find all trips with that route_id and remember their trip_id
  5. Find all stop times with the stop_id and route_id to get the departure times

If we want to initiate this from OSM, we need the following information in OSM:

  1. Where can we download the feed?
  2. Enough information about the stop to determine the right row in stops.txt
  3. Enough information about the route to determine the right row in routes.txt

Example: how to find the route variants for these departure times

  1. In OSM: look at all route variants of the route we used
  2. See if they match with one of the trips

Or alternatively:

  1. In OSM: look at all route variants that contain the stop
  2. See if they match with one of the trips

So for this we need enough information to see if a trip matches with a route variant.

Tagging

We have the following list of things that this proposal should achieve:

  1. A way to discover where to download the GTFS feed
  2. A way to determine the right row in a GTFS file for stops, stations, route variants and routes.
  3. Resolve the namespace that is used for GTFS tagging; currently both gtfs_*=* and gtfs:*=* are in use.
  4. Create a way to handle features that are part of multiple feeds.

GTFS Namespace

I propose to make the gtfs:*=* namespace the standard.

I prefer the syntax since it is clearer that it is a namespace.

Furthermore there are more existing wiki pages for gtfs:*=* (see list at the end) than for gtfs_*=* (only Key:gtfs id)

A comparison of usage for both namespaces can be found further down in this proposal.

I suspect that a decision either way will be the same amount of work.

All tags about GTFS (General Transit Feed Specification) MUST be placed in the gtfs:*=* namespace

Tags MAY use underscores after the namespace prefix

Tags SHOULD use GTFS column names where applicable

This last rule also makes it clear which file of the GTFS feed is referred to; most columns start with the object's type.

In particular, the tags gtfs_id=* and gtfs:id=* should be deprecated.

Instead tags like gtfs:stop_id=*, gtfs:trip_id=*, gtfs:route_id=* and gtfs:shape_id=* should be used.

Feed Relation

To find the feed from OSM, we need a place to tag the properties of the feed.

There are two choices: relation type=network or relation type=gtfs_feed - I will refer to either as the GTFS feed relation.

Later on I will require that some features must be a member of the feed relation.

This can be a problem if the scope of the GTFS feed is not entirely contained within that of a relation type=network.

When this is the case, use relation type=gtfs_feed.

Including the relation type=network as a member allows the feed relation to inherit the networks members (see cascading membership).

Tags

The GTFS feed relation can have the following tags

Tags for a GTFS feed relation
Tag Importance Description Example value
type=gtfs_feed type=network Required Relations that can act as GTFS feed gtfs_feed
gtfs:url=* Required The URL of this feed.

As is recommended by the GTFS standard; there should be a fixed publicly accessible URL to the latest feed. An URL that achieves this by an HTTP redirect is also allowed. If there is no fixed URL, point the operator to the GTFS The URL SHOULD always point to the latest version

The URL MUST start with the protocol

Replaces gtfs:feed_url=*

https://gtfs.openov.nl/gtfs-rt/gtfs-openov-nl.zip
gtfs:feed=* Required A code for the feed, to distinguish between feeds in tags

lowercase version MUST be unique among all GTFS feed relations

NL-OVapi
gtfs:release_date=* Recommended MUST be included when the URL is not a fixed URL to the latest version 2023-10-30
name=* gtfs:name=* Recommended The name of the feed OV Api Netherlands
ref=* gtfs:ref=* Optional An official code for the feed
operator=* gtfs:operator=* Optional The organisation that manages the feed Stichting OpenGeo

There exists an established pattern for gtfs:feed=*.

  1. UPPERCASE  ISO_3166-2 code for the main extent of the feed
  2. Title-case operator abbreviation / descriptor

I suggest to follow this pattern, but it is not required.

However, to be able to consistently lowercase this value for use as part of a key:

The value of gtfs:feed=* MUST use only printable ASCII characters (0x21 (!) - 0x7E (~))

Therefore accents and foreign scripts should be replaced with latinized forms (so ö becomes oe)


Note: There also exists network:guid=* to give networks a unique identifier. I consider feeds and networks orthogonal concepts, though they may line up. Therefore network:guid=* has no effect on the interpretation of a feed relation. In particular, the uniqueness constraint for gtfs:feed=* does not extend to network:guid=* and vice versa. As such, there may exist two relations, one with gtfs:feed=foo and the other with network:guid=foo.

Members

Different kinds of features in the GTFS feed relation
Inferred role PTv2 concept GTFS concept GTFS Description
Role stop nodewayarea public_transport=platform (preferred)

node public_transport=stop_position

node highway=bus_stop

node railway=stop, node railway=platformnode railway=tram_stop

node amenity=ferry_terminal

stop (stops.txt)

location_type=0

Place where passengers board/disembark
Role station nodearea public_transport=stationrelation public_transport=stop_area

relation public_transport=stop_area_group

node railway=station, node railway=halt

nodearea aerialway=station

station (stops.txt)

location_type=1

Physical structure or area with one or more platforms
Role entrance node railway=subway_entrance

node railway=train_station_entrance

entrance/exit (stops.txt)

location_type=2

A location where passengers enter/exit a station
Role trip relation type=route(gtfs:trip_id=*) trip (trips.txt) +

shape (shapes.txt)

A sequence of stops, at a specific time
Role route relation type=route_master(gtfs:route_id=*) route (routes.txt) Group of trips displayed as a single service
Role network relation type=network - - (only used for cascading membership)

It is not required to add the role to these objects; they can be inferred based on the listed tags.

Any members that do not have any of the listed tags MUST use the appropriate role

Consider the GTFS feed description to decide which role is appropriate.

Size of relation - Cascading membership

Relations have a technical limit on their size of 32000 objects. This limit is easily reached if all GTFS objects are included.

To solve this, we use cascading membership;

if a relation is part of the feed relation, then all their members with one of the listed tags are also part of the feed relation.

This is an iterative process. As an extreme example we may have:

relation type=gtfs_feedrelationtype=networkrelationtype=networkrelation type=route_masterrelation type=routerelation public_transport=stop_area_grouprelation public_transport=stop_areanodewayarea public_transport=platform

Cascading membership ensures that each of them is considered to be an (indirect) member of the feed relation.

Based on the tags, we can infer the role of all direct and indirect members.

In this example, the roles are inferred to be: Role network, Role network, Role route, Role trip, Role station, Role stop, Role stop.

The following overpass query implements the cascading membership https://overpass-turbo.eu/s/1E8J

All OSM objects that reference a GTFS object MUST be an (indirect) member of the GTFS feed relation.

Tags for referencing a GTFS object

Referencing a GTFS object is done using the following tags:

Role Tags
Role stop gtfs:stop_id=* or gtfs:stop_code=* or (gtfs:stop_name=* + gtfs:platform_code=*)
Role station gtfs:stop_id=* or gtfs:stop_code=* or gtfs:stop_name=*
Role entrance gtfs:stop_id=* or gtfs:stop_code=* or gtfs:stop_name=*
Role trip gtfs:trip_id=* or gtfs:trip_id:sample=* or gtfs:shape_id=*
Role route gtfs:route_id=* or gtfs:route_long_name=* or gtfs:route_short_name=*

The combination of tags SHOULD reference an unique object in the GTFS feed

The GTFS standard recommends that ID's should persist between versions. This may not always be the case.

Therefore, analyse historic versions of the feed to see which combination of tags is the most stable.

It may be that gtfs:stop_code=* is more stable than gtfs:stop_id=*

gtfs:name=* MAY be used, but MUST NOT be required to find the right object in the GTFS feed

Handling multiple feeds

It may happen that multiple feeds contain the same object.

Examples are:

  • operator owned feeds that operate in the same area
  • border regions between municipalities, states or countries
  • international train stations and train lines.

I propose to handle this with the tagging scheme: gtfs:stop_id:(feed_code)=*

When these feed-specific tags are present, they get priority over their non-specific counterparts

Example (Arnhem Centraal)
Tag value for:
OVApi VRR Other
gtfs:stop_id:nl-ovapi=stoparea:183059 stoparea:183059 - -
gtfs:stop_id:de-nw-vrr=gen:27004:4381:: - gen:27004:4381:: -
gtfs:stop_id=foo foo foo foo
Combined: stoparea:183059 gen:27004:4381:: foo

Requirements

The (feed_code) part MUST be the lowercase version of gtfs:feed=* for a feed relation that contains it

When a feature is part of multiple feed relations, it MUST use feed-specific tags for different values

Feed-specific tags SHOULD be used, even when part of a single feed to prevent conflicts when adding more feeds

The general tag MAY be used when multiple feeds have a common value ,

for example when they use the IFOPT identifier. In that case also adding ref:IFOPT=*is adviced.

ref:IFOPT=* SHOULD NOT be used as a replacement for gtfs:stop_id=* and friends. (contrary to the current GTFS wiki page)

adding gtfs:stop_id=* communicates that it is needed for matching and gives the exact value that is present in the feed.

ref:IFOPT=* may not match exactly, for example ref:IFOPT=NL:Q:50201120 and gtfs:stop_code=50201120.

Example

I will use the OVApi GTFS feed as an example. It covers all public transport in The Netherlands.

For the feed relation we have two options:

In this case the first option could be a good choice, but prevents us to tag the international lines in the feed.

Therefore I will choose to make a relation type=gtfs_feed

Key Value
type gtfs_feed
gtfs:url https://gtfs.ovapi.nl/nl/gtfs-nl.zip
gtfs:feed NL-OVapi
name OV Api Netherlands
operator Stichting OpenGeo

For this example I will give it two members: OV-concessies Nederland and Station Nijmegen

Cascading membership means that this relation has 63615 (indirect) members.

Role stop: 57661, Role station: 1, Role trip: 4026, Role route: 1883, Role network: 47

The reason that the count for Role station is so low, is that it is never included as a part of a route, route_master or network.

The only instance of Role station is Station Nijmegen that is a direct member.

Tags on a route_master

One of the route masters that is included in the extended relation is Bus 10 (Nijmegen):

OV Api Netherlands → OV-concessies NederlandArnhem-NijmegenBus 10 (Nijmegen)

The corresponding row in the GTFS feed looks like:

routes.txt
rotue_id agency_id route_short_name route_long_name route_desc route_type route_color route_text_color route_url
87966 BRENG 10 Nijmegen CS - Heyendaal 3 (bus)
87987 BRENG 10 Nijmegen CS - Heyendaal - CS 3 (bus)

Note: there should not be multiple rows. I will use the second row since it has 19450 rides instead of 2821

I have determined that the route ID is quite stable, so adding gtfs:route_id:nl-ovapi=87987 suffices.

Another option would be to tag gtfs:route_long_name:nl-ovapi=Nijmegen CS - Heyendaal

Tags on a route

One of the routes included is Bus 10: Ringlijn Nijmegen Centraal Station => Universiteit HAN

There are many trips for this route, a couple of them:

trips.txt
route_id service_id trip_id trip_headsign trip_short_name trip_long_name direction_id shape_id ...
87987 457 177374763 Heyendaal 79 0 1127909
87987 457 177374758 Heyendaal 69 0 1127909
87987 457 177374753 Heyendaal 59 0 1127909

In my analysis of different feed versions I found that the trip_id's are not stable, most of them change between versions.

The shape_id is more stable, so I will add gtfs:shape_id:nl-ovapi=1127909

If I would have used the trip_id, it would look like gtfs:trip_id:sample:nl-ovapi=177374763

Tags on a station

We included Station Nijmegen directly in the feed relation.

The corresponding row in the feed:

stops.txt
stop_id stop_code stop_name location_type parent_station platform_code ...
stoparea:17857 nm Nijmegen 1 (station)

I have determined the stop_id's to be quite stable, so I can add gtfs:stop_id:nl-ovapi=stoparea:17857

However, in this case the stop_code is fixed for the station and printed on timetables, so better is gtfs:stop_code=nm (and railway:ref=nm)

Tags on a bus station

Currently Nijmegen bus station is not included in the feed relation; we could add it as a direct member.

The corresponding row in the feed:

stops.txt
stop_id stop_code stop_name location_type parent_station platform_code ...
stoparea:122872 Nijmegen, Centraal Station 1 (station)

In this case we have only one option: gtfs:stop_id:nl-ovapi=122872

Tags on a bus platform

Even though the bus station is not included, the platforms are.

In particular; Nijmegen Centraal platform M is included because it is part of Bus 10.

The corresponding row:

stops.txt
stop_id stop_code stop_name location_type parent_station platform_code ...
2547419 60001013 Nijmegen, Centraal Station 0 (platform) stoparea:122872 M

Again we can use the stop_id: gtfs:stop_id:nl-ovapi=2547419.

However, we have ref:IFOPT=NL:Q:60001013, so we see that stop_code is part of the IFOPT and thus stable.

Therefore we will tag with both ref:IFOPT=NL:Q:60001013 and gtfs:stop_id:nl-ovapi=60001013

Another alternative would be gtfs:stop_name:nl-ovapi=Nijmegen, Centraal Station and gtfs:platform_code:nl-ovapi=M

Considerations

Existing tagging and projects

Number of objects with GTFS tags (7 november)
Namespace Nodes Ways Relations Total Overpass
gtfs_*=* 115647 846 4376 120869 https://overpass-turbo.eu/s/1D43
gtfs:*=* 50682 1190 22882 74754 https://overpass-turbo.eu/s/1D44

This shows that the gtfs_*=* is more common on stops and gtfs:*=* on relations.

Comparison of tags among both namespaces
gtfs_*=* gtfs:*=* gtfs_*=* gtfs:*=*


The stops in gtfs_*=* have tags for all columns in stops.txt, likely from an import

In contrast gtfs:*=* seems to be more aimed at linking to a GTFS feed

An overview of tools that do something with both GTFS and OSM:

gtfs_*=*: GO-Sync

gtfs:*=*: PTNA

Neither: Osm2Gtfs [1], OpenTripPlanner [2]

Transitional period

Currently there are a lot of objects using both namespaces.

If this proposal is approved, we can start the process to approve an automated edit changing gtfs_*=* to gtfs:*=*.

I think it is also a good idea to change gtfs_id=* to gtfs:stop_id=*, but that requires more care and looking at feature type and related tags.

There are some (but few) route relations that use gtfs_id=*.

I don't think this edit would be problematic since:

  1. There is no semantic difference for the namespace change, and minor difference for the gtfs_id=* change.
  2. GTFS tags are not (or at least should not) be rendered.
  3. I suspect most of these tags were originally imported anyway

Licensing (not part of the proposal, many open questions)

I think that we should require that feed relations only exist for feeds that are OSM compatible.

The reason for this is that linked feeds are likely going to be used to maintain routes in OSM (likely as a diff tool).

This leaves open if the GTFS feeds should be attributed somewhere if their license is open but requires attribution.

Another option is to present the feed license in a machine-readable form (SPDX license identifier, license URL, attribution string, ...)

Other ways of dealing with multiple values

Existing schemes for dealing with multiple values :

  1. (deprecated) numbering the multiple values: name_1=*, name_2=*, ...
  2. Multiple values seperated by a semicolon (;): name=foo;bar
  3. More specific tags: old_name=*, official_name=*, alt_name=*, ...

These do not address cases where

  1. We want to know which feed corresponds to which value.
  2. The values can be arbitrary strings, making the use of seperators troublesome.
  3. The domain of keys is not predefined (like 1,2,3 or old,official,alt)

Some other ways considered were:

  • Multiple values in a single tag. Needs escape sequences and a way to distinguish which ID belongs to which feed.
  • Relations between stop and feed relation: too cumbersome

Keeping GTFS tags separate from normal tags like ref, name, ...

Normal tags serve a different function than the gtfs:*=* tags, neither is a replacement of the other. The goal of normal tags is to be displayed to the user. The goal of gtfs:*=* tags is to allow easy lookup of the corresponding GTFS object and associated timetables. Because they have different goals, there are different requirements for their values. Normal tags should be optimized for humans - proper capitalisation - and match what is on the ground. gtfs:*=* tags should match exactly with the value in the GTFS feed. This makes them incompatible to have them in the same tag. Imagine the following scenario:

The following bus stop (osm) is imported from OVApi (stop_id=1329998), and stop_name is placed in name=Huis ter Heide, Pr. Alexanderstichting. Suppose that we defaulted gtfs:stop_name=* with the value of name=*, and this is required to find the right GTFS object (no stable id/stop code). Now a mapper sees this on the map and fixes this to name=Prins Alexanderstichting, now we cannot lookup the GTFS object anymore. You may argue that the mapper should have added gtfs:stop_name=Huis ter Heide, Pr. Alexanderstichting, but how could they have known this? Instead, the original import should have added this tag - even if it has exactly the same value - to allow the mapper to change name=* without affecting the GTFS lookup.

The other direction can cause even more grief. Imagine that a name/id has changed in a GTFS feed and an mechanical edit changes name=* to fix this. This can undo a proper change by a mapper. Therefore, any such mechanical edit would get a lot of push-back from the community. If instead the mechanical edit changes gtfs:stop_name=* nothing breaks, and then a QA tool can ask mappers to review if this change should also be pushed to name=*.

Features/Pages affected

External discussions

https://community.openstreetmap.org/t/draft-feature-proposal-gtfs-tagging-standard/105763/2

Comments

Please comment on the discussion page.

Voting

Instructions for voting
  • Log in to the wiki if you are not already logged in.
  • Scroll down to voting and click 'Edit source'. Copy and paste the appropriate code from this table on its own line at the bottom of the text area:
To get this output you type Description
  • I approve this proposal I approve this proposal.
{{vote|yes}} --~~~~ Feel free to also explain why you support proposal.
  • I oppose this proposal I oppose this proposal. reason
{{vote|no}} reason --~~~~ Replace reason with your reason(s) for voting no.
  • I abstain from voting but have comments I have comments but abstain from voting on this proposal. comments
{{vote|abstain}} comments --~~~~ If you don't want to vote but have comments. Replace comments with your comments.
Note: The ~~~~ automatically inserts your name and the current date.
For full template documentation see Template:Vote. See also how vote outcome is processed.
  • I approve this proposal I approve this proposal. I hope that this proposal will lead to better integration of public transportation routing. --Ÿnérant (talk) 08:47, 30 November 2023 (UTC)
  • I approve this proposal I approve this proposal. -- Something B (talk) 09:09, 30 November 2023 (UTC)
  • I oppose this proposal I oppose this proposal. This adds an extremely complex relation where it is unclear to me why this information cannot be automatically derived. I don't see how the information can be kept in sync with the actual GTFS feed. And it is unclear if the information is useful at all if it goes out of sync. Altogether there was far too little discussion for such a complex topic. In particular I would like to hear from the proposed users (OSMAnd etc.) --Lonvia (talk) 09:19, 30 November 2023 (UTC)
There is currently no good way to figure out which feeds an bus stop belongs to. And if you found the feed you can only really determine the object based on location, which could give you the stop on the other side of the road.
With this proposal the process is: look at all the parent relations to find the GTFS feed relation. Then follow the URL to download the feed. Then find the line in `stops.txt` with the matching `gtfs:stop_id`, `gtfs:stop_code` or similar.
For keeping it in sync - before adding the `gtfs:*` tags someone should look at the feed and it's historic version to determine which combination of columns is the most stable. This should ensure that the references are quite stable. A bot could monitor changed objects in the feed/OSM and fix automatically or ask a mapper.
If it goes out of sync it becomes much less useful.
Spaanse (talk) 10:29, 30 November 2023 (UTC)

Aborted the vote to allow for more discussion

Spaanse (talk) 10:32, 30 November 2023 (UTC)