Planet.osm/diffs

From OpenStreetMap Wiki
Jump to navigation Jump to search

The diffs provided at https://planet.osm.org are small compressed xml files in OsmChange format that contain the changes in the OpenStreetMap data over some period in time. Diff's can be used by developers to bring an OpenStreetMap database into sync with the latest changes made by the mapping community, or to analyse these changes as they happen.

Minutely, Hourly, or Daily diffs

Diff are available at different time granularities as follows:

Granularity Schedule Contents Implementation
Minute (replication/minute) Launched every minute. Full History Transaction Id
Hour (replication/hour) Launched every hour at 2 minutes past the hour. Full History Transaction Id (Aggregation of Minute)
Day (replication/day) Launched at 00:05 UTC every day. Full History Transaction Id (Aggregation of Hour)
Daily (not updated during redaction period)

Note: This extract misses data from long-running transactions and is being phased out in favour of the Day Transaction Id extract above.

Launched at 00:35 UTC. Completes at approximately 01:00 UTC. Delta Date Aligned
Historical Daily

Note: These are the only extracts that allow you to retrieve the complete Open Street Map history.

Launched at 01:00 UTC. Job is configured with 24 hour extract delay leading to a total of 25 hours to minimise chance of missed data. Full History Date Aligned

Full History extracts contain all changes to entities for the extract period. They may contain multiple versions of some entities if those entities were modified multiple times within that extract period. For example, a given node may be modified twice in a minute leading to two version of the node being included in the minute extract.

Delta extracts only contain the changes necessary to patch a dataset to the current version. They only include the latest version of an entity for a given extract period. These extracts are being phased out in favour of full history extracts because delta data can be derived from full history data if required.

Date Aligned extracts use the timestamp fields of database objects to determine which records to include in the extract. This has the advantage of producing extracts where the time period of its contents are easily identified. The major disadvantage of these is that they may miss data due to long-running transactions committing data with timestamps lying within time periods that have already been extracted. To minimise the chance of this occurring, date aligned extracts are run with a time delay. Unless a very long delay is used some data will be missed.

Transaction Id extracts use internal PostgreSQL database transaction identifiers to determine which records to extract. These identifiers allow all changed records to be extracted with zero artificial time delays. The downside to these extracts is that they are not exactly date aligned. The timestamps follow these rules:

  • The timestamp field specified in a replication state file is guaranteed to be greater than or equal to the maximum timestamp contained in the data file.
  • A data file may contain data with timestamps that are equal to or earlier than the timestamp of the previous state file.

These timestamp rules mean that the timestamp specified in a state file can be reliably used to identify the starting point for patching a dataset. The patching tool must cope with receiving duplicate data that already exists in the dataset.

Daily files organisation

The daily files are named by the time period they cover, with the format:

YYYYmmdd-YYYYmmdd.osc.gz

The data in the file is the change in data between midnight on the first and second days, as identified by the timestamps on the current data. Because of the delay in creating the file it is very unlikely, but possible, that some data may be missing.

Minute, Hour, and Day Files Organisation

Each file with daily, hourly or minutely granularity is identified by a nine-digit sequence number. The sequence number is split into groups of three digits and can be found in the following location:

https://planet.openstreetmap.org/replication/[day|hour|minute]/AAA/BBB/CCC.osc.gz

Where the sequence number N = AAA*1000000 + BBB*1000 + CCC. For example, the most recent hourly diff at the time of writing (July 2018) has a sequence number of 51,123 and its location is 000/051/123. Each OsmChange file is accompanied by a state.txt file which contains the following information:

Key Example value Meaning
sequenceNumber 3067 The sequence number of the change/state file.
txnMaxQueried 0 The maximum transaction ID which is included in the diff. (NOTE: Doesn't seem to be used for the hourly diffs) (osmosis internal, not supported by osmdbt)
timestamp 2010-03-27T17\:00\:00Z The timestamp when the diff was generated.
txnReadyList 916201159,916203039 Unknown - seems to be unused. The previously active transaction ids that can now be queried? (osmosis internal, not supported by osmdbt)
txnMax 916203060 The maximum transaction ID at the time the diff was generated, usually the same as txnMaxQueried. (osmosis internal, not supported by osmdbt)
txnActiveList 916201159,916203039 The list of transaction IDs between this state and the previous state which have not been committed yet. (NOTE: Doesn't seem to be used for the hourly diffs). (osmosis internal, not supported by osmdbt)

The numbers are sequential, but are not necessarily aligned with any clock time. To find out the time associated with a particular diff it is necessary to read the timestamp from the associated state file.

Fetching diff files

To fetch changes you should first find the current sequence number by fetching the state for the feed which can be found in the following location:

https://planet.openstreetmap.org/replication/[day|hour|minute]/state.txt

The sequence number should then be extracted from the state and all the required diff files up to and including that sequence number can then be fetched using the naming scheme in the previous section.

Redaction period

OpenStreetMap undertook a change of license in 2012, and this involved asking all contributors to agree, and redacting the data contributed by those who did not agree. Because of the extensive changes during this period, consisting largely of deletions, and because developers needed to be aware of the license change itself, the feed was placed at a different location: https://planet.openstreetmap.org/redaction-period/ and then moved again for the diffs we use today https://planet.openstreetmap.org/replication/ . The recommendation at the time was for developers to stop consuming diffs until a future date when they were happy with the remapping progress. Happily this is many years in the past now, and so we can largely ignore the old redaction period diffs and cc-by-sa licensed data. The live diffs containing the latest OpenStreetMap changes are all ODbL licensed, and our data has come on a long way since then!

Using the replication diffs

The most common way to use the replication diffs is via Osmosis, which will automatically download the relevant diffs and combine them to provide all the changes since it was last run. The diffs can be consumed directly, but this can introduce unnecessary complexity and is not recommended.

Alternatively, the osmupdate can be used to create cumulated diff files (.osc, .osc.gz, .o5c, .o5c.gz). The program will download all necessary diffs between a given timestamp and now. Depending on this period of time, minutely, hourly and daily diffs will be downloaded and processed. It is faster than Osmosis and a bit easier to handle but it lacks a lot of functionality Osmosis provides, for example, osmupdate cannot update databases or write full history diffs (there will be only the newest version of each object in the output file).

With Mapnik

Instructions and example scripts to operate minutely-updated Mapnik can be found on the Minutely Mapnik page.

Using Osmosis

More details can be found in the read change interval section of the Osmosis documentation. Briefly, this is the way it works:

osmosis --rrii workingDirectory=.

This will initialise the current directory as a replication workspace, creating a configuration.txt file. By default, this is initialised to minutely diffs, so if you want hourly you should edit the file so that it references the hourly replication diffs URL. For simplicity the rest of this will assume you want minutely diffs, but simply replace minute with hour to use the hourly diffs. Download a state file:

wget https://planet.osm.org/replication/minute/state.txt

This will be the most recent state file, which may not be suitable for your needs. To reset it to an earlier state all you need to change is the sequenceNumber entry. To find the appropriate sequence number by timestamp you can either look through the diff files (the file timestamp is almost always the same as the timestamp in the file), or use Peter Körner's tool.

Unfortunately, it is not guaranteed that minutely replication files are published every minute, so you cannot rely on simple arithmetic to get the desired replication diff.

Now that Osmosis is set-up, whenever you need diffs you can run:

osmosis --rri workingDirectory=. --wxc foo.osc.gz

This will put all the changes between the previous sequence number and the most up-to-date into the file foo.osc.gz, and there is a parameter in configuration.txt to control the maximum time range of diffs to download and combine at once (defaults to 1 hour).

Using osmupdate

There is a detailed description at osmupdate Wiki page. You also can consult the help page of the program (option --help). Here is a short example how to create a planet change file for the time range between November 1, 2011, 21:59 (UTC) and today:

./osmupdate 2011-11-01T21:59:00Z cumulated_changefile.osc.gz

To update an OSM data file you can use this command:

./osmupdate old_file.pbf new_file.pbf

The program will first determine the age of the old file. This is usually done by reading the file's header. If the header does not contain a file timestamp, the whole file will be scanned to get the latest object timestamp. The new file will be created with a file timestamp so that the (automatic) scanning may be needed only for the first time you update this file. Because of this you do not need to download or edit any state files manually.

Retrieving a File's Timestamp

Sooner or later you will detect an OSM file on your local disk drive and try to find out of which date the stored OSM data are. If you are lucky, the file name will contain date and time. If not, you may want to try to read the files's timestamp or to analyze the file's contents to find the latest recorded dataset timestamp. Other than inspecting an XML file header manually, the program osmconvert can help you getting the required information. For example:

./osmconvert file_with_timestamp.osm.pbf --out-timestamp
2011-08-01T23:50:00Z
./osmconvert file_without_timestamp.o5m --out-timestamp
(invalid timestamp)
./osmconvert germany.osm.pbf --out-statistics

timestamp min: 2005-07-05T02:14:17Z
timestamp max: 2011-07-31T19:59:46Z
lon min: -20.0712330
lon max: 21.1441799
lat min: 47.0830289
lat max: 59.9982830
nodes: 78138447
ways: 11342322
relations: 176024
node id min: 1
node id max: 1380816490
way id min: 92
way id max: 123952798
relation id min: 159
relation id max: 1693098

I'm nearly sure, Osmosis is able to perform these tasks too. Please, someone who has more experience with Osmosis, complete this description. Thanks!

Regionally limited diffs

Processing planet diffs is often an high server ressources consuming process while not everyone needs world coverage.

See Also

OSM file formats#File formats for diffs