User:MaZderMind/Reading OSM History dumps

From OpenStreetMap Wiki
Jump to: navigation, search

There are many reasons why we would like to have the history of OSM at our fingertips: render an animation how my village grew up while I was mapping it, get an insight in what a Changeset really changed or see how a way changed its geometry over time. With the Full-Planet dumps we have all information we need but it’s not easily accessible. What we want is a database that can give us quick answers to the following questions:

  • Where was Node X in Version Y?
  • Of which Nodes did Way X consist in Version Y and which versions of the Nodes were used to make up that way?
  • Which Nodes were in which Versions inside of BBOX X?
  • Which Nodes were in which Versions inside of BBOX X at Date Y?
  • Which Ways did use these Nodes in the Versions?
  • …and some other questions.

What we want is a Database in a modified Osmosis simple schema that contains this version information on Nodes, Ways and Way-Members. What we also want is the ability to build LINESTRING geometries for all versions of all ways. I’m leaving Relations out for simple reasons: Thinking about Relations always gives me a headache and in their behavior they’re close to how we handle ways, so that it’s not too hard to carry the algorithms that are suitable for ways over to relations.

The Schema

When looking at the schema-sql, it seems that only the way_nodes need to be extended in order to store the way-version and the node-version

ALTER TABLE way_nodes ADD COLUMN way_version int NOT NULL;
ALTER TABLE way_nodes ADD COLUMN node_version int NOT NULL;

The Problem

The major issue is not how to store the data but how to generate it out of a history planet dump. Such a dump is very similar to a normal planet dump but it contains all versions of the same item:

<node id="2" lat="20" lon="10" Changeset="1" user="MaZderMind" uid="1" visible="true" timestamp="2009-01-01T01:00:00Z" version="1" />
<node id="2" lat="20" lon="20" Changeset="2" user="MaZderMind" uid="1" visible="true" timestamp="2009-01-02T01:00:00Z" version="2" />

One of the problems reading such data is the OSM versioning scheme. When we take a look at this simple example of an OSM history dump, we can see that the way #100 has one version v1. But how often has the way changed its geometry? Two times, as node #2 was moved after the way has been created. If we understand that not every change to a way produces a new version of it, it becomes clear that we need some kind of intermediate version numbers. Following this concept, the way has two versions and each one consists of a different set of versioned nodes and has a different geometry:

  • v1 consisting of the nodes #1v1, #2v1 and #3v1
  • v1.1 consisting of the nodes #1v1, #2v2 and #3v1

Another Problem is finding which version of the way belongs to which version of the node. When we take at this slightly changes example, we see that we have two major versions of way #100 but of which versioned nodes did v1 consist and how many intermediate versions are between v1 and v2? To answer these questions we need to check whether the node-versions have been created before the way or after it. If Node #1v2 was there before way #100v1, then way #100v1 has version v2 of node #1 as a member. But if that node has been created after the way was submitted, v1 of the way contains v1 of the node as member. So it’s all a matter of the sequence of creation.

It seems obvious to take the Changeset numbers as index of the creation sequence, but unfortunately the API v5/v6 transition created Changesets that are out of order. Just take a look at those two Changesets from my early mapping time: c866438 and c92876. The one with the higher number has been created one month before the one with the lower number. So we can’t take the Changeset number as sequence index. The only information we can use to detect the order of creation is the timestamp.

The other reason that changesets can't be used, is that something can be edited multiple times in the same changeset, or twice in one changeset and in between changed again in another changeset. Smsm1 (talk) 23:25, 9 March 2014 (UTC)

The Algorithm

In order to work with the Ways’ geometries we need random access to the nodes. This can be achieved using three strategies:

  • keep everything in memory
  • store data to disk and keep an index in memory
  • use the destination database and its index

The first option is just not suitable for 18 GB of bz2 compressed XML. Osmosis decided to go the second way because it has greater flexibility than the third way (it allows for example to have a –write-dump action). So first we need to write all versions of all nodes to our Cache. We need Access to all Versions of a Node at once as well as access to a specific version of the Node. The average number of Versions of a node is not too big (usually <10) so that it may be okay to just store the list and iterate over it to find a specific version. Also, a binary search could be used instead. Second we may need an index on the timestamp of each version of a node. We will have to quickly find the version that were the newest at the time, the way was created.

After we got the node-data and our indexes in place, we can iterate over all ways and then over all versions of the way. For each version of the way we need to iterate over the used node IDs. For each Node-ID we need to look up the way’s timestamp in the node-timestamp index to find the closest node version. We then update the Node-Entry in the Node-List of the way with this information. While we’re iterating over all versions of all nodes in the way, we collect their Changesets. Why again Changesets and not the timestamps? We need to know the number of visible changes to the way and, while it’s not true on API level, we’re taking a Changeset as an atomic set of changes that happened as one action here. For each Changeset we’re collecting the id and version of the nodes used in the last major version of the way. I guess we will have to sort this list by one of the nodes timestamps. From that list we can generate the intermediate versions of the way by creating a clone of the original way for each Changeset. After that step there are still the intermediate versions missing that happened after the last major version of the way, but I think that this will be very close to the above algorithm. When all versions of all ways are in place, we can start generating the geometries. This needs to take the now known node-version into account when looking for the Location of the nodes.

Implementation

I have a running concept implementation in PHP but it’s far from being complete.

  • It runs entirely in memory and so only works with very small files.
  • It does not collect the intermediate versions after the last major version of a way.
  • It does not build geometries.
  • It uses the way’s meta-info to build the intermediate ways but instead the node’s meta-info should be used
  • It does not sort the intermediate versions of a way by timestamp but by Changeset number
  • But it shows that this algorithms concept is workable.

I also have a somehow running osmosis plugin that's very memory intensive and not really suited for reals-world imports. The work with osmosis showed its incredible complexity. Because most parts of osmosis aren't history-aware, very much needs to be reimplemented. This, together with the complexity of osmosis itsself, led to the conclusion that it would be better to do a complete rewrite.

I started this rewrite using nodejs, this is where I'm currently working on.

Demo Data

It’s not easy to get demo data in the format of the Full-Planet dump. We can either generate it ourselves like the small pieces of OSM xml above or put it together ourselves with responses from the API as I did with my home street. To test the running implementation we’ll need a bigger set of demo-data but not as big as the full planet dump, a single village maybe. So we need a bbox cutter for this kind of dump. I’m hoping that someone will pick this task and produce some demo data sets to test with.