Talk:Overpass API/Augmented Diffs

From OpenStreetMap Wiki
Jump to: navigation, search

Content

What happens when something is modified twice during a minute ?

It is only included the last version of the element. I assume that such short lived changes are neglectible for general data quality. Because the primary purpose of the augmented diffs is to patch a slim database, I set preference on small and timely generated files. Technical this is not a problem of system load but is rather expensive in development effort.

When an element is deleted, we get it in delete, but with changeset="the changeset in which this element was last modified before", instead of "the changeset in which this node was removed". Would you consider that as a bug or a feature?

From the use case to patch a map or to display changes, I consider this a feature. The old value is needed for the undo use case. However, the use case to show a "who did this change" map, would benefit from this information and at the same time from the augmented diff. I tend to include this in some way.

Therefore, it would be nice to also get the nodes of ways with no geometry changes (tag changes only), to be able to show them on a map. Consumers can still distinguish geometry from tag only changes by comparing the old and new versions.

This is already implemented. Whether the type of change should be indicated more precisely, see in the sort order.
The current implementation is fine by me. This actually was meant as my answers to your question "So a general question to all: should we keep off the nodes of ways that have not changed their geometry?" [1] and to another statement "It would be great to have a clue in the diff that it is just a tag edit and not a geometry edit, this would save the geometry reconstruction processing" [2] (should have made that more clear). --Ikonor 11:48, 10 September 2012 (BST)

For relations, it might be a compromise to only include objects of changed members (added, removed, role change). Created and deleted relations would include all members. This probably would allow to display changes, to update route length calculations and the like without storing geometries and to update stored geometries, e.g. for multipolygon validations. Not sure what to do about relation tag changes though (the restaurant example).

and

Is it possible to include all the members of relations on changes of a single element of them, at least for these type= values? If it is, then it's technically possible to drop all the slim tables for osm2pgsql or implement diffs support for imposm.

There are relations that finally link to more than a million nodes. Think of national borders. Whenver in such a relation a single node changes its position or a tag or the relation gets a name tag added in some language, we get some megabyte data and a delay of minutes or hours. I'm thinking of two possible strategies: we could give it a try or I could just include an approximate bounding box. The latter would still help for map applications, where a precise presentation of millions of nodes is anyway not possible, but may be too imprecise for other use cases (the osm2pgsql use case), so I deem this only a second choice solution.

Sort order

When the change on a node or way is just a tag change and does not involve geometry change, are the diff still providing all linked nodes/ways ?

Yes they do. However, to simplify the decision whether this requires a tag reconstruction, one could split up <keep> in something like <keep> and <move>. I think it is not worth the (implementation) effort but would do it if enough people opt for it.

Could there be back references for nodes in the keep section to resolve which ways the belong to.

Same as above. It's redundant information, so I would do this only when a lot of people really want this feature.

A lot of software depends on getting sorted input data, usually ascending by object type, then by ID.

The order by type is maintained. The order by id is given up in favor of an order by type of action. Inside the action, the elements are again ordered by type. I would expect that order by action is more suitable to applications than order by id.
Every tool which applies a diff file to a planet file (or regional .osm file) "hopes" to get both in the same order, the diff file and the planet file. If this sequence order can be provided, any tool will perform this task much faster because it will not need to buffer all the planet data in memory, temporary file or database.
If this traditional order is broken, some file-based tools will not work with Augmented Diffs, some other tools will work but very inefficiently. I would appreciate it very much if there could be found a way to maintain the traditional order (first by type, then by id). --Marqqs 11:55, 10 September 2012 (BST)
It is technically simple and possible, so I'll change the order to be 1st by type, 2nd by id, 3rd bei type of action. --Roland
Thanks a lot! Please let us know when this is done. I then will upgrade all OSM software I maintain, to enable reading of augmented diffs. Just one question: You are planning "action" as a third sorting criteria. Wouldn't it be better to sort the actions chronologically (just in case there is more than one action for the same object)? Otherwise no one would know the final state for this object. --Marqqs 15:55, 10 September 2012 (BST)
The sort order by action coincides with the chronological sorting: either there is only version present of an object or it is a pair delete/insert which is always ordered delete/insert.
Excellent! --Marqqs 12:29, 15 September 2012 (BST)

General format

The files aren't properly XML encoded.

Fixed. This was a plain bug. Note that the augmented diffs before 18000 haven't been recreated.

The "osm" tag name is already used for a format, so I suggest not using it.

I opt for using <osmAugmentedDiff> as a base tag instead.

Why did you not use solely the existing XML tags <create>, <modify> and <delete>?

Because they don't properly describe what's happening. The core idea of the augmented diffs is to include some unchanged but related elements. This doesn't happen in diffs. Thus, the new category <keep> is used. On the other hand, <modify> doesn't apply to augmented diffs, because it brings the new version only. Thus, the augmented diffs use a pair of <delete>/<insert> for each modified element, with <delete> containing the old version.
Of course you can argue that either <delete> should also be renamed or <insert> should be called <create>. This was accidental.
Well, the format definition is still in an early stage. It would not hurt to undo this little "accident". So why not simply rename "insert" to "create" and "erase" to "delete"? Thus the only newly introduced feature would be "keep" (see OsmChange). This makes it easier to adapt existing software. --Marqqs 12:29, 15 September 2012 (BST)

Shall we use JSON or PBF instead of XML?

All the other change formats are in XML, and for XML exist numerous very efficient parsers for all programming platforms. There's little reason to add yet another format.
I agree that there is no urgent need for a new Change format, and gzipped XML works fine, but there already exists an alternative Change format: .o5c. Since it can be processed much faster than XML it is used by osmconvert and osmupdate. --Marqqs 12:29, 15 September 2012 (BST)

New document format

Hi, can you add the changes requests you already approved to following example document? --Andi 09:07, 12 September 2012 (BST)

<?xml version="1.0" encoding="UTF-8"?>
<osmAugmentedDiff version="0.6" generator="Overpass API">
<note>... </note>
<meta osm_base="2012-08-26T20\:24\:02Z"/>

  <!-- Elements are ordered as: nodes first, then ways, then relations.
       Within each class of elements they are ordered by id -->

<erase>
  <node ... />
  <!-- contains the nodes that are either explicitly deleted or the old version of nodes that are replaced by a new version. -->
</erase>
<keep>
  <node ... />
  <!-- contains the nodes that belong to changed ways including ways that contain a changed node. -->
</keep>
<insert>
  <node ... />
  <!-- contains the nodes that are updated in their newest version. -->
</insert>

<erase>
  <way ... />
  <!-- contains the ways that are either explicitly deleted or the old version of ways that are replaced by a new version. -->
</erase>
<keep>
  <way ... />
  <!-- contains the ways that contain a changed node. -->
</keep>
<insert>
  <way ... />
  <!-- contains the ways that are updated in their newest version. -->
</insert>

<erase>
  <relation ... />
  <!-- contains the relations that are either explicitly deleted or the old version of relations that are replaced by a new version. -->
</erase>
<keep>
  <relation ... />
  <!-- contains the relations that contain a changed node or way (including ways changed only by changing their underlying nodes). -->
</keep>
<insert>
  <relation ... />
  <!-- contains the relations that are updated in their newest version. -->
</insert>

</osmAugmentedDiff>

Software to read Augmented Diffs

What software is currently able to read Augmented Diffs? I would guess: Osmosis – but I am not sure. What about a small table which shows what software can read or write Augmented Diffs?
I just patched osmconvert to version 0.7E. The program can now read Augmented Diffs and convert them to .osc. It also can process them and update an existing .osm or .pbf. All this is highly experimental, you should expect one or two bugs... :-)
Am I right to assume that Augmented Diffs are the only diffs which can be used to extract regional diffs? I did not try this, but in theory... it should work. --Marqqs 18:00, 18 September 2012 (BST)

File Timestamp

Currently, Augmented Diffs come with a file timestamp, for example: <meta osm_base="2012-09-18T02\:21\:02Z"/>
I appreciate this very much! However, there are already two different file timstamp formats in use. Might it be possible to use one of these existing definitions? --Marqqs 18:00, 18 September 2012 (BST)