User:Hubne/drafts/Duplicate keys

From OpenStreetMap Wiki
Jump to: navigation, search

This page is being drafted and will be moved into the main wiki content when ready. Subedits are welcome, as are comments in its discussion page.

Currently endorsed practice for duplicate keys is documented in the OSM FAQ

Duplicate keys occur when an OSM object has two or more key/value pairs applied to it and where the key is the same. The values should be different, else the key/value pair would simply be repeating information. For example:

 object:
   key1="value1"
   key2="value2"
   key2="value3"

The concept is that an object can have several values for a property, essentially a series of atomic values something akin to an unordered list.

Duplicate keys in the "natural" format have been permitted in the OSM API prior to version 0.6, but never supported by any editors. There are a very small number of objects in the OSM database taking advantage of this.

A syntactic device commonly employed to serialise this concept is a delimited string. For example:

 object:
   key1="value1"
   key2="value2,value3"

Use cases

name

Sometimes an object has more than one name, for a variety of reasons (language, history, common use). It is arguably rare that a place has two names of equal weight.

alternate names

This has been used to indciate oher forms of names, but the key is still vague about what kind of alternate name is being noted. Thus, it reasonably probable that an object will have more than one alternate or variant name.[4]

functions

sport

features

source

It is entirely reasonable that information about an object has been sourced from more than one place.

address

This is particularly important for routing software. Some properties have more than one address. It can be the case that neither is the preferred address.

refs

Way has multiple refs [7]

Repeated key syntax

This is perhaps the most natural way to express duplicate keys, as it follows the native structure of the schema and thus requires no additional parsing.

Pros
logical and elegant
consistent with content model
expected
requires less parsing – effective use of XML, in the same way the RESTful API tries to effectively use HTTP
Cons
would require editor and renderer patches
database performance issues without refactoring [2]

Delimited list syntax

This is where the list of values for a key are listed separated by a delimiter.

Pros
the database can perform better using just the key/value pair as a unique index, without needing to create an additional dedicated index
Cons
extra validation work and handling for the API
pushes the tokenisation and serialisation burden to clients, which makes the API less attractive to, and usable for, developers
serialisers and tokenisers need to handle escaping/unescaping of delimiters [3]
need to agree on a delimiter dataset-wide and apply it consistently

Case-by-case workarounds

Most use cases can be described using various workarounds. The problem with relying on workarounds is that the workaround should be documented and agreed upon to be useful. This undermines the deliberately simple design goal of OSM's uncontrolled/folksonomous tagging keys by incorporating tricks and exceptions.

Examples

Converting list values to qualified boolean keys

A list of values can often be expressed as separate tags, with the individual list values becoming keys, and the value a boolean ("yes"|"no"). The original key for the group might be employed as a namespace prefix to provide context.

For example:

 features="toilet,rubbish_bin,lighting"

becomes

 feature:toilet="yes"
 feature:rubbish_bin="yes"
 feature:lighting="yes"

This works well, but explodes the number of tags required that clients/renderers need to recognise. On the other hand, they would need to recognise any other serialisation, irrespective of the number of keys.

Inventing variant keys

This is finding ways whereby additional key instances can have their meaning refined, so that a different key can be used to describe them. For example, "name" could be "local_name", "alternate_name", "other_name", "variant_name", "historic_name", etc.

This offers richer descriptions, but relies on them being well understood by data consumers. For example, in the case of the "name" key, a placename search or routing application should know about each variant form of the key.

It cannot always be applied easily, so is only a partial solution. For example, a street may have two alternate names which cannot be disambiguated beyond that they are "alternate_name"s. [6]

Increasing granularity

If an object can be divided into smaller mappable objects, those objects can be tagged separately in such a way that they adopt the single key values. In addition, a relation can be created to tag the original aggregate object. For example, a set of shops could be split so that each shop is mapped with its own "shop" key value. A sports venue could be divided so that each area within it has its own "sport" tag. [7]

The problem with this workaround is that it mandates detail, so that an editor may find this daunting enough that she prefers not to map the object at all.

Can relations nest??? The other problem is that it only works for one level of granularity. A shop within a shopping centre that is a "multi-shop" (e.g. a combined butcher and delicatessen) suffers the original problem.

It also fails to cover when the multiple tags do not apply to spatial aspects of the object, but rather to other types of considitions like the season. A playing field where Australian Football is played in Winter and Cricket is played in Summer cannot be represented this way.

Archives issue

The OSM API needs to make the history of each object accessible, which poses a problem when the shema changes. If the schema changes to formally disallow duplicate tag keys, it still needs to make the historical version of objects that used duplicate keys available.

The schema probably needs to allow legacy content models within elements that are clearly labelled as historical (either with a wrapper element or an attribute). [5]

References

  1. http://lists.openstreetmap.org/pipermail/dev/2008-October/012059.html - count of duplicate keys posting
  2. http://lists.openstreetmap.org/pipermail/dev/2008-October/012038.html - database index argument
  3. http://lists.openstreetmap.org/pipermail/talk/2008-October/030611.html - serialising escapes issue
  4. http://lists.openstreetmap.org/pipermail/talk/2008-October/030689.html - alternate names discussion
  5. http://lists.openstreetmap.org/pipermail/dev/2008-October/012069.html - history issue
  6. http://lists.openstreetmap.org/pipermail/talk/2008-October/030771.html
  7. http://lists.openstreetmap.org/pipermail/dev/2008-October/012118.html