OpenMetaMap

From OpenStreetMap Wiki
Jump to navigation Jump to search

OpenMetaMap is a proposed tool to enable linking different datasets with OSM data without real importing of data to OSM. Basic idea is that OSM database should have only pure crowd-generated original data, all external datasets should be somewhere else, but we still want and need other database data in the OSM rendering (collective databases).

Motivation

There are several goods and bads of data imports.

Goods:

  • Instant coverage of areas where there is no mappers
  • Using donations from established mapping community
  • Include data what is too hard, or impossible to collect by crowd-sourcing: shoreline, administrative boundaries, forests etc etc
  • There is no control over data later life - everyone can edit as he/she wishes

Bads:

  • Imported data becomes fork of original imported data and remains always unmaintained
  • Data donor cannot get community changes back without share-alike license term
  • Demotivates community
  • Nobody feels to be responsible with the dataset, and it is too big for one person
  • Data source becomes unclear
  • Legal issues from license mixing
  • I (--Jaakl) have personally done several nation-wide and smaller scale imports for my country (admin borders, addresses, Corine Land Cover etc) , about 40% of Estonian OSM data is currently coming from those. Now I have seen all these bad things happened. I believe that if there was OpenMetaMap (or something similar) then I should not done these imports and OSM would have been better place.

There must be a way to fight against bads of imports while keeping as much as possible of the goods. It is to wash up the OSM from the non-crowdsourced data without flushing down the babe.

The Idea

For example, I am the national association of museums, who has nice database of all the museums in a country. I want to share it to OSM community, and I also trust them enough to maintain it and so I also want to get back community edits. I could just import data and maintain it in two places (own database system and OSM) in parallel, but it would be way too much extra work for me, so in reality today I can import (donate) the data and leave it alone there. I will maintain only own database.

Workflow:

  1. Instead of data import I choose to share my database online
  2. I put my dataset online using OpenMetaMap API
  3. I add my API URL in OpenMetaMap (OMM) registry, status there is "new"
  4. I merge data objects semi-manually, using OMM JOSM plugin: mainly it means marking "duplicates". Duplicate in OMM does not mean redundant data - it some cases it is normal to have most of the objects duplicated. They just have to be linked properly.
  5. I mark my database status in OMM registry as "merged"

For this we need to:

OMM-datausage.png

  1. Keep external data sources clearly separately from OSM database. It means: do not import to OSM database at all. To keep thing simple, external database is expected to use OSM API (best if live API, but could be even static OSM file).
  2. Build (OpenMetaMap) meta-database with references to all compatible external sources. It would look very similar to Import/Catalogue table, just with more live technical data.
  3. Have object-level Links between objects in OSM database and external databases, these Links are stored to the meta-database (OpenMetaMap)
  4. Have tools which enables to link databases as easily as you currently import and merge databases: JOSM plug-i
  5. Have rendering engine toolchain support for mix of databases: perhaps osmosis plug-in which does import from different sources and finally get similar combined mapnik database, similar to one now with osm2pgsql
  6. Have a warehouse for "homeless" external datasets

OpenMetaMap API

General properties:

  • RESTful data API for OMM database objects (Links) add/change/delete.
  • Separate HTML webpage for registry of external datasets
  • OMM Link database is a table of (keyA,keyB)->Link values, where:
  • keyA is OSM ID and keyB is unique object ID in the external database.
  • Link is how the objects are related, in regards of both tags and object coordinates. Link has type and additional data

Link types:

  • SIMILAR_TO - it is candidate for merging, temporary status. It is usually automatically marked so, idea is same as for FIXME tag
  • IDENTICAL - exact duplicate. Can be automatically found
  • INTERNAL - object in OSM database is better (has more attributes, is in more precise location etc) than external object (keyB)
  • EXTERNAL - object in external database is better and should be used
  • MERGE - no A or B is simply better, so they should be combined. How exactly is defined in Link merge data. Most complicated case, but also a lot of data could be in this category.

Link merge data:

  • If Link type is MERGE then diff between OSM data and external data, similar format as OSM diff. Enables to define set of add tag, modify tag/coordinate, delete tag actions. Special case is merge with deleted object, which marks that object in OSM should be deleted according to external database.

Link has also own history/metadata like OSM objects have: added, modified, version, added userid, last modified userid. Also (optional) version numbers for both ObjA and ObjB, so it is possible to detect external changes more clearly; if version number is not available then change detection requires deep object diff analysis.

Maybe Links should be combined to changesets, like OSM changes to have easier change management?

Who can edit links? Following OSM general style there is no anonymous editing of Links (everyone is authenticated), and also there would be no special rights in Link creation/editing/deletion. Anyone can merge the datasets. OMM software would be free and open source, so if someone wants to build more controlled/private link databases for in-house use, then we cannot stop it.

Integration phases

There are different levels how well could be external dataset integrated to OSM, from more simple to most advanced:

  1. No linking - import of data (currently only option), dataset is registered in ListOfImports wiki page
  2. Dataset is registered in MetaMap directory. I addition to list of Imports it has more technical info, and machine-readable structure (with specific BBOX, Object ID and other metadata)
  3. Dataset has minimal static on-line API (.osm or .osm.bzip2 XML files in http or public ftp server, maybe Shapefiles can be supported). The files are updated occasionally
  4. Dataset has live read-only API (OSM API, maybe WFS can be supported)
  5. Dataset has live read-write API (OSM API), so changes can be saved there also.
  6. Dataset has history/change management. Each object has proper version tags, so automatic versioning of links is enabled. Possible for both read-only and read-write datasets.

Some options here (e.g. read-write APIs) will be supported later, as they require more complex support in editors and other tools.

Data hosting options

External data can be kept in:

  • Original server of provider only
  • Hosted by original provider and cached in OMM server for better performance and reliability
  • Hosted in OMM server special service, but updated there by original provider (or community)

Usage of combined data

Combining of data (for rendering, for example) is done using following steps (special cases omitted):

  1. Take data from OSM for certain BBOX

Then for all used external datasets:

  1. Take Links from OMM database for same BBOX
  2. Take objects from external database
  3. Apply following rules for every object in external database:
  • if there is no Link - take external object
  • if there is Link EXTERNAL - take external object, delete OSM object
  • If there is Link INTERNAL, IDENTICAL or SIMILAR_TO- omit external, keep OSM object
  • If there is Link MERGE - apply Link diff from additional data to the OSM object
  1. Add short reference tag to external dataset (just for information).

Result would be another standard OSM file which combines OSM and external data, just if it was imported and merged old way.

How not to use combined data

Some may have temptation to use OpenMetaMap Links as FIXME hints (from OSM point of view): if you see that there is EXTERNAL data better, then you take data from the external source and update invalid data in OSM. Or use MERGE tags to see differences to update automatically OSM data with a bot. But this way you actually import data to OSM, with all the bad (and good) consequences, therefore this should be not allowed avoided. In fact, data Linking, if done properly, should create no traces in OSM, and also in the external database. Of course there are cases where data update in OSM makes perfectly sense, then all import guidelines (check license compatibility, attribute properly etc) apply for the imported/updated object(s).

Advantages and limitations

OpenMetaMap solution would enable to have live links to external databases, keep OSM database clean from non-crowdsourced data etc - basically solve all problems of imports what were listed above. In addition it would enable following interesting options:

  • Selective combination of databases: data user can also choose which combination of OSM and external sources to use; based on e.g. license terms.
  • Data donor will have better technical control over his data: it is very easy to see community changes to their objects. There it has two options: to accept community edit, to ignore it or to revert it. The danger is if data donor is overprotective and reverts too much community edits. With current import approach it would be just technically too complicated for the donor, even through also theoretically doable.
  • The solution would be technically completely transparent to OSM API and core database, therefore for data maintainers it would be purely optional to use. There are no changes needed in OSM database for it. However, changes would be needed in Mapnik rendering toolchain and server to make any use of OMM data sources.
  • It would not create significant extra complexity from data donor to do merging: they would still delete duplicates (which technically would add Link and does not really delete). The OMM complexity can be hidden.
  • Legal aspect: OSM would not become derivate from any external databases, only the rendering would be. But for rendering you can always pick suitable set of databases.

Open questions

There are many general open questions: would OMM it feasible, would anyone really use it (as donor and as maintainer of data), how, when and who could implement it, would it remain totally separate from OSM (technically also possible) or well integrated etc.

Principal challenges:

  • When an object, say a museum, is in several databases, then can give (too?) many choices to contributor which can become confusing:
  1. update data in OSM only
  2. update in the external database (assuming that it has also read-write API)
  3. in some cases Link in OMM should be added/removed/changed
  4. different combinations of those

Possible solution: for any this kind of ambiguous editing user will get prompted where to save the contributions. Current JSOM saving in exit is also somewhat similar: it provides options to save edits to file and/or OSM, and asks to resolve conflicts if multi-user editing happens. Target is that the linking should be as transparent as possible in all OSM editors, so user just makes a simple fix (change/add tag, move point) and usually only OSM data is affected. This means that the OMM should be integrated to the editors.

Following are some more specific technical challenges:

  • License of the Links database itself. To avoid any extra complexity I would suggest it to be license-free.
  • Multi-database linking - what if object is present in 3 or even more different external databases - how can we handle this case nicely? Solution idea: deduct all these to two-way relations, with 3 databases ABC you would have 3 links: AB, BC and AC.
  • One-to-many Links. In OSM you have 3 ways (segments), in road register you have 1 object for it. In road case it can be avoidable: the best option would be to have also in OSM one object, which would be something like street-type Relation, but there is no established use of such Relations (only discussions so far). In some cases one-to-many links may be required.
  • Performance and reliability - if we have hundreds and thousands of external sources (in long run every city should have a few of them), then global map making would require that all the sources are up and respond quickly. Most probably local caching would be inevitable.

Current status

Done:

  • Discussion in WhereCampEU Berlin 2011 talk, got very useful feedback
  • Registered openmetamap.org

Now:

  • Make a pretty 'coming soon' page for openmetamap.org (linking to this wiki page)
  • Drafting technical design (summarized in this page)
  • Collecting feedback, comments etc
  • Looking for acceptance by OSM key tool developers (JOSM and Potlatch/P2 developers mostly, also server and other editors)
  • Find volunteers to do development

Next steps and longer vision:

  • Technical implementation of needed tools, setting up server and service. Maybe can use an OSM dev server?
  • Undo ("externalize") old imports where possible and feasible
  • ban new imports. Just kidding - of course everyone will always have choice whether to use old dumb import, or much more advanced OMM solution. In many cases (like small static one-way imports) just plain old import could be fine.