Essen Developers Workshop/Data Replication

From OpenStreetMap Wiki
Jump to: navigation, search

Data replication and distribution

Idea

The Idea is that write requests go to a central database and read requests are satisfied from a seperate hierarchical set of local cache servers


                  central server
                        |
                   +----+---+
                   |    |   |
          +------+-+    |   +----+-------+
          |      |      |        |       |
         ics    ics  germany    uk       |
          |      |    cache    cache     |
      +---+--+          |              other
      |      |      karlsruhe       customized
    local   local     cache           clients
    server  server

  • central server: the OSM server we are using now
  • ics: intermediary cache server, there can be any amount of these. I expect them to be complete mirrors
  • local server: a server that is run by any interested person who needs fast read access and is willing to set up a caching server (e.g., Osmarender developers, Editor developers, ...)
  • possible other clients: Dirty Tile marker, RSS feed, mapnik converter, ...

write requests

  • The central server increases a counter on each change of the data.
  • This a counter is referred as "sync point" from now on. The sync point can be a timestamp, a steadily increasing counter, or whatever.


read requests

  • The clients (JOSM, tiles@home, mapnik, whatever) makes a read request to his caching server.
  • The read request should be something like the If-Modified-Since header of HTTP.
  • The caching server sends its last known sync point to its upstream server.
  • The upstream server sends a block of data (quite possibly in XML)
  • The local server persists those changes to its own, local database and stores the successful "sync point" in its local database
  • The local server notifies the upstream of the successful synchronization so that the upstream can purge the data stored up to that point for that client

Possible optimizations

  • The local caches can be initially seeded with the latest planet file. The seeding would not be needed later on.
  • The client can notify its upstream of the kind of data it is interested in (bounding boxes, tag combinations like "complete railroad map", "complete water borders").
  • The server can stop preparing update data if a local cache server has not asked for its updates for a configurable amount of time (one day? one week?) and restart only if the local cache reappears again.

Open points

  • Multiple bounding boxes (my local cache is interested in the data of Munich, Bagdad and Turkey)
  • Chunking of update data? What my local server wants to re-sync just after the complete TIGER data was imported?
  • The features / bounding boxes a client / cache is interested in should not be re-sent with each request. Create a seperate call for that.
  • What if a client changes the features it is interested in later? A cache that collected railroad data might ask for airports later. How to we force a re-sync? Re-seed with the current planet data?
  • How does a client discover the cache it is supposed to read from? GeoIP?
  • Push / or Pull: Steve prefers Pull
  • If the socket is already open, send additional data without without being asked. New request type for this?
  • the client can request a maximum amount of data it wants to receive
  • The server / client has to be set up by an administrator so Joe Random User can't swamp the main server