Essen Developers Workshop/Data Replication

Data replication and distribution

Idea

The Idea is that write requests go to a central database and read requests are satisfied from a seperate hierarchical set of local cache servers


                  central server
                        |
                   +----+---+
                   |    |   |
          +------+-+    |   +----+-------+
          |      |      |        |       |
         ics    ics  germany    uk       |
          |      |    cache    cache     |
      +---+--+          |              other
      |      |      karlsruhe       customized
    local   local     cache           clients
    server  server

central server: the OSM server we are using now

ics: intermediary cache server, there can be any amount of these. I expect them to be complete mirrors

local server: a server that is run by any interested person who needs fast read access and is willing to set up a caching server (e.g., Osmarender developers, Editor developers, ...)

possible other clients: Dirty Tile marker, RSS feed, mapnik converter, ...

write requests

The central server increases a counter on each change of the data.

This a counter is referred as "sync point" from now on. The sync point can be a timestamp, a steadily increasing counter, or whatever.

read requests

The clients (JOSM, tiles@home, mapnik, whatever) makes a read request to his caching server.

The read request should be something like the If-Modified-Since header of HTTP.

The caching server sends its last known sync point to its upstream server.

The upstream server sends a block of data (quite possibly in XML)

The local server persists those changes to its own, local database and stores the successful "sync point" in its local database

The local server notifies the upstream of the successful synchronization so that the upstream can purge the data stored up to that point for that client

Possible optimizations

The local caches can be initially seeded with the latest planet file. The seeding would not be needed later on.

The client can notify its upstream of the kind of data it is interested in (bounding boxes, tag combinations like "complete railroad map", "complete water borders").

The server can stop preparing update data if a local cache server has not asked for its updates for a configurable amount of time (one day? one week?) and restart only if the local cache reappears again.

Open points

Multiple bounding boxes (my local cache is interested in the data of Munich, Bagdad and Turkey)

Chunking of update data? What my local server wants to re-sync just after the complete TIGER data was imported?

The features / bounding boxes a client / cache is interested in should not be re-sent with each request. Create a seperate call for that.

What if a client changes the features it is interested in later? A cache that collected railroad data might ask for airports later. How to we force a re-sync? Re-seed with the current planet data?

How does a client discover the cache it is supposed to read from? GeoIP?

Push / or Pull: Steve prefers Pull

If the socket is already open, send additional data without without being asked. New request type for this?

the client can request a maximum amount of data it wants to receive

The server / client has to be set up by an administrator so Joe Random User can't swamp the main server

Essen Developers Workshop/Data Replication

Contents

Data replication and distribution

Idea

write requests

read requests

Possible optimizations

Open points

Navigation menu

Essen Developers Workshop/Data Replication

Data replication and distribution

Idea

write requests

read requests

Possible optimizations

Open points

Navigation menu

Search