Essen Developers Workshop/Data Replication
Jump to navigation
Jump to search
Data replication and distribution
Idea
The Idea is that write requests go to a central database and read requests are satisfied from a seperate hierarchical set of local cache servers
central server | +----+---+ | | | +------+-+ | +----+-------+ | | | | | ics ics germany uk | | | cache cache | +---+--+ | other | | karlsruhe customized local local cache clients server server
- central server: the OSM server we are using now
- ics: intermediary cache server, there can be any amount of these. I expect them to be complete mirrors
- local server: a server that is run by any interested person who needs fast read access and is willing to set up a caching server (e.g., Osmarender developers, Editor developers, ...)
- possible other clients: Dirty Tile marker, RSS feed, mapnik converter, ...
write requests
- The central server increases a counter on each change of the data.
- This a counter is referred as "sync point" from now on. The sync point can be a timestamp, a steadily increasing counter, or whatever.
read requests
- The clients (JOSM, tiles@home, mapnik, whatever) makes a read request to his caching server.
- The read request should be something like the If-Modified-Since header of HTTP.
- The caching server sends its last known sync point to its upstream server.
- The upstream server sends a block of data (quite possibly in XML)
- The local server persists those changes to its own, local database and stores the successful "sync point" in its local database
- The local server notifies the upstream of the successful synchronization so that the upstream can purge the data stored up to that point for that client
Possible optimizations
- The local caches can be initially seeded with the latest planet file. The seeding would not be needed later on.
- The client can notify its upstream of the kind of data it is interested in (bounding boxes, tag combinations like "complete railroad map", "complete water borders").
- The server can stop preparing update data if a local cache server has not asked for its updates for a configurable amount of time (one day? one week?) and restart only if the local cache reappears again.
Open points
- Multiple bounding boxes (my local cache is interested in the data of Munich, Bagdad and Turkey)
- Chunking of update data? What my local server wants to re-sync just after the complete TIGER data was imported?
- The features / bounding boxes a client / cache is interested in should not be re-sent with each request. Create a seperate call for that.
- What if a client changes the features it is interested in later? A cache that collected railroad data might ask for airports later. How to we force a re-sync? Re-seed with the current planet data?
- How does a client discover the cache it is supposed to read from? GeoIP?
- Push / or Pull: Steve prefers Pull
- If the socket is already open, send additional data without without being asked. New request type for this?
- the client can request a maximum amount of data it wants to receive
- The server / client has to be set up by an administrator so Joe Random User can't swamp the main server