User talk:MaZderMind/Reading OSM History dumps
Wow, very interesting! Some idle thoughts.
A classic temporal database (we don't yet go bi-temporal) physical schema would look pretty much like the existing history dump, with the probable addition of another datestamp column. Physically its way easier to query & manage performance with a date_start and date_end values rather than relying on version numbers and window queries. So I would suggest add a date_end column containing an 'infinite' date in the future for the current version & the timestamp (minus 1 ms) of the subsequent version for older versions. Obviously point geometries can still be stored for nodes.
For ways & relations the same thing occurs, but the primary key must still incorporate version (just as it does on the API schema).
What does this give us: a) very simple extension of the schema; b) easy to query status of history for time N (all tables in query need additional predicate of BETWEEN date_start AND date_end; c) should be relatively easy to populate & index. What does it not give: a) all valid intermediate geometries for ways etc; b) immediate access to way geometries; c) easily constrained bbox queries. It is possible to build queries which return all potential geometries for a given time period from this type of schema (and indeed from bi-temporal & higher order temporalities), but of course the SQL is quite involved & its hard to do bbox access). Therefore it might be that one adds additional tables for way (& relation, where applicable) geometries, which are populated in some post-processing step.
Another possibility would be just to add a box2 geometry column to way & relation versions. This is quite cheap in PostGIS (about 16 bytes /row) and would avoid having to assemble actual geometries until required by user extracts etc. I'd hope it might be quite cheap computationally too.
These are rather incoherent, but I'm stretching my mind back to implementing bi-temporal (non-geometrical) schema & summary denormalised tables from those. IIRC there are various algorithms related to graphics which are of some use if you replace x & y axes with temporal ones.
An ability to have a diff of two snapshots of OSM would also be useful (e.g., I might have a year's worth of Geofabrik downloads) for many users without storing all intermediate histories.
Hope you dont mind my rambling. -- SK53 13:35, 12 April 2011 (BST)
- Hi SK53 - it is an attractive idea to have a simple extension to the existing schema. However, I think it is important to have that immediate access to all - also the intermediary - geometries of the ways (and relations...?). Only with that in place can you quickly answer questions like 'Give me all features that were in box X on date Y'. If you're interested, have a look at History_API_and_Database where we're also talking about a hack weekend to take this idea some steps further. -- Martijn van Exel 14:10, 12 April 2011 (BST)
- Hi Martin - yes I appreciate that: I think it might be more efficient to do it at the database end rather than trying to pre-process before adding to the database. The point here is that it is possible to still extract all geometries in single SQL queries from this type of schema: I don't know if it's efficient enough, but that can be remedied by adding a small number of additional tables (rather than trying to squeeze all the intermediate geometries into the existing schema). Think of the intermediate geometries as held in something like materialised views. It would be nice to have a decent sized bbox extract of the planet to play with (say medium city sized). I'd certainly be interested in contributing. -- SK53 15:01, 12 April 2011 (BST)
Berlin History Dump
Thanks for doing this: its really useful to have something concrete to play with.
I've extracted this a couple of ways and some versions of ways do not seem to be present: specifically w3996955 b j should have 16 versions, and I only seem to get 1-9 and 12. This is the first way in the data. Am I missing something obvious? SK53 15:35, 18 May 2011 (BST)