Original Changesets and Reverts Proposal 2008

From OpenStreetMap Wiki
Jump to navigation Jump to search

This page contains a proposal (originally by User:Frederik Ramm) and ideas for supporting changesets and reverts in OSM. Both topics are closely related.

The changeset concept was accepted and has been developed as the main new feature of API v0.6. See that page for details of the exact implementation we've gone for.

Changesets

A changeset is a logical unit that encompasses a series of individual edits by the same user with the same piece of software. Think of it as "an editing session".

Modifications to the database schema

A new database table is created for changesets, comprising

 changeset_id integer
 user_id integer
 timestamp integer
 client text
 comment text

where "client" describes the client software used, and "comment" is an (optional) commit comment.

Unlike ways, nodes, and relations, there will be no "history" table for changesets.

The user and timestamp fields are dropped from all way/node/relation tables that currently have them, and are replaced by one field referring to a changeset (where user and timestamp data can then be read). This saves space at the price of losing a little precision in the timestamp because you only have the changeset timestamp and not individual timestamps for edits.

Also, the "created_by" tag is deprecated; clients are asked to put their software name and version number in the client field of the changeset record instead.

At the risk of over-complicating things, I wonder whether we should have the ability to migrate other 'editor-use-only' tags into changesets. In particular, Potlatch needs to have an 'editor_background' tag (and arguably JOSM's YWMS plugin should, too), showing what background was in use at the time - e.g. Yahoo, Landsat, NPE. --Richard

Optional additions to changesets might include an affected bounding box (so you can easily filter out changes for an area) and a marker whether the changeset is still "open". The affected bounding box would probably be updated by a daemon which finds closed changesets that don't have a bounding box and then computes it. Another option is forcing the client to specify the bounding box when it opens the changeset, and later checking if all changes uploaded are really within that box. That would give us one central and convenient place (the open changeset call) to later implement checks whether or not the changeset infringes on a "protected area" or even if that particular user may have been banned from making changes in an area or so.

All tables containing changeset references must have proper indexes on them.

Migration of existing data

For existing data, changesets have to be synthesized to make them compatible with the new schema. We cannot support a mix of both, but don't want to lose user and timestamp data.

  • The most basic way of preserving existing data would be creating one synthesized changeset for every object we currently have (resulting in roughly 300 million changesets). Because we can remove the "created_by" tags for these objects, the overall database size would not change a big deal.
  • A smarter approach to converting existing data would mean trying to group existing objects by time window and user id (i.e. everything changed by user X in hour Y of day Z) and create only one synthesized changeset for that. In this case, the database is likely to become smaller by the migration.
Because Potlatch already groups 'putway'-type edits by timestamp, you could easily create a single changeset where: one or more ways are changed by user N at timestamp T, with created_by=Potlatch .*, and 0 or more nodes are changed by user N at timestamp T. --Richard
  • Other ways could be discussed.

Modifications to the API

In the new changeset world, the API would still accept the usual create/update/delete requests for objects but they would have to be augmented by a valid changeset ID.

A new call would be created that opens a new changeset (takes parameters: client and comment) and returns a changeset ID. Subsequent modifications to OSM objects would then specify that changeset ID and be recorded as referring to it.

  • Implementation/security note - when accepting an upload referring to a change ID, we must check wheter that changeset really exists and belongs to the user currently uploading data. Also we might want to "close" changesets after say an hour or so, so that it becomes impossible to add something to a changeset one day later. We might want to make sure that one user can only have one "open" changeset at a time, dunno. See Cryptogrphic Tokens, below.

The API will also have calls to find and inspect changesets, with or without referenced objects.

Migration of existing API clients

In order to provide a smooth migration path for clients, the API could, for a limited time, support creation of changesets on-the-fly, either creating one new changeset for every change uploaded "old-style", or even checking if there's an open changeset by that user and auto-adding changes to it (and create one if none exists).

We would however excpect clients to migrate quickly because explicitly specifying changesets also gives users a better chance of documenting their work.

Outlook: transcations

I don't think we will have them initially, but a proper changeset handling mechanism could actually also implement transactions, i.e. if things break down before the changeset is closed properly, the whole changeset is thrown away.

Clarification: Concurrent Edits

Some people have asked what would happen if you have two changesets concurrently open and each of these changes the same object.

This would be a problem if we had real transactions, but you must remember that for now, changesets are really only a rather informational grouping of edits. The decisive factor about an individual object is still its version number - the object with the highest version number is "current", regardless of the change set in which it was modified. So a scenario like

  1. . user a opens changeset 1
  2. . user b opens changeset 2
  3. . user b uploads change for node x, bumping x to version 91
  4. . user b closes changeset 2
  5. . user a uploads another change for node x, bumping x to version 92
  6. . user a closes changeset 1

is not really a problem. Well, it IS a problem because a overwrites a change made by b, but that problem is completely untouched by changesets. The result of the above is that node x is current in version 92 and this version has a timestamp that is slightly older than that of the historic version 91; this is a bit quirky but not a problem for anyone.

Changeset 2 could not be reverted cleanly because the node has changed after changeset 2 was applied. Changeset 1 could be reverted cleanly and would revert the node to... oops, I guess there's a little problem here after all ;-) I'd have said it would revert to the previous version but that would be wrong. This would perhaps need some additional thought.

Cryptographic Tokens

There is a little security or performance problem. If someone opens a changeset and subsequently records a number of edits against that, we would have to verify, on each update, whether the user is really the same user who also opened the changeset (and maybe verify whether the changeset exists at all). This would mean an extra read request on the changeset table for each write request sent by the user.

There is however a way around this. When opening the changeset, the API could generate a string consisting of the changeset id, the user id, and a secret string, then create an MD5 sum of the whole and send this to the user. For subsequent write requests against this transaction, the user would not specify user name and password, but instead user id, changeset id, and the MD5 hash. The API would take the specified user id and changeset id, append its secret string, build the MD5 sum and compare it to the one given in the request. If they match, then the request is genuine and can be processed. If not, the request is rejected. This idea is viable if the computation of a MD5 sum is cheaper than a database lookup.

Bounding Boxes

It is very important to be able to quickly access the bounding box affected by a change group. This will allow us to generate reports like "stuff that has changed in your neighbourhood recently". There are various ideas how this could be achieved:

  1. compute and store bounding boxes in an external service. The external service would have to have access to the latest change groups, then download all affected items and compute bounding boxes.
  2. compute bounding boxes on the central server, some time after a change group is closed or timed out; update the database with the results.
  3. have the client compute the bounding box and specify it together with the change comment when opening the change group. this would be ideal (uses least computing power on our part) but the client would have to be trusted not to "cloak" edits by specifying completely wrong bounding boxes. A check on our part whether an uploaded change is really inside the inditially specified bounding box would be too expensive.

But a possible extension to 3. is that we could use the aforementioned crypto tokens to also store the bounding box; at least node changes could then trivially be checked against the bounding box and rejected if they don't lie within.

Can we do 2 via the standard (very fast) quadtiles stuff? --Richard

Live Editors

An editor that does not have the concept of a "session" will perhaps not be able to specify a change comment when the change set is opened. (Think Potlatch: The user starts to edit and cannot be bothered to specify, in advance, what he plans to do!) Since an editor like Potlatch can also be terminated at any time without the user having to hit "save" or so, we cannot force the user to enter a change comment at the end either. This is a real problem because change comments are likely to be very valuable for humans looking through changes. Maybe an editor like Potlatch should have something like a "notepad" where the user can make notes while editing, and whenever a node is made, the current changeset description gets updated or so.

Possible as an advanced-user option but an unnecessary UI hurdle in general, I think. Generally a Potlatch commit will be sufficiently atomic that the comment ("realigned a mile of the A41 to follow GPS track more closely") is unlikely to give additional information other than that implicit in the edit. The trick, perhaps, is to provide tools that let you browse through previous versions so you can see what's changed - maybe an extension of Potlatch's existing history ('H') functionality. --Richard

Do we need "before/after" information?

The way I proposed stuff here, we will always know that a certain changeset resulted in a certain version of the object in our database. I.e. if you would download all details for a changeset, you would get the information "as part of this changeset, way XYZ was changed into the following: ..." - you would not recieve information about the state the way was in before. If you wanted to create a "before/after" view for a changeset - which we will likely want to have at some point, for people to quickly notice changes! - then you would have to look at the previous version of the changed object from history.

In the aforementioned overwrite scenario (user A downloads version 1 of way, user B uploads version 2, user A makes change and uploads version 3), user A would be under the impression that he changed version 1 into version 3, and this might reflect in his change comments ("moved bus stop by one block" or so). If you later retrieve and view the change, you would see that user A in fact changed version 2 into version 3, which may be someting completely different (if B has already moved the bus stop by one block for v2, then A's real action might only have been a tiny move).

I don't know if we can ignore the problem. The solution to this would be that editors, when downloading objects for editing, would have to receive the current version number (difficult as it would require history table access) and on upload the API would have to save the specified number, or it could also compare the specified number with the current version and reject the change if they don't match (opportunistic locking).

I'm leaning towards ignoring the problem.

Non-bounding box "subscriptions"

I've already written about bounding box "subscriptions" or "stuff that changed in your neighbourhood". What if somebody wants to e.g. put all German Youth Hostels on his personal watchlist because he's the authority in that field? I guess this should be possible somehow...

Problem with these "watchlists" - be they geographic or thematic - is that they will usually require looking at the old and new state of things. If someone thinks he's the authority on youth hostels and someone changes "tourism=youth_hostel" to "tourism=hotel" on a node, then that person will want to know!

Reverts

The revert mechanism is based on changesets. You get API support if you want to revert a full changeset. If you want finer granularity, e.g. revert only a few changes inside a changeset, then this has to be implemented in the editor using the normal API calls (i.e. retrieve old versions of individual objects and write them back).

The revert mechanism never truly reverts a change in the sense that it is made undone without a trace. Instead, the revert mechanism will act as if someone would manually change everything back to the state it had before. If a changeset consists of the following: "delete node #1, create node #2, modify way #a to include node #2", then reverting that would lead to the un-deletion of node #1, the deletion of node #2, and a second modification of way #a that puts the way back to the state it had before.

As such, the revert mechanism doesn't do anything magic or anything you could not do through individual API calls, it just makes things easier for you because you can request the reversion of a whole changeset. Everything that is done during a revert makes up a new changeset, so the revert itself can be reverted later.

Clean Reverts

A clean revert can be executed when none of the objects in the change set have been changed in the mean time. In this case the revert will not have any side effects.

(A possible side effect of a clean revert is of course in interaction with other data in the vicinity. Say you undelete something, but meanwhile someone else has mapped the same thing; it may be a clean revert but it will lead to the thing being there twice.)

Dirty Reverts

A dirty revert happens if some of the data to be reverted is changed or used in the mean time.

Note: in this discussion, "is changed" means that the object is NOW in a different state than it was after the changeset had been applied. We must not confuse this with "has been changed"; the object may have been changed and then changed back again, (i.e. it has (had?) been changed but is not changed!) in which case a "clean" revert is possible!

This is the real difficult thing. We have the following options:

  • refuse dirty reverts altogether, instead giving a detailed report on why the changeset cannot be reverted, and leave it to the user and/or the editor to bring each object into the state from where we can then undo the changeset in a clean fashion. A dirty revert would then probably lead to two new changesets, one comprising some individual edits that are meant to get certain objects back to the state from where an undo can be applied, and a second that is the mirror image of the original changeset.
  • when doing a dirty revert, only revert those individual objects that are still clean.
  • when doing a dirty revert, force-revert everything, thereby losing changes that have been applied to objects meanwhile; if the dirty revert breaks ways or relations by removing (some of) their members, alter these objects as required. This could be combined with editor support whereby the editor, before issuing such a request, would alert the user to the consequences.
  • a complex mechanism could be devised to try "merging" things, e.g. if reverting a change would mean that a node is moved back to where it was earlier, and meanwhile that node's tags have been changed, then this is not a real conflict and the move could be made while keeping the changed tags. However this is likely to become wildly complicated.
This is already what Potlatch's history ('H') functionality does. The proposal on this page is additional to that, IMO, so there is probably no need to duplicate it. However, the Potlatch model may be worth looking at if you do decide to implement something here. --Richard
  • possibly more?

Option 1 (refuse dirty) would be simplest but I feel that at least for emergencies one might want to have an "revert this --force" option.

API support

It is thinkable to only have changeset support in the API and do everything else client-side. However this might cause high loads on the API and wreak havoc if not implemented properly on the editor side and/or if transmissions break down mid-way and so on. I have a feeling that the revert stuff is so close to the core that it should be in the API itself.

The API would then basically support a call that says "revert changeset xy", and maybe another call that just analyses the changeset and returns whether or not it would be cleanly revertable.

See also