User:SK53/OS OpenData
Thoughts, ruminations, notes to self (and anyone else if interested).
Tools
A minimum toolset (i.e., w/o programming ones own) to manipulate OS data under MS windows is as follows:
- GDAL/OGR libraries, these are required for other tools and hugely useful in their own right. I have found the OSGeoW4 package is the best way to run these.
- ogr2ogr. Command line conversion between formats and projections. Included in OSGeoW4.
- ogr2osm. This is the best available python conversion program from ESRI shapefiles to osm formats. The most important thing with ogr2osm is that it automatically shares ways and nodes between objects eliminating the dup-node problem which afflicts the other widely used shapefile converters (shp2osm, ...). It does have limitations, specifically on attribute conversion, mapping attributes to OSM tags and long ways. I have not got attribute conversion working on windows, but this is fairly unimportant with the OSGB VectorMap District shapefiles. I use a slightly modified version which restricts ways to 500 nodes.
- Quantum GIS. Excellent tool for visualising data, and fairly straightforward manipulations of the data.
- cygwin. Essential for grep, awk etc.
- various GNU tools. for instance wget.
- Kosmos. Very useful for quick rendering of data-sets in OSM format, prior to upload. Enables checks that data is sensibly projected etc. Example here [1]
- Postgres with PostGIS option. Haven't used this in earnest yet, but plan to match OS Locator street locations to OSM streets at some stage.
Users in other environments (Linux, MacOS) etc will use pretty much the types of tools. The list above is specifically about working within the constraints imposed by Windows.
OS Locator
This is a large CSV file (colon separated) without headers. I have found it most useful to add a header line (essential if you want the Lat/Long conversion to work in qGIS etc).
Getting bits of the file
Subsections can be selected by searching for council names:
grep "NOTTINGHAM"
or in a more sophisticated way:
awk '{FS=":"} /$11 == "NOTTINGHAM"/}
Conversion to WGS84
A different process to that below is described here - OS Locator#Create GPX file of OSL data. This uses PHP which I am not familiar with so I tried something else. (See Discussion page.)
- Add a header to the file
- Open new project in qGIS
- Set project projection to the OSGB National Grid + fudge factors (technically Helmert Transformations, see the OSGB Guide.
+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +a=6377563 +b=6356256.161 +towgs84=446.448,-125.157,542.06,0.1502,0.247,0.8421,-20.4894 +units=m +no_defs
- Import OS Locator CSV file into QGIS using delimited text , identifing which columns to use as lat & long (2nd & 3rd denoting road centroid are probably most useful).
- Save shapefile (now in OSGB co-ordinates)
- Use Vector|Reproject to convert to WGS84.
Vector Data ShapeFiles
Projections
Editing the .prj file to add these (see the example on the wikipedia link, but the parameters given there are incorrect), makes a big difference to the .OSM output. Note these parameters still only give an accuracy in the order of 5m, but they do seem to allow GDAL-based utilities to be used.
Here's the format used, the addition being the TOWGS84 section:
PROJCS["British National Grid (ORD SURV GB)", GEOGCS["unnamed", DATUM["D_OSGB_1936", SPHEROID["Airy - 1848",6377563,299.319997677743], TOWGS84[446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894]], PRIMEM["Greenwich",0],UNIT["degree",0.0174532925199433]], PROJECTION["Transverse_Mercator"], PARAMETER["latitude_of_origin",49], PARAMETER["central_meridian",-2], P ARAMETER["scale_factor",0.9996012717], PARAMETER["false_easting",400000], PARAMETER["false_northing",-100000], UNIT["METER",1]]
Its useful to have a script which copies a template file and overwrites the files delivered in the OSGB download.
Things to Watch
- Custom Feature. can be almost anything. Not very useful.
- Areas not lines. Many types of data, but particularly waterways are delivered as areas even for small features such as streams. We really want a line for this type of data in OSM. I haven't found any general tools to transform selected linear waterways to mid-lines, although in general terms, Delauny triangulation clipped to the original polygon seems a reasonable starting point.
- False boundaries. The 10km tiles seem to be built from 1km square tiles as features which cross these boundaries are split at such boundaries (see e.g. [2]). At present I am hand editing the OSM file to merge these polygons, but this is pretty tedious. The only general solution I can see is to push the data into Postgres and identify lines where either northings or eastings do not change and mark these as suspect.
- Features not completely co-incident. Waterways and Woods may appear to be coincident, but are not converted by ogr2osm into shared ways. Detailed inspection shows that nodes are often a tiny distance apart (<<1 m). Probably an artefact of combining layers, or carelessness in original data entry.
Using OS Locator to name roads in VMD
I've started playing around with this in postGIS, so here are my first thoughts /notes.
I've done everything in EPSG 27700 (the uncorrected OSGB36 projection). Using a CSV file with header from the OS Locator data, and using QGIS to upload a single 10km tile of Vector Map District (SK53 of course)with the Sit plugin. The PostGIS populate_geometry _columns function needs to be called to ensure all the data is accessible to PostGIS manipulation (other techniques such as manual population of the geometry_columns table don't work for me).
My intial idea was just to attach a name based on the centroid. To do this I found all lines in the VMD data which were in the bounding box of the locator road and then found that closest to the centroid:
CREATE TABLE blabla AS SELECT a.rowid , a.osgb_name , a.bbox_min_e , a.bbox_max_e , a.bbox_min_n , a.bbox_max_n , a.osgb_bbox_geom , b.gid , b."FEATDESC" , b.osgb_line_geom , a.osgb_point_geom FROM SK53_osgb_Locator a, "SK53_Road_Line" b WHERE b.osgb_line_geom && a.osgb_bbox_geom
The various non-geometry columns are carried over to make it easier to investigate the data. Next we determine the distance of the centroid (osgb_point_geom) to the VMD road line (osgb_line_geom):
CREATE TABLE SK53_roads_distance AS SELECT a1.rowid , a1.osgb_name , a1.gid , a1."FEATDESC" , ST_Distance(a1.osgb_point_geom, a1.osgb_line_geom) as distance FROM blabla a1 ; CREATE TABLE new_road_line AS select a.rowid , a.osgb_name , d.* FROM SK53_roads_distance c, (SELECT a1.rowid, a1.osgb_name, min(a1.distance) as distance FROM SK53_roads_distance a1 GROUP BY a.rowid, a.osgb_name) a, "Road_Line" d WHERE c.distance= a.distance AND c.gid = d.gid;
This crude code works but two main problems emerge:
- Far too few line segments are successfully matched
- Short stretches of long roads may get assigned the wrong name, particularly if a road is present in Locator but not present in the vector data.
The basic principle being established the following approach seems more productive:
- Match all roads as above.
- Eliminate all roads already named in the VectorMap layer. This is likely to remove quite a few of the problems in case 2 above.
- Find all road segments within a suitably buffered bbox from Locator (e.g.,
ST_Within(a.osgb_line_geom, ST_Buffer(a.osgb_bbox_geom,2))
. - Match up all unambigously assigned segments (count(distinct gid) = 1.
- Eliminate all successfully named segments
- Iterate until no updates occur.
The initial matching is reasonably fast for this size dataset, and seems to substantially improve the ST_WITHIN step (probably because I have no idea how to tune/index for this type of query).
More to follow...