Python/Script for finding nearby nodes
Jump to navigation
Jump to search
Developed for the New Zealand bulk import of the LINZ dataset.
- do not expect this script to be efficient or robust.
- this is hardly tested at all
Post-processing
- TODO: It needs a second pass over the hits to suppress/combine/only report once for each cluster of duplicate hits, &/or to reduce parallel ways to just reporting the way IDs, not every node.
- work arounds:
save the output to a file: $ ./osm_find_nearby_matches.py > matches.txt then parse that for coordinates and node IDs: $ grep lat= matches.txt | cut -f2- -d= | sed -e 's/&lon=/|/' -e 's/&zoom=.*=/|/' > matches.csv and import that .csv file into a mapping program like QGIS to review for any clusters.
or
Import auto-gen .csv file into QGIS: - start QGIS - go to Plugins → Manage plugins - ensure that "Add delimited text layer" is selected - Plugins → Delimited Text → Add delimited text layer - Select the matching_nodes.csv file with the [...] button - Set "|" as the delimiter string - Click the [Parse] button at the bottom - Select "#longitude" for the x-field, and "latitude" for the y-field - Click [ok] - Explore vector point buffering to make discrete clumps from points: o This may want to happen after the points have been reprojected into a Cartesian system such as NZTM. o Vector → Geoprocessesing Tools → Buffer(s) o 500m buffer distance is about right, select "Dissolve buffer results"
or
Import auto-gen .csv file into GRASS GIS: - Create a lat/lon WGS84 location (EPSG:4326), and a new mapset - Launch GRASS in this location and import with: GRASS> v.in.ascii in=matching_nodes.csv out=matching_nodes cat=3 - Explore the cluster analysis modules, etc.
- To do a lot of analyses you'll want to reproject into a Cartesian map projection. Create a NZTM location (EPSG:2193), then pull in the points map from the Lat/Lon location with the following command: GRASS> v.proj in=matching_nodes location=ll_wgs84 or you can skip importing it into the lat/lon location first by reprojecting it on the fly into the NZTM location: GRASS> m.proj -ig in=matching_nodes.csv | v.in.ascii out=matching_nodes cat=3 One way is to buffer the nodes by 200m, then extract the lat/lon centroid of each one of the merged buffer areas as the list to investigate. GRASS> v.buffer in=matching_nodes out=matches_200m_buff distance=200 GRASS> v.out.ascii matches_200m_buff | m.proj -odg | \ sed -e 's/.00000000$//' | awk -F'|' \ '{print "http://www.openstreetmap.org/?lat=" $2 \ "&lon=" $1 "&zoom=16"}'
If vectors aren't your thing you can try the raster clumping modules. See the instructions on the Talk side of this page.
TODO
- this script will not deal with </node> on a separate line in the new-comer input file
- this script will not deal with ' as the value quoting char in the new-comer input file
- update to do real XML parsing with the expat library or similar
The script
#!/usr/bin/env python
############################################################################
#
# MODULE: osm_find_nearby_matches.py
# AUTHOR: Hamish Bowman, Dunedin, New Zealand
# PURPOSE: Search a .osm file for nodes within a certain distance of
# nodes in another .osm file. i.e. point out near-duplicates.
# COPYRIGHT: (c) 2010 Hamish Bowman, and the OpenStreetMap Foundation
#
# This program is free software under the GNU General Public
# License (>=v2) and comes with ABSOLUTELY NO WARRANTY.
# See http://www.gnu.org/licenses/gpl2.html for details.
#
#############################################################################
# do not expect this script to be efficient or robust.
# this is hardly tested at all
# TODO:
# ? this will not deal with </node> on a separate line in the new-comer input file
# ? this will not deal with ' as the value quoting char in the new-comer input file
#
# !! Need a second pass over the hits to supress/combine/only report once
# for each cluster of duplicate hits, &/or to reduce parallel ways to just
# reporting the way IDs, not every node.
# In the mean time, if there are more than 8 hits write out a .csv file
# with the coords for later analysis.
#
# arbitrarily use 100m as the search threshold.
# export node ID, and URLs to OSM map showing it and OSM db feature webpage
#
# ... run this BEFORE uploading new data ...
#
# Download any existing OSM data for comparison:
# URL="http://osmxapi.hypercube.telascience.org/api/0.6"
# BBOX="157.5,-59.0,179.9,-25.5"
# wget -O existing_data.osm "$URL/*[leisure=slipway][bbox=$BBOX]"
import sys
import os
from math import cos, radians
def main():
input_new = 'boatramp_cl_with_slipway.osm'
input_established = 'existing_boatramp_data.osm'
csv_output= input_new.split('.osm')[-2] + '_matching_nodes.csv'
#output = 'boatramp_dupe_summary.txt'
output = ''
# threshold measured in meters (2.5cm ~ 1")
threshold_m = 100.0
thresh_lat = threshold_m / (1852. * 60)
#thresh_lon calc'd per node by cos(node_lat)
#### set up input files
infile_new = input_new
if not os.path.exists(infile_new):
print("ERROR: Unable to read new input data")
sys.exit(1)
infile_old = input_established
if not os.path.exists(infile_old):
print("ERROR: Unable to read established input data")
sys.exit(1)
inf_old = file(infile_old)
print("existing OSM data input file=[%s]" % infile_old)
# read in old file first and build a table of node positions and IDs
# node|lon|lat long|double|double
# init array
old_id_lonlat = [[] for i in range(3)]
while True:
line = inf_old.readline()
#.strip()
if not line:
break
if 'node id=' not in line:
continue
#bits = line.split('"')
id_i = line.find('id=') + 4
lat_i = line.find('lat=') + 5
lon_i = line.find('lon=') + 5
old_id = line[id_i:].replace('"',"'").split("'")[0]
old_lat = float(line[lat_i:].replace('"',"'").split("'")[0])
old_lon = float(line[lon_i:].replace('"',"'").split("'")[0])
old_id_lonlat[0].append(old_id)
old_id_lonlat[1].append(old_lon)
old_id_lonlat[2].append(old_lat)
inf_old.close()
#### read in new file and build a table of node positions and IDs which have a pair
# open new-comer file.osm
inf_new = file(infile_new)
print("new-comer input file=[%s]" % infile_new)
# new_node|old_node|lon|lat long|long|double|double
# init array
newid_oldid_lonlat = [[] for i in range(4)]
lines = inf_new.readlines()
for i in range(len(lines)):
lines[i] = lines[i].rstrip('\n')
for line in lines:
if 'node id=' not in line:
continue
#elif line.strip() == '</node>':
# continue
else:
id_i = line.find('id=') + 4
lat_i = line.find('lat=') + 5
lon_i = line.find('lon=') + 5
new_id = line[id_i:].replace('"',"'").split("'")[0]
new_lat = float(line[lat_i:].replace('"',"'").split("'")[0])
new_lon = float(line[lon_i:].replace('"',"'").split("'")[0])
thresh_lon = thresh_lat / abs(cos(radians(new_lon)))
#print thresh_lat, thresh_lon, new_id
for i in range(len(old_id_lonlat[0])):
if abs(new_lon - old_id_lonlat[1][i]) < thresh_lon and \
abs(new_lat - old_id_lonlat[2][i]) < thresh_lat:
# ?keep?:
if new_id == old_id_lonlat[0][i]:
print 'skip'
continue
newid_oldid_lonlat[0].append(new_id)
newid_oldid_lonlat[1].append(old_id_lonlat[0][i])
newid_oldid_lonlat[2].append(old_id_lonlat[1][i])
newid_oldid_lonlat[3].append(old_id_lonlat[2][i])
#print 'hit: node %s is %s' % (new_id, old_id_lonlat[0][i])
inf_new.close()
##### with those two tables populated, write output
# set up output file
if not output:
outfile = None
outf = sys.stdout
else:
outfile = output
outf = open(outfile, 'w')
print("output file=[%s] (possible duplicates to investigate)" % outfile)
# loop through match table and display results
hits = set(newid_oldid_lonlat[1])
if len(hits) > 8:
csv_outfile = csv_output
csv_outf = open(csv_outfile, 'w')
print("csv output file=[%s] (clusters of possible duplicates to investigate)" % csv_outfile)
csv_outf.write('#longitude|latitude|node_id\n')
if not output:
outf.write('\n')
for i in hits:
matches = [j for j,x in enumerate(newid_oldid_lonlat[1]) if x == i]
outf.write('Node')
if len(matches) > 1:
outf.write('s <')
else:
outf.write(' <')
for node in matches:
if node != matches[0]:
outf.write(',')
outf.write(newid_oldid_lonlat[0][node])
outf.write('> ')
if len(matches) > 1:
outf.write('are')
else:
outf.write('is')
outf.write(' within ' + str(threshold_m) + 'm of node <' + i + '>\n')
latstr = str(newid_oldid_lonlat[3][newid_oldid_lonlat[1].index(i)])
lonstr = str(newid_oldid_lonlat[2][newid_oldid_lonlat[1].index(i)])
outf.write(' Info: http://www.openstreetmap.org/browse/node/' + i + '\n')
outf.write(' Map: http://www.openstreetmap.org/' + \
'?lat=' + latstr + '&lon=' + lonstr + '&zoom=18&node=' \
+ i + '\n\n')
if len(hits) > 8:
csv_outf.write(lonstr + '|' + latstr + '|' + i + '\n')
inf_new.close()
if outfile is not None:
outf.close()
if len(hits) > 8:
csv_outf.close()
if __name__ == "__main__":
main()