Script for finding nearby nodes
From OpenStreetMap Wiki
Developed for the New Zealand bulk import of the LINZ dataset.
- do not expect this script to be efficient or robust.
- this is hardly tested at all
Post-processing
- TODO: It needs a second pass over the hits to suppress/combine/only report once for each cluster of duplicate hits, &/or to reduce parallel ways to just reporting the way IDs, not every node.
- work arounds:
save the output to a file: $ ./osm_find_nearby_matches.py > matches.txt then parse that for coordinates and node IDs: $ grep lat= matches.txt | cut -f2- -d= | sed -e 's/&lon=/|/' -e 's/&zoom=.*=/|/' > matches.csv and import that .csv file into a mapping program like QGIS to review for any clusters.
or
Import auto-gen .csv file into QGIS: - start QGIS - go to Plugins → Manage plugins - ensure that "Add delimited text layer" is selected - Plugins → Delimited Text → Add delimited text layer - Select the matching_nodes.csv file with the [...] button - Set "|" as the delimiter string - Click the [Parse] button at the bottom - Select "#longitude" for the x-field, and "latitude" for the y-field - Click [ok] - Explore vector point buffering to make discrete clumps from points: o This may want to happen after the points have been reprojected into a Cartesian system such as NZTM. o Vector → Geoprocessesing Tools → Buffer(s) o 500m buffer distance is about right, select "Dissolve buffer results"
or
Import auto-gen .csv file into GRASS GIS: - Create a lat/lon WGS84 location (EPSG:4326), and a new mapset - Launch GRASS in this location and import with: GRASS> v.in.ascii in=matching_nodes.csv out=matching_nodes cat=3 - Explore the cluster analysis modules, etc.
- To do a lot of analyses you'll want to reproject into a Cartesian map
projection. Create a NZTM location (EPSG:2193), then pull in the points
map from the Lat/Lon location with the following command:
GRASS> v.proj in=matching_nodes location=ll_wgs84
or you can skip importing it into the lat/lon location first by
reprojecting it on the fly into the NZTM location:
GRASS> m.proj -ig in=matching_nodes.csv | v.in.ascii out=matching_nodes cat=3
One way is to buffer the nodes by 200m, then extract the lat/lon centroid
of each one of the merged buffer areas as the list to investigate.
GRASS> v.buffer in=matching_nodes out=matches_200m_buff distance=200
GRASS> v.out.ascii matches_200m_buff | m.proj -odg | \
sed -e 's/.00000000$//' | awk -F'|' \
'{print "http://www.openstreetmap.org/?lat=" $2 \
"&lon=" $1 "&zoom=16"}'
If vectors aren't your thing you can try the raster clumping modules. See the instructions on the Talk side of this page.
TODO
- this script will not deal with </node> on a separate line in the new-comer input file
- this script will not deal with ' as the value quoting char in the new-comer input file
- update to do real XML parsing with the expat library or similar
The script
#!/usr/bin/env python ############################################################################ # # MODULE: osm_find_nearby_matches.py # AUTHOR: Hamish Bowman, Dunedin, New Zealand # PURPOSE: Search a .osm file for nodes within a certain distance of # nodes in another .osm file. i.e. point out near-duplicates. # COPYRIGHT: (c) 2010 Hamish Bowman, and the OpenStreetMap Foundation # # This program is free software under the GNU General Public # License (>=v2) and comes with ABSOLUTELY NO WARRANTY. # See http://www.gnu.org/licenses/gpl2.html for details. # ############################################################################# # do not expect this script to be efficient or robust. # this is hardly tested at all # TODO: # ? this will not deal with </node> on a separate line in the new-comer input file # ? this will not deal with ' as the value quoting char in the new-comer input file # # !! Need a second pass over the hits to supress/combine/only report once # for each cluster of duplicate hits, &/or to reduce parallel ways to just # reporting the way IDs, not every node. # In the mean time, if there are more than 8 hits write out a .csv file # with the coords for later analysis. # # arbitrarily use 100m as the search threshold. # export node ID, and URLs to OSM map showing it and OSM db feature webpage # # ... run this BEFORE uploading new data ... # # Download any existing OSM data for comparison: # URL="http://osmxapi.hypercube.telascience.org/api/0.6" # BBOX="157.5,-59.0,179.9,-25.5" # wget -O existing_data.osm "$URL/*[leisure=slipway][bbox=$BBOX]" import sys import os from math import cos, radians def main(): input_new = 'boatramp_cl_with_slipway.osm' input_established = 'existing_boatramp_data.osm' csv_output= input_new.split('.osm')[-2] + '_matching_nodes.csv' #output = 'boatramp_dupe_summary.txt' output = '' # threshold measured in meters (2.5cm ~ 1") threshold_m = 100.0 thresh_lat = threshold_m / (1852. * 60) #thresh_lon calc'd per node by cos(node_lat) #### set up input files infile_new = input_new if not os.path.exists(infile_new): print("ERROR: Unable to read new input data") sys.exit(1) infile_old = input_established if not os.path.exists(infile_old): print("ERROR: Unable to read established input data") sys.exit(1) inf_old = file(infile_old) print("existing OSM data input file=[%s]" % infile_old) # read in old file first and build a table of node positions and IDs # node|lon|lat long|double|double # init array old_id_lonlat = [[] for i in range(3)] while True: line = inf_old.readline() #.strip() if not line: break if 'node id=' not in line: continue #bits = line.split('"') id_i = line.find('id=') + 4 lat_i = line.find('lat=') + 5 lon_i = line.find('lon=') + 5 old_id = line[id_i:].replace('"',"'").split("'")[0] old_lat = float(line[lat_i:].replace('"',"'").split("'")[0]) old_lon = float(line[lon_i:].replace('"',"'").split("'")[0]) old_id_lonlat[0].append(old_id) old_id_lonlat[1].append(old_lon) old_id_lonlat[2].append(old_lat) inf_old.close() #### read in new file and build a table of node positions and IDs which have a pair # open new-comer file.osm inf_new = file(infile_new) print("new-comer input file=[%s]" % infile_new) # new_node|old_node|lon|lat long|long|double|double # init array newid_oldid_lonlat = [[] for i in range(4)] lines = inf_new.readlines() for i in range(len(lines)): lines[i] = lines[i].rstrip('\n') for line in lines: if 'node id=' not in line: continue #elif line.strip() == '</node>': # continue else: id_i = line.find('id=') + 4 lat_i = line.find('lat=') + 5 lon_i = line.find('lon=') + 5 new_id = line[id_i:].replace('"',"'").split("'")[0] new_lat = float(line[lat_i:].replace('"',"'").split("'")[0]) new_lon = float(line[lon_i:].replace('"',"'").split("'")[0]) thresh_lon = thresh_lat / abs(cos(radians(new_lon))) #print thresh_lat, thresh_lon, new_id for i in range(len(old_id_lonlat[0])): if abs(new_lon - old_id_lonlat[1][i]) < thresh_lon and \ abs(new_lat - old_id_lonlat[2][i]) < thresh_lat: # ?keep?: if new_id == old_id_lonlat[0][i]: print 'skip' continue newid_oldid_lonlat[0].append(new_id) newid_oldid_lonlat[1].append(old_id_lonlat[0][i]) newid_oldid_lonlat[2].append(old_id_lonlat[1][i]) newid_oldid_lonlat[3].append(old_id_lonlat[2][i]) #print 'hit: node %s is %s' % (new_id, old_id_lonlat[0][i]) inf_new.close() ##### with those two tables populated, write output # set up output file if not output: outfile = None outf = sys.stdout else: outfile = output outf = open(outfile, 'w') print("output file=[%s] (possible duplicates to investigate)" % outfile) # loop through match table and display results hits = set(newid_oldid_lonlat[1]) if len(hits) > 8: csv_outfile = csv_output csv_outf = open(csv_outfile, 'w') print("csv output file=[%s] (clusters of possible duplicates to investigate)" % csv_outfile) csv_outf.write('#longitude|latitude|node_id\n') if not output: outf.write('\n') for i in hits: matches = [j for j,x in enumerate(newid_oldid_lonlat[1]) if x == i] outf.write('Node') if len(matches) > 1: outf.write('s <') else: outf.write(' <') for node in matches: if node != matches[0]: outf.write(',') outf.write(newid_oldid_lonlat[0][node]) outf.write('> ') if len(matches) > 1: outf.write('are') else: outf.write('is') outf.write(' within ' + str(threshold_m) + 'm of node <' + i + '>\n') latstr = str(newid_oldid_lonlat[3][newid_oldid_lonlat[1].index(i)]) lonstr = str(newid_oldid_lonlat[2][newid_oldid_lonlat[1].index(i)]) outf.write(' Info: http://www.openstreetmap.org/browse/node/' + i + '\n') outf.write(' Map: http://www.openstreetmap.org/' + \ '?lat=' + latstr + '&lon=' + lonstr + '&zoom=18&node=' \ + i + '\n\n') if len(hits) > 8: csv_outf.write(lonstr + '|' + latstr + '|' + i + '\n') inf_new.close() if outfile is not None: outf.close() if len(hits) > 8: csv_outf.close() if __name__ == "__main__": main()