User:AdamWill/Imports/Vancouver property addresses

From OpenStreetMap Wiki
Jump to navigation Jump to search

Import Plan Outline

Goals

Import City of Vancouver's Open Data street address dataset to provide a near-complete set of street addresses in Vancouver for OSM.

Contact

I am subscribed to the talk-ca and osm-legal-talk lists, and will shortly subscribe to the import list. I can be found on Freenode IRC as adamw. Email and XMPP are both adamw A T happyassassin D 0 T net.

Schedule

This could potentially be completed quite quickly - in a matter of days - if approved and the necessary legal confirmation received from the City. But I am quite prepared to stay the course for longer if it becomes necessary.

Status

  • Suggestion / proposal mail sent to imports@ and talk-ca@ 2014-01-21

Import Data

Background

Provide links to your sources.

Data source site: http://data.vancouver.ca/datacatalogue/propertyInformation.htm
Data license: http://vancouver.ca/your-government/open-data-catalogue.aspx (click on Terms of use)
Type of license (if applicable): Open Government License (British Columbia) 2.0-based
Link to permission (if required): Pending
OSM attribution (if required): Pending
ODbL Compliance verified: Partial

Legal

The license for all the City of Vancouver's open data is the "Open Government Licence – Vancouver" version 1.0. This is identical in effect to the "Open Government Licence – British Columbia" version 2.0 (the only differences are very minor punctuation, casing and layout changes - we're talking "uses a) instead of a. and puts the hyperlinks below the license text instead of inline" stuff, and the obvious changes for jurisdiction).

Paul Norman advises me that the OGL - BC has been previously evaluated, and provides https://lists.openstreetmap.org/pipermail/legal-talk/2013-December/007685.html as a reference. Basically, it seems that opinion is split as to whether data posted with this license attached can be used as-is due to an exemption which states "This licence does not grant you any right to use: Information or Records not accessible under the Freedom of Information and Protection of Privacy Act (B.C.)", but this can be resolved simply by obtaining a statement from the City that the dataset in question is accessible under the FIPPA, to avoid any possible ambiguity.

The City sent me and Paul Norman the following statement on 2014-02-03:

Data available in the “Property Information” location of the City of Vancouver’s Open Data site under the
following location:  http://data.vancouver.ca/datacatalogue/propertyinformation.htm is released in
accordance with the Freedom of Information and Protection of Privacy Act of British Columbia. 
 
Barbara J. Van Fraassen
Director, Access to Information
 
City Clerk’s Department
City of Vancouver

OSM Data Files

Import Type

The initial import would be a large one-time import of around 100,000 individual nodes with locations, addr:housenumber and addr:street tags generated from the City's dataset, plus static addr:city (Vancouver), addr:country (CA) and source ("City of Vancouver GIS data {DATE-OF-GENERATION}") tags. It would be relatively simple to script future updates if desired: the city says it updates the Open Data extract weekly, and I have a fairly efficient setup for processing the data (takes about ten seconds), so we should be able to capture new addresses this way. Once the initial mega-import is done, the volume of future address additions should be relatively small, and they could likely be manually verified (by survey or 'local knowledge' - almost every new address in Vancouver is a new condo tower, these days...) and tweaked at import time for higher quality.

Template says Identify what method will be used for entering the imported data into the OSM database - e.g. API, JOSM, upload.py, etc.: I am not sufficiently familiar with these different mechanisms to know which would be best, and welcome advice.

Data Preparation

Data Reduction & Simplification

The dataset is already relatively clean and simple. As provided by the city it just contains single nodes (usually corresponding to the centres of property parcels) with a house number and street name for each, plus a couple of other tags that I don't think are any use to OSM - currently I'm throwing these away, but we could easily keep them if desired.

The current CoV dataset contains 93,203 nodes, after all cleaning (see below). According to a simple overpass-turbo search, the Vancouver bounding box given at Canada:British_Columbia:Vancouver#Bounding_boxes contains 861 objects with an addr:housenumber tag, and 1704 with an addr:street tag. These are relatively small numbers and it should not be too difficult to resolve the conflations with manual oversight. Paul Norman has some tooling for this which I'll look into.

No shapes are involved in this import, only single nodes.

Tagging Plans

Translation of the tags is relatively simple, and is achieved using Paul Norman's version of ogr2osm with a fairly simple translation file which I wrote (based on one of Paul's Surrey translations).

The original dataset contains nodes with correct and correctly-formatted house numbers in a tag named CIVIC_NO, all-upper-cased street names using a few abbreviations (ST, AV, and the compass points) in a tag named STREETNAME, and a couple of other tags whose purpose is unknown to me - TAX_COORD and SITE_ID.

The translation maps the CIVIC_NO and STREETNAME tags to addr:housenumber and addr:street respectively, throws away the tags I believe are of no interest (this would be trivial to change if it's desired to preserve these), and adds the tags addr:city "Vancouver", addr:country "CA" and source "City of Vancouver GIS data (DATE-OF-GENERATION)".

Changeset Tags

This is another area in which I could use guidance. The source tag will be used as described above. Presumably we should also apply a type: import tag, by the looks of Proposed_features/changeset_tags, and probably a url tag.

Data Transformation

Beyond what is described above, the original dataset contains a little over 1,000 nodes that do not have a house number or street name (but do have the TAX_COORD and SITE_ID tags). These are thrown away by the ogr2osm translation.

The original dataset also contains around 6,000 cases where multiple nodes have the same house number and street name (but are not in precisely the same place) - a lot of these are large buildings in downtown. Manual inspection of several cases shows that none of the nodes in these cases seem to be outright incorrectly placed - they all lie within the correct property parcel. I wrote a small bash script to keep only one node for each of these cases (the first the script encounters), and it is applied to the OSM files linked on this page. If desired this can easily be left out, and the multiple entries left in the file to be examined and weeded out manually - it would also be trivial to provide a list of the relevant nodes, to make it easier for people to find them.

You should be able to perfectly reproduce my OSM file (modulo any changes by the City of Vancouver - they don't provide older datasets for download) by getting the CoV data in shp format, extracting the property_addresses.* files, and then running:

ogr2osm.py -t vanaddress.py property_addresses.shp
cov_duplicate_addresses.sh property_addresses.osm property_addresses_deduped.osm

ogr2osm, my translation file, and the de-duplication script are all freely licensed.

Data Transformation Results

The OSM file linked in #OSM Data Files is the processed data, after all translations and transformations described above.

Data Merge Workflow

Team Approach

I welcome any assistance with this effort. Paul Norman has been a great help so far.

References

Template states: List all factors that will be evaluated in the import. - I am not quite sure what this means. Obviously any conflations with existing data should be manually inspected, and the data can easily be checked by eye in an editor by anyone familiar with the City. I have already checked various areas I know quite well, and found the data accurate.

Workflow

Again, I welcome assistance with this section - I'm sure it's not as simple as 'upload the entire near-100,000 node dataset from josm and go make a cup of tea'. :)

Conflation

As mentioned in a couple of places above, there is a relatively small set of existing street address data within the CoV, and it should be feasible for a small number of people - or even just me - to inspect the conflations and resolve them appropriately. I can do any physical surveys needed for this.

I'm also looking at tool-based approaches - I'll evaluate various ways to do this and see what looks like the best time/accuracy trade-off. Paul has some tools for this, and there is also osm-addr-tools, which I'm currently playing with (as it doesn't require me to learn pgsql...)

As an initial set of thoughts:

  • It should be relatively simple to just drop addresses that already exist in correct form in the existing OSM data - both sets of tools have a function for this
  • The city can be split quite neatly into a few areas:
    • The downtown peninsula and UBC (heavily mapped)
    • A few other major corridors (moderately / heavily mapped)
    • Everywhere else - i.e. about the whole of South Vancouver (very lightly mapped)
  • We could use different merge strategies for the different areas, perhaps handling heavily mapped and obviously 'cared about' areas more manually

QA

There is excellent free satellite data available for Vancouver (again, courtesy of our enlightened local governments) - high resolution and accurately aligned, as verified by many traces. This helps a lot with verification of the data; for areas of the city I'm familiar with it's pretty easy to inspect the data in an editor, and I have been doing so without yet finding any errors.

I don't know what the typical level of QA required for such an import is, and again, would welcome advice. I could certainly check quite a large amount of the data in an editor, and do some random 'spot check' physical surveys as well.

Drive-by QA notes:

  • There's a lot of building shapes and addresses on Broadway which don't quite line up with the CoV data, needs manual survey to see which is correct
  • 1150 Nelson Street has funky addressing