Import/Catalogue/MassGIS Addresses

From OpenStreetMap Wiki
Jump to navigation Jump to search

MassGIS Addresses is an import of the MassGIS Master Address Data (Statewide Address Points for Geocoding) dataset, which is a personal geodatabase with comprehensive address data for Massachusetts, used for Next-Generation 911. The import is currently (as of Dec 2018) at the seeking approval stage.

The main communications mechanism is the talk-us-massachusetts list, which contains many active mappers. We also discuss some of this on the #massachusetts channel on the OSM Slack group.

The code to process the data and prepare upload files is at https://github.com/yyatsyn/MassGIS-address-import.

Goals

See Import/Guidelines for process related timeline

  • Add comprehensive address data to OSM for Massachusetts
  • Ensure that no hand-mapped data is broken or overwritten by the import process. (However, the process may produce QA information that leads to hand correction of data, with due care.)
  • Validate that the data source has a very high level of accuracy. (Of course, there may be a few errors, but we should check that the subset we import is at least 99.9% accurate.)
  • Ensure that all data added is 100% structurally correct, to minimize future cleanup work.
  • Ensure that all data added conforms to OSM best practices, such as putting address tags on buildings where appropriate, and representing unusual addresses (1R, 1 1/2, etc.) according to established tagging norms
  • Structure the import so that future imports are not made any more difficult than they have to be. Perform an import of a subset of MassGIS data if that gets OSM better data without problems, and putting off solving the issues with another subset is not problematic.

Steps

We are structuring the import into several phases, according to level of detail of the import and coding requirements for that phase. Before applying each phase, we will seek community feedback for that Phase's steps with opportunities to revise steps or defer problems until a later time or project. For more detailed information about the phases and tagging design, see further steps below in this document.

Overall Steps

  • Design, implement, publish and test a program to produce files to be imported, as well as produce information about inconsistencies between OSM and MassGIS data.
  • Document import logic here, as well as in comments in code.
  • Many members of the MA OSM community will check the proposed upload information as well as inconsistencies in their local areas.
  • Email proposal to imports@openstreetmap.org (perhaps after initial QA is successful)

Phase 0 - Planning

  • Identify source data with the greatest amount of information available that will cause the least problems when manipulated, the MassGIS Data: Master Address Data - Statewide Address Points for Geocoding product. The product is downloadable as a single compressed archive, removing the step of downloading each town directly.
  • Verify legal compatibility
  • Discuss tagging schemes, further import phases, and potential issues
  • Discuss long-range update schemas

Phase 1 - Low Risk Import

Under Phase 1, we seek to import entries with the least chance of ambiguity, error, or conflict. The following tags and limitations would be used:

  • Only STATUS=ACTIVE;LINKED;MODIFIED - Other statuses could pose potential conflicts that will require manual review later.
  • Only POINT_TYPE=BC - This filters only to points located directly on top of the building polygon's centroid. There is no ambiguity which building the address node applies to.
  • UNIT=* - Individual addr:unit numbers are separated by the semi-colon delimiter.
  • addr:housenumber=FULL_NUMBE - This data should include forms of integers, alphanumeric, and numerals separated by hyphens to denote range data. The data is directly compatible with OSM's tagging.
  • addr:street=UL_STREET_ - The streetnames have been converted from uppercase, and have been checked for accidental double capital erasure. JOSM Validation will verify.
  • No addr:city - (an optional attribute in OSM)
  • No addr:state - (an optional attribute in OSM)
  • No addr:postcode - (should come from post office source)
  • No multipolygon buildings - (to avoid having to put tags on the relation)
  • No duplicate address nodes, deleted via JOSM validation.
  • No changing existing OSM addresses, only add addresses to buildings that do not have addresses .
  • Cities like Barnstable will be split into administrative level 9 using the MAD community shape file.
  • There should not be multiple points on top of a centroid in this import stage, but should this arise, we will leave the points conflicting unimported, or give the importing party discretion to merge the data if it seems reasonable to do so.

Phase 2 - Medium Risk w/ Conflict Resolution

In Phase 2, we seek to import POINT_TYPE=BMP for inclusion into import.

BMP points assign addresses to parcels that contain one, two or more buildings.

We import only BMP addresses that correspond to L3 parcels with only unaddressed buildings.

To select on each parcel a building that is most likely to be a residential one we use the following approach.

Step 1: Download the data on areas of all buildings with addresses within boundaries of a given town. Use these data to estimate the kernel density of log-areas of buildings with addresses.

Step 2: For each building that is located on a parcel with a BMP address, test the null-hypothesis that its area has been drawn from a distribution estimated in Step 1 (two-sided test for the mean). Save the resulting p-values.

Step 3: On each parcel select the unique building for which the corresponding p-value exceeds 5% (the only building for which the null-hypothesis can not be rejected).

Illustration: the histograms below are build for log-areas of buildings with addresses (blue) and for buildings w/o addresses on a BMP-addressed parcels (orange). All buildings in the left hump of the orange distribution are small-size buildings (garages, etc.) for which the null-hypothesis will be rejected.

[1]

Step 4: if Step 3 results in multiple buildings for which the null-hypothesis is not rejected, then among such buildings select the one that

a) has the log-area that is closest to the median of the reference (blue) distribution AND

b) has the largest number of points in its exterior among all buildings on the parcel.


MAD addresses that result in now buildings selected after Step 3 and Step 4 will be imported manually, because the above outlined test/criteria do not give conclusive results.


Phase n+1

This final phase will be ongoing, and will consist of a periodic update of data from detecting changesets through automated programming. We require additional resources to help identify how this goal would be achieved.

Schedule

We do not have a firm schedule. A notional goal is to complete an initial import in early 2019, after a July 2018 start.

Import Data

Background

Data source site:

MassGIS has a "Master Address Database" (MAD) that is used to generate several products, available at: https://docs.digital.mass.gov/dataset/massgis-data-master-address-data.

There is a "Basic Address List" and an "Advanced Address List" version of the data available (with clicking per-town required as .xlsx. There is a "Basic Address Points" version available (per-town, with clicking) that represents addresses as points, typically adjusted to building centroids, or parcel centroids. There is also "Statewide Address Points for Geocoding", which is similar but has more fields and covers the entire state in one download. It is available at https://docs.digital.mass.gov/dataset/massgis-data-master-address-data-statewide-address-points-geocoding. (Note that this requires submitting an email address to get a download link, but it is obviously automated because the email appears in about a minute.) We plan to use the latter statewide geocoding based file as we believe it to be the most comprehensive export available.


Data license: CC-BY, plus explicit permission from MassGIS to use in OSM and for the form of attribution
Link to permission: copy of explicit permission from MassGIS is at https://lists.openstreetmap.org/pipermail/talk-us-massachusetts/2018-July/000258.html
OSM attribution (if required): "MassGIS (Bureau of Geographic Information), Commonwealth of Massachusetts EOTSS"
ODbL Compliance verified: \todo Add link to LWG discussion of use of CC-By data

OSM Data Files

Files present in the import have been downloaded from the MassGIS reference site, and made available to initial contributors working on the project via Dropbox. There are 355 source files, representing every admin_level=8 town/city boundary in Massachusetts.

Initially, we will work with the files manually through process conversion within JOSM, however we aim to eventually have published code that will extract data from OSM, and the reference files from MassGIS, and produce .osc changeset files, according to the project phase steps outlined and the limitations written into them. Source files and OSM changeset files will be made available to persons who do not choose to use the import coding, or who wish to perform the data merging manually.

Import Type

We are planning this as a multi phase import, with a later phase able to be run occasionally to update data as necessary. We expect that the changesets for subsequent runs will be small. Initially, imports will be managed manually, but a distant goal is to have the resources available to digest the source data and create .osm/.osc files automatically

Source File Tagging Schema

This tagging schema is what is contained on the source data we are importing. It is a merge of data from MassGIS Data: Master Address Data - Statewide Address Points for Geocoding and MassGIS Data: Master Address Data - Advanced Address List. Following the tagging schema are several subheadings with tabular information about the domain values for that specific key, and remarks about that key's nature. A sample data point is provided for reference.

ID Name Description Sample
0 MASTER_ADD Unique ID for a standardized address record. Generated via sequence and trigger. 2829417
1 FULL_NUMBE Standardized address number of the address. May be a range and/or contain characters indicating prefixes and suffixes. 90
2 STREET_NAM Standardized street name. FLAHERTY WAY EXTENSION
3 UNIT Standardized value for the subaddress portion of an address representing a 'Unit' designation (e.g. '1' in 'BLDG A, FLR 3, UNIT 1'). 376
4 BUILDING_N Building designation or formal building name if it has one (e.g. 'A' from 'BLDG A, UNIT 1', or 'MUGAR SCIENCE CENTER'). SAKOWICH CAMPUS CENTER
5 COMMUNITY_ Name of the MSAG community. NORTH ANDOVER
6 GEOGRAPHIC Official municipality name. NORTH ANDOVER
7 GEOGRAPH_1 TOWN_ID of the municipality in which the address is physically located. Should match the GEOGRAPHIC_TOWN_ID (1-351) of any associated address multipoint. 210
8 POSTCODE Address 5-digit ZIP Code (likely derived from ZIP+4 and MSAG/street name lookup or HERE data, not necessarily the address' original source material). 01845
9 PC_NAME Neighborhood name representing a sub-community location of the standardized street name. Adds an additional measure of geographic accuracy/uniqueness, especially for duplicated street names in a given MSAG community. [Note: Alternatively contains upstream data for COMMUNITY and TOWN fields where NEIGHBORHOOD data is not relevant.] NORTH ANDOVER
10 COUNTY Official name for the county. ESSEX
11 cty_town Concatenated County + Town name ESSEX, NORTH ANDOVER
12 ADDRESS_ID Unique ID for a standardized address record. Generated via sequence and trigger. 2829417
13 STATUS Domain values represent the address' status in regards to its association with any address multipoint, reviewability, recent edits, or obsolescence. ACTIVE
14 POINT_TYPE The origin and nature of the address multipoint, especially distinguishing between structure vs. non-structure origins, and single vs. multiple structures. See the domain values below. BC
15 UL_STREET_ Upper/Lowercase conversion of Standardized street name. Flaherty Way Extension
16 UL_UNIT Upper/Lowercase conversion of Standardized value for the subaddress portion of an address representing a 'Unit' designation (e.g. '1' in 'BLDG A, FLR 3, UNIT 1'). 376
17 UL_BUILDIN Upper/Lowercase conversion of Building designation or formal building name if it has one (e.g. 'A' from 'BLDG A, UNIT 1', or 'MUGAR SCIENCE CENTER'). Sakowich Campus Center
18 UL_COMMUNI Upper/Lowercase conversion of Name of the MSAG community. North Andover
19 UL_TOWN Upper/Lowercase conversion of Official Municipality Name North Andover
20 UL_PC_NAME Upper/Lowercase conversion of Neighborhood name representing a sub-community location of the standardized street name. North Andover

STATUS=* Limitations

The data we're using contains a status field indicating the overall "health" of the point in question. We plan on using this field to help us manage which entries should be considered for import, and which are likely erroneous.

Key Name Description OSM Import Phase
ACTIVE Address is assigned to an address point. Phase 1
INACTIVE Address is obsolete. N/A
UNASSIGNED Address not yet assigned to an address point. N/A
SITE_REVIEW Address belongs to a site and requires review before becoming ACTIVE as the address can probably be further disaggregated to the building level. N/A
REVIEW Address requires review before becoming active (for any reasons other than site disaggregation). N/A
NOT_FOUND Address underwent review and a geographic location for it could not be found. N/A
GEOCODED Address temporarily assigned to a geocoded point; still requires review to try to find a structure-based point instead. N/A
PARENT Address record is temporary and represents an aggregation of other unit-specific addresses. This record's MASTER_ADDRESS_ID is equal to its PARENT_ADDRESS_ID. N/A
DCAM_REVIEW Address record belongs to a state-owned property and requires review (preferably by DCAMM) before becoming active. N/A
MODIFIED This address record is active and had one or more address component fields edited during FDC/local review. Phase 1
ADDED This address record represents an address that was added as a new or missing address identified during FDC. It should not have an address point ID. N/A
UNLINKED The address point ID previously associated with this address record was removed during FDC/local review. (but edit may have been overridden) N/A
LINKED This address is active and was associated with a non-geocoded address point during FDC/local review. Phase 1

POINT_TYPE=* Sorting

We established through data analysis that the granularity of the data provided is linked to the attribute POINT_TYPE=* in the source data set. Two relevant point types exist in this data, BC and BMP. BC represents the confirmed centroid of a specific building polygon, and is located directly on top of that data. BMP represents a group of buildings, with the point not necessarily being closer to the main structure for that parcel. We determined there is little to no risk in data integrity to import BC data types first, and wait for more advanced coders and contributors to become available for the BMP type nodes, establishing our phase based import schema.

Data values for POINT_TYPE in import data OSM Import Phase
ABC Address multipoint represents an Assumed/Approximate Building Centroid. Used for new points where a structure polygon does not currently exist, but should (e.g. new development where the structure necessary to create an accurately placed centroid is not visible in the most current available imagery base map). Eventually converts to 'BC' or 'BMP' after validation by MassGIS. N/A
BC Address multipoint represents a Building Centroid (single structure). Phase 1
BMP Address multipoint represents a Building Multipoint (multiple structures). Note: MassGIS is aware of a small number of instances where oddly distributed buildings or unusually configured property boundaries result in ‘BMP’ centroids existing in streets, water, or even other properties. As these are identified, we will consider shifting these points to more intuitive and suitable locations. Phase 2
DBMP Address multipoint represents a Dissolved Building Multipoint (multiple structures on two or more properties that share an identical address). N/A
PC Address multipoint represents a Parcel Centroid (no structure, vacant parcel). N/A
BEP Address multipoint represents a Building Entry Point. With an intrinsically higher geographic accuracy than a BC point, these locations represent points of entry/egress to a building, often when it has several primary entrances and/or addresses. Not extensively in use yet by MassGIS. TBD
BMPC Address multipoint represents a Building Multipoint Centroid, a centralized location on a given property where many addresses have been assigned to individual structures, but one or more remainder addresses could not be assigned to any specific structure and should retain a link to the property. These points provide a location for the unassigned addresses and a link to the original structure geometry without incorrectly associating the address with any particular structure. For example, the main address (portion of an address with no unit-level designations) of a condominium complex should not be assigned to any one building or unit, but is retained in the MAD because the complex as a whole can be referred to by that main address. TBD

Data Preparation

Data Reduction & Simplification

The source data is essentially a list of addresses with coordinates, so there is no notion of geometry simplification. However, there may be multiple addresses with the same location. In this case, we will explore manual merging of the data, or correction of inaccurate data on either side. We do not plan on adding addr:unit numbers to address points, as we believe that's outside the scope of this project.

We can also accomplish data reduction through JOSM's integrated tools for skipping over buildings which already contain matching data to the address POI, and by consolidating multiple apartment-style addresses into a single building through JOSM's integrated conflict resolution options (e.g. you can discard, concatenate, or use data from either set. In the case of apartment numbers on top of each other, you'd use the concatenate function to add them together, such as 20;36;17)

Tagging Plans

Generally, tagging is straightforward following the address schemas laid out within the phase plans.

We discussed housenumbers that are not just numbers, such as 1R and 1 1/2. The community consensus is that we should use the letters and forward-slash character, as provided.

We identified there are tagging inconsistencies along the border of towns where the street address for a residence may not match the physical town the residence is in. For the clarity of the end user, we feel that using the PC_NAME field would correctly identify the address of buildings in the majority of cases, with the exception of unique admin_level jurisdictions, like Barnstable.

Addresses near state borders: are all addresses in the MassGIS data set actually in Massachusetts both logically as well as contained in the border?

Changeset Tags

The changeset comment will include attribution to MassGIS as specified above in the Background section. We will also add this to the wiki (/Contributors, and the Massachusetts page).

Data Transformation

The source data has uppercase street names. We have matched these with nearby streets in OSM and find the proper capitalization, and produce a list of discrepancies for manual investigation. All source files we are working with contain fields for upper and upper/lowercase variants of the data.

Changing STREETNAME to first-letter-capitalized form

Automated Method

MaxErickson created a python filter for use with ogr2osm to apply the title() function and rename fields all in one action.

Requires: simpleMAtranslation.py and ogr2osm

Use the command:

python ./ogr2osm/ogr2osm.py -t simpleMAtranslation.py AddressPts_M213.shp

Manual Method
  1. Open QGIS, load .shp Shapefile for town into layers You can do so by navigating to the .shp file within QGIS's browser section
  2. Right click the AddressPoints_M* file, and Open Attribute Table
  3. Open Field Calculator.
  4. Check Update Existing Field, and select STREETNAME
  5. In the Expression editor below, copy and paste in the phrase "title(STREETNAME)" Click OK, and wait for the data to process.
  6. Exit Attribute Table, saving any changes
  7. Right click layer name, Save As. Set format as ESRI Shapefile using WGS84, and pick a path and name to export modified case shapefile

Import into JOSM

Requires OpenData and Conflation plugins

  1. Download the buildings with no or partial addresses for town to be modified using the query wizard e.g. (type:way and ("addr:housenumber"!=* or "addr:street"!=*) and building=*  in "townname,MA"
  2. Open the modified .shp shapefile from before into JOSM. e.g. city_towns\mgis_MIDDLESEX.CARLISLE.shp You should now have two layers.
  3. Manual Method: Select the shapefile layer into view, and under objects, edit the name of the field "ADDR_NUM" to be "addr:housenumber" and STREETNAME to be "addr:street" (Don't edit the values, just the labels of the values.).
  4. Manual Method: Select the "Data Layer 1" layer into view, and click Search under Selection window. Start a search to select all features with building=yes as its tag.
  5. Select the .shp shapefile into view, do a Select All. Grab a coffee.
  6. You can either use Conflation to merge the data, or JOSM's Merge Address Points feature. With Conflation: Open the Conflation window with the button on the left. Click Configure within the Conflation window. Select into view the Data Layer 1, and click "Freeze" next to Subject within Conflation's popup. Select into view the shapefile and click "Freeze" next to Reference
  7. In the conflation settings, keep simple configuration, ensure you're using Disambiguating, Centroid of <10 or 20, and set Tags to addr:housenumber;addr:street. Under the merging area, uncheck Replace Geometry and All, leave Merge Tags checked. The box directly to the right of All, add addr:housenumber;addr:street
  8. After some processing, you will see cyan arrows on the screen to indicate correlations between the target and reference data layers, and which points will be applied to which buildings. You can review these if you want. Also, in the Conflation docked window, there's a list of nodes. Distance indicates how far away the node was from the building, score is the percentile of the possible match, and No Conflicts appears if there's no conflicts found. If a conflict is found, it brings up a window that asks you what to do.

Data Transformation Results

Post a link to your OSM XML files.

Data Merge Workflow

Team Approach

We have a team of people working on the problem through email and Slack #massachusetts channel communication. Our mappers will work together to create reproducible processes for data transformation and merging, doing many "dry runs" of merging and analyzing the change file results for anomalies. We will have some programming developers assisting us later in the process to help create automated technologies for making the import process easier for the future. We rely on community consensus on the talk-us-massachusetts mailing list to proceed with phases or developments on the project. As we get closer to the point of uploading data, we plan on uploading through a single account so changes are revertable, someone will run a script to upload a single town. We will watch for problems, and then upload more towns, accelerating as we accumulate more trouble-free uploads, and stopping to reassess if there are problems.

References

We are evaluating validity of the source data against data already mapped and entered by people on the ground within OSM.

Workflow

We plan on using JOSM to conflate the data, and upload using separate import accounts.

\todo Information to include:

  • Step by step instructions
  • Changeset size policy
  • Revert plans

Conflation

Identify your approach to conflation here.

QA

Future quality assurance for advanced merges in the later phases of import will likely use a processing script, which will output information about addresses that could not be matched, about information that differs from OSM and MassGIS, and many more. Individual mappers will spot check addresses to be imported both via data (L3 parcels) and actually looking in the field. Until that tool is released, quality assurance can be achieved through JOSM validation, Conflation plugin reports, or the like.

See also

The email to the Imports mailing list was sent on 2018-12-30 and can be found in the archives of the mailing list https://lists.openstreetmap.org/pipermail/imports/2018-December/005869.html.

The email to the local community establishing community consensus was sent on 2018-12-17 and can be found in the archives of the mailing list https://lists.openstreetmap.org/pipermail/talk-us-massachusetts/2018-December/000429.html