New York (state)/NYS GIS SAM Address Points Import

From OpenStreetMap Wiki
Jump to navigation Jump to search

NYS GIS SAM Address Points is an import of the New York State GIS SAM Address Points dataset, which is of type ArcGIS Geodatabase, covering all of New York. The status of the import is currently (as of June 2021) that every county outside of New York City has had its first initial import done. Updates of the data have not been implemented.

Goals

The goal is largely to fill in gaps of address information. The import should favor existing data and leave it alone.

Timeline

Import Timeline
2020-07 Planning begins
2021-01-23 Import begins on one county by hand
2021-02-04 Bulk import begins

Import Data

Background

Data source site: https://gis.ny.gov/gisdata/inventories/details.cfm?DSID=921
Data license: No explicit license
Link to permission: See New York/NYS GIS Clearinghouse
Conversion program source code: https://gitlab.com/dead10ck/nys-gis-sam-import-rs
ODbL Compliance verified: yes?

OSM Data Files

.osc files were generated by the nys-gis-sam-import-rs conversion program. The files that were generated and uploaded are available under a requester-pays AWS S3 bucket at s3://skyler-public-requester/osm/nys-gis-sam-import/.

Import Type

Initially, the focus will be on getting all the data. However, the address points are updated quarterly, and with the help of storing the unique ID of the address points in the tags, updating should be relatively straightforward to implement. There are several timestamp tags in the database that indicate even each point was created and last modified, so filtering down to just the new and updated data is as simple as querying all the points after the most recent date of the date of the last update.

Initially, one county was done manually by using the .osc files in JOSM, and inspecting the data before upload. This was to discover corner cases or bugs.

After confidence was gained, the rest of the data was generated and uploaded for entire counties a time.

Data Preparation

Tagging Plans

The field names and definitions are defined in the PDF file advertised on the dataset web page here. Mappings are as follows:

Additionally, for the purposes of identifying data that comes from this import, and for use in tracking updates to the dataset in the future, these fields are added with a prefix particular for this import:

  • NYSAddressPointID = nysgissam:nysaddresspointid=*
  • nysgissam:review=* is added by the nys-gis-sam-import-rs program when it detects circumstances that warrant special attention, such as when existing address data conflicts with the data set, when the building it is in is part of a multipolygon, etc. The value is a short description of the reason it was flagged for review. See QA.

Numbered routes

There are many roads named simply after their numerical reference number, e.g. "State Highway 355", "County Road 35", etc. There is no standard naming convention for these roads, even in official government sources, so in some sources, a road may be "County Road 35", and in another, it may be "County Highway 35." This is a problem that predates this import, and it is reflected in data all over the state, in both addresses and road names and other tags. e.g. some of these numbered routes have name=State Highway X, and ref=X. Others simply have ref=X with no name=* tag.

Given this, I considered fixing this problem outside the scope of this import, and therefore did not attempt any normalization of the addr:street=* tags.

Data Reduction & Conflation

The GDB of address points are read by the nys-gis-sam-import-rs program, and then checked against an osm2pgsql Postgres database for whether the address already exists within a short distance of the point.

For each address point, it will do the following to look for existing data:

If it finds anything existing, it will check if the element is missing any of the addr:*=* tags, and adds them if so.

If nothing is found, then it looks for any buildings that the point lies inside, or within 1 meter, and conflates with the building if:

  • There is only one address point that lies inside the building
  • The point is not also inside a building:part.

Otherwise, they are left as points.

Single point inside building gets conflated
Multiple points inside building are left as points

The dataset contains many points that are for large apartment buildings that lie on the same coordinate, but have each unit separately. When possible, these are combined into a single point, and the units are combined into addr:flats=*.

  • When the units are all purely numeric, and fall into linear sequences, these are combined into number ranges. e.g. if there are 5 points with units 1, 2, 3, 4, and 5, this is combined into addr:flats=1-5. However, units 2, 4, 6 would end up addr:flats=2;4;6, and units 1A, 2B, 3C would be addr:flats=1A;2B;3C, etc.
  • When there are multiple addr:floor=*'s for the same primary address on the same coordinate, they are combined with a ;.
  • When combined, the NYSAddressPointIDs are all joined with a ;. This can quickly run up against the 255 character limit, so to get around this, when the nysgissam:nysaddresspointid=* tag is too large to fit into one tag, it will "spill over" into multiple tags, suffixed with a number, e.g. nysaddresspointid:2=*, nysgissam:nysaddresspointid:3=*, etc.

Modifying existing elements

If an address already exists, the importer only modifies existing elements if they are missing some addr:*=* tags.

Partial data

Note from above that in order for an existing element to match, it must have both the addr:housenumber=* and addr:street=*. If an element only has an addr:housenumber=*, then it will not be found, and a new node is placed there beside the existing node.

House numbers near each other with different streets

This was an intentional decision. An early version of the importer originally fell back to just looking for addr:housenumber=* if there was nothing with both, but it turns out it is actually fairly common for two addresses to be close to each other that have the same house number, but different streets. This meant that good data was being skipped in cases where it found the same house number close to the address point, and in other cases, the wrong existing element was getting conflated with the incorrect address point and got the rest of the addr:*=* tags.

So we have a choice between skipping good data or conflating with the wrong existing element, or "duplicating" the address alongside the existing element that only had a addr:housenumber=*. I chose the latter.

Changeset Tags

Other things to know about the data

Addresses of new developments

This data contains addresses that have been registered by the addressing authority, but no physical structure has yet been built, i.e. new developments that are still under construction, or have not yet begun. I've discussed this data with a couple of officers from the NYS GIS office, and this was Frank Winters' response:

Once an address is assigned by the addressing authority it technically exists. I picture these addresses being useful for 911 and other purposed at the very beginning of site development. For example, the delivery of a pre-fab home, or a 911 call that comes in when someone get’s hurt clearing the lot. Also, there are thousands of undeveloped parcels that are addressed. Those addresses are used for administration of taxes, or 911 call. The status of development is part of the assessment data attached to parcels. We collect and process this data annually although county and local assessors might have a more current version. Parcel centroids point might help: http://gis.ny.gov/parcels/

This makes sense to me, and I can't think of any way that it hurts to be in OSM. But additionally, the only way to exclude them would be to categorically exclude all address points that are not associated with a building structure, which would include addresses for buildings which do exist but do not yet have satellite imagery (since these points are based on parcel centroids until imagery is available), driveway points, etc. So I think the best course is to keep them.

However, sometimes reality is messy. Sometimes developments are planned and approved, but get abandoned, and the address rolls don't get updated. If you have local knowledge that a new development never happened, or otherwise that the address nodes are marking something that does not and likely will never exist, don't hestitate to delete the address nodes.

Point types

The dataset separates address points into different types, numbered 1 through 5, in the field named PointType, as described in the Data Dictionary. Type 5 is described as "Miscellaneous," and includes many different types of things, such as highway mile markers, benches, monuments, etc. Some of this is possibly useful, but sorting through each category of miscellaneous item to determine what is worth staying, and then trying to figure out how to potentially tag it differently than everything else, seems like more trouble than it's worth for the value it would give. So I am choosing to exclude PointType 5.

Data Merge Workflow

Edits were made under the nysgissam_dead10ck account.

QA

review tags

As mentioned earlier, the importer adds the nysgissam:review=* tag when it encounters a situation that a human being should look at. Below are the tag values, what they mean, and my thoughts about how to handle them.

In general, don't be afraid to fix mistakes in the addresses, or delete addresses that are wrong. Future updates will check for, and give favor to, user edits to the address points. This includes deletions. You need not worry that the address will come back in a later update (at least not by this import).

However, when moving the tags to a different OSM element (such as when turning a node into a building), please be sure to keep the nysgissam:nysaddresspointid=* tag intact so there remains a trace back to the import data, and so the importer can find it again when it's checking for updates.

no matching street nearby

This means there was no street nearby with a name that matches the addr:street=* tag in the address. It is by far the most common review reason. In my experience, they are mostly due to minor typos or simple spelling variations in the street name, e.g. Whittmore vs Whittemore, St Josephs Road vs Saint Josephs Road, etc. I tried to strike a balance here between generating lots of noise and spotting actual typos in the existing street names or address points; when comparing the names, it ignores apostrophes and letter case, but otherwise nothing else.

To fix it, try to find what the actual street name is. Do a survey, check Mapillary, or confer the online map viewer for streets. In my experience, the address data has been correct, and the street name needed to be fixed, but there are also sometimes errors in the address data. Your mileage will vary. If you can determine which one is right, fix the tags accordingly. If official data sources conflict, follow the on the ground principle and let what's on the street sign win.

Because of the numbered route problem, this check was skipped for numbered routes.

existing element's addr:* has different addr:*

The import data conflicts with the address that already existed on the building. If you can determine which one is right, update the tags. If the import data is correct, copy all of the imported node's tags to the previously existing element (sans the review tag) and delete the imported node. If the imported address is wrong, just delete the node.

found > 1 existing matching address

The previously existing OSM data had more than one match found for the address. In this case, the importer can't know which element to conflate with. If possible, try to deduplicate the address tags somehow. Reasonable mappers may have different opinions about how (or whether) to do this, but my personally preferred approach if there are two buildings that share an address is to draw a way around the land that encompasses the whole property that shares the address and put the imported node's tags (sans the review tag) on this way.

repeat address

The import data had the same address more than once. Try to determine which node is the right (or better) one, and delete the other. If you don't mind, please copy the nysgissam:nysaddresspointid=* tag value and combine it with the the one you're keeping, separating the values with a semicolon.

inside/near multiple ways

An address inside multiple ways

If the address point is inside a building and a building:part=*, it will not conflate automatically, because the building:part=* could be any valid value in the building=* tag (such as roof), so it's best to let a human decide which parts of the building get the address tags, or to leave it as a node.

An address point near multiple buildings

If the address point was not inside any way, but near multiple buildings very close to each other, such as densely populated cities, it can't know which one is right, so a human must figure out which building the address goes on.

multipolygon

If the point is inside a building which is part of a multipolygon that did not already have at least the addr:housenumber=* and addr:street=*, then it is not conflated. Multipolygons vary widely in terms of which physical structures have the address. Sometimes it's a house and a detached garage; sometimes two offices, where each have a different address; sometimes two offices with the same address.

They are generally different enough that a human should decide what to conflate the address point with, whether it's one of the buildings, the multipolygon relation, or an area that surrounds both.

Suspicious data

Sometimes you will come across addresses that look questionable, and weren't automatically detected and marked with nysgissam:review=*. It can sometimes be tricky to determine if an address is garbage or not, and therefore it can be tricky to determine what to do with the data, and the nysgissam:nysaddresspointid=* tag. There is no hard and fast answer, but these are some things you may come across.

New address points alongside old ones

This can also come in the form of address tags added to a building way that had addresses tagged as nodes.

This is likely due to a difference in the street name, causing the existing elements not to be found during the automated import. It can also be due to the house numbers disagreeing. Generally speaking, this should be treated like any other conflict: try to determine which one is right if possible. Do you due diligence to determine if the address is actually a valid address or not. Note that there need not be an occupied building structure, or any structure at all, for an address to be valid according to USPS, emergency first responder services, etc. Also note that not all addresses are on a sign outside, or otherwise advertised in an obvious way. There are also occasionally times when a business makes a mistake in the address they advertise on their website.

However, if you come across data where you are very confident the imported nodes are wrong, here are some rules of thumb:

  • If the street name is only different because of a misspelling or small spelling variation, just fix the addr:street=* tag, merge the rest of the addr:*=* tags if any are missing from the existing elements, and include the nysgissam:nysaddresspointid=* tag.
  • If the street name is completely different, it's possible a nearby road has been officially renamed. If this is the case, and the imported street name needs to be changed, then fix it and merge the elements, including the nysgissam:nysaddresspointid=* tag.
  • If the house number is wrong, it becomes a bit more murky. From the perspective of the source data, there's an ambiguity: is the address number wrong, or the coordinate? For this reason, I recommend fixing the addr:housenumber=* and deleting the nysgissam:nysaddresspointid=* tag. This will signal to future updates not to modify this address.

Non-existent addresses

See also Addresses of new developments.

Sometimes you may encounter what looks like randomly placed address nodes in the middle of a forest, or in a park, etc. I again strongly suggest putting some effort into researching the town's local development plans, if possible, and otherwise trying to determine if it could possibly be a valid address.

However, local knowledge should take precedence here. If you know your neighborhood and you have a high confidence that these address nodes are nonsense, delete them.

Likewise for addresses added to buildings; if you're confident the imported address is nonsense, delete the address tags and the nysgissam:nysaddresspointid=* tag.

See also

An email to the following mailing lists was sent on 2021-01-09: