Allegheny County Building Import

From OpenStreetMap Wiki
Jump to navigation Jump to search

The Allegheny County Building Import is a stalled import of building footprints available from the county's GIS webpage. It was active from October 2017 to May 2018, when the community lost interest in the project. The OSM US task manager was later updated, leaving us without the ability to further work on the project. I hope to move it to the new task manager and resume the import in the future.

For context, I am member of the community affected by this import. I map and live in the area covered by this import and I have been mapping it for over 2 years. I have recently contributed to a couple building imports but this is the first import I've tried to organize.

Goals

Improve OpenStreetMap by adding buildings from this dataset. I plan on adding all the buildings, except where OSM users have traced them, the data is outdated, or some other reason not to.

Schedule

The import started in October 2017.

Import Data

Data Source: http://openac-alcogis.opendata.arcgis.com/datasets/allegheny-county-building-footprint-locations

The data contains accurate building footprints for all of Allegheny County, plus a small buffer around it. It contains approximately 500,000 buildings!

Data License

I emailed the county about this data and they said, "Allegheny County GIS has no objections to geodata derived in part from Allegheny County Building Footprint Locations being incorporated into the OpenStreetMap project geodata database and released under a free and open license." They also expressed support for our efforts, so it looks like we have permission.

The license on the Web page doesn't mention anything about copying, just that the data is provided without any guarantees:

"Use Constraints: All parties acknowledge that the data is for informational purposes only, there is no guarantee as to its completeness or accuracy. Allegheny County Division of Computer Services Geographic Information Systems Group is not responsible for any reliance upon said data. Distribution Liability: The USER shall indemnify, save harmless, and, if requested, defend those parties involved with the development and distribution of this data, their officers, agents, and employees from and against any suits, claims, or actions for injury, death, or property damage arising out of the use of or any defect in the FILES or any accompanying documentation. Those parties involved with the development and distribution excluded any and all implied warranties, including warranties or merchantability and fitness for a particular purpose and makes no warranty or representation, either express or implied, with respect to the FILES or accompanying documentation, including its quality, performance, merchantability, or fitness for a particular purpose. The FILES and documentation are provided "as is" and the USER assumes the entire risk as to its quality and performance. Those parties involved with the development and distribution of this data will not be liable for any direct, indirect, special, incidental, or consequential damages arising out of the use or inability to use the FILES or any accompanying documentation."

Type of License (if applicable): Public domain? Either way we have permission.

ODbL Compliance Verified:

OSM Data Files

https://github.com/geokitten/allegheny-building-import Has the .osm file for the whole dataset and a few samples.

Import Type

A one-time import. We will convert and divide up the import data set automatically, but conflating the data will be done in JOSM through the OSM-US tasking manager.

Data Issues

Our data is good but not perfect. In the interest of transparency, it's best to mention them.

Position Offset

The buildings have a small but variable offset a few meters south relative to Bing imagery. It's not enough from preventing the building outlines from lining up with the imagery but noticeable at high zoom. I propose that we handle it by putting correcting the offset in the directions for the community to follow.

Cut-off buildings

Our dataset extends slightly beyond the borders of Allegheny county, which is a small bonus. On the other hand, the data cuts off at the edges in an inconvenient manner: If a building straddles the edge of the data coverage, it's just cut in half. These cut-off buildings only occur right at the edge of the area and they're a tiny fraction of the total import. Before making the data available to the community in OSM Tasking Manager, I will go manually remove the bad buildings in JOSM. This is practical because the buildings are in known locations and small in number.

Data Preparation

I will use ogr2osm to convert the shapefile into an OSM file. Then I will divide the data into chunks with osmconvert based on TIGER's census tract boundaries.

Tagging Plans

The data only contains two useful attributes: The shape of the buildings and the building type: residential, unknown, outbuilding, industrial/commerical, or public building. Residential and industrial commercial are mostly compatible with our definition of these terms, but I don't think outbuilding or public building fit into our tagging scheme. Thus, I will leave those as building=yes just like unknown buildings. This is my own assessment of how tagging should be done but I'm open to suggestions.

FEATURECOD Government Tag OSM Tag
210 Residential building=residential
NULL, 200, 250 or 295 Unknown building=yes
240 Outbuilding building=yes
220 Industrial/Commercial building=commercial
230 Public building=yes

Changeset Tags

Add source=Allegheny County GIS, as well as a hashtag in changeset comments.

Data Transformation

This is a thorough description of how I prepared that data for import. I tried to make it detailed enough that someone could reproduce my work if they wanted.

Necessary Software

I did this on a PC running Linux Mint, some adaptations might be needed for other platforms.

QGIS, ogr2osm, JOSM, osmconvert

Gather Sources

Download the building footprints and the TIGER census tracts for Pennsylvania, which we will use to divide up the data for easy importing. Unzip the zip files. Now you can inspect the shapefiles in QGIS. Also download my GitHub repository, which contains several files I created.

Conversion to OSM Format

Now we need to convert the buildings shapefile to OSM format. We can do this with the ogr2osm utility, using the translation file I wrote:

./ogr2osm.py -t ./allegheny-translation.py "./building footprints/Allegheny_County_Building_Footprint_Locations.shp" -o "allegheny county buildings.osm"

Be advised that this command takes several minutes to run and requires around 4 GiB free RAM. After the conversion, you can open the resulting OSM file in JOSM given sufficient RAM (about 2 GiB).

Partitioning the Data

The import plan requires dividing the data into reasonably-sized chunks in order to load the import as a task on the OSM tasking manager. We will divide the data based on TIGER census tracts because this dataset is readily available, the census tracts are about the right size, and the boundaries between tracts almost never cut through buildings.

The shapefile covers the whole state, so I opened it in QGIS and cut out only the portion we need for this import. That file is also in my GitHub repository, called census tracts clipped.shp.

Next we need to create a shapefile for each census tract that only contains the shape of that particular census tract. Open the file in QGIS, then go to Vector > Data Management Tools > Split Vector Layer.... Make sure the input layer is the census tract data, and set the Unique ID field to GEOID. Select a folder to put all the output shapefiles in and click OK. Now you have a directory that has one shapefile for each census tract.

Next, go to the folder with the shapefiles and run the shell script in my Github repo:

../cut-data.sh

The script will create an OSM extract of the building data for each census tract, 478 in total. The script requires about 1 GiB of RAM and 30 minutes to run for 4 concurrent tasks. Osmconvert gives me warnings about wrong sequences, I don't know what this means but it doesn't seem to negatively affect anything. Once the conversion is finished, feel free to spot check the output in JOSM.

Adding the import to OSM Tasking Manager

If you've set up a local copy of OSM Taking Manger, you can now add the import to your local copy of the website. I've tested these instructions on my machine and I plan to repeat them for the actual import.

Before you do this, you need to create a GeoJSON file of the census boundaries we used to divide up the data. You can do this with the following command:

ogr2ogr -f "GeoJSON" ./bounds.json "./census tracts clipped/census tracts clipped.shp"

Where "./census tracts clipped/census tracts clipped.shp" is the filename of the TIGER census tracts data covering only the area of the import.

Go to your copy of the OSM tasking manager site, click on the top menu where your username is, and go to "Create a new project". You'll end up on a page that says "Step 1". Where it says "Import a GeoJSON, KML, or zipped SHP file," click Import and open the bounds.json file you created with the previous command. When you get to "Step 2 - Type of project" select arbitrary geometries for the task shapes. Then on step 3, create the project. This will take about a minute to complete. On step 6, add the information like the directions. To make it possible to load the import data with each task, see the instruction under per-task instructions. Hit "Save the Modifications" and you're done.

Data Transformation Results

See https://github.com/geokitten/allegheny-building-import for several data samples.

Data Merge Workflow

Team Approach

I want to post this on the OSM-US tasking manager so people can work on it chunks, but I do not have an admin account there yet. I've created the task in a local version of the task manager so I can learn how to set up the task properly.

References

Workflow

  • Select a tile from the task manager.
  • Open the OSM data in JOSM and an OSM file of the building outlines for that area.
  • Fix all problems with the import data:
    • Run validator
    • The buildings have a small offset. Select all (Ctrl+A) and drag them until they line up very accurately with Bing imagery.
  • Copy and paste the data from the county data layer to the OSM layer.
  • Then fix any issues associated with conflating the data.
    • If the new buildings overlap with existing buildings, conflate manually: If the imported version is better, use replace geometry to replace the OSM way with it. If the OSM version is better, delete the import version. Make sure to preserve tags and err towards not harming existing data.
    • For any storage tanks you import, remove the building=* tag and tag them as man_made=storage_tank instead.
    • Spot check the data for sanity. For example, if a building from the import differs significantly from aerial imagery, consider deleting or retracing it.
    • Run validator and fix any issues related to the new data.
  • Upload to OSM and on to the next task.

Conflation

If the building already exists in OSM and the shape is of comparable quality, don't change it. If there is already a building node or a roughly traced building there, use replace geometry in JOSM. In general we want to respect what other people (like me) have already mapped by hand.

Current Status

The import is online at OSM US Tasking Manager, but it is impossible to log in and edit it. Since May 11, 2018, it has been 78% done and 75% validated.

Possible Continuation

I intend to resume the import at some point in the future with several improvements:

  • Host it on the new OSM US task manager site
  • Rename it to "Greater Pittsburgh Building Import" because Pittsburgh is more familiar to people than Allegheny County
  • Fix the offset in the data before uploading the data files
  • Divide the import into smaller chunks; many pieces of the older import were hard to work with due to their size

When I resume the import, I will exclude areas already completed, contact the mailing list to ensure approval from other editors, and ask for help from other users. I expect to need some advice processing the data--it's a large dataset with over 500,000 buildings in it.