OpenHistoricalMap/Projects/Newberry Atlas of Historical County Boundaries Import

From OpenStreetMap Wiki
Jump to navigation Jump to search

Project Discussion Forum Post

Status: Import Complete; Post-Processing Underway

An import file has been created and is available for review (4 Jan 2023).

Updated import file. (16 Jan 2023)

About

The Newberry Library's Atlas of Historical County Boundaries (AHCB) is an amazing collection of historical GIS information related to the evolution of county and county-equivalents in the United States, including those prior to the formation of the United States. This dataset includes ~17.7K shapes of county and county equivalents, covering the ~4K county and county equivalents that have existed throughout the country's history.

The data from the collection is well-tagged (detailed start_date & source metadata) and has a public domain license. And... as of now, there aren't many county boundaries in OHM.

As such, it is a perfect candidate for an import into OHM.

A detailed discussion of this import is available in the related Forum Post.

Import Data

Source Data

Import Type

This is a largely human-reviewed import, which isn't to say error-free, but not completely automated, for reasons described below.

Geometry Preparation

Segment consolidation

The AHCB source data contains complete outlines for every boundary it includes. As a result, any particular line segment for one county might be repeated in an overlapping county border for the same county, an adjacent county, for its containing state, or for an adjacent state, or for a county that no long exists, or from when it was a territory and not a state. And, on and on. You get the picture. So, for any particular connection across nodes, there might be dozens of overlapping shapes. These overlapping and indistinguishable segments can be a nightmare for anyone trying to edit or reuse these borders.

Luckily, Mark Connelly noticed this, understood the problems it might create, and then wrote an amazing script to break down the Newberry data into consituent and comprehensive atomic segments, as well as to provide the crossswalks to reassemble the original counties and states and associate them with their metadata. Script output files are found here.

Way segment uploads

698,428 county ways / segments were created as output of the script. Each way was assigned a unique EDGE_ID identifier. These ways were then tagged for source, source:name, and license and uploaded to OHM using JOSM. See: Source Tagging below.

The resulting ways, which only have a valid OHM OSM ID after being uploaded, were then downloaded as an .osm file. Then, the .osm file was stripped down to a a lookup table .csv file that associated each uploaded way's OHM OSM ID and its EDGE_ID. This transformation was created with regular expressions in a text editor.

Sample way: Way: 199780241 in Florida.

Label point generation

Boundary relations can be helped by the use of label nodes (where those nodes have a role=label).

Label points were generated in QGIS by creating a point shapefile containing centroids for every county boundary area. A label name tag was created by concatenating the source file's NAME and CNTY_TYPE fields. This created 17K labels.

Then, in order to use the same label across a variety of relations and to not proliferate a bunch of label points, entities with the same ID in the source metadata were combined into a single label. The pandas df.groupby function consolidated each entry into a single row using the minimum start_date, maximum end_date, average longitude, and average latitude to create a primary label point that could be shared across relations with the same source ID source.

Key note: this will not work well for any counties with geometries that changes significantly over time, but these label points are designed to serve as a primary first pass at data consolidation.

In addition, these labels were scrubbed of parenthetical information in the source data (e.g., "(Ext)", "(2Nd)") and abbreviations (e.g., "Dist.", "Terr.", "P.", "Jud. Dist.", etc.). end_date values of 2000-12-31 were deleted, as those were placeholders for no end date in the source data.

Per a prior forum discussion, label point name tags *do not* include any year information.

As with the way relations, these points were tagged with the source tags identified below, Wikidata, and Wikipedia tags, and some more label-appropriate tags and then uploaded to OHM using JOSM.

This ended up creating 4,117 label points. Example label point: Belmont County, Ohio.

Again like the way relations, after uploading, these points were then downloaded and stripped down to a .csv file that was used to associate relations with label nodes.

Relation reassembly

Once the way segments and label points had OHM OSM IDs assigned, they were joined with the metadata and converted to OSM XML to create a file with OSM relations for every county in the original Newberry source files.

Negative relation IDs were assigned to each relation prior to upload.

Tagging preparation

See also: OHM TagInfo Newberry AHCB Project Page (TagInfo project file for Newberry AHCB import; TagInfo project file documentation).

Source metadata / OHM tag translation

Caption Newberry AHCB-OHM Tag Mapping
AHCB Metadata OHM Tag Example
NAME (ALL CAPS), CNTY_TYPE, START_DATE YYYY, END_DATE YYYY name Natchitoches Parish (1812-1827)
ID_NUM nl_ahcb:id 12858
ID nl_ahcb:id_text nys_albany
VERSION nl_ahcb:version 8
CITATION nl_ahcb:source Van Zandt, 141; U.S. Stat., vol. 18, part 3 [1876], p. 474
START_DATE start_date 1845-12-29
CHANGE start_event SUFFOLK lost to creation of WORCESTER.
END_DATE end_date 1846-03-23
FIPS nist:fips_code 45057

name fields for county relations include years for ease of differentiating across the various shapes when looking at a list in the OHM inspector or in JOSM. The iD editor in OHM automatically appends the years to help alleviate any confusion.

The tag import:county_type=* was not sourced directly from the AHCB, but is a derived field. It is intended to preserve information about administrative entities that may serve county-like function, but for some historical reason or another, have not been called a "county." Over 20 different "types" of counties are included in this import.

Source tagging

Appropriately identifying the sources and redistribution policies for OHM-hosted data is critical for its use as a distribution source for consolidated historical GIS information. As such, all ways and relations associated with this import should be marked with the following tags:

   <tag k='source:name' v='Newberry Library Atlas of Historical County Boundaries' />
   <tag k='source' v='https://publications.newberry.org/ahcbp/downloads/united_states.html' />
   <tag k='license' v='CC0-1.0' />

`license=CC-1.0` uses the SPDX abbreviation for the Creative Commons CC0 "No Rights Reserved" license.

OpenStreetMap/OHM-specific tagging

In addition to each county's historical metadata, each relation needs to be tagged with OSM/OHM-specific metadata used to let renderers and other systems know how to treat this entity. The admin_level=6 tag is part of OSM convention for counties in the United States.

   <tag k='type' v='boundary' />
   <tag k='boundary' v='administrative' />
   <tag k='admin_level' v='6' />
   <tag k='place' v='county' />

Note: not all of the places imported with this dataset are counties or even county-equivalents in the United States, but to match with OSM-style convention, they are tagged with `place=county`.

Wikimedia tagging

Linking objects in OHM to related entities in Wikidata and Wikipedia will enhance the richness of the data in both places and make OHM's data part of a wider fabric of Linked Open Data across the internet.

The Wikidata codes and Wikipedia pages for these relations were associated using Wikidata Sparql queries and a fair amount of painstaking data cleansing.

Whereever possible, objects have been tagged appropriately, such as:

   <tag k='wikidata' v='Q16861' />
   <tag k='wikipedia' v='en:Bexar County, Texas' />

Notes:

  1. Not every historical county has its own Wikidata entry or Wikipedia article. Where no appropriate entry could be identified, the fields have been left blank.
  2. Most relations are intended to be 1-way links to Wikidata. Most 2-way relation links should be through the chronology relations that will be created after the primary dataset import.

Source Data Errors

The source data is not 100% accurate. This is a known certainty. Hopefully, it is a "fairly" accurate dataset that can be used as a starting point – a basis – for further improvement.

Known error examples

For example: counties on the Great Lakes do not include their over-water areas; end dates listed as 2000-12-31 are just placeholders; and many boroughs in Alaska do not have accurate start_date values.

Renaming of Shannon County to Oglala Lakota County in 2015.

Alaska borough start_date tags, which were fixed before the related relations were imported.

Accuracy of various county boundary datasets

Import Impact Assessment

A small number of county relations (relatively speaking... in Michigan, the coverage is fantastic, thanks to users leonne & matteditmsts have been created in the United States prior to this import.

Authors of these pre-existing counties have been contacted using OHM's built-in messages and no data will be deleted or destroyed without coordination from these original authors.

In the case that these users are not monitoring their site messages, notices for the import plan have also been put on on Slack and Discord and a few other Internet fora.

In addition, 96 ways that have been sourced from the AHCB have been modified in some form and those have been marked for preservation `preserve=*` and understanding the nature of their improvements.

Post-Import Processing

Chronology relation creation

After the county relations have all been uploaded, a type=chronology relation will be created for every county to show its territorial changes over time. Details to follow.

Wikidata updates

After the chronology relations have been uploaded, a link to that relation will be created for every county's Wikidata page.

Error correction and updating the imported ways and relations

After the import, we will work with OHM users to ensure that obvious oversights are corrected, including those related to water coverage.

  • Map the changes that the Newberry dataset noted but declined to map
  • Join state boundaries to county boundaries where there is overlap
    • This is IN PROGRESS. Changes being made under the newberry_import account with #StateCountyAlign changeset hashtag.
  • Join international boundaries to county boundaries where there is overlap
  • OOjs UI icon check-constructive.svg Extend state and county boundaries into the Great Lakes
  • Extend state boundaries into the Atlantic Ocean, Pacific Ocean, and Gulf of Mexico as part of the USDOT time zone boundary import
    • NOTE: Please hold off on this until further discussion. : )
  • Conflate North Carolina county boundaries imported from Carolana.com
  • Extend county boundaries into the Atlantic Ocean, Pacific Ocean, and Gulf of Mexico
    • OOjs UI icon check-constructive.svg New Jersey
  • Extend international boundaries into the Atlantic Ocean, Pacific Ocean, and Gulf of Mexico
  • Add Alaska, Hawaii, Puerto Rico, etc. to United States boundaries
  • OOjs UI icon check-constructive.svg Fix broken California state boundaries
  • OOjs UI icon check-constructive.svg Rename Shannon County, South Dakota, to Oglala Lakota County
  • OOjs UI icon check-constructive.svg Map substantial county boundary changes since 2001
    • OOjs UI icon check-constructive.svg Alaska
    • OOjs UI icon check-constructive.svg Colorado
    • OOjs UI icon check-constructive.svg Virginia
  • Map 195 county boundary changes between 2001 and 2013:
    • OOjs UI icon check-constructive.svg Alaska (2)
    • Arkansas (2)
    • California (14)
    • Colorado (14)
    • Connecticut (3)
    • Florida (2)
    • Georgia (15)
    • Kentucky (2)
    • Louisiana (4)
    • Maine (2)
    • Michigan (2)
    • Missouri (8)
    • Nebraska (13)
    • Nevada (4)
    • New Mexico (2)
    • New York (4)
    • North Carolina (14)
    • OOjs UI icon check-constructive.svg Ohio (1)
    • Oregon (12)
    • Pennsylvania (6)
    • South Carolina (6)
    • Tennessee (2)
    • Texas (8)
    • Utah (6)
    • Virginia (43)
    • OOjs UI icon check-constructive.svg Wisconsin (2)
    • Puerto Rico (2)
  • Map minor county boundary changes based on legal code histories:
    • OOjs UI icon check-constructive.svg Kentucky (no change)
    • OOjs UI icon check-constructive.svg Wisconsin
  • OOjs UI icon check-constructive.svg Map Connecticut's COG planning regions as county-equivalents
  • OOjs UI icon check-constructive.svg Retag Alaska census areas as boundary=census border_type=census_area
  • Retag "non-county areas" as not:boundary=administrative boundary=balance
  • Tag start_date:edtf=*/end_date:edtf=* on 53 boundaries
  • Fix capitalization on "Mc" in county names
  • Map Ohio River boundary disputes
  • Map changes to bancos along the Mexico–U.S. border according to the IBWC

How to fix the old boundaries

In cases where more accurate locations of county boundaries are identified, care should be taken to replicate the source import edges where possible. These edges were created to help minimize the number of boundary segments in the OHM database. Thus, if a county boundary that was part of the original import has multiple subsegments where other historical intersections occurred, the new import should attempt to respect those segments as best possible. See diagram below for further explanation.