OpenHistoricalMap/Projects/Newberry Atlas of Historical County Boundaries Import
Status: Import Complete; Post-Processing Underway
An import file has been created and is available for review (4 Jan 2023).
Updated import file. (16 Jan 2023)
About
The Newberry Library's Atlas of Historical County Boundaries (AHCB) is an amazing collection of historical GIS information related to the evolution of county and county-equivalents in the United States, including those prior to the formation of the United States. This dataset includes ~17.7K shapes of county and county equivalents, covering the ~4K county and county equivalents that have existed throughout the country's history.
The data from the collection is well-tagged (detailed start_date
& source
metadata) and has a public domain license. And... as of now, there aren't many county boundaries in OHM.
As such, it is a perfect candidate for an import into OHM.
A detailed discussion of this import is available in the related Forum Post.
Import Data
Source Data
- Data source site: https://digital.newberry.org/ahcb/downloads/gis/US_AtlasHCB_Counties.zip (~500MB)
- Data source license: "Creative Commons CC0 "No Rights Reserved" license. DLC https://creativecommons.org/share-your-work/public-domain/cc0/" - Library of Congress AHCB Project webpage
- Note: The AHCB data is distributed with a CC BY-NC-SA 2.5 Deed, which is different from the license for the same data identified on the Library of Congress website. Email communication in the OHM community with contacts at Newberry have clarified that CC0 is the governing license for the data.
- Attribution: Siczewicz, Peter. U.S. Historical States and Territories. Emily Kelley, digital comp. Dataset. Atlas of Historical County Boundaries, ed. by John H. Long. Chicago: The Newberry Library, 2011. Available online from http://publications.newberry.org/ahcbp.
Import Type
This is a largely human-reviewed import, which isn't to say error-free, but not completely automated, for reasons described below.
Geometry Preparation
Segment consolidation
The AHCB source data contains complete outlines for every boundary it includes. As a result, any particular line segment for one county might be repeated in an overlapping county border for the same county, an adjacent county, for its containing state, or for an adjacent state, or for a county that no long exists, or from when it was a territory and not a state. And, on and on. You get the picture. So, for any particular connection across nodes, there might be dozens of overlapping shapes. These overlapping and indistinguishable segments can be a nightmare for anyone trying to edit or reuse these borders.
Luckily, Mark Connelly noticed this, understood the problems it might create, and then wrote an amazing script to break down the Newberry data into consituent and comprehensive atomic segments, as well as to provide the crossswalks to reassemble the original counties and states and associate them with their metadata. Script output files are found here.
Way segment uploads
698,428 county ways / segments were created as output of the script. Each way was assigned a unique EDGE_ID identifier. These ways were then tagged for source
, source:name
, and license
and uploaded to OHM using JOSM. See: Source Tagging below.
The resulting ways, which only have a valid OHM OSM ID after being uploaded, were then downloaded as an .osm file. Then, the .osm file was stripped down to a a lookup table .csv file that associated each uploaded way's OHM OSM ID and its EDGE_ID. This transformation was created with regular expressions in a text editor.
Sample way: Way: 199780241 in Florida.
Label point generation
Boundary relations can be helped by the use of label nodes (where those nodes have a role=label
).
Label points were generated in QGIS by creating a point shapefile containing centroids for every county boundary area. A label name
tag was created by concatenating the source file's NAME and CNTY_TYPE fields. This created 17K labels.
Then, in order to use the same label across a variety of relations and to not proliferate a bunch of label points, entities with the same ID in the source metadata were combined into a single label. The pandas df.groupby
function consolidated each entry into a single row using the minimum start_date
, maximum end_date
, average longitude, and average latitude to create a primary label point that could be shared across relations with the same source ID source.
Key note: this will not work well for any counties with geometries that changes significantly over time, but these label points are designed to serve as a primary first pass at data consolidation.
In addition, these labels were scrubbed of parenthetical information in the source data (e.g., "(Ext)", "(2Nd)") and abbreviations (e.g., "Dist.", "Terr.", "P.", "Jud. Dist.", etc.). end_date
values of 2000-12-31
were deleted, as those were placeholders for no end date in the source data.
Per a prior forum discussion, label point name
tags *do not* include any year information.
As with the way relations, these points were tagged with the source tags identified below, Wikidata, and Wikipedia tags, and some more label-appropriate tags and then uploaded to OHM using JOSM.
This ended up creating 4,117 label points. Example label point: Belmont County, Ohio.
Again like the way relations, after uploading, these points were then downloaded and stripped down to a .csv file that was used to associate relations with label nodes.
Relation reassembly
Once the way segments and label points had OHM OSM IDs assigned, they were joined with the metadata and converted to OSM XML to create a file with OSM relations for every county in the original Newberry source files.
Negative relation IDs were assigned to each relation prior to upload.
Tagging preparation
See also: OHM TagInfo Newberry AHCB Project Page (TagInfo project file for Newberry AHCB import; TagInfo project file documentation).
Source metadata / OHM tag translation
AHCB Metadata | OHM Tag | Example |
---|---|---|
NAME (ALL CAPS), CNTY_TYPE, START_DATE YYYY, END_DATE YYYY | name | Natchitoches Parish (1812-1827) |
ID_NUM | nl_ahcb:id | 12858 |
ID | nl_ahcb:id_text | nys_albany |
VERSION | nl_ahcb:version | 8 |
CITATION | nl_ahcb:source | Van Zandt, 141; U.S. Stat., vol. 18, part 3 [1876], p. 474 |
START_DATE | start_date | 1845-12-29 |
CHANGE | start_event | SUFFOLK lost to creation of WORCESTER. |
END_DATE | end_date | 1846-03-23 |
FIPS | nist:fips_code | 45057 |
name
fields for county relations include years for ease of differentiating across the various shapes when looking at a list in the OHM inspector or in JOSM. The iD editor in OHM automatically appends the years to help alleviate any confusion.
The tag import:county_type=*
was not sourced directly from the AHCB, but is a derived field. It is intended to preserve information about administrative entities that may serve county-like function, but for some historical reason or another, have not been called a "county." Over 20 different "types" of counties are included in this import.
Source tagging
Appropriately identifying the sources and redistribution policies for OHM-hosted data is critical for its use as a distribution source for consolidated historical GIS information. As such, all ways and relations associated with this import should be marked with the following tags:
<tag k='source:name' v='Newberry Library Atlas of Historical County Boundaries' /> <tag k='source' v='https://publications.newberry.org/ahcbp/downloads/united_states.html' /> <tag k='license' v='CC0-1.0' />
`license=CC-1.0` uses the SPDX abbreviation for the Creative Commons CC0 "No Rights Reserved" license.
OpenStreetMap/OHM-specific tagging
In addition to each county's historical metadata, each relation needs to be tagged with OSM/OHM-specific metadata used to let renderers and other systems know how to treat this entity. The admin_level=6
tag is part of OSM convention for counties in the United States.
<tag k='type' v='boundary' /> <tag k='boundary' v='administrative' /> <tag k='admin_level' v='6' /> <tag k='place' v='county' />
Note: not all of the places imported with this dataset are counties or even county-equivalents in the United States, but to match with OSM-style convention, they are tagged with `place=county`.
Wikimedia tagging
Linking objects in OHM to related entities in Wikidata and Wikipedia will enhance the richness of the data in both places and make OHM's data part of a wider fabric of Linked Open Data across the internet.
The Wikidata codes and Wikipedia pages for these relations were associated using Wikidata Sparql queries and a fair amount of painstaking data cleansing.
Whereever possible, objects have been tagged appropriately, such as:
<tag k='wikidata' v='Q16861' /> <tag k='wikipedia' v='en:Bexar County, Texas' />
Notes:
- Not every historical county has its own Wikidata entry or Wikipedia article. Where no appropriate entry could be identified, the fields have been left blank.
- Most relations are intended to be 1-way links to Wikidata. Most 2-way relation links should be through the chronology relations that will be created after the primary dataset import.
Source Data Errors
The source data is not 100% accurate. This is a known certainty. Hopefully, it is a "fairly" accurate dataset that can be used as a starting point – a basis – for further improvement.
Known error examples
For example: counties on the Great Lakes do not include their over-water areas; end dates listed as 2000-12-31 are just placeholders; and many boroughs in Alaska do not have accurate start_date values.
Renaming of Shannon County to Oglala Lakota County in 2015.
Alaska borough start_date tags, which were fixed before the related relations were imported.
Accuracy of various county boundary datasets
Import Impact Assessment
A small number of county relations (relatively speaking... in Michigan, the coverage is fantastic, thanks to users leonne & matteditmsts have been created in the United States prior to this import.
Authors of these pre-existing counties have been contacted using OHM's built-in messages and no data will be deleted or destroyed without coordination from these original authors.
In the case that these users are not monitoring their site messages, notices for the import plan have also been put on on Slack and Discord and a few other Internet fora.
In addition, 96 ways that have been sourced from the AHCB have been modified in some form and those have been marked for preservation `preserve=*` and understanding the nature of their improvements.
Post-Import Processing
Chronology relation creation
After the county relations have all been uploaded, a type=chronology relation will be created for every county to show its territorial changes over time, unless there is only 1 territorial boundary for that county (or state), in which case ther will be no type=chronology relation.
Same will be done for territories and states.
Wikidata updates
After the chronology relations have been uploaded, a link to that relation will be created for every county's Wikidata page.
Same will be done for territories and states.
Error correction and updating the imported ways and relations
After the import, we will work with OHM users to ensure that obvious oversights are corrected, including those related to water coverage.
- Map the changes that the Newberry dataset noted but declined to map
- Join state boundaries to county boundaries where there is overlap. Changes made under the newberry_import account with #StateCountyAlign changeset hashtag.
- Extend state and county boundaries into the Great Lakes
- Extend state boundaries into the Atlantic Ocean, Pacific Ocean, and Gulf of Mexico as part of the USDOT time zone boundary import
- NOTE: Please hold off on this until further discussion. : )
- North Carolina
- Conflate pre-1783 North Carolina county boundaries imported from Carolana.com
- Extend county boundaries into the:
- US national-level boundary relations
- Add Alaska, Hawaii, Puerto Rico, etc. to United States boundaries
- Join international boundaries to county boundaries where there is overlap
- Map changes to bancos along the Mexico–U.S. border according to the IBWC
- Extend international boundaries into the Atlantic Ocean, Pacific Ocean, and Gulf of Mexico
- Fix broken California state boundaries
- Rename Shannon County, South Dakota, to Oglala Lakota County
- Map substantial county boundary changes since 2001
- Map 195 county boundary changes between 2001 and 2013:
- Alaska (2)
- Arkansas (2)
- California (14)
- Colorado (14)
- Connecticut (3)
- Florida (2)
- Georgia (15)
- Kentucky (2)
- Louisiana (4)
- Maine (2)
- Michigan (2)
- Missouri (8)
- Nebraska (13)
- Nevada (4)
- New Mexico (2)
- New York (4)
- North Carolina (14)
- Ohio (1)
- Oregon (12)
- Pennsylvania (6)
- South Carolina (6)
- Tennessee (2)
- Texas (8)
- Utah (6)
- Virginia (43)
- Wisconsin (2)
- Puerto Rico (2)
- Map 47 county boundary changes between 2014 and 2020:
- Alaska (23)
- Florida (2)
- North Carolina (4)
- Oregon (2)
- South Carolina (4)
- Tennessee (1)
- Utah (2)
- Virginia (6)
- Puerto Rico (4)
- Map 2 county boundary changes in 2021:
- Virginia (2)
- Map minor county boundary changes based on legal code histories:
- Map 792 county boundary changes in 2022:
- Map 792 county boundary changes in 2023:
- Maine (1)
- Retag Alaska census areas as boundary=census border_type=census_area
- Retag "non-county areas" as not:boundary=administrative boundary=balance
- Tag start_date:edtf=*/end_date:edtf=* on 53 boundaries
- Fix capitalization on "Mc" in county names - #118780 OHMCha
- Map Ohio River boundary disputes
How to fix the old boundaries
In cases where more accurate locations of county boundaries are identified, care should be taken to replicate the source import edges where possible. These edges were created to help minimize the number of boundary segments in the OHM database. Thus, if a county boundary that was part of the original import has multiple subsegments where other historical intersections occurred, the new import should attempt to respect those segments as best possible. See diagram below for further explanation.