Import/All the Places US data

From OpenStreetMap Wiki
Jump to navigation Jump to search

The All the Places data import is a one-time import of All the Places data in the US onto existing OSM objects. All the Places scrapes data from companies' public directories and parses it into rough OSM tagging scheme. This method of information gathering was recently given an "it depends" green-light for use in OSM that appears to grant permission for use in at least the US, per this Licensing Working Group recommendation and this OSM Community thread. This import will not overwrite any data, including if it is incorrect or out of date; it will only add tags that are missing on the object but present in the reference data.

This import uses the Atlus API to parse raw address strings.

The import is currently (as of mid-February 2024) ongoing.

Goals

Primary goal:

  • enrich existing OSM POIs in the US with additional information, such as address, website, and phone data

Secondary goal:

  • identify locations of chains that are new and unmapped (to be added), or stale and shuttered (to be removed). I will put both of these types of non-matches into MapRoulette challenges for editors to investigate (see below)

Schedule

This import will be conducted in stages, and almost certainly will not include all All the Places data. I plan to work by consumer sector, starting with grocery stores, hotels, and other sectors if I'm feeling ambitious and things go well. Each import for a new sector will be advertised in the US OSM Community forum.

Import Data

Background

Data source site: https://www.alltheplaces.xyz/
Data license: https://creativecommons.org/publicdomain/zero/1.0/
Type of license: Creative Commons’ CC0-1.0
ODbL Compliance verified: yes, see Licensing Working Group recommendation

Import type

This will be a rolling, one-time import. Any non-matches will be uploaded as separate challenges to MapRoulette (see below) for users to check individually. I will seek to maintain ref values to facilitate future semi-automated syncing if such tooling becomes available.

Data preparation

Data prep will primarily be handled by a custom Python script as well as JOSM validation rules that the community has developed. These tools will split the address field appropriately (usually in All the Places as a unified addr:street_address or addr:full tag), fix unexpanded street directions and types, format phone numbers, check ref and image tags, fix improper name and street capitalization, and other QA processes. The files will be saved and processed in the GeoJSON file format, as that is how All the Places provides them and that format is easier to manipulate natively in Python.

Tagging plans

How I handle tags will depend on the All the Places spider, as each scrapes different content and each brand provides data of differing quality. I will always defer to existing OSM data, except for overwriting brand:wikidata=* information when it is incorrect (some large hotel chains, for example, are tagged with the parent company's Wikidata object). I will provide greater detail for each brand I plan to import below. All the Places-specific tags, like @spider and nsi_id will always be removed.

Changeset tags

I will upload changesets from my standard OSM account, as this import should not result in huge changesets (by number of objects changed).

System-users-3.svgwhammo (on osm, edits, contrib, heatmap, chngset com.)

Key Value
comment [description of the type of information conflated]
import yes
source All the Places
source:url https://www.alltheplaces.xyz/
import:page Import/All the Places US data
source:license CC0-1.0

Data transformation and cleaning

Most of the data transformation and cleaning will be handled by a custom Python script, available to review here. In addition, ad-hoc data cleaning will be performed as needed, for example to split out branch information from the name, or remove ref information from the branch, using both one-time Python code and JOSM validation rules.

Workflow

Team approach

I plan on taking this import on alone, but if anyone is interested in helping, please let me know. Community assistance with the MapRoulette challenges for the non-matching objects will be critical, as that will not be my focus.

Uploading process

All upload steps will be taken in JOSM after initial cleaning in Python unless stated otherwise. I will work brand by brand within the same JOSM layer, and upload changes after finishing several brands from the same market segment (all supermarkets, all hotels, etc.).

  • Overpass query (sample) and download based on the primary tag and the brand:wikidata=*, within the US. I will also download objects that may be lacking a brand:wikidata=* tag with a narrow query based on regex of the brand name along with the primary tag.
  • JOSM Conflation plugin to find matches (with distance threshold of 500-1000m depending on the concentration of locations), and add data tag by tag, to ensure there are no conflicts and no overwriting of existing data.
    • Merge "Reference only" non-matches into a separate GeoJSON file for upload to a MapRoulette challenge.
    • Mark "Subject only" non-matches with a descriptive fixme:atp comment that will feed into a MapRoulette challenge as an Overpass query.
  • Address lingering validation errors.
  • Upload in regional chunks (maybe six to eight across the US grouped regionally to keep changeset sizes manageable).

MapRoulette challenges

Potentially missing POIs

The MapRoulette challenge for POIs that exist in All the Places data but not in OSM is available here. Please use the provided data, including the website if there is one, and/or local knowledge to verify whether the POI actually exists or if it is just a location that has since closed or is "coming soon."

Because this is a cooperative challenge where the data is pre-processed for users' use, users can only make edits in JOSM.

Potentially stale POIs

The MapRoulette challenge for POIs that exist in OSM but not in All the Places is available here. Please use your research skills, including any tags on the object, and/or local knowledge to research whether the location is still operational. If it is, please remove the fixme:atp comment, which is what MapRoulette uses in an Overpass query to build the challenge. If it isn't, please remove the POI from the database. This challenge works in any standard OSM editor, including JOSM, iD, and Rapid.

Tagging scheme by brand

This import will not touch keys for which there is already a value. As such, if an object has incorrect or out-of-date data on it, it will remain that way. The following schema and data will only come into play if an object does not already have a value for the given key

All cleaned reference data is available in the project's Github repository.

Grocery

A topic on the community forum about grocery store data was posted on 29 January 2024 and can be found here.

Albertsons

The file for Albertsons contains POIs for 17 different brands under the Albertsons umbrella, the largest of which is Albertsons. Removed amenity=fuel and amenity=pharmacy POIs.

ATP example processed example
@spider albertsons delete
addr:city Easton Easton
addr:country US delete
addr:street_address 210 W Marlboro Ave addr:street West Marlboro Avenue
addr:housenumber 210
addr:unit
addr:postcode 21601 21601
addr:state MD MD
brand ACME Markets ACME Markets
brand:wikidata Q341975 overwrite Q341975
image https://dynl.mktgcdn.com/p/... delete
name ACME Markets ACME Markets
nsi_id -1 delete
opening_hours Mo-Sa 07:00-22:00; Su 06:00-22:00 Mo-Sa 07:00-22:00; Su 06:00-22:00
phone +1 410-822-7073 +1 410-822-7073
ref https://local.acmemarkets.com/#5603075 delete
shop supermarket supermarket
website https://local.acmemarkets.com/md/easton/210-w-marlboro-ave.html https://local.acmemarkets.com/md/easton/210-w-marlboro-ave.html
Source Processed
Download ATP direct download link cleaned .geojson file
Count 1,325

ALDI

ATP example processed example
@spider aldi_sud_us delete
addr:city Alexandria Alexandria
addr:country US delete
addr:street_address 425 E Monroe Ave addr:street East Monroe Avenue
addr:housenumber 425
addr:unit
addr:postcode 22301 22301
addr:state VA VA
brand ALDI ALDI
brand:wikidata Q41171672 overwrite Q41171672
contact:facebook https://www.facebook.com/ALDI.USA delete
contact:twitter AldiUSA delete
image https://dynl.mktgcdn.com/p/... delete
name ALDI 425 E Monroe Ave ALDI
nsi_id aldi-68f0e3 delete
opening_hours Mo-Su 09:00-20:30 Mo-Su 09:00-20:30
phone +1 833-471-7067 +1 833-471-7067
ref https://stores.aldi.us/#4420173 delete https://stores.aldi.us/#4420173
shop supermarket supermarket
website https://stores.aldi.us/va/alexandria/425-e-monroe-ave https://stores.aldi.us/va/alexandria/425-e-monroe-ave
Source Processed
Download ATP direct download link cleaned .geojson file
Count 2,357

Whole Foods

ATP example processed example
@spider whole_foods delete
addr:city Arlington Arlington
addr:country US delete
addr:full 520 12th St South addr:street 12th Street South
addr:housenumber 520
addr:unit
addr:postcode 22202 22202
addr:state VA VA
brand Whole Foods Whole Foods
brand:wikidata Q1809448 overwrite Q1809448
name Pentagon City branch Pentagon City
name Whole Foods Market
nsi_id wholefoodsmarket-90050a delete
opening_hours Mo-Su 07:00-22:00 Mo-Su 07:00-22:00
phone +1 571-777-3948 +1 571-777-3948
ref pentagoncity pentagoncity
shop supermarket supermarket
website https://www.wholefoodsmarket.com/stores/pentagoncity https://www.wholefoodsmarket.com/stores/pentagoncity
Source Processed
Download ATP direct download link cleaned .geojson file
Count 528 528

Safeway

Removed amenity=fuel and amenity=pharmacy POIs.

ATP example processed example
@spider safeway delete
addr:city Arlington Arlington
addr:country US delete
addr:street_address 1525 Wilson Blvd addr:street Wilson Boulevard
addr:housenumber 1525
addr:unit
addr:postcode 22209 22209
addr:state VA VA
brand Safeway Safeway
brand:wikidata Q1508234 overwrite Q1508234
contact:facebook https://www.facebook.com/safeway delete
contact:twitter Safeway delete
image https://dynl.mktgcdn.com/p/... delete
name Safeway Safeway
nsi_id N/A delete
opening_hours Mo-Su 06:00-23:00 Mo-Su 06:00-23:00
phone +1 703-276-9315 +1 703-276-9315
ref https://local.safeway.com/#5603910 delete
shop supermarket supermarket
website https://local.safeway.com/safeway/va/arlington/1525-wilson-blvd.html https://local.safeway.com/safeway/va/arlington/1525-wilson-blvd.html
Source Processed
Download ATP direct download link cleaned .geojson file
Count 1,935 915

Trader Joe's

ATP example processed example
@spider trader_joes_us delete
addr:city Arlington Arlington
addr:country US delete
addr:street_address 1109 N Highland St addr:street North Highland Street
addr:housenumber 1109
addr:unit
addr:postcode 22201 22201
addr:state VA VA
brand Trader Joe's Trader Joe's
brand:wikidata Q688825 overwrite Q688825
name Trader Joe's Arlington - Clarendon (640) Trader Joe's
branch Clarendon
nsi_id traderjoes-dde59d delete
opening_hours Mo-Su 08:00-21:00 Mo-Su 08:00-21:00
phone +1 703-351-8015 +1 703-351-8015
ref 640 640
shop supermarket supermarket
website https://locations.traderjoes.com/va/arlington/640/ https://locations.traderjoes.com/va/arlington/640/
Source Processed
Download ATP direct download link cleaned .geojson file
Count 564 564

Kroger

The file for Kroger contains POIs for 25 different brands under the Kroger umbrella, the largest of which is Harris Teeter.

ATP example processed example
@spider kroger_us delete
addr:city Arlington Arlington
addr:country US delete
addr:street_address 900 Army Navy Dr addr:street Army Navy Drive
addr:housenumber 900
addr:unit
addr:postcode 22202 22202
addr:state VA VA
branch Pentagon Row Pentagon Row
brand Harris Teeter Harris Teeter
brand:wikidata Q5665067 overwrite Q5665067
name Harris Teeter Harris Teeter
nsi_id harristeeter-dde59d delete
opening_hours Mo-Su 06:00-23:00 Mo-Su 06:00-23:00
operator Harris Teeter Supermarkets, Inc. delete
phone +1 703-413-7112 +1 703-413-7112
ref 09700083 09700083
shop supermarket supermarket
website https://www.harristeeter.com/stores/grocery/va/arlington/pentagon-row/097/00083 https://www.harristeeter.com/stores/grocery/va/arlington/pentagon-row/097/00083
Source Processed
Download ATP direct download link cleaned .geojson file
Count 6,796 2,857

IGA

The file for IGA contains POIs for 2 different brands under the IGA umbrella, IGA and IGA Express.

ATP example processed example
@spider iga delete
addr:city Urbanna Urbanna
addr:country US delete
addr:full 335 Virginia Street, Urbanna, VA, 23175 delete
addr:street_address 335 Virginia Street addr:street Virginia Street
addr:housenumber 335
addr:unit
addr:postcode 23175 23175
addr:state VA VA
branch Pentagon Row Pentagon Row
brand IGA IGA
brand:wikidata Q3146662 overwrite Q3146662
name Urbanna Market IGA Urbanna Market IGA
nsi_id iga-166dbe delete
phone +1 803-854-5165 +1 803-854-5165
ref Urbanna Market IGA Urbanna Market IGA
shop supermarket supermarket
website https://urbannamarket.iga.com https://urbannamarket.iga.com
Source Processed
Download ATP direct download link cleaned .geojson file
Count 714 712

Regional chains

These grocery chains have significantly smaller distribution footprints.

Source Processed Count
Giant Eagle direct download cleaned file 480
Hannaford direct download cleaned file 187
Key Food direct download cleaned file 411
Piggly Wiggly direct download cleaned file 493
Publix direct download cleaned file 1,423
Shoprite direct download cleaned file 282
Sprouts direct download cleaned file 413
Stater Bros direct download cleaned file 169
Tops direct download cleaned file 148
Winn Dixie direct download cleaned file 546
Wegmans direct download cleaned file 110

Hotels

The import of hotel data will be almost identical to that of grocery store POIs. I do not plan to overwrite any values except for brand and brand:wikidata where they are clearly outdated or wrong. The post about this stage of the import in the OSM Community forum is here.

The following are the hotel chains I plan to import, with multiple sub-brands in each:

Processed Count Sub-brands (not exhaustive)
Best Western cleaned file 2,283 'Best Western': 1083, 'Best Western Plus': 902, 'Surestay Plus': 118, 'Surestay': 117
Choice Hotels cleaned file 6,360 'Quality Inn': 1036, 'Econo Lodge': 697, 'Comfort Inn': 627, 'Comfort Inn & Suites': 605, 'Quality Inn & Suites': 577
Hilton cleaned file 5,534 'Hampton': 1411, 'Hampton Inn & Suites': 1022, 'Hilton Garden Inn': 777, 'Home2 Suites': 596
Hyatt cleaned file 771 'Hyatt Place': 353, 'Hyatt': 271, 'Hyatt House': 120, 'Destination': 27
IHG cleaned file 1,757 'Holiday Inn Express': 994, 'Candlewood Suites': 354
Marriott cleaned file 5,381 'Fairfield': 1170, 'Courtyard': 1057, 'Residence Inn': 863, 'SpringHill Suites': 553, 'Townplace Suites': 515
Wyndham cleaned file 6,519 'Super 8': 1534, 'Days Inn': 1355, 'La Quinta Inn & Suites': 750, 'Baymont': 537, 'Travelodge': 438, 'Ramada': 350

Fast Food and Cafes

The import of fast food and cafe data will be almost identical to that of other POIs. I will not overwrite any values except for brand and brand:wikidata where they are clearly outdated or wrong. Some brand POIs contain fast food relevant tags, like takeaway=* or drive_through=*. Otherwise, it's mostly the usual address tags, phone, website, etc. The post about this stage of the import in the OSM Community forum is here.

The following are the chains I plan to import:

Count File
Carl's Jr. 1067 cleaned file
Domino's 6909 cleaned file
In-N-Out Burger 402 cleaned file
Peet's Coffee 289 cleaned file
Cook Out 359 cleaned file
Five Guys 1476 cleaned file
Subway 21218 cleaned file
Dairy Queen 4269 cleaned file
Long John Silver's 523 cleaned file
Dunkin' 9241 cleaned file
Wingstop 1998 cleaned file
Wendy's 6031 cleaned file
El Pollo Loco 501 cleaned file
Pizza Hut 6780 cleaned file
Shake Shack 331 cleaned file
Popeyes 3034 cleaned file
Burger King 6713 cleaned file
Panda Express 2147 cleaned file
Chipotle 3387 cleaned file
Papa John's 3129 cleaned file
Arby's 3318 cleaned file
Potbelly 422 cleaned file
Jamba 721 cleaned file
Qdoba 745 cleaned file
Bojangles' 822 cleaned file
Tim Hortons 640 cleaned file
Church's Chicken 641 cleaned file
Baskin-Robbins 2197 cleaned file
Jimmy John's 2678 cleaned file
Starbucks 16425 cleaned file
KFC 4280 cleaned file
Jack in the Box 2196 cleaned file
Chick-fil-A 2970 cleaned file
Whataburger 1025 cleaned file
Quizno's 147 cleaned file
Hardee's 1621 cleaned file
Taco Bell 7947 cleaned file
Moe's Southwest Grill 618 cleaned file
Culver's 964 cleaned file
Zaxby's 924 cleaned file
Panera Bread 2127 cleaned file
A&W 448 cleaned file
Dutch Bros. Coffee 880 cleaned file
MOD Pizza 526 cleaned file
Einstein Bros. Bagels 688 cleaned file
The Habit Burger Grill 381 cleaned file
Scooter's Coffee 744 cleaned file


See also