User:Minh Nguyen/South Bay parity

From OpenStreetMap Wiki
Jump to: navigation, search

This study compares OpenStreetMap's coverage of area codes 408 and 669 (most of Santa Clara County, California, including San José) to Valley Yellow Pages, the local business telephone directory. You can find detailed data for this study on Google Sheets.

Preliminary findings

  • OpenStreetMap has 12,233 named POIs, which is 22.99% as many as the 53,212 in the Business White Pages.
  • Among 373 Yellow Pages categories (11% of all categories) representing 32.86% of the entries in the Business White Pages:
    • OSM's coverage is strongest in cannabis dispensaries (not in the Yellow Pages), convenience stores (515.79% versus the Yellow Pages), shopping centers (400.00%), outpatient clinics (344.44%), campsites (333.33%), and theatres (281.82%).
    • OSM's coverage is weakest in tax advisors (2.85%), crematoria (3.33%), chiropractors (4.35%), accountants (5.01%), insurance agents (5.08%), lawyers (5.37%), and real estate agents (5.42%). It's unsurprising that OSM would be weak on these categories, as they tend to be clustered in professional buildings with small signs, making them difficult to survey when driving by. Additionally, professional offices haven't historically been as much a priority for mappers as retail businesses.
    • OSM has 93.66% as many restaurants and 177.38% as many cafés as the yellow pages.
    • OSM has 77.90% as many places of worship as the yellow pages.
    • OSM has 75.53% as many schools as the yellow pages and business white pages.
    • The “OSM presets” tab of the spreadsheet breaks down the comparison into 79 different categories.

Caveats

There are several caveats to the approach taken in this study. A more formal study would attempt to rigorously quantify the impact that these issues have on the results:

  • The name-based overpass queries include some named features that wouldn’t normally be thought of as POIs. In an attempt to weed out non-POIs, features with certain keys like highway=* and waterway=* have been excluded, as have all line features that don’t form part of an area. However, an insignificant number of non-POIs remain in the queries, slightly inflating the OSM feature counts.
  • This study counts the number of phone numbers in the Business White Pages, which isn’t exactly the same as the number of addresses, and doesn’t attempt to deduplicate redundant entries. These anomalies appear to be insignificant, but they do mean that the Business White Pages statistics are an upper bound on the number of features that would be needed for parity.
    • Some public services and utilities are listed multiple times in the Business White Pages. For example, the phone number 911 is listed multiple times under “Police Departments” and again under “Sheriff’s Department”. Pacific Gas & Electric’s multiple entries are duplicated under “PG&E”.
    • A phone number is given for each department of an educational institution, regardless of whether any departments are colocated. This treatment is most apparent for college campuses.
    • Some dispatch services are listed without addresses. For example, “Orkin” is listed with numbers for “From Campbell”, “From Gilroy”, etc.
    • A few other businesses lack any information other than a name and phone number, such as “PRPI” and “PRPII”. It isn’t clear whether these businesses have physical locations that would warrant coverage in OSM.
  • Some phone numbers outside the 408/669 area code appeared in the Business White Pages. Most were toll-free numbers for local businesses. In a few cases, the listings included phone numbers within the 650 and 415 area codes. For example, “SRI Consulting” is listed with a Menlo Park address and a 650 number. “Sunbeam Appliance Service Co.” is listed with a San Francisco address and 415 number. San Francisco–based “KGO Newstalk Radio AM 810” is listed with multiple 415 hotlines. However, these cases appear to be exceedingly few in number.
  • The Business White Pages primarily consists of landline phone numbers. If a business relies on a cell phone as its primary phone, the business may or may not be listed in the Business White Pages.
  • The “By category” portion of this study relies on the Yellow Pages, which has some uniqueness issues that required careful attention. These issues appeared to be insignificant in number and may be balanced out by analogous issues in OSM:
    • Some businesses are incorrectly categorized in the Yellow Pages. For example, the “Church of God Anderson, Indiana” subcategory under the “Churches” category lists “Dosa & Curry Cafe”. It’s unlikely that the church actually operates this cafe; more likely, this miscategorization was either a mistake or an Easter egg.

The biggest caveat may be that only 979 OSM POIs have phone numbers and only 3,023 have addresses, which compares unfavorably to the phone book, in which virtually every phone number has an address and every address has a corresponding phone number. On the other hand, 1,945 OSM POIs have websites and 100% have coordinates, which compares favorably to the zero coordinates and 406 websites in the phone book. Therefore, to the extent that OSM and the phone book both cover POIs, they currently complement each other rather than competing with each other.

Methodology

For the phone book:

  1. Pick up a copy of the 2015–16 Valley Yellow Pages for San Jose and Santa Clara. (Judging from the list of communities served on pages 1 and 216, it's intended to cover all of area code 408.)
  2. Count the phone numbers under each letter heading of the Business White Pages.
  3. Count the phone numbers under each category heading of the Yellow Pages.
    • Choose categories among the top 750 tags in OSM globally, as well as any related categories.
    • There is significant overlap between some categories, because a proprietor has the option of paying for placement in multiple categories. For example, an individual Pizza Hut location may be listed under the Pizza and Restaurant categories (and also under the Pizza section of the Restaurant Guide, which is excluded from this study due to being redundant to the Restaurant section). Redundant pizzeria entries were removed from the Restaurant count in favor of the Pizza count. On the other hand, some features like churches and schools tend to be named twice in OSM (once for the grounds and once more for the building).

For OpenStreetMap:

  1. brew install jq
    npm install -g simplify-geojson
    
  2. Download the NANP area code boundaries in GeoJSON format (public domain).
  3. Extract the 408 area code boundary:
    jq '.features | .[] | select(.properties.NPA == "408").geometry.coordinates[0]' area-codes.geojson
    
  4. Simplify the boundary:
    cat 408.geojson | simplify-geojson -t 0.001
    
  5. Further hand-simplify the boundary in geojson.io, preserving as much precision as possible in densely populated areas while simplifying as much as possible in sparsely populated areas, such as in the mountains or salt marshes. (The overpass queries intersect the resulting boundary with the county boundaries. This is purely a way to reduce the length of the query so that overpass turbo can handle it.) Save as 408.geojson.
  6. Flatten the GeoJSON feature collection into a list of coordinates:
    jq '.features | .[] | select(.properties.NPA == "408").geometry.coordinates[0] | .[] | .[1], .[0]' 408.geojson
    
    Truncate each coordinate to six places, which should be plenty precise.
  7. Query overpass turbo for POIs within the resulting polygon. Include nodes and areas but not lines, which are unlikely to be POIs.

Detailed statistics

Detailed statistics can be found in this spreadsheet.

To do

  • Assess more categories.
  • Query overpass turbo for historic data.

Notes and references