BigQuery dataset

From OpenStreetMap Wiki
Jump to navigation Jump to search

Google created public dataset with OpenStreetMap data snapshot accessible from BigQuery. This dataset could be used to replace OverpassAPI to a certain extent. Google provides this data for free with limitation 1TB/mo of free tier processing.

Data should updat weekly but there's an issue with this. The GitHub project contains the Apache Beam-based ETL software used to load the database. Quality control concerns and bugs may be reported as GitHub issues.

The dataset contains:

  • OSM primary datatypes (nodes, ways, relations)
  • GDAL derived datatypes (lines, multilinestrings, multipolygons, other relations, points) using ogr2ogr
  • Geofabrik semantic "Layered GIS" derived datatypes using SQL
  • Both the OSM full history and weekly planet updates

To query OpenStreetMap Public Dataset one need to have Google Account and Google Console project.

How to setup Google Console project

Assume you have Google Account. Open console.cloud.google.com and login with Google username and password.

Press "Select a project" button in header:

In opened dialog click "New Project" button:

Google Cloud Project initialization takes some time. Wait for success notification. After that you can use this project for free.

How to query dataset

Open BigQuery UI, enter query in text area and press "Run" button.

Tables in Dataset

BigQuery tables are created using an ETL as illustrated in the project's GitHub repo.

OSM Planet file processing.png

Further there are two sets of tables.

  • One set for all OSM historical data (history file, with tables prefixed by `history_`.
  • One set for all OSM current data (planet file, with tables prefixed by `planet_`.

They can be grouped by ETL phase.

Phase 1 - OSM objects

Load the OSM native objects.

In addition to (nodes, ways, relations), changesets are also loaded.

In combination with the `history_layers` table, changesets can be used to analyze what *types of objects* were edited within a given time window.

Planet objects:

  • `geo_openstreetmap.planet_nodes` - TODO
  • `geo_openstreetmap.planet_ways` - TODO
  • `geo_openstreetmap.planet_relations` - TODO
  • `geo_openstreetmap.planet_chagesets` - TODO

History objects:

  • `geo_openstreetmap.history_nodes` - TODO
  • `geo_openstreetmap.history_ways` - TODO
  • `geo_openstreetmap.history_relations` - TODO
  • `geo_openstreetmap.history_changesets` - TODO

Phase 2 - GDAL objects

Convert OSM objects (nodes, ways, relations) to GDAL objects (points, lines, multilinestrings, polygons, multipolygons). Each object contains the original OSM id or way_id for reference.

Planet objects:

  • `geo_openstreetmap.planet_features` - TODO

History objects:

  • `geo_openstreetmap.history_features` - TODO (apparently missing, see relevant GitHub issue).

Phase 3 - Layered GIS

Use SQL-defined Layered GIS filters to add semantic information to GDAL objects. OSM id / way_id references are also present for linking back to original data.

Planet objects:

  • `geo_openstreetmap.planet_layers` - TODO

History objects:

  • `geo_openstreetmap.history_layers` - TODO

Pros and Cons of using BigQuery OSM dataset

Pros

  1. Could be faster than Overpass API
  2. More complex queries using powerful SQL
  3. Running queries from command line using bq cli
  4. Export to CSV, JSON, BigQuery table
  5. Could be used for analytics and basic statistics
  6. In addition to primary datatypes (nodes, ways, relations), data are also pre-processed to GDAL types using ogr2ogr.
  7. GDAL objects are typecast from OSM tags using Layered GIS spec for more compact and powerful queries.

Cons

  1. SQL is not designed to query OSM data and sometimes query looks more complicated than with Overpass QL
  2. No OSM map visualization of result set
  3. Limited support of GeoJSON

Query samples

Query 1: nodes with tag man_made=lighthouse

SELECT ST_ASGEOJSON(geometry) as feature_geojson
  FROM `bigquery-public-data.geo_openstreetmap.planet_nodes`
 WHERE ('man_made', 'lighthouse') in (select (key,value) from unnest(all_tags))
 LIMIT 10

Alternate query:

SELECT ST_ASGEOJSON(geometry) as feature_geojson
  FROM `bigquery-public-data.geo_openstreetmap.planet_nodes` planet_nodes, planet_nodes.all_tags all_tags
 WHERE all_tags.key = 'man_made' AND all_tags.value = 'lighthouse'
 LIMIT 10

Query 2: hospitals with no phone tag

Query map features with tag 'amenity=hospital' and no key 'phone'.

SELECT feature_type, osm_id, osm_timestamp, geometry
  FROM `bigquery-public-data.geo_openstreetmap.planet_features`
 WHERE ('amenity', 'hospital') IN (SELECT (key, value) FROM UNNEST(all_tags))
   AND 'phone' NOT IN (SELECT key FROM UNNEST(all_tags))
 LIMIT 10;

Query 3: multipolygons with more tags

Query multipolygons with tags boundary=administrative and admin_level=4.

SELECT feature_type, osm_id, osm_timestamp, geometry
  FROM `bigquery-public-data.geo_openstreetmap.planet_features`
 WHERE feature_type="multipolygons"
   AND ('boundary', 'administrative') IN (SELECT (key, value) FROM UNNEST(all_tags))
   AND ('admin_level', '4') IN (SELECT (key, value) FROM UNNEST(all_tags))
 LIMIT 10;

Query 4: ways with tag 'highway' and arbitary value

Query ways with key 'highway'.

SELECT id, version, username, changeset, osm_timestamp
  FROM `bigquery-public-data.geo_openstreetmap.planet_ways`
 WHERE 'highway' IN (SELECT key FROM UNNEST(all_tags))
 LIMIT 10;

Other version of the same query:

SELECT id, version, username, changeset, osm_timestamp
  FROM `bigquery-public-data.geo_openstreetmap.planet_ways` planet_ways, planet_ways.all_tags as all_tags
 WHERE all_tags.key = 'highway'
 LIMIT 10;

Query 5: buildings with more than 5 levels

Query ways with key 'building' and 'building:levels' > 5.

SELECT id, version, username, changeset, osm_timestamp
  FROM `bigquery-public-data.geo_openstreetmap.planet_ways`
 WHERE 'building' IN (SELECT key FROM UNNEST(all_tags))
   AND EXISTS (SELECT key FROM UNNEST(all_tags)
                WHERE key = 'building:levels'
                  AND SAFE_CAST(value as INT64) > 5)
 LIMIT 10;

Query 6: query over regexp

Select all shops opened 7 days a week using regexp over "opening_hours" field (Note: don't use this query on real project, "opening_hours" tag should be parsed other way).

SELECT feature_type, osm_id, osm_timestamp, geometry
  FROM `bigquery-public-data.geo_openstreetmap.planet_features`
 WHERE 'shop' IN (SELECT key FROM UNNEST(all_tags))
   AND EXISTS (SELECT key FROM UNNEST(all_tags)
                WHERE key = 'opening_hours'
                  AND REGEXP_CONTAINS(value, r'Mo-Su .+'))
 LIMIT 10;


Query 7: query all bridges in Netherland

Select all features with tag "man_made=bridge" and filter by Netherlands boundary poly.

This sample uses WITH expression to find bounding polygon first.

-- Define bounding_area by selecting administrative boundary multipolygon. One can use hardcoded OSM relation ID here.
WITH bounding_area as (SELECT geometry from `bigquery-public-data.geo_openstreetmap.planet_features`
        WHERE feature_type="multipolygons"
          AND ('name:en', 'Netherlands') in (SELECT (key, value) from unnest(all_tags))
          AND ('boundary', 'administrative') in (SELECT (key, value) from unnest(all_tags))
          AND ('admin_level', '3') in (SELECT (key, value) from unnest(all_tags))
     )
SELECT feature_type, osm_id, osm_timestamp, planet_features.geometry
  FROM `bigquery-public-data.geo_openstreetmap.planet_features` planet_features, bounding_area
 WHERE ('man_made', 'bridge') IN (SELECT (key, value) FROM UNNEST(all_tags)) -- Select features with 'man_made=bridge' tag
   AND ST_DWithin(bounding_area.geometry, planet_features.geometry, 0)  -- Filter only features within bounding_area
 LIMIT 10;

Query 8: count number of bus stops in San Francisco grouped by network operator

Select all points with tag "highway=bus_stop". Group them by "network" tag and count items in each group.

-- Define bounding area of San Francisco
WITH bounding_area AS (SELECT geometry FROM `bigquery-public-data.geo_openstreetmap.planet_features`
        WHERE feature_type="multipolygons"
          AND ('wikidata', 'Q62') IN (SELECT (key, value) FROM unnest(all_tags))
     )
SELECT count(*) AS stops_count,
       (SELECT value FROM unnest(all_tags) WHERE key='network') AS bus_network  -- Extract value of "network" tag
  FROM `bigquery-public-data.geo_openstreetmap.planet_features` planet_features, bounding_area
 WHERE feature_type = 'points'
   AND ('highway', 'bus_stop') IN (SELECT (key, value) FROM UNNEST(all_tags)) -- Select bus stops
   AND ST_DWithin(bounding_area.geometry, planet_features.geometry, 0)  -- Filter only features within bounding_area
 GROUP BY bus_network
 ORDER BY stops_count DESC;

Resultset:

stops_count bus_network
5900 Muni
120 Muni;GGT
84 PresidiGo
64 AC Transit
58 GGT
48 Muni;SamTrans
38 SFRP
... ...

Query 9: calculate length of ways with 'highway' tag in Japan

Calculate length and count of all roads in Japan grouped by type.

-- Define bounding area of Japan
WITH bounding_area AS (SELECT geometry FROM `bigquery-public-data.geo_openstreetmap.planet_features`
        WHERE feature_type="multipolygons"
          AND ('wikidata', 'Q17') IN (SELECT (key, value) FROM unnest(all_tags))
     )
SELECT SUM(ST_LENGTH(planet_features.geometry)) AS highway_length,
       format("%'d", CAST(SUM(ST_LENGTH(planet_features.geometry)) AS INT64)) AS highway_length_formatted,
       count(*) as highway_count,
       (SELECT value FROM unnest(all_tags) WHERE key='highway') AS highway_type  -- Extract value of "highway" tag
  FROM `bigquery-public-data.geo_openstreetmap.planet_features` planet_features, bounding_area
 WHERE feature_type = 'lines'
   AND 'highway' IN (SELECT key FROM UNNEST(all_tags)) -- Select highways
   AND ST_DWithin(bounding_area.geometry, planet_features.geometry, 0)  -- Filter only features within bounding_area
 GROUP BY highway_type
 ORDER BY highway_length DESC;

Resultset:

highway_length highway_length_formatted highway_count highway_type
515544744.3 515,544,744 2651488 residential
462757205.9 462,757,206 2460587 unclassified
208945963.9 208,945,964 637984 path
132285333 132,285,333 441447 tertiary
86336294.03 86,336,294 312449 track
73082827.68 73,082,828 158325 secondary
60538681.24 60,538,681 117100 primary
59189791.17 59,189,791 488955 service
58793335.55 58,793,336 115185 trunk
35814200.76 35,814,201 368688 footway
... ... ... ...

Query 10: find highways withing bounding box

Find all ways in bounding box defined by four points:

-- Define bounding_area from polygon
WITH bounding_area AS (
  SELECT ST_MAKEPOLYGON(ST_MAKELINE(
    [ST_GEOGPOINT(47.3812, -21.9073), ST_GEOGPOINT(47.3812, -21.4249),
     ST_GEOGPOINT(48.0734, -21.4249), ST_GEOGPOINT(48.0734, -21.9073)
    ]
  )) AS geometry )
SELECT feature_type, osm_id, osm_timestamp, planet_features.geometry
  FROM `bigquery-public-data.geo_openstreetmap.planet_features` planet_features, bounding_area
 WHERE 'highway' IN (SELECT key FROM UNNEST(all_tags)) -- Select features with 'highway' key
   AND ST_DWithin(bounding_area.geometry, planet_features.geometry, 0)  -- Filter only features within bounding_area
 LIMIT 100;

How to visualize with BigQuery

BigQuery-Wizard-visualization-schema.jpg

If you need to visualize huge amount of data with millions of geo objects you can do it with BigQuery:

Step 1: Prepare data

Select features (farmlands). Instead of each polygon use centroid because it's easier to work with points. Snap each point to grid. Count number of points at each grid node.


Step 2: Add to map

Now when we have points and number of features near that points we can add the data to map BigQuery-resultset-geovis.jpg

Step 3: Adjust styles

Depending on features count we should adjust color and size of grid nodes:

Screenshots
BigQuery-vis-style-fill-color.jpg BigQuery-vis-style-circle-radius.jpg

Final result

Farmlands in Madagascar