Import/South Australian Waterbodies

From OpenStreetMap Wiki
Jump to navigation Jump to search

About

The South Australian government has provided a dataset "Waterbodies in South Australia" in shapefile format, along with explicit permission to use it in OpenStreetMap.

This page describes a plan to import some of this data into OpenStreetMap.

Distribution of data to be imported

Goals

The goal of this import is to take advantage of a valuable data source, while also ensuring that:

  • Imported features don't interfere or overlap with existing features of the same type on the map
  • Parts of the data that are unimportant, uninteresting or otherwise unsuitable are omitted
  • All data is prepared and tagged appropriately

Import Plan Outline

  1. Download data
    1. Download the Waterbodies shapefile from data.sa.gov.au
    2. Download auxiliary dataset to aid processing: all existing waterbody areas in SA in OpenStreetMap
    3. Download auxiliary dataset to aid processing: the "Gazetteer" dataset from data.sa.gov.au
  2. Process Data
    1. Load three abovementioned datasets as separate layers to a QGIS project. Convert all data to the WGS84 coordinate system.
    2. Using a python script within QGIS, choose a subset of the data to keep, and perform some minor processing steps. Save this data as a shapefile.
    3. Use JOSM to convert the shapefile to OSM format.
    4. Use a Matlab script to perform some further minor processing on the OSM data.
    5. Use JOSM's data validation tool to perform some final corrections and preparation of the data
  3. Upload Data
    1. Use bulk_upload.py to upload the data


Schedule

The plan was announced to the talk-au list on 21 January 2015. https://lists.openstreetmap.org/pipermail/talk-au/2015-January/010493.html

It was announced to the imports list on 27 January 2015. https://lists.openstreetmap.org/pipermail/imports/2015-January/003628.html

The downloading and processing was done in January and February 2015.

The uploading was performed on 22 February 2015.

Import Data

Background

Data source site: http://data.sa.gov.au/dataset/waterbodies-in-south-australia

Data license: http://creativecommons.org/licenses/by/4.0/

Link to permission: Attribution/sa.data.gov.au_explicit_permission

ODbL Compliance verified: Explicit permission

OSM attribution: Contributors#South_Australian_Government_data


OSM Data Files

The proposed .osm file to be uploaded is processed_processed_fixed_v7.osm contained in this Google Drive shared folder: https://drive.google.com/folderview?id=0B1JwNHL1bER0bF9mcV9hZkJINWs&usp=sharing

The file is approximately 44MB in size, and contains approximately 460 relations, 8900 ways, and 380000 nodes.


Import Type

One-time import of a subset of the dataset.

It is possible there will be future imports of further subsets of the data, however they have not been planned yet. If they occurr, these future imports will each be of a similar size to the present import, and follow everything written on this page except, except the description of the particular choice of "important features" that form the present subset.


Data Preparation

Description of the source data

The dataset "Waterbodies in South Australia" is dated 2 July 2014, and consists of a shapefile Waterbodies.shp and other associated files, approximately 200MB total size.

The shapefile contains 151432 separate features (individually tagged areas, consisting of a polygon or multipolygon.)

The features are tagged with a variety of keys. Here is the complete list of keys used: AHGFFEATUR, AHGFPERENN, ALBERSAREA, ATTRIBUTER, ATTRIBUTES, CAPTUREMET, CAPTURESOU, EDITDATE, FEATURECOD, FEATUREREL, FEATURESOU, FEATURES_1, HYDROID, MAPPEDNODE, MAXSCALE, MINSCALE, NETWORKNOD, OBJECTID, PERENNIALI, PLANIMETRI, SHAPE_AREA, SHAPE_LEN, SOURCEFEAT, SOURCEFE_1, SOURCEID, SOURCEIMAG, SOURCETYPE, SYMBOL, VOLUME, GAZRECNO, NAME, WATERSTORE, AUS_WETNR, WET_CODE, TEXTNOTE.

The value of "FEATURECOD" can be used to determine what type of waterbody each feature corresponds to. The table below shows the different types of waterbodies in the dataset, a count of each, and the corresponding value of FEATURECOD.

Type of waterbody Number of features FEATURECOD
Reservoir 151 3236
Dam[1] 89691 4812
Lake (perennial) 3855 4401
Lake (intermittent) 3302 4402
Lake (mostly dry) 20868 4403
Land subject to inundation 28665 4407
Flat 4897 4427


The meaning of feature codes was found in: http://www.environment.sa.gov.au/files/641f0c9b-e759-45be-97c7-9e3001186a6e/lower_southeast.pdf


Data Reduction & Simplification

Of the 151432 features in the dataset, only approximately 9000 will be included in this import. This subset is the result of the following selection criteria:

  1. Take all features in the dataset that have a name.
  2. Add to this, all features that correspond to a permanent lake or a reservoir.
  3. Add to this, any feature that is connected to any feature already included.
  4. From the list constructed so far, remove any feature that overlaps any waterbody that is already in OSM
  5. Also, remove any feature that is clipped by the edges of the entire region

The first two criteria above are aimed at taking only the most import features from the dataset. The third point above aims to make future imports of more of the dataset simpler. That is, any feature that we do not include in this import, and that also doesn't overlap with an existing feature in OSM, is guaranteed to not share any nodes that are contained in the current import.

We run the "SimplifyArea" JOSM plugin on all ways, using the default settings. This reduces the total node count by approximately 23%.


Tagging Plans

Based on the FEATURECOD of each original feature, the following tags will be added:


FEATURECOD Tags added
3236 (Reservoir) natural=water, water=reservoir
4401 (Lake – Perennial) natural=water, water=lake
4402 (Lake - Intermittent) natural=water, water=lake, intermittent=yes
4403 (Lake - Mainly Dry) natural=water, water=lake, intermittent=yes
4407 (Land - Subject to Inundation) natural=wetland
4812 (Dam)[1] natural=water, water=reservoir


In addition, if the original feature has a tag TEXTNOTE=salt, then the tag "salt=yes" will be added to the uploaded feature.

Any relation or way that has tags will be given the additional tag "source=data.sa.gov.au"



Changeset Tags

Each changeset will have the following tags:

source=data.sa.gov.au
created_by=bulk_upload.py/22614 Python/2.7.8
comment=Import part of data.sa.gov.au waterbodies dataset

Data Transformation

Summary of tools used: QGIS Desktop v 2.6; JOSM; Matlab, bulk_upload.py.

Part 1: loading the source data and auxiliary data into a QGIS project

The source file Waterbodies.shp is converted to the WGS84 coordinate system using QGIS, and added to a new layer of a new QGIS project. The auxiliary file Gazetteer.shp is likewise converted to WGS84 and added as a new layer to the same project.

The set of existing waterbodies in OSM in the region of interest is downloaded using JOSM, saved as .osm file, and added as a separate layer to the QGIS project. The following Overpass API script was used to do the download in JOSM, using a bounding box of "min lat: -39.5, max lat: -25.0, min lon: 127.5, max lon: 142.5" :

(way[natural=water];>;
 relation[natural=water];>;
 way[natural=wetland];>;
 relation[natural=wetland];>;
 way[waterway=riverbank];>;
 relation[waterway=riverbank];>;
 way[landuse=reservoir];>;
 relation[landuse=reservoir];>;
 way[waterway=dam];>;
 relation[waterway=dam];>;
);
out meta; 

(Before saving this data in JOSM, all "waterway=riverbank" areas are given the extra tag "natural=water", which is a hack that ensures QGIS treats riverbanks as areas.)

An empty shapefile layer, "processed.shp", is also added to the project.

Part 2: Using a python script within QGIS to select and process the source data

Using the python console which is built in to QGIS, run the script "process_data.py". The steps performed by this script are summarised as follows:

  1. For each feature in the source data set that has a non-empty "NAME" tag, attempt to detect whether the value of this tag is genuinely the name of this waterbody, or whether it is a spurious value. (In the source data set, sometimes all dams on a farm are given a "NAME" tag that corresponds to the name of the nearby homestead. We want to remove these names). Compare the name of each waterbody with the name of nearby homesteads listed in the Gazetteer file. Also count the number of nearby waterbodies with the same name. Also, look for certain keywords in the name. Based on these observations, replace the NAME tag with the empty string.
  2. Delete any source feature that does not have one of the following values of FEATURECOD: 3236, 4401, 4402, 4403, 4407, 4812
  3. Find the subset of all source features that are "important". For the purposes of this import, include: features that have a non-empty name; any permanent lake; and any reservoir (feature code 3236).
  4. Find any source feature connected to an "important" feature, and add that to the list of important features.
  5. For each feature in the important list, detect whether any waterbody in the existing OSM database overlaps with it. If so, delete from the list.
  6. For each item on the list that is not a wetland, find if it overlaps a wetland in list, and if so replace the shape of the wetland with a version having the first item's shape subtracted. (In the original dataset, lakes, dams, and reservoirs are allowed to overlap with wetlands. In OSM, they cannot.)
  7. For each remaining feature, add the required OSM tags, based on the FEATURECOD.
  8. Remove any item that touches the bounding box of the source data, since it is likely clipped.
  9. For each surviving feature, add it to the layer "processed.shp"

Part 3: Conversion to .osm; and further processig using a Matlab script

Use JOSM to convert processed.shp to processed.osm.

Use the Matlab scripts readnodes.m and process_processed.m to perform further processing. The script does the following:

  1. Ways that have too many nodes (more than 2000, the API limit) are split into smaller ways.
  2. Tags are processed further. "datasa:" is prepended to some keys kept from the source. Keys that were clipped to the 10-character limit in a shapefile are fixed.
  3. The file processed_processed.osm is written.

Part 4: Validation using JOSM

The file processed_processed.osm is loaded into JOSM, and its contents are copied to a new data layer. The data validation tool is run. A summary of the errors and warnings generated, and the steps taken to fix them, as as follows:

  • "Natural duplicated nodes (24)". Fixed automatically.
  • "Style for inner way equals multipolygon (36)". Fixed manually, by deleting inner way in most cases. Six of the warnings were ignored, as the inner way had a different name.
  • "Relations with the same member (1)". Deleted relations.
  • "Self-intersecting ways (6)". Manual tweaks to fixes these glitches.
  • "Overlapping water areas (86)". Manual edits where overlap is very small, else delete one of the waterbodies involved.

The SimplifyArea plugin is run on all ways, using the default settings. This reduces the total node count by approximately 21%

The resulting data is saved as processed_processed_fixed_v7.osm.


Data Transformation Results

The proposed .osm file to be uploaded is processed_processed_fixed_v7.osm, contained in this Google Drive shared folder: https://drive.google.com/folderview?id=0B1JwNHL1bER0bF9mcV9hZkJINWs&usp=sharing

The approximate number of waterbodies of each type selected from the original dataset are as follows:

datasa waterbody type Number of features (that have a name) Number of features (with no name)
3236 (Reservoir) 57 43
4401 (Lake – Perennial) 514 2649
4402 (Lake - Intermittent) 62 372
4403 (Lake - Mainly Dry) 412 27
4407 (Land - Subject to Inundation) 1267 705
4812 (Dam)[1] 1168 1733


Data Merge Workflow

The data to be uploaded is disjoint from all existing OSM waterbodies (lakes, ponds, wetlands, riverbanks).

Team Approach

This is solo work by OSM user System-users-3.svgHenry h (on osm, edits, contrib, heatmap, chngset com.), wiki User:Henryh.

The specially-created OSM account System-users-3.svgdatasa_import (on osm, edits, contrib, heatmap, chngset com.) will be used.

References

Data source site: http://data.sa.gov.au/dataset/waterbodies-in-south-australia

Data license: http://creativecommons.org/licenses/by/4.0/

Link to permission: Attribution/sa.data.gov.au_explicit_permission

ODbL Compliance verified: Explicit permission

OSM attribution: Contributors#South_Australian_Government_data

Interpretation of feature codes: http://www.environment.sa.gov.au/files/641f0c9b-e759-45be-97c7-9e3001186a6e/lower_southeast.pdf

Shared Google Drive containing proposed .osm upload file, intermediate files, and scripts: https://drive.google.com/folderview?id=0B1JwNHL1bER0bF9mcV9hZkJINWs&usp=sharing

Thread on talk-au mailing list: https://lists.openstreetmap.org/pipermail/talk-au/2015-January/010493.html

Thread on imports mailing list: https://lists.openstreetmap.org/pipermail/imports/2015-January/003628.html


Workflow

bulk_upload.py It will be used to upload the data, as follows:

python bulk_upload.py -i processed_processed_fixed_v7.osm -u datasa_import -p ****** -c "Import part of data.sa.gov.au waterbodies dataset."

The script bulk_upload.py is designed to be tolerant of interruptions. If an error occurs during upload, the script will be re-run. The script is also designed to automatically break the data into separate changesets.

The script doesn't give the option to choose tags for the changesets. I will modify the source code of bulk_upload.py to force it to add the tag "source=data.sa.gov.au" to the changesets.

In the event that the upload needs to be reverted, the JOSM reverter plugin will be used.

An upload to the sandbox server will be tried first.


Conflation

A simple conservative approach to conflation is used: no data that overlaps with an existing OSM waterbody is included in the upload.


QA

Using JOSM, hundreds of features from random locations in region were examined closely. The features were compared to Bing imagery. I paid particularly close attention to features that were in areas that I am familiar with.

Based on these observations I concluded that the dataset is, at least in the most part, of fairly high quality.

Data Updates

The source data doesn't appear to get updated often. (The latest version is dated July 2014).

If updated versions of the source data are made available in the future, I will perform the processing steps again, to import new features. Features that change shape, change tags, or are deleted won't be handled automatically.

  1. 1.0 1.1 1.2 Note: a "dam" in Australian English is a small reservoir, usually on a farm