Corine Land Cover Romania 2006

From OpenStreetMap Wiki
Jump to: navigation, search

Introduction

This page aims to describe the phases of the CLC land cover import for Romania.

Important : this page is a draft; no data processing is performed at this stage on the real openstreetmap servers/data.

The target

The CLC land cover import for Romania is targeted at using the publicily available data from the European Environment Agency with respect to land cover; additional data is available for major cities, under the designation of 'Urban Atlas'; while the data has almost the same format, this document addresses mainly the CLC import

The unanimously accepted solution, even if a very small number of voters expressed their opinion (just like in modern politics), will ensure that existing land cover data is not deleted, it is retagged for later analysis and possible revert, while the new data is imported. See Romania CLC Import for more details

The approach

The following stages have currently been established :

Stage Description Status Notes Who's involved (feel free to volunteer to any stage)
1 Request for permission Completed See ro-talk list for details stefanu
2 Data preparation Completed Data is provided by the EEA as archived shapefiles; a one-pass coordinate conversion is required for easy processing using a C++ application that makes use of the shapelib library; conversion is performed using 'ogr2ogr' utility from the GDAL library. stefanu
3 OSM data corrections Completed Analisys of the existing data on OSM server to look for inconsistencies (like wrong landuse and natural tags). Corrections on objects that seem obvious errors. For now, this includes wrong tagging and layer corrections (overlapping polygons placed different layers due to rendering issues are converted into relations) stefanu
4 Data conversion Completed (see details on the bot algorithm below for details) Data has been converted to OSM XML format by a dedicated utility written in C++ and using the shapelib library; the current status of the conversion enables the creation of one OSM file per CLC land cover code, ultimately resulting in a large OSM file (approx 1.5 Gb). See data testing section below for details. stefanu
5 Retagging bot development Completed Retagging bot development uses Java code from the tile splitter application by Steve Ratcliffe & Chris Miller. It uses simplified XML parsing and objects along with a simple OSM API client. This may well serve for developing a full API client, since I found none while working on this. stefanu
6 Retagging bot testing Completed Using the server at http://api06.dev.openstreetmap.org/ for this purpose stefanu
7 Import bot development Completed Simple OSM elements imported and manipulated using a simple OSM API Java component aimed at this purpose. stefanu
8 Import bot testing Completed Using the server at http://api06.dev.openstreetmap.org/ for this purpose stefanu
9 Prepare existing OSM data for retagging Completed Some work has been done for this stage along with the development of a filtering application needed for the Romanian Garmin map. See data testing section below for details. stefanu
10 Run the retagging bot As needed, on a per-landuse type basis stefanu
11 Run the import bot As needed, on a per-landuse type basis CLC code being imported : 311-Broad-leaved forest. CLC codes imported so far : 112 (Continuous urban fabric), 124 (Airports), 131 (Mineral extraction sites), 221 (Vineyards), 312 (Coniferous forest), 313 ( Mixed forest). stefanu

Data tagging

CLC code CLC description 1 CLC description 2 CLC description 3 Tags Number of polygons Notes
111 Artificial surfaces Urban fabric Continuous urban fabric landuse=residential 0 No data
112 Artificial surfaces Urban fabric Discontinuous urban fabric landuse=residential 10743 Imported
121 Artificial surfaces Industrial, commercial and transport units Industrial or commercial units landuse=retail;industrial 2707
122 Artificial surfaces Industrial, commercial and transport units Road and rail networks and associated land landuse=industrial 74
123 Artificial surfaces Industrial, commercial and transport units Port areas landuse=harbour 30
124 Artificial surfaces Industrial, commercial and transport units Airports aeroway=aerodrome 26 Imported
131 Artificial surfaces Mine, dump and construction sites Mineral extraction sites landuse=quarry 213 Imported
132 Artificial surfaces Mine, dump and construction sites Dump sites landuse=landfill 112
133 Artificial surfaces Mine, dump and construction sites Construction sites landuse=construction 47
141 Artificial surfaces Artificial, non-agricultural vegetated areas Green urban areas leisure=park 103
142 Artificial surfaces Artificial, non-agricultural vegetated areas Sport and leisure facilities leisure=park 125
211 Agricultural areas Arable land Non-irrigated arable land landuse=farm 10651
212 Agricultural areas Arable land Permanently irrigated land landuse=farm 0 No data
213 Agricultural areas Arable land Rice fields landuse=farm 16
221 Agricultural areas Permanent crops Vineyards landuse=vineyard 3233 Imported
222 Agricultural areas Permanent crops Fruit trees and berry plantations landuse=orchard 3265
223 Agricultural areas Permanent crops Olive groves landuse=orchard; trees=olives 0 No data
231 Agricultural areas Pastures Pastures landuse=meadow 17692
241 Agricultural areas Heterogeneous agricultural areas Annual crops associated with permanent crops landuse=farm 0 No data
242 Agricultural areas Heterogeneous agricultural areas Complex cultivation patterns landuse=farm 9274
243 Agricultural areas Heterogeneous agricultural areas Land principally occupied by agriculture, with significant areas of natural vegetation landuse=farm 10718
244 Agricultural areas Heterogeneous agricultural areas Agro-forestry areas landuse=farm 0 No data
311 Forest and semi natural areas Forests Broad-leaved forest landuse=forest; wood=deciduous 11421
312 Forest and semi natural areas Forests Coniferous forest landuse=forest; wood=coniferous 2956 Imported
313 Forest and semi natural areas Forests Mixed forest landuse=forest; wood=mixed 2490 Imported
321 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Natural grasslands natural=grassland 1771
322 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Moors and heathland natural=heath 331
323 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Sclerophyllous vegetation natural=scrub 0 No data
324 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Transitional woodland-shrub natural=wood; wood=mixed 7509
331 Forest and semi natural areas Open spaces with little or no vegetation Beaches, dunes, sands natural=beach 143
332 Forest and semi natural areas Open spaces with little or no vegetation Bare rocks natural=rock 56
333 Forest and semi natural areas Open spaces with little or no vegetation Sparsely vegetated areas natural=scrub 179
334 Forest and semi natural areas Open spaces with little or no vegetation Burnt areas landuse=rock 0 No data
335 Forest and semi natural areas Open spaces with little or no vegetation Glaciers and perpetual snow natural=glacier 0 No data
411 Wetlands Inland wetlands Inland marshes natural=wetland; wetland=marsh 1515
412 Wetlands Inland wetlands Peat bogs natural=wetland; wetland=bog 2
421 Wetlands Maritime wetlands Salt marshes natural=wetland; wetland=saltmarsh 3
422 Wetlands Maritime wetlands Salines landuse=salt_pond 0 No data
423 Wetlands Maritime wetlands Intertidal flats water=tidal 0 No data
511 Water bodies Inland waters Water courses waterway=riverbank 582
512 Water bodies Inland waters Water bodies natural=water 895
521 Water bodies Marine waters Coastal lagoons natural=water 9
522 Water bodies Marine waters Estuaries natural=coastline 0 No data
523 Water bodies Marine waters Sea and ocean natural=coastline 1

Data testing

Some data testing has been performed while using the CLC data for the creation of a Garmin compatible map. OSM files created at stage 3 have been successfully pipelined through the tile splitter and mkgmap, and resulted in an IMG file (Garmin proprietary format) targeted at compatible devices.

Steps taken in preparation of data

  1. translation of CLC data; a couple of batch files was created to translate all shapefiles to the WGS84 coordinate system; takes a couple of hours to complete
  2. filtering of CLC data and dump to OSM format; a custom C++ app was written solely for this purpose; takes a couple of hours to complete
  3. merging of OSM data; the standard DOS command 'copy' does a very good job at this; takes minutes to complete; however, there is a problem with the generated file, and the very last byte needs to be dropped out; this is done with a very simple C++ app.
  4. import into a temporary database for duplicate merging; another small C++ custom app was created for this purpose; the very same app does the reverse operation also, at step 5; this takes also a few hours to complete.
  5. duplicate merging; two stored procedures in the PostgreSQL database fullfill this goal.
  6. export from database into OSM format again; the very same custom app from step 3 does the reverse; this takes also a few hours to complete.
  7. import to the final import database; yet another custom app, this time written in Java (made with bits and pieces from the map tiling app targeted at Garmin devices) does the job; this takes also a few hours to complete.

Algorithms used

Tools and applications used (details)

  1. ogr2ogr (external link); this small utility from GDAL is used to translate from the original coordinate system to WGS84.
  2. shp2osm; custom built C++ helper app based on the shplib library to filter and dump data from the shapefiles into OSM format.
  3. shp2db; custom built C++ helper app to import data from the shp2osm output into a very simple, temporary, database, so that duplicate nodes can be removed. ALso used to reconstruct the OSM file after duplicate nodes have been removed
  4. one PostgreSQL stored procedure that eliminates node duplicates
  5. filter; custom built Java helper app, that initially started as a filtering app for the daily planet extract for Romania. Now, this app performs several functions :
    1. filter : reads a daily country extract and removes all landcover polygons, while resampling object IDs; this function was needed to generate a garmin map with no landcover data
    2. border : extract the Romanian border from the daily country extract; this was needed so that the shp2osm helper app use a proper polygon for determining land use data that does not exceed the country borders.
    3. polygon : converts the country border saved as OSM file into polygon format.
    4. retag : function designed for retagging, but instead the test function is used.
    5. upload : function designed for uploading CLC data, but instead the test function is used.
    6. setup-db : created the master import database used by the import process, from the OSM file
    7. test : originally designed as a test function to be used with the dev server at openstreetmap.org, it now holds all the import code and acts as retagging and uploading procedure.
  6. one PostgreSQL stored procedure that is used to examine import statistics

Technical info

This section describes in further detail the algorithm used by the bots oulined in previous sections. It is not intended to be a fully detailed software specification, but rather explain the concepts used.

  • How to find out what elements need retagging
Based on the code from the tile splitter by Steve Ratcliffe ([1]), a map filtering application has beed developped; its main purpose is to filter a daily extract of the planet file for Romania and remove all land cover elements, then resample all element IDs; the same utility can be used to dump the filtered elements so they can be used as input by the retagging bot.
UPDATE 01/11/2010 : filtering app has been modified to du the dumping; also, several functions have been added, as helpers to reach the main goal.
  • How to do the retagging
The best way to do this seems to use the direct API access; the bot should use the element IDs from the filtering app, add or modify the tags, then update it directly on the server; given the restrictions on the size of changesets, there is a need for the state of the operation to be persistent; this will also enable pause/resume. osmosis also sounds like a good idea, but it requires the setup of a database and additional processing.
UPDATE 01/11/2010 : Direct API access it is; a veri simple OSM API 0.6 Java component has been developped. It's not aimed at being 100% compliant with all the API functions, but rather do it's job for importing landuse data.
  • How to store the OSM data prior to importing
Given the volume of the data, the restrictions on the changesets and the need for the bot state to be persistent, one idea is to save one OSM file per way/relation and to process them one at a time; this avoids setting up a database and related access code. However, a database (not necessarily a PostGIS database) may solve other issues. See below for details.
UPDATE 01/11/2010 : Data preparation stage is completed. All landuse data has been filtered using a polygon created from Romanian boundaries from the OSM server, duplicate points have been removed and all elements have been imported into a database. This database will serve as a data source for the import bot, storing current import state, allowing it to resume, since it is expected to be a lengthy operation.
  • How to save the state of the import
The simplest way to do this seems to have two folders on disk : the ways that have been already uploaded to the server and those waiting to be uploaded; once a changeset has sucessfully been closed, the included ways/relations are moved from one folder to another; the only issue here is that there will be a large number of small files stored on disc. While using a database, this is even simpler.
UPDATE 01/11/2010 : See the above paragraph for details; basically the import state is saved in a database.
  • What is needed to develop the bots
Time ;) Besides that, there is still a preliminary decision to make : C++ or Java. Or something else if anyone has a better idea. More to come on this issue later.

Known and possible issues

  • No longer an issue; all data is stored in a database. large numbers of small files will be stored on disc; will have to determine if it is the best approach; a database may work around this issue.
  • No longer an issue; node duplicates have been removed prior to import. nodes are duplicated in 'touching' ways; two ways that have a common segment will have their own nodes, thus resulting in a large number of nodes, almost all being duplicates; some intelligent algorithm must be used to avoid this; normally, this is where a database comes in handy.