Import/Catalogue/Lantmäteriet GSD-Terrängkartans ortnamnsimport

From OpenStreetMap Wiki
Jump to navigation Jump to search

Import of Swedish settlements names from Lantmäteriets GSD-Terrängkartan.


To improve OSM completeness for toponymical dataset on territory Sweden using an official map supplied by Swedish mapping, cadastral and land registration authority.

This import considers OSM data representable as nodes tagged with usual key/value pairs: "place=city", "place=town", "place=village", "place=hamlet", "place=isolated_dwelling", and "place=locality". However, it is not planned (but not completely excluded either) to add/modify any nodes with "city" and "town" values. They are expected to be already fully mapped.

Physical geographical features associated with these tags are further collectively called as "settlements". Names used by settlements are called "toponyms".


  • January 2020 — start of the project.
  • January 2020: discuss at talk-se mailing list, address discovered issues
  • February 2020: discuss at imports mailing list, address discovered issues
  • (TODO date): upload the first batch of data, collect feedback on how well the final state looks.
  • (TODO summer-winter 2020) continue with uploads of individual regions
  • (TODO date 2021) Post-mortem of what worked and what did not.

Import Data

Import data comes in a form of vector SHP data files produced by ESRI ArcGIS software. The data covers most of Sweden's populated territory (excluding northern parts of Norrland).

Documentation for the imported data is provided inside the downloaded archives in PDF form (in Swedish). A copy of the PDF file is available here:

Specifically, data from gistext/TX layers corresponding to toponyms is extracted.


Data source site:

The actual SHP files are downloaded from after a free account is registered. A copy of them is placed here:

Import Type

This is a one time import for the data available by the Q1'2020. The data is first processed with scripts, then loaded into JOSM, visually controlled to be consistent and non-conflicting, validated with existing tools. It is then uploaded through the JOSM upload interface.

Data Preparation

Later in the text we will be using words "old" to denote data already present in the OSM database, "new" to denote data extracted from the source, and "ready" to denote the result of automatic conflation of "old" and "new" sets. Of course, "ready" here means "ready for manual inspection", not "ready for blind upload".

Data Reduction & Simplification

From the source vector, points corresponding to toponyms' labels are extracted. The rest of the data is dropped from the source.

At the conflation stage, only points to be found not represented in the current OSM database are preserved.

Tagging Plans

Points in the source files are associated with a number of fields.

KKOD field of source SHP files is mapped to one of the OSM "place" values. Field TEXT is mapped to "name" tags, with possible transformations (see below). See early mentioned PDF documentation for explanation of KKOD and TEXT meanings.

KKODs are mapped as the following Python dict:

   tr_t = {
       1: {"place": "isolated_dwelling"},# Bebyggelse, enstaka gård, hus
       2: {"place": "hamlet"},# Bebyggelse, by, större gård, mindre stadsdel
       3: {"place": "hamlet"},# Bebyggelse, by, stadsdel
       4: {"place": "hamlet"},# Bebyggelse, samhälle, samlad by
       5: {"place": "village"},# Tätort 200 - 499 inv., större stadsdel
       6: {"place": "town"},# Tätort 500 - 1 999 inv.
       7: {"place": "town"},# Tätort 2 000 - 9 999 inv.
       8: {"place": "city"},# Tätort 10 000 - 49 999 inv.
       9: {"place": "city"},# Tätort 50 000 och fler inv.
      14: {"place": "farm"},# Herrgård, storleksklass 1
      16: {"place": "farm"},# Herrgård, storleksklass 2

An extra step is made to retag settlements placed within borders of larger settlements as place=neighbourhood.

Remaining KKODs and associated points in source data are dropped.

Technical and diagnostic tags

In addition to the tags derived from the source dataset, auxiliary tags are added to all or some new nodes.

The following tags may be added.

  • import=yes
  • source="GSD-Terrängkartan"
  • "lantmateriet:kkod" to store the original KKOD value.
  • fixme=<description> for nodes with likely incorrect names, such as ending with a dash, starting from a lower case symbol etc.
  • note=<description> for nodes with names reconstructed from parts or abbreviations.
  • short_name to keep the original abbreviated name
  • import:note = <description> for nodes having names similar to old (multi)polygons.
  • import:in_water = yes if a node is detected to be incorrectly placed into a body of water.

Changeset Tags

Changesets will be tagged with source = "GSD-Terrängkartan".

Data Transformation

Tools used:

  • osmconvert, osmfilter and ogr2osm to perform initial data format conversion and filtering.
  • Scripts and tools (link) to convert, split, clean up, conflate data and resolve issues at intermediate steps.
  • JOSM editor to manually fix remaining issues, visually and semi-automatically review changesets, and finally upload them to the OSM-database.
  • A copy of OSM API v0.6 instance to allow distributed collaboration over the dataset.

Data processing diagram

See the diagrams below. The conflation stage is described later in more details. See also section on collaboration below to learn how workload distribution is done.

   +-------------------+        +------------------+
   |                   |        |                  |
   |Lantmäteriet's SHP |        |Geofabrik country |
   |files              |        |extract           |
   |                   |        |                  |
   +---------+---------+        +--------+---------+
             |                           |
             |ogr2osm                    |osmconvert
             |                           |osmfilter
             v                           v
    +--------+---------+         +-------+---------+
    |                  |         |                 |
    |OSM file with     |         |OSM fiele with   |
    |settlements       |         |settlements      |
    |                  |         |                 |
    +---------+--------+         +-------+---------+
              |                          |
              |                          |
              |   |
     |                 |
     |OSM file with    |
     |only ready nodes |
     |                 |
              | Manual corrections
       Upload to JOSM

The data movement flow:

  SHP file     →     OSM XML file   →     private OSM DB        →          public OSM DB
          extraction             osmosis                   two josm layers

Data Transformation Results

An API server with not yet merged nodes: contains all nodes presently not "live" in the main OSM DB. See notes below on its usage.

Older file sets

(An earlier iteration of) OSM files with new nodes (before conflation) and OSM filtered extract with all nodes with "place=*" within Sweden's borders (old nodes):

Previous sets of generated ready nodes with short history of changes.

- v1:

- v3 (more cleanup of names):

- v7 (even more name cleanup and smarts added):

- v8 (smaller tiles are available):

- v9 (fixed even more abbreviations):

-v10 (added "import=yes"):

-v11 (more name concatenation heuristics, note tag):

-v14 (added generation of files with "dropped" duplicates)

-v13? (add conflation against pseudo-nodes, fix a bug with excessively bix bboxes and lower false positive rates, as result more ready nodes survive):

-v24 (fresh OSM extract):

-v23 (fuzzy name comparison):

-v21 (more aggressive conflict detection, splitting into smaller tiles, promotion of nodes inside city borders)

-v27 (marked nodes outside land with import:in_water for manual adjustments):

-v31 (all nodes are dragged out of weater):

Source files

Old nodes for the country (places.osm) and new nodes for regions (tx_*.osm) under Git control:

Explanation of included files

Input files:

  • places.osm is a file with OSM-extract filtered to have only nodes with "place=*" tag. (Multi)polygons with relevant tags were converted to nodes and included into it as well. The most recent version is generated from the OSM-database extract.
  • tx_N.osm is an input file produced from Lantmäteriet's SHP file for region N.

Output files:

  • regions/tx_<number>.osm is a file for the country's region (number ischosen after those used in source SHP files names, see below). A single OSM file contains from 100 to 15000 ready nodes.
  • tiles/tx_<number>_<number>_<numer>.osm are the same data split in smaller chunks.

Each file should contain approximately 200-400 ready nodes. The exact number of nodes in a tile may vary as the density of ready nodes is not taken into consideration when determinig tiles size and position. Tiles with 0 ready nodes are not created.

  • Log files contain warnings about unresolved names and statistic information about processed and generated files.

Mapping of 21 regions to file numbers follows numbering used by Lantmäteriet's original files. Number of ready nodes in regions

Name Region's code Number of ready nodes
Stockholm 1 3917
Uppsala 3 4813
Södermanland 4 7638
Östergötland 5 10375
Jönköping 6 8225
Kronoberg 7 6796
Kalmar 8 6498
Gotland 9 225
Blekinge 10 1473
Skåne 12 7470
Halland 13 4607
Västra Götaland 14 24568
Varmland 17 9796
Örebro 18 5893
Västmanland 19 3315
Dalarna 20 3528
Gävleborg 21 5301
Västernorrland 22 2420
Jämtland 23 2894
Västerbotten 24 2518
Norrbotten 25 2154
  • dropped/tx_<number>-dropped.osm contain new nodes which were marked as duplicates of already existing objects. All included nodes should have "import:note" tag referencing the object they have matched against, including its ID, coordinates and name.

Data Merge Workflow


A team of contributors collaborating through the mailing list talk-se will import data covering different parts of the country.

Private API server workflow

A private OSM API v0.6 server is created to host a live copy of import data. The server URL is and it can be specified in JOSM settings to download and upload ready nodes.

Notes about the private API server
  1. The private API server is a slow home computer behind an even weaker front-end VPS. Please do not overload it with work.
  2. There are no guarantees that the server is available at any time, or at all. Its power and/or network connection may be down for undefined periods of time.
  3. The server runs a rough copy of the official website software.
    1. Only the API v0.6 endpoint is supposed to work at the specified URL. If something else works, it does so by accident.
    2. There is no real gliding map, no Overpass/Nominatim etc. services, no users (except one), no way to register users, no online editor etc.
    3. There is currently no tile server to visualize the DB contents. Having one would definitely help to see what areas are still not covered.
  4. A single account mapper is created to allow collaborators to make edits via OSM API. See the mailing list thread for the account's password.
  5. Please report your problems with the server to the talk-se mailing list.

The workflow is to download a group of nodes from the private API, edit them as needed, copy them to the public DB, and delete them from the private DB.

Step-by-step workflow

A recommended workflow is described below. JOSM and two data layers are used to download, edit and transfer nodes from the private API to the public API. Certain steps of the workflow may be adjusted when needed.

It is recommended to set up and activate a JOSM filter with query text "place=*" and inverted flag in order to shade everything not related to objects with place tags. It will shade a lot of visual clutter.

1. Use the JOSM download dialog with gliding window to download a chunk of data from the OSM API server into a new data layer.

2. Create a second empty data layer (keyboard shortcut Ctrl-N).

3. Change JOSM settings to expert mode. In JOSM settings change connection options to the private API URL.

4. Use the download dialog with gliding window to download a chunk of data for the same bounding box from the private API server. Now you have two data layers: the first one with "old" and the second one with "new" data.

5. Edit the "new" data in the second layer as you see fit: move, rename, delete, retag etc. Set up an satellite imagery background layer as a reference if needed.

6. When satisfied with the result, it is time to move the new nodes. Do not use JOSM's Merge function to copy nodes between layers! It won't work correctly as object IDs of the private API are incompatible with the main OSM DB.

7. Select all nodes (Ctrl-A), copy them to the clipboard (Ctrl-C). Switch to the first layer, and use Paste at source position (shortcut Ctrl-Alt-V) command to insert them. A copy of selected nodes (treated as newly created objects by JOSM) will be created.

8. Get back to the second layer and delete selected nodes. Then upload your changes to the private API DB. To do that, user account in JOSM settings must be set to mapper, not your normal import account.

  1. Yes, this is awkward to switch back and forth between two API URLs and two accounts via JOSM options. A ticket for a feature enhancement to address this inconvenience has not been closed since 2009.
  2. The deletion step is required to prevent other collaborators from working on the same nodes after you've moved them. Please mention your "real" user account in the changeset message.

9. Change back to the public OSM API and your import account in JOSM settings. Make sure to use a separate account containing word "import" when uploading data to the main DB; it is dictated by the OSM requirements.

10. Open the first data layer and upload it to the public API DB.

As a result of these steps, a set of nodes was moved from one DB to another DB.

Old approach

Note that this approach is deprecated in favor of the private API server approach above.

The whole country area is split into 21 sub-units following the territorial scheme present in the original data source. To better balance the following manual validation work, these files will are also split into smaller tiles. The goal is to have about 100-200 new nodes per a single tile.

Script developed to perform the splitting:

The collaboration is guided through online spreadsheets or other convenient mechanisms to make sure that no two people attempt to upload the same data chunk twice. The main spreadsheet to track progress:

Changeset size policy

Individual changesets of this import should follow regular OSM policies on size limits. Total amount of new nodes is expected to be about 118 thousands, meaning that multiple changesets will be required to upload everything.


Nodes positions and toponyms' names can be validated using the following sources:

  • Lantmäteriet's own raster tiles service available as background layer in JOSM.
  • Lantmäteriet's name toponym search service
  • Historical maps of Sweden used as background layers in JOSM.
  • Existing OSM data (used to visually discover inconsistencies).
  • Publicly available information on toponyms (Wikipedia etc.) to verify names

when there is doubt.

Data extraction

Conversion of a Geofabrik extract for Sweden is done by the following script:

Conversion of SHP files to OSM files is done by the following script:

Note to self: for d in tk_*; do SHP=$d/terrang/*/gistext/tx*.shp; echo ~/workspace/nmd-osm-tools/ $SHP ~/tmp/ortnamn/`basename $SHP .shp`.osm; done

The following tag translation filter is supplied to ogr2osm:

Notes on names

1. The source SHP TEXT fields often contain slightly mangled toponym names. It is tolerable by humans but increases risk of duplicates for automated script. E.g., "St. mosse" and "Stora mosse" would be treated as two different places. To counter this, a set of regular expression based conversion heuristics is applied to expand typical abbreviations, such as "St.", "L.", etc.

2. Longer toponym strings are sometimes split into two close points, each of which contains a hyphenated part of the original name. Such pairs have to be concatenated back into a single toponym. Criteria for merging: close nodes, one ends with "-", another starts with lowercase. The heuristic does not work all the time, e.g. "Öster-Övsjö" split at the dash will not be detected. However, all remaining names with dashes are reported and relevant nodes are marked for human intervention.

It also happens when a name is split at the whitespace. This situation is also detected and fixed automatically. Very few new nodes are affected by this transformation (so far only 5 such nodes have been found in the whole new dataset).

Very few toponyms are split in three or more parts. Possible unmerged left-overs are considered suspicious and marked with "fixme" tags for later manual resolution.

3. Names with non-Swedish letter symbols (punctuation, numbers etc.) are marked with "fixme" for human inspection. E.g. "Günthers" is a valid but unusual toponym worth rechecking. This will also mark toponyms in minority languages to be verified by humans.

Revert plan

In a case of problems discovered after "bad" data is uploaded to the OSM database, it must first be reverted, then corrected and an improved changeset reuploaded.

All participating users should maintain ranges of changeset numbers for their uploads in the private API history . This should assist with reverting faulty changes via JOSM option "Revert changeset".

Changesets' and nodes' "source" tags can also be used to track down nodes participating in incorrect changesets.

It is recommended to document reasons why reverting was necessary. Later, develop a mitigation plan to address discovered issues, fix them and re-attempt uploading if deemed reasonable.

Conflation and final automatic preparation steps

The base script developed for automatic conflation is

Its algorithm operates on a set of old nodes (OSM-extract, nodes marked with "place=*", around 68 000 nodes for the country) and new nodes (produced earlier from SHP files). The script produces ready nodes, which is a strict subset of new nodes. No old nodes are modified in any way during the process. This means that existing data has absolute priority, even in cases it is likely of lower quality than new data.

The sequence of steps is as following.

1. Create a spatial index structure with old nodes to have fast spatial lookup.

2. For all new nodes validation/correction of the "name" tag is performed.

3. For each new node, find old nodes close enough to it to be candidate for duplicates.

4. For each candidate node, compare its name against the current new node name. Comparison is fuzzy to allow for some text variation typical for names. Alternative old names are also checked if present.

5. If a name match is found, the current new node is marked as "duplicate" and is excluded from further analysis and results.

6. An OSM file with ready data is generated.

7. The OSM file is optionally split into smaller tiles to ease and speed up visual validation.

Notes on name comparison

In additions to name sanitation presented earlier, comparison of names between old and new datasets need to account for remaining possible variations. To reduce probability of false negatives (erroneously deciding that two nodes are not aliases when they are), strings are brought to normalized forms before comparison.

The comparison itself uses Python's difflib.SequenceMatcher.quick_ratio() to estimate similarity between the strings. The similarity threshold is set up to consider strings of about 10 characters long to be similar even they have up to one character difference.

Expected issues and their risk assessment

Classes of issues described below are assessed with respect to:

1. their estimated/measured probability to happen (how often an imported node will bring the issue with it?),

2. their negative impact to human map user if not fixed (how much harder it would be to use the map if the error is admitted?),

3. estimation of effort needed to detect them (if an error is admitted, will it be easy to discover it by tools or by humans?),

4. estimation of effort needed fix them (how many human work will be needed to compensated for the issue?).

Note: So far, the most problematic issues seems to be classified as "A duplicate of existing node is added" and "A new node is added with incorrect classification". It is expected that to to discover and fix such problems would require most of required manual editing.

A new node not corresponding to any toponym of real world is added to the map

Probability: very low as the the authorities database is being regularly updated.

Impact: medium as it creates confusion for map users trying to reach a ghost place.

Effort to detect and fix: high/high for tiny settlements as a non-existing place would be impossible to detect and correct without a physical visit of the coordinates. The bigger the settlement, however, the easier it is to discover and delete the mistake by simply looking at the e.g. land satellite image.

A new node with incorrect name is added

Probability: medium, mostly as a result of a typo in the source or unusual spelling used in name.

Impact: low. A misspelled name will still likely be recognizable by humans and/or correctable by computers.

Effort to detect: low. Cross-checking against other public toponym data should uncover the correct or preferable spelling.

Effort to fix: low. Just rename the node manually.

A new node with incorrect classification is added

E.g. adding "place=village" instead of "place=town" etc.

Probability: medium to high. The issue here reduces mostly to the chosen source-destination tag remapping scheme, tagging practices for a particular region. No administrative hierarchy information, which could have been used to derive the administrative classification of settlements, is present in the used data source.

The most controversy is expected to be around tagging with "place=locality". Officially this tag is reserved for named locations without population. On practice, this tag is sometimes used for settlements with unknown status, from isolated buildings to historically sections parts of cities falling in between the existing administrative hierarchy. In this import, "place=locality" is used to represent the smallest named entity, smaller than "isolated_dwelling".

Impact: low. Correct name and coordinates for a settlement are arguably more important than to decide whether it should be treated as e.g. "village" or "town".

If needed, a change in "place=*" scheme can be applied after additional classifications become available.

Effort to detect: medium. Consultation with external sources will be needed to cross-check the official position on a settlement's type.

Effort to fix: easy both for manual and automated retagging once a mistake is discovered and new classification level is known.

A new node is added with incorrect position

Probability: high for small errors, medium for big errors.

Ideally, a "place=*" node should be placed at the settlement's center (e.g. the main square, the main train station, geometrical center etc.). However, the Lantmäteriet's map often has its labels at the side, so that a text label on a corresponding "paper map" won't cover the settlement's territory. For big cities with large area and complex borders, this may create a node placement error up to 2 kilometers relative to its "ideal" position. However, bit settlements are already well mapped and won't receive updates during this import.

For small settlements (the majority of imported nodes), offset error is mostly small, as they have small linear dimensions.

Impact: medium. There is no official "center" for smaller settlements such as a single cottage or a tiny village. At the same time, finding them on a map without any sort of textual labeling is problematic

Effort to detect: medium. Mostly visually comparing against aerial imagery is enough.

Effort to fix: low. If a node is discovered to be at an sub-optimal position by someone, it is easy to move it closer to optimal coordinates.

A duplicate of existing node is added

See also specific situations alanyzed below.

Probability: low to medium. Exact duplicates of existing nodes (both name and coordinates match) should be impossible (provided the conflation scripts are' error-free). A deviation of coordinates and/or names of old and new nodes increase the probability of duplicate's slipping into the map. However, fine-tuned thresholds and distance algorithms (both for spatial and textual information) should reduce the error rate.

Impact: medium. Two nodes naming the same settlement is confusing, but easy to fix upon discovery. Both nodes are still likely to be easily associated with the same place. But it will definitely look annoying until fixed.

Effort to discover: medium. Two closely placed closely named nodes are obvious upon inspection.

Effort to fix: low. To manually delete a duplicate is easy.

Node having same alternative name as existing node

For example, adding a node with name="Gullåkra by" near an old node with name="Gullåkra".

Probability: low. There should not be many variations of names. Existing conflation script checks for alternative names.

Impact: low. A human will easily be able to recognize the error and dismiss it.

Effort to discover: medium. Map has to be visually scanned for suspicious node pairs.

Effort to fix: low. Delete one node, add "alt_name" to the other. If needed, the conflation script can deal with it by utilizing more advanced fuzzy name comparison.

Node having same name as existing closed way

Tag "name=*" can be placed not only on nodes, but also on (multi)polygons encircling settlements, such as landuse=residential, landuse=farmyard etc.

Probability: high. There are regions with hundres of such (multi)polygons.

Impact: low to medium (currently being debated). It is customary for certain mappers to map settlements with both a name on its closed way and as a separate node with "place=*" inside its border. One reason behind it is that a node can be placed at a "logical", "economical" or political center, such as the main square, train station etc. Compared to this, a geometric center of (multi)polygon is hard to control, and it may land somewhere completely non-representative for the settlement.

Effort to discover: low. It is automated (since b4973ffe) to treat closed named ways as pseudo-nodes, apply the same conflation strategy and mark matches with import:note = *.

Effort to fix: low. If needed, the conflation script can be adjusted to address it,

A node for physically existing toponym is not added

Probability: unknown. Chances of having an unknown name for unknown place are hard to estimate.

Impact: very low. If a place was not present in OSM, and it remains unknown, then apparently nobody thinks it is interesting.

Effort to detect: unknown.

Effort fix: hard; as an alternative source must be analyzed to find any missing settlements.

Toponyms in languages other than Swedish

Several minority languages are used in Sweden, and toponyms may be stated in several languages as well. Lantmäteriet's documentation mentions that the data does contain text using letters of Sami language.

The baseline conflation algorithm tries to match names of new nodes against multiple tags of old nodes ("name", "alt_name", "name:sa" etc.)

Probability: low

Impact: low/medium. Additional nodes for the same settlement would be created where a single node with several tags for names ("alt_name", "name:fi" etc.) should be made instead.

Effort to detect: unknown

Effort to fix: medium (manual, upon detection).

Quality Assurance Plan

Common sense should be applied when visually inspecting ready data. Some of visual/manual checks/correction are expected.

  • Not many large (place=town or place=city) settlements should usually be added. Those should already be (almost) completely represented in the OSM database. It is expected that the vast majority of new nodes are to be tagged with place=isolated_dwelling.
  • Generally, new nodes should be placed on land, not inside water. An exception in a form of named archipelago is theoretically possible; however, this import does not contain new "place=archipelago" primitives.
  • All ready nodes marked with "fixme" must be checked and acted upon. Other important things to do after the data is loaded into JOSM, and also after it was uploaded to the OSM database.
  • At all stages when vector data is loaded into JOSM, the standard JOSM/Validator shall be used to detect inconsistencies.
  • It is a requirement for this import that no errors detected by the validator are uploaded together with the new data. When possible, even older errors have to be fixed along with the upload (and committed in separate changesets). No new warnings caused by the new data being imported are allowed. It is encouraged to fix pre-existing warnings for areas that are being updated right before the import changeset uploading ("the boyscout rule").
  • After individual changesets uploads import are finished, the Osmose web service will be used to detect any remaining errors/warnings. To simplify detection of problems caused by this particular import, the per-account web page can be used: .

See also

The email to the Imports mailing list was sent on TODO and can be found in the archives of the mailing list at TODO.