Osmosis/TagTransform

From OpenStreetMap Wiki
Jump to navigation Jump to search

The tag transform Osmosis plugin allows arbitrary tag transforms to be applied to OSM data as a preprocessing step before using other tools. This allows other tools to concentrate on doing what ever they do, without having to handle numerous different tagging schemes and error corrections.

The transforms apply regular expressions to both the tag keys and values, and enable customising output tags based on sub-matches.

Current Status

Brett has added this plugin into the main codebase as part of the current development branch: [1] and has now been released into the stable versions since 0.42.

Downloading

The plugin is now part of the core Osmosis since version 0.42.

The plugin is currently available for older versions of Osmosis here

The code for the older version is GPL and available from the OpenStreetMap SVN repository. It has now been included into the core Osmosis with a license change to public domain along with the rest of the Osmosis code base.

Installation

This plugin is now part of the code Osmosis, thus this is only here for historical reference with older version of Osmosis.

You can put the plugin directly into lib/default. A snippet follows that installs latest osmosis and tagtransform.jar.

# fetch latest osmosis and unpack
wget -O - https://bretth.dev.openstreetmap.org/osmosis-build/osmosis-latest.tgz | tar xz

# get a precompiled tagtransform.jar (tested with 0.40.1)
wget -O $(echo osmosis*)/lib/default/tagtransform.jar http://www.imn.htwk-leipzig.de/~cmuelle8/tagtransform.jar

# run osmosis on YOURFILE, doing transformations in TRANSFORM.xml, writing to OUTFILE
./osmosis*/bin/osmosis --read-xml file=YOURFILE --tag-transform file=TRANSFORM.xml --write-xml file=OUTFILE

#end

Building the plugin yourself

Alternatively you may compile/build tagtransform.jar yourself.
You need ant, recent java and osmosis to do this. The following snippet

# fetches and unpacks latest osmosis
wget -O - https://bretth.dev.openstreetmap.org/osmosis-build/osmosis-latest.tgz | tar xz
svn co https://svn.openstreetmap.org/applications/utils/osmosis/plugins/tagtransform/

cd tagtransform
rm -fr libs
ln -fs ../osmosis-*/lib/ libs

# creates a proper osmosis-plugins.conf so it is found by the PluginLoader of latest osmosis
echo "uk.co.randomjunk.osmosis.transform.TransformPlugin" > src/osmosis-plugins.conf

ant -f build.xml

# moves tagtransform.jar to osmosis plugin directory
mv build/dist/*.jar ../osmosis-*/lib/default/
cd ..

# runs osmosis on YOURFILE, doing transformations in TRANSFORM.xml, writing to OUTFILE
./osmosis*/bin/osmosis --read-xml file=YOURFILE --tag-transform file=TRANSFORM.xml --write-xml file=OUTFILE

#end

Running a transform

All tasks are for API 0.6, and are available in the core Osmosis since Osmosis 0.42.

--tag-transform (--tt)

Transform the tags in the input stream according to the rules specified in a transform file.

Pipe Description
inPipe.0 Consumes an entity stream.
outPipe.0 Produces an entity stream.


Option Description Valid Values Default Value
file The name of the file containing the transform description. transform.xml
stats The name of a file to output statistics of match hit counts to. N/A

Specifying a transform

Transforms are specified as an XML file containing a series of translations. Each translation is made up of the following parts:

Part Required Description
name Name of the translation -- used in stats output
description Description of the translation for your own sanity and stats output
match Y Specifies the conditions that must be met for the output to be applied
find Specifies extra tags used in output that are not essential to achieve a match
data Specifies a list of (possibly external) data sources which can be used to transform matches before generating output tags
output Specifies the tags to be output when an entity is matched

Translations are executed on each entity, with the output of the first translation used as the input for the second etc.

match and find

There are a couple of different match types. The top level element must be match or find.

match

The match element groups together other matches. It has two modes:

  • and (default) -- all contained matches must match (checking will stop at the first non-match)
  • or -- only one of the contained matches must match (all are checked regardless)

The entity type to enable matches for can also be specified. Valid values are all (default), node, way, and relation.

find

This is a special case of match's or-mode and can only be used as a top level tag. The find section can be used to get matches for tags used in the output, but which are not essential.

tag

Matches individual or groups of tags. Tags are selected by regular expressions. These are standard Java regular expressions, and full information can be found at [2].

Attributes are used to specify the regexes:

  • k the key regex to match
  • v the value regex to match
  • match_id the ID to reference in output

The output may reference matches to output tags using the specified ID. Any groups extracted by the regex will be available to the output.

notag

Matches on non-presence of tags. If any tag is matched by the regexes then a parent And matcher will fail.

  • k the key regex to not match
  • v the value regex to not match

data

List of data sources which can be used to attach external information to the output tags defined in the following section.

source (shared)

  • type type of the data source (see below)
  • source_id unique id used to reference this data source in the output section

source (type="CSV")

  • file path to the CSV file
  • csvFormat one of the predefined CSV formats of Apache Commons CSV [3]
  • lookup comma-separated list of 0-based column indices with one entry for each regex match group. If all match groups equal the value of the respective cells in a row, this row is the lookup result.
  • return comma-separated list 0-based column indices with one entry for each regex match group. If a row is the lookup result as defined above, the respective cells' values will replace the match groups in the output tag.
  • fallback comma-separated list of values with one entry for each regex match group. If the lookup fails, these values will replace the match groups in the output tag.

source (custom)

Additional data sources (e.g. PostgreSQL tables, web services, ...) can be added by implementing the interface org.openstreetmap.osmosis.tagtransform.DataSource and registering it in org.openstreetmap.osmosis.tagtransform.impl.DataSources.

output

The output is specified as a series of operations which are executed in order. Tag keys are considered unique, and so any operation writing to an existing key will overwrite that existing tag.

If no output section is specified then any matching entities will be dropped entirely

copy-all

Copies all the original tags to the output unchanged.

copy-unmatched

Copies any tags not matched by match or find expressions.

copy-matched

Copies any tags which were matched by match or find expressions.

tag

Output a specific tag, or multiple tags if referencing a match. The key and values for the new tag(s) are specified using output expressions. Within an output expression {0} will be replaced with the matched regex group of that number. 0 represents the whole match string, and the 1st matched group will be output by {1}. If key_datasource or value_datasource are supplied, the matched regex groups will be transformed using the respective data source before being inserted.

The attributes used are:

  • from_match -- the match_id to take values from
  • k -- the key to output
  • v -- the value to output
  • key_datasource -- id of the data source used to transform the match groups before insertion into k
  • value_datasource -- id of the data source used to transform the match groups before insertion into v

If the referenced match doesn't exist (ie: it was part of find or an "or" mode match and no matching tags were found) then the tag output is omitted (even if groups aren't used in the strings).

If no match is referenced at all then the key and value are treated as simple strings and output verbatim.

Examples

Unify different tags to mark website URLs:

<?xml version="1.0"?>
<translations>
  <translation>
    <name>Unify different tags to mark website URLs</name>
    <description>This transformations searches for the tags url=*, website=* and contact:website=* and unifies them into the tag url=*. Since &lt;tag from_match="cw" ...&gt; comes last in &lt;output&gt;, the input tag contact:website=* has precedence over other input tags when being written to the output tag url=*.</description>
    <match mode="or">
      <tag k="url"             match_id="u"  v=".*"/>
      <tag k="website"         match_id="w"  v=".*"/>
      <tag k="contact:website" match_id="cw" v=".*"/>
    </match>
    <output>
      <copy-unmatched/>
      <tag from_match="u"  k="url" v="{0}"/>
      <tag from_match="w"  k="url" v="{0}"/>
      <tag from_match="cw" k="url" v="{0}"/>
    </output>
  </translation>
</translations>


Many applications may require considerably less access types than are available (or frequently mistyped):

<?xml version="1.0"?>
<translations>

  <translation>
    <name>Simplify Access</name>
    <description>
      Should simplify the access restrictions to yes/no.
      As we use ".*" it will get "access", "foot", "vehicle" etc.
      but also keys which have nothing to do with access
      as long as there value is mentioned in v="...".
      We could limit for specific keys, but lets live dangerously.
    </description>
    <match mode="or">
      <tag k=".*" match_id="yes" v="true|designated|public|permissive"/>
      <tag k=".*" match_id="no" v="false|private|privat"/>
    </match>
    <output>
      <copy-all/>
      <tag from_match="yes" v="yes"/>
      <tag from_match="no" v="no"/>
    </output>
  </translation>

</translations>

(XML surround omitted from now on for clarity)

Convert a crossing tagged using the wiki-voted crossing scheme into the heavily used crossing=toucan used in rendering the cyclemap:

<translation>
  <name>->Toucan</name>
  <description>Convert wiki-voted crossings to toucans, and short-cut the crossing_ref case too</description>
  <match mode="or" type="node">
    <match>
      <tag k="crossing" v="traffic_signals"/>
      <tag k="bicycle" v="yes"/>
    </match>
    <tag k="crossing_ref" v="toucan"/>
  </match>
  <output>
    <copy-all/>
    <tag k="crossing" v="toucan"/>
  </output>
</translation>

There have been many ways of entering cycle routes suggested... we tend to use relations now, but lets regularise the legacy way tagging to ensure ncn=yes is placed on all ways

<translation>
  <name>NCN</name>
  <description>Find all the way ncn way variations and tag consistently</description>
  <match>
    <match mode="or">
      <!-- this matches route=ncn, as well as route=bus;ncn etc. -->
      <tag k="route" v="(.*;|^)ncn(;.*|$)" match_id="route"/>
      <--! sometimes ncn_ref has been specified without ncn=yes -->
      <tag k="ncn_ref" v=".*"/>
    </match>
    <!-- don't match where ncn was already set to something else -->
    <notag k="ncn" v=".*"/>
  </match>
  <output>
    <copy-all/>
    <tag k="ncn" v="yes"/>
    <!-- output the route tag, but without the ncn part -->
    <tag k="route" from_match="route" v="{1}{2}"/>
  </output>
</translation>

I might not like the prefixes used by the piste:lift scheme for whatever reason. Lets remove them, but only on things which are definitely piste lifts:

<translation>
  <name>Arbitrary Piste Remapping</name>
  <description>Remap the piste:lift:* style tags to reduce tag length and remove colons which aren't playing nice with tool X</description>
  <match type="way">
    <tag k="piste:lift" v=".*" match_id="type"/>
  </match>
  <find>
    <tag k="piste:lift:(.*)" v=".*" match_id="piste_attr"/>
  </find>
  <output>
    <copy-unmatched/>
    <tag from_match="type" k="piste_lift"/>
    <tag from_match="piste_attr" k="{1}" v="{0}"/>
  </output>
</translation>

Sometimes it is useful to know the number of inhabitants of a certain area. This example takes the number of inhabitants from a CSV file with postal codes in the first column and number of inhabitants in the second column and creates a new tag with the name "inhabitants".

<translation>
  <name>Join postal codes</name>
  <description>Join postal codes from CSV file to relations which are annotated with one</description>
  <data>
    <!-- Look up postal_code in the first column, return the second. Fallback value is 0 inhabitants. -->
    <source source_id="postal_code" type="CSV" file="postal_code.csv" csvFormat="Excel" lookup="0" return="1" fallback="0"/>
  </data>

  <match type="relation" mode="or">
    <tag match_id="pc" k="postal_code" v=".*"/>
  </match>
  <output>
    <copy-all/>
    <!-- The postal code (in {0}) will be replaced by the number of inhabitants -->
    <tag from_match="pc" value_datasource="postal_code" k="inhabitants" v="{0}"/>
  </output>
</translation>

And when you master all of this you get to say "Everybody stand back..."