O5m

From OpenStreetMap Wiki
Jump to: navigation, search

The .o5m data format was designed to be a compromise between .osm and .pbf format. It has the same structure as .osm format, therefore input and output procedures of existing OSM data processing applications can be adapted to this format with small effort. The data coding shows a lot of similarities to the .pbf format. Hence, there is nearly no difference in file size – gzip compression assumed.

Motivation

There already are some data formats for OSM, two of them well-established: .osm and .pbf. Why another one? Let's have a closer look at the two most common formats and try to determine their strengths and weaknesses.

Conventional formats

.osm

This data format is human-readable because written in XML. Usually, the objects in the file are ordered by type (node, way, relation) and then by id. The XML format goes along with some advantages and disadvantages:

Advantages
  • human-readable
  • random access
  • can be edited easily with common text editors (when not too big)
  • can be compressed using every packer program you want (see below)
  • flat hierarchy, can be processed as data stream
  • two or more files can be merged easily
Disadvantages
  • huge file size
  • slow when being processed
  • slow when being written (because of enormous file size)

.osm.bz2

Advantages
  • smaller file size
  • some random access
  • flat hierarchy, can be processed as data stream
  • two or more files can be merged easily
Disadvantages
  • not human-readable
  • can't be edited easily with common text editors
  • slow when being processed
  • very slow when being written (due to compression)

.pbf

This optimized format was introduced to eliminate some of the .osm format's disadvantages. It is somewhat flexible, could for instance allow geo-regional clustering, so you would have had the possibility to pick your region's data from a larger file. (That would have meant to break the usual id-ordered sequence.)

The usual .pbf file which can be downloaded from different locations contains its objects ordered in the same sequence as in the the conventional .osm file does.

Let's try to evaluate this format:

Advantages
  • small file size
  • can be processed much faster than .osm
  • built-in compression obviates need for external compression
Disadvantages
  • not human-readable
  • cannot be edited easily
  • no significant random access as currently implemented
  • applications need zlib packer functions library

Why a new Format?

The new format tries to combine the advantages of both established formats, .osm and .pbf. Main goals are:

  • small file size
  • fast processing
  • flat hierarchy, processable as data stream
  • easy merging of two or more files
  • user may choose compressing method and compress the file

Unfortunately, due to acceleration of data processing, certain goals cannot be reached. We will have to live with a few disadvantages:

  • no human-readability
  • no editing with common text editors

Also, there is currently no mechanism for random access in an o5m file, though this could easily be added if needed.

Format description

The structure of the new .o5m format is similar to the structure of conventional .osm format: Objects are stored using a strict sequence. First all nodes, then all ways, and finally all relations. Within each of these three groups, the objects are sequenced by their ids, in ascending order.

Every number is packed, using the Varint format you might happen to know from .pbf files. Character strings are included in zero-bytes, or referenced if repeated.

Basics

To understand the format it is most useful to know how numbers and strings are stored.

Numbers

To store numbers of different lengths we abandon bit 7 (most significant bit) of every byte and use this bit as a length indicator. This indicator – when set to 1 – tells us that the next byte belongs to the same number. The first byte of such a long number contains the least significant 7 bits, the last byte the most significant 7 bits. For example:

  • 0x05 = 5
  • 0x7f = 127
  • 0xc3 0x02 = 323
  • 0x80 0x80 0x01 = 16384

If a number is stored as "signed", we will need 1 bit for the sign. For this purpose, the least significant bit of the least significant byte is taken as sign bit. 0 means positive, 1 means negative. We do not need the number -0, of course, so we can shift the range of negative numbers by one. Some Examples:

  • 0x08 = +4
  • 0x80 0x01 = +64
  • 0x03 = -2
  • 0x05 = -3
  • 0x81 0x01 = -65

To interpret a stored number correctly, we must know if it is stored as a signed or as an unsigned number. For this reason, the format we want to use to store OSM information must be defined very accurately.

As you could see, small numbers require less space than high numbers. Our data format will try to avoid large numbers. To accomplish this, we use a trick which has been introduced by .pbf format: the so-called delta coding. Especially where numbers usually differ just slightly from each other, we store only the difference between these numbers. For example, let's assume, we want to store the ids of a few nodes, let's say 123000, 123050, 123055. These are stored as three signed integer values: +123000, +50, +5.

The described number formats support integers only. To store decimals we use fixed-point representation. Latitudes and longitudes are stored as 100 nanodegree values, i.e. the decimal point is moved 7 places to the right. Be aware that applications which create or read .o5m files use 32 bit signed integer values instead of 64 bit ones when delta-coding longitude values. For this reason you will not get any delta values beyond the range of 32 bit signed integers, even in cases you would expect it – for example if the delta counts 358 degrees (179 minus -179). The stored delta value for these 358 degrees is a positive value (714.967.296 or 0x2a9d8900), because in 32 bit arithmetic 1.790.000.000 + 714.967.296 gives an overflow, and the result is -1.790.000.000, which is exactly the value we want to have. In 64 bit arithmetic, we would store a negative value (-3.580.000.000 or 0xffffffff2a9d8900). It is obvious that we only need the last 32 bits of this value, which is exactly the value that is stored in 32 bit arithmetic.

Strings

Character strings are not packed, they are stored "as is" (coded as UTF-8). In general, strings are stored in pairs. To mark the beginning, the end, and the position between the two elements of a string pair, we use zero-bytes. In case there is no pair but just a single string, the second element will be omitted. User names are packed with their user id into a single string. For example:

  • "oneway"/"yes" is coded as 0x00 0x6f 0x6e 0x65 0x77 0x61 0x79 0x00 0x79 0x65 0x73 0x00
  • "1inner"[1] is coded as 0x00 0x31 0x69 0x6e 0x6e 0x65 0x72 0x00
  • uid=1020,"John"[2] is coded as 0x00 0xfc 0x07 0x00 0x4a 0x6f 0x68 0x4e 0x00

To do this with every string would cost a lot of space. Fortunately, most character strings in OSM come in pairs and are repeated over and over. Take "highway"/"residential" or "building"/"yes", for example. This allows us to use references any time we reuse a previously encountered pair of string.

To refer to a string pair which has already been defined in the way shown above, we count the string pair definitions from the present location in the file back to that definition which matches our string pair. This count is stored as unsigned number using the same method which has already been described in the chapter before.

To limit the maximum amount of memory the decoder program must allocate, there is a limit for the number of different string pairs to which we can reference to, and there is a limit for their length:

  • maximum value of string reference number[3]: 15,000
  • maximum length of a string pair which is stored in the reference table[4]: 250 characters in total

When reading an .o5m coded file, we use a reference table which has 15,000 lines, 250+2 characters each (for performance reasons: 256 characters). Every string pair we encounter is copied into the table, with one exception: strings pairs which are longer than 250 characters are interpreted but not copied into the table.

If there is no more space in the table to hold a new string pair, the oldest string pair will be overwritten. The table works as FIRO (First In, Random Out). Addressing starts with 1 – which means "latest stored string". Address 2 means "second latest stored string", and so on.

Note that strings are generally stored as string pairs, even if defined as single string.

Here, an example how to store a few strings:

  • "oneway"/"yes"
  • "atm"/"no"
  • "oneway"/"yes"
  • uid=1020,"John"
  • "atm"/"no"
  • "oneway"/"yes"
  • uid=1020,"John"

They are coded as follows:

  • 0x00 0x6f 0x6e 0x65 0x77 0x61 0x79 0x00 0x79 0x65 0x73 0x00
  • 0x00 0x61 0x74 0x6d 0x00 0x6e 0x6f 0x00
  • 0x02
  • 0x00 0xfc 0x07 0x00 0x4a 0x6f 0x68 0x4e 0x00
  • 0x02
  • 0x03
  • 0x01

File

Each .o5m file starts with a byte 0xff and ends with byte 0xfe. Every dataset starts with a specific byte:

  • 0x10: node
  • 0x11: way
  • 0x12: relation
  • 0xdb: bounding box
  • 0xdc: file timestamp
  • 0xe0: header, should follow the first 0xff in the file; contents: 0x04 0x6f 0x35 0x6d 0x32 ("o5m2"), or 0x04 0x6f 0x35 0x63 0x32 ("o5c2") for .o5m change files
  • 0xee: Sync
  • 0xef: Jump
  • 0xff: (not a dataset) when somewhere in the file contents: Reset

The second byte of every dataset, and, if necessary because the value is larger than 127, the following byte(s), contain the length information, coded as unsigned number (see above). If the decoding program does not understand a specific dataset, it shall jump over its contents. The length information does not include the length byte(s) itself, nor the start byte of the dataset.

Note that there is no length information for bytes in the range from 0xf0 to 0xff, so the program must jump just over one byte.

Node

Every node dataset starts with 0x10, followed by the dataset length and the data. The data fields:

  • id (signed, delta-coded)
  • either 0x00 or version information:
    • version (unsigned)
    • timestamp (seconds since 1970, signed, delta-coded)
    • author information – only if timestamp is not 0:
      • changeset (signed, delta-coded)
      • uid, user (string pair)
  • longitude (signed, delta-coded)
  • latitude (signed, delta-coded)
  • tags:
    • key, val (string pair)
    • key, val (string pair)
    • ...

For example:

  • 0x10 – node
  • 0x21 – length of following data of this node: 33 bytes
  • 0xce 0xad 0x0f – id: 0+125799=125799
  • 0x05 – version: 5
  • 0xe4 0x8e 0xa7 0xca 0x09 – timestamp: 2010-09-30T19:23:30Z
  • 0x94 0xfe 0xd2 0x05 – changeset: 0+5922698=5922698
  • 0x00 – string pair:
    • 0x85 0xe3 0x02 0x00 – uid: 45445
    • 0x55 0x53 0x63 0x68 0x61 0x00 – user: "UScha"
  • 0x86 0x87 0xe6 0x53 – lon: 0+8.7867843=8.7867843
  • 0xcc 0xe2 0x94 0xfa 0x03 – lat: 0+53.0749606=53.0749606
  • 0x10 – node
  • 0x0d – length of following data of this node: 13 bytes
  • 0x02 – id: 125799+1=125800
  • 0x0a – version: 10
  • 0xd2 0x1f – timestamp: 2010-09-30T19:23:30Z+2025s=2010-09-30T19:57:15Z
  • 0xe2 0x04 – changeset: 5922698+305=5923003
  • 0x01 – string pair: reference 1
    • – uid: 45445
    • – user: "UScha"
  • 0x89 0xae 0x03 – lon: 8.7867843-27525=8.7840318
  • 0xe5 0xd8 0x03 – lat: 53.0749606-0.0030259=53.0719347

The example in OSM XML format:

<node id="125799" lat="53.0749606" lon="8.7867843" version="5" changeset="5922698" user="UScha" uid="45445" timestamp="2010-09-30T19:23:30Z"/>
<node id="125800" lat="53.0719347" lon="8.7840318" version="10" changeset="5923003" user="UScha" uid="45445" timestamp="2010-09-30T19:57:15Z"/>

Note that it is allowed to shorten a dataset by decreasing its length (2nd byte of the dataset). The decoding program must cope with such clipped data sets and accept that they do not contain key/val pairs, latitude/longitude or even author information. If a dataset is clipped that way that only its body is left (id and maybe version and author information), the program shall perform a delete action and delete the object with this id (see below, section .o5c).

Way

Every way dataset starts with 0x11, followed by the dataset length and the data. The data fields:

  • id (signed, delta-coded)
  • either 0x00 or version information:
    • version (unsigned)
    • timestamp (seconds since 1970, signed, delta-coded)
    • author information – only if timestamp is not 0:
      • changeset (signed, delta-coded)
      • uid, user (string pair)
  • length of the references section (unsigned)
  • node references section (if length >0):
    • id of a referenced node (signed, delta-coded)
    • id of a referenced node (signed, delta-coded)
    • ...
  • tags:
    • key, val (string pair)
    • key, val (string pair)
    • ...

For example:

  • 0x11 – way
  • 0x20 – length of following data of this node: 32 bytes
  • 0xec 0x9b 0xe8 0x03 – id: 0+3999478=3999478
  • 0x00 – no version and no author information
  • 0x07 – length of node references area: 7 bytes
    • 0xce 0xb9 0xfe 0x13 – referenced node: 0+20958823=20958823
    • 0xce 0xeb 0x01 – referenced node: 20958823+15079=20973902
  • 0x00 – string pair:
    • 0x68 0x69 0x67 0x68 0x77 0x61 0x79 0x00 – key: "highway"
    • 0x73 0x65 0x63 0x6f 0x6e 0x64 0x61 0x72 0x79 0x00 – val: "secondary"

The example in OSM XML format:

<way id="3999478">
  <nd ref="20958823"/>
  <nd ref="20973902"/>
<tag k="highway" v="secondary" />
</way>

Relation

Every way dataset starts with 0x12, followed by the dataset length and the data. The data fields:

  • id (signed, delta-coded)
  • either 0x00 or version information:
    • version (unsigned)
    • timestamp (seconds since 1970, signed, delta-coded)
    • author information – only if timestamp is not 0:
      • changeset (signed, delta-coded)
      • uid, user (string pair)
  • length of the references section (unsigned)
  • references section (if length >0):
    • reference information:
      • id of the referenced object (signed, delta-coded)
      • object type ("0".."2") concatenated with role (single string)
    • reference information:
      • id of the referenced object (signed, delta-coded)
      • object type ("0".."2") concatenated with role (single string)
    • ...
  • tags:
    • key, val (string pair)
    • key, val (string pair)
    • ...

For example:

  • 0x12 – relation
  • 0x28 – length of following data of this node: 40 bytes
  • 0x90 0x2e – id: 0+2952=2952
  • 0x00 – no version and no author information
  • 0x11 – length of references section: 17 bytes
    • reference information:
      • 0xf4 0x98 0x83 0x0b – id: 0+11560506=11560506
      • 0x00 – string pair:
        • 0x31 – type: way
        • 0x69 0x6e 0x6e 0x65 0x72 0x00 – role: "inner"
    • reference information:
      • 0xca 0x93 0xd3 0x0d – id: 11560506+14312677=25873183
      • 0x01 – string pair: reference 1
        • – type: way
        • – role: "inner"
  • 0x00 – string pair:
    • 0x74 0x79 0x70 0x65 0x00 – key: "type"
    • 0x6d 0x75 0x6c 0x74 0x69 0x70 0x6f 0x6c 0x79 0x67 0x6f 0x6e 0x00 – val: "multipolygon"

The example in OSM XML format:

<relation id="2952">
  <member type="way" ref="11560506" role="inner"/>
  <member type="way" ref="25873183" role="inner"/>
  <tag k="type" v="multipolygon" />
</relation>

File Timestamp

This is an optional dataset. It starts with 0xdc, followed by timestamp of a file. The Unit is seconds since Jan 01 1970.:

  • file timestamp (signed)

If this dataset is used in a file, it must be stored before any OSM objects, i.e. before any nodes, ways, or relations.

Bounding Box

This (optional) dataset starts with 0xdb, followed by the dataset length and the bounding box coordinates:

  • south-western corner
    • x1 (longitude, signed)
    • y1 (latitude, signed)
  • north-eastern corner
    • x2 (longitude, signed)
    • y2 (latitude, signed)

Similar to the file timestamp (see above), this must be stored before any OSM objects, i.e. before any nodes, ways, or relations.

Reset

The reset byte 0xff tells the encoder to reset its counters. At this point, every delta coding will start from 0, no string references will be used which would refer back to the file contents before the reset byte.

This mechanism allows it to split an .o5m file into parts which can be processed parallely by different threads. Usually, at least the start of the way and the relations section will be initiated by Reset bytes.

Note that 0xff does not initiate a dataset; therefore no length information will follow the 0xff.

Sync

If you need to process OSM data as a stream, you will need some positions you can synchronize on. These sync points are specified that way you can find them when parsing the data stream for 32-bit zeroes. Every Sync dataset must be followed by a Reset byte. Otherwise you would not be able to decode the subsequently stored OSM datasets if delta coding is used. Syntax:

  • 0xee 0x07 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff

If developing a parser program which does not use this synchronize mechanism, you do not need to care about Sync datasets because these datasets will be recognized as "unknown datasets" and therefore ignored.

Jump

To get random access not only to the start of one of the sections (start of nodes, start of ways, start of relations), you can define additional jump points as Jump datasets. These jump points allow you to move forward and backward in the file very quickly.

The Jump dataset starts with 0xef, followed by the dataset length and the data. The data fields:

  • distance to the next Jump dataset (from 0xef to 0xef in bytes, unsigned)
  • distance to the previous Jump dataset (from 0xef to 0xef in bytes, unsigned)
  • some bytes of unused memory space (if the distances are entered later and the space for these values is not known at the time the Jump dataset is written), must be set to 0xff

Every Jump dataset must be followed by a Reset byte. Otherwise you would not be able to decode the subsequently stored OSM datasets if delta coding is used. For Example:

  • 0xef 0x07 0x80 0x04 0x40 0xff 0xff 0xff 0xff 0xff – 512 forward, 64 backward

If developing a parser program which does not use this jump mechanism, you do not need to care about Jump datasets because these datasets will be recognized as "unknown datasets" and therefore ignored.

File Size Comparison

The size of OSM data files depends on the format which is being used. The 2011-May-12 Germany file from geofabrik.de is used as basis for this comparison. Compressions .gz and .7z were done with default parameters (medium compression strength), .pbf uses internal zlib compression.


OSM File Format
and Compression Type
Size
in Bytes
Relative
Size
Reading Time
(slow computer)
germany.osm 15,519,707,799 100.0 % 604 s
germany.osm.bz2 1,442,403,577 9.3 %
germany.o5m 1,469,972,938 9.5 % 36 s
germany.o5m.gz 949,544,868 6.1 %
germany.o5m.7z 851,845,615 5.5 %
germany.pbf 948,980,117 6.1 % 90 s
germany.pbf.gz[5] 949,117,868 6.1 %
germany.pbf.bz2[5] 953,355,361 6.1 %
germany.pbf.7z[5] 959,453,561 6.2 %


If you discard meta data (timestamp, user name, etc.), the file sizes will decrease:

OSM File Format
and Compression Type
Size
in Bytes
Relative
Size
Reading Time
(slow computer)
germany.osm (excl. meta data) 7,937,414,195 51.1 %
germany.o5m (excl. meta data) 1,046,676,637 6.7 %
germany.o5m.gz (excl. meta data) 730,187,675 4.7 %
germany.o5m.7z (excl. meta data) 647,320,555 4.2 %
germany.pbf (excl. meta data)[6][7] 697,041,975 4.5 %

Further Information

Why that strange Name?

The name o5m was chosen because it looks a bit like osm. The digit 5 stands for "5 times smaller than .osm". Meanwhile, we know that the factor is about 10, not 5. But how silly would o10m sound?

.o5c as OSM Change Format

You easily can use .o5m format to store an OSM change file. There is no difference in the file's data structure, with one exception: the file header contains "o5c2" instead of "o5m2". For the file name, it is recommended to use the extension .o5c instead of .o5m.

What did we do with the <create>, <modify> and <delete> tags which are well-known from .osc format? Well, these tags serve no real purpose – unless you want to use them for plausibility checks. Create and modify result in the same action: the new version of the referred object will be stored; so we simply take the object's version as it comes with the latest change file and store it. Delete is a special action and we would have to define its own .o5c object type. However, for to delete an object, we need nothing else but its id. Therefore we decide that if there is an object stored in the file with nothing else but its id (and maybe its version or author information), this means that the object referred by this id shall be deleted.

Future Extensions

What will we do if there are additional requirements at the data format due to a new OSM api? This should not be a problem. There are 239 o5m dataset ids (range 0x01 to 0xdf); presently, only three of them are in use: 0x10 for nodes, 0x11 for ways, 0x12 for relations. The Varint number format and the string format may also be used for any purpose.

Is there an Example Implementation?

The file format has been described completely, hence you have the means to implement an .o5m reader or writer from scratch – in any programming language you like. Nevertheless it might be useful to have a look at .o5m reader example code (written in C) or at the sources of osmconvert, osmfilter, etc.

Software supporting .o5m

The data format is very new, therefore this list is really short.

Programs already support .o5m

  • osmconvert – format conversions between .osm, .osc, .osh, .pbf, .o5m. .o5c, applying change files and border polygons, creating diffs
  • osmfilter – filtering OSM objects and tags, considers hierarchical dependencies
  • osmupdate – updates OSM data files via Internet download
  • navit map tool – import interface for OSM data into navit
  • osm2po – import interface for OSM data into osm2po routing engine
  • osm2pgsql – import of OSM data into PostgreSQL database (experimental at present, not yet in svn)
  • o5mreader – C library that parses OpenStreetMap data in O5M format.
  • mkgmap – Making maps from OpenStreetMap for Garmin devices
  • splitter – Tile splitter for mkgmap
  • o5m4j - Reader for o5m data written on Java.
  • OSMemory - Java library for validators

Toolchains using .o5m

Footnotes

  1. Within relations, the digit "1" in the string "1inner" indicates the referenced object's type. "0" means node, "1" way and "2" relation.
  2. The user id is coded as unsigned Varint and packed into the first string of this string pair. To code the id as Varint helps to saves a few bytes of space.
  3. The limit of 15,000 was chosen because from 16,384 and above, three instead of two bytes would be necessary to store the reference value.
  4. The limit of 250 characters was chosen because very few strings exceed this maximum. Internally, the string table will need a row size of 252 characters because the terminators must be stored too. It is recommended to define a row size of 256 characters due to faster calculations when accessing the table via index.
  5. 5.0 5.1 5.2 As .pbf files are already compressed internally (with zlib algorithm), they usually grow in file size if you try to compress them a second time. This is normal behaviour and valid for compressed files in general.
  6. File generated with osmosis option omitmetadata=true.
  7. Generated from a slightly newer, larger germany.osm; number given here is a hypothetical number adjusted by the same factor