O5m

From OpenStreetMap Wiki
Jump to: navigation, search

The .o5m data format was designed to be a compromise between .osm and .pbf format. It has the same structure as .osm format, therefore input and output procedures of existing OSM data processing applications can be adapted to this format with small effort. The data coding shows a lot of similarities to the .pbf format. Hence, there is nearly no difference in file size – gzip compression assumed.

Contents

Motivation

There already are some data formats for OSM, two of them well-established: .osm and .pbf. Why another one? Let's have a closer look at the two most common formats and try to determine their strengths and weaknesses.

Conventional formats

.osm

This data format is human-readable because written in XML. Usually, the objects in the file are ordered by type (node, way, relation) and then by id. The XML format goes along with some advantages and disadvantages:

Advantages
Disadvantages

.osm.bz2

Advantages
Disadvantages

.pbf

This optimized format was introduced to eliminate some of the .osm format's disadvantages. It is somewhat flexible, could for instance allow geo-regional clustering, so you would have had the possibility to pick your region's data from a larger file. (That would have meant to break the usual id-ordered sequence.)

The usual .pbf file which can be downloaded from different locations contains its objects ordered in the same sequence as in the the conventional .osm file does.

Let's try to evaluate this format:

Advantages
Disadvantages

Why a new Format?

The new format tries to combine the advantages of both established formats, .osm and .pbf. Main goals are:

Unfortunately, due to speed-up data processing, a certain goals cannot be reached. We will have to live with a few disadvantages:

Also, there is currently no mechanism for random access in an o5m file, though this could easily be added.

Format description

The structure of the new .o5m format is similar to the structure of conventional .osm format: Objects are stored using a strict sequence. First all nodes, then all ways, and finally all relations. Within each of these three groups, the objects are sequenced by their ids, in ascending order.

Every number is packed, using the Varint format you might happen to know from .pbf files. Character strings are included in zero-bytes, or referenced if repeated.

Basics

To understand the format it is most useful to know how numbers and strings are stored.

Numbers

To store numbers of different lengths we abandon bit 7 (most significant bit) of every byte and use this bit as a length indicator. This indicator – when set to 1 – tells us that the next byte belongs to the same number. The first byte of such a long number contains the least significant 7 bits, the last byte the most significant 7 bits. For example:

If a number is stored as "signed", we will need 1 bit for the sign. For this purpose, the least significant bit of the least significant byte is taken as sign bit. 0 means positive, 1 means negative. We do not need the number -0, of course, so we can shift the range of negative numbers by one. Some Examples:

To interpret a stored number correctly, we must know if it is stored as a signed or as an unsigned number. For this reason, the format we want to use to store OSM information must be defined very accurately.

As you could see, small numbers require less space than high numbers. Our data format will try to avoid large numbers. To accomplish this, we use a trick which has been introduced by .pbf format: the so-called delta coding. Especially where numbers usually differ just slightly from each other, we store only the difference between these numbers. For example, let's assume, we want to store the ids of a few nodes, let's say 123000, 123050, 123055. These are stored as three signed integer values: +123000, +50, +5.

The described number formats support integers only. To store decimals we use fixed-point representation. Latitudes and longitudes are stored as 100 nanodegree values, i.e. the decimal point is moved 7 places to the right.

Strings

Character strings are not packed, they are stored "as is" (coded as UTF-8). In general, strings are stored in pairs. To mark the beginning, the end, and the position between the two elements of a string pair, we use zero-bytes. In case there is no pair but just a single string, the second element will be omitted. User names are packed with their user id into a single string. For example:

To do this with every string would cost a lot of space. Fortunately, most character strings in OSM come in pairs and are repeated over and over. Take "highway"/"residential" or "building"/"yes", for example. This allows us to use references every time we reuse a previously encountered pair of string.

To refer to a string pair which has already been defined in the way shown above, we count the string pair definitions from the present location in the file back to that definition which matches our string pair. This count is stored as unsigned number using the same method which has already been described in the chapter before.

To limit the maximum amount of memory the decoder program must allocate, there is a limit for the number of different string pairs to which we can reference to, and there is a limit for their length:

When reading an .o5m coded file, we use a reference table which has 15,000 lines, 250+2 characters each (for performance reasons: 256 characters). Every string pair we encounter is copied into the table, with one exception: strings pairs which are longer than 250 characters are interpreted but not copied into the table. If there is no more space in the table to hold a new string pair, the oldest string pair must be deleted.

Note that strings are always stored as string pairs, even if defined as single string. The only time this difference takes effect is when writing to or reading from an .o5m file. In this case, for a single string only two zero bytes are used instead of three.

Here, an example how to store a few strings:

They are coded as follows:

File

Each .o5m file starts with a byte 0xff and ends with byte 0xfe. Every dataset starts with a specific byte:

The second byte of every dataset, and, if necessary because the value is larger than 127, the following byte(s), contain the length information, coded as unsigned number (see above). If the decoding program does not understand a specific dataset, it shall jump over its contents. The length information does not include the length byte(s) itself, nor the start byte of the dataset.

Note that there is no length information for bytes in the range from 0xf0 to 0xff, so the program must jump just over one byte.

Node

Every node dataset starts with 0x10, followed by the dataset length and the data. The data fields:

For example:

The example in OSM XML format:

<node id="125799" lat="53.0749606" lon="8.7867843" version="5" changeset="5922698" user="UScha" uid="45445" timestamp="2010-09-30T19:23:30Z"/>
<node id="125800" lat="53.0719347" lon="8.7840318" version="10" changeset="5923003" user="UScha" uid="45445" timestamp="2010-09-30T19:57:15Z"/>

Note that it is allowed to shorten a dataset by decreasing its length (2nd byte of the dataset). The decoding program must cope with such clipped data sets and accept that they do not contain key/val pairs, latitude/longitude or even author information. If a dataset is clipped that way that only its body is left (id and maybe version and author information), the program shall perform a delete action and delete the object with this id (see below, section .o5c).

Way

Every way dataset starts with 0x11, followed by the dataset length and the data. The data fields:

For example:

The example in OSM XML format:

<way id="3999478">
  <nd ref="20958823"/>
  <nd ref="20973902"/>
<tag k="highway" v="secondary" />
</way>

Relation

Every way dataset starts with 0x12, followed by the dataset length and the data. The data fields:

For example:

The example in OSM XML format:

<relation id="2952">
  <member type="way" ref="11560506" role="inner"/>
  <member type="way" ref="25873183" role="inner"/>
  <tag k="type" v="multipolygon" />
</relation>

File Timestamp

This is an optional dataset. It starts with 0xdc, followed by timestamp of a file. The Unit is seconds since Jan 01 1970.:

If this dataset is used in a file, it must be stored before every OSM object, i.e. before every node, way and relation.

Bounding Box

This (optional) dataset starts with 0xdb, followed by the dataset length and the bounding box coordinates:

Similar to the file timestamp (see above), this must be stored before every OSM object, i.e. before every node, way and relation.

Reset

The reset byte 0xff tells the encoder to reset its counters. At this point, every delta coding will start from 0, no string references will be used which would refer back to the file contents before the reset byte.

This mechanism allows it to split an .o5m file into parts which can be processed parallely by different threads. Usually, at least the start of the way and the relations section will be initiated by Reset bytes.

Note that 0xff does not initiate a dataset; therefore no length information will follow the 0xff.

Sync

If you need to process OSM data as a stream, you will need some positions you can synchronize on. These sync points are specified that way you can find them when parsing the data stream for 32-bit zeroes. Every Sync dataset must be followed by a Reset byte. Otherwise you would not be able to decode the subsequently stored OSM datasets if delta coding is used. Syntax:

If developing a parser program which does not use this synchronize mechanism, you do not need to care about Sync datasets because these datasets will be recognized as "unknown datasets" and therefore ignored.

Jump

To get random access not only to the start of one of the sections (start of nodes, start of ways, start of relations), you can define additional jump points as Jump datasets. These jump points allow you to move forward and backward in the file very quick.

The Jump dataset starts with 0xef, followed by the dataset length and the data. The data fields:

Every Jump dataset must be followed by a Reset byte. Otherwise you would not be able to decode the subsequently stored OSM datasets if delta coding is used. For Example:

If developing a parser program which does not use this jump mechanism, you do not need to care about Jump datasets because these datasets will be recognized as "unknown datasets" and therefore ignored.

File Size Comparison

The size of OSM data files depends on the format which is being used. The 2011-May-12 Germany file from geofabrik.de is used as basis for this comparison. Compressions .gz and .7z were done with default parameters (medium compression strength), .pbf uses internal zlib compression.


OSM File Format
and Compression Type
Size
in Bytes
Relative
Size
Reading Time
(slow computer)
germany.osm 15,519,707,799 100.0 % 604 s
germany.osm.bz2 1,442,403,577 9.3 %
germany.o5m 1,469,972,938 9.5 % 36 s
germany.o5m.gz 949,544,868 6.1 %
germany.o5m.7z 851,845,615 5.5 %
germany.pbf 948,980,117 6.1 % 90 s
germany.pbf.gz[5] 949,117,868 6.1 %
germany.pbf.bz2[5] 953,355,361 6.1 %
germany.pbf.7z[5] 959,453,561 6.2 %


If you discard meta data (timestamp, user name, etc.), the file sizes will decrease:

OSM File Format
and Compression Type
Size
in Bytes
Relative
Size
Reading Time
(slow computer)
germany.osm (excl. meta data) 7,937,414,195 51.1 %
germany.o5m (excl. meta data) 1,046,676,637 6.7 %
germany.o5m.gz (excl. meta data) 730,187,675 4.7 %
germany.o5m.7z (excl. meta data) 647,320,555 4.2 %
germany.pbf (excl. meta data)[6][7] 697,041,975 4.5 %

Further Information

Why that strange Name?

The name o5m was chosen because it looks a bit like osm. The digit 5 stands for "5 times smaller than .osm". Meanwhile, we know that the factor is about 10, not 5. But how silly would o10m sound?

.o5c as OSM Change Format

You easily can use .o5m format to store an OSM change file. There is no difference in the file's data structure, with one exception: the file header contains "o5c2" instead of "o5m2". For the file name, it is recommended to use the extension .o5c instead of .o5m.

What did we do with the <create>, <modify> and <delete> tags which are well-known from .osc format? Well, these tags serve no real purpose – unless you want to use them for plausibility checks. Create and modify result in the same action: the new version of the referred object will be stored; so we simply take the object's version as it comes with the latest change file and store it. Delete is a special action and we would have to define its own .o5c object type. However, for to delete an object, we need nothing else but its id. Therefore we decide that if there is an object stored in the file with nothing else but its id (and maybe its version or author information), this means that the object referred by this id shall be deleted.

Future Extensions

What will we do if there are additional requirements at the data format due to a new OSM api? This should not be a problem. There are 239 o5m dataset ids (range 0x01 to 0xdf); presently, only three of them are in use: 0x10 for nodes, 0x11 for ways, 0x12 for relations. The Varint number format and the string format may also be used for any purpose.

Software supporting .o5m

The data format is very new, therefore this list is really short.

Programs already support .o5m

Toolchains using .o5m

Footnotes

  1. Within relations, the digit "1" in the string "1inner" indicates the referenced object's type. "0" means node, "1" way and "2" relation.
  2. The user id is coded as unsigned Varint and packed into the first string of this string pair. To code the id as Varint helps to saves a few bytes of space.
  3. The limit of 15,000 was chosen because from 16,384 and above, three instead of two bytes would be necessary to store the reference value.
  4. The limit of 250 characters was chosen because very few strings exceed this maximum. Internally, the string table will need a row size of 252 characters because the terminators must be stored too. It is recommended to define a row size of 256 characters due to faster calculations when accessing the table via index.
  5. 5.0 5.1 5.2 As .pbf files are already compressed internally (with zlib algorithm), they usually grow in file size if you try to compress them a second time. This is normal behaviour and valid for compressed files in general.
  6. File generated with osmosis option omitmetadata=true.
  7. Generated from a slightly newer, larger germany.osm; number given here is a hypothetical number adjusted by the same factor
Personal tools
Namespaces
Variants
Actions
site
Toolbox