Talk:PBF Format

From OpenStreetMap Wiki
Jump to: navigation, search

How can I look at the .proto file?

This page doesn't say (and IMO the PBF format is not clearly described). --Qwertie 22:22, 9 November 2011 (UTC)

First Fileblock Is HeaderBlock

This is mentioned in the .proto file but not in the wiki page. Is this part of the standard? pbf2osm does not check for it currently. --ChristianVetter 12:42, 29 September 2010 (BST)

To be honest, I don't know. That block exists to send metadata to the reader program. With the philosophy of 'be conservative in what you send and liberal in what you accept'. The rule should be at least this strong: "Every file SHOULD start with that block. A reader SHOULD accept a file even if it lacks that block, but MUST warn about the block missing." It could be stronger: "Every file MUST start with that block. A reader SHOULD accept a file even if it lacks that block, but MUST warn about the block being missing." Your thoughts?

Also, I can also imagine cases where a file might have a HeaderBlock at the start and in the middle. E.g., concatenating two .osm.pbf files. I think that that should be supported. --Nutscrape 19:44, 17 November 2010 (UTC)

C++ Namespace Polution

All created classes are added to the global namespace. This makes them unusable in most cases. You should add something like this to the protocol definitions:

package PBF;

--ChristianVetter 11:39, 29 September 2010 (BST)

Thank you for the suggestion. I have just done that and push'ed. Package is OSMPBF. --Nutscrape 04:06, 17 October 2010 (BST)

int4 length

Can we consistently make this value in network byte order? --Skinkie 00:06, 24 September 2010 (BST)

Documentation fixed. It is in network byte order. --Nutscrape 12:07, 27 September 2010 (BST)

Within each primitiveblock, I then divide entities into groups that contain consecutive messages all of the same type (node/way/relation).

How fixed will this be? I could imagine that if I want to compress a smaller file, why not pack them in one primitive group? --Skinkie

It is defined that each primitive group must contain entites of only a single type. The wrapping of multiple primitivegroups within a primitiveblock enables a single primitive block (containing a common stringtable, granularity, etc) to contain entities of different types, which can help for the case of small files. --Nutscrape 12:07, 27 September 2010 (BST)

But what is the reason hat only one primitive group only contains one single type? --Skinkie 22:09, 26 September 2010 (BST)

To preserve the order of entities. The order of PrimitiveGroups in a PrimitiveBlock is well defined. If each of primitive group contains only one field set and containing entities, then the entity order in the file is unambiguous. If, on the other hand, multiple fields of PrimitiveGroup were used to contain entities, their order in the file is ambiguous. --Nutscrape 12:07, 27 September 2010 (BST)


//optional BBox bbox = 19;

Should not be at in that block. The point is, it would still be required to completely decompress that block of data. That is basically a pain if you want to search for data quickly. Either add an extended index having all bbox'es for all OSMData blocks (that allows skipping all OSMData blocks not required). --Skinkie 04:12, 24 September 2010 (BST)

Your observation is correct. Having the bbox in OSMData does not avoid having to decompress each OSMData block, but that is why I designed the fileformat to go 'int4, BlockHeader, Blob'. BlockHeader has an opaque field 'indexdata' which was designed for storing arbitrary serialized protocol buffers containing such a bbox or any other forms of index data. If the bbox is in the BlockHeader.indexdata, then non-matching blocks can be skipped without being decompressed. I chose to postpone the development of this feature, because I did not have plans to write software for creating sorted and indexed *.osm.pbf files, and it is not useful without such software. If anyone is interested in building tools for creating, writing, reading, or using the indexdata field, please contact me. (BTW, such an extended index you propose can be as simple as a list of file offsets and BlockHeader messages of each block in the file.) --User:Nutscrape

Though BBox is just a proposal: the tag 19 of BBox collides with "optional int64 lat_offset = 19" --Yaron 17:16, 24 October 2010 (BST)

Fixed. Thanks. --Nutscrape 19:21, 17 November 2010 (UTC)

HeaderBlock and BlockHeader

There is a HeaderBlock and a BlockHeader. Thats a bit confusing. Can we find a different name for one of them? --Joto 09:40, 30 November 2010 (UTC)

Advantages / Disadvantages

Hello! Presently, two major formats are available for OSM data: the commonly used .osm format and the new .pbf format. Both formats have advantages and disadvantages. I would like to have a list in this article which elaborates these aspects – including a recommendation, when to use which of these formats. --Marqqs 03:19, 6 January 2011 (UTC)

pbf2osm

One of the most important pbf tools is the program pbf2osm. This program is recommended by the article. I am afraid, I might not be the only one who did not manage to install pbf2osm, so I would ask anyone to write a few lines how this program can be installed.
The makefile suggests that there are some additional files necessary:

protoc-c --proto_path=../OSM-binary/src --c_out=. ../OSM-binary/src/*.proto

What exactly are the requirements? What packages besides "protobuf" and "protobuf-c"? Please help... --Marqqs 04:23, 6 January 2011 (UTC)

A detailed description follows

First download, compile and install protobuf and protobuf-c as stated in pbf2osms’ README file. Note that on some systems (e.g. Fedora) it might be necessary to make some symbolic links from the generated protobuf libs installed into /usr/local/lib/ to a place they are found by the system (e.g. /usr/lib64/ or just /usr/lib/ depending on your architecture). I did this by this short bash commands:

cd /usr/lib64/
for i in /usr/local/lib/lib*; do sudo ln -s $i ;done

You should now be able to compile protobuf-c and do the same thing again (create symbolic links) with the newly generated libs.

After that you get pbf2osms’ git repository by typing:

git clone http://git.openstreet.nl/pbf2osm.git

and directly after that typing the following lines (as stated in pbf2osms’ README file):

cd pbf2osm
git submodule init
git submodule update

Now just type

make

which gives you the compiled program. Wicking 23:09, 7 January 2011 (UTC)

Great! Thanks a lot - again! This program is certainly worth to write a Wiki page for it. Maybe this installation description would be a good start. I will care about it as soon there is some time, but, at the moment, I don't know when this will be. :-( --Marqqs 23:31, 7 January 2011 (UTC)

kB, kiB, MB, Mib?

Hi, I'm not certain about the units, please help me to understand the exact amounts of memory needed:

"The length of the BlobHeader *should* be less than 32 kilobytes and *must* be less than 64 kilobytes. The uncompressed length of a Blob *should* be less than 16 megabytes and *must* be less than 32 megabytes."

Is one kilobyte assumed as 1024 bytes, one megabyte as 1024*1024 bytes, or do you really mean kilobyte and megabyte (1000 and 1,000,000)? --Marqqs 21:00, 24 September 2011 (BST)

1024 bytes = 1 kilobyte and 1024*1024 bytes = 1 megabytes. The current code rejects files that exceed these limits. --Nutscrape 03:34, 4 October 2011 (BST)

File Timestamp?

I think it would be nice to have a timestamp for the whole file – other formats support this feature. Hence PBF format is easily extendible this could be added without any problems, couldn't it? I, for one, would be glad to have the possibility to store such a file timestamp in PBF OSMHeader fileblock. --Marqqs 19:50, 2 October 2011 (BST)

Agreed. And that is exactly where I intended metadata to go. (We'd put a key-value dictionary into OSMHeader fileblock.) However, support for that is really waiting on having true metadata handling for OSM files that is properly propagated through the osmosis processing stack. Once we include timestamps, we will want to include things like changeset replication source URL's, replication timestamps, version numbers, entity counts, sort order, and other forms of metadata. I didn't have the time to do the osmosis implementation of this, but if you are, I'd be happy to help with the design, and I'm sure others have lots of their own metadata ideas. --Nutscrape 03:34, 4 October 2011 (BST)

Thanks for your Answers! I think I will wait until the standard implementation has been made in Osmosis. What I had in mind was a definition like this:
message HeaderBlock {
  optional HeaderBBox bbox = 1;
  /* Additional tags to aid in parsing this dataset */
  repeated string required_features = 4;
  repeated string optional_features = 5;
 
  optional string writingprogram = 16; 
  optional string source = 17; // From the bbox field.
  optional int64 filetimestamp = 18; // The file's timestamp (seconds since epoch).  <-------------
}
But your suggestion seems to be better for future extensions. --Marqqs 20:49, 4 October 2011 (BST)

Hello again. Today, a forum user asked about updating a PBF file with osmupdate. Since PBF does not support file timestamps, the whole file would have to be analyzed to get its actuality. Unfortunately, I lack the knowledge to start the metadata "revolution" you proposed. Would it hurt just to add the Varint=18 as suggested above? How would other programs cope with this extension?
Alternatively I could use 'optional_features' to store this kind of metadata. I simply would add a pseudo "optional feature" with the name "timestamp=2011-10-16T15:45:00Z". Other programs would ignore this - hopefully. What do you think? --Marqqs 14:45, 16 October 2011 (BST)

No way should you encode it as an optional_features. I also don't like encoding it as a varint=18, because that will likely lead to a profusion of new keys. The more useful question is how are you planning on reading/writing that field, and through what software programs? Feel free to email me to continue this discussion and I will pose a question on osm-dev sometime this week or next week. --Nutscrape 00:09, 19 October 2011 (BST)

Hi Nutscrape, I'm very sorry, I already implemented the last suggestion because I really needed a file timestamp. Fortunately, there seem to be no side-effects. That's not a surprise to me, because you defined "If a program encounters an optional feature it does not know, it can still safely read the file.". The other programs seem to adhere to this rule.
You wanted to know what software program I'm going to use: it's osmupdate as I wrote in my last post. osmupdate cannot read the PBF file directly, it uses osmconvert as a subprogram for this purpose. osmconvert is able to read and write PBF files directly, and that's the way I write this "special" "optional feature" – in addition to "Sort.Type_then_ID".
As soon as you have a better solution to store the file timestamp I will be glad to change the implementation accordingly. --Marqqs 20:26, 19 October 2011 (BST)

I do agree that we need that timestamp. Actually I want two timestamps, one for the "start" and one for "end", ie. That way we can support change files/partial history files, too. The exact semantics will have to be defined. I am a bit worried about a general metadate facility. General is nice, but we still have to define exactly how all this data is to be interpreted. -- Joto 14:36, 31 October 2011 (UTC)
Something new on this subject? If not, I would like to document the interim solution for the file timestamp which has been implemented in osmconvert. --Marqqs 23:59, 22 November 2011 (UTC)
Done. --Marqqs 15:38, 2 December 2011 (UTC)

lz4

could we maybe extend Blob for LZ4 compression?

message Blob {
   optional bytes raw = 1; // No compression
   optional int32 raw_size = 2; // Only set when compressed, to the uncompressed size
   optional bytes zlib_data = 3;
   // optional bytes lzma_data = 4; // PROPOSED.
   // optional bytes OBSOLETE_bzip2_data = 5; // Deprecated.
   ---> // optional bytes lz4_data = 6; // PROPOSED.
}

in my tests, lz4hc compression expands a zlib-compressed PBF file by approx. 13%, but processing speed is vastly increased. In my test setup, zlib decompression speed is about 70 MB/s (compressed data), while lz4 decompression is so fast, that it's hard to measure (at least 10x faster). For simple tasks, zlib decompression can easily dominate processing time. My test was to extract and dump all addr: tags from nodes for statistical purposes, which took 5.1s w/ lz4 instead of 11.5s w/ zlib for a 631MB country extract (which weighed in at 714MB when recompressed with lz4hc r94). Durga (talk) 20:39, 23 July 2013 (UTC)

Aren't there different compression rates available in zlib too? Did you try to compress the data using the parameter "--fast" (resp. "-1")? You also could leave the data uncompressed, as specified by .pbf format description. A third option would be to use .o5m format instead (see .o5m#File_Size_Comparison). This format allows very fast processing. Which country extract are you referring to? Which tags did you extract? --Marqqs (talk) 11:20, 28 July 2013 (UTC)

HeaderBBox ChangeSet StringTable

Hello

I have started to study PDF. I want to write a C++ parser and read data from *.osm.pbf.

I noticed that some keys are missing from documentation.

( HeaderBBox  ChangeSet  StringTable )

Please enlighten me.

Thank you.

Alin