Just trying to contribute a little bit.
Currently working on a code to convert from .osm.gz2 to .h5. The objective is to enable fast access to data mainly thinking on performing validation tests in a fast manner.
I will perform all the test with the map "spain.osm.gz" (159Mbytes downloaded on December 2010). All the timing is performed with the laptop Dell Inspiron 6400:
- RAM: 1Gbyte
- Processor: Intel® CoreTM 2 Duo Processor T2400 @ 1.83GHz
It works both as a library and as a command line application. It is used to parse .osm.bz2 files.
As a command line utility it works by executing something like:
$ python osmreader.py spain.osm.bz2 $ python osmreader.py --onlynodes spain.osm.bz2 $ python osmreader.py -h
It can be used as a library:
from osmreader import Osm _osm = Osm( 'spain.osm.bz2') for _elem in _osm.parse(): if _elem != None: print _elem # this is the type: node, way, relation print _elem # this is the id print _elem # this is a dictionary with the attribs. print _elem # this can contain tags, nodes, members.
The file can be found here:
- osm2hdf5.py: requires "osmreader.py" to be in the same directory.
To use it, the command line looks like:
$ pytho osm2hdf5 spain.osm.bz2 spain.h5
This tool converts from .osm.gz2 to .h5. It takes around: 25.7 min to convert spain.osm.bz2 in my laptop. The size is bigger: 247Mbytes. Which is not bad at all.
Using HDF5 is interesting because.... read this. The access to the information is fast an easy. Besides you are not required to be bound to python (if you feel so inclined). But the truth is that python, pytables and numpy are an interesting combination.
To print the unique identifier, latitud and longitud:
import tables _h5 = tables.openFile('spain.h5') for _row in _h5.root.original.nodes.iterrows(): print _row['uid'],_row['lat'],_row['lon']
You don't have to worry about if the file is to big to fit in memorynto memory. You can read directly the ways without having to read the nodes first. The performance is really fast. The following code creates a list with IDs of relations. It takes: 0.077sg.
import tables import time _ini = time.time() _h5 = tables.openFile('spain.h5') _list =  for _i in _h5.root.original.relations.iterrows(): _list.append( _i['uid'] ) _h5.close() _end = time.time() #print _list print (_end - _ini), ' sg'
I think it would be better to provide the data as .h5 files rather than .xml based in planet.osm. (just my 2 cents)