OsmosisPgsqlPlans

From OpenStreetMap Wiki
Jump to: navigation, search


Overview

This page contains thoughts on how to extend the current osmosis pgsql "simple" schema. One driver is to allows for extensions to be made to the schema while maintaining support for application of changesets.

Some of the below sections are snippets from emails and need to be fleshed out in more detail.

Support Custom Extensions

A solution would be to have a "status" column on the "nodes", "ways" and "relations" tables. This status would be changed by the --write-pgsql-change tasks and can later be used to decide which changes need to be propagated. Something like this: There is a new node in the .osc File -> the change task creates a new row in the nodes table and sets status=new, later a job finds all nodes with status=new and does whatever it has to do to propagate this change further on. And when all is done the status is changed to "old" or whatever. It could be implemented in a way that first Osmosis does its thing and then it calls a stored procedure named in a command line option. This stored procedure could then be used to do whatever else the user needs to be done. The default for the stored procedure would just delete all rows marked status="deleted" etc. to implement the normal behaviour. I haven't looked into all the details, but I think it would be doable and not add too much overhead.

This is a great idea and people definitely need this! I can think of it more like an additional parameter similar to 'authFile', like 'postExecFile="./generateGeometriesFromOSMData.sql"'. So, after the data made it into the original tables, every user can pass his own sql-file to call a stored procedure (recommended) or just insert sql statements (for those who are not familiar with stored procedures). This also includes that every user is self-responsible to track which entries in the main tables are new. Additionally, users who don't want to manipulate the original data are not getting bonded by an useless field. Osmosis has to execute the sql statements from that sql-file. I just think this is a little more freedom for osmosis users, without creating useless columns in tables or useless "default" stored procedures. Any thoughts? Ralf

--Brett 12:26, 24 September 2008 (UTC) I've added an "action" column to all tables which is populated during import and changeset application. Osmosis then kicks off an osmosisUpdate stored proc which is empty by default but can be updated as required. After the stored proc completes the action columns are all reset. Everything occurs within a single transaction. It is only available in the 0.6 tasks as of this writing.

Improve Created_By Handling

Fully 2/3 of all tags on nodes are "created_by". For most nodes this is the only tag. For ways the ratio is different, but still 3/4 of all roads have a "created_by" tag. How about making "created_by" a special case and put it into a column on the nodes, ways, and relations tables? For the nodes this would mean the node_tags table will be 1/3 the size. For the ways and relations it doesn't matter that much, but should be done for consistency. The reading and writing tasks in Osmosis would get a bit more complicated but not much. In theory this could be done for other tag as well, but we are hitting diminishing returns quickly. For ways the most common keys are "highway", "name" and "created_by". But I think that "created_by" is a special case anyway, so it would make sense to do it only for this case.

Improve BBox Creation Speed

Lots of time is spent on the UPDATE bbox (or, what I am doing is not only creating a bbox, but the whole way geometry from the node geometries.) The idea to speed this up is as follows: While reading in the nodes, we'll keep all the nodes lat/lon in memory. When writing the ways later, we'll retrieve the needed locations and calculate the bbox and/or way geometry on the fly. Storing all the coordinates in memory shouldn't be a big deal: The lat lon can each be stored in a 4 byte long integer. The largest node id currently used is about 275 Mio (249 Mio nodes in actual use) So if we use a plain lookup table, it would need 2*4*275 Mio=2.2 GB. This would fit in memory in many machines these days. I know how to implement this in C, but I am not sure how it can be done in Java, because we don't want much overhead here. Of course the whole things needs to be optional for those people with not enough memory. Especially if you are not importing the whole planet but only one little town, this doesn't make much sense.

Further discussion on this has brought up possibility of using memory mapped files for this purpose. That should allow an efficient array-like storage with minimal overhead without being subject to java memory management limitations.

--Brett 12:30, 24 September 2008 (UTC) I've spent many hours on this and think I have something reasonable working now. Memory map didn't work on a 32-bit system because java only allows a 31 bit address space to be mapped in. I now use a file that I randomly seek over to find the required nodes. I've implemented a reasonably intelligent caching layer to make this faster. I can process the entire planet on my laptop (with file writing code commented out due to disk space limitations) in less than 6 hours. It is available in the --write-pgsql and --write-pgsql-dump classes. Only available in 0.6 tasks at the time of writing.