Google Summer of Code/2017/Project Ideas/Nominatim postcode project

From OpenStreetMap Wiki
Jump to navigation Jump to search

Objective

Improve Nominatim database table structure which will result in better search results for queries containing postcodes.

Also tracked in github issue #403 where further issues and examples are linked.

How it works currently

During database initial import Nominatim checks addresses for postcode values and generates postcode places (table record) in the placex table (osm_type=P, class=place, type=postcocde). Each place is linked to a parent place, this builds a hierarchy for every address. Streets are linked to postcodes and postcodes to cities (parent_place_id column).

This approach has drawbacks

  • often the postcode exists only as single node (point)
  • postcode records are not updated when addresses change in the database
  • in most countries postcodes follow their own hierarchy defined by the country's post office administration. It's not usual that a postcode goes about city boundaries. Postcodes can be smaller than a city or contain multiple cities.
  • streets (in some countries even buildings) can belong to more than one postcode

The new approach should be

  • place postcode records in a separate table
  • make it possible to update that table
  • don't link places to postcodes. Instead during query-time ignore postcodes in the query and check all found addresses against the postcode (distance) later


Remember: OpenStreetMap data can be imperfect, absent, unclean (typos) and we support 200 countries and 100+ languages. Try to improve Nominatim's algorithms without changing OSM data itself.


Rough outline of steps involved

Create separate table

Postcode records are currently in the placex table

  osm_type=p
  class=place
  type=postcode
  osm_id=seq_postcodes
  always only centroid

A new table could look like

  country_code
  postcode
  prefix (boolean)
  centroid
  geometry


Centroid and geometry columns are types of the PostgreSQL extension.

The maximum length of postcodes needs to be determined. United States has up to 10 characters (12345-6789) for example.


Write to new table during initial import

Currently that happens in the calculate-postcodes step in utils/setup.php. It is run once during the initial import.

The logic should move into a (testable) PHP library.

Write the postcode, but also the derived prefixes into the table.


Change forward search logic to query that table

Currently that happens in lib/Geocode.php where a postcode is another phrase gets checked in the words table. Addresses are linked to postcodes and postcodes are treated as having a certain place rank.

Instead we should ignore the postcode in the query (remove it, treat it different), try to find places (place candidates) as before and do a final check which of the places is near the postcodes.

Ideally if the postcode cannot be found we look at its prefixes as well. For example for the GB postcode SW14 4RX it would be great to check the nearby SW14 4.

Write tests

Apart from the PHP test suite Nominatim also uses Python behave test suite to evaluate results of queries against the API.


Write during incremental update

Currently not done.

We were thinking about a PHP script that regularly updates the data. Scale (number of database transactions, data size) needs to be considered. Eventually that logic should move into PL/SQL (see functions.sql). PL/SQL is to complex (and harder to test) to be part of this project proposal.


Optional: Update GB postcodes

On nominatim.org we supply a pre-generated list of GB postcodes. It hasn't been updated in a while. Write a PHP script that converts Ordnance Survey Code-Point Open into an intermediate format, e.g. postcode|lat|lon or JSON/GeoJSON, and as a second step into SQL files that write to the new table.



Ideas for PHP methods to handle postcodes

After the generic table and updates works it makes sense to move logic into testable methods. Below ideas how that could be structured. A lot of the logic will be country specific and should be easy to change by local interested mappers. Ideally via configuration files per country. Don't write logic for each country yourself, too time-consuming.

  • A method to normalize postcodes. For example in NL-1000 the NL is the country and can be removed. For other countries there might be prefixes, optional spaces, dashes.
  • A method to deal with OSM values containing multiple postcodes. The usual though rare convention is split by semicolon, e.g. 1067 JR;1067JR (that's a special case because both should be normalized to the same string and thus be treated as duplicate).
  • A method to check if a string follows the postcode format of that country. E.g. French postcodes are 5 digits. Anything not 5 digits (after normalizing) should be refused.
  • A method that recognizes a postcode inside a string. Given a full address 17 Rue du Médecin-Colonel Calbairac, 31000 Toulouse, France it would find 31000. We may be able to recognize the country already (or the user has specified that) so that can be given as hint to the method.
  • A method to derive a prefix from a postcode. Sometimes called area postcode or district. For example for 1234 in The Netherlands that's 12. In Spain it's the first two digits. In the United Kingdom the part before the space (actually there are more complicated rules). Some postcodes might have zero, others two prefixes.
  • A method to combine multiple postcode coordinates into an area (geometry, polygon). For example if 10 addresses share the same postcode we want to create a boundary and a center point for that postcode. Ideally this has outlier detection, we don't want a typo in an address to create a boundary that covers half the country. Checks what is considered an outlier should probably be its own method so it can be tweaked for different countries.
  • PHP tests for each. Make it easy to add more test cases with little to no coding (e.g. configuration files).


Postcode coverage in worldwide Nominatim database

  SELECT calculated_country_code AS iso, count(*) AS c
  FROM placex 
  WHERE class='place' AND type='postcode' AND osm_type='P'
  GROUP BY calculated_country_code
  ORDER BY c DESC
  LIMIT 10;
iso |   c
------------
nl  | 462088
gb  | 114690 (excluding the 'gb_postcodes' table)
us  |  80433
ca  |  26967
ro  |  23586
br  |  21897
ru  |  18358
pl  |  16682
ua  |  15791
jo  |  11140
de  |   8475
jp  |   7950
fr  |   7174
in  |   7117
pt  |   6277
se  |   5583
es  |   5148
sg  |   4924
it  |   4015
mx  |   3786