Nominatim/Development overview

From OpenStreetMap Wiki
Jump to: navigation, search

Overview

Nominatim comprises of:

  • osm2pgsql output routine
  • Postgresql module
  • Set of plpgsql functions
  • PHP management / interface

Nominatim indexes all named features and a selection of points of interest and provides search, reverse geocoding and some Wikipedia-16px.png gazetteer functionality.

Tags processed

The following name tags are currently imported

Names

  • ref, int_ref, nat_ref, reg_ref, loc_ref, old_ref, ncn_ref, rcn_ref, lcn_ref
  • iata, icao
  • pcode:1, pcode:2, pcode:3
  • un:pcode:1, un:pcode:2, un:pcode:3
  • name, name:*
  • int_name, int_name:*
  • nat_name, nat_name:*
  • reg_name, reg_name:*
  • loc_name, loc_name:*
  • old_name, old_name:*
  • alt_name, alt_name:*
  • official_name, official_name:*
  • commonname, commonname:*
  • common_name, common_name:*
  • place_name, place_name:*
  • short_name, short_name:*
  • operator (a bit of an oddity, needed for many shops)

Country

  • country_code_iso3166_1_alpha_2
  • country_code_iso3166_1
  • country_code_iso3166
  • country_code
  • iso3166-1
  • ISO3166-1
  • iso3166
  • is_in:country_code
  • addr:country (if 2 char long)
  • addr:country_code

Postcodes

  • postal_code
  • post_code
  • postcode
  • addr:postcode

Is in

  • is_in
  • is_in:*
  • addr:country (if longer than 2 char)
  • addr:county
  • addr:city
  • addr:state

House numbers

  • addr:conscriptionnumber, addr:streetnumber
  • addr:housenumber
  • addr:interpolation

Street

  • addr:street

Indexing/address calculation

Country to street level

All indexed features are converted to a simple hierarchy (rank) of importance with points scored between 0 and 30 (where 0 is most important). Rank takes account of differences in interpretation between different countries but is generally calculated as:

For administrative boundaries: admin_level * 2 4–22
Continent, sea 2
Country 4
State 8
Region 10
County 12
City 16
Island, town, moor, waterways 17
Village, hamlet, municipality, district, borough, airport, national park 18
Suburb, croft, subdivision, farm, locality, islet 20
Hall of residence, neighbourhood, housing estate, landuse (polygon only) 22
Airport, street, road 26
Paths, cycleways, service roads, etc. 27
House, building 28
Postcode 11–25 (depends on country)
Other 30

For each feature down to level 26 (street level) a list of parents are calculated using the following algorithm:

  1. All polygon/multi-polygon areas which contain this feature (in order of size).
  2. All items by name listed in the is_in are searched for within the current country (in no particular order).
  3. The nearest feature for each higher rank, and all others within 1.5 times the distance to the nearest (in order of distance).

and a list of keywords are generated from those features.

During the indexing process an address is also calculated using the first feature found for each level. Where an is_in value is provided it is used to filter the address.

Building indexing

Buildings, houses and other lower than street level features (i.e., bus stops, phone boxes, etc.) are indexed by relating them to their most appropriate nearby street.

The street is calculated as:

  1. The street member of an associatedStreet relation
  2. If the node is part of a way:
    1. If this way is street level, than that street
    2. The street member of an associatedStreet relation that this way is in
    3. A street way with 50/100 meters and parallel with the way we are in
  3. A nearby street with the name given in addr:street of the feature we are in or the feature we are part of
  4. The nearest street (up to 3 miles)
  5. Not linked

All address information is then obtained from the street. As a result addr:* tags on low level features are not processed (except as above).

For interpolated ways simple numerical sequences are extrapolated (alpha numerical sequences are not currently handled) and additional building nodes are inserted into the way by duplicating the first (lowest) house number in the sequence.

Search algorithm

TODO

Wikipedia

 createdb wikipedia

Edit settings/local.php to add the database dsn.

Edit utils/import_wikipedia.sh to set the psql command.

 mkdir wikipediadata
 cd wikipediadata
 ../utils/importWikipedia.php --create-tables
 ../utils/import_wikipedia.sh
 ../utils/importWikipedia.php --parse-articles

import_wikipedia.sh will take on the order of 24 hours to run, importWikipedia.php --parse-articles will take several days to run Scores

Scores are calculated using the number of table references to the article from other wikipedia articles.

The importance value is calculated as log(totalcount)/log(max totalcount); i.e.,:

 update wikipedia_article set importance = log(totalcount)/log((select max(totalcount) from wikipedia_article))

Unfortunately this includes ALL links to an article including navigation links (see bottom of http://en.wikipedia.org/wiki/Pinellas_County,_FL for an example). These navigation boxes significantly increase the scores of articles included and need to be removed from the count. This reparsing all wikipedia articles as part of the import instead of using the pagelinks db dumps.