Name finder

From OpenStreetMap Wiki
(Redirected from Namefinder)
Jump to: navigation, search

Name finder was a tool to search for names and related items like road numbers in the OSM database. For a long time it was the main geocoding search engine powering the search box on the openstreetmap.org front page, as well as running at gazetteer.openstreetmap.org/namefinder

As of August 2010 (See announcement) Name finder is no longer in active service on the front page, or at gazetteer.openstreetmap.org.

The name finder is built by David.earl, but the code is (possibly) no longer actively developed or deployed anywhere

Searching for things with name finder

You can find anything in the OpenStreetMap database which has been given a name (e.g. a place, street, pub, school etc., e.g. Bakers Arms) or a reference (e.g. M11), or by its type or its plural (e.g. pub or churches).

You can give just one word or several (e.g. Hinton or Hinton Road), but they must appear in the same order in what you are looking for.

In particular, you can restrict your search to a place using a comma or the word 'near' (lower case only) to separate the two (e.g. Hinton Rd, Cambridge). You can distinguish the several places of the same name (e.g. Cambridge in Cambs, Gloucs UK, Ontario Canada; Mass USA etc.) by asking for the county or country after another comma (e.g. Coronation, Cambridge, England), assuming the place includes the information (with the OSM 'is_in' tag, and possibly later with the 'is_in' relation).

You can also use a latitude,longitude either as the search target or the place qualifying the search, hence pubs near 52.2,0.19.

Furthermore, you can find the distance between two results by putting a colon between them. So Hinton Road, Fulbourn: Crossings Road, Chapel-en-le-Frith will look up both and tell you how far it is between them.

If you want to discriminate by type where things have the same name, put this after the name. For example: Chinley Station will give you the railway station (and in this case, also Station car park, as station appears in the name of the car park) while Chinley village will give you the village. This doesn't apply to streets: you can't say Hinton Road highway for example.

If you search for cities near Islington, for example, you will only get cities back, not named items like City Road. Likewise, towns, suburbs, villages, hamlets, and places. These are recognised and constrained only to match the relevant items. This does not extend to their singulars or capitalized forms, so Towns near 52.0,0.0 unusually produces a different result from towns near 52.0,0.0.

I am considering a more analytic approach to search terms, so we don't rely on a syntax that isn't necessarily obvious. To that end it would be helpful if people could add to the namefinder's address format page so I get a good idea of what street addresses look like across the world.

Example searches

Fulbourn

  • finds anything called or containing Fulbourn in its name:
  village Fulbourn in Cambridgeshire, England, UK (which is about 
  7km east of city Cambridge, ditto) found.
  
  unclassified road Fulbourn Old Drift (east) about 2km south-east 
  of middle of village Teversham in Cambridgeshire, England, UK (which
  is about 4km east of city Cambridge, ditto) found ...

places near 52.18,0.20

  • where is this latitude,longitude?
   requested location found about 1km west of middle of village 
   Fulbourn in Cambridgeshire, England, UK (which is about 7km east 
   of city Cambridge, ditto)

Hinton Road, Fulbourn

Hinton Rd near Fulbourn

  • equivalent to above

Hinton Road near 52.18,0.20

hospitals near Fulbourn

  • churches, pubs, supermarkets, stations, atms ...

pubs near 52.18,0.20

post offices near Cambridge, UK

  • qualifying which one, using is_in

places near Fulbourn

villages near 52.18,0.20

suburbs near Cambridge

airports near Cambridge

airports

  • not that helpful, you'll just get a random 30 of them

Victoria pub, Oxford

  • as opposed to Victoria Station, Victoria Road etc, even though 'pub' doesn't appear in the name

Ipswich : 52.18,0.20

  • how far is it between the two locations?

Abbreviations, Accented characters etc.

Words like Road and Rd, Avenue and Ave, East and E are equivalent. There is a list of what abbreviations are recognised. Please add to the list for other languages.

Letters like ß and ss are equivalent, æ and ae, å and aa etc (so Aarhus will find the Danish city of Århus and vice-versa).

Names with accented characters will be located even if you omit the accent (e.g. Sodertalje finds Södertälje). ø is recognised by o. Pretty much every diacritical 'latin-like' character in the Unicode character set is matched by its non-accented equivalent.

Also, if you search for the linguistically correct transliteration, you'll get a hit too. For example München, Munchen and Muenchen are all equivalent.

Punctuation and case are pretty much ignored, except that spaces divide words, so Field Fare Way will not match Fieldfare Way. However, certain suffixes which are commonly quoted either as part of a longer word or as separate words in languages like German and Dutch - for example Straße and Straat - are recognised to be special and will match whether or not concatenated and/or abbreviated. So for example, the following are all equivalent to the Name finder: Budapester Platz, Budapester Pl, Budapesterpl, Budapesterplatz. See the are abbreviations page for a list of recognised suffixes.

English possessives are also ignored (e.g. St Andrew or St Andrews will find St Andrew’s and vice-versa).

Where things on the map have been given in more than one language (using a 'name:languagecode' tag), the name finder will them in any of the languages.

Definite articles are omitted both from the index and in searches. For example if the surveyor has included only Barkers Arms a search for The Bakers Arms will match. Definite articles treated in this way are the, le, la, der, die, das, el, il.

Limitations

Because types of item and their names use the same index, places [or place] near Cambridge will yield street Abbey Place... as well as suburb Chesterton, and likewise churches near Waterbeach may yield street Church Lane... among the results. For this reason, at the moment place of worship is not included in the index, so searches for e.g. place of worship near Cambridge don't work because places near Cambridge would otherwise yield all the place of worship too if I put that term in the index. But place of worship is translated into church or mosque if there is enough information in the place_of_worship node to determine this; Hindu temples and the like will have to wait for now.

In general, an item has to have a name (including name:language), ref or airport code in order to be found. However, there are some exceptions to this to put certain anonymous items in the index, for example supermarkets and cinemas, so cinemas near Cambridge will give you them all, not just the named ones.

Highways are not indexed by type at all, so you can't currently say highways near Cambridge.

Types of things are only indexed in English. Kirchen nahe München and Hôpitaux pres Besançon sadly don't work. You have to say churches near München or hospitals near Besançon. In a future version I will attempt to support multiple languages here, if you will help with the translations. There is a translations page to gather these.

UK Postcodes

The Namefinder can also make limited use of UK postcodes. Searches for postcode alone are considered experimental: see below.

You can use either a UK postcode or postcode prefix (the bit before the space) in the following contexts:

  • as a term to search for, e.g. CB21 5DZ or CB21
  • as a qualifier instead of a place e.g. Hinton Road, CB21 5DZ or Hinton Road, CB21
  • as a restriction on a qualifying place instead of a country or county, Hinton Road, Fulbourn, CB21 5DZ or Hinton Road, Fulbourn, CB21. Exceptionally the second comma can also be omitted, as in Adelaide Road, London NW11

A postcode prefix search merely searches for the name in the index. The postcode areas are on the map as nodes.

A full postcode used as a qualifier or restrictor is abbreviated to the partial postcode. This is then used either to limit the search term to results near to the postcode area centroid or limit instances of the place name given to those near the postcode area centroid respectively.

Full postcode searches work around the problem of copyright databases by translating the postcode into a street name (or possibly a building name or the like), place name and postcode prefix, and then using those to do the search, i.e. to geolocate the postcode, in the usual way as above. The translation is done by a Google search for the postcode and analyzing the results to see if a street address precedes it, and if so to extract the salient data from it. This does, of course, depend on some address in a postcode being indexed by Google. Not all are, but it is surprisingly common - lots of people publish addresses, estate agents in particular have a good spread of postcodes for example.

How it works

Outline

An index database is built initially from the planet export and then updated regularly from the incremental differences. This indexes names in a canonical form which makes them easy to look up by name and variations, and this is further divided into individual words for efficiency of matching. The index building process is in three stages: the first updates a database of the entire planet data; the second regenerates all the indexes which share the same canonical string (both old and new when this changes) and the third updates the separate index database.

The earth's surface is divided into regions (like big map tiles) about 111km square which are identified by a single number easily derivable from latitude and longitude, and the region number for the location of each indexed name is stored in the index. Because regions are roughly equal in size, their number reduces as one goes further away from the equator and they don't share common lines of longitude as edges.

When a search is constrained to a place (that is, after a comma or "near"), the place is looked up in the index by canonical name. If found, then its region number is determined. Then if it is "close" to the edge of a region, neighbouring region numbers adjoining those edges are also determined.

Where the 'place' is a lat,lon pair, a place is artificially constructed for it instead of looking it up.

The index doesn't store duplicate names. A duplicate name is defined as exactly the same canonical string withing a given distance of another (currently 3km). This means long roads will be found at intervals, but you only get one hit for a road in an area. If there really are two different roads identically named in the same locality, this is likely to lead to confusion elsewhere anyway, so it is suggested that they are disambiguated in the name (for example, 'Silver Street (central Wakefield)' and 'Silver Street (Newton Hill)').

Then names matching the part before the comma are looked up (or constructed from a lat,lon pair), constrained to those entries with the determined region numbers (which is what makes it efficient and scalable), and sorted by increasing distance from the located place, so the database gives us back a list of potential named items close to the place. However, there are some place names which are remarkably common around the world (Cambridge, London, ...) and each has to be searched, so this can be slow.

If no nearby place was specified, or we didn't find any nearby names near the place and the 'any' parameter is on (see below), then we lookup any instance of the name. This database query is constructed to provide exact matches first, then by decreasing importance of place (cities, towns, villages etc), then other non-place names, and limited to an arbitrary 30 results (or max results if the max parameter is give, see below).

In all cases, when a name is located, its context is established: that is we determine firstly the nearest place (if that is not the place we asked for - e.g. if the street located is in a nearby village not the village we asked for), and then the nearest town and/or city. This means we can give quite a lot of contextual information about what we found.

The distances between all these are computed by pythagoras theorem. This is adequate over what is by definition a local area, and distances are rounded to 1, 5 and 10km as they get further away. Sort order for results determined by distance also uses planar formula to speed the database queries. However, distance between different search results (using colon in the search), which can be very large, is the great circle distance.

In more detail

Tables used in searching

named: a named item from the OSM data, where name includes references, IATA codes, old names, foreign language names etc., and the info derived from the kind of item it is (church, pub, school, residential street...) Note that this is for all potentially searchable items, which includes duplicates both nearby and distant.

   id:        A unique identifier derived from OSM node, way and relation ids by multiplying by 
              4 and adding 1, 2 or 3 for node, way and relation respectively. In this way, all names 
              can have a single space for numeric identifiers, but still refer back to the OSM ~
              original if necessary.
   region:    number as described in wiki
   lat:       ) of the named item
   lon:       )
   name:      the name of the item as displayed to the user (includes 
              concatenation of bits of name mentioned above)
   canonical: a canonical form of the name (lower case, punctuation stripped etc)
   category:  the key part of the main tag describing the object (e.g. highway)
   is_in:     an is_in tag value, tidied up for human readability
   rank:      0 except for places, which are 10 through 60 for hamlet through to city
   info:      a string derived from various tags describing the object (e.g. 
              "church", "secondary road", "pub"), intended for readability,

placeindex: abstracts out some of the information in the Named table where entries are places. This is purely for efficiency - a smaller set to search when locating nearby places. Joined on id with Named when textual info then needed.

   id         )
   region     ) all as above
   lat        )
   lon        )
   rank       )

word: a list of words derived from the canonical names of named items including alternate forms (like Strasse and Straße, Muenchen and Munchen and München), split forms (like Bahnhofweg vs. Bahnhof Weg) and the info string (so searches for "pubs" etc work - in English only though at present). Note that Word entries reference only one instance of a similarly named Named of the same type within any 3km radius, so nearby duplicates are not included in any search results, but distant ones aren't. Hence "M11, Harlow" gives a different result from "M11, Cambridge", but you don't get dozens of hits for either just because the road is split up into many ways, at bridges and the like, or for branches of estate roads.

   id:        as above
   ordinal:   where in the sequence of words in the name the word appears 
              (e.g. in "Kilburn High Road [A5]", kilburn=1, high=2, road=3, 
              a5=1 - it starts again for alternates like the ref), used to
              determine exact matches in preference to approximate ones
   firstword: )
   lastword:  ) bools indicating that condition, ditto
   region:    as above. Not vital here, but allows searches to home in on a region here
              rather than the database engine having to join to named first, 
              though in practice MySQL only uses one index at a time

Search procedure

Searching firstly looks up any place qualifier if given (including an artificial place derived from lat,lon if that's given), then culls the result according to is_in for on any third qualifier given ("Kilburn High Road, London, UK), and then looks up the first part of the search by joining Word with itself on id as many times as there are words in the user's *canonicalised* search term and also with Named, using the region and neighbours of the located place(s), ordered by distance from it, or if no qualifying place then by exact match and then by approx match (Kilburn High Road only then Kilburn Road would match).

After that it gets the neighbouring settlements to provide context, hence Kilburn High Road is near Kilburn (presumably) even though the search was for London.

Tables for update management

node: stores position info (only) for all nodes

 id, lat, lon

relation_node

relation_relation

relation_way

way_node

All of those just have two columns, relating one id to others so a search on Way_node for example by way_id will return all nodes in that way, and conversely a search on node_id will return all ways the node is a part of. Way_node contains ids for all ways, and the Relation_Xs for all relations.

canonical: just a list of canonical strings affected by an update and the containing region number.

changedid: a list of the ids of anonymous but interesting items changed during in an update. There are relatively few of these - most pubs do have names, for example.

Update procedure

Currently, a cron job runs updates daily.

Updates are in two phases.

import phase: read XML file (planet, or planet difference file); for each affected object find related affected objects (ways of nodes etc using way_node table etc) and for all names found by doing this record in canonical table (which is emptied at start of update).

Where an object is modified, record both the old and new canonical names. Also edit the Named, Node and relationship tables to reflect the new state, applying deletions, modifications and additions. The position assigned to a Way is that of the middle node, except for known areas it is the centre of the bounding box, and for a relation is one of the constituent nodes or ways.

More than one difference file can be imported, one after the other, and then the update phase applied once for all of them.

update phase: for each distinct Canonical/Region pair found in the import phase, find all Nameds with that pair. Clear out the word index for all those ids just found. For that list then take the first (or in the case of nodes vs ways, the way) and eliminate any others sharing that same pair within the 3km radius. Repeat until all nearby duplicates eliminated.

Create alternate forms for words in the canonical and insert them in the word index - that is, only for the remaining non-nearby-duplicates. In this way, if one way named (say) Kilburn High Road is deleted in an update, another remaining nearby would take its place so it doesn't disappear from the index, even though that Way was not affected by the update, or if a new part of Kilburn High Road is added, the whole of the nearby Kilburn High Road entries are re-evaluated to see which is the best to choose.

(Note that by using region like this, items either side of a region boundary will not be compared for similarity. Region is not essential in theory and could remove this minor boundary case, but at the expense of considering massively more similar names worldwide. You might not think this was much of a problem, but nearly every US settlement has a 1st street and a Main street so the problem expands from a few dozen comparisons in the neighbour culls to many hundreds and becomes prohibitively expensive.)

Anonymous nodes are directly attributable to ids so their 'info' detail can simply be added or removed from the word index directly.

Initial priming is simply an update from an entire planet file consisting purely of additions, though for efficiency Word and other deletions can be turned off, as these are remarkably slow and unnecessary when the Word table is known to be empty.

XML interface

You can also get the results back in an XML file for further processing or to implement the search on another site or in a different language. The url for this is
http://gazetteer.openstreetmap.org/namefinder/search.xml?find=<urlencodedstring>
where the urlencoded string is what you would otherwise put in the search box (including commas and colons, as above). You can also optionally give parameters 'max' and 'any', as in
http://gazetteer.openstreetmap.org/namefinder/search.xml?find=<urlencodedstring>&max=20&any=1
where max gives the maximum number of results to be returned and 'any' (which can have any value) indicates that if the search finds nothing near the place specified in the search, then it will proceed to search worldwide. Previously (before September 2008) this was always the case, but the default is now not to do this (for performance reasons). It is always an error if any qualifying place is not found at all, and it is still the case that the search is relaxed to look for any place of the given name if any qualifying is_in is not satisifed.

Though the whole result is provided in structured form, most of it can be ignored for simple presentation: just iterate the top-level named items extracting description, and in the case of distance searches, combine the referenced named elements by id and the distance between them. To locate on the map, also extract the lat and lon from the top level named elements.

In the event of an error, only the top level osmsearchbyname element is returned, which will include an error attribute, and as much other context as is available, as in

<searchresults date='2007-05-08 12:56:33' error='updating index, back soon'/>

The XML is as follows:

<?xml version="1.0" encoding="UTF-8" ?>
<searchresults 
  date='2007-05-08 12:56:33' 
        <!-- date of search -->
  sourcedate='2007-05-02'
        <!-- date of index data (planet file) -->
  find='Newmarket Road, Cambridge, England' 
        <!-- the original search string requested -->
  distancesearch='no'
        <!-- or yes; says whether to expect <distance> items in response to the 
	     colon in the find string -->
  findname='Newmarket Road'
        <!-- the name part of the string (before the first comma) -->
  findplace='Cambridge'
        <!-- the place part of find, after the first comma, possibly omitted -->
  findisin='England'
        <!-- the is_in part of the search string, after the second comma, 
             possibly omitted -->
  foundnearplace='yes'
        <!-- or no; if no, says that though the place was found, the name was 
	     not found in or near it, but somewhere else; <place> below 
	     will then be absent in all cases -->
    <!-- in the case of a distancesearch, these will be findname1 and findname2 etc, and 
         foundnearplace1,foundnearplace2 -->
  >
    <named ...>...</named>
    <named ...>...</named>
    ...
    <distance ...>...</distance>
    <distance ...>...</distance>
    ...
</searchresults>

where <named> is a result, also used recursively to describe the context, as follows:

<named
  type='way'
        <!-- 'node', 'way' or 'relation' -->
  id='123456' 
        <!-- the osm id from which the item was obtained -->
  lat='52.123456' 
        <!-- latitude of the named item. Note: for segments this is the midpoint 
             of the segment and for ways is the midpoint of a segment 
             selected from the middle of the list of segemnts for the way -->
  lon='0.123456' 
        <!-- longitude, ditto -->
  name='Newmarket Road [A1304]' 
        <!-- the name of the item. Note that this is an amalgam of name, ref, 
             name:lang and possibly others -->
  category='highway'
        <!-- the principal type key -->
  info='primary road'
        <!-- a sanitised readable version of the principal tag key/value pair -->
  rank='0'
        <!-- 0 except for places, which are numerically ranked by importance -->
  is_in='Cambridgeshire, England, UK'
        <!-- a tidied up equivalent of the item's is_in tag value, generally only 
             applied to places -->
  region='52231'
        <!-- a region number (used internally for matching) -->
  distance='1.234'
        <!-- distance in km to the parent named -->
  approxdistance='0'
        <!-- as above, rounded to 1, 5 or 10km depending on distance. 
             0 means < 1km -->
  direction='188'
        <!-- the direction, in degrees anticlockwise from East, of the target _to_ named
             so 188 is nearly west. However, you'll probably want to express it in terms
             of direction from named to the target, so you'll probably want to reverse it -->
  zoom='16'
        <!-- a suggested zoom level (for top-level nameds) when constructing a map url -->
  >
    <description>
      Street Newmarket Road less than 1km from middle of suburb Barnwell 
      in Cambridge, Cambridgeshire, England, UK (which is about 3km from 
      city Cambridge in Cambridgeshire, England, UK) found
      <!-- text describing the match in context, only in the outermost named items -->
    </description>
    <place>
      <!-- the target place, if any. Note that this will have its own context of 
           nearestplaces as well, which are often different. Also place may differ 
           from result to result as named items are found in different places with 
           the same name -->
      <named ...>...</named>
    </place>
    <nearestplaces>
      <!-- the nearest town and/or city to the parent. Note that the
           distance (and approxdistance) tags in the subordinate nameds
           says how far it is back to the parent. Note that these places may
           be nearer to the target name than the target place
      <named ...>...</named>
      <named ...>...</named>
    </nearestplaces>
</named>

and <distance> is (if requested using a colon in the search) the distance between two of the top-level nameds in the xml, identified by their ids:

<distance
  fromtype='node'
    <!-- the type of the item at one end of the great circle: you need to know 
         this because ids are only unique within type -->
  from='123456'
    <!-- its id -->
  totype='node'
  to='345678'
    <!-- the id of the named item at the other end -->
  >
    227.3
    <!-- the great circle distance in km between from and to -->
</distance>    

Wiki template

For linking to a resulting maps from within this wiki there is a now a Template:NameFinder.


What about the source code?

This is now checked in to svn at http://svn.openstreetmap.org/sites/ in the namefinder directory.

This is the currently running version with incremental updates (after June 7, 2008, and including performance changes from September 1, 2008)

It is written in PHP 5 and uses MySQL 5.