User:I guess not/OpenCellID

From OpenStreetMap Wiki
Jump to navigation Jump to search


Introduction

OpenCellIDs is an OpenSource project complementary to OpenStreetMap with the aim of creating a world wide collection of GSM cell tower locations. It enables location services in situations where GPS is unavailable. Please see check the official OpenCellIds web site for more details.

Measurements can by collected by a number of different mobile apps, including Keypad Mapper 3 (or short: KPM3). Berlin based company ENAiKOON provides the collected data free of charge to the community either as (aggregated) collection of base cell tower locations or as (non-aggregated) raw measurement file including actual GPS measurements by anonymous users.

On this page we import raw measurements data into PostGIS database and try to run different (rather basic) data analysis steps on the data. This is by no means an official documentation. It is rather intended as a possible starting point if you're interested in experimenting with the GPS measurement data. Note that GSM cell coverage it outside of the scope of this page. Please feel free to correct any bugs/omissions you find.

One aspect we will cover in more detail is an estimation of the actual KPM3 coverage in today's OSM house numbers. We assume that every house number should have a nearby OpenCellID GPS measurement in order to qualify. Also measurements should be taken by pedestrians rather than cars. Besides the OSM object's last changed timestamp should not be earlier than OpenCellID measurements.

Background information

This section provides some background information about the data to be imported.

File structure

The official measurement file structure documentation can be found here. However the actual file contains a number of additional (still unknown) fields or some fields might be in a different sequence. Please enhance this section, if you have more details about the file format.

Additional Information on the format can be found at downloads.opencellid.org/ [1]

Example file:

558130865;-1;48.222237;15.351143;232;1;24;1935487;0;12;0;149;gpsSuite;2013-01-01 00:00:16;2013-01-01 00:00:28
558130867;1165655209;51.057615;7.062815;262;3;6756;53500;0;0;0;0;kcs_unit;2013-01-01 00:00:17;2013-01-01 00:00:43
558130869;1165655211;51.744363;13.951198;262;2;592;28570;0;0;0;0;kcs_unit;2013-01-01 00:00:30;2013-01-01 00:00:43
558130871;1165655213;51.471757;7.006777;262;7;31101;45008;0;0;0;0;kcs_unit;2013-01-01 00:00:17;2013-01-01 00:00:48
558130873;1165655215;48.696094;11.461272;262;3;32439;57689;2;0;0;0;cell_id_pos_table;2013-01-01 00:00:38;2013-01-01 00:00:48
558130875;1165655217;51.076585;13.301915;262;2;356;10982;0;0;0;0;kcs_unit;2013-01-01 00:00:37;2013-01-01 00:00:48
558130877;1165655219;48.314758;10.057298;262;3;3719;46479;11;0;0;0;kcs_unit;2013-01-01 00:00:23;2013-01-01 00:00:53
558130879;-1;27.493640;-109.957645;334;20;21053;134547157;0;17;0;106;gpsSuite;2012-12-31 23:58:53;2013-01-01 00:00:53
558130881;-1;27.493629;-109.957843;334;20;21053;134591542;0;18;0;106;gpsSuite;2012-12-31 23:59:01;2013-01-01 00:00:53


Field Description (based on official documentation where available)
id this is NOT (!) the id_of_the_cell. Instead, this is the id (the sequence number) in measurement table.
unknown1
latitude GPS latitude of the measurement
longitude GPS longitude of the measurement
mcc mobile country code where the cell is located this measurement belongs to (details see http://en.wikipedia.org/wiki/Mobile_country_code)
mnc mobile network code of the cell this measurement belongs to (details see http://en.wikipedia.org/wiki/Mobile_network_code)
lac local area code of the cell this measurement belongs to (details see http://en.wikipedia.org/wiki/Location_area#Location_area)
cell_id this is the offical cellid of the cell tower this measurement belongs to
unknown2
unknown3
unknown4
unknown5
client See table below
created timestamp when the measurement was recorded in the mobile device (UTC)
sent UTC time of inserting the information into database.


Client field contains a number of undocumented strings. networkSurvey seems to refer to the lat/lon information transmitted by cell towers.

Client field Description Approx. number of rows in raw measurements file
cellByLcdDataInSrv unknown 373 Mio.
gpsSuite Assumed to be actual GPS measurements (confirmed) 25 Mio.
kcs_unit unknown 95 Mio.
networkSurveyInSrv unknown 33 Mio.
networkSurveySrv unknown 25 Mio.
previous_position unknown 66k
cell_id_pos_table unknown 5.5 Mio
openbmap.org 19.1 Mio

Data preparation

Importing OpenCellID measurements into Postgres DB

It is assumed that you already installed a Postgres 9.1 database with PostGIS 1.5, created a database called gis enabled postgis + hstore extensions. If you're unfamiliar with these steps, you should take a look at the Switch2OSM instructions to set up a tile server on Ubuntu.

Download OpenCellID measurements

OpenCellID measurement file is available for download on http://www.opencellids.org/en/download/. As the file is very large (~7GB) a download manager might come in handy.

wget -c http://opencellid.enaikoon.de:8080/gpsSuiteCellIDServer/exportFiles/measurements.tar.gz

Creating database table for import

Start the Postgres command line tool...

psql -d gis

and create a new table called opencellids to hold all raw measurement data.

 CREATE TABLE public.opencellids
 (
   id bigint, 
   unknown1 bigint, 
   latitude double precision, 
   longitude double precision, 
   mcc integer,
   mnc integer,
   lac integer,
   cell_id integer,
   unknown2 integer,
   unknown3 integer,
   unknown4 integer,
   unknown5 integer,
   client varchar(30),
   created timestamp,
   sent timestamp,
   primary key(id)
 ) 
 WITH (
   OIDS = FALSE
 );

Extracting all measurements from year 2013

Measurements are restricted on purpose to the year 2013 as KPM3 was only introduced early 2013. Earlier measurements are out of scope for this analysis (although you easily include them by removing/adjusting the grep statement).

tar -xz --to-stdout --file measurements.tar.gz | grep ";2013-" | gzip -c > m2013.gz

Importing compressed CSV data

Import all GPS measurements into our Postgres table.

Note: As outlined in the table above we assume "gpsSuite" to match GPS measurements by users.

gunzip -c m2013.gz | grep "gpsSuite" | psql -d gis -c "COPY opencellids from stdin DELIMITER ';' CSV;"

Removing Duplicate Entries

The measurements file contains about 21 mio. actual GPS measurements in total (client gpsSuite, no date restriction applied). For some unknown reason some raw measurements appear up to 8429 times in the database with all fields identical except for measurement id and the sent timestamp. This could be a result of some client malfunction, sending the same raw measurements over and over to the OpenCellIDs server.

  latitude  |  longitude  | mcc | mnc |    lac     |  cell_id   |       created       | cnt  
------------+-------------+-----+-----+------------+------------+---------------------+------
  52.713632 |    8.172709 | 262 |   2 |        434 |      56091 | 2013-11-15 17:13:01 | 8429
  53.085208 |    9.020729 | 262 |   2 |       1434 |   39555249 | 2013-09-11 10:31:56 |  625
  53.084388 |    9.020796 | 262 |   2 |       1434 |   39555249 | 2013-09-11 11:25:38 |  625
  53.084609 |    9.020524 | 262 |   2 |       1434 |   39555249 | 2013-09-11 11:24:27 |  625
  53.084521 |     9.02086 | 262 |   2 |       1434 |   39555249 | 2013-09-11 10:25:28 |  625
  53.084609 |    9.020524 | 262 |   2 |       1434 |   39555249 | 2013-09-11 10:51:02 |  625
  53.084395 |    9.021246 | 262 |   2 |       1434 |   39555249 | 2013-09-11 10:54:01 |  625
  53.084636 |    9.020573 | 262 |   2 |       1434 |   39555249 | 2013-09-11 11:22:53 |  625
  53.084371 |    9.020952 | 262 |   2 |       1434 |   39555249 | 2013-09-11 11:04:08 |  625
  53.084615 |    9.020386 | 262 |   2 |       1434 |   39555249 | 2013-09-11 10:44:13 |  625
  53.084372 |    9.020801 | 262 |   2 |       1434 |   39555249 | 2013-09-11 10:25:06 |  625
  53.084673 |    9.019985 | 262 |   2 |       1434 |   39595416 | 2013-09-11 11:10:52 |  625
  53.084375 |    9.022042 | 262 |   2 |       1434 |   39555249 | 2013-09-11 10:53:32 |  625
  53.084644 |      9.0208 | 262 |   2 |       1434 |   39555249 | 2013-09-11 10:52:37 |  625
   53.08443 |    9.021024 | 262 |   2 |       1434 |   39555249 | 2013-09-11 11:32:30 |  625
  53.084775 |     9.02111 | 262 |   2 |       1434 |   39555249 | 2013-09-11 10:31:23 |  625
  53.084473 |    9.020687 | 262 |   2 |       1434 |   39555249 | 2013-09-11 10:58:17 |  625
  53.084703 |    9.021011 | 262 |   2 |       1434 |   39555249 | 2013-09-11 10:42:31 |  625
  53.084378 |    9.021256 | 262 |   2 |       1434 |   39555249 | 2013-09-11 10:59:55 |  625
   53.08472 |    9.020839 | 262 |   2 |       1434 |   39555249 | 2013-09-11 11:19:28 |  625
  53.084381 |    9.020891 | 262 |   2 |       1434 |   39555249 | 2013-09-11 11:15:57 |  625
  53.084778 |    9.020832 | 262 |   2 |       1434 |   39555249 | 2013-09-11 10:32:19 |  625
  53.084382 |    9.020877 | 262 |   2 |       1434 |   39595416 | 2013-09-11 11:05:27 |  625
  53.084603 |    9.020644 | 262 |   2 |       1434 |   39555249 | 2013-09-11 11:30:05 |  625
  53.084401 |    9.020789 | 262 |   2 |       1434 |   39555249 | 2013-09-11 11:22:46 |  625
[...]

Detail for first entry:

    id     | unknown1 | latitude  | longitude | mcc | mnc | lac | cell_id | unknown2 | unknown3 | unknown4 | unknown5 |  client  |       created       |        sent         
-----------+----------+-----------+-----------+-----+-----+-----+---------+----------+----------+----------+----------+----------+---------------------+---------------------
 786047623 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-18 20:26:39
 786047709 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-18 20:26:40
 786050755 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-18 20:26:59
 786052479 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-18 20:27:25
 786102945 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-18 20:39:00
 786131161 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-18 20:45:17
 786605045 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-18 21:31:34
 787145427 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-19 08:38:12
 787145623 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-19 08:38:30
 787149667 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-19 08:39:21
 787224531 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-19 09:33:38
 787257421 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-19 09:54:23
 787257427 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-19 09:54:23
 787257431 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-19 09:54:23
 787257439 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-19 09:54:23
 787712075 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-19 15:30:12
 787729289 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-19 15:40:10
 788440347 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-20 08:24:59
 788440487 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-20 08:25:31
 788441541 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-20 08:28:06
 789175201 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-20 15:39:22
 789176221 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-20 15:40:16
 789176391 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-20 15:40:52
 789176605 |       -1 | 52.713632 |  8.172709 | 262 |   2 | 434 |   56091 |       11 |       27 |       23 |      178 | gpsSuite | 2013-11-15 17:13:01 | 2013-11-20 15:41:28

...(8000+ identical entries for same location / creation timestamp omitted)...

Initial starting point for analysis:

\pset format wrapped
select latitude, longitude, mcc, mnc,lac,cell_id, created, array_agg(sent), count(*) as cnt 
       from opencellids group by latitude, longitude, mcc, mnc,lac,cell_id,created
       having count(*) > 1
       order by cnt desc;

As these redundant measurements don't add any value to our analysis, we will remove these duplicate entries. This way about 7 mio. out of 21 mio. entries can be deleted.

DELETE FROM opencellids o1 USING opencellids o2
       WHERE o1.latitude  = o2.latitude
         AND o1.longitude = o2.longitude
         AND o1.mcc       = o2.mcc
         AND o1.mnc       = o2.mnc
         AND o1.lac       = o2.lac
         AND o1.cell_id   = o2.cell_id
         AND o1.created   = o2.created
         AND o1.id        < o2.id;

Adding geometry column

The additional geometry column in needed to store GPS measurements and create a GIST index on the data (also needed for performance reasons).

select AddGeometryColumn('public','opencellids','geom', 900913, 'POINT', 2);

Calculating Point values for geometry column

update opencellids set geom = st_transform(st_setsrid(st_makepoint(longitude, latitude),4326),900913);

Create a DB index on geometry column

 alter table opencellids alter column geom set not null;
 create index opencellids_gix ON opencellids USING GIST(geom);

Vacuum

This step may be optional. YMMV.

 vacuum analyse opencellids;

Importing existing housenumbers

To match OpenCellId measurements with OSM addr:housenumber, a Geofabrik Germany PBF extract is downloaded, converted to o5m, filtered and imported into Postgres using osm2pgsql.

 wget http://download.geofabrik.de/openstreetmap/europe/germany.osm.pbf
 osmconvert germany.osm.pbf --out-o5m > germany.o5m
 osmfilter germany.o5m --keep="addr:housenumber= " --out-osm | bzip2 > germany_housenumbers.osm.bz2
 osm2pgsql -c --slim -x --unlogged -I  -S house.style --hstore -d gis germany_housenumbers.osm.bz2

Note: house.style is just a stripped down version of the default.style, where most (unimportant) tags have been commented out.

Data Analysis

QGIS could be used to further analyze the data. In the following screenshot a sample KPM3 mapping session is depicted. Coloring schema used is based on creation timestamp's calendar week (red=later in 2013). You can clearly identify a mapper walking along the streets.

Sample Keypad Mapper 3 mapping session in Germany.

In a more sophisticated analysis we could estimate housenumber mapping coverage by KPM3. Our assumption here is that housenumbers created in OSM were mapped after the corresponding OpenCellID measurement occurred. Also all housenumbers should be well within a 30-50m distance to actual OpenCellId measurements in our database table.

Identifying Hotspots

When working with OpenCellID measurements you will notice that in some areas, a huge number of measurements were collected over a course of up to several months. The following screenshot highlights a location where 100.000 (!) measurements were taken.

Roughly 100.000 measurements collected over a period of several months

We can find out more about similar locations using the following SQL statement:

select st_astext(st_transform(geo,4326)), cnt from 
   ( select st_snaptogrid(geom,500) as geo, count(*) cnt 
     from opencellids 
     where geom && st_transform(ST_MakeEnvelope(4, 46, 16, 56, 4326),900913) 
     group by geo 
     order by cnt desc
     limit 50) q;

Sample output (data redacted for privacy reasons):

                st_astext                 |  cnt   
------------------------------------------+--------
 POINT(13.xxx 52.xxx)                     | 108623
 POINT(13.xxx 48.xxx)                     |  40009
 POINT(13.xxx 52.xxx)                     |  33599
 POINT(9.xxx 49.xxx)                      |  29651
[...]

Preparing the top 50 location with the most measurements based on a 500m grid. This table will be later used to exclude some OpenCellID measurement from more expensive calculations, such as 'number of nearby probes in a given timeframe'.

create table hotspots as 
  select st_snaptogrid(geom,500) as geo, count(*) cnt 
  from opencellids 
  where geom && st_transform(ST_MakeEnvelope(4, 46, 16, 56, 4326),900913) 
  group by geo 
  order by cnt desc
  limit 50;

Calculating number of nearby measurements in a given timeframe

First we add another column nearcnt to store the number of GPS measurements in a 2 minutes timeframe within 100m distance. The purpose of the number is to distinguish cars and pedestrians collecting cell tower information. The interval was chosen to approximate typical walking speed.

alter table opencellids add column nearcnt int;

As calculating these values is very expensive, we restrict the analysis to a bbox which fully covers Germany. But before starting this step we mark all OpenCellID measurements in our hotspot table with a very large nearcnt value. This way we can exclude these points in the next analysis step as they seriously impact the overall processing time.

update opencellids o1 
    set nearcnt = 1000 
  where exists(
        select * from hotspots 
        where geo = st_snaptogrid(o1.geom,500))
   and  geom && st_transform(ST_MakeEnvelope(4, 46, 16, 56, 4326),900913);

Also the workload is split up in several hundred chunks to be processed in parallel using GNU parallel. Note that we exclude all measurements where we already determined a nearcnt value. This script can be restarted in case things go seriously wrong. If a chunk was already processed in a previous run (=a nearcnt value <> 0 exists), it will simply be skipped. Parameter offset controls the size of each tile, inner defines an inner margin in a tile (avoids skipping of tiles because of rouding issues on tile borders), overlap defines a small overlap to neighbour tiles (to make sure no values are ignored on tile borders). In addition nearcnt may be null, if the column was just added by an alter table statement.

export offset=0.5
export overlap=0.05
export inner=0.1
export LC_ALL=C
doit() {
  echo "Processing $1 $2"
  export stat=`psql -d gis -A -t -c "select exists(select 1 from opencellids where nearcnt is not null and nearcnt > 0 and nearcnt < 1000 and geom && st_transform(ST_MakeEnvelope($1 + $inner, $2 + $inner, $1 + $offset - $inner, $2 + $offset - $inner, 4326),900913) limit 1);"`
  if [[ $stat == 't' ]] ; then
      echo "Already processed $1 $2"
      exit
  fi
  psql -d gis -c "update opencellids o1 
    set nearcnt = (select count(*) nearcnt 
                   from opencellids o2 
                   where st_dwithin(o1.geom, o2.geom,100) 
                   and o1.id <> o2.id
                   and o2.created >= o1.created - INTERVAL '1 min'
                   and o2.created <= o1.created + INTERVAL '1 min') 
    where geom && st_transform(ST_MakeEnvelope($1 - $overlap, $2 - $overlap, 
                    $1 + $offset + $overlap, $2 + $offset + $overlap, 4326),900913)
      and (nearcnt = 0 or nearcnt is null);"
}
export -f doit
parallel --gnu doit ::: $(seq 4 $offset 16) ::: $(seq 46 $offset 56)

After this process step all opencellids in our bbox should have been processed, i.e. no nearcnt value with the default value null is left behind.

select count(*) from opencellids 
     where nearcnt is null 
       and geom && st_transform(ST_MakeEnvelope(4,46,16,56,4326),900913);

As different parallel processing areas are slightly overlapping and could be updating the same table rows, deadlocks cannot be excluded completely. Postgres will detect this situation automatically and aborts the current transaction. If you receive a non-zero value for the nearcnt query, simply re-run the shell script until the row count drops down to zero.

Calculating number of OpenCellIDs in close proximity to housenumbers

In the last step we look for OpenCellID measurements in up to 50m distance to addr:housenumber polygons + points. GPS probes might origin from cars as well as pedestrian mappers. As the raw data doesn't include any kind of identifier we use a heuristic based on nearcnt introduced before. This way we can count the number of GPS measurements in a 2 minute time interval / 100m distance. As KPM3 frequently takes GPS probes (at least every 10s), this turns out to be a reasonable (albeit not perfect) filter for pedestrian mappers.

 alter table planet_osm_polygon add column cnt int;
 update planet_osm_polygon p set cnt = (select count(*) cnt from opencellids o where st_dwithin(p.way, o.geom, 50) and nearcnt > 3);
 alter table planet_osm_point add column cnt int;
 update planet_osm_point p set cnt = (select count(*) cnt from opencellids o where st_dwithin(p.way, o.geom, 50) and nearcnt > 3);


Same screenshot as above with addr:housenumber polygons.


As discussed before we introduce another level of validation and only accept a mapped housenumber, if the last changed timestamp in the OSM data is later than the surrounding OpenCellIDs measurements. On our sample screenshot a number of houses are now colored in white, as the last change occured in year 2011, while the OpenCellId measurements are from July 2013. Clearly, KPM3 cannot be the original source of those housenumbers.

UPDATE planet_osm_polygon p SET cnt = (SELECT COUNT(*) cnt FROM opencellids o 
       WHERE st_dwithin(p.way, o.geom, 50) AND nearcnt > 3 
       AND created < (to_timestamp(p.tags->'osm_timestamp','YYYY MM DD HH24 MI SS TZ') + INTERVAL '2 HOURS'));

UPDATE planet_osm_point p SET cnt = (SELECT COUNT(*) cnt FROM opencellids o 
       WHERE st_dwithin(p.way, o.geom, 50) AND nearcnt > 3 
       AND created < (to_timestamp(p.tags->'osm_timestamp','YYYY MM DD HH24 MI SS TZ') + INTERVAL '2 HOURS'));


Timestamps are considered for polygons


Refined calculation: Check existence of relevant OpenCellIDs in close proximity to housenumbers

After publishing the first version of this analysis some mappers remarked that they were listed in the Top 25 list, although they never used KPM3. After looking into the data a large gap of several months was identified between the OpenCellID measurement and the added housenumber on the OSM object. A more reasonable upper limit between collecting house numbers and adding them to OSM might be 2 weeks - in some cases OSM was updated a couple of hours later already. House numbers added after this cut-off time are likely not related to OpenCellID measurements.

Unfortunately the Germany Extract only includes the latest version of an OSM object. Up-to-date OSM Full History extracts don't seem to be available at the moment. Effectively we can only apply the 14 day restriction to v1 OSM objects. For later version we don't have the actual date when the house number was added in the first place. Consequently we cannot exclude false positives for OSM v2 and later versions by this kind of time restriction.

Also note that column cnt was changed from type int to boolean, i.e. you need to replace cnt > 0 by cnt is true for Top25 table + History chart.

alter table planet_osm_point add column cnt boolean;
alter table planet_osm_polygon add column cnt boolean;

UPDATE planet_osm_polygon p SET cnt = 
    (SELECT exists(
             select 1 FROM opencellids o 
                  WHERE st_dwithin(p.way, o.geom, 50) 
                    AND o.nearcnt > 3 
                    AND o.created < TO_TIMESTAMP(p.tags->'osm_timestamp','YYYY MM DD HH24 MI SS TZ') + INTERVAL '2 HOURS'
                    AND TO_TIMESTAMP(p.tags->'osm_timestamp','YYYY MM DD HH24 MI SS TZ') - o.created < 
                        (CASE WHEN (p.tags->'osm_version' = '1') THEN INTERVAL '14 DAYS'
                             ELSE INTERVAL '1 YEAR'
                        END)
     ));

UPDATE planet_osm_point p SET cnt = 
    (SELECT exists(
             select 1 FROM opencellids o 
                  WHERE st_dwithin(p.way, o.geom, 50) 
                    AND o.nearcnt > 3 
                    AND o.created < TO_TIMESTAMP(p.tags->'osm_timestamp','YYYY MM DD HH24 MI SS TZ') + INTERVAL '2 HOURS'
                    AND TO_TIMESTAMP(p.tags->'osm_timestamp','YYYY MM DD HH24 MI SS TZ') - o.created < 
                        CASE WHEN p.tags->'osm_version'= '1' THEN INTERVAL '14 DAYS'
                             ELSE INTERVAL '1 YEAR'
                        END
    ));


Missing History information revisited

Due to the lack of an up-to-date Full History Dump for Germany we resort to downloading the respective information from the Main API. Note that this procedure is not at all recommended for larger amounts of requests as it may impact other mappers. In any case please refer to the API documentation.

In a previous analysis step we've already cut down the number of objects from 3 mio. down to 50.000. First we extract the OSM object ids for those nodes and ways where history information should be retrieved for. This step assumes that column cnt was already populated in the previous processing step.

copy (select osm_id from planet_osm_polygon where cnt is true) to '/tmp/ids.csv' csv;
copy (select osm_id from planet_osm_point where cnt is true) to '/tmp/ids2.csv' csv;

In two different directories we mass download the full history details. Again, please read the API documentation first.

cat /tmp/ids.csv | parallel --gnu -j 1 "wget http://api.openstreetmap.org/api/0.6/way/{}/history -O {}.hist"
cat /tmp/ids2.csv | parallel --gnu -j 1 "wget http://api.openstreetmap.org/api/0.6/node/{}/history -O {}.hist"

A small XSLT mapping extracts version details about first addr:housenumber appearance, such as OSM Object id, timestamp, version number, user and user id. It doesn't check if the house number matches the latest version. We will use this information to get a more precise picture when a house number was originally added to OSM.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:output method="text" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select="/osm/*[tag[@k = 'addr:housenumber']][1]"> 
<xsl:value-of select="@id"/>,<xsl:value-of select="@timestamp"/>,<xsl:value-of select="@version"/>,<xsl:value-of select="@user"/>,<xsl:value-of select="@uid"/>
<xsl:text>
</xsl:text> 
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

In each directory we mass process history files. To collect the results of this script, you could use script or something similar.

ls -1 | xargs xsltproc  ../conv.xslt

To include the history information in our analysis, we import the XSLT processing result as CSV file into two Postgres tables.

create table polygon_version(osm_id bigint,ts timestamp,osm_version integer,osm_user text,osm_uid bigint);
create table point_version(osm_id bigint,ts timestamp,osm_version integer,osm_user text,osm_uid bigint);

copy polygon_version from 'res.hist' delimiter ',' CSV;
copy point_version from 'respoint.hist' delimiter ',' CSV;

Now, based on the previous SQL statements we incorporate full history details if available (same 14 day cut off time):

UPDATE planet_osm_polygon p SET cnt =
    (SELECT EXISTS(
             SELECT 1 FROM opencellids o
                  LEFT OUTER JOIN polygon_version pv
                  ON p.osm_id = pv.osm_id
                  WHERE st_dwithin(p.way, o.geom, 50)
                    AND o.nearcnt > 3
                    AND o.created < TO_TIMESTAMP(p.tags->'osm_timestamp','YYYY MM DD HH24 MI SS TZ') + INTERVAL '2 HOURS'
                    AND TO_TIMESTAMP(p.tags->'osm_timestamp','YYYY MM DD HH24 MI SS TZ') - o.created <
                        (CASE WHEN (p.tags->'osm_version' = '1') THEN INTERVAL '14 DAYS'
                             ELSE INTERVAL '1 YEAR'
                        END)
                    AND (CASE WHEN pv.ts IS NOT NULL THEN pv.ts ELSE o.created END) - o.created < INTERVAL '14 DAYS'
     ));

UPDATE planet_osm_point p SET cnt =
    (SELECT EXISTS(
             SELECT 1 FROM opencellids o
                  LEFT OUTER JOIN point_version pv
                  ON p.osm_id = pv.osm_id
                  WHERE st_dwithin(p.way, o.geom, 50)
                    AND o.nearcnt > 3
                    AND o.created < TO_TIMESTAMP(p.tags->'osm_timestamp','YYYY MM DD HH24 MI SS TZ') + INTERVAL '2 HOURS'
                    AND TO_TIMESTAMP(p.tags->'osm_timestamp','YYYY MM DD HH24 MI SS TZ') - o.created <
                        (CASE WHEN p.tags->'osm_version'= '1' THEN INTERVAL '14 DAYS'
                             ELSE INTERVAL '1 YEAR'
                        END)
                    AND (CASE WHEN pv.ts IS NOT NULL THEN pv.ts ELSE o.created END) - o.created < INTERVAL '14 DAYS'
    ));

Sample screenshots

Depending on the area some mappers might have tried Keypad Mapper 3 at home but never actually used it for mapping house numbers (why?). The first screenshot gives such an example, where most housenumbers don't have any nearby OpenCellId measurements. On second screenshot we can see an area with lots of GPS traces, a potential mapping area. Looking at the timestamps, we find out that most buildings were already tagged a few years ago. The last screenshot shows a rather widespread use of KPM3 in Southern parts of Munich where almost whole suburbs were mapped using this tool. We didn't evaluate ourselves if this was a result of a mapping party or one dedicated house mapper. However, most numbers were created by a single account in 2012 and 2013.


Estimated Daily Active users and contributed housenumbers per day

The following chart looks at all addr:housenumber objects previously identified as likely mapped by a Keypad Mapper 3 user. Throughout the year 2013 a constant increase in daily active contributing KPM3 mappers is visible (daily values in blue, 14 day moving average values are shown in black). The actual number of relevant OSM addr:housenumber objects last changed on a given day is shown in orange as 14 day moving average value. Interesting to note is that this number peaks at about 400 house numbers per day regardless of the steady increase of active users since May 2013. Daily peak value so far in 2013: 1500 added house numbers on May 12.


.


Daily active users and last changed housenumbers per day



Based on the following SQL statement:

SELECT DAY, COUNT(DISTINCT UID), COUNT(*)
       FROM (
     SELECT tags->'osm_user' AS u,
            tags->'osm_uid' AS UID,
            DATE(TO_TIMESTAMP(pl.tags->'osm_timestamp','YYYY MM DD HH24 MI SS TZ')) AS DAY
     FROM planet_osm_polygon pl
     WHERE cnt > 0 
  UNION ALL
     SELECT tags->'osm_user' AS u,
            tags->'osm_uid' AS UID,
            DATE(TO_TIMESTAMP(pnt.tags->'osm_timestamp','YYYY MM DD HH24 MI SS TZ')) AS DAY
     FROM planet_osm_point pnt
     WHERE cnt > 0) AS S
GROUP BY DAY ORDER BY DAY;

Estimated Top 25 Keypad Mapper 3 Users in Germany

For our Germany extract (approximate bbox: 4, 46, 16, 56) we could also give a very rough estimate on top KPM3 users. We count addr:housenumber nodes and polygons previously identified as KPM3 candidates. Nevertheless these numbers are really impressive for a single mapper contribution!



OSM User ID OSM User Name(s) KPM3 House Numbers % of total KPM3 House Numbers Comment
194746 {Hurlie,xpla} 5893   13,930
51978 {Parkscheinautomat} 2640    6,240
78613 {heilbron} 2514    5,943
39457 {Markus59} 2356    5,569
288045 {ThG_} 1951    4,612
458641 {uvi} 1425    3,368
156896 {lebsanbe} 960    2,269
715371 {cracklinrain} 767    1,813
385428 {cyankahly} 688    1,626
81983 {Ogmios} 616    1,456
888598 {ulfewa} 575    1,359
579793 {Achterin} 563    1,331
71231 {wolkacer} 550    1,300
550300 {Klumbumbus} 500    1,182
421055 {Adam_Ndula} 471    1,113
8703 {rolandg} 434    1,026 No KPM3 user
290680 {wheelmap_visitor} 416    0,983 Edits for wheelmap, updating wheelmap=* tags only. No KPM3 user
77109 {eknus} 399    0,943
62338 {Weltenbummler} 388    0,917
15035 {grungelborz} 369    0,872
289483 {"Uwe Eggert"} 359    0,849
134846 {Vanrook} 353    0,834
89073 {eimer42} 348    0,823
173844 {Antikalk} 342    0,808
85475 {miche101} 337    0,797


Based on the following SQL statement:

select uid, us, cnt, to_char(100 * cnt / sum(cnt) over (),'990D999') as perc 
   from
   (select uid, array_agg(distinct u) us, count(*) cnt,
       count(distinct day) as days from (
     select tags->'osm_user' as u,
            tags->'osm_uid' as uid,
            date(to_timestamp(pl.tags->'osm_timestamp','YYYY MM DD HH24 MI SS TZ')) as day
     from planet_osm_polygon pl
     where cnt > 0 
  union all
     select tags->'osm_user' as u,
            tags->'osm_uid' as uid,
            date(to_timestamp(pnt.tags->'osm_timestamp','YYYY MM DD HH24 MI SS TZ')) as day
     from planet_osm_point pnt
     where cnt > 0) as q
  group by uid order by cnt desc) 
as r limit 25;

FAQ

  • Is this an official usage statistic by ENAiKOON?
No, this page is not in any way related to the makers of Keypad-Mapper 3, it is a private fun project only. ENAiKOON provides global download statistics at this time (see talk page) but no usage statistics. In a way this wiki page tries to close that gap.
  • The number of house numbers appears to be too low, hdyc reports much higher numbers.
To qualify as KPM3 housenumber some nearby OpenCellId measurements need to be available. If you added some housenumbers without walking around with KPM3, chances are that these housenumbers are not counted in this statistic. Also (similar to hdyc) only the last changed user is taken into account. If you added a housenumber and some other mapper added some 3D tags, this housenumber won't show up in your total figure.
  • I never used KPM3. Why does my username appear in the Top25 KPM3 mapper list?
For OSM v2 and later objects, it is currently not possible to find out when the housenumber was originally added to OSM. Assuming someone walked around in your area with KPM3 and you added some house numbers months later to an OSM v2+ object, the OpenCellID measurements will be linked to your user name. Due to the lack of an OSM full history file, this kind of false positives can only be handled for OSM v1 objects - in this case, there's a cut off time of 14days between OpenCellID measurement and OSM edit.
Real life example: way 117086617 was last edited by rolandg on Sep 29 (version 4), the house number was added by ysae on July 30 (v3). Nearby OpenCellId measurements indicate that someone used KPM3 on the evening of July 29. Using a full history file, this way would have been credited to ysae instead of rolandg.
Update Oct 25: Now that the chart/top25 list also takes the full history information into account, the number of false positives should be smaller. However, if you happen to have added a new house number in an area where someone collected some OpenCellID probes in the last 2 weeks before your edit, you might still be considered as a KPM3 housenumber mapper.

Summary

That's the end of our little quick and dirty OpenCellID experiment. We tried to match actual addr:housenumber objects to OpenCellId measurements and filter out measurements by cars, which are rather unlikely to contribute to existing housenumbers (creating false positives) and also consider timestamps, etc.

Have fun and happy housenumber mapping!