Andhra Pradesh/Notes/Arjunaraoc/Improving geodata accuracy on OSM and Wikidata

From OpenStreetMap Wiki
Jump to navigation Jump to search

Author:arjunaraoc

Last major content update: 2024-04-10

Wikidata location (left) on OSM map and OSM location map(right) for Ananthavaram on the basis of Wikidata attribute for OSM node as of 8 April 2024. (The error was fixed subsequently)

Geodata of places is available from multiple crowd sourced platforms such as Wikipedia, Wikidata and OSM. Wikidata identifier usage on OSM helps in identifying inaccuracies and fixing them. This post is a how to guide.

Andhra Pradesh places data - Background info

Ananthavaram search results - Wikidata
Anantavaram (similar to Ananthavaram) search in Nominatim tool of OSM

Places can be categorized into urban and rural. Rural places consist of revenue villages and hamlets affiliated to some of them. Revenue village information of about 15K for Andhra Pradesh is available from 2011 census documents. Hamlets are twice that number approximately, taking the total place count to more than 40K. Hamlets are more common in hilly areas.

Wikipedians started creating articles for places around 2007. While the towns and cities had substantial content, village article is very brief limiting mostly to the location and hierarchical administrative structure of village, such as mandal, and district. The article generation continued for several years till the notability review enforcement launch. Soon after Wikidata launch around 2010, Wikidata items were created based on the infobox data of the village article. Further info was added from census data. During 2019, Telugu Wikipedians took up a project to create village articles for all revenue villages, with content based on census data. Several hamlet articles were added in English Wiki and Telugu wiki as well.

All revenue villages are identified by census 2011 location code. This is added as a property on Wikidata. During 2018, one OSM editor uploaded data about places from Bhuvan portal to OSM. The number of places became 30K. During 2019, mandals and their admin headquarters (670) were added to OSM along with their Wikidata item. During the same time, about 800 places on OSM were populated with Wikidata values manually. As the data on all these platforms is crowdsourced and as there are several places with similar names sometimes with in a district and mostly across districts, there are several mismatches between what Wikidata shows as the location and the actual location in OSM. On 4 April 2022, Andhra Pradesh was reorganised with 13 districts becoming 26 districts. While all Wikidata items and Telugu Wikipedia articles were updated systematically to reflect the same, English Wikipedia articles may have errors.

As of April 2024, there were 33939 places in OSM, with 1739 of them having Wikidata attribute. There were 16058 revenue villages in Wikidata, with 1594 English wiki articles, amounting to 9.93% of total revenue villages. Number of active contributors to Wikipedia, Wikidata, and OSM is only 2-3 people. Thus it is not feasible to have projects with identified timelines to improve the coverage and accuracy. At the same time, efficient working with Wikidata and OSM requires esoteric script programming skills and complex software tools.

While commercial map providers provide map data in local languages, the local language labels are usually transliterated from English, resulting in errors. OSM, Wikidata and Wikipedia platforms provide a way to improve the local language maps leveraging Wikidata values on OSM, through semi automated updated of Telugu names. This work grew from such a need. The idea is to document as much and as clearly as possible, so that even users with less programming skills and exposure to web based OSM tools and interested in improving the maps can do a lot of work. The initial scope is to work on revenue villages out of 1739 places with Wikidata on OSM.

Analysis of places data in AP

Scripts

Data

as on 2024-04-23

Wikidata District name osm_places osm_places_wd wd_places_RV enwiki_places_RV enwiki% of wd
Q110714850 Alluri Sitharama Raju 3785 2351 2956 24 0.88%
Q110714857 Anakapalli 1176 60 651 38 5.84%
Q15212 Anantapur 1143 39 486 33 6.79%
Q110714854 Annamayya 3095 38 443 36 8.13%
Q110876712 Bapatla 735 117 268 75 27.99%
Q15213 Chittoor 2558 41 779 49 6.29%
Q110714859 Dr. B. R. Ambedkar Konaseema 666 108 303 104 34.32%
Q15338 East Godavari 521 57 259 53 20.46%
Q110714851 Eluru 1568 137 647 149 23.03%
Q15341 Guntur 371 114 192 71 36.98%
Q110714860 Kakinada 575 64 385 71 18.44%
Q15382 Krishna 998 117 455 118 25.93%
Q15381 Kurnool 715 35 432 26 6.02%
Q110714861 Nandyal 717 46 441 36 8.16%
Q110876763 NTR 505 81 292 86 29.45%
Q110714862 Palnadu 723 118 350 87 24.86%
Q110714856 Parvathipuram Manyam 1196 147 902 36 3.99%
Q15390 Prakasam 1632 76 784 79 10.08%
Q15383 Sri Potti Sriramulu Nellore 1708 76 637 62 9.73%
Q110714863 Sri Sathya Sai 1899 38 445 28 6.29%
Q15395 Srikakulam 2376 112 1237 65 5.25%
Q110714853 Tirupati 2096 61 994 58 5.84%
Q15394 Visakhapatnam 407 119 87 8 9.20%
Q15392 Vizianagaram 1239 590 914 89 9.74%
Q15404 West Godavari 691 70 272 73 26.84%
Q15342 YSR 1436 56 666 40 6.01%

Identifying errors in Geodata

Simple Visual identification of potential errors

Error in Wikidata location, as the place name is not seen on the background OSM map, before the error is fixed
Wikipedia page with corrected Wikidata location shown on OSM map for Gurajala

The geodata is presented in Wikidata page and corresponding English Wikipedia article page using OSM as background map. If one notices that the marker is not near to the names identified on OSM map, then there is possibility of an error. Even if the name is identified on OSM background map, selecting different zoom levels allows checking whether the place is in the correct location.

Query based on the distance between Wikidata and OSM locations

Table view of mismatch between Wikidata and OSM locations, before error fixing
Map view of mismatch between Wikidata and OSM locations before error fixing

Wikidata and OSM combined query is useful to identify potential errors. A sample query for comparing village location data in a district of Andhra Pradesh and displaying top 10 by distance between the Wikidata and OSM is (https://w.wiki/9hYy) This provides a map view of the places. A table view shows Wikidata link, place name, Wikidata location, osm url for place, osm location and distance between locations of Wikidata and OSM in kilometres. Places are sorted based on decreasing distance. Usually villages are small in area approximately less than 1 sq km and separated with nearby village by at least a kilometre. So all the places which have error of more than 2 kilometres are suspects. Table view is useful to look at the data.

Wikidata location and OSM location maps for Ananthavaram side by side as of 8 April 2024. (The error was fixed subsequently)
Wikidata showing the mandal property of Ananthavaram, it is also shown in description at the top
OSM map for Ananthavaram at lower zoom to identify administration hierarchy info, by moving the mouse pointer to a nearby location and clicking the locate menu on the right

In order to fix these errors, open OSM for the village and Wikimedia map from Wikidata coordinates property. From the OSM map, find out the mandal, district information by selecting query features and pointing to a location close to the original location and clicking it. These can be compared with those listed on Wikidata page. If there are no differences and if the distance between the locations is more than 2 km, the locations are in error.

Query points overlaid on boundary of district for errors

Places outside Guntur district borders resulting from mismatch between Wikidata and OSM highlighted in JOSM

Let's consider an example to understand the need for this. Jonnalagadda is part of Palnadu district, which was newly created from erstwhile Guntur district. There were at least two places by the same name in Guntur district. Wikidata code added in OSM based on the same name in the district turned out to be wrong at the wrong location. Using error distance between Wikidata and OSM location can not uncover the error. So all the Wikidata and OSM locations need to be overlaid on the boundary of the district to find such errors. For Guntur district, I uncovered three such errors with two in nearby district and another in a far away district.

Though Wikidata Map view displays the point results from Wikidata and OSM, it can not add the boundary of the district. So the output of Wikidata query is transformed into a geojson using Openrefine export with template features.[1] (see customisation for query) That file and the boundary for the district are overlaid in JOSM to identify the errors.

Fixing errors

Locating Anantavaram on Bharatmaps state GIS portal for Andhra Pradesh
Locating Anantavaram on Bharatmaps state GIS portal for Andhra Pradesh
Spelunker search for Ananthavaram
Spelunker when the correct match for Ananthavaram is selected

To fix the errors in geodata, we need a way to gather and present the place data from Wikidata and OSM and present the distance between the data points. For villages, if the distance is more than two kilometres, there is potential for error. The data in OSM, Wikidata need to be verfied from primary open and freely licensed sources such as Bharatmaps (AP portal from Bharatmaps, used by government departments and includes several poi layers, Who's on First(https://spelunker.whosonfirst.org) (it was updated in 2023 and utilised various free information sources from Government of India.

Use StateGIS portals of Bharatmaps(AP portal from Bharatmaps) to find out the location of the village by searching in Geocode locator menu. As soon as you start typing name, potential matches with details of admin hierarchy like mandal, district will be shown in the drop down. Select the best match. Use Measure tool. Select the node corresponding to the place to get the location in measurement window. The location will be in Long, lat format, which needs to be converted to lat, long format while entering data in Wikidata property.

Fixing errors in mismatch between Wikidata and OSM

When using the error distance, fix the locations in Wikidata and OSM as required, if the error exceeds 2 km. Some examples are provided in the following sections.

Wikidata location is incorrect

Update Wikidata coordinates for Ananthavaram

Wikidata location is far from the actual place on OSM. Just updating the Wikidata location is sufficient to fix the issue. Use StateGIS portal of Bharatmaps and update the location as it is most easy. Use Who's on First tool spelunker to get a unique id for the place to update as additional identifer on Wikidata. Images show how a place called Ananthavaram Wikidata is fixed.

OSM location is incorrect

OSM update by removing Wikidata for a place

OSM location is far from the actual place as shown by Wikidata. Locate the village in the likely place through Bharatmaps state portal and create/update with Wikidata.

Wikidata, OSM location at wrong place outside district as correct OSM location does not have Wikidata

Modification in JOSM for Mamillapalle

In this case, Wikidata and OSM locations are at wrong place outside the district, as the correct OSM place does not have Wikidata attribute. Update with the proper Wikidata value for both places.

Wikidata and OSM incorrect, No such place with Wikidata within district

This represents a case where the Wikidata location and the corresponding OSM location are pointing to a place outside the district. As discussed in the previous section, this error can be identified only by overlaying the district boundary with the points from Wikidata+OSM query.

The Wikidata value of the corresponding point in OSM needs update, along with creation of new place inside district.

Changeset visualisation in Achavi tool, nodes affected
Changeset visualisation in Achavi tool, node tags modified
Changeset visualisation in Achavi tool, node created

Lessons learnt from a trial on Guntur district

Corrections data

Wikidata corrections for fixing mismatches between Wikidata and OSM in Guntur district
OSM corrections for fixing mismatches between Wikidata and OSM in Guntur district
Mismatch between Wikidata and OSM locations for Guntur district after fixes
  • Number of places for the district in OSM: About 80
  • Wikidata location property corrections: About 12 inside district, About 6 outside district
  • OSM corrections: About 10
  • Effort spent: About 16 person hours (includes effort spent towards learning different tools for the first time such as spelunker, Bharatmaps, Openrefine export)

Lessons Learnt

  1. Clean up Wikidata as much as possible before starting work on fixing mismatches between Wikidata and OSM
    1. If more than two values are present, make it one as much as possible (coordination location, instance of, located in administrative territorial entity
    2. While deleting additional values, if sources are from GNS, retain them. If the sources are imported from Wikipedia, delete them.
    3. If the same value is present more than once, delete the one without source
  2. Fix villages falling outside the selected district first, as they may not be detected properly based on distance measure
  3. Take up fixing the mismatches based on distance measure
  4. Bharat maps state GIS portal is a good source for fixes, as it has information from several government departments. POIs such as banks, post offices are available which helps to confirm the right match for the place. Purple triangles represent info from habitations while green circles or stars represent info from census. Purple triangle is to be preferred for location information. In OSM, look for any nearby poi with same village and delete them, as they are near by habitations and part of the same village.
    1. While copying location from Bharatmaps state GIS portal to Wikidata, change the order to lat, long
  5. Spelunker tool of Who's on first is also useful, though it is not user friendly, as it does not have geocoder.

References

Bibliography

See Also

Appendix

openrefine-geojson template for use with wikidata query (Revenue village location in an admin area as per wikidata to generate geojson using openrefine). You can edit the query for the desired admin area and run)

prefix:

{"features": [

row template:

 {"geometry": 
        {   "coordinates": [ {{cells["long"].value}},
                {{cells["lat"].value}}
            ],
            "type": "Point"},
            "id": {{jsonize(cells["item"].value)}},

            "properties": {
                "name": {{jsonize(cells["itemLabel"].value)}},
                "wikidata":  {{jsonize(cells["item"].value)}},
                "osmid":  {{jsonize(cells["osmid"].value)}}

             }, "type": "Feature"
        }

row separator:

,

suffix:


], "type": "FeatureCollection"}