User:Zhijie Shen

From OpenStreetMap Wiki
Jump to navigation Jump to search

Hi everyone,

welcome to my OpenStreetMap Wiki page! I am is a Ph.D. candidate in School of Computing, National University of Singapore, and currently work on the geo-referenced video search project. For this project, I leverage OpenStreeMap, where I have more freedom to access and process geo-reference data.

Retrieving Wikipedia Entries Automatically

Now I want to retrieve Wikipedia entries from OpenStreetMap to fertilize the data source for my research project. The Wikipedia links are stored as wiki tags. However, I found that there are actually few such tags in the dataset of OpenStreetMap, though there exit corresponding entries in Wikipedia for OpenStreetMap to link to. I guess that one of the reason of lacking wiki tags is the considerable amount of manual effort. For my own project, I wrote a Wikipedia entries crawler that can automatically retrieve the URL of the Wikipedia entry which has corresponding node or way in OpenStreetMap. Here I am eager to share with you, and wish it can be helpful. The single Java class file can be downloaded here: [source code].

The crawler implements the Sink interface of Osmosis, whose OSM XML file parsing functionality is leveraged. It extracts the name of entity (e.g., node, way) from the name tag (hence the entities without name are omitted), uses it as the parameter to search the candidate Wikipedia entries by calling the Wikipedia API [[1]], and then judge which entry among the responded results is the true one for the corresponding entity. To do this, the crawler checks the string similarity between entity name and Wikipedia entry title, using Levenshtein distance algorithm [[2]]. Moreover, since many Wikipedia entries that the entities may link to have geo-coordinates [[3]], the crawler also takes advantage of this knowledge to select the true entry: it uses the Wikipedia API again to retrieve the entry content, extracts the geo-coordinates if they exist, and computes the distance between it the coordinate of the entity. Afterwards, combining these two metrics together to compute the score for each candidate entry, the crawler chooses the first entry whose score is above the pre-defined threshold (assuming that search functionality of the Wikipedia API ranks responded results appropriately). During the Wikipidea entry crawling, the OSM XML file will be parsed twice: fisrt, retrieving candidate entries for each entity having name; second, recording the coordinates of the entities to be checked, especially for ways whose coordinates cannot be in the first pass.

Note that the source code is the part of my research project, thus some classes defined in other part of this project are used in it. In addition, the crawler is just a quick solution for my research project, so there requires more work to make it comprehensive and robust. Furthermore, I am not a IR expert, and I believe there must be some more accurate methods to select the true Wikipedia entry. By introducing it, I just hope to inspire some ideas on the collaboration between OpenStreetMap and Wikipedia.