User talk:Yurik

From OpenStreetMap Wiki
Jump to: navigation, search

SPARQL question

I see that you created SPARQL examples. I have two questions:

Is it feasible to generate list of wikidata entries that

  • have interesting (long or featured) articles on Polish or English Wikipedia
  • have coordinates or are otherwise described as mappable object
  • are without wikipedia/wikidata tag
  • are located in Poland or around some specific location

?

To avoid XY problem - I want to add wikidata entries but I am bored by processing at https://osm.wikidata.link/ bunch of substubs without any interesting content. Also, trying to match major articles may help to detect missing OSM data.

Is it possible to get query result from http://88.99.164.208 using API? https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/Wikidata_Query_Help/Result_Views is mentioning only manual download

Mateusz Konieczny (talk) 08:32, 9 September 2017 (UTC)

Hi Mateusz, yes to all of the above :) First, you may want to take a look at the main WDQS (click "Examples" there) - it has a lot of Wikidata-only examples and info. Also some help links too. You may use the API directly (look in the browser debugger at the request it sends, and do the same). Its a simple GET request. Or you can use my python code, with http://88.99.164.208/bigdata/sparql endpoint. I will post the queries here in a bit. --Yurik (talk) 01:06, 10 September 2017 (UTC)
P.S. Mateusz, I wrote a query per above and added it to examples. --Yurik (talk) 02:44, 10 September 2017 (UTC)
P.P.S Wikidata does not store article length, but it has page views counts, and how many different wiki languages have an article on the topic. Both are very good indicators of which objects should be shown first.
Is it possible to exclude events? Note that I am not interested in limiting to subclasses of Q618123 (that should be easier, but many entries in Wikidata lack "instance of"). I tried in Wikidata Query Service
SELECT ?item ?itemLabel
WHERE
{
  ?item wdt:P625 ?location.
  MINUS { ?location wdt:P31/wdt:P279* wd:Q1190554. }  # excludes events
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 10

but it timeouts for some reason. I looked at examples but I failed one that excludes items that are subclasses of something and once I tried adapt others I ended with query above that for some reason has performance problems. What is the CPU safe way of excluding events, without excluding objects missing instance of? I prefer false positives, I have no problem with adding missing "is instance".

Modyfying query at SPARQL examples tinyurl y9ma6e3f or tinyurl 993o5sy (no direct URL, wikidata for some reason is using public link shortener typically used to hide spam) resulted in 504 Gateway Time-out.

Mateusz Konieczny (talk) 06:08, 13 September 2017 (UTC)

You are using the wrong subject - it should be ?item, or in my example - ?wd: FILTER NOT EXISTS { ?wd wdt:P31/wdt:P279* wd:Q1190554 . } But yes, it does take too long. It might work faster if replace circle service with the coordinate filter, like I did here (last commented portion). --Yurik (talk) 06:45, 13 September 2017 (UTC)
Thanks! I tried bbox, but it is failing even in the simplest case:
SELECT ?osmId ?wdLabel WHERE {
   ?osmId osmm:loc ?loc .
   BIND( geof:longitude(?loc) as ?longitude )
   BIND( geof:latitude(?loc) as ?latitude )
   FILTER( ?longitude > 19 && ?longitude < 20 && ?latitude > 50 && ?latitude < 51)
} LIMIT 10

is it a hardware limitation or is something wrong wrong with my query? And thanks for featured articles query, I already used it to add some links and read interesting Wikipedia articles Mateusz Konieczny (talk) 08:30, 13 September 2017 (UTC)

I think its failing because there are so many points, and that query requires a sequential scan through the whole DB. That's why there is a box and circle geo services developed by Wikidata. The sequential filtering works well after the results are already small enough. The service on the other hand seems to be doing small geo-indexing, but it does it first, before the filtering. Need to look at it more. --Yurik (talk) 10:41, 13 September 2017 (UTC)

SPARQL question II

Sorry for bothering you but I have other query that fails for an unknown reason:

I wanted to find on Wikidata human settlements in Poland with teryt code and exclude already linked from OSM to check whatever wikidata import using teryt ids for matching may be useful

SELECT ?item ?itemLabel
 WHERE
 {
   ?item wdt:P31 wd:Q486972.
   FILTER EXISTS  {
 		?item wdt:P4046 ?teryt
   }
   # There must not be an OSM object with this wikidata id
   FILTER NOT EXISTS { ?osm1 osmt:wikidata ?wd . }
 
   # There must not be an OSM object with this wikipedia link
   FILTER NOT EXISTS { ?osm2 osmt:wikipedia ?sitelink . }
   
   SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }
 } LIMIT 10
Run it (edit query)

Query return 0 elements, what is surprising given that there are elements that should match.

For example https://www.openstreetmap.org/node/1589969137#map=19/53.97112/14.54503 and https://osm.wikidata.link/Q673875 (I checked lack of matches with http://overpass-turbo.eu/s/rGW )

I used following query adapted from one examples to check that both osm node representing Grodno and its Wikidata entry are in database:

SELECT ?marketLoc ?marketName (?amenity as ?layer) ?osmid WHERE {
   VALUES ?place { "hamlet" }
   ?osmid osmt:place ?place ;
         osmt:name ?marketName ;
         osmm:loc ?marketLoc .
   # Get the location of Grodno from Wikidata
   wd:Q673875 wdt:P625 ?myLoc .
   # Calculate the distance,
   # and filter to just those within 5km
   BIND(geof:distance(?myLoc, ?marketLoc) as ?dist)
   FILTER(?dist < 5)
 }
Run it (edit query)

so it seems that there is a bug in my TERYT query. Is it something obvious? Mateusz Konieczny (talk) 16:35, 13 September 2017 (UTC)

Mateusz, the first query is a bit wrong - you used different variable names - ?item and ?wd, instead of the same one. Also, you don't need "Filter exists" - you can simply list both statements. No need to filter out sitelinks because they are not connected to the rest of the query. And lastly, you don't want just the "instance of a human settlement", you want "instance of a human settlement or anything that is a sub-sub-sub... class of a human settlement".
SELECT ?item ?teryt ?itemLabel WHERE {
   ?item wdt:P31/wdt:P279* wd:Q486972 .
   ?item wdt:P4046 ?teryt .
   FILTER NOT EXISTS { ?osm1 osmt:wikidata ?item . }
 
   SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }
} LIMIT 10
Run it (edit query)
--Yurik (talk) 22:00, 13 September 2017 (UTC)

Wikidata fixing - my noSPARQL tool

It seems that you are interested in the topic and it may give ideas for more quality checks and data import possibilities - http://www.openstreetmap.org/user/Mateusz%20Konieczny/diary/42385 https://github.com/matkoniecz/OSM-wikipedia-tag-validator. I created it before I was aware of SPARQL, main benefit is that it allows thorough listing of issues in a given location without running multiple SPARQL querries. Main problem is that it is not using proper database import (it is downloading individual wikidata entries), as result it is not feasible to run worldwide reports Mateusz Konieczny (talk) 08:03, 2 October 2017 (UTC)

Hi @Mateusz Konieczny, yes, seems we do have some functional overlap there :) Please take a look at Wikipedia Link Improvement Project - I'm putting together all sorts of issues that have been discovered so far. I think your github README should point just to that page, not my old OSM_wiki tag problems or quality control queries. The SPARQL_examples page is mostly used from inside the service itself - it shows up as the "examples" dialog there (try it, its cool :)) Also, there is a big (and somewhat ... heated) discussion at the @talk mailing list that you might be interested in.
Lastly, lets see if we can create SPARQL queries for all of your validations - allowing people to query directly, and to see up to date info is fairly important. BTW, if you know ruby, it would be awesome to help improve MapRoulette a little bit - so that we can upload some of these challenges there. Currently maproulette is missing one important feature - ability to store OSM ids. If it allows it, we can upload objects, and allow users to link to those objects.

P.S. I know it sucks to duplicate efforts, lets coordinate better :) --Yurik (talk) 18:23, 2 October 2017 (UTC)

  • Wikipedia Link Improvement Project - thanks! (I found it already, in fact that was why I am writing this).
  • "should point just to that page, not my old OSM_wiki tag problems or quality control queries" - fixed!
  • "Lastly, lets see if we can create SPARQL queries for all of your validations" - that would be a good idea Mateusz Konieczny (talk) 05:08, 3 October 2017 (UTC)
@Mateusz Konieczny, LOL: " rely on service that may be hard to replicate once it stops working" -- that's a very strange argument. Every service is only good while it works! :) On the other hand, anyone can set up a clone of that service - https://github.com/nyurik/osm2rdf --Yurik (talk) 05:16, 3 October 2017 (UTC)
"that's a very strange argument. Every service is only good while it works!" - yes, but I think you will admit that this service is probably less stable than wikidata API. Thanks for the link! I will try setting this up. Mateusz Konieczny (talk) 05:28, 3 October 2017 (UTC)
On topic of mailing lists - I looked at it and I would really advice to stop pushing for worldwide mechanical edits. It is obvious that people are really unhappy with that idea, and it has potential to not merely end with no edits done and massive amount of discussion but also with backslash ("lets delete all wikidata", "completely ban bots") Maybe try discussing this ideas with local community? Mateusz Konieczny (talk) 05:29, 3 October 2017 (UTC)
@Mateusz Konieczny, I agree, at this point I am simply trying to educate people of what it is, and what benefits it provides. Seems like the loudest are the ones who also have the least understanding of it. Funny how all the more advanced data consumers have already switched to it - Mapbox, Openmaptiles, etc. It has a huge benefit, but because it looks like a "number", and not all tools support it yet, people are afraid of "some number of the devil" that is added. I thought that my initial request, followed by a discussion, followed by a 4 days of quiet time could be considered as settled. But apparently some people decided to jump on it after the discussion. Oh well. As for your github, please change the wording about "only works while it works" - same can be said about OSM itself :) --Yurik (talk) 05:42, 3 October 2017 (UTC)
I removed tautology from github readme. "people are afraid of" - I think that most people are afraid of ending bot happy like Wikidata or Cebuano wikipedia. What is understandable, given different style/targets/situation. Mateusz Konieczny (talk) 06:28, 3 October 2017 (UTC)
@Mateusz Konieczny, its always important to go for the "golden middle". English Wikipedia has used countless bots, and has become the most successful project. OSM is extremely anti-bot. ceb-wiki is a joke. I think its a mistake to go into either extreme. --Yurik (talk) 06:39, 3 October 2017 (UTC)
I'm sure pushing wikidata in such a large scale effort will change that overall attitude towards bots or more genereally speaking what is considered to be automated edits. NOT. People are per se not anti wikidata (even you keep on reiterating the same argument over and over). Maybe you should use a more piecemeal approach so people have a chance to gain more trust and confidence in more automation. This however might end up in a way that a more local mapper community driven approach would be favored, and automation is still considered to be fundamentally flawed (e.g. because the data you use for reasoning is already crappy). I really wonder why you take the burden of adding all this wikidata on your own. It's really the local communities who should take ownership and you can support them via your toolset (like you already do). There were several bots in the past that really messed up data, such as xybot. I believe people have good reason to be very sceptical, given how easily you can screw up data. Leveraging wikidata for multilanguage labels like Mapbox does looks like the first step, I'm sure there are plenty of other use cases ahead :) (see https://blog.mapbox.com/support-for-arabic-and-portuguese-in-mapbox-streets-5a9690dabff4) Mmd (talk) 06:53, 3 October 2017 (UTC)