Sophox

From OpenStreetMap Wiki
(Redirected from User:Yurik/rdf)
Jump to navigation Jump to search
Sophox
Sophox logo.svg
Author: Sophox/sophox/graphs/contributors
Website: sophox.org
Source code: Sophox/sophox

A collection of services exposing OSM data, metadata, and other microservices

Short video on getting started

Sophox is a collection of data services to work with OSM data, OSM tag metadata (from this wiki), and use external (federated) data sources like Wikidata. The service uses a well known SQL-like language called SPARQL. The service can also be used for direct OSM editing, either in a manual "search & replace", or as a challenge similar to MapRoulette. All Wikidata SPARQL documentation is at Wikidata Query Help. Similarly to Overpass-Turbo, this service can be used directly from JOSM (enable Wikipedia plugin).

This short getting-started video (thumbnailed on the right) should give a quick intro. You can also use the "examples" button at https://sophox.org/ to try the service using these example queries.

This service is made available thanks to a generous hardware donation by Elastic (of Elasticsearch fame).

Important Changes

Sophox no longer contains a copy of the Wikidata data due to space and performance limitations. Use federated query to call out into Wikidata's sparql endpoint.

Current Challenges

See Wikipedia Improvement Tasks

Hello (OSM SPARQL) World

Here is a simple example of using SPARQL, the Wikibase RDF format and OSM RDF format. This query lists a few OSM place objects, and can be executed by clicking "run it" underneath it. (Compare it with a similar Overpass Query.)

There are many more query examples on how to cross-reference OSM data with Wikidata, use Wikipedia page popularity, perform quality control queries, and other.

# A sample query to list all OSM objects with a place tag.
# The * will show all variables that are used in the WHERE clause
select * where {
  # The first expression matches the three columns of the Triple Store with:
  #   "find any subject that has an OSM 'place' tag, and any value of that tag"
  #
  # Question mark represents a variable. ?osm becomes a URI for an OSM object, e.g.
  #    <https://www.openstreetmap.org/node/2681940767>  (Everest)
  # The same URI may also be written with a prefix - osmnode:2681940767
  # All OSM tags use osmt: prefix. Using osmt:place constant only
  # matches OSM object with a place tag. The ?place variable will
  # become the value of the place tag. You can use a constant instead of a variable,
  # e.g. substitute ?place with a constant string 'city'.

  ?osm osmt:place ?place.

  # osmm: prefix means "meta values", e.g. OSM object type, location (loc), version, ...
  # Uncomment to limit to a specific type of an OSM object ('n', 'r', 'w').
  #   ?osm osmm:type 'r'.

  # osmm:loc is a centroid point for most OSM objects. If included, the results
  # can be shown on a map by switching to the map view (right under the query)
  #   ?osm osmm:loc ?coordinates.

  # The FILTER expressions allow complex filtering beyond a simple "must match".
  #   filter(?place = 'town' || regex(?place, '[a-z][0-9]', 'i'))

  # This will limit results to places which do not have a `name:en` tag.
  #   filter not exists { ?osm osmt:name:en ?nameen. }

  # For OSM objects with the wikidata tag, find the English name of that object in Wikidata
  # if it exists.
     ?osm osmt:wikidata ?wd.
  # With that get from the Wikidata SPARQL endpoint service what type of the object it
  # is (P31 is the property instance of in Wikidata), and the label for that type,
  # also if exists in English. For labels and descriptions, you may instead use a
  # service that picks first available language based on a list.
  #   service <https://query.wikidata.org/sparql> {
  #     ?wd wdt:P31 ?type.
  #     optional { ?wd rdfs:label ?label. filter(lang(?label) = "en") }
  #     optional { ?type rdfs:label ?typelabel. filter(lang(?typelabel) = "en") }
  #  }
}
# Similar to SQL, queries can use aggregations, subqueries, and limit the results
limit 10

Run it (edit query)

Using from JOSM

This service can be used directly from JOSM editor.

  • Install Wikipedia plugin and enable expert mode (in the view menu)
  • Use File / Download data / Download from Sophox API tab.

JOSM will download all OSM object IDs it finds in the query result. All other values will be ignored, and should not be requested to reduce server load. There are several magic keywords to help with the query:

  • {{boxParams}}: use it with wikibase:box service. Your map selection will be converted to two lines for cornerWest and cornerEast parameters.
  • {{center}}: the center of the map selection will be inserted as a point constant.

Examples:

information sign

These queries only run from JOSM because they use {{*}} parameters.
Running a query with a box service on large areas is often slower than getting all objects worldwide with a specific condition

OSM objects are not humans, but humans could be their subject, so subject:wikidata should be used. This query finds objects tagged with human in wikidata tag within the selected area. This query is very slow if used on a large area.

# Find objects tagged as humans in a selected area (very slow on large areas)
SELECT ?osmid WHERE {
  SERVICE wikibase:box {
    ?osmid osmm:loc ?location .
    {{boxParams}}
  }
  ?osmid osmt:wikidata ?wd .
  ?wd wdt:P31/wdt:P279* wd:Q5 .
}

This query use s a simple calculation and filter to find objects that form a ring around the selection's center point.

# Find educational places at least 2, and at most 3 km from the center of the selection
SELECT ?osmid WHERE {
  VALUES ?amenity { "kindergarten" "school" "university" "college" }
  ?osmid osmt:amenity ?amenity ;
         osmm:loc ?loc .
  BIND(geof:distance({{center}}, ?loc) as ?distance)
  FILTER(?distance > 2 && ?distance < 3)
}

Same as above, but using "wikibase:around" service.

# Find educational places at least 2, and at most 3 km from the center of the selection
SELECT ?osmid WHERE {
  VALUES ?amenity { "kindergarten" "school" "university" "college" }
  ?osmid osmt:amenity ?amenity .
  SERVICE wikibase:around { 
      ?osmid osmm:loc ?loc . 
      bd:serviceParam wikibase:center {{center}} . 
      bd:serviceParam wikibase:radius "3" . 
      bd:serviceParam wikibase:distance ?distance .
  } 
  FILTER(?distance > 2)
}

How OSM data is stored

All data is stored in a Triplestore as subject predicate object. statements. For example, a statement relation #123 has a tag "name" set to "value" can be represented as subject (relation #123), predicate (a tag "name"), and object ("value"). Both the subject and the predicate parts of the statement must always be a complete URI, e.g. <https://www.openstreetmap.org/way/42> (way #42), or <https://wiki.openstreetmap.org/wiki/Key:wikidata> (tag wikidata). To make the URI more readable, we use prefixes to shorten it, e.g. osmway:42 and osmt:wikidata. The object part of the statement can be either a value (string/number/boolean/geo coordinate/...), or, just like the first two parts, a URI. This way one statement's object could be another statement's subject, creating a linked graph. In SPARQL, each statement must end with a period, but if multiple statements have the same subject, we can separate them with a semicolon. The prefixes are defined in the engine, but could be specified manually for clarity.

Note: All RDF data is stored in a single giant "subject/predicate/object" table. That data includes OSM data, OSM tags metadata from this wiki, Wikipedia pageviews statistics, and other. So the subject could be osmnode:1234, osmd:Q42, sitelinks, or many other types of URIs. The easiest way to get just the OSM objects is to use osmm:type predicate - all OSM subjects will have one. But for performance reasons, don't use more filters than you need - if you are querying for any OSM objects with a osmt:wikipedia, you don't need to specify the osmm:type, because osmt:wikipedia already implies an OSM object.

osmnode:1234  osmm:type      'n'
osmnode:1234  osmm:loc       'Point(32.1 44.5)'^^geo:wktLiteral  # longitude/latitude
osmnode:1234  osmt:name      'node name tag'
osmnode:1234  osmt:name:en   'node name:en tag'
osmnode:1234  osmt:wikipedia <https://en.wikipedia.org/wiki/Article_name>
osmnode:1234  osmt:wikidata  wd:Q34
 
osmway:2345  osmm:type       'w'
osmway:2345  osmm:loc        'Point(32.1 44.5)'^^geo:wktLiteral  # way's centroid
osmway:2345  osmm:isClosed   true # is this way an area or a line?
osmway:2345  osmt:name       'way name tag'
osmway:2345  osmt:name:en    'way name:en tag'
osmway:2345  osmt:wikipedia  <https://en.wikipedia.org/wiki/Article_name>
osmway:2345  osmt:wikidata   wd:Q34
 
osmrel:3456  osmm:type       'r'
osmrel:3456  osmm:has        osmway:2345  # relation contains way #2345
osmrel:3456  osmway:2345     "inner"      # way #2345 has an "inner" role
osmrel:3456  osmm:loc        'Point(32.1 44.5)'^^geo:wktLiteral  # centroid of all members
osmrel:3456  osmt:name       'way name tag'
osmrel:3456  osmt:name:en    'way name:en tag'
osmrel:3456  osmt:wikipedia   <https://en.wikipedia.org/wiki/Article_name>
osmrel:3456  osmt:wikidata    wd:Q34


# Other common values:
...  osmm:version    42         # Last object's version
...  osmm:changeset  1234567    # Last edit of this object in this changeset
...  osmm:timestamp  '2017-08-23T01:02:03+00:00'^^xsd:dateTime    # Last edit time
...  osmm:user       'UserName' # Name of the user - could change

Other tags:

  • osmm:badkey "@Sammelmütze" - any unsupported tag names will be stored here as "osmm:badkey" objects, without their own values. For a tag to be supported, it must match /^[0-9a-zA-Z_]([-:0-9a-zA-Z_]*[0-9a-zA-Z_])?$/ regex -- contain only Latin letters, digits, and underscore. Dashes "-" and colons ":" may also be used except in the first and last positions.
  • osmm:loc:error "Error: ..." - whenever node or way's location could not be calculated, this property contains relevant error info.

Other Data Sources

OSM Metadata

This wiki's data items, such as key and tag description, is also imported into Sophox using the same method as Wikidata Query Service, except that it uses slightly different namespaces, e.g. wd → osmd and wdt → osmdt. Check out an excellent SPARQL wikibook.

A special osmd:Q... osmm:key osmt:... is automatically added to tie the data item with the OSM data.. For example bridge (Q103) has osmd:Q103 osmm:key osmt:bridge.

Key Usage Statistics

All metadata keys include their usage statistics from Taginfo. For example, bridge (Q103) has these extra values, as defined in Taginfo API:

osmd:Q103  osmm:count_all                 3621295  # Number of objects in the OSM database with this key.
osmd:Q103  osmm:count_all_fraction        0.0007   # Number of objects in relation to all objects.
osmd:Q103  osmm:count_nodes               4790     # Number of nodes in the OSM database with this key.
osmd:Q103  osmm:count_nodes_fraction      0        # Number of nodes in relation to all tagged nodes.
osmd:Q103  osmm:count_relations           742      # Number of ways in the OSM database with this key.
osmd:Q103  osmm:count_relations_fraction  0.0001   # Number of ways in relation to all ways.
osmd:Q103  osmm:count_ways                3615763  # Number of relations in the OSM database with this key.
osmd:Q103  osmm:count_ways_fraction       0.0066   # Number of relations in relation to all relations.
osmd:Q103  osmm:users_all                 62161    # Number of users owning objects with this key.
osmd:Q103  osmm:values_all                297      # Number of different values for this key.

Wikipedia Page View Counters

Most popular pages have pageview counters stored in Sophox, making it possible to order results by popularity in a specific wiki. Note that the count has no meaning in itself, only as relative to other pageviews. This way one can see the importance/prominence of a city vs another city for labeling purposes.

  • <https://en.wikipedia.org/wiki/Universe> pageviews: 12345

External Data Sources

Sophox query may pull data from an external source, and combine it with the other data. The data must be in a "tabular" format (e.g. CSV), and accessible via a URL (e.g. stored in a Gist). The exact data format may be configuredwith additional parameters.

SELECT * WHERE {

  # This service will pull in the external data from 
  SERVICE wikibase:tabular {
    # URL of the data: https://gist.github.com/nyurik/bc0e305248a66cfac565349e41f1ae99
    # Use the "RAW" link
    bd:serviceParam wikibase:url <https://gist.githubusercontent.com/nyurik/bc0e305248a66cfac565349e41f1ae99/raw/fbb0d844a8cb068faec9b90adecdae16e99da26f/museums-in-JC.csv> .

    # Our data contains a header line, allowing us to use "tabular:columnName" instead of "tabular:index"
    bd:serviceParam wikibase:firstRowIsHeader true .

    # Get these columns into correspondingly named variables
    ?url tabular:url 'uri' .
    ?type tabular:type 'string' .
    ?id tabular:id 'integer' .
    ?name tabular:name 'string' .
  }
}

Run it (edit query)

Overpass Integration

The Overpass service can produce CSV-formatted results by using [csv:...] directive. You can create a simple data URL in Overpass Turbo by clicking "export" and copying the "raw data directly from Overpass API" link. This produces a hard-to-read URL-encoded query string. Alternatively, you can use the original query as a otQuery parameter. Specifying otQuery sets other defaults to parse tab-separated results.

[out:csv(::id,::type,'name')][timeout:25]; (
  node["tourism"="museum"]({{bbox}});
  way["tourism"="museum"]({{bbox}});
  relation["tourism"="museum"]({{bbox}});
); out;

Use Export / Query / convert compact link to get the query as a single string to convert the extended shortcuts like {{bbox}} into a regular Overpass QL query. The above query produced a list of museums, and we use its output in this SPARQL query to find the "instance of" (P31) of the corresponding Wikidata entries.

SELECT * WHERE {
  # We need to set query optimizer to either "None" or "Runtime". The default "Static" optimizer
  # will try to get data from the service after pulling all locations with wikidata, which results
  # in Out Of Memory error. Instead, it should get the service first (same order as the query = None),
  # or do on-the-fly performance optimization with "Runtime".  See Blazegraph help:
  # https://wiki.blazegraph.com/wiki/index.php/QueryOptimization#Join_Order_Optimization
  hint:Query hint:optimizer "None" .

  # Get a list of museums in a predefined area from Overpass
  SERVICE wikibase:tabular {
    bd:serviceParam wikibase:otQuery '[out:csv(::id,::type,"name")][timeout:25];(node["tourism"="museum"](40.69,-74.09,40.76,-74.02);way["tourism"="museum"](40.69,-74.09,40.76,-74.02);relation["tourism"="museum"](40.69,-74.09,40.76,-74.02););out;' .
    ?type tabular:%40type 'string' .
    ?id tabular:%40id 'string' .  # Use string instead of integer to avoid another conversion in ?osmID creation below
    ?name tabular:name 'string' .
  }
  # Dynamically create OSM object ID from Overpass type & id fields
  BIND(URI(CONCAT("https://www.openstreetmap.org/", ?type, "/", ?id)) as ?osmId)

  # Match Overpass results with the RDF data
  # (We could have retrieved wikidata & location from OT as well)
  ?osmId osmm:loc ?loc .

  # If exists, get Wikidata ID, and the type (instance of) of that ID with a label
  OPTIONAL { 
    ?osmId osmt:wikidata ?wikidata .
    ?wikidata wdt:P31 ?instanceOf .
    SERVICE wikibase:label {
      bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
      ?instanceOf rdfs:label ?instanceOfLabel .
    }
  }
}

Run it (edit query)

Parameters

Input parameters:

Each parameter must be in the form bd:serviceParam wikibase:paramName <value> .

name type Description
url string URL to the data source. Service must have either url or otQuery parameter.
otQuery string Overpass QL query that returns a CSV-formatted results. The entire query must be written as a single line. In the output, use url-encoded column names for special characters. (e.g. tabular:%40id for @id).
format string Which CSVFormat profile to use for parsing. Uses DEFAULT for regular urls, and TDF for the Overpass queries
firstRowIsHeader boolean If true, uses first row for column names. By default, OT queries will set this to true, and url requests will use the default from the format.
ignoreSurroundingSpaces boolean The trimming behavior, true to remove the surrounding spaces, false to leave the spaces as is. See doc.
commentMarker char Sets comment start marker to the specified character. Comment start character is only recognized at the start of a line. See doc.
delimiter char Sets delimiter character. See doc.
escape char Sets escape character. See doc.
quote char Sets quote character. See doc.
useColumnNames boolean If true, use tabular:columnName to access columns. Otherwise, use tabular:N, where N is a 0-based column index. By default, this value is the same as firstRowIsHeader.

Output parameters: All output parameters must be in the form of ?variable tabular:columnName "dataType" or ?variable tabular:columnIndex "dataType". The useColumnNames parameter controls if it should be the columnName or columnIndex. Allowed data types: string, double, integer, uri.

Generating Polygon Files

If your query contains an ?id field, and each of its values is unique, you can use it to generate a file with polygons, where each polygon's ID will be set to the ?id field, and any additional fields will be used as polygon's properties. For example, your query could generate the list of US states, and for each state get the name of the governor as a ?governor field. The result will be a file with 50 features, and each feature will have a "governor" property. The result will be shown using Mapshaper. You can inspect and simplify the result, add data files, and do many other geometry manipulations.

To use this mode, add #defaultView:MapRegions line at the top of your query.

This query shows all Canadian provinces, with their names in English and French, their ISO 3166-2 code, and their flags (URL on Wikimedia Commons).

#defaultView:MapRegions
SELECT 
  ?id
  (SAMPLE(?label_en) as ?label_en)
  (SAMPLE(?label_fr) as ?label_fr)
  (SAMPLE(?iso_3166_2) as ?iso_3166_2)
  (SAMPLE(?flag) as ?flag)
WHERE {
  # List of regions, whose sub-regions we want
  VALUES ?entity { wd:Q16 } 

  SERVICE <https://query.wikidata.org/sparql> { 
    # P150 = "contains administrative territorial entity"
    # but must not have a P582 (end date) qualifier
    ?entity p:P150 ?statement .
    ?statement ps:P150 ?id .
    FILTER NOT EXISTS { ?statement pq:P582 ?x }

    # Get labels, ISO code, and flag image
    # for each sub-region, if available
    OPTIONAL { ?id rdfs:label ?label_en . FILTER(LANG(?label_en) = "en") }
    OPTIONAL { ?id rdfs:label ?label_fr . FILTER(LANG(?label_fr) = "fr") }
    OPTIONAL { ?id wdt:P300 ?iso_3166_2 }
    OPTIONAL { ?id wdt:P41 ?flag }
  }
}
# remove possible ID duplicates
GROUP BY ?id

Run it (edit query)

Quick-fix Editor

Quick-fix editor has been fully re-implemented based on community feedback. See Sophox Editor

Other Queries

Current limitations

  • Only includes OSM nodes that have at least one tag. All ways and relations, even empty ones, are included.
  • The database only stores values for simple tag keys - those that only English letters, digits and symbols - : _. Tag keys that don't satisfy this criteria, e.g. those that contain spaces or accented characters, are stored as osmm:badkey "bad tag key", without their values.
  • OSM geometries are not imported, but most objects have osmm:loc (center point), and ways have osmm:isClosed (true/false).

See also

External links

Wikidata
Wikidata has a Q-item on Sophox.