State Of The Map U.S. 2016/Hands-On Day/OSM Analysis SOTMUS 2016

From OpenStreetMap Wiki
Jump to navigation Jump to search

During the hands-on day, the analysis group split into two rooms. One room worked on defining use cases and common questions people have with OSM statistics and analysis. The second room worked on engineering challenges related to enabling these analytics.

OSM Analytics Use Cases

As evidenced by the many presentations and Birds-of-a-Feather sessions at SOTMUS 2016, there is a growing interest in capabilities for analysis of OSM data. The motivations for analytics range from methods to understand and support community growth, to functions validation and quality.

1. Assessing Current State of the Map / Coverage

Why/Who is interested?

  • Institutions that feed data into OSM and need to know where coverage is lacking
  • Knowing where to target OSM outreach efforts to build community and recruit new mappers
  • Groups that ingest and serve OSM data

Questions

  • What is on the landscape that can be mapped? (relates more to machine learning, image processing, etc. than OSM analytics)
  • What is the hierarchy of needs about what needs to be mapped?
  • What is percentage of already-mapped features vs. features that need to be mapped?
  • What is the percentage of current mappers compared with available mappers (mappers per capita)?
  • How do the above figures vary across space (eg, which areas are overmapped vs. undermapped)


2. Snapshot Before/After Comparison

Why/Who is interested?

  • Paid mappers or companies feeding data into OSM (eg, Mapbox)
  • Teachers evaluating assignments
  • Mapathons over time, eg. regular Missing Maps mapathons. How much have we done since X date?

Questions

  • How can we look at the changes between two dates (before and after a mapathon or HOT Task Activation, for example)?
  • Can we quantify/qualify ground-truthing efforts?
  • Are these the same methods we can use for armchair contributed data (such as Mapathons)?
  • How much tagging and geometry have changed in a place over time?


3. Real-Time Quality Assessment

Why/Who is interested?

  • HOT and other mapathon organizers
  • Teachers guiding new mappers
  • Any company relying on OSM and concerned about data quality (eg, Mapbox, Facebook)

Questions

  • Which mappers are consistently making mistakes or low quality edits?
  • Integrate analysis within the tasking manager?
  • Automated building validation (overlap, unsquare, etc). that links to Task Manager Messaging Interface to actively intervene (See below)?
  • How can we quantify/summarize micro-level mistakes to understand if it is a systematic issue (doc, tools, etc.) vs. an individual issue?
  • How can we message/inform users about how to improve in their editing practices (eg, an image of how to properly square buildings, avoid topology inconsistencies, etc)?
  • Can we update HOT Task Manager tools to automatically QA/validate?
  • Can we integrate HOT messaging with OSM messaging to contact users?


4. User and Group-level Statistic Summaries

Why/Who is interested?

  • All users in that it would foster a sense of pride in production, motivating mappers to do more
  • HOT and mapathon organizers looking to set up friendly competitions
  • Educators looking to assess classroom impact as a performance unit: Missing Maps Teams, Teaching Objectives, etc.

Questions

  • Can/should we report aggregate user stats in profiles (similar to Neis's "How did you contribute" stats)?. Consider that many users link to these pages in their profile already
  • What would it look like to include Missing maps/HOT stats into OSM user profiles (as an option)?
  • How do we allow anyone to generate reports of aggregated stats for a class or map-a-thon team?


5. Historical Data Analysis

Why/Who is interested?

  • Researchers
  • OSM Community: How to foster healthy growth/expansion

Questions

  • How is the map created?
  • How have editing practices changed overtime, what are OSM editing practices?
  • What does collaboration in OpenStreetMap look like? How does it compare to other Peer Production Services?
  • What are different editing groups, historically? (Institutions? Paid Mappers? Data Team Q/A).
  • What development phase are we in? Exploration, Map Gardening? (i.e. Alan McConchie's work)


Further notes on above use cases

Macro vs. Micro Scale of edits:

The complication here is how to link these two? Tools/Queries optimized for one level or another are not going to perform well across both domains, so how can we establish a useful link between these two?

Macro Level results are good for:

  • Scope of Questions: "How many? Where? When"
  • Compelling visualizations to inspire OSM contributors

Micro Level results are good for:

  • Questions are about "Who?"
    • types of edits
    • types of conflicts
  • Re-Occuring individual errors that become macro issues



Low-level analytics functions

The functions below could be part of an API or query framework for enabling answers to the higher-level questions above. Some of these are already available to a degree in the epic-osm project (not scalable, but currently works for historical data bounded to smaller areas).

Possibly useful functions for answering higher-level questions above

  • Return all contributors in bounding box
  • Return all changeset metadata for contributor
  • Return all geometries for a contributor
  • Return all geometries in time window
  • Return all changeset metadata in time window
  • Return all changeset metadata with substring (ie, hashtag) in comment
  • Return all changeset data (real changeset) with substring (ie, hashtag) in comment
  • Return all users with tag
  • Return all tags for user
  • Return all diff geometries between two timestamps
  • Return all comments for users with x percent of changes in y bounding box
  • Return/mine all profile or wiki text for contributors in a bounding box, especially badges/widgets/plugins users have put in wiki, such as the Babel language plugin. See example at http://wiki.openstreetmap.org/wiki/User:Geogast


What is the best base object for some of these questions?

Decisions regarding the data model from the beginning will shape the types of questions that we are able to ask in the future. This is a brief discussion of the pros/cons of different data models for these questions.

Changeset

Pro: Gives user comments, a more 'user-centric' object that gives insight into a logical grouping of edits and their purpose.

Con: Difficult to obtain at scale (Best methods are through planet file or planet-stream currently)

Individual Geometries (ie Tiles)?

Pro: Data-driven information that we are more interested in: kilometers of roads, number of amenities, number of buildings, etc. Tile-reduce exists, is awesome, and gives us these answers at scale.

Con: Does not include changeset comments / the vector tile model is not centered around the changeset; requires a second collection.

An OSM Historical Object (Historical Tile)?

What is a historical object in OSM? Is it an object with all of it's history embedded into it? Would these built in real-time? Is this worth it? Would this be do-able with the vector tile model?

Pro: A community standard that could/would define how we handle historical queries once and for all (no more processing of `full-planet-history.pbf`). It would only need to be generated once and then could be cumulative. (It doesn't need to be a single file) ... Weekly dumps that are aggregated monthly and then annually, for example would be enough granularity for the historical analysis oriented questions.

Cons: What would it look like? Would it be scalable?


Possible Data Sources

  • OSM (Map Data: All Geometries, All Changesets)
  • Mapillary
  • OSM Wiki (Help explain the events/behavior we see in the Data)
  • Mapathon Orgnanizers
  • Maptime Meetups, Mapathon Times, etc - could help direct analysts on where to look.
  • OSM User Pages (Being able to link to a user page if that user has added a personal profile would be really good).
  • TagInfo / TagWatch (Can we include any additional information from these resources into these results?)