GDPR/Planet.osm Migration

From OpenStreetMap Wiki
Jump to navigation Jump to search

Outline (moved over from GDPR/Affected_Services)

  • all .osm.pbf and .osm.bz2 files, as well as minutely, hourly, daily diffs: only logged in users get to see the full files; non-logged-in users only see a version with attributes removed. This will require extensive laundering of old files lying around, or removing them from public view.
  • "changesets" files (full dumps and replication files) as well as "discussions" files are only available to logged in users.
  • full history files could simply be removed from public view and be limited to logged-in users. If they remain publicly available, attributes must be removed.

Remark:

  • planet.openstreetmap.org doesn't really have a concept of "logging" on. It's rather unclear, how this is supposed to work for external data consumers who want to login and get the full data.

The following is a draft plan for comments, also posted to the dev list.

Phase 1 - Introduction of no-userdata files

This does not require software development and could start immediately, but some scripting is required.

1a. set up a new domain for OSM internal data downloads, e.g. "osm-internal.planet.openstreetmap.org", initially duplicating all data.

  • Issue: name of domain?
  • Issue: ironbelly disk usage is at 70%, possible to add space?

1b. modify the planetdump.erb in the planet chef cookbook to generate versions without user information of all the weekly dumps, in addition to the versions with user information; have the versions without user information stored in the old "planet.openstreetmap.org" tree, and the versions with user information in the new "osm-internal" tree.

  • Issue: should files have the same names on internal and public site, or should they be called "planet-with-userdata" and "planet" or something?

1c. modify the replication.cron.erb as follows:

  • have osmosis write minutely replication files to the new "internal" tree
  • run a shell script after generating the replication files that will find the newly generated file, pipe it through osmium stripping user information, and write the result to the old "planet" tree, copying the state.txt files as needed
  • run the osmosis "merge-diff" tasks separately on both trees OR run on internal tree only and pipe result through osmium as above
  • write changeset replication XMLs to the new "internal" tree only

For step 1c, it might make sense to announce a maintenance window beforehand during which the changes will be made, so that consumers who rely on user data can stop their replication for a few hours and then make the switch.

1d. modify planet.openstreetmap.org index pages to point to the internal page in case people wish to download stuff with user data; place marker on internal page that these files are with user data.

At the end of phase 1, we will have this situation:

  • new changeset diffs only on the "internal" tree
  • regular diffs come in two flavours, with and without user data
  • planet dumps etc. also come in two flavours
  • old files are unchanged
  • consumers will automatically get the stuff without user data
  • consumers who need user data will have to change their URLs

Phase 2 - Cleaning out old files that contain user data

This can be done slowly in the background over the course of however long it takes:

2a. remove all changeset dumps and changeset diffs from the public tree.

2b. run all .osc, .osm.pbf, and .osm.bz2 files on the public tree through osmium, scrubbing user data (retain file timestamp if possible) and re-creating .md5 files where necessary

Phase 3 - Controlling access to files with user data

Once the parallel systems are up and running, we will want to

3a. issue guidelines about what you are allowed to do with the user data files,

3b. ensure that everyone who has an OSM account agrees to these guidelines one way or the other,

3c. start requiring an OSM login for all downloads from the internal, "with userdata" tree.

One possible technical solution for 3c is https://github.com/geofabrik/sendfile_osm_oauth_protector which also comes with a guide for users on how to run it in a scripted setup.