Automated Edits/b-jazz-bot

From OpenStreetMap Wiki
Jump to navigation Jump to search
HTTPS All The Things!

HTTPS All The Things

https_all_the_things is a manually run script that will convert http website values to https, where appropriate. Any user that clicks on a link provided in the website tag will often already be redirected to an https link, so this script is only doing what the end user would be doing anyway, but it is saving them the time of going through the redirect first. This script will only modify tags where this is the case, and only in specific circumstances.

  • Who is making the change:
    • b-jazz (OSM user account) and later b-jazz-bot, you can send messages to either of them and I will respond.
  • Your motivation for making the change and why it is important
    • http is the unencrypted protocol while https is encrypted. Security experts and many others have been stressing for years that people move away from using http links to https. The wiki page for the website key even states that https should be used if it exists. There have been reported instances of companies doing man in the middle attacks and injecting traffic into user's web browsing which isn't possible when https is used. A large number of websites are redirecting users from the http protocol to https automatically. However, this takes tens to hundreds of milliseconds for anyone visiting the http link. By crawling all of the website URLs in the planet and updating the website tag, those milliseconds can be saved over and over again.
  • A detailed description of the algorithm you will use to decide which objects are changed how
    • The script takes a geohash as input
    • Overpass is queried for objects with a tag of "website" that starts with "http://" or no protocol (i.e. "www.example.com")
    • A "HEAD" request is made to the given http URLs. The script will batch up to a given number of redirect urls to rewrite, I prefer to keep this small (5-10) so that any reverts are easier and less impacting.
    • If HTTP returns a redirect status code (301/302), the URL in the Location header is compared to the original URL
    • If the Location URL and original website value are essentially the same, but using https protocol instead, the value is updated and saved back to the database. (And by "essentially the same", I mean adding/removing "www" or a trailing slash is allowed, whereas directing to a completely different domain or deep linking is not.)
  • Information about any consultation that you have conducted, with links to mailing list/forum posts or Wiki discussion pages
  • When the change was made, or how frequently it is going to be repeated
    • The script will be run on a semi-regular basis to make updates as needed.
  • Information on how to "opt out"
    • Contact me and we can discuss why you would want to opt out. I'm more than happy to accommodate any reasonable request for opting out.
  • Scope
    • I am currently planning on running this across all objects in the United States. At some point, I'll broaden the scope and run it world wide if I don't get any strong objections to me doing so.
  • Affected Tags
    • I'm currently running this against the "website" key. After that is "complete", I'll probably expand to the "contact:website" key (which is about 1/10th the number of tags as "website" is world-wide) and then the "url" key. I'm not sure how likely it will be to run against the other website/url tags out there. If someone feels strongly, I would consider adding their tags to the list.

Slack discussion

  • I've removed the user names since they came from a semi-private server. If the comments come from you, and if you care, feel free to unredact your username.
b-jazz
I'm working on a project to change website tags that are http:// (unsecure) to https:// (encrypted/"secure") and wanted to get some feedback before proceeding.
b-jazz 
I've written the code to query small bits of overpass at a time (4-5 digit geohashes), do a query of the listed website and if it offers a redirect to https, I save the new website back to the n/w/r.
b-jazz 
I typically only make 5 changes to a geohash at a time
b-jazz 
I'm starting off with websites that strictly redirect from http://example.com to exactly https://example.com. In the future I'll loosen that up to handle redirects to https://www.example.com and probably ones that append a trailing "/".
b-jazz 
I haven't decided on what to do with ones that do significant differences in the redirect url. I could argue either way. I mean, the user will be redirected themselves. Might as well save the roundtrip time of a couple extra packets. But I'll worry about that later.
b-jazz 
I don't make any assumptions and force https unless the site specifically does a redirect. So the user won't be forced to https if the site offers an unencrypted site that doesn't redirect.
b-jazz 
If you have any thoughts/questions, I'd love to entertain them.
redacted1 
trailing slash should be matched, that's not a significant part of the URL. Just like domain names are case-insensitive. Even in the strictest matching, you should allow those differences to still be considered a match. (edited)
b-jazz 
i agree. i just haven't coded that part yet. i will.
redacted2 
I would manually handle urls that change a lot. Your bot could be getting redirected to a 404 search page or whatever.
b-jazz 
No, it specifically looks for redirects with an https:// scheme. I’m ignoring all 404s, 503s, etc. I don’t make any assumptions. Redirects are in the 3xx range of HTTP response codes. (edited)
b-jazz 
(I’d like to take up a project some day to force urls to https even when the listed http doesn’t redirect in order to push a more secure internet, but that is not a battle I’m starting with this project.)
redacted2 
restating: manually handle poor matches because people do dumb things on websites.
b-jazz 
ok, i see what you're saying now. thanks for the input.
redacted3 
If there's a large enough difference, it could well be a sign that the URL's no longer valid, even if you're not getting a 404.
b-jazz 
which is why i'm starting with making sure there are sane differences (http -> https, adding or removing www, adding trailing slash).
b-jazz 
i can certainly tackle the large difference problem at a later date. there is plenty of low hanging fruit to tackle for now.
redacted4 
Is your code available somewhere? Also I would make a condition for error codes and manually review those. Some might be gone and need deleted or moved to a `historic:website=` if it was a redirect, &c.
redacted4 
I also prefer manually reviewing websites, as a human can interpret things better than an algorithm in terms of tagging. Websites also usually include a lot of useful data that isn't in OSM yet.
b-jazz 
The source code isn’t available, yet. I’ll certainly make it available in the near future. (edited)
b-jazz 
I agree a cleanup of websites that are gone is needed. But that’s out of scope for this project.
b-jazz 
As for manually reviewing websites, that is also out of scope. I’m just trying to get the tag to correctly show the https version where the user is already being redirected there.
redacted4 
ah ok I see. Bot with limited scope that will just ignore any irregularities?
b-jazz 
Correct. I’m seeing lots of dns errors and not handling HEAD requests correctly and timing out or given authorization errors. Those are all ignored. The program only looks for actual redirects to an https version of the site.
redacted3 
If someone's looking to do a general website cleanup, KeepRight flags sites that return a server error, that appear to have been redirected to domain parking, or that simply don't contain any keywords related to the POI.
b-jazz 
Yup. I'm a big fan of KeepRight.
redacted5 
What about urls missing the scheme (http:// or https://) altogether? These could also be tested and fixed accordingly.
b-jazz 
true. i'll add it to the list of future features (though that is a bit of feature creep, so i'm not 100% sure i should add it. it might be a better task for KeepRight and the like.) (edited)
redacted6 
Have you looked at this? https://wiki.openstreetmap.org/wiki/Automated_Edits_code_of_conduct
redacted6 
Which tags do you want to change? `website` and/or `contact:website`?
redacted6 
There's some social media URLs (`contact:facebook` etc.). We can easily figure out (say) FB's HTTPS policy, so you could edit those ones more confidently (edited)
redacted6 
The AECoC does *strongly* recommend some things, like posting to the mailing lists & making a wiki page.
redacted4 
It would also be nice to have the source code available for review somewhere, eg on github or similar
redacted4 
(in addition to project page in osm wiki)
b-jazz 
Yes, I’ve read the AECoC. It calls for discussing with a forum like this. I’ve started a wiki, but it pretty much says what I’ve said here so far. I’ll move to tags other than `website` at some point, but I think `website` is 100x more prevalent than the others according to taginfo. Certainly 10x more common.
b-jazz 
I’ll post the source this morning as soon as I get up.
redacted6 
`website` is a popular enough tag, and you're talking about a global scope, right? I'd advice posting to some of the mailing lists. I don't think you'll get enough feedback from just (this) slack.
b-jazz 
I’m only going to do the US. Others can grab the code and run it on their countries or more countries.
b-jazz 
Source code is now available: https://gitlab.com/b-jazz/https_all_the_things/
redacted7 
too bad there's no way to test drive your code on the dev instance (at least not for the data fetching part of the exercise). Still you could try to upload some fake data to the dev instance and test the data update part in more detail. (edited)
b-jazz 
yup. that's exactly what i did. tested writing to the dev instance. which was even a problem because after creating a way, and then reading the way, i would get a different way back from the one i created. and none of the existing ways that i tried to download would have associated tags. the dev instance is kind of a mess. but that's a conversation for a different thread.

Mailing list discussion

Starts at: https://lists.openstreetmap.org/pipermail/talk/2019-February/082083.html

Execute with caution

  • Execute only a small number of edits with a new bot before requesting and waiting for feedback before proceeding with larger edits.
    • I developed this and tested pieces against the dev server and then moved to production and did edits in my home state. Verifying along the way and improving the process, I then posted to the US Slack server and addressed any comments and concerns that were brought up.
  • Ensure that you only update based on the current dataset. Ensure that you will never accidentally overwriting something that has been just modified by someone else by using a earlier planet file.
    • As the script only modifies a single value and uses the API, there is no way I can conceive of that I would "overwrite" some else's work.
  • Ensure that you keep all data you need in case you have to revert your change when something goes awry.
    • I've been storing log files of all ids/types/old-values/new-values. I can't imagine anyone wanting to revert this, but it should be possible.
  • Plan your changesets sensibly. If your bot creates one changeset for each edit, that becomes extremely hard to read for people. If your bot creates one changeset for a bunch of changes covering the whole planet, that, too, becomes hard to read. Changes grouped into small regions are easiest to digest for human mappers (e.g. "fixed highway tags in Orange County").
    • I break the country up into geohashes of at least 3 digits. Any geohash that has more than 100 websites returned by overpass will be broken down further into 32 geohash children. I repeat this process after every run of the script. And I only change a max of 5-10 values on each run.
  • Make sure that there is some way of identifying that a certain change has been made by your script. You could create a special user account for the script, or you could add a "source", created_by", or "note" tag or something.
    • The user account is mine (b-jazz) with "-bot" appended, so "b-jazz-bot". The changeset "created_by" key is set to "https_all_the_things/$VERSION"
  • A "comment" tag to the changeset that describes the changes made in this changeset in a human-readable way. You must also add the tag mechanical=yes (or bot=yes), and you must link to the wiki page or user page documenting your changes from the description=* tag (e.g. description=https://wiki.openstreetmap.org/wiki/Mechanical Edits/John Doe#Tag Fixup January 2013).
    • Version 0.0.3 and later include these changeset tags.
  • Provide a means for mappers to "opt out" of your changes, i.e. if someone contacts you and asks you to stop making automated edits to things that they have edited, you must comply with that wish, and you must modify your software or procedure to leave those objects untouched in the future.
    • Simply contact me and let me know a boundary that should not be edited, or other criteria that can be programmed. I'll happily work with you to restrict the area of my changes.
  • For major changes (in the six-digit range or more), check with the admins (try IRC) to ensure that there your change will not interfere with any other operations at a sys admin level, or check the Munin graphs to find out at which time the servers are not busy.
    • The bot runs sequentially and might be able to make about 10,000 changes on a good day. This shouldn't be a rate that would cause any concern.

Source Code

If you'd like to review the source code being used, you can find it at https://gitlab.com/b-jazz/https_all_the_things. The script does the basic effort of making a change to a single small area (geohash). The batching of geohashes for a given area is just some cobbled together shell commands for looping, logging, and trimming of the list of geohashes. Implementation of that is left up to the end user, who I imagine will only be me. But if you end up running this yourself, I'd love to hear about it of course.