Automated edits/404remover

From OpenStreetMap Wiki
Jump to navigation Jump to search

Documentation of 404remover bot account.

Bot not activated, based on community feedback. Feel free to use its code to make something better!


Who is behind this bot?

zabop (on osm, edits, contrib, heatmap, chngset com.)

Email: 404remover@protonmail.com

Issues: https://github.com/zabop/osm-url-screener/issues

Motivation & importance

Some URLs associated with businesses become obsolete. The business could change its web address, it could go bankrupt, etc. The changes happen continually.

For many users, it is important to have up to date URLs (see Organic Maps for example).

Many editors would also benefit from having fewer obsolete URLs in the DB. Take EveryDoor as an example. If there is an obsolete website associated with a place, it won't ask for a website address. If the obsolete address is removed, then EveryDoor will ask for an up to date address, making it easier for editors to contribute. It is much easier to add a new website than to figure out that the existing website is wrong, then add a new website.

What is edited and how?

The bot searches for features with a contact:website tag with values starting with https://www. A GET request is sent to such addresses.

  1. the response has status code 404 AND
  2. the response text contains the string 404 AND
  3. the response text contains the string page not found OR url was not found. Case insensitive search.

then it determines that the website is obsolete and should be removed. It then removes these websites. (Manually checked such changesets: here, here)

I found that only checking the status code is not enough, as there are websites which respond with a 404 status code, but are perfectly functional. To counter this, I introduced step 2 and 3. Later, it might be good idea to expand the list of typical error messages beyond string page not found and url was not found.

Consultation

Community forum

How frequently are changes made?

The bot is launched twice a day (10 minutes past 1am and 1pm). It is given the relation ids of the UK's boundary=administrative relations with admin_level=6 (ie it is given localities). The bot randomly selects one, and checks if there are any URLs which pass the test for being obsolete (see above). If it found at least 1 such URL, it performs its removal and exits. If it does not, it moves on to the next locality. If it does not find anything in 10 localities, it exits.

This pace is a very slow pace of change. I have rarely seen localities with more than 2 obsolete URLs. Running twice a day, chaning a couple of features on average gives room for intervention if someone is not happy with the changes made. This is my first automated edit, so I'm extra vigilant.

How to stop the bot?

You can email me (404remover@protonmail.com), message me, or you can submit an issue on GitHub. Submitting an issue on GitHub will automatically halt all further launches of the bot, until the issue is closed, even if I happen to have been hit by a bus. (This is made possible by this GitHub Workflows step.)