Mechanical Edits/Mateusz Konieczny - bot account/remove tracking parameters

From OpenStreetMap Wiki
Jump to navigation Jump to search

Page content created as advised on Automated_Edits_code_of_conduct#Document_and_discuss_your_plans.

Who

I, Mateusz Konieczny using my bot account

contact

message via OSM I will respond also to PMs to the bot account. In both cases I will be notified about incoming PMs via email and notifications in OSM editors.

What

URL often have unnecessary parts, typically added for tracking purposes. This tracking parameters should never appear in any osm tags.

FB, Google and other add tracking links for various purposes.

It means that it is beneficial to turn tag

website=http://paris.intersquat.org/les-lieux/le-satellite/?fbclid=de58e340d6aa79a584552a2055042d004b9b19454bc0d7a6046fc81fc90f51

into

website=http://paris.intersquat.org/les-lieux/le-satellite/

Usually tracking links are added by clueless people who just searched for a website and copied it from FB/Google.

There are rare cases of links created to specifically track OSM users, see for example

In general I have not noticed correlation between presence of tracking links and additional issues that would not be detected automatically.

Therefore automatic removal of tracking parameters is not causing loss of useful indicators of areas that should be reviewed.

Osmose and JOSM validators and StreetComplete are offering better indicators.

(If anyone is interested in list of more systematic issues that are automatically detectable but require human to fix - please contact me, I have found more broken imports, data with suspicious copyright status, bad tagging than I can process).

Automatic removal would allow me to spend time on something more useful, than reviewing all cases where this links are present and confirming them one by one.

Proposed bot edit would remove links where all used parameters are tracking users and may be removed. Other links will be reviewed manually to catch also currently unknown tracking parameters.

Anchors (#section) will be preserved.

Parameters for removal across OSM: fbclid, gclid, campaign_ref, mc_id, utm_source, utm_medium, utm_term, utm_content, utm_campaign

Code is tested, I am currently using it in a manual review mode. Sole difference in but run will be disabling of manual confirmation.

I have experience with automated edits, see https://wiki.openstreetmap.org/wiki/Mechanical_Edits/Mateusz_Konieczny_-_bot_account

Yes, editing element will cause it to be edited and change "last edited" date. Effect will be exactly the same in case of using bot and manual edit (which I will do anyway in case of rejecting this automated edit proposal). Note that in case of bot edits you may filter out bot edits marked as automatic.

Why

Tracking parameters is not welcomed and is explicitly discouraged in links added as values into OSM database. For start, such parameter add nothing useful and make link more complex. Additionally, such tracking is unwanted, undesirable and unacceptable.

Numbers

About 1000 objects. See planned edit changes at https://gist.github.com/matkoniecz/6710d066fea6596533f5013040eb5dc1 (impossible to publish on OSM Wiki due to triggering spam filter)

How

Changesets will be split in parts to avoid covering huge areas or massive number of objects. In case of object itself being extremely large, larger than desired bounding box some oversized changeset areas are unavoidable (for example, in case of editing country boundary).

Bot will sleep between changesets to reduce risks of unexpected behavior and give more time to react if things are not going well and to eliminate risks of affecting OSM performance by making many edits at the same time.

  • Bot will edit link to remove undesirable parts.
  • following are considered as a tracking parameters: fbclid, gclid, campaign_ref, mc_id, utm_source, utm_medium, utm_term, utm_content, utm_campaign
  • link in any tag value will be checked
    • Edit will not be done of url has no parameters
    • Edit will not be done of url has any parameters except tracking parameters
    • Edit will be done of url has parameters and all of them are tracking ones

state before a mechanical edit - example based on https://www.openstreetmap.org/node/4636662880 :

state after a mechanical edit:


Discussion

https://lists.openstreetmap.org/pipermail/talk/2020-May/084677.html

Bot source code

Bot is using https://github.com/matkoniecz/osm_bot_abstraction_layer library, this code is GNU GPLv3 licensed

from osm_bot_abstraction_layer.generic_bot_retagging import run_simple_retagging_task
import re
import time
import datetime

def main():
    run_in_bot_mode_may_2020()

def run_in_manual_mode():
    test_expectations()
    print(datetime.datetime.now())
    print(query_of_affected_items())
    run_simple_retagging_task(
        max_count_of_elements_in_one_changeset=500,
        objects_to_consider_query=query_of_affected_items(),
        objects_to_consider_query_storage_file='/media/mateusz/5bfa9dfc-ed86-4d19-ac36-78df1060707c/OSM-cache/overpass/osm_elements_with_trackers.osm',
        is_in_manual_mode=True,
        changeset_comment='remove tracking parameters',
        discussion_url='not necessary, as edit was manually reviewed and tracker parameters are clearly unwanted',
        osm_wiki_documentation_page='not necessary, as edit was manually reviewed',
        edit_element_function=edit_element,
    )
    print(datetime.datetime.now())

def run_in_bot_mode_may_2020():
    test_expectations()
    print(datetime.datetime.now())
    print(query_of_affected_items())
    run_simple_retagging_task(
        max_count_of_elements_in_one_changeset=500,
        objects_to_consider_query=query_of_affected_items(),
        objects_to_consider_query_storage_file='/media/mateusz/5bfa9dfc-ed86-4d19-ac36-78df1060707c/OSM-cache/overpass/osm_elements_with_trackers.osm',
        is_in_manual_mode=False,
        changeset_comment='remove tracking parameters',
        discussion_url='https://lists.openstreetmap.org/pipermail/talk/2020-May/084677.html',
        osm_wiki_documentation_page='https://wiki.openstreetmap.org/wiki/Mechanical_Edits/Mateusz_Konieczny_-_bot_account/remove_tracking_parameters',
        edit_element_function=edit_element,
    )
    print(datetime.datetime.now())

"""
URL often have unnecessary parts, typically added for tracking purposes.
This tracking parameters sshould never appear in any osm tags.

FB, Google and other add tracking links for various purposes.

It means that it is beneficial to turn tag
website=http://paris.intersquat.org/les-lieux/le-satellite/?fbclid=de58e340d6aa79a584552a2055042d004b9b19454bc0d7a6046fc81fc90f51
into
website=http://paris.intersquat.org/les-lieux/le-satellite/

This urls can be often fixed using an automated script, allowing to
use human time on something more productive.

Human-made edit will also result in changing "last edited by"
(while not allowing to filter out such edits unlike marked bot edit),
there are better ways to spot areas requiring fixes and we are not lacking
places with QA indicators that manual review is needed.

Usually tracking links are added by clueless people who just searched for 
a website and copied it from FB/Google.

There are rare cases of links created to specifically track OSM users
see for example
* https://www.openstreetmap.org/way/754704241/history
** https://www.cronauerlaw.com/?utm_source=openstreetmap
* https://www.openstreetmap.org/node/1063808111/history
** http://www.travelerscoffee.ru?utm_campaign=geo&utm_source=openstreetmap&utm_medium=link
* https://www.openstreetmap.org/node/6817678019/history
** https://www.resotainer.fr/agence-bonneuil-sur-marne?utm_source=open-street-map&utm_medium=recherche-locale&utm_content=openstreetmap&utm_campaign=open-street-map-garde-meubles-bonneuil-sur-marne
* https://www.openstreetmap.org/node/1684317522
** http://www.travelerscoffee.ru?utm_campaign=geo&utm_source=openstreetmap&utm_medium=link

In general I have not noticed correlation between presence of tracking links
and additional issues that would not be detected automatically.

Therefore automatic removal of tracking parameters is not causing loss of 
useful indicators of areas that should be reviewed.
Osmose and JOSM validators and StreetComplete are offering better indicators.

Automatic removal would allow me to spend time on something more useful,
than reviewing all cases where this links are present and confirming them one by one.

Proposed bot edit would remove links where all used parameters are tracking
users and may be removed. Other links will be reviewed manually to catch
also currently unknown tracking parameters.

Anchors (#section) will be preserved.

Parameters for removal across OSM: fbclid, gclid, campaign_ref, mc_id,
utm_source, utm_medium, utm_term, utm_content, utm_campaign

Code is tested, I am currently using it in a manual review mode.
Sole difference in but run will be disabling of manual confirmation.

I have experience with automated edits, see
https://wiki.openstreetmap.org/wiki/Mechanical_Edits/Mateusz_Konieczny_-_bot_account

Yes, editing element will cause it to be edited and change "last edited" date.
Effect will be exactly the same in case of using bot and manual edit
(which I will do anyway in case of rejecting this automated edit proposal).
Note that in case of bot edits you may filter out bot edits marked as automatic.
"""

def malicious_parameters_for_eradication():
    return ["fbclid", "gclid", "campaign_ref", "mc_id", "utm_source", "utm_medium", "utm_term", "utm_content", "utm_campaign"]
    # igshid - looks like instagram tracking link (not just me - see https://www.bradymoritz.com/igshid-the-new-instagram-click-tracking-id/ )

def evil_parameters_group():
    return "(" + "|".join(malicious_parameters_for_eradication()) + ")"
 
def remove_malicious_parameters(link):
    old_link = None
    while old_link != link:
        old_link = link
        if re.match("&" + evil_parameters_group() + "[^&#]*", link):
            # inner parameter
            link = re.sub("&" + evil_parameters_group() + "=[^&#]*", "", link)
        if re.match("http.*\?" + evil_parameters_group() + "=[^&#]*$", link):
            # sole parameter
            link = re.sub("\?" + evil_parameters_group() + "=[^&#]*$", "", link)
        if re.match("http.*\?" + evil_parameters_group() + "=[^&#]*#", link):
            # sole parameter with anchor at the end
            link = re.sub("\?" + evil_parameters_group() + "=[^&#]*#", "#", link)
        if re.match("http.*\?" + evil_parameters_group() + "=[^&#]*&", link):
            # leading parameter
            link = re.sub("\?" + evil_parameters_group() + "=[^&#]*&", "?", link)
    return link

def edit_element(tag_dictionary):
    old_tags = dict(tag_dictionary)
    for key in tag_dictionary.keys():
        if tag_dictionary[key].find("http") == 0:
            cleaned_link = remove_malicious_parameters(tag_dictionary[key])
            if tag_dictionary[key] != cleaned_link:
                if cleaned_link.find("?") != -1:
                    return old_tags # other tags also may be tracking or for removal, review manually
            tag_dictionary[key] = cleaned_link
    return tag_dictionary


def query_for_limited_keys():
    return """
[out:xml][timeout:25000];
(
  nwr["website"~""" + '"' + evil_parameters_group() + '"' + """];
  nwr["url"~""" + '"' + evil_parameters_group() + '"' + """];
  nwr["source"~""" + '"' + evil_parameters_group() + '"' + """];
);
out body;
>;
out skel qt;
"""

def query_for_all_keys_but_slow():
    return """[out:xml][timeout:25000];
    (
        nwr[~".*"~""" + '"' + evil_parameters_group() + '"' + """];
    );
    out body;
    >;
    out skel qt;
    """

def query_of_affected_items():
    return query_for_all_keys_but_slow()
    
def test_expectations():
    expected = [        
        {
        "input": "https://www.example.com/?utm_medium=referrall#anchor",
        "output": "https://www.example.com/#anchor"
        },
        {
        "input": "https://www.example.com?utm_medium=referrall#anchor",
        "output": "https://www.example.com#anchor"
        },
        {
        "input": "https://www.example.com/?utm_medium=referrall",
        "output": "https://www.example.com/"
        },
        {
        "input": "https://www.example.com/?utm_source=evil&utm_medium=referral",
        "output": "https://www.example.com/"
        },
        {
        "input": "https://clubhaus-olympic.business.site/?utm_source=gmb&utm_medium=referral",
        "output": "https://clubhaus-olympic.business.site/"
        },
        {
        "input": "https://www.enrichinghappiness.com/branch/bickford-of-clinton?utm_source=local&utm_medium=yext&utm_campaign=website",
        "output": "https://www.enrichinghappiness.com/branch/bickford-of-clinton"
        },
        {
        "input": "https://www.wanderservice-schwarzwald.de/de/tour/wanderungen/rundwanderung-grillhuette/112160527/?utm_medium=referral&utm_source=embed&utm_campaign=embed-plugin-referral",
        "output": "https://www.wanderservice-schwarzwald.de/de/tour/wanderungen/rundwanderung-grillhuette/112160527/"
        },
        {
        "input": "https://www.greeneking-pubs.co.uk/pubs/greater-london/shepherds-tavern/?utm_source=g_places&utm_medium=locations&utm_campaign=",
        "output": "https://www.greeneking-pubs.co.uk/pubs/greater-london/shepherds-tavern/"
        },
    ]
    for test in expected:
        cleaned = remove_malicious_parameters(test["input"])
        if (cleaned != remove_malicious_parameters(test["output"])):
            print(cleaned, "vs", test["output"], "for input", test["input"])
            raise "failing to make a proper edit"

main()

Repetition

This edit will be done once. Next run will require a separate permission.

Opt-out

Please write at mailing list thread that will appear in Discussion section. Note that in case of opt out the same edit will be done manually, it is impossible to keep tracking parameters in OSM.