Mechanical Edits/Mateusz Konieczny - bot account/remove tracking parameters
Page content created as advised on Automated_Edits_code_of_conduct#Document_and_discuss_your_plans.
If you look for source code to use, see Mechanical Edits/Mateusz Konieczny - bot account/remove tracking parameters2/ with updated version |
Who
I, Mateusz Konieczny using my bot account
contact
message via OSM I will respond also to PMs to the bot account. In both cases I will be notified about incoming PMs via email and notifications in OSM editors.
What
URL often have unnecessary parts, typically added for tracking purposes. This tracking parameters should never appear in any osm tags.
FB, Google and other add tracking links for various purposes.
It means that it is beneficial to turn tag
into
website=http://paris.intersquat.org/les-lieux/le-satellite/
Usually tracking links are added by clueless people who just searched for a website and copied it from FB/Google.
There are rare cases of links created to specifically track OSM users, see for example
- https://www.openstreetmap.org/way/754704241/history ( https://www.cronauerlaw.com/?utm_source=openstreetmap )
- https://www.openstreetmap.org/node/1063808111/history ( http://www.travelerscoffee.ru?utm_campaign=geo&utm_source=openstreetmap&utm_medium=link )
- https://www.openstreetmap.org/node/6817678019/history ( https://www.resotainer.fr/agence-bonneuil-sur-marne?utm_source=open-street-map&utm_medium=recherche-locale&utm_content=openstreetmap&utm_campaign=open-street-map-garde-meubles-bonneuil-sur-marne )
- https://www.openstreetmap.org/node/1684317522 ( http://www.travelerscoffee.ru?utm_campaign=geo&utm_source=openstreetmap&utm_medium=link )
In general I have not noticed correlation between presence of tracking links and additional issues that would not be detected automatically.
Therefore automatic removal of tracking parameters is not causing loss of useful indicators of areas that should be reviewed.
Osmose and JOSM validators and StreetComplete are offering better indicators.
(If anyone is interested in list of more systematic issues that are automatically detectable but require human to fix - please contact me, I have found more broken imports, data with suspicious copyright status, bad tagging than I can process).
Automatic removal would allow me to spend time on something more useful, than reviewing all cases where this links are present and confirming them one by one.
Proposed bot edit would remove links where all used parameters are tracking users and may be removed. Other links will be reviewed manually to catch also currently unknown tracking parameters.
Anchors (#section) will be preserved.
Parameters for removal across OSM: fbclid, gclid, campaign_ref, mc_id, utm_source, utm_medium, utm_term, utm_content, utm_campaign
Code is tested, I am currently using it in a manual review mode. Sole difference in but run will be disabling of manual confirmation.
I have experience with automated edits, see https://wiki.openstreetmap.org/wiki/Mechanical_Edits/Mateusz_Konieczny_-_bot_account
Yes, editing element will cause it to be edited and change "last edited" date. Effect will be exactly the same in case of using bot and manual edit (which I will do anyway in case of rejecting this automated edit proposal). Note that in case of bot edits you may filter out bot edits marked as automatic.
Why
Tracking parameters is not welcomed and is explicitly discouraged in links added as values into OSM database. For start, such parameter add nothing useful and make link more complex. Additionally, such tracking is unwanted, undesirable and unacceptable.
Numbers
About 1000 objects. See planned edit changes at https://gist.github.com/matkoniecz/6710d066fea6596533f5013040eb5dc1 (impossible to publish on OSM Wiki due to triggering spam filter)
How
Changesets will be split in parts to avoid covering huge areas or massive number of objects. In case of object itself being extremely large, larger than desired bounding box some oversized changeset areas are unavoidable (for example, in case of editing country boundary).
Bot will sleep between changesets to reduce risks of unexpected behavior and give more time to react if things are not going well and to eliminate risks of affecting OSM performance by making many edits at the same time.
- Bot will edit link to remove undesirable parts.
- following are considered as a tracking parameters: fbclid, gclid, campaign_ref, mc_id, utm_source, utm_medium, utm_term, utm_content, utm_campaign
- link in any tag value will be checked
- Edit will not be done of url has no parameters
- Edit will not be done of url has any parameters except tracking parameters
- Edit will be done of url has parameters and all of them are tracking ones
state before a mechanical edit - example based on https://www.openstreetmap.org/node/4636662880 :
- website=https://www.yd.com/locations/on/toronto-pape-danforth?utm_source=google&utm_medium=local&utm_content=Toronto-PapeDanforth&utm_campaign=Google-BusinessLocal
- addr:city=Toronto
- addr:housenumber=915
- addr:postcode=M4J 1L8
- addr:street=Danforth Avenue
- amenity=driving_school
- name=Young Drivers of Canada - Toronto School Danforth
state after a mechanical edit:
- website=https://www.yd.com/locations/on/toronto-pape-danforth
- addr:city=Toronto
- addr:housenumber=915
- addr:postcode=M4J 1L8
- addr:street=Danforth Avenue
- amenity=driving_school
- name=Young Drivers of Canada - Toronto School Danforth
Discussion
https://lists.openstreetmap.org/pipermail/talk/2020-May/084677.html
Bot source code
Bot is using https://github.com/matkoniecz/osm_bot_abstraction_layer library, this code is GNU GPLv3 licensed
If you look for source code to use, see Mechanical Edits/Mateusz Konieczny - bot account/remove tracking parameters2/ with updated version |
from osm_bot_abstraction_layer.generic_bot_retagging import run_simple_retagging_task import re import time import datetime def main(): run_in_bot_mode_may_2020() def run_in_manual_mode(): test_expectations() print(datetime.datetime.now()) print(query_of_affected_items()) run_simple_retagging_task( max_count_of_elements_in_one_changeset=500, objects_to_consider_query=query_of_affected_items(), objects_to_consider_query_storage_file='/media/mateusz/5bfa9dfc-ed86-4d19-ac36-78df1060707c/OSM-cache/overpass/osm_elements_with_trackers.osm', is_in_manual_mode=True, changeset_comment='remove tracking parameters', discussion_url='not necessary, as edit was manually reviewed and tracker parameters are clearly unwanted', osm_wiki_documentation_page='not necessary, as edit was manually reviewed', edit_element_function=edit_element, ) print(datetime.datetime.now()) def run_in_bot_mode_may_2020(): test_expectations() print(datetime.datetime.now()) print(query_of_affected_items()) run_simple_retagging_task( max_count_of_elements_in_one_changeset=500, objects_to_consider_query=query_of_affected_items(), objects_to_consider_query_storage_file='/media/mateusz/5bfa9dfc-ed86-4d19-ac36-78df1060707c/OSM-cache/overpass/osm_elements_with_trackers.osm', is_in_manual_mode=False, changeset_comment='remove tracking parameters', discussion_url='https://lists.openstreetmap.org/pipermail/talk/2020-May/084677.html', osm_wiki_documentation_page='https://wiki.openstreetmap.org/wiki/Mechanical_Edits/Mateusz_Konieczny_-_bot_account/remove_tracking_parameters', edit_element_function=edit_element, ) print(datetime.datetime.now()) """ URL often have unnecessary parts, typically added for tracking purposes. This tracking parameters sshould never appear in any osm tags. FB, Google and other add tracking links for various purposes. It means that it is beneficial to turn tag website=http://paris.intersquat.org/les-lieux/le-satellite/?fbclid=de58e340d6aa79a584552a2055042d004b9b19454bc0d7a6046fc81fc90f51 into website=http://paris.intersquat.org/les-lieux/le-satellite/ This urls can be often fixed using an automated script, allowing to use human time on something more productive. Human-made edit will also result in changing "last edited by" (while not allowing to filter out such edits unlike marked bot edit), there are better ways to spot areas requiring fixes and we are not lacking places with QA indicators that manual review is needed. Usually tracking links are added by clueless people who just searched for a website and copied it from FB/Google. There are rare cases of links created to specifically track OSM users see for example * https://www.openstreetmap.org/way/754704241/history ** https://www.cronauerlaw.com/?utm_source=openstreetmap * https://www.openstreetmap.org/node/1063808111/history ** http://www.travelerscoffee.ru?utm_campaign=geo&utm_source=openstreetmap&utm_medium=link * https://www.openstreetmap.org/node/6817678019/history ** https://www.resotainer.fr/agence-bonneuil-sur-marne?utm_source=open-street-map&utm_medium=recherche-locale&utm_content=openstreetmap&utm_campaign=open-street-map-garde-meubles-bonneuil-sur-marne * https://www.openstreetmap.org/node/1684317522 ** http://www.travelerscoffee.ru?utm_campaign=geo&utm_source=openstreetmap&utm_medium=link In general I have not noticed correlation between presence of tracking links and additional issues that would not be detected automatically. Therefore automatic removal of tracking parameters is not causing loss of useful indicators of areas that should be reviewed. Osmose and JOSM validators and StreetComplete are offering better indicators. Automatic removal would allow me to spend time on something more useful, than reviewing all cases where this links are present and confirming them one by one. Proposed bot edit would remove links where all used parameters are tracking users and may be removed. Other links will be reviewed manually to catch also currently unknown tracking parameters. Anchors (#section) will be preserved. Parameters for removal across OSM: fbclid, gclid, campaign_ref, mc_id, utm_source, utm_medium, utm_term, utm_content, utm_campaign Code is tested, I am currently using it in a manual review mode. Sole difference in but run will be disabling of manual confirmation. I have experience with automated edits, see https://wiki.openstreetmap.org/wiki/Mechanical_Edits/Mateusz_Konieczny_-_bot_account Yes, editing element will cause it to be edited and change "last edited" date. Effect will be exactly the same in case of using bot and manual edit (which I will do anyway in case of rejecting this automated edit proposal). Note that in case of bot edits you may filter out bot edits marked as automatic. """ def malicious_parameters_for_eradication(): return ["fbclid", "gclid", "campaign_ref", "mc_id", "utm_source", "utm_medium", "utm_term", "utm_content", "utm_campaign"] # igshid - looks like instagram tracking link (not just me - see https://www.bradymoritz.com/igshid-the-new-instagram-click-tracking-id/ ) def evil_parameters_group(): return "(" + "|".join(malicious_parameters_for_eradication()) + ")" def remove_malicious_parameters(link): old_link = None while old_link != link: old_link = link if re.match("&" + evil_parameters_group() + "[^&#]*", link): # inner parameter link = re.sub("&" + evil_parameters_group() + "=[^&#]*", "", link) if re.match("http.*\?" + evil_parameters_group() + "=[^&#]*$", link): # sole parameter link = re.sub("\?" + evil_parameters_group() + "=[^&#]*$", "", link) if re.match("http.*\?" + evil_parameters_group() + "=[^&#]*#", link): # sole parameter with anchor at the end link = re.sub("\?" + evil_parameters_group() + "=[^&#]*#", "#", link) if re.match("http.*\?" + evil_parameters_group() + "=[^&#]*&", link): # leading parameter link = re.sub("\?" + evil_parameters_group() + "=[^&#]*&", "?", link) return link def edit_element(tag_dictionary): old_tags = dict(tag_dictionary) for key in tag_dictionary.keys(): if tag_dictionary[key].find("http") == 0: cleaned_link = remove_malicious_parameters(tag_dictionary[key]) if tag_dictionary[key] != cleaned_link: if cleaned_link.find("?") != -1: return old_tags # other tags also may be tracking or for removal, review manually tag_dictionary[key] = cleaned_link return tag_dictionary def query_for_limited_keys(): return """ [out:xml][timeout:25000]; ( nwr["website"~""" + '"' + evil_parameters_group() + '"' + """]; nwr["url"~""" + '"' + evil_parameters_group() + '"' + """]; nwr["source"~""" + '"' + evil_parameters_group() + '"' + """]; ); out body; >; out skel qt; """ def query_for_all_keys_but_slow(): return """[out:xml][timeout:25000]; ( nwr[~".*"~""" + '"' + evil_parameters_group() + '"' + """]; ); out body; >; out skel qt; """ def query_of_affected_items(): return query_for_all_keys_but_slow() def test_expectations(): expected = [ { "input": "https://www.example.com/?utm_medium=referrall#anchor", "output": "https://www.example.com/#anchor" }, { "input": "https://www.example.com?utm_medium=referrall#anchor", "output": "https://www.example.com#anchor" }, { "input": "https://www.example.com/?utm_medium=referrall", "output": "https://www.example.com/" }, { "input": "https://www.example.com/?utm_source=evil&utm_medium=referral", "output": "https://www.example.com/" }, { "input": "https://clubhaus-olympic.business.site/?utm_source=gmb&utm_medium=referral", "output": "https://clubhaus-olympic.business.site/" }, { "input": "https://www.enrichinghappiness.com/branch/bickford-of-clinton?utm_source=local&utm_medium=yext&utm_campaign=website", "output": "https://www.enrichinghappiness.com/branch/bickford-of-clinton" }, { "input": "https://www.wanderservice-schwarzwald.de/de/tour/wanderungen/rundwanderung-grillhuette/112160527/?utm_medium=referral&utm_source=embed&utm_campaign=embed-plugin-referral", "output": "https://www.wanderservice-schwarzwald.de/de/tour/wanderungen/rundwanderung-grillhuette/112160527/" }, { "input": "https://www.greeneking-pubs.co.uk/pubs/greater-london/shepherds-tavern/?utm_source=g_places&utm_medium=locations&utm_campaign=", "output": "https://www.greeneking-pubs.co.uk/pubs/greater-london/shepherds-tavern/" }, ] for test in expected: cleaned = remove_malicious_parameters(test["input"]) if (cleaned != remove_malicious_parameters(test["output"])): print(cleaned, "vs", test["output"], "for input", test["input"]) raise "failing to make a proper edit" main()
Repetition
This edit will be done once. Next run will require a separate permission.
Opt-out
Please write at mailing list thread that will appear in Discussion section. Note that in case of opt out the same edit will be done manually, it is impossible to keep tracking parameters in OSM.