User talk:Ris/link audit

From Gentoo Wiki
Jump to:navigation Jump to:search
Note
Before creating a discussion or leaving a comment, please read about using talk pages. To create a new discussion, click here. Comments on an existing discussion should be signed using ~~~~:
A comment [[User:Larry|Larry]] 13:52, 13 May 2024 (UTC)
: A reply [[User:Sally|Sally]] 16:39, 21 July 2024 (UTC)
:: Your reply ~~~~

Excluding redirects and explicitly setting the namespace in the first script

Talk status
This discussion is still ongoing.

I don't know if you intend to support and re-execute this script, so I add the following as a note here, without using a Talk template.

CODE
URL = "https://" + "wiki.gentoo.org/wiki/Special:AllPages" #split to allow posting to wiki

should be replaced with :

CODE
URL = "https://" + "wiki.gentoo.org/wiki/Special:AllPages?hideredirects=1&namespace=0" #split to allow posting to wiki

because :

  • hideredirects=1 ensures that redirects will not be listed, which means the list will only contains "unique" pages (example: ALSA, Alsa, Alsa-lib, Alsa-utils are the same page). With this URL parameter, the script finds 2145 pages ; without it, the script finds 2789 pages.
  • namespace=0 explicitly ensures that only the pages from the "(Main)" namespace will be listed. Note: namespace=4 is for the "Gentoo Wiki" namespace ; namespace=560 is for the "Handbook" namespace.


--Blacki (talk) 21:46, 15 August 2022 (UTC)

Very good idea! I'm rerunning the script now. As far as I can see, it only checked pages strictly from the main namespace, but it wanders into other namespaces when there is a redirect, so just removing the redirects should keep it to the editable namespace.
Another change comes to mind, maybe exclude links detected as "broken" from the "http" list ?
The idea was to replace the User:Ris/link_audit/http_links completely when running this script again - is that a good idea ? I'm not sure how to deal with any http links that can only be http and not https, when updating the list... have you come across any ? If there are _very_ few maybe we could just ignore them and risk manually checking twice ?
I was ill a few weeks back and I've got really behind, I haven't even checked over the changes made to the wiki while I was away... but from skimming over the list, looks like you have been making some good changes, thanks :).
-- Ris (talk) 22:55, 15 August 2022 (UTC)
Haha thank you, and I hope you recover(ed) quickly ! I was quite surprised to see that there was no wiki feature already tracking broken links and non-HTTPS links, so contributing a bit to your link audit seemed to be the right thing to do :-)
I wholeheartedly agree with excluding links detected as broken from the non-HTTPS links list : I started crossing them out of your list, but that would be better if the script didn't list them in the first place.
Replacing your lists completely in User:Ris/link_audit/http_links and User:Ris/link_audit/broken_links is also what I had in mind. As users should only be crossing out links, and not writing important stuff on these lists, there should be no problem in replacing eveything. I did add some comments on some crossed out links, like "(irrelevant link)", "(no HTTPS available)", "(broken link)", but that's not important actually ; I should have contributed on the script instead.
About HTTP links for which no HTTPS version exists, I've seen several like that (and didn't cross out anything). The script should probably try by itself if a HTTP link has a HTTPS version, and list such a link on your page only if it has a HTTPS version available (as nothing can be done otherwise).
--Blacki (talk) 10:24, 17 August 2022 (UTC)
I'm fine now thanks. I'm actually not certain that the functionality to detect bad links does not already exist in the wiki - I just don't know of it ;).
The current version of the script I have locally now should exclude broken links from the http list, and I've started reworking it towards using the "api".
Automatically testing for https versions of the http links is a great idea! It never crossed my mind, doh. Anything crossed off the list should then no longer appear when an update is done, so nothing should ever get manually checked twice - seems great :).
Pushing it further, it could actually diff the old http and an available https version and in theory update the link automatically if they match... Not sure this is a good idea though. Also, there is a point where it is easier to do the work manually than to write automation (I have about 2700 http links on the list currently).
-- Ris (talk) 10:58, 17 August 2022 (UTC)

Please use the API instead

Talk status
This discussion is still ongoing.

Please don't crawl pages like this as it can severely tax the wiki. Instead, use the API built in. See https://www.mediawiki.org/wiki/API:Allpages and https://www.mediawiki.org/wiki/API:Extlinks as references. --Grknight (talk) 13:02, 16 August 2022 (UTC)

The first script waits 7 seconds before fetching an other page of the search results on Special:AllPages, it doesn't use parallel requests, and it doesn't fetch css files, js files, nor images, which means it fetch at least 10 times less data than a normal visit (I tested that on several wiki pages of different sizes) when the content is not already cached.
Same thing with the second script, except it waits 4 seconds before fetching an other page of the (Main) namespace.
Moreover, before the "wait time", as the second script needs to test the connection to each link it has found in the page, it adds several seconds to the total time between its page visits on the wiki.
These scripts are not meant to be run frequently.
Also, these scripts will currently visit about 2 200 pages.
At the same time, for wiki.gentoo.org, the estimated monthly visits are approximately 300 000 (cf: https://www.similarweb.com/fr/website/wiki.gentoo.org/#overview). It means there are 300 000 visits in 30*24*60*60 = 2 592 000 seconds, so approximately 1 visit every 8.7 seconds in average.
The wiki will be fine.
P.Fox (Ris) , just in case, depending on the total time it takes to scrape what is needed, do you think the "wait time" for the second script could/should be increased to 7 seconds too ? or even increase the "wait time" of both scripts to 10 seconds ?
Thank you for the mediawiki links, I will probably look at the API built-in later.
--Blacki (talk) 15:45, 16 August 2022 (UTC)
Yeah, I did this figuring it should be quite gentle on the server - it's orders of magnitude less taxing than any of the preexisting tools for link checking that I looked into. I'll rework it to use the api. As far as I can see, https://www.mediawiki.org/wiki/API:Extlinks won't yield the "link text" though, which seems useful when looking through the page to find what to edit.
Blacki I'll increase the wait time to 10s, it won't hurt. Though the second script takes a long time to run - hours and hours xD.
-- Ris (talk) 11:14, 17 August 2022 (UTC)

Suggestions for the second script

Talk status
This discussion is still ongoing.

Replacing :

CODE
if href.startswith("/"):

with :

CODE
if href.startswith("/") or href.startswith("#"):

because : the href of an internal link may also start with '#'.


Replacing :

CODE
if validators.url(href.strip()):

with :

CODE
if validators.url(href):

and replacing :

CODE
if href and not href.endswith(lang_suffixes):

with :

CODE
if not href:
    continue

href = href.strip()

if href and not href.endswith(lang_suffixes):

because : href will be stripped for the rest of the script.


Replacing :

CODE
if validators.url(href.strip()):
    external_urls.add((href, tag.get_text()))

    if href.startswith("http:"):
        add_error(href, tag.get_text(), "no HTTPS")
else:
    add_error(href, tag.get_text(), "invalid URL")

with :

CODE
external_urls.add((href, tag.get_text()))

and also, replacing :

CODE
print(f"{href[0]}: 200 OK")

with :

CODE
if urlparse(href[0]).scheme == "http":
    add_error(*href, "No HTTPS")
else:
    print(f"\033[32m{href[0]}: 200 OK\033[39m")

and also, replacing :

CODE
# test each external link in page
for href in external_urls:
    try:

with :

CODE
# test each external link in page
for href in external_urls:
    # test special links
    # ("localhost" and private/loopback/reserved/... IPs)

    hostname = urlparse(href[0]).hostname

    try:
        ip = ipaddress.ip_address(hostname)
    except ValueError:
        hostname_is_ip = 0
    else:
        hostname_is_ip = 1

    if hostname == "localhost" \
    or (hostname_is_ip         \
    and (not ip.is_global or ip.is_multicast)):
        if urlparse(href[0]).scheme == "http":
            add_error(*href, "No HTTPS")
        else:
            print(f"\033[36m{href[0]}: Special link\033[39m")
        continue

    # test invalid links
        
    if not validators.url(href[0]):
        add_error(*href, "Invalid URL")
        continue
    
    # test normal links

    try:

and adding :

CODE
from urllib.parse import urlparse
import ipaddress

because :

  • The URL validation is moved from the link fetching (# gets all the links from the "core" of the page) to the link testing (# test each external link in page).
  • The non-HTTPS testing is moved from the link fetching to the link testing.
  • The link fetching will not print "duplicates" anymore for "no HTTPS" and "invalid URL". It printed the link in white, then it printed the link in red immediately after ; but with this, a link will be printed once in white in the link fetching, then once in red in the link testing.
  • Special links (ex: https://localhost:8080, https://127.0.0.1, https://192.168.0.1/index.html) will not be listed anymore, except if they are non-HTTPS links.
  • Broken links will not be listed anymore in the non-HTTPS list.
  • In the green testing, special links will be displayed in magenta if they don't have any relevant error ; 200 OK links will be displayed in green.


Replacing :

CODE
for pagename in urls:

with :

CODE
with open("wikigo_link_errors_out.json", "a") as f:
    f.write("{")

for pagename in urls:

and also, replacing :

CODE
f.write(json.dumps({pagename: errors}) + "\n")

with :

CODE
f.write("\n" + json.dumps(pagename) + ": " + json.dumps(errors) + ",")

and also, adding at the end of the script, outside the for loop :

CODE
# remove the last comma
with open("wikigo_link_errors_out.json", "rb+") as f:
    f.seek(-1, 2)
    last_char = f.read(1).decode("utf-8")

    if last_char == ',':
        f.seek(-1, 2)
        f.truncate()

with open("wikigo_link_errors_out.json", "a") as f:
    f.write("\n}")

because : the JSON file will then contain valid JSON. It had a content of form :

FILE wikigo_link_errors_out.json
{"page1": [[p1-link1], [p1-link2], ...]}
{"page2": [[p2-link1], [p2-link2], ...]}
...

when it should be :

FILE wikigo_link_errors_out.json
{
"page1": [[p1-link1], [p1-link2], ...],
"page2": [[p2-link1], [p2-link2], ...],
...
}

--Blacki (talk) 13:12, 17 August 2022 (UTC)

Looks like there are some great changes there!! I'll post the current version of the script - I think you might have preempted some of the changes that I did this morning. It's in testing - I don't know if it will run properly, there are probably errors... In this version it uses the api, and regular expressions to pull out links form the mediawiki markup - probably not a good idea, but there it is :). How about I make it a bit nicer to read, test it, integrate your suggestions, and post it to github or something ?
I'm having trouble getting the codebox to show up properly, check the source of the page for the code :).
-- Ris (talk) 13:33, 17 August 2022 (UTC)
CODE
import requests
import time
import re
import validators
import json

lang_suffixes = ('/ab', '/abs', '/ace', '/ady', '/ady-cyrl', '/aeb', '/aeb-arab', '/aeb-latn', '/af', '/ak', '/aln', '/alt', '/am', '/ami', '/an', '/ang', '/anp', '/ar', '/arc', '/arn', '/arq', '/ary', '/arz', '/as', '/ase', '/ast', '/atj', '/av', '/avk', '/awa', '/ay', '/az', '/azb', '/ba', '/ban', '/bar', '/bbc', '/bbc-latn', '/bcc', '/bcl', '/be', '/be-tarask', '/bg', '/bgn', '/bh', '/bho', '/bi', '/bjn', '/bm', '/bn', '/bo', '/bpy', '/bqi', '/br', '/brh', '/bs', '/btm', '/bto', '/bug', '/bxr', '/ca', '/cbk-zam', '/cdo', '/ce', '/ceb', '/ch', '/chr', '/chy', '/ckb', '/co', '/cps', '/cr', '/crh', '/crh-cyrl', '/crh-latn', '/cs', '/csb', '/cu', '/cv', '/cy', '/da', '/de', '/de-at', '/de-ch', '/de-formal', '/default', '/din', '/diq', '/dsb', '/dtp', '/dty', '/dv', '/dz', '/ee', '/egl', '/el', '/eml', '/en', '/en-ca', '/en-gb', '/eo', '/es', '/es-formal', '/et', '/eu', '/ext', '/fa', '/ff', '/fi', '/fit', '/fj', '/fo', '/fr', '/frc', '/frp', '/frr', '/fur', '/fy', '/ga', '/gag', '/gan', '/gan-hans', '/gan-hant', '/gcr', '/gd', '/gl', '/glk', '/gn', '/gom', '/gom-deva', '/gom-latn', '/gor', '/got', '/grc', '/gsw', '/gu', '/gv', '/ha', '/hak', '/haw', '/he', '/hi', '/hif', '/hif-latn', '/hil', '/hr', '/hrx', '/hsb', '/ht', '/hu', '/hu-formal', '/hy', '/hyw', '/ia', '/id', '/ie', '/ig', '/ii', '/ik', '/ike-cans', '/ike-latn', '/ilo', '/inh', '/io', '/is', '/it', '/iu', '/ja', '/jam', '/jbo', '/jut', '/jv', '/ka', '/kaa', '/kab', '/kbd', '/kbd-cyrl', '/kbp', '/kg', '/khw', '/ki', '/kiu', '/kjp', '/kk', '/kk-arab', '/kk-cn', '/kk-cyrl', '/kk-kz', '/kk-latn', '/kk-tr', '/kl', '/km', '/kn', '/ko', '/ko-kp', '/koi', '/krc', '/kri', '/krj', '/krl', '/ks', '/ks-arab', '/ks-deva', '/ksh', '/ku', '/ku-arab', '/ku-latn', '/kum', '/kv', '/kw', '/ky', '/la', '/lad', '/lb', '/lbe', '/lez', '/lfn', '/lg', '/li', '/lij', '/liv', '/lki', '/lld', '/lmo', '/ln', '/lo', '/loz', '/lrc', '/lt', '/ltg', '/lus', '/luz', '/lv', '/lzh', '/lzz', '/mai', '/map-bms', '/mdf', '/mg', '/mhr', '/mi', '/min', '/mk', '/ml', '/mn', '/mni', '/mnw', '/mo', '/mr', '/mrj', '/ms', '/mt', '/mwl', '/my', '/myv', '/mzn', '/na', '/nah', '/nan', '/nap', '/nb', '/nds', '/nds-nl', '/ne', '/new', '/niu', '/nl', '/nl-informal', '/nn', '/nov', '/nqo', '/nrm', '/nso', '/nv', '/ny', '/nys', '/oc', '/olo', '/om', '/or', '/os', '/pa', '/pag', '/pam', '/pap', '/pcd', '/pdc', '/pdt', '/pfl', '/pi', '/pih', '/pl', '/pms', '/pnb', '/pnt', '/prg', '/ps', '/pt', '/pt-br', '/qqq', '/qu', '/qug', '/rgn', '/rif', '/rm', '/rmy', '/ro', '/roa-tara', '/ru', '/rue', '/rup', '/ruq', '/ruq-cyrl', '/ruq-latn', '/rw', '/sa', '/sah', '/sat', '/sc', '/scn', '/sco', '/sd', '/sdc', '/sdh', '/se', '/sei', '/ses', '/sg', '/sgs', '/sh', '/shi', '/shn', '/shy-latn', '/si', '/sk', '/skr', '/skr-arab', '/sl', '/sli', '/sm', '/sma', '/smn', '/sn', '/so', '/sq', '/sr', '/sr-ec', '/sr-el', '/srn', '/ss', '/st', '/stq', '/sty', '/su', '/sv', '/sw', '/szl', '/szy', '/ta', '/tay', '/tcy', '/te', '/tet', '/tg', '/tg-cyrl', '/tg-latn', '/th', '/ti', '/tk', '/tl', '/tly', '/tn', '/to', '/tpi', '/tr', '/tru', '/trv', '/ts', '/tt', '/tt-cyrl', '/tt-latn', '/tw', '/ty', '/tyv', '/tzm', '/udm', '/ug', '/ug-arab', '/ug-latn', '/uk', '/ur', '/uz', '/ve', '/vec', '/vep', '/vi', '/vls', '/vmf', '/vo', '/vot', '/vro', '/wa', '/war', '/wo', '/wuu', '/xal', '/xh', '/xmf', '/xsy', '/yi', '/yo', '/yue', '/za', '/zea', '/zgh', '/zh', '/zh-cn', '/zh-hans', '/zh-hant', '/zh-hk', '/zh-mo', '/zh-my', '/zh-sg', '/zh-tw', '/zu')  # fmt: skip

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"
}

pagenames= []

# set of tuples with all external links for a given page: url, wikisource of link
external_urls = set()
# list of tuples with failed external links for page: url, link text, error type string
errors = []


def add_error(wikisource_link_tag, error_type):
    errors.append((wikisource_link_tag, error_type))
    print(f"\033[31m{wikisource_link_tag}: {error_type}\033[39m")

list_pages_params = {
    "hideredirects": "1",
    "namespace": "0",
    "action": "query",
    "format": "json",
    "list": "allpages",
    "aplimit": "500",
    "apfilterredir": "nonredirects",
}

get_url_params = {
    "action": "query",
    "prop": "revisions",
    "rvprop": "timestamp|user|comment|content",
    "rvslots": "main",
    "formatversion": "2",
    "format": "json"
}


def query_server(params):
    return requests.get("https://" + "wiki.gentoo.org/api.php", params=params).json()


"""result = query_server(list_pages_params)

while True:
    for pagename in result["query"]["allpages"]:
        if not pagename["title"].endswith(lang_suffixes):
            pagenames.append(pagename["title"].replace(' ', '_'))

    if "continue" in result:
        result = query_server(list_pages_params | result["continue"])
    else:
        break

    time.sleep(7)

for pagename in pagenames:
    print(pagename) """


# load list of wiki pages
with open("all_wikigo_pages") as f:
    pagenames = f.readlines()
    pagenames = [line.rstrip() for line in pagenames]


for pagename in pagenames:
    print(f"\n******\n** Fetching page to check - {pagename}\n")

    result = query_server(get_url_params | {"titles" : pagename})
    
    page_content = result["query"]["pages"][0]["revisions"][0]["slots"]["main"]["content"]

    wikisource_link_tags = re.findall("(\[http[^\]]*\])", page_content)

    print("** links in page:")

    # gets all the links from the "core" of the page
    for tag in wikisource_link_tags:
        href = re.search("\[(.*?)\s", tag).group(1)

        if href:
            if validators.url(href.strip()):
                external_urls.add((href, tag))
            else:
                add_error(tag, "invalid URL")

    if external_urls:
        print(f"\n** Testing external links for [{pagename}]:")

    # test each external link in page
    for href in external_urls:
        try:
            result = requests.get(href[0], timeout=(5), headers=headers)
        except requests.exceptions.ConnectionError:
            add_error(href[1], "ConnectionError")
        except requests.exceptions.ReadTimeout:
            add_error(href[1], "timed out")
        except requests.exceptions.ContentDecodingError:
            pass
        except requests.exceptions.TooManyRedirects:
            add_error(href[1], "too many redirects")
        else:
            if result.status_code == 200:
                print(f"{href[1]}: 200 OK")
                if href[0].startswith("http:"):
                    add_error(href, tag.get_text(), "no HTTPS")
            else:
                add_error(href[1], str(result.status_code))

    external_urls.clear()

    # write json file with all errors
    if errors:
        with open("wikigo_link_errors_out.json", "a") as f:
            f.write(json.dumps({pagename: errors}) + "\n")

    https_errors = [e for e in errors if e[1] == "no HTTPS"]

    if https_errors:
        with open("wikigo_https_errors_out", "a") as f:
            f.write(f"== [[:{pagename}]] ==" + "\n\n")

            for e in https_errors:
                f.write(f"{e[0]} : {e[1]}\n\n")

    other_errors = [e for e in errors if e[1] != "no HTTPS"]

    if other_errors:
        with open("wikigo_link_errors_out", "a") as f:
            f.write(f"== [[:{pagename}]] ==" + "\n\n")

            for e in other_errors:
                f.write(f"{e[0]} : {e[1]}\n\n")

    errors.clear()

    time.sleep(4)

    print("\n")
Actually, trying this just now - it doesn't seem to be working at all. Consider this just a gist of what I was doing with it, it may well be compteltely broken xD. I'll try and have a look this evening. -- Ris (talk) 13:43, 17 August 2022 (UTC)
So the (main) problem is that I supposed that links in the mediawiki markup were enclosed in square brackets, but that is only true for some of the links. Parsing the markup for "http://" and "https://", trying to reproduce what mediawiki considers to be links, seems fraught with complications - it'll probably be safer to get html from the api and parse it like before. I'll try and fix this up in the coming days. -- Ris (talk) 19:04, 17 August 2022 (UTC)
Hi Blacki ! I've started moulding the script into something less rough and rushed out, into something more structured and rational - hopefully it'll be much easier to read and work on. I've either integrated your changes, or circumvented the issues you brought up - thanks :). Do you have a github account ? I'm figuring that if you, or anyone else for that matter, wanted to make any changes it would be better to put this somewhere more serious for code than the wiki ;). -- Ris (talk) 22:45, 18 August 2022 (UTC)

Script resurrection

Talk status
This discussion is still ongoing.

Hi again P.Fox (Ris) , sorry for not answering for a long time. As you mentionned me elsewhere and talked about the script, that reminded me about it ; I have also had a complete rewrite of the script for some time, that I was too lazy to put on Github ^^

My script has the same functionalities as your existing one (I don't know how it compares to your rewrite), with some differences, as it :

  • gets links either by fetching data from the MediaWiki Action API, or by loading data from a dump file resulting from a previous fetch
  • uses a default wait time of 10 seconds between requests on the same host
  • tests, for any HTTP external link that returned a HTTP 200 code, whether there is a HTTPS version available (if yes, the link is added to the list of non-HTTPS links that need to be converted to HTTPS, otherwise, it's not, since there is nothing that can be done)
  • adds broken HTTP external links to the list of broken links that need to be fixed, without duplicating them in the list of non-HTTPS links that need to be fixed


As you said, the MediaWiki API doesn't yield the "link text". However, I found that it's not that useful when looking through the page to find what to edit, since I just edit the whole page ("Edit source" button) and then brutally "Ctrl+F" all of it for "http:" or for part/all of the link that need to be fixed.
But I have to admit, it sure is prettier to have the "link text" on the user pages' lists of links ^^

Also, I suggest that when someone fixed a link, the link should be removed from the list on your user page instead of being crossed out, since :

  • this wiki edit is simpler + faster
  • it decreases the size of the list of links that need to be fixed, while crossed out links add a bit of noise


I ran the script for the last time on 2022/10/05 ; should I update the lists of links on your user pages with my results, or put them on my user pages, or something else ?

Here is my code : https://github.com/oblackio/auditlinks

--Blacki (talk) 20:54, 11 October 2022 (UTC)

Hi Blacki . Thanks for the contribution. I should have got back to you on this sooner, but the days turn into weeks, into months...
About updating the lists of bad links, I'd say there is enough work there already. The lists are actually incomplete because of what looks like a bug in mediawiki - see discussion.
Definitely agree that the best thing to do is to remove any "fixed" links from the list.
Hope you are well, missing your edits ;). -- Ris (talk) 12:30, 24 March 2023 (UTC)