User talk:Ris/link audit

From Gentoo Wiki
Jump to:navigation Jump to:search
Note
This is a Talk page - please see the documentation about using talk pages. Add newer comments below older ones, sign comments using four tildes (~~~~), and indent successive comments with colons (:). Add new sections at the bottom of the page, under a heading (== ==). Please remember to mark sections as "open for discussion" using {{talk|open}}, so they will show up in the list of open discussions.

Excluding redirects and explicitly setting the namespace in the first script

Talk status
This discussion is still ongoing.

I don't know if you intend to support and re-execute this script, so I add the following as a note here, without using a Talk template.

CODE
URL = "https://" + "wiki.gentoo.org/wiki/Special:AllPages" #split to allow posting to wiki

should be replaced with :

CODE
URL = "https://" + "wiki.gentoo.org/wiki/Special:AllPages?hideredirects=1&namespace=0" #split to allow posting to wiki

because :

  • hideredirects=1 ensures that redirects will not be listed, which means the list will only contains "unique" pages (example: ALSA, Alsa, Alsa-lib, Alsa-utils are the same page). With this URL parameter, the script finds 2145 pages ; without it, the script finds 2789 pages.
  • namespace=0 explicitly ensures that only the pages from the "(Main)" namespace will be listed. Note: namespace=4 is for the "Gentoo Wiki" namespace ; namespace=560 is for the "Handbook" namespace.


--Blacki (talk) 21:46, 15 August 2022 (UTC)

Very good idea! I'm rerunning the script now. As far as I can see, it only checked pages strictly from the main namespace, but it wanders into other namespaces when there is a redirect, so just removing the redirects should keep it to the editable namespace.
Another change comes to mind, maybe exclude links detected as "broken" from the "http" list ?
The idea was to replace the User:Ris/link_audit/http_links completely when running this script again - is that a good idea ? I'm not sure how to deal with any http links that can only be http and not https, when updating the list... have you come across any ? If there are _very_ few maybe we could just ignore them and risk manually checking twice ?
I was ill a few weeks back and I've got really behind, I haven't even checked over the changes made to the wiki while I was away... but from skimming over the list, looks like you have been making some good changes, thanks :).
-- Ris (talk) 22:55, 15 August 2022 (UTC)
Haha thank you, and I hope you recover(ed) quickly ! I was quite surprised to see that there was no wiki feature already tracking broken links and non-HTTPS links, so contributing a bit to your link audit seemed to be the right thing to do :-)
I wholeheartedly agree with excluding links detected as broken from the non-HTTPS links list : I started crossing them out of your list, but that would be better if the script didn't list them in the first place.
Replacing your lists completely in User:Ris/link_audit/http_links and User:Ris/link_audit/broken_links is also what I had in mind. As users should only be crossing out links, and not writing important stuff on these lists, there should be no problem in replacing eveything. I did add some comments on some crossed out links, like "(irrelevant link)", "(no HTTPS available)", "(broken link)", but that's not important actually ; I should have contributed on the script instead.
About HTTP links for which no HTTPS version exists, I've seen several like that (and didn't cross out anything). The script should probably try by itself if a HTTP link has a HTTPS version, and list such a link on your page only if it has a HTTPS version available (as nothing can be done otherwise).
--Blacki (talk) 10:24, 17 August 2022 (UTC)
I'm fine now thanks. I'm actually not certain that the functionality to detect bad links does not already exist in the wiki - I just don't know of it ;).
The current version of the script I have locally now should exclude broken links from the http list, and I've started reworking it towards using the "api".
Automatically testing for https versions of the http links is a great idea! It never crossed my mind, doh. Anything crossed off the list should then no longer appear when an update is done, so nothing should ever get manually checked twice - seems great :).
Pushing it further, it could actually diff the old http and an available https version and in theory update the link automatically if they match... Not sure this is a good idea though. Also, there is a point where it is easier to do the work manually than to write automation (I have about 2700 http links on the list currently).
-- Ris (talk) 10:58, 17 August 2022 (UTC)

Please use the API instead

Talk status
This discussion is still ongoing.

Please don't crawl pages like this as it can severely tax the wiki. Instead, use the API built in. See https://www.mediawiki.org/wiki/API:Allpages and https://www.mediawiki.org/wiki/API:Extlinks as references. --Grknight (talk) 13:02, 16 August 2022 (UTC)

The first script waits 7 seconds before fetching an other page of the search results on Special:AllPages, it doesn't use parallel requests, and it doesn't fetch css files, js files, nor images, which means it fetch at least 10 times less data than a normal visit (I tested that on several wiki pages of different sizes) when the content is not already cached.
Same thing with the second script, except it waits 4 seconds before fetching an other page of the (Main) namespace.
Moreover, before the "wait time", as the second script needs to test the connection to each link it has found in the page, it adds several seconds to the total time between its page visits on the wiki.
These scripts are not meant to be run frequently.
Also, these scripts will currently visit about 2 200 pages.
At the same time, for wiki.gentoo.org, the estimated monthly visits are approximately 300 000 (cf: https://www.similarweb.com/fr/website/wiki.gentoo.org/#overview). It means there are 300 000 visits in 30*24*60*60 = 2 592 000 seconds, so approximately 1 visit every 8.7 seconds in average.
The wiki will be fine.
Ris , just in case, depending on the total time it takes to scrape what is needed, do you think the "wait time" for the second script could/should be increased to 7 seconds too ? or even increase the "wait time" of both scripts to 10 seconds ?
Thank you for the mediawiki links, I will probably look at the API built-in later.
--Blacki (talk) 15:45, 16 August 2022 (UTC)
Yeah, I did this figuring it should be quite gentle on the server - it's orders of magnitude less taxing than any of the preexisting tools for link checking that I looked into. I'll rework it to use the api. As far as I can see, https://www.mediawiki.org/wiki/API:Extlinks won't yield the "link text" though, which seems useful when looking through the page to find what to edit.
Blacki I'll increase the wait time to 10s, it won't hurt. Though the second script takes a long time to run - hours and hours xD.
-- Ris (talk) 11:14, 17 August 2022 (UTC)

Suggestions for the second script

Talk status
This discussion is still ongoing.

Replacing :

CODE
if href.startswith("/"):

with :

CODE
if href.startswith("/") or href.startswith("#"):

because : the href of an internal link may also start with '#'.


Replacing :

CODE
if validators.url(href.strip()):

with :

CODE
if validators.url(href):

and replacing :

CODE
if href and not href.endswith(lang_suffixes):

with :

CODE
if not href:
    continue

href = href.strip()

if href and not href.endswith(lang_suffixes):

because : href will be stripped for the rest of the script.


Replacing :

CODE
if validators.url(href.strip()):
    external_urls.add((href, tag.get_text()))

    if href.startswith("http:"):
        add_error(href, tag.get_text(), "no HTTPS")
else:
    add_error(href, tag.get_text(), "invalid URL")

with :

CODE
external_urls.add((href, tag.get_text()))

and also, replacing :

CODE
print(f"{href[0]}: 200 OK")

with :

CODE
if urlparse(href[0]).scheme == "http":
    add_error(*href, "No HTTPS")
else:
    print(f"\033[32m{href[0]}: 200 OK\033[39m")

and also, replacing :

CODE
# test each external link in page
for href in external_urls:
    try:

with :

CODE
# test each external link in page
for href in external_urls:
    # test special links
    # ("localhost" and private/loopback/reserved/... IPs)

    hostname = urlparse(href[0]).hostname

    try:
        ip = ipaddress.ip_address(hostname)
    except ValueError:
        hostname_is_ip = 0
    else:
        hostname_is_ip = 1

    if hostname == "localhost" \
    or (hostname_is_ip         \
    and (not ip.is_global or ip.is_multicast)):
        if urlparse(href[0]).scheme == "http":
            add_error(*href, "No HTTPS")
        else:
            print(f"\033[36m{href[0]}: Special link\033[39m")
        continue

    # test invalid links
        
    if not validators.url(href[0]):
        add_error(*href, "Invalid URL")
        continue
    
    # test normal links

    try:

and adding :

CODE
from urllib.parse import urlparse
import ipaddress

because :

  • The URL validation is moved from the link fetching (# gets all the links from the "core" of the page) to the link testing (# test each external link in page).
  • The non-HTTPS testing is moved from the link fetching to the link testing.
  • The link fetching will not print "duplicates" anymore for "no HTTPS" and "invalid URL". It printed the link in white, then it printed the link in red immediately after ; but with this, a link will be printed once in white in the link fetching, then once in red in the link testing.
  • Special links (ex: https://localhost:8080, https://127.0.0.1, https://192.168.0.1/index.html) will not be listed anymore, except if they are non-HTTPS links.
  • Broken links will not be listed anymore in the non-HTTPS list.
  • In the green testing, special links will be displayed in magenta if they don't have any relevant error ; 200 OK links will be displayed in green.


Replacing :

CODE
for pagename in urls:

with :

CODE
with open("wikigo_link_errors_out.json", "a") as f:
    f.write("{")

for pagename in urls:

and also, replacing :

CODE
f.write(json.dumps({pagename: errors}) + "\n")

with :

CODE
f.write("\n" + json.dumps(pagename) + ": " + json.dumps(errors) + ",")

and also, adding at the end of the script, outside the for loop :

CODE
# remove the last comma
with open("wikigo_link_errors_out.json", "rb+") as f:
    f.seek(-1, 2)
    last_char = f.read(1).decode("utf-8")

    if last_char == ',':
        f.seek(-1, 2)
        f.truncate()

with open("wikigo_link_errors_out.json", "a") as f:
    f.write("\n}")

because : the JSON file will then contain valid JSON. It had a content of form :

FILE wikigo_link_errors_out.json
{"page1": [[p1-link1], [p1-link2], ...]}
{"page2": [[p2-link1], [p2-link2], ...]}
...

when it should be :

FILE wikigo_link_errors_out.json
{
"page1": [[p1-link1], [p1-link2], ...],
"page2": [[p2-link1], [p2-link2], ...],
...
}

--Blacki (talk) 13:12, 17 August 2022 (UTC)

Looks like there are some great changes there!! I'll post the current version of the script - I think you might have preempted some of the changes that I did this morning. It's in testing - I don't know if it will run properly, there are probably errors... In this version it uses the api, and regular expressions to pull out links form the mediawiki markup - probably not a good idea, but there it is :). How about I make it a bit nicer to read, test it, integrate your suggestions, and post it to github or something ?
I'm having trouble getting the codebox to show up properly, check the source of the page for the code :).
-- Ris (talk) 13:33, 17 August 2022 (UTC)
CODE
import requests
import time
import re
import validators
import json

lang_suffixes = ('/ab', '/abs', '/ace', '/ady', '/ady-cyrl', '/aeb', '/aeb-arab', '/aeb-latn', '/af', '/ak', '/aln', '/alt', '/am', '/ami', '/an', '/ang', '/anp', '/ar', '/arc', '/arn', '/arq', '/ary', '/arz', '/as', '/ase', '/ast', '/atj', '/av', '/avk', '/awa', '/ay', '/az', '/azb', '/ba', '/ban', '/bar', '/bbc', '/bbc-latn', '/bcc', '/bcl', '/be', '/be-tarask', '/bg', '/bgn', '/bh', '/bho', '/bi', '/bjn', '/bm', '/bn', '/bo', '/bpy', '/bqi', '/br', '/brh', '/bs', '/btm', '/bto', '/bug', '/bxr', '/ca', '/cbk-zam', '/cdo', '/ce', '/ceb', '/ch', '/chr', '/chy', '/ckb', '/co', '/cps', '/cr', '/crh', '/crh-cyrl', '/crh-latn', '/cs', '/csb', '/cu', '/cv', '/cy', '/da', '/de', '/de-at', '/de-ch', '/de-formal', '/default', '/din', '/diq', '/dsb', '/dtp', '/dty', '/dv', '/dz', '/ee', '/egl', '/el', '/eml', '/en', '/en-ca', '/en-gb', '/eo', '/es', '/es-formal', '/et', '/eu', '/ext', '/fa', '/ff', '/fi', '/fit', '/fj', '/fo', '/fr', '/frc', '/frp', '/frr', '/fur', '/fy', '/ga', '/gag', '/gan', '/gan-hans', '/gan-hant', '/gcr', '/gd', '/gl', '/glk', '/gn', '/gom', '/gom-deva', '/gom-latn', '/gor', '/got', '/grc', '/gsw', '/gu', '/gv', '/ha', '/hak', '/haw', '/he', '/hi', '/hif', '/hif-latn', '/hil', '/hr', '/hrx', '/hsb', '/ht', '/hu', '/hu-formal', '/hy', '/hyw', '/ia', '/id', '/ie', '/ig', '/ii', '/ik', '/ike-cans', '/ike-latn', '/ilo', '/inh', '/io', '/is', '/it', '/iu', '/ja', '/jam', '/jbo', '/jut', '/jv', '/ka', '/kaa', '/kab', '/kbd', '/kbd-cyrl', '/kbp', '/kg', '/khw', '/ki', '/kiu', '/kjp', '/kk', '/kk-arab', '/kk-cn', '/kk-cyrl', '/kk-kz', '/kk-latn', '/kk-tr', '/kl', '/km', '/kn', '/ko', '/ko-kp', '/koi', '/krc', '/kri', '/krj', '/krl', '/ks', '/ks-arab', '/ks-deva', '/ksh', '/ku', '/ku-arab', '/ku-latn', '/kum', '/kv', '/kw', '/ky', '/la', '/lad', '/lb', '/lbe', '/lez', '/lfn', '/lg', '/li', '/lij', '/liv', '/lki', '/lld', '/lmo', '/ln', '/lo', '/loz', '/lrc', '/lt', '/ltg', '/lus', '/luz', '/lv', '/lzh', '/lzz', '/mai', '/map-bms', '/mdf', '/mg', '/mhr', '/mi', '/min', '/mk', '/ml', '/mn', '/mni', '/mnw', '/mo', '/mr', '/mrj', '/ms', '/mt', '/mwl', '/my', '/myv', '/mzn', '/na', '/nah', '/nan', '/nap', '/nb', '/nds', '/nds-nl', '/ne', '/new', '/niu', '/nl', '/nl-informal', '/nn', '/nov', '/nqo', '/nrm', '/nso', '/nv', '/ny', '/nys', '/oc', '/olo', '/om', '/or', '/os', '/pa', '/pag', '/pam', '/pap', '/pcd', '/pdc', '/pdt', '/pfl', '/pi', '/pih', '/pl', '/pms', '/pnb', '/pnt', '/prg', '/ps', '/pt', '/pt-br', '/qqq', '/qu', '/qug', '/rgn', '/rif', '/rm', '/rmy', '/ro', '/roa-tara', '/ru', '/rue', '/rup', '/ruq', '/ruq-cyrl', '/ruq-latn', '/rw', '/sa', '/sah', '/sat', '/sc', '/scn', '/sco', '/sd', '/sdc', '/sdh', '/se', '/sei', '/ses', '/sg', '/sgs', '/sh', '/shi', '/shn', '/shy-latn', '/si', '/sk', '/skr', '/skr-arab', '/sl', '/sli', '/sm', '/sma', '/smn', '/sn', '/so', '/sq', '/sr', '/sr-ec', '/sr-el', '/srn', '/ss', '/st', '/stq', '/sty', '/su', '/sv', '/sw', '/szl', '/szy', '/ta', '/tay', '/tcy', '/te', '/tet', '/tg', '/tg-cyrl', '/tg-latn', '/th', '/ti', '/tk', '/tl', '/tly', '/tn', '/to', '/tpi', '/tr', '/tru', '/trv', '/ts', '/tt', '/tt-cyrl', '/tt-latn', '/tw', '/ty', '/tyv', '/tzm', '/udm', '/ug', '/ug-arab', '/ug-latn', '/uk', '/ur', '/uz', '/ve', '/vec', '/vep', '/vi', '/vls', '/vmf', '/vo', '/vot', '/vro', '/wa', '/war', '/wo', '/wuu', '/xal', '/xh', '/xmf', '/xsy', '/yi', '/yo', '/yue', '/za', '/zea', '/zgh', '/zh', '/zh-cn', '/zh-hans', '/zh-hant', '/zh-hk', '/zh-mo', '/zh-my', '/zh-sg', '/zh-tw', '/zu')  # fmt: skip

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"
}

pagenames= []

# set of tuples with all external links for a given page: url, wikisource of link
external_urls = set()
# list of tuples with failed external links for page: url, link text, error type string
errors = []


def add_error(wikisource_link_tag, error_type):
    errors.append((wikisource_link_tag, error_type))
    print(f"\033[31m{wikisource_link_tag}: {error_type}\033[39m")

list_pages_params = {
    "hideredirects": "1",
    "namespace": "0",
    "action": "query",
    "format": "json",
    "list": "allpages",
    "aplimit": "500",
    "apfilterredir": "nonredirects",
}

get_url_params = {
    "action": "query",
    "prop": "revisions",
    "rvprop": "timestamp|user|comment|content",
    "rvslots": "main",
    "formatversion": "2",
    "format": "json"
}


def query_server(params):
    return requests.get("https://" + "wiki.gentoo.org/api.php", params=params).json()


"""result = query_server(list_pages_params)

while True:
    for pagename in result["query"]["allpages"]:
        if not pagename["title"].endswith(lang_suffixes):
            pagenames.append(pagename["title"].replace(' ', '_'))

    if "continue" in result:
        result = query_server(list_pages_params | result["continue"])
    else:
        break

    time.sleep(7)

for pagename in pagenames:
    print(pagename) """


# load list of wiki pages
with open("all_wikigo_pages") as f:
    pagenames = f.readlines()
    pagenames = [line.rstrip() for line in pagenames]


for pagename in pagenames:
    print(f"\n******\n** Fetching page to check - {pagename}\n")

    result = query_server(get_url_params | {"titles" : pagename})
    
    page_content = result["query"]["pages"][0]["revisions"][0]["slots"]["main"]["content"]

    wikisource_link_tags = re.findall("(\[http[^\]]*\])", page_content)

    print("** links in page:")

    # gets all the links from the "core" of the page
    for tag in wikisource_link_tags:
        href = re.search("\[(.*?)\s", tag).group(1)

        if href:
            if validators.url(href.strip()):
                external_urls.add((href, tag))
            else:
                add_error(tag, "invalid URL")

    if external_urls:
        print(f"\n** Testing external links for [{pagename}]:")

    # test each external link in page
    for href in external_urls:
        try:
            result = requests.get(href[0], timeout=(5), headers=headers)
        except requests.exceptions.ConnectionError:
            add_error(href[1], "ConnectionError")
        except requests.exceptions.ReadTimeout:
            add_error(href[1], "timed out")
        except requests.exceptions.ContentDecodingError:
            pass
        except requests.exceptions.TooManyRedirects:
            add_error(href[1], "too many redirects")
        else:
            if result.status_code == 200:
                print(f"{href[1]}: 200 OK")
                if href[0].startswith("http:"):
                    add_error(href, tag.get_text(), "no HTTPS")
            else:
                add_error(href[1], str(result.status_code))

    external_urls.clear()

    # write json file with all errors
    if errors:
        with open("wikigo_link_errors_out.json", "a") as f:
            f.write(json.dumps({pagename: errors}) + "\n")

    https_errors = [e for e in errors if e[1] == "no HTTPS"]

    if https_errors:
        with open("wikigo_https_errors_out", "a") as f:
            f.write(f"== [[:{pagename}]] ==" + "\n\n")

            for e in https_errors:
                f.write(f"{e[0]} : {e[1]}\n\n")

    other_errors = [e for e in errors if e[1] != "no HTTPS"]

    if other_errors:
        with open("wikigo_link_errors_out", "a") as f:
            f.write(f"== [[:{pagename}]] ==" + "\n\n")

            for e in other_errors:
                f.write(f"{e[0]} : {e[1]}\n\n")

    errors.clear()

    time.sleep(4)

    print("\n")
Actually, trying this just now - it doesn't seem to be working at all. Consider this just a gist of what I was doing with it, it may well be compteltely broken xD. I'll try and have a look this evening. -- Ris (talk) 13:43, 17 August 2022 (UTC)
So the (main) problem is that I supposed that links in the mediawiki markup were enclosed in square brackets, but that is only true for some of the links. Parsing the markup for "http://" and "https://", trying to reproduce what mediawiki considers to be links, seems fraught with complications - it'll probably be safer to get html from the api and parse it like before. I'll try and fix this up in the coming days. -- Ris (talk) 19:04, 17 August 2022 (UTC)
Hi Blacki ! I've started moulding the script into something less rough and rushed out, into something more structured and rational - hopefully it'll be much easier to read and work on. I've either integrated your changes, or circumvented the issues you brought up - thanks :). Do you have a github account ? I'm figuring that if you, or anyone else for that matter, wanted to make any changes it would be better to put this somewhere more serious for code than the wiki ;). -- Ris (talk) 22:45, 18 August 2022 (UTC)