User:Ris/link audit

From Gentoo Wiki
Jump to:navigation Jump to:search

See User:Ris/link_audit/broken_links and User:Ris/link_audit/http_links.


Note
Looking into hosting the latest version of this on a VCS, as it is accruing changes.

I've been running link checking on the wiki. I tried preexisting tools at first, but had trouble not overloading the server with them and with restricting link checking to pertinent sections of the wiki.

Here is some throw away code to check links on the main namespace of the wiki. I know it's badly written, and totally inefficient, but it is efficient with my time - supposing it is one use only :). Probably overkill and reinventing to write code for this, but I got frustrated trying to make preexisting tools work.

One advantage of doing it this way is that it could be adapted to check other things, such as articles without "Article description::", if that is not already possible directly on the wiki. It could also be adapted to check for internal links to sections of articles that don't exist - though I'm figuring that there is plenty of work with the broken external links, for now. Could also be used to check for other issues or things that need changing across the wiki.

Python 3.9, dependency BeautifulSoup. If anyone wants to use/modify this, get in touch on a talk page or IRC and I'll post it to github. I guess by posting this here I'm releseing it under Creative Commons Attribution Share Alike license.

First script prints a list of all the pages from the main namespace. python <scriptname>.py | tee all_wikigo_pages to make a file to be consumed by the second script. File is used to store list of pages because they don't change often, and if the second script were to get interupted, it avoids loading the server for the same thing twice.