Python: Find Any Broken Links on Websites


Blog Introduction: If you’re a web developer, chances are good that you’ve had to deal with broken links at some point. Broken links are a pain for both web developers and users alike, as they can prevent users from accessing the content they’re looking for and make your site look unprofessional.

One way to deal with broken links is to find them before they go live on your site. That way, you can fix them before your users ever come across them. In this blog post, we’ll show you how to use the Python package LinkChecker to find broken links on your website.

What is LinkChecker?

LinkChecker is a Python package that makes it easy to find broken links on your website. It does this by checking all of the links on your website and flagging any that are broken.

How Does LinkChecker Work?

LinkChecker works by first crawl your website to find all of the links on it. Once it has a list of all the links, it will check each one to see if it’s working. If a link is broken, LinkChecker will flag it as such.

How Can I Use LinkChecker?

There are two ways you can use LinkChecker: through the command line or through its Python API. We’ll show you how to use both methods in this blog post.

Using LinkChecker Through the Command Line
To use LinkChecker through the command line, first install it using pip:

$ pip install linkchecker

Once installed, you can then check for broken links on your website by running the following command:

$ linkchecker http://example.com/ --ignore-url=http://example.com/admin/
Code language: JavaScript (javascript)

This will crawl your website and print out a report of any broken links it finds.

The –ignore-url option tells LinkChecker to ignore any links that match the given URL pattern. This can be useful for ignoring admin pages or other pages that you don’t want included in the report.

Using LinkChecker Through Its Python API If you want more control over how Linkchecker runs, you can use its Python API instead of the command line interface. To do so, first import LinkChecker into your Python script:

>>> from linkcheck import *; #Importing module
Code language: CSS (css)

Next, create a new instance of the Link class:

>>> obj=Link('http://example.com/');#Creating object
Code language: PHP (php)

Finally, call the check() method on your object to crawl the website and check for broken links:

>>> obj.check(); #Method for checking URLs
Code language: CSS (css)

This will return a list of all the broken links found on your website. Each item in the list will be a tuple containing the URL of the broken link and an error message describing why it’s broken.

If you want to use a quick script to find any broken links on websites with source code python you use this script:

# Title: finding broken links # Author: hruday007 import requests import sys from bs4 import BeautifulSoup from urllib.parse import urlparse from urllib.parse import urljoin searched_links = [] broken_links = [] def getLinksFromHTML(html): def getLink(el): return el["href"] return list(map(getLink, BeautifulSoup(html, features="html.parser").select("a[href]"))) def find_broken_links(domainToSearch, URL, parentURL): if (not (URL in searched_links)) and (not URL.startswith("mailto:")) and \ (not ("javascript: " in URL)) and (not URL.endswith(".png")) and \ (not URL.endswith(".jpg")) and (not URL.endswith(".jpeg")): try: requestObj = requests.get(URL) searched_links.append(URL) if(requestObj.status_code == 404): broken_links.append("BROKEN: link " + URL + " from " + parentURL) print(broken_links[-1]) else: print("NOT BROKEN: link " + URL + " from " + parentURL) if urlparse(URL).netloc == domainToSearch: for link in getLinksFromHTML(requestObj.text): find_broken_links(domainToSearch, urljoin(URL, link), URL) except Exception as e: print("ERROR: " + str(e)) searched_links.append(domainToSearch) find_broken_links(urlparse(sys.argv[1]).netloc, sys.argv[1], "") print("\n--- DONE! ---\n") print("The following links were broken:") for link in broken_links: print("\t" + link)
Code language: Python (python)

How to run get all the broken links from website input via CLI

$pip install -r requirements.txt $python brokenlinksfinder.py
Code language: PHP (php)

Download the full source code here.

Another source code to Find Broken Links With Python

When we were at work fixing broken links on our blog, I thought it would be fun to make my own broken link checker. It didn’t turn out to be very hard, and I’m glad I no longer have to open a web browser and navigate to a website full of ads to see if a page has broken links.

If you want to use it, the code is below.

import requests from bs4 import BeautifulSoup from concurrent.futures import ThreadPoolExecutor def get_broken_links(url): # Set root domain. root_domain = domain.com # Internal function for validating HTTP status code. def _validate_url(url): r = requests.head(url) if r.status_code == 404: broken_links.append(url) # Make request to URL. data = requests.get(url).text # Parse HTML from request. soup = BeautifulSoup(data, features="html.parser") # Create a list containing all links with the root domain. links = [link.get("href") for link in soup.find_all("a") if f"//{root_domain}" in link.get("href")] # Initialize list for broken links. broken_links = [] # Loop through links checking for 404 responses, and append to list. with ThreadPoolExecutor(max_workers=8) as executor: executor.map(_validate_url, links) return broken_links
Code language: Python (python)

I get this source code from this site.

Finding and fixing broken links before they go live on your site is an important part of being a web developer. Thankfully, there are tools like LinkChecker and some source code in python that make it easy to find these broken links so that you can fix them before they cause any problems for your users.

Andy Avery

I really enjoy helping people with their tech problems to make life easier, ​and that’s what I’ve been doing professionally for the past decade.

Recent Posts