DEV Community

Vinay Khatri
Vinay Khatri

Posted on

How to Check Internal and External links on a Webpage using Python

Python is one of the most popular programming languages for extracting, processing, and analyzing data. The inbuilt and third-party libraries of Python make it very easy for a developer to get specific data from a web page and make results around those data sets.
In this article, I have covered a simple Python script that can extract links from a given url of a web page and create a CSV file containing all the links present on that web page with extra information telling whether the link is external or internal.

Prerequisite

As it is a python article with the program, so it goes without saying that you need to have basic knowledge of Python and Python installed on your system to test the program for yourself.
If you are on a new system, you can easily install the latest version of Python with this quick download link.

To make the program, I will use 4 Python libraries, among which two libraries are third-party libraries, and the other two are built-in.

Libraries

1. requests:
requests is the popular python HTTP library. We will use this library to make an HTTP request for the url which links we want to check.
As requests is a third-party library, we need to install it for our Python environment using the pip command.
pip install requests

2. Beautiful soup:
Beautiful soup is a third-party Python library that can extract data from HTML and XML files. Generally, a web page is an HTML document, and we can use the Python beautiful soup to extract links from that web page.

Use the following command to install beautiful soup
pip install beautifulsoup4

3. csv
csv modules come with Python, and we can write, read and append between .csv files using this module.

4. datetime
datetime is also an inbuilt Python module that can deal with date and time.

Program

Now let’s use all these 4 Python modules and write a Program that can tell all the internal and external links of a web page and export that data into a .csv file.

I have divided this program into three functions to make it modular.

Function 1: requestMaker(url)

The requestMake(url) function accepts the url as a string and sends a get request to the url using the .get() method.
After making the request, inside the requestMaker() function, I collected the response web page HTML content and the url using the .text and .url properties.
And called the parseLinks(pageHtml, pageUrl) function.



#to make the HTTP request to the give url
def requestMaker(url):
    try:
        #make the get request to the url
        response = requests.get(url)

        #if the request is successful
        if response.status_code in range(200, 300):
            #extract the page html content for parsing the links
            pageHtml = response.text
            pageUrl = response.url

            #call the parseLink function
            parseLinks(pageHtml, pageUrl)

        else:
            print("Sorry Could not fetch the result status code {response.status_code}!")

    except:
        print(f"Could Not Connect to url {url}")



Enter fullscreen mode Exit fullscreen mode

Function 2: parseLinks(pageHtml, pageUrl)

The parseLinks() function accept the pageHtml and pageUrl as string, and parse the pageHTML string using the BeautiulSoup module with HTML parser as a soup object. And with the soup object we collected a list of all the <a> tags present in the HTML page using the .find_all('a') method.
Then inside the parseLinks() function I have called the extIntLinks(allLinks, pageUrl) function.



#parse all the links from the web page
def parseLinks(pageHtml, pageUrl):
    soup = BeautifulSoup(pageHtml, 'html.parser')

    #get all the <a> elements from the HTML page
    allLinks = soup.find_all('a')

    extIntLinks(allLinks, pageUrl)


Enter fullscreen mode Exit fullscreen mode

Function 3: extIntLinks(allLinks, pageUrl)

The extIntLinks(allLinks, pageUrl) function does the following things.

  1. Create a unique .csv file name using the datetime module.
  2. Create the unique .csv file in write mode.
  3. Loop through all the extracted <a> links
  4. Check for the internal and external links.
  5. Write the data into the csv file.


def extIntLinks(allLinks, pageUrl):
    #filename 
    currentTime = datetime.datetime.now()
    #create a unique .csv file name using the datetime module
    filename =  f"Links-{currentTime.day}-{currentTime.month}-{currentTime.year} {currentTime.hour}{currentTime.minute}{currentTime.second}.csv"

    with open(filename, 'w', newline='') as csvfile:
        fieldnames = ['Tested Url','Link', 'Type']

        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        internalLinks = 0
        externalLinks = 0 

        #go through all the <a> elements list 
        for anchor in allLinks:
            link = anchor.get("href")   #get the link from the <a> element

            #check if the link is internal
            if link.startswith(pageUrl) or link.startswith("/") or link.startswith("#") :
                writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'Internal'})
                internalLinks+=1
            #if the link is external
            else:
                writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'External'})
                externalLinks+=1
        writer = csv.writer(csvfile)
        writer.writerow(["Total Internal Links", f"{internalLinks}", "Total External Links", f"{externalLinks}"])

        print(f"The page {url} has {internalLinks} Internal Link(s) and {externalLinks} External Link(s)")
        print(f"And data has been saved in the {filename}")


Enter fullscreen mode Exit fullscreen mode

The complete Program:

Now we can put the complete program altogether and run it.



import requests
from bs4 import BeautifulSoup
import csv
import datetime 


def extIntLinks(allLinks, pageUrl):
    #filename 
    currentTime = datetime.datetime.now()
    #create a unique .csv file name using the datetime module
    filename =  f"Links-{currentTime.day}-{currentTime.month}-{currentTime.year} {currentTime.hour}{currentTime.minute}{currentTime.second}.csv"

    with open(filename, 'w', newline='') as csvfile:
        fieldnames = ['Tested Url','Link', 'Type']

        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        internalLinks = 0
        externalLinks = 0 

        #go through all the <a> elements list 
        for anchor in allLinks:
            link = anchor.get("href")   #get the link from the <a> element

            #check if the link is internal
            if link.startswith(pageUrl) or link.startswith("/") or link.startswith("#") :
                writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'Internal'})
                internalLinks+=1
            #if the link is external
            else:
                writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'External'})
                externalLinks+=1
        writer = csv.writer(csvfile)
        writer.writerow(["Total Internal Links", f"{internalLinks}", "Total External Links", f"{externalLinks}"])

        print(f"The page {url} has {internalLinks} Internal Link(s) and {externalLinks} External Link(s)")
        print(f"And data has been saved in the {filename}")


#parse all the links from the web page
def parseLinks(pageHtml, pageUrl):
    soup = BeautifulSoup(pageHtml, 'html.parser')

    #get all the <a> elements from the HTML page
    allLinks = soup.find_all('a')

    extIntLinks(allLinks, pageUrl)

#to make the HTTP request to the give url
def requestMaker(url):
    try:
        #make the get request to the url
        response = requests.get(url)

        #if the request is successful
        if response.status_code in range(200, 300):
            #extract the page html content for parsing the links
            pageHtml = response.text
            pageUrl = response.url

            #call the parseLink function
            parseLinks(pageHtml, pageUrl)

        else:
            print("Sorry Could not fetch the result status code {response.status_code}!")

    except:
        print(f"Could Not Connect to url {url}")



if __name__ == "__main__":
    url = input("Enter the URL eg. https://example.com:  ")
    requestMaker(url)


Enter fullscreen mode Exit fullscreen mode

Output



Enter the URL eg. https://example.com:  https://techgeekbuzz.com
The page https://techgeekbuzz.com has 126 Internal Link(s) and 7 External Link(s)
And data has been saved in the Links-16-7-2022 11644.csv


Enter fullscreen mode Exit fullscreen mode

The CSV File

Image description

You can also download the this code from my github

HAPPY CODING!!

Top comments (0)