Python is one of the most popular programming languages for extracting, processing, and analyzing data. The inbuilt and third-party libraries of Python make it very easy for a developer to get specific data from a web page and make results around those data sets.
In this article, I have covered a simple Python script that can extract links from a given url of a web page and create a CSV file containing all the links present on that web page with extra information telling whether the link is external or internal.
Prerequisite
As it is a python article with the program, so it goes without saying that you need to have basic knowledge of Python and Python installed on your system to test the program for yourself.
If you are on a new system, you can easily install the latest version of Python with this quick download link.
To make the program, I will use 4 Python libraries, among which two libraries are third-party libraries, and the other two are built-in.
Libraries
1. requests:
requests is the popular python HTTP library. We will use this library to make an HTTP request for the url which links we want to check.
As requests is a third-party library, we need to install it for our Python environment using the pip command.
pip install requests
2. Beautiful soup:
Beautiful soup is a third-party Python library that can extract data from HTML and XML files. Generally, a web page is an HTML document, and we can use the Python beautiful soup to extract links from that web page.
Use the following command to install beautiful soup
pip install beautifulsoup4
3. csv
csv modules come with Python, and we can write, read and append between .csv files using this module.
4. datetime
datetime is also an inbuilt Python module that can deal with date and time.
Program
Now letβs use all these 4 Python modules and write a Program that can tell all the internal and external links of a web page and export that data into a .csv file.
I have divided this program into three functions to make it modular.
Function 1: requestMaker(url)
The requestMake(url)
function accepts the url as a string and sends a get request to the url using the .get()
method.
After making the request, inside the requestMaker()
function, I collected the response web page HTML content and the url using the .text
and .url
properties.
And called the parseLinks(pageHtml, pageUrl)
function.
#to make the HTTP request to the give url
def requestMaker(url):
try:
#make the get request to the url
response = requests.get(url)
#if the request is successful
if response.status_code in range(200, 300):
#extract the page html content for parsing the links
pageHtml = response.text
pageUrl = response.url
#call the parseLink function
parseLinks(pageHtml, pageUrl)
else:
print("Sorry Could not fetch the result status code {response.status_code}!")
except:
print(f"Could Not Connect to url {url}")
Function 2: parseLinks(pageHtml, pageUrl)
The parseLinks()
function accept the pageHtml
and pageUrl
as string, and parse the pageHTML
string using the BeautiulSoup
module with HTML parser as a soup object. And with the soup object we collected a list of all the <a>
tags present in the HTML page using the .find_all('a')
method.
Then inside the parseLinks()
function I have called the extIntLinks(allLinks, pageUrl)
function.
#parse all the links from the web page
def parseLinks(pageHtml, pageUrl):
soup = BeautifulSoup(pageHtml, 'html.parser')
#get all the <a> elements from the HTML page
allLinks = soup.find_all('a')
extIntLinks(allLinks, pageUrl)
Function 3: extIntLinks(allLinks, pageUrl)
The extIntLinks(allLinks, pageUrl)
function does the following things.
- Create a unique
.csv
file name using the datetime module. - Create the unique .csv file in write mode.
- Loop through all the extracted
<a>
links - Check for the internal and external links.
- Write the data into the csv file.
def extIntLinks(allLinks, pageUrl):
#filename
currentTime = datetime.datetime.now()
#create a unique .csv file name using the datetime module
filename = f"Links-{currentTime.day}-{currentTime.month}-{currentTime.year} {currentTime.hour}{currentTime.minute}{currentTime.second}.csv"
with open(filename, 'w', newline='') as csvfile:
fieldnames = ['Tested Url','Link', 'Type']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
internalLinks = 0
externalLinks = 0
#go through all the <a> elements list
for anchor in allLinks:
link = anchor.get("href") #get the link from the <a> element
#check if the link is internal
if link.startswith(pageUrl) or link.startswith("/") or link.startswith("#") :
writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'Internal'})
internalLinks+=1
#if the link is external
else:
writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'External'})
externalLinks+=1
writer = csv.writer(csvfile)
writer.writerow(["Total Internal Links", f"{internalLinks}", "Total External Links", f"{externalLinks}"])
print(f"The page {url} has {internalLinks} Internal Link(s) and {externalLinks} External Link(s)")
print(f"And data has been saved in the {filename}")
The complete Program:
Now we can put the complete program altogether and run it.
import requests
from bs4 import BeautifulSoup
import csv
import datetime
def extIntLinks(allLinks, pageUrl):
#filename
currentTime = datetime.datetime.now()
#create a unique .csv file name using the datetime module
filename = f"Links-{currentTime.day}-{currentTime.month}-{currentTime.year} {currentTime.hour}{currentTime.minute}{currentTime.second}.csv"
with open(filename, 'w', newline='') as csvfile:
fieldnames = ['Tested Url','Link', 'Type']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
internalLinks = 0
externalLinks = 0
#go through all the <a> elements list
for anchor in allLinks:
link = anchor.get("href") #get the link from the <a> element
#check if the link is internal
if link.startswith(pageUrl) or link.startswith("/") or link.startswith("#") :
writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'Internal'})
internalLinks+=1
#if the link is external
else:
writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'External'})
externalLinks+=1
writer = csv.writer(csvfile)
writer.writerow(["Total Internal Links", f"{internalLinks}", "Total External Links", f"{externalLinks}"])
print(f"The page {url} has {internalLinks} Internal Link(s) and {externalLinks} External Link(s)")
print(f"And data has been saved in the {filename}")
#parse all the links from the web page
def parseLinks(pageHtml, pageUrl):
soup = BeautifulSoup(pageHtml, 'html.parser')
#get all the <a> elements from the HTML page
allLinks = soup.find_all('a')
extIntLinks(allLinks, pageUrl)
#to make the HTTP request to the give url
def requestMaker(url):
try:
#make the get request to the url
response = requests.get(url)
#if the request is successful
if response.status_code in range(200, 300):
#extract the page html content for parsing the links
pageHtml = response.text
pageUrl = response.url
#call the parseLink function
parseLinks(pageHtml, pageUrl)
else:
print("Sorry Could not fetch the result status code {response.status_code}!")
except:
print(f"Could Not Connect to url {url}")
if __name__ == "__main__":
url = input("Enter the URL eg. https://example.com: ")
requestMaker(url)
Output
Enter the URL eg. https://example.com: https://techgeekbuzz.com
The page https://techgeekbuzz.com has 126 Internal Link(s) and 7 External Link(s)
And data has been saved in the Links-16-7-2022 11644.csv
The CSV File
You can also download the this code from my github
HAPPY CODING!!
Top comments (0)