Introduction
Web scraping is a technique used to extract data from websites. It involves making HTTP requests to a web's server, downloading the HTML content of the webpage, and parsing that content to extract the data you are interested in. It is when someone takes a specific part of a website and copies all the data from that section into their program. Web scraping allows you to collect the raw data you need from websites with no APIs or sites that don't allow access via API.
There are a ton of great tools out there for web scraping. Different tools work with different languages and allow for different types of scraping. This article will look at web scraping using Python and Beautiful Soup. Beautiful Soup is a Python library for extracting data from HTML and XML files. It allows you to parse a document into a tree of text nodes, call methods on the nodes, and access the contents of HTML elements.
This article will demonstrate web scraping using the Beautiful Soup library from the Big Square site. We will scrape the website's product names and prices, and save the data to a CSV file.
Prerequisites
The following are the requirements for this tutorial:
- Familiarity with Python programming language.
- Knowledge of HTML and CSS.
- Familiarity with the command line
- Understanding of the request and response cycle.
What is the Process of Web Scraping?
The process of web scraping using Python and Beautiful Soup typically involves the following steps:
1.Installing the required libraries i.e. beautiful soup and request.
We will begin by creating a virtual environment before installing the packages. To do that, create a project folder, navigate to the project folder then run the following command in the terminal:
python -m venv scrapy
Here, scrapy
is the name of the virtual environment that we have created but you can name it anything you want.
To activate the virtual environment type the following command in the terminal:
scrapy\Scripts\activate.bat
You can learn more about virtual environments here.
Now that we have created a virtual environment, we are ready to install the libraries. The libraries are installed using the pip command as shown below. Install the requests library by running the following command:
python -m pip install requests
2. Sending request to the URL of the website to scrape.
We will make an HTTP GET request to the website and print the HTML content of the webpage. To do that, create a file named main.py and paste in the code below:
import requests
content = requests.get("https://bigsquare.co.ke/").text
print(content)
This code imports the requests
library, which allows you to send HTTP requests using Python. The code then uses the get
function from the requests library to send a GET request to the specified URL Big Square. The response is then assigned to a variable called "content". The response from the GET request is in bytes format and the .text
is used to convert it to a string format. Finally, the code prints the content of the response, which is the HTML source code of the website.
3. Parsing the HTML content using beautiful soup library
We will install beautiful soup using the terminal as we did with requests by typing the following in the terminal:
python -m pip install beautiful soup
We can replace the existing code with the code below to create a beautiful soup object:
from bs4 import BeautifulSoup
import requests
content = requests.get('https://bigsquare.co.ke/').text
info = BeautifulSoup(content, "html.parser")
print(info.prettify())
In this code snippet the HTML content is passed as an argument to the BeautifulSoup constructor, along with the "html.parser" argument, which tells BeautifulSoup to use the built-in HTML parser to parse the content.
The resulting BeautifulSoup object, stored in the 'info' variable, is then printed using the 'prettify()' method, which formats the HTML content in a readable way.
4. Navigating the parse tree
We can use various methods to extract the data we want from the HTML or XML document. For example, we can use the find
or find_all
methods to search for specific HTML tags and extract their contents.
product_divs = soup.find_all("div", class_="product")
This will return a list of BeautifulSoup objects representing all the div elements on the page with the class "product". We can then iterate over this list and extract the data we want from each element:
for div in product_divs:
name = div.find("h3").text
price = div.find("span", class_="price").text
print(name, price)
This script goes through each element in the list of "product_divs" using a for loop. Within the for loop, the code uses the find()
method provided by BeautifulSoup to locate specific elements within the current "div" element.
The first line of the for loop uses the find()
method to locate an h3 element within the current "div" element, and then extracts the text content of that element. This text content here is the the name of a product.
The second line uses the find()
method again, this time to locate a span element with the class "price" within the current "div" element, and then extracts the text content of that element. This text content is the price of the product.
5. Saving the data
Once we have extracted the data we want, we can save it to a CSV or Excel file using the csv or openpyxl libraries respectively. Here we will save it to a CSV file.
import csv
with open("products.csv", "w", newline="") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["Name", "Price"])
for div in product_divs:
name = div.find("h3").text
price = div.find("span", class_="price").text
writer.writerow([name, price])
In this code snippet we use the "csv" library in Python to write the data to a file. The first line is opening a file named "products.csv" in write mode ("w") and creating a csv_file object. The "newline" parameter is specified to ensure that new rows are properly written to the file.
Then, a CSV writer object "writer" is created using the csv_file object, and it writes the headers "Name" and "Price" to the file.
Then, the script loops through all the div elements in the "product_divs" class and extracts the text of the h3 element (product name) and the text of the span element with a class of "price" (product price) as before, but this time instead of printing the data, it is writing the rows of data to the CSV file.
Conclusion
This is a basic example of web scraping with Python and Beautiful Soup. Python and the Beautiful Soup library make it easy to perform web scraping, and can be used to extract data from a wide range of websites. There are many more advanced techniques and features available, such as handling pagination, handling AJAX requests, and handling cookies.
You can download the source code for this sample project here.
Top comments (1)
👏👏