This is a guide on how to scrape a webpage using Urllib and BeautifulSoup.
Code for this tutorial can be found on Github
Introduction
Web scraping is a useful tool for extracting data from websites, especially those that don’t provide an API. In this post, I’ll show you how you can use web scraping to generate a dataset from a webpage.
For this example we will be using a website called trendogate which is a website that displays trending twitter hashtags on a given day based on region. Our goal will be to retrieve hashtags that are currently trending in the US.
To do this we will primarily be using two libraries:
1. Urllib
Urllib is a Python module that can be used for opening URLs. It defines functions and classes to help in URL actions. With Python you can also access and retrieve data from the internet like XML, HTML, JSON, etc.
Urllib is going to help us retrieve the web page we want to scrape.
To install Urllib-
pip install urllib
2. Beautiful Soup
Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping
Once we have the page we need from Urllib, we’re going to use Beautiful Soup to create a parse tree and extract the information we need from the page.
To install Beautiful Soup
pip install BeautifulSoup4
Analysing the web page
To scrape a webpage you need to be familiar with the structure of the HTML tags in that page. So right click on the page you want to scrape and select Inspect.
The trendogate webpage looks like this:
We are interested in the trending today section. If we inspect it we can see the HTML structure.
Today’s trending hashtags are in an unordered list of class list-group
that looks something like this:
<ul class="list-group">
<li class="list-group-item"><a href="/trend/79334795"> #TriviaTuesday</a></li>
<li class="list-group-item"><a href="/trend/79334794"> #IA02</a></li>
<li class="list-group-item"><a href="/trend/79334793"> #livelocaldigital</a></li>
<li class="list-group-item"><a href="/trend/79334792"> #MLW19</a></li>
<li class="list-group-item"><a href="/trend/79334791"> #TuesdayTip</a></li>
<li class="list-group-item"><a href="/trend/79334790"> #VoteCox</a></li>
<li class="list-group-item"><a href="/trend/79334789"> #OnePlus7Series</a></li>
<li class="list-group-item"><a href="/trend/79334788"> #CelebrateWomen</a></li>
<li class="list-group-item"><a href="/trend/79334787"> #FelizMartes</a></li>
<li class="list-group-item"><a href="/trend/79334786"> #PESummit</a></li>
</ul>
First we import the libraries we’ll need
from bs4 import BeautifulSoup
import urllib
Then we’ll use urllib
to get the webpage. Read more about urllib
URL = "https://trendogate.com/place/23424977"
# Open the URL
page = urllib.request.Request(URL)
result = urllib.request.urlopen(page)
# Store the HTML page in a variable
resulttext = result.read()
Now we’ll create a BeautifulSoup
object from the HTML page we just retrieved. Read more about BeautifulSoup
# Creates a nested data structure
soup = BeautifulSoup(resulttext, 'html.parser')
# Since we are interested only in an element with class "list-group"
# We will search for all elements with that class in the soup
soup = soup.find_all(class_= "list-group")
# Soup now contains an array of all elements with the class "list-group"
# Since the Trending today list is the first on the page, it's index is 0
trending_list = soup[0]
# Now we will iterate through the elements of the <ul>
# <ul> has <li> tags nested inside
trending_tags = []
for li in trending_list.contents:
# There is an <a> tag nested in each <li>
a = li.contents[0]
# The contents of 'a' is just the text inside the tag
tag = a.contents[0].strip()
trending_tags.append(tag)
print(trending_tags)
Now we have scraped all of today’s trending hashtags from the website.
Conclusion
This is just one of the ways that you can scrape information from websites. Another method is to use something like Selenium that allows you to emulate a browser instance and automate tasks with code.
Here’s an article you can refer to, if you want to get started with Selenium: https://www.toptal.com/python/web-scraping-with-python
Top comments (0)