Introduction
Web scraping is essentially a way to automate the process of extracting data from the web, and as a Python developer, you have access to some of the best libraries and frameworks available to help you get the job done.
We're going to take a look at some of the most popular Python libraries and frameworks for web scraping and compare their pros and cons, so you know exactly what tool to use to tackle any web scraping project you might come across.
HTTP Libraries - Requests and HTTPX
First up, let's talk about HTTP libraries. These are the foundation of web scraping since every scraping job starts by making a request to a website and retrieving its contents, usually as HTML.
Two popular HTTP libraries in Python are Requests and HTTPX.
Requests is easy to use and great for simple scraping tasks, while HTTPX offers some advanced features like async and HTTP/2 support.
Their core functionality and syntax are very similar, so I would recommend HTTPX even for smaller projects since you can easily scale up in the future without compromising performance.
Feature | HTTPX | Requests |
---|---|---|
Asynchronous | ✅ | ❌ |
HTTP/2 support | ✅ | ❌ |
Timeout support | ✅ | ✅ |
Proxy support | ✅ | ✅ |
TLS verification | ✅ | ✅ |
Custom exceptions | ✅ | ❌ |
Parsing HTML with Beautiful Soup
Once you have the HTML content, you need a way to parse it and extract the data you're interested in.
Beautiful Soup is the most popular HTML parser in Python, allowing you to easily navigate and search through the HTML tree structure. Its straightforward syntax and easy setup also make Beautiful Soup a great option for small to medium web scraping projects as well as web scraping beginners.
The two major drawbacks of Beautiful Soup are its inability to scrape JavaScript-heavy websites and its limited scalability, which results in low performance in large-scale projects. For large projects, you would be better off using Scrapy, but more about that later.
Next, lets take a look at how Beautiful Soup works in practice:
from bs4 import BeautifulSoup
import httpx
# Send an HTTP GET request to the specified URL using the httpx library
response = httpx.get("<https://news.ycombinator.com/news>")
# Save the content of the response
yc_web_page = response.content
# Use the BeautifulSoup library to parse the HTML content of the webpage
soup = BeautifulSoup(yc_web_page)
# Find all elements with the class "athing" (which represent articles on Hacker News) using the parsed HTML
articles = soup.find_all(class_="athing")
# Loop through each article and extract relevant data, such as the URL, title, and rank
for article in articles:
data = {
"URL": article.find(class_="titleline").find("a").get('href'), # Find the URL of the article by finding the first "a" tag within the element with class "titleline"
"title": article.find(class_="titleline").getText(), # Find the title of the article by getting the text content of the element with class "titleline"
"rank": article.find(class_="rank").getText().replace(".", "") # Find the rank of the article by getting the text content of the element with class "rank" and removing the period character
}
# Print the extracted data for the current article
print(data)
Explaining the code:
1 - We start by sending an HTTP GET request to the specified URL using the HTTPX library. Then, we save the retrieved content to a variable.
2 - Now, we use the Beautiful Soup library to parse the HTML content of the webpage.
3 - This enables us to manipulate the parsed content using Beautiful Soup methods, such as find_all
to find the content we need. In this particular case, we are finding all elements with the class athing
, which represents articles on Hacker News.
4- Next, we simply loop through all the articles on the page and then use CSS selectors to further specify what data we would to extract from each article. Finally, we print the scraped data to the console.
Browser automation libraries - Selenium and Playwright
What if the website you're scraping relies on JavaScript to load its content? In that case, an HTML parser won't be enough, as you'll need to generate a browser instance to load the pages JavaScript using a browser automation tool like Selenium or Playwright.
These are primarily testing and automation tools that allow you to control a web browser programmatically, including clicking buttons, filling out forms, and more. However, they are also often used in web scraping as a means to access dynamically generated data on a webpage.
While Selenium and Playwright are very similar in their core functionality, Playwright is more modern and complete than Selenium.
For example, Playwright offers some unique built-in features, such as automatically waiting on elements to be visible before making actions and an asynchronous version of its API using asyncIO
.
To exemplify how we can use Playwright to do web scraping, lets quickly walk through a code snippet where we use Playwright to extract data from an Amazon product and save a screenshot of the page while at it.
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as p:
browser = await p.firefox.launch(headless=False)
page = await browser.new_page()
await page.goto("<https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C>")
# Create a dictionary with the scraped data
selectors = ['#productTitle', 'span.author a', '#productSubtitle', '.a-size-base.a-color-price.a-color-price']
book_data = await asyncio.gather(*(page.query_selector(sel) for sel in selectors))
book = {}
book["book_title"], book["author"], book["edition"], book["price"] = [await elem.inner_text() for elem in book_data if elem]
print(book)
await page.screenshot(path="book.png")
await browser.close()
asyncio.run(main())
Explaining the code:
Import the necessary modules:
asyncio
andasync_playwright
from Playwright's async API.After importing the necessary modules, we start by defining an async function called
main
that launches a Firefox browser instance withheadless
mode set toFalse
so we can actually see the browser working. Creates a new page in the browser using thenew_page
method and finally navigates to the Amazon website using thegoto
method.Next, we define a list of CSS selectors for the data we want to be scraped. Then, we can use the method
asyncio.gather
to simultaneously execute thepage.query_selector
method on all the selectors in the list, and store the results in abook_data
variable.Now we can iterate over
book_data
to populate thebook
dictionary with the scraped data. Note that we also check that the element is notNone
and only add the elements which exist. This is considered good practice since websites can make small changes that will affect your scraper. You could even expand on this example and write more complex tests to ensure the data being extracted is not missing any values.Finally, we print the
book
dictionary contents to the console and take a screenshot of the scraped page, saving it as a file calledbook.png
.As a last step, we make sure to close the browser instance.
But wait! If browser automation tools can be used to scrape virtually any webpage and, on top of that, can also make it easier for you to automate tasks, test and visualize your code working, why dont we just always use Playwright or Selenium for web scraping?
Well, despite being powerful scraping tools, these libraries and frameworks have a noticeable drawback. It turns out that generating a browser instance is a very resource-heavy action when compared to simply retrieving the pages HTML. This can easily become a huge performance bottleneck for large scraping jobs, which will not only take longer to complete but also become considerably more expensive. For that reason, we usually want to limit the usage of these tools to only the necessary tasks and, when possible, use them together with Beautiful Soup or Scrapy.
Scrapy
Next up, we have the most popular and, arguably, powerful web scraping framework for Python.
If you find yourself needing to scrape large amounts of data regularly, then Scrapy could be a great option.
The Scrapy framework offers a full-fledged suite of tools to aid you even in the most complex scraping jobs.
On top of its superior performance when compared to Beautiful Soup, Scrapy can also be easily integrated into other data-processing Python tools and even other libraries, such as Playwright.
Not only that, but it comes with a handy collection of built-in features catered specifically to web scraping, such as:
Feature | Description |
---|---|
Powerful and flexible spidering framework | Scrapy provides a built-in spidering framework that allows you to easily define and customize web crawlers to extract the data you need. |
Fast and efficient | Scrapy is designed to be fast and efficient, allowing you to extract data from large websites quickly and with minimal resource usage. |
Support for handling common web data formats | Export data in multiple formats such as HTML, XML, and JSON. |
Extensible architecture | Easily add custom functionality through middleware, pipelines, and extensions. |
Distributed scraping | Scrapy supports distributed scraping, allowing you to scale up your web scraping operation across multiple machines. |
Error handling | Scrapy has robust error-handling capabilities, allowing you to handle common errors and exceptions that may occur during web scraping. |
Support for authentication and cookies | Supports handling authentication and cookies to scrape websites that require login credentials. |
Integration with other Python tools | Scrapy can be easily integrated with other Python tools, such as data processing and storage libraries, making it a powerful tool for end-to-end data processing pipelines. |
Here's an example of how to use a Scrapy Spider to scrape data from a website:
import scrapy
class HackernewsSpiderSpider(scrapy.Spider):
name = 'hackernews_spider'
allowed_domains = ['news.ycombinator.com']
start_urls = ['<http://news.ycombinator.com/>']
def parse(self, response):
articles = response.css('tr.athing')
for article in articles:
yield {
"URL": article.css(".titleline a::attr(href)").get(),
"title": article.css(".titleline a::text").get(),
"rank": article.css(".rank::text").get().replace(".", "")
}
We can use the following command to run this script and save the resulting data to a JSON file:
scrapy crawl hackernews -o hackernews.json
Explaining the code:
The code example uses Scrapy to scrape data from the Hacker News website (news.ycombinator.com). Let's break down the code step by step:
After importing the necessary modules, we define the Spider class we want to use:
class HackernewsSpiderSpider(scrapy.Spider):
Next, we set the Spider properties:
name
: The name of the spider (used to identify it).allowed_domains
: A list of domains that the spider is allowed to crawlstart_urls
: A list of URLs to start crawling from.
name = 'hackernews_spider'
allowed_domains = ['news.ycombinator.com']
start_urls = ['<http://news.ycombinator.com/>']
Then, we define the parse
method: This method is the entry point for the spider and is called with the response of the URLs specified in start_urls
.
def parse(self, response):
In the parse method, we will extract data from the HTML response: The response
object represents the HTML page received from the website. The spider uses CSS selectors to extract relevant data from the HTML structure.
articles = response.css('tr.athing')
Now we use a for loop to iterate over each article found on the page.
for article in articles:
Finally, for each article, the spider extracts the URL, title, and rank information using CSS selectors and yields a Python dictionary containing this data.
yield {
"URL": article.css(".titleline a::attr(href)").get(),
"title": article.css(".titleline a::text").get(),
"rank": article.css(".rank::text").get().replace(".", "")
}
Which Python scraping library is right for you?
So, which library should you use for your web scraping project? The answer depends on the specific needs and requirements of your project. Each web scraping library and framework presented here has a unique purpose in an expert scraper's toolkit. Learning to use each one will give you the flexibility to select the best tool for each job, so don't be afraid to try each of them before deciding!
Whether you are scraping with BeautifulSoup, Scrapy, Selenium, or Playwright, the Apify Python SDK helps you run your project in the cloud at any scale.
Top comments (0)