Pierre

Posted on Jan 22, 2020 • Originally published at scrapingbee.com on Jan 21, 2020

Python web scraping with BeautifulSoup

#python #webdev #webscraping #tutorial

In this article, we will see how to extract structured information from web page leveraging BeautifulSoup and CSS selectors.

WebScraping with BeautifulSoup

Pulling the HTML out

BeautifulSoup is not a web scraping library per se. It is a library that allows you to efficiently and easily pull out information from HTML, in the real world, it is very often used for web scraping project.

So to begin, we'll need HTML. We will begin by pulling out HackerNews landing page HTML using requests python package.

import requests
response = requests.get("https://news.ycombinator.com/")
if response.status_code != 200:
    print("Error fetching page")
    exit()
else:
    content = response.content
print(content)

> b'<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" 
> content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css? ...

Parsing the HTML with BeautifulSoup

Now that the HTML is accessible we will use BeautifulSoup to parse it. If you haven't done already you need to install the package by doing a simple pip install beautifullsoup4. In the rest of this article, we will refer to BeautifulSoup4 as BS4.

We now need to parse the HTML and load it into a BS4 structure.

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

This soup object is very handy and allows us to easily access many useful pieces of information such as:

# The HTML title of the page
print(soup.title)
> <title>Hacker News</title>

# The test title of the page
print(soup.title.string)
> Hacker News

# All links in the page
nb_links = len(soup.find_all('a'))
print(f"There are {nb_links} links in this page")
> There are 231 links in this page

# Text from the page
print(soup.get_text())
> Hacker News
> Hacker News
> new | past | comments | ask | show | jobs | submit
> login
> ...

Targeting DOM elements

You might begin to see a pattern in how to use this library. This library allows you to quickly and elegantly target the DOM elements you need.

If you need to select DOM elements from its tag (<p>, <a>, <span>, ....) you can simply do soup.<tag> to select it. The caveat is that it will only select the first HTML element with that tag.

For example if I want the first link I just have to do

 first_link = soup.a
 print(first_link)
 ><a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a>

And this element will also have many useful method to quickly extract informations:

# The text of the link
print(first_link.text)
# Empty because first link is an <img>
>""

# The href of the link
print(first_link.get('href')
> https://news.ycombinator.com

This is a simple example, if you want to select the first element based on its id or class it is not much more difficult:

pagespace = soup.find(id="pagespace")
print(pagespace)
> <tr id="pagespace" style="height:10px" title=""></tr>

# class is a reserved keyword in Python, hence the '_'
athing = soup.find(class_="athing")
print(athing)
> <tr class="athing" id="22115671">
> ...

That is as simple.

And if you don't want the first matching element but all matching elements, just replace find by find_all.

This simple and elegant interface allows you to quickly write short and powerful Python snippets. For example, let's say that I want to extract all links in this page, and find the top 3 links that appears the most in the page, all I have to do is this:

from collections import Counter
all_hrefs = [a.get('href') for a in soup.find_all('a')]
top_3_links = Counter(all_hrefs).most_common(3)
print(top_3_links)
> [('from?site=github.com', 3), ('item?id=22115671', 2), ('item?id=22113827', 2)]

Advanced usage

BeautifulSoup is a great example of a library that is both easy to use and powerful. There is much more you can do to select elements, we won't cover those cases in this article but here are few examples of advanced things you can do with the relevant documentation links:

Select elements with regexp
Select elements with a custom function (links that have Google in it for example)
Iterating over siblings elements

We also only covered how to target elements but there is also a whole section about updating and writing HTML but again we won't cover it in this article.

Let's now talk about CSS selectors

CSS selectors

But why learn about CSS selectors if BeautifulSoup can select all elements with its pre-made method ?

Well, you'll soon understand.

Hard dom

Sometimes the HTML document you'll work with won't have useful class and id. So to select elements with BS4 without relying on that information it can be quite verbose.

For example, let's say that you want to extract score of post in HN homepage and that you can't use class name or id in your code, here is how you could do it:

results = []
all_tr = soup.find_all('tr')
for tr in all_tr:
    if len(tr.contents) == 2:
        print(len(tr.contents[1]))
        if len(tr.contents[0].contents) == 0 and len(tr.contents[1].contents) == 13:
            points = tr.contents[1].text.split(' ')[0].strip()
            results.append(points)
print(results)
>['168', '80', '95', '344', '7', '84', '76', '2827', '247', '185', '91', '2025', '482', '68', '47', '37', '6', '89', '32', '17', '47', '1449', '25', '73', '35', '463', '44', '329', '738', '17']

Not that great right?

If you rely on CSS selectors, it becomes easier.

all_results = soup.select('td:nth-child(2) > span:nth-child(1)')
results = [r.text.split(' ')[0].strip() for r in all_results]
print(results)

>['168', '80', '95', '344', '7', '84', '76', '2827', '247', '185', '91', '2025', '482', '68', '47', '37', '6', '89', '32', '17', '47', '1449', '25', '73', '35', '463', '44', '329', '738', '17']

Much clearer and simpler right? Of course this example artificially hightlights the usefullness of the css selector but quickly, you will see that the dom structure of a page is more reliable than than the class name.

Easily debuggable

Another thing that makes CSS expression great for web scraping is that there are easily debuggable, I'll show you how. Open Chrome, your developers' tools, (left-click -> "Inspect"), click on the document panel and do "Ctrl-F or CMD-F" to be in search mode.

In the search bar, you'll be able to write any CSS expression you want and Chrome will find instantly all elements matching it.

Iterate over results by pressing Enter to check that you are correctly getting everything you need.

But what is great with Chrome is that it works the other way around too, you can also left click on an element, click "Copy -> Copy Selector" and your selector will be pasted in your clipboard.

Powerful

CSS selectors, and particularly pseudo-classes allow you to select any elements you want with only one simple string.

Child and descendants

You can select direct child and descendant with:

# all <p> directly inside and <a>
a > p

# all <p> descendant of an <a>
a p

And you can mix them together

a > p > .test .example > span

This will totally works

Siblings

This one is one of my favorites because it allows you to select elements based on the elements on the same level in the DOM hierarchy, hence the sibling expression.

#html example
<p>...</p>
<section>
    <p>...</p>
    <h2>...</h2>
    <p>This paragraph will be selected</p> (match h2 ~ p / h2 + p)
    <div>
        <p>...</p>
    </div>
    <p>This paragraph will be selected</p> (match h2 ~ p)
</section>

To select all p coming after an h2 you can use the h2 ~ p selector (it will match two p). You can also use h2 + p if you only want to select p coming directly after an h2 (it will match only one p)

Attribute selectors

Attribute selectors allow you to select element with particular attributes values, p[data-test="foo"] will match

<p data-test="foo"></p>

Position pseudo classes

If you want to select last p inside a section, you can also do it in "pure" CSS by leveraging position pseudo-classes. For this particular example, you just need this selector: section p:last-child(). If you want to learn more about this I suggest you take a look at this article

Maintainable code

I also personally think that CSS expressions are easier to maintain. For example at ScrapingBee, when we do custom web scraping tasks all our scripts begins like this:

    TITLE_SELECTOR = "title"
    SCORE_SELECTOR = "td:nth-child(2) > span:nth-child(1)"
    ...

It makes it easy and quick to fix scripts when DOM changes appear. The laziest way to do it is to simply copy/paste what Chrome gives you when you left-click on an element. If you do this be careful, Chrome tends to add a lot of useless selectors when you use this trick so do not hesitate to clear them a bit before using them in your script.