DEV Community

Cover image for BeautifulSoup is so 2000-and-late: Web Scraping in 2020
Max Humber
Max Humber

Posted on

BeautifulSoup is so 2000-and-late: Web Scraping in 2020

BeautifulSoup (bs4) was created over a decade-and-a-half ago. And it's been the standard for web scraping ever since. But it's time for something new, because bs4 is so 2000-and-late.

In this post we'll explore 10 reasons why gazpacho is the future of web scraping, by scraping parts of this post!

1. No Dependencies

gazpacho is installed at command line:

pip install gazpacho 
Enter fullscreen mode Exit fullscreen mode

With no extra dependencies:

pip freeze
# gazpacho==1.1
Enter fullscreen mode Exit fullscreen mode

In contrast, bs4 is packaged with soupsieve and lxml. I won't tell you how to write software, but minimizing dependencies is usually a good idea...

2. Batteries Included

The html for this blog post can be fetched and made parse-able with Soup.get:

from gazpacho import Soup

url = "https://dev.to/maxhumber/beautifulsoup-is-so-2000-and-late-web-scraping-in-2020-2528"
soup = Soup.get(url)
Enter fullscreen mode Exit fullscreen mode

Unfortunately, you'll need requests on top of bs4 to do the same thing:

import requests
from bs4 import BeautifulSoup

url = "https://dev.to/maxhumber/beautifulsoup-is-so-2000-and-late-web-scraping-in-2020-2528"
html = requests.get(url).text
bsoup = BeautifulSoup(html)
Enter fullscreen mode Exit fullscreen mode

3. Simple finding

bs4 is a monster. There are 184 methods and attributes attached to every BeautifulSoup object. Making it hard to know what to use and when to use it:

len(dir(BeautifulSoup()))
# 184
Enter fullscreen mode Exit fullscreen mode

In contrast, Soup objects in gazpacho are simple; there are just seven methods and attributes to keep track of:

[method for method in dir(Soup())]
# ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']
Enter fullscreen mode Exit fullscreen mode

Looking at that list it's clear that to find the title of this post (nested inside of an h1 tag), for example, we'll need to use .find:

soup.find('h1')
Enter fullscreen mode Exit fullscreen mode

4. Prototyping to Production

gazpacho is awesome for prototyping and even better for production. By default, .find will return one Soup object if it finds just one element, or a list of Soup objects if it finds more than one.

To guarantee and enforce return types in production the mode= argument in .find can be set manually:

title = (soup
    .find("header", {'id': 'main-title'}, mode="first")
    .find("h1", mode="all")[0]
    .text
)
Enter fullscreen mode Exit fullscreen mode

In contrast, bs4 has 27 find methods and they all return something different:

[method for method in dir(BeautifulSoup()) if 'find' in method] 
Enter fullscreen mode Exit fullscreen mode

5. PEP 561 Compliant

As of version 1.1, gazpacho is PEP 561 compliant. Meaning that the entire library is typed and will work with your typed (or standard duck/un-typed!) code-base:

help(soup.find)
# Signature:
# soup.find(
#     tag: str,
#     attrs: Union[Dict[str, Any], NoneType] = None,
#     *,
#     partial: bool = True,
#     mode: str = 'automatic',
#     strict: Union[bool, NoneType] = None,
# ) -> Union[List[ForwardRef('Soup')], ForwardRef('Soup'), NoneType]

Enter fullscreen mode Exit fullscreen mode

6. Automatic Formatting

The html on dev.to and this post is well formatted. But if it weren't:

header = soup.find("div", {'class': 'crayons-article__header__meta'})
html = str(header.find("div", {'class': 'mb-4 spec__tags'}))
bad_html = html.replace("\n", "") # remove new line characters
print(bad_html)
# <div class="mb-4 spec__tags">  <a class="crayons-tag mr-1" href="/t/python" style="background-color:#1E38BB;color:#FFDF5B">    <span class="crayons-tag__prefix">#</span>    python  </a>  <a class="crayons-tag mr-1" href="/t/webscraping" style="background-color:;color:">    <span class="crayons-tag__prefix">#</span>    webscraping  </a>  <a class="crayons-tag mr-1" href="/t/gazpacho" style="background-color:;color:">    <span class="crayons-tag__prefix">#</span>    gazpacho  </a>  <a class="crayons-tag mr-1" href="/t/hacktoberfest" style="background-color:#29161f;color:#ffa368">    <span class="crayons-tag__prefix">#</span>    hacktoberfest  </a></div>
Enter fullscreen mode Exit fullscreen mode

gazpacho would be able to automatically format and indent the bad/malformed html:

tags = Soup(bad_html)
Enter fullscreen mode Exit fullscreen mode

Making things easier to read:

print(tags)
# <div class="mb-4 spec__tags">
#   <a class="crayons-tag mr-1" href="/t/python" style="background-color:#1E38BB;color:#FFDF5B">
#     <span class="crayons-tag__prefix">#</span>
#         python
#   </a>
#   <a class="crayons-tag mr-1" href="/t/webscraping" style="background-color:;color:">
#     <span class="crayons-tag__prefix">#</span>
#         webscraping
#   </a>
#   <a class="crayons-tag mr-1" href="/t/gazpacho" style="background-color:;color:">
#     <span class="crayons-tag__prefix">#</span>
#         gazpacho
#   </a>
#   <a class="crayons-tag mr-1" href="/t/hacktoberfest" style="background-color:#29161f;color:#ffa368">
#     <span class="crayons-tag__prefix">#</span>
#         hacktoberfest
#   </a>
# </div>
Enter fullscreen mode Exit fullscreen mode

7. Speed

gazpacho is fast. It takes just 258 µs to scrape the tag links for this post:

%%timeit
tags = Soup(bad_html)
tags = tags.find("a")
tag_links = ["https://dev.to" + tag.attrs['href'] for tag in tags]
# 258 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Enter fullscreen mode Exit fullscreen mode

While bs4 takes nearly twice as long to do the same thing:

%%timeit
tags = BeautifulSoup(bad_html)
tags = tags.find_all("a")
tag_links = ["https://dev.to" + tag.attrs['href'] for tag in tags]
# 465 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Enter fullscreen mode Exit fullscreen mode

8. Partial Matching

gazpacho can partially match html element attributes. For instance, the sidebar for this page is displayed with the following html:

<aside class="crayons-layout__sidebar-right" aria-label="Right sidebar navigation">
Enter fullscreen mode Exit fullscreen mode

And can be matched exactly with:

soup.find("aside", {"class": "crayons-layout__sidebar-right"}, partial=False)
Enter fullscreen mode Exit fullscreen mode

Or partially (the default behaviour) with:

sidebar = soup.find("aside", {'aria-label': 'Right sidebar'}, partial=True)

# finding my name
sidebar.find("span", {'class': 'crayons-subtitle-2'}, partial=True).text 
Enter fullscreen mode Exit fullscreen mode

9. Debt-free

gazpacho is Python 3 first, Black, typed with mypy, and about ~400 sloc. It's easy to read through the source:

import inspect

source = inspect.getsource(Soup.find)
print(source)
Enter fullscreen mode Exit fullscreen mode

And like bs4 isn't riddled with Python 2 technical debt.

10. Open (and Friendly)!

Most importantly, gazpacho is open-source, hosted on GitHub (instead of some clunky custom platform) and looking for contributors.

If you're participating in #hacktoberfest, we'd love to have you. There's a couple of open issues that could use some help!

Top comments (9)

Collapse
 
mohiuddinsumon profile image
Mohiuddin Sumon

Yes it's true bs4 does have a lot of methods and knowing and remembering them is tough but for most of the work we can simply use common ones.

However where I faced most difficulty is with dynamic pages for example you can go to an ecommerce site and scrape a search result page.

In this type of scenario how would gazpacho work ? @maxhumber maybe you can make a tutorial video with selenium and gazpacho ?

Collapse
 
maxhumber profile image
Max Humber

There's some examples of gazpacho + selenium on this website: scrape.world/

Collapse
 
gbnegrini profile image
Guilherme Bauer-Negrini

I've used bs4 in 3 projects by now and it never had occurred to me to search for alternatives. I'll certainly give gazpacho a chance, seems pretty easy.

Collapse
 
famine332 profile image
Robert

Cool!

Collapse
 
tomquirk profile image
Tom Quirk

One thing I've never understood about Beautiful soup is how un-user friendly the docs are. Gazpacho looks so simple in comparison - I'll definitely check out!

Collapse
 
maxhumber profile image
Max Humber

Right? 🙈

Collapse
 
patarapolw profile image
Pacharapol Withayasakpunt

Why not just simply lxml with xpath? (Who says we have to use BeautifulSoup?)

My favorite is Cheerio (in Node.js / web browser), a jQuery analog, though.

Collapse
 
mehmehctrl profile image
mehmeh-ctrl

Remembering attending your workshops (also these where you've talked about how one of your cousins plunged into a barn without his parachute opened), I wonder why I tomorrow won't be teaching people how to use Gazpacho. I've prepared some extended Pandas exercises (that's only Intro to Big Data). Gazpacho should turn into a tutorial of the kind of Django Girls website to conquer the world.

Collapse
 
yuricosta profile image
Yuri Costa

Bow Bow Pow hahahaha