What will be scraped
Prerequisites
Basic knowledge scraping with CSS selectors
CSS
selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.
If you haven't scraped with CSS
selectors, there's a dedicated blog post of mine about how to use CSS
selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective.
Separate virtual environment
In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system thus preventing libraries or Python version conflicts.
If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.
📌Note: this is not a strict requirement for this blog post.
Install libraries:
pip install requests parsel
Reduce the chance of being blocked
There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites.
Full Code
from parsel import Selector
import requests, json, re
params = {
"q": "richard branson",
"tbm": "bks",
"gl": "us",
"hl": "en"
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
selector = Selector(text=html.text)
books_results = []
# https://regex101.com/r/mapBs4/1
book_thumbnails = re.findall(r"s=\\'data:image/jpg;base64,(.*?)\\'", str(selector.css("script").getall()), re.DOTALL)
for book_thumbnail, book_result in zip(book_thumbnails, selector.css(".Yr5TG")):
title = book_result.css(".DKV0Md::text").get()
link = book_result.css(".bHexk a::attr(href)").get()
displayed_link = book_result.css(".tjvcx::text").get()
snippet = book_result.css(".cmlJmd span::text").get()
author = book_result.css(".fl span::text").get()
author_link = f'https://www.google.com/search{book_result.css(".N96wpd .fl::attr(href)").get()}'
date_published = book_result.css(".fl+ span::text").get()
preview_link = book_result.css(".R1n8Q a.yKioRe:nth-child(1)::attr(href)").get()
more_editions_link = book_result.css(".R1n8Q a.yKioRe:nth-child(2)::attr(href)").get()
books_results.append({
"title": title,
"link": link,
"displayed_link": displayed_link,
"snippet": snippet,
"author": author,
"author_link": author_link,
"date_published": date_published,
"preview_link": preview_link,
"more_editions_link": f"https://www.google.com{more_editions_link}" if more_editions_link is not None else None,
"thumbnail": bytes(bytes(book_thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape")
})
print(json.dumps(books_results, indent=2))
Import libraries:
from parsel import Selector
import requests, json
-
parsel
is a library to extract and remove data from HTML and XML using XPath and CSS selectors. It's similar tobeautifulsoup4
except it supports full XPath and has its own CSS pseudo-elements support, for example::text
or::attr(<attribute_name>)
.
Create search query parameters and request headers:
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "richard branson", # search query
"tbm": "bks", # book results
"gl": "us", # country to search from
"hl": "en" # language
}
# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
-
user-agent
is used to act as a "real" user visit so website think it's a user, not the bot/script that sends a request. It's the most basic form of avoiding being blocked by a website.
Pass query params, request headers to the request and create a Selector
object:
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
selector = Selector(text=html.text)
-
timeout=30
tellsrequests
to stop waiting for a response after 30 seconds. -
Selector()
is likeBeautifulSoup()
except you get a full XPath support, and every CSS selector query translates to XPath usingcssselect
package and names itFunctionalPseudoElement
.
Create a temporary list
to store the data:
books_results = []
Match thumbnails data using regular expression:
# https://regex101.com/r/mapBs4/1
book_thumbnails = re.findall(r"s=\\'data:image/jpg;base64,(.*?)\\'", str(selector.css("script").getall()), re.DOTALL)
The reason why we need to parse the data from <script>
tags is because if you parse book thumbnail from <img>
["src"]
attribute you'll get a 1x1 placeholder instead of a thumbnail.
-
re.findall()
return alist
of all matches. -
selector.css("script")
return a list of all found<script>
tags andgetall()
will get thedata
value from translated XPath returned by<class 'SelectorList'>
or<class 'Selector'>
instance. -
re.DOTALL
will match everything including new line. Note that you have to have.
switch, otherwise it will match every charter except a new line.
Iterate over matched thumbnails and CSS container with all the needed data and extract it:
for book_thumbnail, book_result in zip(book_thumbnails, selector.css(".Yr5TG")):
title = book_result.css(".DKV0Md::text").get()
link = book_result.css(".bHexk a::attr(href)").get()
displayed_link = book_result.css(".tjvcx::text").get()
snippet = book_result.css(".cmlJmd span::text").get()
author = book_result.css(".fl span::text").get()
author_link = f'https://www.google.com/search{book_result.css(".N96wpd .fl::attr(href)").get()}'
date_published = book_result.css(".fl+ span::text").get()
preview_link = book_result.css(".R1n8Q a.yKioRe:nth-child(1)::attr(href)").get()
more_editions_link = book_result.css(".R1n8Q a.yKioRe:nth-child(2)::attr(href)").get()
-
zip()
aggregates multiple iterables in parallel and returns a tuple with an item from each one. -
css(".Yr5TG")
is like callingsoup.select(".Yr5TG")
withbs4
, which will return alist
of matches. -
css(".DKV0Md::text")
where CSS3 pseudo-element::text
will get text, andget()
will get the textualdata
value from translated XPath. If using withoutget()
you'll get a translated XPath<class 'SelectorList'>
or<class 'Selector'>
instance from CSS selector. -
::attr(href)
is also a pseudo-element to grab an attribute.
Append the data to temporary list
as a dict
:
books_results.append({
"title": title,
"link": link,
"displayed_link": displayed_link,
"snippet": snippet,
"author": author,
"author_link": author_link,
"date_published": date_published,
"preview_link": preview_link,
# if URL is present, add "https://www.google.com" to the URL, instead to None: "Nonehttps://www.google.com"
"more_editions_link": f"https://www.google.com{more_editions_link}" if more_editions_link is not None else None,
"thumbnail": bytes(bytes(book_thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape")
})
-
bytes().decode()
will decode unicode escape characters. We have to do it twice, because after first decoding some unicode characters are still present for some reason.
Print the data:
print(json.dumps(books_results, indent=2))
Part of the JSON output:
[
{
"title": "The Virgin Way: How to Listen, Learn, Laugh and Lead",
"link": "https://books.google.com/books?id=Jkp1AgAAQBAJ&printsec=frontcover&dq=richard+branson&hl=en&newbks=1&newbks_redir=1&sa=X&ved=2ahUKEwin3IrX-_n1AhXclmoFHbMHDfIQ6AF6BAgIEAI",
"displayed_link": "books.google.com",
"snippet": "This is not a conventional book on leadership. There are no rules \u2013 but rather the secrets of leadership that he has learned along the way from his days at Virgin Records, to his recent work with The Elders.",
"author": "Sir Richard Branson",
"author_link": "https://www.google.com/search/search?gl=us&hl=en&tbm=bks&tbm=bks&q=inauthor:%22Sir+Richard+Branson%22&sa=X&ved=2ahUKEwin3IrX-_n1AhXclmoFHbMHDfIQ9Ah6BAgIEAU",
"date_published": "2014",
"preview_link": "https://books.google.com/books?id=Jkp1AgAAQBAJ&printsec=frontcover&dq=richard+branson&hl=en&newbks=1&newbks_redir=1&sa=X&ved=2ahUKEwin3IrX-_n1AhXclmoFHbMHDfIQuwV6BAgIEAc",
"more_editions_link": "https://www.google.com/books/edition/The_Virgin_Way/Jkp1AgAAQBAJ?hl=en&gl=us&kptab=editions&sa=X&ved=2ahUKEwin3IrX-_n1AhXclmoFHbMHDfIQmBZ6BAgIEAg",
"thumbnail": ""
}, ... other results
]
Links
Outro
If you have anything to share, any questions, suggestions, or something that isn't working correctly, reach out via Twitter at @dimitryzub, or @serp_api.
Yours,
Dmitriy, and the rest of SerpApi Team.
Join us on Reddit | Twitter | YouTube
Add a Feature Request💫 or a Bug🐞
Top comments (0)