Prerequisites
Install libraries:
pip install requests parsel google-search-results
Basic knowledge scraping with CSS selectors
CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.
If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective.
What will be scraped
📌 Note: only such layout will be covered in this blog post. There are at least 3 different Carousel results.
Full Code
import requests, lxml, re, json
from parsel import Selector
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36"
}
params = {
"q": "dune actors", # search query
"gl": "us", # country to search from
}
def parsel_get_top_carousel():
html = requests.get('https://www.google.com/search', headers=headers, params=params)
selector = Selector(text=html.text)
carousel_name = selector.css(".yKMVIe::text").get()
all_script_tags = selector.css("script::text").getall()
data = {f"{carousel_name}": []}
decoded_thumbnails = []
for _id in selector.css("img.d7ENZc::attr(id)").getall():
# https://regex101.com/r/YGtoJn/1
thumbnails = re.findall(r"var\s?s=\'([^']+)\'\;var\s?ii\=\['{_id}'\];".format(_id=_id), str(all_script_tags))
thumbnail = [
bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in thumbnails
]
decoded_thumbnails.append("".join(thumbnail))
for result, image in zip(selector.css('.QjXCXd.X8kvh'), decoded_thumbnails):
title = result.css(".JjtOHd::text").get()
link = f"https://www.google.com{result.css('.QjXCXd div a::attr(href)').get()}"
extensions = result.css(".ellip.AqEFvb::text").getall()
if title and link and extensions is not None:
data[carousel_name].append({
"title": title,
"link": link,
"extensions": extensions,
"thumbnail": image
})
print(json.dumps(data, indent=2, ensure_ascii=False))
parsel_get_top_carousel()
Output:
{
"Dune": [
{
"title": "Zendaya", ... first results
"link": "https://www.google.com/search?gl=us&q=Zendaya&stick=H4sIAAAAAAAAAONgFuLVT9c3NEzLqko2ii8xUOLSz9U3SElJM7So0BLKTrbST8vMyQUTVsmJxSWLWNmjUvNSEisTAY7G9vs7AAAA&sa=X&ved=2ahUKEwjp99fw1972AhXXXM0KHeWoAX4Q9OUBegQIAxAC",
"extensions": [
"Chani"
],
"thumbnail": ""
}, ... other results
{
"title": "Javier Bardem", ... last results
"link": "https://www.google.com/search?gl=us&q=Javier+Bardem&stick=H4sIAAAAAAAAAONgFuLVT9c3NEzLqko2ii8xUOLUz9U3MDQ3NE7WEspOttJPy8zJBRNWyYnFJYtYeb0SyzJTixScEotSUnMBeUccjEAAAAA&sa=X&ved=2ahUKEwjp99fw1972AhXXXM0KHeWoAX4Q9OUBegQIAxAQ",
"extensions": [
"Stilgar"
],
"thumbnail": ""
}
]
}
Code Explanation
Thumbnail extraction
Parsing thumbnails from img.d7ENZc
CSS selector to grab src
attribute will bring a 1x1 placeholder, instead of actual thumbnail.
Thumbnails are located in the <script>
tags. In order to grab them, we need to:
- Locate image element via Dev Tools.
- Copy
id
value. - Open page source
CTRL+U
, pressCTRL+F
and pasteid
value to find it.
Most likely you'll see two occurrences, and the second one will be somewhere in the <script>
tags. That's what we're looking for.
Now we need to match image id
with extracted data:image
from the <script>
elements to extract the right image:
selector = Selector(text=html.text)
# grabs every script element
all_script_tags = selector.css("script::text").getall()
# list to temporary store thumbnails data
decoded_thumbnails = []
# iterating over each image ID
# using _id because id is a Python build-in name
for _id in selector.css("img.d7ENZc::attr(id)").getall():
# https://regex101.com/r/YGtoJn/1
thumbnails = re.findall(r"var\s?s=\'([^']+)\'\;var\s?ii\=\['{_id}'\];".format(_id=_id), str(all_script_tags))
thumbnail = [
bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in thumbnails
]
decoded_thumbnails.append("".join(thumbnail))
Code | Explanation |
---|---|
css("img.d7ENZc::attr(id)") |
to grab every image id . |
getall() |
returns a list of matches. |
re.findall() |
to find all matches via regular expression. |
r"<expression>" |
a regular expression. |
([^']+) |
is a regex capture group. |
['{_id}'\] |
is a parsed image id that were passed to regular expression to match the correct image. |
format(_id=_id) |
is a string placeholder. String interpolation would look a bit awkward. |
bytes().deccode() |
to convert unicode characters to ascii characters. |
"".join(thumbnail) |
to join (convert) each element from a list to a string. |
Output from decoded_thumbnails
:
# data:image is shortened on purpose,
# so the output would not cover the entire page
[
'',
"other images ..."
]
Title, link and extensions extraction
The next step is to iterate over CSS container with title, link, and extensions and over decoded_thumbnails
:
for result, image in zip(selector.css('.QjXCXd.X8kvh'), decoded_thumbnails):
title = result.css(".JjtOHd::text").get()
link = f"https://www.google.com{result.css('.QjXCXd div a::attr(href)').get()}"
extensions = result.css(".ellip.AqEFvb::text").getall()
Code | Explanation |
---|---|
zip() |
allows to iterate over multiple iterables in a single for loop. |
::text |
a parsel pseudo-element to extract textual node data which is identical to XPath <node>/text()
|
::attr(<attribute>) |
a parsel pseudo-element grab attribute data from the node which is identical to XPath <node>/@href
|
get() |
to return first element of actual data. |
getall() |
to return list of all matches. |
The next step is to check if
extracted title, link and extensions have some values and append to temporary list
and print
the data:
data = {f"{carousel_name}": []}
if title and link and extensions is not None:
data[carousel_name].append({
"title": title,
"link": link,
"extensions": extensions,
"thumbnail": image
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Using Google Top Carousel API from SerpApi
SerpApi is a paid API with a free plan which allows end-user to forget about figuring out how to bypass blocks from search entities and focus on the which data to extract.
from serpapi import GoogleSearch
import os, json
def serpapi_get_top_carousel():
params = {
# https://docs.python.org/3/library/os.html#os.getenv
"api_key": os.getenv("API_KEY"), # your SerpApi key in the environment variable
"engine": "google", # search engine
"q": "dune actors", # search query
"hl": "en", # language
"gl": "us" # country
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['knowledge_graph']['cast']:
print(json.dumps(result, indent=2))
serpapi_get_top_carousel()
Part of the output:
{
"name": "Timothée Chalamet",
"extensions": [
"Paul Atreides"
],
"link": "https://www.google.com/search?hl=en&gl=us&q=Timoth%C3%A9e+Chalamet&stick=H4sIAAAAAAAAAONgFuLVT9c3NEzLqko2ii8xUOLSz9U3KDDKM0wr0BLKTrbST8vMyQUTVsmJxSWPGJcycgu8_HFPWGo246Q1J68xTmHkwqJOyJCLzTWvJLOkUkhQip8L1RIjEahAtll2hpFZXqHAwmWzGJWcjUx2XZp2jk1P8FkoA0Ndb4iDkiLnFCHrhswn7-wFXd__299ywsBBgkWBQYPB8JElq8P6KYwHtBgOMDI17VtxiI2Fg1GAwYpJg6mKiYOFZxGrUEhmbn5JxuGVqQrOGYk5ibmpJRPYGAHILgFT8gAAAA&sa=X&ved=2ahUKEwiMxLi-ksXzAhUAl2oFHf88AN0Q-BZ6BAgBEDQ",
"image": "https://serpapi.com/searches/6165a3dcfa86759a4fa42ba4/images/94afec67f82aa614bb572a123ec09cf051cf10bde8e0bc8025daf21915c49798.jpeg"
} ... other results
Links
Outro
If you have any questions or suggestions, or something isn't working correctly, reach out via Twitter at @dimitryzub or @serp_api.
Yours,
Dimitry, and the rest of SerpApi Team.
Join us on Reddit | Twitter | YouTube
Add a Feature Request💫 or a Bug🐞
Top comments (0)