Contents: intro, imports, what will be scraped, code, links, outro.
Intro
This blog post is a continuation of Google's web scraping series. Here you'll see examples of how you can scrape Google Organic Search Results using Python. An alternative SerpApi solution will be shown.
Imports
import lxml, requests
from bs4 import BeautifulSoup
from serpapi import GoogleSearch
What will be scraped
Title, link, displayed link, snippet
Headers
Make sure you specified HTTP header user-agent
so Google won't block your requests, otherwise it will block it eventually. Why? Without a headers
it might think that your request is a request from a bot (script) which it is.
Example of passing headers into request:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('https://www.google.com/search', headers=headers)
# other code
Code
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "apple",
"hl": "en",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text
snippet = result.select_one('#rso .lyLwlc').text
# inline sitelinks
for i in soup.select('.HiHjCd a'):
print(i.text)
# expanded sitelinks
for t in soup.select('.usJj9c'):
text = t.select_one('.r').text
text_link = t.select_one('.r a')['href']
snippet = t.select_one('.st').text
print(f'{text}\n{text_link}\n{snippet}\n')
Using Google Organic Results API
SerpApi is a paid API with a free trial of 5,000 searches.
The difference is that all that needs to be done is just to iterate over a ready made, structured JSON
instead of coding everything from scratch, and selecting correct selectors which could be time consuming at times.
from serpapi import GoogleSearch
import json # just for prettier output
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "buy trampoline",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "cyber security",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
link = result['link']
displayed_link = result['displayed_link']
snippet = result['snippet']
try:
inline_sitelinks = result['sitelinks']['inline']
except:
inline_sitelinks = None
try:
expanded_sitelinks = result['sitelinks']['expanded']
except:
expanded_sitelinks = None
print(f"{title}\n{link}\n{displayed_link}\n{snippet}\n")
print(json.dumps(expanded_sitelinks, indent=2, ensure_ascii=False))
print(json.dumps(inline_sitelinks, indent=2, ensure_ascii=False))
-----------------
# organic
'''
Trampolines - Sports & Outdoors - The Home Depot
https://www.homedepot.com/b/Sports-Outdoors-Trampolines/N-5yc1vZc455
https://www.homedepot.com › Sports & Outdoors
Get free shipping on qualified Trampolines or Buy Online Pick Up ... Round Trampoline with Safety Enclosure Basketball Hoop and Ladder.
'''
# expanded sitelinks
'''
[
{
"title": "Store",
"link": "https://m.dji.com/",
"snippet": "Mavic Series - Refurbished Products - Buy Osmo Series - ..."
}
]
'''
# inline sitelinks
'''
[
{
"title": "Pure Fun",
"link": "https://www.homedepot.com/b/Sports-Outdoors-Trampolines/Pure-Fun/N-5yc1vZc455Zdeo"
}
]
'''
Top comments (0)