DEV Community

Cover image for Scraping Millions of Google SERPs The Easy Way (Python Scrapy Spider)
Ian Kerins
Ian Kerins

Posted on • Edited on

Scraping Millions of Google SERPs The Easy Way (Python Scrapy Spider)

Google is the undisputed king of search engines in just about every aspect. Making it the ultimate source of data for a whole host of use cases.

If you want to get access to this data you either need to extract it manually, pay a 3rd party for a expensive data feed or build your own scrape to extract the data for you.

In this article I will show you the easiest way to build a Google scraper that can extract millions of pages of data each day with just a few lines of code.

By combining Scrapy with Scraper API's proxy/autoparsing functionality we will build a Google scraper that can the search engine results from any Google query and return the following for each result:

  • Title
  • Link
  • Related links
  • Description
  • Snippet
  • Images
  • Thumbnails
  • Sources, and more

You can also refine your search queries with parameters, by specifying a keyword, the geographic region, the language, the number of results, results from a particular domain, or even to only return safe results. The possibilities are nearly limitless.

The code for this project is available on GitHub here.

For this guide, we're going to use:

  • Scraper API as our proxy solution, as Instagram has pretty aggressive anti-scraping in place. You can sign up to a free account here which will give you 5,000 free requests.
  • ScrapeOps to monitor our scrapers for free and alert us if they run into trouble. Live demo here: ScrapeOps Demo

ScrapeOps Dashboard


How Querying Google Using Scraper API’s Autoparse Functionality

We will use Scraper API for two reasons:

  1. Proxies, so we won't get blocked.
  2. Parsing, so we don't have to worry about writing our own parsers.

Scraper API is a proxy management API that handles everything to do with rotating and managing proxies so our requests don't get banned. Which is great for a difficult site to scrape like Google.

However, what makes Scraper API extra useful for sites like Google and Amazon is that they provide auto parsing functionality free of charge so you don't need to write and maintain your own parsers.

By using Scraper API’s autoparse functionality for Google Search or Google Shopping, all the HTML will be automatically parsed into JSON format for you. Greatly simplifying the scraping process.

All we need to do to make use of this handy capability is to add the following parameter to our request:

 "&autoparse=true"
Enter fullscreen mode Exit fullscreen mode

We’ll send the HTTP request with this parameter via Scrapy which will scrape google results based on specified keywords. The results will be returned in JSON format which we will then parse using Python.


Scrapy Installation and Setup

First thing’s first, the requirements for this tutorial are very straightforward:

• You will need at least Python version 3, later
• And, pip to install the necessary software packages

So, assuming you have both of those things, you only need to run the following command in your terminal to install Scrapy:

pip install scrapy
Enter fullscreen mode Exit fullscreen mode

Scrapy will automatically create some default folders where all the packages and project files will be located. So navigate to that folder, and then run the following commands:

scrapy startproject google_scraper
cd google_scraper
scrapy genspider google api.scraperapi.com
Enter fullscreen mode Exit fullscreen mode

First Scrapy will create a new project folder called “google-scraper” which is also the project name. We then navigate into this folder and run the “genspider” command which will generate a web scraper for us with the name “google.”

You should now see a bunch of configuration files, a “spiders” folder with your scraper(s), and a Python modules folder with some package files.


Building URLs to Query Google

As you might expect, Google uses a very standard and easy to query URL structure. To build a URL to query Google with, you only need to know the URL parameters for the data you need. In this tutorial, I’ll use some of the parameters that will be the most useful for the majority of web scraping projects.

Every Google Search query will start with the following base URL:

http://www.google.com/search
Enter fullscreen mode Exit fullscreen mode

You can then build out your query simply by adding one or more of the following parameters:

There are many more parameters to use for querying Google, such as date, encoding, or even operators such as ‘or’ or ‘and’ to implement some basic logic.


Building the Google Search Query URL

Below is the function I’ll be using to build the Google Search query URL. It creates a dictionary with key-value pairs for the q, num, and as_sitesearch parameters. If you want to add more parameters, this is where you could do it.

If no site is specified, it will return a URL without the as_sitesearch parameter. If one is specified, it will first extract network location using netloc (e.g. amazon.com), then add this key-value pair to google_dict, and, finally, encode it in the return URL with the other parameters:

from urllib.parse import urlparse
from urllib.parse import urlencode

def create_google_url(query, site=''):
   google_dict = {'q': query, 'num': 100, }
   if site:
       web = urlparse(site).netloc
       google_dict['as_sitesearch'] = web
       return 'http://www.google.com/search?' + urlencode(google_dict)
   return 'http://www.google.com/search?' + urlencode(google_dict)
Enter fullscreen mode Exit fullscreen mode

Connecting to a Proxy via the Scraper API

When scraping an internet service like Google, you will need to use a proxy if you want to scrape at any reasonable scale. If you don’t, you could get flagged by its ant-botting countermeasures and get your IP-banned. Thankfully, you can use Scraper API’s proxy solution for free for up to 5,000 API calls, using up to 10 concurrent threads. You can also use some of Scraper API’s more advanced features, such as geotargeting, JS rendering, and residential properties.

To use the proxy, just head here to sign up for free. Once you have, find your API key in the dashboard as you’ll need it to set up a proxy connection.

The proxy is incredibly easy to implement into your web spider. In the get_url function below, we’ll create a payload with our Scraper API key and the URL we built in the create_google_url function. We’ll also enable the autoparse feature here as well as set the proxy location as the U.S.:

def get_url(url):
   payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
   proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
   return proxy_url
Enter fullscreen mode Exit fullscreen mode

To send our request via one of Scraper API’s proxy pools, we only need to append our query URL to Scraper API’s proxy URL. This will return the information that we requested from Google and that we’ll parse later on.


Querying Google Search

The start_requests function is where we will set everything into motion. It will iterate through a list of queries that will be sent through to the create_google_url function as keywords for our query URL.

def start_requests(self):
       queries = ['scrapy’, ‘beautifulsoup’]
       for query in queries:
           url = create_google_url(query)
           yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})
Enter fullscreen mode Exit fullscreen mode

The query URL we built will then be sent as a request to Google Search using Scrapy’s yield via the proxy connection we set up in the get_url function. The result (which should be in JSON format) will be then be sent to the parse function to be processed. We also add the {'pos': 0} key-value pair to the meta parameter which is just used to count the number of pages scraped.


Scraping the Google Search Results

Because we used Scraper API’s autoparse functionality to return data in JSON format, parsing is very straightforward. We just need to select the data we want from the response dictionary.

First of all, we’ll load the entire JSON response and then iterate through each result, extracting some information and then putting it together into a new item we can use later on.

This process also checks to see if there is another page of results. If there is, it invokes ** yield scrapy.Request** again and sends the results to the parse function. In the meantime, pos is used to keep track of the number of pages we have scraped:

def parse(self, response):
       di = json.loads(response.text)
       pos = response.meta['pos']
       dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
       for result in di['organic_results']:
           title = result['title']
           snippet = result['snippet']
           link = result['link']
           item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
           pos += 1
           yield item
       next_page = di['pagination']['nextPageUrl']
       if next_page:
           yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})
Enter fullscreen mode Exit fullscreen mode

Putting it All Together and Running the Spider

You should now have a solid understanding of how the spider works and the flow of it. The spider we created, google.py, should now have the following contents:

import scrapy
from urllib.parse import urlencode
from urllib.parse import urlparse
import json
from datetime import datetime
API_KEY = 'YOUR_KEY'

def get_url(url):
   payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
   proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
   return proxy_url

def create_google_url(query, site=''):
   google_dict = {'q': query, 'num': 100, }
   if site:
       web = urlparse(site).netloc
       google_dict['as_sitesearch'] = web
       return 'http://www.google.com/search?' + urlencode(google_dict)
   return 'http://www.google.com/search?' + urlencode(google_dict)

class GoogleSpider(scrapy.Spider):
   name = 'google'
   allowed_domains = ['api.scraperapi.com']
   custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',
                  'CONCURRENT_REQUESTS_PER_DOMAIN': 10}

   def start_requests(self):
       queries = ['scrapy’, ‘beautifulsoup’] 
       for query in queries:
           url = create_google_url(query)
           yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})

   def parse(self, response):
       di = json.loads(response.text)
       pos = response.meta['pos']
       dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
       for result in di['organic_results']:
           title = result['title']
           snippet = result['snippet']
           link = result['link']
           item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
           pos += 1
           yield item
       next_page = di['pagination']['nextPageUrl']
       if next_page:
           yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})
Enter fullscreen mode Exit fullscreen mode

Before testing the scraper we need to configure the settings to allow it to integrate with the Scraper API free plan with 10 concurrent threads.

To do this we defined the following custom settings in our spider class.

custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',
                       'CONCURRENT_REQUESTS_PER_DOMAIN': 10, 
                       'RETRY_TIMES': 5}
Enter fullscreen mode Exit fullscreen mode

We the concurrency to 10 threads to match the Scraper API free plan and et RETRY_TIMES to tell Scrapy to retry any failed requests 5 times. In the settings.py file we also need to make sure that DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY aren’t enabled as these will lower your concurrency and are not needed with Scraper API.

To test or run the spider, just make sure you are in the right location and then run the following crawl command which will also output the results to a .csv file:

scrapy crawl google -o test.csv
Enter fullscreen mode Exit fullscreen mode

If all goes according to plan, the spider will scrape Google Search for all the keywords you provide. By using a proxy, you’ll also avoid getting banned for using a bot.


Setting Up Monitoring

To monitor our scraper we're going to use ScrapeOps, a free monitoring and alerting tool dedicated to web scraping.

With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.

Live demo here: ScrapeOps Demo

Getting setup with ScrapeOps is simple. Just install the Python package:

pip install scrapeops-scrapy
Enter fullscreen mode Exit fullscreen mode

And add 3 lines to your settings.py file:

## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}
Enter fullscreen mode Exit fullscreen mode

From there, our scraping stats will be automatically logged and automatically shipped to our dashboard.

ScrapeOps Dashboard


If you would like to run the spider for yourself or modify it for your particular Google project then feel free to do so. The code is on GitHub here. Just remember that you need to get your own Scraper API API_KEY by signing up for a free account here.

Top comments (5)

Collapse
 
kai_gray_b8562555da86b018 profile image
Kai Gray

This is really great, thank you! How could I get the original query to be written out alongside the result? For instance, if the search query is "soap" I'd be interested in the CSV having a column that listed that search query. I see ['search_information']['query_displayed'] in the response.text but cannot print it out as part of "item"

Many thanks

Collapse
 
ana123kots profile image
ana123kots • Edited

Hi, had the same issue. I solved it by added 2 lines after creating the item dic

query_displayed = di['search_information']
item['query_displayed'] = query_displayed['query_displayed']

Collapse
 
iottrends profile image
iottrends

Good article... Got a start... I was looking to build serpbot kind of feature to track my serps...
Observed the queries don't have the location tag...
Right now I am running it on raspberry pi model 3b+. I will continue to enhance... In the mean time is there a way to do social media posting using similar kind of stuff...

Collapse
 
urbanpharming profile image
urbanpharming

i need it to follow the links and collect all of the pages h2s and h3s. how would i approach that? i think we re not using the classic shell but as far as i know its needed for link following.
many thanks!

Collapse
 
djmystica profile image
djmystica

Hi,
great post!
is there a way to scrape the information on the right side when they show up?
monosnap.com/file/5PkiCRXw0MWAPicx...
thanks!