DEV Community

Cover image for Scrape Google Scholar Case Law Results to CSV with Python and SerpApi
Dmitriy Zub ☀️
Dmitriy Zub ☀️

Posted on • Edited on • Originally published at serpapi.com

Scrape Google Scholar Case Law Results to CSV with Python and SerpApi

What will be scraped

scrape_google_scholar_case_law_what_will_be_scraped_01

Prerequisites

Separate virtual environment

If you're on Linux:

python -m venv env && source env/bin/activate
Enter fullscreen mode Exit fullscreen mode

If you're on Windows and using Git Bash:

python -m venv env && source env/Scripts/activate
Enter fullscreen mode Exit fullscreen mode

If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system thus prevention libraries or Python version conflicts.

Install libraries:

pip install pandas google-search-results  
Enter fullscreen mode Exit fullscreen mode

Scrape and save Google Scholar Case Law results to CSV

If you don't need an explanation, try it in the online IDE.

import os
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import pandas as pd

def case_law_results():

    print("Extracting case law results..")

    params = {
        "api_key": os.getenv("API_KEY"),  # SerpApi API key
        "engine": "google_scholar",       # Google Scholar search results
        "q": "minecraft education ",      # search query
        "hl": "en",                       # language
        "start": "0",                     # first page
        "as_sdt": "6"                     # case law results. Wierd, huh? Try without it.
    }
    search = GoogleSearch(params)

    case_law_results_data = []

    while True:
        results = search.get_dict()

        if "error" in results:
            break

      print(f"Currently extracting page #{results.get('serpapi_pagination', {}).get('current')}..")

      for result in results["organic_results"]:
          title = result.get("title")
          publication_info_summary = result["publication_info"]["summary"]
          result_id = result.get("result_id")
          link = result.get("link")
          result_type = result.get("type")
          snippet = result.get("snippet")

        try:
          file_title = result["resources"][0]["title"]
        except: file_title = None

        try:
          file_link = result["resources"][0]["link"]
        except: file_link = None

        try:
          file_format = result["resources"][0]["file_format"]
        except: file_format = None

        cited_by_count = result.get("inline_links", {}).get("cited_by", {}).get("total", {})
        cited_by_id = result.get("inline_links", {}).get("cited_by", {}).get("cites_id", {})
        cited_by_link = result.get("inline_links", {}).get("cited_by", {}).get("link", {})
        total_versions = result.get("inline_links", {}).get("versions", {}).get("total", {})
        all_versions_link = result.get("inline_links", {}).get("versions", {}).get("link", {})
        all_versions_id = result.get("inline_links", {}).get("versions", {}).get("cluster_id", {})

        case_law_results_data.append({
          "page_number": results['serpapi_pagination']['current'],
          "position": result["position"] + 1,
          "result_type": result_type,
          "title": title,
          "link": link,
          "result_id": result_id,
          "publication_info_summary": publication_info_summary,
          "snippet": snippet,
          "cited_by_count": cited_by_count,
          "cited_by_link": cited_by_link,
          "cited_by_id": cited_by_id,
          "total_versions": total_versions,
          "all_versions_link": all_versions_link,
          "all_versions_id": all_versions_id,
          "file_format": file_format,
          "file_title": file_title,
          "file_link": file_link
        })

      if "next" in results.get("serpapi_pagination", {}):
        search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
      else:
        break

    return case_law_results_data


def save_case_law_results_to_csv():
    print("Waiting for case law results to save..")
    pd.DataFrame(data=case_law_results()).to_csv("google_scholar_case_law_results.csv", encoding="utf-8", index=False)

    print("Case Law Results Saved.")
Enter fullscreen mode Exit fullscreen mode

Code explanation

Import libraries:

import os
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import pandas as pd
Enter fullscreen mode Exit fullscreen mode
  • pandas will be used to easily save extracted data to CSV file.
  • urllib will be used in the pagination process.
  • os is used to return the value of the SerpApi API key environment variable.

Create, pass search parameters to SerpApi and create a temporary list() to store extracted data:

params = {
    "api_key": os.getenv("API_KEY"),  # SerpApi API key
    "engine": "google_scholar",       # Google Scholar search results
    "q": "minecraft education ",      # search query
    "hl": "en",                       # language
    "start": "0",                     # first page
    "as_sdt": "6"                     # case law results
}
search = GoogleSearch(params)

case_law_results_data = []
Enter fullscreen mode Exit fullscreen mode

as_sdt is used to determine and filter which Court(s) are targeted in an API call. Refer to supported SerpApi Google Scholar Courts or select courts on Google Scholar and pass it to as_sdt parameter.

Note: if you want to search results for Missouri Court Of Appeals, as_sdt parameter would become as_sdt=4,204. Pay attention to 4,, without it, article results will appear instead.

Set up a while loop, add an if statement to be able to exit the loop:

while True:
    results = search.get_dict()

    # if any backend service error or search fail
    if "error" in results:
      break

    # extraction code here... 

    # if next page is present -> update previous results to new page results.
    # if next page is not present -> exit the while loop.
    if "next" in results.get("serpapi_pagination", {}):
        search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
    else:
        break
Enter fullscreen mode Exit fullscreen mode

search.params_dict.update() will split next page URL in parts and pass updated search param values to GoogleSearch(search) as a dictionary.

Extract results in a for loop and handle exceptions:

for result in results["organic_results"]:
    title = result.get("title")
    publication_info_summary = result["publication_info"]["summary"]
    result_id = result.get("result_id")
    link = result.get("link")
    result_type = result.get("type")
    snippet = result.get("snippet")

    try:
      file_title = result["resources"][0]["title"]
    except: file_title = None

    try:
      file_link = result["resources"][0]["link"]
    except: file_link = None

    try:
      file_format = result["resources"][0]["file_format"]
    except: file_format = None

    # if something is None it will return an empty {} dict()
    cited_by_count = result.get("inline_links", {}).get("cited_by", {}).get("total", {})
    cited_by_id = result.get("inline_links", {}).get("cited_by", {}).get("cites_id", {})
    cited_by_link = result.get("inline_links", {}).get("cited_by", {}).get("link", {})
    total_versions = result.get("inline_links", {}).get("versions", {}).get("total", {})
    all_versions_link = result.get("inline_links", {}).get("versions", {}).get("link", {})
    all_versions_id = result.get("inline_links", {}).get("versions", {}).get("cluster_id", {})
Enter fullscreen mode Exit fullscreen mode

Append results to temporary list() as a dictionary {}:

case_law_results_data.append({
    "page_number": results['serpapi_pagination']['current'],
    "position": position + 1,
    "result_type": result_type,
    "title": title,
    "link": link,
    "result_id": result_id,
    "publication_info_summary": publication_info_summary,
    "snippet": snippet,
    "cited_by_count": cited_by_count,
    "cited_by_link": cited_by_link,
    "cited_by_id": cited_by_id,
    "total_versions": total_versions,
    "all_versions_link": all_versions_link,
    "all_versions_id": all_versions_id,
    "file_format": file_format,
    "file_title": file_title,
    "file_link": file_link
})
Enter fullscreen mode Exit fullscreen mode

Return extracted data:

return case_law_results_data
Enter fullscreen mode Exit fullscreen mode

Save returned case_law_results() data to_csv():

pd.DataFrame(data=case_law_results()).to_csv("google_scholar_case_law_results.csv", encoding="utf-8", index=False)
Enter fullscreen mode Exit fullscreen mode
  • data argument inside DataFrame is your data.
  • encoding='utf-8' argument just to make sure everything will be saved correctly. I used it explicitly even thought it's a default value.
  • index=False argument to drop default pandas row numbers.

Join us on Twitter | YouTube

Add a Feature Request💫 or a Bug🐞

Top comments (0)