What will be scraped
Prerequisites
Separate virtual environment
If you're on Linux:
python -m venv env && source env/bin/activate
If you're on Windows and using Git Bash:
python -m venv env && source env/Scripts/activate
If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.
In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system thus prevention libraries or Python version conflicts.
Install libraries:
pip install pandas google-search-results
Scrape and save Google Scholar Case Law results to CSV
If you don't need an explanation, try it in the online IDE.
import os
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import pandas as pd
def case_law_results():
print("Extracting case law results..")
params = {
"api_key": os.getenv("API_KEY"), # SerpApi API key
"engine": "google_scholar", # Google Scholar search results
"q": "minecraft education ", # search query
"hl": "en", # language
"start": "0", # first page
"as_sdt": "6" # case law results. Wierd, huh? Try without it.
}
search = GoogleSearch(params)
case_law_results_data = []
while True:
results = search.get_dict()
if "error" in results:
break
print(f"Currently extracting page #{results.get('serpapi_pagination', {}).get('current')}..")
for result in results["organic_results"]:
title = result.get("title")
publication_info_summary = result["publication_info"]["summary"]
result_id = result.get("result_id")
link = result.get("link")
result_type = result.get("type")
snippet = result.get("snippet")
try:
file_title = result["resources"][0]["title"]
except: file_title = None
try:
file_link = result["resources"][0]["link"]
except: file_link = None
try:
file_format = result["resources"][0]["file_format"]
except: file_format = None
cited_by_count = result.get("inline_links", {}).get("cited_by", {}).get("total", {})
cited_by_id = result.get("inline_links", {}).get("cited_by", {}).get("cites_id", {})
cited_by_link = result.get("inline_links", {}).get("cited_by", {}).get("link", {})
total_versions = result.get("inline_links", {}).get("versions", {}).get("total", {})
all_versions_link = result.get("inline_links", {}).get("versions", {}).get("link", {})
all_versions_id = result.get("inline_links", {}).get("versions", {}).get("cluster_id", {})
case_law_results_data.append({
"page_number": results['serpapi_pagination']['current'],
"position": result["position"] + 1,
"result_type": result_type,
"title": title,
"link": link,
"result_id": result_id,
"publication_info_summary": publication_info_summary,
"snippet": snippet,
"cited_by_count": cited_by_count,
"cited_by_link": cited_by_link,
"cited_by_id": cited_by_id,
"total_versions": total_versions,
"all_versions_link": all_versions_link,
"all_versions_id": all_versions_id,
"file_format": file_format,
"file_title": file_title,
"file_link": file_link
})
if "next" in results.get("serpapi_pagination", {}):
search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
else:
break
return case_law_results_data
def save_case_law_results_to_csv():
print("Waiting for case law results to save..")
pd.DataFrame(data=case_law_results()).to_csv("google_scholar_case_law_results.csv", encoding="utf-8", index=False)
print("Case Law Results Saved.")
Code explanation
Import libraries:
import os
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import pandas as pd
-
pandas
will be used to easily save extracted data to CSV file. -
urllib
will be used in the pagination process. -
os
is used to return the value of the SerpApi API key environment variable.
Create, pass search parameters to SerpApi and create a temporary list()
to store extracted data:
params = {
"api_key": os.getenv("API_KEY"), # SerpApi API key
"engine": "google_scholar", # Google Scholar search results
"q": "minecraft education ", # search query
"hl": "en", # language
"start": "0", # first page
"as_sdt": "6" # case law results
}
search = GoogleSearch(params)
case_law_results_data = []
as_sdt
is used to determine and filter which Court(s) are targeted in an API call. Refer to supported SerpApi Google Scholar Courts or select courts on Google Scholar and pass it to as_sdt
parameter.
Note: if you want to search results for Missouri Court Of Appeals, as_sdt
parameter would become as_sdt=4,204
. Pay attention to 4,
, without it, article results will appear instead.
Set up a while
loop, add an if
statement to be able to exit the loop:
while True:
results = search.get_dict()
# if any backend service error or search fail
if "error" in results:
break
# extraction code here...
# if next page is present -> update previous results to new page results.
# if next page is not present -> exit the while loop.
if "next" in results.get("serpapi_pagination", {}):
search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
else:
break
search.params_dict.update()
will split next page URL in parts and pass updated search param values to GoogleSearch(search)
as a dictionary.
Extract results in a for
loop and handle exceptions:
for result in results["organic_results"]:
title = result.get("title")
publication_info_summary = result["publication_info"]["summary"]
result_id = result.get("result_id")
link = result.get("link")
result_type = result.get("type")
snippet = result.get("snippet")
try:
file_title = result["resources"][0]["title"]
except: file_title = None
try:
file_link = result["resources"][0]["link"]
except: file_link = None
try:
file_format = result["resources"][0]["file_format"]
except: file_format = None
# if something is None it will return an empty {} dict()
cited_by_count = result.get("inline_links", {}).get("cited_by", {}).get("total", {})
cited_by_id = result.get("inline_links", {}).get("cited_by", {}).get("cites_id", {})
cited_by_link = result.get("inline_links", {}).get("cited_by", {}).get("link", {})
total_versions = result.get("inline_links", {}).get("versions", {}).get("total", {})
all_versions_link = result.get("inline_links", {}).get("versions", {}).get("link", {})
all_versions_id = result.get("inline_links", {}).get("versions", {}).get("cluster_id", {})
Append results to temporary list()
as a dictionary {}
:
case_law_results_data.append({
"page_number": results['serpapi_pagination']['current'],
"position": position + 1,
"result_type": result_type,
"title": title,
"link": link,
"result_id": result_id,
"publication_info_summary": publication_info_summary,
"snippet": snippet,
"cited_by_count": cited_by_count,
"cited_by_link": cited_by_link,
"cited_by_id": cited_by_id,
"total_versions": total_versions,
"all_versions_link": all_versions_link,
"all_versions_id": all_versions_id,
"file_format": file_format,
"file_title": file_title,
"file_link": file_link
})
Return
extracted data:
return case_law_results_data
Save returned case_law_results()
data to_csv()
:
pd.DataFrame(data=case_law_results()).to_csv("google_scholar_case_law_results.csv", encoding="utf-8", index=False)
-
data
argument insideDataFrame
is your data. -
encoding='utf-8'
argument just to make sure everything will be saved correctly. I used it explicitly even thought it's a default value. -
index=False
argument to drop defaultpandas
row numbers.
Links
- Code in the online IDE
- Google Scholar Organic Results API
- SerpApi supported Google Scholar Courts
- List of Google Scholar Courts
Add a Feature Request💫 or a Bug🐞
Top comments (0)