Dmitriy Zub ☀️ for SerpApi

Posted on May 2, 2022 • Edited on Nov 28, 2022 • Originally published at serpapi.com

Scrape Google Scholar Publications within a certain website using Python

#programming #database #python #tutorial

What will be scraped
How filtering works
Prerequisites
Full Code
- Code Explanation
Links
Outro

What will be scraped

How filtering works

To filter results by a certain website, you need to use site: operator which restricts search results to papers published by websites containing <website_name> in their name.

This operator can be used in addition to OR operator i.e site:cabdirect.org OR site:<other_website>. So the search query would become:

search terms site:cabdirect.org OR site:<other_website>

Prerequisites

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to thus allowing to extract of data from matching tags and attributes.

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective and show the most common approaches of using CSS selectors when web scraping.

Separate virtual environment

If you're on Linux:

python -m venv env && source env/bin/activate

If you're on Windows and using Git Bash:

python -m venv env && source env/Scripts/activate

In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other in the same system thus preventing libraries or Python version conflicts.

If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

📌Note: this is not a strict requirement for this blog post.

Install libraries:

pip install requests parsel google-search-results

Reduce the chance of being blocked

There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites.

Full Code

from parsel import Selector
import requests, json, os


def check_websites(website: list or str):
    if isinstance(website, str):
        return website                                           # cabdirect.org
    elif isinstance(website, list):
        return " OR ".join([f'site:{site}' for site in website]) # site:cabdirect.org OR site:cab.net


def scrape_website_publications(query: str, website: list or str):

    """
    Add a search query and site or multiple websites.

    Following will work:
    ["cabdirect.org", "lololo.com", "brabus.org"] -> list[str]
    ["cabdirect.org"]                             -> list[str]
    "cabdirect.org"                               -> str
    """

    # https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
    params = {
        "q": f'{query.lower()} {check_websites(website=website)}',  # search query
        "hl": "en",                                                 # language of the search
        "gl": "us"                                                  # country of the search
    }

    # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
    }

    html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
    selector = Selector(html.text)

    publications = []

    # iterate over every element from organic results from the first page and extract the data
    for result in selector.css(".gs_r.gs_scl"):
        title = result.css(".gs_rt").xpath("normalize-space()").get()
        link = result.css(".gs_rt a::attr(href)").get()
        result_id = result.attrib["data-cid"]
        snippet = result.css(".gs_rs::text").get()
        publication_info = result.css(".gs_a").xpath("normalize-space()").get()
        cite_by_link = f'https://scholar.google.com/scholar{result.css(".gs_or_btn.gs_nph+ a::attr(href)").get()}'
        all_versions_link = f'https://scholar.google.com/scholar{result.css("a~ a+ .gs_nph::attr(href)").get()}'
        related_articles_link = f'https://scholar.google.com/scholar{result.css("a:nth-child(4)::attr(href)").get()}'

        publications.append({
            "result_id": result_id,
            "title": title,
            "link": link,
            "snippet": snippet,
            "publication_info": publication_info,
            "cite_by_link": cite_by_link,
            "all_versions_link": all_versions_link,
            "related_articles_link": related_articles_link,
        })

    # print or return the results
    # return publications

    print(json.dumps(publications, indent=2, ensure_ascii=False))


scrape_website_publications(query="biology", website="cabdirect.org")

Code Explanation

Import libraries and define a function:

from parsel import Selector
import requests, json, os

Create a function to check if website argument is either a list of str or a string:

# check if returned website argument is string or a list

def check_websites(website: list or str):
    if isinstance(website, str):
        return website                                           # cabdirect.org
    elif isinstance(website, list):
        return " OR ".join([f'site:{site}' for site in website]) # site:cabdirect.org OR site:cab.com

Define a parse function:

def scrape_website_publications(query: str, website: list or str):
    # further code

Code	Explanation
`query: str`/`website: list or str`	to tell Python that `query` and `website` arguments should be with a type of `list` of `strings` or a `string`

Create search query parameters, request headers, pass them to request:

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": f'{query.lower()} site:{website}',  # search query
    "hl": "en",                              # language of the search
    "gl": "us"                               # country of the search
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
}

html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
selector = Selector(html.text)

Code	Explanation
`params`	is a query parameters that passed to `requests.get()` as a `dict`
`haeders`	is request headers, and `user-agent` is a thing that is used to act as a "real" user visit so websites (not all and in all cases) don't block the request. We need to pass our `user-agent` because the default `requests` `user-agent` is `python-requests` so websites understand that it's a script.
`timeout`	to tell requests to stop waiting for a response after 30 seconds.

Create a temporary list, iterate over all organic results, and extract the data:

publications = []

# iterate over every element from organic results from the first page and extract the data
for result in selector.css(".gs_r.gs_scl"):
    title = result.css(".gs_rt").xpath("normalize-space()").get()
    link = result.css(".gs_rt a::attr(href)").get()
    result_id = result.attrib["data-cid"]
    snippet = result.css(".gs_rs::text").get()
    publication_info = result.css(".gs_a").xpath("normalize-space()").get()
    cite_by_link = f'https://scholar.google.com/scholar{result.css(".gs_or_btn.gs_nph+ a::attr(href)").get()}'
    all_versions_link = f'https://scholar.google.com/scholar{result.css("a~ a+ .gs_nph::attr(href)").get()}'
    related_articles_link = f'https://scholar.google.com/scholar{result.css("a:nth-child(4)::attr(href)").get()}'

Code	Explanation
`css(<selector>)`	to extarct data from a given CSS selector. In the background `parsel` translates every CSS query into XPath query using `cssselect`.
`xpath("normalize-space()")`	to get blank text nodes as well. By default, blank text nodes will be skipped resulting not a complete output.
`::text`/`::attr()`	is a `parsel` pseudo-elements to extract text or attribute data from the HTML node.
`get()`	to get actual data.

Append extracted data to the list as a dict, and return or print the results:

publications.append({
    "result_id": result_id,
    "title": title,
    "link": link,
    "snippet": snippet,
    "publication_info": publication_info,
    "cite_by_link": cite_by_link,
    "all_versions_link": all_versions_link,
    "related_articles_link": related_articles_link,
})

# print or return the results
# return publications

print(json.dumps(publications, indent=2, ensure_ascii=False))


# call the function
scrape_website_publications(query="biology", website="cabdirect.org")

Outputs:

[
  {
    "result_id": "6zRLFbcxtREJ",
    "title": "The biology of mycorrhiza.",
    "link": "https://www.cabdirect.org/cabdirect/abstract/19690600367",
    "snippet": "In the second, revised and extended, edition of this work [cf. FA 20 No. 4264], two new ",
    "publication_info": "JL Harley - The biology of mycorrhiza., 1969 - cabdirect.org",
    "cite_by_link": "https://scholar.google.com/scholar/scholar?cites=1275980731835430123&as_sdt=2005&sciodt=0,5&hl=en",
    "all_versions_link": "https://scholar.google.com/scholar/scholar?cluster=1275980731835430123&hl=en&as_sdt=0,5",
    "related_articles_link": "https://scholar.google.com/scholar/scholar?q=related:6zRLFbcxtREJ:scholar.google.com/&scioq=biology+site:cabdirect.org&hl=en&as_sdt=0,5"
  }, ... other results
]

Alternatively, you can do the same thing using Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.

The difference is that you don't have to create the parser from scratch, maintain it, figure out how to scale it, how bypass blocks from Google, and figure out which proxy/captcha providers are good.

import os, json
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl


def serpapi_scrape(query: str, website: str):
    params = {
        # https://docs.python.org/3/library/os.html#os.getenv
        "api_key": os.getenv("API_KEY"), # your serpapi API key
        "engine": "google_scholar",      # search engine
        "q": f"{query} site:{website}",  # search query
        "hl": "en",                      # language
        # "as_ylo": "2017",              # from 2017
        # "as_yhi": "2021",              # to 2021
        "start": "0"                     # first page
    }

    search = GoogleSearch(params)

    publications = []

    publications_is_present = True
    while publications_is_present:
        results = search.get_dict()

        print(f"Currently extracting page #{results.get('serpapi_pagination', {}).get('current')}..")

        for result in results["organic_results"]:
            position = result["position"]
            title = result["title"]
            publication_info_summary = result["publication_info"]["summary"]
            result_id = result["result_id"]
            link = result.get("link")
            result_type = result.get("type")
            snippet = result.get("snippet")

            publications.append({
                "page_number": results.get("serpapi_pagination", {}).get("current"),
                "position": position + 1,
                "result_type": result_type,
                "title": title,
                "link": link,
                "result_id": result_id,
                "publication_info_summary": publication_info_summary,
                "snippet": snippet,
                })


        if "next" in results.get("serpapi_pagination", {}):
            # splits URL in parts as a dict and passes it to a GoogleSearch() class.
            search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
        else:
            papers_is_present = False

    print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

Links

Join us on Twitter | YouTube

Add a Feature Request💫 or a Bug🐞

Top comments (1)

Dmitriy Zub ☀️ • May 18 '22

Thank you 💫

DEV Community

Scrape Google Scholar Publications within a certain website using Python

What will be scraped

How filtering works

Prerequisites

Full Code

Code Explanation

Links

Top comments (1)

Read next

Setting Up C/C++ Development Environment in Visual Studio Code Using MinGW

How to create own Python project in 5 minutes

HubSpot offers a powerful platform for creating user interfaces (UI) and serverless functions using React, allowing users to develop highly customizable pages and forms directly within the CRM itself.

Part 1: Master Authentication and Role-Based Access Control (RBAC) with Kinde and Convex in a File-Sharing Application