DEV Community

Cover image for Scrape Google Scholar with Python

Scrape Google Scholar with Python

Dmitriy Zub ☀️ on May 30, 2021

Intro Prerequisites SelectorGadgets Extension Organic Search Organic Search Pagination Organic Cite Organic Profiles Pagination Autho...
Collapse
 
mohammadreza20 profile image
Mohammad Reza

Thasksful dear Dmitriy for a great post.

I have two problems when scraping profiles:

1_ When I use your code it returns 0 pages (i don't want to use serpapi) or "unusual traffic", I couldn't solve it. ( maybe I have been blocked by google).

2_ I wrote a code that works for me by Urllib, but now I notice a problem that the query just returns a matching case, and doesn't return a substring, for ex. I looking for ecology at Michigan State University, and as a result, it doesn't return "Scott C Stark", but if I write exactly "Tropical_Forest_Ecology" then it returns "Scott C Stark".
How do I solve it that returns any word even in substring?

Image description

Collapse
 
dmitryzub profile image
Dmitriy Zub ☀️ • Edited

Hey, @mohammadreza20 🙂

  1. When I use your code it returns 0 pages (i don't want to use serpapi) or "unusual traffic", I couldn't solve it. ( maybe I have been blocked by google).

Yes, most likely you get blocked by Google. But I don't have enough context to give a proper answer 🙂

Have a look at scrape-google-scholar-py custom backend solution. It uses selenium-stealth under the hood which bypasses cloudflare captcha, and other captchas (and IP rate limits).

It's a package of mine. Note that it's in early alpha. Open issue if you found and bugs.

  1. I wrote a code that works for me by Urllib, but now I notice a problem that the query just returns a matching case, and doesn't return a substring, for ex. I looking for ecology at Michigan State University, and as a result, it doesn't return "Scott C Stark", but if I write exactly "Tropical_Forest_Ecology" then it returns "Scott C Stark".

Without seeing your code I can only give you a very generic answer which wouldn't be helpful. Show code where you have difficulties if you want a deeper answer.

Collapse
 
mohammadreza20 profile image
Mohammad Reza • Edited

@dmitryzub Thank you for your response.

The question 2 is important for me but i couldn't solve it,

my code:

from bs4 import BeautifulSoup
import urllib
from time import sleep
label="ecology"
university_name="Michigan University"

params = {    
    "view_op": "search_authors",                       # author results
    "mauthors": f'label:{label} "{university_name}"',  # search query
    "hl": "en",                                        # language
    "astart": 0                                        # page number
}


data = urllib.parse.urlencode(params) 
req = urllib.request.Request(url+data) 
resp = urllib.request.urlopen(req).read() 
soup = BeautifulSoup(resp, 'html5lib')


proftags = soup.findAll("div", {"class": "gsc_1usr" })
quote = {}
for mytag in proftags:
    quote['name'] = mytag.find("h3", {"class": "gs_ai_name" }).text
    quote['email'] = mytag.find("div", {"class": "gs_ai_eml" }).text
    quote['affiliations'] = mytag.find("div", {"class": "gs_ai_aff" }).text
    quote['cited_by'] = mytag.find("div", {"class": "gs_ai_cby" }).text
    lst_interest = [item.text for item in mytag.findAll("a", {"class": "gs_ai_one_int" })]
    quote['interests'] = lst_interest#.split(' ')
    sleep(2)
    print(quote)
Enter fullscreen mode Exit fullscreen mode
Collapse
 
rafambraga profile image
rafambraga

Hi Dmitriy,
Thank you for your very useful article!
However, whenever using beautifulsoup I can only scrape the first page when scraping a profile for example. Or if I need to scrape citations from a specific author, this only scrapes the first few citations instead of all of them. How can I scrape all the results (in multiple pages using beautifulsoup or SerpApi?
Thanks!

Collapse
 
dmitryzub profile image
Dmitriy Zub ☀️ • Edited

Hi @rafambraga,

Thank you for finding it helpful! Yes, the example I've shown in this blog post scrapes only first profiles page. This blog post is old and needs an upgrade.

I've answered Stackoverflow question about scraping profile results from all pages using both bs4 and SerpApi with example in the online IDE.

About citations. I've also written a code snippet to scrape citations in bs4. Note that examples scrapes only Bibtex data. You need add a few lines of code to scrape all of them.


Besides that, there's also a dedicated blog posts on scrape historic Google Scholar results using Python and scrape all Google Scholar Profile, Author Results to CSV with Python and SerpApi.

If you need to scrape profiles from a certain university, there's also a dedicated scrape Google Scholar Profiles from a certain University in Python blog post just about it with a step-by-step explanation 👀

Collapse
 
dlittlewood12 profile image
Dlittlewood12

Do you have any recommendations for scraping all pages of an organic search result? I tried adding "start": 0 into the parameters and just manually changing that but it seems to repeat results occasionally. I also tried to follow your Scrape historic Google Scholar results using Python script but keep getting the error KeyError: 'serpapi_pagination'

Thanks for any help you can provide!

Thread Thread
 
dmitryzub profile image
Dmitriy Zub ☀️

@dlittlewood12 Thank you for reaching out!

Most likely you're getting an KeyError: 'serpapi_pagination' error is that you need to pass your API_KEY to os.getenv("API_KEY"). In the terminal, type API_KEY=<your-api-key> python your-script-file.py

Or remove it completely and pass api key as a string inside params dict, for example: "api_key": "2132414122asdsadadaa"

Let me know if answers your question 🎈

Collapse
 
dmitryzub profile image
Dmitriy Zub ☀️

@rafambraga I've just published major updates to this blog post which includes:

  1. add DIY pagination to organic, profile, and author articles results + code updates in every section.
  2. add DIY cite results extraction.
  3. all code snippets returning JSON now.

And othe changes 🐱‍👤🐱‍🏍

Collapse
 
sim777 profile image
SIM-777 • Edited

Hello Dimitry
I would like to know if there is any parameter (using SerpAPI)
so that we can scrape author profiles with certain minimum number of citations.

Collapse
 
dmitryzub profile image
Dmitriy Zub ☀️

Hey @sim777, thank you for your question. SerpApi doesn't has such parameter, and Google Scholar itself (as far as I know) doesn't have it also.

As a workaround you can always do an if condition manually by accessing a cited_by hash key from SerpApi response and check if it's bigger or lower to the value you provide. And if condition is true - extract profile.

Example code of what I mean:

from serpapi import GoogleSearch

params = {
  "engine": "google_scholar_profiles",
  "hl": "en",
  "mauthors": "Mike", # search query
  "api_key": "secret_api_key"
}

search = GoogleSearch(params)
results = search.get_dict()

profiles_with_2000_citations = []

for profile in results["profiles"]:
     if profile["cited_by"] >= 2000:
         profiles_with_2000_citations.append(profile)
Enter fullscreen mode Exit fullscreen mode
Collapse
 
sim777 profile image
SIM-777

What parameter can I use in the search query so that I can scrape authors with certain number of citations?

Collapse
 
dmitryzub profile image
Dmitriy Zub ☀️

Hey @sim777, I've answered to your question in the response above. I've showed a SerpApi example but you can do the same thing with your own solution without SerpApi 👍

Collapse
 
george_z profile image
Georg • Edited

Hi Dmitriy, I noticed that you only scrape the shorted descritions of the papers, but not the entire description. If you look on google scholar search results page, only short excerpt from the abstract, title or authors if there are many is seen ending by a tripple dot (...) The scraper only scrapes this, leaving the rest of the information out. Do you maybe know a solution to this?

Collapse
 
dmitryzub profile image
Dmitriy Zub ☀️

Hi, Georg! Thank you reaching out! Unfortunately, only part of the snippet was provided by the Google backend, and as you wrote, this is what scraper scrapes.

To make it work you have to make another request to desired website, check if there's exists the same text as in the snippet from Google Scholar Organic results, and if so, scrape the rest of it.

But this will work if the rest of the text will is on the same website over and over again which in most cases - not. I believe it can be done but it's a tricky task.

All the best,
Dmitriy

@george_z

Collapse
 
shwetat26 profile image
shwetat26

Hi Dimitriy,

Thank you for the very useful articles!
I was following 'Scrape historic Google Scholar results using Python' article on serpapi.com/blog/scrape-historic-g....
However, whenever I try to call 'cite_results()' I am getting key error: 'citations'. And if I change the query I am getting key error: 'serpapi_pagination'. I also tried to run my script from terminal by running - 'API_KEY=my_api_key python test11.py'. But nothing seems to work especially when I change the query.
Suggestions are highly appreciate.
Thanks.

Collapse
 
dmitryzub profile image
Dmitriy Zub ☀️

Hey, @shwetat26 🙂 Thank you for reaching out.

First question, are you actually changing my_api_key to your actual API key, and Python file test11.py to your actual Python file in the API_KEY=my_api_key python test11.py command you've shown?

It should be something like this:

API_KEY=6d4113sdsdas7d865asdask66s79aaasa0a0s87d6794b2642es python <your_script.py>
Enter fullscreen mode Exit fullscreen mode

Let me know it makes sense.

Collapse
 
datum_geek profile image
Mohamed Hachaichi 🇺🇦 • Edited

Can we scrap the number of citations+year inside the barplot (citation by graph)?

Collapse
 
dmitryzub profile image
Dmitriy Zub ☀️

Do you mean value of each individual cell?

Collapse
 
dmitryzub profile image
Dmitriy Zub ☀️

@datum_geek the blog post has received major updates including graph extraction if it's something you still need 🙂

Collapse
 
datum_geek profile image
Mohamed Hachaichi 🇺🇦

Hi @dmitryzub, I facing issues to scrap data on "samsung", for all pages of google scholar. Including title of the publication, authors of the publication, the year, and the full abstract. You code (the first peace) does not work, it reders nothing!

Thread Thread
 
dmitryzub profile image
Dmitriy Zub ☀️ • Edited

Hi, @datum_geek.

What error do you receive? Pagination, data extraction works as expected without changes. I've just tested in the online IDE that is linked in this post:

Image description

Collapse
 
jayaivan profile image
Ivan Jaya

Dmitriy, thanks for your tutorial. I'm newbie at python and I already try your script Scrape Google Scholar All Author Articles. I have some questions:

  • How to export the result into csv file?
  • If there are more than one user profile, can we scrape it all at once?
Collapse
 
dmitryzub profile image
Dmitriy Zub ☀️ • Edited

Hey @jayaivan, thank you 🙂

Great questions! We can make a solution for two questions.

Have a look at examples in the online IDE: replit.com/@DimitryZub1/Google-Sch...

How to export the result into csv file?

We can use pandas to_csv() method or using Python's build-in context manager and build-in csv library.

The main difference is that pandas is an additional dependency (additional thing to install which leads to larger project storage size), however, pandas simplifies this task a lot.

📌Note: I'll be using json.dumps() at the very end of the code just to show what is being printed (extracted). Delete it if it's unnecessary to you.

Using pandas (don't forget to pip install it):

pd.DataFrame(all_articles[:-1]).to_csv(f'user-{params["user"]}-articles.csv', index=False)
Enter fullscreen mode Exit fullscreen mode

Actual example using code in the blog post (the line you're looking for is almost at the end of the script):

from parsel import Selector
import requests, os, json
import pandas as pd

def parsel_scrape_all_author_articles():
    params = {
        'user': '_xwYD2sAAAAJ',       # user-id
        'hl': 'en',                   # language
        'gl': 'us',                   # country to search from
        'cstart': 0,                  # articles page. 0 is the first page
        'pagesize': '100'             # articles per page
    }

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
    }

    all_articles = []

    while True:
        html = requests.get('https://scholar.google.com/citations', params=params, headers=headers, timeout=30)
        selector = Selector(text=html.text)

        for index, article in enumerate(selector.css('.gsc_a_tr'), start=1):
            article_title = article.css('.gsc_a_at::text').get()
            article_link = f"https://scholar.google.com{article.css('.gsc_a_at::attr(href)').get()}"
            article_authors = article.css('.gsc_a_at+ .gs_gray::text').get()
            article_publication = article.css('.gs_gray+ .gs_gray::text').get()

            cited_by_count = article.css('.gsc_a_ac::text').get()
            publication_year = article.css('.gsc_a_hc::text').get()

            all_articles.append({
                'position': index,
                'title': article_title,
                'link': article_link,
                'authors': article_authors,
                'publication': article_publication,
                'publication_year': publication_year,
                'cited_by_count': cited_by_count
            })

        # this selector is checking for the .class that contains: 'There are no articles in this profile.'
        # example link: https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en&cstart=500&pagesize=100
        if selector.css('.gsc_a_e').get():
            break
        else:
            params['cstart'] += 100  # paginate to the next page

    # [:-1] doesn't pick last element which is not we want and don't contain any data.
    pd.DataFrame(all_articles[:-1]).to_csv(f'user-{params["user"]}-articles.csv', index=False)
    print(json.dumps(all_articles[:-1], indent=2, ensure_ascii=False))


parsel_scrape_all_author_articles()
Enter fullscreen mode Exit fullscreen mode

Outputs:

Image description

Using context manager:

with open(f'user-{params["user"]}-articles.csv', mode='w') as csv_file:
    fieldnames = ['position', 'title', 'link', 'authors', 'publication', 'publication_year', 'cited_by_count']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()

    # data extraction here...

     # this for loop should inside the with open() statement
     # keep an eye on your indentation level
     for article in all_articles[:-1]:
         writer.writerow(article)
Enter fullscreen mode Exit fullscreen mode

Actual example from the blog post code:

from parsel import Selector
import requests, os, json
import csv

def parsel_scrape_all_author_articles():
    params = {
        'user': '_xwYD2sAAAAJ',       # user-id
        'hl': 'en',                   # language
        'gl': 'us',                   # country to search from
        'cstart': 0,                  # articles page. 0 is the first page
        'pagesize': '100'             # articles per page
    }

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
    }


    with open(f'user-{params["user"]}-articles.csv', mode='w') as csv_file:
        fieldnames = ['position', 'title', 'link', 'authors', 'publication', 'publication_year', 'cited_by_count']
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()

        all_articles = []

        while True:
            html = requests.get('https://scholar.google.com/citations', params=params, headers=headers, timeout=30)
            selector = Selector(text=html.text)

            for index, article in enumerate(selector.css('.gsc_a_tr'), start=1):
                article_title = article.css('.gsc_a_at::text').get()
                article_link = f"https://scholar.google.com{article.css('.gsc_a_at::attr(href)').get()}"
                article_authors = article.css('.gsc_a_at+ .gs_gray::text').get()
                article_publication = article.css('.gs_gray+ .gs_gray::text').get()

                cited_by_count = article.css('.gsc_a_ac::text').get()
                publication_year = article.css('.gsc_a_hc::text').get()

                all_articles.append({
                    'position': index,
                    'title': article_title,
                    'link': article_link,
                    'authors': article_authors,
                    'publication': article_publication,
                    'publication_year': publication_year,
                    'cited_by_count': cited_by_count
                })

            # this selector is checking for the .class that contains: 'There are no articles in this profile.'
            # example link: https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en&cstart=500&pagesize=100
            if selector.css('.gsc_a_e').get():
                break
            else:
                params['cstart'] += 100  # paginate to the next page

        for article in all_articles[:-1]:
            writer.writerow(article)


    # [:-1] doesn't pick last element which is not we want and don't contain any data.
    print(json.dumps(all_articles[:-1], indent=2, ensure_ascii=False))


parsel_scrape_all_author_articles()
Enter fullscreen mode Exit fullscreen mode

Outputs:

Image description

If there are more than one user profile, can we scrape it all at once?

You need a list of user IDs. Iterate over it at extract the data as already shown in the blog post.

Keep in mind that each user = new request. More users = more time to extract data. If it takes a lot of time, think about asynchronous requests as it will speed up things quite a lot.

Iteration over user ID's list and passing user_id value to params["user"] which will be passed to search URL:

user_ids = ['_xwYD2sAAAAJ', 'OBf4YnkAAAAJ', 'xBHVqNIAAAAJ']

for user_id in user_ids:
    params = {
        'user': user_id,       # user-id
        'hl': 'en',                   # language
        'gl': 'us',                   # country to search from
        'cstart': 0,                  # articles page. 0 is the first page
        'pagesize': '100'             # articles per page
    }

    # further data extraction
Enter fullscreen mode Exit fullscreen mode

Actual code:

from parsel import Selector
import requests, os, json
import csv

def parsel_scrape_all_author_articles():
    user_ids = ['_xwYD2sAAAAJ', 'OBf4YnkAAAAJ', 'xBHVqNIAAAAJ']

    for user_id in user_ids:
        params = {
            'user': user_id,       # user-id
            'hl': 'en',                   # language
            'gl': 'us',                   # country to search from
            'cstart': 0,                  # articles page. 0 is the first page
            'pagesize': '100'             # articles per page
        }

        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
        }


        with open(f'user-{params["user"]}-articles.csv', mode='w') as csv_file:
            fieldnames = ['position', 'title', 'link', 'authors', 'publication', 'publication_year', 'cited_by_count']
            writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
            writer.writeheader()

            all_articles = []

            while True:
                html = requests.get('https://scholar.google.com/citations', params=params, headers=headers, timeout=30)
                selector = Selector(text=html.text)

                for index, article in enumerate(selector.css('.gsc_a_tr'), start=1):
                    article_title = article.css('.gsc_a_at::text').get()
                    article_link = f"https://scholar.google.com{article.css('.gsc_a_at::attr(href)').get()}"
                    article_authors = article.css('.gsc_a_at+ .gs_gray::text').get()
                    article_publication = article.css('.gs_gray+ .gs_gray::text').get()

                    cited_by_count = article.css('.gsc_a_ac::text').get()
                    publication_year = article.css('.gsc_a_hc::text').get()

                    all_articles.append({
                        'position': index,
                        'title': article_title,
                        'link': article_link,
                        'authors': article_authors,
                        'publication': article_publication,
                        'publication_year': publication_year,
                        'cited_by_count': cited_by_count
                    })

                # this selector is checking for the .class that contains: 'There are no articles in this profile.'
                # example link: https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en&cstart=500&pagesize=100
                if selector.css('.gsc_a_e').get():
                    break
                else:
                    params['cstart'] += 100  # paginate to the next page

            for article in all_articles[:-1]:
                writer.writerow(article)


        # [:-1] doesn't pick last element which is not we want and don't contain any data.
        print(json.dumps(all_articles[:-1], indent=2, ensure_ascii=False))


parsel_scrape_all_author_articles()
Enter fullscreen mode Exit fullscreen mode

Let me know if any of this makes sense 🙂

Collapse
 
deepeshsagar profile image
deepeshsagar

Hi, this is very useful article. Definitely reduces the time. I'm planning to get all articles within the last month. Can you tell me how to do that. Below two actions. 1) default display of search is sort by " by relevance". how to set it to "sort by date" 2) there is information on number of days before article published. how to get it?

Collapse
 
dmitryzub profile image
Dmitriy Zub ☀️

Hi, @deepeshsagar ! I'm glad that the article helped you somehow!

Use sortby=pubdate query parameter which will sort by published date.

In articles example the link would look like this: https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ&sortby=pubdate

Or you can add a params dict() to make it more readable and faster to understand:

params = {
   "user": "m8dFEawAAAAJ",
   "sortby": "pubdate",
   "hl": "en"
}

html = requests.get('https://scholar.google.com/citations', params=params)
# further code..
Enter fullscreen mode Exit fullscreen mode

I updated code on replit so you can test in the browser (try to remove sortby param and see the difference in first articles).

Collapse
 
dmitryzub profile image
Dmitriy Zub ☀️ • Edited

@deepeshsagar i've just updated blog post and now you're able to extract all available articles from author page. This is possible because of pagination i've added.

🐱‍👤

Collapse
 
cdelosriosru profile image
cdelosriosru

Thanks mitry for a great and useful post. I am currently having one problem using your code for "Scrape Google Scholar Organic Results using SerpApi with Pagination". I edited it only, so that it loops thorugh different search terms. However, the code for some reason is not able to get pass the last page of this particular search of one particular query. If I use the code for all the other search terms it works, but for some reason it does not work for only one of the search terms (regardless of the position it has in the loop). The code simply stays forever at saying "Currently extracting page #6" and never adavcnes or ends. I am guessing it has something to do whith what is in the last page, but I havent been able to fix it or identify the problem. Below is a snapshot of what that last page shows. I hope you can help me with this. Thanks!

Image description