The Python package bibliometrics, with source code on GitHub and available for installation from PyPI, is a command line utility implemented 100% in Python that extracts common bibliometrics (total citations, h-index, i10-index) from a researcher's Google Scholar profile, calculates others (g-index, i100-index, i1000-index) from the first page of their profile, and generates an SVG summarizing the metrics which can then be displayed perhaps on a list of publications on their website. Here is an example (colors are user-configurable) of what this produces when pointed at my Scholar profile:
The intended use-case is for a researcher to monitor their own publications. For example, I am currently running this in a cron job twice per month (once per month is probably also sufficient). It is designed with that use-case in mind. It is also designed to respect Google Scholar's current robots.txt
, which currently allows accessing the first page of a profile, while disallowing virtually everything else. It has no dependencies, and does not use any of the existing Python libraries that collect Scholar data. This is not a tool for more generally scraping such data. If you are looking for more general scraping functionality, you can find several such Python libraries by searching PyPI.
This post is organized as follows:
- Supported Citation Metrics: explains the supported bibliometrics, as well as why these are included.
- How to Use
- Info for Potential Contributors
- Where You Can Find Me
Supported Citation Metrics
This application supports the following bibliometrics:
- Total number of citations.
- Total number of citations in past 5 years.
- Number of citations to most-cited article.
- h-index: An h-index equal to h means that the researcher's h most-cited articles have been cited a minimum of h times each.
- g-index: A g-index equal to g means that the researcher's g most-cited articles have been cited an average of g times each.
- i10-index: A researcher's i10-index is the number of their articles cited at least 10 times.
- i100-index, i1000-index, i10000-index: These are like the i10-index, but instead are the numbers of articles cited at least 100 times, 1000 times, and 10000 times, respectively.
Why These Citation Metrics for This Application?
Several of these can be extracted from the researcher's Google Scholar profile directly, while respecting Scholar's robots.txt
. The others (g-index, i100-index, i1000-index, i10000-index), likewise while respecting Scholar's robots.txt
, can be calculated using only the first page (top 100 publications) of the researcher's Google Scholar profile (provided the metric is at most 100). For any of these that would require retrieving more than the first page of results to compute, the application simply skips them. For example, if the researcher's g-index is actually 105, the application won't be able to compute this since it can only retrieve a list of the top 100 publications of that researcher without violating Scholar's robots.txt
, and thus the SVG that is produced simply won't show the g-index.
How to Use
Installing from PyPI
To install from PyPI:
python3 -m pip install bibliometrics
Or on Windows:
py -m pip install bibliometrics
Configuring
You configure the utility with a JSON file. The JSON configuration file must be named .bibliometrics.config.json
. The .
at start is not a typo. Its rationale is my own personal use-case, where I run this in a directory containing contents of a GitHub Pages site, and GitHub Pages by default doesn't serve files with names beginning with a .
. Here is an example of the configuration (explanation follows):
{
"scholarID": "YOUR-SCHOLAR-ID-HERE",
"jsonOutputFile": "bibliometrics.json",
"svgConfig": [
{
"background": "#010409",
"border": "rgba(56,139,253,0.4)",
"filename": "images/bibliometrics2.svg",
"text": "#c9d1d9",
"title": "#58a6ff"
},
{
"background": "#f6f8fa",
"border": "rgba(84,174,255,0.4)",
"filename": "images/bibliometrics.svg",
"text": "#24292f",
"title": "#0969da"
}
]
}
The above example configures the bibliometrics utility to generate two SVG files, one of them with a light color theme, and the other with a dark color theme. The "svgConfig"
field can be used to configure as many SVGs as you want to generate (all for the same Scholar ID). If you only want one SVG, just provide a list there with a single JSON object describing the various color properties. The fields text
, title
, border
, and background
can all be specified via any valid method of defining a color in an SVG, such as 6-digit hex colors (most of the colors in the example), 3-digit hex colors, rgba (see example), as well as named colors. If it is valid as a color in an SVG, you can use it. The bibliometrics utility simply inserts it for the color.
The "jsonOutputFile"
field is optional. If provided, then in addition to generating an SVG, a JSON file will also be generated containing the extracted and computed bibliometrics.
You can specify your Scholar ID in one of two ways. In the above example, the field "scholarID"
is used. Alternatively, the bibliometrics utility will also check for an environment variable SCHOLAR_ID
. A single Scholar ID is used no matter how many SVGs you are generating. The intention of this application is for use by a researcher for their own bibliometrics, and among the design criteria was to make it inconvenient to use to extract bibliometrics for multiple researchers.
Running
Once you have completed configuration, change your working directory to the directory containing the .bibliometrics.config.json
file, and execute the following:
python3 -m bibliometrics
Or on Windows:
py -m bibliometrics
Info for Potential Contributors
The bibliometrics package is licensed via the MIT license. Source code is maintained on Github here:
cicirello / bibliometrics
Summarize your Google Scholar bibliometrics in an SVG
bibliometrics
This command line utility does the following:
- retrieves the first page of your Google Scholar profile;
- parses from that page your total citations, your five-year citation count, your h-index, your i10-index, and the number of citations of your most-cited paper;
- computes your o-index (https://arxiv.org/abs/1511.01545);
- computes your g-index provided if it is less than 100 (reason for limitation later);
- computes your i100-index, i1000-index, and i10000-index (doi:10.1007/s11192-020-03831-9), hiding any that are 0, and provided they are less than 100 (reason for limitation later);
- computes your w-index (doi:10.1002/asi.21276), hiding if equal to 0, and provided it is less than 100 (reason for limitation later);
- computes your e-index (doi:10.1371/journal.pone.0005429), your r-index (doi:10.1007/s11434-007-0145-9), and your a-index provided that your h-index is at most 100 (reason for…
If you are interested in submitting issues or contributing code, any proposed new features must be implementable while respecting Scholar's robots.txt. This largely means limited to what can be extracted or computed from the first page of a profile (up to the first 100 publications). Additionally, proposed new features must not be solely for the purpose of making it easier to scrape multiple profiles. For example, the use of a configuration file (with a limit of one scholar ID) rather than command line arguments deliberately makes it less convenient (though not impossible) to use within a script that processes several profiles. The name of the configuration file, and its location relative to current working directory, are not configurable for that same reason.
Where You Can Find Me
You can find me on the web:
Here on DEV:
On GitHub:
Vincent A Cicirello
View My Detailed GitHub Activity
If you want to generate the equivalent to the above for your own GitHub profile, check out the cicirello/user-statistician GitHub Action.
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more