As a Ph.D. student studying Deep Learning (DL) from the perspective of a Software Engineer, I rely upon academic resources to learn about DL models, techniques, and methods. arXiv is arguably the largest host of the latest academic (but not peer-reviewed) DL manuscripts.
However, as it relies upon community donations to support the service, there are limitations to the service. One of them is that only the last week of manuscripts are browsable at any given time with the rest being searchable.
As someone who checks the service often for the latest information, it can become irritating when I'm casually browsing the site, find an interesting manuscript, and (for one reason or another) forget to bookmark it and then not find the paper as I can't nail down the exact keywords to search for it. Additionally, I'd like to leverage the data on the site for other projects like testing retrieval augmented generation (RAG) techniques for finding information from manuscripts.
To support users like me, the arXiv team releases the metadata of all papers submitted to the platform weekly on Kaggle as JSON. So for today's blog post, let's convert the JSON file into a queriable SQLite3 database!
Project Setup
I'll leverage Python 3.10 and bash for this project primarily for the pandas
library. pandas
provides convenient read_json
and to_sql
methods for reading JSON files and writing to SQL databases respectfully.
To start, I created a GitHub repository based on my Python template repo. You can find all the project code here.
Getting And Cleaning The Data
As the arXiv Dataset is hosted on Kaggle, we can use their kaggle
Python library to download and unzip the data. Wrapping this as a bash script, we get:
#!/bin/bash
kaggle datasets download --unzip Cornell-University/arxiv -p $1
Where the --unzip
argument decompresses the data, and the -p
argument specifies a path to download the data. We can improve this further by leveraging optparse
to provide command-line arguments for our script. All said and done, we have a download script that looks like this:
#!/bin/bash
source optparse.bash
optparse.define short=p long=path desc="Directory to store dataset" variable=PATH default="."
source $( optparse.build )
ABS_PATH=$(realpath $PATH)
kaggle datasets download --unzip Cornell-University/arxiv -p $ABS_PATH
Now that we have the data in JSON format, we can further optimize it by converting it into JSON Lines (JL) format. JL is a format for storing JSON data where each line is a single object. This effectively removes top-level arrays of objects which is how the arXiv Dataset is stored. By converting the data into a JL format, Pandas can read the file in chunks, thereby reducing the memory overhead by loading only portions of data into memory.
We can leverage jq
to do the conversion with this script:
#!/bin/bash
source optparse.bash
optparse.define short=i long=input desc="Input JSON file" variable=inputPath
optparse.define short=o long=output desc="Output JSON Lines file" variable=outputPath
source $( optparse.build )
if [[ -z $inputPath ]]; then
echo "No input provided."
exit 1
fi
if [[ -z $outputPath ]]; then
echo "No output provided."
exit 1
fi
absInputPath=$(realpath $inputPath)
absOutputPath=$(realpath $outputPath)
jq -c . $absInputPath > $absOutputPath
With that, our data is finally in a format where we can start loading it into a database!
Creating The Database
We will define our database schema using SQLAlchemy. First, we will store a subset of the information in a single table called documents
. This is to test that our database configuration is correct and avoid storing nested data now. The code is fairly simple to create a SQLite3 database with SQLAlchemy:
from pathlib import Path
from sqlalchemy import (
Column,
Engine,
MetaData,
PrimaryKeyConstraint,
String,
Table,
create_engine,
)
class DB:
def __init__(self, path: Path) -> None:
self.path: Path = path
self.engine: Engine = create_engine(url=f"sqlite:///{path}")
self.metadata: MetaData = MetaData()
self.documentTable: str = "documents"
self.createTables()
def createTables(self) -> None:
_: Table = Table(
self.documentTable,
self.metadata,
Column("id", String),
Column("title", String),
Column("submitter", String),
Column("comments", String),
Column("journal-ref", String),
Column("doi", String),
Column("report-no", String),
Column("categories", String),
Column("license", String),
Column("abstract", String),
Column("update_date", DateTime),
PrimaryKeyConstraint("id"),
)
self.metadata.create_all(bind=self.engine, checkfirst=True)
Running this code and checking the database schema we see that the table and columns have been created successfully:
We will extend this later by adding tables and relationships between nested values and the documents table.
Inserting Data Into The Database
With Pandas, we can read the data in from the JL file as chunks:
from pathlib import Path
from typing import Iterator
import pandas
from pandas import DataFrame
def readJSON(fp: Path, chunksize: int = 10000) -> Iterator[DataFrame]:
return pandas.read_json(
path_or_buf=fp,
lines=True,
chunksize=chunksize,
engine="ujson",
)
This Iterator[DataFrame]
object lazily reads the file into memory which we can do with a for
loop.
def loadData(dfs: Iterator[DataFrame], db: DB) -> None:
df: DataFrame
for df in dfs:
print(df.columns)
quit()
As our current database schema doesn't take all of the fields captured in the JSON objects, we need to parse our DataFrame for the columns that are captured:
def getDocuments(df: DataFrame) -> DataFrame:
documentsDF: DataFrame = df[
[
"id",
"title",
"submitter",
"comments",
"journal-ref",
"doi",
"report-no",
"categories",
"license",
"abstract",
"update_date",
]
].copy()
documentsDF["update_date"] = pandas.to_datetime(
arg=documentsDF["update_date"],
)
return documentsDF
Then we can load the document data into the database:
def loadData(dfs: Iterator[DataFrame], db: DB) -> None:
df: DataFrame
for df in dfs:
documentsDF: DataFrame = getDocuments(df=df)
documentsDF.to_sql(
name=db.documentTable,
con=db.engine,
if_exists="append",
index=False,
)
quit()
Checking our testing database, we can see that the first set of documents was loaded correctly:
However, when we try to import the entire dataset into the database, we get a sqlalchemy.exc.IntegrityError
because some of the primary keys are duplicated in the JL file. Rather than handling this when converting the data, we can extend our DB
class to support reading DataFrames into a table while checking for duplicates should an error arise:
...
from pandas import DataFrame
import pandas
from sqlalchemy.exc import IntegrityError
from typing import List
class DB:
...
def toSQL(self, tableName: str, df: DataFrame) -> int:
try:
df.to_sql(
name=tableName,
con=self.engine,
if_exists="append",
index=False,
)
except IntegrityError as error:
ids: List[str] = [param[0] for param in error.params]
df = df[~df["id"].isin(values=ids)]
df.to_sql(
name=tableName,
con=self.engine,
if_exists="append",
index=False,
)
return df.shape[0]
If an error occurs, a unique DataFrame is created from the list of primary keys not reported by the IntegrityError
, another attempt is made to reinsert them into the database. Additionally, we now return the number of rows committed to the database.
So our updated loadData
method now looks like this (with a Spinner object to help report progress from the progress
library):
def loadData(dfs: Iterator[DataFrame], db: DB) -> None:
with Spinner(f"Loading data into {db.path}... ") as spinner:
df: DataFrame
for df in dfs:
documentsDF: DataFrame = getDocuments(df=df)
db.toSQL(tableName=db.documentTable, df=documentsDF)
spinner.next()
Wrapping Up
Now that the basic structure of the application has been created, all that's left is to add the other tables.
For example, we can create a table called authors
to store each author of a document:
_: Table = Table(
self.authorTable,
self.metadata,
Column("id", Integer),
Column("document_id", String),
Column("author", String),
PrimaryKeyConstraint("id"),
ForeignKeyConstraint(
columns=["document_id"],
refcolumns=["documents.id"],
),
)
And then access only the authors from the DataFrame with this method:
def getAuthors(df: DataFrame, idIncrement: int = 0) -> DataFrame:
authorsDF: DataFrame = df[["id", "authors_parsed"]]
authorsDF = authorsDF.explode(column="authors_parsed", ignore_index=True)
authorsDF["author"] = authorsDF["authors_parsed"].apply(
lambda x: ", ".join(x),
)
authorsDF = authorsDF.drop(columns="authors_parsed")
authorsDF.index += idIncrement
authorsDF = authorsDF.reset_index()
authorsDF = authorsDF.rename(columns={"id": "document_id", "index": "id"})
return authorsDF
Then we can use the DB.toSQL()
method to write it to the database.
The final database schema is as follows (generated with SchemaCrawler):
As seen here, there is additional complexity in storing this data. The versions table undergoes similar transformations as well.
If you are interested in how the versions
table is created and to leverage this tool, please visit the GitHub project page.
Thanks for taking the time to read this post. I hope to be posting more in the future.
Top comments (0)