Abstract
In this short article, we'll test the ability of SingleStoreDB to store and query the OpenAI Wikipedia Vector Database dataset. We'll see that SingleStoreDB can manage this dataset with ease. SingleStoreDB has supported a range of vector functions for some time, and these functions are ideally suited for modern applications using GPT technology.
The notebook file used in this article is available on GitHub.
Introduction
In several previous articles, we have used some of the vector capabilities built into SingleStoreDB:
- Quick tip: SingleStoreDB's EUCLIDEAN_DISTANCE and JSON_ARRAY_PACK functions
- Using SingleStore, Spark and Alternating Least Squares (ALS) to build a Movie Recommender System
There is an OpenAI notebook available on GitHub under an MIT License that tests several Vector Database systems. The tests can be run using local clients or in the cloud. In this article, we'll use SingleStore in the cloud.
Fill out the Notebook
Let's now create a new notebook. We'll adhere to the flow and structure of the OpenAI notebook and use some small code sections directly from the notebook, where required.
Setup
First, some libraries:
!pip install openai --quiet
!pip install tabulate --quiet
Next, some imports:
import getpass
import numpy as np
import openai
import os
import pandas as pd
import shutil
import wget
import zipfile
from ast import literal_eval
from openai import OpenAI
from tabulate import tabulate
and then the embedding model:
EMBEDDING_MODEL = "text-embedding-ada-002"
Load Data
We'll now obtain the Wikipedia dataset:
embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"
# The file is ~700 MB so this will take some time
wget.download(embeddings_url)
and unpack it:
zip_file_name = "vector_database_wikipedia_articles_embedded.zip"
csv_file_name = "vector_database_wikipedia_articles_embedded.csv"
directory_name = "extract"
with zipfile.ZipFile(zip_file_name, "r") as zip_ref:
zip_ref.extractall(directory_name)
Next, we'll load the file into a Pandas Dataframe:
article_df = pd.read_csv(directory_name + "/" + csv_file_name)
and we'll take a look at the first few lines, as follows:
article_df.head()
The next operation can take a while:
# Apply literal_eval and cast to float32 for both columns in one step
article_df[["title_vector", "content_vector"]] = article_df[["title_vector", "content_vector"]].applymap(
lambda x: np.array(literal_eval(x), dtype = np.float32)
)
# Set vector_id to be a string
article_df["vector_id"] = article_df["vector_id"].apply(str)
and then next, we'll look at the Dataframe info:
article_df.info(show_counts = True)
The result should be as follows:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 25000 non-null int64
1 url 25000 non-null object
2 title 25000 non-null object
3 text 25000 non-null object
4 title_vector 25000 non-null object
5 content_vector 25000 non-null object
6 vector_id 25000 non-null object
dtypes: int64(1), object(6)
Create Table
We'll create a database and table:
CREATE DATABASE IF NOT EXISTS openai_demo;
USE openai_demo;
DROP TABLE IF EXISTS wikipedia;
CREATE TABLE IF NOT EXISTS wikipedia (
id INT PRIMARY KEY,
url VARCHAR(255),
title VARCHAR(100),
text TEXT,
title_vector VECTOR(1536),
content_vector VECTOR(1536),
vector_id INT
)
Populate Table
Let's now create a connection to SingleStoreDB:
from sqlalchemy import *
db_connection = create_engine(connection_url)
We can populate our database table, as follows:
article_df.to_sql(
"wikipedia",
con = db_connection,
if_exists = "append",
index = False,
chunksize = 1000
)
Loading the data should take a short time. We can use other data loading methods, such as pipelines, for larger datasets.
Search Data
We'll now define a Python function that will allow us to use either of the two vector columns in the database:
def search_wikipedia(
query: str,
column1: str,
column2: str,
num_rows: int = 10
) -> list:
"""Searches Wikipedia for the given query and returns the top `num_rows` results.
Args:
query: The query to search for.
column1: The name of the column in the Wikipedia database to return for each result.
column2: The name of the column in the Wikipedia database to use as the score for each result.
num_rows: The number of results to return.
Returns:
A list of the top `num_rows` results.
"""
# Get the embedding of the query
query_embedding_response = client.embeddings.create(
input = query,
model = EMBEDDING_MODEL
)
query_embedding = query_embedding_response.data[0].embedding
# Create the SQL statement
stmt = """
SELECT {column1}, (%s <*> {column2}) AS score
FROM wikipedia
ORDER BY score DESC
LIMIT %s
""".format(column1 = column1, column2 = column2)
results = pd.read_sql(
stmt,
db_connection,
params = (str(query_embedding), num_rows)
)
# Separate the results into two lists
values = results.iloc[:, 0].tolist()
scores = results.iloc[:, 1].tolist()
# Return the results
return values, scores
We can test SingleStoreDB using the two examples in the OpenAI notebook. First, we'll use title
and title_vector
:
values1, scores1 = search_wikipedia(
query = "modern art in Europe",
column1 = "title",
column2 = "title_vector",
num_rows = 5
)
We'll format the results using the following:
# Combine the values and scores lists into a list of tuples
# Each tuple contains a value and its corresponding score
table_data1 = list(zip(values1, scores1))
# Add a rank column to the table data
table_data1 = [(i + 1,) + data for i, data in enumerate(table_data1)]
# Create the table
table1 = tabulate(table_data1, headers = ["Rank", "Title", "Score"])
# Print the table
print(table1)
The output should be similar to the following:
Rank Title Score
------ -------------------- --------
1 Museum of Modern Art 0.87496
2 Western Europe 0.867415
3 Renaissance art 0.864026
4 Pop art 0.860158
5 Northern Europe 0.854661
Next, we'll use text
and content_vector
:
values2, scores2 = search_wikipedia(
query = "Famous battles in Scottish history",
column1 = "text",
column2 = "content_vector",
num_rows = 5
)
We'll format the results using the following:
# Combine the values and scores lists into a list of tuples
# Each tuple contains a value and its corresponding score
table_data2 = list(zip([value[:50] for value in values2], scores2))
# Add a rank column to the table data
table_data2 = [(i + 1,) + data for i, data in enumerate(table_data2)]
# Create the table
table2 = tabulate(table_data2, headers = ["Rank", "Text", "Score"])
# Print the table
print(table2)
The output should be similar to the following:
Rank Text Score
------ -------------------------------------------------- --------
1 The Battle of Bannockburn, fought on 23 and 24 Jun 0.869272
2 The Wars of Scottish Independence were a series of 0.861397
3 Events 0.852443
January 1 – Charles II crowned King of
4 The First War of Scottish Independence lasted from 0.849582
5 Robert I of Scotland (11 July 1274 – 7 June 1329) 0.846126
Summary
In this short article, we've seen that SingleStoreDB can store vectors with ease and that we can also store other data types in the same table, such as numeric and text. With its powerful SQL and multi-model support, SingleStoreDB provides a one-stop solution for modern applications bringing both technical and business benefits through a single product.
If you are interested in further reading, check out these SingleStore blog posts:
Top comments (0)