Abstract
SingleStore is a high-performance distributed SQL database and through SingleStore Kai extends its compatibility to MongoDB's vector search capabilities. By supporting MongoDB's $vectorSearch
operator, SingleStore Kai provides efficient querying and indexing of vector data, such as embeddings used in ML and AI applications. Users can benefit from SingleStore's optimised data handling and Kai's vector search capabilities to perform similarity searches across large datasets. In this article, we'll see an example of how to use SingleStore Kai with vector data.
The notebook file used in this article is available on GitHub.
Introduction
In a previous article, we discussed SingleStore Kai's support for the $euclideanDistance
extension. Support is now available for $vectorSearch
from MongoDB.
Create a SingleStoreDB Cloud account
A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings:
- Workspace Group Name: Iris Demo Group
- Cloud Provider: AWS
- Region: US East 1 (N. Virginia)
- Workspace Name: iris-demo
- Size: S-00
-
Settings:
- SingleStore Kai selected
From Deployments > Firewall, we'll temporarily allow access from anywhere.
Import the notebook
We'll download the notebook from GitHub.
From the left navigation pane in the SingleStore cloud portal, we'll select DEVELOP > Data Studio.
In the top right of the web page, we'll select New Notebook > Import From File. We'll use the wizard to locate and import the notebook we downloaded from GitHub.
Run the notebook
After checking that we are connected to our SingleStore workspace, we'll run the cells one by one.
In the database, we'll store the Iris flower data set. We'll first download the Iris CSV file into a Pandas Dataframe and then convert it into two columns, as follows:
pandas_df["vector"] = pandas_df.apply(
lambda row: [
row["sepal_length"],
row["sepal_width"],
row["petal_length"],
row["petal_width"]
], axis = 1
)
new_df = pandas_df[["vector", "species"]]
new_df.head()
Example output:
vector species
0 [5.1, 3.5, 1.4, 0.2] Iris-setosa
1 [4.9, 3.0, 1.4, 0.2] Iris-setosa
2 [4.7, 3.2, 1.3, 0.2] Iris-setosa
3 [4.6, 3.1, 1.5, 0.2] Iris-setosa
4 [5.0, 3.6, 1.4, 0.2] Iris-setosa
Next, we'll transform the data into a dictionary:
records = new_df.to_dict(orient = "records")
and get the number of dimensions for the vector column:
dimensions = len(new_df.at[0, "vector"])
We'll now create a client and use the connection_url_kai
, as follows:
client = pymongo.MongoClient(connection_url_kai)
db = client["iris_db"]
collection = db["iris"]
which avoids the need to provide the long connection string.
We'll now create a collection:
db.create_collection("iris",
columns = [{
"id": "vector", "type": f"VECTOR({dimensions}) NOT NULL"
}],
);
This uses the SingleStore VECTOR
type with the number of dimensions previously determined.
Next, we'll insert the data:
result = collection.insert_many(records)
and we'll retrieve a few rows to confirm the data have been stored:
cursor = collection.find(projection = {"_id": 0}).limit(5)
table = []
for document in cursor:
species = document["species"]
vector = [round(value, 2) for value in document["vector"]]
table.append([vector, species])
print(tabulate(table, headers = ["vector", "species"]))
Example output:
vector species
-------------------- ---------------
[6.4, 3.2, 4.5, 1.5] Iris-versicolor
[6.0, 3.0, 4.8, 1.8] Iris-virginica
[6.7, 3.1, 5.6, 2.4] Iris-virginica
[7.2, 3.2, 6.0, 1.8] Iris-virginica
[4.6, 3.4, 1.4, 0.3] Iris-setosa
Next, we'll create a vector index:
db.command({
"createIndexes": "iris",
"indexes": [{
"key": {"vector": "vector"},
"name": "vector_index",
"kaiIndexOptions": {
"index_type": "AUTO",
"metric_type": "EUCLIDEAN_DISTANCE",
"dimensions": dimensions
}
}],
});
AUTO uses IVF_PQFS
. Other indexing options are also available.
Finally, let's use some fictitious data values to make a prediction:
query_vector = [5.2, 3.6, 1.5, 0.3]
and query the data to find the closest matches:
pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"path": "vector",
"queryVector": query_vector,
"limit": 5
}
}, {
"$project": {
"_id": 0,
"species": 1,
"score": {
"$meta": "vectorSearchScore"
}
}
}
]
cursor = collection.aggregate(pipeline)
table = []
for document in cursor:
species = document["species"]
score = document["score"]
table.append([score, species])
print(tabulate(table, headers = ["score", "species"]))
Example output:
score species
-------- -----------
0.141421 Iris-setosa
0.173205 Iris-setosa
0.173205 Iris-setosa
0.173205 Iris-setosa
0.2 Iris-setosa
Summary
In this article we've seen how to use $vectorSearch
with SingleStore Kai. Comparing the results we obtained running similar queries in a previous article, we can see that the results are the same. For larger datasets, the results may differ due to the vector indexing.
Top comments (0)