Abstract
The new SingleStore VECTOR
data type was introduced in early 2024. It provides a number of benefits over using the BLOB
type when working with vector data. Now, numpy arrays can be directly stored in a table with a VECTOR
column. In this article, we'll see how.
The notebook file used in this article is available on GitHub.
Create a SingleStoreDB Cloud account
A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings:
- Workspace Group Name: Iris Demo Group
- Cloud Provider: AWS
- Region: US East 1 (N. Virginia)
- Workspace Name: iris-demo
- Size: S-00
Create a new notebook
From the left navigation pane in the cloud portal, we'll select DEVELOP > Data Studio.
In the top right of the web page, we'll select New Notebook > New Notebook, as shown in Figure 1.
We'll call the notebook numpy_vector_demo, select a Blank notebook template from the available options, and save it in the Personal location.
Fill out the notebook
First, we'll import some libraries:
import pandas as pd
import numpy as np
Next, we'll load the iris
dataset:
url = "https://gist.githubusercontent.com/VeryFatBoy/9af771d443f5ec4dd6eec8d69a062638/raw/c03ef25a97f23a48ee408ac02114195b663a2364/iris.csv"
df = pd.read_csv(url)
and print the first few rows:
df.head()
Example output:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
We'll now create a copy of the original DataFrame:
str_df = df.copy()
and then combine the first four columns into a new vector column in a string format and then drop the first four columns, as follows:
str_df["vector"] = str_df.apply(lambda row: f"[{row['sepal_length']}, {row['sepal_width']}, {row['petal_length']}, {row['petal_width']}]", axis = 1)
str_df.drop(columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"], inplace = True)
We now have two columns in our DataFrame.
We can also represent the data using numpy arrays instead of strings.
First, we'll create another copy of our original DataFrame:
num_df = df.copy()
and then combine the first four columns into a new vector column in a numpy 32-bit format and then drop the first four columns, as follows:
num_df["vector"] = num_df[["sepal_length", "sepal_width", "petal_length", "petal_width"]].apply(lambda row: np.array(row).astype(np.float32), axis = 1)
num_df.drop(columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"], inplace = True)
We'll now create our database and two tables, as follows:
DROP DATABASE IF EXISTS iris_db;
CREATE DATABASE IF NOT EXISTS iris_db;
USE iris_db;
DROP TABLE IF EXISTS iris_str;
CREATE TABLE IF NOT EXISTS iris_str (
species VARCHAR(20),
vector VECTOR(4) NOT NULL
);
DROP TABLE IF EXISTS iris_num;
CREATE TABLE IF NOT EXISTS iris_num (
species VARCHAR(20),
vector VECTOR(4) NOT NULL
);
We have one table where we will store the vectors using the string format and another table where we will store the vectors using the numpy format.
We'll now create the connection to SingleStore:
from sqlalchemy import *
db_connection = create_engine(connection_url)
and write the first DataFrame using the string format:
str_df.to_sql(
"iris_str",
con = db_connection,
if_exists = "append",
index = False,
chunksize = 1000
)
and then the second DataFrame using the numpy format:
num_df.to_sql(
"iris_num",
con = db_connection,
if_exists = "append",
index = False,
chunksize = 1000
)
Using "append"
ensures that the table structures we previously defined are preserved.
Both methods enable us to write the original dataset into SingleStore. With support for writing vector data in multiple formats, we can choose the best approach for our use case.
Summary
Besides the existing support for writing vectors as strings, numpy arrays can now be directly written to the VECTOR
data type in SingleStore.
Top comments (0)