DEV Community

Cover image for Quick tip: Using OpenAI Embeddings Search with SingleStoreDB
Akmal Chaudhri for SingleStore

Posted on • Edited on

Quick tip: Using OpenAI Embeddings Search with SingleStoreDB

Abstract

In this short article, we'll test the ability of SingleStoreDB to store and query the OpenAI Wikipedia Vector Database dataset. We'll see that SingleStoreDB can manage this dataset with ease. SingleStoreDB has supported a range of vector functions for some time, and these functions are ideally suited for modern applications using GPT technology.

The notebook file used in this article is available on GitHub.

Introduction

In several previous articles, we have used some of the vector capabilities built into SingleStoreDB:

There is an OpenAI notebook available on GitHub under an MIT License that tests several Vector Database systems. The tests can be run using local clients or in the cloud. In this article, we'll use SingleStore in the cloud.

Fill out the Notebook

Let's now create a new notebook. We'll adhere to the flow and structure of the OpenAI notebook and use some small code sections directly from the notebook, where required.

Setup

First, some libraries:

!pip install openai --quiet
!pip install tabulate --quiet
Enter fullscreen mode Exit fullscreen mode

Next, some imports:

import getpass
import numpy as np
import openai
import os
import pandas as pd
import shutil
import wget
import zipfile

from ast import literal_eval
from openai import OpenAI
from tabulate import tabulate
Enter fullscreen mode Exit fullscreen mode

and then the embedding model:

EMBEDDING_MODEL = "text-embedding-ada-002"
Enter fullscreen mode Exit fullscreen mode

Load Data

We'll now obtain the Wikipedia dataset:

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)
Enter fullscreen mode Exit fullscreen mode

and unpack it:

zip_file_name = "vector_database_wikipedia_articles_embedded.zip"
csv_file_name = "vector_database_wikipedia_articles_embedded.csv"
directory_name = "extract"

with zipfile.ZipFile(zip_file_name, "r") as zip_ref:
    zip_ref.extractall(directory_name)
Enter fullscreen mode Exit fullscreen mode

Next, we'll load the file into a Pandas Dataframe:

article_df = pd.read_csv(directory_name + "/" + csv_file_name)
Enter fullscreen mode Exit fullscreen mode

and we'll take a look at the first few lines, as follows:

article_df.head()
Enter fullscreen mode Exit fullscreen mode

The next operation can take a while:

# Apply literal_eval and cast to float32 for both columns in one step
article_df[["title_vector", "content_vector"]] = article_df[["title_vector", "content_vector"]].applymap(
    lambda x: np.array(literal_eval(x), dtype = np.float32)
)

# Set vector_id to be a string
article_df["vector_id"] = article_df["vector_id"].apply(str)
Enter fullscreen mode Exit fullscreen mode

and then next, we'll look at the Dataframe info:

article_df.info(show_counts = True)
Enter fullscreen mode Exit fullscreen mode

The result should be as follows:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              25000 non-null  int64 
 1   url             25000 non-null  object
 2   title           25000 non-null  object
 3   text            25000 non-null  object
 4   title_vector    25000 non-null  object
 5   content_vector  25000 non-null  object
 6   vector_id       25000 non-null  object
dtypes: int64(1), object(6)
Enter fullscreen mode Exit fullscreen mode

Create Table

We'll create a database and table:

CREATE DATABASE IF NOT EXISTS openai_demo;

USE openai_demo;

DROP TABLE IF EXISTS wikipedia;
CREATE TABLE IF NOT EXISTS wikipedia (
    id INT PRIMARY KEY,
    url VARCHAR(255),
    title VARCHAR(100),
    text TEXT,
    title_vector VECTOR(1536),
    content_vector VECTOR(1536),
    vector_id INT
)
Enter fullscreen mode Exit fullscreen mode

Populate Table

Let's now create a connection to SingleStoreDB:

from sqlalchemy import *

db_connection = create_engine(connection_url)
Enter fullscreen mode Exit fullscreen mode

We can populate our database table, as follows:

article_df.to_sql(
    "wikipedia",
    con = db_connection,
    if_exists = "append",
    index = False,
    chunksize = 1000
)
Enter fullscreen mode Exit fullscreen mode

Loading the data should take a short time. We can use other data loading methods, such as pipelines, for larger datasets.

Search Data

We'll now define a Python function that will allow us to use either of the two vector columns in the database:

def search_wikipedia(
    query: str,
    column1: str,
    column2: str,
    num_rows: int = 10
) -> list:
    """Searches Wikipedia for the given query and returns the top `num_rows` results.

    Args:
        query: The query to search for.
        column1: The name of the column in the Wikipedia database to return for each result.
        column2: The name of the column in the Wikipedia database to use as the score for each result.
        num_rows: The number of results to return.

    Returns:
        A list of the top `num_rows` results.
    """

    # Get the embedding of the query
    query_embedding_response = client.embeddings.create(
        input = query,
        model = EMBEDDING_MODEL
    )
    query_embedding = query_embedding_response.data[0].embedding

    # Create the SQL statement
    stmt = """
        SELECT {column1}, (%s <*> {column2}) AS score
        FROM wikipedia
        ORDER BY score DESC
        LIMIT %s
    """.format(column1 = column1, column2 = column2)

    results = pd.read_sql(
        stmt,
        db_connection,
        params = (str(query_embedding), num_rows)
    )

    # Separate the results into two lists
    values = results.iloc[:, 0].tolist()
    scores = results.iloc[:, 1].tolist()

    # Return the results
    return values, scores
Enter fullscreen mode Exit fullscreen mode

We can test SingleStoreDB using the two examples in the OpenAI notebook. First, we'll use title and title_vector:

values1, scores1 = search_wikipedia(
    query = "modern art in Europe",
    column1 = "title",
    column2 = "title_vector",
    num_rows = 5
)
Enter fullscreen mode Exit fullscreen mode

We'll format the results using the following:

# Combine the values and scores lists into a list of tuples
# Each tuple contains a value and its corresponding score
table_data1 = list(zip(values1, scores1))

# Add a rank column to the table data
table_data1 = [(i + 1,) + data for i, data in enumerate(table_data1)]

# Create the table
table1 = tabulate(table_data1, headers = ["Rank", "Title", "Score"])

# Print the table
print(table1)
Enter fullscreen mode Exit fullscreen mode

The output should be similar to the following:

  Rank  Title                    Score
------  --------------------  --------
     1  Museum of Modern Art  0.87496
     2  Western Europe        0.867415
     3  Renaissance art       0.864026
     4  Pop art               0.860158
     5  Northern Europe       0.854661
Enter fullscreen mode Exit fullscreen mode

Next, we'll use text and content_vector:

values2, scores2 = search_wikipedia(
    query = "Famous battles in Scottish history",
    column1 = "text",
    column2 = "content_vector",
    num_rows = 5
)
Enter fullscreen mode Exit fullscreen mode

We'll format the results using the following:

# Combine the values and scores lists into a list of tuples
# Each tuple contains a value and its corresponding score
table_data2 = list(zip([value[:50] for value in values2], scores2))

# Add a rank column to the table data
table_data2 = [(i + 1,) + data for i, data in enumerate(table_data2)]

# Create the table
table2 = tabulate(table_data2, headers = ["Rank", "Text", "Score"])

# Print the table
print(table2)
Enter fullscreen mode Exit fullscreen mode

The output should be similar to the following:

  Rank  Text                                                   Score
------  --------------------------------------------------  --------
     1  The Battle of Bannockburn, fought on 23 and 24 Jun  0.869272
     2  The Wars of Scottish Independence were a series of  0.861397
     3  Events                                              0.852443
         January 1 – Charles II crowned King of
     4  The First War of Scottish Independence lasted from  0.849582
     5  Robert I of Scotland (11 July 1274 – 7 June 1329)   0.846126
Enter fullscreen mode Exit fullscreen mode

Summary

In this short article, we've seen that SingleStoreDB can store vectors with ease and that we can also store other data types in the same table, such as numeric and text. With its powerful SQL and multi-model support, SingleStoreDB provides a one-stop solution for modern applications bringing both technical and business benefits through a single product.

If you are interested in further reading, check out these SingleStore blog posts:

Top comments (0)