How similar are the strings “I care about strong ACID guarantees” and “I like transactional databases”? While there’s a number of ways we could compare these strings—syntactically or grammatically for instance—one powerful thing AI models give us is the ability to compare these semantically, using something called embeddings. Given a model, such as OpenAI’s text-embedding-ada-002
, I can tell you that the aforementioned two strings have a similarity of 0.784, and are more similar than “I care about strong ACID guarantees” and “I like MongoDB” 😛. With embeddings, we can do a whole suite of powerful things:1
- Search (where results are ranked by relevance to a query string)
- Clustering (where text strings are grouped by similarity)
- Recommendations (where items with related text strings are recommended)
- Anomaly detection (where outliers with little relatedness are identified)
- Diversity measurement (where similarity distributions are analyzed)
- Classification (where text strings are classified by their most similar label)
This article will look at working with raw OpenAI embeddings.
What is an embedding?
An embedding is ultimately a list of numbers that describe a piece of text, for a given model. In the case of OpenAI’s model, it’s always a 1,536-element-long array of numbers. Furthermore, for OpenAI, the numbers are all between -1 and 1, and if you treat the array as a vector in 1,536-dimensional space, it has a magnitude of 1 (i.e. it’s “normalized to length 1” in linear algebra lingo).
On a conceptual level, you can think of each number in the array as capturing some aspect of the text. Two arrays are considered similar to the degree that they have similar values in each element in the array. You don’t have to know what any of the individual values correspond to—that’s both the beauty and the mystery of embeddings—you just need to compare the resulting arrays. We’ll look at how to compute this similarity below.
Depending on what model you use, you can get wildly different arrays, so it only makes sense to compare arrays that come from the same model. It also means that different models may disagree about what is similar. You could imagine one model being more sensitive to whether the string rhymes. You could fine-tune a model for your specific use case, but I’d recommend starting with a general-purpose one to start, for similar reasons as to why to generally pick Chat GPT over fine-tuned text generation models.
It’s beyond the scope of this post, but it’s also worth mentioning that we’re just looking at text embeddings here, but there are also models to turn images and audio into embeddings, with similar implications.
How do I get an embedding?
There are a few models to turn text into an embedding. To use a hosted model behind an API, I’d recommend OpenAI, and that’s what we’ll be using in this article. For open-source options, you can check out all-MiniLM-L6-v2 or all-mpnet-base-v2.
Assuming you have an API key in your environment variables, you can get an embedding via a simple fetch
:
export async function fetchEmbedding(text: string) {
const result = await fetch("https://api.openai.com/v1/embeddings", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: "Bearer " + process.env.OPENAI_API_KEY,
},
body: JSON.stringify({
model: "text-embedding-ada-002",
input: [text],
}),
});
const jsonresults = await result.json();
return jsonresults.data[0].embedding;
}
For efficiency, I’d recommend fetching multiple embeddings at once in a batch.
export async function fetchEmbeddingBatch(text: string[]) {
const result = await fetch("https://api.openai.com/v1/embeddings", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: "Bearer " + process.env.OPENAI_API_KEY,
},
body: JSON.stringify({
model: "text-embedding-ada-002",
input: [text],
}),
});
const jsonresults = await result.json();
const allembeddings = jsonresults.data as {
embedding: number[];
index: number;
}[];
allembeddings.sort((a, b) => b.index - a.index);
return allembeddings.map(({ embedding }) => embedding);
}
Where should I store it?
Once you have an embedding vector, you’ll likely want to do one of two things with it:
- Use it to search for similar strings (i.e. search for similar embeddings).
- Store it to be searched against in the future.
If you plan to store thousands of vectors, I’d recommend using a dedicated vector database like Pinecone. This allows you to quickly find nearby vectors for a given input, without having to compare against every vector every time. Stay tuned for a future post on using Pinecone alongside Convex.
If you don’t have many vectors, however, you can just store them directly in a normal database. In my case, if I want to suggest Stack posts similar to a given post or search, I only need to compare against fewer than 100 vectors, so I can just fetch them all and compare them in a matter of milliseconds using the Convex database.
How should I store an embedding?
If you’re storing your embeddings in Pinecone, stay tuned for a dedicated post on it, but the short answer is you configure a Pinecone “Index” and store some metadata along with the vector, so when you get results from Pinecone you can easily re-associate them with your application data. For instance, you can store the document ID for a row that you want to associate with the vector.
If you’re storing the embedding in Convex, I’d advise storing it as a binary blob rather than a javascript array of numbers. Convex advises to not store arrays longer than 1024 elements. We can achieve this by converting it into a Float32Array pretty easily in JavaScript:
const numberList = await fetchEmbedding(inputText); // number[]
const floatArray = Float32Array.from(numberList); // Float32Array
const floatBytes = floatArray.buffer; // ArrayBuffer
// Save floatBytes to the DB
// Later, after you read the bytes back out:
const arrayAgain = new Float32Array(bytesFromDB); // Float32Array
You can represent the embedding as a field in a table in your schema:
vectors: defineTable({
float32Buffer: v.bytes(),
textId: v.id("texts"),
}),
In this case, I store the vector alongside an ID of a document in the “texts” table.
How to compare embeddings in JavaScript
If you’re looking to compare two embeddings from OpenAI without using a vector database, it’s very simple. There’s a few ways of comparing vectors, including Euclidean distance, dot product, and cosine similarity. Thankfully, because OpenAI normalizes all the vectors to be length 1, they will all give the same rankings! With a simple dot product you can get a similarity score ranging from -1 (totally unrelated) to 1 (incredibly similar). There are optimized libraries to do it, but for my purposes, this simple function suffices:
/**
* Compares two vectors by doing a dot product.
*
* Assuming both vectors are normalized to length 1, it will be in [-1, 1].
* @returns [-1, 1] based on similarity. (1 is the same, -1 is the opposite)
*/
export function compare(vectorA: Float32Array, vectorB: Float32Array) {
return vectorA.reduce((sum, val, idx) => sum + val * vectorB[idx], 0);
}
Example
In this example, let’s make a function (a Convex query in this case) that returns all of the vectors and their similarity scores in order based on some query vector, assuming a table of vectors
as we defined above, and the compare
function we just defined.
export const compareTo = query(async ({ db }, { vectorId }) => {
const target = await db.get(vectorId);
const targetArray = new Float32Array(target.float32Buffer);
const vectors = await db.query("vectors").collect();
const scores = await Promise.all(
vectors
.filter((vector) => !vector._id.equals(vectorId))
.map(async (vector) => {
const score = compare(
targetArray,
new Float32Array(vector.float32Buffer)
);
return { score, textId: vector.textId, vectorId: vector._id };
})
);
return scores.sort((a, b) => b.score - a.score);
});
Summary
In this post, we looked at embeddings, why they’re useful, and how we can store and use them in Convex. I’ll be making more posts on working with embeddings, including chunking long input into multiple embeddings and using Pinecone alongside Convex soon. Let us know in our Discord what you think!
-
Copied from OpenAI’s guide ↩
Top comments (0)