Welcome to part three of my series on learnings from building a vector database. In part one, we covered some of the basics about vectors. Part two was a deep dive into different similarities, on and off the unit sphere. Here, we’ll explore the use of vectors with Apache Cassandra and DataStax Astra DB.
Similarity search and approximate nearest neighbor
The key interrogation one wants to run in a vector database is "similarity search": given a certain query vector V, one wants the entries whose vectors have the highest similarity to V. The next problem is, how to make vector search efficient when there are potentially many millions of vectors in the database. Luckily, there are several advanced algorithms to do that within a DB engine, usually involving the so-called Approximate Nearest Neighbour (ANN) search one way or the other. In practice this means that one trades the (very small) likelihood of missing some matches for a (substantial) gain in performance.
I won't spend many words here on the topic of how to equip Cassandra with state-of-the-art ANN search: if you are curious, check out this excellent article that tells the story of how we did it! I'll just mention that, by maintaining a vector-specialized index alongside the table, Cassandra / Astra DB can offer flexible, performant and massively scalable vector search. Crucially, this index (a certain type of SAI index, in Cassandra parlance) needs a specific choice of measure right from the start.
I'll now take a closer look at how the interaction with Cassandra – or equivalently Astra DB – works. You will see vector-search queries expressed in the Cassandra Query Language (CQL). CQL is a powerful way to interact with the database, and vector-related workloads are no exception.
This section assumes CQL is used throughout to interact with the database (the reason is that CQL is permissive enough to let you violate the querying best practices as outlined below).
Now, I personally think CQL is great, but I appreciate that not everyone might have the time, the motivation, or the need to learn how to use it.
For this reason, there are several higher-level abstractions that effectively "hide" the CQL foundations and allow one to focus on the vector-related application itself. I'll just mention, for Python users, CassIO, which in turn powers LangChain's and LlamaIndex's plugins for Cassandra. For an easier-to-use and cross-language experience, take a look at the REST-based Data API available in Astra DB.
Index time versus query time
The two basic operations when using a vector-enabled database table are: inserting entries, and running ANN searches. Actually, there is another important ingredient that comes before any other: creating the table, along with the required SAI index for vector search. Here is how the typical usage might unfold:
- Creation of table and index This is where you choose a measure for the table (e.g. Euclidean).
- Writing entries to the table (i.e. rows with their vector). All vectors must have the dimensionality specified when creating the table.
- Querying with ANN search "Give me the k entries whose vector is the most similar to a query vector V".
- Further writing and querying in any order (etc, etc …)
The simple sequence above stresses a key fact: you commit to a measure right when you create a table. Unless you drop the index and re-create it anew, any ANN search you'll run will use that measure. In other words, the measure used for computing similarities is fixed at index-time.
Here is the CQL code for a minimal creation of a vector table with the related index:
CREATE TABLE my_v_table
(id TEXT PRIMARY KEY, my_vector VECTOR<FLOAT, 3>);
CREATE CUSTOM INDEX my_v_index ON my_v_table(my_vector)
USING 'StorageAttachedIndex'
WITH OPTIONS = {'similarity_function': 'EUCLIDEAN'};
For the sake of completeness, here is the CQL to insert a row in the table:
INSERT INTO my_v_table
(id, my_vector)
VALUES
('mill_falcon', [8, 3.5, -4.2]);
Finally, this is how an ANN search is expressed:
SELECT id, my_vector
FROM my_v_table
ORDER BY my_vector ANN OF [5.0, -1.0, 6.5]
LIMIT 5;
As you see, no measure is specified at query-time anymore. The CQL for vector search admits other bits and options I won't mention here (feel free to visit the docs page), but there is an important thing you can add. Observe:
SELECT
id,
my_vector,
similarity_euclidean(my_vector, [5.0, -1.0, 6.5]) as sim
FROM my_v_table
ORDER BY my_vector ANN OF [5.0, -1.0, 6.5]
LIMIT 5;
This last statement can be phrased as:
Give me the five most-similar rows to a query vector V and, for each of them, tell me also the euclidean similarity between the row and V itself.
The new bit is that one asks for an additional sim
column (the alias is strongly suggested here, to avoid unwieldy column names) containing the numeric value of the similarity between the row and the provided vector. This similarity is not stored anywhere on the DB — and how could it be, as its value depends on the vector provided in the query itself?
Now pay attention: nothing prevents one from using two different vectors in the query, or even to ask for a similarity computed with a different measure than the one chosen earlier for the index (hence used for searching)!
// Warning: WEIRD QUERY
// (check the vector values and the chosen measure).
// Why should you do this?
SELECT
id,
my_vector,
similarity_dot_product(my_vector, [-7, 0, 11]) as weird_sim
FROM my_v_table
ORDER BY my_vector ANN OF [5.0, -1.0, 6.5]
LIMIT 5;
In words: give me the five rows closest to V, according to the Euclidean similarity, and compute their Dot-product similarity to this other vector W.
Pretty odd, no? The point is, this is something CQL does not actively prevent, but that does not mean you should do it:
- Using two different similarities at index-time and query-time is probably never a sensible choice.
- Using two different vectors for the ANN part and the sim column (i.e. W != V) is something that hardly makes sense; in particular, the returned rows would not be sorted highest-to-lowest-similarity anymore.
In short, the measure for the sim
column is specified at query-time. Moreover, the query vector and the one for the sim
column are passed as two separate parameters.
The practical advice is then this: do not pass two different vectors, and do not mix similarities between index and query.
As usual, the confusion from mixing similarities is greatly reduced if working on the sphere (the ordering of results, at least, would not "look weird").
Now, there might be very specific use cases that could knowingly take advantage of this kind of CQL freedom. But, frankly, if you are into that kind of thing, you already know the topic covered in this section (while you're at it, why don't you drop a comment below on your use case? I'd be curious to hear about it).
Note: when using the Data API, as opposed to plain CQL, you do not retain the freedom to ask for a similarity other than the one configured for the table index, as well as the freedom to use W != V in vector searches. And, as was argued in this section, this is not a bad thing at all!
Caption: Bad (or at least questionable) practices with vector searches in CQL. Left: SELECT similarity_euclidean(my_vector, V) as sim ... ORDER BY my_vector ANN OF V
on a table created with the Cosine similarity. The returned rows will be r1 and r2 in that order, but since (Euclidean!) δ1 > δ2, then the sim
column will come back in increasing order. Right: SELECT similarity_euclidean(my_vector, W) as sim ... ORDER BY my_vector ANN OF V
on a table created with Euclidean. The closest-to-farthest sorting of the rows is referred to V, but the similarity is calculated on W: so, one gets back rows r1 and r2 in that order, but the sim
column lists increasing values again.
Null vectors and cosine
A close look at how the similarities are computed will reveal a property specific to the cosine choice: for the null vector, i.e. a vector with all components equal to zero, the definition makes no sense at all!
Beneath the "technical" problem of a division by zero (|v|=0) arising in the similarity formula, indeed, lies the deeper mathematical reason: a segment with zero length has no defined direction to speak of.
What does this imply? In practice, if you plan to use the cosine similarity, you must ensure that only non-null vectors are inserted. But then again, if you choose this similarity, you could as well rescale each incoming vector to unit length and live in the comfort of the sphere … which is something that can be done on all vectors except the null vector! So I've come full circle.
Note that the null vector has no problem whatsoever with the Euclidean distance, nor with the Dot-product (but the latter, as remarked earlier, probably makes not much sense out of the sphere).
So, by now you must be curious as to how Cassandra / Astra DB handles this situation. Here are the various cases:
- Running a search such as
SELECT ... ANN OF [0, 0, ...]
raises a database error such as Operation failed - received 0 responses and 2 failures: UNKNOWN from ... - Inserting a row, e.g.
INSERT INTO ... VALUES (..., [0, 0, ...], ...)
raises the same error from the DB: Operation failed - received 0 responses and 2 failures: UNKNOWN from ... - Using the null vector for a similarity column computed during the query, such as
SELECT ... similarity_cosine(my_vector, [0, 0, ...]) ...
, returns asim
column of all NaN (not-a-number) values. This remains true also when running queries through the Python Cassandra driver. - The most interesting case is when the table already contains some null-vector rows before the vector SAI index is created. In that case, no errors are raised, but the rows will be silently left out of the index and never reached by ANN searches. This can be tricky, and should be kept in mind.
Note: when using the Data API, the underlying behavior for Cosine-based collections is the same as outlined above – except, the last case is not really possible as the index is created as soon as the collection is available.
The first three installments of this mini-series cover most of what you'll need to know your way around the mathematics, and the practical aspects, of similarity computations between vectors. However, it turns out that a few obstacles may get in your way if you want to perform something as mundane as, say, switching between vector stores in an existing GenAI application. In the next, and last, article, we’ll look at a complete example of such a migration, paying close attention to the hidden surprises along the way. See you next time!
Top comments (0)