Jianyang Gao

Posted on Dec 12

Extended RaBitQ: an Optimized Scalar Quantization Method

#rabitq #vectordatabase #quantization #rag

The cover is generated by DALL-E.

Approximate nearest neighbor (ANN) search in the high-dimensional space, a.k.a., vector search, is an increasingly important area due to its surging applications in vector databases, large language models (LLMs), and retrieval-augmented generation (RAG). However, to support various applications with low latency, a system often needs to host millions or even billions of vectors in RAM, leading to high costs, for instance, when a service is deployed on clouds. Quantization, a thread of technologies applied for vector compression, thus, performs increasingly critical roles in reducing the space consumption as well as the costs of services.

In the our previous post, we illustrated the key insights behind our research, the RaBitQ algorithm, which works as an optimized approach to binary quantization. It achieves significant improvement on the accuracy in practice and provides an asymptotically optimal theoretical error bound. However, the original RaBitQ paper only supports binary quantization (1-bit quantization). It remains unclear how it can utilize more bits (e.g., 2-bit, 3-bit or 4-bit per dimension) to achieve higher accuracy. In this post, we will introduce our extended version of RaBitQ, which presents a new strategy for minimizing the error when more bits are used. This allows the RaBitQ method to support the quantization of an arbitrary compression rate. In particular, it can provide an optimized approach for scalar quantization which has the potential to replace the existing ones in a system seamlessly:

Its accuracy is dominantly better than the state-of-the-art variant of scalar quantization under the same compression rates.
Its computation during querying is exactly the same as the scalar quantization.

The paper and code was released about three months ago at Sep-2024. To better understand this blog, it is suggested to first read our last post.

Codebook: From 1-bit per dimension to multiple-bit per dimension

Recall that in the RaBitQ method, we construct a quantization codebook by taking the vertices of a cube nested in the unit sphere, which is illustrated as follows.

Let $D$ be the dimensionality of a vector. Each vector in the cube corresponds to a bi-valued vector, i.e., all the coordinates of a vector are either $- \frac{1}{\sqrt{D}}$ or $+ \frac{1}{\sqrt{D}}$ . Thus, the codebook in total contains $2^D$ vectors and each vector in the codebook can be represented as a $D$ -bit binary string.

However, when more bits are used per dimension, we cannot easily find a codebook which are nested in the unit sphere: when each dimension has more than 1 bit, the codebook corresponds to vectors on a grid. In the following figure, we plot the codebook of 2-bit quantization in the 2-dimensional space as an example, where each empty blue point represents a vector in the codebook.

Recall that as is discussed in our previous post, RaBitQ achieves accurate distance estimation because it designs an accurate estimator. This estimator depends on the property that the codebook consists of unit vectors. However, in the current example, the property does not hold.

Finding the Nearest Vector on a Normalized Codebook

To resolve this issue, a natural idea is to map the codebook onto the unit sphere by normalization, which is illustrated as follows. The green circle visualizes the 2-dimensional unit sphere and the red points represent the normalized vectors in the codebook.

Performing this operation resolves the aforementioned issue of estimators. However, this causes extra challenges: for a data vector, we cannot easily find its nearest vector in the codebook.

When the codebook is formed of vectors on grids (the blue empty points). We can easily perform rounding (i.e., scalar quantization) for each dimension to find the nearest vector.
When the codebook is normalized (the red full points), the result of rounding is not necessarily the vector that best approximates the data vector.

In the following figure, we provide such an example. The purple triangle represents the data vector X, which we want to quantize. Here in the codebook, vector A is the one that is the nearest to X in the codebook, which means that it is also the result if we perform rounding (scalar quantization) on X. However, when the codebook is normalized, the vector B is the one that best approximates the vector X: the corresponding red point of vector B is closer to the vector X.

So the question is how we can find the vector that best approximates a data vector after the normalization of the codebook. We find that although the optimal vector might not be found by directly performing rounding on a data vector, if we rescale the vector and perform rounding, the optimal vector can be found. Let us continue with the example. In the figure below, we plot a vector tX with another purple triangle. Here the rescaling factor t is a positive number.

We can see from the figure that, when the data vector X is rescaled to the location of tX, its nearest vector (which can be found by rounding) is B: after rescaling the data vector and perform rounding, we successfully find the optimal vector in the codebook that best approximates it.

But, is it a special case? Is it true for any possible vector? Through rigorous mathematical proofs (see Lemma 3.1 in our paper), we have discovered that for any data vector, there is always a rescaling factor such that, by performing rounding (scalar quantization) on the rescaled vector, we can find the optimal vector in the codebook.

Based on this lemma, to find the optimal vector, we can try different rescaling factors. For every rescaling factor, we compute its nearest vector by rounding and compute the similarity/error produced in this case. Finally, by comparing the errors, we can find the optimal rescaling factor and the vector that best approximates our data vector. We refer readers to the Section 3.2 in our paper for more detailed strategy of trying different factors. It is worth highlighting that this algorithm guarantees to exactly find the optimal vector in the codebook.

Another Equivalent Way of Understanding

Besides rescaling the data vector, we note that the algorithm can also be equivalently understood from the another perspective: rescaling codebooks. Let us still consider quantizing the vector X.

In the figure above, both the data vector and the codebook are not rescaled. In this case, the nearest vector in the codebook of X is A and it produces a relatively large error.

When we rescale the codebook to the location indicated by the figure above, the nearest vector of X in the codebook becomes B and the error (distances between X and the rescaled B) is much smaller. Thus, our algorithm can also be equivalently explained in the following way: it performs scalar quantization by trying different parameters on a per-vector basis, computes the quantization error produced under every certain parameter and selects the optimal parameter and vector in the codebook to minimize the error. Mathematically, rescaling a data vector by a factor of t is equivalent to rescaling a codebook by a factor of 1/t. In addition, it is also worth noting that the original RaBitQ, which quantizes every dimension to 1-bit, can be regarded as a special case of the extended technology.

Summary

Based on the aforementioned technology, we resolve the key question: how we can find the best approximation of a vector when multiple bits are available in every dimension. In intuition, our strategy is to perform rounding (i.e., scalar quantization) by trying different parameters on a per-vector basis, compute the quantization error produced under every certain parameter and select the optimal parameter and vector. Based on this strategy and the estimator discussed in our previous posts, our method achieves the following results.
Theoretically, we prove that the extended version of RaBitQ provides an asymptotically optimal error bound in terms of the trade-off between space and accuracy. To our knowledge, RaBitQ is also the first practical algorithm that achieves the optimality.
Empirically, on all the tested datasets, when quantizing a vector to 2-bit per dimension, RaBitQ is more accurate than the state-of-the-art variant of scalar quantization by orders of magnitude. From 3-bit and above, RaBitQ still brings significant and consistent improvement.
In addition, we note that the unique error bound of RaBitQ not only promises the stable accuracy across different datasets, but also opens a broader range of opportunities for optimizing vector search. For example, based on the error bound, we can check whether a candidate is unlikely to become the nearest neighbor: if the lower bound of its approximate distance is larger than the distances of the current nearest neighbor, it would be unlikely to be the answer. We refer readers to our paper and code repos for more experimental results and technologies.

Since its release in early this year, RaBitQ has been adopted in several real-world systems. For example, TensorChord adopts RaBitQ in their cost-effective vector search systems and has published a detailed blog explaining how to optimize this algorithm step by step in Rust. Additionally, Elastic integrated our RaBitQ algorithm into a feature, which they call "BBQ". Their empirical evaluation highlights the breakthrough performance made by our RaBitQ algorithm. In September 2024, we released an updated version of RaBitQ with an extension that enhances scalar quantization. We look forward to seeing this extension further drive performance improvements in production systems.

DEV Community

Extended RaBitQ: an Optimized Scalar Quantization Method

Codebook: From 1-bit per dimension to multiple-bit per dimension

Finding the Nearest Vector on a Normalized Codebook

Another Equivalent Way of Understanding

Summary

Top comments (0)

Read next

Exploring RAG: How Data Ingestion Powers Cutting-Edge AI?

Building a Document QA with Streamlit & OpenAI

Doing Multihop on HotPotQA Using Qwen 2.5 72B

🚀 Introduction to Building AI-Powered Apps with Streamlit and FastAPI