Today, I'm introducing a new experimental multilingual embedding model for flexible visual document retrieval. mcdse-2b-v1 (🤗) builds upon MrLight/dse-qwen2-2b-mrl-v1 and it is trained using the DSE approach.
This model allows you to embed page/slide screenshots and query them using natural language. Whether it's tables, graphs, charts, schemas, images, or text, mcdse-2b-v1 encodes everything into a single embedding vector, eliminating the need for traditional OCR, document layout analysis, reading order detection, chunking, table/formula extraction...
Strong metrics on 🇮🇹 Italian, 🇪🇸 Spanish, 🇬🇧 English, 🇫🇷 French and 🇩🇪 German.
Matryoshka Representation Learning: embeddings can efficiently scale from 1536 to 256 dimensions. You can reduce the size 6x and still keep 95% of the embeddings quality.
Exceptional on binarization: 768d binary vectors keep 99% retrieval quality of the base 1536d float vectors. Using binary vectors, you can encode 100 million multilingual pages in just 10GB.
Fast vLLM inference: run inference on vLLM and efficiently serve embeddings at scale, production ready. Check Deployment to learn more.
My benchmarks aren't flawless, so I encourage you to test the model on your own data. This is an early version with plenty of room for improvement. However, despite this, the results highlight a strong multilingual retriever that adapts remarkably well to various memory/speed requirements.
Training
mcdse-2b is trained from MrLight/dse-qwen2-2b-mrl-v1 using low-rank adapters (LoRA) on a multilingual corpus of documents. I have trained it on 8xRTX3090 using the DSE approach with the following parameters:
- Epochs = 1
- Warmup ratio = 0.1
- Learning rate = 1e-5
- Optimizer = adamw_torch
- Schedule = linear
- Total batch size = 16
- LoRA
- Alpha = 64
- R = 16
- Dropout = 0.1
- DoRA = True
Dataset
The dataset comprises 24K PDF documents automatically scraped from the public internet. Random pages were extracted from each document, converted into compressed JPEG images, and filtered to remove blank pages and duplicates. The resulting page screenshots are unique and span a wide range of topics.
I used gemini-flash-1.5-002 to generate queries based on each image. Gemini was instructed to come up with three type of queries:
- A broad topical query: summarizing the overall theme of the document.
- A specific detailed question: capturing subtle nuances within the content.
- A visual query: focusing on visual elements such as charts, graphs, images, or signatures.
The entire training and evaluation datasets were generated for just €2 (thanks, Gemini Flash!)
Each image is then classified by its text density on a scale from 0 to 2. I used omoured YOLOv10n model, fine-tuned on DocLayNet, to detect areas such as figures versus text. Based on these proportions, I heuristically calculate the text density. I plan to use this classification to improve the model's performance on text-dense documents.
- 0 = only visuals
- 1 = a mix of visuals and text
- 2 = only text
The eval and train datasets are not yet published. I'm very willing to open source them, but I'm still unsure on how to properly do it without breaking any license (if any). If you know how to help me, please reach out!
Train Runs
The model was sequentially trained for each language in the following order:
1) French: 6k samples
2) Spanish: 6k samples
3) Italian: 6k samples
4) German: 6k samples
This order was determined by the base model's retrieval performance in these languages, the first being the best performing. My intuition is that, given the small dataset, starting with the stronger languages could help balance overall improvements across the model.
Before reaching this final checkpoint, I conducted multiple runs to test various strategies and validate some of my intuitions.
Language order: I swapped the order of the last two languages and found that training German last improved its performance on evaluations by 1.7%, while maintaining similar scores across the other languages.
Model initialization: I initialized the model with 10k mmarco pairs for each language. This resulted in worse performance across all languages, particularly with lower-dimensional embeddings. For example, French NDCG@5 using 512-dimensional embeddings dropped by 2% when trained with mmarco.
-
Different image resize algorithm: I developed a custom resize function (
custom_resize
) that strictly preserves the image's aspect ratio while scaling it down to fit withinmin_pixels
andmax_pixels
. All evaluations used the standard resize function from qwen_vl_utils. Models trained with the custom resize function outperformed the standard method, with an average +1.7% NDCG@5 improvement (1536 dimensions). It would be interesting to explore training a ColQwen model with thiscustom_resize
function.Resize function Avg English Italian Spanish French German qwen2_vl_utils 80.8 80.2 80.5 79.6 81 82.6 custom_resize 82.2 80.8 81.2 80.7 84.5 83.8 +1.7% +0.7% +0.9% +1.4% +4.0% +1.4%
Evaluations
Due to the limited availability of publicly available datasets for multilingual document image retrieval, the model has been evaluated using a custom-built dataset. This eval dataset was specifically designed to benchmark the model's performance across various languages.
This evaluation dataset was created using the same methodologies and pipelines of the training dataset. However, the document topics are generally different, and no images are shared between the training and evaluation datasets to avoid any evaluation contamination. NDCG scores were calculated by running 100 unique queries across 1K document indexes for each language.
Matryoshka Representation Learning
This model is trained with Matryoshka Representation Learning (MRL) on the following dimensions: 1536, 1024, 768, 512, 384, 256. The loss function used during training is calibrated to track performance across all these dimensions, leading the model to frontload the most important identifying information. This effectively allows you to shrink the embedding dimensions according to your scale and budget.
Average NDCG@5 for every dimensions. Interestingly, the model shows improvements in English, even though this language wasn't included in the training set. The model performs 6% better on 256 dimensions, and shows an overall improvement of 4% on average across all dimensions. Evaluations were conducted using FAISS with IndexFlatL2.
NDCG@5 (float)
Average | English | Italian | Spanish | French | German | |
---|---|---|---|---|---|---|
1536 dimensions | ||||||
dse-qwen2-2b-mrl-v1 | 79.5 | 79.2 | 80.2 | 77.9 | 80.6 | 79.6 |
mcdse-2b-v1 | 82.2 | 80.8 | 81.2 | 80.7 | 84.5 | 83.8 |
+3.28% | +1.98% | +1.23% | +3.47% | +4.62% | +5.01% | |
1024 dimensions | ||||||
dse-qwen2-2b-mrl-v1 | 78.3 | 78.8 | 78.5 | 76.5 | 80 | 77.5 |
mcdse-2b-v1 | 81.7 | 80 | 80.2 | 80.1 | 84 | 84.3 |
+4.23% | +1.75% | +2.12% | +4.49% | +4.76% | +8.07% | |
768 dimensions | ||||||
dse-qwen2-2b-mrl-v1 | 77.8 | 78.4 | 78.3 | 75.6 | 80.8 | 75.9 |
mcdse-2b-v1 | 81.1 | 79.6 | 79.9 | 79.2 | 83.3 | 83.3 |
+4.02% | +1.51% | +2.00% | +4.55% | +3.00% | +8.88% | |
512 dimensions | ||||||
dse-qwen2-2b-mrl-v1 | 76.2 | 77.6 | 75.9 | 73.1 | 79.2 | 75.2 |
mcdse-2b-v1 | 79.3 | 78.5 | 79.1 | 75.8 | 81.4 | 81.7 |
+3.91% | +1.15% | +4.05% | +3.56% | +2.70% | +7.96% | |
384 dimensions | ||||||
dse-qwen2-2b-mrl-v1 | 75.7 | 76.2 | 75.5 | 74.6 | 78.4 | 74 |
mcdse-2b-v1 | 78.8 | 77.5 | 78.5 | 76.1 | 80.4 | 81.4 |
+3.86% | +1.68% | +3.82% | +1.97% | +2.49% | +9.09% | |
256 dimensions | ||||||
dse-qwen2-2b-mrl-v1 | 73.5 | 74.5 | 73.6 | 70.6 | 74.8 | 73.8 |
mcdse-2b-v1 | 78.1 | 78.5 | 77.6 | 76.2 | 80.1 | 77.9 |
+5.89% | +5.10% | +5.15% | +7.35% | +6.62% | +5.26% |
Binary Embeddings
mcdse-2b-v1 clearly performs better on binarization, especially at lower dimensions. The model is 23% better on 256 dimensions, with an average improvement of 13% overall. Evaluations were conducted using FAISS with IndexBinaryFlat. But why are binary embeddings superior?
NDCG@5 | Memory needed for 100M embeddings | |
---|---|---|
dse-qwen2-2b-mrl-v1 (float16) | 79.5 | 286 GB |
mcdse-2b-v1 (binary) | 80.6 | 18 GB |
This table shows that mcdse-2b-v1's binary embeddings are 1% better than the base model's 1536-dimensional float vectors while reducing memory consumption by 16x. Besides these advantages, binary embeddings can also be searched 40x faster with hamming distance, as comparing two binary vectors only uses 2 CPU cycles (xor, popcnt)
NDCG@5 (binary)
Average | English | Italian | Spanish | French | German | |
---|---|---|---|---|---|---|
1536 dimensions | ||||||
dse-qwen2-2b-mrl-v1 | 75.0 | 75.8 | 75.4 | 72.4 | 78.1 | 73.2 |
mcdse-2b-v1 | 80.6 | 79.5 | 76.9 | 81.9 | 83.7 | 80.8 |
+6.93% | +4.65% | +1.95% | +11.60% | +6.69% | +9.41% | |
1024 dimensions | ||||||
dse-qwen2-2b-mrl-v1 | 72.2 | 74.8 | 71 | 70.8 | 74.6 | 69.6 |
mcdse-2b-v1 | 79.3 | 78.4 | 75.4 | 80.8 | 82.6 | 79.5 |
+9.05% | +4.59% | +5.84% | +12.38% | +9.69% | +12.45% | |
768 dimensions | ||||||
dse-qwen2-2b-mrl-v1 | 70.1 | 71.7 | 69.3 | 69.8 | 73.7 | 65.9 |
mcdse-2b-v1 | 78.8 | 77.1 | 75.4 | 80 | 83 | 78.5 |
+11.07% | +7.00% | +8.09% | +12.75% | +11.20% | +16.05% | |
512 dimensions | ||||||
dse-qwen2-2b-mrl-v1 | 66.5 | 70 | 65.4 | 63.7 | 70.2 | 63 |
mcdse-2b-v1 | 76.6 | 74.8 | 74.2 | 77.7 | 80.9 | 75.3 |
+13.21% | +6.42% | +11.86% | +18.02% | +13.23% | +16.33% | |
384 dimensions | ||||||
dse-qwen2-2b-mrl-v1 | 61.1 | 62.7 | 58.5 | 58.6 | 65.1 | 60.8 |
mcdse-2b-v1 | 74.3 | 74.5 | 71.4 | 77.2 | 75.2 | 73 |
+17.67% | +15.84% | +18.07% | +24.09% | +13.43% | +16.71% | |
256 dimensions | ||||||
dse-qwen2-2b-mrl-v1 | 54.3 | 59 | 56.5 | 53.6 | 53 | 49.6 |
mcdse-2b-v1 | 70.9 | 72.6 | 66.4 | 73.5 | 72.6 | 69.2 |
+23.31% | +18.73% | +14.91% | +27.07% | +27.00% | +28.32% |
ShiftProject
The vidore/shiftproject_test dataset is part of the ViDoRe Benchmark. It contains French queries and documents sourced from the Shift Project about the environment. Queries were generated with Claude-3 Sonnet on the same, French-translated, prompt used for generating queries of the scrapped documents of vidore/colpali_train_set.
ShiftProject (NDCG@5) | |
---|---|
dse-qwen2-2b-mrl-v1 | 80.8 |
mcdse-2b-v1 | 78.6 |
-2.80% |
This is the NDCG@5 on the ShiftProject dataset, with 1536 float dimensions and evaluated using at most 960 image patches.
I expected the score of mcdse-2b-v1 to be higher than the base model, instead it's 3% worse.
The base model was trained on the colpali train set, I tought that maybe it may have been over-optimized for "Claude-3 Sonnet like" queries. To investigate this, I regenerated the ShiftProject dataset queries using gemini-1.5-flash-002 and my prompts.
ShiftProject_Gemini (NDCG@5) | |
---|---|
dse-qwen2-2b-mrl-v1 | 67 |
mcdse-2b-v1 | 70.8 |
+5.37% |
The scores change wildly, but in this case, mcdse-2b-v1 is 5% better. These results tends to suggest two possible causes:
1) The base model is over-optimized for "Claude-3 Sonnet like" queries
2) My model is over-optimized for "gemini-1.5-flash-002 like" queries
In both scenarios, I believe mcdse-2b-v1 has mitigated these overoptimizations by understanding broader query distributions.
My generated gemini queries are in two formats: questions and queries. colpali_train_set generated queries are only questions. I also tested both models on just Gemini queries and just Gemini questions.
ShiftProject_GeminiQuestions (NDCG@5) | ShiftProject_GeminiQueries (NDCG@5) | |
---|---|---|
dse-qwen2-2b-mrl-v1 | 74.8 | 58.6 |
mcdse-2b-v1 | 69.5 | 63.5 |
-7.63% | +7.72% |
The base model is 7% better on gemini questions and 7% worse on gemini queries. The average scores between queries and questions are nearly identical (66.7 and 66.5). This suggests that my model has mitigated the previously mentioned overoptimizations and is generally better at understanding a wider variety of queries. Training on more multilingual data will probably increase this average and eventually improve perfomances on ShiftProject.
Cohere Embed v3 Image
I conducted some preliminary (and rushed) tests using the recently announced cohere embed-multilingual-v3.0 multimodal embeddings on a smaller version of the English dataset. The model achieved an NDCG@5 score of 71, while mcdse-2b-v1 scored around 84. I'm working on more comprehensive evaluations for this model.
Deployment
On HuggingFace Transformers, you can expect to encode ~3 images/s using an RTX3090 with a batch size of 32 (35TFLOPS). A more common inference side GPU like RTX 4000 Ada will roughly deliver the same troughput.
vLLM
vLLM officially supports Qwen2VL for generation only, I have added a new model class Qwen2VLForEmbeddingGeneration
to support embedding tasks. Running inference on vLLM should be ~5x faster than HuggingFace Transformers.
Download the new model class
git clone https://github.com/marplex/mcdse && cd mcdse
Download mcdse-2b-v1 for local inference
from huggingface_hub import snapshot_download
snapshot_download(repo_id="marco/mcdse-2b-v1", local_dir="/path/to/model/mcdse-2b-v1")
Edit config.json
Replace Qwen2VLForConditionalGeneration
with Qwen2VLForEmbeddingGeneration
sed -i -e 's/Qwen2VLForConditionalGeneration/Qwen2VLForEmbeddingGeneration/g' /path/to/model/mcdse-2b-v1/config.json
Check vllm/main.py
for local inference
#vllm/main.py
from qwen2_vl_dse import Qwen2VLForEmbeddingGeneration, get_query_prompt, get_document_prompt
from vllm import ModelRegistry, LLM
from PIL import Image
ModelRegistry.register_model("Qwen2VLForEmbeddingGeneration", Qwen2VLForEmbeddingGeneration)
llm = LLM(
model="/path/to/model/mcdse-2b-v1",
limit_mm_per_prompt={
"image": 1
}
)
# Encode queries
query_prompt, image = get_query_prompt("Quali erano le passività totali al 31 dicembre 2017?")
outputs = llm.encode({"prompt": query_prompt, "multi_modal_data": {"image": [image]}})
outputs[0].outputs.embedding #1536 dimensional embedding
# Encode documents
dummy_document_image = Image.new('RGB', (256, 256))
document_prompt, image = get_document_prompt(dummy_document_image)
outputs = llm.encode({"prompt": document_prompt, "multi_modal_data": {"image": [image]}})
outputs[0].outputs.embedding #1536 dimensional embedding
Conclusion
This is my first time training a model, it was challenging but incredibly fun. I don't think I could have ever done this without the amazing work of the HuggingFace team and contributors. I also want to thank Manuel Faysse, Tony Wu, and the entire Vidore team for their work on ColPali, Xueguang Ma for all its work on the Tevatron codebase and for training a very strong base model. I was also inspired by Benjamin Clavié and his impressive model announcements.
I hope this model proves useful for your retrieval and RAG pipelines. As mentioned in the beginning, my benchmarks are far from perfect, and results in real-world scenarios may vary. I encourage you to test it on your own use cases. Overall, a significant advantage of visual retrieval is that you can scrap your complex indexing pipeline by simply embedding the page. This is the future!
Top comments (0)