Aryan Kargwal for Tune AI

Posted on Nov 25, 2024

Benchmarking Pixtral Large vs Pixtral 12B

#llm #vlm #benchmarking #research

Multimodal AI has taken significant leaps in recent years, and Mistral AI's Pixtral Large is no exception. This new Vision-Language Model (VLM) aims to redefine benchmarks in multimodal understanding and reasoning. In this post, I’ll dive into Pixtral Large's capabilities, its performance against its predecessor, Pixtral 12B, and GPT-4V, and share my benchmarking experiments to help you make informed decisions when choosing your next VLM.

What is Pixtral Large?

Pixtral Large is Mistral AI’s latest multimodal innovation. Building on the foundation of Pixtral 12B, it introduces enhanced reasoning and comprehension capabilities. Whether tackling complex math problems on datasets like MathVista, document comprehension from DocVQA, or visual-question answering with VQAv2, Pixtral Large consistently sets itself apart with superior performance.

At its core, Pixtral Large is powered by 123 billion multimodal decoder parameters and a 1 billion-parameter vision encoder, making it a true powerhouse. It supports up to 30 high-resolution images within a 128K context window, allowing it to handle complex, large-scale reasoning tasks effortlessly. Its Mistral Large 2 Text Encoder enhances text processing while maintaining its exceptional multimodal capabilities.

Technical Specifications

Although the exact architecture of Pixtral Large remains undisclosed, it likely builds upon Pixtral 12B's common embedding-based multimodal transformer decoder. This setup enables it to process multi-image inferences and perform high-quality cross-modal reasoning, excelling at tasks that require a deep integration of visual and textual data.

Here are some standout specs of Pixtral Large:

Parameters: 123 billion (multimodal decoder) + 1 billion (vision encoder)
Context Window: 128K tokens
Image Support: Up to 30 high-resolution images
Applications: Math reasoning, document comprehension, chart understanding, and more

Pixtral Large vs. Pixtral 12B

The shift from Pixtral 12B to Pixtral Large represents a nuanced tradeoff:

Pixtral 12B: Balanced capabilities across tasks, excelling in label-based and rationale-based evaluations.
Pixtral Large: Falls behind in label-based tasks but shines in rationale-based performance, indicating superior reasoning and explanation capabilities.

This evolution demonstrates Pixtral Large’s focus on tasks requiring deeper comprehension and reasoning, making it a strong contender for specialized use cases.

Benchmarking Results

Datasets Used

To test Pixtral Large, I benchmarked it against its predecessor and GPT-4V using two datasets:

ArxivQA: Research paper-based QA tasks with GPT-4V inferences for comparison.
Flickr30k: A classic image captioning dataset enhanced with GPT-4O-generated captions.

Evaluation Metrics

I used Cosine Similarity to measure semantic alignment between generated outputs and reference data. Metrics included win rate, average similarity, and top-1, top-5, top-10 scores.

ArxivQA Results

From 1,000 randomly selected images, Pixtral Large demonstrated a stronger ability to reason through scientific and mathematical content. While it struggled with label-based evaluations compared to Pixtral 12B, it outperformed in rationale-based tasks. This indicates a shift toward deeper reasoning capabilities, ideal for complex QA scenarios.

Flickr30k Results

For the Flickr30k Captioning Benchmark, Pixtral Large produced slight improvements over Pixtral 12B when evaluated against human-generated captions. However, both models lagged in achieving a win rate for this task.

Interestingly, when compared to GPT-4V captions, Pixtral Large performed well, though it fell slightly behind Pixtral 12B in top-ranked matches. These results highlight Pixtral Large’s potential but also suggest areas for improvement in precision and caption generation.

Using Pixtral Large on Tune Studio

Due to the model's size and resource requirements, I used Tune Studio for benchmarking. With its user-friendly interface and efficient inference scripts, I was able to process 500 images per hour, completing the job for under $20. This makes Tune Studio a valuable tool for researchers and developers working on large-scale AI projects.

Conclusion

Pixtral Large represents a significant step forward in multimodal AI, offering enhanced reasoning and cross-modal comprehension. While it may not surpass Pixtral 12B in every aspect, its focus on rationale-based tasks makes it a compelling choice for applications requiring deeper understanding.

For developers, researchers, and enterprises looking for cutting-edge VLMs, Pixtral Large offers a mix of power and precision that’s hard to beat.

What do you think about Pixtral Large? Is it the next big thing in VLMs, or do you see potential in other models like GPT-4V? Let me know your thoughts in the comments below! 🚀

DEV Community