📊 Exploring Vision Language Models (VLMs) for Structured Data Extraction

#computervision #vlm #informatinextraction #ai

Over the past few weeks, I've been studying the effectiveness of Vision Language Models (VLMs) for structured data extraction from documents. While many benchmarks exist, none focus exclusively on structured data extraction, which led me to develop a framework that does just that.

🔍 Models Tested:

Open Source: Qwen2, MiniCPM, Bunny
Closed Source: GPT-4o mini, Gemini 1.5 flash, Claude 3.5 These models were chosen because they either offer the most affordable APIs in their respective families or can run on a consumer GPU (less than 24GB VRAM).

📚 Datasets:

SROIE and CORD receipt datasets – standardized, complex, and ideal for benchmarking VLMs on structured data extraction.

💡 Key Takeaways:

Qwen2 is the top-performing open-source model, while Claude 3.5 leads among closed-source models—though both are also the most expensive in their categories.
Both model types perform similarly on the simpler SROIE dataset, but closed-source models clearly outperform on the more complex CORD dataset.
High accuracy at a higher cost is more beneficial to customers, as the cost of error handling in low-cost models may exceed any savings.
Open-source VLMs have room for fine-tuning, potentially closing the gap with closed-source models.
Discovered a free Google API that offers a decent number of calls per day—an exciting find!

🔗 Check out the full study here: https://nanonets.com/blog/vision-language-model-vlm-for-data-extraction/

🔗 Source code available here: github.com/nanonets/hands-on-vision-language-models/

I'm currently on the lookout for new benchmarks that focus on structured data extraction from documents. If you know any relevant datasets, I'd love to benchmark them and update the study in my free time.
Feedback is much appreciated!

DEV Community

📊 Exploring Vision Language Models (VLMs) for Structured Data Extraction

Top comments (0)

Read next

MCP Server for MySQL

KaibanJS v0.11.0: Empowering Developers with Advanced RAG Tools

Unpacking AI Risks: Oversight, Self-Exfiltration, and Data Manipulation in OpenAI’s o1 Model

Transforming Healthcare with AI: Introducing Contact Doctor's Bio-Medical-MultiModal-Llama-3-8B-V1 LLM