DEV Community

Yeshwanth Reddy
Yeshwanth Reddy

Posted on

πŸ“Š Exploring Vision Language Models (VLMs) for Structured Data Extraction

Over the past few weeks, I've been studying the effectiveness of Vision Language Models (VLMs) for structured data extraction from documents. While many benchmarks exist, none focus exclusively on structured data extraction, which led me to develop a framework that does just that.

πŸ” Models Tested:

  • Open Source: Qwen2, MiniCPM, Bunny
  • Closed Source: GPT-4o mini, Gemini 1.5 flash, Claude 3.5 These models were chosen because they either offer the most affordable APIs in their respective families or can run on a consumer GPU (less than 24GB VRAM).

πŸ“š Datasets:

  • SROIE and CORD receipt datasets – standardized, complex, and ideal for benchmarking VLMs on structured data extraction.

πŸ’‘ Key Takeaways:

  • Qwen2 is the top-performing open-source model, while Claude 3.5 leads among closed-source modelsβ€”though both are also the most expensive in their categories.
  • Both model types perform similarly on the simpler SROIE dataset, but closed-source models clearly outperform on the more complex CORD dataset.
  • High accuracy at a higher cost is more beneficial to customers, as the cost of error handling in low-cost models may exceed any savings.
  • Open-source VLMs have room for fine-tuning, potentially closing the gap with closed-source models.
  • Discovered a free Google API that offers a decent number of calls per dayβ€”an exciting find!

πŸ”— Check out the full study here: https://nanonets.com/blog/vision-language-model-vlm-for-data-extraction/

πŸ”— Source code available here: github.com/nanonets/hands-on-vision-language-models/

I'm currently on the lookout for new benchmarks that focus on structured data extraction from documents. If you know any relevant datasets, I'd love to benchmark them and update the study in my free time.
Feedback is much appreciated!

Top comments (0)