DEV Community

Cover image for NoLiMA: GPT-4o achieve 99.3% accuracy in short contexts (<1K tokens), performance degrades to 69.7% at 32K tokens.
Dan Shalev for FalkorDB

Posted on

2 1 1 1 1

NoLiMA: GPT-4o achieve 99.3% accuracy in short contexts (<1K tokens), performance degrades to 69.7% at 32K tokens.

Recent advancements in large language models (LLMs) have pushed context window limits to 128K–1M tokens, yet benchmarks like NoLiMA: Long-Context Evaluation Beyond Literal Matching reveal critical gaps in associative reasoning over extended sequences.

NoLiMA demonstrates that while models like GPT-4o achieve 99.3% accuracy in short contexts (<1K tokens), performance degrades to 69.7% at 32K tokens. The benchmark’s two-hop associative tasks (e.g., linking “Saxony” to “Semper Opera House” to “Yuki”) reveal that models fail to preserve transitive relationships across 16K+ token windows.

The NoLiMA benchmark highlights a fundamental truth: scaling context windows alone cannot overcome attention mechanisms' inability to model latent relationships. Property graphs provide the missing structural layer, offering explicit relationship encoding and metadata-aware retrieval.

For AI architects, integrating graph-native storage with LLMs isn’t optional—it’s imperative for building systems capable of robust, multi-hop reasoning at scale.

Quadratic AI

Quadratic AI – The Spreadsheet with AI, Code, and Connections

  • AI-Powered Insights: Ask questions in plain English and get instant visualizations
  • Multi-Language Support: Seamlessly switch between Python, SQL, and JavaScript in one workspace
  • Zero Setup Required: Connect to databases or drag-and-drop files straight from your browser
  • Live Collaboration: Work together in real-time, no matter where your team is located
  • Beyond Formulas: Tackle complex analysis that traditional spreadsheets can't handle

Get started for free.

Watch The Demo 📊✨

Top comments (1)

Collapse
 
danshalev7 profile image
Dan Shalev

Models like GPT-4o may still have decent base scores, but their effective context length remains limited when dealing with associative reasoning without literal cues.

AWS Security LIVE!

Hosted by security experts, AWS Security LIVE! showcases AWS Partners tackling real-world security challenges. Join live and get your security questions answered.

Tune in to the full event

DEV is partnering to bring live events to the community. Join us or dismiss this billboard if you're not interested. ❤️