NoLiMA: GPT-4o achieve 99.3% accuracy in short contexts (<1K tokens), performance degrades to 69.7% at 32K tokens.

#ai #llm #rag #programming

Recent advancements in large language models (LLMs) have pushed context window limits to 128K–1M tokens, yet benchmarks like NoLiMA: Long-Context Evaluation Beyond Literal Matching reveal critical gaps in associative reasoning over extended sequences.

NoLiMA demonstrates that while models like GPT-4o achieve 99.3% accuracy in short contexts (<1K tokens), performance degrades to 69.7% at 32K tokens. The benchmark’s two-hop associative tasks (e.g., linking “Saxony” to “Semper Opera House” to “Yuki”) reveal that models fail to preserve transitive relationships across 16K+ token windows.

The NoLiMA benchmark highlights a fundamental truth: scaling context windows alone cannot overcome attention mechanisms' inability to model latent relationships. Property graphs provide the missing structural layer, offering explicit relationship encoding and metadata-aware retrieval.

For AI architects, integrating graph-native storage with LLMs isn’t optional—it’s imperative for building systems capable of robust, multi-hop reasoning at scale.

Quadratic AI – The Spreadsheet with AI, Code, and Connections

AI-Powered Insights: Ask questions in plain English and get instant visualizations
Multi-Language Support: Seamlessly switch between Python, SQL, and JavaScript in one workspace
Zero Setup Required: Connect to databases or drag-and-drop files straight from your browser
Live Collaboration: Work together in real-time, no matter where your team is located
Beyond Formulas: Tackle complex analysis that traditional spreadsheets can't handle

Get started for free.

Watch The Demo 📊✨

Top comments (1)

Dan Shalev • Feb 19

Models like GPT-4o may still have decent base scores, but their effective context length remains limited when dealing with associative reasoning without literal cues.

DEV Community

NoLiMA: GPT-4o achieve 99.3% accuracy in short contexts (<1K tokens), performance degrades to 69.7% at 32K tokens.

Quadratic AI – The Spreadsheet with AI, Code, and Connections

Top comments (1)

AWS Security LIVE!