Benchmarks - DEV Community

Skip to content

DEV Community

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Pneumetron

Jul 17

SynthDocBench: A New Benchmark for Long-Context Visual Document Understanding Reveals VLM Weaknesses

#visionlanguagemodels #vlm #documentunderstanding #benchmarks

4 min read

Pneumetron

Jul 13

UniClawBench: A New Benchmark for Proactive AI Agents in Real-World Scenarios

#aiagents #benchmarks #llms #mllms

3 min read

Jul 11

AI News Roundup: Grok 4.5 Hits Tesla, Perplexity's Orchestrator Beats Opus, and Meta Undercuts Pricing

#news #industry #benchmarks #launches

2 min read

Muhammed Rasin O M

Jul 10

Half the answer keys in text-to-SQL benchmarks are wrong. So I generated the database from the answer key.

#evaluation #dataagents #benchmarks #syntheticdata

7 min read

Peremptory

Jul 10

Grok 4.5 Was Trained on Your Coding Sessions Before xAI Owned Them

#modelrelease #xai #developertools #benchmarks

3 min read

Levash0v

Jul 7

Agent Leaderboards Measure Score. We Added Price.

#ai #agents #llm #benchmarks

5 min read

Peremptory

Jul 7

Xiaomi Launched a Frontier Model Anonymously. Developers Loved It.

#chineseai #modelrelease #developertools #benchmarks

3 min read

Breach Protocol

Jul 2

Turn the camera away, and the AI's world freezes

#worldmodels #videogeneration #robotics #benchmarks

3 min read

Breach Protocol

Jul 1

Reliable, and still wrong

#evaluation #llmasjudge #benchmarks

3 min read

Breach Protocol

Jul 1

Put AI agents in charge of a Civilization game and they reach for the nukes

#agents #alignment #safety #benchmarks

3 min read

Peremptory

Jun 12

Claude Fable 5 Scores 95% on SWE-bench, Then Hands Off to Opus 4.8

#anthropic #claude #benchmarks #safety

3 min read

Arthur

Jun 11

An LLM benchmark is only useful for as long as it's hard

#llm #evaluation #benchmarks #humaneval

10 min read

Andrew Kew

Jul 10

OpenAI just found ~30% of SWE-Bench Pro is broken — and retracted their own recommendation

#ai #openai #benchmarks #llm

2 min read

Rob

Jun 2

An AMD GPU Beat My Mac on Llama 8B. The Same GPU Lost on Phi-3.

#performance #benchmarks #machinelearning #gpu

5 min read

Milliseconds.dev

Jun 2

NLTK vs Compiled Regex: Tokenizing 100 MB of Text in .NET

#dotnet #csharp #performance #benchmarks

3 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.