DEV Community

# benchmarks

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Sonnet 4.6 vs Haiku 4.5 vs Opus 4.6: I Tested 3 Claude Models on 10 Real Tasks

Sonnet 4.6 vs Haiku 4.5 vs Opus 4.6: I Tested 3 Claude Models on 10 Real Tasks

Comments
3 min read
The YC President Endorsed an AI Memory System With Fake Benchmarks. He Also Shipped His Own. We Read the Code.

The YC President Endorsed an AI Memory System With Fake Benchmarks. He Also Shipped His Own. We Read the Code.

Comments
3 min read
Proposal: A Real Benchmark for Long-Term AI Memory Systems

Proposal: A Real Benchmark for Long-Term AI Memory Systems

Comments
3 min read
The $500 GPU That Outperforms Claude Sonnet on Coding Benchmarks

The $500 GPU That Outperforms Claude Sonnet on Coding Benchmarks

Comments
4 min read
Milla Jovovich just released an AI memory system. It reached over 1.5 million people and 5,400 GitHub stars in less than 24 hours.

Milla Jovovich just released an AI memory system. It reached over 1.5 million people and 5,400 GitHub stars in less than 24 hours.

Comments
9 min read
LLM Evaluation: Metrics and Testing Strategies

LLM Evaluation: Metrics and Testing Strategies

Comments
6 min read
Why Small LLMs Fail at Tool Calling: The Shocking Discovery from Our Llama 3B Benchmark

Why Small LLMs Fail at Tool Calling: The Shocking Discovery from Our Llama 3B Benchmark

Comments
11 min read
We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally

We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally

Comments
5 min read
Windsurf's Arena Mode Lets You Blind-Test AI Models. I Tried It.

Windsurf's Arena Mode Lets You Blind-Test AI Models. I Tried It.

1
Comments
5 min read
Critical Flaws in Long-Term Memory Benchmarks: Addressing Unreliable and Uninterpretable Results

Critical Flaws in Long-Term Memory Benchmarks: Addressing Unreliable and Uninterpretable Results

Comments
15 min read
We Gave LLMs 150 Tools: Here's What Broke.

We Gave LLMs 150 Tools: Here's What Broke.

1
Comments 1
9 min read
GenHTTP vs ASP.NET Minimal APIs: The C# Benchmark Showdown Nobody Expected

GenHTTP vs ASP.NET Minimal APIs: The C# Benchmark Showdown Nobody Expected

5
Comments
6 min read
O Mito do 'Site Rápido em WordPress': Benchmarks Reais de 2026 Que Ninguém Mostra

O Mito do 'Site Rápido em WordPress': Benchmarks Reais de 2026 Que Ninguém Mostra

Comments
3 min read
Cómo rompieron los benchmarks top de agentes de IA — y lo que eso dice del stack que estoy usando

Cómo rompieron los benchmarks top de agentes de IA — y lo que eso dice del stack que estoy usando

Comments
8 min read
Drogon: The C++ Framework That Tops HTTP/2 Benchmarks (And Where It Struggles)

Drogon: The C++ Framework That Tops HTTP/2 Benchmarks (And Where It Struggles)

Comments
6 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.