Skip to content
Navigation menu
Search
Powered by Algolia
Search
Log in
Create account
DEV Community
Close
#
benchmarks
Follow
Hide
Posts
Left menu
đź‘‹
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
Right menu
Sonnet 4.6 vs Haiku 4.5 vs Opus 4.6: I Tested 3 Claude Models on 10 Real Tasks
James AI
James AI
James AI
Follow
Apr 15
Sonnet 4.6 vs Haiku 4.5 vs Opus 4.6: I Tested 3 Claude Models on 10 Real Tasks
#
ai
#
llm
#
claude
#
benchmarks
Comments
Add Comment
3 min read
The YC President Endorsed an AI Memory System With Fake Benchmarks. He Also Shipped His Own. We Read the Code.
Penfield
Penfield
Penfield
Follow
Apr 11
The YC President Endorsed an AI Memory System With Fake Benchmarks. He Also Shipped His Own. We Read the Code.
#
ai
#
aimemory
#
benchmarks
#
yc
Comments
Add Comment
3 min read
Proposal: A Real Benchmark for Long-Term AI Memory Systems
Penfield
Penfield
Penfield
Follow
Apr 10
Proposal: A Real Benchmark for Long-Term AI Memory Systems
#
ai
#
aimemory
#
benchmarks
Comments
Add Comment
3 min read
The $500 GPU That Outperforms Claude Sonnet on Coding Benchmarks
Pooya Golchian
Pooya Golchian
Pooya Golchian
Follow
Apr 7
The $500 GPU That Outperforms Claude Sonnet on Coding Benchmarks
#
ai
#
llm
#
benchmarks
#
nvidia
Comments
Add Comment
4 min read
Milla Jovovich just released an AI memory system. It reached over 1.5 million people and 5,400 GitHub stars in less than 24 hours.
Penfield
Penfield
Penfield
Follow
Apr 7
Milla Jovovich just released an AI memory system. It reached over 1.5 million people and 5,400 GitHub stars in less than 24 hours.
#
ai
#
aimemory
#
benchmarks
Comments
Add Comment
9 min read
LLM Evaluation: Metrics and Testing Strategies
Matt Frank
Matt Frank
Matt Frank
Follow
Apr 6
LLM Evaluation: Metrics and Testing Strategies
#
llmevaluation
#
aitesting
#
benchmarks
Comments
Add Comment
6 min read
Why Small LLMs Fail at Tool Calling: The Shocking Discovery from Our Llama 3B Benchmark
Anak Wannaphaschaiyong
Anak Wannaphaschaiyong
Anak Wannaphaschaiyong
Follow
Apr 3
Why Small LLMs Fail at Tool Calling: The Shocking Discovery from Our Llama 3B Benchmark
#
ai
#
llm
#
agents
#
benchmarks
Comments
Add Comment
11 min read
We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally
Penfield
Penfield
Penfield
Follow
Apr 4
We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally
#
ai
#
mcp
#
machinelearning
#
benchmarks
Comments
Add Comment
5 min read
Windsurf's Arena Mode Lets You Blind-Test AI Models. I Tried It.
Alan West
Alan West
Alan West
Follow
Mar 29
Windsurf's Arena Mode Lets You Blind-Test AI Models. I Tried It.
#
windsurf
#
aimodels
#
devtools
#
benchmarks
1
 reaction
Comments
Add Comment
5 min read
Critical Flaws in Long-Term Memory Benchmarks: Addressing Unreliable and Uninterpretable Results
Valeria Solovyova
Valeria Solovyova
Valeria Solovyova
Follow
Mar 27
Critical Flaws in Long-Term Memory Benchmarks: Addressing Unreliable and Uninterpretable Results
#
ai
#
benchmarks
#
memory
#
reliability
Comments
Add Comment
15 min read
We Gave LLMs 150 Tools: Here's What Broke.
Craig Tracey
Craig Tracey
Craig Tracey
Follow
Mar 26
We Gave LLMs 150 Tools: Here's What Broke.
#
discuss
#
mcp
#
benchmarks
#
ai
1
 reaction
Comments
1
 comment
9 min read
GenHTTP vs ASP.NET Minimal APIs: The C# Benchmark Showdown Nobody Expected
Benny
Benny
Benny
Follow
Mar 27
GenHTTP vs ASP.NET Minimal APIs: The C# Benchmark Showdown Nobody Expected
#
csharp
#
dotnet
#
performance
#
benchmarks
5
 reactions
Comments
Add Comment
6 min read
O Mito do 'Site Rápido em WordPress': Benchmarks Reais de 2026 Que Ninguém Mostra
Gabriel Lima Ferreira
Gabriel Lima Ferreira
Gabriel Lima Ferreira
Follow
Mar 23
O Mito do 'Site Rápido em WordPress': Benchmarks Reais de 2026 Que Ninguém Mostra
#
wordpress
#
nextjs
#
performance
#
benchmarks
Comments
Add Comment
3 min read
Cómo rompieron los benchmarks top de agentes de IA — y lo que eso dice del stack que estoy usando
Juan Torchia
Juan Torchia
Juan Torchia
Follow
Apr 12
Cómo rompieron los benchmarks top de agentes de IA — y lo que eso dice del stack que estoy usando
#
typescript
#
llm
#
aiagents
#
benchmarks
Comments
Add Comment
8 min read
Drogon: The C++ Framework That Tops HTTP/2 Benchmarks (And Where It Struggles)
Benny
Benny
Benny
Follow
Mar 17
Drogon: The C++ Framework That Tops HTTP/2 Benchmarks (And Where It Struggles)
#
webdev
#
performance
#
benchmarks
#
cpp
Comments
Add Comment
6 min read
đź‘‹
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
We're a place where coders share, stay up-to-date and grow their careers.
Log in
Create account