DEV Community

# benchmarks

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
I benchmarked 10 LLMs on slopsquatting — up to 87% installed fake packages

I benchmarked 10 LLMs on slopsquatting — up to 87% installed fake packages

1
Comments
9 min read
DeepSeek V4 Released: Open-Source 1.6T MoE, 1M Context, Apache 2.0 — and It's Already on the API

DeepSeek V4 Released: Open-Source 1.6T MoE, 1M Context, Apache 2.0 — and It's Already on the API

Comments
6 min read
GPT-5.5 Released: First Fully Retrained Base Model Since GPT-4.5, 1M Context, $5/$30 Pricing

GPT-5.5 Released: First Fully Retrained Base Model Since GPT-4.5, 1M Context, $5/$30 Pricing

Comments
6 min read
GPT-5.5 Is Out — What the Numbers Actually Say

GPT-5.5 Is Out — What the Numbers Actually Say

Comments
4 min read
How to Choose the Right AI Model for the Right Job

How to Choose the Right AI Model for the Right Job

Comments
13 min read
How I took LongMemEval oracle from 62% to 82.8% without touching the retriever

How I took LongMemEval oracle from 62% to 82.8% without touching the retriever

Comments
3 min read
What Is Agent Evaluation? How EClaw Arena Benchmarks AI Agents Across 12 Dimensions

What Is Agent Evaluation? How EClaw Arena Benchmarks AI Agents Across 12 Dimensions

Comments
3 min read
Sonnet 4.6 vs Haiku 4.5 vs Opus 4.6: I Tested 3 Claude Models on 10 Real Tasks

Sonnet 4.6 vs Haiku 4.5 vs Opus 4.6: I Tested 3 Claude Models on 10 Real Tasks

Comments
3 min read
The YC President Endorsed an AI Memory System With Fake Benchmarks. He Also Shipped His Own. We Read the Code.

The YC President Endorsed an AI Memory System With Fake Benchmarks. He Also Shipped His Own. We Read the Code.

Comments
3 min read
Proposal: A Real Benchmark for Long-Term AI Memory Systems

Proposal: A Real Benchmark for Long-Term AI Memory Systems

Comments
3 min read
When Generic Benchmarks Fail: Building a Sales-Domain Evaluation Bench from Scratch

When Generic Benchmarks Fail: Building a Sales-Domain Evaluation Bench from Scratch

1
Comments
7 min read
I accidentally made the fastest event system in the world

I accidentally made the fastest event system in the world

Comments 1
11 min read
The $500 GPU That Outperforms Claude Sonnet on Coding Benchmarks

The $500 GPU That Outperforms Claude Sonnet on Coding Benchmarks

Comments
4 min read
Milla Jovovich just released an AI memory system. It reached over 1.5 million people and 5,400 GitHub stars in less than 24 hours.

Milla Jovovich just released an AI memory system. It reached over 1.5 million people and 5,400 GitHub stars in less than 24 hours.

Comments
9 min read
LLM Evaluation: Metrics and Testing Strategies

LLM Evaluation: Metrics and Testing Strategies

Comments
6 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.