DEV Community

# benchmark

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
I Prompted 5 Frontier LLMs to “Report Uncertainty” Here’s What Happened to Their Statistical Validity Scores

I Prompted 5 Frontier LLMs to “Report Uncertainty” Here’s What Happened to Their Statistical Validity Scores

Comments
2 min read
Opus 4.7 First Look: I Tested the Day-Old Model Against 3 Other Claudes on 10 Real Tasks

Opus 4.7 First Look: I Tested the Day-Old Model Against 3 Other Claudes on 10 Real Tasks

Comments 1
5 min read
Writing an HTTP Load Tester That Doesn't Lie About p99

Writing an HTTP Load Tester That Doesn't Lie About p99

Comments
8 min read
I Tested OpenAI, Anthropic, and Cohere for Bulk Content Generation. Here's What the Data Actually Shows.

I Tested OpenAI, Anthropic, and Cohere for Bulk Content Generation. Here's What the Data Actually Shows.

Comments
7 min read
Micro-benchmarking TypeScript Without Lying to Yourself

Micro-benchmarking TypeScript Without Lying to Yourself

1
Comments
8 min read
I Benchmarked 8 Ollama Cloud AI Models. The 397B One Lost to a 1.6s Model.

I Benchmarked 8 Ollama Cloud AI Models. The 397B One Lost to a 1.6s Model.

Comments
3 min read
I benchmarked GPT-4o, Claude 3.5, and Gemini 1.5 for security — the results

I benchmarked GPT-4o, Claude 3.5, and Gemini 1.5 for security — the results

Comments
2 min read
NexusQuant vs KVTC vs TurboQuant vs CommVQ — honest comparison

NexusQuant vs KVTC vs TurboQuant vs CommVQ — honest comparison

Comments
4 min read
🚀 8x Faster Than ONNX Runtime: Zero-Allocation AI Inference in Pure C#

🚀 8x Faster Than ONNX Runtime: Zero-Allocation AI Inference in Pure C#

Comments
3 min read
ARC-AGI V3 Explained: The New AI Benchmark That Breaks Every Agent

ARC-AGI V3 Explained: The New AI Benchmark That Breaks Every Agent

Comments
3 min read
GPT-5.1 scored 26%. Gemini 3 Flash scored 74%. Same prompt, same tools.

GPT-5.1 scored 26%. Gemini 3 Flash scored 74%. Same prompt, same tools.

Comments
8 min read
AI Gateways Are Not I/O-Bound Proxies I Benchmarked 5 of Them to Prove It

AI Gateways Are Not I/O-Bound Proxies I Benchmarked 5 of Them to Prove It

2
Comments
9 min read
I Tried Speculative Decoding on RTX 4060 8GB — Every Config Was Slower Than Baseline

I Tried Speculative Decoding on RTX 4060 8GB — Every Config Was Slower Than Baseline

1
Comments
8 min read
FTS vs Hybrid Memory Search: A Real-World Benchmark

FTS vs Hybrid Memory Search: A Real-World Benchmark

1
Comments
4 min read
How Mano-P Achieves #1 on OSWorld: Architecture, Benchmarks, and Edge Deployment

How Mano-P Achieves #1 on OSWorld: Architecture, Benchmarks, and Edge Deployment

Comments
4 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.