DEV Community

# inference

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
The KV cache, why LLM inference is memory-bound, not compute-bound

The KV cache, why LLM inference is memory-bound, not compute-bound

Comments
4 min read
Etched hits $5B and $1B in orders: why inference chips matter

Etched hits $5B and $1B in orders: why inference chips matter

Comments
4 min read
Two labs race to make AI write whole paragraphs at once instead of word by word

Two labs race to make AI write whole paragraphs at once instead of word by word

Comments
3 min read
KV Cache Is Eating Your VRAM — Here's How to Estimate It Before You Run Out

KV Cache Is Eating Your VRAM — Here's How to Estimate It Before You Run Out

Comments
6 min read
I Benchmarked Speculative Decoding — a = 3.5 Wasn't Enough

I Benchmarked Speculative Decoding — a = 3.5 Wasn't Enough

Comments
7 min read
96% of cuBLAS, no `unsafe`: what cuTile Rust proves

96% of cuBLAS, no `unsafe`: what cuTile Rust proves

Comments
8 min read
Extract Structured JSON from Messy Text with Telnyx AI Inference

Extract Structured JSON from Messy Text with Telnyx AI Inference

Comments
2 min read
Chạy LLM trên iGPU: Giới hạn VRAM của Intel Arc và Radeon 780M

Chạy LLM trên iGPU: Giới hạn VRAM của Intel Arc và Radeon 780M

Comments
3 min read
Lossless, But Not Free: The Lossless, But Not Free — When Speculative Decoding Actually Pays Off (and When It Doesn't)

Lossless, But Not Free: The Lossless, But Not Free — When Speculative Decoding Actually Pays Off (and When It Doesn't)

2
Comments 4
6 min read
How to Build a Secure Homelab for LLM Inference

How to Build a Secure Homelab for LLM Inference

Comments
4 min read
Google's DiffusionGemma Generates Text Sideways

Google's DiffusionGemma Generates Text Sideways

Comments
3 min read
Sipp: a local-first runtime for Hybrid AI Applications

Sipp: a local-first runtime for Hybrid AI Applications

11
Comments 2
11 min read
Speculative decoding: when and why it actually speeds up inference

Speculative decoding: when and why it actually speeds up inference

1
Comments
9 min read
Can You Tell When an LLM API Swaps in a Cheaper Model?

Can You Tell When an LLM API Swaps in a Cheaper Model?

1
Comments 3
3 min read
ReFlect: Training-Free Error Recovery for Long-Horizon LLM Reasoning

ReFlect: Training-Free Error Recovery for Long-Horizon LLM Reasoning

Comments
4 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.