Evals

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Dishant Sethi

May 27

How to Evaluate LLM Outputs: Building Evals That Actually Catch Regressions

#evals #ai #llmops #agents

9 min read

David Aronchick

May 5

The Loop Is Only as Good as the Metric

#ai #evals #machinelearning #data

7 min read

aasawari sahasrabuddhe

Apr 23

Why Most AI Teams Are Flying Blind: And What to Do About It

#ai #evals #genai #womenintech

13 min read

Frank Brsrk

Apr 22

Wait, you guys run evals?

#ai #evals #llm

1 min read

Scarlett Attensil

May 14

If You Can Survive a Toddler, You Can Ship LLMs in Production

#ai #evals #llm

5 min read

Manouk Draisma for LangWatch

Mar 24

From zero evals to a working multimodal evaluation in 30 minutes using LangWatch Skills

#ai #agents #evals #claudecode

7 min read

Manouk Draisma

Mar 23

Your coding agent already knows how to test your AI agent (we just turned it into a Skill)

#agents #agentskills #evals #simulations

4 min read

Russell Jones

Mar 30

Build an eval harness for 184 AI agent prompts with promptfoo

#promptfoo #evals #aiagents #llm

8 min read

Raphael Porto

Mar 27

Self-improving Coding Agents

#agents #harness #ai #evals

5 min read

Scarlett Attensil for LaunchDarkly

Mar 26

Evaluate LLM code generation with LLM-as-judge evaluators

#ai #evals #llm #agents

12 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.

DEV Community

# evals

How to Evaluate LLM Outputs: Building Evals That Actually Catch Regressions

The Loop Is Only as Good as the Metric

Why Most AI Teams Are Flying Blind: And What to Do About It

Wait, you guys run evals?

If You Can Survive a Toddler, You Can Ship LLMs in Production

From zero evals to a working multimodal evaluation in 30 minutes using LangWatch Skills

Your coding agent already knows how to test your AI agent (we just turned it into a Skill)

Build an eval harness for 184 AI agent prompts with promptfoo

Self-improving Coding Agents

Evaluate LLM code generation with LLM-as-judge evaluators