DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Something every senior engineer learns the expensive way:

Something every senior engineer learns the expensive way:

1
Comments
1 min read
A hard-earned rule from incident retrospectives:

A hard-earned rule from incident retrospectives:

1
Comments
1 min read
One insight that changed how I design systems:

One insight that changed how I design systems:

Comments
1 min read
Zero-Downtime Argo CD Migrations: The Ultimate Guide to ApplicationSet Refactoring

Zero-Downtime Argo CD Migrations: The Ultimate Guide to ApplicationSet Refactoring

Comments
3 min read
I built an AI tool for incident investigation (looking for honest feedback)

I built an AI tool for incident investigation (looking for honest feedback)

1
Comments
2 min read
Determinism Series: Siliconizing Decision-Making (Index)

Determinism Series: Siliconizing Decision-Making (Index)

1
Comments
4 min read
Aurora vs Traditional Incident Management Tools: An Honest Comparison

Aurora vs Traditional Incident Management Tools: An Honest Comparison

Comments
3 min read
Why Your Monitoring Is Failing in Microservices (And What Actually Works)

Why Your Monitoring Is Failing in Microservices (And What Actually Works)

1
Comments
3 min read
Your AI Agent Is Not Failing. Your System Design Is.

Your AI Agent Is Not Failing. Your System Design Is.

12
Comments 8
1 min read
On-Call Management Kit

On-Call Management Kit

Comments
4 min read
Capacity Planning Toolkit

Capacity Planning Toolkit

Comments
3 min read
SLI/SLO Framework

SLI/SLO Framework

Comments
4 min read
Chaos Engineering Toolkit

Chaos Engineering Toolkit

Comments
4 min read
Platform Developer Portal

Platform Developer Portal

Comments
3 min read
Postmortem Framework

Postmortem Framework

Comments
4 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.