DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
Most AI dev tools assume you have a repo. Ops engineers have a broken node and a 3am page.

Most AI dev tools assume you have a repo. Ops engineers have a broken node and a 3am page.

Comments
4 min read
Agent Handoff Contracts: The Missing Piece in Production Agent Systems

Agent Handoff Contracts: The Missing Piece in Production Agent Systems

Comments 1
3 min read
A provider latency spike stalled our whole build queue

A provider latency spike stalled our whole build queue

Comments
4 min read
How I Built FRIDAY - An Autonomous Incident Investigation Agent That Reduced MTTR by 65%

How I Built FRIDAY - An Autonomous Incident Investigation Agent That Reduced MTTR by 65%

Comments
8 min read
What Is Multi-Agent SRE? A Practical Introduction

What Is Multi-Agent SRE? A Practical Introduction

Comments
3 min read
5-Minute Post-Deploy Postmortem with SignalPilot

5-Minute Post-Deploy Postmortem with SignalPilot

Comments
3 min read
The Future of SRE: What the Next 5 Years Look Like

The Future of SRE: What the Next 5 Years Look Like

Comments
3 min read
Why Setting Up Observability Takes Forever (And What To Do About It)

Why Setting Up Observability Takes Forever (And What To Do About It)

Comments
4 min read
Great Stack to Doesn't Work #10 — Season Finale: "When PagerDuty Calls at 3 AM"

Great Stack to Doesn't Work #10 — Season Finale: "When PagerDuty Calls at 3 AM"

Comments
12 min read
Stop breaking production: a migration path to unified platforms 🛠️

Stop breaking production: a migration path to unified platforms 🛠️

Comments
1 min read
Building a Career in SRE: From Junior to Staff

Building a Career in SRE: From Junior to Staff

Comments
2 min read
The Human-in-the-Loop SRE: Designing Automation Escalation Policies for AI-Assisted Operations

The Human-in-the-Loop SRE: Designing Automation Escalation Policies for AI-Assisted Operations

Comments
15 min read
CPU and DB were bored, yet every site timed out: a slow-read bot that starved Apache's workers

CPU and DB were bored, yet every site timed out: a slow-read bot that starved Apache's workers

Comments
5 min read
I'm building a read-only context engine for Kubernetes and AI agents

I'm building a read-only context engine for Kubernetes and AI agents

Comments
6 min read
The Post-Mortem That Taught My System How to Fix Itself Using Hindsight

The Post-Mortem That Taught My System How to Fix Itself Using Hindsight

Comments
7 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.