DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
The 10 Commandments of Working in Production

The 10 Commandments of Working in Production

Comments
7 min read
The Prometheus label that blew our monitoring bill out 6x

The Prometheus label that blew our monitoring bill out 6x

1
Comments
4 min read
API Rate Limiting: Patterns That Scale

API Rate Limiting: Patterns That Scale

Comments
2 min read
How to Optimize MongoDB on Bare Metal Servers: SRE Playbook

How to Optimize MongoDB on Bare Metal Servers: SRE Playbook

Comments
5 min read
Kiln Crisis Management: Controlling Irregular Raw Meal in CCR Using Python

Kiln Crisis Management: Controlling Irregular Raw Meal in CCR Using Python

Comments
3 min read
Rename a Kubernetes PVC Without Losing Your Data: PersistentVolume Rebinding

Rename a Kubernetes PVC Without Losing Your Data: PersistentVolume Rebinding

Comments
4 min read
Performance Tuning: The Day the Server Got “Tired” and Started Acting Funny

Performance Tuning: The Day the Server Got “Tired” and Started Acting Funny

Comments
3 min read
AI Agents Mapped My Legacy Production Environment in One Hour.

AI Agents Mapped My Legacy Production Environment in One Hour.

2
Comments
4 min read
Remetric: find waste in self-hosted Prometheus, Grafana, and Loki

Remetric: find waste in self-hosted Prometheus, Grafana, and Loki

Comments
6 min read
The Degradation Ladder: How Systems Fail Before They Fail

The Degradation Ladder: How Systems Fail Before They Fail

Comments
5 min read
Your Agent Acts Without Checking Your Error Budget — That's the Failure Mode Nobody Is Tracking

Your Agent Acts Without Checking Your Error Budget — That's the Failure Mode Nobody Is Tracking

Comments
6 min read
How We Killed Our Worst Alert (And What We Learned)

How We Killed Our Worst Alert (And What We Learned)

Comments
2 min read
The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure

The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure

Comments
11 min read
Why Backup Success Does Not Mean Database Recoverability

Why Backup Success Does Not Mean Database Recoverability

Comments
2 min read
Game day on our build cluster: killing an AZ to test LLM flake detection

Game day on our build cluster: killing an AZ to test LLM flake detection

Comments
4 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.