DEV Community

# reliability

General discussions on building and maintaining reliable software systems.

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure

The Hidden Cost of Downtime: How SRE Error Budgets Protect National Economic Infrastructure

Comments
11 min read
Eleven silent-failure modes across 36 agent platforms, and the structural feature they share

Eleven silent-failure modes across 36 agent platforms, and the structural feature they share

Comments
5 min read
Spotify Changed a Filter Order in Their Proxy — Then Every Server in the World Crashed at Once

Spotify Changed a Filter Order in Their Proxy — Then Every Server in the World Crashed at Once

Comments
14 min read
How we survived 218 network transitions with zero data loss: ALEF's self-healing architecture

How we survived 218 network transitions with zero data loss: ALEF's self-healing architecture

Comments
2 min read
Grafana 'No Data' after migration: 7 reconcilers we had to kill first

Grafana 'No Data' after migration: 7 reconcilers we had to kill first

Comments
8 min read
The Silent Outage: Monitoring What You Can't See

The Silent Outage: Monitoring What You Can't See

Comments
2 min read
The silent sequential skip: a failure class every AI pipeline should name

The silent sequential skip: a failure class every AI pipeline should name

Comments
5 min read
Energy Grid Observability: What the Power Sector Can Learn from Google SRE

Energy Grid Observability: What the Power Sector Can Learn from Google SRE

Comments
12 min read
The 80% Problem: Why Getting an LLM System to 'Works in Demo' Is 20% of the Work

The 80% Problem: Why Getting an LLM System to 'Works in Demo' Is 20% of the Work

Comments
14 min read
Datadog Went Dark for 24 Hours and Came Back With a Different Philosophy

Datadog Went Dark for 24 Hours and Came Back With a Different Philosophy

Comments
13 min read
What SSL Error Means and How to Fix It

What SSL Error Means and How to Fix It

Comments
8 min read
How to Fix Slow DNS Lookup: A Complete Troubleshooting Guide

How to Fix Slow DNS Lookup: A Complete Troubleshooting Guide

Comments
10 min read
Slack's Worst Day: When a Better Cache Manager Made Everything Worse

Slack's Worst Day: When a Better Cache Manager Made Everything Worse

Comments
12 min read
Slack Cut Deploy-Related Customer Impact by 90% in Eighteen Months

Slack Cut Deploy-Related Customer Impact by 90% in Eighteen Months

Comments
11 min read
Slack Built a Big Red Button to Drain an Entire Data Center in Five Minutes

Slack Built a Big Red Button to Drain an Entire Data Center in Five Minutes

Comments
12 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.