Last week, December 11th, 2024, OpenAI faced an SRE nightmare. A major platform outage lasting four hours affected ChatGPT and SORA (OpenAI's video generation model) due to faulty service deployment, bringing down their largest Kubernetes clusters to the knees. On-call engineers were locked out of the cluster, preventing them from running kubectl.
Rollout turned bad
The root cause was a bad rollout strategy for their new telemetry service deployment, which collected Kubernetes control plane metrics. This telemetry service overwhelmed the API server by sending a high volume of resource-intensive API calls, the cost of which scaled with the size of the cluster.
Service not discovering
The worst part was that the issue was not caught until the rollout began fleet-wide and propagated to their largest clusters running mission-critical workloads. DNS caching mitigated the impact temporarily by providing stale cached records to DNS queries, it only made the issue worse.
After the DNS cache expired over the following 20 minutes, the telemetry service rollout had already propagated to their largest clusters running mission-critical workloads, and suddenly, a surge of real-time DNS queries overloaded the DNS server (CoreDNS likely) running on their control plane, which is already on stress due to telemetry service running resource-intensive API operations. As a result, the DNS-based service discovery for the cluster became unresponsive, leading to the application pod not being able to perform real-time DNS resolutions.
No way to get into the cluster
On-call engineers were not able to roll back this telemetry service as they were locked out and unable to access the Kubernetes control plane due to extensive load. I've experienced this exact issue before, and anyone who's faced this situation knows just how challenging it can be to recover an unresponsive API server.
Ultimately, they were able to recover the API server and bring clusters back up by reducing the API operations in several ways, such as blocking network access to Kubernetes admin APIs and scaling up Kubernetes API servers. Finally, they rolled back the faulty telemetry service deployment.
Lesson re-learned
In response to this major outage, OpenAI has laid out the following action items to prevent such large-scale outages from happening again.
Firstly, the phased rollout will be improved going forward by continuously monitoring the health of the workload and the Kubernetes control plane.
Conduct fault injection testing to ensure that the Kubernetes data plane running production workloads can function without a control plane for a longer period of time.
Get rid of the dependency on Kubernetes DNS for service discovery and decouple the Kubernetes data plane from the control plane to ensure the control plane doesn't play any major role in processing production workloads.
Implement a break-glass mechanism for on-call engineers to be able to access the Kubernetes API server under any circumstances.
In today’s fast-paced, AI-driven world, where you can ship features as fast as you can think, platform reliability is crucial. This incident underscores that delivering features reliably is no easy feat, and if not planned properly, it directly impacts the product, consumers, and investors.
And remember, when in doubt, it's always DNS. After all, if it's not DNS, it's probably just DNS pretending to be something else! - Puru Tuladhar
Top comments (0)