Software at Scale
Software at Scale 20 - Naphat Sanguansin: ex Server Platform SRE, Dropbox
Naphat Sanguansin was the former TL of the Server Platform SRE and Application Services teams at Dropbox, where he led efforts to improve Dropbox’s availability SLA and set a long-term vision for server development.
This episode is more conversational than regular episodes since I was on the same team as Naphat and we worked on a few initiatives together. We share the story behind the reliability of a large monolith with hundreds of weekly contributors, and the eventual decision to “componentize” the monolith for both reliability and developer productivity that we’ve written about officially here. This episode serves as a useful contrast to the recent Running in Production episode, where we talk more broadly about the initial serving stack and how that served Dropbox.
Apple Podcasts | Spotify | Google Podcasts
Highlights
1:00 - Why work on reliability?
4:30 - Monoliths vs. Microservices in 2021. The perennial discussion (and false dichotomy)
6:30 - Tackling infrastructural ambiguity
12:00 - Overcoming the fear from legacy systems
22:00 - Balking the traditional red/green (or whatever color) deployments in emergencies. Pushing the entire site at once so that hot-fixes can be checked in quickly. How to think of deployments from first principles. And the benefits of Envoy.
31:00 - What happens when you forget to jitter your distributed system
34:00 - If the monolith was reliable, why move away from the monolith?
41:00 - The approach that other large monoliths like Facebook, Slack, and Shopify have taken (publicly) is push many times a day. Why not do that at Dropbox?
52:00 - Why zero cost migrations are important at larger companies.
56:00 - Setting the right organizational incentives so that teams don’t over-correct for reliability or product velocity.