(NOTE: This was originally published in October 2022 đđ»)
Halloween is around the corner. Buckle up for a spooky engineering ghost story.
A few years ago, I worked as a software engineer at a large company building a video streaming service. Our first customer was a major professional sports league who would be using our service to broadcast a livestream of their games once a week to millions of viewers; an opportunity that was both exciting and terrifying!
âWhen we signed on to the project, our service didnât actually exist yet. But the leagueâs broadcast schedule certainly did. đ The launch date was rock solid, and the service had to be able to handle all traffic being sent to us.
âIs this where the scary part of the story begins? Nope! We had a fantastic engineering team and an architecture design we believed in. The schedule was tight, but we were confident weâd be able to hit our launch date. We put our heads down and got to work.
âA few weeks before the first broadcast, we were feeling pretty good. The service was built, sans some finishing touches. The team was in the home stretch of load testing to make sure the service would hold up to the traffic at game time, and everything was business as usual.
âBut thenâŠ
âWe got our first realistic sample data set from our customer, and we integrated it into our load tests. It did not go smoothly. Based on our budget and our estimates for how much data we would need to store, we had configured a maximum read and write capacity for DynamoDB. But during the load test, we found that we were dramatically exceeding that capacity and running into DynamoDB throttles. Our service failed. Hard.
Be afraid. Be very afraid.
Uh oh. Itâs only a few weeks until our first broadcast, and we have a major problem. In our architecture design, there were data we needed to store for each individual viewer watching the broadcast to keep track of where they were in the stream. We had decided to store this data in DynamoDB. After investigating the traffic that the broadcaster was sending us, we discovered the size of the payload for each viewer might be up to 10x larger than our estimates. This required 10x the IOPs on DynamoDBâand 10x the costs!
âOur workload was very write-heavy. Some napkin math based on the observed 10x increase in data made it clear that storing it in Dynamo would put us far over budget. These data were ephemeral, so we decided that we could move them out of DynamoDB and into a cache server. We did some quick research on our options and decided to move forward with a managed Redis solution.
âManaged Redis services have some nice benefits in that you arenât explicitly responsible for provisioning and operating the individual nodes in your cache cluster. But, you *are* explicitly responsible for determining how many nodes you need in your cache cluster, and how big they need to be.
âThe next step was to write code to simulate the load that we would put on the Redis cluster, and run it... over and over again. We tested different sizes of nodes. We tested different cluster sizes. We tested different replication configurations. We tested. A lot.
âAll this writing of synthetic load tests to size a caching cluster was not work that we had accounted for in our engineering plans. Experimenting with different sizes (and types) of cache nodes, monitoring them to ensure they werenât overloaded during the test runs⊠These tasks were expensive and time consumingâand largely ancillary to the actual business logic of the service we were trying to build. None of them were especially unique to us. But we still had to allocate precious engineering resources to them.
âAfter a week, we had nailed down the sizing and configuration for our cluster, still racing against the clock. After another week, we had completed the work to migrate that part of our code off of Dynamo onto the Redis cluster.
âAnd the service was up and running again.
Itâs alive! Itâs aliiive!
We did it! The first broadcast went smoothly. As with any major software project, after observing it in action in the real world, we learned some lessons and found some things to improve, but the viewers had a good viewing experience. We rolled out some of those improvements during the subsequent weeks, and before we knew it, the season was well underway. Victory!
UntilâŠ
âAbout a month into the season, we got our AWS bill. To say that it caused us a fright would be an understatement. The bill was⊠HUGE! What the heck happened?!
â## Itâs coming from inside the house!
Because of our architecture, we knew that the biggest chunk of our bill was going to come from DynamoDB. But we had done a reasonable job of estimating that cost based on our DDB capacity limits. So why was the AWS bill so high?
âIt turns out that the culprit was our Redis clusters. In retrospect, it was predictable, but we had been so busy just trying to make sure that things were operational in time to meet our deadlines, we hadnât had time to do the math.
âTo meet the demands of our peak traffic during the games, we had been forced to create clusters with 90 nodes in themâin every region that we were broadcasting from. Plus, we needed each node to have enough RAM to store all the data we were pumping into them, which required very large instance types.
â## Is this place haunted?
Very large instance types that provided the amount of RAM we needed happened to also come with high numbers of vCPUs. Redis is a single-threaded application, meaning it can only take advantage of one vCPU on each node in the cluster, leaving the remaining vCPUs almost 100% idle.
âSo there we were, paying for boatloads of big 16-vCPU instances, and we were guaranteed each one of them would never be using more than about 6% of the CPU it had available. Believe it or not, this wasnât even the worst of it.
âThe peak traffic we would experience during the sports broadcasts dwarfed the traffic we were handling during any other window of time. So not only were we forced to pay for horsepower that we werenât even fully utilizing during the games, but we were paying for these Redis clusters 24 hours a day, seven days a week, even though they were effectively at 0% utilization outside of the 3-hour window each week when we were broadcasting the sporting events.
âAnd then the season ended and we had no more sports broadcasts for 6 months. So now those clusters were sitting at approximately 0% utilization 24-7.
âOkay, fine. Problem identified. All we had to do was fix it and get our cloud bill under control!
â## A horde of zombie⊠engineers!
Well, it turns out fixing our spend on our Redis clusters was much easier said than done. The managed Redis service didnât have any easy, safe way to scale the clusters up and down. And because Redis clients handle key sharding on the client side, they have to be aware of the list of available servers at any given time, meaning that scaling the cluster in or out carries a high risk of impacting cache hit rate during the transition, and thus would need to be managed very carefully.
âThese were solvable problems. Throw enough engineers at something, and anything is possible, right? They could update all of the code so that it writes to two different clusters during a scaling event and have reads fail over from the new cluster to the old one for cache misses during the transition. Then, they could scale down by adding a second, smaller Redis cluster alongside the giant one needed for peak traffic. They could definitely handle the work of meticulously monitoring the behavior of the new code while the new cluster was brought online, and they could decide when itâs safe to begin the teardown of the old cluster. Oh, and they can kick that off and meticulously monitor it to make sure that goes smoothly.
âSo sure, our team was capable of doing that twice a week: once when we needed to scale up in preparation for the sports broadcast, and again when we needed to scale down to save costs after the event.
âBut that would be a ton of work. Now we were forced to do some math on how much we were paying those engineers vs. how much we were paying for the overprovisioned Redis clusters.
âAnd then thereâs the opportunity cost: none of this cluster scaling nonsense had any unique business value for us, and we had a limited number of engineers available to work on delivering features actually unique to our business and provide actual customer-facing value to our users.
âI bet you can guess where we landed. Yep. We never reached a point where we felt like we could justify the engineering cost it would take to try to solve this problem when there were so many more valuable customer projects our engineers could be doingâprojects which would actually move the business forward and win us new customers.
âSo we just kept paying. For something we werenât using.
âAt a certain point, if our business was struggling, we might have been forced to allocate the engineering resources to solving this problem in order to reduce our spending and balance the budget. But this would have been a sign that we were in trouble.
âAnd I donât know how you feel about the cloud services your team spends money on, but I consider it pretty scary that a cloud service can make it so complicated for you to get a fair billâa bill where you are paying a fair amount for what you are actually using, and not paying a ton of money for resources that are sitting idleâthat you will only be able to make time for it if youâve gotten into a desperate situation.
âItâs a great business model for the cloud service provider. Not a great business model for the customer.
âIt doesnât have to be this way.
â## Momento Cache: All treat, no tricks!
The horrific tale youâve just read was a large part of the inspiration for us to build Momentoâs serverless caching product. One of the best things about serverless cloud services is the fair pricing model: pay for what you use and nothing more. Why should we settle for less with caching?
âWith Momento, you get a dead-simple pricing policy based strictly on how many bytes you send to and receive from your cache. We donât think you should have to pay more if those bytes are all transferred within a 3-hour window or are evenly distributed over the course of a week or a month. As far as weâre concerned, you should be able to read and write your cache when you need it. Thatâs it. Plain and simple.
âOf course, serverless doesnât stop there. We manage all of the tricky stuff on the backend for you. If your traffic increases and your cache needs more capacity, thatâs on us. If your traffic decreases, you shouldnât have to pay the same amount of money for your low-traffic window as you did for your high-traffic window. And you most certainly shouldnât have to pay for 15 idle CPU cores on a bunch of nodes in a caching cluster just because you needed more RAM.
âSo: stop letting cloud services trick you into paying for caching capacity that you arenât using, and see what a treat it is to work with Momento today! You can create a cacheâfor freeâin less than five minutes. If it takes more than five minutes, let us know, and weâll send you some Halloween candy.
âVisit our getting started guide to try it out, and check out our pricing page to see how we make sure you get what you pay for.
âHappy Halloween! đ»
Top comments (0)