In the 1980s, traffic in downtown Boston was nearly unbearable so city planners came up with a plan to reroute the highway tunnel below downtown Boston. The project was nicknamed "the Big Dig". Construction started in 1991 and when it finished in 2007, the final price tag was around 15 billion dollars, about twice the original cost that was expected. Source.
This is a story about a software project that ran over budget due to organizational mishaps, no oversight and no awareness of cost.
Spoiler alert: No actual query analysis is shown in the article. I am not a SQL expert; more of an observer with a passion for writing and with 20/20 hindsight.
Before we dive deeper into the money bleeding monster, here's a bit of background.
Cost of a project
Total cost of ownership (TCO) is the total cost of owning software over its entire lifecycle, including the initial building price and ongoing charges, such as maintenance, human capital investments, resource allocation, and opportunity costs. Source.
A software product can make financial sense if the value it provides to the organization is greater than the TCO over its lifespan.
The cost of this particular software product, whose main part was the Athena query, was around $15000 per month in its operational phase (excluding the cost of making it). At one point, it was more expensive than ~40 RDS databases. Its value to the business is hard to determine because it provided data for dashboards.
The organization
The organization and its people built this product during a difficult period, with COVID, financial insecurity and layoffs looming over everyone and everything. I'm sure they did their best under the circumstances and this article is in no way meant to criticize.
Most software development in this organization is done with 2-pizza teams working in 2-week sprints to deliver peer-reviewed code that runs in the cloud configured by infrastructure-as-code. Decent operational monitoring and alerting exists and it works.
Our money bleeding monster was an outlier; it was built by one developer. Read on to understand how serverless can be expensive.
1. Cost awareness, or lack of it
From the initial idea about this product, over its design and creation, all the way through delivery no one sat down to calculate how much it would cost. Architecture, design and development was done by a single developer to fulfill a business need. But it is not the fault of that developer that the TCO calculation was not done.
Business stakeholders and product owners usually think about TCO and whether a software product makes financial sense. A developer could have calculated, with a fair amount of certainty, the operational costs of the product. AWS Athena pricing is dead simple: $5 per terrabyte of data scanned. Fourth grade math at best.
So what failed then? There was no one to ask the question: "How much will this query cost?"
2. No systematic cost reporting
AWS infrastructure used by the organization is managed through code and it is uniformly tagged. Some of those tags include a project name, the environment where the project is running (DEV, PROD) and owner/cost center.
It would be trivially easy to create a monthly cost report using AWS Cost Explorer or any similar tool. The report could break down cost per project, environment and cost center and it could be sent to owners for review.
So what failed then? There is always something more pressing to work on. No one cares about these cost reports so the platform team never prioritized them.
3. Managed service misuse
AWS Athena is a serverless analytics service with capability to query structured data from AWS S3 and other data sources using SQL-like syntax.
In our case, data was uploaded to an AWS S3 bucket from Kafka. Kafka is the backbone for all microservices and a large chunk of business data flows through it. All that data ends up in an S3 bucket and is partitioned to support WHERE
clauses in queries. Having a WHERE
clause tells Athena to scan only the data partition that matches it. Partitioning schema is based on YEAR/MONTH/DAY
pattern, so an example query can look like:
SELECT style_id FROM schema.SALES
WHERE month=11 AND day=20 AND color=blue
This would return all the style IDs of all blue clothing items that were sold on 20th of November. So far so good.
The real query did not use any WHERE
clauses. This might be acceptable in the DEV environment and personally I'd use LIMIT 10
. Not acceptable in the PROD environment, especially when the volume of data will only ever go up. The ever-growing amount of data in PROD meant that every time the query ran:
- it scanned more and more data
- it scanned all the data; even data it does not need
Athena is truly serverless and it will happily scan, scale and charge you for what it scanned. "Scan everything? On it boss!". "Scan some more? Don't mind if I do".
Using a WHERE
clause can drastically reduce the amount of data scanned which directly correlates to incurred costs. It will also shorten the query execution time.
So what failed then? There was no one to ask the question: "Can this query be improved?".
4. One person "team"
Architecture, design and development on this product was done by a single developer. He had no help, no one to discuss ideas with, no one to review his code.
Every developer has the responsibility to write good software. Every human being has the right to make mistakes.
So what failed then? The organization should not have allowed this. The developer was not set up for success from day one. One person is not a team.
Learnings
I hope it goes without saying but: do the opposite of what the points above illustrate. Also, here are some learnings we acquired over time.
Do the math
Operations phase is an important phase of every product/service. That is when they generate value for the organization. This phase is hopefully also the longest. Because of these two facts, understanding operational costs of the product is very important. Calculate them early.
Own it
And I really mean OWN IT. All of it. Product's lifecycle is in its infancy when you ship the code - it doesn't end there. All the logs, metrics, alerts, bills etc. that the product creates must be owned by someone.
Inform and be informed
If I tell you that on average, an ice cream costs $0.40 where I live, that's data. But if I add that the ice cream truck passes my house every day, that's information (and temptation!) that you can use to buy cheap ice cream every day.
1TB of data scanned with Athena costs $5 and that's a fact. Knowing that your query will scan close to 100TB each time it runs is valuable information. Be informed and inform business stakeholders too.
Friends or foes
Managed services can be great friends. With all the heavy lifting they do and a tendency to reduce prices over time, one would be hard-pressed to not use them.
You can make sure they stay your friends by owning your product and creating appropriate billing alerts. Even hardcore serverless teams experience runaway costs - but they catch them early!
In case you're wondering, the query was improved and it now costs a fraction of what it cost before. And if you're thinking that the money that was wasted on a poor query could have been used to hire more people, I agree with you. We humans learn and evolve our whole lives, but so do organizations.
Learning from our own mistakes is the most difficult but those lessons tend to stick the longest. I can only hope that this organization has enough knowledge management capacity to prevent scenarios like this one in the future. Otherwise, the myth that serverless is expensive will live on. High cost comes from sloppy code, bad development practices and processes in the organization that allow those to happen.
Top comments (0)