Abstract
Carefully planning all stages of development can ease our expectations in terms of architectural decisions, roadmaps, costs, technical dept, among other aspects.
Opening and keeping a communication channel between the engineering and product teams is an import part of the process and should be taken seriously.
The concepts of load testing and stress testing add valuable information about the performance of your application and how you can act to mitigate degradation of service you provide to the users.
Opinions are my own and not of my employer.
The concept
Performance has become an important piece in development as it provides historical and real-time value on how an application (ex: a micro-service) reacts to different demands. With this data, we can also tailor our infrastructure to be optimised to those demands.
Performance development is thinking through every step in our development and it should be considered as a daily practice.
Considering that we as developers take advantage of many tools to help us comply with code standards (ex: ESLint or Commitlint), so why can't we use tools that will enable us to comply with performance as well?
So how can we track performance?
Tracking performance
Performance can be impacted by two factors, the codebase and the provisioned infrastructure. Monitoring/Observability is implied in both.
The codebase can have poor architectural decisions that will affect in many ways its performance such as maintainability, memory and CPU consumption, among others.
The infrastructure can also have a huge impact in performance, mainly if it's under dimensioned which can crash the application for no apparent reason or by making poor decisions that lead to vertically scaling the infrastructure.
Both factors are strongly connected in the early stages as you need to provisioning infrastructure to deploy the codebase.
Monitoring can impact the codebase hard, but it definitely can provide valuable insights on the performance of the codebase and infrastructure with a well thought observable system implementation.
If you are interested in Monitoring, please read my Modern Monitoring post for a more in-depth analysis.
Monitoring incurs additional costs. Logs can drastically affect performance (ex: throughput of a micro-service). Infrastructure monitoring also incurs costs with your cloud provider.
If you rely on a cloud provider to manage and deploy, consider starting first with a small compute engine machine, or low impact image when using container orchestrators such as Kubernetes. These are also valid for on-prems solutions.
For serverless solutions, the scenario is slightly different since the infrastructure is mainly managed by the provider itself. But apart from this difference, the rest should also apply.
As new features are developed, chances are your application will get more "greedy" and demanding or perhaps the user demand itself changed. Running the load tests against these new scenarios will detect new issues.
Now we can act upon that!
We can review/refactor the codebase and/or apply changes to the provisioned infrastructure.
Development process
As mentioned above, starting a new project often involves setting up code analysis tooling and that gets defined somewhere or by someone, normally a company or a team policy.
With this common ground in mind, we also need "policies" for what we are targeting in terms of business value. This somewhere or someone is the stakeholders/product team. These "policies" can be defined as SLAs but we will discuss more about this later on.
This means we need numbers to track performance and develop our process of measuring it.
Continuing with a micro-service example, what's the excepted maximum number of simultaneous requests hitting it? And how can we guarantee that our codebase is up to that challenge and our infrastructure is well provisioned to withstand that demand? The answer is load testing.
The stakeholders/product expectations define our threshold to work with.
With this threshold in mind, a well-implemented load testing step ensures us that any modification on the codebase or infrastructure never falls below the threshold. And if it comes to that, we can immediately take action and make amends or changes to surpass it.
Don't run load testing blindly! Your consumers are your own. Act accordingly.
If stakeholders/product fails to provide solid and reliable load scenarios then the architecture/development is crippled. Same goes the other way around, If the designed and implemented architecture is not optimised and maintainable then most likely you will fail delivering your service below stipulated SLAs.
Load testing can be effective to identify real markers such as latency, 3rd parties response timeouts, or any other particular dependency your application has.
How can we identify all of these?
In order to gather, digest, correlate and analyse all these markers we need an observable system, as mentioned above.
Monitoring & Observability
Observability describes the ability of how accurate we can observe the characteristics of an external output of a system and infer it as a measure of its internal state.
A system should have health checks, metrics, log entries and end-to-end tracing in order to be considered observable.
Observability is not Monitoring!
To make performance development possible it requires a good monitoring & observability system capable or retrieving valuable information from your codebase and infrastructure.
Every action depends on the accuracy and reliability of all those data types a system provides. Choose your service wisely and most of all that it provides most of the tools you need. Different services are harder to maintain and sync data.
Load Testing
As we have been discussing, load testing evaluates the performance of an application allowing us to detect issues with the current infrastructure and codebase.
But how can we add this step to our development process?
- Run against the production environment;
- Provision an environment that mirrors production.
Please read Load testing micro-services by João Tiago for a deep dive on a real example of the first approach.
The approach you choose to take depends on how much value it adds to the application. The reason I choose the second approach lays the foundation for stress testing, which is described in the next section.
If your application depends on 3rd party APIs you may need to contact your liaison to understand what impact running the tests will have on their servers. Solutions to this problem are for them to provisioning a sandbox or for you to mock their API.
There's a chance the 3rd party starts blocking requests from your load tests if you don't contact them for clarification on their limitations if any.
If you mock their API take special attention to mock their limitations as well, such as maximum timeouts, throttling, etc.
The infrastructure provisioned MUST be identical to production but with the difference that it will be an ephemeral environment.
Consider using the CI/CD pipeline workflow to trigger the load testing step before the application is deployed to production, although it can also be a parallel step which won't affect the deployment to production. The key takeaway is that it's a really important step to keep taps on the performance of your application, and should be part of the pipeline workflow.
One important thing to take into account is where you're making the requests to your application. If you're targeting people all over the world and you have a multi-region deployment, consider adding this geographic constrains as well as it influences latency your consumers might experience.
Here's a simple example of an application.
After running a load test we detect that this machine type has a low CPU and a high memory for its usage, so it's easy to figure out that we may need to change the provisioned type to one with higher CPU and lower memory.
The next step, stress testing, is an additional step to load testing due to its unique purpose.
Stress Testing
A particularly interesting step of performance development is stress testing, which simulates scenarios of abnormal peaks until the application crashes, providing awareness and predictability of the application.
By knowing the hard limitations of the infrastructure, we can start thinking about ways to mitigate the loss of QoS - Quality Of Service (ex: scalability).
The purpose of any stress testing is to crash something - always!
First, we need to understand the importance of SLI, SLO and SLA.
- SLI (Service Level Indicators) "...are well-defined metrics that describe the behaviour of the system";
- SLO (Service Level Objectives) "...are specific targets for those SLI";
- SLA (Service Level Agreement) "...lists SLO that define the performance guarantees to customers and its consequences if they are not met".
Stress testing is extremely important to profile your infrastructure and identify signs of degradation in our application.
By monitoring & observability, we can define, for example, when our infrastructure needs to scale up.
There are some interesting and most common ways of scaling your application.
- Proactive cyclic is defined by a periodic scaling that occurs at a fixed interval (daily, weekly, monthly, quarterly);
- Proactive event-based scales just when we expect a big surge of traffic due to a scheduled business event (new product launch, marketing campaigns);
- Auto-scaling based on demand requires a monitoring service so the system can send triggers to take appropriate actions such as scaling up or down based on metrics (utilisation of the servers or network i/o, for instance).
Another "side effect" of running this step is that we can see similarities with a (D)DoS - Distributed Denial of Service.
There is a distorted relation between what a DoS infrastructure layer attack will havoc, and a stress testing tries to accomplish as both are meant to damage the availability of your service to the point of collapsing.
This is a perfect step to validate all the mechanisms in place to stop a (D)DoS attack, for example, throttling requests, and so on.
This is all great, but there are a lot of nuances here. When running a stress testing you don't want those mechanisms to act because it can be a legitimate scenario of abnormal peak usage. This can be tricky to configure or provision.
Costs
Everything discussed so far comes with a price, and that price often is a steep one at the end of the month.
The key takeaway from all of this is that there will be a tradeoff between costs and the confidence developers will have deploying.
Conclusions
Although a bit extensive, the purpose of this post was not to be in-depth on each step. There are many approaches to each one and honestly, it may be different in your situation.
I hope you can identify yourself with all of the above. 😊
Please share positive feedback about this topic. 🙏
Thank you! ❤️
Top comments (0)