Introduction
On a cold and dark evening in December 2022, a good friend of mine calls me and says: "Nicolas, I am creating a product that is going to scale massively and revolutionize the market, and I need your help". Now, if I had a dollar for every time I heard this sentence, I would be financing trips to Mars by now.
Nevertheless, I met with the friend and his technical lead. After long hours of discussions (and daydreaming), the business model was summarized as follows: "The product is a maintenance management platform designed to help companies and vehicle owners to efficiently manage their vehicles. The product aims to automate the entire maintenance procedure and provide preventive and predictive solutions by connecting vehicles to IoT devices, which allows the monitoring of maintenance parameters in real-time."
I agreed to help them for many reasons, some of which include:
- They actually know what they are doing.
- The technical lead is absolutely intelligent.
- I trust they will make it.
My job, evidently, was to architect and implement the infrastructure, deployment, and maintenance of the application.
Requirements and Challenges
At the time of discussion, they had just finished an MVP that was deployed poorly on AWS. In fact, both my friend and the technical lead have very minimal experience in everything related to infrastructure and DevOps. In addition, they had little money to pay my original fees and therefore did not want to be a big burden on me. So at first, they suggested that I perform a very basic infrastructure and deployment strategy, that they can use temporarily until they raise more money.
The first thought I had was: "Those noobs don't even know what they are talking about". From my experience in consulting with more than two dozen companies (from small startups to extremely large multinationals), once you start working with a bad infrastructure, chances are you will keep building on top of it until working on it becomes a living hell, and then possibly run out of business due to bad tech. I was definitely not going to be part of this scenario
Therefore, my answer was: "No, I will do it properly". So after countless back-and-forth discussions, below is the summary of the challenges to think about while architecting the solution:
- There must be at least two environments: Develop and Production.
- The developers must be able to operate the infrastructure without having to become DevOps Engineers.
- Proper observability must be employed to quickly identify and solve issues when they happen (Because they will happen).
- The cost must be as optimized as possible.
- And finally, I set a requirement, for my sake primarily: The solution must be robust enough to minimize the number of headaches I have to suffer from in the future.
Understanding the Application
Before actually coming up with the solution, a good approach would be to first understand the different components of the application. Therefore, as a first step, the technical lead was kind enough to explain to me the different components of the application, and how to run it locally.
For simplicity purposes, both the backend (NodeJS) and frontend (ReactJS) applications are designed as a mono repository, managed through NX. The application stores its data in a PostgreSQL database. Surprisingly, the application was very well documented, a phenomenon I have rarely seen in my life. Therefore, understanding the behavior and the build steps of the application wasn't so difficult.
In about three hours, I was able to containerize, deploy, and run all the containerized application components on a single Linux machine. Amazing! First step complete.
Infrastructure Requirements
Now that the application is containerized, and all the steps documented, it is time to architect the infrastructure. Whenever I am architecting a solution, regardless of its complexity and cost, I always make sure to achieve the following characteristics:
Security: One of the most integral parts in any application is security. A robust software is one that prohibits cyber attacks, such as SQL Injection Attacks, Password Attacks, Cross Site Scripting Attacks, etc. Integrating security mechanisms in the code is a mandatory practice to ensure the safety of the system in general, especially the data layer.
Availability: Refers to the probability that a system is running as required, when required, during the time it is supposed to be running. A good practice to achieve availability would be to replicate the system and application as much as possible (e.g., containers, machines, databases, etc).
Scalability: The on-demand provisioning of resources offered by the cloud allows its users to quickly scale-in and scale-out resources based on the varying load. This is absolutely important, especially to optimize the cost, all while serving the traffic consistently.
-
System Observability: One of the most important mechanisms required to achieve a robust application is system visibility:
- Logging: Aggregating the application logs and displaying them in an organized fashion allows the developers to test, debug, and enhance the application.
- Tracing: Tracing the requests is another important practice, allowing to tail every request flowing in and out of the system and rapidly finding and fixing errors and bottlenecks.
- Monitoring: It is essential to have accurate and reliable monitoring mechanisms in every aspect of the system. Key metrics that must be monitored include but are not limited to CPU utilization, Memory Utilization, Disk Read/Write Operations, Disk space, etc.
Infrastructure Solution
In light of all the above, and after twisting my imagination for a little bit, I came up with the architecture depicted in the diagram below (Does not display all the components used):
Networking
The infrastructure is created in the region of Ireland (eu-west-1). The following network components are created:
- Virtual Private Cluster: To isolate the resources in a private network.
- Internet Gateway: To provide internet connectivity to the resources in the public subnets.
- NAT Gateway: To provide outbound connectivity to private resources.
- Public Subnets: In each availability zone.
- Private Subnets: In each availability zone.
VPN
A VPN instance with a free license is deployed to provide secure connectivity for the developers and system administrators to the private resources in the VPC.
AWS EKS
An AWS EKS cluster is created to orchestrate the backend service of each environment. The cluster is composed of one node pool made of 2 nodes, each in an Availability zone.
Application Load Balancer
An Application Load Balancer (Layer 7) is created to expose the endpoints and provide the routing rules required from the internet into the application. The load balancer is configured to serve traffic on ports 80 and 443.
AWS RDS PostgreSQL
An AWS RDS PostgreSQL database is created to hold and persist the application’s data. Both the develop and production environments are hosted on the same instance but are separated logically.
Clients VM
A private virtual machine on which client applications are installed, to interact with different parts of the infrastructure (e.g., kubectl, PostgreSQL client, etc).
AWS ECR
Two ECR repositories are created for the backend service, one for each environment.
S3 Bucket
An AWS S3 bucket is created to host the frontend application for each environment.
AWS Cloudfront
An AWS Cloudfront distribution is created to cache the frontend application hosted on AWS S3 of each environment.
ACM
ACM Public certificates are required for the domains. A public certificate must be created in the region of eu-west-1 to be used by the load balancer, and another one in the region of us-east-1, to be used by Cloudfront.
Cloudwatch
The infrastructure metrics and application logs are configured to be displayed on Cloudwatch.
Application Deployment
Now that the infrastructure was successfully architected and created, I proceeded to deploy the containerized backend services and ensured their proper connectivity to the databases. Afterward, the frontend application was built and deployed on S3.
Continuous Delivery Pipelines
The last step before signaling to the team the good news was to automate the build and delivery steps of all the services. Evidently, none of the developers should perform tedious and time-wasting tasks of building and deploying the application everytime there is a change. As a matter of fact, knowing the pace at which the developers are working, I expect they push code to develop 276 million times per day.
Therefore, I used AWS Codebuild and AWS CodePipeline to automate the steps of building and deploying the services. The diagram below depicts all the steps required to continuously deliver the frontend and backend applications:
Conclusion
Once everything is done, I met with the friend and with the technical lead for a handover. They were so pleased with the outcome, stating that the infrastructure is amazing, but is overkill and much more than they need right now.
But in reality, it is not an overkill. As a matter of fact, the product and the team are growing very rapidly. This solution is a skeleton that can be quickly and easily modified and scaled upon need:
- Backend services replicas can be easily modified.
- The EKS nodes can be easily scaled vertically and horizontally.
- The frontend application is on S3, which is automatically scalable.
- The database can be easily scaled vertically and horizontally.
After delivering the solution in mid December 2022:
- The developers are happy because of the robustness and ease of use of the infrastructure.
- My friend is happy because his application is live, and is costing him less than $500 per month.
- I am happy because they never called me with a complaint.
Everybody is happy :)))) The end!!
Top comments (28)
Using EKS on a app with only 2 components and 2 people working on it definitely seams like overkill to me. Any reason why you didn't opt to run it on something like Beanstalk or Heroku?
Definitely, for many reasons actually:
ECS seems better
ECS still doesn't support ConfigMaps (well, the equivalent) which quickly becomes a nuisance.
@nick well you have many choices: AWS secret manager or use env file from s3: docs.aws.amazon.com/AmazonECS/late...
I'm surprised client VM (+ VPN) is still a thing, with session manager allowing RDP as well. Much more secure and simpler thing would be to have proper IAM controls around session manager with some "client VM" like AMIs.
How do you provide direct access to private resources for developers?
Via the AWS console. you can have a Linux or Windows VM (whatever the prefer) with desktop clients like workbench/pgadmin/kubectl. The Dev eks doesn't really need to have a private endpoint unless it's a specific requirement, and provided you haven't networked Dev and prod VPCs, as Kubectl is easier to have locally as well as using docker for MySQL/postgresql, thus guiding developers to develop everything as code.
Give fleet manager/session manager a go. The "client VMs" don't need to be in the public subnet either, see how you could use it for your use case, it does make like easier.
I bet if you used Google Cloud. Cloud Run for the service. Something like Firestone for the DB. You can get crazy low bills, like sun $50/m and have just as much power.
GCP has the best free tier. If cost is a major factor.
Each time I add codebuild and codepipeline into the mix my deploy times go through the roof. I always think I may be doing something wrong. I come from a RoR background and my Capistrano deploys take all of 30 seconds. Can you share how long it takes from the time code is commited in repo, to server with new code being up and running? With codepipeline and ecs this always takes about 10 minutes for me, even with dead simple apps which IMO is unacceptable.
Hi Augusto,
Do you suffer from a delay before the pipeline starts? or the build and deployment times take 10 minutes from the moments the pipeline starts execution?
Delay is on the pipeline / codebuild side. Basically it takes a LOONG time for codepipeline to spin up the new code version Fargate ECS containers into the cluster, have them as available, and then remove the previous code versions.
I realize you're not using Fargate but EKS. However, I'd like to know in your opinion, what would an "acceptable" deploy time for containerized deploys be, as compared to more traditional deploy methods.
@augustosamame I think I know what the problem is, if the ECS tasks are part of a Target Group, navigate to this Target Group and modify the de-registration delay to 10 seconds or an even smaller number.
This will reduce the time.
I like this post a lot!!
Would be happy to see a tutorial of all that.
It can be gold for any learner or tech lead, like your friend, who wants to create a
Reliable infrastructure to his team and product.
Thank you for your kind words @amitkad. I already have a free Udemy course (Introductory course). Please check it out and leave a feedback: udemy.com/course/intro-fullstack-d...
I hope to create more tutorials in the future.
Great write up. You said they have a mono repository using nx for the build system. Do you mind sharing some details of how you structured the IAAC with the AWS CDK for the mono repo? I have a project that uses a mono repo and NX as well and I have been struggling to figure out what a good approach is for separating my CDK code from my application code. My current approach now is having an infra/ directory at the root of the repository with multiple nested folders for my different infra to keep my build and deployments separated. E.g., infra/frontend and infra/backend.
Hi @christiankaseburg,
I don't think there is one solution that fits all. I started out similar to your method, by including a folder in the root directory containing subfolders and files.
But eventually, you would want to separate this from developers to avoid unwanted changes or errors.
In my opinion à quick solution would be to either create a separate repository for it, or store it in some file storage such as S3.
Have the pipeline download it before executing commands.
I hope this answers your question.
Interesting read, if I read correctly, scale and possible future complexity are the main drivers.
Did you use the serverless RDS? And fargate eks? Possible reduced cost and complexity
Could ECS be used, or do they need an exit strategy off Aws?
Could scaleway (same tech, postgres and k8s) be used to reduce costs?
(Just wondering)
Hi Dave! How are you?
Yeap, you're right, the main driver behind this infra is to minimize headaches in the future.
Yes I used Aurora. No I did not use Fargate Eks, but definitely worth investing in it.
I avoided ECS for many reasons, especially lock in, and lack of portability. Once the application is stable on kubernetes, I can easily redeploy it anywhere (such as on premise)
Do you have cost breakdown of this solution? 500$/month seems reasonable but looking at the picture, it looks quite expensive. I am wondering how much cost comes from each component? Are there usage charges like egress bandwidth etc.?
Did you try to Amplify Hosting?
Unfortunately, Amplify will not meet all the requirements stated in the article.
What did you end up using for observability? Essentially the tracing part.
Also did you authored any IaC for infrastructure orchestration?
Hi Adeel,
Cloudwatch for Metrics. Tracing was not implemented yet, as it needs development effort. But in my opinion, either use X-Ray, or a custom built mechanism that integrates with Cloudwatch logs.