Image by Dan Nelson from Pixabay
I’m assuming you already know about AWS Client VPN but if not, it is a fully-managed remote access VPN solution that supports both certificate and SAML based authentication to securely access resources within both AWS and your on-premises network.
Let’s set the context
For the last few years, I have been working on a project that hosts a single-tenant SaaS application on AWS in a single region. Many of our services are hosted in private network, so we use AWS Client VPN to interact with these services.
Few months ago, we decided to refactor our IaC (Terraform) code so that we can deploy our SaaS application in multiple regions based on client’s request to fulfil data-residency and compliance requirements. Being a single-tenant application, each client has their own set of resources like KMS key, S3 buckets, database instance, etc. Along with these resources, there’s also a central/shared database cluster which contains generic data and this shared database cluster resides in our primary region unlike client databases that are scattered across different regions. We wanted to make sure that client database clusters can talk to the central/shared database cluster, so we decided to setup VPC Peering between the regions which was quick and straightforward.
Note: The above diagram is just for understanding purpose and is a very simplified representation of the actual architecture.
Now, you may be thinking that this sounds like a pretty common design and there doesn’t seem to be any problem. You’re absolutely correct but this is not the full picture yet. Let’s talk about the problem now.
The Problem
Earlier, I mentioned that we use AWS Client VPN to connect to internal resources which was all smooth until we introduced support for multiple region deployments. We did not want to have separate Client VPN in each region, so we decided to modify the configuration of our existing Client VPN to interact with resources deployed across various regions. So, this is where the problem starts.
This was a new challenge for me so I turn to AWS documentation to find out how do I accomplish this. After going through the AWS Client VPN doc carefully, I updated the route table and authorization rule. Part one complete. Next, I had to update the route tables and NACLs for VPC and security groups for internal resources to accept connection from the VPC CIDR hosting the Client VPN.
Boom!! It’s all working smoothly and I get super excited.
But wait, looks like I got excited little early because I did these changes via console and not via Terraform and this is where I experience the problem. I replicated all the changes I did in the console in Terraform, reverted back the changes in console and fired terraform apply command.
Image generated using DALL-E-3
In an ideal scenario, this should have worked because all the settings were properly configured but in reality I wasn’t able to connect to any of the internal resources in regions other than primary.
Time To Troubleshoot
Sad and shocked, I start troubleshooting the config changes both in Terraform and AWS console. Everything looks exactly the same as I did earlier via the console. After troubleshooting for around an hour, I go back to AWS Client VPN doc and read the steps 2-3 times and compare my changes to figure out in case I have messed up anything but it all looks absolutely fine. At this time, I’m starting to get a bit annoyed and exhausted, so I go for a walk to grab my favourite Flat White ☕️ to refresh my mind.
Photo by Nathan Dumlao on Unsplash
After a short break, I decide to check if the bastion host present in the primary region is able to connect to internal resources hosted in another region and it can. This confirms my VPC peering and NACL rules are correct and something is wrong with AWS Client VPN configuration, so I switch back to Client VPN console and start staring at authorization rules and route tables config but couldn’t figure out anything with either of them. Destination CIDR, Group ID, Access Type and Target Subnet looks all properly configured, so I give up after an hour and decide to end my day 😴.
Image generated using DALL-E-3
The Solution
Next day morning, I start fresh and after an hour of troubleshooting with a networking expert colleague I manage to fix the issue. The problem wasn’t with the configuration. All the values were correct.
While troubleshooting with the networking colleague, we tried multiple changes but nothing worked out and I reverted all the manual changes. By this time, I had already invested a lot of time in figuring out the issue but unfortunately, could not figure out problem so I thought to raise AWS tech support ticket rather than investing more time on this issue by myself. This is when suddenly a thought crossed my mind and I decided to give it a try.
The previous day, when for the first time I was making config changes via console, I remember adding new entries under Route Tables tab before adding Authorization Rule for AWS Client VPN. I thought may be it’s the order in which these two settings are applied, it was a wild guess but I decided to act on it. So I delete the Routes and Authorization rules for the non-primary region and add them back in the same order as I did the day before.
Guess what happens? Bingo, It works!! After hours of troubleshooting I can finally connect to internal resources in non-primary region via Client VPN. Before celebrating the victory, I had to add explicit dependency between Authorization Rule and Route Table resources in Terraform and test it one final time.
Image generated using Microsoft Bing
My Thoughts
Even though I was relived and happy figuring out the root cause after hours to troubleshooting, I was also a bit disappointed after realising the reason behind the issue. I feel there should be a validation in place that prevents or warns users while adding Authorization Rule if the respective CIDR block is not already present in the Route Table.
I wonder if this is by design or a bug that hasn’t been flagged yet. What do you guys think?
Understanding the basics
Q: What is the difference between AWS Client VPN and AWS VPN Gateway?
A: AWS Client VPN is a fully-managed service that is used to connect to internal resources either on AWS or on-premise whereas VPN Gateway is used setup connectivity between on-premise and AWS cloud to setup hybrid infrastructure.
Q: What are the limitations of AWS client VPN?
A: Here are some of the limitations:
- As of writing this article, Client VPN only supports IPv4 traffic.
- The client CIDR cannot be changed after the Client VPN is created.
- Client VPN CIDR cannot overlap with VPC CIDR block.
- Client VPN is not FIPS compliant.
Above list is not exhaustive. To know about all the limitations and best practices navigate to AWS doc.
Q: What is the difference between AWS Client VPN and OpenVPN?
A: AWS Client VPN is a fully-managed VPN service to interact with services hosted in private network either on-premise or on AWS cloud. In contrast, you need to install OpenVPN on an EC2 instance, configure it, make it HA, patch it periodically yourself. To avoid the administration overhead, you can use AWS Client VPN. AWS Client VPN supports both certificate and federated authentication to seamlessly integrate with your existing IAM infrastructure.
Top comments (0)