Photo by Giorgio Trovato on Unsplash
Original Source: https://skildops.com/blog/demystifying-an-interesting-relation-between-ecr-and-s3
Preface
Did you know ECR images are stored in S3? Well, I discovered this fact after the pods running on EKS cluster started failing to pull the images from ECR throwing 403 Forbidden error.
You may be puzzled about how are these 2 scenarios even related? Glad you asked this.
Basically, I created an S3 Gateway Endpoint so that VPC resources will avoid using the expensive Internet/NAT gateway route to interact with S3 objects and rather than using an overly-permissive resource policy for the gateway endpoint, I decided to restrict the use of endpoint by the services in a particular account by using aws:PrincipalAccount condition.
Gateway Endpoint Policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": "*",
"Action": "*",
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:PrincipalAccount": "NUMERIC_AWS_ACCOUNT_ID"
}
}
}
]
}
Discovery Phase
Few hours after the S3 gateway endpoint was deployed, the development team triggers a deployment pipeline to test new backend changes and realises the pipeline is broken and reaches out to the DevOps team.
The task gets assigned to me and I start the investigation. Upon looking at the Kubernetes events, I notice an ImagePullBack error with reason 403 Forbidden. Initially, Iām bit confused because no infrastructure changes were deployed in the past few hours that modified permissions of either the IAM role associated to the EKS pod or the resource policy of ECR registry. Still, to be 100% sure, my first course of action was to verify permissions for both ECR resource policy and IAM role attached to the EKS pod. As I thought earlier, the permissions were intact.
As my next step, I start looking at the recent changes made to the infrastructure and the only change pushed in the last few hours was the implementation of S3 Gateway Endpoint. At first, it does not make sense to me that why and how is S3 Gateway Endpoint linked with ECR image pull issue.
After few mins, it strikes my mind that may be ECR uses S3 to store the images and may be the restrictive policy of gateway endpoint is the reason, so I decided to temporarily switch to the default overly-permissive resource policy to validate the hypothesis. Surprisingly, the pod can now pull the image from ECR.
Even though the issue was resolved by changing the policy, we cannot call it a victory because we the default resource policy does not follow the least-privilege principle. Hence, we decided to dive deeper into the problem and uncover the relation between these two services so that we can write a restrictive policy and also allow our pods and other services to download the container images from ECR.
Troubleshooting Phase
Continuing with the investigation, I decide to revert back the endpoint resource policy to the original restrictive one before doing anything else.
We have a bastion host running within the VPC, so I SSH into it and pull the ECR image with a hope to get some additional information regarding the error that can help me resolve the mystery.
Brilliant! I get a detailed error that explains it all.
Error:
Error pulling image configuration: download failed after attempts=1: error parsing HTTP 403 response body: invalid character '<' looking for beginning of value: "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>AccessDenied</Code><Message>User: arn:aws:sts::026506400584:assumed-role/storage-access-role-prod-eu-west-1/s8-fe-dub-prod-2-12021.dub2.amazon.com is not authorized to perform: s3:GetObject on resource: \"arn:aws:s3:::prod-eu-west-1-starport-layer-bucket/bf8750-096813982275-babc1bd1-8c7f-a85b-5fdf-78508ffe501c/a07f425b-261f-4cae-a9ba-1993d84f9d58\" because no VPC endpoint policy allows the s3:GetObject action</Message><RequestId>B740Z04JBRFNX95Q</RequestId><HostId>kjSd+f8pT2rKPt6HVOV6H8+7rKubuD34xeEnwVpmVbqJ2Df3ibnrLOsxmpkxb+YMmoSrCJSW25GhD2t2BHDm6NiM1ypxTqXv</HostId></Error>"
The error makes it clear that an AWS managed IAM role is trying to download image/artifacts from an S3 bucket managed by AWS.
This brings me one step closer to finding the right solution.
Solution Phase
After doing a bit of research I come across an AWS documentation that talks about the exact problem and its solution. It was time to update the endpoint policy to allow AWS to use the gateway endpoint to pull files from this particular bucket.
Updated Gateway Endpoint Policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": "*",
"Action": "*",
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:PrincipalAccount": "NUMERIC_AWS_ACCOUNT_ID"
}
}
},
{
"Sid": "AccessToAWSEcrBucket",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::prod-{AWS_REGION_ID}-starport-layer-bucket/*"
}
]
}
Note: If you plan to use the above policy, make sure to replace the placeholders NUMERIC_AWS_ACCOUNT_ID and AWS_REGION_ID with the actual values
Understanding the basics
Q: What is the difference between S3 gateway endpoint and interface endpoint?
A: In both cases, your traffic stays on the Amazon network so that your resources within the VPC can bypass Internet/NAT gateway to interact with S3. Few noticeable differences include creation of ENIs in case of interface endpoints within your VPC and use private IP space to interact with S3 API whereas in case of gateway endpoint, route table is updated to direct the traffic to the endpoint if traffic is destined to S3 assigned public IP. Moreover, use of gateway endpoint is free but interface endpoint incurs additional cost. To learn more differences between both the endpoints, please visit AWS doc.
Q: What is the difference between VPC Endpoint and PrivateLink?
A: VPC Endpoint enables you to connect to AWS services privately from within your VPC whereas with the help of PrivateLink, other AWS accounts can interact with privately hosted services in your account.
Q: How do I know if VPC endpoint is used?
A: There are multiple ways to validate if traffic is flowing via a VPC endpoint and it depends on the type of endpoint you are using. You can use network utility scripts like ping, traceroute to identify the IP and path or you can use the VPC flow logs to validate the traffic flow.
Top comments (0)