David

Posted on Dec 6 • Originally published at davidhyppolite.com

From 502 to 200: Building a Auto-Scaling Infrastructure with Terraform

#devops #aws #terraform #infrastructureascode

Overview
- Real World Use Case
Infrastructure Overview
Implementation Challenges
- Resolution and Lessons Learned

Overview

I designed and implemented a highly available, fault-tolerant auto-scaling infrastructure using Terraform. The architecture spans three Availability Zones and includes both public and private subnets, demonstrating real-world security and scalability practices.

Real World Use Case

In today's dynamic digital landscape, applications need to adapt to changing demands while maintaining security and cost efficiency. This infrastructure addresses several critical business needs:

Handling Traffic Spikes
- Manages unpredictable traffic from marketing campaigns
- Adapts to product launches and seasonal events
- Maintains performance during peak loads
Cost Management
- Dynamically scales resources based on demand
- Prevents over-provisioning during low-traffic periods
- Optimizes cloud spending without sacrificing performance
High Availability and Reliability
- Distributes load across multiple instances
- Ensures seamless failover capabilities
- Minimizes downtime through redundancy
Enhanced User Experience
- Maintains consistent performance during traffic surges
- Ensures responsive application behavior
- Supports customer satisfaction and retention
Future-Proof Scalability
- Provides foundation for business growth
- Adapts to increasing traffic demands
- Requires no significant architectural changes
Security-First Design
- Places web servers in private subnets
- Restricts direct internet access
- Implements defense-in-depth through ALB

Infrastructure Overview

Auto Scaling Flow

Creating an AWS Auto Scaling Architecture

CPU Usage High (>70%) → Auto Scaling Group → Uses Launch Template → Creates New EC2 in Private Subnet

Traffic Flow

- Internet traffic → ALB (public subnet) → Target Group → EC2 instances (private subnets)
- Health checks ensure traffic only routes to healthy instances
- Auto scaling maintains service availability based on demand

Implementation Process

Before deploying our infrastructure, we need to set up our AWS credentials and understand our module structure.

AWS account Setup

First, configure your AWS credentials:

# Set your AWS profile
"export AWS_PROFILE=my-profile"

# Verify your account access
"aws sts get-caller-identity" 
# Note: You'll need the account ID for the monitoring configuration

Terraform Workflow:

Our implementation follows this systematic approach:

Write each module
Validate module configuration
Plan and review changes
Repeat until completion
Apply final configuration

Module Validation

# Initialize Terraform working directory
terraform init

# Validate module syntax and configuration
terraform validate

# Review planned changes
terraform plan

Root module Configuration

Our root module orchestrates all components:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# Configure the AWS Provider
provider "aws" {
  region = var.aws_region 
  profile = var.aws_profile #profile name
}

module "vpc" {
  source = "./modules/vpc"

  project_name = var.project_name
}

module "launch_template" {
  source = "./modules/launch-template"
  project_name = var.project_name
  vpc_id = module.vpc.vpc_id
}

module "alb" {
  source = "./modules/alb"
  project_name = var.project_name
  vpc_id = module.vpc.vpc_id
  public_subnet_ids = module.vpc.public_subnet_ids
  alb_security_groups_id = module.launch_template.alb_security_groups_id
}

module "asg" {
  source = "./modules/asg"
  project_name = var.project_name
  private_subnet_ids = module.vpc.private_subnet_ids
  launch_template_id = module.launch_template.launch_template_id
  launch_template_version = module.launch_template.launch_template_version
  lb_target_group_arn = module.alb.lb_target_group_arn
}

module "monitoring" {
  source = "./modules/monitoring"
  project_name = var.project_name
  autoscaling_group_name = module.asg.autoscaling_group_name
  autoscaling_policy_dwn_arn = module.asg.autoscaling_policy_dwn_arn
  account-id = var.account-id
  autoscaling_policy_up_arn = module.asg.autoscaling_policy_up_arn
  notification_email = var.notification_email
}

Key Features:

Modular Design

Each component is a separate module
Clear dependency chain
Easy to maintain and update

Resource Flow

VPC provides network foundation
Launch template defines instance configuration
ALB handles traffic distribution
ASG manages scaling
Monitoring provides oversight

Output Configuration

output "alb_dns_name" {
  value = module.alb.alb_dns_name
}

This output provides the ALB DNS name for accessing our application.

VPC Module: Network Foundation

The first major component is creating a VPC module with public and private subnets.

Module Variables

variable "vpc_cidr" {
  type = string
  description = "CIDR block for VPC"
  default = "10.0.0.0/16"
}

variable "azs" {
  type = list(string)
  description = "availabilty zones"
  default = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

Base VPC Configuration

# Create vpc 
resource "aws_vpc" "main" {
    cidr_block = var.vpc_cidr
    enable_dns_support = true
    enable_dns_hostnames = true
}

The VPC is configured with:

CIDR block 10.0.0.0/16
DNS hostname support enabled
DNS resolution enabled

Public Network Layer

# Create public subnets
resource "aws_subnet" "public" {
  count = length(var.azs)      
  vpc_id     = aws_vpc.main.id
  cidr_block = "10.0.${count.index + 1}.0/24"
  availability_zone = var.azs[count.index]

  map_public_ip_on_launch = true
}

# internet gateway for public subnets
resource "aws_internet_gateway" "main" {
    vpc_id = aws_vpc.main.id

# route table for public subnets
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }

# 
resource "aws_route_table_association" "public" {
  count = length(var.azs)  
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

This creates:

Three public subnets across different AZs (10.0.1.0/24, 10.0.2.0/24, 10.0.3.0/24)
Enabled auto-assign public IP setting for these subnets
Attached an Internet Gateway to enable direct internet connectivity
Configured a public route table with a route to 0.0.0.0/0 via the IGW

Private Network Layer

# create private subnets
resource "aws_subnet" "private" {
  count = length(var.azs)      
  vpc_id     = aws_vpc.main.id
  cidr_block = "10.0.${count.index + 10}.0/24"
  availability_zone = var.azs[count.index]

  tags = {
    Name = "${var.project_name}-private-${count.index + 1}"
  }
}

resource "aws_eip" "nat" {
  domain = "vpc"

# nate gateway for private subnets internet access
resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id = aws_subnet.public[0].id

    depends_on = [ aws_internet_gateway.main ]
}

# route table for private subnets
resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main.id
  }

resource "aws_route_table_association" "private" {
  count = length(var.azs)  
  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private.id
}

This implements:

Three private subnets (10.0.10.0/24, 10.0.11.0/24, 10.0.12.0/24)
NAT Gateway in the first public subnet
Private route table routing through NAT Gateway
No public IP assignment

The architecture ensures reliability through:

Multi-AZ deployment for redundancy
Strategically placed NAT Gateway
Paired public/private subnets in each AZ
Optimized AWS data transfer paths

Launch Templates and Security Groups

This module defines our instance configurations and security boundaries. I use a security-first approach with separate security groups for the ALB and EC2 instances.

Security Group Configuration

EC2 Security Group (Private Subnet)

# restrict outgoing traffic to alb onlyy
resource "aws_security_group" "ec2" {
  name = "${var.project_name}-ec2-sg"
  vpc_id = var.vpc_id

  lifecycle {
    create_before_destroy = true
  }
}

# Allow only Alb traffic
resource "aws_security_group_rule" "allow_alb" {
  type = "ingress"
  security_group_id = aws_security_group.ec2.id
  source_security_group_id = aws_security_group.alb.id
  from_port         = 80
  protocol       = "tcp"
  to_port           = 80

}

resource "aws_security_group_rule" "allow_ec2_egress" {
  type = "egress"
  security_group_id = aws_security_group.ec2.id
  from_port         = 0
  protocol       = "-1"
  to_port           = 0
  cidr_blocks = [ "0.0.0.0/0" ]
}

Security measures:

Restricts inbound access to ALB traffic only
Allows outbound internet access via NAT Gateway
Ensures instances remain private and secure

ALB Security Group (Public Subnet)

# Alb security group 
resource "aws_security_group" "alb" {
  name = "${var.project_name}-alb-sg"
  vpc_id = var.vpc_id

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_security_group_rule" "allow_alb_ingress" {
  type = "ingress"
  security_group_id = aws_security_group.alb.id
  cidr_blocks         = ["0.0.0.0/0"]
  from_port         = 80
  protocol       = "tcp"
  to_port           = 80
}

resource "aws_security_group_rule" "allow_alb_egress" {
  type = "egress"
  security_group_id = aws_security_group.alb.id
  cidr_blocks          = ["0.0.0.0/0"]
  to_port = 0
  from_port = 0
  protocol       = "-1" 
}

Features:

Accepts HTTP traffic from internet
Routes traffic to EC2 instances
Acts as public entry point

Launch Template Setup

The launch template defines the EC2 instance configuration:

resource "aws_launch_template" "main" {
  name_prefix   = "${var.project_name}-"

  image_id = data.aws_ami.ubuntu.id

  instance_type = "t2.micro"


  network_interfaces {
    associate_public_ip_address = false
    security_groups = [aws_security_group.ec2.id]
  }


  tag_specifications {
    resource_type = "instance"

  lifecycle {
    create_before_destroy = true
  }

  monitoring {
    enabled = "true"
  }

  user_data = filebase64("${abspath(path.root)}/scripts/ec2.sh")

  iam_instance_profile {
    name = aws_iam_instance_profile.ec2_profile.name
  }
}

data "aws_ami" "ubuntu" {
  most_recent = true
  owners = [ "099720109477" ]

  filter {
    name = "name"
    values = [ "ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*" ]
  }

  filter {
    name = "virtualization-type"
    values = [ "hvm" ]
  }

}

Key Configurations:

t2.micro for cost efficiency
Uses Ubuntu AMI for consistency
Disabled public IPs for security
Enables detailed monitoring
Includes user data script for setup

IAM Role Configuration

Implements least-privilege access:

resource "aws_iam_role" "ec2_role" {
  name = "${var.project_name}-ec2-role"

  # Terraform's "jsonencode" function converts a
  # Terraform expression result to valid JSON syntax.
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Sid    = ""
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      },
    ]
  })
}

resource "aws_iam_instance_profile" "ec2_profile" {
  name = "${var.project_name}-ec2-profile"
  role = aws_iam_role.ec2_role.name
}




resource "aws_iam_role_policy" "ec2_custom_policy" {
  name = "${var.project_name}-ec2-custom-policy"
  role = aws_iam_role.ec2_role.name

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "elasticloadbalancing:Describe*",
          "elasticloadbalancing:DeregisterTargets",
          "elasticloadbalancing:RegisterTargets"
        ]
        Resource = "*"
      },
      {
        Effect = "Allow"
        Action = [
          "ec2:DescribeInstances",
          "ec2:DescribeTags"
        ]
        Resource = "*"
      }
    ]
  })
}

Permissions granted:

Load balancer registration
Instance metadata access
Basic EC2 operations

Load Balancer and Auto Scaling Groups

This section configures our application's load balancing and auto scaling capabilities.

Application Load Balancer Setup

First, I created the Application Load Balancer in our public subnets:

resource "aws_lb" "main" {
  name               = "${var.project_name}-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [var.alb_security_groups_id]
  subnets            = var.public_subnet_ids

  enable_deletion_protection = false
}

Key Features:

Deployed in public subnets for internet accessibility
Internet-facing for public access
Uses security group allowing HTTP traffic

Target Group Configuration

# create target group
resource "aws_lb_target_group" "main" {
  name        = "${var.project_name}-alb-tg"
  target_type = "instance"
  port        = 80
  protocol    = "HTTP"
  vpc_id      = var.vpc_id

  health_check {
    enabled             = true
    healthy_threshold   = 2
    interval            = 30
    matcher            = "200"
    path               = "/"
    port               = "traffic-port"
    timeout            = 5
    unhealthy_threshold = 2
  }
}

Created target group with health checks:

Path: "/" for root endpoint testing
30-second check intervals
Healthy threshold: 2 successful checks
Unhealthy threshold: 2 failed checks

# create alb listener
resource "aws_lb_listener" "http" {
  load_balancer_arn = aws_lb.main.arn
  port = 80
  protocol = "HTTP"

  default_action {
    type = "forward"
    target_group_arn = aws_lb_target_group.main.arn
  }
}

Configured HTTP listener on port 80 to forward traffic to target group

Auto Scaling Group Implementation

The ASG manages our EC2 instances:

resource "aws_autoscaling_group" "main" {
  name                      = "${var.project_name}-asg"
  max_size                  = 9
  min_size                  = 3
  desired_capacity          = 3    
  health_check_type         = "ELB"
  health_check_grace_period = 300   
  vpc_zone_identifier       = var.private_subnet_ids
  target_group_arns  = [var.lb_target_group_arn] 


  launch_template {
    id      = var.launch_template_id
    version = var.launch_template_version
  }
  # var.lb_target_group_arn

  tag {
    key                 = "Name"
    value               = "${var.project_name}-asg-instance"
    propagate_at_launch = true
  }

}

Configuration Details:

Placed EC2 instances in private subnets for enhanced security

Capacity settings:

Minimum: 3 instances
Maximum: 9 instances
Initial : 3 instances

Distribution:

Spread across multiple availability zones
Integrated with ALB target group for load distribution

Scaling Policies:

resource "aws_autoscaling_policy" "scaleUp" {
  name                   = "${var.project_name}-asg-up"
  scaling_adjustment     = 1
  adjustment_type        = "ChangeInCapacity"
  cooldown               = 300
  autoscaling_group_name = aws_autoscaling_group.main.name

}

resource "aws_autoscaling_policy" "scaleDown" {
  name                   = "${var.project_name}-asg-down"
  scaling_adjustment     = -1
  adjustment_type        = "ChangeInCapacity"
  cooldown               = 300
  autoscaling_group_name = aws_autoscaling_group.main.name

}

Scales up by 1 instance when triggered
Scales down by 1 instance when triggered
5-minute cooldown between scaling actions
Matches with CloudWatch alarms for CPU metrics

Monitoring and Alerts

The monitoring strategy combines CloudWatch alarms with SNS notifications to provide comprehensive oversight of the auto-scaling infrastructure.

CloudWatch CPU Alarms

resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "${var.project_name}-highCpu-alarm"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 120
  statistic           = "Average"
  threshold           = 70

  dimensions = {
    AutoScalingGroupName = var.autoscaling_group_name
  }

  alarm_description = "Scale out if CPU > 70%"
  alarm_actions     = [var.autoscaling_policy_up_arn]
}

resource "aws_cloudwatch_metric_alarm" "low_cpu" {
  alarm_name          = "${var.project_name}-lowCpu-alarm"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 1
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 120
  statistic           = "Average"
  threshold           = 20

  dimensions = {
    AutoScalingGroupName = var.autoscaling_group_name
  }

  alarm_description = "Scale out if CPU > 70%"
  alarm_actions     = [var.autoscaling_policy_dwn_arn]
}

Configuration Details:

Scale-Out Alarm (High CPU)

Triggers when CPU > 70%
Evaluates every 120 seconds
Initiates immediate scaling

Scale-In Alarm (Low CPU)

Triggers when CPU < 20%
Same evaluation period
Reduces capacity when load decreases

SNS Notification System

SNS Topic Setup

resource "aws_sns_topic" "main" {
  name = "${var.project_name}-sns-topic"
}

resource "aws_sns_topic_policy" "policy" {
  arn = aws_sns_topic.main.arn

  policy = data.aws_iam_policy_document.sns_topic_policy.json
}

data "aws_iam_policy_document" "sns_topic_policy" {
  policy_id = "__default_policy_ID"

  statement {
    actions = [
      "SNS:Subscribe",
      "SNS:SetTopicAttributes",
      "SNS:RemovePermission",
      "SNS:Receive",
      "SNS:Publish",
      "SNS:ListSubscriptionsByTopic",
      "SNS:GetTopicAttributes",
      "SNS:DeleteTopic",
      "SNS:AddPermission",
    ]

    condition {
      test     = "StringEquals"
      variable = "AWS:SourceOwner"

      values = [
        var.account-id,
      ]
    }

    effect = "Allow"

    principals {
      type        = "AWS"
      identifiers = ["*"]
    }

    resources = [aws_sns_topic.main.arn]

    sid = "__default_statement_ID"
  }
}

resource "aws_autoscaling_notification" "asg_notifications" {
  group_names = [var.autoscaling_group_name]
  topic_arn   = aws_sns_topic.main.arn

  notifications = [
    "autoscaling:EC2_INSTANCE_LAUNCH",
    "autoscaling:EC2_INSTANCE_TERMINATE",
    "autoscaling:EC2_INSTANCE_LAUNCH_ERROR",
    "autoscaling:EC2_INSTANCE_TERMINATE_ERROR"
  ]
}

Email Alert Configuration

resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.main.arn
  protocol  = "email"
  endpoint  = var.notification_email  
}

Monitoring Coverage:

Auto Scaling Events

Instance launches
Instance terminations
Scaling errors
Operation failures

Real-Time Notifications

Scale-out actions
Scale-in actions
Infrastructure health alerts
Error conditions

Implementation Challenges

Resolving the 502 Bad Gateway

When attempting to access the Application Load Balancer's DNS name, I encountered a 502 Bad Gateway error, indicating a communication breakdown between the load balancer and the backend instances. This led me through a comprehensive troubleshooting process that demonstrates the complexity of debugging distributed systems.

Initial Investigation

After confirming that the EC2 instances were running in their designated private subnets, I discovered that while the instances were operational, the ALB target group showed all targets as unhealthy. This suggested a deeper application-level issue rather than an infrastructure problem. A thorough review of security group configurations and the launch template confirmed that the basic networking and instance setup were correct.

The Private Subnet Challenge

The architecture's security-first design presented an interesting troubleshooting challenge. With EC2 instances deliberately placed in private subnets and no public IP addresses or SSH access configured, traditional debugging approaches weren't viable. The solution required:

Implementing AWS Systems Manager Session Manager access by:
- Adding appropriate IAM permissions to the EC2 role via terraform
- Creating VPC endpoints for Systems Manager connectivity

resource "aws_iam_role_policy_attachment" "ssm_policy" {
  role       = aws_iam_role.ec2_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

# updated custom policy
resource "aws_iam_role_policy" "ec2_custom_policy" {
  name = "${var.project_name}-ec2-custom-policy"
  role = aws_iam_role.ec2_role.name

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "elasticloadbalancing:Describe*",
          "elasticloadbalancing:DeregisterTargets",
          "elasticloadbalancing:RegisterTargets",
          "ssm:UpdateInstanceInformation",
          "ssm:ListInstanceAssociations",
          "ssm:DescribeDocument",
          "ssm:GetDocument",
          "ssm:GetParameter"
        ]
        Resource = "*"
      },
      {
        Effect = "Allow"
        Action = [
          "ec2:DescribeInstances",
          "ec2:DescribeTags"
        ]
        Resource = "*"
      }
    ]
  })
}

Establishing secure instance access through the AWS console

Go to VPC → Endpoints
Click "Create Endpoint"

Under services
I added these 4 endpoints one by one:

com.amazonaws.us-east-1.ssm
com.amazonaws.us-east-1.ssmmessages
com.amazonaws.us-east-1.ec2
com.amazonaws.us-east-1.ec2messages

I selected the VPC of the autoscaling group

Make sure "Enable DNS name" is left unchecked since we are working with private subnets.

Under Subnets
I selected the 3 private subnets available for all 3 AZs. Make sure "Designate Ip Addresses" is left unchecked we are working with private subnets

I gave the the policy Full Access since we are debugging

Root Cause Analysis

After gaining instance access through Session Manager, I followed a systematic troubleshooting approach to verify network connectivity and instance configuration:

First, I checked if the NAT Gateway was functioning correctly by testing internet connectivity:

ping google.com

This test confirmed the instance could reach the internet through the NAT Gateway in the public subnet, validating the network architecture was working as designed.

Next, I verified I was on the correct instance by retrieving its metadata:

curl http://169.254.169.254/latest/meta-data/instance-id

This step is crucial when working with auto-scaled instances to ensure I was troubleshooting the right target.

From there I ran the command sudo systemctl status nginx that showed me that nginx wasn't installed on the instances. This explained the failed health checks - the web servers weren't present to respond to the ALB's requests. The resolution involved:

Manually installing nginx
Creating the index.html file using sudo tee
Enabling and restarting the nginx service

Resolution and Lessons Learned

This troubleshooting experience reinforced several crucial aspects of AWS infrastructure management:

The importance of validating user data scripts before deployment
The value of AWS Systems Manager for secure instance management
The critical relationship between ALB health checks and application state
The effectiveness of defense-in-depth security practices
The benefit of testing configurations with spot instances before production deployment

DEV Community