Luqman Bello for AWS Community Builders

Posted on Nov 1, 2024

Building Production-Grade ECS Anywhere Infrastructure with Custom Capacity Providers

#tutorial #devops #aws #containers

Hey there, fellow infrastructure engineers! If you're reading this, you're probably looking to level up your container game with ECS Anywhere. Maybe you've got a hybrid infrastructure setup, or perhaps you're dealing with edge locations that need the same love as your cloud workloads. Whatever brought you here, I've got you covered.

In this guide, we'll dive deep into building a production-ready ECS Anywhere infrastructure with custom capacity providers. But don't worry – we'll keep it real and focus on practical, battle-tested approaches that you can actually use.

What We'll Cover

Why ECS Anywhere (And Why Should You Care?)
The Architecture That Actually Works
Setting Things Up (The Right Way)
Custom Capacity Providers (The Secret Sauce)
Making It Production-Ready
When Things Go Wrong (And They Will)

Why ECS Anywhere?

Let's be honest – not everything belongs in the cloud. Whether you're dealing with regulatory requirements, existing infrastructure investments, or edge computing needs, sometimes you need to run containers outside AWS. That's where ECS Anywhere comes in.

Here's what makes it interesting:

☁️ Cloud Benefits + On-Prem Control = ECS Anywhere

But here's what they don't tell you in the basic tutorials: the default capacity provider might not cut it for production workloads. That's why we're building a custom one.

The Architecture That Actually Works

Before we dive into the code, let's talk architecture. Here's what we're building:

Why this setup? Because it:

Keeps your ops team sane with unified management
Handles real-world scaling scenarios
Doesn't fall apart under pressure

Setting Things Up

First things first. Here's what you'll need:

# Don't just copy-paste this - make sure you understand each part
aws --version  # Needs v2.13.0+

# Create your cluster with external capacity provider
aws ecs create-cluster \
    --cluster-name prod-hybrid \
    --capacity-providers EXTERNAL

🚨 Pro Tip: Always use a test environment first. I learned this the hard way when I accidentally scaled down production instances. Not fun.

The Secret Sauce: Custom Capacity Provider

This is where things get interesting. Here's a custom capacity provider that actually works in production:

class CustomCapacityProvider {
  constructor(private config: CapacityProviderConfig) {
    // Trust me, you want these logs
    this.setupLogging();
  }

  async evaluateCapacity(): Promise<void> {
    try {
      const metrics = await this.getMetrics();

      // Don't just check CPU - that's a rookie mistake
      if (this.needsScaling(metrics)) {
        await this.scaleCluster(metrics);
      }
    } catch (error) {
      // You'll thank me for this error handling later
      this.handleScalingError(error);
    }
  }

  private needsScaling(metrics: ClusterMetrics): boolean {
    // Real-world scaling logic that won't wake you up at 3 AM
    return metrics.cpuUtilization > 70 || 
           metrics.memoryUtilization > 80 ||
           metrics.pendingTasks > 0;
  }
}

Here's what makes this implementation special:

It handles edge cases (literally, if you're running at the edge)
It won't flap like a fish out of water during traffic spikes
It logs what you actually need to debug issues

Making It Production-Ready

Now, let's talk about what it takes to make this production-ready. Here are some battle-tested patterns:

Monitoring That Actually Helps

class MetricsPublisher {
  async publishMetrics(): Promise<void> {
    await cloudwatch.putMetricData({
      Namespace: 'ECS/CustomCapacityProvider',
      MetricData: [
        {
          // These are the metrics you'll actually look at
          MetricName: 'FailedTaskAllocation',
          Value: this.getFailedTaskCount(),
          Unit: 'Count'
        },
        // Add more metrics that matter
      ]
    }).promise();
  }
}

💡 Real Talk: Don't just monitor everything. Monitor what matters. My team once spent hours chasing a "problem" that turned out to be a noisy metric.

Security That Makes Sense

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecs:RegisterExternalInstance",
        "ecs:DeregisterExternalInstance"
      ],
      "Resource": "*",
      "Condition": {
        // This condition saved us during a security audit
        "StringEquals": {
          "aws:ResourceTag/Environment": "Production"
        }
      }
    }
  ]
}

When Things Go Wrong

Because they will. Here's your survival guide:

Common Issues I've Hit (So You Don't Have To)

The "Missing Instance" Problem

# First, check if SSM Agent is actually running
sudo systemctl status amazon-ssm-agent

# If it's not, here's the fix
sudo systemctl restart amazon-ssm-agent

The "Scaling Won't Stop" Issue

class ScalingManager {
  private async applyBackoff(): Promise<void> {
    // This backoff strategy saved our bacon during a traffic spike
    const backoffMinutes = Math.min(
      this.failureCount * 2,
      30
    );
    await this.wait(backoffMinutes);
  }
}

Real-World Debugging

Here's a debugging flow that's saved me countless hours:

Check the ECS agent logs
Verify Systems Manager connectivity
Look for capacity provider events
Check your custom metrics

# The holy grail of debugging commands
aws ecs describe-container-instances \
    --cluster prod-hybrid \
    --container-instances $INSTANCE_ID

Lessons Learned

After running this in production for a while, here are some key takeaways:

Start Small: Don't try to boil the ocean. Get a basic setup working and iterate.
Monitor Wisely: Focus on actionable metrics. Nobody wants another noisy dashboard.
Automate Recovery: Because nobody wants to SSH into servers at 3 AM.
Document Everything: Your future self will thank you.

Wrapping Up

Building a production-grade ECS Anywhere infrastructure isn't just about following AWS documentation. It's about understanding your workloads, planning for failure, and building systems that can be maintained by humans.

Remember:

Test thoroughly (seriously)
Start with a simple capacity provider
Add complexity only when needed
Keep those logs meaningful

What's Next?

If you're looking to take this further, consider:

Implementing cross-region failover
Adding custom metrics for your specific use case
Building automated testing for your capacity provider

Got questions? Hit me up in the comments. I'd love to hear about your ECS Anywhere adventures!

P.S. If you found this helpful, I'd love to hear about your implementation stories. Drop a comment below!

DEV Community