DEV Community

Cover image for Building Production-Grade ECS Anywhere Infrastructure with Custom Capacity Providers

Building Production-Grade ECS Anywhere Infrastructure with Custom Capacity Providers

Hey there, fellow infrastructure engineers! If you're reading this, you're probably looking to level up your container game with ECS Anywhere. Maybe you've got a hybrid infrastructure setup, or perhaps you're dealing with edge locations that need the same love as your cloud workloads. Whatever brought you here, I've got you covered.

In this guide, we'll dive deep into building a production-ready ECS Anywhere infrastructure with custom capacity providers. But don't worry – we'll keep it real and focus on practical, battle-tested approaches that you can actually use.

What We'll Cover

  1. Why ECS Anywhere (And Why Should You Care?)
  2. The Architecture That Actually Works
  3. Setting Things Up (The Right Way)
  4. Custom Capacity Providers (The Secret Sauce)
  5. Making It Production-Ready
  6. When Things Go Wrong (And They Will)

Why ECS Anywhere?

Let's be honest – not everything belongs in the cloud. Whether you're dealing with regulatory requirements, existing infrastructure investments, or edge computing needs, sometimes you need to run containers outside AWS. That's where ECS Anywhere comes in.

Here's what makes it interesting:

☁️ Cloud Benefits + On-Prem Control = ECS Anywhere
Enter fullscreen mode Exit fullscreen mode

But here's what they don't tell you in the basic tutorials: the default capacity provider might not cut it for production workloads. That's why we're building a custom one.

The Architecture That Actually Works

Before we dive into the code, let's talk architecture. Here's what we're building:

Architecture

Why this setup? Because it:

  • Keeps your ops team sane with unified management
  • Handles real-world scaling scenarios
  • Doesn't fall apart under pressure

Setting Things Up

First things first. Here's what you'll need:

# Don't just copy-paste this - make sure you understand each part
aws --version  # Needs v2.13.0+

# Create your cluster with external capacity provider
aws ecs create-cluster \
    --cluster-name prod-hybrid \
    --capacity-providers EXTERNAL
Enter fullscreen mode Exit fullscreen mode

🚨 Pro Tip: Always use a test environment first. I learned this the hard way when I accidentally scaled down production instances. Not fun.

The Secret Sauce: Custom Capacity Provider

This is where things get interesting. Here's a custom capacity provider that actually works in production:

class CustomCapacityProvider {
  constructor(private config: CapacityProviderConfig) {
    // Trust me, you want these logs
    this.setupLogging();
  }

  async evaluateCapacity(): Promise<void> {
    try {
      const metrics = await this.getMetrics();

      // Don't just check CPU - that's a rookie mistake
      if (this.needsScaling(metrics)) {
        await this.scaleCluster(metrics);
      }
    } catch (error) {
      // You'll thank me for this error handling later
      this.handleScalingError(error);
    }
  }

  private needsScaling(metrics: ClusterMetrics): boolean {
    // Real-world scaling logic that won't wake you up at 3 AM
    return metrics.cpuUtilization > 70 || 
           metrics.memoryUtilization > 80 ||
           metrics.pendingTasks > 0;
  }
}
Enter fullscreen mode Exit fullscreen mode

Here's what makes this implementation special:

  • It handles edge cases (literally, if you're running at the edge)
  • It won't flap like a fish out of water during traffic spikes
  • It logs what you actually need to debug issues

Making It Production-Ready

Now, let's talk about what it takes to make this production-ready. Here are some battle-tested patterns:

Monitoring That Actually Helps

class MetricsPublisher {
  async publishMetrics(): Promise<void> {
    await cloudwatch.putMetricData({
      Namespace: 'ECS/CustomCapacityProvider',
      MetricData: [
        {
          // These are the metrics you'll actually look at
          MetricName: 'FailedTaskAllocation',
          Value: this.getFailedTaskCount(),
          Unit: 'Count'
        },
        // Add more metrics that matter
      ]
    }).promise();
  }
}
Enter fullscreen mode Exit fullscreen mode

💡 Real Talk: Don't just monitor everything. Monitor what matters. My team once spent hours chasing a "problem" that turned out to be a noisy metric.

Security That Makes Sense

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecs:RegisterExternalInstance",
        "ecs:DeregisterExternalInstance"
      ],
      "Resource": "*",
      "Condition": {
        // This condition saved us during a security audit
        "StringEquals": {
          "aws:ResourceTag/Environment": "Production"
        }
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

When Things Go Wrong

Because they will. Here's your survival guide:

Common Issues I've Hit (So You Don't Have To)

  1. The "Missing Instance" Problem
# First, check if SSM Agent is actually running
sudo systemctl status amazon-ssm-agent

# If it's not, here's the fix
sudo systemctl restart amazon-ssm-agent
Enter fullscreen mode Exit fullscreen mode
  1. The "Scaling Won't Stop" Issue
class ScalingManager {
  private async applyBackoff(): Promise<void> {
    // This backoff strategy saved our bacon during a traffic spike
    const backoffMinutes = Math.min(
      this.failureCount * 2,
      30
    );
    await this.wait(backoffMinutes);
  }
}
Enter fullscreen mode Exit fullscreen mode

Real-World Debugging

Here's a debugging flow that's saved me countless hours:

  1. Check the ECS agent logs
  2. Verify Systems Manager connectivity
  3. Look for capacity provider events
  4. Check your custom metrics
# The holy grail of debugging commands
aws ecs describe-container-instances \
    --cluster prod-hybrid \
    --container-instances $INSTANCE_ID
Enter fullscreen mode Exit fullscreen mode

Lessons Learned

After running this in production for a while, here are some key takeaways:

  1. Start Small: Don't try to boil the ocean. Get a basic setup working and iterate.
  2. Monitor Wisely: Focus on actionable metrics. Nobody wants another noisy dashboard.
  3. Automate Recovery: Because nobody wants to SSH into servers at 3 AM.
  4. Document Everything: Your future self will thank you.

Wrapping Up

Building a production-grade ECS Anywhere infrastructure isn't just about following AWS documentation. It's about understanding your workloads, planning for failure, and building systems that can be maintained by humans.

Remember:

  • Test thoroughly (seriously)
  • Start with a simple capacity provider
  • Add complexity only when needed
  • Keep those logs meaningful

What's Next?

If you're looking to take this further, consider:

  • Implementing cross-region failover
  • Adding custom metrics for your specific use case
  • Building automated testing for your capacity provider

Got questions? Hit me up in the comments. I'd love to hear about your ECS Anywhere adventures!


P.S. If you found this helpful, I'd love to hear about your implementation stories. Drop a comment below!

Top comments (0)