Running self-hosted GitHub Actions runners on the cloud provides great control, but the costs can spiral if not optimized. In this post, I’ll share how we achieved 30% cost savings on AWS and how you can replicate similar strategies on GCP, with a touch of technical fun and actionable advice. Let's dive in!
Why Self-Hosted Runners?
GitHub Actions' default hosted runners are convenient but can be expensive for large workloads, especially for compute-intensive tasks like integration tests or builds. Self-hosted runners, deployed on AWS or GCP, offer:
- Cost Control: Pay only for what you use.
- Custom Environments: Tailored to specific workflows.
- Scalability: Dynamically scale based on workload.
However, running self-hosted runners at scale comes with its own challenges: idle resources, inefficient configurations, and escalating network costs.
Architecture Overview
Here’s a high-level architecture for both AWS and GCP self-hosted runners:
Challenges in Cost Management
- Idle Resources: Pre-provisioned runners waiting for jobs lead to unnecessary costs.
- Networking Overheads: High outbound traffic, especially for Docker pulls.
- Instance Type Selection: Choosing cost-effective and performant instance types.
- Preemption Risks: Spot instances (AWS) or preemptible VMs (GCP) can fail mid-job.
Optimization Strategies
1. Dynamic Scaling
Both AWS and GCP allow scaling instances based on demand.
AWS
- Use Auto Scaling Groups (ASGs) with Lambda functions triggered by
workflow_job
webhooks. - Leverage tools like
philips-labs/terraform-aws-github-runner
to simplify management.
GCP
- Use Managed Instance Groups (MIGs) with custom autoscaler policies based on job queue size or CPU load.
- Cloud Functions or Cloud Run can handle scaling triggers.
2. Spot Instances (AWS) / Preemptible VMs (GCP)
These offer significant cost savings but require careful handling of preemptions.
AWS Spot Instances
- Mix instance types in Spot Pools for better availability:
-
m5
,m6i
,m7i
(Intel) -
m5a
,m6a
(AMD)
-
GCP Preemptible VMs
- Use diverse instance types:
-
e2-standard
,n2-highmem
,t2d-standard
(AMD)
-
- Jobs must checkpoint regularly to handle interruptions gracefully.
Pro Tip: Always have fallback capacity with on-demand instances or higher-priority pools for critical workloads.
3. Caching and Artifact Management
Networking Optimization
-
AWS: Implement S3-based caching with tools like
actions/cache
. - GCP: Use Cloud Storage or Artifact Registry for similar functionality.
Docker Pulls
- Reduce Docker pull costs by:
- Setting up a pull-through cache in GCP Artifact Registry or AWS ECR.
- Using VPC endpoints (AWS) or private access (GCP) to minimize outbound traffic.
4. Cost Monitoring and Analysis
Both cloud providers offer tools to analyze costs:
- AWS: Cost Explorer + CloudWatch for EC2 usage.
- GCP: Billing Reports + Monitoring with Stackdriver.
Key Metrics to Watch:
- Idle instance time
- Spot/preemptible interruption rates
- Network egress traffic
Case Study: AWS Optimization Outcomes
- Idle Runners Reduced: Adjusted runner pools based on org activity.
-
Spot Pools Optimized: Added AMD-based
m6a
instances, reducing costs by 30%. - Networking Costs: Introduced Docker pull-through caching with S3.
Case Study: GCP Adaptation
- Dynamic Scaling: Managed Instance Groups with preemptible VMs.
- Networking: Switched to private Google Access for egress traffic.
-
Preemptible Instances:
n2-highmem
provided a balance of cost and performance.
Results
Cost reduction metrics
Cloud Provider | Baseline Cost | Optimized Cost | Savings (%) |
---|---|---|---|
AWS | $10,000 | $7,000 | 30% |
GCP | $9,500 | $6,500 | 31% |
User Experience Improvements
- Reduced job interruptions.
- Faster job execution due to optimized runner configurations.
Future Opportunities
-
IPv6 and NAT Gateway Optimization:
- Both AWS and GCP support IPv6 to reduce NAT costs.
-
Machine Learning for Scaling Decisions:
- Use historical data to predict demand spikes.
Conclusion
Optimizing self-hosted GitHub Actions runners on AWS and GCP can save significant costs while improving performance. By dynamically scaling resources, leveraging spot/preemptible instances, and optimizing network usage, you can achieve a highly efficient setup tailored to your workloads.
Feel free to experiment with these strategies and share your results. Happy optimizing! 🚀
Top comments (0)