DEV Community

Cover image for CDK CloudWatch Auto Alarms

CDK CloudWatch Auto Alarms

Abstract

  • For observability, Amazon CloudWatch is one of the options to collect and track metrics as well as provide alerts based on the metric threshold setting. Especially when you don't want to use external monitoring and observability tools such as Datadog or Prometheus, and don't want to pay extra costs for data transferring out.

  • The thing is that we need an automated way of setting up CloudWatch Alarms for EC2 instances and customising the metrics as well as alerts. Especially when there are new EC2 created by autoscaling or on-demand, we need to trigger the automation process to install cloudwatch agent on the EC2 instances as well as set up the alarm for them such as CPU utilization, disk I/O, and memory usage.

  • In this blog post, I demonstrate how to automate the setup and configuration of CloudWatch alarms on Amazon EC2 in addition to providing alert notification to the Slack channel.

Table Of Contents


🚀 Solution overview

The CloudWatch Auto Alarms and Install CloudWatch Agent AWS Lambda functions help to quickly and automatically create a standard set of CloudWatch alarms for the new Amazon EC2 instances (or just reboot the EC2 for generating a Running event state). It saves the time for installing cloudwatch agent as well as agent configuration setup, deploying alarms and setup metric alerts, plus reduces the skills gap required to create and manage alarms.

This blog post gives an example of setting default configuration and creating alarms for the Amazon EC2 with Amazon Linux AMI (but the lambda function supports multiple OS such as Ubuntu, Redhat, SUSE and Windows):

  • CPU Utilization

  • Disk Space Used

  • Memory Used

CloudWatch agent predefined metric sets - Advanced

CPU: cpu_usage_idle, cpu_usage_iowait, cpu_usage_user, cpu_usage_system

Disk: disk_used_percent, disk_inodes_free

Diskio: diskio_io_time, diskio_write_bytes, diskio_read_bytes, diskio_writes, diskio_reads

Mem: mem_used_percent

Netstat: netstat_tcp_established, netstat_tcp_time_wait

Swap: swap_used_percent

The created alarms take action of notifying an Amazon SNS topic. The SNS topic is subscribed by the AWS ChatBot associated with the Slack channel to send alert messages directly to Slack.

🚀 Flow overview

  • Prerequisites: EC2 instances use AMI versions which support automatically installing SSM agents from startup.
  • In the flow chart above, it performs the following steps
    1. For any EC2 instance launched or restarted, the eventbridge rule install-cw-agent-install-cw-agent and cw-auto-alarm catch the event of new Running state from the EC2 instance and then trigger their targets here are lambda functions
    2. The lambda function install-cw-agent-install-cw-agent does following steps
      1. Get instance tag to check if it contains tag-key Create_Auto_Alarms (reference to ALARM_TAG environment of the lambda) then proceed, otherwise, ignore
      2. Run the SSM documents AWS-ConfigureAWSPackage to install cloudwatch agent on the target instance and then run SSM AWS-RunShellScript to load cloudwatch agent config from SSM parameter store and start cloudwatch agent service
    3. The lambda function cw-auto-alarm based on EC2 instance tags to create cloudwatch alarms with format AutoAlarm-<InstanceID>-<cw-namespace>-<MetricName>-<ComparisonOperator>-<Period>-<EvaluationPeriods>-<Statistic>-<CloudWatchAutoAlarms>. These alarms send alert to the SNS topic which is defined in DEFAULT_ALARM_SNS_TOPIC_ARN environment
    4. When the SNS topic receives a message, it forwards it to AWS ChatBot webhook and then the chatbot sends an alert message to the registered slack channel.
    5. If there's any instance terminated, the eventbridge rule cw-auto-alarm catches the event and then triggers the lambda function to delete the alarms according to the terminated instances

🚀 Deploying the solution

  • For infrastructure as code, in this blog post I use CDK Typescript.
  • Stack visualize chart

  • Prerequisites:

    • Add AWS Chatbot app to slack channel.
    • Provide slack workspace ID and slack channel ID to the CDK code.
  • Deploy cdk stacks cdk deploy --all

🚀 Test alarms

  • The above cdk deploy --all includes creating EC2 instance but it might be a gap for eventbridge rule to catch event of Running state change, so for sure, just restart the EC2.
  • Create one more instance to test creating alarms for new instance launch through the stack test-ec2
  • EC2 with proper tags

will be created according alarms

  • Now we access to a EC2 using SSM connect and run cpu-dump.py and test-mem-alert.py test scripts. We will see the alert then.

    • In-alarm threadhold

  • Slack alert

🚀 Cleanup

  • Destroy all the stacks within this project by running cdk destroy --all
  • Cloudwatch logs groups which are created by Lambda functions are not parts of the project stacks so they are not deleted. Although the log group have retention you might want to delete them for cleaning up completely

🚀 Conclusion

  • In this post, I leverage serverless services such as lambda function, eventbridge rule, systems manager, and SNS to provide an automation way of creating CloudWatch alarms and alerts for Amazon EC2 instances in an AWS account.
  • By using the SSM agent from the Systems manager, the lambda function can remotely install cloudwatch agent in the EC2 instances for collecting system logs and metrics and then create cloudwatch alarms properly based on the tags of EC2.
  • The solution is deployed using AWS CDK typescript. For production, I encourage creating the CDK pipeline to deploy the IaC through codepipeline completely.

References:


Top comments (0)