DEV Community

Hulk Pham
Hulk Pham

Posted on

AWS Monitoring - Part 1: AWS CloudWatch

TL;DR

Introduction to Monitoring

  • Monitoring collects and analyzes data about operational health and usage of resources, helping to answer questions about system performance and issues
  • Metrics are individual data points created by resources, which become statistics when collected and analyzed over time

Benefits of Monitoring

  • Enables proactive response to operational issues, improves performance and reliability, and helps recognize security threats
  • Facilitates data-driven decisions and creates cost-effective solutions by optimizing resource usage

Amazon CloudWatch

  • CloudWatch is a centralized monitoring and observability service that collects resource data and provides actionable insights
  • It offers features like anomaly detection, alarms, log analysis, and automated actions
  • CloudWatch Logs allows for centralized storage and analysis of log files from various AWS services and applications
  • CloudWatch alarms can be set up to automatically initiate actions based on sustained state changes of metrics, helping prevent and troubleshoot issues

I. Mornitoring Intruduction

1. Purpose of monitoring

When operating a website like the employee directory application on AWS, you might have questions like the following:

  • How many people are visiting my site day to day?
  • How can I track the number of visitors over time?
  • How will I know if the website is having performance or availability issues?
  • What happens if my Amazon Elastic Compute Cloud (Amazon EC2) instance runs out of capacity?
  • Will I be alerted if my website goes down?

You need a way to collect and analyze data about the operational health and usage of your resources. The act of collecting, analyzing, and using data to make decisions or answer questions about your IT resources and systems is called monitoring.

Monitoring provides a near real-time pulse on your system and helps answer the previous questions. You can use the data you collect to watch for operational issues caused by events like overuse of resources, application flaws, resource misconfiguration, or security-related events. Think of the data collected through monitoring as outputs of the system, or metrics.

2. Use metrics to solve problems

The AWS resources that host your solutions create various forms of data that you might be interested in collecting. Each individual data point that a resource creates is a metric. Metrics that are collected and analyzed over time become statistics, such as average CPU utilization over time showing a spike.

CPUUtilization description

One way to evaluate the health of an EC2 instance is through CPU utilization. Generally speaking, if an EC2 instance has a high CPU utilization, it can mean a flood of requests. Or it can reflect a process that has encountered an error and is consuming too much of the CPU. When analyzing CPU utilization, take a process that exceeds a specific threshold for an unusual length of time. Use that abnormal event as a cue to either manually or automatically resolve the issue through actions like scaling the instance.

CPU utilization is one example of a metric. Other examples of metrics that EC2 instances have are network utilization, disk performance, memory utilization, and the logs created by the applications running on top of Amazon EC2.

3. Types of metrics

Different resources in AWS create different types of metrics. To see examples of metrics associated with different resources, flip each of the following flashcards by choosing them.****

  1. Amazon Simple Storage Service (Amazon S3) metrics

◦ Size of objects stored in a bucket

◦ Number of objects stored in a bucket

◦ Number of HTTP request made to a bucket

  1. Amazon Relational Database Service (Amazon RDS) metrics

◦ Database connections

◦ CPU utilization of an instance

◦ Disk space consumption

  1. Amazon EC2 metrics

◦ CPU utilization

◦ Network utilization

◦ Disk performance

◦ Status checks

4. Monitoring benefits

Monitoring gives you visibility into your resources, but the question now is, "Why is that important?" This section describes some of the benefits of monitoring.

To learn more, expand each of the following five categories.

Respond proactively

Respond to operational issues proactively before your end users are aware of them. Waiting for end users to let you know when your application is experiencing an outage is a bad practice. Through monitoring, you can keep tabs on metrics like error response rate and request latency. Over time, the metrics help signal when an outage is going to occur. You can automatically or manually perform actions to prevent the outage from happening and fix the problem before your end users are aware of it.

Improve performance and reliability

Monitoring can improve the performance and reliability of your resources. Monitoring the various resources that comprise your application provides you with a full picture of how your solution behaves as a system. Monitoring, if done well, can illuminate bottlenecks and inefficient architectures. This helps you drive performance and improve reliability.

Recognize security threats and events

By monitoring, you can recognize security threats and events. When you monitor resources, events, and systems over time, you create what is called a baseline. A baseline defines normal activity. Using a baseline, you can spot anomalies like unusual traffic spikes or unusual IP addresses accessing your resources. When an anomaly occurs, an alert can be sent out or an action can be taken to investigate the event.

Make data-driven decisions

Monitoring helps you make data-driven decisions for your business. Monitoring keeps an eye on IT operational health and drives business decisions. For example, suppose you launched a new feature for your cat photo app and now you want to know if it’s being used. You can collect application-level metrics and view the number of users who use the new feature. With your findings, you can decide whether to invest more time into improving the new feature.

Create cost-effective solutions

Through monitoring, you can create more cost-effective solutions. You can view resources that are underused and rightsize your resources to your usage. This helps you optimize cost and make sure you aren’t spending more money than necessary.

II. Amazon CloudWatch

1. Visibility using CloudWatch

AWS resources create data that you can monitor through metrics, logs, network traffic, events, and more. This data comes from components that are distributed in nature. This can lead to difficulty in collecting the data you need if you don’t have a centralized place to review it all. AWS has taken care of centralizing the data collection for you with a service called CloudWatch.

CloudWatch is a monitoring and observability service that collects your resource data and provides actionable insights into your applications. With CloudWatch, you can respond to system-wide performance changes, optimize resource usage, and get a unified view of operational health.

You can use CloudWatch to do the following:

  • Detect anomalous behavior in your environments.
  • Set alarms to alert you when something is not right.
  • Visualize logs and metrics with the AWS Management Console.
  • Take automated actions like scaling.
  • Troubleshoot issues.
  • Discover insights to keep your applications healthy.

2. How CloudWatch works

With CloudWatch, all you need to get started is an AWS account. It is a managed service that you can use for monitoring without managing the underlying infrastructure.

To learn more, choose each numbered marker.

How_CW_works2 description

Collect

Collect metrics and logs from your resources, applications, and services that run on AWS or on-premises servers.

Monitor

Visualize applications and infrastructure with dashboards. Troubleshoot with correlated logs and metrics, and set alerts.

Act

Automate responses to operational changes with CloudWatch events and auto scaling.

Analyze

Up to 1-second metrics, extended data retention (15 months), and real-time analysis with CloudWatch metric math.

The employee directory application is built with various AWS services working together as building blocks. Monitoring the individual services independently can be challenging. Fortunately, CloudWatch acts as a centralized place where metrics are gathered and analyzed.

Many AWS services automatically send metrics to CloudWatch for free at a rate of 1 data point per metric per 5-minute interval. This is called basic monitoring, and it gives you visibility into your systems without any extra cost. For many applications, basic monitoring is adequate.

For applications running on EC2 instances, you can get more granularity by posting metrics every minute instead of every 5-minutes using a feature like detailed monitoring. Detailed monitoring incurs a fee. For more information about pricing, see "Amazon CloudWatch Pricing" in the Resources section at the end of this lesson.

3. CloudWatch concepts

Metrics are the fundamental concept in CloudWatch. A metric represents a time-ordered set of data points that are published to CloudWatch. Think of a metric as a variable to monitor and the data points as representing the values of that variable over time. Every metric data point must be associated with a timestamp.

To learn more, choose each numbered marker.

cloudwatch_metrics description

Metric

Metrics are data about the performance of your systems.

For example, the CPU usage of a particular EC2 instance is one metric provided by Amazon EC2.

Timestamp

Each metric data point must be associated with a timestamp. If you do not provide a timestamp, CloudWatch creates one for you based on the time the data point was received.

AWS services that send data to CloudWatch attach dimensions to each metric. A dimension is a name and value pair that is part of the metric’s identity. You can use dimensions to filter the results that CloudWatch returns. For example, many Amazon EC2 metrics publish InstanceId as a dimension name and the actual instance ID as the value for that dimension.

cloudwatch_view_metrics_2 description

By default, many AWS services provide metrics at no charge for resources such as EC2 instances, Amazon Elastic Block Store (Amazon EBS) volumes, and Amazon RDS database (DB) instances. For a charge, you can activate features such as detailed monitoring or publishing your own application metrics on resources such as your EC2 instances.

4. Custom metrics

Suppose you have an application, and you want to record the number of page views your website gets. How would you record this metric with CloudWatch? First, it's an application-level metric. That means it’s not something the EC2 instance would post to CloudWatch by default. This is where custom metrics come in. With custom metrics, you can publish your own metrics to CloudWatch.

If you want to gain more granular visibility, you can use high-resolution custom metrics, which make it possible for you to collect custom metrics down to a 1-second resolution. This means you can send 1 data point per second per custom metric.

Some examples of custom metrics include the following:

  • Webpage load times
  • Request error rates
  • Number of processes or threads on your instance
  • Amount of work performed by your application

5. CloudWatch dashboards

Once you provision your AWS resources and they are sending metrics to CloudWatch, you can visualize and review that data using CloudWatch dashboards. Dashboards are customizable home pages you can configure for data visualization for one or more metrics through widgets, such as a graph or text.

You can build many custom dashboards, each one focusing on a distinct view of your environment. You can even pull data from different AWS Regions into a single dashboard to create a global view of your architecture. The following screenshot an example of a dashboard with metrics from Amazon EC2 and Amazon EBS.

cloudwatch_dash description

CloudWatch aggregates statistics according to the period of time that you specify when creating your graph or requesting your metrics. You can also choose whether your metric widgets display live data. Live data is data published within the last minute that has not been fully aggregated.

You are not bound to using CloudWatch exclusively for all your visualization needs. You can use external or custom tools to ingest and analyze CloudWatch metrics using the GetMetricData API.

As far as security is concerned, with AWS Identity and Access Management (IAM) policies, you control who has access to view or manage your CloudWatch dashboards.

6. Amazon CloudWatch Logs

CloudWatch Logs is centralized place for logs to be stored and analyzed. With this service, you can monitor, store, and access your log files from applications running on EC2 instances, AWS Lambda functions, and other sources.

cloudwatch_log_streams description

With CloudWatch Logs, you can query and filter your log data. For example, suppose you’re looking into an application logic error for your application. You know that when this error occurs, it will log the stack trace. Because you know it logs the error, you query your logs in CloudWatch Logs to find the stack trace. You also set up metric filters on logs, which turn log data into numerical CloudWatch metrics that you can graph and use on your dashboards.

Some services, like Lambda, are set up to send log data to CloudWatch Logs with minimal effort. With Lambda, all you need to do is give the Lambda function the correct IAM permissions to post logs to CloudWatch Logs. Other services require more configuration. For example, to send your application logs from an EC2 instance into CloudWatch Logs, you need to install and configure the CloudWatch Logs agent on the EC2 instance. With the CloudWatch Logs agent, EC2 instances can automatically send log data to CloudWatch Logs.

CloudWatch Logs terminology

Log data sent to CloudWatch Logs can come from different sources, so it’s important you understand how they’re organized.

To learn more about logs terminology, choose each of the three numbered markers.

CloudWatch_Logs_Terms description

Log event

A log event is a record of activity recorded by the application or resource being monitored. It has a timestamp and an event message.

Log stream

Log events are grouped into log streams, which are sequences of log events that all belong to the same resource being monitored.

For example, logs for an EC2 instance are grouped together into a log stream that you can filter or query for insights.

Log group

A log group is composed of log streams that all share the same retention and permissions settings.

For example, suppose you have multiple EC2 instances hosting your application and you send application log data to CloudWatch Logs. You can group the log streams from each instance into one log group.

7. CloudWatch alarms

You can create CloudWatch alarms to automatically initiate actions based on sustained state changes of your metrics. You configure when alarms are invoked and the action that is performed.

First, you must decide which metric you want to set up an alarm for, and then you define the threshold that will invoke the alarm. Next, you define the threshold's time period. For example, suppose you want to set up an alarm for an EC2 instance to invoke when the CPU utilization goes over a threshold of 80 percent. You also must specify the time period the CPU utilization is over the threshold.

You don’t want to invoke an alarm based on short, temporary spikes in the CPU. You only want to invoke an alarm if the CPU is elevated for a sustained amount of time. For example, if CPU utilization exceeds 80 percent for 5 minutes or longer, there might be a resource issue. To set up an alarm you need to choose the metric, threshold, and time period.

CloudWatch_Alarms description

An alarm can be invoked when it transitions from one state to another. After an alarm is invoked, it can initiate an action. Actions can be an Amazon EC2 action, an automatic scaling action, or a notification sent to Amazon Simple Notification Service (Amazon SNS).

States of an alarm:

  • OK: The metric is within the defined threshold. Everything appears to be operating like normal.
  • ALARM: The metric is outside the defined threshold. This might be an operational issue.
  • INSUFFICIENT_DATA: The alarm has just started, the metric is not available, or not enough data is available for the metric to determine the alarm state.

8. Prevent and troubleshoot issues with CloudWatch alarms

CloudWatch Logs uses metric filters to turn the log data into metrics that you can graph or set an alarm on. The following timeline indicates the order of the steps to complete when setting up an alarm. It also provides an example using our employee directory application.

  • Set up a metric filter

    For the employee directory application, suppose you set up a metric filter for HTTP 500 error response codes.

  • Define an alarm

    Then, you define which metric alarm state ****should be invoked based on the threshold. With this example, the alarm state is invoked if HTTP 500 error responses are sustained for a specified period of time.

  • Define an action

    Next, you define an action that you want to take place when the alarm is invoked. Here, it makes sense to send an email or text alert to you so you can start troubleshooting the website. Hopefully, you can fix it before it becomes a bigger issue.

    After the alarm is set up, you know that if the error happens again, you will be notified promptly.

You can set up different alarms for different reasons to help you prevent or troubleshoot operational issues. In the scenario just described, the alarm invokes an Amazon SNS notification that goes to a person who looks into the issue manually.

Another option is to have alarms invoke actions that automatically remediate technical issues. For example, you can set up an alarm to invoke an EC2 instance to reboot or scale services up or down. You can even set up an alarm to invoke an Amazon SNS notification that invokes a Lambda function. The Lambda function then calls any AWS API to manage your resources and troubleshoot operational issues. By using AWS services together like this, you can respond to events more quickly.

Top comments (0)