I have been spending lots of time the last few years working on projects in cloud environments. Most of my adventures involve AWS but i have spent time with all the major clouds and have tried out many of the components they provide. There are so many tools available out there that in many cases it is just a matter of piecing together the best components available to solve whatever your current problem is.
When you have deployed a new app or site in the cloud the focus usually shifts to monitoring and maintenance. One of the core parts you need to keep an eye on is just making sure your site is reachable to your users at all times.
On AWS there are many ways you can do this but one approach i like is to setup Health Checks in the Route 53 area. One misconception about these checks is that you have to use Route 53 for your host DNS or you have to have everything in Route53 to use the Health Check feature. This is not the case. You can use this feature to monitor ANY endpoint (even ones you don't control) via HTTP, HTTPS, or TCP and specify the host by hostname or IP address.
This blog details how you can use some key serverless components from AWS like Amazon Eventbridge, AWS Lambda, and Simple Notification Service to setup a system that will monitor your site (which can be running anywhere) and send emails, text messages, slack messages, and more when the reachability status of your site changes.
Serverless Application Model (SAM)
I'm a big fan of using an Infrastructure as Code (IaC) approach for any project. My go to tools for this are the Servlerless Application Model (SAM) and it's associated CLI (SAM CLI). For more official use cases and for cross platform apps I typically use Terraform.
The setup for this project will be done using SAM and the associated project sample code found in this Github repository uses SAM. Of course all of the components described here can be setup using the AWS console, AWS CLI, or with many other approaches.
The SAM template for this project can be found in the Github repository but here is the template.yaml file.
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >
route53-health-check-sam
SAM Template for route53-health-check-sam
- Route53 Health Check
- Health Check Cloudwatch Alarm
- Lambda Function to run when alarm state changes
- SNS Topic to send updates on health check changes to
Globals:
Function:
Timeout: 3
MemorySize: 128
Tracing: Active
LoggingConfig:
LogFormat: JSON
Api:
TracingEnabled: true
Parameters:
Hostname:
Type: String
Description: Hostname to monitor
Default: www.amazon.com
SlackWebhookURL:
Type: String
Description: URL to publish slack messages to when health check changes state
Default: https://hooks.slack.com/triggers/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
Resources:
HealthCheckStateChangedFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: site_health_check/
Handler: app.lambda_handler
Runtime: python3.12
Architectures:
- x86_64
Policies:
- SNSPublishMessagePolicy:
TopicName: !GetAtt Route53HealthCheckSNSTopic.TopicName
Environment:
Variables:
SNS_TOPIC_ARN: !Ref Route53HealthCheckSNSTopic
SLACK_WEBHOOK_URL: !Ref SlackWebhookURL
Events:
Trigger:
Type: EventBridgeRule
Properties:
Pattern:
source:
- aws.cloudwatch
detail-type:
- CloudWatch Alarm State Change
detail:
alarmName:
- wildcard: "*-HealthCheckAlarm"
Route53HealthCheckSNSTopic:
Type: "AWS::SNS::Topic"
Properties:
DisplayName: "Route53 Health Check SNS Topic"
Subscription:
- Endpoint: healthcheck_status@example.com
Protocol: email
TopicName: "Route53HealthCheckSNSTopic"
Route53HealthCheck:
Type: 'AWS::Route53::HealthCheck'
Properties:
HealthCheckConfig:
Port: 443
Type: HTTPS
ResourcePath: '/'
FullyQualifiedDomainName: !Ref Hostname
RequestInterval: 30
FailureThreshold: 3
Route53HealthCheckAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: Health Check Alarm
AlarmName: !Join
- ''
- - !Ref Hostname
- '-HealthCheckAlarm'
Namespace: AWS/Route53
MetricName: HealthCheckStatus
Dimensions:
- Name: HealthCheckId
Value: !Ref Route53HealthCheck
ComparisonOperator: LessThanThreshold
EvaluationPeriods: 1
Period: 30
Statistic: Minimum
Threshold: 1.0
TreatMissingData: breaching
Route53 Health Checks
The core component of this solution is the service provided by AWS called Route53 Health Checks. Route53 health checks monitor the health and performance of your web applications, web servers, and other resources.
Route 53 has health checkers in locations around the world. When you create a health check that monitors an endpoint, health checkers start to send requests to the endpoint that you specify to determine whether the endpoint is healthy. You can choose which locations you want Route 53 to use, and you can specify the interval between checks: every 10 seconds or every 30 seconds.
Cloudwatch Alarms for Route53 Health Checks
You can use Amazon Cloudwatch Alarms to monitor the status of your Route53 Health Checks. When setting up the alarm you specify that it should be based on the metric associated with the Route53 Health Check. When the status of the metric changes from UP to DOWN (metric goes from 1 to 0) or DOWN to UP (metric goes from 0 to 1), the status of the Cloudwatch alarm will change as well.
One of the most important components in AWS to build Event-Driven Architectures is Amazon Eventbridge. It is a serverless event bus that helps you receive, filter, transform, route, and deliver events. Most changes that happen in your AWS account automatically get sent to this bus including Cloudwatch alarm state changes. We'll discuss this below.
Amazon Eventbridge
Whenever the status of a Cloudwatch alarm changes, AWS automatically sends and event to the default Eventbridge event bus in your account. In order to take advantage of this and trigger actions on events like this you need to create an Eventbridge Rule to match the events you are interesting in and define what to do when a matching event is seen.
Here is an example of the rule we will be using to match the state change of the Cloudwatch alarm for the state change of the Route53 Health Check. We have setup the alarm name to be the hostname of what we are checking with a suffix of "-HealthCheckAlarm". The rule below will match any event that has an alarm name ending with this suffix.
When you create a rule in Eventbridge to match events you can specify a list of targets or actions you want to execute when a match happens. Here is the target we have setup for this rule. It will call an AWS Lambda Function which will take care of notifying the people we have setup via SNS and Slack.
Simple Notification Service (SNS)
There are many use cases for having to notify someone or some other service on changes in your system. On AWS, the SNS component is usually the best approach to use.
SNS is a managed service that provides message delivery from publishers to subscribers (also known as producers and consumers). Publishers communicate asynchronously with subscribers by sending messages to a topic, which is a logical access point and communication channel. Clients can subscribe to the SNS topic and receive published messages using a supported endpoint type, such as Amazon Kinesis Data Firehose, Amazon SQS, AWS Lambda, HTTP, email, mobile push notifications, and mobile text messages (SMS).
When you setup a new SNS topic and register endpoints (listeners) you typically have to accept being added to the notification list. For example with email notifications you will receive an email confirming you are expecting the notifications and want to receive them. For example you will see an email like this.
AWS Lambda Function
When the Cloudwatch alarm state changes we have setup a Lambda function (called for example - route53-health-check-sam-HealthCheckStateChangedFu) to be executed. The information about the site status and which site it was will be sent in the payload to the lambda function invocation.
Below is an example of lambda handler code. It is using the highly recommended Powertools for AWS Lambda library to ensure best practices around tracing, logging, metrics, and more. The function gets passed an SNS topic and Slack webhook URL at creation time (via environment variables) to send notifications to. It parses the passed in event information to determine which hostname had it's status changed and what the new status is and sends out notifications.
One nice part about using SNS from a Lambda function is that you can set the notification subject and body to be whatever you want. When you use SNS directly it defines those for you.
@tracer.capture_lambda_handler
@logger.inject_lambda_context(log_event=True)
@metrics.log_metrics(capture_cold_start_metric=True)
def lambda_handler(event, context):
SNS_TOPIC_ARN = os.environ['SNS_TOPIC_ARN']
SLACK_WEBHOOK_URL = os.environ['SLACK_WEBHOOK_URL']
logger.info(event)
logger.info(f"SNS_TOPIC_NAME={SNS_TOPIC_ARN}")
logger.info(f"SLACK_WEBHOOK_URL={SLACK_WEBHOOK_URL}")
try:
if event['detail-type'] != "CloudWatch Alarm State Change":
logger.error(f"This is not an event we care about - not sure why we got here")
return
event_detail = event['detail']
site_that_changed = event_detail['alarmName'][:event_detail['alarmName'].find('-HealthCheckAlarm')]
logger.info(f"site_that_changed={site_that_changed}")
new_status = event_detail['state']['value']
logger.info(f"new_status={new_status}")
if new_status != 'OK':
icon = ":x:"
text_status = "DOWN"
else:
icon = ":white_check_mark:"
text_status = "UP"
message_to_show = f"{icon} {site_that_changed} ( https://{site_that_changed} ) is now {text_status}"
slack_data = {
"text": message_to_show
}
if SLACK_WEBHOOK_URL:
logger.info(f"slack_data={slack_data}")
slack_response = send_slack_message(slack_data, SLACK_WEBHOOK_URL)
logger.info(f"slack_response={slack_response}")
sns_subject = f"{site_that_changed} is now {text_status}"
sns_msg = f"https://{site_that_changed} is now {text_status}"
sns_response = publish_to_sns(sns_subject, sns_msg, SNS_TOPIC_ARN)
logger.info(f"sns_response={sns_response}")
except:
traceback.print_exc()
logger.info(f"traceback={traceback.format_exc()}")
Notifications on Route53 Health Check status
In the example i have put together you will get email notifications from SNS at the email address you defined in the SAM project as below (healthcheck_status@example.com is the code default). Of course you will need a real email that you have access to to accept the SNS topic confirmation. You can also setup SMS text messages and more with SNS.
You will also receive Slack messages using the Slack webhook URL you define (example below uses a FAKE webhook URL of https://hooks.slack.com/triggers/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). You will need to get a valid webhook URL or you can set this to '' to skip the slack update part and just use SNS.
Setting up a Slack application or bot is out of scope of this article but there is a good tutorial here: How to create a webhook URL for a Slack Channel?
The hostname to monitor (www.amazon.com is setup as the default) is defined in the parameters of the SAM template as below.
Route53HealthCheckSNSTopic:
Type: "AWS::SNS::Topic"
Properties:
DisplayName: "Route53 Health Check SNS Topic"
Subscription:
- Endpoint: healthcheck_status@example.com
Protocol: email
TopicName: "Route53HealthCheckSNSTopic"
Parameters:
Hostname:
Type: String
Description: Hostname to monitor
Default: www.amazon.com
SlackWebhookURL:
Type: String
Description: URL to publish slack messages to when health check changes state
Default: https://hooks.slack.com/triggers/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
Here are examples of the messages you will see with this setup on state changes of the alarm.
Conclusion
I hope you learned about using various components in AWS to setup a serverless site monitoring solution. Please clone the Github repository and try it out for yourself. There are likely many improvements that could be made.
Please let me know if you have any questions or concerns.
Cleanup
If you did clone the repo and set this solution up for yourself please remember to clean up the resources to avoid any ongoing costs.
sam delete
will be used to delete the underlying Cloudformation template and the resources provisioned in AWS.
For more articles from me please visit my blog at Darryl's World of Cloud or find me on X, LinkedIn, Medium, Dev.to, or the AWS Community.
For tons of great serverless content and discussions please join the Believe In Serverless community we have put together at this link: Believe In Serverless Community
Top comments (1)
I'm a rookie dev using AWS serverless services to build a project currently. I remember it did make me scratch my head a bit when dealing with monitoring the errors caught in lambda functions, and I ended up with setting up cloudwatch custom alarms and custom metrics in a trycatch block in my lambda functions and using SNS to send notification just as mentioned in blog. It turned out it's great, especially we are able to set thresholds for the errors based on for example severity level.
However, by reading this blog, it makes me realize that we actually have a couple of options to choose from based on our needs when it comes to monitoring or health check.
Last but not least, thank you so much for sharing your experience! Really learned something today 😄 And I will definitely give this solution a try, particularly it's a good opportunity to practice SAM a bit 😄