Raphael Jambalos for AWS Community ASEAN

Posted on Feb 19, 2022

Optimize your Application's performance with Amazon X-Ray

Fifteen years ago, when we send a request to a website, it is most likely served by a single server. Now, it is likely your request is being served by a dozen microservices. With this setup, identifying bottlenecks means looking through each microservice to see which one is taking the most time.

For this purpose, we have Amazon X-Ray. With X-Ray, we can trace the journey of requests as it goes through several applications in our system. We can track how long it took to write to the DB, to call the User system, to call the Orders system, etc. With this, we can identify bottlenecks.

Best of all, with Amazon X-Ray, we don't have to change so much in our codebase. For serverless apps, just turn on "tracing" when you deploy and you should be good to go.

1 | Setup

Let's create a sample Serverless application by using the command serverless. And then, call the project "xray-system-demo".

Open your text editor to the newly created Serverless project. Back in your terminal, let's setup the Python virtual environment and install the dependencies required for this project:

python3 -m venv venv
source venv/bin/activate

pip install aws-xray-sdk
pip install requests
pip install boto3

pip freeze > requirements.txt

serverless plugin install -n serverless-python-requirements

2 | Set up our primary service, MAIN.

2.1 Update your serverless.yml

Add tracing and iamRoleStatements to the provider section in your serverless.yml, as shown below.

The tracing section tells Serverless Framework (SF) to enables tracing for basic API Gateway for Lambda. The additional iamRoleStatements also grants permissions for our function to push data to X-Ray. We will demonstrate this in a bit.

Also, add the package section so our deployment package size will be reduced. To learn more techniques in reducing package size, refer to this blog post.

provider:
  tracing:
    apiGateway: true
    lambda: true
  iamRoleStatements:
    - Effect: "Allow" # xray permissions (required)
      Action:
        - "xray:PutTraceSegments"
        - "xray:PutTelemetryRecords"
      Resource:
        - "*"

package:
  exclude:
    - venv/**
    - node_modules/**

custom:
  pythonRequirements:
    pythonBin: python3

2.2 Deploy your function

Now, let's deploy our function!

serverless deploy --region us-west-2

Once this command is done, you should see an output similar to the image below. Visit the URL specified below.

Right now, it just prints out a JSON message. This is the equivalent to a "hello world" exercise. However, under the hood, Lambda was configured to send performance data to X-Ray. Let's try and see what it sent on the x-ray console. Let's go into the X-Ray console:

On the x-ray console, click service map and select "Last 5 minutes". This should display your very first service map. It may take up to 30 seconds for your service map to be generated from the time of your request.

You should see 4 circles: 1 for the client (you), 1 for API Gateway and 2 for Lambda. We also see the request took 593ms, 572ms of which came from Lambda (the remainder is from API Gateway). We have 2 circles for Lambda: one for the Lambda service itself, and another for the specific Lambda function.

Clicking on one of the circles, you will see the distribution of how long it took your app to send back a response to the request sent.

In the next section, we will dive deep on what is happening under the hood.

3 | Add a second service, DEV2 that MAIN communicates to via API Call

While useful to illustrate X-Ray, the previous example is completely unrealistic. Most applications have some sort of external dependency like a database, caching layer, a backend system, etc. In this example, we will setup external dependencies for our application.

Understanding Segments and Subsegments

An important concept in X-Ray is something called patching. We add the next snippet code on top of the handler code. With this snippet on your app, X-Ray will make sure a "subsegment" will be created when your application calls a supported library or service. A subsegment breaks down the work done by your segment to fulfill its part of the work in fulfilling the request.

from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all

patch_all()

def hello(event, context):
  ...

To further illustrate the point, say we have an ecommerce website and we are analyzing the "Complete Order" functionality. In the diagram below, we see that there are 4 different "segments" involved in completing the order: Storefront, Payments, Order, and Notifications. Each segment does their own series of steps to do their part in completing the order - each step is called a subsegment. In the case of the Storefront segment, calling and waiting for Payments, Orders and Notification segments to return a response is also a subsegment.

Understanding Patching

The important thing to note though with X-Ray patching is that it only supports a handful of the most commonly use libraries in Python. The "requests" library happens to be one of them. As a supported library, we just add the code snippet above (the one with patch_all()) and X-Ray makes sure that a subsegment is created for it.

In this example, we are going to create a second service called DEV2 that our current application MAIN calls via HTTP GET (maybe MAIN needs to get some values from DEV2 to display the homepage). As we are using the "requests" library for this, we no longer have to write the code to notify X-Ray that calling DEV2 is a subsegment. It automatically does that for us.

Having the call to DEV2 as a subsegment is convenient because it allows us to isolate how many seconds calling DEV2 added in fulfilling the request. Perhaps DEV2 took 5seconds to respond to us and is the bottleneck.

If this doesn't make clear sense, let's do the hands-on and I will also have an explanation at the end of the section.

3.1 Setting up a second backend system

Imagine our main application has to get data first from another backend service in order to fulfill its duties.

In order to avoid rewriting code, let's deploy the same setup we have now but under another stage (say, dev2).

serverless deploy --region us-west-2 --stage dev2

Now, we have a second application deployed. Let's call this 2nd application, dev2.

From the display above, take note of the URL. We will use that in the next part.

3.2 Make main app communicate with Dev2

On the main application, let's add code so we can send HTTP requests to dev2. Add this code under the file "http_gateway.py"

import requests

class HttpGateway:
    @classmethod
    def send_get_request(cls, url, headers=None):

        payload = {}

        response = requests.request("GET", url, headers=headers, data=payload)
        return {
            "code": response.status_code,   
            "reason": response.reason,
            "body": response.json()
        }

Then, let's update our handler code. Notice that I called the HttpGateway.send_get_request() with the URL from the DEV2 service I just created in the previous step.

import json

from http_gateway import HttpGateway

def hello(event, context):
    body = {
        "message": "Go Serverless v2.0! Your function executed successfully!",
        "input": event,
    }

    response = {"statusCode": 200, "body": json.dumps(body)}

    HttpGateway.send_get_request("https://7hjkoozf88.execute-api.us-west-2.amazonaws.com/dev2/")

    return response

Now, lets deploy our main application:

serverless deploy --region us-west-2

Once done, visit the URL of the main application like we did on Section 2. Wait 30 seconds and go back to the X-Ray service map and you should see it updated:

4 | DynamoDB

4.1 Modify the serverless.yml to add DynamoDB

In the resources section, we add the CloudFormation code to create a simple DynamoDB table. We also update the provider section to:

Add permission for our app to access DynamoDB
Add an environment variable for the table name

provider:
    ...
  iamRoleStatements:
    ...
    - Effect: "Allow"
      Action: "dynamodb:*"
      Resource: "*"
  environment:
    DYNAMODB_REQUEST_TABLE: ${self:provider.stage}-generic-request-table

resources:
  Resources:
    GenericRequestTable:
      Type: 'AWS::DynamoDB::Table'
      Properties:
        AttributeDefinitions:
          - AttributeName: request_id
            AttributeType: S
        KeySchema:
          - AttributeName: request_id
            KeyType: HASH
        BillingMode: PAY_PER_REQUEST
        TableName: ${self:provider.environment.DYNAMODB_REQUEST_TABLE}

4.2 Modify app code to communicate with DynamoDB

To properly communicate with DynamoDB, we create this small snippet that creates a row on DynamoDB:

import boto3

class DynamodbGateway:
    @classmethod
    def create_item(cls, table_name, payload):
        client = boto3.resource('dynamodb')
        table = client.Table(table_name)

        result = table.put_item(
            Item=payload
        )

        print(result)

        return result

Now, let's update our handler function to call DynamoDB to create an item

import os
from dynamodb_gateway import DynamodbGateway

def hello(event, context):
    ...

    table_name = os.getenv("DYNAMODB_REQUEST_TABLE")
    request_id = event['requestContext']["requestId"]
    DynamodbGateway.create_item(table_name, {"request_id": request_id, "master": 1})

Now, lets deploy our main application:

serverless deploy --region us-west-2

5 | Analyzing X-Ray Traces

Go to the traces tab and click one of the requests:

Now, go to the "Timeline" tab in the lower side of the screen. Scroll down even more for a view similar to this:

Explaining Sampling

Earlier, we saw the service map. It is a visual representation of the journey our request goes through and all the systems it calls to fulfill the request and send a response back to you. In order to construct this map, Lambda gets the 1st request in any given second and 5% of additional requests within that second. It then sends trace data about these request to the X-Ray service, which in turn, analyzes it and produces this map.

To make it explicit, Lambda does not send all requests to X-ray for analysis - just a sample of it.

Explaining Traces

Aside from the service map, we also get to examine in-depth each sampled request sent by Lambda to X-Ray. And this is what we saw above. We chose one sampled request and looked at the journey of this specific request in a bit more detail.

Here, we see 7 segments:

MAIN - APIGW
MAIN - Lambda Service
MAIN - Lambda Function
DEV2 - APIGW
DEV2 - Lambda Service
DEV2 - Lambda Function
DynamoDB

If you examine segment 3 closely, you will see that one of its subsegments includes a call to DEV2. The line with the blue rectangle indicates it took 1.2s for DEV2 to return. DEV2 is on the same AWS account as the MAIN app and is also configured with X-Ray. Because of this, we are able to look under the hood on the journey of the request as it goes through DEV2. This part of the request's journey shows up as 3 more segments (segments 4,5,6).

Hence, we see the call to DEV2 as both a subsegment to MAIN (segment 3) and its own segments (segment 4,5,6).

We can also say the same for MAIN's call to DynamoDB. It is both a subsegment for MAIN and its own segment.

Representing it in the diagram we used earlier:

But how about if the system we are calling is not within the same AWS account or not configured with X-Ray? In that case, it only shows up as a subsegment of MAIN and not its own segment.

6 | Next Steps

This next section will be more like a homework. You can try integrating other AWS services or external resources to your application and see how the service map looks like.

In this repository, I tried the following exercises:

MAIN sends a task to SQS and DEV2 consumes the task - boto3 library is patched so this should be no issue and easy-to-do!
MAIN searches through an ElasticSearch index - this is a bit advanced as the "elasticsearch" Python library is not included in the list of libraries that X-Ray automatically patches. I utilized custom subsegments for that. Perhaps on a follow-up post, I'll show you how to do it.
MAIN calls an external API that I do not own - requests library is patched so no issue and easy-to-do!

7 | Where does the optimization come in?

Now that you have integrated all there is to integrate and have an accurate service map of your entire application, you can identify the bottlenecks. Which requests returns above 5 seconds and which systems maybe the bottleneck?

From this insight, you can set your team's priorities on what system to optimize first.

That's all folks!

Photo by Adi Goldstein on Unsplash

Special thanks to my guest editor, Audrick.

DEV Community