DEV Community

Cover image for How to deploy BERT on AWS
Stephen Collins
Stephen Collins

Posted on

1

How to deploy BERT on AWS

Post Series

  1. How to fine tune BERT for real time sentiment analysis
  2. How to run BERT on AWS
  3. How to deploy BERT on AWS (this post)

Table of Contents

  1. Introduction
  2. Overview of Data Mining Cluster Architecture
  3. Setting up AWS resources with AWS CDK
    1. Getting Started with AWS CDK
    2. Creating an SQS Queue
    3. Creating an SNS Topic
    4. Importing an S3 Bucket
    5. Creating a Python Lambda Function
    6. Creating an EventBridge rule
    7. Importing a DynamoDB Table
    8. Importing an ECR repository
    9. Creating and initializing an EC2 instance
  4. Conclusion

Introduction

In the first post of this two part post series, we discussed how we can take 1.6 million sentiment-labeled tweets and fine tune BERT to output a sentiment score for the input social media post. The result of the fine-tuning generated persistent configuration data files we can load for running in our data mining cluster, that will run 24/7, 1x a minute to save to our database (in this case, using DynamoDB).

This article will pick up where the previous left off, and will assume you have generated configuration data files for the Hugging Face pre-trained BERT model for Sequence Classification. If not, we highly encourage you to check the first post and second post and then come back!

Overview of Data Mining Cluster Architecture

From a high level, the cluster is architected like a pipline. We have an EventBridge rule triggering a Lambda function, which in turn writes retrieved raw data from a social media API to an S3 bucket. This writing to S3 of the raw social media data creates a PUT event on the bucket. This PUT event is passed to an SNS topic, which in turn pushes the event notification to an SQS queue.

We have an independent EC2 instance running a docker container 24/7 that instantiates and keeps alive an SQS listener subscribed to the same queue that the SNS topic populates. Once the SQS listener running in the EC2 instance receives a message, the process in the docker container (that we call the "modelWorker" process) begins to take the raw social media data and processes it with a pre-instantiated, fine-tuned BERT model instance to generate the resulting social media sentiment if a mention of bitcoin, ethereum, polkadot, dogecoin and/or chanlink is found.

After the docker container inside the EC2 instance is finished processing the raw social media data, it writes the resulting sentiment score and mention count for each coin to a dedicated DynamoDB Table.

With this overview out of the way, now we'll go more in depth on how to manage creating and using the AWS resources for our data mining cluster.

Setting up AWS Resources with AWS CDK

After doing research on how best to manage the creation and update of this data mining cluster, we decided to use AWS' CDK to manage the creation, importing as well as interaction (in the case of permissions) between the resources of our cluster, in order to perform efficiently enough (both from a cost and time perspective) with the desired interval of data processing - querying social media APIs once a minute, 24/7.

We'll provide both AWS CDK code examples as well as our input on how we came up with the deployment code for the orchestration of the resources of our cluster.

Getting Started with AWS CDK

Before we can dive into the aws-cdk libraries we used for orchestrating this cluster, we need to setup an AWS CDK app. We used TypeScript, as we familiar with it and our web app is written with TypeScript and React.

To setup an AWS CDK application, we need to install a couple of things. You need TypeScript and npm on your local system. We recommend installing nvm to manage the active Node.js you are using (npm is a package manager for Node.js, and we'll use it to both install TypeScript and the AWS CDK CLI tool).

Once you have a newer than 10.13.0 version of node on your system (or the current Node.js LTS version), let's install TypeScript. We can install TypeScript now with:

npm install -g typescript
Enter fullscreen mode Exit fullscreen mode

You also need an AWS account, which you can sign up for free here. We are using macOS, so we installed the AWS CLI with Homebrew:

brew install awscli
Enter fullscreen mode Exit fullscreen mode

Here are additional AWS CLI installation instructions for other platforms.

After we have the AWS CLI installed, now let's configure our local machine with the credentials of our new AWS account. We can do by running:

aws configure
Enter fullscreen mode Exit fullscreen mode

or, by manually setting credentials by creating a file ~/.aws/config with the following contents, substituting your own AWS access key ID and AWS secret access key:

[default]
aws_access_key_id=YOUR_AWS_ACCESS_KEY_ID
aws_secret_access_key=YOUR_AWS_SECRET_ACCESS_KEY
Enter fullscreen mode Exit fullscreen mode

For more info, here's how to create AWS access key ID and secret access key.

Now that the AWS CLI is installed on our system, let's install the AWS CDK Toolkit:

npm install -g aws-cdk
Enter fullscreen mode Exit fullscreen mode

and testing the AWS CDK Toolkit was installed correctly:

cdk --version
Enter fullscreen mode Exit fullscreen mode

Now, we need to create dedicated AWS S3 buckets to be used by AWS Cloudformation for our CDK application:

cdk bootstrap aws://YOUR-AWS-ACCOUNT-NUMBER/REGION
Enter fullscreen mode Exit fullscreen mode

Where YOUR-AWS-ACCOUNT-NUMBER is your account's number. A quick way to find this number is with this command:

aws sts get-caller-identity
Enter fullscreen mode Exit fullscreen mode

And use the value from the key "Account". This account value is typically at / around twelve digits long.

For the REGION value of the cdk bootstrap command, S3 buckets are region-specific, so you will want to create an S3 bucket in the AWS region closest (and/or available) to you.

We're on the home stretch before we can dive into provisioning the resources for the data mining cluster itself. Assuming the cdk bootstrap command was successful, now all that's left is to use the built-in AWS CDK project generator for TypeScript:

Let's make a directory for our CDK application:

mkdir my-cdk-app
cd my-cdk-app
Enter fullscreen mode Exit fullscreen mode

then finally at the root of the new directory (my-cdk-app):

cdk init app --language typescript
Enter fullscreen mode Exit fullscreen mode

All though not explicitly required, we highly, highly recommend you put your CDK app under version control with version control software. Our favorite VCS by a long shot is Git! If you already have Git on your system, then the AWS CDK project generator should initialize a git repository for you, with the root of the new repository being my-cdk-app.

Very last step, at the root of the my-cdk-app we need to install @aws-cdk/aws-lambda-python-alpha package to support Python AWS Lambda functions from our TypeScript CDK application. Since this is an experimental package, you'll need to be pay attention to updates to this package, since it's possible AWS will create non-backward compatibile changes in future releases. We certainly are watching it! To install it from the root of my-cdk-app:

npm install @aws-cdk/aws-lambda-python-alpha@2.39.0-alpha.0
Enter fullscreen mode Exit fullscreen mode

Without going down a rabbit hole, this package version, 2.39.0-alpha.0, is one of the more recently released versions supported by the AWS CDK project generator we are using (currently, this package pre-installed aws-cdk-lib is versioned at 2.40.0, which is compatible with 2.39.0-alpha.0 as of this post published date.)

Finally, with the prerequisites of our AWS CDK application to build this data mining cluster out of the way, we can dive into the mining cluster itself.

To note, all of the CDK-related code snippets of the following sections will be written within the constructor of the class managing our CDK application's single Cloudformation Stack. This class is called MyCdkAppStack and located at lib/my-cdk-app-stack.ts

import {
  Stack,
  StackProps,
  aws_s3,
  aws_lambda,
  aws_events,
  aws_ec2,
  aws_ecr,
  aws_sns,
  aws_sqs,
  aws_sns_subscriptions,
  aws_iam,
  aws_dynamodb,
  Duration,
} from "aws-cdk-lib"
import { PythonFunction } from "@aws-cdk/aws-lambda-python-alpha"
import { Construct } from "constructs"
import * as path from "path"
import { LambdaFunction } from "aws-cdk-lib/aws-events-targets"
import {
  SnsDestination,
} from "aws-cdk-lib/aws-s3-notifications"
import { readFileSync } from "fs"
import { Construct } from 'constructs';

export class MyCdkAppStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    // **ALL OF OUR CDK CODE SNIPPETS WILL GO INSIDE THIS CONSTRUCTOR**
  }
}
Enter fullscreen mode Exit fullscreen mode

Creating an SQS Queue

First, we need to create an SQS queue. The default SQS queue type is called "LIFO" which stands for "Last In First Out". For our purposes, as we will see when we discuss the DynamoDB table, the messages in the queue already have the exact timestamps that we will use to create our composite partition keys for our main DynamoDB table, so the exact order of the messsage processing will be rectified during the DynamoDB PUT opertion to add the document to the table - if our EC2 instance becomes overwhelmed with the amount of data in the given message and loses time to keep with the 1 minute interval (we can fix this simply by adding another independent EC2 instance also subscribed to the SQS queue). Additionally, LIFO SQS queues are cheaper than FIFO (First In First Out) queues as well!

We just need to create the new queue (in the Stack class constructor) like so:

// create our SQS Queue
const queue = new aws_sqs.Queue(this, "MiningQueue", {
  queueName: "MiningQueue",
})
Enter fullscreen mode Exit fullscreen mode

Creating an SNS Topic

After creating the SQS queue for our cluster, next up is to create the SNS Topic that will populate the SQS queue. AWS-CDK really shines for doing things like this; we only have to add two lines of code to create the SNS topic and subscribe the SQS queue to the new SNS topic:

// create our SNS Topic
const topic = new aws_sns.Topic(this, "MiningTopic")

// Push a message to our SQS queue from a notification broadcast by our SNS Topic
topic.addSubscription(new aws_sns_subscriptions.SqsSubscription(queue))
Enter fullscreen mode Exit fullscreen mode

Importing an S3 Bucket

Next, let's import an S3 bucket for our mining cluster. This assumes the S3 bucket already exists (For a guide on how to create an S3 bucket through the AWS S3 console). We want to guarantee that our S3 bucket (that we are treating as our "data lake") cannot be deleted if this CDK application is destroyed. We want to guarantee the preservation of this data lake for a couple of reasons:

  • allows us to re-fine tune our BERT model
  • allows us to train new models with existing contextually similar text data (to offer additional model output, independent of our current model running in production) without concern of the state of this particular mining cluster
  • lets us use the same data lake to offer multiple model output concurrently from the same raw social media input data (this will be further explained in the conclusion)

For the code to add to our CDK application, we need to first import the bucket. After importing the S3 bucket, we need to tell it to send object create events (PUT events, that is, when we add an object to the S3 bucket), like so:

// import S3 bucket
const miningBucket = aws_s3.Bucket.fromBucketAttributes(
  this,
  "MiningBucket",
  {
    bucketName: "test-mining-cluster-bucket",
  }
)
// Send S3 PUT events to our SNS Topic
miningBucket.addEventNotification(
  aws_s3.EventType.OBJECT_CREATED_PUT,
  new SnsDestination(topic)
)
Enter fullscreen mode Exit fullscreen mode

Creating a Python Lambda Function

This Pytho Lambda function is responsible for fetching raw social media data on a once per minute interval. Without digging too far into this particular lambda function, suffice to say that we have a few relevant lines of code in this Python Lambda function:

import json
from decouple import config
import boto3

# keeping this lambda function simple: baking in a few environment variables with a Lambda package local .env
AWS_DEFAULT_REGION = config('AWS_DEFAULT_REGION')
AWS_S3_MINING_BUCKET = config('AWS_S3_MINING_BUCKET')

# explicitly ensuring the correct environment AWS region before invoking boto3
os.environ['AWS_DEFAULT_REGION'] = AWS_DEFAULT_REGION
# using boto3, the python AWS CDK:
aws_client = boto3.client('s3')

#...
# *fetch social media raw data*
#...

# "output_data" is the raw social media post data retrieved from the social media API we are using initially
serialized_output_data = json.dumps(output_data)
aws_client.put_object(
     Body=serialized_output_data,
     Bucket=AWS_S3_MINING_BUCKET,
     # S3 bucket objects can't have ":" characters, so replacing with "_"
     Key=f'social_media_data_{output_data["timestamp"].replace(":", "_")}.json')
Enter fullscreen mode Exit fullscreen mode

How you write this Python Lambda function to get raw social media data is completely up to you. The only critical part that matters, is that the JSON file that you create and add to the S3 bucket (on the once a minute interval) has a structure exactly like this:

{
  "posts": [
    {
      "title": "SOME POST TITLE",
      "content": "SOME RAW POST TEXT"
    },
    {
      "title": "SOME POST TITLE",
      "content": "SOME RAW POST TEXT"
    },
    {
      "title": "SOME POST TITLE",
      "content": "SOME RAW POST TEXT"
    }
  ],
  "timestamp": "SOME_UTC_TIMESTAMP",
}
Enter fullscreen mode Exit fullscreen mode

In addition, for now we are just concating the title and the content of the post, so if you come across social media posts without content (or would only contain an image as the post content), we'll be combining these fields before passing to our model worker process. We explain this model worker process in part 2.

Now the CDK TypeScript code (that again goes in our Stack class' constructor) to create our Python Lambda function and also give it permissions to read and write to our imported S3 bucket:

// create lambda to write to s3 bucket
const socialFetchLambda = new PythonFunction(this, "SocialFetch", {
  runtime: aws_lambda.Runtime.PYTHON_3_7,
  memorySize: 512,
  timeout: Duration.seconds(30),
  index: "app.py",
  handler: "handler",
  entry: path.join(__dirname, "socialFetch"),
})
// give this Python Lambda Function permission to read and write our imported S3 bucket
miningBucket.grantReadWrite(socialFetchLambda)
Enter fullscreen mode Exit fullscreen mode

One thing that is important to note is that to create the Python Lambda function as we have here, requires the Python Lambda code to be in a dedicated Python location, that would be located at my-cdk-app/lib/socialFetch with an __init__.py file at the socialFetch directory root (to create a python package) and an app.py file that defines a handler function at the top level of the app.py module.

We're not going into too much detail right now about the details of socialFetch because those details are a bit outside of the scope of how to deploy an AWS CDK application.

Creating an EventBridge rule

Next, we have an EventBridge rule that calls the only Lambda function of this CDK application. This is the "starting point" of our cluster. This cluster of resources, which is a data processing pipeline, begins and completes within 1 minute, every minute (with around 40 seconds of margin of safety, as we will explain with creating the EC2 instance).

For the AWS CDK code, we create the rule, and add the previously created Python Lambda Function as the sole target of the rule:

// create event bridge rule to invoke the lambda function 1x every minute 24/7
const rule = new aws_events.Rule(this, "ScheduleRule", {
  // this cron expression configuration equals: "* * * * *"
  schedule: aws_events.Schedule.cron({
    minute: "*",
    hour: "*",
    day: "*",
    month: "*",
    year: "*",
  }),
})

// add the rule target, our Python Lambda Function
rule.addTarget(new LambdaFunction(socialFetchLambda, { retryAttempts: 1 }))
Enter fullscreen mode Exit fullscreen mode

Importing a DynamoDB Table

Can't preserve data without a table - most of the time. Our long term persistence choice is DynamoDB. DynamoDB is a managed No-SQL document-oriented database. However, it can be difficult to use without understanding your data access patterns.

Unlike a No-SQL database like MongoDB, you have to know ahead of time the PK (partition key) to query with. MongoDB allows you to query any field with a vast array of flexible query conditions.

DynamoDB is much stricter on both the structure of the PK (and SK) as well as their types, and there's a more limited number of query conditions (things in DynamoDB parlance called KeyConditionExpressions and FilterExpressions - which each are very different) So using more traditional UUIDs (Universally Unique IDentifier) with DynamoDB is a non-starter. Why bother with DynamoDB? Generally, two words - cost and performance. In our experience, DynamoDB is very cheap to run 24/7 vs a MongoDB server that you pay for regularly, your costs with DynamoDB is more related to how much it's being accessed.

Performance - DynamoDB is known for having sub-millisecond response times, and there's a cool "custom tailored Redis" option for in-memory database caching called DAX, which is a good option for read-intensive workloads - and does not require you to write any custom code, unlike Redis caching.

So for the sake of cost on a month-to-month basis as well as performance, we went with DynamoDB. Which also has made it easy to incorporate into our CDK app via importing, since all of our resources are in-house from AWS:

// import DynamoDB Table
const sentimentDataTable = aws_dynamodb.Table.fromTableAttributes(
  this,
  "sentimentData",
  {
    tableArn: `arn:aws:dynamodb:${Stack.of(this).region}:${
      Stack.of(this).account
    }:table/sentimentData`,
  }
)
Enter fullscreen mode Exit fullscreen mode

As an aside, to create a new DynamoDB table (and you will have to be careful about the retention policy, so to not inadvertently delete it) this example works:

// updated import from "aws-cdk-lib"
import {
  aws_dynamodb,
  RemovalPolicy,
} from "aws-cdk-lib"

   const createdTable = new aws_dynamodb.Table(
      this,
      "SentimentDataTableV1",
      {
        tableName: "SentimentDataTableV1",
        // The most important thing is that the S3 data lake is unaffected, and with correct storing of model configuration,
        // all of this machine learning processed data stored in this table can be perfectly regenerated (if backups weren't enabled manually later)
        // if this table is unintentionally deleted in the worst-case scenario.
        removalPolicy: RemovalPolicy.DESTROY,
        partitionKey: {
          name: "dateMonth",
          type: aws_dynamodb.AttributeType.STRING,
        },
        sortKey: { name: "timestamp", type: aws_dynamodb.AttributeType.STRING },
      }
    )

    const readScaling = createdTable.autoScaleReadCapacity({
      minCapacity: 10,
      // set pretty high for this application, do not want to risk dropping reads
      maxCapacity: 40000,
    })

    readScaling.scaleOnUtilization({
      targetUtilizationPercent: 70,
    })

    const writeScaling = createdTable.autoScaleWriteCapacity({
      minCapacity: 10,
      maxCapacity: 40000,
    })

    writeScaling.scaleOnUtilization({
      targetUtilizationPercent: 70,
    })
Enter fullscreen mode Exit fullscreen mode

NOTE: this is just an example of how to create a new DynamoDB table. We are importing ours.

Importing an ECR repository

The docker container that will run inside our EC2 instance will need to be pulled from a pre-existing docker image that lives in a separate ECR (Elastic Container Repository) repository. This repository we created in part 2 of how to run BERT in an EC2 instance of this blog post series. Since this ECR repository already exists, we'll just import it into our stack:

// import ECR repository
const ecrRepo = aws_ecr.Repository.fromRepositoryAttributes(
  this,
  "model-worker",
  {
    repositoryArn: `arn:aws:ecr:${Stack.of(this).region}:${
      Stack.of(this).account
    }:repository/model-worker`,
    repositoryName: "model-worker",
  }
)
Enter fullscreen mode Exit fullscreen mode

Creating and initializing an EC2 instance

Last set of resources to provision for our mining cluster. Heavily inspired by this post on how to create EC2 instances for a CDK application. There's a couple of resources we need to create in order for our EC2 instance to 1) be able to access the internet, 2) allow us to connect to it, and 3) have a sufficient level of permissions to interact with the rest of the resources of the mining cluster.

First, we need to create a VPC and a Security Group to allow us to SSH into it. SSH (Secure Shell) allows us access to the machine from a terminal (given the correct pair of public-private SSH keys) to debug what's going on.

// Create the VPC
const vpc = new aws_ec2.Vpc(this, "MiningVPC", {
  cidr: "10.0.0.0/16",
  natGateways: 0,
  subnetConfiguration: [
    { name: "public", cidrMask: 24, subnetType: aws_ec2.SubnetType.PUBLIC },
  ],
})

// Create the Security Group
const ec2SG = new aws_ec2.SecurityGroup(this, "ec2-sg", {
  vpc,
  allowAllOutbound: true,
})

// Add rule to allow SSH access
ec2SG.addIngressRule(
  aws_ec2.Peer.anyIpv4(),
  aws_ec2.Port.tcp(22),
  "allow SSH access from anywhere"
)
Enter fullscreen mode Exit fullscreen mode

Next, we need to create an IAM role to give to the EC2 instance. This is what will allow our docker container to connect successfully to the resources we need it to be able to interact with, by using the permissions of the underlying machine. This post about using IAM roles to give permissions to EC2 instances is a good resource for learning more about how this works.

We are initially creating the IAM role with 2 managed policies. "Managed" here means these are pre-made policies offered by default by AWS. The AmazonS3ReadOnlyAccess policy allows read only access to any S3 bucket in the same AWS account. The AmazonSQSFullAccess policy allows full access to any SQS queue in the given AWS account. Admittedly, this is a broad policy. However for the sake of simplicity, as well as the fact that this SQS queue at any point in time does not contain sensitive or confidential data (frankly, only containing public data offered by the social media APIs themselves), the tradeoff we saw was worth the simplicity. After, we add the cluster-specific permissions we need our EC2 instance to have.

// Create the IAM role for our EC2 instance
const ec2Role = new aws_iam.Role(this, "ec2-mining-cluster-role", {
  assumedBy: new aws_iam.ServicePrincipal("ec2.amazonaws.com"),
  managedPolicies: [
    aws_iam.ManagedPolicy.fromAwsManagedPolicyName(
      "AmazonS3ReadOnlyAccess"
    ),
    aws_iam.ManagedPolicy.fromAwsManagedPolicyName("AmazonSQSFullAccess"),
  ],
})

// grant cluster-specific permissions to this IAM role for acces to S3, ECR and DynamoDB
miningBucket.grantRead(ec2Role)
ecrRepo.grantPull(ec2Role)
sentimentDataTable.grantReadWriteData(ec2Role)
Enter fullscreen mode Exit fullscreen mode

Now with the IAM role setup sufficiently for our EC2 instance, let's actually create the EC2 instance. We are using the t2.large EC2 instance type. This instance type we found is cost effective with a fair amount of margin of safety for handling spikes in the volume of data delivered by the social media APIs on a minute by minute basis, because it is considered a burstable EC2 instance type

// create the EC2 instance
const ec2Instance = new aws_ec2.Instance(this, "ec2-instance", {
  vpc,
  vpcSubnets: {
    subnetType: aws_ec2.SubnetType.PUBLIC,
  },
  role: ec2Role,
  securityGroup: ec2SG,
  instanceType: aws_ec2.InstanceType.of(
    aws_ec2.InstanceClass.T2,
    aws_ec2.InstanceSize.LARGE
  ),
  machineImage: new aws_ec2.AmazonLinuxImage({
    generation: aws_ec2.AmazonLinuxGeneration.AMAZON_LINUX_2,
  }),
  // You need to create an SSH key for connecting to this EC2 instance
  keyName: "YOUR-SSH-KEY-NAME",
})
Enter fullscreen mode Exit fullscreen mode

Check out this info for how to create an SSH key for connecting to your EC2 instances.

There's some custom setup we need our EC2 instance to run when it starts up. This is chiefly installing docker, correctly pulling the "model-worker" docker image, and starting a docker container from that docker image in a background process.

We treat the ec2_secrets.sh script as if it were a .env file - not managed by VCS (Git in our case).

The ec2_secrets.sh script:

AWS_REGION=YOUR_AWS_REGION
AWS_ACCOUNT_ID=YOUR_AWS_ACCOUNT_ID
AWS_DYNAMODB_SENTIMENT_DATA_TABLE=sentimentData
AWS_MINING_QUEUE=MiningQueue
AWS_MINING_ECR_IMAGE=model-worker
Enter fullscreen mode Exit fullscreen mode

The ec2_setup.sh script that installs docker and ultimately starts the model-worker docker container in a background process:

#!/usr/bin/env bash

function ec2_setup {
  sudo yum install docker -y && \
  sleep 1 && \
  sudo systemctl start docker && \
  sleep 1 && \
  aws configure set default.region $AWS_REGION && \
  sleep 1 && \
  aws ecr get-login-password --region $AWS_REGION | sudo docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com && \
  sleep 1 && \
  sudo docker pull $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$AWS_MINING_ECR_IMAGE:latest && \
  sleep 1 && \
  sudo docker run \
  -e AWS_REGION=$AWS_REGION \
  -e QUEUE_NAME=$AWS_MINING_QUEUE \
  -e SENTIMENT_DATA_TABLE=$AWS_DYNAMODB_SENTIMENT_DATA_TABLE \
  -d $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$AWS_MINING_ECR_IMAGE:latest
}

# to record standard error and standard output
ec2_setup 2>&1 >> ec2_setup_error.txt
Enter fullscreen mode Exit fullscreen mode

Finally, we'll add the userdata to our EC2 instance:

// local bash scripts to be executed on EC2 instance startup
const secretsScript = readFileSync(
  "./lib/modelWorker/ec2_secrets.sh",
  "utf8"
)
const userDataScript = readFileSync(
  "./lib/modelWorker/ec2_setup.sh",
  "utf8"
)

// add these bash scripts to the EC2 instance's user data to be executed on startup.
ec2Instance.addUserData(secretsScript + "\n" + userDataScript)
Enter fullscreen mode Exit fullscreen mode

And that's it! That is the entire data mining cluster CDK application code for creating and managing the resources we need.

All that's left to do, to make sure we implemented this stack correctly.

We need to run:

cdk synth 
Enter fullscreen mode Exit fullscreen mode

from the root of the Git repo. This command translates or "synthesizes" our TypeScript into YAML that Cloudformation can directly consume.

If that command succeeds, we are good to actually deploy the CDK app with the following command:

cdk deploy
Enter fullscreen mode Exit fullscreen mode

When you are finished testing out this data mining cluster, be sure to clean up and tear down your used AWS resources simply with this command, or else you'll continue to be charged for the AWS resources you are using:

cdk destroy
Enter fullscreen mode Exit fullscreen mode

Conclusion

In this post we have explained how we deploy our data mining cluster for social media sentiment analysis. While not hands off like a typical CI/CD process, we like the granularity of control we have through AWS CDK, and recommend using AWS CDK for non-client facing applications that can't be managed by a toolchain like AWS Amplify. Hopefully, this post has answered how we deploy our data mining cluster, and help you with how to work with AWS CDK for your own applications.

Like this post? Share on social media and connect with us on Twitter!

Since you've made it this far, here's all the code for the CDK application in one snippet:

import {
  Stack,
  StackProps,
  aws_s3,
  aws_lambda,
  aws_events,
  aws_ec2,
  aws_ecr,
  aws_sns,
  aws_sqs,
  aws_sns_subscriptions,
  aws_iam,
  aws_dynamodb,
  Duration,
} from "aws-cdk-lib"
import { PythonFunction } from "@aws-cdk/aws-lambda-python-alpha"
import { Construct } from "constructs"
import * as path from "path"
import { LambdaFunction } from "aws-cdk-lib/aws-events-targets"
import {
  SnsDestination,
} from "aws-cdk-lib/aws-s3-notifications"
import { readFileSync } from "fs"
import { Construct } from 'constructs';

export class MyCdkAppStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    // create our SQS Queue
    const queue = new aws_sqs.Queue(this, "MiningQueue", {
      queueName: "MiningQueue",
    })

    // create our SNS Topic
    const topic = new aws_sns.Topic(this, "MiningTopic")

    // Push a message to our SQS queue from a notification broadcast by our SNS Topic
    topic.addSubscription(new aws_sns_subscriptions.SqsSubscription(queue))

    // import S3 bucket
    const miningBucket = aws_s3.Bucket.fromBucketAttributes(
      this,
      "MiningBucket",
      {
        bucketName: "test-mining-cluster-bucket",
      }
    )

    // Send S3 PUT events to our SNS Topic
    miningBucket.addEventNotification(
      aws_s3.EventType.OBJECT_CREATED_PUT,
      new SnsDestination(topic)
    )

    // create lambda to write to s3 bucket
    const socialFetchLambda = new PythonFunction(this, "SocialFetch", {
      runtime: aws_lambda.Runtime.PYTHON_3_7,
      memorySize: 512,
      timeout: Duration.seconds(30),
      index: "app.py",
      handler: "handler",
      entry: path.join(__dirname, "socialFetch"),
    })

    // give this Python Lambda Function permission to read and write our imported S3 bucket
    miningBucket.grantReadWrite(socialFetchLambda)

    // create event bridge rule to invoke the lambda function 1x every minute 24/7
    const rule = new aws_events.Rule(this, "ScheduleRule", {
      // this cron expression configuration equals: "* * * * *"
      schedule: aws_events.Schedule.cron({
        minute: "*",
        hour: "*",
        day: "*",
        month: "*",
        year: "*",
      }),
    })

    // add the rule target, our Python Lambda Function
    rule.addTarget(new LambdaFunction(socialFetchLambda, { retryAttempts: 1 }))

    // import DynamoDB Table
    const sentimentDataTable = aws_dynamodb.Table.fromTableAttributes(
      this,
      "sentimentData",
      {
        tableArn: `arn:aws:dynamodb:${Stack.of(this).region}:${
          Stack.of(this).account
        }:table/sentimentData`,
      }
    )

    // import ECR repository
    const ecrRepo = aws_ecr.Repository.fromRepositoryAttributes(
      this,
      "model-worker",
      {
        repositoryArn: `arn:aws:ecr:${Stack.of(this).region}:${
          Stack.of(this).account
        }:repository/model-worker`,
        repositoryName: "model-worker",
      }
    )

    // Create the VPC
    const vpc = new aws_ec2.Vpc(this, "MiningVPC", {
      cidr: "10.0.0.0/16",
      natGateways: 0,
      subnetConfiguration: [
        { name: "public", cidrMask: 24, subnetType: aws_ec2.SubnetType.PUBLIC },
      ],
    })

    // Create the Security Group
    const ec2SG = new aws_ec2.SecurityGroup(this, "ec2-sg", {
      vpc,
      allowAllOutbound: true,
    })

    // Add rule to allow SSH access
    ec2SG.addIngressRule(
      aws_ec2.Peer.anyIpv4(),
      aws_ec2.Port.tcp(22),
      "allow SSH access from anywhere"
    )

    // Create the IAM role for our EC2 instance
    const ec2Role = new aws_iam.Role(this, "ec2-mining-cluster-role", {
      assumedBy: new aws_iam.ServicePrincipal("ec2.amazonaws.com"),
      managedPolicies: [
        aws_iam.ManagedPolicy.fromAwsManagedPolicyName(
          "AmazonS3ReadOnlyAccess"
        ),
        aws_iam.ManagedPolicy.fromAwsManagedPolicyName("AmazonSQSFullAccess"),
      ],
    })

    // grant cluster-specific permissions to this IAM role for acces to S3, ECR and DynamoDB
    miningBucket.grantRead(ec2Role)
    ecrRepo.grantPull(ec2Role)
    sentimentDataTable.grantReadWriteData(ec2Role)

    // create the EC2 instance
    const ec2Instance = new aws_ec2.Instance(this, "ec2-instance", {
      vpc,
      vpcSubnets: {
        subnetType: aws_ec2.SubnetType.PUBLIC,
      },
      role: ec2Role,
      securityGroup: ec2SG,
      instanceType: aws_ec2.InstanceType.of(
        aws_ec2.InstanceClass.T2,
        aws_ec2.InstanceSize.LARGE
      ),
      machineImage: new aws_ec2.AmazonLinuxImage({
        generation: aws_ec2.AmazonLinuxGeneration.AMAZON_LINUX_2,
      }),
      // You need to create an SSH key for connecting to this EC2 instance
      keyName: "YOUR-SSH-KEY-NAME",
    })

    // local bash scripts to be executed on EC2 instance startup
    const secretsScript = readFileSync(
      "./lib/modelWorker/ec2_secrets.sh",
      "utf8"
    )
    const userDataScript = readFileSync(
      "./lib/modelWorker/ec2_setup.sh",
      "utf8"
    )

    // add these bash scripts to the EC2 instance's user data to be executed on startup.
    ec2Instance.addUserData(secretsScript + "\n" + userDataScript)
  }
}
Enter fullscreen mode Exit fullscreen mode

Top comments (0)

Image of Bright Data

Overcome Captchas with Ease – Keep your data flow uninterrupted.

Our Web Unlocker smoothly handles captchas, ensuring your data scraping activities remain productive.

Solve Captchas

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay