In my last article I loaded some items into a freshly-provisioned table using a Custom Resource. I explained that this approach does not scale because I'm essentially mapping my putBatchItem
statements into a CloudFormation template and there are limits to the size of that template. The next thing I wanted to try was writing my own Lambda handler to do the same work at a greater clip.
I wasn't very far into that journey when I got a great comment that really put me on the right track. I ended up learning a lot more about DymamoDB and the way these components fit together.
Table of Contents
tl;dr
You can skip ahead to the code.
Faker Lambda
The first step is to move my code that generates fake friends into a Lambda function. I'm not yet completely sure what I want my project structure to be, but I like the idea of running the whole thing with a single npm install. I've seen nested packages and I know I could support something like that with lerna or heck even a bash script, but I'm going to defer on that for now. I've tried some other experiments where I build with webpack in order to remove unneeded dependencies and that seems to work okay, but only because I have already taken the time to learn webpack.
So without further discussion around naming conventions and with 100% certainty I'll be disavowing the ones I've used here inside of six months, here's what I'm doing:
I'm going to take the private methods that were in my stack class and add them to my Lambda. I'm not using the es6 class syntactic sugar here because I don't see any benefit of doing so. These can just be a couple of functions that my handler can use. I'm copying in that interface as well so I can take full advantage of TypeScript.
interface IFriend {
id: string;
firstName: string;
lastName: string;
shoeSize: number;
favoriteColor: string;
}
const generateItem = (): IFriend => {
return {
id: random.uuid(),
firstName: name.firstName(),
lastName: name.lastName(),
shoeSize: random.number({ max: 25, min: 1, precision: 0.1 }),
favoriteColor: commerce.color(),
};
};
const generateBatch = (batchSize = 25): { PutRequest: { Item: IFriend } }[] => {
return new Array(batchSize).fill(undefined).map(() => {
return { PutRequest: { Item: generateItem() } };
});
};
This changed slightly. I no longer have to specify the Dynamo types. I assume that's just due to a difference between the DynamoDB API and JavaScript SDK. AWS is an amazing platform for development, but there are lots of little things like this that can trip you up. Best to expect them.
My Lambda also needs a handler.
export const handler = async (event: CloudFormationCustomResourceEventCommon): Promise<void> => {
const { ReadWriteCapacity, DesiredCount, TableName } = event.ResourceProperties;
for (let i = 0; i < DesiredCount / ReadWriteCapacity; i++) {
const batch = new Array(ReadWriteCapacity / 25)
.fill(undefined)
.map(() => db.batchWrite({ RequestItems: { [TableName]: generateBatch() } }).promise());
try {
await Promise.all(batch);
console.log(`Batch ${i} complete. ${ReadWriteCapacity} items written.`);
} catch (e) {
console.error('Batch write failed! ', e);
}
}
};
I'm still using batchWrite
and here I'm using a nested loop Promise.all
to run several batches in parallel. Let me break this down a little just in case you aren't familiar with modern JavaScript.
First, I'm doing a standard for loop. The intent of this structure is I will run the maximum number of writes - in this case 40,000 (will come back to that in a moment) per iteration. Since I want a million fake friends, there will be 25 iterations. The iterations are run synchronously.
Inside my loop, I'm using batchWrite
which will take care of 25 writes at a time, so I'm going to load up an array with 1600 (40,000/25) promises. I do this by initializing an array with 1600 elements, then using the map
method to replace each element with db.batchWrite({ RequestItems: { [TableName]: generateBatch() } }).promise()
. This will generate the batch inline and call batchWrite
but not wait for that work to finish.
Next I let the entire iteration complete with await Promise.all(batch);
and log success before moving on to the next iteration.
DynamoDB WCU
To get this to work (and again thanks to @rehanvdm for putting me on the right track), I need to understand DynamoDB WCU - write capacity units. Very briefly, any write to DynamoDB consumes one or more WCU. If the item I want to write is up to 1 kb, it'll consume a single unit. If it's more than that, it'll consume more units. WCU represents my concurrency limit for writes. By default, CDK will provision a table with 5 WCU. The maximum I can provision for a single table is 40,000 WCU.
WARNING while the vast majority of what I'm doing here can be done at the free tier, Dynamo bills by WCU (and also read capacity and storage). If you run this yourself, be sure to cdk destroy
as soon as you're satisfied with your million items. The cost to leave this table sitting around unused for a month is approximately $20,000 US.
So hopefully by now the math makes sense. I simplified things a little by setting read and write capacity to the same value (ReadWriteCapacity
), but you don't need to do that at all. I'm passing this value as well as the table name and the total I'm going for from my CDK stack. Note that CloudFormation uses PascalCase and so must I. I originally named this variable readWriteCapacity
and CDK automatically capitalized the first letter. I figured it consistency would be less confusing, even though that's not a convention I'm used to.
Lambda Test
All this is still pretty new to me so I want to write a unit test for my Lambda function. I'm leaning heavily on jest mocks because I want to be able to run the test without provisioning any AWS infrastructure. Lambda functions are fairly easy to write unit tests for because, thanks to typings, I can get a good idea of what different events should look like. I can call the handler directly with any event. I'm even mocking the console here to keep my output clean.
CDK
My stack is under 50 lines so I'll go through the entire thing, even though some of it is just copied from my last article.
export class CdkDynamoLambdaLoaderStack extends Stack {
constructor(scope: Construct, id: string, props?: StackProps) {
super(scope, id, props);
const TableName = 'friends';
const ReadWriteCapacity = 40000;
const DesiredCount = 1000000;
// Create a table
const friendsTable = new Table(this, 'FriendsTable', {
partitionKey: { name: 'id', type: AttributeType.STRING },
readCapacity: ReadWriteCapacity,
removalPolicy: RemovalPolicy.DESTROY,
tableName: TableName,
writeCapacity: ReadWriteCapacity,
});
As I mentioned before, I'm using a common value for read and write capacity. I'll get to the reason for that in a moment. These constants could be passed in via StackProps, but I'm just declaring them inside my constructor for demonstration purposes.
const initDBLambda = new Function(this, 'initDBFunction', {
code: new AssetCode(${__dirname}/lambda),
handler: 'init-db.handler',
memorySize: 3000,
runtime: Runtime.NODEJS_12_X,
timeout: Duration.minutes(15),
});
My Lambda declaration points to my source and indicates the method being called. I've set max timeout and memory, though I probably don't need them. This kind of function won't be called that many times so the cost is negligible.
friendsTable.grant(initDBLambda, 'dynamodb:BatchWriteItem');
When I was working entirely with CloudFormation custom resources, I could rely upon CDK to automatically create the minimum roles I needed. Now that I'm calling DynamoDB inside a Lambda, I need to grant permission explicitly, otherwise my Lambda would not be able to write to Dynamo. I could use friendsTable.grantFullAccess
or friendsTable.grantWriteData
, but instead I'm opting to give the minimum permission required to do the work. My Lambda will be able to use batchWrite
but it would not be able to put
a single item.
const initDbProvider = new Provider(this, 'initDBProvider', {
onEventHandler: initDBLambda,
});
new CustomResource(this, 'initDBResource', {
provider: initDbProvider,
properties: {
ReadWriteCapacity,
DesiredCount,
TableName,
},
});
With this last bit, I'm gluing my Lambda into CloudFormation to be called after my table is provisioned and with the specified properties. CloudFormation will of course do this with another Lambda function to handle the lifecycle and invocation of my Lambda.
Build and Deploy
So now I can build and deploy my new stack with the same setup I used last time around, with one little hitch. I need the faker library for my Lambda to run. There are a few different ways to package that up, one of them being use nested package.json and build processes, but I don't want to do that. I could use webpack to bundle faker into my Lambda, but that solution introduces complexity I don't need when I just want to add a single module. Instead I'll just copy the module with a utility called copyfiles
and my build script: "build": "tsc && copyfiles \"node_modules/faker/**/*\" build/src/lambda"
. This is not a scalable solution, but works when I just have a single module. My build output now looks like this:
My tests wind up in my build output. That doesn't really cause any problems, but it does muddy the waters a little. I could exclude them with (again) a webpack build, but that feels like another article. It's also possible to do some tricks with tsconfig.json
.
I should be good to go now so I check my account credential and npm run build && cdk deploy
. I can log into the AWS console and check the progress of my stack being created and look at my Lambda logs in CloudWatch. I'll see the output of each iteration. Looks like they take just over 2 seconds.
I can also look at my successful Lambda invocation.
Seems a million items were created in less than a minute. Since I set my RCU (read capacity units) to 40,000 as well, I can head to the DynamoDB dashboard and get a count. If I left RCU at the default, the count would time out and need to be restarted over and over and I'd end up paying more for the whole operation.
Next Steps
cdk destroy
! This table is expensive to keep around. I've already used up most of the credits I earned at re:Invent, but it was well worth it.
This was useful for understanding the power and scaling of Lambda and DynamoDB and how easy it is to connect them with an IAM role via CDK. I still think a stack that uploads a csv file for deterministic data sets would be useful. I might try something like this with Aurora as well since there's a serverless option.
Cover image
VENUS AT THE FORGE OF VULCAN. JAN BRUEGHEL AND HENDRICK VAN BALEN. CIRCA 1600
KAISER FRIEDRICH MUSEUM, BERLIN
Top comments (1)
Haha this is great! Couldn't figure out how to run batchWrites in parallel, looked into Data Pipeline, Batch Jobs, Step Functions when everything you need is just multiple parallel Promises within one Lambda. So cool!