Overview
This post is about how to distribute data among kinesis shards uniformly but before moving forward go through the kinesis documentation to grasp the basic understanding of kinesis data stream. I used Lambda as a consumer for Kinesis data stream, so I will discuss things in Lambda’s context.
Read this article about Kinesis stream and AWS Lambda.
Problem
A Kinesis data stream is a set of shards. Each shard has a sequence of data records. Each data record has a sequence number that is assigned by Kinesis Data Streams.
To put a record in Kinesis stream Partition Key must be provided for each record. An MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards using the hash key ranges of the shards.
The problem arises when the partition key is not random and MD5 hash function maps the record’s partition key in a specific shard’s hash key range more than the other.
In the above image, we can see that shard#2 has more records than other shards because of the non-random partition key assignment.
Non-uniform distribution of messages causes Lambda’s Iterator age to increase and there is a chance to lose data. If we increase the retention period it will increase the operational cost of Kinesis.
Iterator age is the time between when the last record in a batch was recorded and when Lambda reads the record.
Iterator age depends on other parameters, such as Lambda function execution duration, shard count, and batch size.
For more information, see AWS Lambda CloudWatch Metrics.
Kinesis can retain data up to 7 days and the default retention period is 1 day.
Solution
There are two ways to distribute data among shards uniformly:
Assign each record a unique partition key, in this way, there is a better chance of uniform distribution.
Assign ExplicitHashKey to each record. In this way, the partition key is not required and we can make sure that the message is dumped in the shard of our choice. In python, we can use list shards method to get shards information. It will return information in this format:
{ "Shards": [
{
"ShardId": "shardId-000000000000",
"HashKeyRange": {
"StartingHashKey": "0",
"EndingHashKey": "170141183460469231371588410571"
},
"SequenceNumberRange": {
"StartingSequenceNumber": "496103516369698317099491240188243335841720557371394"
}
},
{
"ShardId": "shardId-000000000001",
"HashKeyRange": {
"StartingHashKey": "170141183460469231371588410571",
"EndingHashKey": "340282asdasd6337460743176821144"
},
"SequenceNumberRange": {
"StartingSequenceNumber": "49610351636992132529573630114381723961608490082063351826"
}
},
{
"ShardId": "shardId-000000000002",
"HashKeyRange": {
"StartingHashKey": "34028236692093846346337460743176821145",
"EndingHashKey": "51042355038140769519506191114765231717"
},
"SequenceNumberRange": {
"StartingSequenceNumber": "49610351637014433274772160737523259679881138443569332258"
}
}
]
}
As we can see that each shard has its own hash key range, so by using that information we can generate a key that lies between shard’s StartingHashKey and EndingHashKey. So by using this approach we can implement a round-robin algorithm that will put records to each shard one by one for achieving uniform data distribution among Kinesis data stream shards.
Final Thoughts
The approach discussed in this story is one way to achieve uniform data distribution among Kinesis shards. Please share your feedback about anything that can be improved or I missed. Thank you
Resources
https://docs.aws.amazon.com/streams/latest/dev/introduction.html
Top comments (2)
Thanks for such a informative and nice way to resolve this problem. We are providing website development Services in Lahore at Creative Jaguars.
You can visit us here creativejaguars.com/
Thanks for such a informative and nice way to resolve this problem. We are providing website development Services