Irtiza Ali

Posted on Nov 12, 2020

Uniform Data Distribution Among Kinesis Data Stream Shards

#kinesis #dataengineering #python3 #aws

Overview

This post is about how to distribute data among kinesis shards uniformly but before moving forward go through the kinesis documentation to grasp the basic understanding of kinesis data stream. I used Lambda as a consumer for Kinesis data stream, so I will discuss things in Lambda’s context.

Read this article about Kinesis stream and AWS Lambda.

Problem

A Kinesis data stream is a set of shards. Each shard has a sequence of data records. Each data record has a sequence number that is assigned by Kinesis Data Streams.

To put a record in Kinesis stream Partition Key must be provided for each record. An MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards using the hash key ranges of the shards.

The problem arises when the partition key is not random and MD5 hash function maps the record’s partition key in a specific shard’s hash key range more than the other.

In the above image, we can see that shard#2 has more records than other shards because of the non-random partition key assignment.

Non-uniform distribution of messages causes Lambda’s Iterator age to increase and there is a chance to lose data. If we increase the retention period it will increase the operational cost of Kinesis.

Iterator age is the time between when the last record in a batch was recorded and when Lambda reads the record. 

Iterator age depends on other parameters, such as Lambda function execution duration, shard count, and batch size. 

For more information, see AWS Lambda CloudWatch Metrics.

Kinesis can retain data up to 7 days and the default retention period is 1 day.

Solution

There are two ways to distribute data among shards uniformly:

Assign each record a unique partition key, in this way, there is a better chance of uniform distribution.
Assign ExplicitHashKey to each record. In this way, the partition key is not required and we can make sure that the message is dumped in the shard of our choice. In python, we can use list shards method to get shards information. It will return information in this format:

{  "Shards": [
            {
                "ShardId": "shardId-000000000000",
                "HashKeyRange": {
                    "StartingHashKey": "0",
                    "EndingHashKey": "170141183460469231371588410571"
                },
                "SequenceNumberRange": {
                    "StartingSequenceNumber": "496103516369698317099491240188243335841720557371394"
                }
            },
            {
                "ShardId": "shardId-000000000001",
                "HashKeyRange": {
                    "StartingHashKey": "170141183460469231371588410571",
                    "EndingHashKey": "340282asdasd6337460743176821144"
                },
                "SequenceNumberRange": {
                    "StartingSequenceNumber": "49610351636992132529573630114381723961608490082063351826"
                }
            },
            {
                "ShardId": "shardId-000000000002",
                "HashKeyRange": {
                    "StartingHashKey": "34028236692093846346337460743176821145",
                    "EndingHashKey": "51042355038140769519506191114765231717"
                },
                "SequenceNumberRange": {
                    "StartingSequenceNumber": "49610351637014433274772160737523259679881138443569332258"
                }
            }
        ]
}

As we can see that each shard has its own hash key range, so by using that information we can generate a key that lies between shard’s StartingHashKey and EndingHashKey. So by using this approach we can implement a round-robin algorithm that will put records to each shard one by one for achieving uniform data distribution among Kinesis data stream shards.

Final Thoughts

The approach discussed in this story is one way to achieve uniform data distribution among Kinesis shards. Please share your feedback about anything that can be improved or I missed. Thank you

Resources

https://docs.aws.amazon.com/streams/latest/dev/introduction.html

Top comments (2)

Umair Akram • Apr 15 '21 • Edited

Thanks for such a informative and nice way to resolve this problem. We are providing website development Services in Lahore at Creative Jaguars.
You can visit us here creativejaguars.com/

Umair Akram • Apr 15 '21

Thanks for such a informative and nice way to resolve this problem. We are providing website development Services

DEV Community

Uniform Data Distribution Among Kinesis Data Stream Shards

Overview

Problem

Solution

Final Thoughts

Resources

Top comments (2)

Read next

How To Setup Password Hash Synchronization In Microsoft Azure

Optimizing Large-Scale Data Processing in Python: A Guide to Parallelizing CSV Operations

A conversation with your architecture

Launching EC2 Instances with AWS CLI and Advanced Features