DEV Community

Sakshi Agarwal
Sakshi Agarwal

Posted on

Efficiently Deleting Multiple Lakhs of Records from DynamoDB Using Lambda: A Comprehensive Guide

DynamoDB, a highly scalable NoSQL database service offered by AWS, is designed for high throughput and low latency. However, deleting a massive number of records at once can be a performance-intensive operation. This blog post will guide you through the process of efficiently deleting multiple lakhs of records from DynamoDB using a Lambda function, incorporating best practices and addressing common challenges.

Key Considerations
Before diving into the code, it's essential to consider the following:

Partition Key: DynamoDB uses a partition key to distribute data across partitions. Ensure that the partition key values are evenly distributed to avoid hot partitions.
Batch Write Operations: DynamoDB supports batch write operations, which can significantly improve performance by allowing you to delete multiple items in a single request.

Throttling: Be mindful of DynamoDB's throttling limits. Exceeding these limits can result in errors and delays.

Consistency: If consistency is crucial, consider using DynamoDB's strong consistency mode. However, this may impact performance.

Filter Expression: Use a filter expression to selectively delete records based on specific criteria, reducing the number of items to be processed.

Lambda Function Code

Here's a Python Lambda function that demonstrates how to efficiently delete multiple lakhs of records from DynamoDB using batch write operations, filter expressions, and last evaluated key handling:

import boto3
import time

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('your_table_name')

def delete_records(records, last_evaluated_key):
    with table.batch_writer() as batch:
        for record in records:
            batch.delete_item(Key={'partition_key': record['partition_key'], 'sort_key': record['sort_key']})

    return batch.response.get('LastEvaluatedKey')

def handler(event, context):
    filter_expression = "attribute_name = :value"
    expression_attribute_values = {':value': "your_filter_value"}

    response = table.scan(
        FilterExpression=filter_expression,
        ExpressionAttributeValues=expression_attribute_values
    )

    records = response['Items']
    last_evaluated_key = response.get('LastEvaluatedKey')

    while last_evaluated_key:
        response = table.scan(
            FilterExpression=filter_expression,
            ExpressionAttributeValues=expression_attribute_values,
            ExclusiveStartKey=last_evaluated_key   

        )

        records.extend(response['Items'])
        last_evaluated_key = response.get('LastEvaluatedKey')

    # Divide the records into batches to avoid throttling
    batch_size = 25
    batches = [records[i:i+batch_size] for i in range(0, len(records), batch_size)]

    for batch in batches:
        last_evaluated_key = delete_records(batch, last_evaluated_key)
        # Introduce a delay to avoid throttling
        time.sleep(1)

    return {'statusCode': 200, 'body': 'Records deleted successfully'}

Enter fullscreen mode Exit fullscreen mode

Explanation:

Filter Expression: The filter_expression and expression_attribute_values parameters allow you to selectively delete records based on specific criteria, reducing the number of items to be processed.
Last Evaluated Key: The last_evaluated_key is used to efficiently handle large datasets by retrieving items in batches and continuing from the last processed item.
Batch Processing: The records are divided into batches to avoid throttling and improve performance. The delete_records function now returns the LastEvaluatedKey to be used in subsequent scans.
Looping: The code uses a loop to continue scanning and deleting records until there are no more items to process.

Additional Considerations

Error Handling: Implement proper error handling mechanisms to catch exceptions and log errors for troubleshooting.
Performance Optimization: If you're dealing with an extremely large number of records, consider using DynamoDB's global secondary indexes or partitioning strategies to improve performance.
Parallel Processing: For even faster deletion, explore using multiple Lambda functions or a distributed system to process the records in parallel.
Cost Optimization: If cost is a concern, carefully evaluate the number of Lambda invocations and the amount of data transferred to optimize your costs.
By following these guidelines and customizing the code to your specific use case, you can effectively delete multiple lakhs of records from DynamoDB using a Lambda function, ensuring efficient, scalable, and cost-effective operations.

Top comments (0)