DEV Community

Cover image for Data Masking of AWS Lambda Function Logs
Matthieu Lienart
Matthieu Lienart

Posted on

3

Data Masking of AWS Lambda Function Logs

What are the problems?
When logging events, service API call answers, etc. inside Lambda Functions into CloudWatch Logs, you might end up writing sensitive information like PII into CloudWatch Logs. This potentially exposes sensitive information to people who should not have access, e.g. developers and cloud platform administrators. But it is also a huge problem to be compliant with data protection regulations like the right to be forgotten. How do you erase a customer information across all the logs?

Existing Approaches
CloudWatch Logs Native Data Masking
Natively, AWS CloudWatch Logs allow data masking by using managed or custom data identifiers and data protection policies. The data identifiers are pattern matching or machine learning models which detect sensitive data. Data protection policies are JSON documents describing the operations to perform on the identified sensitive data. The operation can be set to just “audit” or to “de-identify” the data in which case, only principals authorized to perform the log:Unmask action would be able to see the data.
Note that only new data written to CloudWatch Logs will be masked according to the defined policies. Someone with access to the logs would still be able to see the sensitive data written before enabling data masking.
Although this approach prevents access to sensitive data for unauthorized personnel, it does not help with complying with the right to be forgotten.

AWS Lambda Powertools Data Masking
AWS Lambda Powertools is “a developer toolkit to implement Serverless best practices and increase developer velocity” originally developed in Python but made available for Java, Typescript and .NET. As of now only the Python version offers a utility for data masking.
Two approaches are proposed. One approach which uses a KMS key to encrypt/decrypt the sensitive information inside the log. A second approach which simply erases the sensitive information before writing the logs. To implement the first approach but also comply with regulations like the right to be forgotten, you would need to have one encryption key per customer, find some way to encrypt each customer’s information with their own key. Should they exercise their right to be forgotten, simply delete the encryption key, making their data forever unrecoverable.
Although those approaches can address both problems, you must know exactly what to encrypt/erase. For example, to erase the phone number in a list of customers you would need to do something like this:

data_masker.erase(data, fields=["customers[*].phone_number"])

But what if you are unsure at the start of a project about the data structure and the content? What if the data schema changes? What if you forgot a field in a nested JSON structure?

Erasing All PII by Default
Do you really need sensitive information like PII in application logs?
Probably not.
In that case, the AWS Lambda Powertools data erasing approach seems like the simplest one. But again, it works as long as you know the data structure and it doesn’t change. As a security/compliance officer how can I make sure the developers don’t forget to erase sensitive information?
So, I wanted to improve on the AWS Lambda Powertools approach, to erase sensitive information, wherever they are in the logs…
This is what I came up with based on the AWS Lambda Powertools data masking utility.

1- Create a Function to Erase Sensitive Data
I created a Python decorator which calls the data_masker.erase() function on the message to erase all the fields passed as a parameter before calling the function the decorator is applied to.

import json
from warnings import catch_warnings
from functools import wraps, partial
from decimal import Decimal
from typing import Any
from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities.data_masking import DataMasking
from aws_lambda_powertools.utilities.data_masking.provider import BaseProvider

def is_valid_json_string(json_string: str) -> bool:
    if isinstance(json_string, str):
        try:
            result = json.loads(json_string)
            return isinstance(result, dict)
        except json.JSONDecodeError:
            return False

def log_masking_decorator(masked_fields: list[str]):
    def decorator(func):
        @wraps(func)
        def wrapper(self, msg, *args, **kwargs):
            if is_valid_json_string(msg) or isinstance(msg, dict):
                with catch_warnings(action="ignore"):
                    msg = self.data_masker.erase(msg, fields=masked_fields)
            return func(self, msg, *args, **kwargs)
        return wrapper
    return decorator
Enter fullscreen mode Exit fullscreen mode

Code explanations:

  • The data_masker.erase() function only works on dictionaries and string containing a JSON object. So we need to verify the type of the message before erasing the data.
  • The AWS Lambda Powertools Data Masker raises a warning if you instruct it to mask a field which it can’t find. With this approach where I want to globally define a list a fields to mask everywhere, this will result in a lot of warnings in CloudWatch Logs, which I don’t want. So I ignore the warnings before calling the erase() method.

2- Apply the Function on all Logging Methods
A class decorator is created to apply a decorator function passed as an argument to all the logging methods (e.g. info, error, exceptions) of the logger class:

def decorate_log_methods(decorator):
    def decorate(cls):
        for attr in dir(cls):
            if callable(getattr(cls, attr)) and attr in [
                "info",
                "error",
                "warning",
                "exception",
                "debug",
                "critical",
            ]:
                setattr(cls, attr, decorator(getattr(cls, attr)))
        return cls
    return decorate
Enter fullscreen mode Exit fullscreen mode

3- Create a Custom Logger Class
Finally, a custom logger class is created with the class decorator created in the previous step applied to it. The class decorator takes as an argument the first function decorating the data_masker.erase() function. The data masker decorator takes as an argument all the JSON keys containing PII and which should be erased.

def decimal_serializer(obj: Any) -> Any:
    if isinstance(obj, Decimal):
        obj = str(obj)
    return obj 

@decorate_log_methods(
    log_masking_decorator(
        masked_fields=[
            "$.[*].phoneNumber",
            "$..[*].phoneNumber",
            "$.[*].name",
            "$..[*].name",
        ]
    )
)
class CustomLogger(Logger):
    def __init__(self):
        super().__init__()
        self.datamasking_provider = BaseProvider(
            json_serializer=partial(json.dumps, default=decimal_serializer),
            json_deserializer=json.loads,
        )
        self.data_masker = DataMasking(
            provider=self.datamasking_provider, raise_on_missing_field=False
        )
Enter fullscreen mode Exit fullscreen mode

Code explanations:

  • I use here a custom JSON serializer to convert Python Decimal values into strings to avoid errors.

4- Usage
By instantiating the Python logger into the Lambda Function as a CustomLogger() instead of the default AWS Lambda Power Tools Logger(), all values of the JSON keys listed in the class decorator argument will be erased by default.

from log_helpers import CustomLogger
logger = CustomLogger()
@logger.inject_lambda_context(log_event=True)
def lambda_handler(event: dict, context: LambdaContext):
    response = boto3_client.whatever_service_api()
    logger.info(response)
Enter fullscreen mode Exit fullscreen mode

Code explanations:

  • The inject_lambda_context decorator calls the logger.info(). Since the logger is here our custom logger, all PII listed in our CustomLogger class decorator will be erase from the Lambda event logs This achieves the goal of enforcing the erasure of all the listed PII without the developer having to specifically list each field to erase on every logging action.

Would I Use That in Production?
No.
Parsing the entire JSON structure of every log will increase the latency of the response of your Lambda function, which is not something you want. As the documentation of the AWS Lambda Power Tools says, logging the event of the Lambda handler function should only be done in non-production environments. And you should also know the data your Lambda Function is handling and thus erase the specific sensitive data where necessary for efficiency.
I still find it an interesting approach which could be useful in some cases. Test environments should not have production data, but hey, we have all seen those cases out there…
It was nevertheless an interesting exercise to try.

Note: The banner image was generated using AWS Nova Canvas image generation AI model

AWS GenAI LIVE image

Real challenges. Real solutions. Real talk.

From technical discussions to philosophical debates, AWS and AWS Partners examine the impact and evolution of gen AI.

Learn more

Top comments (0)

Image of Datadog

How to Diagram Your Cloud Architecture

Cloud architecture diagrams provide critical visibility into the resources in your environment and how they’re connected. In our latest eBook, AWS Solution Architects Jason Mimick and James Wenzel walk through best practices on how to build effective and professional diagrams.

Download the Free eBook

👋 Kindness is contagious

If you found this post helpful, please leave a ❤️ or a friendly comment below!

Okay