TLDR;
In this post, we dive into how Generative AI (GenAI) can supercharge data cleansing, specifically for cleaning up email addresses and birthdays. We tackle the usual headaches of using regular expressions and show how GenAI can step in to make things easier and faster. You’ll find code snippets for generating fake data, adding some errors, and using Amazon Bedrock for smart error-fixing. The results? GenAI nailed all the birthday fixes, but some emails tripped up due to special characters.
Introduction: PowerUp for RegEX Implementation
GenAI has been discussed in many different fields. For a long time, I have been thinking about how to implement GenAI in Data Preparation or Integration. So far, I have tried to implement regular expressions to first identify invalid values in a table. In a second step, I attempted to build a rule set using Spark, Python, or SQL to correct inaccurate values.
Both steps are tedious due to edge cases and different development streams. In any case, the work was valuable, as we see significant improvements in data analysis and the creation of data products. Ultimately, the harmonization of data sources will always lead to opportunities in every organization.
Due to this effort and the value of data cleansing, I want to share my first steps in leveraging GenAI to reduce the implementation effort. This is possible because a prompt describes a generic task, like "fix this string according to ... if it contains errors, mistakes, and typos," and the GenAI model in the background will decide on its own how to fix the string. By using such an approach, we combine regex development and error correction in one step. One caveat might be that the response behavior may vary from time to time.
Creating a Sample: Cleaning Email Addresses and Birthdays
To showcase my experiences so far, we will create samples coming from an internal system. It will contain data such as emails and birthdays from our customers. Unfortunately, our input system does not check the data that customers enter into our system. Therefore, customers occasionally send incorrect data to our CRM.
Data Producer: Creating a Ground Truth for Evaluation Purposes
Our system will be mocked by this Python script. This will create valid or expected inputs with 100% accuracy. Afterwards, we will introduce typical errors and typos with a 50% chance into the entries of our dataset. Let's start with emails.
# src/dataproducer.py
def generate_random_email(self):
first_name = random.choice(self.first_names)
last_name = random.choice(self.last_names)
domains = ['example.com', 'test.com', 'sample.com', 'demo.com']
email = f"{first_name}.{last_name}@{random.choice(domains)}"
return email.lower()
def introduce_typo_email(self, email: str):
if random.random() < 0.5:
email_list = list(email)
index = random.randint(0, len(email) - 1)
action = random.choice(['replace', 'add', 'remove', 'case_change'])
if action == 'replace':
email_list[index] = random.choice('abcdefghijklmnopqrstuvwxyz')
elif action == 'add':
email_list.insert(index, random.choice('abcdefghijklmnopqrstuvwxyz'))
elif action == 'remove' and len(email) > 1:
email_list.pop(index)
elif action == 'case_change':
email_list[index] = email_list[index].upper() if email_list[index].islower() else email_list[index].lower()
return ''.join(email_list)
return email
The function generate_random_email()
creates random emails based on a predefined set of German first and last names and four different email providers. The function introduce_typo_email()
introduces an error with a 50% chance by replacing, adding, or removing a letter from an email. Letters can also be changed to uppercase.
The second data field will be the birthday of our customers.
def generate_correct_birthday(self):
start_date = datetime(1900, 1, 1)
end_date = datetime(2023, 12, 31)
random_date = start_date + timedelta(days=random.randint(0, (end_date - start_date).days))
return random_date.strftime("%d.%m.%Y")
def introduce_typo_birthday(self, birthday):
if random.random() < 0.5:
birthday = birthday.lstrip('0')
if random.random() < 0.5:
birthday = birthday.replace('.', random.choice([' ', '/', '-']))
if random.random() < 0.5:
if "19" in birthday or "20" in birthday:
year = birthday.split('.')[-1]
if year.startswith("19") and random.choice([True, False]):
year = year[2:]
elif year.startswith("20") and random.choice([True, False]):
year = year[2:]
birthday = '.'.join(birthday.split('.')[:-1] + [year])
return birthday
The function generate_correct_birthday()
creates birthdays in the format DD.MM.YYYY. Each date will be between 01.01.1900 and 31.12.2023. The typos we introduce can include changing the separator from "." to "/" or "-" or even removing any separator. We will remove leading zeros or "19" and "20".
Finally, we will write down the dataset via CSV. For this blog post, we will limit the number of rows to 100.
def create_pandas_dataframe(self):
emails_correct = [self.generate_random_email() for _ in range(self.rows)]
emails_typo = [self.introduce_typo_email(email=email) for email in emails_correct]
birthdays_correct = [self.generate_correct_birthday() for _ in range(self.rows)]
birthdays_typo = [self.introduce_typo_birthday(birthday=birthday) for birthday in birthdays_correct]
data = {
'EmailCorrect': emails_correct,
'EmailTypo': emails_typo,
'BirthdayCorrect': birthdays_correct,
'BirthdayTypo': birthdays_typo,
}
return pd.DataFrame(data)
def write_as_csv(self, dataframe: pd.DataFrame):
dataframe.to_csv(path_or_buf='data/out.csv', index=False)
We execute the snippets above from the file src/datacleaner.py.
# src/datacleaner.py
# snippet
def create_data():
logging.info("Creating DataFrame")
data_producer = dp.DataProducer(
rows=100,
first_names=['Max', 'Sophie', 'Leon', 'Marie', 'Paul', 'Emma', 'Lukas', 'Hannah', 'Tim', 'Anna'],
last_names=['Müller', 'Schmidt', 'Schneider', 'Fischer', 'Weber', 'Meyer', 'Wagner', 'Becker', 'Hoffmann', 'Schulz'],
)
df = data_producer.create_pandas_dataframe()
data_producer.write_as_csv(dataframe=df)
logging.info("Wrote down DataFrame as csv")
if __name__ == "__main__":
create_data()
GenAI-based Data Cleansing: Invoking Claude Haiku for Dynamic RegEX Evaluation at Runtime
To implement a dynamic GenAI-based RegEX at runtime, we need to handle four things. First, we need to create a prompt that holds the relevant information. Second, we need to prepare the data so it can be processed by our GenAI model. Finally, we need to invoke the Amazon Bedrock endpoint. Lastly, the result from the endpoint needs to fit and be passed into our DataFrame.
Prompting, Retrieving, and Running the Implementation
I followed this website to create the prompt. The following snippet shows how to create dynamic prompts for different data fields. I marked the instructions and the data with XML tags as suggested on the website.
def prompt_builder(field_name: str, pattern: str, data: list[str]):
prompt = f"""<instructions>You are in charge of ensuring data quality.
Check the following data fields. They represent {field_name}
from our CRM of our German customers. Fix all fields if they show typos or other mistakes. Ensure your answers follow this pattern: {pattern}.
Return only a comma-separated list and keep the order of the input data.</instructions> <data>{data}</data>"""
return prompt
The idea is to create a role-based approach to ensure that the task is clearly and directly formulated, as well as being precise about the expected output and, in our case, also the order.
The comma-separated list is a preparation for integrating the results into our existing DataFrame. To do this, I implemented the following code:
def process_output(bedrock_response: dict, json_parsable: bool = False):
output = bedrock_response['body'].read()
output = output.decode('utf-8')
output = json.loads(output)['content'][0]['text']
if json_parsable:
output = json.loads(output)
return [elem.strip() for elem in output.split(",")]
It's important to mention that the result is expressed in natural language. That's why I need to process the output a little bit to extract elements from the text. But the processing is relatively straightforward, as it involves removing leading empty spaces in front of each word (" 05.12.1988").
Now that we have declared the input and output handling, we just need to invoke the endpoint:
def invoke_endpoint(bedrock_client: boto3.client, prompt: str, top_k: int, max_tokens: int, top_p: int, temperature: int):
body = json.dumps(
{
'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': prompt}]}],
'anthropic_version': 'bedrock-2023-05-31',
'max_tokens': max_tokens,
'temperature': temperature,
'top_p': top_p,
'top_k': top_k,
}
)
modelId = 'anthropic.claude-3-haiku-20240307-v1:0'
accept = "*/*"
contentType = "application/json"
response = bedrock_client.invoke_model(
body=body,
modelId=modelId,
accept=accept,
contentType=contentType,
)
return response
Et voilà, the final script:
# src/datacleaner.py
# snippet
def invoke_endpoint(bedrock_client: boto3.client, prompt: str, top_k: int, max_tokens: int, top_p: int, temperature: int):
body = json.dumps(
{
'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': prompt}]}],
'anthropic_version': 'bedrock-2023-05-31',
'max_tokens': max_tokens,
'temperature': temperature,
'top_p': top_p,
'top_k': top_k,
}
)
modelId = 'anthropic.claude-3-haiku-20240307-v1:0'
accept = "*/*"
contentType = "application/json"
response = bedrock_client.invoke_model(
body=body,
modelId=modelId,
accept=accept,
contentType=contentType,
)
return response
def prompt_builder(field_name: str, pattern: str, data: list[str]):
prompt = f"""<instructions>You are in charge of ensuring data quality.
Check the following data fields. They represent {field_name}
from our CRM of our German customers. Fix all fields if they show typos or other mistakes. Ensure your answers follow this pattern: {pattern}.
Return only a comma-separated list and keep the order of the input data.</instructions> <data>{data}</data>"""
return prompt
def process_output(bedrock_response: dict, json_parsable: bool = False):
output = bedrock_response['body'].read()
output = output.decode('utf-8')
output = json.loads(output)['content'][0]['text']
if json_parsable:
output = json.loads(output)
return [elem.strip() for elem in output.split(",")]
def gen_ai_processing(create_data: bool, aws_region: str, data_path: str):
config = Config(
region_name=aws_region,
)
logging.info("Creating or reading data ...")
if create_data:
create_data()
logging.info("Created data ...")
else:
df = pd.read_csv(filepath_or_buffer=data_path)
logging.info("Read data ...")
bedrock_client = boto3.client("bedrock-runtime", config=config)
logging.info("Running Bedrock inference for data cleansing")
for entry in [
{
"FieldName": "Email",
"Pattern": "first_name.last_name@company.com",
"ColumnTypo": "EmailTypo",
"ColumnCorrect": "EmailCorrect"
},
{
"FieldName": "Birthday",
"Pattern": "DD.MM.YYYY",
"ColumnTypo": "BirthdayTypo",
"ColumnCorrect": "BirthdayCorrect"
}
]:
prompt = prompt_builder(
field_name=entry["FieldName"],
pattern=entry["Pattern"],
data=df[entry["ColumnTypo"]].tolist()
)
logging.info(f"Running Bedrock inference for {entry['FieldName']}")
llm_res = invoke_endpoint(
bedrock_client=bedrock_client,
prompt=prompt,
temperature=0.2,
top_k=100,
top_p=0.2,
max_tokens=1024
)
processed_output = process_output(bedrock_response=llm_res, json_parsable=False)
df[f"{entry['FieldName']}Bedrock"] = processed_output
comparison_column = np.where(df[f"{entry['FieldName']}Bedrock"] == df[entry["ColumnCorrect"]], True, False)
df[f"CompareBedrockOrg{entry['FieldName']}"] = comparison_column
df.to_csv(path_or_buf='data/bedrock_out.csv', index=False)
if __name__ == "__main__":
gen_ai_processing(
create_data=False,
aws_region="eu-central-1",
data_path="data/out.csv",
)
Analyzing the Results
Now it's time to conduct an analysis on our dynamic RegEX implementation with GenAI. Since we know the Ground Truth, we can simply compare the results from Bedrock with it. Initially, we have 100 entries for birthdays and emails. The script created 45 emails and 31 birthdays with typos.
With Bedrock, all birthdays could be fixed:
INFO:root: BirthdayCorrect BirthdayTypo BirthdayBedrock CompareBedrockOrgBirthday
0 06.06.2006 6/06/2006 06.06.2006 True
5 22.12.1957 22 12 1957 22.12.1957 True
6 23.03.1985 23.03.85 23.03.1985 True
11 22.07.2010 22.07.2010 22.07.2010 True
14 28.03.1970 28/03/1970 28.03.1970 True
While 5 emails could not be fixed:
INFO:root: EmailCorrect EmailTypo EmailBedrock CompareBedrockOrgEmail
0 tim.weber@sample.com ti.weber@sample.com ti.weber@sample.com False
5 marie.müller@test.com marie.müller@test.com marie.mueller@test.com False
6 sophie.hoffmann@demo.com sophie.offmann@demo.com sophie.offmann@demo.com False
11 hannah.müller@test.com hannah.müller@testf.com hannah.mueller@test.com False
14 paul.müller@example.com payul.müller@example.com paul.mueller@example.com False
However, if you look closer, you can see that Bedrock fixed 3 out of 5 correctly, as emails cannot contain German special characters. Thus, two names ("Tim" and "Hoffmann") could be corrected.
Final Words
I was very impressed by this approach. However, I am fully aware of some limitations of this blog post. The data set is small. In larger settings, chunking will be crucial as Bedrock has strict limits for batch approaches. Also, the examples are relatively simple. More complex data fields might show lower success rates. The reproduction showed very good results in five rounds; I consistently got the same results here. However, in larger sets, this needs to be evaluated.
I hope you gained some insight into GenAI-based RegEX implementation for data cleansing. Feel free to reach out if you want to discuss! I will be happy! :-)
Happy coding!
Top comments (0)