A major part of any data pipeline is the cleaning of data. Depending on the project, cleaning data could mean a lot of things. But in most cases, it means normalizing data and bringing data into a format that is accepted within the project. For example, it could be extracting date and time components from a timestamp column into multiple columns, converting the case of a string column, or label encoding.
We write code for this, and then design a pipeline for automating that using tools such as Apache Airflow. We have been doing this for years now. But as we move more and more of our processing and pipelines to the cloud, we can achieve these results of data cleaning by using tools provided by our cloud infrastructure providers. And usually, such tools don’t require any coding. Being a developer myself, I can see how this can seem threatening to my job. But I don’t want to get into that argument, because I think that’s moot.
Anyway, in this post, we’ll take a look at AWS Glue DataBrew and how we can clean our datasets using this. For this POC, I’m using S3 to both store the input and the processed output (mostly because that’s the easiest option I see) and using an NYC OpenData dataset. To be more specific, I’m using the Motor Vehicle Collisions – Crashes dataset.
What Is AWS Glue DataBrew?
We already know that AWS Glue is a tool for designing the extract, transform, and load (ETL) pipelines. It provides a lot of features for creating and running ETL jobs. DataBrew takes it one step ahead by providing features to also clean and transform the data to ready it for further processing or feeding to machine learning models.
DataBrew provides over 250 transformations to get started with. These include filtering data, converting formats or converting data into standard formats, fixing data issues, extracting data from columns using regex, and much more. The best part is, these transformations are already available in DataBrew, which means we don’t need to code any of these transformations.
The advantage of that is data analysts can design the transformation pipelines themselves without having to wait for developers to code the transformations. You can apply these transformations at any stage of the pipeline, especially if you already have a Glue ETL job.
Another advantage here is that all of this is serverless. This means you only pay for what you use, and you don’t pay for anything when the pipeline isn’t running. Also, it can scale automatically depending on the amount of data coming in.
So, let’s get started with DataBrew now. We’re first going to create a new project and a recipe. Then we’ll download the sample dataset and upload that to DataBrew. Then, we’ll apply a few transformations and create a job to run the transformations on the entire dataset. Finally, we’ll see how the transformed data looks like.
Getting Started With AWS Glue DataBrew
Creating a project
First, head over to the AWS Glue DataBrew console and create a new project. Next, provide a project name and a recipe name, as you can see from the screenshot below.
For this POC, we can leave all the configurations to the defaults. After providing a name for the project and the recipe, scroll down to the Select a dataset section to upload the sample CSV dataset. If you haven’t yet downloaded the sample dataset, you can get it from the NYC OpenData website here.
In the Select a dataset section, select the New dataset option to the right of the screen. Provide a name to the dataset. Then select the File upload option in the Connect to a new dataset option. You can see this configuration in the screenshot below.
In the Enter S3 destination option, select an S3 bucket for hosting the source data and the transformed data. If you don’t have a bucket created for this, you can do so from here itself.
Finally, scroll down to the bottom of the page and click the Create project button to create the project. This will take some time because the file has to be uploaded to the S3 bucket first, then a bunch of resources has to be provisioned, and also the file has to be previewed. During this time, you’ll see a progress indicator something like this:
Applying transformations
Once the data is prepared, you will see the data presented in a tabular form with a bunch of options at the top of the table. Here, we can select individual rows, select transformations, preview the results, and then apply the transformations. Also, we can manipulate the columns, such as duplicating columns, splitting columns, etc. These do count as transformations, but I just wanted to explicitly call them out as these are very commonly applied transformations.
We’ll start applying transformations with the first column, which is the CRASH DATE column. In most analytics use cases, we would want to aggregate something over years. We usually have timestamp columns from which we extract the year component in queries. But the queries can be much easier and performant if we have a dedicated column for the year value. So we’ll now extract the year from the date column and save it as a new column called CRASH YEAR.
For this, select the column, click the three dots at the top-right corner of the column, select the Extract menu, and then select the Date-time values option. This will open up a new configuration pane to the right side of the page, which is as shown in the screenshot below:
Here, the Create column options, the Source column, and the Extract option sections should be already filled. In the Date-time unit to extract section, select Year from the dropdown menu, and enter CRASH YEAR as the Destination column field value. This will create a new column with the name CRASH YEAR with the year component extracted from the date column. Click the Preview button at the bottom to preview the changes. This should add a new column to the dataset, as shown in the screenshot below.
Click the Apply button to apply the transformation.
Next, we’ll transform the CRASH TIME column to format it in the HH:MM format. It is already in a standard format. But for the sake of experimentation, we’ll do this anyway. So let’s repeat the same process again. Click on the three dots at the top right corner of the column, select the Format menu and then click the Date-time formats menu. Then scroll down to the end of the sub-menu and select More options. This will open up a similar configuration pane that we saw while extracting the year component earlier. Use the following screenshot as a reference to configure this transformation.
Click the Preview changes options to see the difference after applying the transformation. It should be similar to the screenshot below. And finally, click the Apply button to apply the transformation. It is important to note that these changes will not be applied immediately after hitting the Apply button. The apply button will only save the transformation so that we can go back and change it anytime we want. The transformations will change the data only after we run the job.
Now, we’ll apply the final transformation to another column. For machine learning use-cases, we need to eliminate as many text columns as we can. And in cases where we’re working with categorical data, we can map them to numbers to make sure the machine learning models don’t develop any biases. In our dataset, we have one such column – CONTRIBUTING FACTOR VEHICLE 1. This process of converting textual, categorical data into numbers is called Categorical Mapping. There’s one other way of doing this, called One Hot Encoding.
Glue DataBrew provides both options. Click on the three dots at the top right corner of the column to open the context menu and scroll to the end, you’ll see both Categorical mapping and One-hot encode column options. For this example, we’ll go with categorical mapping. So select the menu to open the configuration panel. Match the configuration as shown in the screenshot below.
Click the Preview changes button to preview the changes, which should look something like this:
Click the Apply button again to apply this transformation. We should see a new column added with the mapped values. Now, for an actual project, this transformation isn’t going to help much for machine learning as the numerical values will make the model think this is some sort of ranking instead of categorical data. We need to use One-hot encoding for that. But we’ll stick with this for this example.
After this, click the Recipe button at the top right corner of the table. You should now see all the transformations that we applied to the dataset. It should look something similar to the screenshot below.
From here, we should be able to easily edit any of these recipes or remove them. We can even add a new transformation from this pane. This comes in very handy when we want to check all the transformations we’re applying to a dataset in one place.
Running the job
We are now finally ready to run the job to apply all these transformations to the entire dataset. You should see the Run job button at the top right corner of the page. Click that button and take a break now. This will take a few minutes to complete.
In the meantime, we can look at the data lineage to visualize and understand the complete pipeline. Click the LINEAGE button that’s to the right of the Run job button. Here, you’ll see the flow of data from the S3 bucket, through the transformation recipes, and back to the S3 bucket. It should look similar to the one shown in the screenshot below.
You can see that we have one dataset created from source files in the S3 bucket. From that dataset, we have one project that contains six recipes. These six recipes are currently being applied to the dataset using a job, and finally, the transformed data is saved back to the S3 bucket.
That’s it
And that’s pretty much it. This is, and I repeat, not a typical production pipeline. Usually, the source of data is streaming into the system, using tools such as Kinesis DataStream. The transformations are applied on the fly with micro-batching and stored in a data warehouse. But if the use case is for batch processing, S3 could be the source.
Nonetheless, this exercise should help you get started with DataBrew and explore the tool’s capabilities for advanced transformations. And if the tool doesn’t provide a transformation that you need, or if you need to add a custom transformation, you can always use Lambda functions.
AWS has great tools for creating data pipelines with transformations, data lakes and warehouses, machine learning models, and more. You can check here for more resources, getting-started guides, and POCs on AWS services. And if you want me to explore a particular tool or service for you, do let me know in the comments.
And if you like what you see here, or on my Medium blog and personal blog, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.
Top comments (1)
Gonna read this later