Rajat Venkatesh

Posted on Jan 27, 2020 • Originally published at tokern.io

Secure Data Lake with AWS Lake Formation Tutorial

#aws #database #governance #security

AWS Lake Formation helps to build a secure data lake on data in AWS S3.
This blog will help you get started by describing the steps to setup a basic data lake with S3, Glue, Lake Formation
and Athena in AWS. The tutorial will use
New York City Taxi and Limousine Commission (TLC) Trip Record Data
as the data set.

Image by Jerry Hargrove

At the end of the tutorial, you will have a data lake setup with Lake Formation.

Prerequisites

Setup AWS Lake Formation using instructions in Getting Started Tutorial.
After the tutorial, the following users and roles are setup:
- Administrator is the Administrator IAM User
- NycWorkflowRole is the IAM Role for Workflows
- lakeadmin is the Data Lake Administrator
Upgrade to Lake Formation Permissions model as described the Upgrade Tutorial.
Setup AWS Athena

Import Taxi Data into Lake Formation

Allow NycWorkflowRole to read NYC Data

Goto IAM > Policies > Create Policy
Follow the wizard and create a policy with name S3NycBucketRead with the following definition:

    {
      "Version": "2012-10-17",
      "Statement": [
                {
                        "Sid": "VisualEditor0",
                        "Effect": "Allow",
                        "Action": [
                        "s3:GetObject",
                                "s3:ListBucket"
                        ],
                        "Resource": [
                                "arn:aws:s3:::nyc-tlc",
                                "arn:aws:s3:::nyc-tlc/*"
                        ]
                }
      ]
    }

Attach S3NycBucketRead policy to NycWorkFlowRole

Create a database and grant permissions

Head to AWS Lake Formation > Databases > Create Database
Fill the form with details below and click on Create Database
- Name: taxidata
Click on Databases in left navigation pane
Select taxidata.
Click on Actions > Grant
Fill up the form with following details and then click on Grant
- IAM Users and Roles: NycWorkflowRole
- Database Permissions: Create Table, Alter.

Create a Crawler to register the data in Glue data catalog

A Glue Crawler will read the files in nyc-tlc bucket and create tables in a database automatically.

Head to AWS Glue > Crawlers > Add Crawler
Fill in the following details in the wizard and click Finish at the end.

Field	Value
Crawler Name	TaxiCrawler
Tags	project: lake_formation_example
Crawler Source Type	Data Stores
Choose a data Store	S3
Crawl data in	Specified Path in another account
Include Path	s3://nyc-tlc/trip data/
Exclude Patterns	fhv_tripdata_2015* fhv_tripdata_2016* fhv_tripdata_2017* fhv_tripdata_2018* yellow* green*
Add Another Data Store	YES
Choose a data Store	S3
Crawl data in	Specified Path in another account
Include Path	s3://nyc-tlc/misc/
Exclude Patterns	foil shared* uber* .html .zip FOIL_*
Add another Data Store	No
IAM Role	Choose an Existing Role
IAM Role	NycWorkflowRole
Frequency	Run On Demand
Choose an existing database	TaxiData
Prefix added to tables	csv_

In the the crawlers list, select the TaxiCrawler and click the Run Crawler button. This should take less than a minute and it will crawl the CSV files you've not excluded in the taxi S3 bucket.
Database -> Tables and review the tables.

Verify Data in AWS Athena

Fire up athenacli as lakeadmin and run a couple of queries to verify the data looks good.

us-east-2:default> select * from taxidata.csv_trip_data limit 10;

+----------------------+---------------------+------------------+--------------+--------------+---------+-------------------+

| dispatching_base_num | pickup_datetime     | dropoff_datetime | pulocationid | dolocationid | sr_flag | hvfhs_license_num |

+----------------------+---------------------+------------------+--------------+--------------+---------+-------------------+

| B00013               | 2015-01-01 00:30:00 |                  | <null>       | <null>       | <null>  | <null>            |

| B00013               | 2015-01-01 01:22:00 |                  | <null>       | <null>       | <null>  | <null>            |

| B00013               | 2015-01-01 01:23:00 |                  | <null>       | <null>       | <null>  | <null>            |

| B00013               | 2015-01-01 01:44:00 |                  | <null>       | <null>       | <null>  | <null>            |

| B00013               | 2015-01-01 02:00:00 |                  | <null>       | <null>       | <null>  | <null>            |

| B00013               | 2015-01-01 02:00:00 |                  | <null>       | <null>       | <null>  | <null>            |

| B00013               | 2015-01-01 02:00:00 |                  | <null>       | <null>       | <null>  | <null>            |

| B00013               | 2015-01-01 02:50:00 |                  | <null>       | <null>       | <null>  | <null>            |

| B00013               | 2015-01-01 04:45:00 |                  | <null>       | <null>       | <null>  | <null>            |

| B00013               | 2015-01-01 06:30:00 |                  | <null>       | <null>       | <null>  | <null>            |

+----------------------+---------------------+------------------+--------------+--------------+---------+-------------------+

10 rows in set

Time: 4.926s

us-east-2:default> select * from taxidata.csv_misc limit 10;

+------------+-----------------+---------------------------+---------------+

| locationid | borough         | zone                      | service_zone  |

+------------+-----------------+---------------------------+---------------+

| 1          | "EWR"           | "Newark Airport"          | "EWR"         |

| 2          | "Queens"        | "Jamaica Bay"             | "Boro Zone"   |

| 3          | "Bronx"         | "Allerton/Pelham Gardens" | "Boro Zone"   |

| 4          | "Manhattan"     | "Alphabet City"           | "Yellow Zone" |

| 5          | "Staten Island" | "Arden Heights"           | "Boro Zone"   |

| 6          | "Staten Island" | "Arrochar/Fort Wadsworth" | "Boro Zone"   |

| 7          | "Queens"        | "Astoria"                 | "Boro Zone"   |

| 8          | "Queens"        | "Astoria Park"            | "Boro Zone"   |

| 9          | "Queens"        | "Auburndale"              | "Boro Zone"   |

| 10         | "Queens"        | "Baisley Park"            | "Boro Zone"   |

+------------+-----------------+---------------------------+---------------+

10 rows in set

Time: 3.116s

Summary

This tutorial showed how to setup a basic data lake containing NYC Taxi Trip data using AWS Lake Formation.
Now the data lake is ready to provide secure access to data engineers, analysts and data scientists.
We would love to hear if this tutorial was helpful to you. Get in touch using the chat widget.

Top comments (1)

Dendi Handian • Nov 17 '22

Unfortunately the lake formation to create blueprint workflow aren't available yet in Boto3

DEV Community

Secure Data Lake with AWS Lake Formation Tutorial

Prerequisites

Import Taxi Data into Lake Formation

Allow NycWorkflowRole to read NYC Data

Create a database and grant permissions

Create a Crawler to register the data in Glue data catalog

Verify Data in AWS Athena

Summary

Top comments (1)

Read next

How to: Puppeteer in AWS Docker Lambda

Secure Device Authentication in Python: Introducing the System Hardware ID Generator Script

Aurora Limitless - Collocation (and porting PgBench)

Key Components of a VPC: Detailed Breakdown