How to query CSV file stored in AWS S3 using Athena?

#aws #athena #s3 #python

CSV (Comma Separated Values) can consider as a database, and it stores data in a table format. As far as I know, CSV has been used everywhere; from reports to machine learning to store the dataset and else.

Believe it or not, most companies still use CSV as a 'db.' Then, when we talk about data and you're a passionate Data Analyst, we must analyze it using SQL (in BI tools) so, if you're working in AWS environment do follow this steps.

1- Prepare a CSV File

Skip if you already have CSV file on your local.

As we will use Python, do install Python on your local
Install faker library to get dummy data
Prepare the CSV with 'first_name', 'last_name', 'age' as headers
Now, open the folder and dummy-data.csv should created inside it

import csv
from faker import Faker

fake = Faker()
number_of_records = 8000

with open('dummy-data.csv', mode='w+') as file:
    file_writer = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    file_writer.writerow(['first_name', 'last_name', 'age'])

    for _ in range(number_of_records):
        file_writer.writerow([fake.first_name(), fake.last_name(), fake.numerify("@#")])

2- Create a bucket in Amazon S3

Skip if you already have a bucket and use the existing one

Go to AWS Console -> search S3 -> click Create Bucket
Fill up the necessary fields -> Create bucket
New bucket is created. Now you can decide whether to create a new folder inside the bucket. For me, I want to have a separate folder for dummy data; hence I created dummy_data folder
Click Create folder
Fill up the necessary fields -> Create folder
New folder is created in the bucket
Now, upload the .csv file by drag to the page
You can refresh or go back to the bucket folder to see the imported CSV file
Please don't forget to copy the S3 URI of the folder to create Athena table

3- Create a table in AWS Athena

Note: If you want to create another database, can use below query
CREATE DATABASE myDataBase

Create table in Athena; you need to define the structure of the table so that Athena table able to know the data structure in CSV file.
1. AWS Management Console
2. AWS CLI
For this post, I am using AWS Management Console. Like the S3 service, you can open your AWS Console and search Athena.
Populate the table using below query

CREATE EXTERNAL TABLE dummy_person (
  first_name string,
  last_name string,
  age string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'separatorChar' = ',',
  'quoteChar' = '"',
  'escapeChar' = '\\'
)
LOCATION '[S3 URI]'
TBLPROPERTIES ("skip.header.line.count"="1");

Table is created.

Run SQL queries to check the data has been populated

select * from dummy_person

At this moment, you can already see the data

Next, you can connect Athena to your BI tools and start to develop the dashboard 😊

DEV Community

How to query CSV file stored in AWS S3 using Athena?

1- Prepare a CSV File

2- Create a bucket in Amazon S3

3- Create a table in AWS Athena

Top comments (0)

Read next

Deploying Docs-as-Code on AWS: Building Dynamic Documentation Sites in MkDocs and Docusaurus

2. Understanding Django’s Architecture: The MTV Pattern.

How to Set Up Selenium as a Linux Daemon with systemd

AWS IoT Core Simplified - Part 4: Rules