Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
It's easy to get started with and you'll get most of your results in seconds!
In this article I'll show you how you can get started with Amazon Athena and things that are good for you to think of when using this tool.
Create S3 buckets
For this we will create one S3 bucket and add a JSON file in it containing some sample data which will be used as the query source. We will also create one bucket where our query results will be stored in
So start by creating a S3 bucket which will contain our data and create a new file and add it into the bucket.
Note that your bucket name must be globally unique and must not contain spaces or uppercase letters.
When you have uploaded your file into the bucket, select the file and click on "Copy S3 URI", since this is needed to the next step.
You can copy the content below to the file if you want:
{"id": "aefb16a1-873e-4768-8b41-fed2169d6edc","isActive": true,"balance": "$3,945.61","age": 35,"registered": "2021-07-26T07:51:30 -02:00"}
{"id": "03ae00c4-0c0c-492f-8042-6769aa0ab1dd","isActive": false,"balance": "$32,405.72","age": 77,"registered": "2017-07-13T06:42:59 -02:00"}
{"id": "5c304a34-b9db-4bb5-b939-b399b99a01f7","isActive": true,"balance": "$11,117.00","age": 36,"registered": "2020-02-22T02:31:18 -01:00"}
{"id": "4a2a09c4-ce22-4397-8ac6-0b7175a67015","isActive": false,"balance": "$7,461.80","age": 23,"registered": "2022-01-09T08:18:47 -01:00"}
{"id": "aeee5d8c-8fdc-4c2a-bf69-996e2c167e93","isActive": false,"balance": "$37,321.59","age": 49,"registered": "2015-07-11T12:24:18 -02:00"}
{"id": "f005b41b-f8c2-4e44-8b91-01d3386d6523","isActive": false,"balance": "$10,240.07","age": 20,"registered": "2021-07-25T08:22:22 -02:00"}
{"id": "fd211f23-7890-49bf-8c5b-9a73f474fc88","isActive": true,"balance": "$36,729.80","age": 47,"registered": "2022-09-01T01:43:36 -02:00"}
{"id": "dd2e7598-5f87-432f-ae62-9734f955fb11","isActive": false,"balance": "$9,976.13","age": 80,"registered": "2014-11-10T11:05:54 -01:00"}
{"id": "d436143a-4796-4ee1-9435-379e90fcbe09","isActive": false,"balance": "$27,959.28","age": 28,"registered": "2017-09-01T09:14:55 -02:00"}
{"id": "c7c5cf84-4d4b-41d3-a421-0a3b9043d622","isActive": true,"balance": "$37,448.82","age": 54,"registered": "2020-02-06T10:08:52 -01:00"}
{"id": "e030033f-c266-41d5-a80e-1bcc899f96a2","isActive": false,"balance": "$10,377.82","age": 37,"registered": "2018-07-22T12:11:56 -02:00"}
{"id": "77a63add-51b0-4204-a6c5-36140742e221","isActive": false,"balance": "$5,119.16","age": 68,"registered": "2016-11-20T04:44:33 -01:00"}
{"id": "76ca5a7c-1aeb-4da3-b755-30ecf90fb678","isActive": true,"balance": "$39,777.44","age": 50,"registered": "2020-12-13T09:49:02 -01:00"}
{"id": "1a64b33e-ea3a-4d4c-95e7-dc8bc413e916","isActive": true,"balance": "$36,599.49","age": 23,"registered": "2022-04-11T03:38:24 -02:00"}
{"id": "d5c03b11-e5e9-4b28-ab01-9570152d06d7","isActive": true,"balance": "$8,772.33","age": 41,"registered": "2022-02-19T04:48:50 -01:00"}
{"id": "7e496708-2747-45b7-91d3-569c0864c775","isActive": true,"balance": "$36,996.72","age": 80,"registered": "2016-03-23T08:38:24 -01:00"}
{"id": "f71f8116-a1b1-44be-8541-c199832bd5a5","isActive": true,"balance": "$38,512.93","age": 30,"registered": "2022-09-29T04:50:38 -02:00"}
{"id": "956618ef-81ba-4eca-944f-7c56e26c2446","isActive": false,"balance": "$12,605.06","age": 22,"registered": "2016-09-15T09:55:42 -02:00"}
{"id": "1a6be308-162b-40bd-ab8c-2d50c5ea8d51","isActive": true,"balance": "$30,926.10","age": 30,"registered": "2020-03-17T10:12:31 -01:00"}
{"id": "fbda07e6-cb32-4512-a288-22d6d586ea5e","isActive": false,"balance": "$7,245.87","age": 35,"registered": "2019-04-21T01:45:16 -02:00"}
Now create another bucket which will contain the query results, in this bucket create a folder called "athena-results" or what you prefer.
You should now have the following:
- One S3 bucket containing the data.json file
- One S3 bucket and a folder in it which will store the Amazon Athena query results
- You should also have copied the S3 URI to the first bucket
Setup Amazon Athena and AWS Glue
Now let's head over to Amazon Athena in AWS. In the main view click on "Explore the query editor" and you should now see this view:
Now click on "Edit settings" (highlighted in the image above). On the Location of query result option, click "Browse S3" and then select your results bucket and then select the results folder. Then click "Choose" and then "Save".
Now head back to the "Editor" tab. In the navigation pane, choose "Data sources". On the Data sources page, choose "Create data source". Next choose "S3 - AWS Glue Data Catalog".
Note that this option costs a bit extra (minor cost) but this will crawl your data and create a table with the columns based on your json data automatically. The other option is to create your table manually which is free. Click "Next" on the bottom right.
In the next step we want to select the "Create a crawler in AWS Glue" option. Now click "Create in AWS Glue".
AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.
You will now be redirected to a new tab to AWS Glue. Here click on "Crawlers" in the left navigation pane and then click on "Create crawler".
- Add a name and click "Next"
- Is your data already mapped to Glue tables? Select "Not yet"
- In the "Data sources" click "Add a data source"
- Next paste the S3 URI but remove the file extension name
- Click "Add an S3 source"
- Now go through the rest of the tutorial
- When you have done all the steps in the tutorial, select your Crawler and click "Run crawler" (this can take a couple of minutes)
Query data in Amazon Athena
Now head back to Amazon Athena and click on "Explore the query editor"
You should now under tables see your table that was created through the Crawler:
If you expand the table you should see all the columns that the table have and these are based on the JSON file.
Now let's start and do some queries:
SELECT * FROM "default"."aws_athena_query_data_tutorial" where age > 50
Which will result in:
If you do check your results bucket in S3 you will also see that for each query there is .csv files automatically generated.
Cleanup
To cleanup what we just created you need to:
- S3: remove the buckets you created (query and results)
- AWS Glue: delete the Crawler and the database
- Athena: delete the table
Summary
That's it! Hope you liked this and noticed how powerful this tool can be when you need to query data. Everything is serverless which means that you do not need to handle any servers.
Any comments, questions or discussions are always welcome!
Top comments (0)