In this blog I will mention some of the best practices recommended by AWS for building queries in Athena based on my experience and the following resources:
- https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
- https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
General Recommendations
Always use WHERE on partition field
This mainly in order to speed up time and cost.
For example:
Avoid:
select * from table1 where cast(col_1 as integer) = cast('201912' as integer) - 1
and prefer:
select * from table1 where col_particion = '201911' //speedup ~85%, savings ~95%
Avoid using ORDER BY without LIMIT
It is extremely important to understand that the ORDER BY function must be done in a single node, since it is a slow and time-consuming function. Ideally it should be avoided, however, if within the use case you are implementing you must use it, I always recommend placing a LIMIT.
For example:
Avoid:
select * from table1 order by date;
and prefer:
select * from table1 order by date limit 1000; //speedup ~98%, avoid 'Query exhausted resources at this scale factor'
Select only the columns to retrieve the final result
This recommendation is very simple. In practical effect it is to avoid the SELECT * FROM.
For example:
Avoid:
create table tmp_table
as select
A.col_1
A.col_2
B.col_3
from select * from table_1 A
left join (select * from table_2) B on A.col_1 = B.col_1
Instead use:
create table tmp_table
as select
A.col_1
A.col_2
B.col_3
from select col_1,col_2 from table_1 A
left join (select col_1,col_3 from table_2) B on A.col_1 = B.col_1
Schedule data aggregation for small files
The numbers speak for themselves:
Query | Number of files | Run time |
---|---|---|
SELECT COUNT(*) FROM lineitem | 5000 | 8.4 seg |
SELECT COUNT(*) FROM lineitem | 1 | 2.31 seg |
Speedup | 72% faster |
Prefer the use of regular expressions over 'LIKE'
Query | Run time |
---|---|
SELECT COUNT(*) FROM lineitem WHERE text_column LIKE '%wake%' OR text_column LIKE '%some%' OR text_column LIKE '%express%' OR text_column LIKE '%hello%' | 20.56 seg |
SELECT COUNT(*) FROM lineitem WHERE regexp_like(text_column,'...') | 15.87 seg |
Speedup | 17% faster |
Note: The expression would be
regexp_like(text_column, 'wake|some|express|hello')
When using group by for multiple fields. order them from highest to lowest cardinality
This will avoid memory errors and reduce the time to deliver results.
For instance:
Avoid:
select * from people group by column_genre,department;
Instead use:
select * from people group by department,column_genre;
In case of using Crawlers to automatically obtain the structure of the data stored in S3, respect the data types supported for the source engine. Likewise, do not forget to run the crawler after a data update that may generate changes in the structure, this in order to update the structure in the glue catalog.
Use MSCK REPAIR TABLE only if the folders are created with the structure 'field1 = / field2 = /.../ fieldN = ' and only after creating the table, since it is' msck repair table 'is expensive operation and it is preferable to use' alter table add partition 'or glue api to add partitions.
and do you have any good practice that you recommend? Comment it in the comment box.
I hope this blog is useful for you. Greetings!
Top comments (2)
Nice work! The hardest thing I found when first working with JSON was flattening the rows before processing it; a trick needed with Hive which I'd never seen before. Such a tricky tool to learn, but super-powerful once you have it.
Oh yeah, I think that in itself is also part of the difficulty of working with JSON with the SQL standard. In this case, as Athena is under Presto, although there are functions to work under that nomenclature, it is a bit complex if we talk about custom structures. Very good tip!