At the end of the first video for week 3 of DataTalksClub's Data Engineering Zoomcamp, the instructor introduces clustering. The example he uses partitions the database according to date, and then clusters by VendorID.
CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitoned_clustered
PARTITION BY DATE(tpep_pickup_datetime)
CLUSTER BY VendorID AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;
My version is:
CREATE OR REPLACE TABLE data-engineering-2024-411821.ny_taxi.yellow_tripdata_partitioned_clustered
PARTITION BY DATE(tpep_pickup_datetime)
CLUSTER BY VendorId AS
SELECT * FROM data-engineering-2024-411821.ny_taxi.external_yellow_tripdata;
My external_yellow_tripdata consisted of one file only from Google cloud storage. This created a new table called yellow_tripdata_partitioned_clustered which was both partitioned and clustered. I could see this in the Table info DETAILS
tab.
Then I compared the tables "yellow_tripdata_partitioned" and "yellow_tripdata_partitioned_clustered". I ran the query
SELECT count(*) as trips
FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitioned
WHERE DATE(tpep_pickup_datetime) = '2021-01-15'
AND VendorID=1;
SELECT count(*) as trips
FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitioned_clustered
WHERE DATE(tpep_pickup_datetime) = '2021-01-15'
AND VendorID=1;
I didn't save very many resources. The difference in bytes was 11.36 MB for the non-clustered table and 10.04 MB for the clustered table, and 576 msec for the non-clustered table vs. 418 msec for the clustered table.
I thought I would do better if I search a range of days, so I substituted BETWEEN '2021-01-15' AND '2021-01-31'
for DATE. Here the savings were 10.04 MB vs. 11.36 MB, but it took longer, 1817 msec vs. 1199 msec.
The instructor's results were similar. He got a reduction in data processed from 1.1G to 843.5 MB, but a slight increase in processing time from 0.7 sec to 0.8 sec.
In the next video, DE Zoomcamp 3.1.2 - Partitioning and Clustering, the instructor goes over partitioning and clustering in more detail. Most partitioning is done based on a time unit. Daily time units are the most popular. If you have a huge amount of data, you can partition by hour, but there is a limit of 4000 partitions. So you may have to have an expire partitioning strategy. Monthly and yearly partitions are used for smaller amounts of data.
Note: I looked this up on Google cloud documentation, because the video is two years old. It's still limited to 4000 clusters.
Clustering improves filter queries and aggregate queries on the column that you cluster. When you use clustering, you can cluster by up to four columns. The order of the columns is important, because it specifies the sort order of the data. However, if you have less that 1 GB of data, clustering doesn't really help. In fact it could add costs due to metadata reads and maintenance.
When should you use partitioning vs. clustering? Partitioning gives you an estimate of the cost before running a query. Clustering does not. So if you have cost constraints, partition your database. Then you can hold off on a query if the cost is too much. Partitioning also allows management of partitions - for example, you can move them around. Clustering does not.
On the other hand, clustering is better when you need more granularity than partitioning can provide, or if you use filter and aggregate queries on multiple columns. Partitioning can only be used on one column.
You would want to choose clustering over partitioning when partitioning results in a small amount of data per partition (< 1G), or when partitioning exceeds the 4000 partition limit. Or, if partitioning results in mutating operations modifying the majority of partitions frequently (e.g. every few minutes).
As data is added to a clustered table, you might get a situation where the intervals of the new data overlap the intervals in the old data. This leads to the degradation of functionality. BigQuery takes care of this problem by automatically reclustering the data to restore the sort property of the table. This is especially beneficial because it's done at no cost to the user. It's done in the background automatically as data is added to the table.
Top comments (1)
It's good to see articles talking about BQ and the cloud! I just wanted to comment that the article and the examples provided are straightforward and easy to understand.
Partitioning is a must when dealing with TBs and PBs of data. And if you enforce the partitioning, it will save lots of money. You never know when users create complex queries and views without considering the costs and scanning full tables when all they need is a short period.