I originally posted this on Starburst's blog, as part of a series I've been publishing on Iceberg.
TL;DR
Apache Iceberg is an open source table format that brings database functionality to object storage such as S3, Azure’s ADLS, Google Cloud Storage and MinIO. This allows an organization to take advantage of low-cost, high performing cloud storage while providing data warehouse features and experience to their end users without being locked into a single vendor.
What is Apache Iceberg?
Apache Iceberg is a table format, originally created by Netflix, that provides database type functionality on top of object stores such as Amazon S3. Iceberg allows organizations to finally build true data lakehouses in an open architecture, avoiding vendor and technology lock-in.
The excitement around Iceberg began last year and has greatly increased in 2022. Most of the customers and prospects I speak with on a weekly basis are either considering migrating their existing Hive tables to it or have already started. They are excited a true open source table format has been created with many engines both open source and proprietary jumping on board.
Advantages of Apache Iceberg
One of the best things about Iceberg is the vast adoption by many different engines. In the diagram below, you can see many different technologies can work the same set of data as long as they use the open-source Iceberg API. As you can see, the popularity and work that each engine has done is a great indicator of the popularity and usefulness that this exciting technology brings.
With more and more technologies jumping on board, Iceberg isn’t a passing fad. It has been growing in popularity, not only because of how useful it is, but also because it’s truly an open source table format, many companies have contributed and helped improve the specification making it a true community based effort.
Here is a list of the many features Iceberg provides:
- Choose your engine: As you can see from the diagram above, there are many engines that support Iceberg. This offers the ultimate flexibility to own your own data and choose the engine that fits your use cases.
- Avoid Data Lock-in: The data Iceberg and these engines work on, is YOUR data in YOUR account which avoids data lock-in.
- Avoid Vendor Lock-out: Iceberg metadata is always available to all engines. So you can guarantee consistency, even with multiple writers.
- DML (modifying table data): Modifying data in Hadoop was a huge challenge. With Iceberg, data can easily be modified to adhere to use cases and compliance such as GDPR.
- Schema evolution: Much like a database, Iceberg supports full schema evolution including columns and even partitions.
- Performance: Since Iceberg stores a table state in a snapshot, the engine simply needs to read the metadata in that snapshot then start retrieving the data from storage saving valuable time and reduced cloud object store retrieval costs.
- Database feel: Partitioning is performed on any column and end users query Iceberg tables just like they would a database.
Iceberg Architecture
Iceberg is a layer of metadata over your object storage. It provides a transaction log per table very similar to a traditional database. This log keeps track of the current state of the table including any modifications. It also keeps a current “snapshot” of the files that belong to the table and statistics about them in order to reduce the amount of data that is needed to be read during queries greatly, improving performance.
Snapshots
Everytime a modification to an Iceberg table is performed, (insert, update, delete, etc.) a new snapshot of the table is created. When an Iceberg client (let’s say Trino) wants to query a table, the latest snapshot is read and the files that “belong” to that snapshot are read. This makes a very powerful feature called time travel available because the table at any given point contains a set of snapshots over time which can be queried with the proper syntax.
Under the covers, Iceberg uses a set of Avro-based files to keep track of this metadata. A Hive compatible metastore is used to “point” to the latest metadata file that has the current state of the table. All engines that want to interact with the table first get the latest “pointer” from the metastore then start interacting with Iceberg metadata files from there.
Here is a very basic diagram of the different files that are created during a CTAS (create table as select):
Metadata File Pointer (fp1) – This is an entry in a Hive compatible metastore (AWS Glue for example) that points to the current metadata file. This is the start to any query against an Iceberg table.
Metadata File (mf1) – A json file that contains the latest version of a table. Any changes made to a table create a new metadata file. The contents of this file are simply lists of manifest list files with some high level metadata.
Manifest List (ml1) – List of manifest files that make up a snapshot. This also includes metadata such as partition bounds in order to skip files that do not need to be read for the query.
Manifest File (mf1) – Lists a set of files and metadata about these files. This is the final step for a query as only files that need to be read are determined using these files saving valuable querying time.
Here is a sample table named customer_iceberg
that was created on S3:
customer_iceberg-a0ae01bc83cb44c5ad068dc3289aa1b9/
data/
20221005_142356_18493_dnvqc-43a7f422-d402-41d8-aab3-38d88f9a8810.orc
20221005_142356_18493_dnvqc-548f81e0-b9c3-4015-99a7-d0f19416e39c.orc
metadata/
00000-8364ea6c-5e89-4b17-a4ea-4187725b8de6.metadata.json
54d59fe-8368-4f5e-810d-4331dd3ee243-m0.avro
snap-2223082798683567304-1-88c32199-6151-4fc7-97d9-ed7d9172d268.avro
Table directory – This is the name of the table with a unique uuid in order to support table renames.
Data directory – This holds the Orc, Parquet or Avro files and could contain subdirectories depending on if the table is partitioned.
Metadata directory – This directory holds the manifest files as covered above.
Again, this might be too nitty-gritty for the average user but the point is a tremendous amount of thought and work has been put into Iceberg to ensure it can handle many different types of analytical queries along with real-time ingestion. It was built to fill the gap between low-cost, cloud object stores and the demanding processing engines such as Trino and Spark.
Partitioning
Using partitions in Iceberg is just like with a database. Most data you ingest into your data lake has a timestamp and partitioning by that column is very easy:
Example – partition by month from a timestamp column:
create table orders_iceberg
with (partitioning=ARRAY['month(create_date),region'])
Querying using a standard where clause against the partitioned column will result in partition pruning and much higher performance:
select * from orders_iceberg
WHERE CAST(create_date AS date) BETWEEN date '1993-06-01' AND date '1993-11-30';
Trino's Iceberg Support
Trino has full support for Iceberg with a feature matrix listed below:
- Create Table
- Modify Table (update/delete/merge)
- Add/Drop/Modify table column
- Rename table
- Rollback to previous snapshot
- View support (includes AWS Glue)
- Time travel
- Maintenance (Optimize/Expire Snapshots)
- Alter table partition
- Metadata queries
Using Iceberg in Trino is very easy. There is a dedicated connector page located here.
If you're new to Trino, Starburst Galaxy's free tier is the easiest and fastest way to test out the power of Trino and Iceberg.
Top comments (0)