Introduction
In the world of big data, storage solutions are the backbone of any architecture. As an expert solutions architect, I have relied on various storage solutions to ensure data integrity and availability. Recently, I discovered a critical flaw in MinIO's tiering feature, introduced in RELEASE.2022-11-10T18-20-21Z, that poses a significant risk to data integrity.
This feature is inspired by AWS S3 lifecycle transition
MinIO is very simple to use, which is why it is frequently chosen for testing purposes to simulate S3 in CI/CD pipelines. In these cases, data loss is not an issue. On the other hand, MinIO is also highly appreciated in on-premises environments for providing an S3 API that is very similar to AWS S3 and much simpler to set up than the alternative Ceph + Rados Gateway. In such cases, MinIO is often a critical part of the enterprise data lake and is operated with very high resilience and durability targets. Data loss is not tolerable.
Note: This is an ongoing investigation, and I have not received any insights from the MinIO team. However, I experienced the issue on a production cluster and was able to reproduce it in a test setup. Two different versions of MinIO were tested without success.
About tiering
Important Facts About the Tiering Feature:
- Once tiering is configured, all requests must go to the unique hot tier.
- Backend tier(s) cannot be used, even for read-replica purposes. Under the hood, MinIO uses UUIDs to name transitioned objects.
- The hot tier holds the metadata of all objects (from all tiers). As such, there are no lookups to cold tiers on LIST requests. This is particularly powerful for on-premise setups, where one can have a small SSD cluster as the front tier and a large HDD cluster as the cold tier, achieving great performance.
- The transition strategy is dictated by AWS algorithms, and transitions are made in daily batches starting at 00:00 UTC.
- The transition delay cannot be shorter than one day.
More details about the philosophy around the tiering feature of Minio in github issue 18821
The Issue at Hand
MinIO's tiering feature was designed to optimize storage by transitioning objects between different storage classes. However, enabling this feature can lead to severe data loss. The root of the problem lies in the MinIO scanner's inability to repair metadata of transitioned objects. This flaw is akin to a lack of anti-entropy algorithm (hello Cassandra friends !), where data consistency cannot be repaired once compromised.
The Consequences
Without the ability to repair metadata, every outage or drive replacement risks losing quorum. Over time, this inevitably leads to data loss. For organizations relying on MinIO for critical data storage, this flaw could have catastrophic consequences.
Technical Details
To provide a comprehensive understanding, I will delve into the technical specifics of the issue as documented on GitHub (Issue #20559).
MinIO uses Erasure Coding to ensure fault tolerance. Without tiering, replacing faulty drives is a routine operation. Once a faulty drive is replaced, MinIO immediately detects that it is empty and initiates its healing process to rebuild the data.
Recently, I had to replace a faulty server in an on-premise cluster using tiering. To my surprise, many users complained about LIST consistency issues, which prompted this analysis.
The smallest test setup involves 5 virtual machines (VMs). The hot tier needs to effectively use erasure coding, requiring a minimum cluster size of 4 nodes. The cold tier only needs to exist to configure the transition, so I used a single VM, although it could have been a path to an AWS S3 bucket.
These two terminal captures show the version used and the files stored in the test bucket named "buc." As can be seen, all objects from the directory "adir" transitioned to the cold tier. We can still find the metadata of all these transitioned objects by searching for the "xl.meta" files directly in the MinIO drives.
There are 21 objects in this bucket "buc" separated in 2 directories (adir: 11 objects cold + bdir: 10 objects hot), but the the scanner sees :
- 11 grey objects : corresponding to the metadata file of transitioned objects. 1 metadata file per object.
- 14 green objects : for sure 10 of these objects are the recent objects in the the "STANDARD" hot tier. The other 4 objects found is a non-sense.
Sidenote: Minio uses data inlining of small objects: in order to prevent loosing too much IOPS on small files, if the data size is smaller than 256KiB, then Minio does not create a datafile and the data is inlined in the metadata file "xl.meta".
In the next step, the file named "testfile_241028_idx20" is directly removed from the MinIO drive of the node u20-1 to simulate a faulty drive. In a non-tiered setup, the healing would be immediate upon the next read of this object. In this example, I forced a heal and read this object, but as can be seen, the "xl.meta" was never healed on u20-1. At this stage, there is still no data loss.
In the next step, I performed the same action on the node u20-2 and observed a catastrophic failure: data was lost.
At this stage, I am still able to list the object despite using the MinIO setting "list_quorum=strict." However, I am no longer able to read the data.
After performing the same action on u20-3, the data is completely gone, and I am not even able to list the lost object. If you do not use a delta table or some other catalog, you might never notice that data was lost.
Once you reach this point, there is no solution to heal the cluster. You are forced to use a backup and rewrite the data to the hot tier to ensure consistent metadata across all nodes.
This situation is even more frustrating because the data remains intact on the cold tier! However, since MinIO uses internal names and directory structures on the cold tier, it is impossible to heal the data using the cold tier as can ben seen below.
Shell snippet
rm testfile_*idx*
for i in {11..20}; do dd if=/dev/urandom bs=1K count=$i of=testfile_$(date '+%y%m%d')_idx$i; done
mc cp testfile_*idx* local/buc/adir/
mc ls -r local/buc/
mc admin info local | tail -12
date
f='testfile_241028_idx20'
mc ls local/buc/adir/$f
find /data/1/buc -type f | grep $f
ll /data/1/buc/adir/$f/xl.meta
mc admin heal -r --scan=deep -a --force local/buc
for i in range {1..10}; do printf "$(mc ls -r local/buc/ | wc -l) "; done; echo ""
mc ls local/buc/adir/$f; date; sudo systemctl stop minio; date ; sudo rm -f $(find /data/1/buc -type f | grep $f); ll /data/1/buc/adir/$f/xl.meta; sudo systemctl start minio ; date ; ll /data/1/buc/adir/$f/xl.meta; mc ls local/buc/adir/$f;
for i in range {1..10}; do printf "$(mc ls -r local/buc/ | wc -l) "; done; echo ""
mc admin info local | tail -2
ll /data/1/buc/adir/$f/xl.meta
mc stat local/buc/adir/$f
mc admin heal -r --scan=deep -a --force local/buc
ll /data/1/buc/adir/$f/xl.meta
mc ls local/buc/adir/$f
mc cp local/buc/adir/$f /tmp/
mc stat local/buc/adir/$f
Speculations
In this section, I will allow myself to go beyond the facts presented earlier.
Firstly, bugs of this nature should not occur in a production grade storage solution. Their existence suggests that the developers did not adequately test the product.
Secondly, from an architectural perspective, I believe the feature is flawed by design. In MinIO, the anti-entropy mechanism is the "scanner" which runs continuously to perform a full scan of all objects on a regular basis. Additionally, there is a read-repair mechanism triggered when some shards are unavailable during read queries. This system works well when tiering is not used. However, when tiering is enabled, the hot tier only holds metadata of transitioned objects. Consequently, the scanner perceives these as "grey objects" or unrecoverable objects. This is the first problem with tiering: you can no longer monitor your hot tier for data loss, as transitioned objects have the same status as lost objects.
Additionally, to scale, LIST queries remain on the front tier and do not trigger lookups to the backend until the user requests a READ of the object itself. To address the healing issue, the healing algorithm should not perform lookups on the cold tier, which would completely undermine the purpose of the tiering feature...
Technically, the fact that the tiering feature uses scheduled batches rather than streaming complicates matters: you must wait (or manipulate the clock) until 00:00 UTC the next day for the transition to occur. Regardless, resilience and durability tests must be conducted!
Recommendations
While it's crucial to highlight this flaw, it's equally important to offer solutions or workarounds. Here are some recommendations for MinIO users:
- Disable Tiering: Until a fix is released, consider disabling the tiering feature to prevent potential data loss.
- Regular Backups: Ensure that regular backups are in place to mitigate the risk of data loss.
- Monitor Updates: Keep an eye on MinIO's updates and patches for a resolution to this issue.
Conclusion
Raising awareness about this critical flaw is essential for the community. While MinIO has been a reliable storage solution, this issue underscores the importance of thorough testing and validation of new features. I urge MinIO to address this flaw promptly to restore user confidence.
Top comments (0)