DEV Community

Kevin Cox
Kevin Cox

Posted on • Originally published at kevincox.ca on

Attempting to use bcachefs

I was revamping my home server and needed to decide what filesystem to use. My ol’ reliable ext4 was not going to cut it as I was planning on going multi-disk with some level of redundancy, so I considered a few options.

Hardware Raid

LVM or Hardware RAID is a base option. I quickly ruled this out as it is quite inflexible. I can go with x2 data replication (RAID1) but have limited ability to add disks or store some data with less durability (for data that is easy to replace). For example when adding disks to a 2x replicated logical volume you need to add them in pairs. Also, when replacing a disk you need to replace it with a disk of the same size (or do some complicated remapping). If I did go this route I probably would have ditched replication but even then striping is limited if you are incrementally adding disks, leaving performance on the table.

ZFS

ZFS is the giant in this field. Heralded as an improvement on hardware RAID. However, in my opinion it is only a small step above. You group disks into “vdevs” and then you can create a filesystem on top of any number of vdevs. The vdev system provides very flexible storage configurations. You can stripe, replicate, erasure code or some combination of the above such as striping over replicated pairs of disks. But the key limitation of ZFS is that these RAID-type configurations are still fairly static. If you have 2 disks mirrored you can’t add a 3rd disk and still have all data stored twice. If you have 3 disks with 1 parity disk you can’t add a fourth. You are also fairly limited in how you configure your redundancy. Really it is similar to LVM, except the striping on top is more dynamic. There are also some benefits to the end-to-end integration compared to a single-disk filesystem on top of an independent RAID system, but it feels more like minor optimizations rather than a fully integrated system. It also doesn’t allow mixed-redundancy in the same storage pool. If you want a particular dataset to be unreplicated you will need to create a separate vdev with fixed space upfront and create a separate zpool.

I think ZFS fits really well in more professional use cases where you are likely to purchase and configure the whole server at once and each different use case can use a different server or at least different drives. However, for a hobbyist use case this lack of flexibility is a more severe limitation.

BTRFS

BTRFS is the next obvious contender. It solves the incremental addition problem very well. If I have a 2x replicated (RAID1) system with 2 drives and add a third drive I can immediately get that space. As long as the drive doesn’t become more than half of your storage pool it will fit right in. (Rebalancing may be required depending on previous space usage and the fraction of added storage.) It feels like an integrated system making an intelligent placement decision, instead of just following simple RAID-style rules. However, BTRFS still lacks some flexibility on the filesystem side. Each BTRFS filesystem can only have a single replication profile. So you need to decide upfront what redundancy and performance tradeoffs you want to make for your whole system, or you need to manage multiple filesystems and deal with the mostly-fixed partitioning of the storage between them.

Off Topic: It seems like BTRFS could support this. Migrating between storage “profiles” is done online, so there must be some support for multiple active profiles. It also seems that you can cancel a migration and start a different one so there is support for more than 2 profiles on the same filesystem. Maybe all that is missing is a way to configure what profile writes should use?

BTRFS also has other issues that I would prefer to avoid. It has a reputation for corrupting itself, which is hard to shake. It also has a very simple view of disks, basically treating all devices as equivalent. This makes it very limiting if you mix fast and slow disks. It isn’t too bad for striped data (RAID0) as there is only one copy to read from. But even then, options to move hot data to faster storage would be useful. However replicated performance really suffers as each process is chancing its performance on a coin flip, rather than reading from the faster disk.

bcachefs

bcachefs is the filesystem I ended up using. It is a newer player, a derivative of the in-kernel bcache block device caching system but not yet included in the upstream kernel. Getting it setup on NixOS was fairly easy as it is supported in nixpkgs and there are instructions in the wiki. It took a few tries to get it installed and booted but just due to my own mistakes.

bcachefs provides a really nice abstract interface for storing files. You can basically throw a bunch of disks at it and it will use them as best it can. You can incrementally add disks and it will add them to your storage pool as you do. But unlike the other filesystems you can also configure durability much more granularly. You can set durability parameters for any subtree, including individual files. You can also do this at any point of time, not just when creating the data like with separate filesystems or logical volumes. For example, I set my root directory to 2x replication (RAID1) for metadata and data but after installation I ran bcachefs setattr --data_replicas 1 /srv/mirrors/ on a handful of directories containing easy-to-recreate data and am now only using half of the physical storage for them. There are a huge number of attributes that can be controlled at the file or directory level such as checksums, compression, hash function, and more.

bcachefs also provides good performance. It tracks device latency to automatically issue read requests to the device that is expected to be faster. It provides control of which devices files reside on with the foreground_target (where writes go initially), background_target (where cold data is moved) and promote_target (where files are copied to when read) options. While this isn’t a detailed placement policy (for example I wanted an option to always keep one copy of a dataset on an SSD and couldn’t specify that) but it is flexible enough to get good results for just about any scenario.

Snapshot Deletion Issue

My first issue with bcachefs was a failure of bcachefs subvolume destroy $snapshot. This is supposed to delete a snapshot, but it did nothing (successfully). I could remove the snapshot with rm -r --no-preserve-root $snapshot but, this took quite a while to recursively delete all children first, rather than the expected near-instant removal. This would be an issue if I was regularly creating snapshots for backups.

Data Corruption

Then it really started going downhill. Right at the top of the bcachefs.org homepage it proudly proclaims

The COW filesystem for Linux that won’t eat your data.

Unfortunately my experience was not in agreement with this claim. Within a day of use some bcachefs task got in an infinite loop and froze the system. I had to force a shutdown and then the system wouldn’t boot up again. I plugged in a monitor and found the following error:

snapshot deletion did not run correctly:
  duplicate keys in btree inodes at 0:4097 snapshots 4294967291, 4294967295 (equiv 4294967291)
, fixing
bcachefs (8a3a3970-9854-4f6b-b059-297990901660): inconsistency detected - emergency read only
bcachefs (8a3a3970-9854-4f6b-b059-297990901660): check_inode(): error fsck_errors_not_fixed
bcachefs (8a3a3970-9854-4f6b-b059-297990901660): check_inodes(): error fsck_errors_not_fixed
bcachefs (8a3a3970-9854-4f6b-b059-297990901660): Error in recovery: error in recovery (fsck_errors_not_fixed)
bcachefs (8a3a3970-9854-4f6b-b059-297990901660): Error in filesystem: fsck_errors_not_fixed
mount: mounting /dev/sda:/dev/sdb on /mnt-root/ failed: Unknown error 2107

Ok, I guess bcachefs doesn’t recover from unclean shutdowns by default. I booted up a live CD and started an fsck. It had some disk activity at the start but then seemed to just sit utilizing 100% of a CPU core. But about an hour it did finish.

I rebooted and quickly noticed a bunch of empty files in my Minecraft server folder. (A hard day’s work lost ☹️) But the system was up and running again. I was hesitant at this point as the hung task seemed to occur significantly after the Minecraft changes should have been flushed to disk but decided to let it keep running.

I added fsck to the mount options to avoid failing to boot again. It didn’t add much time for clean boots (a couple of minutes) and I wanted to automatically recover from unclean shutdowns. Unexpectedly it kept finding similar errors. It seems like the fsck didn’t fully fix the issue, just enough to mount.

bcachefs (8a3a3970-9854-4f6b-b059-297990901660): starting fsck
snapshot deletion did not run correctly:
  duplicate keys in btree inodes at 0:15488 snapshots 4294967291, 4294967295 (equiv 4294967291)
, fixing
bcachefs (8a3a3970-9854-4f6b-b059-297990901660): check_inode(): error need_snapshot_cleanup
bcachefs (8a3a3970-9854-4f6b-b059-297990901660): going read-write

I don’t know if it was this error or other bugs in the filesystem but, problems kept occurring. Files would become unreadable only to be empty after the next fsck, critical system files would go missing requiring a reinstall, bad things kept happening, it was clear that I needed to switch off bcachefs.

Jumping Ship

I decided to switch to BTRFS. While it did have a reputation for losing data in the past it seemed to be mostly stable these days. I picked it over ZFS because I planned to incrementally add drives to this system as needed and that should be easy to do with BTRFS. I decided to ditch redundancy since paying that overhead for all of my data wasn’t worth it to me (and I have backups so worst case I lose some recent data and have some downtime). Although I suppose since I wasn’t doing redundancy anyways I could have used ZFS for similar results.

Shrink

The first step in migrating was freeing a drive to start the BTRFS filesystem. I first ran bcachefs setattr --metadata_replicas=1 --data_replicas=1 / followed by bcachefs device evacuate to move the data off of one disk. This took some time to copy data that wasn’t already replicated but finished in a reasonable time frame (a couple hours) with high disk activity the whole time.

By some miracle I had 30GiB of 4TB free space, barely fitting all of the data onto a single disk. I very nearly needed to clean up some data or spill onto a different disk. That was a bit of a relief.

I then ran bcachefs device remove to remove that device from the filesystem. (In retrospect bcachefs device remove copies the data first unless you pass some scary sounding flags, so the explicit evacuate was unnecessary). After reading a small amount from disk it started burning 100% of a single CPU core. I left it running overnight and it was still burning CPU in the morning with a bunch of kernel warnings about a hung task. I rebooted the system and tried again in the morning and this time it finished quickly.

Copy

Next I created a BTRFS filesystem on the now-unused drive, then copied the data to the new system using rsync -av --info=progress2 --no-inc-recursive /bcachefs/ /btrfs/. This step was thankfully uneventful.

Grow

Then I added the second disk to the BTRFS filesystem and ran a rebalance to spread the data evenly. I then re-installed my OS and booted into the BTRFS system.

Live from BTRFS

I can’t say that I love being on a filesystem with a reputation for losing data. But I’ve been using it on my mail server for well over a year without any issues. I’m hoping that the kinks have been hammered out by now. After a horrible week dealing with filesystem corruption I have had a wonderful week not worrying about my home server at all. I can only hope that lasts.

Summary

I absolutely love the design of bcachefs. It does a great job abstracting storage and allowing the best option for each use case without the need to pre-plan everything up-front. It really does feel like you can simply add all your disks, the say “please store this data twice” and it does the right thing. Unless you are over-optimizing with performance tweaking, just adding all your storage and picking the desired replica account will serve you well. You can add storage of different sizes and performance profiles as you need them and adjust redundancy configuration on the fly. I was lured in by great marketing about focusing on reliability but despite sticking to the features marked as stable (I was planning on taking advantage of erasure coding one day but held off as it wasn’t stable yet) it still ate my data. Usually conservative, I still use ext4 for all my desktop systems as I don’t need snapshots as it has never burned me. However, for a server with more use cases and more storage the allure won me over, and I regret it.

I hope bcachefs can live up to its stability promises one day, its other features are unique and valuable. I dare say that per file or directory configuration is the way that all filesystems should be. But once bitten twice shy, I can comfortably say that I won’t be trying bcachefs again for at least another decade.

Top comments (0)