Durable writes

#linux #filesystems

When writing data to a file, one of the things you have to deal with is ensuring the data is reliably persisted to a non-volatile storage device. Storage devices like magnetic disks(HDD), SSDs, persistent memory(NV RAM) etc. are used to offer some form of durable storage, such that once data is stored in the device it is "guaranteed" to exist even after an outage(e.g. power loss). The storage device can be local to your system or on a network to be accessed remotely.

When updating a file using the write system call, an easy mistake to make is to assume the data is immediately "saved" to permanent storage. Is this assumption correct? Well...it depends :). I will attempt to answer the question by describing what happens when a user attempts to write to a file.

Abstractions everywhere...

Multiple caching layers exist between the user applications and the storage device. Caching can start from the language library like glibc all the way to the write back cache in the storage device. This is mostly for performance reasons. These caches temporarily store data before sending it to the storage devices.
Performing I/O is expensive and can easily take up to hundreds of milliseconds or even seconds(in some slower disks). This can degrade performance for user applications, as the CPU has to wait for data transfer to and from disk storage. This leads to wasted CPU cycles as the CPU will sit idle waiting for I/O to complete. Caching data is an optimization used to keep data, in memory, closer to the CPU thus reducing latency and ensuring the CPU is kept busy doing meaningful work.

There are various levels of caching that the system provides as shown in the figure below

As depicted in the diagram above, caching enables fast reads(reduced I/O to retrieve data from slow storage) and also improves write performance since the user does not need to wait for actual writes to the storage device to complete. Writes to disk will be taken care of by the kernel in the background threads(bdflush et al.). Something interesting is that even writing the data to the storage device might not guarantee durability. How? Some storage devices have an internal writeback cache where data is stored for a short period before being flushed to disk. The heuristic used to flush the data might differ for different disk vendors and is opaque to the kernel: meaning even if the kernel issues a sync command to the device it can't guarantee that the data is stored on permanent storage when the write returns.

With this, a scenario can occur where a user issues a write and the system experiences a power loss before the data reaches the storage device. At this point, the data could still be in one of the cache layers and there is a possibility for data loss. This means that after the system is recovered, some of the data the user thought was "stored to durable storage" might not be available.

What now?

This does not mean there is nothing we can do to reduce the chances of data loss. The scenario above mostly describes buffered writes. One possible way to reduce the possibility of data loss is using unbuffered writes, bypassing the page cache and interfacing "directly" with the underlying storage device. I use "directly" loosely here since this is not usually the case as you still need to go through the filesystem when opening/writing to a file. Unbuffered writes can be done by opening the file with O_SYNC and O_DIRECT flags.
The type of filesystem being used also matters -- ext4 filesystem behavior may be different from zfs, for example when handling the flags above and general handling of writes(O_DIRECT flag is not yet supported by some filesystems).

Calling fsync or fdatasysnc ensures data is flushed at least to the storage device. The difference is fsync flushes both data and metadata to the underlying storage device. The data being what the user wants to write to the file while the metadata is the details about the file such as inode data which can be file modification time, file size etc. Because fdatasync only flushes the file data, it performs better than fsync with the caveat that it might cause failures during recovery due to missing or incorrect metadata.

However the calls to fsync and fdatasync, synchronous system call, should be used sparingly for performance reasons since the application has to wait for the data to be flushed out to disk. One way to optimize this is to batch writes and sync at intervals, therefore combining multiple write/sync calls into a single syscall.

So, back to the question above: Is it correct to assume data is written to the storage device after a write call? The answer still remains...it depends.

DEV Community

Durable writes

Abstractions everywhere...

What now?

Top comments (0)

Read next

Understanding os.Stat() vs os.Lstat() in Go: File and Symlink Handling

50 Linux Networking Commands Every DevOps Engineer Should Master

Linux KVM Backup and Recovery Best Practices: Expert Advice You Need

A performant and extensible Web Server with Zig and Python