DEV Community

Brian Misachi
Brian Misachi

Posted on

Durable writes

When writing data to a file, one of the things you have to deal with is ensuring the data is reliably persisted to permanent storage device. Storage devices like magnetic disks(HDD), SSDs, persistent memory(NV RAM) etc. are used to offers some form of durable storage, such that once data is stored it is "guaranteed" to exist after an outage(e.g. power loss). The storage device can be local to your system or on a network somewhere in a remote location.

When updating a file using the *write system call one easy mistake is to assume the data is immediately "saved" to permanent storage. Is this assumption correct? Well...it depends :). I will attempt to answer the question by describing what happens when a user attempts to write to a file.

Abstractions everywhere...

Multiple caching layers from language libraries down to the kernel when writing data. This is mostly for performance benefits.
These caches temporarily store data before sending it to the storage devices.
Performing I/O is expensive and can easily take up to hundreds of milliseconds or even seconds(in some slower disks). This can degrade performance for user applications, as the CPU has to wait for data transfer to and from disk storage. This leads to wasted CPU cycles as the CPU will sit idle waiting for I/O to complete. Caching data is an optimization used to keep data, in memory, closer to the CPU thus reducing latency and ensuring the CPU is kept busy doing meaningful work.

There are various levels of caching that the system provides as shown in the figure below

Image description

As depicted in the diagram above, caching enables fast reads(reduced I/O to retrieve data from slow storage) and also improves write performance since the user does not need to wait for actual writes to the storage device to complete. Writes to disk will be taken care of by the kernel in the background threads(bdflush et al.). Something interesting is that even writing the data to the storage device might not guarantee durability. How? Some storage devices have an internal writeback cache where data is stored for a short period before being flushed to disk. The heuristic used to flush the data is different for different disk vendors and is opaque to the kernel: meaning even if the kernel issues a flush the the device it can't guarantee that the data is stored on permanent storage.

With this, a scenario can occur where a user issues a write call and the system experiences an outage(power loss) before the data reaches the storage device. At this point, the data could still be stored at one of the cache layers and there is a possibility that the data would be lost. This means that after the system is recovered, some of the data the user thought "stored to durable storage" might not be available.

What now?

This does not mean there is nothing we can do to reduce the chances of data loss. The scenario above mostly describes buffered writes. One way to reduce the possibility of data loss is using unbuffered writes -- That is bypassing the filesystem cache and kernel page cache and interfacing "directly" with the underlying storage device. I use "directly" loosely here since this is not usually the case as you still need to go through the filesystem when opening/writing to a file. It can be done by opening the file with O_SYNC and O_DIRECT flags. I guess direct access would mean opening a block device directly(not sure about this).

The type of filesystem being used also matters -- ext4 filesystem behavior can be different from zfs, for example when handling the flags above and general handling of writes(O_DIRECT flag is not yet supported by some filesystems).

Calling fsync or fdatasysnc ensures data is flushed at least to the storage device. The difference is fsync flushes both data and metadata to the underlying storage device. The data being what the user wants to write to the file while the metadata is the details about the file such as inode data(file modification time, file size etc). Because fdatasync only flushes the file data, it performs better than fsync with the caveat that it might cause failures during recovery due to missing or incorrect metadata.

However the calls to fsync and fdatasync should be used sparingly for performance reasons -- Since the application has to wait for the data to be flushed out to disk. One way to optimize this is to batch writes and flush at intervals therefore combining multiple write/sync calls into a single syscall.

So, back to the question above: Is it correct to assume data is written to the storage device after a write call? The answer still remains...it depends.

Top comments (0)