This blogpost is about an investigation I performed recently about the fsync() call. If you are interested into the details of how the operating system deals with the fsync() call, this is a blogpost for you.
ps. I am using kernel version 3.10.
What is fsync()?
The fsync() call is a system call that a linux executable can execute to guarantee previously written data to be persisted on the block device.
Wait! Wasn't there something with fsync and databases?
Yes. There was a thing that some people refer to as fsyncgate 2018. This article is not about that. The fsync gate 2018 issue was a finding that the linux kernel might not expose write errors detected by the fsync() system call to an application, which therefore go undetected.
What does fsync() do?
The first thing to understand is that fsync() is a general linux system call, but because it deals with the filesystem, has, and requires, a filesystem specific implementation. The common linux filesystem we use, and is in use on RedHat compatible linux distributions, is XFS.
The second thing to understand is that by default linux lets processes and threads write to a file and perform other modifications to the filesystem, which in reality are writes/changes to memory, called the page cache, for which the kernel will perform the actual writing to disk at a later point in time, generally using kernel write-back worker threads. Only if a file is explicitly opened with the O_DIRECT flag, a write will happen to the file directly.
The function and reason for fsync() is to provide an option to request the operating system to actually write any potential pending changes to a file that are still in memory to disk and guarantee persistence.
And now the details
As you might expect, fsync() is not really a single action that magically executes the persistency guarantee. A number of actions must happen, which are detailed below.
When a "user land" process executes the fsync() call with an inode number (file), the process switches to system mode and traverses a number of layers in the linux kernel to the actual XFS implementation of the fsync() call, which is called xfs_file_fsync()
. Please mind this is the source code of the standard kernel, not the RedHat kernel, which turns out to be an important detail.
This function is the basis of the XFS fsync() implementation, and essentially performs the following functions:
This is slightly simplified, for all the details please read the source code (I am not kidding, elixir makes it reasonably easy!).
filemap_write_and_wait_range
Step 1, the filemap_write_and_wait_range()
call is the actual step that most people think what is all that fsync()
does, which is find any range of dirty pages (pages written in memory that are not yet persisted to disk) of the given file, and write these to disk.
In kernel 3.10 in the combination that my test machine uses, this causes this step (writing dirty pages of the given inode/file) to be executed by the process itself. With a version 4 kernel with Alma linux 8, the process gathers the blocks and queues these and requests a kernel background worker thread to perform the writing.
The way filemap_write_and_wait_range()
is executed, is it does gather the dirty pages in a function under filemap_write_and_wait_range()
called __filemap_fdatawrite_range()
. It creates a batch of a continuous range of dirty pages, and repeats this for another batch of dirty pages until it has scanned the entire file. These batches become individual IO requests which are submitted as IO requests.
The second step is in the function __filemap_fdatawait_range()
, which name is reasonably self explanatory: it waits for the submitted write IO requests to finish.
If __filemap_fdatawait_range()
is finished, all the pages for the given inode/file have been written to disk. But the work for fsync() is not finished yet.
xfs_ilock / xfs_iunlock
The xfs_ilock() and xfs_iulock() functions lock and unlock the inode/file, for which there are two locks (XFS_IOLOCK and XFS_ILOCK), which each can be taken in shared and exclusive modes. The interesting bit here is that the locking is done to make sure the pin count, which is the number of outstanding items in the journal/log, which is read using the function xfs_ipincount()
(which is a macro, so there is no function source code), is consistent.
The thing to notice here is the if statement inside xfs_ipincount()
:
if (!datasync ||
(ip->i_itemp->ili_fields & ~XFS_ILOG_TIMESTAMP))
lsn = ip->i_itemp->ili_last_lsn;
The way to read this is that the if chooses to read the lsn
variable if:
- datasync is not true (data sync is a flag to indicate this function was called as 'fdatasync', which means it requests to ONLY perform the data part above of fsync) -or-
- any fields need logging (meaning a structure change happened since last log, so this overrides datasync)
If any of the two options are true, the lsn variable is filled with the lsn, the log sequence number from the inode (ip) from the inode log item last lsn field.
_xfs_log_force_lsn
The _xfs_log_force_lsn() function writes the journal of the XFS filesystem to disk. A filesystem journal is a property of journalling filesystem which saves structure/metadata changes to an inode/file in a separate area on disk to be able to replay the changes to an inode/file in case of a crash.
Before journalling filesystems, crashes could lead to the filesystem structure (inode to file mappings) getting corrupted, which required running fsck (filesystem check), which typically happens during server startup after a crash to reconstruct the filesystem in case in inconsistence. With larger filesystems, this can take significant amounts of time. (Inconsistencies found during fsck are generally saved in the lost+found directory)
The writing of the journal only happens if the lsn variable that optionally could be obtained in the previous step is not 0 (if (lsn)
).
The function _xfs_log_force_lsn() has some interesting comments in it's comments, which explains the following:
* Synchronous forces are implemented with a signal variable. All callers
* to force a given lsn to disk will wait on a the sv attached to the
* specific in-core log. When given in-core log finally completes its
* write to disk, that thread will wake up all threads waiting on the
* sv.
This means that if an fsync() has written the dirty pages from the OS page cache, and obtained the log sequence number, and then entered _xfs_log_force()
, it can find the in-core log.
The _xfs_log_force()
function is a master function that handles the different parts of flushing XFS metadata to disk for a given in-core log.
_xfs_log_force_lsn() > xlog_cil_force_lsn()
The first thing that gets done is execute the xlog_cil_force_lsn()
function. This function takes the lsn, and looks for the operating system work queue that is bound to this in-core log committed item list (cil) and checks if it's still busy. If so, it waits for it off CPU. (the committed item list/cil is a double linked list of items in the in-core log).
Once it's done, it checks the committed item list for items with an lsn lower than the required lsn. If it finds one or more committed item list items, it will request a flush (write) of the committed item list to disk via a linux workqueue.
The last thing in the xlog_cil_force_lsn()
function is waiting for the committed item list flushes to finish, using the linux workqueue. This means it's preventing multiple processes or threads to perform flushing the cil, and wait for the previous one to finish.
_xfs_log_force_lsn() > xlog_state_switch_iclogs()
The next thing that is done is finding the correct buffer in the in-core log, waiting for pending IOs to finish and move the pointer to the next in-core log buffer using the function xlog_state_switch_iclogs()
, which marks the current buffer for writing (synchronisation). By moving the pointer to the next buffer, new in-core log items will be created in the next buffer, and the previous in-core log buffer can be prepared for writing.
_xfs_log_force_lsn() > xlog_state_release_iclogs()
After the log buffer is switched, the function xlog_state_release_iclog()
perform the preparation of the previous buffer for writing, and call xlog_sync()
to actually perform the write.
Inside xlog_sync()
the actual write happens. However, the write is done asynchronously.
This is why when the calls return after performing the write in xlog_sync()
, the loop in _xfs_log_force_lsn()
has a last if
block that essentially performs xlog_wait()
to check for the log write to finish.
xfs_blkdev_issue_flush()
At two places in the xfs_file_fsync()
there is another call, with some dependencies on specific flags set and specific situations. This call is xfs_blkdev_issue_flush().
This function essentially sends a SCSI/device command to the disk telling it to flush its cache to disk. The basic idea is that in the case of a crash or lost power, the disk controller or disk cache itself might hold the written data and still can loose the data. This call tells it to flush that to the actual block device.
This is also the reason that some executables that perform O_DIRECT calls still perform fsync(), despite that they don't actually cache anything at the operating system layer: this way it's made sure that the written data is ultimately stored onto a persistent device.
A twist
I have been reading the source code of the linux kernel and XFS, which gives an understanding of how this works. I carefully checked the kernel version running on Centos 7, which is kernel version 3.10 and used the same version in the source code browser at https://elixir.bootlin.com.
I also used the linux kernel ftrace facility to look at what is actually going on. In my opinion, it's important to always validate whatever you have been learning and studying.
What I found was that the redhat 3.10 kernel does not obey the function order shown in the source code on elixir. The page on elixir says:
- file_write_and_wait_range()
- xfs_ilock()
- (xfs_ipincount())
- xfs_iunlock()
- _xfs_log_force_lsn
But what I found when using ftrace was:
- file_write_and_wait_range()
- xfs_ilock()
- (xfs_ipincount())
- _xfs_log_force_lsn
- xfs_iunlock()
This turns out to be change in the redhat kernel. Because this change is the order in newer kernel versions, I do assume this is a fix that is back ported to the 3.10 redhat kernel.
Conclusion
There is a lot going on if you take the effort to read the source code and read up on all the concepts that it implements.
One thing that I have not mentioned is that another property of fsync() is that it takes an inode, but it doesn't return or expose how much work it has actually done. The return code is zero for successful execution and -1 for error.
The reason for the rather complex way of handling the logging goes much further than I described for the sake of trying to keep it as simple as possible. The logging is done in a completely asynchronous way for maximal performance, which makes it more complex.
The first part of fsync() writing is very obvious, the second part of writing the in-core log (buffer) is less obvious, and all the optimisations to take advantage of and be able as performant and flexible as possible for concurrent writes to the log. Hopefully this gives you an understanding of some of the complexities that are going on here, and awareness of the implementation.
Top comments (1)
Thank you so much for sharing this!