Soft Updates drastically improves meta-data performance, mainly file creation and
deletion, through the use of a memory cache. We recommend to use Soft Updates on all of
your file systems. There are two downsides to Soft Updates that you should be aware of:
First, Soft Updates guarantees filesystem consistency in the case of a crash but could
very easily be several seconds (even a minute!) behind updating the physical disk. If
your system crashes you may lose more work than otherwise. Secondly, Soft Updates delays
the freeing of filesystem blocks. If you have a filesystem (such as the root filesystem)
which is almost full, performing a major update, such as make
installworld, can cause the filesystem to run out of space and the update to
fail.
There are two traditional approaches to writing a file systems meta-data back to disk.
(Meta-data updates are updates to non-content data like inodes or directories.)
Historically, the default behavior was to write out meta-data updates synchronously.
If a directory had been changed, the system waited until the change was actually written
to disk. The file data buffers (file contents) were passed through the buffer cache and
backed up to disk later on asynchronously. The advantage of this implementation is that
it operates safely. If there is a failure during an update, the meta-data are always in a
consistent state. A file is either created completely or not at all. If the data blocks
of a file did not find their way out of the buffer cache onto the disk by the time of the
crash, fsck(8) is able to
recognize this and repair the filesystem by setting the file length to 0. Additionally,
the implementation is clear and simple. The disadvantage is that meta-data changes are
slow. An rm -r, for instance, touches all the files in a
directory sequentially, but each directory change (deletion of a file) will be written
synchronously to the disk. This includes updates to the directory itself, to the inode
table, and possibly to indirect blocks allocated by the file. Similar considerations
apply for unrolling large hierarchies (tar -x).
The second case is asynchronous meta-data updates. This is the default for
Linux/ext2fs and mount -o async for *BSD ufs. All meta-data
updates are simply being passed through the buffer cache too, that is, they will be
intermixed with the updates of the file content data. The advantage of this
implementation is there is no need to wait until each meta-data update has been written
to disk, so all operations which cause huge amounts of meta-data updates work much faster
than in the synchronous case. Also, the implementation is still clear and simple, so
there is a low risk for bugs creeping into the code. The disadvantage is that there is no
guarantee at all for a consistent state of the filesystem. If there is a failure during
an operation that updated large amounts of meta-data (like a power failure, or someone
pressing the reset button), the filesystem will be left in an unpredictable state. There
is no opportunity to examine the state of the filesystem when the system comes up again;
the data blocks of a file could already have been written to the disk while the updates
of the inode table or the associated directory were not. It is actually impossible to
implement a fsck which is able to clean up the resulting chaos
(because the necessary information is not available on the disk). If the filesystem has
been damaged beyond repair, the only choice is to use newfs(8) on it and
restore it from backup.
The usual solution for this problem was to implement dirty region logging, which is also referred to as journaling, although that term is not
used consistently and is occasionally applied to other forms of transaction logging as
well. Meta-data updates are still written synchronously, but only into a small region of
the disk. Later on they will be moved to their proper location. Because the logging area
is a small, contiguous region on the disk, there are no long distances for the disk heads
to move, even during heavy operations, so these operations are quicker than synchronous
updates. Additionally the complexity of the implementation is fairly limited, so the risk
of bugs being present is low. A disadvantage is that all meta-data are written twice
(once into the logging region and once to the proper location) so for normal work, a
performance “pessimization” might result. On the other hand, in case of a
crash, all pending meta-data operations can be quickly either rolled-back or completed
from the logging area after the system comes up again, resulting in a fast filesystem
startup.
Kirk McKusick, the developer of Berkeley FFS, solved this problem with Soft Updates:
all pending meta-data updates are kept in memory and written out to disk in a sorted
sequence (“ordered meta-data updates”). This has the effect that, in case of
heavy meta-data operations, later updates to an item “catch” the earlier ones
if the earlier ones are still in memory and have not already been written to disk. So all
operations on, say, a directory are generally performed in memory before the update is
written to disk (the data blocks are sorted according to their position so that they will
not be on the disk ahead of their meta-data). If the system crashes, this causes an
implicit “log rewind”: all operations which did not find their way to the
disk appear as if they had never happened. A consistent filesystem state is maintained
that appears to be the one of 30 to 60 seconds earlier. The algorithm used guarantees
that all resources in use are marked as such in their appropriate bitmaps: blocks and
inodes. After a crash, the only resource allocation error that occurs is that resources
are marked as “used” which are actually “free”. fsck(8) recognizes
this situation, and frees the resources that are no longer used. It is safe to ignore the
dirty state of the filesystem after a crash by forcibly mounting it with mount -f. In order to free resources that may be unused, fsck(8) needs to be
run at a later time. This is the idea behind the background fsck: at system startup time, only a snapshot of the filesystem is recorded.
The fsck can be run later on. All file systems can then be
mounted “dirty”, so the system startup proceeds in multiuser mode. Then,
background fscks will be scheduled for all file systems where
this is required, to free resources that may be unused. (File systems that do not use
Soft Updates still need the usual foreground fsck though.)
The advantage is that meta-data operations are nearly as fast as asynchronous updates
(i.e. faster than with logging,
which has to write the meta-data twice). The disadvantages are the complexity of the code
(implying a higher risk for bugs in an area that is highly sensitive regarding loss of
user data), and a higher memory consumption. Additionally there are some idiosyncrasies
one has to get used to. After a crash, the state of the filesystem appears to be somewhat
“older”. In situations where the standard synchronous approach would have
caused some zero-length files to remain after the fsck, these
files do not exist at all with a Soft Updates filesystem because neither the meta-data
nor the file contents have ever been written to disk. Disk space is not released until
the updates have been written to disk, which may take place some time after running rm. This may cause problems when installing large amounts of data on
a filesystem that does not have enough free space to hold all the files twice.