ZFS Failure Modes
As a combined file system and volume manager, ZFS can exhibit many
different failure modes. This chapter begins by outlining the various failure modes, then discusses
how to identify them on a running system. This chapter concludes by discussing
how to repair the problems. ZFS can encounter three basic types of errors:
Note that a single pool can experience all three errors, so a
complete repair procedure involves finding and correcting one error, proceeding to the next error,
and so on.
Missing Devices in a ZFS Storage Pool
If a device is completely removed from the system, ZFS detects that the
device cannot be opened and places it in the FAULTED state. Depending on
the data replication level of the pool, this might or might not result
in the entire pool becoming unavailable. If one disk in a mirrored or
RAID-Z device is removed, the pool continues to be accessible. If all
components of a mirror are removed, if more than one device in a
RAID-Z device is removed, or if a single-disk, top-level device is removed, the
pool becomes FAULTED. No data is accessible until the device is reattached.
Damaged Devices in a ZFS Storage Pool
The term “damaged” covers a wide variety of possible errors. Examples include the
following errors:
Transient I/O errors due to a bad disk or controller
On-disk data corruption due to cosmic rays
Driver bugs resulting in data being transferred to or from the wrong location
Simply another user overwriting portions of the physical device by accident
In some cases, these errors are transient, such as a random I/O
error while the controller is having problems. In other cases, the damage is
permanent, such as on-disk corruption. Even still, whether the damage is permanent does not
necessarily indicate that the error is likely to occur again. For example, if
an administrator accidentally overwrites part of a disk, no type of hardware failure
has occurred, and the device need not be replaced. Identifying exactly what went
wrong with a device is not an easy task and is covered
in more detail in a later section.
Corrupted ZFS Data
Data corruption occurs when one or more device errors (indicating missing or damaged
devices) affects a top-level virtual device. For example, one half of a mirror
can experience thousands of device errors without ever causing data corruption. If an
error is encountered on the other side of the mirror in the exact
same location, corrupted data will be the result.
Data corruption is always permanent and requires special consideration during repair. Even if
the underlying devices are repaired or replaced, the original data is lost forever.
Most often this scenario requires restoring data from backups. Data errors are recorded
as they are encountered, and can be controlled through routine disk scrubbing as
explained in the following section. When a corrupted block is removed, the next
scrubbing pass recognizes that the corruption is no longer present and removes any
trace of the error from the system.