Identifying Problems in ZFS
The following sections describe how to identify problems in your ZFS file systems
or storage pools.
You can use the following features to identify problems with your ZFS configuration:
Detailed ZFS storage pool information with the zpool status command
Pool and device failures are reported with ZFS/FMA diagnostic messages
Previous ZFS commands that modified pool state information can be displayed with the zpool history command
Most ZFS troubleshooting is centered around the zpool status command. This command analyzes the
various failures in the system and identifies the most severe problem, presenting you
with a suggested action and a link to a knowledge article for more
information. Note that the command only identifies a single problem with the pool,
though multiple problems can exist. For example, data corruption errors always imply that
one of the devices has failed. Replacing the failed device does not fix
the data corruption problems.
In addition, a ZFS diagnostic engine is provided to diagnose and report pool
failures and device failures. Checksum, I/O, device, and pool errors associated with pool
or device failures are also reported. ZFS failures as reported by fmd are
displayed on the console as well as the system messages file. In most
cases, the fmd message directs you to the zpool status command for further recovery instructions.
The basic recovery process is as follows:
If appropriate, use the zpool history command to identify the previous ZFS commands that led up to the error scenario. For example:
# zpool history
History for 'tank':
2007-04-25.10:19:42 zpool create tank mirror c0t8d0 c0t9d0 c0t10d0
2007-04-25.10:19:45 zfs create tank/erick
2007-04-25.10:19:55 zfs set checksum=off tank/erick
Notice in the above output that checksums are disabled for the tank/erick file system. This configuration is not recommended.
Identify the errors through the fmd messages that are displayed on the system console or in the /var/adm/messages files.
Find further repair instructions in the zpool status -x command.
Repair the failures, such as:
Replace the faulted or missing device and bring it online.
Restore the faulted configuration or corrupted data from a backup.
Verify the recovery by using the zpool status -x command.
Back up your restored configuration, if applicable.
This chapter describes how to interpret zpool status output in order to diagnose the
type of failure and directs you to one of the following sections on
how to repair the problem. While most of the work is performed automatically
by the command, it is important to understand exactly what problems are being
identified in order to diagnose the type of failure.
Determining if Problems Exist in a ZFS Storage Pool
The easiest way to determine if any known problems exist on the
system is to use the zpool status -x command. This command describes only pools exhibiting
problems. If no bad pools exist on the system, then the command displays
a simple message, as follows:
# zpool status -x
all pools are healthy
Without the -x flag, the command displays the complete status for all pools
(or the requested pool, if specified on the command line), even if the
pools are otherwise healthy.
For more information about command-line options to the zpool status command, see Querying ZFS Storage Pool Status.
Reviewing zpool status Output
The complete zpool status output looks similar to the following:
# zpool status tank
pool: tank
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror DEGRADED 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 OFFLINE 0 0 0
errors: No known data errors
This output is divided into several sections:
Overall Pool Status Information
This header section in the zpool status output contains the following fields, some of
which are only displayed for pools exhibiting problems:
- pool
The name of the pool.
- state
The current health of the pool. This information refers only to the ability of the pool to provide the necessary replication level. Pools that are ONLINE might still have failing devices or data corruption.
- status
A description of what is wrong with the pool. This field is omitted if no problems are found.
- action
A recommended action for repairing the errors. This field is an abbreviated form directing the user to one of the following sections. This field is omitted if no problems are found.
- see
A reference to a knowledge article containing detailed repair information. Online articles are updated more often than this guide can be updated, and should always be referenced for the most up-to-date repair procedures. This field is omitted if no problems are found.
- scrub
Identifies the current status of a scrub operation, which might include the date and time that the last scrub was completed, a scrub in progress, or if no scrubbing was requested.
- errors
Identifies known data errors or the absence of known data errors.
Configuration Information
The config field in the zpool status output describes the configuration layout of the
devices comprising the pool, as well as their state and any errors generated
from the devices. The state can be one of the following: ONLINE, FAULTED,
DEGRADED, UNAVAILABLE, or OFFLINE. If the state is anything but ONLINE, the fault
tolerance of the pool has been compromised.
The second section of the configuration output displays error statistics. These errors are
divided into three categories:
READ – I/O error occurred while issuing a read request.
WRITE – I/O error occurred while issuing a write request.
CKSUM – Checksum error. The device returned corrupted data as the result of a read request.
These errors can be used to determine if the damage is permanent.
A small number of I/O errors might indicate a temporary outage, while a
large number might indicate a permanent problem with the device. These errors do not
necessarily correspond to data corruption as interpreted by applications. If the device is
in a redundant configuration, the disk devices might show uncorrectable errors, while no
errors appear at the mirror or RAID-Z device level. If this scenario is
the case, then ZFS successfully retrieved the good data and attempted to heal
the damaged data from existing replicas.
For more information about interpreting these errors to determine device failure, see Determining the Type of Device Failure.
Finally, additional auxiliary information is displayed in the last column of the zpool status
output. This information expands on the state field, aiding in diagnosis of failure
modes. If a device is FAULTED, this field indicates whether the device is inaccessible
or whether the data on the device is corrupted. If the device is
undergoing resilvering, this field displays the current progress.
For more information about monitoring resilvering progress, see Viewing Resilvering Status.
Scrubbing Status
The third section of the zpool status output describes the current status of any
explicit scrubs. This information is distinct from whether any errors are detected
on the system, though this information can be used to determine the accuracy
of the data corruption error reporting. If the last scrub ended recently, most
likely, any known data corruption has been discovered.
For more information about data scrubbing and how to interpret this information, see
Checking ZFS Data Integrity.
Data Corruption Errors
The zpool status command also shows whether any known errors are associated with the
pool. These errors might have been found during disk scrubbing or during normal
operation. ZFS maintains a persistent log of all data errors associated with the
pool. This log is rotated whenever a complete scrub of the system finishes.
Data corruption errors are always fatal. Their presence indicates that at least one
application experienced an I/O error due to corrupt data within the pool. Device
errors within a redundant pool do not result in data corruption and are
not recorded as part of this log. By default, only the number
of errors found is displayed. A complete list of errors and their specifics
can be found by using the zpool status -v option. For example:
# zpool status -v
pool: tank
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://www.sun.com/msg/ZFS-8000-8A
scrub: resilver completed with 1 errors on Fri Mar 17 15:42:18 2006
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 1
mirror DEGRADED 0 0 1
c1t0d0 ONLINE 0 0 2
c1t1d0 UNAVAIL 0 0 0 corrupted data
errors: The following persistent errors have been detected:
DATASET OBJECT RANGE
5 0 lvl=4294967295 blkid=0
A similar message is also displayed by fmd on the system console and
the /var/adm/messages file. These messages can also be tracked by using the fmdump
command.
For more information about interpreting data corruption errors, see Identifying the Type of Data Corruption.
System Reporting of ZFS Error Messages
In addition to persistently keeping track of errors within the pool, ZFS also
displays syslog messages when events of interest occur. The following scenarios generate events
to notify the administrator:
Device state transition – If a device becomes FAULTED, ZFS logs a message indicating that the fault tolerance of the pool might be compromised. A similar message is sent if the device is later brought online, restoring the pool to health.
Data corruption – If any data corruption is detected, ZFS logs a message describing when and where the corruption was detected. This message is only logged the first time it is detected. Subsequent accesses do not generate a message.
Pool failures and device failures – If a pool failure or device failure occurs, the fault manager daemon reports these errors through syslog messages as well as the fmdump command.
If ZFS detects a device error and automatically recovers from it, no notification
occurs. Such errors do not constitute a failure in the pool redundancy or
data integrity. Moreover, such errors are typically the result of a driver problem
accompanied by its own set of error messages.