- Identify the failed disk to be replaced by examining the /var/adm/messages file
and the metastat command output.
- Locate any state database replicas that might have been placed on the failed
disk.
Use the metadb command to find the replicas.
The metadb command might report errors for the state database replicas that are
located on the failed disk. In this example, c0t1d0 is the problem device.
# metadb
flags first blk block count
a m u 16 1034 /dev/dsk/c0t0d0s4
a u 1050 1034 /dev/dsk/c0t0d0s4
a u 2084 1034 /dev/dsk/c0t0d0s4
W pc luo 16 1034 /dev/dsk/c0t1d0s4
W pc luo 1050 1034 /dev/dsk/c0t1d0s4
W pc luo 2084 1034 /dev/dsk/c0t1d0s4
The output shows three state database replicas on each slice 4 of
the local disks, c0t0d0 and c0t1d0. The W in the flags field of the
c0t1d0s4 slice indicates that the device has write errors. Three replicas on the
c0t0d0s4 slice are still good.
- Record the slice name where the state database replicas reside and the number
of state database replicas. Then, delete the state database replicas.
The number of state database replicas is obtained by counting the number of
appearances of a slice in the metadb command output. In this example, the
three state database replicas that exist on c0t1d0s4 are deleted.
# metadb -d c0t1d0s4
Caution - If, after deleting the bad state database replicas, you are left with three
or fewer, add more state database replicas before continuing. Doing so helps to ensure that configuration information
remains intact.
- Locate and delete any hot spares on the failed disk.
Use the metastat command to find hot spares. In this example, hot spare
pool hsp000 included c0t1d0s6, which is then deleted from the pool.
# metahs -d hsp000 c0t1d0s6
hsp000: Hotspare is deleted
- Replace the failed disk.
This step might entail using the cfgadm command, the luxadm command, or other commands
as appropriate for your hardware and environment. When performing this step, make sure
to follow your hardware's documented procedures to properly manipulate the Solaris state of
this disk.
- Repartition the new disk.
Use the format command or the fmthard command to partition the disk with
the same slice information as the failed disk. If you have the prtvtoc
output from the failed disk, you can format the replacement disk with the
fmthard -s /tmp/failed-disk-prtvtoc-output command.
- If you deleted state database replicas, add the same number back to the
appropriate slice.
In this example, /dev/dsk/c0t1d0s4 is used.
# metadb -a -c 3 c0t1d0s4
- If any slices on the disk are components of RAID-5 volumes or are
components of RAID-0 volumes that are in turn submirrors of RAID-1 volumes, run
the metareplace -e command for each slice.
In this example, /dev/dsk/c0t1d0s4 and mirror d10 are used.
# metareplace -e d10 c0t1d0s4
- If any soft partitions are built directly on slices on the replaced disk,
run the metarecover -m -p command on each slice that contains soft partitions. This command
regenerates the extent headers on disk.
In this example, /dev/dsk/c0t1d0s4 needs to have the soft partition markings on disk
regenerated. The slice is scanned and the markings are reapplied, based on the
information in the state database replicas.
# metarecover c0t1d0s4 -m -p
- If any soft partitions on the disk are components of RAID-5 volumes or
are components of RAID-0 volumes that are submirrors of RAID-1 volumes, run the
metareplace -e command for each slice.
In this example, /dev/dsk/c0t1d0s4 and mirror d10 are used.
# metareplace -e d10 c0t1d0s4
- If any RAID-0 volumes have soft partitions built on them, run the metarecover
command for each RAID-0 volume.
In this example, RAID-0 volume, d17, has soft partitions built on it.
# metarecover d17 -m -p
- Replace hot spares that were deleted, and add them to the appropriate hot
spare pool or pools.
In this example, hot spare pool, hsp000 included c0t1d0s6. This slice is added to
the hot spare pool.
# metahs -a hsp000 c0t1d0s6hsp000: Hotspare is added
- If soft partitions or nonredundant volumes were affected by the failure, restore data
from backups. If only redundant volumes were affected, then validate your data.
Check the user and application data on all volumes. You might have to
run an application-level consistency checker, or use some other method to check the
data.