Overview of Replacing and Enabling Components in RAID-1 and RAID-5 Volumes
Solaris Volume Manager can replace and enable components within RAID-1 (mirror) and RAID-5
volumes.
In Solaris Volume Manager terminology, replacing a component is a way to substitute
an available component on the system for a selected component in a submirror
or RAID-5 volume. You can think of this process as logical replacement, as
opposed to physically replacing the component. For more information see Replacing a Component With Another Available Component.
Enabling a component means to “activate” or substitute a component with itself (that
is, the component name is the same). For more information, see Enabling a Component.
Note - When recovering from disk errors, scan /var/adm/messages to see what kind of errors
occurred. If the errors are transitory and the disks themselves do not have
problems, try enabling the failed components. You can also use the format command
to test a disk.
Enabling a Component
You can enable a component when any of the following conditions exist:
Solaris Volume Manager cannot access the physical drive. This problem might occur, for example, due to a power loss, or a loose drive cable. In this case, Solaris Volume Manager puts the components in the “Maintenance” state. You need to make sure that the drive is accessible (restore power, reattach cables, and so on), and then enable the components in the volumes.
You suspect that a physical drive is having transitory problems that are not disk-related. You might be able to fix a component in the “Maintenance” state by simply enabling it. If enabling the component does not fix the problem, then you need to do one of the following:
When you physically replace a disk, be sure to partition the disk like the replaced disk to ensure adequate space on each used component.
Note - Always check for state database replicas and hot spares on the disk that
is being replaced. Any state database replica in an erred state should be
deleted before you replace the disk. Then, after you enable the component, recreate
the state database replicas using the same size. You should treat hot spares
in the same manner.
Replacing a Component With Another Available Component
You use the metareplace command when you replace or swap an existing component
with a different component that is available and not in use on the
system.
You can use this command when any of the following conditions exist:
A disk drive has problems, and you do not have a replacement drive. However, you do have available components elsewhere on the system.
You might want to use this strategy when a replacement is absolutely necessary, but you do not want to shut down the system.
You see soft errors on the physical disks.
Physical disks might report soft errors even though Solaris Volume Manager shows the mirror/submirror or RAID-5 volume in the “Okay” state. Replacing the component in question with another available component enables you to perform preventative maintenance and potentially prevent hard errors from occurring.
You want to do performance tuning.
One way that you can evaluate components is by using the performance monitoring feature available from the Enhanced Storage tool within the Solaris Management Console. For example, you might see that a particular component in a RAID-5 volume is experiencing a high load average, even though it is in the “Okay” state. To balance the load on the volume, you can replace that component with a component from a disk that is less utilized. You can perform this type of replacement online without interrupting service to the volume.
Maintenance and Last Erred States
When a component in a RAID-1 or RAID-5 volume experiences errors, Solaris
Volume Manager puts the component in the “Maintenance” state. No further reads or writes
are performed to a component in the “Maintenance” state.
Sometimes a component goes into a “Last Erred” state. For a RAID-1 volume,
this usually occurs with a one-sided mirror. The volume experiences errors. However, there
are no redundant components to read from. For a RAID-5 volume this occurs
after one component goes into “Maintenance” state, and another component fails. The second
component to fail goes into the “Last Erred” state.
When either a RAID-1 volume or a RAID-5 volume has a component
in the “Last Erred” state, I/O is still attempted to the component marked
“Last Erred.” This I/O attempt occurs because a “Last Erred” component contains the last
good copy of data from Solaris Volume Manager's point of view. With a
component in the “Last Erred” state, the volume behaves like a normal device
(disk) and returns I/O errors to an application. Usually, at this point, some
data has been lost.
The subsequent errors on other components in the same volume are handled differently,
depending on the type of volume.
- RAID-1 Volume
A RAID-1 volume might be able to tolerate many components in the “Maintenance” state and still be read from and written to. If components are in the “Maintenance” state, no data has been lost. You can safely replace or enable the components in any order. If a component is in the “Last Erred” state, you cannot replace it until you first replace the components in the “Maintenance” state. Replacing or enabling a component in the “Last Erred” state usually means that some data has been lost. Be sure to validate the data on the mirror after you repair it.
- RAID-5 Volume
A RAID-5 volume can tolerate a single component in the “Maintenance” state. You can safely replace a single component in the “Maintenance” state without losing data. If an error on another component occurs, it is put into the “Last Erred” state. At this point, the RAID-5 volume is a read-only device. You need to perform some type of error recovery so that the state of the RAID-5 volume is stable and the possibility of data loss is reduced. If a RAID-5 volume reaches a “Last Erred” state, there is a good chance it has lost data. Be sure to validate the data on the RAID-5 volume after you repair it.
Always replace components in the “Maintenance” state first, followed by those in the
“Last Erred” state. After a component is replaced and resynchronized, use the metastat
command to verify its state. Then, validate the data.
Background Information for Replacing and Enabling Components in RAID-1 and RAID-5 Volumes
When you replace components in a RAID-1 volume or a RAID-5 volume, follow
these guidelines:
Always replace components in the “Maintenance” state first, followed by those components in the “Last Erred” state.
After a component is replaced and resynchronized, use the metastat command to verify the state of the volume. Then, validate the data. Replacing or enabling a component in the “Last Erred” state usually means that some data has been lost. Be sure to validate the data on the volume after you repair it. For a UFS, run the fsck command to validate the “metadata” (the structure of the file system). Then, check the actual user data. (Practically, users will have to examine their files.) A database or other application must have its own way of validating its internal data structure.
Always check for state database replicas and hot spares when you replace components. Any state database replica in an erred state should be deleted before you replace the physical disk. The state database replica should be added back before you enable the component. The same procedure applies to hot spares.
During component replacement for a RAID-5 volume, data is recovered in one of two ways. The data is recovered either from a hot spare currently in use or from using the RAID-5 parity, when no hot spare is in use.
When you replace a component for a RAID-1 volume, Solaris Volume Manager automatically starts resynchronizing the new component with the rest of the volume. When the resynchronization completes, the replaced component becomes readable and writable. If the failed component has been replaced with data from a hot spare, the hot spare is placed in the “Available” state and made available for other hot spare replacements.
The new component must be large enough to replace the old component.
As a precaution, back up all data before you replace “Last Erred” devices.