As a quick thought experiment, the next time you are in your
data center, look around, and imagine for a moment that it is gone.
And not just the computers. Imagine that the entire building no
longer exists. Next, imagine that your job is to get as much of the
work that was being done in the data center going in some fashion,
some where, as soon as possible. What would you do?
By thinking about this, you have taken the first step of
disaster recovery. Disaster recovery is the ability to recover from
an event impacting the functioning of your organization's data
center as quickly and completely as possible. The type of disaster
may vary, but the end goal is always the same.
The steps involved in disaster recovery are numerous and
wide-ranging. Here is a high-level overview of the process, along
with key points to keep in mind.
A backup site is vital, but it is still useless without a
disaster recovery plan. A disaster recovery plan dictates every
facet of the disaster recovery process, including but not limited
to:
-
What events denote possible disasters
-
What people in the organization have the authority to declare a
disaster and thereby put the plan into effect
-
The sequence of events necessary to prepare the backup site once
a disaster has been declared
-
The roles and responsibilities of all key personnel with respect
to carrying out the plan
-
An inventory of the necessary hardware and software required to
restore production
-
A schedule listing the personnel to staff the backup site,
including a rotation schedule to support ongoing operations without
burning out the disaster team members
-
The sequence of events necessary to move operations from the
backup site to the restored/new data center
Disaster recovery plans often fill multiple looseleaf binders.
This level of detail is vital because in the event of an emergency,
the plan may well be the only thing left from your previous data
center (other than the last off-site backups, of course) to help
you rebuild and restore operations.
|
Tip |
|
While disaster recovery plans should be readily available at
your workplace, copies should also be stored off-site. This way, a
disaster that destroys your workplace will not take every copy of
the disaster recovery plan with it. A good place to store a copy is
your off-site backup storage location. If it does not violate your
organization's security policies, copies may also be kept in key
team members' homes, ready for instant use.
|
Such an important document deserves serious thought (and
possibly professional assistance to create).
And once such an important document is created, the knowledge it
contains must be tested periodically. Testing a disaster recovery
plan entails going through the actual steps of the plan: going to
the backup site and setting up the temporary data center, running
applications remotely, and resuming normal operations after the
"disaster" is over. Most tests do not attempt to perform 100% of
the tasks in the plan; instead a representative system and
application is selected to be relocated to the backup site, put
into production for a period of time, and returned to normal
operation at the end of the test.
|
Note |
|
Although it is an overused phrase, a disaster recovery plan must
be a living document; as the data center changes, the plan must be
updated to reflect those changes. In many ways, an out-of-date
disaster recovery plan can be worse than no plan at all, so make it
a point to have regular (quarterly, for example) reviews and
updates of the plan.
|
One of the most important aspects of disaster recovery is to
have a location from which the recovery can take place. This
location is known as a backup site. In the
event of a disaster, a backup site is where your data center will
be recreated, and where you will operate from, for the length of
the disaster.
There are three different types of backup sites:
-
Cold backup sites
-
Warm backup sites
-
Hot backup sites
Obviously these terms do not refer to the temperature of the
backup site. Instead, they refer to the effort required to begin
operations at the backup site in the event of a disaster.
A cold backup site is little more than an appropriately
configured space in a building. Everything required to restore
service to your users must be procured and delivered to the site
before the process of recovery can begin. As you can imagine, the
delay going from a cold backup site to full operation can be
substantial.
Cold backup sites are the least expensive sites.
A warm backup site is already stocked with hardware representing
a reasonable facsimile of that found in your data center. To
restore service, the last backups from your off-site storage
facility must be delivered, and bare metal restoration completed,
before the real work of recovery can begin.
Hot backup sites have a virtual mirror image of your current
data center, with all systems configured and waiting only for the
last backups of your user data from your off-site storage facility.
As you can imagine, a hot backup site can often be brought up to
full production in no more than a few hours.
A hot backup site is the most expensive approach to disaster
recovery.
Backup sites can come from three different sources:
-
Companies specializing in providing disaster recovery
services
-
Other locations owned and operated by your organization
-
A mutual agreement with another organization to share data
center facilities in the event of a disaster
Each approach has its good and bad points. For example,
contracting with a disaster recovery firm often gives you access to
professionals skilled in guiding organizations through the process
of creating, testing, and implementing a disaster recovery plan. As
you might imagine, these services do not come without cost.
Using space in another facility owned and operated by your
organization can be essentially a zero-cost option, but stocking
the backup site and maintaining its readiness is still an expensive
proposition.
Crafting an agreement to share data centers with another
organization can be extremely inexpensive, but long-term operations
under such conditions are usually not possible, as the host's data
center must still maintain their normal production, making the
situation strained at best.
In the end, the selection of a backup site is a compromise
between cost and your organization's need for the continuation of
production.
Your disaster recovery plan must include methods of procuring
the necessary hardware and software for operations at the backup
site. A professionally-managed backup site may already have
everything you need (or you may need to arrange the procurement and
delivery of specialized materials the site does not have
available); on the other hand, a cold backup site means that a
reliable source for every single item must be identified. Often
organizations work with manufacturers to craft agreements for the
speedy delivery of hardware and/or software in the event of a
disaster.
When a disaster is declared, it is necessary to notify your
off-site storage facility for two reasons:
|
Tip |
|
In the event of a disaster, the last backups you have from your
old data center are vitally important. Consider having copies made
before anything else is done, with the originals going back
off-site as soon as possible.
|
A data center is not of much use if it is totally disconnected
from the rest of the organization that it serves. Depending on the
disaster recovery plan and the nature of the disaster itself, your
user community might be located miles away from the backup site. In
these cases, good connectivity is vital to restoring
production.
Another kind of connectivity to keep in mind is that of
telephone connectivity. You must ensure that there are sufficient
telephone lines available to handle all verbal communication with
your users. What might have been a simple shout over a cubicle wall
may now entail a long-distance telephone conversation; so plan on
more telephone connectivity than might at first appear
necessary.
The problem of staffing a backup site is multi-dimensional. One
aspect of the problem is determining the staffing required to run
the backup data center for as long as necessary. While a skeleton
crew may be able to keep things going for a short period of time,
as the disaster drags on more people will be required to maintain
the effort needed to run under the extraordinary circumstances
surrounding a disaster.
This includes ensuring that personnel have sufficient time off
to unwind and possibly travel back to their homes. If the disaster
was wide-ranging enough to affect peoples' homes and families,
additional time must be allotted to allow them to manage their own
disaster recovery. Temporary lodging near the backup site is
necessary, along with the transportation required to get people to
and from the backup site and their lodgings.
Often a disaster recovery plan includes on-site representative
staff from all parts of the organization's user community. This
depends on the ability of your organization to operate with a
remote data center. If user representatives must work at the backup
site, similar accommodations must be made available for them, as
well.
Eventually, all disasters end. The disaster recovery plan must
address this phase as well. The new data center must be outfitted
with all the necessary hardware and software; while this phase
often does not have the time-critical nature of the preparations
made when the disaster was initially declared, backup sites cost
money every day they are in use, so economic concerns dictate that
the switchover take place as quickly as possible.
The last backups from the backup site must be made and delivered
to the new data center. After they are restored onto the new
hardware, production can be switched over to the new data
center.
At this point the backup data center can be decommissioned, with
the disposition of all temporary hardware dictated by the final
section of the plan. Finally, a review of the plan's effectiveness
is held, with any changes recommended by the reviewing committee
integrated into an updated version of the plan.