Академический Документы
Профессиональный Документы
Культура Документы
&RQWHQWV
Overview ..................................................................................................................22
Why Plan for a Disaster?........................................................................................23
Planning for a Disaster...........................................................................................24
Test your Disaster Recovery Procedure ............................................................215
Other Considerations ...........................................................................................216
Minimizing the Chances for a Disaster ...............................................................217
21
2YHUYLHZ
The purpose of this chapter is to help you understand what we feel is the most critical job of
a system administratordisaster recovery.
We included this chapter at the beginning of our guidebook for two reasons:
<
<
:KDW,VD'LVDVWHU"
The goal of disaster recovery is to restore the system so that the company can continue
doing business. A disaster is anything that results in the corruption or loss of the R/3
System.
Examples include:
< Database corruption.
For example when test data is accidentally loaded into the production system.
This happens more often than people realize.
<
<
22
prepare, test, and refine. The plan could fill many volumes. This chapter helps you start
thinking about and planning for disaster recovery.
:K\3ODQIRUD'LVDVWHU"
<
A system administrator should expect and plan for the worst, and then hope for the best.
<
During a disaster recovery, nothing should be done for the first time.
Unpleasant surprises could be fatal to the recovery process.
<
How much lost revenue and cost will be incurred for each hour that the system is down?
<
<
<
How long can the system be down before the company goes out of business?
<
<
<
<
How long will it take before the R/3 System is available for use?
If you plan properly, you will be under less stress, because you know that the system can be
recovered and how long this recovery will take.
If the recovery downtime is unacceptable, management should invest in:
<
<
23
3ODQQLQJIRUD'LVDVWHU
This chapter is not a disaster recovery how to. It is only designed to get you thinking
and working on disaster recovery.
&UHDWLQJD3ODQ
Creating a disaster recovery plan is a major project because:
<
It can take over a year and considerable time to develop, test, and document.
<
If you do not know how to plan for a disaster recovery, get the assistance of an expert. A
bad plan (that will fail) is worse than no plan, because it provides a false sense of security.
:KDW$UHWKH%XVLQHVV5HTXLUHPHQWVIRU'LVDVWHU5HFRYHU\"
Who will provide the requirements?
< Senior management needs to provide global (or strategic) requirements and guidelines.
<
<
<
<
24
([DPSOH
([DPSOH
What: The system cannot be offline for more than three hours.
Why: The cost (an average of $25,000 per hour) is the inability to book sales.
([DPSOH
What: In the event of disaster, such as the loss of the building containing the R/3
data center, the company can only tolerate a two-day downtime.
Why: At that point, permanent customer loss begins.
Other: There must be an alternate method of continuing business.
:KHQ6KRXOGD'LVDVWHU5HFRYHU\3URFHGXUH%HJLQ"
Ask yourself the following questions:
<
<
<
The person must be aware of the effect of the disaster on the companys business and the
critical nature of the recovery.
([SHFWHG'RZQWLPHRU5HFRYHU\7LPH
([SHFWHG'RZQWLPH
Expected downtime is only part of the business cost of disaster recovery. For defined
scenarios, this cost is the expected minimum time before R/3 can be productive again.
Downtime may mean that no orders can be processed and no products shipped.
Management must approve this cost, so it is important that they understand that downtime
are potential business costs.
To help business continue, it is important to find out if there are alternate processes that can
be used while the R/3 System is being recovered.
25
<
A downed system is more expensive during the business day when business activity
would stop than at the end of the business day when everyone has gone home.
<
The duration of acceptable downtime depends on the company and the nature of its
business.
5HFRYHU\7LPH
Unless you test your recovery procedure, the recovery time is only an estimate, or worse, a
guess. Different disaster scenarios have different recovery times, which are based on what
needs to be done to become operational again.
The time to recover must be matched to the business requirements. If this time is greater
than the business requirements, the mismatch needs to be communicated to the appropriate
managers or executives.
Resolving this mismatch involves:
<
<
Changing the business requirements to accept the longer recovery time and accepting
the consequences.
An extreme (but possible) example: A company cannot afford the cost and lost revenue for
the month it would take one person to recover the system. During that time, the competition
would take away customers, payment would be due to vendors, and bills would not be
collected. In this situation, senior management needs to allocate resources to reduce the
recovery time to an acceptable level.
5HFRYHU\*URXSDQG6WDIILQJ5ROHV
There are four key roles in a recovery group. The number of employees performing these
roles will vary depending on your company size. In a smaller company, for example, the
recovery manager and the communication liaison could be the same person. Titles and tasks
will probably differ based on your companys needs.
We defined the following key roles:
<
Recovery manager
Manages the entire technical recovery. All recovery activities and issues should be
coordinated through this person.
<
Communication liaison
Handles user phone calls and keeps top management updated with the recovery status.
One person handling all phone calls allows the group doing the technical recovery to
proceed without interruptions.
26
<
<
To reduce interruption of the recovery staff, we recommend you maintain a status board.
The status board should list key points in the recovery plan and an estimate of when the
system will be recovered and available to use.
<
If the disaster is a major geographical event (like an earthquake), your local staff will be
more concerned with their familiesnot the company.
<
You should expect and plan for these situations. Plan for staff from other geographic sites
to be flown in and participate as disaster recovery team members.
A final staffing role is to plan for at least one staff member to be unavailable. Without this
person, the rest of the department must be able to perform a successful recovery. This issue
may become vital during an actual disaster.
7\SHVRI'LVDVWHU5HFRYHU\
Disaster recovery scenarios can be grouped into two types:
<
Onsite
<
Offsite
2QVLWH
Onsite recovery is disaster recovery done at your site. The infrastructure usually remains
intact. The best case scenario is a recovery done on the original hardware. The worst case
scenario is a recovery done on a backup system.
2IIVLWH
Offsite recovery is disaster recovery done at a disaster recovery site. In this scenario, all
hardware and infrastructure are lost as a result of facility destruction such as a fire, a flood,
or an earthquake. The new servers must be configured from scratch.
A major consideration is that once the original facility has been rebuilt and tested, a second
restore must take place back to the customers original facility. While this second restore can
be planned and scheduled at a convenient time to disrupt as few users as possible. The
timing is just as critical as the disaster. While the system is being recovered, it is down.
27
'LVDVWHU6FHQDULRV
There are an infinite number of disaster scenarios that could occur. It would take an infinite
amount of time to plan for them, and you will never account for all of them. To make this
task manageable, you should plan for at least three and no more than five scenarios. In the
event of a disaster, you would adapt the closest scenario(s) to the actual disaster.
The disaster scenarios are made up of:
<
<
<
7KUHH&RPPRQ'LVDVWHU6FHQDULRV
The following three examples range from a best-to-worst scenario order:
The downtimes in the examples below are only samples. Your downtimes will be different.
You must replace the sample downtimes with the downtimes applicable to your
environment.
$&RUUXSW'DWDEDVH
<
<
<
$+DUGZDUH)DLOXUH
<
28
<
<
$&RPSOHWH/RVVRU'HVWUXFWLRQRIWKH6HUYHU)DFLOLW\
<
<
A complete loss of the facility can result from the following types of disasters:
Fire
Earthquake
Flood
Hurricane
Tornado
Man-made disasters, such as the World Trade Center bombing
Such a disaster requires:
Replacing the facilities
Replacing the infrastructure
Replacing lost hardware
Rebuilding the server and R/3 environment (hardware, operating system, database,
etc.)
Recovering the R/3 database and related files
<
<
Two days to rebuild the NT server (one person); 16 hours actual work time
As the hardware is procured and the server is being rebuilt, an alternate facility is
obtained and an emergency (minimal) network is constructed
One day to integrate into the emergency network
29
<
5HFRYHU\6FULSW
:KDW
<
<
:K\
<
If the primary recovery person is unavailable, a recovery script helps the backup person
complete the recovery.
&UHDWLQJD5HFRYHU\6FULSW
Creating a recovery script requires:
<
<
<
5HFRYHU\3URFHVV
To reduce recovery time, define a process by:
<
<
0DMRU6WHSV
1. During a potential disaster, anticipate a recovery by:
<
Collecting facts
<
<
Recalling the crash kit (see page 211 for more information).
<
210
<
<
<
<
What are the criteria to declare a disaster, and have they been met?
<
&UDVK.LW
:KDW
<
Reinstall R/3
<
:K\
During a disaster, everything that is needed to recover the R/3 environment is contained in
one (or a few) containers. If you have to evacuate the site, you will not have the time to run
around, gathering the items at the last minute, hoping that you get everything you need.
In a major disaster you may not even have that opportunity.
211
:KHQ
When a change is made to a component (hardware or software) on the server, replace the
outdated items in the crash kit with updated items that have been tested.
A periodic review of the crash kit should be performed to determine if items need to be
added or changed. A service contract is a perfect example of an item that requires this type
of review.
:KHUHWR3XWWKH&UDVK.LW
The crash kit should be physically separated from the servers. If it is located in the server
room, and the server room is destroyed, this kit is lost.
Some crash kit storage areas include:
<
<
<
+RZ
The following is an inventory list of some of the major items to put into the crash kit. You
will need to add or delete items for your specific environment. This inventory list is
organized into the following categories:
<
Documentation
<
Software
'RFXPHQWDWLRQ
An inventory of the crash kit should be taken by the person who seals the kit. If the seal is
broken, items may have been removed or changed, making the kit useless in a recovery.
The inventory list below must be signed and dated by the person checking the crash kit. The
following documentation must be included in the crash kit:
212
<
<
<
<
Copies of:
SAP license for all instances
Service agreements (with phone numbers) for all servers
Ensure that maintenance agreements are still valid and check if the agreements expired.
These should be part of a regular schedule task.
<
<
<
A parts list
If the server is destroyed, this list should be in sufficient detail to purchase or lease
replacement hardware. Over time, if original parts are no longer available, an alternate
parts list will have to be prepared. At this point, you might consider upgrading the
equipment.
<
<
Hardware layout
You need to know which:
Cards go in which slots
Cables go where (connector-by-connector)
Labeling cables and connectors greatly reduces confusion
<
6RIWZDUH
<
Operating system:
Installation kit
Drivers for hardware, such as a Network Interface Card (NIC) or a SCSI
controller, which are not included in the installation kit
Service packs, updates, and patches
213
<
<
<
<
Database:
Installation kit
Service packs, updates, and patches
Recovery scripts, to automate the database recovery
For R/3:
Installation kit
Currently installed kernel
System profile files
tpparam file
saprouttab file
saplogon.ini
Other R/3 integrated programs (for example, a tax package)
Other software for the R/3 installation:
Utilities
Backup
UPS control program
Hardware monitor
FTP client
Remote control program
System monitor
%XVLQHVV&RQWLQXDWLRQ'XULQJ5HFRYHU\
Business continuation during a recovery is an alternate process to continue doing business
while recovering from a disaster. It includes:
<
Cash collection
<
Order processing
<
Product shipping
<
Bill paying
<
Payroll processing
<
:K\
214
<
<
<
+RZ
Manual paper-based
<
2IIVLWH'LVDVWHU5HFRYHU\6LWHV
<
<
<
,QWHJUDWLRQZLWK\RXU&RPSDQ\V*HQHUDO'LVDVWHU3ODQQLQJ
Because there are many dependencies, the R/3 disaster recovery process must be integrated
with your companys general disaster planning. This process includes telephone, network,
product deliveries, mail, etc.
:KHQWKH56\VWHP5HWXUQV
How will the transactions that were handled with the alternate process be entered into R/3
when it is operational?
7HVW\RXU'LVDVWHU5HFRYHU\3URFHGXUH
Unless you test your recovery process, you do not know if you can actually recover
your system.
A test is a simulated disaster recovery which verifies that you can recover the system and
exercise every task outlined in the disaster recovery plan.
<
The information that is clear to the person documenting the procedure may be
unclear to the person reading the procedure.
Older hardware is no longer available
Here, alternate planning is needed. You may have to upgrade your hardware to be
compatible with currently available equipment.
Since many factors affect recovery time, actual recovery times can only be determined by
testing. Once you have actual times (not guesses or estimates), your disaster planning
215
becomes more credible. If the procedure is practiced often, when a disaster occurs, everyone
will know what to do. This way, the chaos of a disaster will be reduced.
+RZ
<
The disaster recovery test should be done at the same site that you expect to recover.
If you have multiple recovery sites, perform a test recovery at each site. The
equipment, facilities, and configuration may be different at each site. Document
all specific items that need to be completed for each site. You do not want
to discover that you cannot recover at a site after a disaster occurs.
<
<
<
<
:KR6KRXOG3DUWLFLSDWH
<
Primary and backup personnel who will do the job during a real disaster recovery
A provision should be made that some of the key personnel are to be unavailable during
a disaster recovery. A test procedure might involve randomly picking a name and
declare that person unavailable to participate. This procedure duplicates a real situation
in which a key person is seriously injured or killed.
<
2WKHU&RQVLGHUDWLRQV
2WKHU8SVWUHDPRU'RZQVWUHDP$SSOLFDWLRQV
For the company to function, other up (or down) stream applications also need to be
recovered with R/3. Some of these applications may be tightly associated with R/3. The
applications should be accounted for and protected in the company-wide disaster recovery
planning.
216
Applications located on only one persons desktop computer must be backed up to a safe
location.
%DFNXS6LWHV
Having a contract with a disaster recovery site does not guarantee that the site will be
available. In a regional disaster, such as an earthquake or flood, many other companies will
be competing for the same commercial disaster sites. In this situation, you may not have a
site to recover to, if others have booked it before you.
The emergency backup site may not have equipment of the same performance level as your
production system. Reduced performance and transaction throughput must be considered.
Examples:
<
<
Only essential business tasks will be done while on the recovery system
0LQLPL]LQJWKH&KDQFHVIRUD'LVDVWHU
There are many ways to minimize chances for a disaster. Some of these ideas seem obvious,
but it is these ideas that are often forgotten.
0LQLPL]H+XPDQ(UURU
Many disasters are caused by human error, such as a mistake or a tired operator. Do not
attempt dangerous tasks when you are tired. If you have to do a dangerous task, get a
second opinion before you start.
<
Dangerous tasks should be scripted and checkpoints included to verify the steps.
Such tasks include:
Deleting the test database
Check that the delete command specifies the Test, not the
Production, database.
Moving a file
Verify that the target file (to be overwritten) is the old, not the new, file.
Formatting a new drive
Verify that the drive to be formatted is the new drive, not an existing drive with data
on it.
217
0LQLPL]H6LQJOH3RLQWVRI)DLOXUH
A single-point failure is when the failure of one component causes the entire system to fail.
To minimize single-point failure:
<
<
<
The backup R/3 server is located in the same data center as the production R/3 server.
If the data center is destroyed, the backup server is also destroyed.
<
&DVFDGH)DLOXUHV
A cascade failure is when one failure triggers additional failures, which increases the
complexity of a problem. The recovery involves the coordinated fixing of many problems.
([DPSOH $&DVFDGH)DLOXUH
1. A power failure in the air conditioning system causes an environmental (air
conditioning) failure in the server room.
2. Without cooling, the temperature in the server room rises above the equipments
acceptable operating temperature.
3. The overheating causes a hardware failure in the server.
4. The hardware failure causes a database corruption.
In addition, overheating can damage many things, such as:
Network equipment
Phone system
Other servers
<
In this case, a system that monitors the air conditioning system or the temperature in the
server room could alert the appropriate employees before the temperature in the server
room becomes too hot.
218