Вы находитесь на странице: 1из 9

How to Conduct an End-to-End Disaster Recovery

Exercise in Real Time


Written by Shankar Subramaniyan CISSP, CISM, PMP, ABCPWednesday, 03 April 2013 02:56

Many times organizations conduct traditional disaster recovery exercises where testing is done in silos,
and the scope is limited and restricted only to host level recovery of individual systems. With growing
technology changes and globalization trends, the intricacy and interdependencies of applications have
become more complex in recent years, and major applications are spread across multiple locations and
multiple servers. In this scenario, a traditional recovery exercise focusing on server (host) level recovery
is not going to adequately ensure the complete recovery of the application without any inconsistencies
among various interdependent subcomponents. In a widespread disaster scenario involving major
outages at the data center level, it is fairly certain that this kind of limited exercise is not going to be
sufficient to assure the realistic readiness status and overall recovery time objective (RTO) for multiple
applications. Therefore, organizations should increase the scope and complexity of disaster recovery
exercises over time and ensure that each exercise is process-oriented and focused on end-to-end
recovery. This article addresses some of the technical challenges faced in end-to-end disaster recovery
exercises which attempt a full life cycle of transactions across disaster recovery applications and their
dependencies and simulate business activities during the exercises.
Growing reliance on information technology, along with compliance and regulatory requirements, has led
many organizations to focus on business continuity and disaster recovery (DR) solutions. Availability has
become a major concern for business survival. Therefore, it becomes mandatory that one should take a
detailed look at disaster recovery testing and the specific steps to ensure a disaster recovery plan
performs as expected. An end-to-end disaster recovery exercise would provide realistic readiness status
and bring out any complexities or intricacies involved in recovering multiple applications in the case of any
widespread disasters, including a data center level outage.
There are a lot of challenges in an end to end" disaster recovery exercise approach compared to
traditional disaster recovery exercise since one needs to consider all the dependencies and should take
into an account an end to end view to understand the full functionality of the applications.
This article illustrates that some of the challenges faced in an actual end to end disaster recovery
exercises conducted for applications which interfaced with external third parties and had heavy reliance
on middleware components and batch jobs
Why a Disaster Recovery Exercise is Required
Disaster recovery plans represent a considerable amount of complex and interrelated technologies and
tasks that are required to support an organizations disaster recovery capability. Constant changes in
personnel, technologies, and application systems demand periodic plan validation to assure that the
recovery plans are functional and remain so in the future. Without this validation, an organization would

not be able to demonstrate that the documented set of recovery plans support current recovery
operations that will be needed to sustain critical business functions in time of disaster.
The periodic disaster recovery exercise is required to validate the documented recovery procedures,
assumptions, and associated technology used in the restoration of the production environment.
Issues in Traditional Disaster Recovery Exercises
How many organizations attempt a full life cycle of transactions across disaster recovery applications and
their dependencies and simulate business activity as part of disaster recovery exercises? Many times
organizations conduct traditional disaster recovery exercises where testing is done in silos, and the scope
is limited and restricted only to host level recovery of individual systems. In most of these disaster
recovery exercises, the participating team is comprised of only the information technology team without
involving any business users. Generally the primary objectives in such exercises will be restricted to
recovery of standalone systems without involving any integration with upstream or downstream
dependencies.
Typical application validation carried out in this exercise includes login validation, form navigation, and
search validation without testing any connections to other dependent applications or any business activity.
Most of the time traditional disaster recovery test activities are limited to travel and the restoration of hosts
at the recovery site and not anything further. Major drawbacks in this type of testing are that one will not
know, until the actual disaster, how the integration part is going to work, what the main dependencies are.
and what the impact may be due to any network latency related issue.
With growing technology changes and globalization trends, the intricacy and interdependencies of
applications have become more complex in recent years, and major applications are spread across
multiple locations and multiple servers. In this scenario, a traditional recovery exercise focusing on server
(host) level recovery is not going to be adequate to fully recover the application without any
inconsistencies among various interdependent subcomponents.
This kind of limited exercise not involving end-to-end disaster recovery activities and without attempting to
simulate business activity is not going to be sufficient to reflect the preparedness to handle a real time
disaster and to assure the required overall Recovery Time Objective (RTO) for multiple applications.
Why We Need an End-to-End Disaster Recovery Exercise
A limited scope disaster recovery exercise not involving end-to-end disaster recovery activities and
without attempting to simulate business activity is typically based on asset level (example: specific server
or application) outage scenarios and not based on any widespread site (datacenter/city level) level
outages.
Therefore, in order to ensure effective disaster recovery preparedness, organizations should plan for an
end-to-end disaster recovery exercise including all interdependent applications in scope. This will bring

out the practical issues involved in performing the business transactions in the disaster recovery
environment and verify the real effectiveness of disaster recovery procedures.
Challenges in an End-to-End Disaster Recovery Exercise
An end-to-end disaster recovery exercise focuses on complete recovery of applications and their
dependencies across various layers, including presentation, business logic, integration, and data layer. It
takes into account the required data consistency among various interdependent subcomponents and
sees the recovery from the business process perspective.
Since an end-to-end disaster recovery exercise attempts a full life cycle of transactions across disaster
recovery applications and their dependencies, and simulates a business activity during the exercise, there
are many challenges in conducting an end-to-end recovery exercise. Typical challenges faced are:

isolating the DR environment

replacing hard coded IP addresses and host names

connecting to dependent systems not having a corresponding disaster recovery environment

proper sequencing of applications

thorough preparation and coordination

ensuring a back-out plan and data replication during the exercise.

This article assumes a parallel exercise scenario and highlights the common technical challenges faced in
conducting an end-to-end disaster recovery parallel exercise in a warm site. In a parallel exercise, the DR
environment is brought up without interrupting or shutting down the production environment.
Isolating the DR Environment
As everyone will agree, we need to perform the disaster recovery exercise without any interruption to
production. This is very easily said, but it is the toughest challenge for the disaster recovery coordinator,
especially when it is required to do a parallel test at a warm site. Isolating the DR environment and at the
same time conducting the full life cycle testing requires a lot of planning and coordination.
The key issue with a full life cycle test is the potential interruption to production systems by unintended
access either by other applications or batch jobs. This may result in updating some transactions in the
production environment during the test since these restored systems might have the same host names or
IP addresses as production systems. Any production interruption such as any duplicate financial
transaction for paying a vendor or any missing critical transaction due to a disaster recovery exercise
could put your disaster recovery effort in jeopardy.
One should ensure that disaster recovery instances are not connected to the production environment at
all layers, including the database and network layers. For example, at the database layer, the

tnsnames.ora file or database (DB) links should be updated to ensure that only DR instances are
speaking to each other. At the network layer, appropriate firewall rules should be implemented to block
any traffic from the disaster recovery environment to the production environment.
In an isolated DR environment, there will be challenges for the desktop clients/end users to connect to the
DR environment and to verify whether the production or DR environments are accessed. These
challenges can be overcome by allowing access to the disaster recovery environment via DR-Citrix, DR
host names, or direct DR-URLs as applicable. End user client machines local host file and configuration
file need to be configured to point to DR host names instead of production host names during the DR
exercise.
Replace the Hard-Coded IP Address and Host Name
In many organizations, a major issue in disaster recovery exercises is hard-coded IP addresses and host
names in applications, particularly in batch jobs. There is a possibility that interfaces and batch jobs might
fail or interrupt production systems during the exercise if there are any hard-coded IP addresses or host
names. Hence one needs to thoroughly analyze all the involved systems and identify any hard-coded IP
address or host name. As a best practice, one should always reference alias names and avoid any hardcoding of host names or IP address. One of the important tasks for effective disaster recovery
implementation is to convert every application to reference alias names, not the primary host names listed
in DNS or IP addresses.
However, it might become a tough job to replace the hard-coded host name or IP address for some of the
applications which were developed several years ago. In such cases, it is suggested to use automated
scripts as much as possible to replace the production host name or production IP address to the
respective DR host name or DR IP address. These DR scripts should be documented and ensured that
they are not overwritten during storage replication to DR environments.
Connecting to Dependent Systems Not Having a Corresponding Disaster Recovery Environment
One of the key challenges in an end-to-end disaster recovery exercise is how to test the connecting
interfaces with other applications which do not have a corresponding disaster recovery environment. For
instance, as represented in figure 1 let us assume an application X, which is hosted at the disaster
recovery site and needs to interface with application Y, which is hosted at a third-party site. If application Y
has a corresponding disaster recovery system, then we can connect both disaster recovery systems
during the exercise. Otherwise, one needs to look into options of using the other available environments,
such as development, test, or pre-production systems of Y application for testing. Flowcharts, data feeds,
and architecture comparisons for production and disaster recovery would help in identifying all the
required components for the successful functioning of applications in a disaster scenario. An Interface
architecture comparison done between production and DR environment is shown in figure 1. In the below
DR Interface Architecture Drawing, since Y which is a vendor application, did not have any DR

environment, DR exercise is conducted by connecting to the test environment of Y application from DR


environment of X application.

Figure 1: DR Interface Architecture Drawing


Proper Sequencing of Applications and overall RTO A crucial challenge in most disaster recovery
exercises is the proper identification and sequencing of upstream and downstream dependencies. When
performing a disaster recovery exercise with a full life cycle of transactions for 20 or 30 applications,
sequencing of applications becomes very critical. The sequence should be planned out properly based on
the dependency and agreed overall Recovery Time Objective (RTO) requirements for multiple
applications. Documenting all the critical interfaces for a disaster recovery scenario would help in
ensuring proper sequencing of applications. While considering the dependencies for application, the
interfaces need to be analyzed for business requirement of the data and the frequency at which they run.
Figure 2 illustrates the resulting application dependency analysis diagram. As illustrated in this diagram,
D1 is the application which needs to be brought up first in DR environment before bringing up the DR
application X. This is due to the reason that D1 provides critical input data to X, without which X cannot
function appropriately. Inbound interfaces which feed data to applications are required to be brought up

first at DR site in most of the cases. In this example, applications marked as D1, D2, D3, and D4 are
brought first followed by which application X is brought up as D5. Under this scenario, RTO for application
X (D5) will depend on the RTO for other four dependent applications (D1, D2, D3 and D4) and this overall
RTO should meet the business requirements as well. Applications for outbound interfaces are brought up
subsequently. Applications can also be brought up in parallel instead of in sequence as per the business
requirements.

Figure 2: Application Dependency Analysis


Thorough Preparation and Coordination
In disaster recovery exercises, one can tend to skip the proper sequence of DR exercises, or one can
overlook the importance of the sequence. But in the road to an end-to-end disaster recovery exercise, it is

crucial to thoroughly follow a proper sequence of testing, namely walk-through, simulation, parallel, and
then full interruption exercises.
A walk-through and simulation test is required first among the various participating teams
(network/firewall, server, database, middleware, and various applications) to ensure that everyone knows
what the scenario is, who needs to do what, and what is the sequence. These tests bring out the potential
risks to the production environment during DR exercise and the coordination or sequence related issues
in recovery procedures. Thorough preparation and coordination involving a great deal of planning,
involvement from all the participating teams, and "mini" tests to test all the subcomponents would result in
identifying most of the potential issues before they occur and in eliminating most of the human errors.

Figure 3: Performing Disaster Recovery Exercise Using Point-in-Time Copy of Data


Ensuring Back-Out Plan and Data Replication During Exercise

As always, one needs to ensure the appropriate process is in place for a solid back-out plan (a restore
point prior to test start) and how to abort the exercise in the event of anomaly or critical business needs
while performing the exercise.
One also needs to ensure that data replication is not stopped while testing if there is a continuous data
replication process in place to the disaster recovery site. As shown in figure 3, if storage array based
Storage Area Network (SAN) replication is used, then using technologies like point-in-time copies of data
can be used for disaster recovery exercises [4] by presenting point-in-time copies of data to hosts instead
of directly attaching SAN to hosts. In this way, we need not have to stop the data replication during the
exercise. In the figure provided above, there is a continuous data replication from local data center which
is a primary site to remote site even during DR exercise. Testers are testing the data in point-in-time
copies of data. Also in the above figure, as a best practice, a point in time copy and backup is taken at
primary site which might help in resolving any major issues due to data corruption at the primary site. In
storage array based replication, there is a risk that when the data at the primary site SAN is corrupted,
then the secondary site SAN will also have the corrupted data. In that case both will become unusable.
Hence one should consider this risk and design the DR replication solution accordingly.
Simplified and Automated Recovery Procedure to Resolve Issues in Involving Testing Team
Traditional recovery testing team would consists of several groups such as operating system, database,
middleware, networking, storage, and datacenter operations team, etc. Since there are multiple teams
(about 8-9 teams) involved in testing, this makes it more complicated in terms of scheduling the test. Also,
some times during the test, if any other high priority production issue arises, then testers may need to
leave in the middle of the testing since in most of the cases testers will be supporting the production
environment as well. In order to reduce the complexity in scheduling and to avoid any interruptions, it is
recommended to reduce these levels of dependency to a minimum level and create a DR tester who can
run all these recovery steps alone and contact the respective (Database, Network, Storage, OS) system
administrators only when there is any issue. The important aspect in this arrangement is that the recovery
steps should be documented in such a way that it can be understandable and executable by any normal
(L4) Helpdesk level person who will not have any high-level specific administrator (L2/L1) skills.
Besides testing, in a time of actual disaster, there is a tremendous amount of pressure and stress to get
everything back up and running and available to users. In manual processes, mistakes will be made for a
variety of reasons. Thus, it is suggested to automate the recovery process as much as possible. Having a
simplified and automated disaster recovery process would eliminate the unnecessary time delay and
manual errors during the recovery.
In most cases, simple scripts can help in reducing the recovery time considerably and in avoiding human
error and dependency on skilled administrators during disasters. For example, scripts can be used for
activities like mounting or unmounting the disk groups or changing the hard-coded host name in
configuration parameters to point to the DR host name, etc.

Conclusion
Growing changes in technology and business models demand business processes which are heavily
reliant on complex and interdependent applications, and therefore an organization should attempt
process-oriented and end-to-end disaster recovery exercises for testing these applications instead of
traditional server-centric exercises. Even though there are so many challenges in performing an end-toend disaster recovery exercise, these challenges can be overcome by thoroughly analyzing the
interdependencies and by using the appropriate sequence to bring up the dependent applications. An
end-to-end disaster recovery exercise is the only way to effectively build the confidence among
stakeholders on the recoverability of the disaster recovery environment and to understand the realistic
RTO in a site level outage where multiple applications are impacted.
Shankar Subramaniyan, CISSP, CISM, ABCP, has more than 13 years of experience as a technology
consulting and project management executive in the areas of IT Governance, Risk and Compliance
(GRC) and business continuity planning. He is a certified professional and has hands-on experience in
implementing disaster recovery solutions. He has implemented and managed Information Security
Management System (ISMS) based on industry standards such as ISO27001 and ITIL. He has worked
extensively on various compliance requirements including SoX, PCI, etc.

Вам также может понравиться