Вы находитесь на странице: 1из 33

DR FAILOVER RUNBOOK

www.cloudhpt.com

Table of Contents
Executive Summary...........................................................................................................2
Document Control .............................................................................................................3
Document Change History ...............................................................................................3
Contact Information ..........................................................................................................4
Communication Structure of DR Plan..............................................................................7
Declaration Guidelines .....................................................................................................8
Key Contacts .................................................................................................................... 10
How to use Assured Support .......................................................................................... 10
How a support case works ............................................................................................. 10
How a support case works.............................................................................................................................10
Escalation Contacts ........................................................................................................ 11
Incident Management.................................................................................................... 12
Incident Priority Table...................................................................................................... 12
Incident Response and Escalation Table ..................................................................... 12
Change Management .................................................................................................... 13
1.2.

Service Description..............................................................................................................................13

1.2.1. Change implementation targets Table................................................................................14


Service Levels, Key Performance Indicators and
Service Credits ............................................................................................................................. 15
Assured SLA Guarantee .................................................................................................. 15
24 x 7 x 365 Staffing Guarantee..................................................................................... 15
Emergency Service and Reboot Requests
Response Guarantee .................................................................................................................. 15
Monitoring......................................................................................................................... 16
Part 2 .............................................................................................................................................. 18
System Level Procedures................................................................................................ 18
Evoking Assured DR yourself (Example to be
replaced)...................................................................................................................................... 24
Acronis .............................................................................................................................. 25
Zerto replication Failover or test .................................................................................... 27
Roles and Responsibilities............................................................................................... 31
1

1. Executive Summary
DR planning is complex and spans three key areas: technology, people, and process. From an IT
perspective, planning starts with a business impact analysis (BIA) by application/workload.
Natural tiers or stages of DR begin at phase 1 infrastructure (networking, AD, DHCP, etc.) then
extend to recovery by application tiers. Each application tier should have an established
recovery time objective (RTO) and recovery point objective (RPO) based on business risk.
DR testing is essential to adequate recovery of systems and data, but also to uncover events or
conditions met during real disasters scenarios that were not previously accounted for. Examples
include change management such as the needed reconfiguration of applications or systems.
Also, the recovery of systems in the right sequence is important. To ensure that DR testing,
planning, and recovery is organized and effective, organizations need disaster recovery "run
book." This document is your companies Disaster Recovery (DR) run book.
A DR run book is a working document, unique to every organization, which outlines the
necessary steps to recover from a disaster or service interruption. It provides an instruction set for
personnel in the event of a disaster, including both infrastructure and process information. Run
books, or updates to run books, are the outputs of every DR test.
However, a run book is only useful if it is up-to-date. If documented properly, it can take the
confusion and uncertainty out the recovery environment which, during an actual disaster, is
often in a state of panic. Use the run book in the event of a Disaster.

2. Document Control
Document creation and edit records should be maintained by your companys disaster
recovery coordinator (DRC) or business continuity manager (BCM). If your organiza tion does not
have a DRC, consider creating that role to manage all future disaster recovery activities.
Document Name

Disaster Recovery Run Book for [Company]

Version
Date created
Date last modified
Last modified by

3. Document Change History


Version

Date

V 1.0

1/6/2016

V1.1

29/12/2016

Description
Initial version

Approval
Business Owner / DRC

End of year DR test


action plan updates
to run book

Test Manager / DRC

Distribution List
Ensure that all key stakeholders have access to the document. Use the chart below to indicate
the stakeholders to whom this run book will be distributed. This is a critical step in your DR plan
Role

Name

Email

Phone

Owner
Approver
Auditor
Contributor (Technical)
Contributor (DBA)
Contributor (Network)
Contributor (Vendor)
3

4. Contact Information
This section will list our contacts along with those from your IT department. This is the team that
will conduct ongoing disaster recovery operations and respond in the case of a true
emergency. Specific roles listed below are examples of those that might comprise your team.
All of these roles need to be in communication when in a disaster recovery mode of operation.
For pending events, this same distribution list should be used to provide advanced notice of
potential incidents. Your customer service teams should also not be overlooked as they are the
first line of communication to your customer base. Forgetting this step will cause extra work on
your primary recovery team as they take time to explain what is going on
Your contacts
Names

Title

Phone

Email

Disaster Recovery
Coordinator
Chief Information Officer

Network Systems
Administrator
Database Systems
Administrator
Chief Security Officer
Chief Technology Officer

Business Owner

Application Development
Lead (as applicable)
Data Center Manager

Customer Support Manager

Call Center Manager


.
4

BIOS Helpdesk Declare DR here First

Helpdesk Telephone

+971 4 607 0888

Helpdesk E-mail

support@biosme.com

Helpdesk Web Address

https://support.biosme.com

Your Username

xxxxxxxxxx

Your Portal PW

Known to whom and/or stored where?

Additional Contacts
BIOS Contact

Role
Disaster Recovery
Coordinator
Customer Service

Mobile

Email

Emergency Support
Sr. System Engineer
Director Service Delivery

Data Center Access Control List (Customer DC)


Maintain an up-to-date access control list (ACL) specifying who, in both your company and your
IT service provider (if applicable), has access to your data center and resources therein.
Also specify which individuals can introduce guests to the data center. This will be useful for
determining, in the event of an emergency scenario, who may be designated a point person for
facilitating access to critical infrastructure.
Examples are provided in the table below. Remove, replace and add individuals to this list as
appropriate for your organization and infrastructure.

Name

Role

Mobile

Access level

Chief Technology
Officer

General access
Can authorize guest
access

Director of Service
Delivery

General access
Can authorize guest
access

Service Delivery
Engineer

Server room access,


cage/cabinet, NOC
access
Cannot authorize
guest access

Systems Engineer

Server room access


Cannot authorize
guest access

Network Engineer

Server room access


Cannot authorize
guest access

Chief Security Officer

General Access
Can authorize guest
access

Chief Information
Officer

General Access
Can authorize guest
access

5. Communication Structure of DR Plan


Disaster event call tree:
During any disaster event there should be a defined call tree specifying the exact roles and
procedures for each member of your IT organization to communicate with key stakeholders
(both inside and outside of your com pany). When defining the call structure, limit your tree and
branches to a 1:10 ratio of caller to call recipient.
As a first step, for example, your Disaster Recovery Coordinator should call BIOS and then might
call both the company CEO and head of operat ions, both of whom would then inform the
appropriate contacts in their teams along with key customers, service providers, and other
stakeholders responsible for correcting the service outage and restoring data and operations.
An example call tree might appear as follows:

Sr. Systems Engineer


DR Coordinator
Calls BIOS and
the informs the
company via

Head of
Operations

Director of Service
Delivery

Network Engineer
Systems
Administrator

CEO

Director of Business
Development

Sales contact
Customer Service
Representative

6. Declaration Guidelines
Examples when DR should be considered
Event
Application Failure
Hardware failure
Power failure

Plan of Action
Recycle
Failover to alternative
Enact total failover plan

Data center failure


(fire/flood/hack/power)
Pending weather event
(winter storm, hurricane, etc.)

Enact total failover plan

Situation
Workaround does not exist in
a matter of time that does
not affect customer SLAs
Restoration procedures
cannot be completed in your
production environment
A production environment no
longer exists or is unable to be
accessed

Action
Declare application level
failover and enact failover to
secondary site
Declare application level
failover and enact failover to
secondary site
Declare a data center failure
and enact a total failover
plan from primary to
secondary data center

Review all DR plans, notify


DRC, put key employees on
standby

Owner
Application business owner
IT Support
Disaster Recovery
Coordinator (DRC)
Disaster Recovery
Coordinator (DRC)
Disaster Recovery
Coordinator (DRC)
Business Owner

Owner

If your environment is monitored by BIOS, the following events will create trouble tickets and
make us aware you are facing issues. We will then endeavor to address them and escalate to
your contact person to ask if you want they declare a DR.

Event Type
Performance
Monitoring Status =
Warning Alert Level
Memory Usage > 80%

Duration of Event
> 2 minutes

CPU Usage > 90%

> 3 minutes

Memory

> 15 minutes

Storage

> 3 minutes

Network
Line - VPN

> 3 minutes
> 1 minutes

> 5 minutes

Corrective Action
Isolate problem
device / recycle
device
- Isolate physical
device / virtual
machine
- configure memory
pool increase
- clear memory
cache
- clear memory buffer
- increase compute
allocation (virtual)
- add additional
compute resources
into application
pool
- check memory
queue
- clear memory
cache of affected
system
- increase memory
allocation (virtual)

Event criticality
Critical Level

- Check
- Check
- Check
- Check
- Check

Critical Level

Lun
Space
Snapshots
Network
connectivity

Critical Level

Critical Level

Critical Level

Critical Level
Critical Level

7. Key Contacts

GENERAL ENQUIRIES

sales@cloudhpt.com
TEL:

PORTAL URL

https://portal.cloudhpt.com

REQUEST PORTAL ACCESS

accessportal@cloudhpt.com

SALES

sales@cloudhpt.com
TEL:

MEA SERVICE DESK

support@cloudhpt.com
TEL:

BILLING SERVICES

billing@cloudhpt.com
TEL:

WEBSITE

www.cloudhpt.com

8. How to use Assured Support

1.

How a support case works

Initial Contact
The Provider Technical Support team will work together with the Customer to identify and resolve
problem(s); a new case is created in our Helpdesk Management System for each Customer issue,
and either by the Customer directly online, by em ail or via telephone with a Helpdesk Support
Professional. All required fields in the Helpdesk Management System must be completed so that
a unique Case Number can be generated this case number must be available for all further
communications and to enable tracking of that issue to Closure.
The customer may raise support tickets as part of the Assured servi ce in addition to the proactive
work in our scope of services. Tickets can be raised using the following methods:

10

DESCRIPTION

METHOD

Online

http://support.CloudHPT.com

Email

support@CloudHPT.com

Telephone

+971 04 378 9088

For Priority 1 cases telephone is the preferred method of call logging. Assured desk operates 24
hours a day, 7 days a week, including holidays.
Fig 1. http://support.CloudHPT.com login page

2.

Escalation Contacts

The people listed below are our Escalation Contacts for this Agreement
CONTACT NAME

TITLE

PHONE NUMBER

EMAIL ADDRESS

LOCATION

+971559859694

Rijeesh@CloudHPT.com

Dubai

+97156 644 9251

gerald@CloudHPT.com

Dubai

+971504597713

chris@CloudHPT.com

Dubai

+971505516058

adam@CloudHPT.com

Dubai

+971506248505

dominic@CloudHPT.com

Dubai

Support
Rijeesh
Rathnakumar(L1)

Manager

Gerald

Assured

Vorster

(L2)

Chris Dalala (L4)

Adam Wolf

Desk

Manager
Cloud Services
Director
Technical
Director

Dominic

Managing

Docherty

Director

Escalation Contacts are used to inform all relevant stakeholders within the Customer business
hierarchy of the status of P1 Incidents. The proliferation of this information will take the form of
automated emails. Escalation Contacts will also possess the authority to approve Emergency
changes should they be required.
11

3.

Incident Management

1.1.

Service Description

Support is accessed through the Providers dedicated support line, call routing to the case
owner or the relevant incident team can be made from this point to ensure the Customer
reaches the expertise needed in a timely manner.
All incidents will be recorded on Provider Service Desk system under the Incident
Management workflow. The Provider records the name of the person reporting the Incident,
time of call and any other pertinent information, along with criteria for resolution to ensure the
workflow is initiated correctly.
It should be noted that the Customer shall report priority 1 and 2 cases via telephone only.
Priority 1 and 2 cases are the highest priority incidents reserved for when systems are non functioning or offline or when users are directly unable to work on the systems provided. As such
the customer should inform the support desk via telephone to ensure the ticket is opened
immediately. The Provider cannot offer any Service Levels for Business Critical Incidents via
email.
4. Incident Priority Table
BUSINESS IMPACT
AFFECT

System/Service Down
System/Service
Affected
User Down/Affected

Minor

Moderate

Major

Critical

P3

P2

P1

P1

P4

P3

P2

P1

P4

P4

P3

P2

4. Incident Response and Escalation Table


RESPONSE

SPECIALIST

ESCALATION

ESCALATION

EMAIL

TARGET

PRIORITY

SLA

REVIEW

MANAGER

DIRECTOR

FREQUENCY

RESOLUTION

P1

30 Minutes

2 Hour

Immediate

2 Hours

Hourly

2 hours

P2

1 Hour

4 Hours

4 Hours

4 Hours

4 Hours

1 Day

P3

4 Hours

8 Hours

2 Days

Never

Daily

10 Days

12

P4

4 Day

8 Hours

2 Days

Never

Daily

30 Days

For an Incident, Response is the time from when the Customer first logs a request with a
Provider helpdesk professional for assistance to the time that the Provider responds with a
suitably qualified employed person whether via an email, telephone call or in person. For
detailed process flow see the current Managed Services Handbook. Support to provide a
resolution shall be provided from the time of Response until such time as the Incident has been
resolved.
For an Incident, Escalation shall take place if a resolution to t he Incident has not been
achieved within the timeframe set out in the table above, and will continue to be escalated
until details of the Incident is given to the Escalation Director.
From the time of Response until resolution, updates shall be provided to the Named Contacts
and/or Escalation Contacts by email at such frequencies as set out in the table above.

5.

Change Management

1.2. Service Description

All Changes require a Request for Change (RFC) form to be completed and submitted to the
Provider detailing the required Change. The Provider will reject incomplete RFC forms.
Changes will follow the Change Management Process as defined in the Provider Managed
Services Handbook. It should be noted that Emergency Changes will only be carried out in the
event of a P1 scenario (either pro-active or reactive) and/or a major Security Incident where
the Provider deems appropriate.

Significant

Major

Critical

CR3

CR2

CR1

Minor

Significant

Major

CR4

CR3

CR2

High
Medium

IMPACT ON SERVICE

All Normal changes are subject to the following risk assessment matrix

13

Candidate
Standardization
5

Minor

Significant

CR4

CR3

Medium

High

Low

CR5

for

Low

Probability of Negative Impact Until Change is Successfully Completed

1.2.1. Change implementation targets Table

CHANGE TYPE

IMPLEMENTATION START DATE

Normal CR1

1 Working Day from CAB Approval

Normal CR2

2 Working Days from CAB Approval

Normal CR3

3 Working Days from CAB Approval

Normal CR4

4 Working Days from CAB Approval

Normal CR5

5 Working Days from CAB Approval

Normal CR6

Projects Only

Standard

Change to be completed within 4 Working days from


logging on Provider System

Emergency Changes are dealt with in conjunction with the Incident Management Process;
further details of this and all other change types are detailed within the Managed Services
Handbook.
Standard and Emergency Changes to the Service within the scope of this Contract will be
completed by the Provider at no additional cost.
Project and Normal Changes may require a separate Statement of Work t o be agreed
between the Customer and Provider. Such changes may be subject to Additional Service
Charge.
The Provider will review security, critical and software updates and where appropriate log a
Security Incident. Where the Provider deems it appropriate the Provider will then undertake any
14

necessary Changes via Change Management Process, depending on the severity at the
Providers discretion a Standard Change or Emergency Change will be implemented.

6.

Service Levels, Key Performance Indicators and Service Credits

CATEGORY

SERVICE LEVEL TARGET

P1 Incidents

95% of incidents responded to within 30 minutes.

P2 Incidents

95% of incidents responded to within 1 hour.

P3 Incidents

90% of incidents responded to within 4 hours.

P4 Incidents

90% of incidents responded to within 1 day.

Root Cause

90% of P1 Incidents to receive a Root Cause Analysis within 10


days.

7.

Assured SLA Guarantee

Provider takes our responsibilities to the Customer very seriously and is proud to offer one of the
most comprehensive Service Level Agreements (SLA) in the industry. We are committed to
providing the Customer with the reliable services, support, management and secure infrastructure.
Our processes and policies have undergone thorough and independent audits earning us Cisco
Master Managed Service Provider status. This SLA includes our promises.

8. 24 x 7 x 365 Staffing Guarantee


Provider guarantees it will provide technical staff 100% of the time in our operations department
to assist our clients with issues 24x7x365.

9. Emergency Service and Reboot Requests Response Guarantee


Provider will respond to emergency service and reboot requests within 30 minutes 100% of the
time. Emergency service requests are defined by unique requests for support on servers that are
down. This guarantee will only apply to emergency tickets submitted through our client centre
with the emergency option selected, or requests delivered over the phone to a member of the
Provider technical staff where the client clearly states the issue is an emergency and that
emergency is documented by our staff.

15

10. Monitoring

We will use various tools to monitor your environment in real time. They will provide auto alerts the
moment any system goes offline. Alerts auto create tickets on our helpdesk In the event of an alert
you will be contacted to see if our assistance is needed. You can also contact our team 24x7x365.
Our managed services is the pride of our business and operates 24x7, it is called Assured. We
understand IT is a critical component to any business and must be on and optimized at all times.
This is why Assured is included as an integral part of our Cloud as a Service offering.

16

Process once DR is declared:

17

Part 2
9. System Level Procedures
The run book content, up to this point, have addressed organizational points of concern. At this
stage in this run book you should have a fully documented procedure for your companies issue
management and escalation, criteria for evaluating and declaring an emergency scenario,
and procedures for ensuring all key stakeholders and responsible parties are in communication
and are ready and able to take the necessary steps to begin disaster recovery procedures.
From this point forward, the run book will shift focus to system level procedures to address
infrastructure and network level configurations, restoration steps, and system level responsibilities
while in disaster recovery mode.

Infrastructure Overview
A detailed overview of your IT environment is required in this section, including the location(s) of
all data center(s), nature of use of those facilities (e.g. colocation, tape storage, cloud hosting),
security features of your infrastructure and the hosting facilities, and procedures for access to
those facilities.

18

Data center
Specify the location of all facilities in which your companys data is stored. Include an address
and directions to each location.

Example:

19

Network Layout Topology (example)

20

Access to Facilities
Data centers and colocation facilities typically maintain strict entry protocol. Certa in members
of your organization will typically hold the appropriate credentials to enter the facility. Detail
members of your team (and/or your IT service providers team) who have access to all data
facilities along with any requirements for access. We may need them to be inside the DC while
recovering from a Failover.

Order of Restoration
This section will include instructions for recovery personnel to follow that lay out which
infrastructure components to restore and in which order. It should take into account application
dependencies, authentication, middleware, database and third party elements and list
restoration items by system or application type.
Ensure that this order of restoration is understood before engaging in restore work. An example is
provided below. The rest of the table should be filled out in the exact order that restoration
procedures are to be completed.

Order of Restoration Table:

Server Name
Ws12_VF1

Server Role
Web Server
Valley Forge 1

Order of
Restoration
Restore prior to
db12_VF1
startup

OS / Patch level
ESX4.1

Application
loaded
Apache

21

System Configuration
This section should include systems and application specific typology diagrams and an inventory
of elements that comprise your overall system. Include networking, web app middleware, data
base and storage elements, along with third party systems that connect to and share data with
this system.
Network table (attach as xls for example only)
Device
type
Firewall
Load
balancer
Switch

Name

Primary IP

OS level

Gateway

Subnet

Mask

Server table (attach as xls for example only)


Server
Name/Priori
ty

OS

Patc
h

IP
Address

Sub

Gatew
ay

DNS

Alternat
e DNS

Secondar
y IPs

Productio
n Mac
Address

Storage table (attach as xls for example only)

Name

LUN

Address

RAID
configuration

Host name

Backup Configuration (if not using cloud replication or backup highly advised)

22

Use this section to list instructions specifying the servers, directories and files from (and to) which
backup procedures will be run. This should be the location of your last known good copy of
production data.

Server

Software

Version

Backup
Cycle

Backup
Source

Backup
Target

23

10.

Evoking Assured DR yourself

Stage 1 Zerto DR Failover test

Virtual machines are being replicated between Production CUD and DR CloudHPT

Virtual Protection Groups (VPG) have been created for use in case of a DR scenario

Below shows the steps in executing the failover and Failover Test

24

11.

Acronis

Do the following to recover files from a backup:


1.

Log in to Backup Management Console and click Recover:

2.

Select Recover files:

25

3.

Select files and click Recover. When recovering from the Cloud storage, select either to
recover files or to download them in an archive:

Recover from a local storage

4.

Recover or download from the Cloud storage

Select the location you want to recover or to download the files to:

26

12.

Zerto replication Failover or test

Production Site
Customer XYZ (CXYZZERTO-DR)
Disaster Recovery Site
CloudHPT (hpt-rep-zvm01)
CXYZ opted to utilize Zerto Replication for DRaaS. This document entails the step by step of the
failover testing of the CUD virtual server from Production site in CXYZ to DR site at CloudHPT.
All VPGs are named based on the VM that is inside the VPG

1.

VPG CloudHPT_Demo indicating the replication job for 1 VM

2.

Click the small pencil that indicates Edit on the VPG

27

3.

The wizard to edit VPG will be presented to you.

4.

Move through each page verifying settings (VMs)

5.

Zerto Replication Page


28

6.

.Zerto Storage Page

29

7.

Zerto Recovery Page

8.

Zerto Network Page, DMZ network not required in DR site, (click edit settings to change
failover IP address)

30

9.

To Do a Test Failover (make sure slide bar is on Test) and click Failover. Zerto will go and
perform the test failover in an isolated environment

13.

Roles and Responsibilities

Service Delivery Responsibility


Assignment Matrix

Table Key
Code

Description

Responsible Party: Those who do the work to achieve the task

Accountable Party: The party ultimately answerable for the correct and thorough
completion of the deliverable or task, and the one from whom responsible party is
delegated the work

Consulted Party: Those whose opinions are sought, typically subject matter experts;
and with whom there is two-way communication

Informed Party: Those who are kept up-to-date on progress, often only on
completion of the task or deliverable; and with whom there is just one-way
communication

31

Positions that will fill these roles and responsibilities will often include your DR coordinator, network
engineer, database engineer, systems engineer, application owner, data center service
coordinator, and your service provider. Identify the responsibili ties of each of these roles in a
disaster event, then map them onto a matrix of all activities associated with recovery
procedures, as in the example table provided below.

Activity
R
Maintain situational management of
recovery events
React to server outage alerts
React to file system alerts
React to host outage alerts
React to network outage alerts
Document technical landscape
Configure network for system access
Configure VPN and acceleration between
your business and service provider network
(if applicable
Maintain DNS or host file
Monitor service provider network availability
(if applicable)
Diagnose service provider network errors (if
applicable
Create named users at OS level
Create domain users
Manage OS privileges
Create virtual machines
Convert physical servers to virtual servers
Install base operating system
Configure operating system
Configure OS disks
Diagnose OS errors
Start/Stop the virtual machine
Windows OS licensing (or your operating
system)
Security hardening of the OS
Daily server level backup
Patch Management for Windows servers (or
your operating system)
Provide a project manager
Provide a key technical contact for OS,
network, and SAN
Coordinate deployment schedule
Support, management and update of
Protection Software

DRC

Responsible Parties
A
C
DRC
DRC

I
All

32

Вам также может понравиться