Академический Документы
Профессиональный Документы
Культура Документы
www.cloudhpt.com
Table of Contents
Executive Summary...........................................................................................................2
Document Control .............................................................................................................3
Document Change History ...............................................................................................3
Contact Information ..........................................................................................................4
Communication Structure of DR Plan..............................................................................7
Declaration Guidelines .....................................................................................................8
Key Contacts .................................................................................................................... 10
How to use Assured Support .......................................................................................... 10
How a support case works ............................................................................................. 10
How a support case works.............................................................................................................................10
Escalation Contacts ........................................................................................................ 11
Incident Management.................................................................................................... 12
Incident Priority Table...................................................................................................... 12
Incident Response and Escalation Table ..................................................................... 12
Change Management .................................................................................................... 13
1.2.
Service Description..............................................................................................................................13
1. Executive Summary
DR planning is complex and spans three key areas: technology, people, and process. From an IT
perspective, planning starts with a business impact analysis (BIA) by application/workload.
Natural tiers or stages of DR begin at phase 1 infrastructure (networking, AD, DHCP, etc.) then
extend to recovery by application tiers. Each application tier should have an established
recovery time objective (RTO) and recovery point objective (RPO) based on business risk.
DR testing is essential to adequate recovery of systems and data, but also to uncover events or
conditions met during real disasters scenarios that were not previously accounted for. Examples
include change management such as the needed reconfiguration of applications or systems.
Also, the recovery of systems in the right sequence is important. To ensure that DR testing,
planning, and recovery is organized and effective, organizations need disaster recovery "run
book." This document is your companies Disaster Recovery (DR) run book.
A DR run book is a working document, unique to every organization, which outlines the
necessary steps to recover from a disaster or service interruption. It provides an instruction set for
personnel in the event of a disaster, including both infrastructure and process information. Run
books, or updates to run books, are the outputs of every DR test.
However, a run book is only useful if it is up-to-date. If documented properly, it can take the
confusion and uncertainty out the recovery environment which, during an actual disaster, is
often in a state of panic. Use the run book in the event of a Disaster.
2. Document Control
Document creation and edit records should be maintained by your companys disaster
recovery coordinator (DRC) or business continuity manager (BCM). If your organiza tion does not
have a DRC, consider creating that role to manage all future disaster recovery activities.
Document Name
Version
Date created
Date last modified
Last modified by
Date
V 1.0
1/6/2016
V1.1
29/12/2016
Description
Initial version
Approval
Business Owner / DRC
Distribution List
Ensure that all key stakeholders have access to the document. Use the chart below to indicate
the stakeholders to whom this run book will be distributed. This is a critical step in your DR plan
Role
Name
Phone
Owner
Approver
Auditor
Contributor (Technical)
Contributor (DBA)
Contributor (Network)
Contributor (Vendor)
3
4. Contact Information
This section will list our contacts along with those from your IT department. This is the team that
will conduct ongoing disaster recovery operations and respond in the case of a true
emergency. Specific roles listed below are examples of those that might comprise your team.
All of these roles need to be in communication when in a disaster recovery mode of operation.
For pending events, this same distribution list should be used to provide advanced notice of
potential incidents. Your customer service teams should also not be overlooked as they are the
first line of communication to your customer base. Forgetting this step will cause extra work on
your primary recovery team as they take time to explain what is going on
Your contacts
Names
Title
Phone
Disaster Recovery
Coordinator
Chief Information Officer
Network Systems
Administrator
Database Systems
Administrator
Chief Security Officer
Chief Technology Officer
Business Owner
Application Development
Lead (as applicable)
Data Center Manager
Helpdesk Telephone
Helpdesk E-mail
support@biosme.com
https://support.biosme.com
Your Username
xxxxxxxxxx
Your Portal PW
Additional Contacts
BIOS Contact
Role
Disaster Recovery
Coordinator
Customer Service
Mobile
Emergency Support
Sr. System Engineer
Director Service Delivery
Name
Role
Mobile
Access level
Chief Technology
Officer
General access
Can authorize guest
access
Director of Service
Delivery
General access
Can authorize guest
access
Service Delivery
Engineer
Systems Engineer
Network Engineer
General Access
Can authorize guest
access
Chief Information
Officer
General Access
Can authorize guest
access
Head of
Operations
Director of Service
Delivery
Network Engineer
Systems
Administrator
CEO
Director of Business
Development
Sales contact
Customer Service
Representative
6. Declaration Guidelines
Examples when DR should be considered
Event
Application Failure
Hardware failure
Power failure
Plan of Action
Recycle
Failover to alternative
Enact total failover plan
Situation
Workaround does not exist in
a matter of time that does
not affect customer SLAs
Restoration procedures
cannot be completed in your
production environment
A production environment no
longer exists or is unable to be
accessed
Action
Declare application level
failover and enact failover to
secondary site
Declare application level
failover and enact failover to
secondary site
Declare a data center failure
and enact a total failover
plan from primary to
secondary data center
Owner
Application business owner
IT Support
Disaster Recovery
Coordinator (DRC)
Disaster Recovery
Coordinator (DRC)
Disaster Recovery
Coordinator (DRC)
Business Owner
Owner
If your environment is monitored by BIOS, the following events will create trouble tickets and
make us aware you are facing issues. We will then endeavor to address them and escalate to
your contact person to ask if you want they declare a DR.
Event Type
Performance
Monitoring Status =
Warning Alert Level
Memory Usage > 80%
Duration of Event
> 2 minutes
> 3 minutes
Memory
> 15 minutes
Storage
> 3 minutes
Network
Line - VPN
> 3 minutes
> 1 minutes
> 5 minutes
Corrective Action
Isolate problem
device / recycle
device
- Isolate physical
device / virtual
machine
- configure memory
pool increase
- clear memory
cache
- clear memory buffer
- increase compute
allocation (virtual)
- add additional
compute resources
into application
pool
- check memory
queue
- clear memory
cache of affected
system
- increase memory
allocation (virtual)
Event criticality
Critical Level
- Check
- Check
- Check
- Check
- Check
Critical Level
Lun
Space
Snapshots
Network
connectivity
Critical Level
Critical Level
Critical Level
Critical Level
Critical Level
7. Key Contacts
GENERAL ENQUIRIES
sales@cloudhpt.com
TEL:
PORTAL URL
https://portal.cloudhpt.com
accessportal@cloudhpt.com
SALES
sales@cloudhpt.com
TEL:
support@cloudhpt.com
TEL:
BILLING SERVICES
billing@cloudhpt.com
TEL:
WEBSITE
www.cloudhpt.com
1.
Initial Contact
The Provider Technical Support team will work together with the Customer to identify and resolve
problem(s); a new case is created in our Helpdesk Management System for each Customer issue,
and either by the Customer directly online, by em ail or via telephone with a Helpdesk Support
Professional. All required fields in the Helpdesk Management System must be completed so that
a unique Case Number can be generated this case number must be available for all further
communications and to enable tracking of that issue to Closure.
The customer may raise support tickets as part of the Assured servi ce in addition to the proactive
work in our scope of services. Tickets can be raised using the following methods:
10
DESCRIPTION
METHOD
Online
http://support.CloudHPT.com
support@CloudHPT.com
Telephone
For Priority 1 cases telephone is the preferred method of call logging. Assured desk operates 24
hours a day, 7 days a week, including holidays.
Fig 1. http://support.CloudHPT.com login page
2.
Escalation Contacts
The people listed below are our Escalation Contacts for this Agreement
CONTACT NAME
TITLE
PHONE NUMBER
EMAIL ADDRESS
LOCATION
+971559859694
Rijeesh@CloudHPT.com
Dubai
gerald@CloudHPT.com
Dubai
+971504597713
chris@CloudHPT.com
Dubai
+971505516058
adam@CloudHPT.com
Dubai
+971506248505
dominic@CloudHPT.com
Dubai
Support
Rijeesh
Rathnakumar(L1)
Manager
Gerald
Assured
Vorster
(L2)
Adam Wolf
Desk
Manager
Cloud Services
Director
Technical
Director
Dominic
Managing
Docherty
Director
Escalation Contacts are used to inform all relevant stakeholders within the Customer business
hierarchy of the status of P1 Incidents. The proliferation of this information will take the form of
automated emails. Escalation Contacts will also possess the authority to approve Emergency
changes should they be required.
11
3.
Incident Management
1.1.
Service Description
Support is accessed through the Providers dedicated support line, call routing to the case
owner or the relevant incident team can be made from this point to ensure the Customer
reaches the expertise needed in a timely manner.
All incidents will be recorded on Provider Service Desk system under the Incident
Management workflow. The Provider records the name of the person reporting the Incident,
time of call and any other pertinent information, along with criteria for resolution to ensure the
workflow is initiated correctly.
It should be noted that the Customer shall report priority 1 and 2 cases via telephone only.
Priority 1 and 2 cases are the highest priority incidents reserved for when systems are non functioning or offline or when users are directly unable to work on the systems provided. As such
the customer should inform the support desk via telephone to ensure the ticket is opened
immediately. The Provider cannot offer any Service Levels for Business Critical Incidents via
email.
4. Incident Priority Table
BUSINESS IMPACT
AFFECT
System/Service Down
System/Service
Affected
User Down/Affected
Minor
Moderate
Major
Critical
P3
P2
P1
P1
P4
P3
P2
P1
P4
P4
P3
P2
SPECIALIST
ESCALATION
ESCALATION
TARGET
PRIORITY
SLA
REVIEW
MANAGER
DIRECTOR
FREQUENCY
RESOLUTION
P1
30 Minutes
2 Hour
Immediate
2 Hours
Hourly
2 hours
P2
1 Hour
4 Hours
4 Hours
4 Hours
4 Hours
1 Day
P3
4 Hours
8 Hours
2 Days
Never
Daily
10 Days
12
P4
4 Day
8 Hours
2 Days
Never
Daily
30 Days
For an Incident, Response is the time from when the Customer first logs a request with a
Provider helpdesk professional for assistance to the time that the Provider responds with a
suitably qualified employed person whether via an email, telephone call or in person. For
detailed process flow see the current Managed Services Handbook. Support to provide a
resolution shall be provided from the time of Response until such time as the Incident has been
resolved.
For an Incident, Escalation shall take place if a resolution to t he Incident has not been
achieved within the timeframe set out in the table above, and will continue to be escalated
until details of the Incident is given to the Escalation Director.
From the time of Response until resolution, updates shall be provided to the Named Contacts
and/or Escalation Contacts by email at such frequencies as set out in the table above.
5.
Change Management
All Changes require a Request for Change (RFC) form to be completed and submitted to the
Provider detailing the required Change. The Provider will reject incomplete RFC forms.
Changes will follow the Change Management Process as defined in the Provider Managed
Services Handbook. It should be noted that Emergency Changes will only be carried out in the
event of a P1 scenario (either pro-active or reactive) and/or a major Security Incident where
the Provider deems appropriate.
Significant
Major
Critical
CR3
CR2
CR1
Minor
Significant
Major
CR4
CR3
CR2
High
Medium
IMPACT ON SERVICE
All Normal changes are subject to the following risk assessment matrix
13
Candidate
Standardization
5
Minor
Significant
CR4
CR3
Medium
High
Low
CR5
for
Low
CHANGE TYPE
Normal CR1
Normal CR2
Normal CR3
Normal CR4
Normal CR5
Normal CR6
Projects Only
Standard
Emergency Changes are dealt with in conjunction with the Incident Management Process;
further details of this and all other change types are detailed within the Managed Services
Handbook.
Standard and Emergency Changes to the Service within the scope of this Contract will be
completed by the Provider at no additional cost.
Project and Normal Changes may require a separate Statement of Work t o be agreed
between the Customer and Provider. Such changes may be subject to Additional Service
Charge.
The Provider will review security, critical and software updates and where appropriate log a
Security Incident. Where the Provider deems it appropriate the Provider will then undertake any
14
necessary Changes via Change Management Process, depending on the severity at the
Providers discretion a Standard Change or Emergency Change will be implemented.
6.
CATEGORY
P1 Incidents
P2 Incidents
P3 Incidents
P4 Incidents
Root Cause
7.
Provider takes our responsibilities to the Customer very seriously and is proud to offer one of the
most comprehensive Service Level Agreements (SLA) in the industry. We are committed to
providing the Customer with the reliable services, support, management and secure infrastructure.
Our processes and policies have undergone thorough and independent audits earning us Cisco
Master Managed Service Provider status. This SLA includes our promises.
15
10. Monitoring
We will use various tools to monitor your environment in real time. They will provide auto alerts the
moment any system goes offline. Alerts auto create tickets on our helpdesk In the event of an alert
you will be contacted to see if our assistance is needed. You can also contact our team 24x7x365.
Our managed services is the pride of our business and operates 24x7, it is called Assured. We
understand IT is a critical component to any business and must be on and optimized at all times.
This is why Assured is included as an integral part of our Cloud as a Service offering.
16
17
Part 2
9. System Level Procedures
The run book content, up to this point, have addressed organizational points of concern. At this
stage in this run book you should have a fully documented procedure for your companies issue
management and escalation, criteria for evaluating and declaring an emergency scenario,
and procedures for ensuring all key stakeholders and responsible parties are in communication
and are ready and able to take the necessary steps to begin disaster recovery procedures.
From this point forward, the run book will shift focus to system level procedures to address
infrastructure and network level configurations, restoration steps, and system level responsibilities
while in disaster recovery mode.
Infrastructure Overview
A detailed overview of your IT environment is required in this section, including the location(s) of
all data center(s), nature of use of those facilities (e.g. colocation, tape storage, cloud hosting),
security features of your infrastructure and the hosting facilities, and procedures for access to
those facilities.
18
Data center
Specify the location of all facilities in which your companys data is stored. Include an address
and directions to each location.
Example:
19
20
Access to Facilities
Data centers and colocation facilities typically maintain strict entry protocol. Certa in members
of your organization will typically hold the appropriate credentials to enter the facility. Detail
members of your team (and/or your IT service providers team) who have access to all data
facilities along with any requirements for access. We may need them to be inside the DC while
recovering from a Failover.
Order of Restoration
This section will include instructions for recovery personnel to follow that lay out which
infrastructure components to restore and in which order. It should take into account application
dependencies, authentication, middleware, database and third party elements and list
restoration items by system or application type.
Ensure that this order of restoration is understood before engaging in restore work. An example is
provided below. The rest of the table should be filled out in the exact order that restoration
procedures are to be completed.
Server Name
Ws12_VF1
Server Role
Web Server
Valley Forge 1
Order of
Restoration
Restore prior to
db12_VF1
startup
OS / Patch level
ESX4.1
Application
loaded
Apache
21
System Configuration
This section should include systems and application specific typology diagrams and an inventory
of elements that comprise your overall system. Include networking, web app middleware, data
base and storage elements, along with third party systems that connect to and share data with
this system.
Network table (attach as xls for example only)
Device
type
Firewall
Load
balancer
Switch
Name
Primary IP
OS level
Gateway
Subnet
Mask
OS
Patc
h
IP
Address
Sub
Gatew
ay
DNS
Alternat
e DNS
Secondar
y IPs
Productio
n Mac
Address
Name
LUN
Address
RAID
configuration
Host name
Backup Configuration (if not using cloud replication or backup highly advised)
22
Use this section to list instructions specifying the servers, directories and files from (and to) which
backup procedures will be run. This should be the location of your last known good copy of
production data.
Server
Software
Version
Backup
Cycle
Backup
Source
Backup
Target
23
10.
Virtual machines are being replicated between Production CUD and DR CloudHPT
Virtual Protection Groups (VPG) have been created for use in case of a DR scenario
Below shows the steps in executing the failover and Failover Test
24
11.
Acronis
2.
25
3.
Select files and click Recover. When recovering from the Cloud storage, select either to
recover files or to download them in an archive:
4.
Select the location you want to recover or to download the files to:
26
12.
Production Site
Customer XYZ (CXYZZERTO-DR)
Disaster Recovery Site
CloudHPT (hpt-rep-zvm01)
CXYZ opted to utilize Zerto Replication for DRaaS. This document entails the step by step of the
failover testing of the CUD virtual server from Production site in CXYZ to DR site at CloudHPT.
All VPGs are named based on the VM that is inside the VPG
1.
2.
27
3.
4.
5.
6.
29
7.
8.
Zerto Network Page, DMZ network not required in DR site, (click edit settings to change
failover IP address)
30
9.
To Do a Test Failover (make sure slide bar is on Test) and click Failover. Zerto will go and
perform the test failover in an isolated environment
13.
Table Key
Code
Description
Accountable Party: The party ultimately answerable for the correct and thorough
completion of the deliverable or task, and the one from whom responsible party is
delegated the work
Consulted Party: Those whose opinions are sought, typically subject matter experts;
and with whom there is two-way communication
Informed Party: Those who are kept up-to-date on progress, often only on
completion of the task or deliverable; and with whom there is just one-way
communication
31
Positions that will fill these roles and responsibilities will often include your DR coordinator, network
engineer, database engineer, systems engineer, application owner, data center service
coordinator, and your service provider. Identify the responsibili ties of each of these roles in a
disaster event, then map them onto a matrix of all activities associated with recovery
procedures, as in the example table provided below.
Activity
R
Maintain situational management of
recovery events
React to server outage alerts
React to file system alerts
React to host outage alerts
React to network outage alerts
Document technical landscape
Configure network for system access
Configure VPN and acceleration between
your business and service provider network
(if applicable
Maintain DNS or host file
Monitor service provider network availability
(if applicable)
Diagnose service provider network errors (if
applicable
Create named users at OS level
Create domain users
Manage OS privileges
Create virtual machines
Convert physical servers to virtual servers
Install base operating system
Configure operating system
Configure OS disks
Diagnose OS errors
Start/Stop the virtual machine
Windows OS licensing (or your operating
system)
Security hardening of the OS
Daily server level backup
Patch Management for Windows servers (or
your operating system)
Provide a project manager
Provide a key technical contact for OS,
network, and SAN
Coordinate deployment schedule
Support, management and update of
Protection Software
DRC
Responsible Parties
A
C
DRC
DRC
I
All
32