Вы находитесь на странице: 1из 42

VMware Technology Day 2016

Zero Downtime Application Mobility


with Site Recovery Manager
By Bo Bo Zin, Senior Solutions Architect, VMware Myanmar
2014 VMware Inc. All rights reserved.

Traditional Disaster Recovery


Infrastructure Challenges: Compute, Networking and Storage

WAN / Internet

Compute

Deployment/Recovery
Manual, Complex, Error Prone

Web

D
B

Web

D
B

Web

D
B

Web

D
B

Deployment/Recovery
Automated and Reliable
App

App

App

Storag
e

Deployment/Recovery
Automated and Reliable

App

Compute

Storag
e

Network

Network

WAN / Internet

Traditional Disaster Recovery


Infrastructure Challenges: Site Connectivity

Protected

Recovery
WAN / Internet

WAN / Internet

Network Fabric

Network Fabric

Complex and Expensive DCI at the WAN Edge

Recreate FW, LB Policies


Recreate L3
Recreate L2 (Re-IP/Preserve IP Space)

V
M

V
M

V
M

10.1.1.0/24

V
M

V
M

V
M

10.1.2.0/24

V
M

V
M

V
M

10.1.3.0/24

10.1.1.0/24

V
M

V
M

V
M

Data Center
Interconnect (DCI)

10.1.2.0/24

V
M

V
M

V
M

(VPLS, Overlay Transport,


L2 Extensions)

10.1.3.0/24

V
M

V
M

V
M

Traditional Disaster Recovery

WAN / Internet

Network

Networking
and Security
Policies

Network Admin

Recovery Plan

Scripts/APIs/Tools

Recovery Management ?

Compute
and Storage
Recovery
Policies

Recovery Plan

App

Scripts/APIs/Tools

Web

D
B

App

Storag
e

Compute/Virtualization
Admin

D
B

Compute

Web

Changing IP Addresses Is Still Very Popular


Most SRM customers re-IP their VMs on failover!
SRM has a robust capabilities to do this, with improvements in subnet IP address mapping

introduced in SRM 5.8

But It Has Downsides

Customizing IP addresses can add


about a third to your per-VM
recovery time

And More Downsides

Additional complexity introduced by synchronizing


firewall and other security rules across multiple sites

Huh!

Cross-site Availability Typical choices today


Two sites, treat as one = stretched cluster

vCenter

Two sites, treat as two = Disaster Recovery

vCenter

vSphere
Cluster

Stretched vSphere Cluster

Site A
(Active)

Stretched Storage

SRM

Site B
(Active)

Site A
(Active)

vCenter

SRM

vSphere
Cluster

Replicated Storage

Site B
(Passive)

Active-Active datacenters use cases

Planned Maintenance
Planned maintenance of
one site without any service
downtime

Transparent to app owners


and end users

Avoid lengthy approval


processes

Ability to migrate

Disaster Avoidance
Prevent service outages
before an impending
disaster (e.g. hurricane,
rising flood levels)

Avoid downtime, not


recover from it

Zero data loss possible if

Automated Recovery
Automated initiation of VM
restart or recovery

Very low RTO for majority


of unplanned failures

Allows users to focus on


app health after recovery,
not how to recover VMs

you have the time

applications back after


maintenance is complete

10

What you need for an Active-Active datacenter model


Stretched Storage solution
Storage clustering solution that supports

Stretched Volumes
Stretched Volumes Across Sites

distributed data mirroring


Read/write access to the same volumes

from both sites

Storage
Controllers

Storage
Controllers

Some tie-break mechanism to avoid split-

brain
Examples: EMC VPLEX, IBM SVC, NetApp

MetroCluster, etc.
Stretched Network

Backend Arrays
Site 1

Backend Arrays
Site 2

11

Active-active datacenter network model


Multi-Site Single vC with Stretched Clusters

N-S Connectivity

N-S Connectivity

DB

Web

App

Web

Web

App

DB

App

DB

<10ms

Stretched Storage

12

Active-Passive datacenters use cases

Unplanned Failover
Recover from unexpected
site failure (full or partial)

Most common use case

Fast and accurate


recovery usually critical to
customers

Workflow driven
High degree of confidence
if regular test failovers
have been performed

Preventative Failover
Anticipate potential
datacenter outages

Initiate preventative
failover for smooth
migration of services

Graceful shutdown of
services to be migrated,
zero data loss

Planned Migration
Planned datacenter
maintenance

Global load balancing or


distribution of service

Using test feature to


minimize risk

Execute partial failovers


Automated failback
enables bi-directional
migrations

13

What you need for an Active-passive datacenter model


Replicated Storage solution
Storage or software based replication

configured between sites


vCenter per site
SRM server per site

SRM

SRM

vCenter

vCenter

Site A

Site B

Network can be stretched or not


Concept is referred to as Active-passive;

reality is each site is active simply acts as


the passive DR location for it counterpart

14

Active-Passive datacenter network model


Active-Standby Application Pair, No Cross Site Connectivity, Traffic Directed to Active Site
Active logical network (L2, L3, DFW)
Place holder logical network (L2, L3, DFW)

Recovery Site

Protected Site

SRM Pair

VC+SRM

VC+SRM
Replicated/Sync

DB
App

Web

DB

Protected

App

Web

Recovery

Active-Standby
Application Pair

Replicated

15

Active-Passive Datacenter Today (simple view)

Primary Site

Recovery Site

Snapshot VM
4

Change IP Address
Reconfig Security

10.0.20.21

10.0.10.21

3 Recover
the VM
SAN

Major
RTO
Impact

SAN

Step 1&2
(e.g VMware SRM)

10.0.10/24

Physical Network Infrastructure

10.0.20/24
Physical Network Infrastructure

Replicate
VM & Storage

16

Active-Passive Datacenter with NSX Network Virtualization


(simple view)

Primary Site

Recovery Site

Virtual Network
10.0.30/24

Virtual Network
10.0.30/24

10.0.30.21

2b

Snapshot VM

Snapshot
Network &
Security
NSX Controller

Network & Security


already exists

NSX Controller

SAN

3
Recover
the VM

10.0.30.21

80%
RTO

SAN

Step 1&2
(e.g VMware SRM)

10.0.10/24

Physical Network Infrastructure

2a

10.0.20/24
Physical Network Infrastructure

Replicate
VM & Storage

17

Cake and eat it


Most common requests
Use both stretched and non-stretched storage in same design
Leverage operational benefits of SRM for stretched storage
Use SRM to drive large scale migrations where needed on stretched solutions

Can this be done?


Prior to vSphere 6.0 the answer was NO, its one or the other
Reaction from customers was usually this.

18

What has changed? vMotion anywhere!

Support introduced in vSphere 6.0


Requires vCenter & ESXi 6.0 or later
Simultaneously changes
Compute
Storage
Network
vCenter
vMotion without shared storage
Increased scale
Pool resources across vCenter servers

19

This layout seems familiar.

Cross vCenter vMotion layout

vCenter

vSphere
Cluster

Site A
(Active)

vCenter

vSphere
Cluster

Site B
(Active)

SRM layout

vCenter

SRM

vSphere Cluster

Site A
(Active)

Replicated Storage

vCenter

SRM

vSphere Cluster

Site B
(Active)

20

So what can be done now?


Still supported by SRM today
Bunker Site
Production

Dedicated
Sites for Prod
& DR
Production

NEW in SRM 6.1

Bi-directional
Failover

Active-Active
data centers

Production

Production

Site 1

Recovery

Legacy DR scenario
Expensive dedicated
resources

Recovery

Site 2

Production

Leverage recovery
infrastructure for test,
development, training

Production
applications at both
sites

Utilize both sites

Each site acts as


recovery site for other

Production apps at both sites with


seamless mobility across sites
Zero downtime for planned events
Typically limited to a Metro
distance

Multi-site topologies with three or more sites are not shown here

21

Why customers ask for SRM integration with stretched clusters


vCenter Availability
Failure of the site where vCenter is running disrupts management of both sites

Operational Watchdogs
Availability specific alarms, alerts and events
Configuration validation on the fly

DRS and HA are not site aware


VMs are recovered and migrated to any site may not be what you want !
Could result in additional East-West traffic when your network is not designed to handle it

No Orchestration or Testability
Stretched Clusters lack a repeatable, testable procedure to handle unplanned failures
HA will restart VMs based on VM restart order but doesnt give you granular control of VM

dependencies or customization

22

SRM integration with Active-Active storage solutions

23

active-active datacenters with SRM 6.1

Description
Live Migration (vMotion) of workloads across sites and
vCenter instances for planned failover
Full orchestration of VM movement (vCenter and solution
configuration, storage, and live state).
Combined with DR orchestration to enable recovery of failed
VMs in the event of site failure

Benefits
Reuse same orchestrated Recovery Plan for unplanned
failures and Continuous Availability
Integration with Stretched Storage enables very low RTO for
unplanned failures and easier load balancing across sites
Non-disruptive test for unplanned failures

24

Active-Active Datacenters with SRM 6.1


vCenter

SRM

vCenter

SRM

Stretched Networks
Site 1 vSphere cluster
ESXi

ESXi

Volume A at Site 1
(Full R/W access)

Site 2 vSphere cluster

ESXi

ESXi

Stretched Storage

ESXi

ESXi

Volume A at Site 2
(Full R/W access)

25

Scenario 1: Local Host Failures In One Site


vCenter

SRM

Site 1 vSphere cluster


ESXi

ESXi

Volume A at Site 1
(Full R/W access)

vCenter

Stretched Networks

ESXi

HA handles local
host failures

Stretched Storage

SRM

Site 2 vSphere cluster


ESXi

ESXi

ESXi

Volume A at Site 2
(Full R/W access)

26

Scenario 2: Disaster Avoidance At One Site

vCenter

Execute
SRM vMotion
Plannedas per
SRM
invokes
Migration
VM
prioritywith
and vMotion
dependencies
BEFORE
disaster
in Recovery
Plan

SRM

vCenter

SRM

Stretched Networks
Site 1 vSphere cluster
ESXi

ESXi

Site 2 vSphere cluster

ESXi

Volume A at Site 1
(Full R/W access)

ESXi

Stretched Storage

ESXi

ESXi

Volume A at Site 2
(Full R/W access)

SRM gives you the easy button for handling


planned downtime in Active-Active datacenters

27

Scenario 3: Faster Recovery From Unplanned Failures


vCenter

In a site failure scenario,


Use SRMs test capability
vCenter
execute SRM Recovery
to Plan
prepare for site-wide
at other Active site.
failures

SRM

Site 1 vSphere cluster


ESXi

ESXi

Can be automatically
Stretched
Networks
triggered
by an
external
system

ESXi

Volume A at Site 1
(Full R/W access)

Site 2 vSphere cluster


ESXi

Stretched Storage

SRM

ESXi

ESXi

Volume A at Site 2
(Full R/W access)

SRM enables a reliable, testable and low RTO solution for


handling unplanned failures in Active-Active datacenters

28

29

30

Key Takeaways
Roadmap

SRM is a great solution for Active Active datacenters


SRM enhances Continuous Availability with rich orchestration
Stretched Storage enables a lower RTO for unplanned failover
SRM with vMotion enables ZERO service downtime for disaster avoidance

You don't have to trade-off Testability and Repeatability when you choose

Active-Active model
SRM + Live Migration is a game changer in IT operations

31

Contact Your Sales


Manager

Take Hands-On Lab

Get Certified with VMware


Visit VMware Booth
3

Join The Conversation


#vForumID
4

Backup Slides

Deep Dive on
SRM 6.1 & NSX 6.2
Storage

36

Multi-VC Egress Optimized Routing (NSX 6.2)


L3
Network
Site A

Site B

vCenter
Server

Site B NSX
Edge GW

Site A NSX
Edge GW
Uplink Net A

Uplink Net B
Universal Distributed Logical
Router

Control VM
w/ Local Egress

VM1

vCenter
Server

VM2

Control VM
w/ Local Egress

VM3
Universal Logical
Switch A
Locale ID:
NSX-A

Locale ID:
NSX-B

Multi-Site Enhancement: Locale ID

NSX 6.2 introduces the concept of Locale ID for routes sent to the NSX Controller. This value is set to the NSX Manager UUID by default

If Local Egress is not enabled on the UDLR, the Locale ID value is ignored

When Local Egress is enabled, the NSX Controller will only send routes to ESXi hosts with a matching Locale ID

Using a site specific uplink, each site can have a local routing configuration. This allows NSX 6.2 to support up to 8 sites with local egress

Locale ID can also be set on a per UDLR, per Cluster or per Host level if the same NSX Manager is used across multiple sites

37

DR with Multi-VC Logical Network (NSX 6.2)


Recovery

Protected
vCenter

vC+SRM Pair

SR
M

vCenter

SR
M

Internal API
Master NSX Manager

Global Config Replication


Logical L2, DLR, DFW

Local Egress

Site A
ESG

Secondary NSX Manager


Local Egress

Universal Controller
Cluster

Recovered on failure

Site B
ESG

Universal Distributed Logical Router


Universal DFW

Universal DFW

Non-SRM
Protected

Primary VMs
SG-Prod-01

Universal Logical
Switch

Placeholder
VMs
(SRM Protected)

Non-SRM
Protected
SG-Prod-01

Active-Active Pair

Replication

Active-Active Pair
CONFIDENTIAL

38

DR with Multi-VC: Initial Set-Up (NSX 6.2)


Stand-by N-S

Active N-S

Site Local
Router

Site Local
Router

Protected

SR
M

SR
M

Recovery

NSX ESG
with ECMP
(Advertising 10.1.1.0 reachability)
U-DLR LIF

U-DLR LIF

Universal DLR
Universal
Control VM

10.1.1.0/24

U-DFW

Web

Universal
Control VM
Universal
Logical Switch
(Logical Switch for
Protected Workloads)

U-DFW

App
Application Tier

DB

Web

App

DB

Locale-ID Set to Protected


Site

Application Tier (Recovery)

39

DR with Multi-VC: Planned Migration/Partial Failure (NSX 6.2)


Stand-by N-S

Active N-S

Site Local
Router

Site Local
Router

Protected

SR
M

SR
M

Recovery

NSX ESG
with ECMP
(Advertising 10.1.1.0 reachability)
U-DLR LIF

U-DLR LIF

Universal DLR
Universal
Control VM

10.1.1.0/24

U-DFW

Web

Universal
Control VM
Universal
Logical Switch
(Logical Switch for
Protected Workloads)

U-DFW

App
Application Tier

DB

Web

App

DB

Application Tier (Recovery)

40

DR with Multi-VC: Complete Application Failure (NSX 6.2)


Stand-by N-S

Active N-S

Site Local
Router

Site Local
Router

Protected

SR
M

SR
M

Recovery

NSX ESG
with ECMP
(Advertising 10.1.1.0 reachability)
U-DLR LIF

U-DLR LIF

Universal DLR
Universal
Control VM

10.1.1.0/24

U-DFW

Web

Universal
Control VM
Universal
Logical Switch
(Logical Switch for
Protected Workloads)

U-DFW

App
Application Tier

DB

Web

App

DB

Locale-ID Set to Recovery


Site

Application Tier (Recovery)

41

Client

Physical Network

Datacenter A

Datacenter B

Active Route

VLAN

VLAN

Pre-Create Perimeter Edge


Services Gateway

10.114.212.198 / 28

10.114.208.86 / 28

ESG

OSPF

ESG
10.114.220.25

OSPF

10.114.220.25

Transit Switch

Transit Switch

10.114.220.26

10.114.220.26

Distributed Logical Router

DB Switch
App Switch
Web Switch
Protected
VMs

NSX Universal Objects


10.114.220.2 /29

10.114.220.10 /29

10.114.220.18 /29

Protected Site

Logical Switches
Distributed Logical
Routers
DFW Rules

Placeholder
VMs
10.114.220.2 /29

10.114.220.10 /29

10.114.220.18 /29

Recovery Site

SRM Managed Storage Replication


VMFS

VMFS

42

VMware Technology Day 2016

Zero Downtime Application Mobility


with Site Recovery Manager
By Bo Bo Zin, Senior Solutions Architect, VMware Myanmar
2014 VMware Inc. All rights reserved.

Вам также может понравиться