Академический Документы
Профессиональный Документы
Культура Документы
Version 2.0
Jennifer Aspesi
Oliver Shorey
Copyright © 2010, 2011 EMC Corporation. All rights reserved.
EMC believes the information in this publication is accurate as of its publication date. The information is
subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO
REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS
PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR
FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an applicable
software license.
For the most up-to-date regulatory document for your product line, go to the Technical Documentation and
Advisories section on EMC Powerlink.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com.
All other trademarks used herein are the property of their respective owners.
Preface
Chapter 7 Conclusion
Conclusion........................................................................................ 124
Better protection from storage-related failures ....................125
Protection from a larger array of possible failures...............125
Greater overall resource utilization........................................126
Glossary
Title Page
1 Application and data mobility example ..................................................... 18
2 HA infrastructure example ........................................................................... 19
3 Distributed data collaboration example ..................................................... 20
4 VPLEX offerings ............................................................................................. 22
5 Architecture highlights.................................................................................. 24
6 VPLEX cluster example ................................................................................. 34
7 VPLEX Management Console ...................................................................... 44
8 Management Console welcome screen ....................................................... 45
9 VPLEX small configuration .......................................................................... 49
10 VPLEX medium configuration ..................................................................... 50
11 VPLEX large configuration ........................................................................... 51
12 Port redundancy............................................................................................. 58
13 Director redundancy...................................................................................... 59
14 Engine redundancy ........................................................................................ 60
15 Site redundancy.............................................................................................. 61
16 High level functional sites in communicaton............................................. 64
17 High level Site A failure ................................................................................ 65
18 High level Inter-site link failure ................................................................... 65
19 VPLEX active and functional between two sites ....................................... 66
20 VPLEX concept diagram with failure at Site A.......................................... 67
21 Correct resolution after volume failure at Site A....................................... 68
22 VPLEX active and functional between two sites ....................................... 69
23 Inter-site link failure and cluster partition ................................................. 70
24 Correct handling of cluster partition........................................................... 71
25 VPLEX static detach rule............................................................................... 73
26 Typical detach rule setup .............................................................................. 74
27 Non-preferred site failure ............................................................................. 75
28 Volume remains active at Cluster 1............................................................. 76
29 Typical detach rule setup before link failure ............................................. 77
30 Inter-site link failure and cluster partition ................................................. 78
Title Page
1 Overview of VPLEX features and benefits .................................................. 25
2 Configurations at a glance ............................................................................. 35
3 Management server user accounts ............................................................... 42
4 Output from ls for brief VPLEX Witness status.......................................... 97
5 Output from ll command for brief VPLEX Witness component
status ..................................................................................................................98
Audience This document is part of the EMC VPLEX family documentation set,
and is intended for use by storage and system administrators.
Readers of this document are expected to be familiar with the
following topics:
◆ Storage Area Networks
◆ Storage Virtualization Technologies
◆ EMC Symmetrix and CLARiiON Products
Authors This TechBook was authored by the following individuals from the
Enterprise Storage Division, VPLEX Business Unit based at EMC
Headquarters, Hopkinton, MA.
Jennifer Aspesi has over 10 years of work experience with EMC in
Storage Area Networks (SAN), Wide Area Networks (WAN), and
Network and Storage Security technologies. Jen currently manages
the Corporate Systems Engineer team for the VPLEX Business Unit.
She earned her M.S. in Marketing and Technological Innovation from
Worcester Polytech Institute, Massachusetts.
Typographical EMC uses the following type style conventions in this document:
conventions
Normal Used in running (nonprocedural) text for:
• Names of interface elements (such as names of windows, dialog
boxes, buttons, fields, and menus)
• Names of resources, attributes, pools, Boolean expressions,
buttons, DQL statements, keywords, clauses, environment
variables, functions, utilities
• URLs, pathnames, filenames, directory names, computer
names, filenames, links, groups, service keys, file systems,
notifications
Bold Used in running (nonprocedural) text for:
• Names of commands, daemons, options, programs, processes,
services, applications, utilities, kernels, notifications, system
calls, man pages
This chapter provides a brief summary of the main use cases for the
EMC VPLEX family and design considerations for High Availability.
It also covers some of the key features of the VPLEX family system.
Topics include:
◆ Introduction ........................................................................................ 16
◆ VPLEX value overview ..................................................................... 17
◆ VPLEX product offerings ................................................................. 21
◆ Metro High Availability design considerations............................. 27
Introduction
The purpose of this TechBook is to introduce EMC® VPLEX™ High
Availability and the VPLEX Witness as it is conceptually
architectured typically by customer storage administrators and EMC
Solutions Architects. The introduction of VPLEX Witness provides
customers with “absolute” physical and logical fabric and cache
coherent redundancy as it is properly designed in the VPLEX Metro
environment.
This guide is designed to provide an overview of the features and
functionality associated with the VPLEX Metro configuration and the
importance of Active/Active data resiliency for today’s advanced
host applications.
for large files, or even small files that move regularly, and
negatively impacts productivity because the other sites can sit
idle while they wait to receive the latest data from another site.
If teams decide to do their own work independent of each
other, then the dataset quickly becomes inconsistent, as
multiple people are working on it at the same time and are
unaware of each other’s most recent changes. Bringing all of
the changes together in the end is time-consuming, costly, and
grows more complicated as the data-set gets larger.
VPLEX Local
VPLEX Local provides seamless, non-disruptive data mobility and
ability to manage multiple heterogeneous arrays from a single
interface within a data center.
The VPLEX Local allows increased availability, simplified
management, and improved utilization across multiple arrays.
Architecture highlights
VPLEX support is open and heterogeneous, supporting both EMC
storage and common arrays from other storage vendors, such as
HDS, HP, and IBM. VPLEX conforms to established world wide
naming (WWN) guidelines that can be used for zoning.
VPLEX supports operating systems including both physical and
virtual server environments with VMware ESX and Microsoft
Hyper-V. VPLEX supports network fabrics from Brocade and Cisco
including legacy McData SANs.
An example of the architecture is shown in Figure 5 on page 24.
Features Benefits
Advanced data caching Improve I/O performance and reduce storage array
contention.
Scale-out cluster architecture Start small and grow larger with predictable service
levels.
If for some reason an active ESX server were to fail (perhaps due to
site failure) then the VM can be re-started on a remaining ESX server
within the cluster at the remote site as the datastore where it was
running spans the two locations since it is configured on a VPLEX
Metro distributed volume. This would be deemed an unplanned
failover which will incur a small outage of the application since the
running state of the VM was lost when the ESX server failed meaning
the service will be unavailable until the VM has restarted elsewhere.
Although comparing a planned application mobility event to an
unplanned disaster restart will result in the same outcome (i.e. a
service relocating elsewhere) we can now see that there is a big
difference since the planned mobility job keeps the application online
during the relocation whereas the disaster restart will result in the
application being offline during the relocation as a restart is
conducted.
A pre-requisite for a geographical cluster to perform disaster restart
would be an Active/Active underlying replication solution (VPLEX
Metro only at this publication). When using legacy Active/Passive
type solutions in these scenarios would also typically require an extra
step over and above standard application failover since a storage
failover would also be required. This is where VPLEX can assist
greatly since it is active/active therefore in most cases no manual
intervention at the storage layer is required.The value of VPLEX
Witness and application of following physically high available and
redundant hardware connectivity best practices will truly provide
customers with “Absolute” availability!
Introduction
This section provides basic information on the following:
◆ “VPLEX I/O” on page 32
◆ “High-level VPLEX I/O discussion” on page 32
◆ “Distributed coherent cache” on page 33
◆ “VPLEX family clustering architecture ” on page 33
VPLEX I/O
VPLEX is built on a lightweight protocol that maintains cache
coherency for storage I/O and the VPLEX cluster provides highly
available memory cache, processing power, front-end, and back-end
Fibre Channel interfaces.
EMC hardware powers the VPLEX cluster design so that all devices
are always available and I/O that enters the cluster from anywhere
can be serviced by any node within the cluster.
The AccessAnywhere feature in the VPLEX Metro and VPLEX Geo
products extends the cache coherency between data centers at a
distance.
Introduction 33
Hardware and Software
Directors 2 4 8
Management Servers 1 1 1
Single-engine VPLEX
◆ Two directors
◆ 32 Fibre Channel ports
◆ 64 GB cache
◆ I/O throughput characteristics
Introduction 35
Hardware and Software
Dual-engine VPLEX
◆ Four directors
◆ 64 Fibre Channel ports
◆ 128 GB cache
◆ I/O throughput characteristics
Quad-engine VPLEX
◆ Eight directors
◆ 128 Fibre Channel ports
◆ 256 GB cache
◆ I/O throughput characteristics
Upgrade paths
VPLEX facilitates application and storage upgrades without a service
window through its flexibility to shift production workloads
throughout the VPLEX technology.
In addition, high-availability features of the VPLEX cluster allow for
non-disruptive VPLEX hardware and software upgrades.
This flexibility means that VPLEX is always servicing I/O and never
has to be completely shut down.
Hardware upgrades
Upgrades are supported for single-engine VPLEX systems to dual- or
quad-engine systems.
Two VPLEX Local systems can be reconfigured to work as a VPLEX
Metro or VPLEX Geo.
Software upgrades
VPLEX features a robust non-disruptive upgrade (NDU) technology
to upgrade the software on VPLEX engines. Management server
software must be upgraded before running the NDU.
Due to the VPLEX distributed coherent cache, directors elsewhere in
the VPLEX installation service I/Os while the upgrade is taking
place. This alleviates the need for service windows and reduces RTO.
The NDU includes the following steps:
◆ Preparing the VPLEX system for the NDU
◆ Starting the NDU
◆ Transferring the I/O to an upgraded director
◆ Completing the NDU
Introduction 37
Hardware and Software
Web-based GUI
VPLEX includes a Web-based graphical user interface (GUI) for
management. The EMC VPLEX Management Console Help provides
more information on using this interface.
To perform other VPLEX operations that are not available in the GUI,
refer to the CLI, which supports full functionality. The EMC VPLEX
CLI Guide provides a comprehensive list of VPLEX commands and
detailed instructions on using those commands.
The EMC VPLEX Management Console contains but not limited to
the following functions:
◆ Supports storage array discovery and provisioning
◆ Local provisioning
◆ Distributed provisioning
◆ Mobility Central
◆ Online help
VPLEX CLI
VPlexcli is a command line interface (CLI) to configure and operate
VPLEX systems. It also generates the EZ Wizard Setup process to
make installation of VPLEX easier and quicker.
The CLI is divided into command contexts. Some commands are
accessible from all contexts, and are referred to as ‘global commands’.
The remaining commands are arranged in a hierarchical context tree
that can only be executed from the appropriate location in the context
tree.
Management console
The VPLEX Management Console provides a graphical user interface
(GUI) to manage the VPLEX cluster. The GUI can be used to
provision storage, as well as manage and monitor system
performance.
Figure 7 on page 44 shows the VPLEX Management Console window
with the cluster tree expanded to show the objects that are
manageable from the front-end, back-end, and the federated storage.
The VPLEX Management Console provides online help for all of its
available functions. You can access online help in the following ways:
◆ Click the Help icon in the upper right corner on the main screen
to open the online help system, or in a specific screen to open a
topic specific to the current task.
◆ Click the Help button on the task bar to display a list of links to
additional VPLEX documentation and other sources of
information.
For information about the VPlexcli, refer to the EMC VPLEX CLI
Guide.
System reporting
VPLEX system reporting software collects various configuration
information from each cluster and each engine. The resulting
configuration file (XML) is zipped and stored locally on the
management server or presented to the SYR system at EMC via call
home.
You can schedule a weekly job to automatically collect SYR data
(VPlexcli command scheduleSYR), or manually collect it whenever
needed (VPlexcli command syrcollect).
Director software
The director software provides:
◆ Basic Input/Output System (BIOS ) — Provides low-level
hardware support to the operating system, and maintains boot
configuration.
◆ Power-On Self Test (POST) — Provides automated testing of
system hardware during power on.
◆ Linux — Provides basic operating system services to the Vplexcli
software stack running on the directors.
◆ VPLEX Power and Environmental Monitoring (ZPEM) —
Provides monitoring and reporting of system hardware status.
◆ EMC Common Object Model (ECOM) —Provides management
logic and interfaces to the internal components of the system.
◆ Log server — Collates log messages from director processes and
sends them to the SMS.
◆ GeoSynchrony (I/O Stack) — Processes I/O from hosts, performs
all cache processing, replication, and virtualization logic,
interfaces with arrays for claiming and I/O.
Director software 47
Hardware and Software
Configuration overview
The VPLEX configurations are based on how many engines are in the
cabinet. The basic configurations are small, medium, and large, as
shown in .
The configuration sizes refer to the number of engines in the VPLEX
cabinet. The remainder of this section describes each configuration
size.
Small configurations
The VPLEX-02 (small) configuration includes the following:
◆ Two directors
◆ One engine
◆ Redundant engine SPSs
◆ 8 front-end Fibre Channel ports
◆ 8 back-end Fibre Channel ports
◆ One management server
The unused space between engine 1 and the management server in
Figure 9 on page 49 is intentional.
OFF OFF
O O
I I
ON ON
Management server
OFF OFF
O O
I I
ON ON
OFF OFF
O O
I I
ON ON
Engine 1
SPS 1
VPLX-000255
Medium configurations
The VPLEX-04 (medium) configuration includes the following:
◆ Four directors
◆ Two engines
◆ Redundant engine SPSs
◆ 16 front-end Fibre Channel ports
◆ 16 back-end Fibre Channel ports
◆ One management server
◆ Redundant Fibre Channel COM switches for local COM; UPS for
each Fibre Channel switch
Configuration overview 49
Hardware and Software
ON ON
I I
O O
OFF OFF
ON ON
I I
O O
OFF OFF
ON ON
I I
O O
OFF OFF
UPS B
UPS A
OFF OFF
O O
I I
ON ON
Management server
Engine 2
OFF OFF
O O
I I
ON ON
SPS 2
OFF OFF
O O
I I
ON ON
Engine 1
SPS 1
VPLX-000254
Large configurations
The VPLEX-08 (large) configuration includes the following:
◆ Eight directors
◆ Four engines
◆ Redundant engine SPSs
◆ 32 front-end Fibre Channel ports
◆ 32 back-end Fibre Channel ports
◆ One management server
◆ Redundant Fibre Channel COM switches for local COM; UPS for
each Fibre Channel switch
Figure 11 shows an example of a large configuration.
ON
I
O
ON
I
O
Engine 4
OFF OFF
SPS 4
ON ON
I I
O O
OFF OFF
Engine 3
ON ON
I I
O O
OFF OFF
SPS 3
UPS B
UPS A
OFF OFF
O O
I I
ON ON
Management server
Engine 2
OFF OFF
O O
I I
ON ON
SPS 2
OFF OFF
O O
I I
ON ON
Engine 1
SPS 1
VPLX-000253
Configuration overview 51
Hardware and Software
I/O implementation
The VPLEX cluster utilizes a write-through mode whereby all writes
are written through the cache to the back-end storage. Writes are
completed to the host only after they have been completed to the
back-end arrays, maintaining data integrity.
This section describes the VPLEX cluster caching layers, roles, and
interactions. It gives an overview of how reads and writes are
handled within the VPLEX cluster and how distributed cache
coherency works. This is important to the introduction of high
availability concepts.
Cache coherence
Cache coherence creates a consistent global view of a volume.
Distributed cache coherence is maintained using a directory. There is
one directory per user volume and each directory is split into chunks
(4096 directory entries within each). These chunks exist only if they
are populated. There is one directory entry per global cache page,
with responsibility for:
◆ Tracking page owner(s) and remembering the last writer
◆ Locking and queuing
Meta-directory
Directory chunks are managed by the meta-directory, which assigns
and remembers chunk ownership. These chunks can migrate using
Locality-Conscious Directory Migration (LCDM). This
meta-directory knowledge is cached across the share group for
efficiency.
If the data is not found in local cache, VPLEX searches global cache.
Global cache includes all directors that are connected to one another
within the VPLEX cluster. When the read is serviced from global
cache, a copy is also stored in the local cache of the director from
where the request originated.
If a read cannot be serviced from either local cache or global cache, it
is read directly from the back-end storage. In this case both the global
and local cache are updated to maintain cache coherency.
I/O implementation 53
Hardware and Software
Overview
VPLEX clusters are capable of surviving any single hardware failure
in any subsystem within the overall storage cluster. These include
host connectivity subsystem, memory subsystem, etc. A single failure
in any subsystem will not affect the availability or integrity of the
data. Multiple failures in a single subsystem and certain
combinations of single failures in multiple subsystems may affect the
availability or integrity of data.
This availability requires that host connections be redundant and that
hosts are supplied with multipath drivers. In the event of a front-end
port failure or a director failure, hosts without redundant physical
connectivity to a VPLEX cluster and without multipathing software
installed may be susceptible to data unavailability.
Cluster
A cluster is a collection of one, two, or four engines in a physical
cabinet. A cluster serves I/O for one storage domain and is managed
as one storage cluster.
All hardware resources (CPU cycles, I/O ports, and cache memory)
are pooled:
◆ The front-end ports on all directors provide active/active access
to the virtual volumes exported by the cluster.
◆ For maximum availability, virtual volumes must be presented
through each director so that all directors but one can fail without
causing data loss or unavailability. All directors must be
connected to all storage.
Cluster 57
System and Component Integrity
Safety check
In addition to the redundancy fail-safe features, the VPLEX cluster
provides event logs and call home capability.
Note: To ensure the explanation of this subject remains at a high level, for the
following section the graphics have been broken down into major objects
(e.g. Site A, Site B and Link) please assume that within each site resides a
VPLEX cluster therefore when a site failure is shown it will also cause a full
VPLEX cluster failure within that site. Please also assume that the link object
between sites represents the main inter-cluster data network connected to
each VPLEX cluster in either site. Also, assume that each site shares the same
failure domain. A site failure will affect all components within this failure
domain including VPLEX cluster.
This detach rule can either be set within the VPLEX GUI or via
VPLEX CLI.
Each volume can be either set to Cluster 1 detaches, or Cluster 2
detaches.
If the DR1 is set to Cluster 1 detaches, then in any failure scenario the
preferred cluster for that volume would be declared as Cluster 1, but
if the DR1 detach rule is set to Cluster 2 detaches, then in any failure
scenario the preferred cluster for that volume would be declared as
Cluster 2.
Note: Some people when looking at this prefer to substitute the word detaches
for the word preferred which is perfectly acceptable and can make it easier to
understand.
If there was a problem at site B, then the DR1 will become degraded
as shown in Figure 27.
As the bias rule was set to Cluster 1 detaches, then the distributed
volume will remain active at site A. This is shown in Figure 28 on
page 76.
If the link were now lost then the distributed volume will again be
degraded as shown in Figure 30 on page 78.
To ensure that split brain does not occur after this type of failure the
static bias rule is applied and IO is suspended at Cluster 2 in this case
as the rule is set to Cluster 1 detaches.
This can be observed in Figure 31 on page 79.
If site B had a total failure in this example disruption will now also
occur at site A as shown in Figure 33 on page 81.
As we can see the preferred site has now failed and bias rule has been
used, but since the rule is “static” and cannot distinguish between a
link failure or remote site failure we can see that in this example the
remaining site becomes suspended therefore in this case manual
intervention will be required to bring the volume on line at site A.
Static bias is a very powerful rule. It does provide zero RPO and zero
RTO resolution for non-preferred cluster failure and inter-cluster
partition scenarios. It completely avoids split brain manually and in
the presence of a preferred cluster failure providing non-zero RTO; it
is good to note that this feature is available without automation and
is a valuable failback when the Witness is unavailable or customer
infrastructure cannot accommodate.
However, what if there were a “mechanism” other than the standard
CLI intervention and provide a global view of failures in the Metro
environment in the previous example? VPLEX Witness has been
designed to overcome these scenarios since it can override the static
bias and leave what was the non preferred site ACTIVE.
As you can see the VPLEX Witness server is connected via the VPLEX
management IP network in a third failure or fault domain.
Depending on the scenarios that is to be protected against, this third
fault domain could reside in a different floor within the same
building as VPLEX Cluster 1 and Cluster 2. It can also be located in a
completely geographically dispersed data center which could be in a
different country.
Clearly with the example of the third floor in the building you would
not be protected from a total building failure so depending on the
requirement careful consideration should be given to choose this
third failure domain.
Check the latest VPLEX ESSM (EMC simple support matrix) for the
latest information including VPLEX Witness server physical host
requirements and site qualification.
In this case the clusters adhere to the pre-configured static bias rules
and volume access at Cluster 1 will be suspended since the rule set
was configured as Cluster 2 detaches. Figure 40 on page 91 shows the
final state after this failure.
The next example shows how VPLEX Witness can assist if we have a
site failure at the preferred site. As discussed above this type of
failure without VPLEX Witness would cause the volumes in the
remaining site to go offline. This is where VPLEX Witness greatly
improves the outcome of this event and remove the need for manual
intervention.
Figure 41 on page 92 shows a typical setup for VPLEX v5.x with a
distributed volume configured in a consistency group and has a rule
set configured for Cluster 2 detaches (such as, Cluster 2 wins).
As we know from the previous section, when a site has failed then the
distributed volumes are now degraded, however unlike our previous
example where there was a site failure at the preferred site and the
static bias rule was used forcing volumes into a suspend state at
Cluster 1, VPLEX Witness will now observe that communication is
still possible to Cluster 1 (but not Cluster 2). Additionally since
Cluster 1 cannot contact Cluster 2, VPLEX Witness can make an
informed decision and instruct Cluster 1 to override the static rule set
and proceed with I/O.
VPlexcli:/cluster-witness> ls
Attributes:
Name Value
------------- -------------
admin-state enabled
private-ip-address 128.221.254.3
public-ip-address 10.31.25.45
Contexts:
components
VPlexcli:/cluster-witness> ll components/
/cluster-Witness/components:
VPlexcli:/cluster-Witness> ll components/*
/cluster-Witness/components/cluster-1:
Name Value
----------------------- ------------------------------------------------------
admin-state enabled
diagnostic INFO: Current state of cluster-1 is in-contact (last
state change: 0 days, 13056 secs ago; last message
from server: 0 days, 0 secs ago.)
id 1
management-connectivity ok
operational-state in-contact
/cluster-witness/components/cluster-2:
Name Value
----------------------- ------------------------------------------------------
admin-state enabled
/cluster-Witness/components/server:
Name Value
----------------------- ------------------------------------------------------
admin-state enabled
diagnostic INFO: Current state is clusters-in-contact (last state
change: 0 days, 13056 secs ago.) (last time of
communication with cluster-2: 0 days, 0 secs ago.)
(last time of communication with cluster-1: 0 days, 0
secs ago.)
id -
management-connectivity ok
operational-state clusters-in-contact
Attributes:
Name Value
------------- -------------
admin-state enabled
private-ip-address 128.221.254.3
public-ip-address 10.31.25.45
Contexts:
components
admin-state This attribute identifies whether VPLEX Witness functionality (as a whole) is enabled or disabled.
If VPLEX Witness functionality is enabled, the clusters send health observations to the VPLEX Witness
Server and the VPLEX Witness Server provides guidance to the clusters when the VPLEX Witness Server
observes inter-cluster partition and cluster failure/isolation scenarios.
If VPLEX Witness functionality is disabled, the clusters follow configured detach rule sets to allow or suspend
I/O to the distributed volumes in all consistency groups when inter-cluster partition or cluster failure/isolation
scenarios occur. When VPLEX Witness functionality is disabled, the communication of health observations
and guidance stops between the clusters and the VPLEX Witness Server. In this case, all distributed volumes
in all consistency groups leverage their pre-configured detach rule sets regardless of VPLEX Witness.
To determine the administrative state of individual components, refer to the admin-state attribute associated
with the individual component context.
This admin-state value at the top-level cluster-Witness context is one of the following:
unknown: There is partial management network connectivity between this Management Server and VPLEX
Witness components that are supposed to report their administrative state. To identify the component that is
unreachable over the management network, refer to the output of the individual component contexts.
enabled: All VPLEX Witness components are reachable over the management network and report their
administrative state as enabled.
disabled: All VPLEX Witness components are reachable over the management network and report their
administrative state as disabled.
inconsistent: All VPLEX Witness components are reachable over the management network but some
components report their administrative state as disabled while others report it as enabled. This should be an
extremely rare state, which may result from a potential but highly unlikely failure during enabling or disabling.
Please call EMC Customer Service if you see this state.
private- ip-address This read-only attribute identifies the private IP address of the VPLEX Witness Server VM (128.221.254.3)
that is used for VPLEX Witness-specific traffic.
public-ip-address This read-only attribute identifies the public IP address of the VPLEX Witness Server VM that is used as an
endpoint of the IPsec tunnel.
components This sub-context displays all the individual components of VPLEX Witness that include both VPLEX clusters
configured with VPLEX Witness functionality and the VPLEX Witness Server. Each sub-context displays
details for the corresponding individual component.
VPlexcli:/cluster-Witness> ll components/
/cluster-Witness/components:
Name ID Admin State Operational State Mgmt Connectivity
---------- -- ----------- ------------------- -----------------
cluster-1 1 enabled in-contact ok
cluster-2 2 enabled in-contact ok
server - enabled clusters-in-contact ok
Table 5 Output from ll command for brief VPLEX Witness component status
(page 1 of 2)
admin-state This field identifies whether the corresponding component is enabled or not. The supported values are:
enabled: VPLEX Witness functionality is enabled on this component
disabled: VPLEX Witness functionality is disabled on this component.
unknown: This component is not reachable and its administrative state cannot be determined.
diagnostic This is a diagnostic string is generated by CLI based on the analysis of the data and state information
reported by the corresponding component.
id The cluster-id for the cluster components. The VPLEX CLI ignores this field for the VPLEX Witness Server
and reports the value as a dash “-”.
management- This field displays the communication status to the VPLEX Witness component from the local CLI session
connectivity over the management network.
The possible values are:
ok: The component is reachable
failed: The component is not reachable
Table 5 Output from ll command for brief VPLEX Witness component status
(page 2 of 2)
operational-state This field represents the operational state of the corresponding server component. The clusters-in-contact
(server component) state is the only healthy state. All other states indicate a problem.
clusters-in-contact: According to the latest data reported by each of clusters, both clusters are in contact
with each other over the inter-cluster network.
cluster-partition: According to VPLEX Witness Server observations, the clusters partitioned from each
other over the inter-cluster network, while the VPLEX Witness Server could still talk to each of them.
cluster-unreachable: According to VPLEX Witness Server observations, one cluster has either failed or
become isolated (that is partitioned from its peer cluster and disconnected from the VPLEX Witness
Server).
unknown: VPLEX Witness Server does not know the states of one or both of the clusters and needs to
learn them before it can start making decisions. VPLEX Witness Server assumes this state upon startup.
When the server operational state is set to "cluster-partition" or "cluster-unreachable", this operational state
may not necessarily reflect the current observation of the VPLEX Witness Server. After VPLEX Witness
Server transitions to this state and provides guidance to both clusters, it stays in this state regardless of
more recent observations until it observes complete recovery of the clusters and their inter-cluster
connectivity. (This prevents split brain.)
The VPLEX Witness Server state and the guidance that it provides to the clusters based on its state is
sticky in a sense that if VPLEX Witness Server observes a failure (changes its state and provides guidance
to the clusters), the VPLEX Witness Server will maintain this state even if current observations change.
VPLEX Witness Server will maintain its failure state and guidance until both cluster and their connectivity
fully recover. This policy is implemented in order to avoid potential Data Corruption scenarios due to
possible split brain.
operational-state This field represents the operational state of the corresponding cluster component.
(cluster component) in-contact: This cluster is in contact with its peer over the inter-cluster network. Rebuilds may be in
progress. Subject to other system-wide restrictions, I/O to all distributed volumes in all consistency groups
is allowed from VPLEX Witness’ perspective.
cluster-partition: This cluster is not in contact with its peer and VPLEX Witness Server declared that two
clusters partitioned. Subject to other system-wide restrictions, I/O to all distributed volumes in all
consistency groups is allowed from VPLEX Witness’ perspective.
remote-cluster-isolated-or-dead: This cluster is not in contact with its peer and the VPLEX Witness
Server declared that the remote cluster (i.e. the peer) was isolated or dead. Subject to other system-wide
restrictions, I/O to all distributed volumes in all consistency groups is allowed from VPLEX Witness’
perspective.
local-cluster-isolated: This cluster is not in contact with its peer and the VPLEX Witness Server declared
that the remote cluster (i.e. the peer) as the only proceeding cluster. This cluster must suspend I/O to all
distributed volumes in all consistency groups regardless of bias.
unknown: This cluster is not in contact with its peer over the inter-cluster network and is awaiting guidance
from the VPLEX Witness Server. I/O to all distributed volumes in all consistency groups is suspended
regardless of bias.
VPLEX Witness Cluster As discussed in the previous section we can see that deploying a
isolation semantics VPLEX solution with VPLEX Witness will give continuous
and dual failures availability to the storage volumes regardless of there being a site
failure or inter-cluster link failure. These types of failure are deemed
single component failures and we have shown no single point of
failure can induce data unavailability using the VPLEX Witness.
It should be noted, however, that in rare situations more than one
fault or component outage can occur especially when considering
inter-cluster communication links which if two failed at once would
lead to a VPLEX cluster isolation at a given site.
For instance, if we consider a typical VPLEX Setup with VPLEX
Witness we will automatically have three failure domains (let’s call
then A, B & C where VPLEX Cluster 1 resides at A, VPLEX Cluster 2
at B and the VPLEX Witness server resides at C). In this case there
will be in inter cluster link between A and B (Cluster 1 and 2), plus a
management IP link between A and C as well as a management IP
link between B and C effectively giving a triangulated topology.
In rare situations there is a chance that if any two of these three links
fail then one of the sites will be isolated (cut off).
Due to the nature of VPLEX Witness, these types of isolation can also
be dealt with effectively without manual intervention.
This is achieved since a site isolation is very similar in terms of
technical behavior to a full site outage the main difference being that
the isolated site is still fully operational and powered up (but needs
to be forced into I/O suspension) unlike a site failure where the failed
site is not operational.
In these cases the failure semantics and VPLEX Witness are
effectively the same however two further actions are taken at the site
that becomes isolated:
◆ I/O is shut off/suspended at the isolated site.
◆ The VPLEX cluster will attempt to call home.
Note: If best practices are followed then the likely hood of these scenarios
occurring are significantly less likely than even the rare isolation incidents
discussed above mainly as the faults would have to disrupt components in
totally different fault domains that would be spread over many miles.
Figure 45 Highly unlikely dual failure scenarios that require manual intervention
Figure 46 Two further dual failure scenarios that would require manual
intervention
Metro HA overview
From a technical perspective VPLEX Metro HA solutions are
effectively three new flavors of reference architecture which utilize
the new VPLEX Witness feature in VPLEX v5.0 and therefore greatly
enhance an overall solutions ability to tolerate component failure
causing less or no disruption than legacy solutions with little or no
human intervention over either Cross-Cluster or Metro distances
The two main architecture types enabled by VPLEX Witness feature
are:
◆ Metro HA with Cross-Cluster Connect defined as those clusters
that are within limitations of host ISL cross connectivity.
◆ Metro HA with distances higher than the limitations of ISL cross
connectivity.
This section will look at each of these solutions in turn and show how
value can be derived by stepping through the different failure
scenarios.
The key benefit to this solution and can eliminate in most cases RTO
altogether if objects or components were to fail.
Failure scenarios
Although the following VPLEX Metro HA environments are
compatible with multiple cluster technologies including HyperV and
Microsoft Cluster Services, we will assume for these failure scenarios
that vSphere 4.1 or higher is configured in a stretched HA topology
with DRS so that all of the physical hosts (ESX servers) are within the
same HA cluster.
As discussed previously this type of configuration brings the ability
to teleport virtual machine’s over distance which is extremely useful
in disaster avoidance, load balancing and cloud infrastructure use
cases all using out of the box features and functions, however
additional value can be derived from deploying the VPLEX Metro
HA Cross-Cluster Connect solution to ensure total availability.
Figure 48 on page 109 shows the topology of an Metro HA
Cross-Cluster Connect environment divided up into logical fault
domains. The following sections will demonstrate the recovery
automation for a single failure within any of these domains and show
how no single fault in any domain can take down the system as a
whole, and in most cases without even an interruption of service.
The next example describes what will happen in the unlikely event
that a VPLEX cluster was to fail in either domain A2 or B2. Examples
of how this could happen would include power outage, flood or fire.
In this instance there would be no interruption of service to any of the
virtual machines.
The next example describes what will happen in the event of a failure
to one (or all of) the back end storage arrays in either domain A3 or
B3.
Again in this instance there would be no interruption to any of the
virtual machines.
Figure 51 shows the failure to all storage arrays that reside in domain
A3. Since a cache coherent VPLEX Metro distributed volume is
configured between domains A2 and B2 IO can continue to be
actively serviced from the VPLEX in A2 even though the local back
end storage has failed. This is due to the embedded VPLEX cache
coherency which will efficiently cache any reads into the A2 domain
whilst also propagating writes to the back end storage in domain B3
via the remote VPLEX cluster in site B2.
The next example describes what will happen in the event of a failure
to the inter-cluster link between domains A2 and B2.
Again in this instance there would be no interruption to any of the
virtual machines or VPLEX clusters.
Figure 53 on page 115 shows the inter-cluster link has failed between
domains A2 and B2. In this instance the static bias rule set which was
defined previously will be invoked since neither VPLEX cluster can
communicate with the other VPLEX cluster (but the VPLEX Witness
Server can communicate with both VPLEX Clusters) therefore access
to the given distributed volume within one of the domains A2 or B2
will be suspended. Since in this example there are alternate paths still
available to the remote VPLEX cluster where the volume remains
online VMware will simply re-route the traffic to the alternate VPLEX
cluster therefore the virtual machine will remain online and
unaffected whichever site it was running on.
Failure scenarios
Again for this section we will assume for these failure scenarios that
vSphere 4.1 or higher is configured in a stretched HA topology so
that all of the physical hosts at either site (ESX servers) are within the
same HA cluster. Also, as with the previous section deploying a
stretched VMware configuration with Metro HA, it is also possible to
enable long distance virtual machine teleportation since the virtual
machine datastores still reside on a VPLEX Metro distributed
volume.
Figure 55 shows the topology of an Metro HA environment divided
up into logical fault domains. The next section will demonstrate the
recovery automation for a single failure within any of these domains.
The next example describes what will happen in the event of a failure
to the inter-cluster link between domains A2 and B2.
One of two outcome of this scenario will happen:
◆ If the static bias for a given distributed volume was set to Cluster
1 detaches (assuming Cluster 1 resides in domain A2) and the
virtual machine was running at the same site where the volume
remains online (aka the preferred site) then there is no
interruption to service.
◆ If the static bias for a given distributed volume was set to Cluster
1 detaches (assuming Cluster 1 resides in domain A2) and the
virtual machine was running at the remote site (Domain B1) then
the virtual machine’s storage will be in the suspended state.
Most guest operating systems will fail in this case, allowing the
The remaining failure scenarios with this solution are identical to the
previously discussed VPLEX Metro HA Cross-Cluster Connect
solutions. For failure handling in domains A1, B1, A3, B3 or C, see
“VPLEX Metro HA with Cross-Cluster Connect” on page 107.
Conclusion
Conclusion 123
Conclusion
Conclusion
As outlined in this book, using VPLEX AccessAnywhereTM
technology in combination with High Availability and VPLEX
Witness, storage administrators and data center managers will be
able to provide absolute physical and logical high availability for
their organizations’ mission critical applications with less resource
overhead and dependency on manual intervention. Increasingly,
those mission critical applications are virtualized and in most cases
using VMware vSphere or Microsoft Hyper-V “virtual machine”
technologies. It is expected that VPLEX customers use the HA /
VPLEX Witness solution to incorporate several application-specific
clustering and virtualization technologies to provide HA benefits for
targeted mission critical applications.
As described, the storage administrator is provided with two specific
VPLEX Metro-based solutions around High Availability as outlined
specifically for VMware ESX 4.1 or higher as integrated into the
VPLEX Metro HA Cross-Cluster Connect and Metro environments.
VPLEX Metro HA Cross-Cluster Connect provides a slightly higher
level of HA than the VPLEX Metro HA deployment without
Cross-Cluster connectivity however it is limited to in-data center use
or cases where the network latency between data centers is
negligible.
Both solutions are ideal for customers who are not only currently or
planning on becoming highly virtualized but are looking for the
following:
◆ Elimination of the “night shift” storage and server administrator
positions. To accomplish this, they must be comfortable that their
applications will ride through any failures that happen during
the night.
◆ Reduction of capital expenditures by moving from an
active/passive data center replication model to a fully active
highly available data center model.
◆ Increase application availability by protecting against flood and
fire disasters that could affect their entire data center.
From a holistic view of both types of solutions and what it provides
the storage administrator, the following benefits are in common with
variances. What EMC VPLEX technology with Witness provides to
consumers are as follows:
Conclusion 125
Conclusion
A
AccessAnywhere The breakthrough technology that enables VPLEX clusters to provide
access to information between clusters that are separated by distance.
active/active A cluster with no primary or standby servers, because all servers can
run applications and interchangeably act as backup for one another.
array A collection of disk drives where user data and parity data may be
stored. Devices can consist of some or all of the drives within an
array.
asynchronous Describes objects or events that are not coordinated in time. A process
operates independently of other processes, being initiated and left for
another task before being acknowledged.
For example, a host writes data to the blades and then begins other
work while the data is transferred to a local disk and across the WAN
asynchronously. See also ”synchronous.”
B
bandwidth The range of transmission frequencies a network can accommodate,
expressed as the difference between the highest and lowest
frequencies of a transmission cycle. High bandwidth allows fast or
high-volume transmissions.
bias When a cluster has the bias for a given DR1 it will remain online if
connectivity is lost to the remote cluster (in some cases this may get
over ruled by VPLEX Cluster Witness)
block The smallest amount of data that can be transferred following SCSI
standards, which is traditionally 512 bytes. Virtual volumes are
presented to users as a contiguous lists of blocks.
C
cache Temporary storage for recent writes and recently accessed data. Disk
data is read through the cache so that subsequent read references are
found in the cache.
cache coherency Managing the cache so data is not lost, corrupted, or overwritten.
With multiple processors, data blocks may have several copies, one in
the main memory and one in each of the cache memories. Cache
coherency propagates the blocks of multiple users throughout the
system in a timely fashion, ensuring the data blocks do not have
inconsistent versions in the different processors caches.
controller A device that controls the transfer of data to and from a computer and
a peripheral device.
D
data sharing The ability to share access to the same data with multiple servers
regardless of time and location.
detach rule A rule set applied to a DR1 to declare a winning and a losing cluster
in the event of a failure.
director A CPU module that runs GeoSynchrony, the core VPLEX software.
There are two directors in each engine, and each has dedicated
resources and is capable of functioning independently.
dirty data The write-specific data stored in the cache memory that has yet to be
written to disk.
disaster recovery (DR) The ability to restart system operations after an error, preventing data
loss.
disk cache A section of RAM that provides cache between the disk and the CPU.
RAMs access time is significantly faster than disk access time;
therefore, a disk-caching program enables the computer to operate
faster by placing recently accessed data in the disk cache.
distributed file system Supports the sharing of files and resources in the form of persistent
(DFS) storage over a network.
Distributed RAID1 A cache coherent VPLEX Metro or Geo volume that is distributed
device (DR1) between two VPLEX Clusters
E
engine Enclosure that contains two directors, management modules, and
redundant power.
Ethernet A Local Area Network (LAN) protocol. Ethernet uses a bus topology,
meaning all devices are connected to a central cable, and supports
data transfer rates of between 10 megabits per second and 10 gigabits
per second. For example, 100 Base-T supports data transfer rates of
100 Mb/s.
event A log message that results from a significant action initiated by a user
or the system.
F
failover Automatically switching to a redundant or standby device, system,
or data path upon the failure or abnormal termination of the
currently active device, system, or data path.
Fibre Channel (FC) A protocol for transmitting data between computer devices. Longer
distance requires the use of optical fiber; however, FC also works
using coaxial cable and ordinary telephone twisted pair media. Fibre
channel offers point-to-point, switched, and loop interfaces. Used
within a SAN to carry SCSI traffic.
field replaceable unit A unit or component of a system that can be replaced on site as
(FRU) opposed to returning the system to the manufacturer for repair.
firmware Software that is loaded on and runs from the flash ROM on the
VPLEX directors.
G
Geographically A system physically distributed across two or more Geographically
distributed system separated sites. The degree of distribution can vary widely, from
different locations on a campus or in a city to different continents.
gigabit Ethernet The version of Ethernet that supports data transfer rates of 1 Gigabit
per second.
I
input/output (I/O) Any operation, program, or device that transfers data to or from a
computer.
internet Fibre Channel Connects Fibre Channel storage devices to SANs or the Internet in
protocol (iFCP) Geographically distributed systems using TCP.
intranet A network operating like the World Wide Web but with access
restricted to a limited group of authorized users.
K
kilobit (Kb) 1,024 (2^10) bits. Often rounded to 10^3.
L
latency Amount of time it requires to fulfill an I/O request.
local area network A group of computers and associated devices that share a common
(LAN) communications line and typically share the resources of a single
processor or server within a small Geographic area.
logical unit number Used to identify SCSI devices, such as external hard drives,
(LUN) connected to a computer. Each device is assigned a LUN number
which serves as the device's unique address.
M
megabit (Mb) 1,048,576 (2^20) bits. Often rounded to 10^6.
metadata Data about data, such as data quality, content, and condition.
metavolume A storage volume used by the system that contains the metadata for
all the virtual volumes managed by the system. There is one
metadata storage volume per cluster.
mirroring The writing of data to two or more disks simultaneously. If one of the
disk drives fails, the system can instantly switch to one of the other
disks without losing data or service. RAID 1 provides mirroring.
miss An operation where the cache is searched but does not contain the
data, so the data instead must be accessed from disk.
N
namespace A set of names recognized by a file system in which all names are
unique.
network partition When one site loses contact or communication with another site.
P
parity The even or odd number of 0s and 1s in binary code.
parity checking Checking for errors in binary data. Depending on whether the byte
has an even or odd number of bits, an extra 0 or 1 bit, called a parity
bit, is added to each byte in a transmission. The sender and receiver
agree on odd parity, even parity, or no parity. If they agree on even
parity, a parity bit is added that makes each byte even. If they agree
on odd parity, a parity bit is added that makes each byte odd. If the
data is transmitted incorrectly, the change in parity will reveal the
error.
R
RAID The use of two or more storage volumes to provide better
performance, error recovery, and fault tolerance.
RAID 1 Also called mirroring, this has been used longer than any other form
of RAID. It remains popular because of simplicity and a high level of
data availability. A mirrored array consists of two or more disks. Each
disk in a mirrored array holds an identical image of the user data.
RAID 1 has no striping. Read performance is improved since either
disk can be read at the same time. Write performance is lower than
single disk storage. Writes must be performed on all disks, or mirrors,
in the RAID 1. RAID 1 provides very good data reliability for
read-intensive applications.
RAID leg A copy of data, called a mirror, that is located at a user's current
location.
remote direct Allows computers within a network to exchange data using their
memory access main memories and without using the processor, cache, or operating
(RDMA) system of either computer.
Recovery Point The amount of data that can be lost before a given failure event.
Objective (RPO)
Recovery Time The amount of time the service takes to fully recover after a failure
Objective (RTO) event.
S
scalability Ability to easily change a system in size or configuration to suit
changing conditions, to grow with your needs.
small computer A set of evolving ANSI standard electronic interfaces that allow
system interface personal computers to communicate faster and more flexibly than
(SCSI) previous interfaces with peripheral hardware such as disk drives,
tape drives, CD-ROM drives, printers, and scanners.
split brain Condition when a partitioned DR1 accepts writes from both clusters.
storage RTO The amount of time taken for the storage to be available after a failure
event (In all cases this will be a smaller time interval than the RTO
since the storage is a pre-requisite).
stripe depth The number of blocks of data stored contiguously on each storage
volume in a RAID 0 device.
striping A technique for spreading data over multiple disk drives. Disk
striping can speed up operations that retrieve data from disk storage.
Data is divided into units and distributed across the available disks.
RAID 0 provides disk striping.
T
throughput 1. The number of bits, characters, or blocks passing through a data
communication system or portion of that system.
2. The maximum capacity of a communications channel or system.
3. A measure of the amount of work performed by a system over a
period of time. For example, the number of I/Os per day.
tool command A scripting language often used for rapid prototypes and scripted
language (TCL) applications.
transmission control The basic communication language or protocol used for traffic on a
protocol/Internet private network and the Internet.
protocol (TCP/IP)
U
uninterruptible power A power supply that includes a battery to maintain power in the
supply (UPS) event of a power failure.
universal unique A 64-bit number used to uniquely identify each VPLEX director. This
identifier (UUID) number is based on the hardware serial number assigned to each
director.
V
virtualization A layer of abstraction implemented in software that servers use to
divide available physical storage into storage volumes or virtual
volumes.
virtual volume A virtual volume looks like a contiguous volume, but can be
distributed over two or more storage volumes. Virtual volumes are
presented to hosts.
VPLEX Cluster Witness A new feature in VPLEX V5.x that can augment and improve upon
the failure handling semantics of Static Bias.
W
wide area network A Geographically dispersed telecommunications network. This term
(WAN) distinguishes a broader telecommunication structure from a local
area network (LAN).
world wide name A specific Fibre Channel Name Identifier that is unique worldwide
(WWN) and represented by a 64-bit unsigned binary value.