Вы находитесь на странице: 1из 48

Introduction (Server Clusters : Storage Area Networks - For

Windows 2000 and Windows Server 2003)


Updated: January 1, 2003

A storage area network (SAN) is defined as a set of interconnected devices (e.g. disks and tapes) and
servers that are connected to a common communication and data transfer infrastructure such as fibre
channel. The common communication and data transfer mechanism for a given deployment is commonly
known as the storage fabric. The purpose of the SAN is to allow multiple servers access to a pool of
storage in which any server can potentially access any storage unit. Clearly in this environment,
management plays a large role in providing security guarantees (who is authorized to access which
devices) and sequencing or serialization guarantees (who can access which devices at what point in time).

SANs evolved to address the increasingly difficult job of managing storage at a time when the storage
usage is growing explosively. With devices locally attached to a given server or in the server enclosure
itself, performing day-to-day management tasks becomes extremely complex; backing up the data in the
datacenter requires complex procedures as the data is distributed amongst the nodes and is accessible
only through the server it is attached to. As a given server outgrows its current storage pool, storage
specific to that server has to be acquired and attached, even if there are other servers with plenty of
storage space available. Other benefits can be gained such as multiple servers can share data
(sequentially or in some cases in parallel), backing up devices can be done by transferring data directly
from device to device without first transferring it to a backup server.

So why use yet another set of interconnect technologies? A storage area network is a network like any
other (for example a LAN infrastructure). A SAN is used to connect many different devices and hosts to
provide access to any device from anywhere. Existing storage technologies such as SCSI are tuned to the
specific requirements of connecting mass storage devices to host computers. In particular, they are low
latency, high bandwidth connections with extremely high data integrity semantics. Network technology, on
the other hand, is tuned more to providing application-to-application connectivity in increasingly complex
and large-scale environments. Typical network infrastructures have high connectivity, can route data
across many independent network segments, potentially over very large distances (consider the internet)
and there are many network management and troubleshooting tools.

Storage area networks try to capitalize on the best of the storage technologies and network technologies
to provide a low latency, high bandwidth interconnect which can span large distances, has high
connectivity and good management infrastructure from the start.

In summary, a SAN environment provides the following benefits:

Centralization of storage into a single pool. This allows storage resources and server resources to grow
independently, and allows storage to be dynamically assigned from the pool as and when it is required.
Storage on a given server can be increased or decreased as needed without complex reconfiguring or re-
cabling of devices.

Common infrastructure for attaching storage allows a single common management model for configuration
and deployment.

Storage devices are inherently shared by multiple systems. Ensuring data integrity guarantees and
enforcing security policies for access rights to a given device is a core part of the infrastructure.
Data can be transferred directly from device to device without server intervention. For example, data can
be moved from a disk to a tape without first being read into the memory of a backup server. This frees-up
compute cycles for business logic rather than management related tasks.

Because multiple servers have direct access to storage devices, SAN technology is particularly interesting
as a way to build clusters where shared access to a data set is required. Consider a clustered SQL Server
environment. At any point in time a SQL Server instance may be hosted on one machine in the cluster and
it must have exclusive access to its associated database on a disk from the node on which it is hosted. In
the event of a failure or an explicit management operation, the SQL Server instance may failover to
another node in the cluster. Once failed over, the SQL Server instance must be able to have exclusive
access to the database on disk from its new host node.

By deploying multiple clusters onto a single storage area network, all of the benefits of SAN technology
described above can be brought to the cluster environment. The rest of this paper describes how clusters
can be attached to storage area networks, what the requirements are and what is supported today in
Windows 2000

Storage Area Network Components


Updated: January 1, 2003

As previously discussed, the primary technology used in storage area networks today is Fibre Channel.
This section provides a basic overview of the components in a fibre channel storage fabric as well as
different topologies and configurations open to Windows deployments.

Fibre Channel Topologies

Fundamentally, fibre channel defines three configurations:

• Point-to-point

• Fibre Channel Arbitrated Loop (FC-AL)

• Switched Fibre Channel Fabrics (FC-SW).

Although the term fibre channel implies some form of fibre optic technology, the fibre channel
specification allows for both fibre optic interconnects as well as copper coaxial cables.

Point-to-point

Point-to-point fibre channel is a simple way to connect two (and only two) devices directly together, as
shown in Figure 1 below. It is the fibre channel equivalent of direct attached storage (DAS).

Figure 1: Point to point connection

From a cluster and storage infrastructure perspective, point-to-point is not a scalable enterprise
configuration and we will not consider it again in this document.
Arbitrated Loops

A fibre channel arbitrated loop is exactly what it says; it is a set of hosts and devices that are connected
into a single loop, as shown in Figure 2 below. It is a cost-effective way to connect up to 126 devices and
hosts into a single network.

Figure 2: Fibre Channel Arbitrated Loop

Devices on the loop share the media; each device is connected in series to the next device in the loop and
so on around the loop. Any packet traveling from one device to another must pass through all
intermediate devices. In the example shown, for host A to communicate with device D, all traffic between
the devices must flow through the adapters on host B and device C. The devices in the loop do not need to
look at the packet; they will simply pass it through. This is all done at the physical layer by the fibre
channel interface card itself; it does not require processing on the host or the device. This is very
analogous to the way a token-ring topology operates.

When a host or device wishes to communicate with another host or device, it must first arbitrate for the
loop. The initiating device does this by sending an arbitration packet around the loop that contains its own
loop address (more on addressing later). The arbitration packet travels around the loop and when the
initiating device receives its own arbitration packet back, the initiating device is considered to be the loop
owner. The initiating device next sends an open request to the destination device which sets up a logical
point-to-point connection between the initiating device and target. The initiating device can then send as
much data as required before closing down the connection. All intermediate devices simply pass the data
through. There is no limit on the length of time for any given connection and therefore other devices
wishing to communicate must wait until the data transfer is completed and the connection is closed before
they can arbitrate.

If multiple devices or hosts wish to communicate at the same time, each one sends out an arbitration
packet that travels around the loop. If an arbitrating device receives an arbitration packet from a different
device before it receives its own packet back, it knows there has been a collision. In this case, the device
with the lowest loop address is declared the winner and is considered the loop owner. There is a fairness
algorithm built into the standard that prohibits a device from re-arbitrating until all other devices have
been given an opportunity, however, this is an optional part of the standard.

Note
Not all devices and host bus adapters support loop configurations since it is an optional part of the
fibre channel standard. However, for a loop to operate correctly, all devices on the loop MUST have
arbitrated loop support1. Figure 3 below shows a schematic of the wiring for a simple arbitrated loop
configuration.

Figure 3: FC-AL wiring schematic

With larger configurations, wiring a loop directly can be very cumbersome. Hubs allow for simpler,
centralized wiring of the loop (see section Hubs, Switches, Routers and Bridges). Communication in an
arbitrated loop can occur in both directions on the loop depending on the technology used to build the
loop, and in some cases communication can occur both ways simultaneously.

Loops can support up to 126 devices, however, as the number of devices on the arbitrated loop increases,
so the length of the path and therefore the latency of individual operations increases.

Many loop devices, such as JBODs, have dip switches to set the device address on the loop (known as
hard addressing). Most, if not all devices, implement hard addresses so it is possible to assign a loop ID to
a device, however, just as in a SCSI configuration, different devices must have unique hard IDs. In cases
where a device on the loop already has a conflicting address when a new device is added, the new device
either picks a different ID or it does not get an ID at all (non-participating).

Note

Most of the current FC-AL devices are configured automatically to avoid any address conflicts.
However, if a conflict does happen then it can lead to I/O disruptions or failures.
Unlike many bus technologies, the devices on an arbitrated loop do not have to be given fixed addresses
either by software configuration or via hardware switches. When the loop initializes, each device on the
loop must obtain an Arbitrated Loop Physical Address which is dynamically assigned. This process is
initiated when a host or device sends out a LIP; a master is dynamically selected for the loop and the
master controls a well defined process where each device is assigned an address.

A LIP is generated by a device or host when the adapter is powered up or when a loop failure is detected
(such as loss of carrier). Unfortunately, this means that when new devices are added to a loop or when
devices on the loop are power-cycled, all the devices and hosts on the loop can (and probably will) change
their physical addresses. This can lead to unstable configurations if the operating system is not fully aware
of the changes.
For these reasons, arbitrated loops provide a solution for small numbers of hosts and devices in relatively
static configurations.

Fibre Channel Switched Fabric

In a switched fibre channel fabric, devices are connected in a many-to-many topology using fibre channel
switches, as shown in Figure 4 below. When a host or device communicates with another host or device,
the source and target setup a point-to-point connection (just like a virtual circuit) between them and
communicate directly with each other. The fabric itself routes data from the source to the target. In a fibre
channel switched fabric, the media is not shared. Any device can communicate with any other device
(assuming it is not busy) and communication occurs at full bus speed (1Gbit/Sec or 2Gbit/sec today
depending on technology) irrespective of other devices and hosts communicating.

Figure 4: Switched Fibre Channel Fabric

When a host or device is powered on, it must first login to the fabric. This enables the device to determine
the type of fabric (there is a set of characteristics about what the fabric will support) and it causes a host
or device to be given a fabric address. A given host or device continues to use the same fabric address
while it is logged into the fabric and the fabric address is guaranteed to be unique for that fabric. When a
host or device wishes to communicate with another device, it must establish a connection to that device
before transmitting data in a way similar to the arbitrated loop. However, unlike the arbitrated loop, the
connection open packets and the data packets are sent directly from the source to the target (the
switches take care of routing the packets in the fabric).

Fibre channel fabrics can be extended in many different ways such as by federating switches or cascading
switches, and therefore, fibre channel switched fabrics provide a much more scalable infrastructure for
large configurations. Because device addresses do not change dynamically once a device has logged in to
the fabric, switched fabrics provide a much more stable storage area network environment than is possible
using an arbitrated loop configuration.

Fibre channel arbitrated loop configurations can be deployed in larger switched SANs. Many of the newer
switches from vendors like Brocade incorporate functionality to allow arbitrated loop or point-to-point
devices to be connected to any given port. The ports can typically sense whether the device is a loop
device or not and adapt the protocols and port semantics accordingly. This allows platforms or specific
host adapters or devices which only support arbitrated loop configurations today, to be attached to
switched SAN fabrics.
Note that not all switches are created equal. Brocade switches are easy to deploy; Vixel and Gadzoox
switches behave more like hubs with respect to addressing.

Loops verses Fabrics

Both fibre channel arbitrated loops and switched fabrics have pros and cons. Before deploying either, you
need to understand the restrictions and issues as well as the benefits of each technology. The vendor’s
documentation provides specific features and restrictions; however, the following helps to position the
different technologies.

FC-AL

Pros

Low cost

Loops are easily expanded and combined with up to 126 hosts and devices

Easy for vendors to develop

Cons

Difficult to deploy

Maximum 126 devices

Devices share media thus lower overall bandwidth

Switched Fabric

Pros

Easy to deploy

Supports 16 million hosts and devices

Communicate at full wire-speed, no shared media

Switches provide fault isolation and re-routing

Cons

Difficult for vendors to develop

Interoperability issues between components from different vendors

Switches can be expensive

Host Bus Adapters


A host bus adapter (HBA) is an interface card that resides inside a server or a host computer. It is the
functional equivalent of the network adapter in a traditional Ethernet network. All traffic to the storage
fabric or loop is done via the HBA.

HBAs, with the exception of older Compaq cards and early Tachyon based cards, support both FC-AL and
Fabric (since 1999). However, configuration is not as simple or as automatic as could be supposed. It is
difficult to figure out if an HBA configures itself to the appropriate setting. On a Brocade fabric, it is
possible to get everything connected, however, some of it might be operating as loop and still appear to
work. It is important to verify from the switch side that the hosts are operating in the appropriate mode.

Note

Be sure to select the correct HBA for the topology that you are using. Although some switches can
auto-detect the type of HBA in use, using the wrong HBA in a topology can lead to data loss and can
cause many issues to the storage fabric.

Hubs, Switches, Routers and Bridges

Thus far, we have discussed the storage fabric as a generic infrastructure that allows hosts and devices to
communication with each other. As you have seen, there are fundamentally different fibre channel
topologies and these different topologies use different components to provide the infrastructure.

Hubs

Hubs are the simplest form of fibre channel devices and are used to connect devices and hosts into
arbitrated loop configurations. Hubs typically have 4, 8, 12 or 16 ports allowing up to 16 devices and
hosts to be attached, however, the bandwidth on a hub is shared by all devices on the hub. In addition,
hubs are typically half-duplex (newer full duplex hubs are becoming available). In other words,
communication between devices or hosts on a hub can only occur in one direction at a time. Because of
these performance constraints, hubs are typically used in small and/or low bandwidth configurations.

Figure 5 below shows two hosts and two storage devices connected to the hub with the dark arrows
showing the physical loop provided by the hub.
Figure 5: FC-AL hub configuration

A typical hub detects empty ports on the hub and does not configure them into the loop. Some hubs
provide higher levels of control over how the ports are configured and when devices are inserted into the
loop.

Switches

A switch is a more complex storage fabric device that provides the full fibre channel bandwidth to each
port independently, as shown in Figure 6 below. Typical switches allow ports to be configured in either an
arbitrated loop or a switched mode fabric.

When a switch is used in an arbitrated loop configuration, the ports are typically full bandwidth, bi-
directional allowing devices and hosts to communicate at full fibre channel speed in both directions. In this
mode, ports are configured into a loop, providing performance, arbitrated loop configuration.

Switches are the basic infrastructure used for large, point-to-point, switched fabrics. In this mode, a
switch allows any device to communicate directly with any other device at full fibre channel speed
(1Gbit/Sec or 2Gbit/sec today).
Figure 6: Switched fibre configuration

Switches typically support 16, 32, 64 or even 128 ports today. This allows for complex fabric
configurations. In addition, switches can be connected together in a variety of ways to provide larger
configurations that consist of multiple switches. Several manufacturers such as Brocade and McData
provide a range of switches for different deployment configurations, from very high performance switches
that can be connected together to provide a core fabric to edge switches that connect servers and devices
with less intensive requirements.

Figure 7 below shows how switches can be interconnected to provide a scalable storage fabric supporting
many hundreds of devices and hosts (these configurations are almost certainly deployed in highly
available topologies, section Highly Available Solutions deals with high availability).
Figure 7: Core and edge switches in a SAN fabric

The core backbone of the SAN fabric is provided by high performance (and typically high port density)
switches. The inter-switch bandwidth in the core is typically 8Gbit/sec and above. Large data center class
machines and large storage pools can be connected directly to the backbone for maximum performance.
Severs and storage with less performance requirements (such as departmental servers) may be connected
via large arrays of edge switches (each of which may have 16 to 64 ports).

Bridges and Routers

In an ideal world, all devices and hosts would be SAN-aware and all would interoperate in a single,
ubiquitous environment. Unfortunately, many hosts and storage components are already deployed using
different interconnect technologies. To allow these types of devices to play in a storage fabric
environment, a wide variety of bridge or router devices allow technologies to interoperate. For example,
SCSI-to-fibre bridges or routers allow parallel SCSI (typically SCSI-2 and SCSI-3 devices) to be connected
to a fibre network, as shown in Figure 8 below. In the future, bridges will allow iSCSI (iSCSI is a device
interconnect using IP as the communications mechanism and layering the SCSI protocol on top of IP)
devices to connect into a switch SAN fabric.
Figure 8: SCSI to FibreChannelBridge

Storage Components

Thus far, we have discussed devices being attached to the storage bus as though individual disks are
attached. While in some very small, arbitrated loop configurations, this is possible, it is highly unlikely that
this configuration will persist. More likely, storage devices such as disk and tape are attached to the
storage fabric using a storage controller such as an EMC Symmetrix or a Compaq StorageWorks RAID
controller. IBM would refer to these types of components as Fibre RAID controllers.

In its most basic form, a storage controller is a box that houses a set of disks and provides a single
(potentially redundant and highly available) connection to a SAN fabric. Typically, disks in this type of
controller appear as individual devices that map directly to the individual spindles housed in the controller.
This is known as a JBOD (just a bunch of disks) configuration. The controller provides no value-add, it is
just a concentrator to easily connect multiple devices to a single (or small number for high availability)
fabric switch port.

Modern controllers almost always provide some level of redundancy for data. For example, many
controllers offer a wide variety of RAID levels such as RAID 1, RAID 5, RAID 0+1 and many other
algorithms to ensure data availability in the event of the failure of an individual disk drive. In this case,
the hosts do not see devices that correspond directly to the individual spindles; rather the controller
presents a virtual view of highly available storage devices to the hosts called logical devices.
Figure 9: Logical devices

In the example in Figure 9, although there are five physical disk drives in the storage cabinet, only two
logical devices are visible to the hosts and can be addressed through the storage fabric. The controller
does not expose the physical disks themselves.

Many controllers today are capable of connecting directly to a switched fabric; however, the disk drives
themselves are typically either SCSI, or more common now, are disks that have a built-in FC-AL interface.
As you can see in Figure 10 below, the storage infrastructure that the disks connect to is totally
independent from the infrastructure presented to the storage fabric.

A controller typically has a small number of ports for connection to the fibre channel fabric (at least two
are required for highly available storage controllers). The logical devices themselves are exposed through
the controller ports as logical units (LUNs).
Figure 10: Internal components of a storage controller

Highly Available Solutions

One of the benefits of storage area networks is that the storage can be managed as a centralized pool of
resources that can be allocated and re-allocated as required. This powerful paradigm is changing the way
data centers and enterprises are built, however, one of the biggest issues to overcome is that of
guaranteed availability of data. With all of the data detached from the servers, the infrastructure must be
architected to provide highly available access so that the loss of one or more components in the storage
fabric does not lead to the servers being unable to access the application data. All areas must be
considered including:

No single point of failure of cables or components such as switches, HBAs or storage controllers. Typical
highly available storage controller solutions from storage vendors have redundant components and can
tolerate many different kinds of failures.

Transparent and dynamic path detection and failover at the host. This typically involves multi-path drivers
running on the host to present a single storage view to the application across multiple, independent HBAs.

Built-in hot-swap and hot-plug for all components from HBAs to switches and controllers. Many high-end
switches and most if not all enterprise class storage controllers allow interface cards, memory, CPU and
disk drives to be hot-swapped.
There are many different storage area network designs that have different performance and availability
characteristics. Different switch vendors provide different levels of support and different topologies,
however, most of the topologies are derived from standard network topology design (after all a SAN is a
network, just the interconnect technology is tuned to a given application). Topologies include:

• Multiple independent fabrics

• Federated fabrics

• Core Backbone

Multiple Independent Fabrics

In a multiple fabric configuration, each device or host is connected to multiple fabrics, as shown in Figure
11 below. In the event of the failure of one fabric, hosts and devices can communicate using the
remaining fabric.

Figure 11: Multiple independent fabrics

Pros

Resilient to management or user errors. For example, if security is changed or zones are deleted, the
configuration on the alternate fabric is untouched and can be re-applied to the broken fabric.

Cons

Managing multiple independent fabrics can be costly and error prone. Each fabric should have the same
zoning and security information to ensure a consistent view of the fabric regardless of the communication
port chosen.

Hosts and devices must have multiple adapters. In the case of a host, multiple adapters are typically
treated as different storage buses. Additional multi-pathing software such as Compaq SecurePath or EMC
PowerPath are required to ensure that the host gets a single view of the devices across the two HBAs.

Federated Fabrics

In a federated fabric, multiple switches are connected together, as shown in Figure 12 below. Individual
hosts and devices are connected to at least two switches.
Figure 12: Federated switches for single fabric view

Pros

Management is simplified, the configuration is a highly available, single fabric, and therefore there is only
one set of zoning information and one set of security information to manage.

The fabric itself can route around failures such as link failures and switch failures.

Cons

Hosts with multiple adapters must run additional multi-pathing software such as Compaq SecurePath or
EMC PowerPath to ensure that the host gets a single view of the devices where there are multiple paths
from the HBAs to the devices.

Management errors are propagated to the entire fabric.

Core Backbone

A core backbone configuration is really a way to scale-out a federated fabric environment. Figure 7 shows
a backbone configuration. The core of the fabric is built using highly scalable, high performance switches
where the inter-switch connections provide high performance communication (e.g. 8-10GBit/Sec using
today’s technology). Redundant edge switches can be cascaded from the core infrastructure to provide
high numbers of ports for storage and hosts devices.

Pros

Highly scalable and available storage area network configuration.

Management is simplified, the configuration is a highly available, single fabric, and therefore there is only
one set of zoning information and one set of security information to manage.

The fabric itself can route around failures such as link failures and switch failures.

Cons
Hosts with multiple adapters must run additional multi-pathing software such as Compaq SecurePath or
EMC PowerPath to ensure that the host gets a single view of the devices where there are multiple paths
from the HBAs to the devices.

Management errors are propagated to the entire fabric.

1
Most devices today, except for some McData switches, support FC-AL.

Management
Updated: January 1, 2003

As you can see from the previous section, storage area networks are increasingly complex and large
configurations are becoming more and more common. While storage area networks certainly provide
many benefits over direct attach storage, the big issue is how to manage this complexity.

Zoning

A storage fabric can have many devices and hosts attached to it. With all of the data stored in a single,
ubiquitous cloud of storage, controlling which hosts have access to what data is extremely important. It is
also important that the security mechanism be an end-to-end solution so that badly behaved devices or
hosts cannot circumvent security and access unauthorized data.

Zoning is a mechanism, implemented at the switch level, which provides an isolation boundary. A port
(either host adapters or storage controller ports) can be configured as part of a zone. Only ports in a given
zone can communicate with other ports in that zone. The zoning is configured and access control is
implemented by the switches in the fabric, so a host adapter cannot spoof the zones that it is in and gain
access to data for which it has not been configured.

Figure 13: Zoning

In Figure 13 above, hosts A and B can access data from storage controller S1, however host C cannot as it
is not in Zone A. Host C can access data from storage S2.
Many switches today allow overlapping zones. This enables a storage controller to reside in more than one
zone, thus enabling the devices in that controller to be shared amongst different servers in different
zones, as shown in Figure 14 below. Finer precision access controls are required to protect individual disks
against access from unauthorized servers in this environment.

Zoning can be implemented in either hardware or software. Hardware zoning is done by the ASIC in the
switch ports themselves. Every packet is checked at line speed to ensure that it is authorized. Software
zoning is done by the name server or other fabric access software. When a host tries to open a connection
to a device, access controls can be checked at that time.

Figure 14: Storage controller in multiple zones

Zoning is an extremely important concept. Not only is it a security feature, but it also limits the traffic flow
within a given SAN environment. Traffic (I/O requests and other storage requests) between ports is only
routed to those pieces of the fabric that are in the same zone. Typically with modern switches, as new
switches are added to an existing fabric, the new switches are automatically updated with the current
zoning information.

I/Os (either read/write or such things as device reset or LIP) from hosts or devices in a fabric cannot leak
out and affect other zones in the fabric causing noise or cross-talk between zones. As we shall see, this is
fundamental to deploying Server clusters on a SAN.

Fine-grain Security and Access Control

While zoning provides a high-level security infrastructure in the storage fabric, it does not provide the
fine-grain level of access control needed for large storage devices. In a typical environment, a storage
controller may have many Gigabytes or Terabytes of storage to be shared amongst a set of servers.
Storage controllers typically provide LUN-level access controls that enable an administrator to restrict
access to a given LUN to one or more hosts. By providing this access control at the storage controller, the
controller itself can enforce access policies to the data.

LUN masking is a host-based mechanism that hides specific LUNs from applications. Although the host bus
adapter and the lower layers of the operating system have access to and could communicate with a set of
devices, LUN masking prevents the higher layers from knowing that the device exists and therefore
applications cannot use those devices. LUN masking is a policy-driven software security and access control
mechanism enforced at the host. For this policy to be successful, the administrator has to trust the drivers
and the operating systems to adhere to the policies.

SAN Management

SAN management is a huge topic on its own and is outside the scope of this document. Different vendors
(both vendors that provide SAN fabric components as well as software vendors that provide storage
management tools) provide a wide range of tools for setting up, configuring, monitoring and managing the
SAN fabric, as well as the state of devices and hosts on the fabric.

Virtualized View of Storage

The previous section touched on virtualization of storage when describing various RAID levels. The logical
devices presented by the controller to the storage fabric are some composite of the real physical devices
in the storage cabinet. Moving forward, the panacea for storage management is that the devices
presented to the storage infrastructure are not tied to any physical storage. In other words, the set of
spindles in the cabinet is treated as a pool of storage blocks. Logical devices can be materialized from that
storage pool with specific attributes such as must survive a single failure, have xyz performance
characteristics etc. The storage controller is then free to store the data associated with the logical devices
anywhere (and indeed change the placement at will) as long as the desired characteristics are maintained.

At this point, there are no real physical characteristics associated with a logical disk, any physical notions,
such as a disk serial number or identity, are purely software-generated virtualized views. See Figure 15
below.

By taking this route, storage vendors can drive many value-added storage management features down
into the storage infrastructure itself without having to have host involvement. We are seeing the first few
steps down this path today with the notion of snapshots provided by some storage controllers today.
Figure 15: Storage virtualization by the controller

Deploying Server Clusters in a SAN Environment


Updated: January 1, 2003

This section covers best practices for MSCS in a SAN environment. Clusters are supported by Microsoft in
a SAN environment; however, there are some specific requirements and restrictions placed on the
configurations.

Note

In a SAN environment, the storage fabric provides access to data for a wide range of applications. If
the storage fabric stability is compromised, the availability of the entire data center could be at risk, no
amount of clustering can protect against an unstable or unavailable storage fabric.

Qualified Configurations
As with all existing cluster configurations, only complete cluster solutions that appear on the Microsoft
Hardware Compatibility List (HCL) will be supported by Microsoft. Clusters cannot be arbitrarily built up
from device-level components (even those components such as RAID controllers, multi-cluster devices
etc. that are qualified as cluster components) and put together into a supported configuration.

A single cluster can be qualified and placed on the HCL using fibre channel storage interconnects and
switch technology and there are many examples of complete configurations on the HCL today. This,
however, does not really constitute a storage area network (SAN) configuration.

Microsoft fully supports multiple clusters and/or servers deployed on a single fibre channel switched fabric
and sharing the same storage controllers as long as the configuration adheres to the following rules:

• The storage controller must be on the Cluster/Multi-Cluster Device HCL list if it is shared between

clusters.

• The complete configuration for any individual cluster must be on the Cluster HCL list.

Take, for example the following HCL lists:

Cluster/Multi-cluster device HCL list:

Storage Controller St1

Storage Controller St2

Cluster HCL list

2-node advanced server cluster AS1

Server 1: Server Box S1, 256Mb, 700Mhz PIII, HBA H1

Server 2: Server Box S2, 256Mb, 700Mhz PIII, HBA H1

Storage: Storage Controller St1

4-node advanced server cluster AS2

Server 1: Server Box S5, 512Mb, 1.2Ghz PIV, HBA H1

Server 2: Server Box S6, 512Mb, 1.2Ghz PIV, HBA H1

Server 3: Server Box S7, 512Mb, 1.2Ghz PIV, HBA H1

Server 4: Server Box S8, 512Mb, 1.2Ghz PIV, HBA H1

Storage: Storage Controller St1

2-node advanced server cluster AS3

Server 1: Server Box S10, 256Mb, 700Mhz PIII, HBA H2


Server 2: Server Box S11, 256Mb, 700Mhz PIII, HBA H2

Storage: Storage Controller St2

In this case, the 2-node AS1 and the 4-node AS2 configurations can both be placed on the same storage
area network and can in fact share the same storage controller St1. It is also possible to have AS3 on the
same storage area network as long as it uses storage controller St2 and not St1.

With Windows 2000, the storage area network fabric itself is not on the HCL and is not qualified directly by
Microsoft. When building these configurations, you must ensure that the switches and other fabric
components are compatible with the HBAs and the storage controllers.

Arbitrated Loops and Switched Fabrics

Fibre channel arbitrated loops can be configured to support multiple hosts and multiple storage devices,
however, arbitrated loop configurations typically have restrictions due to the nature of the technology. For
example, in some cases, a complete storage controller must be assigned to a given server or cluster.
Individual devices in the controller cannot be assigned to different servers or clusters. While
manufacturers and vendors allow multiple clusters to be hosted on a single arbitrated loop, due to the
configuration restrictions and the mechanisms that the cluster service uses to protect disks in a cluster,
Microsoft recommends that only one cluster is attached to any single arbitrated loop configuration and
that arbitrated loop configurations are limited to small, relatively static cluster configurations.

Fabrics are fully supported by server clusters for both a single cluster and for multiple clusters and
independent servers on the same storage fabric. Fabrics provide a much more stable environment where
multiple server clusters are deployed using the same storage infrastructure. Nodes (and indeed storage
devices) can leave or enter the SAN independently without affecting other parts of the fabric. Highly
available fabrics can be built up, and in conjunction with multi-path drivers, can provide a highly available
and scalable storage infrastructure.

Hints, Tips and Don’t Dos

This section describes the dos and don’ts of deploying one or more clusters in a SAN.

MUST Do

Each cluster on a SAN MUST be deployed in its own zone. The cluster uses mechanisms to protect access
to the disks that can have an adverse effect on other clusters that are in the same zone. By using zoning
to separate the cluster traffic from other cluster or non-cluster traffic, there is no chance of interference.
Figure 16 shows two clusters sharing a single storage controller. Each cluster is in its own zone. The LUNs
presented by the storage controller must be allocated to individual clusters using fine-grained security
provided by the storage controller itself. LUNs must be setup as visible to all nodes in the cluster and a
given LUN should only be visible to a single cluster.

The multi-cluster device test used to qualify storage configurations for the multi-cluster HCL list tests the
isolation guarantees when multiple clusters are connected to a single storage controller in this way.
Figure 16: Clusters assigned to individual zones

All HBAs in a single cluster must be the same type and at the same firmware revision level. Many storage
and switch vendors require that ALL HBAs on the same zone, and in some cases the same fabric, are the
same type and have the same firmware revision number.

All storage device drivers and HBA device drivers in a cluster must be at the same software version.

SCSI bus resets are not used on a fibre channel arbitrated loop; they are interpreted by the HBA and
driver software and cause a LIP to be sent. As previously described, this resets all devices on the loop.

When adding a new server to a SAN, ensure that the HBA is appropriate for the topology. In some
configurations, adding an arbitrated loop HBA to a switched fibre fabric can result in widespread failures of
the storage fabric. There have been real-world examples of this causing serious downtime.

The base Windows 2000 platform will mount any device that it can see when the system boots. The
cluster software ensures that access to devices that can be accessed by multiple hosts in the same cluster
is controlled and only one host actually mounts the disk at any one time. When first creating a cluster,
make sure that only one node can access the disks that are to be managed by the cluster. This can be
done either by leaving the other (to be) cluster members powered off, or by using access controls or
zoning to stop the other hosts from accessing the disks. Once a single node cluster has been created, the
disks marked as cluster-managed will be protected and other hosts can be either booted or the disks
made visible to other hosts to be added to the cluster.

This is no different to any cluster configuration that has disks that are accessible from multiple
hosts.
Note

In Windows Server 2003 by using the new command mountvol/n you can disable dynamic scanning. It
is recommended that dynamic scanning be disabled before the servers are connected to the SAN in a
San environment. New Cluster Setup or adding and removing of nodes in a server cluster should be
done while dynamic scanning is turned off. It is recommended that dynamic scanning remains turned
off as long as the servers are connected to the storage infrastructure.

MUST NOT Do

NEVER allow multiple hosts access to the same storage devices unless they are in the SAME cluster. If
multiple hosts that are not in the same cluster can access a given disk, this will lead to data corruption.

NEVER put any non-disk device into the same zone as cluster disk storage devices.

Other Hints

Highly available systems such as clustered servers should typically be deployed with multiple HBAs with a
highly available storage fabric. In these cases be sure to ALWAYS load the multi-path driver software. If
the I/O subsystem in the Windows 2000 platform sees two HBAs, it will assume they are different buses
and enumerate all the devices assuming that they are different devices on each bus; where in fact, the
host is seeing multiple paths to the same disks. Failure to load the multi-path driver will lead to data
corruption. A simple manifestation of this is that the disk signature is re-written. If the Windows platform
sees what it thinks are two independent disks with the same signature, it will re-write one of them to
ensure that all disks have unique signatures. This is covered in KB article 293778 Multiple-Path Software
May cause Disk Signature to change.

Note

Windows Server 2003 will detect the fact that the same volume is being exposed twice. If such a
situation arise Windows Server 2003 will not mount the volumes exposed by controller 2 that have
were already exposed by the controller 1.
Many controllers today provide snapshots at the controller level that can be exposed to the cluster as a
completely separate LUN. The cluster does not react well to multiple devices having the same signature. If
the snapshot is exposed back to the host with the original disk online, the base I/O subsystem will re-
write the signature as in the previous example, however, if the snapshot is exposed to another node in the
cluster, the cluster software will not recognize it as a different disk. DO NOT expose a hardware snapshot
of a clustered disk back to a node in the same cluster. While this is not specifically a SAN issue, the
controllers that provide this functionality are typically deployed in a SAN environment.

Adding and Removing Disks from a Cluster

In Windows 2000 (SP3 onwards) and Windows Server 2003, adding a disk to the cluster is simple. Simply
add the storage (in a SAN this probably means adding the physical drives to a storage controller and then
creating a logical unit that is available in the correct zone and with the correct security attributes).

Once the disk is visible to the operating system, you can make the disk a cluster-managed disk by adding
a physical disk resource in cluster administrator. The new disk will appear as being capable of being
clustered.
Note

Some controllers use a different cluster resource than physical disk, for those environments; create a
resource of the appropriate type.
Only Basic, MBR format disks that contain at least one NTFS partition can be managed by the cluster.
Before adding a disk, it must be formatted.

Remember that the same rules apply when adding disks as in creating a cluster. If multiple nodes can see
the disk BEFORE any node in the cluster is managing it, this will lead to data corruption. When adding a
new disk, first make the disk visible to only one cluster node and then once it is added as a cluster
resource, make the disk visible to the other cluster nodes.

To remove a disk from a cluster, first remove the cluster resource corresponding to that disk. Once it has
been removed from the cluster, the disk can be removed (either the drive can be physically removed or
the LUN can be deleted or re-purposed)

There are several KB articles on replacing a cluster-managed disk. While disks in a cluster should typically
be RAID sets or mirror sets, there are sometimes issues that cause catastrophic failures leading to a disk
having to be rebuilt from the ground up. There are also other cases where cluster disks are not redundant
and failure of those disks also leads to a disk having to be replaced. The steps outlined in those articles
should be used if you need to rebuild a LUN due to failures.

243195 - Replacing a cluster managed disk in Windows NT 4.0

280425 - Recovering from an Event ID 1034 on a Server Cluster

Expanding disks
Now you can expand volumes dynamically without requiring a reboot. Microsoft provided Diskpart tool can
be used to expand volumes dynamically. Diskpart tool is available for each Windows 2000 and Windows
Server 2003. You can download Windows 2000 version of Diskpart from the www.microsoft.com web site.

SAN Backup

Storage area networks provide many opportunities to offload work from the application hosts. Many of the
devices in the SAN (either hosts or storage controllers) have CPUs and memory and are capable of
executing complex code paths. In addition, any device can communicate with any other device, the SAN
provides a peer-to-peer communication mechanism. This leads to such things as SAN-based backups. A
storage controller can easily initiate the backup of a disk device to a tape device on the SAN without host
intervention. In some cases, hybrid backup solutions are implemented where file system related
information is provide by the host, but bulk copying of the data blocks is done directly from storage
controller to tape device.

The cluster software uses disk reservations to protect devices that could be accessed by multiple
computers simultaneously. The host that currently owns a disk protects it so that no other host can write
to it. This is necessary to avoid writes that are in the pipeline when failover occurs from corrupting the
disk. When failover occurs, the new owner protects the disk. This means that a cluster disk is always
reserved and therefore can only be accessed by the owning host. No other host or device (including the
controller that is hosting the disk) can access the disk simultaneously. This means that SAN-based backup
solutions where data transfers from disk to tape are initiated by a 3rd party (i.e. initiated by a device
other than the owning host) cannot be supported in a cluster environment.

Booting from a SAN

Microsoft supports booting from a SAN in limited environments. There are a set of configuration
restrictions around how Windows boots from a storage area network, see KB article 305547.

Windows 2000 Server clusters require that the startup disk, page file disk and system disk be on a
different storage bus to the cluster server disks. To boot from a SAN, you must have a separate HBA for
the boot, system and pagefile disks than the cluster disks. You MUST ensure that the cluster disks are
isolated from the boot, system and pagefile disks by zoning the cluster disks into their own zone.

Note

Windows Server 2003 will allow for startup disk and the cluster server disks hosted on the same bus.
However, you would need to use Storport minioprt HBA drivers for this functionality to work. This is
NOT supported configuration with in any other combination (for example, SCSI port miniport or Full
port drivers).

Cluster Service Features in Windows Server 2003


Updated: January 1, 2003

The Windows Server 2003 release has a number of enhancements. The following enhancements are
specific to supporting Server clusters in a SAN environment:

Targeted Reset

Historically, server clusters use the SCSI reservation mechanism to protect disks against access,
guaranteeing that only the host that has the disk online can actually access it. To ensure that devices can
be failed over in the event of failures, server clusters implements a challenge/response mechanism that
can detect dead or unresponsive server nodes even though they may not have crashed and therefore the
storage fabric is unaware that the server is not responding. To do this, the reservations are periodically
broken by other nodes in the cluster using SCSI bus reset commands. In a SAN fabric, SCSI reset
commands can be very detrimental to the fabric since they are typically not implemented the same way
by different vendors and they typically result in a LIP command that takes the fabric sometime to re-
settle.

In Windows Server 2003, the server cluster code uses a new mechanism to first try targeted device reset,
then LUN reset and if all else fails it will fall-back to a full bus reset. This feature requires the storage
mini-port drivers to interpret the new control codes. At this time, several HBA vendors are modifying their
mini-port drivers to provide this feature, thus enabling much more stable cluster configurations in a
switched fabric environment.

This feature requires no administration to enable it. If the device driver supports the targeted reset
functions they will be automatically used.

Single Storage Bus Configurations

As described previously, in Windows 2000, only storage devices on a different bus to the system disk will
be considered eligible as cluster-managed devices. In a SAN environment, the goal is to centralize all
storage into a single fabric accessible through a single port (actually in most cases the host will have
multiple HBAs and multi-path drivers to provide a single port view to the cluster software).

In Windows Server 2003 Cluster server has a switch that when enabled, allows any disk on the system,
regardless of the bus it is on, to be eligible as a cluster-managed disk. Using this, the system disk, startup
disk, pagefile disks and any cluster managed disks can be attached to the same HBA. This feature is
enabled by setting the following registry key:

Copy Code
HKLM\SYSTEM\CurrentControlSet\Services\ClusSvc\Parameters\ManageDisks
OnSystemBuses 0x01
This feature is enabled by a registry key to ensure that it is not accidentally enabled by customers that do
not understand the implications of this configuration. It is intended for OEMs to ship qualified and tested
configurations and not for a typical end-user or administrator to setup in an ad hoc manner.

A single storage bus configuration MUST have device drivers that support the targeted reset functionality
previously defined.

Related Issues
Updated: January 1, 2003

Shared disk verses Shared-nothing

You may see various documents that use terms like shared disk clusters and non-shared disk or shared-
nothing clusters. These terms are very misleading and can cause confusion since they depend on the
context of the discussion.

When talking about the physical connectivity of devices, shared disk clusters means that multiple
computers have direct physical access to any given storage unit (for example, multiple hosts are directly
connected to a disk drive on a SCSI bus that the computers are both connected to). Non-shared disk or
shared-nothing clusters in this context means that any given disk is only physically connected to one
computer. See Figure 17.
Figure 17: Physical view of cluster topologies

In the context of file systems or data access from applications, shared disk means that applications
running on multiple computers in a cluster can access the same disk directly at the same time. To support
this application, the file system must coordinate concurrent access to a single disk from multiple hosts
(e.g. a cluster file system). Clearly, shared physical access is required for this configuration. When talking
about application or data access, non-shared disk means that only applications running on one computer
can access data on any given disk directly. In this case, the physical disk may or may not be connected to
multiple computers, but if it is, then only the connection from one computer is in use at any one time. See
Figure 18 below.

Figure 18: Application view of cluster topologies

SAN Versus NAS


There are two industry-wide terms that refer to externally attached storage:

• Storage Area Networks (SAN)

• Network Attached Storage (NAS)

Having two, similar sounding terms leads to some confusion and therefore it is worth discussing the
differences between the two different technologies before delving into storage area network details.

Storage area networks (SANs), see Figure 19 below, are typically built-up using storage-specific network
technologies. Fibre channel is the current technology leader in this space. Servers connect to storage and
access data at the block level. In other words, to the server, a disk drive out on the storage area network
is accessed using the same read and write disk block primitives as though it were a locally attached disk.
Typically, data and requests are transmitted using a storage-specific protocol (usually based on the SCSI
family of protocols). These protocols are tuned for low latency, high bandwidth data transfers required by
storage infrastructure.

Figure 19: Storage Area Network

While fibre channel is by far the leading technology today, other SAN technologies have been proposed,
for example SCSI over Infiniband, iSCSI (which is SCSI protocol running over a standard IP network). All
these technologies allow a pool of devices to be accessed from a set of servers, decoupling the compute
needs from the storage needs.

In contrast, network attached storage (NAS), see Figure 20 below, is built using standard network
components such as Ethernet or other LAN technologies. The application servers access storage using file
system functions such as open file, read file, write file, close file, etc.. These higher-level functions are
encapsulated in protocols such as CIFS, NFS or AppleShare and run across standard IP-based connections.
Figure 20: Network Attached Storage

In a NAS solution, the file servers hide the details of how data is stored on disks and present a high level
file system view to application servers. In a NAS environment, the file servers provide file system
management functions such as the ability to back up a file server.

As SAN technology prices decrease and the need for highly scalable and highly available storage solutions
increases, vendors are turning to hybrid solutions that combine the centralized file server simplicity of NAS
with the scalability and availability offered by SAN as shown in Figure 21 below.
Figure 21: Hybrid NAS and SAN solution

The following table contrasts the SAN and NAS technologies

SAN Versus NAS Technologies

Storage Area Network Network Attached Storage

Application Server Block-level access File-level access


Access methods

Communication protocol SCSI over Fibre Channel iSCSI CIFS, NFS; AppleShare
(SCSI over IP)
Network physical Typically storage-specific General purpose LAN
technology
(e.g. Fibre-channel) but may be e.g. Gigabit Ethernet
high-speed Ethernet

Example Storage Compaq StorageWorks SAN Network Appliance NetApp Filers; Maxtor
Vendors family; EMC Symmetrix NASxxxx; Compaq TaskSmart N-series

There are many camps that believe that in the future, various different technologies will win in the storage
space and there are those that believe that in the end there will be a single network interconnect that will
cover SAN and NAS needs, as well as basic inter-computer networking needs. Over time, the Windows
platform and Windows Clustering technologies will support different interconnect technologies as they
become important to end-customer deployments.

Summary (Server Clusters : Storage Area Networks - For


Windows 2000 and Windows Server 2003)
Updated: January 1, 2003

Storage area networks provide a broad range of advantages over locally connected devices. They allow
computer units to be detached from storage units, thereby providing flexible deployment and re-purposing
of servers and storage to suit current business needs. You do not have to be concerned about buying the
right devices for a given server, or with re-cabling a datacenter to attach storage to a specific server.

Microsoft fully supports storage area networks both as part of the base Windows platform, and as part of a
complete Windows Clustering, high availability solution. One or more server clusters can be deployed in a
single SAN environment, along with standalone Windows servers and/or non-Windows-based platforms.

Related Links (Server Clusters : Storage Area Networks -


For Windows 2000 and Windows Server 2003)
Updated: January 1, 2003

See the following resources for further information:

• For hardware and software compatibility information, see the Windows Server Catalog at the

Microsoft Web site (http://go.microsoft.com/fwlink/?LinkId=4287).

• For a library of technical articles about Windows Server 2003, see the Windows Server 2003

TechCenter at the Microsoft Web site (http://go.microsoft.com/fwlink/?LinkId=4281).

• For the latest information about Windows Server 2003, see the Windows 2003 Server Web site at

the Microsoft Web site (http://go.microsoft.com/fwlink/?LinkId=48237).

• For interactive help in solving a problem with your computer or to research your problem see

Product Support Services at the Microsoft Web site (http://go.microsoft.com/fwlink/?


LinkId=281).

• For additional information about cluster deployment, see "Designing and Deploying Clusters" at

the Microsoft Web site (http://go.microsoft.com/fwlink/?LinkId=48238).

• For information about troubleshooting, see "Troubleshooting cluster node installations" at the

Microsoft Web site (http://go.microsoft.com/fwlink/?LinkId=48239).


Server Clusters: Cluster Configuration Best Practices for
Windows Server 2003
Cluster and Application Deployments
Updated: January 1, 2003

Historically, we have described clusters and applications as being active/active or active/passive, which is
a natural result of the fact that the earliest products supported only 2-node clusters. In some cases,
however, this has led to confusion because the terms can be interpreted in different ways depending on
whether the context is 1) how a given application instance is running, 2) how different instances of an
application are running or 3) whether the nodes in the cluster are performing useful work. To fully
understand the specific application and cluster deployments, we need to understand the following:

• How many instances of the application are actually up and running?

• How many instances of the same data are in the cluster?

• Do instances of the data move around?

• Do different instances of the same application run on different cluster nodes?

• Can different applications run on different cluster nodes?

• What kind of load does the application put on a server, and how will this load be re-distributed

after a failover?

Before describing how an application is deployed, we must first define what an application is. An
application, for the purposes of this document, is defined as the running code and data required to provide
a single service to an end-user or client. Consider a couple of different examples to illustrate the point: A
single instance of Microsoft Word running on a workstation is a single application instance. If there are
multiple instances of Word running, each instance is considered a different application instance. There are,
however, more complex applications. For example, with Microsoft SQL Server, a single database is
considered to be a single application instance. Independent databases are considered to be different
application instances. A single database, however, can be partitioned into multiple SQL Server instances
and tied-together using the SQL Server query engine. In this case, the set of SQL Server instances that
are tied together to provide a single database image is considered a single application instance.

Fundamentally, there are five different attributes that together provide a complete view of the deployment
and can be used to reason about or characterize the deployment:

• Server load How much of a servers resources are consumed by the applications it supports, and

how will this resource utilization be affected by failover?

• Application style Is the application a single monolithic application that runs on a single node or is

the complete application split into smaller pieces that can be deployed around the cluster?

• Application deployment How are the pieces of the application spread around the cluster in a given

deployment?

• Failover policies How is the application configured to behave after a failover?

• Application implementation How is the application itself implemented?


Server Load
Updated: January 1, 2003

When deploying an application, it is always important to consider what demands it will make on a servers
resources. With clustering, there is a related issue that also needs to be taken into account: how is the
load re-distributed after a failover?

Server Load The Basics

Consider one of the simplest cases, an active/active, 2-node file server cluster, with node A and node B
each serving a single share. If node A fails, its resources will move to node B, placing an additional load
on node B. In fact, if node A and B were each running at only 50% capacity before the failure, node B will
be completely saturated (100% of capacity) after the failover is completed, and performance may suffer.

While this situation may not be optimal, it is important to remember that having all of the applications still
running, even in a reduced performance scenario, is a 100% improvement over what you would have
without the high availability protection that clusters provide. But this does bring up the notion of risk, and
what amount of it you are willing to accept in order to protect the performance, and ultimately the
availability, of your applications.

We have intentionally chosen the worst case (an active/active, 2-node cluster with each node running a
single application that consumes half of the servers resources) for the purpose of clarity. With an
additional node, the equation changes: there are more servers to support the workload, but if all three
nodes are running at 50% capacity and there are two failures, the single remaining server will simply not
be able to handle the accumulated load of the applications from both of the failed servers. Of course, the
likelihood of two failures is considerably less than that of a single failure, so the risk is mitigated
somewhat.

Nevertheless, the load/risk tradeoff must be considered when deploying applications on a cluster. The
more nodes in a cluster, the more options you have for distributing the workload. If your requirements
dictate that all of your clustered applications must run with no performance degradation, then you may
need to consider some form of active/passive configuration. But even in this scenario, you must consider
the risks of the various configurations. If you cannot accept even the slightest risk of any reduced
performance under any conditions whatsoever, you will need a dedicated passive node for each active
node.

If, on the other hand, you are convinced that the risk of multiple failures is small, you have other choices.
If you have a 4-node, or 8-node cluster, you may want to consider an N+I configuration. N+I, which is
discussed in more detail in section 1.4.3, is a variant of active/passive, where N nodes are active, and I
nodes are passive, or reserve nodes. Typically, the value for I is less than the value for N, and an N+I
cluster topology can handle I failures before any performance degradation is likely. The risk is that with
more than I failures, performance will likely decline, but once again, the likelihood of multiple failures is
increasingly remote.

For this reason, N+I clusters are a useful configuration that balances the hardware cost of having 100%
passive server capacity against the low level risk of multiple cluster node failures.
Server Load Some More Realistic Configurations
The scenarios above were intentionally simplistic, assuming that one application imposed a monolithic load
on each server, and thus its resource utilization could not be spread among more than one other server in
the event of a failover. That is often not the case in the real world, especially for file and print servers, so
we will take a look at some additional scenarios with a 4-node cluster, named ABCD, and having nodes A,
B, C, and D.

Typically a single server will support the load of more than one application. If under normal conditions,
each server was loaded at 25%, then the ABCD cluster could survive the loss of three members before a
likely loss of application availability, which would be nearly a worst-case scenario.

The following series of figures illustrates what would happen with the application load in a 4-node cluster
for successive node failures. The shaded, or patterned, areas indicate the capacity demands of the running
applications. Further, the example below assumes that the application load on any given server is
divisible, and can be redistributed among any of the surviving nodes.

Figure 1.1: Cluster under normal operating conditions (each node loaded at 25%)

Figure 1.2: Cluster after a single node failure. Note redistribution of application load.

Figure 1.3: Cluster after two node failures. Each surviving node is now approximately 50%
loaded.
Figure 1.4: After three node failures, single surviving node is at full capacity.

If each node were running at 75% capacity, then without sensible failover policies, even a single node
failure could result in loss of application availability. Depending on the application(s), however, you can
specify that, in the event of a server failure, some percentage of the applications should fail over to node
A, node B, node C, and node D. If the applications are spread evenly among the surviving nodes, then this
cluster could now survive the loss of a single machine, because one third of the failed servers load (one
third of 75% is 25%) is allocated to each of the surviving three machines. The result is three fully loaded
servers (each of the nodes running at 75% capacity now have an additional 25%), but all applications are
still available.

As a variation on the example that was illustrated previously, the following series of figures will depict
what happens to a 4-node cluster, each of which is running at approximately 33% capacity. Further, in
this case, the application load is indivisible, and can not be spread among multiple other servers in the
event of a failover (perhaps it is a single application or multiple applications that depend on the same
resource).

Figure 2.1: Cluster under normal operating conditions (each node loaded at approximately
33%)

Figure 2.2: Cluster after a single node failure. Note redistributed application load.
Figure 2.3: Cluster after second node failure. Each surviving node is now running at
approximately 66% capacity. Note that in the event of another node failure, this cluster will no
longer be capable of supporting all four of these applications.

Figure 2.4: After third failure, the single surviving server can only support three of the four
applications.

Another style of failover policy is Failover Pairs, also known as Buddy Pairs. Assuming each of the four
servers is loaded at 50% or less, a failover buddy can be associated with each machine. This allows a
cluster in this configuration to survive two failures. More details on Failover Pairs can be found in section
1.4.1.

Taking the previous Failover Pair example, we can convert it to an active/passive configuration by loading
two servers at 100% capacity, and having two passive backup servers. This active/passive configuration
can also survive two failures, but note that under ordinary circumstances, two servers remain unused.
Furthermore, performance of these servers at 100% load is not likely to be as good as with the Failover
Pair configuration, where each machine is only running at 50% utilization. Between these two examples,
note that you have the same number of servers, the same number of applications, and the same
survivability (in terms of how many nodes can fail without jeopardizing application availability). However,
the Failover Pair configuration clearly comes out ahead of active/passive in terms of both performance and
economy.

Still using our ABCD 4-node cluster, consider what happens if we configure it as an N+I cluster (explained
in more detail in section 1.4.3) where three nodes are running at 100% capacity, and there is a single
standby node. This cluster can only survive a single failure. As before, however, comparing it to the
example where each server is running at 75% capacity, you again have the same number of servers,
applications, and same survivability, but the performance and economy can suffer when you have passive
servers backing up active servers running at 100% load.

Application Style
Now it is time to consider some aspects of the applications design, and how that affects the way it is
deployed. Applications running in a cluster can be characterized as one of:

• Single Instance

In a single instance application, only one application instance is running in the cluster at any
point in time. An example of this is the DHCP service in a server cluster. At any point in time,
there is only one instance of the DHCP service running in a cluster. The service is made highly
available using the failover support provided with server clusters. Single instance applications
typically have state that cannot be partitioned across the nodes. In the case of DHCP, the set of
IP addresses that have been leased is relatively small, but potentially highly volatile in a large
environment. To avoid the complexities of synchronizing the state around the cluster, the service
runs as a single instance application.

Figure 3: Single instance application

In the example above, the cluster has four nodes. The single instance application, by definition,
can be running on only one node at a time.

• Multiple Instance

A multiple instance application is one where multiple instances of the same code or different
pieces of code that cooperate to provide a single service can be executing around the cluster.
Together these instances provide the illusion of a single service to an end-user or client
computer. There are two different types of multiple instance applications that depend on the type
of application and data:

• Cloned Application. A cloned application is a single logical application that consists of

two or more instances of the same code running against the same data set. Each instance
of the application is self-contained, thus enabling a client to make a request to any instance of
the application with the guarantee that regardless of the instance, the client will receive the
same result. A cloned application is scalable since additional application instances can be
deployed to service client requests and the instances can be deployed across different nodes,
thus enabling an application to grow beyond the capacity of a single node. By deploying
application instances on different nodes, the application is made highly available. In the event
that a node hosting application instances fails, there are instances running on other nodes
that are still available to service client requests.
Typically, client requests are load-balanced across the nodes in the cluster to spread the client
requests amongst the application instances. Applications that run in this environment do not
have long-running in-memory state that spans client requests, since each client request is
treated as an independent operation and each request can be load-balanced independently. In
addition, the data set must be kept consistent across the entire application. As a result,
cloning is a technique typically used for applications that have read-only data or data that is
changed infrequently. Web server front-ends, scalable edge-servers such as arrays of firewalls
and middle-tier business logic fall into this category.
While these applications are typically called stateless applications1 they can save client-specific
session-oriented state in a persistent store that is available to all instances of the cloned
application, however, the client must be given a token or a key that it can present with
subsequent requests so that whichever application instance services each request can
associate the request with the appropriate client state.
The Microsoft Windows platform provides Network Load Balancing as the basic infrastructure
for building scale-out clusters with the ability to spray client requests across the nodes.
Application Center provides a single image view of management for these environments.

Figure 4: Cloned Application

In the example above, the same application App is running on each node. Each instance of the
application is accessing the same data set, in this case the dataset A-Z. This example show
each instance having access to its own data set (created and kept consistent using some form
of staging or replication technique), some applications can share the same data set (where the
data is available to the cluster using a file share for example).
Figure 5: Cloned Application

using a file share

• Partitioned Applications. Applications that have long-running in-memory state or

have large, frequently updated data sets cannot be easily cloned. These applications are
typically called stateful applications. The cost of keeping the data consistent across many
instances of the application would be prohibitive.
Fortunately, however, many of these applications have data sets or functionality that can be
readily partitioned. For example, a large file server can be partitioned by dividing the files
along the directory structure hierarchy or a large customer database can be partitioned along
customer number or customer name boundaries (customers from A to L in one database,
customers from M to Z in another database, for example). In other cases, the functionality
can be partitioned or componentized. Once an application is partitioned, the different
partitions can be deployed across a set of nodes in a cluster, thus enabling the complete
application to scale beyond the bounds of a single node. In order to present a single
application image to the clients, a partitioned application requires an application-dependent
decomposition, routing and aggregation mechanism that allows a single client request to be
distributed across the set of partitions and the results from the partitions to be combined into
a single response back to the client. For example, take a partitioned customer database, a
single SQL query to return all of the accounts that have overdue payments requires that the
query be sent to every partition of the database. Each partition will contain a subset of
customer records that must be combined into a single data set to be retuned to the client.
Partitioning an application allows it to scale but does not provide high availability. If a node
fails that is hosting a partition of the application, that piece of the data set or that piece of
functionality is no longer accessible.
Partitioning is typically application-specific, since the aggregation mechanism is dependent of
the type of application and the type of data returned. SQL Server, Exchange data stores and
DFS are all examples of applications and services that can be partitioned for scalability.
Figure 6: Partitioned Application Data

Partitioning
In the example above, each node is running multiple instances of the same application against
different pieces of the complete data set. A single request from a client application can span
multiple instances, for example a query to return all the records in the database. This splitting
of the client across the different application instances and the aggregation of the single result
passed back to the client can be done either by the applications on the server cooperating
amongst themselves (for example the SQLQuery engine) or it can be done by the client (as is
the case with Exchange 2000 data stores).
Applications may also be partitioned along functional lines as well as data sets.

Figure 7: Partitioned Application -

Functional Partitions
In the above example, each node in the cluster is performing a different function; however,
the cluster together provides a single, uniform service. One node is providing a catalog
service; one is providing a billing service, etc. The cluster, though, provides a single book
buying service to clients.
Computational clusters are built to support explicitly massively parallel applications. In other
words, the applications are specifically written to decompose a problem into many (potentially
thousands) of sub-operations and execute them in parallel across a set of machines. This type
of application is also a partitioned application. Computational clusters typically provide a set of
libraries to provide cluster-wide communication and synchronization primitives tailored to the
environment (MPI is one such set of libraries). These clusters have been termed Beowulf
clusters. Microsoft has created the High Performance Computing initiative to provide support
for this type of cluster.

Application Deployments

Applications can be deployed in different ways across a cluster depending on what style of application is
being deployed and the number of applications that are being deployed on the same cluster. They can be
characterized as follows:

1. One, single instance application

In this type of deployment, one node of the cluster is executing the application so that it can service
user requests, the other nodes are in stand-by, waiting to host the application in the event of failure.
This type of deployment is typically suited to 2-node failover clusters and is not typically used for N-
node clusters since only 1/N of the total capacity is being used. This is what some people term an
active/passive cluster.

2. Several, single instance applications

Several single instance applications can be deployed on the same cluster. Each application is
independent from the others. In a failover cluster, each application can failover independently for the
others, so in the case of a node hosting multiple applications, if that node fails, the different
applications may failover to different nodes in the cluster. This type of deployment is typically used
for consolidation scenarios where multiple applications are hosted on a set of nodes and the cluster
provides a highly available environment. In this environment, careful capacity planning is required to
ensue that in the event of failures, the nodes have sufficient capacity (CPU, memory IO bandwidth
etc.) to support the increased load.

3. A single, multiple instance application

In this environment, while the cluster is only supporting one application, various pieces of the
application are running on the different nodes. In the case of a cloned application, each node is
running the same code against the same data. This is the typical web-front end scenario (all of the
nodes in the cluster are identical). In the event of a failure, the capacity is reduced. In a partitioned
application, the individual partitions are typically deployed across the nodes in the cluster. If a failure
occurs, multiple partitions may be hosted on a single node. Careful planning is required to ensure
that, in the event of a failure, the application SLA is achieved. Take a 4-node cluster as an example. If
an application is partitioned into four pieces, then in normal running, each node would host a
partition. In the event that one node failed, two nodes would continue to host one partition and the
remaining node would end up hosting two partitions, giving the 3rd node twice the load, potentially. If
the application were split into 12 partitions, each node would host three partitions in the normal case.
In the event of a failure, each partition could be configured to failover to a different node, thereby;
the remaining nodes could each host four partitions, spreading the load in the event of a failure. The
cost however, is that support for 12 partitions may be more overhead than for four partitions.
4. Several, multiple instance applications

Of course, several multiple instance applications may be deployed on the same cluster (indeed single
instance and multiple instances may be deployed on the same cluster). In the case of multiple cloned
applications, each node is simply running one instance of each application. In the case of a partitioned
application, capacity planning as well as defining failover targets becomes increasingly complex.

Failover Policies

Failover is the mechanism that single instance applications and the individual partitions of a partitioned
application typically employ for high availability (the term Pack has been coined to describe a highly
available, single instance application or partition).

In a 2-node cluster, defining failover policies is trivial. If one node fails, the only option is to failover to the
remaining node. As the size of a cluster increases, different failover policies are possible and each one has
different characteristics.

Failover Pairs
In a large cluster, failover policies can be defined such that each application is set to failover between two
nodes. The simple example below shows two applications App1 and App2 in a 4-node cluster.

Figure 8: Failover pairs

This configuration has pros and cons:

Pro Good for clusters that are supporting heavy-weight2 applications, such as databases. This
configuration ensures that in the event of failure, two applications will not be hosted on the same
node.

Pro Very easy to plan capacity. Each node is sized based on the application that it will need to host
(just like a 2-node cluster hosting one application).
Pro Effect of a node failure on availability and performance of the system is very easy to determine.

Pro Get the flexibility of a larger cluster. In the event that a node is taken out for maintenance, the
buddy for a given application can be changed dynamically (may end up with standby policy
below).

Con In simple configurations, such as the one above, only 50% of the capacity of the cluster is in use.z

Con Administrator intervention may be required in the event of multiple failures.

Failover pairs are supported by server clusters on all versions of Windows by limiting the possible owner
list for each resource to a given pair of nodes.

Hot-Standby Server
To reduce the overhead of failover pairs, the spare node for each pair may be consolidated into a single
node, providing a hot standby server that is capable of picking up the work in the event of a failure.

Figure 9: Standby Server

The standby server configuration has pros and cons:

Pro Good for clusters that are supporting heavy-weight applications such as databases. This
configuration ensures that in the event of a single failure, two applications will not be hosted on
the same node.

Pro Very easy to plan capacity. Each node is sized based on the application that it will need to host,
the spare is sized to be the maximum of the other nodes.

Pro Effect of a node failure on availability and performance of the system is very easy to determine.

Con Configuration is targeted towards a single point of failure.


Con Does not really handle multiple failures well. This may be an issue during scheduled maintenance
where the spare may be in use.

Server clusters support standby servers today using a combination of the possible owners list and the
preferred owners list. The preferred node should be set to the node that the application will run on by
default and the possible owners for a given resource should be set to the preferred node and the spare
node.

N+I
Standby server works well for 4-node clusters in some configurations, however, its ability to handle
multiple failures is limited. N+I configurations are an extension of the standby server concept where there
are N nodes hosting applications and I nodes which are spares.

Figure 10: N+I Spare node configuration

N+I configurations have the following pros and cons:

Pro Good for clusters that are supporting heavy-weight applications such as databases or Exchange.
This configuration ensures that in the event of a failure, an application instance will failover to a
spare node, not one that is already in use.

Pro Very easy to plan capacity. Each node is sized based on the application that it will need to host.

Pro Effect of a node failure on availability and performance of the system is very easy to determine.

Pro Configuration works well for multiple failures.

Con Does not really handle multiple applications running in the same cluster well. This policy is best
suited to applications running on a dedicated cluster.
Server clusters supports N+I scenarios in the Windows Server 2003 release using a cluster group public
property AntiAffinityClassNames. This property can contain an arbitrary string of characters. In the
event of a failover, if a group being failed over has a non-empty string in the AntiAffinityClassNames
property, the failover manager will check all other nodes. If there are any nodes in the possible owners list
for the resource that are NOT hosting a group with the same value in AntiAffinityClassNames, then
those nodes are considered a good target for failover. If all nodes in the cluster are hosting groups that
contain the same value in the AntiAffinityClassNames property, then the preferred node list is used to
select a failover target.

Failover Ring
Failover rings allow each node in the cluster to run an application instance. In the event of a failure, the
application on the failed node is moved to the next node in sequence.

Figure 11: Failover Ring

This configuration has pros and cons:

Pro Good for clusters that are supporting several small application instances where the capacity of any
node is large enough to support several at the same time.

Pro Effect on performance of a node failure is easy to predict.

Pro Easy to plan capacity for a single failure.

Con Configuration does not work well for all cases of multiple failures. If Node 1 fails, Node 2 will host
two application instances and Nodes 3 and 4 will host one application instance. If Node 2 then
fails, Node 3 will be hosting three application instances and Node 4 will be hosting one instance.
Con Not well suited to heavy-weight applications since multiple instances may end up being hosted on
the same node even if there are lightly-loaded nodes.

Failover rings are supported by server clusters on the Windows Server 2003 release. This is done by
defining the order of failover for a given group using the preferred owner list. A node order should be
chosen and then the preferred node list should be set up with each group starting at a different node.

Random
In large clusters or even 4-node clusters that are running several applications, defining specific failover
targets or policies for each application instance can be extremely cumbersome and error prone. The best
policy in some cases is to allow the target to be chosen at random, with a statistical probability that this
will spread the load around the cluster in the event of a failure.

Random failover policies have pros and cons:

Pro Good for clusters that are supporting several small application instances where the capacity of any
node is large enough to support several at the same time.

Pro Does not require an administrator to decide where any given application should failover to.

Pro Provided that there are sufficient applications or the applications are partitioned finely enough,
this provides a good mechanism to statistically load-balance the applications across the cluster in
the event of a failure.

Pro Configuration works well for multiple failures.

Pro Very well tuned to handling multiple applications or many instances of the same application
running in the same cluster well.

Con Can be difficult to plan capacity. There is no real guarantee that the load will be balanced across
the cluster.

Con Effect on performance of a node failure is not easy to predict.

Con Not well suited to heavy-weight applications since multiple instances may end up being hosted on
the same node even if there are lightly-loaded nodes.

The Windows Server 2003 release of server clusters randomizes the failover target in the event of node
failure. Each resource group that has an empty preferred owners list will be failed over to a random node
in the cluster in the event that the node currently hosting it fails.

Customized control
There are some cases where specific nodes may be preferred for a given application instance.

A configuration that ties applications to nodes has pros and cons:


Pro Administrator has full control over what happens when a failure occurs.

Pro Capacity planning is easy, since failure scenarios are predictable.

Con With many applications running in a cluster, defining a good policy for failures can be extremely
complex.

Con Very hard to plan for multiple, cascaded failures.

Server clusters provide full control over the order of failover using the preferred node list feature. The full
semantics of the preferred node list can be defined as:

Preferred Node Move group to best possible


List initiated via administrator Failover due to node or group failure

Contains all nodes Group is moved to highest node in Group is moved to the next node on the
in cluster preferred node list that is up and preferred node list.
running in the cluster.

Contains a subset Group is moved to highest node in Group is moved to the next node on the
of the nodes in preferred node list that is up and preferred node list.
the cluster running in the cluster.
If the node that was hosting the group is
If no nodes in the preferred node the last on the list or was not in the
list are up and running, the group is preferred node list, the group is moved to
moved to a random node. a random node.

Empty Group is moved to a random Group is moved to a random node.


node.

1
This is really a misnomer, the applications have state, however, the state does not span individual client
requests.

2
A heavy-weight application is one that consumes a significant number of system resources such as CPU,
memory or IO bandwidth.

Load Balancing
Updated: January 1, 2003

Load balancing is a term that means many different things. Within a cluster environment, load balancing is
typically used in two different ways:

• Spreading client requests across a set of clones In Windows, Network Load Balancing (NLB)

provides the capability to spread client TCP/IP requests across a set of nodes (typically Web
servers or ISA arrays or terminal server machines). This functionality can also be provided by
external network infrastructure such as CISCO local director.

• Optimizing the compute resources across a failover cluster A cluster that supports failover allows

application instances to move around the cluster. Historically this has been used to ensure that
on failure, the application can be restarted elsewhere. However, using this same mechanism
could be used with performance counters and application load metrics to provide a way to
balance the load of applications across the nodes in a cluster. This can be done in one of two
ways:

• Manual load balancing: Each group in a cluster can be moved around independently of

the other cluster groups. By breaking down applications into finely defined groups where it is
possible, manual load balancing can be done using the cluster administration tools. If the load
on a given node is high relative to the other nodes in the cluster, one or more groups can be
manually moved to other cluster nodes.

• Dynamic load balancing: Today, Server clusters do not provide any load balancing

mechanism (beyond a randomized failover policy in Windows Server 2003). Dynamic load
balancing is an extremely complex issue, while moving an application may at first appear to
make sense, the cost of that move can be hard to determine. For example, moving SQL
Server may not only incur the cost of having to re-mount the database, but it also loses the
in-memory cache and state which may have a significant impact on performance from the
client perspective.

Вам также может понравиться