Вы находитесь на странице: 1из 10

c  

 

    
   is a complex technology with lots of messy details. To make it easier to understand,
let's take a look at the big picture of how clustering works. A     is a group of
independent servers that are managed as a single system for higher availability, easier
manageability, and greater scalability.

In this article we take a look at:

O c  

O „  c 
O   
O      
 
O    „  
O 0      

c  


Although a SQL Server 2005 cluster can support up to eight nodes, clustering actually only
occurs between two nodes at a time. This is because a single SQL Server 2005 instance can only
run on a single node at a time, and should a failover occur, the failed instance can only fail over
to another individual node. This adds up to two nodes. Clusters of three or more nodes are only
used where you need to cluster multiple instances of SQL Server 2005.

In a two-node SQL Server 2005 cluster, one of the physical server nodes is referred to as the
active node, and the other one is referred to as the passive node. It doesn't matter which physical
servers in a cluster is designated as the active or the passive, but it is easier, from an
administrative point of view, to go ahead and assign one node as the active and the other as the
passive. This way, you won't get confused about which physical server is performing which role
at the current time.

When we refer to an active node, we mean that this particular node is currently running an active
instance of SQL Server 2005 and that it is accessing the instance's databases, which are located
on a shared data array.

When we refer to a passive node, we mean that this particular node is not currently in production
and it is not accessing the instance's databases. When the passive node is not in production, it is
in a state of readiness, so that if the active node fails, and a failover occurs, it can automatically
go into production and begin accessing the instance's databases located on the shared disk array.
In this case, the passive mode then becomes the active node, and the formerly active node now
becomes the passive node (or failed node should a failure occur that prevents it from operating).


„  c 


So what is a shared disk array? Unlike non-clustered SQL Server 2005 instances, which usually
store their databases on locally attached disk storage, clustered SQL Server 2005 instances store
data on a shared disk array. By shared, we mean that both nodes of the cluster are physically
connected to the disk array, but that only the active node can access the instance's databases.
There is never a case where both nodes of a cluster are accessing an instance's databases at the
same time. This is to ensure the integrity of the databases.

Generally speaking, a shared disk array is a SCSI- or fiber-connected RAID 5 or RAID 10 disk
array housed in a stand-alone unit, or it might be a SAN. This shared array must have at least two
logical partitions. One partition is used for storing the clustered instance's SQL Server databases,
and the other is used for the quorum.

  

When both nodes of a cluster are up and running, participating in their relevant roles (active and
passive) they communicate with each other over the network. For example, if you change a
configuration setting on the active node, this configuration change is automatically sent to the
passive node and the same change made. This generally occurs very quickly, and ensures that
both nodes are synchronized.

But, as you might imagine, it is possible that you could make a change on the active node, but
before the change is sent over the network and the same change made on the passive node
(which will become the active node after the failover), that the active node fails, and the change
never gets to the passive node. Depending on the nature of the change, this could cause
problems, even causing both nodes of the cluster to fail.

To prevent this from happening, a SQL Server 2005 cluster uses what is called a quorum, which
is stored on the quorum drive of the shared array. A quorum is essentially a log file, similar in
concept to database logs. Its purpose is to record any change made on the active node, and
should any change recorded here not get to the passive node because the active node has failed
and cannot send the change to the passive node over the network, then the passive node, when it
becomes the active node, can read the quorum file and find out what the change was, and then
make the change before it becomes the new active node.

In order for this to work, the quorum file must reside on what is called the quorum drive. A
quorum drive is a logical drive on the shared array devoted to the function of storing the quorum.


     
 

Each node of a cluster must have at least two network cards. One network card will be connected
to the public network, and the other to a private network.

The public network is the network that the SQL Server 2005 clients are attached, and this is how
they communicate to a clustered SQL Server 2005 instance.

The private network is used solely for communications between the nodes of the cluster. It is
used mainly for what is called the heartbeat signal. In a cluster, the active node puts out a
heartbeat signal, which tells the other nodes in the cluster that it is working. Should the heartbeat
signal stop then a passive node in the cluster becomes aware that the active node has failed, and
that it should at this time initiate a failover so that it can become the active node and take control
over the SQL Server 2005 instance.

   „  


One of the biggest mysteries of clustering is how do clients know when and how to switch
communicating from a failed cluster node to the now new active node? And the answer may be a
surprise. They don't. That's right; SQL Server 2005 clients don't need to know anything about
specific nodes of a cluster (such as the NETBIOS name or IP address of individual cluster
nodes). This is because each clustered SQL Server 2005 instance is given a virtual name and IP
address, which clients use to connect to the cluster. In other words, clients don't connect to a
node's specific name or IP address, but instead connect to a virtual name and IP address that
stays the same no matter what node in a cluster is active.

When you create a cluster, one of the steps is to create a virtual cluster name and IP address. This
name and IP address is used by the active node to communicate with clients. Should a failover
occur, then the new active node uses this same virtual name and IP address to communicate with
clients. This way, clients only need to know the virtual name or IP address of the clustered
instance of SQL Server, and a failover between nodes doesn't change this. At worst, when a
failover occurs, there may be an interruption of service from the client to the clustered SQL
Server 2005 instance, but once the failover has occurred, the client can once again reconnect to
the instance using the same virtual name or IP address.


0            




While there can be many different causes of a failover, let's look at the case where the power
stops for the active node of a cluster and the passive node has to take over. This will provide a
general overview of how a failover occurs.

Let's assume that a single SQL Server 2005 instance is running on the active node of a cluster,
and that a passive node is ready to take over when needed. At this time, the active node is
communicating with both the database and the quorum on the shared array. Because only a
single node at a time can be communicating with the shared array, the passive node is not
communicating with the database or the quorum. In addition, the active node is sending out
heartbeat signals over the private network, and the passive node is monitoring them to see if they
stop. Clients are also interacting with the active node via the virtual name and IP address,
running production transactions.

Now, for whatever reason, the active node stops working because it no longer is receiving any
electricity. The passive node, which is monitoring the heartbeats from the active node, now
notices that it is not receiving the heartbeat signal. After a predetermined delay, the passive node
assumes that the active node has failed and it initiates a failover. As part of the failover process,
the passive node (now the active node) takes over control of the shared array and reads the
quorum, looking for any unsynchronized configuration changes. It also takes over control of the
virtual server name and IP address. In addition, as the node takes over the databases, it has to do
a SQL Server startup, using the databases, just as if it is starting from a shutdown, going through
a database recovery. The time this takes depends on many factors, including the speed of the
system and the number of transactions that might have to be rolled forward or back during the
database recovery process. Once the recovery process is complete, the new active nodes
announces itself on the network with the virtual name and IP address, which allows the clients to
reconnect and begin using the SQL Server 2005 instance with minimal interruption.

In a    environment, server clusters can be defined in two basic ways:

O Active/Active
Y There are multiple independent, redundant servers
Y The load is distributed through round-robin DNS
Y The load is balanced by a load-balancing solution (for example, WLBS)
O Active/Passive
Y üultiple servers are configured to provide a service
Y Only a single server provides the service at any given time
Y Other servers serve as hot-spares in case of a server (service) problem.
   


D !  
Even if all you need at the moment is a single server, consider creating a one-node cluster. This
gives you the option of upgrading to a cluster later, thus avoiding a rebuild. Just be sure that the
hardware you choose is on the cluster portion of the Windows Catalog.
It's not merely for high availability that you'd want the option to add a node at a later date.
Consider what happens if you find that your server just doesn't have the necessary capacity. That
translates to a migration-and that takes time and effort. If you have a one-node cluster, migration
is easier with far less downtime. You add the new node to the cluster; add the SQL Server
binaries and service packs to the new node, and then failover to the new node. Then you add any
post-service pack updates and, finally, evict the old node. The downtime is only the time it takes
to failover and add the updates (if any).

c ! 
Since all nodes in a cluster must be the same, you'll want to act sooner, rather than later, to get
that extra node. If you wait too long, the node may go out of production. On one project, I had to
rebuild a node in a SQL Server 2000 cluster. I had the OS/network admin handle the basic
machine build, then I jumped in to add it back to the cluster and prepare it for service as a SQL
Server node. All went well until I failed over to the new node. üuch to my dismay, it failed right
back. To make a long story short, although I had prepared a detailed document on building a new
cluster, including adding the cluster service and SQL Server service accounts to both nodes, the
document wasn't followed explicitly. The admin didn't add those service accounts to the rebuilt
node, so the privileges they had before the rebuild no longer existed.
It took me a long time to track that one down. One day it occurred to me to look at local group
membership. Once I added the two accounts, failover went smoothly. Then I got to thinking.
Rebuilding a node is something you don't do frequently and, if you do, it's an emergency.
Although I had a document in place, it wasn't used. We could have automated the security part
by simply writing a brief script to add those two accounts and make any other necessary
customizations. Things have improved in SQL Server 2005, though. The installer requires you to
set domain-level groups for the SQL Server service accounts.
Of course, this got me thinking even more. You can create scripts that invoke CLUSTER.EXE to
add the node to your üicrosoft® Cluster Server (üSCS) cluster. All you have to do is feed the
script the name of the node and it can handle the rest. In an emergency, automation is really your
friend.

!"# 
Sometimes, the reason for adding a node to a cluster isn't that you're replacing a node. You could
be adding more SQL Server instances to your cluster and each instance needs separate disk
resources. Though multiple instances can run on a single node, they would be sharing CPU and
RAü-and that could spell poor performance. Ideally, only a single instance should run on a
single node. How do you ensure that when you fail over? Simple: the answer is that one node has
no services running on it, while the other nodes each run one SQL Server instance. In fact, that's
the definition of an N+1 cluster: N instances running on N+ 1 node. The extra node is the
backup.

 
 c$ %   
irtualization allows you to run one or more operating systems concurrently on a single physical
server. irtualization software adds another layer of capabilities to the cluster concept because
you can cluster the software. Consequently, if the server on which the host is running fails, then
it-and its guest OSs-failover to a backup node. This could be a very easy way to migrate a guest
server. Plus, the guest OS does not have to be cluster-capable. Thus, you could run SQL Server
Workgroup Edition inside a guest Windows Server 2003, running on üicrosoft irtual Server
2005 on a cluster.

Diagram Using a virtual server





Figure of basic clustering concepts

This Figure illustrates how server clustering can make two or more servers
(Server 1 through Server Լ appear as one virtual resource to a dependent
application.

c   

In     a standby server exists only to take over for another server in the event
of failure. This type of cluster is usually used to provide high availability and scalability for
read/write stores such as databases, messaging systems, and file and print services. If one of the
nodes in a cluster becomes unavailable, due to either planned downtime for maintenance or
unplanned downtime due to failure, another node takes over the function of the failed node.

The standby server performs no other useful work and is either as capable as or less capable than
a primary server. A less capable, less expensive standby server is often used when primary
servers are configured for high availability and fault tolerance with multiple redundant
subsystems. One common type of asymmetric cluster is known as a ˜ 
 (see the
 
 pattern).

 
   

Illustrates how a symmetric cluster presents a virtual resource to an application. Requests are
divided among healthy servers to distribute load and increase scalability.

One common type of symmetric cluster is a load-balanced cluster (see the à   

 pattern). Load-balanced clusters enhance the performance, availability, and scalability of
services such as Web servers, media servers, PN servers, and read-only stores by distributing
requests across all of the healthy servers in the server cluster.

 
„ 
 results in the following benefits and liabilities:
@ 

O    . „ 


 enables applications to handle more load.

O 0 &    . „ 


 helps applications avoid interruptions in service.

O ÿ   %   . The ability of clustering to present a virtual unified computing resource
provides IT personnel with more options for configuring the infrastructure to support application
performance, availability, and scalability requirements.

à 

O       % . Some clustering designs significantly increase the


complexity of your solution, which may affect operational and support requirements. For
example, clustering can increase the numbers of servers to manage, storage devices to maintain,
and network connections to configure and monitor.

O c         '   . Applications may require specific design and
coding changes to function properly when used in an infrastructure that uses clustering. For
example, the need to manage session state can become more difficult across multiple servers and
could require coding changes to accommodate maintaining state so that session information is
not lost if a server fails.

O     . An existing application or application component may not be able to support
clustering technologies. For example, a limitation in the technology used to develop the
application or component may not support clustering even through code changes.

Вам также может понравиться