CMAN Questions

What is Cluster Manager (cman)?
It depends on which version of the code you are running. Basically, cluster manager
is a component of the cluster project that handles communications between nodes
in the cluster.
In the latest cluster code, cman is just a userland program that interfaces with the
OpenAIS membership and messenging system.
In the previous versions, cman was a kernel module whose job was to keep a
"heartbeat" message moving throughout the cluster, letting all the nodes know that
the others are alive.
It also handles cluster membership messages, determining when a node enters or
leaves the cluster.
What does Quorum mean and why is it necessary?
Quorum is a voting algorithm used by the cluster manager.
A cluster can only function correctly if there is general agreement between the
members about things. We say a cluster has 'quorum' if a majority of nodes are
alive, communicating, and agree on the active cluster members. So in a thirteennode cluster, quorum is only reached if seven or more nodes are communicating. If
the seventh node dies, the cluster loses quorum and can no longer function.
It's necessary for a cluster to maintain quorum to prevent 'split-brain' problems. If
we didn't enforce quorum, a communication error on that same thirteen-node
cluster may cause a situation where six nodes are operating on the shared disk, and
another six were also operating on it, independently. Because of the communication
error, the two partial-clusters would overwrite areas of the disk and corrupt the file
system. With quorum rules enforced, only one of the partial clusters can use the
shared storage, thus protecting data integrity.
Quorum doesn't prevent split-brain situations, but it does decide who is dominant
and allowed to function in the cluster. Should split-brain occur, quorum prevents
more than one cluster group from doing anything.
How can I define a two-node cluster if a majority is needed to reach
quorum?
We had to allow two-node clusters, so we made a special exception to the quorum
rules. There is a special setting "two_node" in the /etc/cluster.conf file that looks like
this:
<cman expected_votes="1" two_node="1"/>
This will allow one node to be considered enough to establish a quorum. Note that if
you configure a quorum disk/partition, you don't want two_node="1".
What is a tie-breaker, and do I need one in two-node clusters?
Tie-breakers are additional heuristics that allow a cluster partition to decide whether
or not it is quorate in the event of an even-split - prior to fencing. A typical tiebreaker construct is an IP tie-breaker, sometimes called a ping node. With such a
tie-breaker, nodes not only monitor each other, but also an upstream router that is
on the same path as cluster communications.
With such a tie-breaker, nodes not only monitor each other, but also an upstream
router that is on the same path as cluster communications. If the two nodes lose
contact with each other, the one that wins is the one that can still ping the
upstream router. Of course, there are cases, such as a switch-loop, where it is
possible for two nodes to see the upstream router - but each other - causing what is
called a split brain. This is why fencing is required in cases where tie-breakers are
used.
This is why fencing is required in cases where tie-breakers are used.
Other types of tie-breakers include where a shared partition, often called a quorum
disk, provides additional details. clumanager 1.2.x (Red Hat Cluster Suite 3) had a
disk tie-breaker that allowed operation if the network went down as long as both
nodes were still communicating over the shared partition.
More complex tie-breaker schemes exist, such as QDisk (part of linux-cluster). QDisk
allows arbitrary heuristics to be specified. These allow each node to determine its
own fitness for participation in the cluster. It is often used as a simple IP tie-breaker,
however. See the qdisk(5) manual page for more information.
CMAN has no internal tie-breakers for various reasons. However, tie-breakers can be
implemented using the API. This API allows quorum device registration and
updating. For an example, look at the QDisk source code.
You might need a tie-breaker if you:
Have a two node configuration with the fence devices on a different network
path than the path used for cluster communication
Have a two node configuration where fencing is at the fabric level - especially
for SCSI reservations
However, if you have a correct network & fencing configuration in your cluster, a tiebreaker only adds complexity, except in pathological cases.
If both nodes in a two-node cluster lose contact with each other, don't
they try to fence each other?
They do. When each node recognizes that the other has stopped responding, it will
try to fence the other. It can be like a gunfight at the O.K. Coral, and the node that's
quickest on the draw (first to fence the other) wins. Unfortunately, both nodes can
end up going down simultaneously, losing the whole cluster.
It's possible to avoid this by using a network power switch that serializes the two
fencing operations. That ensures that one node is rebooted and the second never
fences the first. For other configurations, see below.
What is the best two-node network & fencing configuration?
In a two node cluster (where you are using two_node="1" in the cluster
configuration, and w/o QDisk), there are several considerations you need to be
aware of:
If you are using per-node power management of any sort where the device is not
shared between cluster nodes, it must be connected to the same network used by
CMAN for cluster communication. Failure to do so can result in both nodes
simultaneously fencing each other, leaving the entire cluster dead, or end up in a
fence loop. Typically, this includes all integrated power management solutions (iLO,
IPMI, RSA, ERA, IBM Blade Center, Egenera Blade Frame, Dell DRAC, etc.), but also
includes remote power switches (APC, WTI) if the devices are not shared between
the two nodes.
It is best to use power-type fencing. SAN or SCSI-reservation fencing might work, as
long as it meets the above requirements. If it does not, you should consider using a
quorum disk or partition
If you can not meet the above requirements, you can use quorum disk or partition.
What if the fenced node comes back up and still can't contact the other?
Will it corrupt my file system?
The two_node cluster.conf option allows one node to have quorum by itself. A
network partition between the nodes won't result in a corrupt fs because each node
will try to fence the other when it comes up prior to mounting gfs.
Strangely, if you have a persistent network problem and the fencing device is still
accessible to both nodes, this can result in a "A reboots B, B reboots A" fencing
loop.
This problem can be worked around by using a quorum disk or partition to break the
tie, or using a specific network & fencing configuration.
I lost quorum on my six-node cluster, but my remaining three nodes can
still write to my GFS volume. Did you just lie?
It's possible to still write to a GFS volume, even without quorum, but ONLY if the
three nodes that left the cluster didn't have the GFS volume mounted. It's not a
problem because if a partitioned cluster is ever formed that gains quorum, it will
fence the nodes in the inquorate partition before doing anything.
If, on the other hand, nodes failed while they had gfs mounted and quorum was lost,
then gfs activity on the remaining nodes will be mostly blocked. If it's not then it
may be a bug.
Can I have a mixed cluster with some nodes at RHEL4U1 and some at
RHEL4U2?
You can't mix RHEL4 U1 and U2 systems in a cluster because there were changes
between U1 and U2 that changed the format of internal messages that are sent
around the cluster.
Since U2, we now require these messages to be backward-compatible, so mixing U2
and U3 or U3 and U4 shouldn't be a problem.
How do I add a third node to my two-node cluster?
Unfortunately, two-node clusters are a special case. A two-node cluster needs two
nodes to establish quorum, but only one node to maintain quorum. This special
status is set by a special "two_node" option in the cman section of cluster.conf.
Unfortunately, this setting can only be reset by shutting down the cluster. Therefore,
the only way to add a third node is to:
Shut down the cluster software on both nodes.
Add the third node into your /etc/cluster/cluster.conf file.
Get rid of two_node="1" option in cluster.conf.
Copy the modified cluster.conf to your third node.
Restart all three nodes.
The system-config-cluster gui gets rid of the two_node option automatically when
you add a third node. Also, note that this does not apply to two-node clusters with a
quorum disk/partition. If you have a quorum disk/partition defined, you don't want
to use the two_node option to begin with.
Adding subsequent nodes to a three-or-more node cluster is easy and the cluster
does not need to be stopped to do it.
Add the node to your cluster.conf
Increment the config file version number near the top and save the changes
Do ccs_tool update /etc/cluster/cluster.conf to propagate the file to the cluster
Use cman_tool status | grep "Config version" to get the internal version
number.
Use cman_tool version -r <new config version>.
Start the cluster software on the additional node.
I removed a node from cluster.conf but the cluster software and services
kept running. What did I do wrong?
You're supposed to stop the node before removing it from the cluster.conf.
How can I rename my cluster?
Here's the procedure:
Unmount all GFS partitions and stop all clustering software on all nodes in the
cluster.
Change name="old_cluster_name" to name="new_cluster_name" in

/etc/cluster/cluster.conf
If you have GFS partitions in your cluster, you need to change their
superblock to use the new name. For example:
gfs_tool sb /dev/vg_name/gfs1 table new_cluster_name:gfs1
Restart the clustering software on all nodes in the cluster.
Remount your GFS partitions
What's the proper way to shut down my cluster?

Halting a single node in the cluster will seem like a communication failure to the
other nodes. Errors will be logged and the fencing code will get called, etc. So
there's a procedure for properly shutting down a cluster. Here's what you should do:
Use the "cman_tool leave remove" command before shutting down each node. That
will force the remaining nodes to adjust quorum to accomodate the missing node
and not treat it as an error.
Follow these steps:
for i in rgmanager gfs2 gfs; do service ${i} stop; done
fence_tool leave
cman_tool leave remove
Why does the cman daemon keep shutting down and reconnecting?
Additional info: When I try to start cman, I see these messages in
/var/log/messages:
Sep 11 11:44:26 server01
CMAN/SM Plugin v1.1.5
Sep 11 11:44:26 server01
Sep 11 11:44:57 server01
Sep 11 11:44:57 server01
reconnect...
ccsd[24972]: Connected to cluster infrastruture via:

ccsd[24972]: Initial status:: Inquorate
ccsd[24972]: Cluster is quorate. Allowing connections.
ccsd[24972]: Cluster manager shutdown. Attemping to
I see these messages in dmesg:

CMAN:
CMAN:
CMAN:
CMAN:
CMAN:
CMAN:
CMAN:
CMAN:
forming a new cluster

quorum regained, resuming activity
sendmsg failed: -13
No functional network interfaces, leaving cluster
sendmsg failed: -13
we are leaving the cluster.
Waiting to join or form a Linux-cluster
sendmsg failed: -13
This is almost always caused by a mismatch between the kernel and user space
CMAN code. Update the CMAN user tools to fix the problem.
I've heard there are issues with using an even/odd number of nodes. Is it
true?
No, it's not true. There is only one special case: two node clusters have special rules
for determining quorum. See question 3 above.
What is a quorum disk/partition and what does it do for you?
A quorum disk or partition is a section of a disk that's set up for use with
components of the cluster project. It has a couple of purposes. Again, I'll explain
with an example.
Suppose you have nodes A and B, and node A fails to get several of cluster
manager's "heartbeat" packets from node B. Node A doesn't know why it hasn't
received the packets, but there are several possibilities: either node B has failed,
the network switch or hub has failed, node A's network adapter has failed, or maybe
just because node B was just too busy to send the packet. That can happen if your
cluster is extremely large, your systems are extremely busy or your network is
flakey.
Node A doesn't know which is the case, and it doesn't know whether the problem
lies within itself or with node B. This is especially problematic in a two-node cluster
because both nodes, out of touch with one another, can try to fence the other.
So before fencing a node, it would be nice to have another way to check if the other
node is really alive, even though we can't seem to contact it. A quorum disk gives
you the ability to do just that. Before fencing a node that's out of touch, the cluster
software can check whether the node is still alive based on whether it has written
data to the quorum partition.
In the case of two-node systems, the quorum disk also acts as a tie-breaker. If a
node has access to the quorum disk and the network, that counts as two votes.
A node that has lost contact with the network or the quorum disk has lost a vote,
and therefore may safely be fenced.
Is a quorum disk/partition needed for a two-node cluster?
In older versions of the Cluster Project, a quorum disk was needed to break ties in a
two-node cluster. Early versions of Red Hat Enterprise Linux 4 (RHEL4) did not have
quorum disks, but it was added back as an optional feature in RHEL4U4.
In RHCS 4 update 4 and beyond, see the man page for qdisk for more information.
As of September 2006, you need to edit your configuration file by hand to add
quorum disk support. The system-config-cluster gui does not currently support
adding or editing quorum disk properties.
Whether or not a quorum disk is needed is up to you. It is possible to configure a
two-node cluster in such a manner that no tie-breaker (or quorum disk) is required.
Here are some reasons you might want/need a quorum disk:
If you have a special requirement to go down from X -> 1 nodes in a single

transition. For example, if you have a 3/1 network partition in a 4-node
cluster - here the 1-node partition is the only node which still has network
connectivity. (Generally, the surviving node is not going to be able to handle
the load...)
If you have a special situation causing a need for a tie-breaker in general.
If you have a need to determine node-fitness based on factors which are not
handled by CMAN
In any case, please be aware that use of a quorum disk requires additional
configuration information and testing.
How do I set up a quorum disk/partition?
The best way to start is to do "man qdisk" and read the qdisk.5 man page. This has
good information about the setup of quorum disks.
Note that if you configure a quorum disk/partition, you don't want two_node="1" or
expected_votes="2" since the quorum disk solves the voting imbalance. You want
two_node="0" and expected_votes="3" (or nodes + 1 if it's not a two-node cluster).
However, since 0 is the default value for two_node, you don't need to specify it at
all. If this is an existing two-node cluster and you're changing the two_node value
from "1" to "0", you'll have to stop the entire cluster and restart it after the
configuration is changed (normally, the cluster doesn't have to be stopped and
restarted for configuration changes, but two_node is a special case.) Basically, you
want something like this in your /etc/cluster/cluster.conf:
<cman two_node="0" expected_votes="3" .../>
<clusternodes>
<clusternode name="node1" votes="1" .../>
<clusternode name="node2" votes="1" .../>
</clusternodes>
<quorumd device="/dev/mapper/lun01" votes="1"/>
Note: You don't have to use a disk or partition to prevent two-node fence-cycles; you
can also set your cluster up this way. You can set up a number of different heuristics
for the qdisk daemon. For example, you can set up a redundant NIC with a
crossover cable and use ping operations to the local router/switch to break the tie
(this is typical, actually, and is called an IP tie breaker). A heuristic can be made to
check anything, as long as it is a shared resource.
Do I really need a shared disk to use QDisk?
Currently, yes. There have been suggestions to make qdiskd operate in a 'diskless'
mode in order to help prevent a fence-race (i.e. prevent a node from attempting to
fence another node), but no work has been done in this area (yet).
Are the quorum disk votes reported in "Total_votes" from cman_tool
nodes?
Yes. if the quorum disk is registered correctly with cman you should see the votes it
contributes and also it's "node name" in cman_tool nodes.
What's the minimum size of a quorum disk/partition?
The official answer is 10MB. The real number is something like 100KB, but we'd like
to reserve 10MB for possible future expansion and features.
Is quorum disk/partition reserved for two-node clusters, and if not, how
many nodes can it support?
Currently a quorum disk/partition may be used in clusters of up to 16 nodes.
In a 2 node cluster, what happens if both nodes lose the heartbeat but
they can still see the quorum disk? Don't they still have quorum and cause
split-brain?
First of all, no, they don't cause split-brain. As soon as heartbeat contact is lost, both
nodes will realize something is wrong and lock GFS until it gets resolved and
someone is fenced.
What actually happens depends on the configuration and the heuristics you build.
The qdisk code allows you to build non-cluster heuristics to determine the fitness of
each node beyond the heartbeat. With the heuristics in place, you can, for example,
allow the node running a specific service to have priority over the other node. It's a
way of saying "This node should win any tie" in case of a heartbeat failure. The
winner fences the loser.
If both nodes still have a majority score according to their heuristics, then both
nodes will try to fence each other, and the fastest node kills the other. Showdown at
the Cluster Corral. The remaining node will have quorum along with the qdisk, and
GFS will run normally under that node. When the "loser" reboots, unlike with a cman
operation, it will become quorate with just the quorum disk/partition, so it cannot
cause split-brain that way either.
At this point (4-Apr-2007), if there are no heuristics defined whatsoever, the QDisk
master node wins (and fences the non-master node).
If my cluster is mission-critical, can I override quorum rules and have a
"last-man-standing" cluster that's still functioning?
This may not be a good idea in most cases because of the dangers of split-brain, but
there is a way you can do this: You can adjust the "votes" for the quorum disk to be
equal to the number of nodes in the cluster, minus 1
For example, if you have a four-node cluster, you can set the quorum disk votes to
3, and expected_votes to 7. That way, even if three of the four nodes die, the
remaining node may still function. That's because the quorum disk's 3 votes plus
the remaining node's 1 vote makes a total of 4 votes out of 7, which is enough to
establish quorum. Additionally, all of the nodes can be online - but not the qdiskd
(which you might need to take down for maintenance or reconfiguration).
My cluster won't come up. It says: kernel: CMAN: Cluster membership
rejected. What do I do?
One or more of the nodes in your cluster is rejecting the membership of this node.
Check the syslog (/var/log/messages) on all remaining nodes in the cluster for
messages regarding why the membership was rejected.
This message will appear when another node is rejecting the node in question and it
WILL tell syslog (/var/log/messages) why unless you have kernel logging switched
off for some reason. There are several reasons your node may be rejected:
Mismatched cluster.conf version numbers.
Mismatched cluster names.
Mismatched cluster number (a hash of the name).
Node has the wrong node ID (i.e. it joined with the same name and a different
node ID or vice versa).
CMAN protocol version differs (or other software mismatch - there are several
error messages for these but they boil down to the same thing).
Something else you might like to try is changing the port number that this cluster is
using, or changing the cluster name to something totally different.
If you find that things work after doing this then you can be sure there is another
cluster with that name or number on the network. If not, then you need to
double/triple check that the config files really do all match on all nodes.
I've seen this message happen when I've accidentally done something like this:
Created a cluster.conf file with 5 nodes: A, B, C, D and E.
Tried to bring up the cluster.
Realized that node E has the wrong software, has no access to the SAN, has a
hardware problem or whatever.
Removed node E from cluster.conf because it really doesn't belong in the

cluster after all.
Updated all five machines with the new cluster.conf.
Rebooted nodes A, B, C and D to restart the cluster.
Guess what? None of the nodes come up in a cluster. Can you guess why?
It's because node E still thinks it's part of the cluster and still has a claim on the
cluster name. You still need to shut down the cluster software on E, or else reboot it
before the correct nodes can form a cluster.
Is it a problem if node order isn't the same for all nodes in cman_tool
services?
No, this isn't a problem and can be ignored. Some nodes may report [1 2 3 4 5]
while others report a different order, like [4 3 5 2 1]. This merely has to do with the
order in which cman join messages are received.
Why does cman_tool leave say "cman_tool: Can't leave cluster while there
are X active subsystems"?
This message indicates that you tried to leave the cluster from a node that still has
active cluster resources, such as mounted GFS file systems.
A node cannot leave the cluster if there are subsystems (e.g. DLM, GFS, rgmanager)
active. You should unmount all GFS filesystems, stop the rgmanager service, stop
the clvmd service, stop fenced and anything else using the cluster manager before
using cman_tool leave. You can use cman_tool status and cman_tool services to see
how many (and which) services are running.
What are these services/subsystems and how do I make sense of what
cman_tool services prints?
Although this may be an over-simplification, you can think of the services as a big
membership roster for different special interest groups or clubs. Each "servicename" pair corresponds to access to a unique resource, and each node corresponds
to a voting member in the club.
So let's weave a inane piece of fiction around this concept: let's pretend that a
journalist named Sam, wants to write an article for her newspaper, "The National
Conspiracy Theorist." To write her article, she needs access to secret knowledge
kept hidden for centuries by a secret society known only as "The Group." The only
way she can become a member is to petition the existing members to join and the
decision must be unanimously in her favor. But The Group is so secretive, they don't
even know each other's names; every member is assigned a unique id number.
Their only means of communication is through a chat room, and they won't even
speak to you unless you're a member or unless you know to become a member.
So she logs into chat room and joins the channel #default. In the chat room, she
can see there are seven members of The Group. They're not listed in order, but
they're all there.
[root@roth-02 ~]# cman_tool services
Service
Name
GID LID State
Fence Domain: "default"
1 2 run
[7 6 1 2 3 4 5]
Code
-
She finds a blog (called "cluster.conf") and reads from it that her own ID number is
8. So she sends them a message: "Node 8 wants to join the default group".
Secretly, the other members take attendance to make sure all the members are
present and accounted for. Then they take a vote. If all of them vote yes, she's
allowed into the group and she becomes the next member. Her ID number is added
to the list of members.
Service
Name
GID LID State
1 2 run
[7 6 1 2 3 4 5 8]
Code
-
Now that she's a member of the Group, she is told that the secrets of the order are
not given to ordinary newbies; they're kept in a locked space. They are stored in an
office building owned by the order, that they oddly call "clvmd." Since she's a
newbie, she has to petition the other members to get a key to the clvmd office
building. After a similar vote, they agree to give her a key, and they keep track of
everyone who has a key.
Service
Name
GID LID State
1 2 run
[7 6 1 2 3 4 5 8]
DLM Lock Space: "clvmd"
7 3 run
[7 6 1 2 3 4 5 8]
Code
-
Eager to write her article, she drives to the clvmd office building, unlocks the door,
and goes inside. She's heard rumors that the secrets are kept in suite labeled
"secret". She goes from room to room until she finds a door marked "secret." Then
she discovers that the door is locked and her key doesn't fit. Again, she has to
petition the others for a key. They tell her that there are actually two adjacent rooms
inside the suite, the "DLM" room and the "GFS" room, each holding a different set of
secrets.
Four of the members (3, 4, 6 and 7) never really cared what was in those rooms, so
they never bothered to learn the grueling rituals, and consequently, they were
never issued keys to the two secret rooms. So after months of training, Sam once
again petitions the other members to join the "secret rooms" group. She writes
"Node 8 wants to join the 'secret' DLM group" and sends it to the members who
have a key: #1, #2 and #5. She sends them a similar message for the other room
as well: "Node 8 wants to join the 'secret' GFS group". Having performed all the
necessary rituals, they agree, and she's issued a duplicate key for both secret
rooms.
Service
Name
GID
[7 6 1 2 3 4 5 8]
[7 6 1 2 3 4 5 8]
DLM Lock Space: "secret"
[1 2 5 8]
GFS Mount Group: "secret"
[1 2 5 8]
LID State
1 2 run
Code
-
7 3 run
12 8 run
13 9 run
Then something shocking rocks the secret society: member 2 went into cardiac
arrest and died on the operating table. Clearly, something must be done to recover
the keys held by member 2. In order to secure the contents of both rooms, no one is
allowed to touch the information in the secret rooms until they've verified member 2
was really dead and recovered his keys. The members decide to leave that task to
the most senior member, member 7.
That night, when no one is watching, Member 7 breaks into the morgue, verifies #2
is really dead, and steals back the key from his pocket. Then #7 drives to the office
building, returns all the secrets he had borrowed from the secret room. (They call it
"recovery".) He also informs the other members that #2 is truly dead and #2 is
taken off the group membership lists. Relieved that their secrets are safe, the others
are now allowed access to the secret rooms.
Service
Name
GID
[7 6 1 3 4 5 8]
[7 6 1 3 4 5 8]
DLM Lock Space: "secret"
[1 5 8]
GFS Mount Group: "secret"
[1 5 8]
LID State
1 2 run
Code
-
7 3 run
12 8 run
13 9 run
You get the picture...Each of these "services" keeps a list of members who are
allowed access, and that's how the cluster software on each node knows which
others to contact for locking purposes. Each GFS file system has two groups that are
joined when the file system is mounted; one for GFS and one for DLM.
The "state" of each service corresponds to its status in the group: "run" means it's a
normal member. There are also states corresponding to joining the group, leaving
the group, recovering its locks, etc.
What can cause a node to leave the cluster?
A node may leave the cluster for many reasons. Among them:
Shutdown: cman_tool leave was run on this node
Killed by another node. The node was killed with either by cman_tool kill or
qdisk.
Panic: cman failed to allocate memory for a critical data structure or some
other very bad internal failure.
Removed: Like 1, but the remainder of the cluster can adjust quorum
downwards to keep working.
Membership Rejected: The node attempted to join a cluster but it's
cluster.conf file did not match that of the other nodes. To find the real reason
for this you need to examine the syslog of all the valid cluster members to
find out why it was rejected.
Inconsistent cluster view: This is usually indicative of a bug but it can also
happen if the network is extremely unreliable.
Missed too many heartbeats: This means what it says. All nodes are expected
to broadcast a heartbeat every 5 seconds (by default). If none is received
within
21 seconds (by default) then it is removed for this reason. The heartbeat values
may be changed from their defaults.
No response to messages: This usually happens during a state transition to

add or remove another node from a group. The reporting node sent a
message five times (by default) to the named node and did not get a
response.
How do I change the time interval for the heartbeat messages?

Just add hello_timer="value" to the cman section in your cluster.conf file. For
example:
<cman hello_timer="5">
The default value is 5 seconds.
How do I change the time after which a non-responsive node is considered
dead?
For RHEL4 and STABLE branches: Just add deadnode_timeout="value" to the

cman section in your cluster.conf file. For example:
<cman deadnode_timeout="21"/>
. The default value is 21 seconds.
For RHEL5 and STABLE2 branches: Just add token="value" to the totem
section in your cluster.conf file. Note that the totem token timeout value is
specified in milliseconds, not seconds. The equivalent for the above example
is:
<totem token="21000"/>
. The default value is 10000 milliseconds (or 10 seconds). It is important to
change this value if you are using QDisk on RHEL5/STABLE2; 21000 should
work if you left QDiskd's interval/tko at their default values.
What does "split-brain" mean?
"Split brain" is a condition whereby two or more computers or groups of computers

lose contact with one another but still act as if the cluster were intact. This is like
having two governments trying to rule the same country. If multiple computers are
allowed to write to the same file system without knowledge of what the other nodes
are doing, it will quickly lead to data corruption and other serious problems.
Split-brain is prevented by enforcing quorum rules (which say that no group of
nodes may operate unless they are in contact with a majority of all nodes) and
fencing (which makes sure nodes outside of the quorum are prevented from
interfering with the cluster).
What's the "right" way to get cman to use a different NIC, say, eth2 rather
than eth0?
There are several reasons for doing this. You may want to do this in cases where you
want the cman heartbeat messages to be on a dedicated network so that a heavily
used network doesn't cause heartbeat messages to be missed (and nodes in your
cluster to be fenced). Second, you may have security reasons for wanting to keep
these messages off of an Internet-facing network.
First, you want to configure your alternate NIC to have its own IP address, and the
settings that go with that (subnet, etc).
Next, add an entry into /etc/hosts (on all nodes) for the ip address associated with
the NIC you want to use. In this case, eth2. One way to do this is to append a suffix
to the original host name. For example, if your node is "node-01" you could give it
the name "node-01-p" (-p for private network). For example, your /etc/hosts file
might look like this:
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1
localhost.localdomain localhost
::1
localhost6.localdomain6 localhost6
10.0.0.1
node-01
192.168.0.1
node-01-p
Once you've done this you need to make sure that your cluster.conf uses the name
with the -p suffix rather than the old name. Note that -p is just a suggestion for
names you could use -internal or anything else really.
If you're using RHEL4.4 or above, or 5.1 or above, that's all you need to do. There is
code in cman to look at all the active network interfaces on the node and find the
one that corresponds to the entry in cluster.conf. Note that this only works on ipv4
interfaces.
Does cluster suite use multicast or broadcast?
By default, the older cluster infrastructure (RHEL4, STABLE and so on) uses
broadcast. By default, the newer cluster infrastructure with openais (RHEL5, HEAD
and so on) uses multicast. You can configure a RHEL4 cluster to use multicast rather
than broadcast. However, you can't switch openais to use broadcast.
Is it possible to configure a cluster with nodes running on different
networks (subnets)?
Yes, it is. If you configure the cluster to use multicast rather than broadcast (there is
an option for this in system-config-cluster) then the nodes can be on different
subnets.
Be careful that any switches and/or routers between the nodes are of good
specification and are set to pass multicast traffic though.
How can I configure my RHEL4 cluster to use multicast rather than
broadcast?
Put something like this in your cluster.conf file:
<clusternode name="node1">
<multicast addr="224.0.0.1" interface="eth0"/>
</clusternode>
On RHEL5, why do I get "cman not started: Can't bind to local cman
socket /usr/sbin/cman_tool"?
There is currently a known problem with RHEL5 whereby system-config-cluster is
trying to improperly access /usr/sbin/cman_tool (cman_tool currently resides in
/sbin). We'll correct the problem, but in the meanwhile, you can circumvent the
problem by creating a symlink from /sbin/cman_tool to /usr/sbin/. For example:
[root@node-01 ~]# ln -s /sbin/cman_tool /usr/sbin/cman_tool
If this is not your problem, read on:
Ordinarily, this message would mean that cman could not create the local socket
in /var/run for communication with the cluster clients.
The cman tries to create /var/run/cman_client and /var/run/cman_admin. Things like
cman_tool, groupd and ccsd talk to cman over this link. If it can't be created then
you'll get this error.
Check /var/run is writable and able to hold Unix domain sockets.
On Fedora 8, CMAN won't start, complaining about "aisexec not started".
How do I fix it?
On Fedora 8 and other distributions where the core supports multiple architectures
(ex: x86, x86_64), you must have a matched set of packages installed. A cman
package for x86_64 will not work with an x86 (i386/i686) openais package, and viceversa. To see if you have a mixed set, run:
WRONG:
[root@ayanami ~]# file `which cman_tool`; file `which aisexec`
/usr/sbin/cman_tool: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for
GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, stripped
/usr/sbin/aisexec: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for
RIGHT:
[root@ayanami ~]# file `which cman_tool`; file `which aisexec`
/usr/sbin/cman_tool: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for
/usr/sbin/aisexec: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for
You need to use the same architecture as your kernel for running the userland parts
of the cluster packages; on x86_64, this generally means you should only have the
x86_64 versions of the cluster packages installed.
rpm -e cman.i386 openais.i386 rgmanager.i386 ...
yum install -y cman.x86_64 openais.x86_64 rgmanager.x86_64 ...
Note: If you were having trouble getting things up, there's a chance that an old
aisexec process might be running on one of the nodes; make sure you kill it before
trying to start again!
My RHEL5 or similar cluster won't work with my Cisco switch.
Some nodes can not see each other; ping works! Why?
Two nodes in different blade frames can not see each other. Why?
These problems are caused by multicast routing problems. Assuming you have
already checked your firewall configuration, read on.
Solution #1: Fix your switch
Some Cisco (and other software-based) switches do not support IP multicast in their
default configuration.
Since openais uses multicast for cluster communications, you may have to enable it
in the switch in order to use the cluster software.
Before making any changes to your Cisco switches it is adviseable to contact your
Cisco TAC to ensure the changes will have no negative consequences in your
network. Please visit this page for more information: OpenAIS - Cisco Switches
Solution #2: Work around the switch
In some environments, it's possible to simply change the multicast that OpenAIS
uses from within cluster.conf rather than reconfiguring your switch. To do this, add a
multicast tag to your cluster.conf:
<cman ... >
<multicast addr="225.0.0.13" />
</cman>
The address 225.0.0.x is known to work in some environments when the standard
openais multicast address does not. Note, however, that this address lies within a
reserved range of multicast addresses and may not be suitable for use in the future:
225.000.000.000-231.255.255.255 Reserved
[IANA]
Source: - IANA - Multicast Addresses

My RHEL5 or similar cluster won't work with my HP switch.
Some HP servers and switches do not play well together when using Linux. More
information, and a workaround, is available here.
I created a large RHEL5 cluster but it falls apart when I boot it.
The default parameters for a RHEL5 cluster are usually enough to get a small to
medium size cluster running, say up to around 16 modes.
Beyond that limit some tuning needs to be done. Here are some parameters I have
used to get larger clusters running. Note that this increases the time taken for dead
nodes to be detected quite considerably.
The following cluster.conf extract allowed me to get 45 nodes running:
<totem token="50000" consensus="45000" join="6000" send_join="880"
token_retransmits_before_loss_const="10"/>
To get beyond that some seriously large numbers are needed. Here's what I did to
get 60 nodes working:
<totem token="60000" consensus="45000" join="15000" send_join="1000"
token_retransmits_before_loss_const="100">
These numbers are not definitive and might not work perfectly at your site. Other
variables such as network and host load come into play. But they should, I hope, be
a good starting point for people wanting to run larger RHEL5 clusters.
[MAIN ] Killing node mynode01 because it has rejoined the cluster with
existing state
What this message means is that a node was a valid member of the cluster once; it
then left the cluster (without being fenced) and rejoined automatically. This can
sometimes happen if the ethernet is disconnected for a time, usually a few seconds.
If a node leave the cluster, it MUST rejoin using the cman_tool join command with
no services running. The usual way to make this happen is to reboot the node and
let the init script do its job, and if fencing is configured correctly that is what
normally happens. It could be that fencing is too slow to manage this or that the
cluster is made up of two nodes without a quorum disk so that the 'other' node
doesn't have quorum and cannot initiate fencing.What must not happen is that the
node is ejected from the cluster and the system manager simply reruns the init
script from the command-line. This will almost certainly not clear out running
services.
Another (more common) cause of this, is slow responding of some Cisco switches as
documented above.
What is the "Dirty" flag that cman_tool shows, and should I be worried?
The short answer is "No, you should not be worried".
All this flag indicates is that there are cman daemons running that have state which
cannot be rebuilt without a node reboot. This can be as simple (in concept!) as a
DLM lockspace or a fence domain. When a cluster has state the dirty flags is set (it
cannot be reset) and this prevents two stateful clusters merging, as the two states
cannot be reconciled. In some cases this can cause the message shown above.
Many daemons can set this flag. eg: fence_tool join will set it, via fenced. As will
clvmd (because it instantiates a lock space). Think of it as a "we have some
daemons running" flag if you like !
The main reason for the flag is to prevent state corruption where the cluster is
evenly split (so that fencing cannot occur) and tries to merge back again. Neither
side of the cluster knows if the other side's state has changed and there is no
mechanism for performing a state merge. So one side gets marked disallowed or is
fenced, depending on quorum. Fencing can only be done by a quorate partition.
This flag has been renamed to "HaveState?" in STABLE3 so as to panic people less.
In general most users can ignore this flag.
Chrissie's plea to people submitting logs for bug reports
Please, please *always* attach full logs. I'd much rather have 2GB of log files to
wade through than 1K of truncated logs that don't show what I'm looking for.
I'm very good at filtering log files, it's my job and I've been doing it for a very long
time now! And it's quite possible that I might spot something important that looks
insignificant to you.
Cluster name limitations
* 15 non-NUL (ASCII 0) characters * You can use the 'alias' attribute to make a more
descriptive name.
https://fedorahosted.org/cluster/wiki/FAQ/CMAN

CMAN Questions

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

CMAN Questions

Загружено:

Авторское право:

Доступные форматы

What is Cluster Manager (cman)?

Shut down the cluster software on both nodes.

Add the third node into your /etc/cluster/cluster.conf file.

Get rid of two_node="1" option in cluster.conf.

Copy the modified cluster.conf to your third node.

Restart all three nodes.

Add the node to your cluster.conf

Do ccs_tool update /etc/cluster/cluster.conf to propagate the file to the cluster

Use cman_tool version -r <new config version>.

Start the cluster software on the additional node.

Change name="old_cluster_name" to name="new_cluster_name" in

superblock to use the new name. For example:

gfs_tool sb /dev/vg_name/gfs1 table new_cluster_name:gfs1

Restart the clustering software on all nodes in the cluster.

Remount your GFS partitions

What's the proper way to shut down my cluster?

ccsd[24972]: Connected to cluster infrastruture via:

I see these messages in dmesg:

forming a new cluster

If you have a special requirement to go down from X -> 1 nodes in a single

If you have a special situation causing a need for a tie-breaker in general.

Mismatched cluster.conf version numbers.

Mismatched cluster names.

Mismatched cluster number (a hash of the name).

Created a cluster.conf file with 5 nodes: A, B, C, D and E.

Tried to bring up the cluster.

Removed node E from cluster.conf because it really doesn't belong in the

Updated all five machines with the new cluster.conf.

Rebooted nodes A, B, C and D to restart the cluster.

Shutdown: cman_tool leave was run on this node

Membership Rejected: The node attempted to join a cluster but it's

No response to messages: This usually happens during a state transition to

How do I change the time interval for the heartbeat messages?

For RHEL4 and STABLE branches: Just add deadnode_timeout="value" to the

What does "split-brain" mean?

"Split brain" is a condition whereby two or more computers or groups of computers

Source: - IANA - Multicast Addresses

Вам также может понравиться