Вы находитесь на странице: 1из 5

Using the fence_kdump agent (RHEL6.

2 and later)
===================================================

kdump configuration
===================
1. Install the kdump, kexec-tools,
rpm -ivh kexec-tools-1.102pre-75.el5.x86_64.rpm
2. Add the following to the kernel line(s) in /boot/grub/grub.conf
crashkernel=128M@16M

like: --
========
title Red Hat Enterprise Linux (2.6.32-358.el6.i686)
root (hd0,0)
kernel /vmlinuz-2.6.32-358.el6.i686 ro root=/dev/mapper/VolGroup-lv_root
nomodeset rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD rd_LVM_LV=VolGroup/lv_swap SYSFO
NT=latarcyrheb-sun16 crashkernel=128M@16M rd_LVM_LV=VolGroup/lv_root KEYBOARDTYP
E=pc KEYTABLE=us rd_NO_DM rhgb quiet
initrd /initramfs-2.6.32-358.el6.i686.img

3. Shutdown/start the node. disable ASR in case of physical node from BI


OS
4. Check with showmount -e <ipaddress-nfs-server> if one can access the
share.
Add the following to /etc/kdump.conf:
path /kdump
net <NFS_server_IP>:/dumps
a. With net one needs to specify the ip address of the nfsserver
and its mount point. The user would have to make sure there is enough free spac
e on the nfs share.
b. With path needed to specify the path relative to the mountpoi
nt, so the vmcore will be written to /dumps/kdump on node <NFS Server IP>. Need
to enable for world read,write,execute access to this directory, so chmod -R 777
/dumps/kdump. Recreate of the kdumpinitrd fails if this is not configured corre
ctly.
5. Start kdump service. Anew kdumpinitrd file will be created, and no er
rors should be displayed. After this check if the kdump service is operational w
ith: service kdump status, like:
# service kdump restart
# service kdump status
Kdump is operational

6. If not set to 1 modify /etc/sysctl.conf and set kernel.sysrq to 1 and


set the current value to 1 with:
# echo "kernel.sysrq = 1" > /etc/sysctl.conf ; sysctl -p
# sysctl -a | grep sysrq
kernel.sysrq = 1

7. check the kdump --


# echo c > /proc/sysrq-trigger

FENCE_KDUMP configure
=====================
In Red Hat Enterprise Linux 6.2 and later, the fence_kdump agent can be used to
detect that a failed cluster node has entered the kdump crash recovery service a
nd mark the node as fenced.
Using the fence_kdump agent will result is significantly shorter recovery time a
s compared to using post_fail_delay. When using post_fail_delay, recovery will n
ot complete until kdump core collection has completed. The fence_kdump agent red
uces recovery time since fencing will complete as soon as the cluster is notifie
d that a failed node has entered the kdump crash recovery service, allowing the
cluster to recovery prior to the completion of kdump core collection.
The fence_kdump agent must be used in conjunction with another fence agent. It m
ust not be used by itself. The fence_kdump agent is only capable of detecting th
at a cluster node has entered the kdump kernel. Other events that required fenci
ng (eg. network outage) must be handled by other fencing methods.
The fence_kdump fence agent has two components:
===============================================
1. fence_kdump: The fencing agent. When fencing occurs, this agent will list
en for a message from the node that is being fenced. If the agent does not recei
ve a message from the failed node within a certain amount of time, the agent ret
urns failure and other fencing methods should be attempted. If the agent does re
ceive a message from the failed node, the agent returns success and the node is
considered to be fenced.
2 fence_kdump_send: The utility that sends the message. This is normally run
from within the kdump kernel while the kdump crash recovery service is performi
ng core collection. Messages will be sent continuously at a regular interval to
all nodes in the cluster.
To use the fence_kdump agent, first edit the /etc/cluster/cluster.conf file.
# ccs -h node2 --getconf
<cluster config_version="43" name="Cluster_BD">
<fence_daemon post_fail_delay="5" post_join_delay="10"/>
<clusternodes>
<clusternode name="node1" nodeid="1" votes="1">
<fence>
<method name="kdump">
<device name="kdump"/>
</method>
</fence>
</clusternode>
<clusternode name="node2" nodeid="2" votes="1">
<fence>
<method name="kdump">
<device name="kdump"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_kdump" name="kdump" port="8452" timeout="180"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="failover1" nofailback="0" ordered="1" restricted="0"
>
<failoverdomainnode name="node1" priority="1"/>
<failoverdomainnode name="node2" priority="2"/>
</failoverdomain>
</failoverdomains>
<resources>
<fs device="/dev/CLUSTERVG/lv1" fstype="ext4" mountpoint="/cluster/fs1" na
me="fs"/>
</resources>
<service domain="failover1" name="FS" recovery="relocate">
<fs ref="fs"/>
</service>
</rm>
</cluster>

3. If the fence_kdump agent is configured to listen on a port other than


the default, the fence_kdump_send utility must be configured to send to that sa
me port number. The behavior of fence_kdump_send can be modified by setting opti
ons in the /etc/sysconfig/fence_kdump file:
In this example, fence_kdump_send would send a message to port 8542 every 5 seco
nds. For a complete list of fence_kdump_send options, please refer to the fence_
kdump_send man page. Note that any changes to /etc/sysconfig/fence_kdump require
s that the kdump service be restarted.

#echo 'FENCE_KDUMP_OPTS="-i 5 -p 8452"' > /etc/sysconfig/fence_kdump

4. It is important to note that if a node fails for any reason other tha
n a kernel panic, the total recovery time will be delayed by the time that fence
_kdump waits for a message. In the example above, if "node1" fails for any reaso
n other than a kernel panic, the fence_apc agent will not attempt to fence the n
ode until fence_kdump has returned failure after 120 seconds.
Once the /etc/cluster/cluster.conf file has been modified to use fence_kdump, re
start the kdump service.
This step is required so that the kdump service detects that it should send mess
ages to the cluster nodes. The kdump service will detect that the /etc/cluster/c
luster.conf file has changed and rebuild the kexec initrd image. When this occur
s, kdump will extract a list of cluster nodes that should received notification
messages when the node enters the kdump crash recovery service. For the reason,
the kdump service should be restarted whenever the /etc/cluster/cluster.conf is
modified.
<fencedevice name="kdump" agent="fence_kdump" timeout="180" port
="8452"/>
# ccs -h node1 --addfencedev kdump1 agent="fence_kdump" port="8
452" timeout="180"
# ccs -h node1 --addmethod kdump1 node1
# ccs -h node1 --addfenceinst kdump1 node1 kdump1
5. By default, the fence_kdump agent will listen for UDP messages on por
t 7410. The default timeout is 60 seconds. These values can be modified by setti
ng parameters for the fence_kdump fence device:
In this example, the fence_kdump agent will listen for UDP messages on port 8452
and will timeout if no message is received after 180 seconds.

# service kdump restart

6. With the kdump service enabled and running, bring up the cluster by s
tarting the cman service on each node.
# service cman start/restart
7. The fence_kdump agent can now be tested by forcing a node to panic.
For this example, assume that the node being forced to panic is "node1" with an
IP address of 192.168.1.4.
# echo c > /proc/sysrq-trigger
8. Once the node has entered the kdump kernel, fence_kdump_send will beg
in sending messages to all cluster nodes. Of the remaining cluster nodes, the no
de with the lowest node ID will be responsible for fencing the failed node "node
1". Inspecting /var/log/messages on the node performing the fence operation shou
ld show the following:`

#tail -f /var/log/messages

example with ccs command


=========================
# ccs -h node1 --addfencedev kdump1 agent="fence_kdump" port="8
452" timeout="180"
# ccs -h node1 --addmethod kdump1 node1
# ccs -h node1 --addmethod kdump1 node2
# ccs -h node2 --addfenceinst kdump node1 kdump
# ccs -h node2 --addfenceinst kdump node2 kdump
# ccs -h node2 --getconf
# ccs -h node1 --getconf
# ccs -h node1 --rmmethod kdump1 node1
# ccs -h node1 --rmmethod kdump1 node2
# ccs -h node1 --rmfencedev kdump1

[root@node2 ~]# tail -f /var/log/messages


Feb 7 07:10:54 node2 kdump: stopped
Feb 7 07:11:58 node2 kdump: kexec: loaded kdump kernel
Feb 7 07:11:58 node2 kdump: started up
Feb 7 07:12:47 node2 corosync[25688]: [TOTEM ] A processor failed, forming ne
w configuration.
Feb 7 07:12:49 node2 corosync[25688]: [QUORUM] Members[1]: 2
Feb 7 07:12:49 node2 corosync[25688]: [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Feb 7 07:12:49 node2 corosync[25688]: [CPG ] chosen downlist: sender r(0) i
p(192.168.244.12) ; members(old:2 left:1)
Feb 7 07:12:49 node2 corosync[25688]: [MAIN ] Completed service synchronizat
ion, ready to provide service.
Feb 7 07:12:49 node2 kernel: dlm: closing connection to node 1
Feb 7 07:12:49 node2 rgmanager[25955]: State change: node1 DOWN
Feb 7 07:12:54 node2 fenced[25754]: fencing node node1
Feb 7 07:12:54 node2 fence_kdump[3606]: waiting for message from '192.168.244.1
1'
Feb 7 07:12:55 node2 fence_kdump[3606]: received valid message from '192.168.24
4.11'
Feb 7 07:12:55 node2 fenced[25754]: fence node1 success
Feb 7 07:12:56 node2 rgmanager[25955]: Taking over service service:FS from down
member node1

Вам также может понравиться