From: Vladislav Bogdanov <bubble@hoster-ok.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] (Repost from linux-cluster) Handling of CPG_REASON_NODEDOWN in daemons
Date: Fri, 19 Aug 2011 23:32:17 +0300 [thread overview]
Message-ID: <4E4EC851.9030608@hoster-ok.com> (raw)
Hi all,
I originally posted the same content to linux-cluster list, but there is
no answer there, so I suspect that this list is more suitable.
Several days ago I found that clvmd CPG in my cluster went to kern_stop
state, after there were some problems on corosync ring due to high load.
Cluster now contains three nodes, two bare-metal and one VM. VM suffered
from insufficient scheduling due to host load, and cluster went to
split-brain for one second and quickly recovered back.
CPG issued CPG_REASON_NODEDOWN event, and after that clvmd went to
kern_stop on two bare-metal nodes and to kern_stop,fencing on VM
(natural, it didn't have a quorum).
I would expect VM to be fenced, but actual fencing did not happen. clvmd
cpg stuck in kern_stop even after that VM was fenced manually, so I
needed to take the whole cluster down to recover.
I discovered a reason why node was not fenced on CPG_REASON_NODEDOWN event.
Here what I see in dlm_tool dump:
1313579105 Processing membership 80592
1313579105 Skipped active node 939787530: born-on=80580,
last-seen=80592, this-event=80592, last-event=80580
1313579105 Skipped active node 956564746: born-on=80564,
last-seen=80592, this-event=80592, last-event=80580
1313579105 del_configfs_node rmdir
"/sys/kernel/config/dlm/cluster/comms/1543767306"
1313579105 Removed inactive node 1543767306: born-on=80572,
last-seen=80580, this-event=80592, last-event=80580
1313579105 dlm:controld conf 2 0 1 memb 939787530 956564746 join left
1543767306
1313579105 dlm:ls:clvmd conf 2 0 1 memb 939787530 956564746 join left
1543767306
1313579105 clvmd add_change cg 4 remove nodeid 1543767306 reason 3
1313579105 clvmd add_change cg 4 counts member 2 joined 0 remove 1 failed 1
1313579105 clvmd stop_kernel cg 4
1313579105 write "0" to "/sys/kernel/dlm/clvmd/control"
1313579105 Node 1543767306/mgmt01 has not been shot yet
1313579105 clvmd check_fencing 1543767306 wait add 1313562825 fail
1313579105 last 0
1313579107 Node 1543767306/mgmt01 was last shot 'now'
1313579107 clvmd check_fencing 1543767306 done add 1313562825 fail
1313579105 last 1313579107
1313579107 clvmd check_fencing done
That means that dlm_controld received CPG_REASON_NODEDOWN event for
clvmd CPG and did not call kick_node_from_cluster(), so pacemaker didn't
do fencing on behalf of clvmd cpg.
Please correct me if I'm wrong:
* Request for fencing of node on CPG_REASON_NODEDOWN event was
historically left to groupd to do.
* That's why all daemons (fenced, dlm_controld, gfs2_controld) call
kick_node_from_cluster() only on CPG_REASON_PROCDOWN event, not on
CPG_REASON_NODEDOWN.
* groupd is obsoleted in 3.x.
Shouldn't daemons request fencing on CPG_REASON_NODEDOWN too?
Now they only mark node as failed and increase cg failcount.
I use pacemaker-based setup, and actually use only (obsoleted)
dlm_controld.pcmk, but problems seems to be a little bit wider than that
daemons one.
Setup is:
corosync-1.4.1
openais-1.1.4
pacemaker-tip
clusterlib-3.1.1
dlm_controld.pcmk from 3.0.17
lvm2-cluster-2.0.85
Best,
Vladislav
next reply other threads:[~2011-08-19 20:32 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-08-19 20:32 Vladislav Bogdanov [this message]
2011-09-01 13:02 ` [Cluster-devel] (Repost from linux-cluster) Handling of CPG_REASON_NODEDOWN in daemons Vladislav Bogdanov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4E4EC851.9030608@hoster-ok.com \
--to=bubble@hoster-ok.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.