[Cluster-devel] (Repost from linux-cluster) Handling of CPG_REASON

cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* [Cluster-devel] (Repost from linux-cluster) Handling of CPG_REASON_NODEDOWN in daemons
@ 2011-08-19 20:32 Vladislav Bogdanov
  2011-09-01 13:02 ` Vladislav Bogdanov
  0 siblings, 1 reply; 2+ messages in thread
From: Vladislav Bogdanov @ 2011-08-19 20:32 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi all,

I originally posted the same content to linux-cluster list, but there is
no answer there, so I suspect that this list is more suitable.

Several days ago I found that clvmd CPG in my cluster went to kern_stop
state, after there were some problems on corosync ring due to high load.

Cluster now contains three nodes, two bare-metal and one VM. VM suffered
from insufficient scheduling due to host load, and cluster went to
split-brain for one second and quickly recovered back.

CPG issued CPG_REASON_NODEDOWN event, and after that clvmd went to
kern_stop on two bare-metal nodes and to kern_stop,fencing on VM
(natural, it didn't have a quorum).

I would expect VM to be fenced, but actual fencing did not happen. clvmd
cpg stuck in kern_stop even after that VM was fenced manually, so I
needed to take the whole cluster down to recover.

I discovered a reason why node was not fenced on CPG_REASON_NODEDOWN event.

Here what I see in dlm_tool dump:
1313579105 Processing membership 80592
1313579105 Skipped active node 939787530: born-on=80580,
last-seen=80592, this-event=80592, last-event=80580
1313579105 Skipped active node 956564746: born-on=80564,
last-seen=80592, this-event=80592, last-event=80580
1313579105 del_configfs_node rmdir
"/sys/kernel/config/dlm/cluster/comms/1543767306"
1313579105 Removed inactive node 1543767306: born-on=80572,
last-seen=80580, this-event=80592, last-event=80580
1313579105 dlm:controld conf 2 0 1 memb 939787530 956564746 join left
1543767306
1313579105 dlm:ls:clvmd conf 2 0 1 memb 939787530 956564746 join left
1543767306
1313579105 clvmd add_change cg 4 remove nodeid 1543767306 reason 3
1313579105 clvmd add_change cg 4 counts member 2 joined 0 remove 1 failed 1
1313579105 clvmd stop_kernel cg 4
1313579105 write "0" to "/sys/kernel/dlm/clvmd/control"
1313579105 Node 1543767306/mgmt01 has not been shot yet
1313579105 clvmd check_fencing 1543767306 wait add 1313562825 fail
1313579105 last 0
1313579107 Node 1543767306/mgmt01 was last shot 'now'
1313579107 clvmd check_fencing 1543767306 done add 1313562825 fail
1313579105 last 1313579107
1313579107 clvmd check_fencing done

That means that dlm_controld received CPG_REASON_NODEDOWN event for
clvmd CPG and did not call kick_node_from_cluster(), so pacemaker didn't
do fencing on behalf of clvmd cpg.

Please correct me if I'm wrong:
* Request for fencing of node on CPG_REASON_NODEDOWN event was
historically left to groupd to do.
* That's why all daemons (fenced, dlm_controld, gfs2_controld) call
kick_node_from_cluster() only on CPG_REASON_PROCDOWN event, not on
CPG_REASON_NODEDOWN.
* groupd is obsoleted in 3.x.

Shouldn't daemons request fencing on CPG_REASON_NODEDOWN too?
Now they only mark node as failed and increase cg failcount.

I use pacemaker-based setup, and actually use only (obsoleted)
dlm_controld.pcmk, but problems seems to be a little bit wider than that
daemons one.

Setup is:
corosync-1.4.1
openais-1.1.4
pacemaker-tip
clusterlib-3.1.1
dlm_controld.pcmk from 3.0.17
lvm2-cluster-2.0.85

Best,
Vladislav

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Cluster-devel] (Repost from linux-cluster) Handling of CPG_REASON_NODEDOWN in daemons
  2011-08-19 20:32 [Cluster-devel] (Repost from linux-cluster) Handling of CPG_REASON_NODEDOWN in daemons Vladislav Bogdanov
@ 2011-09-01 13:02 ` Vladislav Bogdanov
  0 siblings, 0 replies; 2+ messages in thread
From: Vladislav Bogdanov @ 2011-09-01 13:02 UTC (permalink / raw)
  To: cluster-devel.redhat.com

No reply...
Did I ask something extremely stupid?

One more addition: pacemaker seen transitional membership change event
the same time.

19.08.2011 23:32, Vladislav Bogdanov wrote:
> Hi all,
> 
> I originally posted the same content to linux-cluster list, but there is
> no answer there, so I suspect that this list is more suitable.
> 
> Several days ago I found that clvmd CPG in my cluster went to kern_stop
> state, after there were some problems on corosync ring due to high load.
> 
> Cluster now contains three nodes, two bare-metal and one VM. VM suffered
> from insufficient scheduling due to host load, and cluster went to
> split-brain for one second and quickly recovered back.
> 
> CPG issued CPG_REASON_NODEDOWN event, and after that clvmd went to
> kern_stop on two bare-metal nodes and to kern_stop,fencing on VM
> (natural, it didn't have a quorum).
> 
> I would expect VM to be fenced, but actual fencing did not happen. clvmd
> cpg stuck in kern_stop even after that VM was fenced manually, so I
> needed to take the whole cluster down to recover.
> 
> I discovered a reason why node was not fenced on CPG_REASON_NODEDOWN event.
> 
> Here what I see in dlm_tool dump:
> 1313579105 Processing membership 80592
> 1313579105 Skipped active node 939787530: born-on=80580,
> last-seen=80592, this-event=80592, last-event=80580
> 1313579105 Skipped active node 956564746: born-on=80564,
> last-seen=80592, this-event=80592, last-event=80580
> 1313579105 del_configfs_node rmdir
> "/sys/kernel/config/dlm/cluster/comms/1543767306"
> 1313579105 Removed inactive node 1543767306: born-on=80572,
> last-seen=80580, this-event=80592, last-event=80580
> 1313579105 dlm:controld conf 2 0 1 memb 939787530 956564746 join left
> 1543767306
> 1313579105 dlm:ls:clvmd conf 2 0 1 memb 939787530 956564746 join left
> 1543767306
> 1313579105 clvmd add_change cg 4 remove nodeid 1543767306 reason 3
> 1313579105 clvmd add_change cg 4 counts member 2 joined 0 remove 1 failed 1
> 1313579105 clvmd stop_kernel cg 4
> 1313579105 write "0" to "/sys/kernel/dlm/clvmd/control"
> 1313579105 Node 1543767306/mgmt01 has not been shot yet
> 1313579105 clvmd check_fencing 1543767306 wait add 1313562825 fail
> 1313579105 last 0
> 1313579107 Node 1543767306/mgmt01 was last shot 'now'
> 1313579107 clvmd check_fencing 1543767306 done add 1313562825 fail
> 1313579105 last 1313579107
> 1313579107 clvmd check_fencing done
> 
> That means that dlm_controld received CPG_REASON_NODEDOWN event for
> clvmd CPG and did not call kick_node_from_cluster(), so pacemaker didn't
> do fencing on behalf of clvmd cpg.
> 
> Please correct me if I'm wrong:
> * Request for fencing of node on CPG_REASON_NODEDOWN event was
> historically left to groupd to do.
> * That's why all daemons (fenced, dlm_controld, gfs2_controld) call
> kick_node_from_cluster() only on CPG_REASON_PROCDOWN event, not on
> CPG_REASON_NODEDOWN.
> * groupd is obsoleted in 3.x.
> 
> Shouldn't daemons request fencing on CPG_REASON_NODEDOWN too?
> Now they only mark node as failed and increase cg failcount.
> 
> I use pacemaker-based setup, and actually use only (obsoleted)
> dlm_controld.pcmk, but problems seems to be a little bit wider than that
> daemons one.
> 
> Setup is:
> corosync-1.4.1
> openais-1.1.4
> pacemaker-tip
> clusterlib-3.1.1
> dlm_controld.pcmk from 3.0.17
> lvm2-cluster-2.0.85
> 
> Best,
> Vladislav
> 



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2011-09-01 13:02 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-19 20:32 [Cluster-devel] (Repost from linux-cluster) Handling of CPG_REASON_NODEDOWN in daemons Vladislav Bogdanov
2011-09-01 13:02 ` Vladislav Bogdanov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).