[Cluster-devel] (Repost from linux-cluster) Handling of CPG_REASON_NODEDOWN in daemons

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Vladislav Bogdanov <bubble@hoster-ok.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] (Repost from linux-cluster) Handling of CPG_REASON_NODEDOWN in daemons
Date: Thu, 01 Sep 2011 16:02:30 +0300	[thread overview]
Message-ID: <4E5F8266.3020907@hoster-ok.com> (raw)
In-Reply-To: <4E4EC851.9030608@hoster-ok.com>

No reply...
Did I ask something extremely stupid?

One more addition: pacemaker seen transitional membership change event
the same time.

19.08.2011 23:32, Vladislav Bogdanov wrote:
> Hi all,
> 
> I originally posted the same content to linux-cluster list, but there is
> no answer there, so I suspect that this list is more suitable.
> 
> Several days ago I found that clvmd CPG in my cluster went to kern_stop
> state, after there were some problems on corosync ring due to high load.
> 
> Cluster now contains three nodes, two bare-metal and one VM. VM suffered
> from insufficient scheduling due to host load, and cluster went to
> split-brain for one second and quickly recovered back.
> 
> CPG issued CPG_REASON_NODEDOWN event, and after that clvmd went to
> kern_stop on two bare-metal nodes and to kern_stop,fencing on VM
> (natural, it didn't have a quorum).
> 
> I would expect VM to be fenced, but actual fencing did not happen. clvmd
> cpg stuck in kern_stop even after that VM was fenced manually, so I
> needed to take the whole cluster down to recover.
> 
> I discovered a reason why node was not fenced on CPG_REASON_NODEDOWN event.
> 
> Here what I see in dlm_tool dump:
> 1313579105 Processing membership 80592
> 1313579105 Skipped active node 939787530: born-on=80580,
> last-seen=80592, this-event=80592, last-event=80580
> 1313579105 Skipped active node 956564746: born-on=80564,
> last-seen=80592, this-event=80592, last-event=80580
> 1313579105 del_configfs_node rmdir
> "/sys/kernel/config/dlm/cluster/comms/1543767306"
> 1313579105 Removed inactive node 1543767306: born-on=80572,
> last-seen=80580, this-event=80592, last-event=80580
> 1313579105 dlm:controld conf 2 0 1 memb 939787530 956564746 join left
> 1543767306
> 1313579105 dlm:ls:clvmd conf 2 0 1 memb 939787530 956564746 join left
> 1543767306
> 1313579105 clvmd add_change cg 4 remove nodeid 1543767306 reason 3
> 1313579105 clvmd add_change cg 4 counts member 2 joined 0 remove 1 failed 1
> 1313579105 clvmd stop_kernel cg 4
> 1313579105 write "0" to "/sys/kernel/dlm/clvmd/control"
> 1313579105 Node 1543767306/mgmt01 has not been shot yet
> 1313579105 clvmd check_fencing 1543767306 wait add 1313562825 fail
> 1313579105 last 0
> 1313579107 Node 1543767306/mgmt01 was last shot 'now'
> 1313579107 clvmd check_fencing 1543767306 done add 1313562825 fail
> 1313579105 last 1313579107
> 1313579107 clvmd check_fencing done
> 
> That means that dlm_controld received CPG_REASON_NODEDOWN event for
> clvmd CPG and did not call kick_node_from_cluster(), so pacemaker didn't
> do fencing on behalf of clvmd cpg.
> 
> Please correct me if I'm wrong:
> * Request for fencing of node on CPG_REASON_NODEDOWN event was
> historically left to groupd to do.
> * That's why all daemons (fenced, dlm_controld, gfs2_controld) call
> kick_node_from_cluster() only on CPG_REASON_PROCDOWN event, not on
> CPG_REASON_NODEDOWN.
> * groupd is obsoleted in 3.x.
> 
> Shouldn't daemons request fencing on CPG_REASON_NODEDOWN too?
> Now they only mark node as failed and increase cg failcount.
> 
> I use pacemaker-based setup, and actually use only (obsoleted)
> dlm_controld.pcmk, but problems seems to be a little bit wider than that
> daemons one.
> 
> Setup is:
> corosync-1.4.1
> openais-1.1.4
> pacemaker-tip
> clusterlib-3.1.1
> dlm_controld.pcmk from 3.0.17
> lvm2-cluster-2.0.85
> 
> Best,
> Vladislav
>

     prev parent reply	other threads:[~2011-09-01 13:02 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-19 20:32 [Cluster-devel] (Repost from linux-cluster) Handling of CPG_REASON_NODEDOWN in daemons Vladislav Bogdanov
2011-09-01 13:02 ` Vladislav Bogdanov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E5F8266.3020907@hoster-ok.com \
    --to=bubble@hoster-ok.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.