All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Ren <zren@suse.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [DLM PATCH] dlm_controld: handle the case of network transient disconnection
Date: Mon, 16 May 2016 15:44:27 +0800	[thread overview]
Message-ID: <57397A5B.3070302@suse.com> (raw)
In-Reply-To: <20160513154913.GA28849@redhat.com>

Hi David,

On 05/13/2016 11:49 PM, David Teigland wrote:
> If both sides of the merged partition are kicking the other out of the
> cluster at the same time, it's hard to predict which nodes will remain
> (and it could be none).  To resolve an even partition merge, you need to
> remove/restart the nodes on one of the former halves, i.e. either A,B or
> C,D.  I never thought of a way to do that automatically in this code
> (maybe higher level code would have more options to resolve it.)

Thanks! Hum, according to the long comments, you've handled the 2/2 even 
split by way of the low nodeid killing statefull merged numbers. we
can do that because the "clean nodes" == "stateful nodes", right? I 
guess you're saying the case that there're 3 or more partitions that 
merge, and none could see enough clean nodes?

Yes, agree. But pacemaker guys may complain there's not enough info for 
them to judge where DLM is. Now DLM outputs log message like: "fence 
work wait to clear merge $merge_count clean $clean_count part 
$part_count gone $gone_count". I'm wondering if we can provide these 
info and how long DLM has been stuck by "dlm_tool $some_cmd"?

Also, I'm working an option (like enable_force_kick) as you suggested;-)

>
> Remove the bad fix and it should work better.
>

Yes, will try to persuade pacemaker to drop that patch;-)

>
> Two node clusters are a special case of an even partition merge; I'm sure
> you've seen the lengthy comment about that.  In a 2|2 partition merge, we
> don't kick any nodes from the cluster, as explained above, and it
> generally requires manual resolution.
>
> But a 1|1 partition merge can sometimes be resolved automatically.  Quorum
> can be disabled in a two node cluster, and the fencing system allowed to
> race between two partitioned nodes to select a survivor (there are caveats
> with that.)  The area of code you've been looking at (with the long
> comment) uses the result of the fencing race to resolve the possible
> partition merge by kicking out the node that was fenced.

Yes, I can understand this;-)

Thanks a lot!
Eric

>
> Dave
>



  reply	other threads:[~2016-05-16  7:44 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-12  9:16 [Cluster-devel] [DLM PATCH] dlm_controld: handle the case of network transient disconnection Eric Ren
2016-05-12 16:51 ` David Teigland
2016-05-13  5:45   ` Eric Ren
2016-05-13 15:49     ` David Teigland
2016-05-16  7:44       ` Eric Ren [this message]
2016-05-16 17:02         ` David Teigland

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57397A5B.3070302@suse.com \
    --to=zren@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.