[Cluster-devel] [DLM PATCH] dlm_controld: handle the case of network transient disconnection

cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* [Cluster-devel] [DLM PATCH] dlm_controld: handle the case of network transient disconnection
@ 2016-05-12  9:16 Eric Ren
  2016-05-12 16:51 ` David Teigland
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Ren @ 2016-05-12  9:16 UTC (permalink / raw)
  To: cluster-devel.redhat.com

DLM would be stuck in "need fencing" state, although cluster can
regain quorum very quickly after a network transient disconnection.

It's possible that this process happens within one monoclock. It
means "cluster_quorate_monotime" can eqaul "node->daemon_rem_time".
We now skip this chance of telling corosync to kill cluster for
stateful merge. As a result, any fencing cannot proceed further.

Signed-off-by: Eric Ren <zren@suse.com>
---
 dlm_controld/daemon_cpg.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dlm_controld/daemon_cpg.c b/dlm_controld/daemon_cpg.c
index 356e80d..cd8a4e2 100644
--- a/dlm_controld/daemon_cpg.c
+++ b/dlm_controld/daemon_cpg.c
@@ -1695,7 +1695,7 @@ static void receive_protocol(struct dlm_header *hd, int len)
 		node->stateful_merge = 1;
 
 		if (cluster_quorate && node->daemon_rem_time &&
-		    cluster_quorate_monotime < node->daemon_rem_time) {
+		    cluster_quorate_monotime <= node->daemon_rem_time) {
 			if (!node->killed) {
 				if (cluster_two_node) {
 					/*
-- 
2.6.6



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Cluster-devel] [DLM PATCH] dlm_controld: handle the case of network transient disconnection
  2016-05-12  9:16 [Cluster-devel] [DLM PATCH] dlm_controld: handle the case of network transient disconnection Eric Ren
@ 2016-05-12 16:51 ` David Teigland
  2016-05-13  5:45   ` Eric Ren
  0 siblings, 1 reply; 6+ messages in thread
From: David Teigland @ 2016-05-12 16:51 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, May 12, 2016 at 05:16:08PM +0800, Eric Ren wrote:
> DLM would be stuck in "need fencing" state, although cluster can
> regain quorum very quickly after a network transient disconnection.
> 
> It's possible that this process happens within one monoclock. It
> means "cluster_quorate_monotime" can eqaul "node->daemon_rem_time".
> We now skip this chance of telling corosync to kill cluster for
> stateful merge. As a result, any fencing cannot proceed further.

Hi Eric, thanks for looking at this, it's a notoriously difficult
situation to sort out.  I'm not sure we have the same understanding of how
the behavior will change with your patch, so let's look at an example, and
please let me know if you think these examples don't match what you see
(it's been quite a while since I actually tested this).

T = time in seconds, A,B,C = cluster nodes.

At T=1 A,B,C become members and have quorum.
At T=10 a partition creates A,B | C.
At T=11 it merges and creates A,B,C.

At T=12, A,B will have:
cluster_quorate=1
cluster_quorate_monotime=1
C->daemon_rem_time=10

At T=12, C will have:
cluster_quorate=1
cluster_quorate_monotime=11
A->daemon_rem_time=10
B->daemon_rem_time=10

Result:

A,B will kick C from the cluster because
cluster_quorate_monotime (1) < C->daemon_rem_time (10),
which is what we want.

C will not kick A,B from the cluster because
cluster_quorate_monotime (11) > A->daemon_rem_time (10),
which is what we want.

It's the simpler case, but does that sound right so far?

...

If the partition and merge occur within the same second, then:

At T=1 A,B,C become members and get quorum.
At T=10 a partition creates A,B | C.
At T=10 it merges and creates A,B,C.

At T=12, A,B will have:
cluster_quorate=1
cluster_quorate_monotime=1
C->daemon_rem_time=10

At T=12, C will have:
cluster_quorate=1
cluster_quorate_monotime=10
A->daemon_rem_time=10
B->daemon_rem_time=10

Result:

A,B will kick C from the cluster because
cluster_quorate_monotime (1) < C->daemon_rem_time (10),
which is what we want.

C will not kick A,B from the cluster because
cluster_quorate_monotime (10) = A->daemon_rem_time (10),
which is what we want.

If that's correct, there doesn't seem to be problem so far.
If we apply your patch, won't it allow C to kick A,B from the
cluster since cluster_quorate_monotime = A->daemon_rem_time?

...

If you're looking at a cluster with an equal partition, e.g. A,B | C,D,
then it becomes messy because cluster_quorate_monotime = daemon_rem_time
everywhere after the merge.  In this case, no nodes will kick others from
the cluster, but with your patch, each side will kick the other side from
the cluster.  Neither option is good.  In the past we decided to let the
cluster sit in this state so an admin could choose which nodes to remove.
Do you prefer the alternative of kicking nodes in this case (with somewhat
unpredictable results)?  If so, we could make that an optional setting,
but we'd want to keep the existing behavior for non-even partitions in the
example above.

> diff --git a/dlm_controld/daemon_cpg.c b/dlm_controld/daemon_cpg.c
> index 356e80d..cd8a4e2 100644
> --- a/dlm_controld/daemon_cpg.c
> +++ b/dlm_controld/daemon_cpg.c
> @@ -1695,7 +1695,7 @@ static void receive_protocol(struct dlm_header *hd, int len)
>  		node->stateful_merge = 1;
>  
>  		if (cluster_quorate && node->daemon_rem_time &&
> -		    cluster_quorate_monotime < node->daemon_rem_time) {
> +		    cluster_quorate_monotime <= node->daemon_rem_time) {
>  			if (!node->killed) {
>  				if (cluster_two_node) {
>  					/*
> -- 
> 2.6.6

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cluster-devel] [DLM PATCH] dlm_controld: handle the case of network transient disconnection
  2016-05-12 16:51 ` David Teigland
@ 2016-05-13  5:45   ` Eric Ren
  2016-05-13 15:49     ` David Teigland
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Ren @ 2016-05-13  5:45 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi David,

Thanks very very much for explaining this to me in such nice way;-)

On 05/13/2016 12:51 AM, David Teigland wrote:
>
> T = time in seconds, A,B,C = cluster nodes.
>
> At T=1 A,B,C become members and have quorum.
> At T=10 a partition creates A,B | C.
> At T=11 it merges and creates A,B,C.
>
> At T=12, A,B will have:
> cluster_quorate=1
> cluster_quorate_monotime=1
> C->daemon_rem_time=10
>
> At T=12, C will have:
> cluster_quorate=1
> cluster_quorate_monotime=11
> A->daemon_rem_time=10
> B->daemon_rem_time=10
>
> Result:
>
> A,B will kick C from the cluster because
> cluster_quorate_monotime (1) < C->daemon_rem_time (10),
> which is what we want.
>
> C will not kick A,B from the cluster because
> cluster_quorate_monotime (11) > A->daemon_rem_time (10),
> which is what we want.
>
> It's the simpler case, but does that sound right so far?

Sure;-)

> ...
>
> If the partition and merge occur within the same second, then:
>
> At T=1 A,B,C become members and get quorum.
> At T=10 a partition creates A,B | C.
> At T=10 it merges and creates A,B,C.
>
> At T=12, A,B will have:
> cluster_quorate=1
> cluster_quorate_monotime=1
> C->daemon_rem_time=10
>
> At T=12, C will have:
> cluster_quorate=1
> cluster_quorate_monotime=10
> A->daemon_rem_time=10
> B->daemon_rem_time=10
>
> Result:
>
> A,B will kick C from the cluster because
> cluster_quorate_monotime (1) < C->daemon_rem_time (10),
> which is what we want.
>
> C will not kick A,B from the cluster because
> cluster_quorate_monotime (10) = A->daemon_rem_time (10),
> which is what we want.
>
> If that's correct, there doesn't seem to be problem so far.
> If we apply your patch, won't it allow C to kick A,B from the
> cluster since cluster_quorate_monotime = A->daemon_rem_time?

Aha, yes! Also, this patch doesn't make any sense because it doesn't 
work if cluster_quorate_monotime (the time when network re-connected) > 
A->daemon_rem_time (when network disconnected) when there are 3 or more
partitions that merge.

> ...
>
> If you're looking at a cluster with an equal partition, e.g. A,B | C,D,
> then it becomes messy because cluster_quorate_monotime = daemon_rem_time
> everywhere after the merge.  In this case, no nodes will kick others from
> the cluster, but with your patch, each side will kick the other side from
> the cluster.  Neither option is good.  In the past we decided to let the
> cluster sit in this state so an admin could choose which nodes to remove.
> Do you prefer the alternative of kicking nodes in this case (with somewhat
> unpredictable results)?  If so, we could make that an optional setting,
> but we'd want to keep the existing behavior for non-even partitions in the
> example above.
>

Gotcha, thanks! But could you please elaborate a bit more on the meaning 
of "with somewhat unpredictable results"? IMHO, you mean some 
inconsistent problems may happen? or just worry about all the service 
would be interrupted because all nodes would be fenced?

Actually, the reason why I'm working on this problem is because the 
patch for this issue has been merged into pacemaker:

        [1] https://github.com/ClusterLabs/pacemaker/pull/839

Personally, I think it's a totally bad fix which makes careful thoughts 
and efforts of dlm_controld in vain, because it make pacemaker fence 
this node whenever dlm resource agent notices there's fencing going on 
by "dlm_tool ls | grep -q "wait fencing"", right?

This fix is for:
        [2] https://bugzilla.redhat.com/show_bug.cgi?id=1268313

We've seen unnecessary fence happens in two-node cluster with this fix. 
E.g. both of two nodes will be fenced when we kill corosync on one of nodes.

Thanks for your suggestion. So far, it looks like an optional setting is 
much better than this fix. I'd like to have a try;-)

Thanks,
Eric

>
>> diff --git a/dlm_controld/daemon_cpg.c b/dlm_controld/daemon_cpg.c
>> index 356e80d..cd8a4e2 100644
>> --- a/dlm_controld/daemon_cpg.c
>> +++ b/dlm_controld/daemon_cpg.c
>> @@ -1695,7 +1695,7 @@ static void receive_protocol(struct dlm_header *hd, int len)
>>   		node->stateful_merge = 1;
>>
>>   		if (cluster_quorate && node->daemon_rem_time &&
>> -		    cluster_quorate_monotime < node->daemon_rem_time) {
>> +		    cluster_quorate_monotime <= node->daemon_rem_time) {
>>   			if (!node->killed) {
>>   				if (cluster_two_node) {
>>   					/*
>> --
>> 2.6.6



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cluster-devel] [DLM PATCH] dlm_controld: handle the case of network transient disconnection
  2016-05-13  5:45   ` Eric Ren
@ 2016-05-13 15:49     ` David Teigland
  2016-05-16  7:44       ` Eric Ren
  0 siblings, 1 reply; 6+ messages in thread
From: David Teigland @ 2016-05-13 15:49 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, May 13, 2016 at 01:45:47PM +0800, Eric Ren wrote:
> >the cluster.  Neither option is good.  In the past we decided to let the
> >cluster sit in this state so an admin could choose which nodes to remove.
> >Do you prefer the alternative of kicking nodes in this case (with somewhat
> >unpredictable results)?  If so, we could make that an optional setting,
> >but we'd want to keep the existing behavior for non-even partitions in the
> >example above.
> 
> Gotcha, thanks! But could you please elaborate a bit more on the
> meaning of "with somewhat unpredictable results"? IMHO, you mean
> some inconsistent problems may happen? or just worry about all the
> service would be interrupted because all nodes would be fenced?

If both sides of the merged partition are kicking the other out of the
cluster at the same time, it's hard to predict which nodes will remain
(and it could be none).  To resolve an even partition merge, you need to
remove/restart the nodes on one of the former halves, i.e. either A,B or
C,D.  I never thought of a way to do that automatically in this code
(maybe higher level code would have more options to resolve it.)

> Actually, the reason why I'm working on this problem is because the
> patch for this issue has been merged into pacemaker:
> 
>        [1] https://github.com/ClusterLabs/pacemaker/pull/839
> 
> Personally, I think it's a totally bad fix which makes careful
> thoughts and efforts of dlm_controld in vain, because it make
> pacemaker fence this node whenever dlm resource agent notices
> there's fencing going on by "dlm_tool ls | grep -q "wait fencing"",
> right?
> 
> This fix is for:
>        [2] https://bugzilla.redhat.com/show_bug.cgi?id=1268313

"wait fencing" is normal.  Reading those links, it appears the real issue
was not identified before the patch was applied.

> We've seen unnecessary fence happens in two-node cluster with this
> fix. E.g. both of two nodes will be fenced when we kill corosync on
> one of nodes.

Remove the bad fix and it should work better.

> Thanks for your suggestion. So far, it looks like an optional
> setting is much better than this fix. I'd like to have a try;-)

Two node clusters are a special case of an even partition merge; I'm sure
you've seen the lengthy comment about that.  In a 2|2 partition merge, we
don't kick any nodes from the cluster, as explained above, and it
generally requires manual resolution.

But a 1|1 partition merge can sometimes be resolved automatically.  Quorum
can be disabled in a two node cluster, and the fencing system allowed to
race between two partitioned nodes to select a survivor (there are caveats
with that.)  The area of code you've been looking at (with the long
comment) uses the result of the fencing race to resolve the possible
partition merge by kicking out the node that was fenced.

Dave

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cluster-devel] [DLM PATCH] dlm_controld: handle the case of network transient disconnection
  2016-05-13 15:49     ` David Teigland
@ 2016-05-16  7:44       ` Eric Ren
  2016-05-16 17:02         ` David Teigland
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Ren @ 2016-05-16  7:44 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi David,

On 05/13/2016 11:49 PM, David Teigland wrote:
> If both sides of the merged partition are kicking the other out of the
> cluster at the same time, it's hard to predict which nodes will remain
> (and it could be none).  To resolve an even partition merge, you need to
> remove/restart the nodes on one of the former halves, i.e. either A,B or
> C,D.  I never thought of a way to do that automatically in this code
> (maybe higher level code would have more options to resolve it.)

Thanks! Hum, according to the long comments, you've handled the 2/2 even 
split by way of the low nodeid killing statefull merged numbers. we
can do that because the "clean nodes" == "stateful nodes", right? I 
guess you're saying the case that there're 3 or more partitions that 
merge, and none could see enough clean nodes?

Yes, agree. But pacemaker guys may complain there's not enough info for 
them to judge where DLM is. Now DLM outputs log message like: "fence 
work wait to clear merge $merge_count clean $clean_count part 
$part_count gone $gone_count". I'm wondering if we can provide these 
info and how long DLM has been stuck by "dlm_tool $some_cmd"?

Also, I'm working an option (like enable_force_kick) as you suggested;-)

>
> Remove the bad fix and it should work better.
>

Yes, will try to persuade pacemaker to drop that patch;-)

>
> Two node clusters are a special case of an even partition merge; I'm sure
> you've seen the lengthy comment about that.  In a 2|2 partition merge, we
> don't kick any nodes from the cluster, as explained above, and it
> generally requires manual resolution.
>
> But a 1|1 partition merge can sometimes be resolved automatically.  Quorum
> can be disabled in a two node cluster, and the fencing system allowed to
> race between two partitioned nodes to select a survivor (there are caveats
> with that.)  The area of code you've been looking at (with the long
> comment) uses the result of the fencing race to resolve the possible
> partition merge by kicking out the node that was fenced.

Yes, I can understand this;-)

Thanks a lot!
Eric

>
> Dave
>



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cluster-devel] [DLM PATCH] dlm_controld: handle the case of network transient disconnection
  2016-05-16  7:44       ` Eric Ren
@ 2016-05-16 17:02         ` David Teigland
  0 siblings, 0 replies; 6+ messages in thread
From: David Teigland @ 2016-05-16 17:02 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, May 16, 2016 at 03:44:27PM +0800, Eric Ren wrote:
> Thanks! Hum, according to the long comments, you've handled the 2/2
> even split by way of the low nodeid killing statefull merged
> numbers.

Interesting, I'd forgotten about that bit of code, so I was wrong to say
that we do nothing after a 2/2 partition merge.

> we can do that because the "clean nodes" == "stateful nodes", right? I
> guess you're saying the case that there're 3 or more partitions that
> merge, and none could see enough clean nodes?

That sounds about right, this is a fairly narrow special case that mainly
helps in the case of two evenly split partitions.  There could be some
cases with more nodes and 3+ partitions where it might help.

> Now DLM outputs log message like:
> "fence work wait to clear merge $merge_count clean $clean_count part
> $part_count gone $gone_count". I'm wondering if we can provide these
> info and how long DLM has been stuck by "dlm_tool $some_cmd"?

Yes, that's a good idea.  dlm_tool should clearly report if the dlm is
blocked because of a stateful partition merge.  Perhaps a new global
variable printed by 'dlm_tool status -v'?

I'm not exactly sure of the condition when we'd set this new variable,
could you try it out?  For a start, maybe something similar to this:

static int stateful_merge_wait;
...

        if (!cluster_two_node && merge_count) {
                log_retry(retry_fencing, "fence work wait to clear merge %d clean %d part %d gone %d",
                          merge_count, clean_count, part_count, gone_count);

                if ((clean_count >= merge_count) && !part_count && (low == our_nodeid))
                        kick_stateful_merge_members();
+               if ((clean_count < merge_count) && !part_count)
+                       stateful_merge_wait=1;

                retry = 1;
                goto out;
        }

(and added to print_state_daemon)

Dave



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-05-16 17:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-12  9:16 [Cluster-devel] [DLM PATCH] dlm_controld: handle the case of network transient disconnection Eric Ren
2016-05-12 16:51 ` David Teigland
2016-05-13  5:45   ` Eric Ren
2016-05-13 15:49     ` David Teigland
2016-05-16  7:44       ` Eric Ren
2016-05-16 17:02         ` David Teigland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).