[Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful

cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful
@ 2016-05-17 12:10 Eric Ren
  2016-05-17 12:10 ` [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful merging Eric Ren
  2016-05-18  6:53 ` [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful Eric Ren
  0 siblings, 2 replies; 5+ messages in thread
From: Eric Ren @ 2016-05-17 12:10 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi David,
This is just a draft patch for you to review;-) There's an issue I'm
not sure: where should we clear "stateful_merge_wait"?

And I need more communications with pacemaker guys and more time for testing.
I will send you the formal patch if things get done;-)

Thanks,
Eric

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful merging
  2016-05-17 12:10 [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful Eric Ren
@ 2016-05-17 12:10 ` Eric Ren
  2016-05-18  6:53 ` [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful Eric Ren
  1 sibling, 0 replies; 5+ messages in thread
From: Eric Ren @ 2016-05-17 12:10 UTC (permalink / raw)
  To: cluster-devel.redhat.com

When there are 3 or more partitions that merge, none may see enough
clean nodes. Therefore, DLM would be stuck there forever unitl
administrator manually reset/restart enough nodes to produce sufficient
clean nodes. Therefore, output explicit information for higher code (e.g. pcmk)
about the stateful merging state. Now, higher code can use `dlm status
-v` to get "stateful_merge_wait". If it equals "1", we know dlm is
waiting manual intervention. Then, higher code can choose one of nodes
to fence. DLM will continue to work if "clean nodes >= stateful merged
nodes" becomes true.

David advised me to do the right thing;-) Thanks a lot!

Signed-off-by: Eric Ren <zren@suse.com>
---
 dlm_controld/daemon_cpg.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/dlm_controld/daemon_cpg.c b/dlm_controld/daemon_cpg.c
index 356e80d..8f6434f 100644
--- a/dlm_controld/daemon_cpg.c
+++ b/dlm_controld/daemon_cpg.c
@@ -118,6 +118,7 @@ static int zombie_count;
 
 static int fence_result_pid;
 static unsigned int fence_result_try;
+static int stateful_merge_wait; /* cluster is stuck in waiting for manual intervention */
 
 static void send_fence_result(int nodeid, int result, uint32_t flags, uint64_t walltime);
 static void send_fence_clear(int nodeid, int result, uint32_t flags, uint64_t walltime);
@@ -847,10 +848,13 @@ static void daemon_fence_work(void)
 
 		if ((clean_count >= merge_count) && !part_count && (low == our_nodeid))
 			kick_stateful_merge_members();
+		if ((clean_count < merge_count) && !part_count)
+			stateful_merge_wait = 1;
 
 		retry = 1;
 		goto out;
 	}
+	stateful_merge_wait = 0; /* where should this line go? */
 
 	/*
 	 * startup fencing
@@ -2382,7 +2386,8 @@ static int print_state_daemon(char *str)
 		 "fence_pid=%d "
 		 "fence_in_progress_unknown=%d "
 		 "zombie_count=%d "
-		 "monotime=%llu ",
+		 "monotime=%llu "
+		 "stateful_merge_wait=%d ",
 		 daemon_member_count,
 		 daemon_joined_count,
 		 daemon_remove_count,
@@ -2392,7 +2397,8 @@ static int print_state_daemon(char *str)
 		 daemon_fence_pid,
 		 fence_in_progress_unknown,
 		 zombie_count,
-		 (unsigned long long)monotime());
+		 (unsigned long long)monotime(),
+		 stateful_merge_wait);
 
 	return strlen(str) + 1;
 }
-- 
2.6.6



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful
  2016-05-17 12:10 [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful Eric Ren
  2016-05-17 12:10 ` [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful merging Eric Ren
@ 2016-05-18  6:53 ` Eric Ren
  2016-05-18 18:50   ` David Teigland
  1 sibling, 1 reply; 5+ messages in thread
From: Eric Ren @ 2016-05-18  6:53 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi David,

Ken Gaillot got me with this question:
Since corosync/pcmk can be healed from such a case, why not DLM?
Please look at detailed discussion here:
        [1] https://github.com/ClusterLabs/pacemaker/pull/839

Here is my thoughts, but I'm not sure, CMIIW please:
time: T; cluster:A, B, C; and if we have a lockspace named after $uuid 
for a shared disk volume, and a CPG for lockspace $uuid; $uuid CPG has
members of A, B and C when things are OK, but:

T: quorum lost; cluster partitions into 3 parts; lockspace $uuid cannot 
perform any lockspace operations because cluster is not quorate;

T+1: quorum regained; dlm_controld daemon CPG has not done its 
merging/fencing stuff; so here are 2 questions:
Q1: what's stateful merged node?
I've seen the comments within code;-) It means a lockspace has been on 
the node before it sends protocol message?

Q2: what if we add the stateful merged nodes to dlm_controld daemon cpg 
instead of fencing them?

if so, CPG $uuid now, e.g. from the perspective of A, may has only one 
memeber - A itself, it can perform lockspace now because cluster is 
quorate now (and if we skip fencing); B and C do likewise; then for each 
node, it looks like every node own this volume; so corruption may happen?

Thanks a lot,
Eric

On 05/17/2016 08:10 PM, Eric Ren wrote:
> Hi David,
> This is just a draft patch for you to review;-) There's an issue I'm
> not sure: where should we clear "stateful_merge_wait"?
>
> And I need more communications with pacemaker guys and more time for testing.
> I will send you the formal patch if things get done;-)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful
  2016-05-18  6:53 ` [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful Eric Ren
@ 2016-05-18 18:50   ` David Teigland
  2016-05-20  9:03     ` Eric Ren
  0 siblings, 1 reply; 5+ messages in thread
From: David Teigland @ 2016-05-18 18:50 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, May 18, 2016 at 02:53:00PM +0800, Eric Ren wrote:
> Q1: what's stateful merged node?

> Q2: what if we add the stateful merged nodes to dlm_controld daemon
> cpg instead of fencing them?

The details here are fundamental to the way dlm works because the dlm
depends on the properties of Virtual Synchrony.  Partitions obviously
violate VS.  ("Extended" forms of virtual synchrony deal with partitions,
but they are not very practical.  Unfortunately, corosync implements one
of these extended forms of VS, which means any application that requires
strict VS has to implement an equivalent of this "stateful merging"
detection that's in the dlm.)

With VS, message/membership events change the state being kept consistent
among nodes.  When a partition occurs, nodes have divergent events and
inconsistent state.  The partition is simple to understand, because
partitioned nodes are indistinguishable from failed nodes and are treated
as such.  But, if partitioned nodes merge, the inconsistent state has to
be made consistent.  This must be done in the same way a new node is added
to an existing node, which means doing "state transfer" from the existing
node to the new node to make the state consistent between them.

If the "new" node previously had state because of partition/merge, it must
drop that old state and replace it with the state being transferred to it.
After this, they will be consistent and can continue.  With a simple
process, you might just kill it, restart it and add the transferred state.
But the dlm isn't a process that can simply be restarted, the state is
spread through applications using it, and through the kernel.  The only
mechanism for resetting the dlm state is resetting the kernel, which is
resetting/rebooting the machine.

> if so, CPG $uuid now, e.g. from the perspective of A, may has only one 
> memeber - A itself, it can perform lockspace now because cluster is 
> quorate now (and if we skip fencing); B and C do likewise; then for each 
> node, it looks like every node own this volume; so corruption may happen?

When the nodes are partitioned, the situation is fairly straight forward
-- each node thinks the others are failed, and normal operation is blocked
until recovery happens for the failed nodes.

The harder problem is what to do when they merge.  The dlm effectively
ignores the invalid addition of the merged nodes and calls it a "stateful
merge".  The merged nodes continue to be considered failed (from the
partition) and require a full restart before being added.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful
  2016-05-18 18:50   ` David Teigland
@ 2016-05-20  9:03     ` Eric Ren
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Ren @ 2016-05-20  9:03 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi David,

On 05/19/2016 02:50 AM, David Teigland wrote:
> On Wed, May 18, 2016 at 02:53:00PM +0800, Eric Ren wrote:
>> Q1: what's stateful merged node?
>
>> Q2: what if we add the stateful merged nodes to dlm_controld daemon
>> cpg instead of fencing them?
>
> The details here are fundamental to the way dlm works because the dlm
> depends on the properties of Virtual Synchrony.  Partitions obviously
> violate VS.  ("Extended" forms of virtual synchrony deal with partitions,
> but they are not very practical.  Unfortunately, corosync implements one
> of these extended forms of VS, which means any application that requires
> strict VS has to implement an equivalent of this "stateful merging"
> detection that's in the dlm.)
>
> With VS, message/membership events change the state being kept consistent
> among nodes.  When a partition occurs, nodes have divergent events and
> inconsistent state.  The partition is simple to understand, because
> partitioned nodes are indistinguishable from failed nodes and are treated
> as such.  But, if partitioned nodes merge, the inconsistent state has to
> be made consistent.  This must be done in the same way a new node is added
> to an existing node, which means doing "state transfer" from the existing
> node to the new node to make the state consistent between them.
>
> If the "new" node previously had state because of partition/merge, it must
> drop that old state and replace it with the state being transferred to it.
> After this, they will be consistent and can continue.  With a simple
> process, you might just kill it, restart it and add the transferred state.
> But the dlm isn't a process that can simply be restarted, the state is
> spread through applications using it, and through the kernel.  The only
> mechanism for resetting the dlm state is resetting the kernel, which is
> resetting/rebooting the machine.
>
>> if so, CPG $uuid now, e.g. from the perspective of A, may has only one
>> memeber - A itself, it can perform lockspace now because cluster is
>> quorate now (and if we skip fencing); B and C do likewise; then for each
>> node, it looks like every node own this volume; so corruption may happen?
>
> When the nodes are partitioned, the situation is fairly straight forward
> -- each node thinks the others are failed, and normal operation is blocked
> until recovery happens for the failed nodes.
>
> The harder problem is what to do when they merge.  The dlm effectively
> ignores the invalid addition of the merged nodes and calls it a "stateful
> merge".  The merged nodes continue to be considered failed (from the
> partition) and require a full restart before being added.
>

Thanks a lot for elaborating this valuable knowledge to me! I've also 
shard with pacemaker guys. They'll make corresponding changes on pcmk 
side once the patch of dlm_controld is merged. I've sent the patch to 
you. Please take a look at;-)

With best regards,
Eric



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-05-20  9:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-17 12:10 [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful Eric Ren
2016-05-17 12:10 ` [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful merging Eric Ren
2016-05-18  6:53 ` [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful Eric Ren
2016-05-18 18:50   ` David Teigland
2016-05-20  9:03     ` Eric Ren

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).