* [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful @ 2016-05-17 12:10 Eric Ren 2016-05-17 12:10 ` [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful merging Eric Ren 2016-05-18 6:53 ` [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful Eric Ren 0 siblings, 2 replies; 5+ messages in thread From: Eric Ren @ 2016-05-17 12:10 UTC (permalink / raw) To: cluster-devel.redhat.com Hi David, This is just a draft patch for you to review;-) There's an issue I'm not sure: where should we clear "stateful_merge_wait"? And I need more communications with pacemaker guys and more time for testing. I will send you the formal patch if things get done;-) Thanks, Eric ^ permalink raw reply [flat|nested] 5+ messages in thread
* [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful merging 2016-05-17 12:10 [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful Eric Ren @ 2016-05-17 12:10 ` Eric Ren 2016-05-18 6:53 ` [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful Eric Ren 1 sibling, 0 replies; 5+ messages in thread From: Eric Ren @ 2016-05-17 12:10 UTC (permalink / raw) To: cluster-devel.redhat.com When there are 3 or more partitions that merge, none may see enough clean nodes. Therefore, DLM would be stuck there forever unitl administrator manually reset/restart enough nodes to produce sufficient clean nodes. Therefore, output explicit information for higher code (e.g. pcmk) about the stateful merging state. Now, higher code can use `dlm status -v` to get "stateful_merge_wait". If it equals "1", we know dlm is waiting manual intervention. Then, higher code can choose one of nodes to fence. DLM will continue to work if "clean nodes >= stateful merged nodes" becomes true. David advised me to do the right thing;-) Thanks a lot! Signed-off-by: Eric Ren <zren@suse.com> --- dlm_controld/daemon_cpg.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/dlm_controld/daemon_cpg.c b/dlm_controld/daemon_cpg.c index 356e80d..8f6434f 100644 --- a/dlm_controld/daemon_cpg.c +++ b/dlm_controld/daemon_cpg.c @@ -118,6 +118,7 @@ static int zombie_count; static int fence_result_pid; static unsigned int fence_result_try; +static int stateful_merge_wait; /* cluster is stuck in waiting for manual intervention */ static void send_fence_result(int nodeid, int result, uint32_t flags, uint64_t walltime); static void send_fence_clear(int nodeid, int result, uint32_t flags, uint64_t walltime); @@ -847,10 +848,13 @@ static void daemon_fence_work(void) if ((clean_count >= merge_count) && !part_count && (low == our_nodeid)) kick_stateful_merge_members(); + if ((clean_count < merge_count) && !part_count) + stateful_merge_wait = 1; retry = 1; goto out; } + stateful_merge_wait = 0; /* where should this line go? */ /* * startup fencing @@ -2382,7 +2386,8 @@ static int print_state_daemon(char *str) "fence_pid=%d " "fence_in_progress_unknown=%d " "zombie_count=%d " - "monotime=%llu ", + "monotime=%llu " + "stateful_merge_wait=%d ", daemon_member_count, daemon_joined_count, daemon_remove_count, @@ -2392,7 +2397,8 @@ static int print_state_daemon(char *str) daemon_fence_pid, fence_in_progress_unknown, zombie_count, - (unsigned long long)monotime()); + (unsigned long long)monotime(), + stateful_merge_wait); return strlen(str) + 1; } -- 2.6.6 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful 2016-05-17 12:10 [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful Eric Ren 2016-05-17 12:10 ` [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful merging Eric Ren @ 2016-05-18 6:53 ` Eric Ren 2016-05-18 18:50 ` David Teigland 1 sibling, 1 reply; 5+ messages in thread From: Eric Ren @ 2016-05-18 6:53 UTC (permalink / raw) To: cluster-devel.redhat.com Hi David, Ken Gaillot got me with this question: Since corosync/pcmk can be healed from such a case, why not DLM? Please look at detailed discussion here: [1] https://github.com/ClusterLabs/pacemaker/pull/839 Here is my thoughts, but I'm not sure, CMIIW please: time: T; cluster:A, B, C; and if we have a lockspace named after $uuid for a shared disk volume, and a CPG for lockspace $uuid; $uuid CPG has members of A, B and C when things are OK, but: T: quorum lost; cluster partitions into 3 parts; lockspace $uuid cannot perform any lockspace operations because cluster is not quorate; T+1: quorum regained; dlm_controld daemon CPG has not done its merging/fencing stuff; so here are 2 questions: Q1: what's stateful merged node? I've seen the comments within code;-) It means a lockspace has been on the node before it sends protocol message? Q2: what if we add the stateful merged nodes to dlm_controld daemon cpg instead of fencing them? if so, CPG $uuid now, e.g. from the perspective of A, may has only one memeber - A itself, it can perform lockspace now because cluster is quorate now (and if we skip fencing); B and C do likewise; then for each node, it looks like every node own this volume; so corruption may happen? Thanks a lot, Eric On 05/17/2016 08:10 PM, Eric Ren wrote: > Hi David, > This is just a draft patch for you to review;-) There's an issue I'm > not sure: where should we clear "stateful_merge_wait"? > > And I need more communications with pacemaker guys and more time for testing. > I will send you the formal patch if things get done;-) ^ permalink raw reply [flat|nested] 5+ messages in thread
* [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful 2016-05-18 6:53 ` [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful Eric Ren @ 2016-05-18 18:50 ` David Teigland 2016-05-20 9:03 ` Eric Ren 0 siblings, 1 reply; 5+ messages in thread From: David Teigland @ 2016-05-18 18:50 UTC (permalink / raw) To: cluster-devel.redhat.com On Wed, May 18, 2016 at 02:53:00PM +0800, Eric Ren wrote: > Q1: what's stateful merged node? > Q2: what if we add the stateful merged nodes to dlm_controld daemon > cpg instead of fencing them? The details here are fundamental to the way dlm works because the dlm depends on the properties of Virtual Synchrony. Partitions obviously violate VS. ("Extended" forms of virtual synchrony deal with partitions, but they are not very practical. Unfortunately, corosync implements one of these extended forms of VS, which means any application that requires strict VS has to implement an equivalent of this "stateful merging" detection that's in the dlm.) With VS, message/membership events change the state being kept consistent among nodes. When a partition occurs, nodes have divergent events and inconsistent state. The partition is simple to understand, because partitioned nodes are indistinguishable from failed nodes and are treated as such. But, if partitioned nodes merge, the inconsistent state has to be made consistent. This must be done in the same way a new node is added to an existing node, which means doing "state transfer" from the existing node to the new node to make the state consistent between them. If the "new" node previously had state because of partition/merge, it must drop that old state and replace it with the state being transferred to it. After this, they will be consistent and can continue. With a simple process, you might just kill it, restart it and add the transferred state. But the dlm isn't a process that can simply be restarted, the state is spread through applications using it, and through the kernel. The only mechanism for resetting the dlm state is resetting the kernel, which is resetting/rebooting the machine. > if so, CPG $uuid now, e.g. from the perspective of A, may has only one > memeber - A itself, it can perform lockspace now because cluster is > quorate now (and if we skip fencing); B and C do likewise; then for each > node, it looks like every node own this volume; so corruption may happen? When the nodes are partitioned, the situation is fairly straight forward -- each node thinks the others are failed, and normal operation is blocked until recovery happens for the failed nodes. The harder problem is what to do when they merge. The dlm effectively ignores the invalid addition of the merged nodes and calls it a "stateful merge". The merged nodes continue to be considered failed (from the partition) and require a full restart before being added. ^ permalink raw reply [flat|nested] 5+ messages in thread
* [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful 2016-05-18 18:50 ` David Teigland @ 2016-05-20 9:03 ` Eric Ren 0 siblings, 0 replies; 5+ messages in thread From: Eric Ren @ 2016-05-20 9:03 UTC (permalink / raw) To: cluster-devel.redhat.com Hi David, On 05/19/2016 02:50 AM, David Teigland wrote: > On Wed, May 18, 2016 at 02:53:00PM +0800, Eric Ren wrote: >> Q1: what's stateful merged node? > >> Q2: what if we add the stateful merged nodes to dlm_controld daemon >> cpg instead of fencing them? > > The details here are fundamental to the way dlm works because the dlm > depends on the properties of Virtual Synchrony. Partitions obviously > violate VS. ("Extended" forms of virtual synchrony deal with partitions, > but they are not very practical. Unfortunately, corosync implements one > of these extended forms of VS, which means any application that requires > strict VS has to implement an equivalent of this "stateful merging" > detection that's in the dlm.) > > With VS, message/membership events change the state being kept consistent > among nodes. When a partition occurs, nodes have divergent events and > inconsistent state. The partition is simple to understand, because > partitioned nodes are indistinguishable from failed nodes and are treated > as such. But, if partitioned nodes merge, the inconsistent state has to > be made consistent. This must be done in the same way a new node is added > to an existing node, which means doing "state transfer" from the existing > node to the new node to make the state consistent between them. > > If the "new" node previously had state because of partition/merge, it must > drop that old state and replace it with the state being transferred to it. > After this, they will be consistent and can continue. With a simple > process, you might just kill it, restart it and add the transferred state. > But the dlm isn't a process that can simply be restarted, the state is > spread through applications using it, and through the kernel. The only > mechanism for resetting the dlm state is resetting the kernel, which is > resetting/rebooting the machine. > >> if so, CPG $uuid now, e.g. from the perspective of A, may has only one >> memeber - A itself, it can perform lockspace now because cluster is >> quorate now (and if we skip fencing); B and C do likewise; then for each >> node, it looks like every node own this volume; so corruption may happen? > > When the nodes are partitioned, the situation is fairly straight forward > -- each node thinks the others are failed, and normal operation is blocked > until recovery happens for the failed nodes. > > The harder problem is what to do when they merge. The dlm effectively > ignores the invalid addition of the merged nodes and calls it a "stateful > merge". The merged nodes continue to be considered failed (from the > partition) and require a full restart before being added. > Thanks a lot for elaborating this valuable knowledge to me! I've also shard with pacemaker guys. They'll make corresponding changes on pcmk side once the patch of dlm_controld is merged. I've sent the patch to you. Please take a look at;-) With best regards, Eric ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2016-05-20 9:03 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-05-17 12:10 [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful Eric Ren 2016-05-17 12:10 ` [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful merging Eric Ren 2016-05-18 6:53 ` [Cluster-devel] [DLM PATCH] dlm_controld: outputs explicit info about stateful Eric Ren 2016-05-18 18:50 ` David Teigland 2016-05-20 9:03 ` Eric Ren
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).