From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Teigland Date: Mon, 16 May 2016 12:02:46 -0500 Subject: [Cluster-devel] [DLM PATCH] dlm_controld: handle the case of network transient disconnection In-Reply-To: <57397A5B.3070302@suse.com> References: <1463044568-19583-1-git-send-email-zren@suse.com> <20160512165114.GB13651@redhat.com> <57356A0B.60100@suse.com> <20160513154913.GA28849@redhat.com> <57397A5B.3070302@suse.com> Message-ID: <20160516170246.GB20979@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Mon, May 16, 2016 at 03:44:27PM +0800, Eric Ren wrote: > Thanks! Hum, according to the long comments, you've handled the 2/2 > even split by way of the low nodeid killing statefull merged > numbers. Interesting, I'd forgotten about that bit of code, so I was wrong to say that we do nothing after a 2/2 partition merge. > we can do that because the "clean nodes" == "stateful nodes", right? I > guess you're saying the case that there're 3 or more partitions that > merge, and none could see enough clean nodes? That sounds about right, this is a fairly narrow special case that mainly helps in the case of two evenly split partitions. There could be some cases with more nodes and 3+ partitions where it might help. > Now DLM outputs log message like: > "fence work wait to clear merge $merge_count clean $clean_count part > $part_count gone $gone_count". I'm wondering if we can provide these > info and how long DLM has been stuck by "dlm_tool $some_cmd"? Yes, that's a good idea. dlm_tool should clearly report if the dlm is blocked because of a stateful partition merge. Perhaps a new global variable printed by 'dlm_tool status -v'? I'm not exactly sure of the condition when we'd set this new variable, could you try it out? For a start, maybe something similar to this: static int stateful_merge_wait; ... if (!cluster_two_node && merge_count) { log_retry(retry_fencing, "fence work wait to clear merge %d clean %d part %d gone %d", merge_count, clean_count, part_count, gone_count); if ((clean_count >= merge_count) && !part_count && (low == our_nodeid)) kick_stateful_merge_members(); + if ((clean_count < merge_count) && !part_count) + stateful_merge_wait=1; retry = 1; goto out; } (and added to print_state_daemon) Dave