From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Ren Date: Mon, 16 May 2016 15:44:27 +0800 Subject: [Cluster-devel] [DLM PATCH] dlm_controld: handle the case of network transient disconnection In-Reply-To: <20160513154913.GA28849@redhat.com> References: <1463044568-19583-1-git-send-email-zren@suse.com> <20160512165114.GB13651@redhat.com> <57356A0B.60100@suse.com> <20160513154913.GA28849@redhat.com> Message-ID: <57397A5B.3070302@suse.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi David, On 05/13/2016 11:49 PM, David Teigland wrote: > If both sides of the merged partition are kicking the other out of the > cluster at the same time, it's hard to predict which nodes will remain > (and it could be none). To resolve an even partition merge, you need to > remove/restart the nodes on one of the former halves, i.e. either A,B or > C,D. I never thought of a way to do that automatically in this code > (maybe higher level code would have more options to resolve it.) Thanks! Hum, according to the long comments, you've handled the 2/2 even split by way of the low nodeid killing statefull merged numbers. we can do that because the "clean nodes" == "stateful nodes", right? I guess you're saying the case that there're 3 or more partitions that merge, and none could see enough clean nodes? Yes, agree. But pacemaker guys may complain there's not enough info for them to judge where DLM is. Now DLM outputs log message like: "fence work wait to clear merge $merge_count clean $clean_count part $part_count gone $gone_count". I'm wondering if we can provide these info and how long DLM has been stuck by "dlm_tool $some_cmd"? Also, I'm working an option (like enable_force_kick) as you suggested;-) > > Remove the bad fix and it should work better. > Yes, will try to persuade pacemaker to drop that patch;-) > > Two node clusters are a special case of an even partition merge; I'm sure > you've seen the lengthy comment about that. In a 2|2 partition merge, we > don't kick any nodes from the cluster, as explained above, and it > generally requires manual resolution. > > But a 1|1 partition merge can sometimes be resolved automatically. Quorum > can be disabled in a two node cluster, and the fencing system allowed to > race between two partitioned nodes to select a survivor (there are caveats > with that.) The area of code you've been looking at (with the long > comment) uses the result of the fencing race to resolve the possible > partition merge by kicking out the node that was fenced. Yes, I can understand this;-) Thanks a lot! Eric > > Dave >