From mboxrd@z Thu Jan 1 00:00:00 1970 From: Steven Whitehouse Date: Thu, 19 Nov 2009 16:15:58 +0000 Subject: [Cluster-devel] fencing conditions: what should trigger a fencing operation? In-Reply-To: <20091119170404.GA23287@redhat.com> References: <4B052D69.3010502@redhat.com> <20091119170404.GA23287@redhat.com> Message-ID: <1258647358.6052.935.camel@localhost.localdomain> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi, On Thu, 2009-11-19 at 11:04 -0600, David Teigland wrote: > On Thu, Nov 19, 2009 at 12:35:05PM +0100, Fabio M. Di Nitto wrote: > > > - what are the current fencing policies? > > node failure > I think what Fabio is asking is what event is considered to be a node failure? It sounds from your description that it means a failure of corosync communications. Are there other things which can feed into this though? For example dlm seems to have some kind of timeout mechanism which sends a message to userspace, and I wonder whether that contributes to the decision too? It certainly isn't desirable for all types of filesystem failure to result in fencing & automatic recovery. I think we've got that wrong in the past. I posted a patch a few days back to try and address some of that. In the case we find an invalid block in a journal during recovery we certainly don't want to try and recover the journal on another node, nor even kill the recovering node since it will only result in another node trying to recover the same journal and hitting the same error. Eventually it will bring down the whole cluster. The aim of the patch was to return a suitable status indicating why journal recovery failed so that it can then be handled appropriately, Steve.