From: David Teigland <teigland@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] fencing conditions: what should trigger a fencing operation?
Date: Thu, 19 Nov 2009 11:28:09 -0600 [thread overview]
Message-ID: <20091119172809.GC23287@redhat.com> (raw)
In-Reply-To: <1258647358.6052.935.camel@localhost.localdomain>
On Thu, Nov 19, 2009 at 04:15:58PM +0000, Steven Whitehouse wrote:
> Hi,
>
> On Thu, 2009-11-19 at 11:04 -0600, David Teigland wrote:
> > On Thu, Nov 19, 2009 at 12:35:05PM +0100, Fabio M. Di Nitto wrote:
> >
> > > - what are the current fencing policies?
> >
> > node failure
> >
> I think what Fabio is asking is what event is considered to be a node
> failure? It sounds from your description that it means a failure of
> corosync communications.
corosync's main job is to define node up/down states and notify everyone
if it changes, i.e. "cluster membership"
> Are there other things which can feed into this though? For example dlm
> seems to have some kind of timeout mechanism which sends a message to
> userspace, and I wonder whether that contributes to the decision too?
lock timeouts? lock timeouts are a just a normal lock manager feature,
although we don't use them. (The dlm also has a variation on lock
timeouts where it doesn't cancel the timed out lock, but instead sends a
notice to the deadlock detection code that there may be a deadlock, so a
new deadlock detection cycle is started.)
> It certainly isn't desirable for all types of filesystem failure to
> result in fencing & automatic recovery. I think we've got that wrong in
> the past. I posted a patch a few days back to try and address some of
> that. In the case we find an invalid block in a journal during recovery
> we certainly don't want to try and recover the journal on another node,
> nor even kill the recovering node since it will only result in another
> node trying to recover the same journal and hitting the same error.
> Eventually it will bring down the whole cluster.
>
> The aim of the patch was to return a suitable status indicating why
> journal recovery failed so that it can then be handled appropriately,
Historically, gfs will panic if it finds an error that will keep it from
making progress or handling further fs access. This, of course, was in
the interest of HA since you don't want one bad fs on one node to prevent
all the *other* nodes from working too.
Dave
next prev parent reply other threads:[~2009-11-19 17:28 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-11-19 11:35 [Cluster-devel] fencing conditions: what should trigger a fencing operation? Fabio M. Di Nitto
2009-11-19 17:04 ` David Teigland
2009-11-19 16:15 ` Steven Whitehouse
2009-11-19 17:28 ` David Teigland [this message]
2009-11-19 17:16 ` David Teigland
2009-11-19 18:10 ` Fabio M. Di Nitto
2009-11-19 19:49 ` David Teigland
2009-11-20 7:26 ` Fabio M. Di Nitto
2009-11-20 17:40 ` David Teigland
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20091119172809.GC23287@redhat.com \
--to=teigland@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).