[Cluster-devel] fencing conditions: what should trigger a fencing operation?

cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* [Cluster-devel] fencing conditions: what should trigger a fencing operation?
@ 2009-11-19 11:35 Fabio M. Di Nitto
  2009-11-19 17:04 ` David Teigland
  0 siblings, 1 reply; 9+ messages in thread
From: Fabio M. Di Nitto @ 2009-11-19 11:35 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi guys,

I have just hit what I think it?s a bug and I think we need review our
fencing policies.

This is what I saw:

- 6 nodes cluster (node1-3 x86, node4-6 x86_64)
- node1 and node4 perform a simple mount gfs2 -> wait -> umount -> wait
-> mount -> and loop forever
- node2 and node5 perform read/write operation on the same gfs2
partition (nothing fancy really)
- node3 is in charge of creating and removing clustered lv volumes.
- node6 is in charge of constantly relocating rgmanager services.

cluster is running qdisk too.

It is a known issue that node1 will crash at some point (kernel OOPS).

Here are the interesting bits:

node1 is hanging in mount/umount (expected)
node2, node4, node5 will continue to operate as normal.
node3 is now hanging creating a vg.
node6 is trying to stop service from node1 (it happened to be located
there at the time of the crash).

I was expecting, that after a failure, node1 would be fenced but nothing
is happening automatically.

Manually fencing the node will recover all hanging operations.

Talking to Steven W. it appears that our methods to define and detect a
failure should be improved.

My questions, simply driven by the fact that I am not a fence expert, are:

- what are the current fencing policies?
- what can we do to improve them?
- should we monitor for more failures than we do now?

Cheers
Fabio

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Cluster-devel] fencing conditions: what should trigger a fencing operation?
  2009-11-19 17:04 ` David Teigland
@ 2009-11-19 16:15   ` Steven Whitehouse
  2009-11-19 17:28     ` David Teigland
  2009-11-19 17:16   ` David Teigland
  2009-11-19 18:10   ` Fabio M. Di Nitto
  2 siblings, 1 reply; 9+ messages in thread
From: Steven Whitehouse @ 2009-11-19 16:15 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Thu, 2009-11-19 at 11:04 -0600, David Teigland wrote:
> On Thu, Nov 19, 2009 at 12:35:05PM +0100, Fabio M. Di Nitto wrote:
> 
> > - what are the current fencing policies?
> 
> node failure
> 
I think what Fabio is asking is what event is considered to be a node
failure? It sounds from your description that it means a failure of
corosync communications. Are there other things which can feed into this
though? For example dlm seems to have some kind of timeout mechanism
which sends a message to userspace, and I wonder whether that
contributes to the decision too?

It certainly isn't desirable for all types of filesystem failure to
result in fencing & automatic recovery. I think we've got that wrong in
the past. I posted a patch a few days back to try and address some of
that. In the case we find an invalid block in a journal during recovery
we certainly don't want to try and recover the journal on another node,
nor even kill the recovering node since it will only result in another
node trying to recover the same journal and hitting the same error.
Eventually it will bring down the whole cluster.

The aim of the patch was to return a suitable status indicating why
journal recovery failed so that it can then be handled appropriately,

Steve.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Cluster-devel] fencing conditions: what should trigger a fencing operation?
  2009-11-19 11:35 [Cluster-devel] fencing conditions: what should trigger a fencing operation? Fabio M. Di Nitto
@ 2009-11-19 17:04 ` David Teigland
  2009-11-19 16:15   ` Steven Whitehouse
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: David Teigland @ 2009-11-19 17:04 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Nov 19, 2009 at 12:35:05PM +0100, Fabio M. Di Nitto wrote:

> - what are the current fencing policies?

node failure

> - what can we do to improve them?

node failure is a simple, black and white, fact

> - should we monitor for more failures than we do now?

corosync *exists* to to detect node failure

> It is a known issue that node1 will crash at some point (kernel OOPS).

oops is not necessarily node failure; if you *want* it to be, then you
sysctl -w kernel.panic_on_oops=1

(gfs has also had it's own mount options over the years to force this
behavior, even if the sysctl isn't set properly; it's a common issue.
It seems panic_on_oops has had inconsistent default values over various
releases, sometimes 0, sometimes 1; setting it has historically been part
of cluster/gfs documentation since most customers want it to be 1.)

Dave



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Cluster-devel] fencing conditions: what should trigger a fencing operation?
  2009-11-19 17:04 ` David Teigland
  2009-11-19 16:15   ` Steven Whitehouse
@ 2009-11-19 17:16   ` David Teigland
  2009-11-19 18:10   ` Fabio M. Di Nitto
  2 siblings, 0 replies; 9+ messages in thread
From: David Teigland @ 2009-11-19 17:16 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Nov 19, 2009 at 11:04:04AM -0600, David Teigland wrote:
> (gfs has also had it's own mount options over the years to force this
> behavior, even if the sysctl isn't set properly; it's a common issue.

gfs1 does still have "-o oopses_ok", I think gfs2 recently changed this
due to a customer who couldn't get it to work right.

http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=blob;f=gfs/man/gfs_mount.8;h=faf5d8345801070b7ce3183a62d81c21db6b6023;hb=RHEL4#l137



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Cluster-devel] fencing conditions: what should trigger a fencing operation?
  2009-11-19 16:15   ` Steven Whitehouse
@ 2009-11-19 17:28     ` David Teigland
  0 siblings, 0 replies; 9+ messages in thread
From: David Teigland @ 2009-11-19 17:28 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Nov 19, 2009 at 04:15:58PM +0000, Steven Whitehouse wrote:
> Hi,
> 
> On Thu, 2009-11-19 at 11:04 -0600, David Teigland wrote:
> > On Thu, Nov 19, 2009 at 12:35:05PM +0100, Fabio M. Di Nitto wrote:
> > 
> > > - what are the current fencing policies?
> > 
> > node failure
> > 
> I think what Fabio is asking is what event is considered to be a node
> failure? It sounds from your description that it means a failure of
> corosync communications. 

corosync's main job is to define node up/down states and notify everyone
if it changes, i.e. "cluster membership"

> Are there other things which can feed into this though? For example dlm
> seems to have some kind of timeout mechanism which sends a message to
> userspace, and I wonder whether that contributes to the decision too?

lock timeouts?  lock timeouts are a just a normal lock manager feature,
although we don't use them.  (The dlm also has a variation on lock
timeouts where it doesn't cancel the timed out lock, but instead sends a
notice to the deadlock detection code that there may be a deadlock, so a
new deadlock detection cycle is started.)

> It certainly isn't desirable for all types of filesystem failure to
> result in fencing & automatic recovery. I think we've got that wrong in
> the past. I posted a patch a few days back to try and address some of
> that. In the case we find an invalid block in a journal during recovery
> we certainly don't want to try and recover the journal on another node,
> nor even kill the recovering node since it will only result in another
> node trying to recover the same journal and hitting the same error.
> Eventually it will bring down the whole cluster.
> 
> The aim of the patch was to return a suitable status indicating why
> journal recovery failed so that it can then be handled appropriately,

Historically, gfs will panic if it finds an error that will keep it from
making progress or handling further fs access.  This, of course, was in
the interest of HA since you don't want one bad fs on one node to prevent
all the *other* nodes from working too.

Dave



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Cluster-devel] fencing conditions: what should trigger a fencing operation?
  2009-11-19 17:04 ` David Teigland
  2009-11-19 16:15   ` Steven Whitehouse
  2009-11-19 17:16   ` David Teigland
@ 2009-11-19 18:10   ` Fabio M. Di Nitto
  2009-11-19 19:49     ` David Teigland
  2 siblings, 1 reply; 9+ messages in thread
From: Fabio M. Di Nitto @ 2009-11-19 18:10 UTC (permalink / raw)
  To: cluster-devel.redhat.com

David Teigland wrote:
> On Thu, Nov 19, 2009 at 12:35:05PM +0100, Fabio M. Di Nitto wrote:
> 
>> - what are the current fencing policies?
> 
> node failure
> 
>> - what can we do to improve them?
> 
> node failure is a simple, black and white, fact
> 
>> - should we monitor for more failures than we do now?
> 
> corosync *exists* to to detect node failure
> 
>> It is a known issue that node1 will crash at some point (kernel OOPS).
> 
> oops is not necessarily node failure; if you *want* it to be, then you
> sysctl -w kernel.panic_on_oops=1
> 
> (gfs has also had it's own mount options over the years to force this
> behavior, even if the sysctl isn't set properly; it's a common issue.
> It seems panic_on_oops has had inconsistent default values over various
> releases, sometimes 0, sometimes 1; setting it has historically been part
> of cluster/gfs documentation since most customers want it to be 1.)

So a cluster can hang because our code failed, but we don?t detect that
it did fail.... so what determines a node failure? only when corosync dies?

panic_on_oops is not cluster specific and not all OOPS are panic == not
a clean solution.

Fabio




^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Cluster-devel] fencing conditions: what should trigger a fencing operation?
  2009-11-19 18:10   ` Fabio M. Di Nitto
@ 2009-11-19 19:49     ` David Teigland
  2009-11-20  7:26       ` Fabio M. Di Nitto
  0 siblings, 1 reply; 9+ messages in thread
From: David Teigland @ 2009-11-19 19:49 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Nov 19, 2009 at 07:10:54PM +0100, Fabio M. Di Nitto wrote:
> David Teigland wrote:
> > On Thu, Nov 19, 2009 at 12:35:05PM +0100, Fabio M. Di Nitto wrote:
> > 
> >> - what are the current fencing policies?
> > 
> > node failure
> > 
> >> - what can we do to improve them?
> > 
> > node failure is a simple, black and white, fact
> > 
> >> - should we monitor for more failures than we do now?
> > 
> > corosync *exists* to to detect node failure
> > 
> >> It is a known issue that node1 will crash at some point (kernel OOPS).
> > 
> > oops is not necessarily node failure; if you *want* it to be, then you
> > sysctl -w kernel.panic_on_oops=1
> > 
> > (gfs has also had it's own mount options over the years to force this
> > behavior, even if the sysctl isn't set properly; it's a common issue.
> > It seems panic_on_oops has had inconsistent default values over various
> > releases, sometimes 0, sometimes 1; setting it has historically been part
> > of cluster/gfs documentation since most customers want it to be 1.)
> 
> So a cluster can hang because our code failed, but we don?t detect that
> it did fail.... so what determines a node failure? only when corosync dies?

The error is detected in gfs.  For every error in every bit of code, the
developer needs to consider what the appropriate error handling should be:
What are the consequences (with respect to availability and data
integrity), both locally and remotely, of the error handling they choose?
It's case by case.

If the error could lead to data corruption, then the proper error handling
is usually to fail fast and hard.

If the error can result in remote nodes being blocked, then the proper
error handling is usually self-sacrifice to avoid blocking other nodes.

Self-sacrifice means forcibly removing the local node from the cluster so
that others can recover for it and move on.  There are different ways of
doing self-sacrifice:

- panic the local machine (kernel code usually uses this method)
- killing corosync on the local machine (daemons usually do this)
- calling reboot (I think rgmanager has used this method)

> panic_on_oops is not cluster specific and not all OOPS are panic == not
> a clean solution.

So you want gfs oopses to result in a panic, and non-gfs oopses to *not*
result in a panic?  There's probably a combination of options that would
produce this effect.  Most people interested in HA will want all oopses to
result in a panic and recovery since an oops puts a node in a precarious
position regardless of where it came from.

Dave

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Cluster-devel] fencing conditions: what should trigger a fencing operation?
  2009-11-19 19:49     ` David Teigland
@ 2009-11-20  7:26       ` Fabio M. Di Nitto
  2009-11-20 17:40         ` David Teigland
  0 siblings, 1 reply; 9+ messages in thread
From: Fabio M. Di Nitto @ 2009-11-20  7:26 UTC (permalink / raw)
  To: cluster-devel.redhat.com

David Teigland wrote:
> On Thu, Nov 19, 2009 at 07:10:54PM +0100, Fabio M. Di Nitto wrote:
>
> The error is detected in gfs.  For every error in every bit of code, the
> developer needs to consider what the appropriate error handling should be:
> What are the consequences (with respect to availability and data
> integrity), both locally and remotely, of the error handling they choose?
> It's case by case.
> 
> If the error could lead to data corruption, then the proper error handling
> is usually to fail fast and hard.

of course, agreed.

> 
> If the error can result in remote nodes being blocked, then the proper
> error handling is usually self-sacrifice to avoid blocking other nodes.

ok, so this is the case we are seeing here. the cluster is half blocked
but there is no self-sacrifice action happening.

> 
> Self-sacrifice means forcibly removing the local node from the cluster so
> that others can recover for it and move on.  There are different ways of
> doing self-sacrifice:
> 
> - panic the local machine (kernel code usually uses this method)
> - killing corosync on the local machine (daemons usually do this)
> - calling reboot (I think rgmanager has used this method)

I don?t have an opinion on how it happens really, as long as it works.

> 
>> panic_on_oops is not cluster specific and not all OOPS are panic == not
>> a clean solution.
> 
> So you want gfs oopses to result in a panic, and non-gfs oopses to *not*
> result in a panic? 

Well partially yes.

We can?t take decision for OOPSes that are not generated within our
code. The user will have to configure that via panic_on_oops or other
means. Maybe our task is to make sure users are aware of this
situation/option (i didn?t check if it is documented).

You have a point by saying that it depends from error to error and this
is exactly where I?d like to head. Maybe it?s time to review our error
paths and make better decisions on what to do. At least within our code.

 There's probably a combination of options that would
> produce this effect.  Most people interested in HA will want all oopses to
> result in a panic and recovery since an oops puts a node in a precarious
> position regardless of where it came from.

I agree, but I don?t think we can kill the node on every OOPS by
default. We can agree that has to be a user configurable choice but we
can improve our stuff to do the right thing (or do better what it does now).

Fabio

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Cluster-devel] fencing conditions: what should trigger a fencing operation?
  2009-11-20  7:26       ` Fabio M. Di Nitto
@ 2009-11-20 17:40         ` David Teigland
  0 siblings, 0 replies; 9+ messages in thread
From: David Teigland @ 2009-11-20 17:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Nov 20, 2009 at 08:26:57AM +0100, Fabio M. Di Nitto wrote:
> We can?t take decision for OOPSes that are not generated within our
> code. The user will have to configure that via panic_on_oops or other
> means. Maybe our task is to make sure users are aware of this
> situation/option (i didn?t check if it is documented).

Yeah, in past we've told (and documented) people to set panic_on_oops=1 if
it's not already set that way (see the gfs_mount man page I gave a link to
for one example).

As I said, in some releases, like RHEL4 and RHEL5, panic_on_oops is 1 by
default, so everyone tends to forget about it.  But I think upstream
kernels currently default to 0, so this will bite people using upstream
kernels who don't happen to read our documentation about setting the
systctl.

Dave

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-11-20 17:40 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-19 11:35 [Cluster-devel] fencing conditions: what should trigger a fencing operation? Fabio M. Di Nitto
2009-11-19 17:04 ` David Teigland
2009-11-19 16:15   ` Steven Whitehouse
2009-11-19 17:28     ` David Teigland
2009-11-19 17:16   ` David Teigland
2009-11-19 18:10   ` Fabio M. Di Nitto
2009-11-19 19:49     ` David Teigland
2009-11-20  7:26       ` Fabio M. Di Nitto
2009-11-20 17:40         ` David Teigland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).