From mboxrd@z Thu Jan  1 00:00:00 1970
From: Fabio M. Di Nitto <fdinitto@redhat.com>
Date: Fri, 20 Nov 2009 08:26:57 +0100
Subject: [Cluster-devel] fencing conditions: what should trigger a fencing
	operation?
In-Reply-To: <20091119194952.GE23287@redhat.com>
References: <4B052D69.3010502@redhat.com>
	<20091119170404.GA23287@redhat.com>	<4B058A2E.9070600@redhat.com>
	<20091119194952.GE23287@redhat.com>
Message-ID: <4B0644C1.2020804@redhat.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

David Teigland wrote:
> On Thu, Nov 19, 2009 at 07:10:54PM +0100, Fabio M. Di Nitto wrote:
>
> The error is detected in gfs.  For every error in every bit of code, the
> developer needs to consider what the appropriate error handling should be:
> What are the consequences (with respect to availability and data
> integrity), both locally and remotely, of the error handling they choose?
> It's case by case.
> 
> If the error could lead to data corruption, then the proper error handling
> is usually to fail fast and hard.

of course, agreed.

> 
> If the error can result in remote nodes being blocked, then the proper
> error handling is usually self-sacrifice to avoid blocking other nodes.

ok, so this is the case we are seeing here. the cluster is half blocked
but there is no self-sacrifice action happening.

> 
> Self-sacrifice means forcibly removing the local node from the cluster so
> that others can recover for it and move on.  There are different ways of
> doing self-sacrifice:
> 
> - panic the local machine (kernel code usually uses this method)
> - killing corosync on the local machine (daemons usually do this)
> - calling reboot (I think rgmanager has used this method)

I don?t have an opinion on how it happens really, as long as it works.

> 
>> panic_on_oops is not cluster specific and not all OOPS are panic == not
>> a clean solution.
> 
> So you want gfs oopses to result in a panic, and non-gfs oopses to *not*
> result in a panic? 

Well partially yes.

We can?t take decision for OOPSes that are not generated within our
code. The user will have to configure that via panic_on_oops or other
means. Maybe our task is to make sure users are aware of this
situation/option (i didn?t check if it is documented).

You have a point by saying that it depends from error to error and this
is exactly where I?d like to head. Maybe it?s time to review our error
paths and make better decisions on what to do. At least within our code.

 There's probably a combination of options that would
> produce this effect.  Most people interested in HA will want all oopses to
> result in a panic and recovery since an oops puts a node in a precarious
> position regardless of where it came from.

I agree, but I don?t think we can kill the node on every OOPS by
default. We can agree that has to be a user configurable choice but we
can improve our stuff to do the right thing (or do better what it does now).

Fabio