From mboxrd@z Thu Jan  1 00:00:00 1970
From: David Teigland <teigland@redhat.com>
Date: Wed, 8 Oct 2008 09:00:41 -0500
Subject: [Cluster-devel] unifying fencing/stonith
Message-ID: <20081008140041.GA25435@redhat.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

As discussed at the Prague cluster summit, here is a description of what a
unified fencing/stonith system might look like.  I've used fenced as a
starting point, and added/changed things according to what I learned about
pacemaker/stonith.

. what causes fencing

The fenced daemon will join a cpg.  When a cpg member fails (confchg from
libcpg), that node will be fenced.  After a node leaves the cpg, it will not
be fenced if it fails.  The join should happen prior to any shared data being
used (e.g. mounting a cluster fs), and the leave should be prevented until all
shared data is done being used (e.g. all cluster fs's are unmounted).

If another program (pacemaker, a control daemon, etc) decides that a cluster
member is unfit to be in the cluster, that program can tell corosync to kill
the node (through a new lib api).  corosync will forcibly remove the member;
it will appear is a failed node in the corresponding confchg callbacks.  If
the node was a member of the fenced cpg, it will be fenced as a result.

A program that wants to know when a node has been fenced can query (or get a
callback) from fenced via libfenced.

. fenced communication

When fenced needs to fence a node, the cpg members will communicate among
themselves (via cpg messages) to decide which of them should carry out the
fencing.  When a fencing operation completes, the result is communicated to
the cpg members.  If the fencing operation failed (or the node doing the
fencing fails), a different node will try.

. calling agents

To carry out fencing, fenced needs to read the fencing configuration for the
victim node.  fenced will have a fence_node_config structure that defines all
the information for fencing a node.  One of two different plugins will be
called to fill in this structure (until we have a common configuration
system).  Plugin A will fill in the structure using info from cluster.conf
(libccs).  Plugin B will fill it in using the pacemaker config source.  (This
config structure will probably be filled in prior to the negotiation above so
that the specific config can factor into the decision about who fences.)

Once the fence_node_config structure is filled in, fenced will call the first
agent specified in the config, passing it the necessary parameters.  If the
first agent fails, it can try the second, etc.  If all methods fail, another
node can make an attempt.

If fenced can fork/exec all agents, that would be the simplest.  I'm not
familiar with fence device monitoring, but it sounds like something that's
probably best done by the resource manager, outside the scope of fenced.

Plugins would fill in these structures using specific config files/sources.
Most of the time, only a single method is used.  If multiple methods are
defined, each is tried sequentially until one succeeds, e.g. if the first
method is power-reset and the second is SAN-disconnect, fenced will first try
to reset the power, but if that fails, will try to disable its SAN port.

struct fence_node_config {
	int nodeid;
	char node_name[MAX];
	int num_methods;
	struct fence_node_method methods[8];
};

/* A method will usually contain one device.  It contains multiple devices
   in the case of dual power supplies or multiple SAN paths that all need to
   be disabled.  Each device in a method is a separate invocation of the agent.
   All device invocations for a method must succeed in order for the method to
   be considered successful. */

struct fence_node_method {
	int num_devices;
	struct fence_node_device devices[8];
};

/* Agent parameter strings are opaque to fenced which doesn't need to
   grok them.  fenced just copies them directly from the config plugin 
   to the agent. */

struct fence_node_device {
	char agent[MAX];		/* name of script/program */
	char general_params[MAX];	/* ip address, password, etc */
	char node_params[MAX];		/* port number, etc */
	char other_params[MAX];		/* on, off, querying? */
};

. quorum

fenced will use the new quorum plugin being developed for corosync.  When the
cluster loses quorum, fenced will not execute fence agents, but it continues
to operate normally apart from that, i.e. queue fencing operations for failed
cpg members.  When quorum is regained, fenced will act upon any queued fencing
operations for failed cpg members.

If a failed node rejoins the fenced cpg before the other cpg members have
carried out the fencing operation against it, the operation is skipped.  This
is common when quorum is lost, e.g.

. fenced cpg members A,B,C
. B,C fail; A loses quorum
. A queues fencing operations against B,C
. A does not execute fencing operations without quorum
. B is rebooted somehow and rejoins the cpg
. A removes B from the list of queued victims (instead of fencing it)
. A,B now have quorum and can execute fencing of C
  (unless C also rejoins the cpg beforehand)

[This assumes that B rejoining the cpg implies that B has been rebooted and
 no longer needs to be fenced.  That isn't necessarily true without some
 extra enforcement.]

. startup fencing

This has always been a thorny problem, and fenced has never had a great
solution.  The approach fenced takes is described here; I'm hoping pacemaker
might give us some better options for dealing with this, or maybe we can
collectively come up with a better solution.

An example is the simplest way to define the problem:

. A,B,C are cluster members and fenced cpg members
. A,B,C all have cluster fs foo mounted and are writing to it
. A,B experience a power reset and C freezes/hangs, all simultaneously
. A,B reboot, and form a new cluster (C remains hung and unresponsive)
. A,B both join the fenced cpg

A,B know nothing about node C.  Is it safe for A,B to mount and use cluster fs
foo?  No, not until C is fenced.  fenced's solution has been, when nodes first
join the fenced cpg, and have quorum, they fence any nodes listed in the
static cluster configuration that is not a member of the cluster (or fenced
cpg).  In this example, A,B will fence C at the last step listed above.