[Cluster-devel] unfencing

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Cluster-devel] unfencing
@ 2009-02-20 21:44 David Teigland
  2009-02-23  6:27 ` Fabio M. Di Nitto
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: David Teigland @ 2009-02-20 21:44 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Fencing devices that do not reboot a node, but just cut off storage have
always required the impractical step of re-enabling storage access after the
node has been reset.  We've never provided a mechanism to automate this
unfencing.

Below is an outline of how we might automate unfencing with some simple
extensions to the existing fencing library, config scheme and agents.  It does
not involve the fencing daemon (fenced).  Nodes would unfence themselves when
they start up.  We might also consider a scheme where a node is unfenced by
*other* nodes when it starts up, if that has any advantage over
self-unfencing.

cluster3 is the context, but a similar thing would apply to a next generation
unified fencing system, e.g.
https://www.redhat.com/archives/cluster-devel/2008-October/msg00005.html

init.d/cman would run:
	cman_tool join
	fence_node -U <ourname>
	qdiskd
	groupd
	fenced
	dlm_controld
	gfs_controld
	fence_tool join

The new step fence_node -U <name> would call libfence:fence_node_undo(name).
[fence_node <name> currently calls libfence:fence_node(name) to fence a node.]

libfence:fence_node_undo(node_name) logic:
	for each device_name under given node_name,
	if an unfencedevice exists with name=device_name, then
	run the unfencedevice agent with first arg of "undo"
	and other args the normal combination of node and device args
	(any agent used with unfencing must recognize/support "undo")

[logic derived from cluster.conf structure and similar to fence_node logic]

Example 1:

<clusternode name="foo" nodeid="3">
	<fence>
	<method="1">
		<device name="san" node="foo"/>
	</method>
	</fence>
</clusternode>

<fencedevices>
	<fencedevice name="san" agent="fence_scsi"/>
</fencedevices>

<unfencedevices>
	<unfencedevice name="san" agent="fence_scsi"/>
</unfencedevices>

fence_node_undo("foo") would:
- fork fence_scsi
- pass arg string: undo node="foo" agent="fence_scsi"

[Note: we've talked about fence_scsi getting a device list from
 /etc/cluster/fence_scsi.conf instead of from clvm.  It would require
 more user configuration, but would create fewer problems and should
 be more robust.]

Example 2:

<clusternode name="bar" nodeid="4">
	<fence>
	<method="1">
		<device name="switch1" port="4"/>
		<device name="switch2" port="6"/>
	</method>
	<method="2">
		<device name="apc" port="4"/>
	</method>
	</fence>
</clusternode>

<fencedevices>
	<fencedevice name="switch1" agent="fence_brocade" ipaddr="1.1.1.1"/>
	<fencedevice name="switch2" agent="fence_brocade" ipaddr="2.2.2.2"/>
	<fencedevice name="apc" agent="fence_apc" ipaddr="3.3.3.3"/>
</fencedevices>

<unfencedevices>
	<unfencedevice name="switch1" agent="fence_brocade" ipaddr="1.1.1.1"/>
	<unfencedevice name="switch2" agent="fence_brocade" ipaddr="2.2.2.2"/>
</unfencedevices>

fence_node_undo("bar") would:
- fork fence_brocade
- pass arg string: undo port="4" agent="fence_brocade" ipaddr="1.1.1.1"
- fork fence_brocade
- pass arg string: undo port="6" agent="fence_brocade" ipaddr="2.2.2.2"
- ignore device "apc" because it's not found under <unfencedevices>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-20 21:44 [Cluster-devel] unfencing David Teigland
@ 2009-02-23  6:27 ` Fabio M. Di Nitto
  2009-02-23 18:15   ` David Teigland
  2009-02-23 19:36 ` [Cluster-devel] unfencing Ryan O'Hara
  2009-02-26 21:35 ` David Teigland
  2 siblings, 1 reply; 23+ messages in thread
From: Fabio M. Di Nitto @ 2009-02-23  6:27 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi David,

On Fri, 2009-02-20 at 15:44 -0600, David Teigland wrote:
> Fencing devices that do not reboot a node, but just cut off storage have
> always required the impractical step of re-enabling storage access after the
> node has been reset.  We've never provided a mechanism to automate this
> unfencing.
> 
> Below is an outline of how we might automate unfencing with some simple
> extensions to the existing fencing library, config scheme and agents.  It does
> not involve the fencing daemon (fenced).  Nodes would unfence themselves when
> they start up.  We might also consider a scheme where a node is unfenced by
> *other* nodes when it starts up, if that has any advantage over
> self-unfencing.

Use case where we need remote unfencing is to recover nodes that boot
from the shared storage and those are not that uncommon.

I personally don't like the idea of exposing a -U option to users. It's
a short cut that could be easily misused in an attempt to recover a node
and make more damage than anything else, but I can't see another
solution either.

> cluster3 is the context, but a similar thing would apply to a next generation
> unified fencing system, e.g.
> https://www.redhat.com/archives/cluster-devel/2008-October/msg00005.html
> 
> init.d/cman would run:
> 	cman_tool join
> 	fence_node -U <ourname>
> 	qdiskd
> 	groupd
> 	fenced
> 	dlm_controld
> 	gfs_controld
> 	fence_tool join
> 
> The new step fence_node -U <name> would call libfence:fence_node_undo(name).
> [fence_node <name> currently calls libfence:fence_node(name) to fence a node.]
> 
> libfence:fence_node_undo(node_name) logic:
> 	for each device_name under given node_name,
> 	if an unfencedevice exists with name=device_name, then
> 	run the unfencedevice agent with first arg of "undo"
> 	and other args the normal combination of node and device args
> 	(any agent used with unfencing must recognize/support "undo")

All our agents already support on/off enable/disable operations. It's
probably best to align them to have the same config options rather than
adding a new one across the board.

> 
> [logic derived from cluster.conf structure and similar to fence_node logic]
> 
> Example 1:
> 
> <clusternode name="foo" nodeid="3">
> 	<fence>
> 	<method="1">
> 		<device name="san" node="foo"/>
> 	</method>
> 	</fence>
> </clusternode>
> 
> <fencedevices>
> 	<fencedevice name="san" agent="fence_scsi"/>
> </fencedevices>
> 
> <unfencedevices>
> 	<unfencedevice name="san" agent="fence_scsi"/>
> </unfencedevices>

I think that we can avoid the whole <unfence* structure either by
overriding the default action="" for that fence method or possibly
consider unfencing a special case method. The idea is to contain the
whole fence config for the node within the <clusternode> object rather
than spreading it even more.

For e.g.:

<method name="1">
 <device name="san" node="foo"/>
</method>
<method name="unfence">
 ...
</method>

OR

<method name="1">
 <device name="san" node="foo"/>
</method>
<method name="2" operation="unfence">
 ...
</method>

(clearly names and format are up for discussion)


> 
> [Note: we've talked about fence_scsi getting a device list from
>  /etc/cluster/fence_scsi.conf instead of from clvm.  It would require
>  more user configuration, but would create fewer problems and should
>  be more robust.]

I think we should really consider firing up a separate thread for this.
It seems to be a more and more often recurring issue.

Fabio



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-23  6:27 ` Fabio M. Di Nitto
@ 2009-02-23 18:15   ` David Teigland
  2009-02-23 18:31     ` Fabio M. Di Nitto
  0 siblings, 1 reply; 23+ messages in thread
From: David Teigland @ 2009-02-23 18:15 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Feb 23, 2009 at 07:27:20AM +0100, Fabio M. Di Nitto wrote:
> > libfence:fence_node_undo(node_name) logic:
> > 	for each device_name under given node_name,
> > 	if an unfencedevice exists with name=device_name, then
> > 	run the unfencedevice agent with first arg of "undo"
> > 	and other args the normal combination of node and device args
> > 	(any agent used with unfencing must recognize/support "undo")
> 
> All our agents already support on/off enable/disable operations. It's
> probably best to align them to have the same config options rather than
> adding a new one across the board.

Yes, I have those options in mind, and would prefer to use them as well.
We'll have to wait and see during the implementation phase; for the time being
they complicate things, so I'm using "undo" to avoid those details.

(I did reuse those options back in my first unfencing attempt which I
eventually removed:
http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=c781fbb6df57f9780cdadf42126cdcc9a2ff3878)


> > <clusternode name="foo" nodeid="3">
> > 	<fence>
> > 	<method="1">
> > 		<device name="san" node="foo"/>
> > 	</method>
> > 	</fence>
> > </clusternode>
> > 
> > <fencedevices>
> > 	<fencedevice name="san" agent="fence_scsi"/>
> > </fencedevices>
> > 
> > <unfencedevices>
> > 	<unfencedevice name="san" agent="fence_scsi"/>
> > </unfencedevices>
> 
> I think that we can avoid the whole <unfence* structure either by
> overriding the default action="" for that fence method or possibly
> consider unfencing a special case method. The idea is to contain the
> whole fence config for the node within the <clusternode> object rather
> than spreading it even more.
> 
> For e.g.:
> 
> <method name="1">
>  <device name="san" node="foo"/>
> </method>
> <method name="unfence">
>  ...
> </method>
> 
> OR
> 
> <method name="1">
>  <device name="san" node="foo"/>
> </method>
> <method name="2" operation="unfence">
>  ...
> </method>
> 
> (clearly names and format are up for discussion)

The meanings of those fencing structures have never changed since being
introduced many years ago, and both of those fundamentally change it.  It
would be very unfortunate to redefine them.

A good alternative to <unfencedevices> would be an <unfence> section within
the node setions (it would not require a method level)....  Now that I've
thought more about it, it seems a better choice than "unfencedevices".  It
defines explicitly what should be done, rather than depending on the implicit
effects of matching names between fencedevice/unfencedevice.

<clusternode name="foo" nodeid="3">
	<fence>
	<method="1">
		<device name="san" node="foo"/>
	</method>
	</fence>

	<unfence>
		<device name="san" node="foo"/>
	</unfence>
</clusternode>

<fencedevices>
	<fencedevice name="san" agent="fence_scsi"/>
</fencedevices>

and

<clusternode name="bar" nodeid="4">
	<fence>
	<method="1">
		<device name="switch1" port="4"/>
		<device name="switch2" port="6"/>
	</method>
	<method="2">
		<device name="apc" port="4"/>
	</method>
	</fence>

	<unfence>
		<device name="switch1" port="4"/>
		<device name="switch1" port="6"/>
	</unfence>
</clusternode>

<fencedevices>
        <fencedevice name="switch1" agent="fence_brocade" ipaddr="1.1.1.1"/>
        <fencedevice name="switch2" agent="fence_brocade" ipaddr="2.2.2.2"/>
        <fencedevice name="apc" agent="fence_apc" ipaddr="3.3.3.3"/>
</fencedevices>

The key thing I've realized since the previous attempt in 2004, is that we
need to explicitly configure what unfencing should happen, rather than just
trying to apply the normal fencing config in reverse.

Dave



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-23 18:15   ` David Teigland
@ 2009-02-23 18:31     ` Fabio M. Di Nitto
  2009-02-23 18:40       ` David Teigland
  0 siblings, 1 reply; 23+ messages in thread
From: Fabio M. Di Nitto @ 2009-02-23 18:31 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, 2009-02-23 at 12:15 -0600, David Teigland wrote:
> On Mon, Feb 23, 2009 at 07:27:20AM +0100, Fabio M. Di Nitto wrote:
> > > libfence:fence_node_undo(node_name) logic:
> > > 	for each device_name under given node_name,
> > > 	if an unfencedevice exists with name=device_name, then
> > > 	run the unfencedevice agent with first arg of "undo"
> > > 	and other args the normal combination of node and device args
> > > 	(any agent used with unfencing must recognize/support "undo")
> > 
> > All our agents already support on/off enable/disable operations. It's
> > probably best to align them to have the same config options rather than
> > adding a new one across the board.
> 
> Yes, I have those options in mind, and would prefer to use them as well.
> We'll have to wait and see during the implementation phase; for the time being
> they complicate things, so I'm using "undo" to avoid those details.
> 

I know Marek is about to start a "matrix" to map fence agents features
and options. It might be a good thing to talk to him soon'ish. We were
discussing it only a few hours ago.

> The meanings of those fencing structures have never changed since being
> introduced many years ago, and both of those fundamentally change it.  It
> would be very unfortunate to redefine them.

I agree. it's a good point.

> 
> A good alternative to <unfencedevices> would be an <unfence> section within
> the node setions (it would not require a method level)....  Now that I've
> thought more about it, it seems a better choice than "unfencedevices".  It
> defines explicitly what should be done, rather than depending on the implicit
> effects of matching names between fencedevice/unfencedevice.

Agreed.

> 
> <clusternode name="foo" nodeid="3">
> 	<fence>
> 	<method="1">
> 		<device name="san" node="foo"/>
> 	</method>
> 	</fence>
> 
> 	<unfence>
> 		<device name="san" node="foo"/>
> 	</unfence>
> </clusternode>
> 
> <fencedevices>
> 	<fencedevice name="san" agent="fence_scsi"/>
> </fencedevices>
> 
> and
> 
> <clusternode name="bar" nodeid="4">
> 	<fence>
> 	<method="1">
> 		<device name="switch1" port="4"/>
> 		<device name="switch2" port="6"/>
> 	</method>
> 	<method="2">
> 		<device name="apc" port="4"/>
> 	</method>
> 	</fence>
> 
> 	<unfence>
> 		<device name="switch1" port="4"/>
> 		<device name="switch1" port="6"/>
> 	</unfence>
> </clusternode>
> 
> <fencedevices>
>         <fencedevice name="switch1" agent="fence_brocade" ipaddr="1.1.1.1"/>
>         <fencedevice name="switch2" agent="fence_brocade" ipaddr="2.2.2.2"/>
>         <fencedevice name="apc" agent="fence_apc" ipaddr="3.3.3.3"/>
> </fencedevices>
> 
> The key thing I've realized since the previous attempt in 2004, is that we
> need to explicitly configure what unfencing should happen, rather than just
> trying to apply the normal fencing config in reverse.

I think I was trying to apply this same logic and stalled at some point
in the apc+brocade example. With more than one fence agent the amount of
combinations to achieve fencing and then safely unfence node simply
grows exponentially..

Given this last example, a reasonable unfence operation would be to try
to poweron via apc too.

There is no guarantee that it was only method="1" fencing the node and
the node could be powered off.

if we succeed in enabling the switch port, we still don't guarantee that
the node will come back because of lack of power..

How do we protect a node that failed to be fenced, from being unfenced?

Example 2:
both method="1" and method="2" fail to fence node X.
At this point any unfence operation is extremely dangerous.

Fabio



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-23 18:31     ` Fabio M. Di Nitto
@ 2009-02-23 18:40       ` David Teigland
  2009-02-23 18:52         ` Fabio M. Di Nitto
  0 siblings, 1 reply; 23+ messages in thread
From: David Teigland @ 2009-02-23 18:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Feb 23, 2009 at 07:31:29PM +0100, Fabio M. Di Nitto wrote:
> Given this last example, a reasonable unfence operation would be to try
> to poweron via apc too.
> 
> There is no guarantee that it was only method="1" fencing the node and
> the node could be powered off.
> 
> if we succeed in enabling the switch port, we still don't guarantee that
> the node will come back because of lack of power..
> 
> How do we protect a node that failed to be fenced, from being unfenced?
> 
> Example 2:
> both method="1" and method="2" fail to fence node X.
> At this point any unfence operation is extremely dangerous.

A node unfences *itself* when it boots up.  As such, power-unfencing doesn't
make sense; unfencing is only meant to reverse storage fencing.

Dave



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-23 18:40       ` David Teigland
@ 2009-02-23 18:52         ` Fabio M. Di Nitto
  2009-02-23 19:09           ` David Teigland
  0 siblings, 1 reply; 23+ messages in thread
From: Fabio M. Di Nitto @ 2009-02-23 18:52 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, 2009-02-23 at 12:40 -0600, David Teigland wrote:
> On Mon, Feb 23, 2009 at 07:31:29PM +0100, Fabio M. Di Nitto wrote:
> > Given this last example, a reasonable unfence operation would be to try
> > to poweron via apc too.
> > 
> > There is no guarantee that it was only method="1" fencing the node and
> > the node could be powered off.
> > 
> > if we succeed in enabling the switch port, we still don't guarantee that
> > the node will come back because of lack of power..
> > 
> > How do we protect a node that failed to be fenced, from being unfenced?
> > 
> > Example 2:
> > both method="1" and method="2" fail to fence node X.
> > At this point any unfence operation is extremely dangerous.
> 
> A node unfences *itself* when it boots up.  As such, power-unfencing doesn't
> make sense; unfencing is only meant to reverse storage fencing.

What can stop a user to run fence_node -U from another node to do remote
(un)fencing?

How do we address the problem of nodes booting from that same shared
storage?

Fabio



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-23 18:52         ` Fabio M. Di Nitto
@ 2009-02-23 19:09           ` David Teigland
  2009-02-23 19:22             ` Ryan O'Hara
                               ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: David Teigland @ 2009-02-23 19:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Feb 23, 2009 at 07:52:55PM +0100, Fabio M. Di Nitto wrote:
> > A node unfences *itself* when it boots up.  As such, power-unfencing doesn't
> > make sense; unfencing is only meant to reverse storage fencing.
> 
> What can stop a user to run fence_node -U from another node to do remote
> (un)fencing?

It would work.  Users can do anything they like, that's beside the point.

The point is to make storage fencing more practical by automating storage
unfencing.  Otherwise, users have to invent ad hoc methods of doing it
themselves, often manually.  And, we end up solving the problem in painful,
one-off cases like scsi_reserve/fence_scsi, which cry out for a better
approach.

> How do we address the problem of nodes booting from that same shared
> storage?

Use power fencing (that's not the problem I'm trying to solve.)

Dave



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-23 19:09           ` David Teigland
@ 2009-02-23 19:22             ` Ryan O'Hara
  2009-02-23 19:27               ` David Teigland
  2009-02-23 20:24             ` Ryan O'Hara
  2009-02-26  6:51             ` Fabio M. Di Nitto
  2 siblings, 1 reply; 23+ messages in thread
From: Ryan O'Hara @ 2009-02-23 19:22 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Feb 23, 2009 at 01:09:58PM -0600, David Teigland wrote:
> On Mon, Feb 23, 2009 at 07:52:55PM +0100, Fabio M. Di Nitto wrote:
> > > A node unfences *itself* when it boots up.  As such, power-unfencing doesn't
> > > make sense; unfencing is only meant to reverse storage fencing.
> > 
> > What can stop a user to run fence_node -U from another node to do remote
> > (un)fencing?
> 
> It would work.  Users can do anything they like, that's beside the point.
> 
> The point is to make storage fencing more practical by automating storage
> unfencing.  Otherwise, users have to invent ad hoc methods of doing it
> themselves, often manually.  And, we end up solving the problem in painful,
> one-off cases like scsi_reserve/fence_scsi, which cry out for a better
> approach.

I was going to ask about this. From the sounds of it, we could
elimiate the need for scsi_reserve completly. At least I can't think
of any reason that it could not me eliminated at this point.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-23 19:22             ` Ryan O'Hara
@ 2009-02-23 19:27               ` David Teigland
  0 siblings, 0 replies; 23+ messages in thread
From: David Teigland @ 2009-02-23 19:27 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Feb 23, 2009 at 01:22:26PM -0600, Ryan O'Hara wrote:
> On Mon, Feb 23, 2009 at 01:09:58PM -0600, David Teigland wrote:
> > On Mon, Feb 23, 2009 at 07:52:55PM +0100, Fabio M. Di Nitto wrote:
> > > > A node unfences *itself* when it boots up.  As such, power-unfencing doesn't
> > > > make sense; unfencing is only meant to reverse storage fencing.
> > > 
> > > What can stop a user to run fence_node -U from another node to do remote
> > > (un)fencing?
> > 
> > It would work.  Users can do anything they like, that's beside the point.
> > 
> > The point is to make storage fencing more practical by automating storage
> > unfencing.  Otherwise, users have to invent ad hoc methods of doing it
> > themselves, often manually.  And, we end up solving the problem in painful,
> > one-off cases like scsi_reserve/fence_scsi, which cry out for a better
> > approach.
> 
> I was going to ask about this. From the sounds of it, we could
> elimiate the need for scsi_reserve completly. At least I can't think
> of any reason that it could not me eliminated at this point.

Yep, that's the idea.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-20 21:44 [Cluster-devel] unfencing David Teigland
  2009-02-23  6:27 ` Fabio M. Di Nitto
@ 2009-02-23 19:36 ` Ryan O'Hara
  2009-02-23 19:44   ` David Teigland
  2009-02-26 21:35 ` David Teigland
  2 siblings, 1 reply; 23+ messages in thread
From: Ryan O'Hara @ 2009-02-23 19:36 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Feb 20, 2009 at 03:44:32PM -0600, David Teigland wrote:

> [Note: we've talked about fence_scsi getting a device list from
>  /etc/cluster/fence_scsi.conf instead of from clvm.  It would require
>  more user configuration, but would create fewer problems and should
>  be more robust.]

Agreed. The "discovery" concept in scsi_reserve/fence_scsi limits its
use to clvm volumes. How to configure devices for use with scsi
reservations is a problem that will need to be addressed.

> fence_node_undo("bar") would:
> - fork fence_brocade
> - pass arg string: undo port="4" agent="fence_brocade" ipaddr="1.1.1.1"
> - fork fence_brocade
> - pass arg string: undo port="6" agent="fence_brocade" ipaddr="2.2.2.2"
> - ignore device "apc" because it's not found under <unfencedevices>

What happens if unfencing fails? Is it safe to say that a node that
fails to unfence itself will be prohibited from joining the fence
domain? This is important for fence_scsi, since unfencing is
equivalient to re-registering with the scsi devices. Failure to
unfence (ie. register) precludes that node from being able to fence
other nodes. I'm not sure if other fencing methods have this type of
requirement.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-23 19:36 ` [Cluster-devel] unfencing Ryan O'Hara
@ 2009-02-23 19:44   ` David Teigland
  0 siblings, 0 replies; 23+ messages in thread
From: David Teigland @ 2009-02-23 19:44 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Feb 23, 2009 at 01:36:04PM -0600, Ryan O'Hara wrote:
> What happens if unfencing fails? Is it safe to say that a node that
> fails to unfence itself will be prohibited from joining the fence
> domain? This is important for fence_scsi, since unfencing is
> equivalient to re-registering with the scsi devices. Failure to
> unfence (ie. register) precludes that node from being able to fence
> other nodes. I'm not sure if other fencing methods have this type of
> requirement.

Good point, it would probably be as simple as init.d/cman exiting with a
failure if fence_node -U fails.

Dave



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-23 19:09           ` David Teigland
  2009-02-23 19:22             ` Ryan O'Hara
@ 2009-02-23 20:24             ` Ryan O'Hara
  2009-02-23 20:28               ` David Teigland
  2009-02-26  6:51             ` Fabio M. Di Nitto
  2 siblings, 1 reply; 23+ messages in thread
From: Ryan O'Hara @ 2009-02-23 20:24 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Feb 23, 2009 at 01:09:58PM -0600, David Teigland wrote:
> On Mon, Feb 23, 2009 at 07:52:55PM +0100, Fabio M. Di Nitto wrote:
> > What can stop a user to run fence_node -U from another node to do remote
> > (un)fencing?
> 
> It would work.  Users can do anything they like, that's beside the point.

It would not work for scsi reservations. With scsi reservations, an
unfence operation is as simple a registering with the device(s). It
cannot be done remotely. A registration exists on an "IT nexus"; the
relationship between initiator and target. Bottom line is that a
remote node cannot register another node --- the registration
(sg_persist command) has to be run on the node that wants to "unfence"
itself.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-23 20:24             ` Ryan O'Hara
@ 2009-02-23 20:28               ` David Teigland
  0 siblings, 0 replies; 23+ messages in thread
From: David Teigland @ 2009-02-23 20:28 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Feb 23, 2009 at 02:24:13PM -0600, Ryan O'Hara wrote:
> On Mon, Feb 23, 2009 at 01:09:58PM -0600, David Teigland wrote:
> > On Mon, Feb 23, 2009 at 07:52:55PM +0100, Fabio M. Di Nitto wrote:
> > > What can stop a user to run fence_node -U from another node to do remote
> > > (un)fencing?
> > 
> > It would work.  Users can do anything they like, that's beside the point.
> 
> It would not work for scsi reservations. With scsi reservations, an
> unfence operation is as simple a registering with the device(s). It
> cannot be done remotely. A registration exists on an "IT nexus"; the
> relationship between initiator and target. Bottom line is that a
> remote node cannot register another node --- the registration
> (sg_persist command) has to be run on the node that wants to "unfence"
> itself.

OK, thanks, that's good to keep in mind.  The "other scheme" I mentioned
originally where *other* nodes would unfence a node (instead of
self-unfencing) wouldn't work for scsi.

Dave



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-23 19:09           ` David Teigland
  2009-02-23 19:22             ` Ryan O'Hara
  2009-02-23 20:24             ` Ryan O'Hara
@ 2009-02-26  6:51             ` Fabio M. Di Nitto
  2009-02-26 14:33               ` David Teigland
  2 siblings, 1 reply; 23+ messages in thread
From: Fabio M. Di Nitto @ 2009-02-26  6:51 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, 2009-02-23 at 13:09 -0600, David Teigland wrote:
> On Mon, Feb 23, 2009 at 07:52:55PM +0100, Fabio M. Di Nitto wrote:
> > > A node unfences *itself* when it boots up.  As such, power-unfencing doesn't
> > > make sense; unfencing is only meant to reverse storage fencing.
> > 
> > What can stop a user to run fence_node -U from another node to do remote
> > (un)fencing?
> 
> It would work.  Users can do anything they like, that's beside the point.

I was thinking about 2 little points..

Given the time at which fence_node -U will fire, you probably want to
add a cman_init + cman_is_active + cman_finish loop in fence_node to
make sure cman is ready to reply to our ccs queries, otherwise we might
have a race condition at boot time (it might be already there.. didn't
really check the code). All our daemons do that to give cman time to
bootstrap.

The second thing would be to set a minimal protection mechanism by
allowing fence_node -U to be fired only for the node that it is invoking
it. So if we run on node A, fence_node -U can only execute unfencing
operations for node A. For testing purposes then we could add a manual
override such as "--i-understand-this-operation-can-destroy-the-world".

Fabio

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-26  6:51             ` Fabio M. Di Nitto
@ 2009-02-26 14:33               ` David Teigland
  2009-02-26 18:06                 ` [Cluster-devel] unfencing (cman startup) Fabio M. Di Nitto
  0 siblings, 1 reply; 23+ messages in thread
From: David Teigland @ 2009-02-26 14:33 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Feb 26, 2009 at 07:51:57AM +0100, Fabio M. Di Nitto wrote:
> On Mon, 2009-02-23 at 13:09 -0600, David Teigland wrote:
> > On Mon, Feb 23, 2009 at 07:52:55PM +0100, Fabio M. Di Nitto wrote:
> > > > A node unfences *itself* when it boots up.  As such, power-unfencing doesn't
> > > > make sense; unfencing is only meant to reverse storage fencing.
> > > 
> > > What can stop a user to run fence_node -U from another node to do remote
> > > (un)fencing?
> > 
> > It would work.  Users can do anything they like, that's beside the point.
> 
> I was thinking about 2 little points..
> 
> Given the time at which fence_node -U will fire, you probably want to
> add a cman_init + cman_is_active + cman_finish loop in fence_node to
> make sure cman is ready to reply to our ccs queries, otherwise we might
> have a race condition at boot time (it might be already there.. didn't
> really check the code). All our daemons do that to give cman time to
> bootstrap.

Yes, good point.  I wonder if we'd be better off having cman_tool join
effectively do an is_active wait before exiting?  Then we could probably
avoid doing it many other places.  (It's also annoying when corosync crashes
after is_active completes, but before I've read what I need from cman/ccs.)

> The second thing would be to set a minimal protection mechanism by
> allowing fence_node -U to be fired only for the node that it is invoking
> it. So if we run on node A, fence_node -U can only execute unfencing
> operations for node A. For testing purposes then we could add a manual
> override such as "--i-understand-this-operation-can-destroy-the-world".

I plan to use "fence_node -U" (no name) to unfence self.  I'm inclined to
just allow any node name after that, but not advertise it.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing (cman startup)
  2009-02-26 14:33               ` David Teigland
@ 2009-02-26 18:06                 ` Fabio M. Di Nitto
  2009-02-27 12:54                   ` Chrissie Caulfield
  0 siblings, 1 reply; 23+ messages in thread
From: Fabio M. Di Nitto @ 2009-02-26 18:06 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, 2009-02-26 at 08:33 -0600, David Teigland wrote:
> On Thu, Feb 26, 2009 at 07:51:57AM +0100, Fabio M. Di Nitto wrote:
> > On Mon, 2009-02-23 at 13:09 -0600, David Teigland wrote:
> > > On Mon, Feb 23, 2009 at 07:52:55PM +0100, Fabio M. Di Nitto wrote:
> > > > > A node unfences *itself* when it boots up.  As such, power-unfencing doesn't
> > > > > make sense; unfencing is only meant to reverse storage fencing.
> > > > 
> > > > What can stop a user to run fence_node -U from another node to do remote
> > > > (un)fencing?
> > > 
> > > It would work.  Users can do anything they like, that's beside the point.
> > 
> > I was thinking about 2 little points..
> > 
> > Given the time at which fence_node -U will fire, you probably want to
> > add a cman_init + cman_is_active + cman_finish loop in fence_node to
> > make sure cman is ready to reply to our ccs queries, otherwise we might
> > have a race condition at boot time (it might be already there.. didn't
> > really check the code). All our daemons do that to give cman time to
> > bootstrap.
> 
> Yes, good point.  I wonder if we'd be better off having cman_tool join
> effectively do an is_active wait before exiting?  Then we could probably
> avoid doing it many other places.  (It's also annoying when corosync crashes
> after is_active completes, but before I've read what I need from cman/ccs.)

hmm.. it might be reasonable to ask cman_tool to do that, but if you
look for example how cmannotifyd works (i know not all daemons can do
that), it can cope with cman going away and coming back (no matter
what's the reason).

> 
> > The second thing would be to set a minimal protection mechanism by
> > allowing fence_node -U to be fired only for the node that it is invoking
> > it. So if we run on node A, fence_node -U can only execute unfencing
> > operations for node A. For testing purposes then we could add a manual
> > override such as "--i-understand-this-operation-can-destroy-the-world".
> 
> I plan to use "fence_node -U" (no name) to unfence self.  I'm inclined to
> just allow any node name after that, but not advertise it.

a bit too late.. we are on a public mailing list ;)

Fabio



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-20 21:44 [Cluster-devel] unfencing David Teigland
  2009-02-23  6:27 ` Fabio M. Di Nitto
  2009-02-23 19:36 ` [Cluster-devel] unfencing Ryan O'Hara
@ 2009-02-26 21:35 ` David Teigland
  2009-02-27  7:04   ` Fabio M. Di Nitto
  2 siblings, 1 reply; 23+ messages in thread
From: David Teigland @ 2009-02-26 21:35 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Feb 20, 2009 at 03:44:32PM -0600, David Teigland wrote:
> init.d/cman would run:
> 	cman_tool join
> 	fence_node -U <ourname>

How does this look?  I'm not up on the ins and outs of init scripts.
In the common case, no unfencing will happen, and nothing is printed.
If unfencing is defined and fails, the whole script exits with failure.
That's important for fence_scsi, but not necessarily for other forms
of unfencing... should we make that behavior conditional on fence_scsi?


diff --git a/cman/init.d/cman.in b/cman/init.d/cman.in
index 9303a0b..b8303cb 100644
--- a/cman/init.d/cman.in
+++ b/cman/init.d/cman.in
@@ -236,6 +236,25 @@ start_cman()
     return 0
 }
 
+unfence_self()
+{
+    fence_node -U > /dev/null 2>&1
+    error=$?
+
+    if [ $error -eq 0 ]
+    then
+        echo "   Unfencing self... done"
+        return 0
+    else
+        if [ $error -eq 1 ]
+        then
+            echo "   Unfencing self... failed"
+            return 1
+        else
+            return 0
+        fi
+    fi
+}
 
 start_qdiskd()
 {
@@ -502,6 +521,12 @@ start()
 	return 1
     fi
 
+    unfence_self
+    if [ $? -eq 1 ]
+    then
+        return 1
+    fi
+
     start_qdiskd
 
     echo -n "   Starting daemons... "



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing
  2009-02-26 21:35 ` David Teigland
@ 2009-02-27  7:04   ` Fabio M. Di Nitto
  0 siblings, 0 replies; 23+ messages in thread
From: Fabio M. Di Nitto @ 2009-02-27  7:04 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, 2009-02-26 at 15:35 -0600, David Teigland wrote:
> On Fri, Feb 20, 2009 at 03:44:32PM -0600, David Teigland wrote:
> > init.d/cman would run:
> > 	cman_tool join
> > 	fence_node -U <ourname>
> 
> How does this look?  I'm not up on the ins and outs of init scripts.
> In the common case, no unfencing will happen, and nothing is printed.
> If unfencing is defined and fails, the whole script exits with failure.
> That's important for fence_scsi, but not necessarily for other forms
> of unfencing... should we make that behavior conditional on fence_scsi?
> 
> 

The logic seems ok. We use other commands to print success/failures but
you can commit it and I'll fix it later or I can merge it once I get to
it soonish.

> diff --git a/cman/init.d/cman.in b/cman/init.d/cman.in
> index 9303a0b..b8303cb 100644
> --- a/cman/init.d/cman.in
> +++ b/cman/init.d/cman.in
> @@ -236,6 +236,25 @@ start_cman()
>      return 0
>  }
>  
> +unfence_self()
> +{
> +    fence_node -U > /dev/null 2>&1
> +    error=$?
> +
> +    if [ $error -eq 0 ]
> +    then
> +        echo "   Unfencing self... done"
> +        return 0
> +    else
> +        if [ $error -eq 1 ]
> +        then
> +            echo "   Unfencing self... failed"
> +            return 1
> +        else
> +            return 0
> +        fi
> +    fi
> +}
>  
>  start_qdiskd()
>  {
> @@ -502,6 +521,12 @@ start()
>  	return 1
>      fi
>  
> +    unfence_self
> +    if [ $? -eq 1 ]
> +    then
> +        return 1
> +    fi
> +
>      start_qdiskd
>  
>      echo -n "   Starting daemons... "



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing (cman startup)
  2009-02-26 18:06                 ` [Cluster-devel] unfencing (cman startup) Fabio M. Di Nitto
@ 2009-02-27 12:54                   ` Chrissie Caulfield
  2009-02-27 15:52                     ` David Teigland
  0 siblings, 1 reply; 23+ messages in thread
From: Chrissie Caulfield @ 2009-02-27 12:54 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Fabio M. Di Nitto wrote:
> On Thu, 2009-02-26 at 08:33 -0600, David Teigland wrote:
>> On Thu, Feb 26, 2009 at 07:51:57AM +0100, Fabio M. Di Nitto wrote:
>>> On Mon, 2009-02-23 at 13:09 -0600, David Teigland wrote:
>>>> On Mon, Feb 23, 2009 at 07:52:55PM +0100, Fabio M. Di Nitto wrote:
>>>>>> A node unfences *itself* when it boots up.  As such, power-unfencing doesn't
>>>>>> make sense; unfencing is only meant to reverse storage fencing.
>>>>> What can stop a user to run fence_node -U from another node to do remote
>>>>> (un)fencing?
>>>> It would work.  Users can do anything they like, that's beside the point.
>>> I was thinking about 2 little points..
>>>
>>> Given the time at which fence_node -U will fire, you probably want to
>>> add a cman_init + cman_is_active + cman_finish loop in fence_node to
>>> make sure cman is ready to reply to our ccs queries, otherwise we might
>>> have a race condition at boot time (it might be already there.. didn't
>>> really check the code). All our daemons do that to give cman time to
>>> bootstrap.
>> Yes, good point.  I wonder if we'd be better off having cman_tool join
>> effectively do an is_active wait before exiting?  Then we could probably
>> avoid doing it many other places.  (It's also annoying when corosync crashes
>> after is_active completes, but before I've read what I need from cman/ccs.)
> 

Err, cman_tool already does this with the -w switch, and the init script
uses it.

> hmm.. it might be reasonable to ask cman_tool to do that, b

Chrissie



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing (cman startup)
  2009-02-27 12:54                   ` Chrissie Caulfield
@ 2009-02-27 15:52                     ` David Teigland
  2009-02-27 16:27                       ` Chrissie Caulfield
  2009-02-27 17:46                       ` Fabio M. Di Nitto
  0 siblings, 2 replies; 23+ messages in thread
From: David Teigland @ 2009-02-27 15:52 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Feb 27, 2009 at 12:54:20PM +0000, Chrissie Caulfield wrote:
> >>> Given the time at which fence_node -U will fire, you probably want to
> >>> add a cman_init + cman_is_active + cman_finish loop in fence_node to
> >>> make sure cman is ready to reply to our ccs queries, otherwise we might
> >>> have a race condition at boot time (it might be already there.. didn't
> >>> really check the code). All our daemons do that to give cman time to
> >>> bootstrap.
> >> Yes, good point.  I wonder if we'd be better off having cman_tool join
> >> effectively do an is_active wait before exiting?  Then we could probably
> >> avoid doing it many other places.  (It's also annoying when corosync crashes
> >> after is_active completes, but before I've read what I need from cman/ccs.)
> > 
> 
> Err, cman_tool already does this with the -w switch, and the init script
> uses it.

Great, so the constant flogging to add cman_is_active checks everywhere will
end!?  Can I remove all my cman_is_active loops?



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing (cman startup)
  2009-02-27 15:52                     ` David Teigland
@ 2009-02-27 16:27                       ` Chrissie Caulfield
  2009-02-27 17:46                       ` Fabio M. Di Nitto
  1 sibling, 0 replies; 23+ messages in thread
From: Chrissie Caulfield @ 2009-02-27 16:27 UTC (permalink / raw)
  To: cluster-devel.redhat.com

David Teigland wrote:
> On Fri, Feb 27, 2009 at 12:54:20PM +0000, Chrissie Caulfield wrote:
>>>>> Given the time at which fence_node -U will fire, you probably want to
>>>>> add a cman_init + cman_is_active + cman_finish loop in fence_node to
>>>>> make sure cman is ready to reply to our ccs queries, otherwise we might
>>>>> have a race condition at boot time (it might be already there.. didn't
>>>>> really check the code). All our daemons do that to give cman time to
>>>>> bootstrap.
>>>> Yes, good point.  I wonder if we'd be better off having cman_tool join
>>>> effectively do an is_active wait before exiting?  Then we could probably
>>>> avoid doing it many other places.  (It's also annoying when corosync crashes
>>>> after is_active completes, but before I've read what I need from cman/ccs.)
>> Err, cman_tool already does this with the -w switch, and the init script
>> uses it.
> 
> Great, so the constant flogging to add cman_is_active checks everywhere will
> end!?  Can I remove all my cman_is_active loops?

Yes.

And if it doesn't work, file a bug :)

Chrissie



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing (cman startup)
  2009-02-27 15:52                     ` David Teigland
  2009-02-27 16:27                       ` Chrissie Caulfield
@ 2009-02-27 17:46                       ` Fabio M. Di Nitto
  2009-03-02  7:59                         ` Chrissie Caulfield
  1 sibling, 1 reply; 23+ messages in thread
From: Fabio M. Di Nitto @ 2009-02-27 17:46 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, 2009-02-27 at 09:52 -0600, David Teigland wrote:
> On Fri, Feb 27, 2009 at 12:54:20PM +0000, Chrissie Caulfield wrote:
> > >>> Given the time at which fence_node -U will fire, you probably want to
> > >>> add a cman_init + cman_is_active + cman_finish loop in fence_node to
> > >>> make sure cman is ready to reply to our ccs queries, otherwise we might
> > >>> have a race condition at boot time (it might be already there.. didn't
> > >>> really check the code). All our daemons do that to give cman time to
> > >>> bootstrap.
> > >> Yes, good point.  I wonder if we'd be better off having cman_tool join
> > >> effectively do an is_active wait before exiting?  Then we could probably
> > >> avoid doing it many other places.  (It's also annoying when corosync crashes
> > >> after is_active completes, but before I've read what I need from cman/ccs.)
> > > 
> > 
> > Err, cman_tool already does this with the -w switch, and the init script
> > uses it.
> 
> Great, so the constant flogging to add cman_is_active checks everywhere will
> end!?  Can I remove all my cman_is_active loops?

This works fine via init script. We could theoretically kill all those
loops but at least for us developers, that start stuff by hand, they
could still be useful.. and maybe a good failsafe if we ask users to run
something manually for debugging.. dunno.. just a thought. I don't have
a strong opinion on this matter.

Fabio





^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Cluster-devel] unfencing (cman startup)
  2009-02-27 17:46                       ` Fabio M. Di Nitto
@ 2009-03-02  7:59                         ` Chrissie Caulfield
  0 siblings, 0 replies; 23+ messages in thread
From: Chrissie Caulfield @ 2009-03-02  7:59 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Fabio M. Di Nitto wrote:
> On Fri, 2009-02-27 at 09:52 -0600, David Teigland wrote:
>> On Fri, Feb 27, 2009 at 12:54:20PM +0000, Chrissie Caulfield wrote:
>>>>>> Given the time at which fence_node -U will fire, you probably want to
>>>>>> add a cman_init + cman_is_active + cman_finish loop in fence_node to
>>>>>> make sure cman is ready to reply to our ccs queries, otherwise we might
>>>>>> have a race condition at boot time (it might be already there.. didn't
>>>>>> really check the code). All our daemons do that to give cman time to
>>>>>> bootstrap.
>>>>> Yes, good point.  I wonder if we'd be better off having cman_tool join
>>>>> effectively do an is_active wait before exiting?  Then we could probably
>>>>> avoid doing it many other places.  (It's also annoying when corosync crashes
>>>>> after is_active completes, but before I've read what I need from cman/ccs.)
>>> Err, cman_tool already does this with the -w switch, and the init script
>>> uses it.
>> Great, so the constant flogging to add cman_is_active checks everywhere will
>> end!?  Can I remove all my cman_is_active loops?
> 
> This works fine via init script. We could theoretically kill all those
> loops but at least for us developers, that start stuff by hand, they
> could still be useful.. and maybe a good failsafe if we ask users to run
> something manually for debugging.. dunno.. just a thought. I don't have
> a strong opinion on this matter.
> 

You might as well take them out to be honest. Those loops are mostly
overspill from the RHEL4 cman where cman started up but could take 20-30
seconds to start or join a cluster. With openais/corosync once the
daemon is up then you can talk to it.

It might not be quorate ... but that IS your problem :-)

Chrissie



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2009-03-02  7:59 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-20 21:44 [Cluster-devel] unfencing David Teigland
2009-02-23  6:27 ` Fabio M. Di Nitto
2009-02-23 18:15   ` David Teigland
2009-02-23 18:31     ` Fabio M. Di Nitto
2009-02-23 18:40       ` David Teigland
2009-02-23 18:52         ` Fabio M. Di Nitto
2009-02-23 19:09           ` David Teigland
2009-02-23 19:22             ` Ryan O'Hara
2009-02-23 19:27               ` David Teigland
2009-02-23 20:24             ` Ryan O'Hara
2009-02-23 20:28               ` David Teigland
2009-02-26  6:51             ` Fabio M. Di Nitto
2009-02-26 14:33               ` David Teigland
2009-02-26 18:06                 ` [Cluster-devel] unfencing (cman startup) Fabio M. Di Nitto
2009-02-27 12:54                   ` Chrissie Caulfield
2009-02-27 15:52                     ` David Teigland
2009-02-27 16:27                       ` Chrissie Caulfield
2009-02-27 17:46                       ` Fabio M. Di Nitto
2009-03-02  7:59                         ` Chrissie Caulfield
2009-02-23 19:36 ` [Cluster-devel] unfencing Ryan O'Hara
2009-02-23 19:44   ` David Teigland
2009-02-26 21:35 ` David Teigland
2009-02-27  7:04   ` Fabio M. Di Nitto

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.