[Ocfs2-devel] Fencing in OCFS2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Ocfs2-devel] Fencing in OCFS2
@ 2006-05-17 14:20 Sum Sha
  2006-05-18 14:46 ` Sum Sha
  2006-05-25 21:35 ` Daniel Phillips
  0 siblings, 2 replies; 9+ messages in thread
From: Sum Sha @ 2006-05-17 14:20 UTC (permalink / raw)
  To: ocfs2-devel

Hi,
Just wanted to understand how OCFS2 fencing works. Sorry if this has
already been discussed...
(1)
--------------
A node has quorum when:
        * it sees an odd number of heartbeating nodes and has network
          connectivity to more than half of them.
                or
        * it sees an even number of heartbeating nodes and has network
          connectivity to at least half of them *and* has connectivity to
          the heartbeating node with the lowest node number.
--------------
Now, Think about a case where there are 5 nodes in an OCFS2 cluster.
Consider that split-brain happens and it's divided into 2 subclusters
of 3-node and 2-node. In this case, this algorithm will work fine and
the cluster with 3-node sub cluster will win the race. But think about
the case, where there is a serial split-brain and you have 2-node,
2-node and 1-node (3 sub-clusters) after 2 split-brains at the same
time. In this case, this algorithm will fail and all sub-clusters will
be paniced, because on each sub cluster, none of the nodes has
connectivity to more than (5/2 = 2) nodes, while each node can get
disk hearbeat from 5 nodes.

This may be the case with any cluster configuration, if there are
serial split-brains. Has the algorithm been designed for handling
serial split-brains? If yes, then how?
Is there anything else which is to be considered?

(2) In ocfs2_faq I read that for quorum process to get stabilzed it
may take 28 seconds.
--------------
Q05     How long does the quorum process take?
A05     First a node will realize that it doesn't have connectivity with
        another node.  This can happen immediately if the connection is closed
        but can take a maximum of 10 seconds of idle time.  Then the node
        must wait long enough to give heartbeating a chance to declare the
        node dead.  It does this by waiting two iterations longer than
        the number of iterations needed to consider a node dead (see Q03 in
        the Heartbeat section of this FAQ).  The current default of 7
        iterations of 2 seconds results in waiting for 9 iterations or 18
        seconds.  By default, then, a maximum of 28 seconds can pass from the
        time a network fault occurs until a node fences itself.
--------------

I don't understand why are we giving heartbeating extra 2 iterations
to declare a node dead in case of split brain? What I think is, if we
are already missing disk heartbeat for a node, then it's missed
heartbeat counter has already been started and we would declare that
node dead after 7 iterations. How do we include these extra 2
iterations?

What I want to say here is, after 10 seconds of TCP idle timeout for a
node, we believe that we will start missing disk heartbeats for that
node and we allow 9 iterations of such missed heartbeats, but how do
you inform the other thread, which is already doing this missed
heartbeat calculation (because we are missing disk hearbeats), that it
needs to wait for 2 more iterations before declaring the node dead. If
you don't inform that thread about this, then it will declare the
other node dead after 7 iterations only. So how this extra 2
iterations concept will come into picture?

Thanks.
Sumsha.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] Fencing in OCFS2
  2006-05-17 14:20 [Ocfs2-devel] Fencing in OCFS2 Sum Sha
@ 2006-05-18 14:46 ` Sum Sha
  2006-05-18 18:48   ` Zach Brown
  2006-05-25 21:35 ` Daniel Phillips
  1 sibling, 1 reply; 9+ messages in thread
From: Sum Sha @ 2006-05-18 14:46 UTC (permalink / raw)
  To: ocfs2-devel

With some experiments and going through OCFS2's quorum code, I am sure
that in case of serial split-brain, quorum algorithm will surely break
and will cause complete cluster shutdown. It will cause all the
subcluster nodes to panic themselves. Please correct me if I am
wrong...

o2quo_make_decision() function, which is responsible for taking the
final decision during hb_up and hb_down, makes lots of assumptions
(which may fail) and it may take wrong decision in serial split brain
cases.

Probably this problem will be resolved once "we have some more
rational approach that is driven from userspace" as mentioned in
quorum.c

Thanks.
Sumsha.

On 5/17/06, Sum Sha <sumsha.matrixreloaded@gmail.com> wrote:
> Hi,
> Just wanted to understand how OCFS2 fencing works. Sorry if this has
> already been discussed...
> (1)
> --------------
> A node has quorum when:
>         * it sees an odd number of heartbeating nodes and has network
>           connectivity to more than half of them.
>                 or
>         * it sees an even number of heartbeating nodes and has network
>           connectivity to at least half of them *and* has connectivity to
>           the heartbeating node with the lowest node number.
> --------------
> Now, Think about a case where there are 5 nodes in an OCFS2 cluster.
> Consider that split-brain happens and it's divided into 2 subclusters
> of 3-node and 2-node. In this case, this algorithm will work fine and
> the cluster with 3-node sub cluster will win the race. But think about
> the case, where there is a serial split-brain and you have 2-node,
> 2-node and 1-node (3 sub-clusters) after 2 split-brains at the same
> time. In this case, this algorithm will fail and all sub-clusters will
> be paniced, because on each sub cluster, none of the nodes has
> connectivity to more than (5/2 = 2) nodes, while each node can get
> disk hearbeat from 5 nodes.
>
> This may be the case with any cluster configuration, if there are
> serial split-brains. Has the algorithm been designed for handling
> serial split-brains? If yes, then how?
> Is there anything else which is to be considered?
>
> (2) In ocfs2_faq I read that for quorum process to get stabilzed it
> may take 28 seconds.
> --------------
> Q05     How long does the quorum process take?
> A05     First a node will realize that it doesn't have connectivity with
>         another node.  This can happen immediately if the connection is
> closed
>         but can take a maximum of 10 seconds of idle time.  Then the node
>         must wait long enough to give heartbeating a chance to declare the
>         node dead.  It does this by waiting two iterations longer than
>         the number of iterations needed to consider a node dead (see Q03 in
>         the Heartbeat section of this FAQ).  The current default of 7
>         iterations of 2 seconds results in waiting for 9 iterations or 18
>         seconds.  By default, then, a maximum of 28 seconds can pass from
> the
>         time a network fault occurs until a node fences itself.
> --------------
>
> I don't understand why are we giving heartbeating extra 2 iterations
> to declare a node dead in case of split brain? What I think is, if we
> are already missing disk heartbeat for a node, then it's missed
> heartbeat counter has already been started and we would declare that
> node dead after 7 iterations. How do we include these extra 2
> iterations?
>
> What I want to say here is, after 10 seconds of TCP idle timeout for a
> node, we believe that we will start missing disk heartbeats for that
> node and we allow 9 iterations of such missed heartbeats, but how do
> you inform the other thread, which is already doing this missed
> heartbeat calculation (because we are missing disk hearbeats), that it
> needs to wait for 2 more iterations before declaring the node dead. If
> you don't inform that thread about this, then it will declare the
> other node dead after 7 iterations only. So how this extra 2
> iterations concept will come into picture?
>
> Thanks.
> Sumsha.
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] Fencing in OCFS2
  2006-05-18 14:46 ` Sum Sha
@ 2006-05-18 18:48   ` Zach Brown
  2006-05-18 19:21     ` Daniel Phillips
  0 siblings, 1 reply; 9+ messages in thread
From: Zach Brown @ 2006-05-18 18:48 UTC (permalink / raw)
  To: ocfs2-devel

Sum Sha wrote:
> With some experiments and going through OCFS2's quorum code, I am sure
> that in case of serial split-brain, quorum algorithm will surely break
> and will cause complete cluster shutdown. It will cause all the
> subcluster nodes to panic themselves. Please correct me if I am
> wrong...

Can you provide the details that lead you to be sure of your analysis?

- z

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] Fencing in OCFS2
  2006-05-18 18:48   ` Zach Brown
@ 2006-05-18 19:21     ` Daniel Phillips
  2006-05-19 14:40       ` Sum Sha
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Phillips @ 2006-05-18 19:21 UTC (permalink / raw)
  To: ocfs2-devel

Zach Brown wrote:
> Sum Sha wrote:
> 
>>With some experiments and going through OCFS2's quorum code, I am sure
>>that in case of serial split-brain, quorum algorithm will surely break
>>and will cause complete cluster shutdown. It will cause all the
>>subcluster nodes to panic themselves. Please correct me if I am
>>wrong...
> 
> Can you provide the details that lead you to be sure of your analysis?

It seems to be the intended behavior.  No node is a member of a quorum
group so all nodes should be fenced.  This is correct.

What is wrong is the idea that fencing is equivalent to panicking.  No!
Fencing means preventing a particular node from writing to shared storage.
The FAQ item is incorrect.  How about this instead?

Q03     What is fencing?
A03     Fencing is the act of preventing a node from writing to shared
         cluster storage.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] Fencing in OCFS2
  2006-05-18 19:21     ` Daniel Phillips
@ 2006-05-19 14:40       ` Sum Sha
  2006-05-20  2:00         ` Daniel Phillips
  0 siblings, 1 reply; 9+ messages in thread
From: Sum Sha @ 2006-05-19 14:40 UTC (permalink / raw)
  To: ocfs2-devel

Don't know if I am looking at very old code or not getting what you
want to say.
Code for OCFS2 version 1.2.0-1 says that "if a node detects that it's
not part of quorum, then panic itself".

Inside fs/ocfs2/cluster/quorum.c: o2quo_make_decision() {
-> A Node detects if it's part of quorum
-> If it's not, then it calls o2quo_fence_self()
-> o2quo_fence_self() function stops all the regions by calling
o2hb_stop_all_regions() and then calls panic() directly with the
message "ocfs2 is very sorry to be fencing this system by
panicing\n"...
}

Now tell me if in this case fencing means panic or not?
If you want to stop a node from accessing a shared storage, then
panicking may be a good idea (that's what you are doing here), but
don't understand if this algorithm stops all the nodes and causes
complete cluster shutdown, then how it can be a good idea !

Probably I am looking at the older version of the code or some more
explaination is needed here :)

Thanks.
Sumsha.

On 5/19/06, Daniel Phillips <phillips@google.com> wrote:
> Zach Brown wrote:
> > Sum Sha wrote:
> >
> >>With some experiments and going through OCFS2's quorum code, I am sure
> >>that in case of serial split-brain, quorum algorithm will surely break
> >>and will cause complete cluster shutdown. It will cause all the
> >>subcluster nodes to panic themselves. Please correct me if I am
> >>wrong...
> >
> > Can you provide the details that lead you to be sure of your analysis?
>
> It seems to be the intended behavior.  No node is a member of a quorum
> group so all nodes should be fenced.  This is correct.
>
> What is wrong is the idea that fencing is equivalent to panicking.  No!
> Fencing means preventing a particular node from writing to shared storage.
> The FAQ item is incorrect.  How about this instead?
>
> Q03     What is fencing?
> A03     Fencing is the act of preventing a node from writing to shared
>          cluster storage.
>
> Regards,
>
> Daniel
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] Fencing in OCFS2
  2006-05-19 14:40       ` Sum Sha
@ 2006-05-20  2:00         ` Daniel Phillips
  2006-05-22  7:42           ` Sum Sha
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Phillips @ 2006-05-20  2:00 UTC (permalink / raw)
  To: ocfs2-devel

Sum Sha wrote:
> Don't know if I am looking at very old code or not getting what you
> want to say.
> Code for OCFS2 version 1.2.0-1 says that "if a node detects that it's
> not part of quorum, then panic itself".
> 
> Inside fs/ocfs2/cluster/quorum.c: o2quo_make_decision() {
> -> A Node detects if it's part of quorum
> -> If it's not, then it calls o2quo_fence_self()
> -> o2quo_fence_self() function stops all the regions by calling
> o2hb_stop_all_regions() and then calls panic() directly with the
> message "ocfs2 is very sorry to be fencing this system by
> panicing\n"...
> }
> 
> Now tell me if in this case fencing means panic or not?
> If you want to stop a node from accessing a shared storage, then
> panicking may be a good idea (that's what you are doing here), but
> don't understand if this algorithm stops all the nodes and causes
> complete cluster shutdown, then how it can be a good idea !
> 
> Probably I am looking at the older version of the code or some more
> explaination is needed here :)

You are looking at a quick hack appropriate for a first try.  Now let's look at
what has to be done to make this more generic and less panic-oriented.

1) Self-fencing is just one possible fencing method, so we need a way of plugging
in and configuring other fencing methods.

2) There are really two parts to self-fencing:
      * Target.  Each fencing method includes a specified behavior of the
        node that is to be fenced.  We must define such behavior accurately,
        or we won't be able to use self-fencing.  For fencing methods other
        than self-fencing we still may want to define target behaviour, such
        as rebooting, or attempting self-cleanup and rejoin.  Each target
        fencing method specifies the initiation method to be used in order
        to fence this node.

      * Initiator.  Fencing must be initiated by some quorum node.  A
        particular fencing method initiates fencing by some means.  For a
        self-fencing target the initiator method simply waits some number of
        heartbeats then reports success.

OCFS2 only implements one degenerate form of self-fencing target, and no methods
of initiation.  This needs to be fixed.  I am preparing a specific proposal for
a better fencing harness for OCFS2.  Since it is too long to write in the margin
of this email, I will send it to the list next week in its own email.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] Fencing in OCFS2
  2006-05-20  2:00         ` Daniel Phillips
@ 2006-05-22  7:42           ` Sum Sha
  0 siblings, 0 replies; 9+ messages in thread
From: Sum Sha @ 2006-05-22  7:42 UTC (permalink / raw)
  To: ocfs2-devel

Thanks for giving this information. I will wait for your proposal
assuming that today OCFS2 fencing will work as per what I have
mentioned in my previous mails.

Thanks.
Sumsha.
On 5/20/06, Daniel Phillips <phillips@google.com> wrote:
> Sum Sha wrote:
> > Don't know if I am looking at very old code or not getting what you
> > want to say.
> > Code for OCFS2 version 1.2.0-1 says that "if a node detects that it's
> > not part of quorum, then panic itself".
> >
> > Inside fs/ocfs2/cluster/quorum.c: o2quo_make_decision() {
> > -> A Node detects if it's part of quorum
> > -> If it's not, then it calls o2quo_fence_self()
> > -> o2quo_fence_self() function stops all the regions by calling
> > o2hb_stop_all_regions() and then calls panic() directly with the
> > message "ocfs2 is very sorry to be fencing this system by
> > panicing\n"...
> > }
> >
> > Now tell me if in this case fencing means panic or not?
> > If you want to stop a node from accessing a shared storage, then
> > panicking may be a good idea (that's what you are doing here), but
> > don't understand if this algorithm stops all the nodes and causes
> > complete cluster shutdown, then how it can be a good idea !
> >
> > Probably I am looking at the older version of the code or some more
> > explaination is needed here :)
>
> You are looking at a quick hack appropriate for a first try.  Now let's look
> at
> what has to be done to make this more generic and less panic-oriented.
>
> 1) Self-fencing is just one possible fencing method, so we need a way of
> plugging
> in and configuring other fencing methods.
>
> 2) There are really two parts to self-fencing:
>       * Target.  Each fencing method includes a specified behavior of the
>         node that is to be fenced.  We must define such behavior accurately,
>         or we won't be able to use self-fencing.  For fencing methods other
>         than self-fencing we still may want to define target behaviour, such
>         as rebooting, or attempting self-cleanup and rejoin.  Each target
>         fencing method specifies the initiation method to be used in order
>         to fence this node.
>
>       * Initiator.  Fencing must be initiated by some quorum node.  A
>         particular fencing method initiates fencing by some means.  For a
>         self-fencing target the initiator method simply waits some number of
>         heartbeats then reports success.
>
> OCFS2 only implements one degenerate form of self-fencing target, and no
> methods
> of initiation.  This needs to be fixed.  I am preparing a specific proposal
> for
> a better fencing harness for OCFS2.  Since it is too long to write in the
> margin
> of this email, I will send it to the list next week in its own email.
>
> Regards,
>
> Daniel
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] Fencing in OCFS2
  2006-05-17 14:20 [Ocfs2-devel] Fencing in OCFS2 Sum Sha
  2006-05-18 14:46 ` Sum Sha
@ 2006-05-25 21:35 ` Daniel Phillips
  2006-05-26 17:23   ` Daniel Phillips
  1 sibling, 1 reply; 9+ messages in thread
From: Daniel Phillips @ 2006-05-25 21:35 UTC (permalink / raw)
  To: ocfs2-devel

Sum Sha wrote:
> --------------
> Q05     How long does the quorum process take?
> A05     First a node will realize that it doesn't have connectivity with
>         another node.  This can happen immediately if the connection is closed
>         but can take a maximum of 10 seconds of idle time.  Then the node
>         must wait long enough to give heartbeating a chance to declare the
>         node dead.  It does this by waiting two iterations longer than
>         the number of iterations needed to consider a node dead (see Q03 in
>         the Heartbeat section of this FAQ).  The current default of 7
>         iterations of 2 seconds results in waiting for 9 iterations or 18
>         seconds.  By default, then, a maximum of 28 seconds can pass from the
>         time a network fault occurs until a node fences itself.
> --------------
> 
> I don't understand why are we giving heartbeating extra 2 iterations
> to declare a node dead in case of split brain? What I think is, if we
> are already missing disk heartbeat for a node, then it's missed
> heartbeat counter has already been started and we would declare that
> node dead after 7 iterations. How do we include these extra 2
> iterations?

While working on the fencing harness RFC I realized why that extra wait
is necessary.  Heartbeat will continue pinging a node some number of
periods even while it receives no responses from the node.  The trouble
is, the remote node may be receiving the pings and answering them, but
the answers are getting lost somewhere along the route back.  So the
remote node does not yet know it is incommunicado.  Then heartbeat
gives up and stops pinging.  It is only at this point that the
remote node is sure to start its watchdog count.

Given:

   A = number of missed answers before heartbeat stops pinging
   B = number of missed pings before watchdog triggers
   H = heartbeat period
   L = maximum network latency within some confidence factor
   W = maximum latency between watchdog trigger and shutdown

the time to declare a node dead is:

   P(A + B) + 2L

so with:

   A = 2
   B = 2
   H = 2 seconds
   L = .5 seconds
   W = 10 seconds

we have:

   8 + 1 + 10 = 19 seconds

Network latency includes the maximum time to notice a ping and respond to
it, and the time required for heartbeat to notice the answer.  There is no
need to incorporate a safety factor because allowing more than one missed
ping is already a safety factor.

Did I miss anything in my bookkeeping?  I did not check to see if OCFS2's
heartbeat obeys this formula.

Unfortunately, it is difficult to establish dependable bounds for network
latency, so heartbeating is really a game of probabilities.  We should set
the safety factor high enough so that false positives do not cost more
downtime than would be saved by shorter timeouts.

Now, if we use a storage-side fencing method instead of a watchdog we can
set B and W to zero, giving 5 seconds using the example above.  This is
three times better and shows why we need a proper fencing harness sooner
rather than later.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] Fencing in OCFS2
  2006-05-25 21:35 ` Daniel Phillips
@ 2006-05-26 17:23   ` Daniel Phillips
  0 siblings, 0 replies; 9+ messages in thread
From: Daniel Phillips @ 2006-05-26 17:23 UTC (permalink / raw)
  To: ocfs2-devel

Daniel Phillips wrote:
> so with:
> 
>    A = 2
>    B = 2
>    H = 2 seconds
>    L = .5 seconds
>    W = 10 seconds
> 
> we have:
> 
>    8 + 1 + 10 = 19 seconds

Oops, sorry, I should not have set W (maximum latency between watchdog
trigger and shutdown) as high as 10, since the panic shuts down
interrupts much faster than that, which in theory stops any disk or
network hardware from transmitting.  This should just take a few ms,
however there may be write traffic in flight, so we need to set W to
a second or two.  This gets our "safe" watchdog wait down to 10 seconds
or so, which is still twice as bad as the 5 seconds we get with storage
side fencing.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-05-26 17:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-17 14:20 [Ocfs2-devel] Fencing in OCFS2 Sum Sha
2006-05-18 14:46 ` Sum Sha
2006-05-18 18:48   ` Zach Brown
2006-05-18 19:21     ` Daniel Phillips
2006-05-19 14:40       ` Sum Sha
2006-05-20  2:00         ` Daniel Phillips
2006-05-22  7:42           ` Sum Sha
2006-05-25 21:35 ` Daniel Phillips
2006-05-26 17:23   ` Daniel Phillips

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.