All of lore.kernel.org
 help / color / mirror / Atom feed
* [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
@ 2004-09-24 14:29 Lars Ellenberg
  2004-09-24 21:11 ` [Drbd-dev] Re: [Linux-ha-dev] " Lars Marowsky-Bree
  2004-09-27 14:52 ` [Drbd-dev] " Philipp Reisner
  0 siblings, 2 replies; 13+ messages in thread
From: Lars Ellenberg @ 2004-09-24 14:29 UTC (permalink / raw)
  To: drbd-dev, linux-ha-dev

some of this applies to replicated resources in general,
so Andrew may have some ideas to generalize it...

source of it is a pod'ed perl script that I try to tweak into
calculating all the possible transitions for me (and then filter out the
relevant ones ...)

=======

DRBD cluster states and state transitions
    We want to consolidate all DRBD state changes and "recovery strategies"
    into one prominent and obvious place, something like a state machine.
    This is neccessary to serialize state changes properly, and to make
    error recovery maintainable.

    Given this background, this script generates a set of cluster states
    (and transitions later) of two DRBD peers from the point of view of an
    all knowing higher level intelligence (cluster manager/operator).

    We (as DRBD) should actually only be concerned about the single node
    state transitions, but the CRM (wink to Andrew) may want to twist its
    brain with the two node states to think about what can happen with
    replicated resources...

    This overview can and should be improved, so we provably cover all
    corner cases and the recovery strategies are as best as can be.

    Currently this covers only the states, and outlines the transitions. It
    should help to define the actions to be taken on every possible "input"
    to the DRBD internal "state machine".

    Please think about it and give us feedback. Especially about whether the
    the set of states is complete. We do not want to miss one single corner
    case.

    Thank you very much.

    Lars Ellenberg

  Node states
    Each node has several attributes, which may change more or less
    independently. A node can be

    *   up or down

    *   with backing storage or diskless

        We need to distinguish between data storage and meta-data storage.
        If we don't have meta-data storage, this may as well be down (and in
        the event of losing the md storage, it should take appropriate
        emergency actions to commit suicide).

        Obviously a diskless node cannot take part in synchronization.

    *   Active or Standby

        Promoting an unconnected diskless non-active node is not possible.

        Promoting a connected diskless non-active node should not be
        possible.

    *   target or source of synchronization, consistent, or inconsistent.

        Even though consistent, we might know or assume that we are
        outdated.

    *   connected or unconnected.

        Obviously an unconnected node cannot take part in synchronization.

    Some of the attributes depend on others, and the information about the
    node status could be easily encoded in one single letter.

    But since HA is all about redundancy, we will encode the node status
    redundantly in *four* letters, to make it more obvious to human readers.

     _        down,
     S        up, standby (non-active, but ready to become active)
     s        up, not-active, but target of sync
     i        up, not-active, unconnected, inconsistent
     o        up, not-active, unconnected, outdated
     d        up, not-active, diskless
     A        up, active
     a        up, active, but target of sync
     b        up, blocking, because unconnected active and inconsistent
                            (no valid data available)
     B        up, blocking, because unconnected active and diskless
                            (no valid data available)
     D        up, active, but diskless (implies connection to good data)
      M       meta-data storage available
      _       meta-data storage unavailable
       *      backing storage available
       o      backing storage consistent but outdated
              (refuses to become active)
       i      backing storage inconsistent (unfinished sync)
       _      diskless 
        :     unconnected, stand alone
        ?     unconnected, looking for peer
        -     connected
        >     connected, sync source
        <     connected, sync target

    Note however that "S" does NOT necessarily mean it has uptodate data,
    only that it thinks its data is consistent, it was not explicitly told
    that it is outdated, and it had no reason to assume so! E.g. directly
    after boot, if it was not yet connected, the data may well be
    consistent, but outdated. But since this is information not directly
    available to the node, let alone DRBD, this is difficult to map in here.

    Since if the node is down, everything else is irrelevant, and since
    synchronisation implies backing storage, without meta-data storage we
    refuse to do anything, and some of the states will resolve immediately,
    (e.g. outaded => sync target upon connect), we do now have 24
    distinguishable host states.

     ___:
     AM*- DM_- AM*> aM*< AM*?  AM*: bM*?  bM*: BM_?  BM_:
     SM*- dM_- SM*> sM*< SM*?  SM*: iM*?  iM*: dM_?  dM_:
          oM*-                      oM*?  oM*:

    This is our starting point, so please double check. Did I miss
    something?

    Because non-active unconnected diskless node can as well be down, to
    simplify, we *could* introduce this equivalence, which reduces some
    cluster states: dM_[:?] => __:

    We *could* merge both unconnected states into one for the purpose of
    describing and testing it. This needs some thought. It would reduce the
    number of possible node states by 5 resp. 6, and the resulting cluster
    states by a considerable factor. (..)[:?] => $1:

  Cluster States
    For the cluster:

     left-node -- network -- right-node

    the list of all possible pairwise combinations of these states needs to
    be filtered: combining a connected left state with an unconnected right
    state does not give a valid cluster state.

    Since connected states with more than one active node currently are
    still invalid, too, and will be immediately disconnected, we don't
    mention these, either. See also the note about the "split brain" cluster
    states below.

    States where both nodes are looking for the peer (should) resolve
    automatically into some connected mode "imediately" (unless the network
    is broken).

    Since we assume an "all knowing" CM, the state of the network link is
    therefore explicitly stated as - ok % broken

    For the purpose of describing and testing it we may chose to merge :%:
    and :-: into :_:, because if neither node tries to connect, the link
    status is irrelevant.

    We leave off states where the reverse symmetrically states are already
    listed.

  Classify
    These states can be classified as sane "[OK]", degraded "[deg]", not
    operational "{bad}", and fatal "[BAD]".

    A "[deg]" state is still operational. This means that applications can
    run and client requests are satisfied. But they are only one failure
    appart from being rendered non-operational, so you still should *run*
    and fix it...

    If it is not fatal, but only "{bad}", it *can* be "self healing", i.e.
    some of the "{bad}" states may find a transition to a operational state,
    though most likely only to some "{deg}" one. For example if the network
    comes back, or the cluster manager promotes a currently non-active node
    to be active.

    In case "[BAD]" modes do ever occure, intervention of a higher level
    intelligence (cluster manager/operator) is necessary to restore an
    operational state.

    There is one additional class: the "BRAIN" class, which typically can
    only occur in and after split brain situations, and can never occur with
    an "all knowing" cluster manager, so these are special cases here.

    Note that some of these (the "double A" states) may eventually become
    legal, when we start to support [Open]GFS or other shared access modes,
    others will then just be "{BAD}" or even only "{bad}".

    The outcome is

    225 states: [OK]: AM*---*MS SM*---*MS

     {deg}: AM*---_Md   DM_---*MS   AM*>->*Ms   aM*<-<*MS   AM*?-:*Mo
            AM*?%:*Mo   AM*?%?*Mo   AM*?%?*MS   AM*?-:*MS   AM*?%:*MS
            AM*?%?*Mi   AM*?-:*Mi   AM*?%:*Mi   AM*?%?_Md   AM*?-:_Md
            AM*?%:_Md   AM*?-:___   AM*?%:___   AM*:-:*Mo   AM*:%:*Mo
            AM*:-?*Mo   AM*:%?*Mo   AM*:-?*MS   AM*:%?*MS   AM*:-:*MS
            AM*:%:*MS   AM*:-?*Mi   AM*:%?*Mi   AM*:-:*Mi   AM*:%:*Mi
            AM*:-?_Md   AM*:%?_Md   AM*:-:_Md   AM*:%:_Md   AM*:-:___
            AM*:%:___   SM*>->*Ms

     {bad}: bM*?%?*MS   bM*?-:*MS   bM*?%:*MS   bM*:-?*MS   bM*:%?*MS
            bM*:-:*MS   bM*:%:*MS   BM_?%?*MS   BM_?-:*MS   BM_?%:*MS
            BM_:-?*MS   BM_:%?*MS   BM_:-:*MS   BM_:%:*MS   SM*---_Md
            oM*:-?*MS   oM*:%?*MS   oM*:-:*MS   oM*:%:*MS   oM*?%?*MS
            oM*?-:*MS   oM*?%:*MS   SM*?%?*MS   SM*?-:*MS   SM*?%:*MS
            SM*?%?*Mi   SM*?-:*Mi   SM*?%:*Mi   SM*?%?_Md   SM*?-:_Md
            SM*?%:_Md   SM*?-:___   SM*?%:___   SM*:-:*MS   SM*:%:*MS
            SM*:-?*Mi   SM*:%?*Mi   SM*:-:*Mi   SM*:%:*Mi   SM*:-?_Md
            SM*:%?_Md   SM*:-:_Md   SM*:%:_Md   SM*:-:___   SM*:%:___

     [BAD]: DM_---*Mo   DM_---_Md   bM*?-:*Mo   bM*?%:*Mo   bM*?%?*Mo
            bM*?%?*Mi   bM*?-:*Mi   bM*?%:*Mi   bM*?%?_Md   bM*?-:_Md
            bM*?%:_Md   bM*?-:___   bM*?%:___   bM*:-:*Mo   bM*:%:*Mo
            bM*:-?*Mo   bM*:%?*Mo   bM*:-?*Mi   bM*:%?*Mi   bM*:-:*Mi
            bM*:%:*Mi   bM*:-?_Md   bM*:%?_Md   bM*:-:_Md   bM*:%:_Md
            bM*:-:___   bM*:%:___   BM_?-:*Mo   BM_?%:*Mo   BM_?%?*Mo
            BM_?%?*Mi   BM_?-:*Mi   BM_?%:*Mi   BM_?%?_Md   BM_?-:_Md
            BM_?%:_Md   BM_?-:___   BM_?%:___   BM_:-:*Mo   BM_:%:*Mo
            BM_:-?*Mo   BM_:%?*Mo   BM_:-?*Mi   BM_:%?*Mi   BM_:-:*Mi
            BM_:%:*Mi   BM_:-?_Md   BM_:%?_Md   BM_:-:_Md   BM_:%:_Md
            BM_:-:___   BM_:%:___   oM*---*Mo   oM*---_Md   oM*:-:*Mo
            oM*:%:*Mo   oM*:-?*Mo   oM*:%?*Mo   oM*:-?*Mi   oM*:%?*Mi
            oM*:-:*Mi   oM*:%:*Mi   oM*:-?_Md   oM*:%?_Md   oM*:-:_Md
            oM*:%:_Md   oM*:-:___   oM*:%:___   oM*?%?*Mo   oM*?%?*Mi
            oM*?-:*Mi   oM*?%:*Mi   oM*?%?_Md   oM*?-:_Md   oM*?%:_Md
            oM*?-:___   oM*?%:___   dM_---_Md   iM*?%?*Mi   iM*?-:*Mi
            iM*?%:*Mi   iM*?%?_Md   iM*?-:_Md   iM*?%:_Md   iM*?-:___
            iM*?%:___   iM*:-:*Mi   iM*:%:*Mi   iM*:-?_Md   iM*:%?_Md
            iM*:-:_Md   iM*:%:_Md   iM*:-:___   iM*:%:___   dM_?%?_Md
            dM_?-:_Md   dM_?%:_Md   dM_?-:___   dM_?%:___   dM_:-:_Md
            dM_:%:_Md   dM_:-:___   dM_:%:___   ___:-:___   ___:%:___

     BRAIN: AM*?%?*MA   AM*?-:*MA   AM*?%:*MA   AM*?%?*Mb   AM*?-:*Mb
            AM*?%:*Mb   AM*?%?_MB   AM*?-:_MB   AM*?%:_MB   AM*:-:*MA
            AM*:%:*MA   AM*:-?*Mb   AM*:%?*Mb   AM*:-:*Mb   AM*:%:*Mb
            AM*:-?_MB   AM*:%?_MB   AM*:-:_MB   AM*:%:_MB   bM*?%?*Mb
            bM*?-:*Mb   bM*?%:*Mb   bM*?%?_MB   bM*?-:_MB   bM*?%:_MB
            bM*:-:*Mb   bM*:%:*Mb   bM*:-?_MB   bM*:%?_MB   bM*:-:_MB
            bM*:%:_MB   BM_?%?_MB   BM_?-:_MB   BM_?%:_MB   BM_:-:_MB
            BM_:%:_MB

  Possible Transitions
    Now, by comparing each with every state, and finding all pairs which
    differ in exactly one "attribute", we have all possible state
    transitions.

    We ignore certain node state transitions which are refused by drbd.
    Allowed node state transition "inputs" or "reactions" are

    *   up or down the node

    *   add/remove the disk (by administrative request or in response to io
        error)

        if it was the last accessible good data, should this result in
        suicide, or block all further io, or just fail all further io?

        if this lost the meta-data storage at the same time (meta-data
        internal), do we handle this differently?

    *   fail meta-data storage

        should result in suicide.

    *   establish or lose the connection; quit/start retrying to establish a
        connection.

    *   promote to active / demote to non-active

        To promote an unconnected inconsistent non-active node you need
        brute force. Similar if it thinks it is outdated.

        Promoting an unconnected diskless node is not possible. But those
        should have been mapped to a "down" node, anyways.

    *   start/finish synchronization

        One must not request a running and up-to-date active node to become
        target of synchronization.

    *   block/unblock all io requests

        This is in response to drbdadm suspend/resume, or a result of an
        "execption handler".

    *   commit suicide

        This is our last resort emergency handler. It should not be
        implemented as "panic", though currently it is.

    Again, this is important, please double check: Did I miss something?

    Because the fatal "[BAD]" and "BRAIN" states can only be resolved by the
    operator, for these we consider only transitions to a non-fatal state.
    Connected fatal states will immediately be disconnected.

    Transitions are consequences of certain events. An event can be an
    operator/cluster manager Request, a Failure, or a self-healing (of the
    network link, for example).

    While simulating the events, we will at any time modify exactly one node
    attribute of one node, or the status of the network link.

    The "establish connection" event is special in that we cannot simulate
    it: this is a DRBD-internel event. And from only looking at the cluster
    state before this event, we cannot directly know what cluster state will
    result, unless we want to add the "up-to-date-ness" of the data as
    additional node attribute...

    So as soon as the connection between DRBD-peers is established, they
    will auto-resolve to some other state.

======

what should follow are the relevant state transitions...
I am still not satisfied with the output of my script, though.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
  2004-09-24 14:29 [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies Lars Ellenberg
@ 2004-09-24 21:11 ` Lars Marowsky-Bree
  2004-09-24 23:04   ` Lars Ellenberg
  2004-09-26 18:40   ` Andrew Beekhof
  2004-09-27 14:52 ` [Drbd-dev] " Philipp Reisner
  1 sibling, 2 replies; 13+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-24 21:11 UTC (permalink / raw)
  To: drbd-dev, linux-ha-dev

On 2004-09-24T16:29:25,
   Lars Ellenberg <Lars.Ellenberg@linbit.com> said:

> some of this applies to replicated resources in general,
> so Andrew may have some ideas to generalize it...

I think the modelling we have so far (with the recent addendum) captures
this quite nicely for the time being. But of course, it'll help us to
verify this.

>     Some of the attributes depend on others, and the information about the
>     node status could be easily encoded in one single letter.
> 
>     But since HA is all about redundancy, we will encode the node status
>     redundantly in *four* letters, to make it more obvious to human readers.
> 
>      _        down,
>      S        up, standby (non-active, but ready to become active)
>      s        up, not-active, but target of sync
>      i        up, not-active, unconnected, inconsistent
>      o        up, not-active, unconnected, outdated
>      d        up, not-active, diskless
>      A        up, active
>      a        up, active, but target of sync
>      b        up, blocking, because unconnected active and inconsistent
>                             (no valid data available)
>      B        up, blocking, because unconnected active and diskless
>                             (no valid data available)
>      D        up, active, but diskless (implies connection to good data)
>       M       meta-data storage available
>       _       meta-data storage unavailable
>        *      backing storage available
>        o      backing storage consistent but outdated
>               (refuses to become active)
>        i      backing storage inconsistent (unfinished sync)
>        _      diskless 
>         :     unconnected, stand alone
>         ?     unconnected, looking for peer
>         -     connected
>         >     connected, sync source
>         <     connected, sync target

I'd structure this somewhat differently into the node states (Up, Down),
our assumption about the other node (up, down, fenced), Backing Store
states (available, outdated, inconsistent, unavailable), the connection
(up or down) and the relationship between the GCs (higher, lower,
equal).

(Whether we are syncing and in what direction seems to be a function of
that, same whether or not we are blocking or not.)

It's essentially the same as your list, but it seems to be more
accessible to me. But, it's late ;-)

>     Because non-active unconnected diskless node can as well be down, to
>     simplify, we *could* introduce this equivalence, which reduces some
>     cluster states: dM_[:?] => __:

Seems sane.

>     We *could* merge both unconnected states into one for the purpose of
>     describing and testing it. This needs some thought. It would reduce the
>     number of possible node states by 5 resp. 6, and the resulting cluster
>     states by a considerable factor. (..)[:?] => $1:

Are they really, though? Doesn't it affect the GC if I tell it to go
standalone as opposed to leaving it looking for it's peer?

>   Classify
>     These states can be classified as sane "[OK]", degraded "[deg]", not
>     operational "{bad}", and fatal "[BAD]".

Makes sense, mostly, but...

>     A "[deg]" state is still operational. This means that applications can
>     run and client requests are satisfied. But they are only one failure
>     appart from being rendered non-operational, so you still should *run*
>     and fix it...
> 
>     If it is not fatal, but only "{bad}", it *can* be "self healing", i.e.
>     some of the "{bad}" states may find a transition to a operational state,
>     though most likely only to some "{deg}" one. For example if the network
>     comes back, or the cluster manager promotes a currently non-active node
>     to be active.

A bad state seems 'degenerate' to me. Are those two really distinct?
Self-healing would be an on-going sync or something like it.

>     The outcome is
> 
>     225 states: [OK]: AM*---*MS SM*---*MS

I see know why you backed away from my proposal of the state machine for
the testing back then and instead suggested the CTH ;-)

>   Possible Transitions
>     Now, by comparing each with every state, and finding all pairs which
>     differ in exactly one "attribute", we have all possible state
>     transitions.

These seem sane and complete.

Just some comments:

>     We ignore certain node state transitions which are refused by drbd.
>     Allowed node state transition "inputs" or "reactions" are
> 
>     *   up or down the node
> 
>     *   add/remove the disk (by administrative request or in response to io
>         error)
> 
>         if it was the last accessible good data, should this result in
>         suicide, or block all further io, or just fail all further io?
> 
>         if this lost the meta-data storage at the same time (meta-data
>         internal), do we handle this differently?
> 
>     *   fail meta-data storage
> 
>         should result in suicide.

In fact, even for meta-data loss, we can switch to detached mode and
increase some version counters on the other side, and still do a smooth
transition. We can't any longer touch the local disk, which is bad, but
we also can't make it worse.

This will work as long as we don't explicitly have to _set_ a
dirty/outdated bit, but if we explicitly clear it instead when we
smoothly shutdown.

I don't see any difference here between meta-data and backing store
loss, actually, that complicates things unnecessarily.

>     *   establish or lose the connection; quit/start retrying to establish a
>         connection.
> 
>     *   promote to active / demote to non-active
> 
>         To promote an unconnected inconsistent non-active node you need
>         brute force. Similar if it thinks it is outdated.
> 
>         Promoting an unconnected diskless node is not possible. But those
>         should have been mapped to a "down" node, anyways.
> 
>     *   start/finish synchronization
> 
>         One must not request a running and up-to-date active node to become
>         target of synchronization.
> 
>     *   block/unblock all io requests
> 
>         This is in response to drbdadm suspend/resume, or a result of an
>         "execption handler".
> 
>     *   commit suicide
> 
>         This is our last resort emergency handler. It should not be
>         implemented as "panic", though currently it is.
> 


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
  2004-09-24 21:11 ` [Drbd-dev] Re: [Linux-ha-dev] " Lars Marowsky-Bree
@ 2004-09-24 23:04   ` Lars Ellenberg
  2004-09-25  8:54     ` Lars Marowsky-Bree
  2004-09-26 18:40   ` Andrew Beekhof
  1 sibling, 1 reply; 13+ messages in thread
From: Lars Ellenberg @ 2004-09-24 23:04 UTC (permalink / raw)
  To: drbd-dev, linux-ha-dev

/ 2004-09-24 23:11:33 +0200
\ Lars Marowsky-Bree:
> On 2004-09-24T16:29:25,
>    Lars Ellenberg <Lars.Ellenberg@linbit.com> said:
> 
> > some of this applies to replicated resources in general,
> > so Andrew may have some ideas to generalize it...
> 
> I think the modelling we have so far (with the recent addendum) captures
> this quite nicely for the time being. But of course, it'll help us to
> verify this.
> 
> >     Some of the attributes depend on others, and the information about the
> >     node status could be easily encoded in one single letter.
> > 
> >     But since HA is all about redundancy, we will encode the node status
> >     redundantly in *four* letters, to make it more obvious to human readers.
> > 
> >      _        down,
> >      S        up, standby (non-active, but ready to become active)
> >      s        up, not-active, but target of sync
> >      i        up, not-active, unconnected, inconsistent
> >      o        up, not-active, unconnected, outdated
> >      d        up, not-active, diskless
> >      A        up, active
> >      a        up, active, but target of sync
> >      b        up, blocking, because unconnected active and inconsistent
> >                             (no valid data available)
> >      B        up, blocking, because unconnected active and diskless
> >                             (no valid data available)
> >      D        up, active, but diskless (implies connection to good data)
> >       M       meta-data storage available
> >       _       meta-data storage unavailable
> >        *      backing storage available
> >        o      backing storage consistent but outdated
> >               (refuses to become active)
> >        i      backing storage inconsistent (unfinished sync)
> >        _      diskless 
> >         :     unconnected, stand alone
> >         ?     unconnected, looking for peer
> >         -     connected
> >         >     connected, sync source
> >         <     connected, sync target
> 
> I'd structure this somewhat differently into the node states (Up, Down),
> our assumption about the other node (up, down, fenced), Backing Store
> states (available, outdated, inconsistent, unavailable), the connection
> (up or down) and the relationship between the GCs (higher, lower,
> equal).
> 
> (Whether we are syncing and in what direction seems to be a function of
> that, same whether or not we are blocking or not.)
> 
> It's essentially the same as your list, but it seems to be more
> accessible to me. But, it's late ;-)

well, it is really the same, I guess.

but I'll try to write pseudo code for the state tupels that should make
it clear, and post that here, before I go implement it.

> >   Classify
> >     These states can be classified as sane "[OK]", degraded "[deg]", not
> >     operational "{bad}", and fatal "[BAD]".
> 
> Makes sense, mostly, but...
> 
> >     A "[deg]" state is still operational. This means that applications can
> >     run and client requests are satisfied. But they are only one failure
> >     appart from being rendered non-operational, so you still should *run*
> >     and fix it...
> > 
> >     If it is not fatal, but only "{bad}", it *can* be "self healing", i.e.
> >     some of the "{bad}" states may find a transition to a operational state,
> >     though most likely only to some "{deg}" one. For example if the network
> >     comes back, or the cluster manager promotes a currently non-active node
> >     to be active.
> 
> A bad state seems 'degenerate' to me. Are those two really distinct?
> Self-healing would be an on-going sync or something like it.

well, the difference is "degenerate" is bad,
but I still have access to good data (which I mean by operational)!

> >     225 states: [OK]: AM*---*MS SM*---*MS
> 
> I see know why you backed away from my proposal of the state machine for
> the testing back then and instead suggested the CTH ;-)

at that point of time we did not know yet about "outdated",
and it was "only" 171 states iirc ...

> >     We ignore certain node state transitions which are refused by drbd.
> >     Allowed node state transition "inputs" or "reactions" are
> > 
> >     *   up or down the node
> > 
> >     *   add/remove the disk (by administrative request or in response to io
> >         error)
> > 
> >         if it was the last accessible good data, should this result in
> >         suicide, or block all further io, or just fail all further io?
> > 
> >         if this lost the meta-data storage at the same time (meta-data
> >         internal), do we handle this differently?
> > 
> >     *   fail meta-data storage
> > 
> >         should result in suicide.
> 
> In fact, even for meta-data loss, we can switch to detached mode and
> increase some version counters on the other side, and still do a smooth
> transition. We can't any longer touch the local disk, which is bad, but
> we also can't make it worse.
> 
> This will work as long as we don't explicitly have to _set_ a
> dirty/outdated bit, but if we explicitly clear it instead when we
> smoothly shutdown.
> 
> I don't see any difference here between meta-data and backing store
> loss, actually, that complicates things unnecessarily.

well, DRBD needs to make a difference, because they meta-data storage
and data storage may be physically different devices, and therefore can
fail independently. (ok, single blocks can fail on the same physical
storage independently, too, but this is an other thing)

but yes, meta-data loss is not per definition catastrophic...

	lge

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
  2004-09-24 23:04   ` Lars Ellenberg
@ 2004-09-25  8:54     ` Lars Marowsky-Bree
  2004-09-25  9:50       ` Lars Ellenberg
  0 siblings, 1 reply; 13+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-25  8:54 UTC (permalink / raw)
  To: drbd-dev, linux-ha-dev

On 2004-09-25T01:04:57,
   Lars Ellenberg <Lars.Ellenberg@linbit.com> said:

> > I don't see any difference here between meta-data and backing store
> > loss, actually, that complicates things unnecessarily.
> 
> well, DRBD needs to make a difference, because they meta-data storage
> and data storage may be physically different devices, and therefore can
> fail independently. (ok, single blocks can fail on the same physical
> storage independently, too, but this is an other thing)

The point I was trying to make is that meta-data loss and backing
storage loss can essentially be mapped to a generic local IO failure.

The special case where we only loss access to the backing store and not
to the meta-data allows us to set a flag there (for whatever use it may
be the next time we compare GCs), but then it amounts to the same: Loss
of the local storage.

I don't see any benefit in keeping the two as distinct failure modes...


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
  2004-09-25  8:54     ` Lars Marowsky-Bree
@ 2004-09-25  9:50       ` Lars Ellenberg
  2004-09-25  9:59         ` Lars Marowsky-Bree
  0 siblings, 1 reply; 13+ messages in thread
From: Lars Ellenberg @ 2004-09-25  9:50 UTC (permalink / raw)
  To: drbd-dev, linux-ha-dev

/ 2004-09-25 10:54:28 +0200
\ Lars Marowsky-Bree:
> On 2004-09-25T01:04:57,
>    Lars Ellenberg <Lars.Ellenberg@linbit.com> said:
> 
> > > I don't see any difference here between meta-data and backing store
> > > loss, actually, that complicates things unnecessarily.
> > 
> > well, DRBD needs to make a difference, because they meta-data storage
> > and data storage may be physically different devices, and therefore can
> > fail independently. (ok, single blocks can fail on the same physical
> > storage independently, too, but this is an other thing)
> 
> The point I was trying to make is that meta-data loss and backing
> storage loss can essentially be mapped to a generic local IO failure.
> 
> The special case where we only loss access to the backing store and not
> to the meta-data allows us to set a flag there (for whatever use it may
> be the next time we compare GCs), but then it amounts to the same: Loss
> of the local storage.
> 
> I don't see any benefit in keeping the two as distinct failure modes...

well. they are different events, and I must handle both of them.
there indeed is no benefit to it. but this is not about theoretically
descibing it, in the end I want to code a state machine from it,
and know that it is complete.

in any case, if you look at the listed states, all of them have the "M"
on, so I already "simplified" in so far...

currently in drbd it IS indeed both mapped to "local io error",
then "trying" (without further action on failure) to set the
"inconsistent, need full sync" bit in the meta-data.

but currently in drbd recovery code is distributed in small pieces all
over the code, and I want to try to put it all into one place,
and be sure I deal with every possible corner case.

and for example if we are the only remaining node (we have no
connection), we may rather chose to continue, passing on IO errors if
they happen, than to "detach" the partially broken storage.

appart from the fact that the local storage should never fail,
and should possibly be some sort of raid itself...

	lge

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
  2004-09-25  9:50       ` Lars Ellenberg
@ 2004-09-25  9:59         ` Lars Marowsky-Bree
  0 siblings, 0 replies; 13+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-25  9:59 UTC (permalink / raw)
  To: drbd-dev, linux-ha-dev

On 2004-09-25T11:50:04,
   Lars Ellenberg <Lars.Ellenberg@linbit.com> said:

> well. they are different events, and I must handle both of them.
> there indeed is no benefit to it. but this is not about theoretically
> descibing it, in the end I want to code a state machine from it,
> and know that it is complete.

Ah. Ok. Then we agree.

> appart from the fact that the local storage should never fail,
> and should possibly be some sort of raid itself...

Right! It'd all be best if nothing ever failed! ;-)


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
  2004-09-24 21:11 ` [Drbd-dev] Re: [Linux-ha-dev] " Lars Marowsky-Bree
  2004-09-24 23:04   ` Lars Ellenberg
@ 2004-09-26 18:40   ` Andrew Beekhof
  2004-09-27 12:19     ` Lars Ellenberg
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Beekhof @ 2004-09-26 18:40 UTC (permalink / raw)
  To: High-Availability Linux Development List; +Cc: drbd-dev


On Sep 24, 2004, at 11:11 PM, Lars Marowsky-Bree wrote:

> On 2004-09-24T16:29:25,
>    Lars Ellenberg <Lars.Ellenberg@linbit.com> said:
>
>> some of this applies to replicated resources in general,
>> so Andrew may have some ideas to generalize it...
>
> I think the modelling we have so far (with the recent addendum) 
> captures
> this quite nicely for the time being. But of course, it'll help us to
> verify this.

right.  i also be thinking more about this in the coming days when I 
start on the incarnations.

>
>>     Some of the attributes depend on others, and the information 
>> about the
>>     node status could be easily encoded in one single letter.
>>
>>     But since HA is all about redundancy, we will encode the node 
>> status
>>     redundantly in *four* letters, to make it more obvious to human 
>> readers.
>>
>>      _        down,
>>      S        up, standby (non-active, but ready to become active)
>>      s        up, not-active, but target of sync
>>      i        up, not-active, unconnected, inconsistent
>>      o        up, not-active, unconnected, outdated
>>      d        up, not-active, diskless
>>      A        up, active
>>      a        up, active, but target of sync
>>      b        up, blocking, because unconnected active and 
>> inconsistent
>>                             (no valid data available)
>>      B        up, blocking, because unconnected active and diskless
>>                             (no valid data available)
>>      D        up, active, but diskless (implies connection to good 
>> data)
>>       M       meta-data storage available
>>       _       meta-data storage unavailable
>>        *      backing storage available
>>        o      backing storage consistent but outdated
>>               (refuses to become active)
>>        i      backing storage inconsistent (unfinished sync)
>>        _      diskless
>>         :     unconnected, stand alone
>>         ?     unconnected, looking for peer
>>         -     connected
>>>     connected, sync source
>>         <     connected, sync target
>
> I'd structure this somewhat differently into the node states (Up, 
> Down),
> our assumption about the other node (up, down, fenced), Backing Store
> states (available, outdated, inconsistent, unavailable), the connection
> (up or down) and the relationship between the GCs (higher, lower,
> equal).
>
> (Whether we are syncing and in what direction seems to be a function of
> that, same whether or not we are blocking or not.)
>
> It's essentially the same as your list, but it seems to be more
> accessible to me. But, it's late ;-)

I agree here... restricting the attributes to true inputs (rather than 
derived values) helps stop my brain going into meltdown trying to 
absorb the matrix :)  I could also imagine that it makes your life 
easier if you decide to change how you decide something like syncing 
direction later on.



-- 
Andrew Beekhof

"Ooo Ahhh, Glenn McRath" - TISM


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
  2004-09-26 18:40   ` Andrew Beekhof
@ 2004-09-27 12:19     ` Lars Ellenberg
  2004-09-27 12:38       ` Andrew Beekhof
  0 siblings, 1 reply; 13+ messages in thread
From: Lars Ellenberg @ 2004-09-27 12:19 UTC (permalink / raw)
  To: High-Availability Linux Development List, drbd-dev

/ 2004-09-26 20:40:36 +0200
\ Andrew Beekhof:
> 
> On Sep 24, 2004, at 11:11 PM, Lars Marowsky-Bree wrote:
> 
> >On 2004-09-24T16:29:25,
> >   Lars Ellenberg <Lars.Ellenberg@linbit.com> said:
> >
> >>some of this applies to replicated resources in general,
> >>so Andrew may have some ideas to generalize it...
> >
> >I think the modelling we have so far (with the recent addendum) 
> >captures
> >this quite nicely for the time being. But of course, it'll help us to
> >verify this.
> 
> right.  i also be thinking more about this in the coming days when I 
> start on the incarnations.
> 
> >
> >>    Some of the attributes depend on others, and the information 
> >>about the
> >>    node status could be easily encoded in one single letter.
> >>
> >>    But since HA is all about redundancy, we will encode the node 
> >>status
> >>    redundantly in *four* letters, to make it more obvious to human 
> >>readers.
> >>
> >>     _        down,
> >>     S        up, standby (non-active, but ready to become active)
> >>     s        up, not-active, but target of sync
> >>     i        up, not-active, unconnected, inconsistent
> >>     o        up, not-active, unconnected, outdated
> >>     d        up, not-active, diskless
> >>     A        up, active
> >>     a        up, active, but target of sync
> >>     b        up, blocking, because unconnected active and 
> >>inconsistent
> >>                            (no valid data available)
> >>     B        up, blocking, because unconnected active and diskless
> >>                            (no valid data available)
> >>     D        up, active, but diskless (implies connection to good 
> >>data)
> >>      M       meta-data storage available
> >>      _       meta-data storage unavailable
> >>       *      backing storage available
> >>       o      backing storage consistent but outdated
> >>              (refuses to become active)
> >>       i      backing storage inconsistent (unfinished sync)
> >>       _      diskless
> >>        :     unconnected, stand alone
> >>        ?     unconnected, looking for peer
> >>        -     connected
> >>>    connected, sync source
> >>        <     connected, sync target
> >
> >I'd structure this somewhat differently into the node states (Up, 
> >Down),
> >our assumption about the other node (up, down, fenced), Backing Store
> >states (available, outdated, inconsistent, unavailable), the connection
> >(up or down) and the relationship between the GCs (higher, lower,
> >equal).
> >
> >(Whether we are syncing and in what direction seems to be a function of
> >that, same whether or not we are blocking or not.)
> >
> >It's essentially the same as your list, but it seems to be more
> >accessible to me. But, it's late ;-)
> 
> I agree here... restricting the attributes to true inputs (rather than 
> derived values) helps stop my brain going into meltdown trying to 
> absorb the matrix :)  I could also imagine that it makes your life 
> easier if you decide to change how you decide something like syncing 
> direction later on.

ok.
I'll try again, and maybe create a dot file while I'm at it :)

even though my "states" are not actually "derived".
but maybe only "obvious" to the "insider" (me).

so, for the time being, could you please have a look at
 http://wiki.trick.ca/linux-ha/DRBD/StateMachine

and maybe comment it.

	lge

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
  2004-09-27 12:19     ` Lars Ellenberg
@ 2004-09-27 12:38       ` Andrew Beekhof
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Beekhof @ 2004-09-27 12:38 UTC (permalink / raw)
  To: High-Availability Linux Development List; +Cc: drbd-dev


On Sep 27, 2004, at 2:19 PM, Lars Ellenberg wrote:

> / 2004-09-26 20:40:36 +0200
> \ Andrew Beekhof:
>>
>> On Sep 24, 2004, at 11:11 PM, Lars Marowsky-Bree wrote:
>>
>>> On 2004-09-24T16:29:25,
>>>   Lars Ellenberg <Lars.Ellenberg@linbit.com> said:
>>>
>>>> some of this applies to replicated resources in general,
>>>> so Andrew may have some ideas to generalize it...
>>>
>>> I think the modelling we have so far (with the recent addendum)
>>> captures
>>> this quite nicely for the time being. But of course, it'll help us to
>>> verify this.
>>
>> right.  i also be thinking more about this in the coming days when I
>> start on the incarnations.
>>
>>>
>>>>    Some of the attributes depend on others, and the information
>>>> about the
>>>>    node status could be easily encoded in one single letter.
>>>>
>>>>    But since HA is all about redundancy, we will encode the node
>>>> status
>>>>    redundantly in *four* letters, to make it more obvious to human
>>>> readers.
>>>>
>>>>     _        down,
>>>>     S        up, standby (non-active, but ready to become active)
>>>>     s        up, not-active, but target of sync
>>>>     i        up, not-active, unconnected, inconsistent
>>>>     o        up, not-active, unconnected, outdated
>>>>     d        up, not-active, diskless
>>>>     A        up, active
>>>>     a        up, active, but target of sync
>>>>     b        up, blocking, because unconnected active and
>>>> inconsistent
>>>>                            (no valid data available)
>>>>     B        up, blocking, because unconnected active and diskless
>>>>                            (no valid data available)
>>>>     D        up, active, but diskless (implies connection to good
>>>> data)
>>>>      M       meta-data storage available
>>>>      _       meta-data storage unavailable
>>>>       *      backing storage available
>>>>       o      backing storage consistent but outdated
>>>>              (refuses to become active)
>>>>       i      backing storage inconsistent (unfinished sync)
>>>>       _      diskless
>>>>        :     unconnected, stand alone
>>>>        ?     unconnected, looking for peer
>>>>        -     connected
>>>>>    connected, sync source
>>>>        <     connected, sync target
>>>
>>> I'd structure this somewhat differently into the node states (Up,
>>> Down),
>>> our assumption about the other node (up, down, fenced), Backing Store
>>> states (available, outdated, inconsistent, unavailable), the 
>>> connection
>>> (up or down) and the relationship between the GCs (higher, lower,
>>> equal).
>>>
>>> (Whether we are syncing and in what direction seems to be a function 
>>> of
>>> that, same whether or not we are blocking or not.)
>>>
>>> It's essentially the same as your list, but it seems to be more
>>> accessible to me. But, it's late ;-)
>>
>> I agree here... restricting the attributes to true inputs (rather than
>> derived values) helps stop my brain going into meltdown trying to
>> absorb the matrix :)  I could also imagine that it makes your life
>> easier if you decide to change how you decide something like syncing
>> direction later on.
>
> ok.
> I'll try again, and maybe create a dot file while I'm at it :)

I think you'll find that a helps a lot (I know it helped me with the 
CRM's)

>
> even though my "states" are not actually "derived".
> but maybe only "obvious" to the "insider" (me).

I think it depends on the POV you take and how you break it up...
I would say "syncing" is a state, but the direction is derived and who 
we and the other side are (master/slave/...) are also inputs.
Otherwise you need states for every combination and (at least in the 
CRM) I didn't find that gave me much (except for a really huge .dot 
file :)

>
> so, for the time being, could you please have a look at
>  http://wiki.trick.ca/linux-ha/DRBD/StateMachine
>
> and maybe comment it.

sure thing

>
> 	lge
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
-- 
Andrew Beekhof

"Ooo Ahhh, Glenn McRath" - TISM


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
  2004-09-24 14:29 [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies Lars Ellenberg
  2004-09-24 21:11 ` [Drbd-dev] Re: [Linux-ha-dev] " Lars Marowsky-Bree
@ 2004-09-27 14:52 ` Philipp Reisner
  2004-09-29 12:58   ` Philipp Reisner
  1 sibling, 1 reply; 13+ messages in thread
From: Philipp Reisner @ 2004-09-27 14:52 UTC (permalink / raw)
  To: drbd-dev

Am Freitag, 24. September 2004 16:29 schrieb Lars Ellenberg:

[...]

>     Currently this covers only the states, and outlines the transitions. It
>     should help to define the actions to be taken on every possible "input"
>     to the DRBD internal "state machine".
>

While reading through this giant e-mail I lost my confidence that it 
could be a good idea to have a "central" state switching function in 
DRBD, but of course I will see what this discussions gives...

We have a huge space of possible cominations of these attributes, but
a lot of those are impossible/invalid... etc. Currently these constraints
are expressed by the code ...

The question is, what is easier to read/understand/code/get right.

[...]
>
>     Allowed node state transition "inputs" or "reactions" are
>
>     *   up or down the node
>
>     *   add/remove the disk (by administrative request or in response to io
>         error)
>
>         if it was the last accessible good data, should this result in
>         suicide, or block all further io, or just fail all further io?
>
>         if this lost the meta-data storage at the same time (meta-data
>         internal), do we handle this differently?

I guess this is a question we can not answer here for all of our users,
some one might want this, the others that... etc.. If it is a question 
you can not answer, it probabely needs to be configurable.

>     *   fail meta-data storage
>
>         should result in suicide.
>
>     *   establish or lose the connection; quit/start retrying to establish
> a connection.
>
>     *   promote to active / demote to non-active
>
>         To promote an unconnected inconsistent non-active node you need
>         brute force. Similar if it thinks it is outdated.
>
>         Promoting an unconnected diskless node is not possible. But those
>         should have been mapped to a "down" node, anyways.
>

Hmmm ? 

Just had a look at what we are currently doing. Probabely we should 
drop the DISKLESS bit and replace this by an enum

dstate: inconsistent, 
        outdated  (known to be outdated -- happens via drbdadm outdate and 
                   in data was consistent negotiation's outcome was this this
                   is old data and sync is Paused),
        consistent (this reflects the meta-data meaning of consistent i.e.
                    might be outdated),
        na (=diskless), 
        uptodate

and display this in /proc/drbd "ld:"

>     *   start/finish synchronization
>
>         One must not request a running and up-to-date active node to become
>         target of synchronization.
>
>     *   block/unblock all io requests
>
>         This is in response to drbdadm suspend/resume, or a result of an
>         "execption handler".
>
>     *   commit suicide
>
>         This is our last resort emergency handler. It should not be
>         implemented as "panic", though currently it is.
>
>     Again, this is important, please double check: Did I miss something?
>

I think everything is there... (and reading it is quite inspiring)

-Philipp

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
  2004-09-27 14:52 ` [Drbd-dev] " Philipp Reisner
@ 2004-09-29 12:58   ` Philipp Reisner
  2004-09-29 17:07     ` Lars Ellenberg
  0 siblings, 1 reply; 13+ messages in thread
From: Philipp Reisner @ 2004-09-29 12:58 UTC (permalink / raw)
  To: drbd-dev

[-- Attachment #1: Type: text/plain, Size: 806 bytes --]

> [...]
> >     Currently this covers only the states, and outlines the transitions.
> > It should help to define the actions to be taken on every possible
> > "input" to the DRBD internal "state machine".
>
> While reading through this giant e-mail I lost my confidence that it
> could be a good idea to have a "central" state switching function in
> DRBD, but of course I will see what this discussions gives...

Thought about this a bit more... and came to the conclusion that it 
would be a good idea. What do you think of this skeleton -
pseude code (it compiles actually).

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

[-- Attachment #2: new_st.c --]
[-- Type: text/x-csrc, Size: 2996 bytes --]

typedef struct {
	unsigned role    : 2 ;   // 3   primary/secondary/unknown
	unsigned peer    : 2 ;   // 3   primary/secondary/unknown
	unsigned conn    : 5 ;   // 17  cstates
	unsigned disk    : 3 ;   // 5   from Diskless to UpToDate
	unsigned multi_p : 1 ;   // 2   multiple primaries alloes.
} drbd_state_t;

typedef enum {
	Unconfigured,
	StandAlone,
	Unconnected,
	Timeout,
	BrokenPipe,
	NetworkFailure,
	WFConnection,
	WFReportParams, // we have a socket
	Connected,      // we have introduced each other
	SkippedSyncS,   // we should have synced, but user said no
	SkippedSyncT,
	WFBitMapS,
	WFBitMapT,
	SyncSource,     // The distance between original state and pause
	SyncTarget,     // state must be the same for source and target. (+2)
	PausedSyncS,    // see _drbd_rs_resume() and _drbd_rs_pause()
	PausedSyncT,    // is sync target, but higher priority groups first
} drbd_conns_t;

typedef enum {
	Diskless,
	Failed,         // Moves on to Diskless as soon as we reported it ot the peer
	Inconsistent,
	Outdated,
	Consistent,     // Might be outdated, might be UpTpDate ...
	UpToDate,
} drbd_disks_t;

typedef enum {
	Unknown=0,
	Primary=1,     // role
	Secondary=2,   // role
} drbd_role_t;


drbd_state_t st;  // here for mdev->state;

void lock(){}
void unlock(){}

static int state_change(drbd_state_t ns, int hard)
{
	int rv;

	lock();

	if( ns.role == st.role &&
	    ns.peer == st.peer &&
	    ns.conn == st.conn &&
	    ns.disk == st.disk &&
	    ns.multi_p == st.multi_p ) { 
		rv = 1; 
		goto out; 
	}

	if(!hard) {

		// pre-state-change checks
		if(!ns.multi_p && 
		   ns.role == Primary && 
		   ns.peer == Primary) {
			rv = 0;
			goto out;
		}

		// ...

	}

	// State sanitising
	if ( ns.conn < Connected ) ns.peer = Unknown;

	st = ns;

	// post-state-change actions...
	if( ns.conn < Connected && 
	    ns.disk <= Inconsistent && 
	    ns.role == Primary ) {
		panic(); // Just for illustration
	}
	// ...

	// Probabely it also makes sense to run a dynamic list of 
	// post-state-change-callbacks sitting on a post-state-change-hook.
	//
	// Could use this to implement the sync groupes more sanely.
	// Could also replace the cstate_wait

	rv = 1;
 out:
	unlock();
	return rv;
}


int set_cstate(drbd_conns_t new, int hard)
{
	drbd_state_t ns = st;
	ns.conn = new;

	return state_change(ns,hard);
}

int set_dstate(drbd_disks_t new, int hard)
{
	drbd_state_t ns = st;
	ns.disk = new;

	return state_change(ns,hard);
}

int set_rstate(drbd_role_t new, int hard)
{
	drbd_state_t ns = st;
	ns.role = new;

	return state_change(ns,hard);
}

int set_pstate(drbd_role_t new, int hard)
{
	drbd_state_t ns = st;
	ns.peer = new;

	return state_change(ns,hard);
}

panic()
{ 
	printf("PANIC\n");
}

main()
{
	st = (drbd_state_t){ Secondary,Unknown,Unconfigured,UpToDate,1 };

	set_cstate(Connected,0);
	set_pstate(Primary,0);
	set_rstate(Primary,0); // This one fails...
	set_cstate(WFConnection,0);
	set_rstate(Primary,0);
	set_dstate(Diskless,1); // causes panic...
}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
  2004-09-29 12:58   ` Philipp Reisner
@ 2004-09-29 17:07     ` Lars Ellenberg
  2004-10-06 11:55       ` Philipp Reisner
  0 siblings, 1 reply; 13+ messages in thread
From: Lars Ellenberg @ 2004-09-29 17:07 UTC (permalink / raw)
  To: drbd-dev

/ 2004-09-29 14:58:47 +0200
\ Philipp Reisner:
> > [...]
> > >     Currently this covers only the states, and outlines the transitions.
> > > It should help to define the actions to be taken on every possible
> > > "input" to the DRBD internal "state machine".
> >
> > While reading through this giant e-mail I lost my confidence that it
> > could be a good idea to have a "central" state switching function in
> > DRBD, but of course I will see what this discussions gives...
> 
> Thought about this a bit more... and came to the conclusion that it 
> would be a good idea. What do you think of this skeleton -
> pseude code (it compiles actually).

I think it probably should come out more like a real state machine,
with a defined set of possible INPUTS,
a defined set of states (which should not have the same detail depth as
the actual drbd internal state set with all its different attributes),
a set of actions, and a defined state[INPUT] => action => newstate
matrix.

maybe that is overkill.
it may as well be that this turns out to be the easier and cleaner way.
I'm not yet sure. the back of my head is still busy and did not give me
a "completion event" on it yet ...

	lge

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
  2004-09-29 17:07     ` Lars Ellenberg
@ 2004-10-06 11:55       ` Philipp Reisner
  0 siblings, 0 replies; 13+ messages in thread
From: Philipp Reisner @ 2004-10-06 11:55 UTC (permalink / raw)
  To: drbd-dev

[-- Attachment #1: Type: text/plain, Size: 2125 bytes --]

On Wednesday 29 September 2004 19:07, Lars Ellenberg wrote:
> / 2004-09-29 14:58:47 +0200
>
> \ Philipp Reisner:
> > > [...]
> > >
> > > >     Currently this covers only the states, and outlines the
> > > > transitions. It should help to define the actions to be taken on
> > > > every possible "input" to the DRBD internal "state machine".
> > >
> > > While reading through this giant e-mail I lost my confidence that it
> > > could be a good idea to have a "central" state switching function in
> > > DRBD, but of course I will see what this discussions gives...
> >
> > Thought about this a bit more... and came to the conclusion that it
> > would be a good idea. What do you think of this skeleton -
> > pseude code (it compiles actually).
>
> I think it probably should come out more like a real state machine,
> with a defined set of possible INPUTS,
> a defined set of states (which should not have the same detail depth as
> the actual drbd internal state set with all its different attributes),
> a set of actions, and a defined state[INPUT] => action => newstate
> matrix.
>
> maybe that is overkill.

Currently I want to go the way that was outlined with the new_st.c
skeleton.

Regarding: only worker should do state changes.
  We have quite a lot of inputs that are asynchronous by their nature.
  E.g. Disk fails. It does not make any sense to synchronize an advance
  in the disk-state state-machine with anything.

While it makes a lot of sense to synchronize changes to the node
state machine. 

At first I drew a directed graph of the cstates we have in 
drbd-0.7 (see cstates-7.ps)

You will immediately realize that the differentiation between
Unconfigured and StandAllone is a leftover from drbd-0.6

Then I drew directed graphs of the "state machines" as I see
them for drbd-0.8

conn-states-8.ps, disk-states-8.ps, node-states-8.ps (has 2 pages)

PS: The program is graphviz

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

[-- Attachment #2: conn-states-8.dot --]
[-- Type: text/plain, Size: 833 bytes --]

digraph conn_states {
	StandAllone  -> WFConnection   [ label = "ioctl_set_net()" ]
	WFConnection -> Unconnected    [ label = "unable to bind()" ]
	WFConnection -> WFReportParams [ label = "in connect() after accept" ]
	WFReportParams -> StandAllone  [ label = "checks in receive_param()" ]
	WFReportParams -> Connected    [ label = "in receive_param()" ]
	WFReportParams -> WFBitMapS    [ label = "sync_handshake()" ]
	WFReportParams -> WFBitMapT    [ label = "sync_handshake()" ]
	WFBitMapS -> SyncSource        [ label = "receive_bitmap()" ]
	WFBitMapT -> SyncTarget        [ label = "receive_bitmap()" ]
	SyncSource -> Connected
	SyncTarget -> Connected
	SyncSource -> PausedSyncS
	SyncTarget -> PausedSyncT
	PausedSyncS -> SyncSource
	PausedSyncT -> SyncTarget
	Connected   -> WFConnection    [ label = "* on network error" ]
}

[-- Attachment #3: disk-states-8.dot --]
[-- Type: text/plain, Size: 913 bytes --]

digraph disk_states {
	Diskless -> Inconsistent       [ label = "ioctl_set_disk()" ]
	Diskless -> Consistent         [ label = "ioctl_set_disk()" ]
	Diskless -> Outdated           [ label = "ioctl_set_disk()" ]
	Consistent -> Outdated         [ label = "receive_param()" ]
	Consistent -> UpToDate         [ label = "receive_param()" ]
	Consistent -> Inconsistent     [ label = "start resync" ]
	Outdated   -> Inconsistent     [ label = "start resync" ]
	UpToDate   -> Inconsistent     [ label = "ioctl_replicate" ]
	Inconsistent -> UpToDate       [ label = "resync completed" ]
	Consistent -> Failed           [ label = "io completion error" ]
	Outdated   -> Failed           [ label = "io completion error" ]
	UpToDate   -> Failed           [ label = "io completion error" ]
	Inconsistent -> Failed         [ label = "io completion error" ]
	Failed -> Diskless             [ label = "sending notify to peer" ]
}

[-- Attachment #4: disk-states-8.ps --]
[-- Type: application/postscript, Size: 12540 bytes --]

[-- Attachment #5: node-states-8.dot --]
[-- Type: text/plain, Size: 540 bytes --]

digraph node_states {
	Secondary -> Primary           [ label = "ioctl_set_state()" ]
	Primary   -> Secondary 	       [ label = "ioctl_set_state()" ]
}

digraph peer_states {
	Secondary -> Primary           [ label = "recv state packet" ]
	Primary   -> Secondary 	       [ label = "recv state packet" ]
	Primary   -> Unknown 	       [ label = "connection lost" ]
	Secondary  -> Unknown  	       [ label = "connection lost" ]
	Unknown   -> Primary           [ label = "connected" ]
	Unknown   -> Secondary         [ label = "connected" ]
}


[-- Attachment #6: node-states-8.ps --]
[-- Type: application/postscript, Size: 9774 bytes --]

[-- Attachment #7: cstates-7.dot --]
[-- Type: text/plain, Size: 1318 bytes --]

digraph cstate {
	Unconfigured -> StandAllone    [ label = "ioctl_set_disk()" ]
	StandAllone  -> Unconnected    [ label = "ioctl_set_net()" ]
	Unconfigured -> Unconnected    [ label = "ioctl_set_net()" ]
	Unconnected  -> WFConnection   [ label = "connect()[1]" ]
	WFConnection -> Unconnected    [ label = "unable to bind()" ]
	WFConnection -> WFReportParams [ label = "in connect() after accept" ]
	WFReportParams -> StandAllone  [ label = "checks in receive_param()" ]
	WFReportParams -> Connected    [ label = "in receive_param()" ]
	WFReportParams -> WFBitMapS    [ label = "sync_handshake()" ]
	WFReportParams -> WFBitMapT    [ label = "sync_handshake()" ]
	WFBitMapS -> SyncSource        [ label = "receive_bitmap()" ]
	WFBitMapT -> SyncTarget        [ label = "receive_bitmap()" ]
	SyncSource -> Connected
	SyncTarget -> Connected
	SyncSource -> PausedSyncS
	SyncTarget -> PausedSyncT
	PausedSyncS -> SyncSource
	PausedSyncT -> SyncTarget
	Connected   -> BrokenPipe       [ label = "* recv error" ]
	BrokenPipe  -> WFConnection     [ label = "connect()[1]" ]
	Connected   -> NetworkFailure   [ label = "* set by asender()" ]
	NetworkFailure -> WFConnection  [ label = "connect()[1]" ]
	Connected   -> Timeout          [ label = "* set drbd_send()" ]
	Timeout     -> WFConnection     [ label = "connect()[1]" ]
}

[-- Attachment #8: conn-states-8.ps --]
[-- Type: application/postscript, Size: 13380 bytes --]

[-- Attachment #9: cstates-7.ps --]
[-- Type: application/postscript, Size: 17743 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2004-10-06 11:54 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-24 14:29 [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies Lars Ellenberg
2004-09-24 21:11 ` [Drbd-dev] Re: [Linux-ha-dev] " Lars Marowsky-Bree
2004-09-24 23:04   ` Lars Ellenberg
2004-09-25  8:54     ` Lars Marowsky-Bree
2004-09-25  9:50       ` Lars Ellenberg
2004-09-25  9:59         ` Lars Marowsky-Bree
2004-09-26 18:40   ` Andrew Beekhof
2004-09-27 12:19     ` Lars Ellenberg
2004-09-27 12:38       ` Andrew Beekhof
2004-09-27 14:52 ` [Drbd-dev] " Philipp Reisner
2004-09-29 12:58   ` Philipp Reisner
2004-09-29 17:07     ` Lars Ellenberg
2004-10-06 11:55       ` Philipp Reisner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.