* [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
@ 2004-09-24 14:29 Lars Ellenberg
2004-09-24 21:11 ` [Drbd-dev] Re: [Linux-ha-dev] " Lars Marowsky-Bree
2004-09-27 14:52 ` [Drbd-dev] " Philipp Reisner
0 siblings, 2 replies; 13+ messages in thread
From: Lars Ellenberg @ 2004-09-24 14:29 UTC (permalink / raw)
To: drbd-dev, linux-ha-dev
some of this applies to replicated resources in general,
so Andrew may have some ideas to generalize it...
source of it is a pod'ed perl script that I try to tweak into
calculating all the possible transitions for me (and then filter out the
relevant ones ...)
=======
DRBD cluster states and state transitions
We want to consolidate all DRBD state changes and "recovery strategies"
into one prominent and obvious place, something like a state machine.
This is neccessary to serialize state changes properly, and to make
error recovery maintainable.
Given this background, this script generates a set of cluster states
(and transitions later) of two DRBD peers from the point of view of an
all knowing higher level intelligence (cluster manager/operator).
We (as DRBD) should actually only be concerned about the single node
state transitions, but the CRM (wink to Andrew) may want to twist its
brain with the two node states to think about what can happen with
replicated resources...
This overview can and should be improved, so we provably cover all
corner cases and the recovery strategies are as best as can be.
Currently this covers only the states, and outlines the transitions. It
should help to define the actions to be taken on every possible "input"
to the DRBD internal "state machine".
Please think about it and give us feedback. Especially about whether the
the set of states is complete. We do not want to miss one single corner
case.
Thank you very much.
Lars Ellenberg
Node states
Each node has several attributes, which may change more or less
independently. A node can be
* up or down
* with backing storage or diskless
We need to distinguish between data storage and meta-data storage.
If we don't have meta-data storage, this may as well be down (and in
the event of losing the md storage, it should take appropriate
emergency actions to commit suicide).
Obviously a diskless node cannot take part in synchronization.
* Active or Standby
Promoting an unconnected diskless non-active node is not possible.
Promoting a connected diskless non-active node should not be
possible.
* target or source of synchronization, consistent, or inconsistent.
Even though consistent, we might know or assume that we are
outdated.
* connected or unconnected.
Obviously an unconnected node cannot take part in synchronization.
Some of the attributes depend on others, and the information about the
node status could be easily encoded in one single letter.
But since HA is all about redundancy, we will encode the node status
redundantly in *four* letters, to make it more obvious to human readers.
_ down,
S up, standby (non-active, but ready to become active)
s up, not-active, but target of sync
i up, not-active, unconnected, inconsistent
o up, not-active, unconnected, outdated
d up, not-active, diskless
A up, active
a up, active, but target of sync
b up, blocking, because unconnected active and inconsistent
(no valid data available)
B up, blocking, because unconnected active and diskless
(no valid data available)
D up, active, but diskless (implies connection to good data)
M meta-data storage available
_ meta-data storage unavailable
* backing storage available
o backing storage consistent but outdated
(refuses to become active)
i backing storage inconsistent (unfinished sync)
_ diskless
: unconnected, stand alone
? unconnected, looking for peer
- connected
> connected, sync source
< connected, sync target
Note however that "S" does NOT necessarily mean it has uptodate data,
only that it thinks its data is consistent, it was not explicitly told
that it is outdated, and it had no reason to assume so! E.g. directly
after boot, if it was not yet connected, the data may well be
consistent, but outdated. But since this is information not directly
available to the node, let alone DRBD, this is difficult to map in here.
Since if the node is down, everything else is irrelevant, and since
synchronisation implies backing storage, without meta-data storage we
refuse to do anything, and some of the states will resolve immediately,
(e.g. outaded => sync target upon connect), we do now have 24
distinguishable host states.
___:
AM*- DM_- AM*> aM*< AM*? AM*: bM*? bM*: BM_? BM_:
SM*- dM_- SM*> sM*< SM*? SM*: iM*? iM*: dM_? dM_:
oM*- oM*? oM*:
This is our starting point, so please double check. Did I miss
something?
Because non-active unconnected diskless node can as well be down, to
simplify, we *could* introduce this equivalence, which reduces some
cluster states: dM_[:?] => __:
We *could* merge both unconnected states into one for the purpose of
describing and testing it. This needs some thought. It would reduce the
number of possible node states by 5 resp. 6, and the resulting cluster
states by a considerable factor. (..)[:?] => $1:
Cluster States
For the cluster:
left-node -- network -- right-node
the list of all possible pairwise combinations of these states needs to
be filtered: combining a connected left state with an unconnected right
state does not give a valid cluster state.
Since connected states with more than one active node currently are
still invalid, too, and will be immediately disconnected, we don't
mention these, either. See also the note about the "split brain" cluster
states below.
States where both nodes are looking for the peer (should) resolve
automatically into some connected mode "imediately" (unless the network
is broken).
Since we assume an "all knowing" CM, the state of the network link is
therefore explicitly stated as - ok % broken
For the purpose of describing and testing it we may chose to merge :%:
and :-: into :_:, because if neither node tries to connect, the link
status is irrelevant.
We leave off states where the reverse symmetrically states are already
listed.
Classify
These states can be classified as sane "[OK]", degraded "[deg]", not
operational "{bad}", and fatal "[BAD]".
A "[deg]" state is still operational. This means that applications can
run and client requests are satisfied. But they are only one failure
appart from being rendered non-operational, so you still should *run*
and fix it...
If it is not fatal, but only "{bad}", it *can* be "self healing", i.e.
some of the "{bad}" states may find a transition to a operational state,
though most likely only to some "{deg}" one. For example if the network
comes back, or the cluster manager promotes a currently non-active node
to be active.
In case "[BAD]" modes do ever occure, intervention of a higher level
intelligence (cluster manager/operator) is necessary to restore an
operational state.
There is one additional class: the "BRAIN" class, which typically can
only occur in and after split brain situations, and can never occur with
an "all knowing" cluster manager, so these are special cases here.
Note that some of these (the "double A" states) may eventually become
legal, when we start to support [Open]GFS or other shared access modes,
others will then just be "{BAD}" or even only "{bad}".
The outcome is
225 states: [OK]: AM*---*MS SM*---*MS
{deg}: AM*---_Md DM_---*MS AM*>->*Ms aM*<-<*MS AM*?-:*Mo
AM*?%:*Mo AM*?%?*Mo AM*?%?*MS AM*?-:*MS AM*?%:*MS
AM*?%?*Mi AM*?-:*Mi AM*?%:*Mi AM*?%?_Md AM*?-:_Md
AM*?%:_Md AM*?-:___ AM*?%:___ AM*:-:*Mo AM*:%:*Mo
AM*:-?*Mo AM*:%?*Mo AM*:-?*MS AM*:%?*MS AM*:-:*MS
AM*:%:*MS AM*:-?*Mi AM*:%?*Mi AM*:-:*Mi AM*:%:*Mi
AM*:-?_Md AM*:%?_Md AM*:-:_Md AM*:%:_Md AM*:-:___
AM*:%:___ SM*>->*Ms
{bad}: bM*?%?*MS bM*?-:*MS bM*?%:*MS bM*:-?*MS bM*:%?*MS
bM*:-:*MS bM*:%:*MS BM_?%?*MS BM_?-:*MS BM_?%:*MS
BM_:-?*MS BM_:%?*MS BM_:-:*MS BM_:%:*MS SM*---_Md
oM*:-?*MS oM*:%?*MS oM*:-:*MS oM*:%:*MS oM*?%?*MS
oM*?-:*MS oM*?%:*MS SM*?%?*MS SM*?-:*MS SM*?%:*MS
SM*?%?*Mi SM*?-:*Mi SM*?%:*Mi SM*?%?_Md SM*?-:_Md
SM*?%:_Md SM*?-:___ SM*?%:___ SM*:-:*MS SM*:%:*MS
SM*:-?*Mi SM*:%?*Mi SM*:-:*Mi SM*:%:*Mi SM*:-?_Md
SM*:%?_Md SM*:-:_Md SM*:%:_Md SM*:-:___ SM*:%:___
[BAD]: DM_---*Mo DM_---_Md bM*?-:*Mo bM*?%:*Mo bM*?%?*Mo
bM*?%?*Mi bM*?-:*Mi bM*?%:*Mi bM*?%?_Md bM*?-:_Md
bM*?%:_Md bM*?-:___ bM*?%:___ bM*:-:*Mo bM*:%:*Mo
bM*:-?*Mo bM*:%?*Mo bM*:-?*Mi bM*:%?*Mi bM*:-:*Mi
bM*:%:*Mi bM*:-?_Md bM*:%?_Md bM*:-:_Md bM*:%:_Md
bM*:-:___ bM*:%:___ BM_?-:*Mo BM_?%:*Mo BM_?%?*Mo
BM_?%?*Mi BM_?-:*Mi BM_?%:*Mi BM_?%?_Md BM_?-:_Md
BM_?%:_Md BM_?-:___ BM_?%:___ BM_:-:*Mo BM_:%:*Mo
BM_:-?*Mo BM_:%?*Mo BM_:-?*Mi BM_:%?*Mi BM_:-:*Mi
BM_:%:*Mi BM_:-?_Md BM_:%?_Md BM_:-:_Md BM_:%:_Md
BM_:-:___ BM_:%:___ oM*---*Mo oM*---_Md oM*:-:*Mo
oM*:%:*Mo oM*:-?*Mo oM*:%?*Mo oM*:-?*Mi oM*:%?*Mi
oM*:-:*Mi oM*:%:*Mi oM*:-?_Md oM*:%?_Md oM*:-:_Md
oM*:%:_Md oM*:-:___ oM*:%:___ oM*?%?*Mo oM*?%?*Mi
oM*?-:*Mi oM*?%:*Mi oM*?%?_Md oM*?-:_Md oM*?%:_Md
oM*?-:___ oM*?%:___ dM_---_Md iM*?%?*Mi iM*?-:*Mi
iM*?%:*Mi iM*?%?_Md iM*?-:_Md iM*?%:_Md iM*?-:___
iM*?%:___ iM*:-:*Mi iM*:%:*Mi iM*:-?_Md iM*:%?_Md
iM*:-:_Md iM*:%:_Md iM*:-:___ iM*:%:___ dM_?%?_Md
dM_?-:_Md dM_?%:_Md dM_?-:___ dM_?%:___ dM_:-:_Md
dM_:%:_Md dM_:-:___ dM_:%:___ ___:-:___ ___:%:___
BRAIN: AM*?%?*MA AM*?-:*MA AM*?%:*MA AM*?%?*Mb AM*?-:*Mb
AM*?%:*Mb AM*?%?_MB AM*?-:_MB AM*?%:_MB AM*:-:*MA
AM*:%:*MA AM*:-?*Mb AM*:%?*Mb AM*:-:*Mb AM*:%:*Mb
AM*:-?_MB AM*:%?_MB AM*:-:_MB AM*:%:_MB bM*?%?*Mb
bM*?-:*Mb bM*?%:*Mb bM*?%?_MB bM*?-:_MB bM*?%:_MB
bM*:-:*Mb bM*:%:*Mb bM*:-?_MB bM*:%?_MB bM*:-:_MB
bM*:%:_MB BM_?%?_MB BM_?-:_MB BM_?%:_MB BM_:-:_MB
BM_:%:_MB
Possible Transitions
Now, by comparing each with every state, and finding all pairs which
differ in exactly one "attribute", we have all possible state
transitions.
We ignore certain node state transitions which are refused by drbd.
Allowed node state transition "inputs" or "reactions" are
* up or down the node
* add/remove the disk (by administrative request or in response to io
error)
if it was the last accessible good data, should this result in
suicide, or block all further io, or just fail all further io?
if this lost the meta-data storage at the same time (meta-data
internal), do we handle this differently?
* fail meta-data storage
should result in suicide.
* establish or lose the connection; quit/start retrying to establish a
connection.
* promote to active / demote to non-active
To promote an unconnected inconsistent non-active node you need
brute force. Similar if it thinks it is outdated.
Promoting an unconnected diskless node is not possible. But those
should have been mapped to a "down" node, anyways.
* start/finish synchronization
One must not request a running and up-to-date active node to become
target of synchronization.
* block/unblock all io requests
This is in response to drbdadm suspend/resume, or a result of an
"execption handler".
* commit suicide
This is our last resort emergency handler. It should not be
implemented as "panic", though currently it is.
Again, this is important, please double check: Did I miss something?
Because the fatal "[BAD]" and "BRAIN" states can only be resolved by the
operator, for these we consider only transitions to a non-fatal state.
Connected fatal states will immediately be disconnected.
Transitions are consequences of certain events. An event can be an
operator/cluster manager Request, a Failure, or a self-healing (of the
network link, for example).
While simulating the events, we will at any time modify exactly one node
attribute of one node, or the status of the network link.
The "establish connection" event is special in that we cannot simulate
it: this is a DRBD-internel event. And from only looking at the cluster
state before this event, we cannot directly know what cluster state will
result, unless we want to add the "up-to-date-ness" of the data as
additional node attribute...
So as soon as the connection between DRBD-peers is established, they
will auto-resolve to some other state.
======
what should follow are the relevant state transitions...
I am still not satisfied with the output of my script, though.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
2004-09-24 14:29 [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies Lars Ellenberg
@ 2004-09-24 21:11 ` Lars Marowsky-Bree
2004-09-24 23:04 ` Lars Ellenberg
2004-09-26 18:40 ` Andrew Beekhof
2004-09-27 14:52 ` [Drbd-dev] " Philipp Reisner
1 sibling, 2 replies; 13+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-24 21:11 UTC (permalink / raw)
To: drbd-dev, linux-ha-dev
On 2004-09-24T16:29:25,
Lars Ellenberg <Lars.Ellenberg@linbit.com> said:
> some of this applies to replicated resources in general,
> so Andrew may have some ideas to generalize it...
I think the modelling we have so far (with the recent addendum) captures
this quite nicely for the time being. But of course, it'll help us to
verify this.
> Some of the attributes depend on others, and the information about the
> node status could be easily encoded in one single letter.
>
> But since HA is all about redundancy, we will encode the node status
> redundantly in *four* letters, to make it more obvious to human readers.
>
> _ down,
> S up, standby (non-active, but ready to become active)
> s up, not-active, but target of sync
> i up, not-active, unconnected, inconsistent
> o up, not-active, unconnected, outdated
> d up, not-active, diskless
> A up, active
> a up, active, but target of sync
> b up, blocking, because unconnected active and inconsistent
> (no valid data available)
> B up, blocking, because unconnected active and diskless
> (no valid data available)
> D up, active, but diskless (implies connection to good data)
> M meta-data storage available
> _ meta-data storage unavailable
> * backing storage available
> o backing storage consistent but outdated
> (refuses to become active)
> i backing storage inconsistent (unfinished sync)
> _ diskless
> : unconnected, stand alone
> ? unconnected, looking for peer
> - connected
> > connected, sync source
> < connected, sync target
I'd structure this somewhat differently into the node states (Up, Down),
our assumption about the other node (up, down, fenced), Backing Store
states (available, outdated, inconsistent, unavailable), the connection
(up or down) and the relationship between the GCs (higher, lower,
equal).
(Whether we are syncing and in what direction seems to be a function of
that, same whether or not we are blocking or not.)
It's essentially the same as your list, but it seems to be more
accessible to me. But, it's late ;-)
> Because non-active unconnected diskless node can as well be down, to
> simplify, we *could* introduce this equivalence, which reduces some
> cluster states: dM_[:?] => __:
Seems sane.
> We *could* merge both unconnected states into one for the purpose of
> describing and testing it. This needs some thought. It would reduce the
> number of possible node states by 5 resp. 6, and the resulting cluster
> states by a considerable factor. (..)[:?] => $1:
Are they really, though? Doesn't it affect the GC if I tell it to go
standalone as opposed to leaving it looking for it's peer?
> Classify
> These states can be classified as sane "[OK]", degraded "[deg]", not
> operational "{bad}", and fatal "[BAD]".
Makes sense, mostly, but...
> A "[deg]" state is still operational. This means that applications can
> run and client requests are satisfied. But they are only one failure
> appart from being rendered non-operational, so you still should *run*
> and fix it...
>
> If it is not fatal, but only "{bad}", it *can* be "self healing", i.e.
> some of the "{bad}" states may find a transition to a operational state,
> though most likely only to some "{deg}" one. For example if the network
> comes back, or the cluster manager promotes a currently non-active node
> to be active.
A bad state seems 'degenerate' to me. Are those two really distinct?
Self-healing would be an on-going sync or something like it.
> The outcome is
>
> 225 states: [OK]: AM*---*MS SM*---*MS
I see know why you backed away from my proposal of the state machine for
the testing back then and instead suggested the CTH ;-)
> Possible Transitions
> Now, by comparing each with every state, and finding all pairs which
> differ in exactly one "attribute", we have all possible state
> transitions.
These seem sane and complete.
Just some comments:
> We ignore certain node state transitions which are refused by drbd.
> Allowed node state transition "inputs" or "reactions" are
>
> * up or down the node
>
> * add/remove the disk (by administrative request or in response to io
> error)
>
> if it was the last accessible good data, should this result in
> suicide, or block all further io, or just fail all further io?
>
> if this lost the meta-data storage at the same time (meta-data
> internal), do we handle this differently?
>
> * fail meta-data storage
>
> should result in suicide.
In fact, even for meta-data loss, we can switch to detached mode and
increase some version counters on the other side, and still do a smooth
transition. We can't any longer touch the local disk, which is bad, but
we also can't make it worse.
This will work as long as we don't explicitly have to _set_ a
dirty/outdated bit, but if we explicitly clear it instead when we
smoothly shutdown.
I don't see any difference here between meta-data and backing store
loss, actually, that complicates things unnecessarily.
> * establish or lose the connection; quit/start retrying to establish a
> connection.
>
> * promote to active / demote to non-active
>
> To promote an unconnected inconsistent non-active node you need
> brute force. Similar if it thinks it is outdated.
>
> Promoting an unconnected diskless node is not possible. But those
> should have been mapped to a "down" node, anyways.
>
> * start/finish synchronization
>
> One must not request a running and up-to-date active node to become
> target of synchronization.
>
> * block/unblock all io requests
>
> This is in response to drbdadm suspend/resume, or a result of an
> "execption handler".
>
> * commit suicide
>
> This is our last resort emergency handler. It should not be
> implemented as "panic", though currently it is.
>
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
2004-09-24 21:11 ` [Drbd-dev] Re: [Linux-ha-dev] " Lars Marowsky-Bree
@ 2004-09-24 23:04 ` Lars Ellenberg
2004-09-25 8:54 ` Lars Marowsky-Bree
2004-09-26 18:40 ` Andrew Beekhof
1 sibling, 1 reply; 13+ messages in thread
From: Lars Ellenberg @ 2004-09-24 23:04 UTC (permalink / raw)
To: drbd-dev, linux-ha-dev
/ 2004-09-24 23:11:33 +0200
\ Lars Marowsky-Bree:
> On 2004-09-24T16:29:25,
> Lars Ellenberg <Lars.Ellenberg@linbit.com> said:
>
> > some of this applies to replicated resources in general,
> > so Andrew may have some ideas to generalize it...
>
> I think the modelling we have so far (with the recent addendum) captures
> this quite nicely for the time being. But of course, it'll help us to
> verify this.
>
> > Some of the attributes depend on others, and the information about the
> > node status could be easily encoded in one single letter.
> >
> > But since HA is all about redundancy, we will encode the node status
> > redundantly in *four* letters, to make it more obvious to human readers.
> >
> > _ down,
> > S up, standby (non-active, but ready to become active)
> > s up, not-active, but target of sync
> > i up, not-active, unconnected, inconsistent
> > o up, not-active, unconnected, outdated
> > d up, not-active, diskless
> > A up, active
> > a up, active, but target of sync
> > b up, blocking, because unconnected active and inconsistent
> > (no valid data available)
> > B up, blocking, because unconnected active and diskless
> > (no valid data available)
> > D up, active, but diskless (implies connection to good data)
> > M meta-data storage available
> > _ meta-data storage unavailable
> > * backing storage available
> > o backing storage consistent but outdated
> > (refuses to become active)
> > i backing storage inconsistent (unfinished sync)
> > _ diskless
> > : unconnected, stand alone
> > ? unconnected, looking for peer
> > - connected
> > > connected, sync source
> > < connected, sync target
>
> I'd structure this somewhat differently into the node states (Up, Down),
> our assumption about the other node (up, down, fenced), Backing Store
> states (available, outdated, inconsistent, unavailable), the connection
> (up or down) and the relationship between the GCs (higher, lower,
> equal).
>
> (Whether we are syncing and in what direction seems to be a function of
> that, same whether or not we are blocking or not.)
>
> It's essentially the same as your list, but it seems to be more
> accessible to me. But, it's late ;-)
well, it is really the same, I guess.
but I'll try to write pseudo code for the state tupels that should make
it clear, and post that here, before I go implement it.
> > Classify
> > These states can be classified as sane "[OK]", degraded "[deg]", not
> > operational "{bad}", and fatal "[BAD]".
>
> Makes sense, mostly, but...
>
> > A "[deg]" state is still operational. This means that applications can
> > run and client requests are satisfied. But they are only one failure
> > appart from being rendered non-operational, so you still should *run*
> > and fix it...
> >
> > If it is not fatal, but only "{bad}", it *can* be "self healing", i.e.
> > some of the "{bad}" states may find a transition to a operational state,
> > though most likely only to some "{deg}" one. For example if the network
> > comes back, or the cluster manager promotes a currently non-active node
> > to be active.
>
> A bad state seems 'degenerate' to me. Are those two really distinct?
> Self-healing would be an on-going sync or something like it.
well, the difference is "degenerate" is bad,
but I still have access to good data (which I mean by operational)!
> > 225 states: [OK]: AM*---*MS SM*---*MS
>
> I see know why you backed away from my proposal of the state machine for
> the testing back then and instead suggested the CTH ;-)
at that point of time we did not know yet about "outdated",
and it was "only" 171 states iirc ...
> > We ignore certain node state transitions which are refused by drbd.
> > Allowed node state transition "inputs" or "reactions" are
> >
> > * up or down the node
> >
> > * add/remove the disk (by administrative request or in response to io
> > error)
> >
> > if it was the last accessible good data, should this result in
> > suicide, or block all further io, or just fail all further io?
> >
> > if this lost the meta-data storage at the same time (meta-data
> > internal), do we handle this differently?
> >
> > * fail meta-data storage
> >
> > should result in suicide.
>
> In fact, even for meta-data loss, we can switch to detached mode and
> increase some version counters on the other side, and still do a smooth
> transition. We can't any longer touch the local disk, which is bad, but
> we also can't make it worse.
>
> This will work as long as we don't explicitly have to _set_ a
> dirty/outdated bit, but if we explicitly clear it instead when we
> smoothly shutdown.
>
> I don't see any difference here between meta-data and backing store
> loss, actually, that complicates things unnecessarily.
well, DRBD needs to make a difference, because they meta-data storage
and data storage may be physically different devices, and therefore can
fail independently. (ok, single blocks can fail on the same physical
storage independently, too, but this is an other thing)
but yes, meta-data loss is not per definition catastrophic...
lge
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
2004-09-24 23:04 ` Lars Ellenberg
@ 2004-09-25 8:54 ` Lars Marowsky-Bree
2004-09-25 9:50 ` Lars Ellenberg
0 siblings, 1 reply; 13+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-25 8:54 UTC (permalink / raw)
To: drbd-dev, linux-ha-dev
On 2004-09-25T01:04:57,
Lars Ellenberg <Lars.Ellenberg@linbit.com> said:
> > I don't see any difference here between meta-data and backing store
> > loss, actually, that complicates things unnecessarily.
>
> well, DRBD needs to make a difference, because they meta-data storage
> and data storage may be physically different devices, and therefore can
> fail independently. (ok, single blocks can fail on the same physical
> storage independently, too, but this is an other thing)
The point I was trying to make is that meta-data loss and backing
storage loss can essentially be mapped to a generic local IO failure.
The special case where we only loss access to the backing store and not
to the meta-data allows us to set a flag there (for whatever use it may
be the next time we compare GCs), but then it amounts to the same: Loss
of the local storage.
I don't see any benefit in keeping the two as distinct failure modes...
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
2004-09-25 8:54 ` Lars Marowsky-Bree
@ 2004-09-25 9:50 ` Lars Ellenberg
2004-09-25 9:59 ` Lars Marowsky-Bree
0 siblings, 1 reply; 13+ messages in thread
From: Lars Ellenberg @ 2004-09-25 9:50 UTC (permalink / raw)
To: drbd-dev, linux-ha-dev
/ 2004-09-25 10:54:28 +0200
\ Lars Marowsky-Bree:
> On 2004-09-25T01:04:57,
> Lars Ellenberg <Lars.Ellenberg@linbit.com> said:
>
> > > I don't see any difference here between meta-data and backing store
> > > loss, actually, that complicates things unnecessarily.
> >
> > well, DRBD needs to make a difference, because they meta-data storage
> > and data storage may be physically different devices, and therefore can
> > fail independently. (ok, single blocks can fail on the same physical
> > storage independently, too, but this is an other thing)
>
> The point I was trying to make is that meta-data loss and backing
> storage loss can essentially be mapped to a generic local IO failure.
>
> The special case where we only loss access to the backing store and not
> to the meta-data allows us to set a flag there (for whatever use it may
> be the next time we compare GCs), but then it amounts to the same: Loss
> of the local storage.
>
> I don't see any benefit in keeping the two as distinct failure modes...
well. they are different events, and I must handle both of them.
there indeed is no benefit to it. but this is not about theoretically
descibing it, in the end I want to code a state machine from it,
and know that it is complete.
in any case, if you look at the listed states, all of them have the "M"
on, so I already "simplified" in so far...
currently in drbd it IS indeed both mapped to "local io error",
then "trying" (without further action on failure) to set the
"inconsistent, need full sync" bit in the meta-data.
but currently in drbd recovery code is distributed in small pieces all
over the code, and I want to try to put it all into one place,
and be sure I deal with every possible corner case.
and for example if we are the only remaining node (we have no
connection), we may rather chose to continue, passing on IO errors if
they happen, than to "detach" the partially broken storage.
appart from the fact that the local storage should never fail,
and should possibly be some sort of raid itself...
lge
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
2004-09-25 9:50 ` Lars Ellenberg
@ 2004-09-25 9:59 ` Lars Marowsky-Bree
0 siblings, 0 replies; 13+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-25 9:59 UTC (permalink / raw)
To: drbd-dev, linux-ha-dev
On 2004-09-25T11:50:04,
Lars Ellenberg <Lars.Ellenberg@linbit.com> said:
> well. they are different events, and I must handle both of them.
> there indeed is no benefit to it. but this is not about theoretically
> descibing it, in the end I want to code a state machine from it,
> and know that it is complete.
Ah. Ok. Then we agree.
> appart from the fact that the local storage should never fail,
> and should possibly be some sort of raid itself...
Right! It'd all be best if nothing ever failed! ;-)
Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
2004-09-24 21:11 ` [Drbd-dev] Re: [Linux-ha-dev] " Lars Marowsky-Bree
2004-09-24 23:04 ` Lars Ellenberg
@ 2004-09-26 18:40 ` Andrew Beekhof
2004-09-27 12:19 ` Lars Ellenberg
1 sibling, 1 reply; 13+ messages in thread
From: Andrew Beekhof @ 2004-09-26 18:40 UTC (permalink / raw)
To: High-Availability Linux Development List; +Cc: drbd-dev
On Sep 24, 2004, at 11:11 PM, Lars Marowsky-Bree wrote:
> On 2004-09-24T16:29:25,
> Lars Ellenberg <Lars.Ellenberg@linbit.com> said:
>
>> some of this applies to replicated resources in general,
>> so Andrew may have some ideas to generalize it...
>
> I think the modelling we have so far (with the recent addendum)
> captures
> this quite nicely for the time being. But of course, it'll help us to
> verify this.
right. i also be thinking more about this in the coming days when I
start on the incarnations.
>
>> Some of the attributes depend on others, and the information
>> about the
>> node status could be easily encoded in one single letter.
>>
>> But since HA is all about redundancy, we will encode the node
>> status
>> redundantly in *four* letters, to make it more obvious to human
>> readers.
>>
>> _ down,
>> S up, standby (non-active, but ready to become active)
>> s up, not-active, but target of sync
>> i up, not-active, unconnected, inconsistent
>> o up, not-active, unconnected, outdated
>> d up, not-active, diskless
>> A up, active
>> a up, active, but target of sync
>> b up, blocking, because unconnected active and
>> inconsistent
>> (no valid data available)
>> B up, blocking, because unconnected active and diskless
>> (no valid data available)
>> D up, active, but diskless (implies connection to good
>> data)
>> M meta-data storage available
>> _ meta-data storage unavailable
>> * backing storage available
>> o backing storage consistent but outdated
>> (refuses to become active)
>> i backing storage inconsistent (unfinished sync)
>> _ diskless
>> : unconnected, stand alone
>> ? unconnected, looking for peer
>> - connected
>>> connected, sync source
>> < connected, sync target
>
> I'd structure this somewhat differently into the node states (Up,
> Down),
> our assumption about the other node (up, down, fenced), Backing Store
> states (available, outdated, inconsistent, unavailable), the connection
> (up or down) and the relationship between the GCs (higher, lower,
> equal).
>
> (Whether we are syncing and in what direction seems to be a function of
> that, same whether or not we are blocking or not.)
>
> It's essentially the same as your list, but it seems to be more
> accessible to me. But, it's late ;-)
I agree here... restricting the attributes to true inputs (rather than
derived values) helps stop my brain going into meltdown trying to
absorb the matrix :) I could also imagine that it makes your life
easier if you decide to change how you decide something like syncing
direction later on.
--
Andrew Beekhof
"Ooo Ahhh, Glenn McRath" - TISM
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
2004-09-26 18:40 ` Andrew Beekhof
@ 2004-09-27 12:19 ` Lars Ellenberg
2004-09-27 12:38 ` Andrew Beekhof
0 siblings, 1 reply; 13+ messages in thread
From: Lars Ellenberg @ 2004-09-27 12:19 UTC (permalink / raw)
To: High-Availability Linux Development List, drbd-dev
/ 2004-09-26 20:40:36 +0200
\ Andrew Beekhof:
>
> On Sep 24, 2004, at 11:11 PM, Lars Marowsky-Bree wrote:
>
> >On 2004-09-24T16:29:25,
> > Lars Ellenberg <Lars.Ellenberg@linbit.com> said:
> >
> >>some of this applies to replicated resources in general,
> >>so Andrew may have some ideas to generalize it...
> >
> >I think the modelling we have so far (with the recent addendum)
> >captures
> >this quite nicely for the time being. But of course, it'll help us to
> >verify this.
>
> right. i also be thinking more about this in the coming days when I
> start on the incarnations.
>
> >
> >> Some of the attributes depend on others, and the information
> >>about the
> >> node status could be easily encoded in one single letter.
> >>
> >> But since HA is all about redundancy, we will encode the node
> >>status
> >> redundantly in *four* letters, to make it more obvious to human
> >>readers.
> >>
> >> _ down,
> >> S up, standby (non-active, but ready to become active)
> >> s up, not-active, but target of sync
> >> i up, not-active, unconnected, inconsistent
> >> o up, not-active, unconnected, outdated
> >> d up, not-active, diskless
> >> A up, active
> >> a up, active, but target of sync
> >> b up, blocking, because unconnected active and
> >>inconsistent
> >> (no valid data available)
> >> B up, blocking, because unconnected active and diskless
> >> (no valid data available)
> >> D up, active, but diskless (implies connection to good
> >>data)
> >> M meta-data storage available
> >> _ meta-data storage unavailable
> >> * backing storage available
> >> o backing storage consistent but outdated
> >> (refuses to become active)
> >> i backing storage inconsistent (unfinished sync)
> >> _ diskless
> >> : unconnected, stand alone
> >> ? unconnected, looking for peer
> >> - connected
> >>> connected, sync source
> >> < connected, sync target
> >
> >I'd structure this somewhat differently into the node states (Up,
> >Down),
> >our assumption about the other node (up, down, fenced), Backing Store
> >states (available, outdated, inconsistent, unavailable), the connection
> >(up or down) and the relationship between the GCs (higher, lower,
> >equal).
> >
> >(Whether we are syncing and in what direction seems to be a function of
> >that, same whether or not we are blocking or not.)
> >
> >It's essentially the same as your list, but it seems to be more
> >accessible to me. But, it's late ;-)
>
> I agree here... restricting the attributes to true inputs (rather than
> derived values) helps stop my brain going into meltdown trying to
> absorb the matrix :) I could also imagine that it makes your life
> easier if you decide to change how you decide something like syncing
> direction later on.
ok.
I'll try again, and maybe create a dot file while I'm at it :)
even though my "states" are not actually "derived".
but maybe only "obvious" to the "insider" (me).
so, for the time being, could you please have a look at
http://wiki.trick.ca/linux-ha/DRBD/StateMachine
and maybe comment it.
lge
^ permalink raw reply [flat|nested] 13+ messages in thread
* [Drbd-dev] Re: [Linux-ha-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
2004-09-27 12:19 ` Lars Ellenberg
@ 2004-09-27 12:38 ` Andrew Beekhof
0 siblings, 0 replies; 13+ messages in thread
From: Andrew Beekhof @ 2004-09-27 12:38 UTC (permalink / raw)
To: High-Availability Linux Development List; +Cc: drbd-dev
On Sep 27, 2004, at 2:19 PM, Lars Ellenberg wrote:
> / 2004-09-26 20:40:36 +0200
> \ Andrew Beekhof:
>>
>> On Sep 24, 2004, at 11:11 PM, Lars Marowsky-Bree wrote:
>>
>>> On 2004-09-24T16:29:25,
>>> Lars Ellenberg <Lars.Ellenberg@linbit.com> said:
>>>
>>>> some of this applies to replicated resources in general,
>>>> so Andrew may have some ideas to generalize it...
>>>
>>> I think the modelling we have so far (with the recent addendum)
>>> captures
>>> this quite nicely for the time being. But of course, it'll help us to
>>> verify this.
>>
>> right. i also be thinking more about this in the coming days when I
>> start on the incarnations.
>>
>>>
>>>> Some of the attributes depend on others, and the information
>>>> about the
>>>> node status could be easily encoded in one single letter.
>>>>
>>>> But since HA is all about redundancy, we will encode the node
>>>> status
>>>> redundantly in *four* letters, to make it more obvious to human
>>>> readers.
>>>>
>>>> _ down,
>>>> S up, standby (non-active, but ready to become active)
>>>> s up, not-active, but target of sync
>>>> i up, not-active, unconnected, inconsistent
>>>> o up, not-active, unconnected, outdated
>>>> d up, not-active, diskless
>>>> A up, active
>>>> a up, active, but target of sync
>>>> b up, blocking, because unconnected active and
>>>> inconsistent
>>>> (no valid data available)
>>>> B up, blocking, because unconnected active and diskless
>>>> (no valid data available)
>>>> D up, active, but diskless (implies connection to good
>>>> data)
>>>> M meta-data storage available
>>>> _ meta-data storage unavailable
>>>> * backing storage available
>>>> o backing storage consistent but outdated
>>>> (refuses to become active)
>>>> i backing storage inconsistent (unfinished sync)
>>>> _ diskless
>>>> : unconnected, stand alone
>>>> ? unconnected, looking for peer
>>>> - connected
>>>>> connected, sync source
>>>> < connected, sync target
>>>
>>> I'd structure this somewhat differently into the node states (Up,
>>> Down),
>>> our assumption about the other node (up, down, fenced), Backing Store
>>> states (available, outdated, inconsistent, unavailable), the
>>> connection
>>> (up or down) and the relationship between the GCs (higher, lower,
>>> equal).
>>>
>>> (Whether we are syncing and in what direction seems to be a function
>>> of
>>> that, same whether or not we are blocking or not.)
>>>
>>> It's essentially the same as your list, but it seems to be more
>>> accessible to me. But, it's late ;-)
>>
>> I agree here... restricting the attributes to true inputs (rather than
>> derived values) helps stop my brain going into meltdown trying to
>> absorb the matrix :) I could also imagine that it makes your life
>> easier if you decide to change how you decide something like syncing
>> direction later on.
>
> ok.
> I'll try again, and maybe create a dot file while I'm at it :)
I think you'll find that a helps a lot (I know it helped me with the
CRM's)
>
> even though my "states" are not actually "derived".
> but maybe only "obvious" to the "insider" (me).
I think it depends on the POV you take and how you break it up...
I would say "syncing" is a state, but the direction is derived and who
we and the other side are (master/slave/...) are also inputs.
Otherwise you need states for every combination and (at least in the
CRM) I didn't find that gave me much (except for a really huge .dot
file :)
>
> so, for the time being, could you please have a look at
> http://wiki.trick.ca/linux-ha/DRBD/StateMachine
>
> and maybe comment it.
sure thing
>
> lge
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
--
Andrew Beekhof
"Ooo Ahhh, Glenn McRath" - TISM
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
2004-09-24 14:29 [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies Lars Ellenberg
2004-09-24 21:11 ` [Drbd-dev] Re: [Linux-ha-dev] " Lars Marowsky-Bree
@ 2004-09-27 14:52 ` Philipp Reisner
2004-09-29 12:58 ` Philipp Reisner
1 sibling, 1 reply; 13+ messages in thread
From: Philipp Reisner @ 2004-09-27 14:52 UTC (permalink / raw)
To: drbd-dev
Am Freitag, 24. September 2004 16:29 schrieb Lars Ellenberg:
[...]
> Currently this covers only the states, and outlines the transitions. It
> should help to define the actions to be taken on every possible "input"
> to the DRBD internal "state machine".
>
While reading through this giant e-mail I lost my confidence that it
could be a good idea to have a "central" state switching function in
DRBD, but of course I will see what this discussions gives...
We have a huge space of possible cominations of these attributes, but
a lot of those are impossible/invalid... etc. Currently these constraints
are expressed by the code ...
The question is, what is easier to read/understand/code/get right.
[...]
>
> Allowed node state transition "inputs" or "reactions" are
>
> * up or down the node
>
> * add/remove the disk (by administrative request or in response to io
> error)
>
> if it was the last accessible good data, should this result in
> suicide, or block all further io, or just fail all further io?
>
> if this lost the meta-data storage at the same time (meta-data
> internal), do we handle this differently?
I guess this is a question we can not answer here for all of our users,
some one might want this, the others that... etc.. If it is a question
you can not answer, it probabely needs to be configurable.
> * fail meta-data storage
>
> should result in suicide.
>
> * establish or lose the connection; quit/start retrying to establish
> a connection.
>
> * promote to active / demote to non-active
>
> To promote an unconnected inconsistent non-active node you need
> brute force. Similar if it thinks it is outdated.
>
> Promoting an unconnected diskless node is not possible. But those
> should have been mapped to a "down" node, anyways.
>
Hmmm ?
Just had a look at what we are currently doing. Probabely we should
drop the DISKLESS bit and replace this by an enum
dstate: inconsistent,
outdated (known to be outdated -- happens via drbdadm outdate and
in data was consistent negotiation's outcome was this this
is old data and sync is Paused),
consistent (this reflects the meta-data meaning of consistent i.e.
might be outdated),
na (=diskless),
uptodate
and display this in /proc/drbd "ld:"
> * start/finish synchronization
>
> One must not request a running and up-to-date active node to become
> target of synchronization.
>
> * block/unblock all io requests
>
> This is in response to drbdadm suspend/resume, or a result of an
> "execption handler".
>
> * commit suicide
>
> This is our last resort emergency handler. It should not be
> implemented as "panic", though currently it is.
>
> Again, this is important, please double check: Did I miss something?
>
I think everything is there... (and reading it is quite inspiring)
-Philipp
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
2004-09-27 14:52 ` [Drbd-dev] " Philipp Reisner
@ 2004-09-29 12:58 ` Philipp Reisner
2004-09-29 17:07 ` Lars Ellenberg
0 siblings, 1 reply; 13+ messages in thread
From: Philipp Reisner @ 2004-09-29 12:58 UTC (permalink / raw)
To: drbd-dev
[-- Attachment #1: Type: text/plain, Size: 806 bytes --]
> [...]
> > Currently this covers only the states, and outlines the transitions.
> > It should help to define the actions to be taken on every possible
> > "input" to the DRBD internal "state machine".
>
> While reading through this giant e-mail I lost my confidence that it
> could be a good idea to have a "central" state switching function in
> DRBD, but of course I will see what this discussions gives...
Thought about this a bit more... and came to the conclusion that it
would be a good idea. What do you think of this skeleton -
pseude code (it compiles actually).
-Philipp
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :
[-- Attachment #2: new_st.c --]
[-- Type: text/x-csrc, Size: 2996 bytes --]
typedef struct {
unsigned role : 2 ; // 3 primary/secondary/unknown
unsigned peer : 2 ; // 3 primary/secondary/unknown
unsigned conn : 5 ; // 17 cstates
unsigned disk : 3 ; // 5 from Diskless to UpToDate
unsigned multi_p : 1 ; // 2 multiple primaries alloes.
} drbd_state_t;
typedef enum {
Unconfigured,
StandAlone,
Unconnected,
Timeout,
BrokenPipe,
NetworkFailure,
WFConnection,
WFReportParams, // we have a socket
Connected, // we have introduced each other
SkippedSyncS, // we should have synced, but user said no
SkippedSyncT,
WFBitMapS,
WFBitMapT,
SyncSource, // The distance between original state and pause
SyncTarget, // state must be the same for source and target. (+2)
PausedSyncS, // see _drbd_rs_resume() and _drbd_rs_pause()
PausedSyncT, // is sync target, but higher priority groups first
} drbd_conns_t;
typedef enum {
Diskless,
Failed, // Moves on to Diskless as soon as we reported it ot the peer
Inconsistent,
Outdated,
Consistent, // Might be outdated, might be UpTpDate ...
UpToDate,
} drbd_disks_t;
typedef enum {
Unknown=0,
Primary=1, // role
Secondary=2, // role
} drbd_role_t;
drbd_state_t st; // here for mdev->state;
void lock(){}
void unlock(){}
static int state_change(drbd_state_t ns, int hard)
{
int rv;
lock();
if( ns.role == st.role &&
ns.peer == st.peer &&
ns.conn == st.conn &&
ns.disk == st.disk &&
ns.multi_p == st.multi_p ) {
rv = 1;
goto out;
}
if(!hard) {
// pre-state-change checks
if(!ns.multi_p &&
ns.role == Primary &&
ns.peer == Primary) {
rv = 0;
goto out;
}
// ...
}
// State sanitising
if ( ns.conn < Connected ) ns.peer = Unknown;
st = ns;
// post-state-change actions...
if( ns.conn < Connected &&
ns.disk <= Inconsistent &&
ns.role == Primary ) {
panic(); // Just for illustration
}
// ...
// Probabely it also makes sense to run a dynamic list of
// post-state-change-callbacks sitting on a post-state-change-hook.
//
// Could use this to implement the sync groupes more sanely.
// Could also replace the cstate_wait
rv = 1;
out:
unlock();
return rv;
}
int set_cstate(drbd_conns_t new, int hard)
{
drbd_state_t ns = st;
ns.conn = new;
return state_change(ns,hard);
}
int set_dstate(drbd_disks_t new, int hard)
{
drbd_state_t ns = st;
ns.disk = new;
return state_change(ns,hard);
}
int set_rstate(drbd_role_t new, int hard)
{
drbd_state_t ns = st;
ns.role = new;
return state_change(ns,hard);
}
int set_pstate(drbd_role_t new, int hard)
{
drbd_state_t ns = st;
ns.peer = new;
return state_change(ns,hard);
}
panic()
{
printf("PANIC\n");
}
main()
{
st = (drbd_state_t){ Secondary,Unknown,Unconfigured,UpToDate,1 };
set_cstate(Connected,0);
set_pstate(Primary,0);
set_rstate(Primary,0); // This one fails...
set_cstate(WFConnection,0);
set_rstate(Primary,0);
set_dstate(Diskless,1); // causes panic...
}
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
2004-09-29 12:58 ` Philipp Reisner
@ 2004-09-29 17:07 ` Lars Ellenberg
2004-10-06 11:55 ` Philipp Reisner
0 siblings, 1 reply; 13+ messages in thread
From: Lars Ellenberg @ 2004-09-29 17:07 UTC (permalink / raw)
To: drbd-dev
/ 2004-09-29 14:58:47 +0200
\ Philipp Reisner:
> > [...]
> > > Currently this covers only the states, and outlines the transitions.
> > > It should help to define the actions to be taken on every possible
> > > "input" to the DRBD internal "state machine".
> >
> > While reading through this giant e-mail I lost my confidence that it
> > could be a good idea to have a "central" state switching function in
> > DRBD, but of course I will see what this discussions gives...
>
> Thought about this a bit more... and came to the conclusion that it
> would be a good idea. What do you think of this skeleton -
> pseude code (it compiles actually).
I think it probably should come out more like a real state machine,
with a defined set of possible INPUTS,
a defined set of states (which should not have the same detail depth as
the actual drbd internal state set with all its different attributes),
a set of actions, and a defined state[INPUT] => action => newstate
matrix.
maybe that is overkill.
it may as well be that this turns out to be the easier and cleaner way.
I'm not yet sure. the back of my head is still busy and did not give me
a "completion event" on it yet ...
lge
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies
2004-09-29 17:07 ` Lars Ellenberg
@ 2004-10-06 11:55 ` Philipp Reisner
0 siblings, 0 replies; 13+ messages in thread
From: Philipp Reisner @ 2004-10-06 11:55 UTC (permalink / raw)
To: drbd-dev
[-- Attachment #1: Type: text/plain, Size: 2125 bytes --]
On Wednesday 29 September 2004 19:07, Lars Ellenberg wrote:
> / 2004-09-29 14:58:47 +0200
>
> \ Philipp Reisner:
> > > [...]
> > >
> > > > Currently this covers only the states, and outlines the
> > > > transitions. It should help to define the actions to be taken on
> > > > every possible "input" to the DRBD internal "state machine".
> > >
> > > While reading through this giant e-mail I lost my confidence that it
> > > could be a good idea to have a "central" state switching function in
> > > DRBD, but of course I will see what this discussions gives...
> >
> > Thought about this a bit more... and came to the conclusion that it
> > would be a good idea. What do you think of this skeleton -
> > pseude code (it compiles actually).
>
> I think it probably should come out more like a real state machine,
> with a defined set of possible INPUTS,
> a defined set of states (which should not have the same detail depth as
> the actual drbd internal state set with all its different attributes),
> a set of actions, and a defined state[INPUT] => action => newstate
> matrix.
>
> maybe that is overkill.
Currently I want to go the way that was outlined with the new_st.c
skeleton.
Regarding: only worker should do state changes.
We have quite a lot of inputs that are asynchronous by their nature.
E.g. Disk fails. It does not make any sense to synchronize an advance
in the disk-state state-machine with anything.
While it makes a lot of sense to synchronize changes to the node
state machine.
At first I drew a directed graph of the cstates we have in
drbd-0.7 (see cstates-7.ps)
You will immediately realize that the differentiation between
Unconfigured and StandAllone is a leftover from drbd-0.6
Then I drew directed graphs of the "state machines" as I see
them for drbd-0.8
conn-states-8.ps, disk-states-8.ps, node-states-8.ps (has 2 pages)
PS: The program is graphviz
-Philipp
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :
[-- Attachment #2: conn-states-8.dot --]
[-- Type: text/plain, Size: 833 bytes --]
digraph conn_states {
StandAllone -> WFConnection [ label = "ioctl_set_net()" ]
WFConnection -> Unconnected [ label = "unable to bind()" ]
WFConnection -> WFReportParams [ label = "in connect() after accept" ]
WFReportParams -> StandAllone [ label = "checks in receive_param()" ]
WFReportParams -> Connected [ label = "in receive_param()" ]
WFReportParams -> WFBitMapS [ label = "sync_handshake()" ]
WFReportParams -> WFBitMapT [ label = "sync_handshake()" ]
WFBitMapS -> SyncSource [ label = "receive_bitmap()" ]
WFBitMapT -> SyncTarget [ label = "receive_bitmap()" ]
SyncSource -> Connected
SyncTarget -> Connected
SyncSource -> PausedSyncS
SyncTarget -> PausedSyncT
PausedSyncS -> SyncSource
PausedSyncT -> SyncTarget
Connected -> WFConnection [ label = "* on network error" ]
}
[-- Attachment #3: disk-states-8.dot --]
[-- Type: text/plain, Size: 913 bytes --]
digraph disk_states {
Diskless -> Inconsistent [ label = "ioctl_set_disk()" ]
Diskless -> Consistent [ label = "ioctl_set_disk()" ]
Diskless -> Outdated [ label = "ioctl_set_disk()" ]
Consistent -> Outdated [ label = "receive_param()" ]
Consistent -> UpToDate [ label = "receive_param()" ]
Consistent -> Inconsistent [ label = "start resync" ]
Outdated -> Inconsistent [ label = "start resync" ]
UpToDate -> Inconsistent [ label = "ioctl_replicate" ]
Inconsistent -> UpToDate [ label = "resync completed" ]
Consistent -> Failed [ label = "io completion error" ]
Outdated -> Failed [ label = "io completion error" ]
UpToDate -> Failed [ label = "io completion error" ]
Inconsistent -> Failed [ label = "io completion error" ]
Failed -> Diskless [ label = "sending notify to peer" ]
}
[-- Attachment #4: disk-states-8.ps --]
[-- Type: application/postscript, Size: 12540 bytes --]
[-- Attachment #5: node-states-8.dot --]
[-- Type: text/plain, Size: 540 bytes --]
digraph node_states {
Secondary -> Primary [ label = "ioctl_set_state()" ]
Primary -> Secondary [ label = "ioctl_set_state()" ]
}
digraph peer_states {
Secondary -> Primary [ label = "recv state packet" ]
Primary -> Secondary [ label = "recv state packet" ]
Primary -> Unknown [ label = "connection lost" ]
Secondary -> Unknown [ label = "connection lost" ]
Unknown -> Primary [ label = "connected" ]
Unknown -> Secondary [ label = "connected" ]
}
[-- Attachment #6: node-states-8.ps --]
[-- Type: application/postscript, Size: 9774 bytes --]
[-- Attachment #7: cstates-7.dot --]
[-- Type: text/plain, Size: 1318 bytes --]
digraph cstate {
Unconfigured -> StandAllone [ label = "ioctl_set_disk()" ]
StandAllone -> Unconnected [ label = "ioctl_set_net()" ]
Unconfigured -> Unconnected [ label = "ioctl_set_net()" ]
Unconnected -> WFConnection [ label = "connect()[1]" ]
WFConnection -> Unconnected [ label = "unable to bind()" ]
WFConnection -> WFReportParams [ label = "in connect() after accept" ]
WFReportParams -> StandAllone [ label = "checks in receive_param()" ]
WFReportParams -> Connected [ label = "in receive_param()" ]
WFReportParams -> WFBitMapS [ label = "sync_handshake()" ]
WFReportParams -> WFBitMapT [ label = "sync_handshake()" ]
WFBitMapS -> SyncSource [ label = "receive_bitmap()" ]
WFBitMapT -> SyncTarget [ label = "receive_bitmap()" ]
SyncSource -> Connected
SyncTarget -> Connected
SyncSource -> PausedSyncS
SyncTarget -> PausedSyncT
PausedSyncS -> SyncSource
PausedSyncT -> SyncTarget
Connected -> BrokenPipe [ label = "* recv error" ]
BrokenPipe -> WFConnection [ label = "connect()[1]" ]
Connected -> NetworkFailure [ label = "* set by asender()" ]
NetworkFailure -> WFConnection [ label = "connect()[1]" ]
Connected -> Timeout [ label = "* set drbd_send()" ]
Timeout -> WFConnection [ label = "connect()[1]" ]
}
[-- Attachment #8: conn-states-8.ps --]
[-- Type: application/postscript, Size: 13380 bytes --]
[-- Attachment #9: cstates-7.ps --]
[-- Type: application/postscript, Size: 17743 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2004-10-06 11:54 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-24 14:29 [Drbd-dev] [RFC] (CRM and) DRBD (0.8) states and transistions, recovery strategies Lars Ellenberg
2004-09-24 21:11 ` [Drbd-dev] Re: [Linux-ha-dev] " Lars Marowsky-Bree
2004-09-24 23:04 ` Lars Ellenberg
2004-09-25 8:54 ` Lars Marowsky-Bree
2004-09-25 9:50 ` Lars Ellenberg
2004-09-25 9:59 ` Lars Marowsky-Bree
2004-09-26 18:40 ` Andrew Beekhof
2004-09-27 12:19 ` Lars Ellenberg
2004-09-27 12:38 ` Andrew Beekhof
2004-09-27 14:52 ` [Drbd-dev] " Philipp Reisner
2004-09-29 12:58 ` Philipp Reisner
2004-09-29 17:07 ` Lars Ellenberg
2004-10-06 11:55 ` Philipp Reisner
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.