[Drbd-dev] [RFC] Handling of internal split-brain in multiple state resources

Distributed Replicated Block Device (DRBD) development
 help / color / mirror / Atom feed

* [Drbd-dev] [RFC] Handling of internal split-brain in multiple state resources
@ 2004-09-10 18:55 Lars Marowsky-Bree
  2004-09-20 15:09 ` Philipp Reisner
  2004-09-20 16:03 ` Lars Ellenberg
  0 siblings, 2 replies; 7+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-10 18:55 UTC (permalink / raw)
  To: linux-ha-dev; +Cc: drbd-dev

Hi there,

this is a call for help on how to handle internal split brain with
multiple state / replicated resources in the new CRM. I'm cc'ing the
drbd-dev list because I'm using drbd as the example in the discussion,
for it is the one replicated resource type we best understand. But the
problem is applicable to all such scenarios, which is why I'd like to
continue the discussion on the linux-ha-dev list.

My intent here is to first explain the problem, our goals, and discuss
some approaches to solving it, none of which currently satisfy me, and
thus I'm asking for feedback ;-) Anything from criticism to new angles
on the problem or an approach to the solution is welcome.  It's also a
braindump for myself of the discussions I've had with lge in the hope of
better understanding the problem.  Maybe someone finds it helpful.

I will assume the reader has read the wiki page on Multiple Incarnations
and Multiple States...

PROBLEM:

With replicated resources managed by the CRM (and correct handling of
replicated resources is a stated goal of the CRM work), we can run into
the case where the resource internally loses it's ability to replicate
due to a software bug, link failure or whatever; but the CRM itself,
running on top of the heartbeat infrastructure, may still be able to
talk to both incarnations.

I think in most scenarios, we will want to continue to operate in
degraded mode; ie with only one of the two nodes. This implies that the
data is _diverging_ between the two nodes, and there's transactions
being committed on the active node which are not replicated. Thus,
essentially, we have lost the ability to failover when a second fault
occurs and takes down the active node. So there's certain
double-failures from which this does not protect, but it can still
protect against a single failure.

We need to make sure the node which we are not proceeding with knows
this and marks it's data as 'outdated', 'desync' or whatever.

(In a strict replicated scenario where a write quorum of two or higher
is required, the only option would be to freeze all IO until the
internal split brain is resolved. This is requested by some database
vendors and some customers (ie banks), but then addressed by either
bringing a hot-spare for the replication target online and/or using
additional redundant replication links. Thus, it is a different
problem.)

The problem arises from the overlap of this scenario with the 'one node
down' scenario from the point of view of the resource itself as I will
go on to try and show.

Consider first the complex solution:

Time	N1	Link	N2
1	Master	ok	Replica		Everything's fine.
2	Master	fail	Replica		Link fails, one of the two nodes
					notices.

(It does not matter whether N1 or N2 tells us first that it noticed the
loss of internal connectivity; first it's very very unlikely that only
one incarnation notices the split-brain, and second it doesn't matter,
for the vote by one incarnation is sufficient.)

Notice that this failure case is _not_ a regular monitoring failure; the
incarnations themselves are still just fine. (Or they should report a
real monitoring failure instead.) This means 'monitor' needs more
semantics, essentially a special return code.

Essentially, at this point in time, the Master has to suspend IO for a
while, because the WriteAcks from the Replica are obviously not
arriving. (This already happens with drbd protocol C.)

We need to explicitly tell N1 that it is allowed to proceed, and that
the N2 knows that from that point on, it's local data is Outdated (which
is a special case of 'Inconsistent') and must refuse to become Master
unless forced manually (with "--I-want-to-lose-transactions"). Sequence
obviously is to first tell N2 'mark-desync' and only when that completed
successfully then allow N1 to resume.

This is, from the master resource point of view, identical to:

Time	N1	Link	N2
1	Master	ok	Replica		
2	Master	ok	crash

Master freezes, tells us about it's internal split-brain, and we
eventually tell it that yeah, we know, we have fenced N2 (post-fence is
equivalent to a post-mark-desync notification). Here it also doesn't
matter whether we receive the notification from N1 before or after we
have noticed that N2 went down or failed. N2 has to know that if it
crashed while being connected to a Master, it's by definition outdated.

The uglyness arises, as hinted at above, from the overlap with another
failure case, which I'm now going to illustrate.

Time	N1	Link	N2
1	Master	ok	Replica		Everything's fine.
2	crash	ok	Replica

If we notice that N1 is crashed first, that's fine. Everything will
happen just as always, and N2 can proceed as soon as it sees the
post-fence/stop notification, which it will see before being promoted to
master or even being asked about it.

But, from the point of view of the replicated resource on N2, this is
indistinguishable from the split-brain; all it knows is that it lost
connection to it's peer. So it goes on to report this.

If this event occurs before we have noticed a monitoring failure or full
node failure on N1 and were using the recovery method explained so far,
we are going to assume an internal split-brain, and tell N2 to mark
itself outdated, and then try to tell N1 to resume.  Oops. No more
talky-talky to N1, and we just told N2 it's supposed to refuse to become
master.

So, this requires special logic - whenever one incarnation reports an
internal split-brain, we actively need to go and verify the status of
the other incarnations first.

In which case we'd notice that, ah, N1 is down or experiencing a local
resource failure, and instead of outdating N2, would fence / stop N1 and
then promote N2.

This is the special logic I don't much like. As Rusty put it in his
keynote, "Fear of complexity" is good for programmers. And this reeks of
it - extending the monitor semantics, needing an additional command on
the secondary, _and_ needing to talk to all incarnations and then
figuring out what to do. (I don't want to think much about partitions
with >2 resources involved.) Alas, the problem seems to be real.

Here's some other alternatives I've thought about which seem simpler,
but which I then noticed don't solve the problem completely.

A) Rely on the internal split-brain timeout being larger than our
deadtime of N1 and the resource monitoring interval.

This _seems_ to solve it - because then the problematic ordering does
not occur, but relies quite a bit on timing. And if the resource on N2
notices, for example, a connection loss immediately, this basically
can't be made to work. Oh yeah, it can be worked around by adding delays
etc, but that smells a bit dung-ish, too.

B) Instead of reporting the split-brain on both nodes as a special case,
fail the monitoring operation on the secondary instead. In response to
this, we'd act just as we always would, and try to stop the replica (and
maybe even try to move it somewhere else). As soon as the primary
receives this stop/fence post-notification, it is allowed to resume, and
the data on the replica is implicitly outdated until resynced.

Unfortunately, this again runs into the race as above. If the secondary
notices first and fails, we are going to respond by stopping it; but,
woah, now we figure the primary is screwed, but we just outdated /
stopped the secondary, which now refuses to become a master. So we
cannot continue.

C) Claim the damn problem is not ours to solve and that the internal
replication must be designed redundantly too. ;-) And the race _is_
fairly unlikely...

Maybe I'm over-estimating the complexity with the approach suggested,
but this is a hard problem and I definetely want to run my thoughts past
the public to make sure it's bullet proof.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	   \\\  /// 
SUSE Labs, Research and Development \honk/ 
SUSE LINUX AG - A Novell company     \\// 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Drbd-dev] [RFC] Handling of internal split-brain in multiple state resources
  2004-09-10 18:55 [Drbd-dev] [RFC] Handling of internal split-brain in multiple state resources Lars Marowsky-Bree
@ 2004-09-20 15:09 ` Philipp Reisner
  2004-09-20 15:36   ` Lars Ellenberg
  2004-09-24 15:17   ` [Linux-ha-dev] " Lars Marowsky-Bree
  2004-09-20 16:03 ` Lars Ellenberg
  1 sibling, 2 replies; 7+ messages in thread
From: Philipp Reisner @ 2004-09-20 15:09 UTC (permalink / raw)
  To: drbd-dev; +Cc: linux-ha-dev

[ I am not subscribed to linux-ha-dev ]

Hi Lars,

[...]
> If we notice that N1 is crashed first, that's fine. Everything will
> happen just as always, and N2 can proceed as soon as it sees the
> post-fence/stop notification, which it will see before being promoted to
> master or even being asked about it.
>
> But, from the point of view of the replicated resource on N2, this is
> indistinguishable from the split-brain; all it knows is that it lost
> connection to it's peer. So it goes on to report this.
>
> If this event occurs before we have noticed a monitoring failure or full
> node failure on N1 and were using the recovery method explained so far,
> we are going to assume an internal split-brain, and tell N2 to mark
> itself outdated, and then try to tell N1 to resume.  Oops. No more
> talky-talky to N1, and we just told N2 it's supposed to refuse to become
> master.

So the algorithm in HB/CRM seems to be:

If I see that resource (drbd) got disconnected from its peer. then {
 If the resource is a replica (secondary) then {
  tell it that it should mark itself as "desync". 
 } else /* Resource is master (primary) */ {
  Wait for the post fence event and thaw the resource.
 }
}

> So, this requires special logic - whenever one incarnation reports an
> internal split-brain, we actively need to go and verify the status of
> the other incarnations first.
>
> In which case we'd notice that, ah, N1 is down or experiencing a local
> resource failure, and instead of outdating N2, would fence / stop N1 and
> then promote N2.
>
> This is the special logic I don't much like. As Rusty put it in his
> keynote, "Fear of complexity" is good for programmers. And this reeks of
> it - extending the monitor semantics, needing an additional command on
> the secondary, _and_ needing to talk to all incarnations and then
> figuring out what to do. (I don't want to think much about partitions
> with >2 resources involved.) Alas, the problem seems to be real.
>

What is about:

If I see that resource (drbd) got disconnected from its peer. then {
 If the resource is a replica (secondary) then {
  /* do nothing */
 } else /* Resource is master (primary) */ {
  Ask the other node to do the fencing.
 }
}

If I see a fence ack then {
 Thaw the resource.
}

There is no special case in there...

> Here's some other alternatives I've thought about which seem simpler,
> but which I then noticed don't solve the problem completely.
>
>
> A) Rely on the internal split-brain timeout being larger than our
> deadtime of N1 and the resource monitoring interval.
>
> This _seems_ to solve it - because then the problematic ordering does
> not occur, but relies quite a bit on timing. And if the resource on N2
> notices, for example, a connection loss immediately, this basically
> can't be made to work. Oh yeah, it can be worked around by adding delays
> etc, but that smells a bit dung-ish, too.
>

Right, ... 
Currently drbd's timeout needs to be smaller than heartbeat's deadtime,
making this the other way round asks for troubles I think...

[...]

BTW, from the text I realized that hearbeat will monitor the resource (drbd).
Probabely with calling the resource script with a new method. Basically
hearbeat polls DRBD for an change in the connection state.

Would you like to have an active notification from DRBD ? 

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Drbd-dev] [RFC] Handling of internal split-brain in multiple state resources
  2004-09-20 15:09 ` Philipp Reisner
@ 2004-09-20 15:36   ` Lars Ellenberg
  2004-09-24 15:17   ` [Linux-ha-dev] " Lars Marowsky-Bree
  1 sibling, 0 replies; 7+ messages in thread
From: Lars Ellenberg @ 2004-09-20 15:36 UTC (permalink / raw)
  To: drbd-dev, linux-ha-dev

/ 2004-09-20 17:09:36 +0200
\ Philipp Reisner:
> [ I am not subscribed to linux-ha-dev ]
> 
> Hi Lars,
> 
> [...]
> > If we notice that N1 is crashed first, that's fine. Everything will
> > happen just as always, and N2 can proceed as soon as it sees the
> > post-fence/stop notification, which it will see before being promoted to
> > master or even being asked about it.
> >
> > But, from the point of view of the replicated resource on N2, this is
> > indistinguishable from the split-brain; all it knows is that it lost
> > connection to it's peer. So it goes on to report this.
> >
> > If this event occurs before we have noticed a monitoring failure or full
> > node failure on N1 and were using the recovery method explained so far,
> > we are going to assume an internal split-brain, and tell N2 to mark
> > itself outdated, and then try to tell N1 to resume.  Oops. No more
> > talky-talky to N1, and we just told N2 it's supposed to refuse to become
> > master.
> 
> So the algorithm in HB/CRM seems to be:
> 
> If I see that resource (drbd) got disconnected from its peer. then {
>  If the resource is a replica (secondary) then {
>   tell it that it should mark itself as "desync". 
>  } else /* Resource is master (primary) */ {
>   Wait for the post fence event and thaw the resource.
>  }
> }
> 
> > So, this requires special logic - whenever one incarnation reports an
> > internal split-brain, we actively need to go and verify the status of
> > the other incarnations first.
> >
> > In which case we'd notice that, ah, N1 is down or experiencing a local
> > resource failure, and instead of outdating N2, would fence / stop N1 and
> > then promote N2.
> >
> > This is the special logic I don't much like. As Rusty put it in his
> > keynote, "Fear of complexity" is good for programmers. And this reeks of
> > it - extending the monitor semantics, needing an additional command on
> > the secondary, _and_ needing to talk to all incarnations and then
> > figuring out what to do. (I don't want to think much about partitions
> > with >2 resources involved.) Alas, the problem seems to be real.
> >
> 
> What is about:
> 
> If I see that resource (drbd) got disconnected from its peer. then {
>  If the resource is a replica (secondary) then {
>   /* do nothing */
>  } else /* Resource is master (primary) */ {
>   Ask the other node to do the fencing.
>  }
> }
> 
> If I see a fence ack then {
>  Thaw the resource.
> }
> 
> There is no special case in there...

and that is about what I meant when discussing with lmb...
I answer how this works out in an other followup on the original post.

> BTW, from the text I realized that hearbeat will monitor the resource (drbd).
> Probabely with calling the resource script with a new method. Basically
> hearbeat polls DRBD for an change in the connection state.
> 
> Would you like to have an active notification from DRBD ? 

now, I'd like to make active drbd event notification possible.
I see basically two ways to do so:
 a)
  provide a special read-only file like /proc/drbd/event or so, allow
  exactly one opener, and allow that to select on it.
  define some simple, say line-based, notification messages.

  one needs to write a daemon to dispatch on those.

 b)
  make some hooks within the drbd code itself, and upon certain
  events do an fork/execle with special arguments from the worker
  thread.

  one needs to provide some external script(s)/executable(s) that
  act appropriate on those events.

 and there is, of course,
 c)
  combination of both 
 
from the CRM point of view, this is about how the
replicated/multistate/multipeer resource can help
in monitoring itself. it is an optimisation and probably not a
substitute for regular monitoring polls.

	Lars Ellenberg

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Linux-ha-dev] Re: [Drbd-dev] [RFC] Handling of internal split-brain in multiple state resources
  2004-09-20 15:09 ` Philipp Reisner
  2004-09-20 15:36   ` Lars Ellenberg
@ 2004-09-24 15:17   ` Lars Marowsky-Bree
  1 sibling, 0 replies; 7+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-24 15:17 UTC (permalink / raw)
  To: High-Availability Linux Development List, drbd-dev

On 2004-09-20T17:09:36,
   Philipp Reisner <philipp.reisner@linbit.com> said:

> BTW, from the text I realized that hearbeat will monitor the resource (drbd).
> Probabely with calling the resource script with a new method. Basically
> hearbeat polls DRBD for an change in the connection state.
> 
> Would you like to have an active notification from DRBD ? 

Active notification is of course welcome. The CRM is event-based anyway,
so we could throw the event from drbd into the same machinery; that
would be a welcome addition.

Though not only connection state is interesting, but also the loss of
it's local disk.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Drbd-dev] [RFC] Handling of internal split-brain in multiple state resources
  2004-09-10 18:55 [Drbd-dev] [RFC] Handling of internal split-brain in multiple state resources Lars Marowsky-Bree
  2004-09-20 15:09 ` Philipp Reisner
@ 2004-09-20 16:03 ` Lars Ellenberg
  2004-09-21 12:58   ` Philipp Reisner
  2004-09-24 15:15   ` [Linux-ha-dev] " Lars Marowsky-Bree
  1 sibling, 2 replies; 7+ messages in thread
From: Lars Ellenberg @ 2004-09-20 16:03 UTC (permalink / raw)
  To: linux-ha-dev, drbd-dev

/ 2004-09-10 20:55:53 +0200
\ Lars Marowsky-Bree:
> Hi there,
> 
> this is a call for help on how to handle internal split brain with
> multiple state / replicated resources in the new CRM. I'm cc'ing the
> drbd-dev list because I'm using drbd as the example in the discussion,
> for it is the one replicated resource type we best understand. But the
> problem is applicable to all such scenarios, which is why I'd like to
> continue the discussion on the linux-ha-dev list.
> 
> My intent here is to first explain the problem, our goals, and discuss
> some approaches to solving it, none of which currently satisfy me, and
> thus I'm asking for feedback ;-) Anything from criticism to new angles
> on the problem or an approach to the solution is welcome.  It's also a
> braindump for myself of the discussions I've had with lge in the hope of
> better understanding the problem.  Maybe someone finds it helpful.
> 
> I will assume the reader has read the wiki page on Multiple Incarnations
> and Multiple States...
> 
> 
> PROBLEM:
> 
> With replicated resources managed by the CRM (and correct handling of
> replicated resources is a stated goal of the CRM work), we can run into
> the case where the resource internally loses it's ability to replicate
> due to a software bug, link failure or whatever; but the CRM itself,
> running on top of the heartbeat infrastructure, may still be able to
> talk to both incarnations.
> 
> I think in most scenarios, we will want to continue to operate in
> degraded mode; ie with only one of the two nodes. This implies that the
> data is _diverging_ between the two nodes, and there's transactions
> being committed on the active node which are not replicated. Thus,
> essentially, we have lost the ability to failover when a second fault
> occurs and takes down the active node. So there's certain
> double-failures from which this does not protect, but it can still
> protect against a single failure.
> 
> We need to make sure the node which we are not proceeding with knows
> this and marks it's data as 'outdated', 'desync' or whatever.
> 
> (In a strict replicated scenario where a write quorum of two or higher
> is required, the only option would be to freeze all IO until the
> internal split brain is resolved. This is requested by some database
> vendors and some customers (ie banks), but then addressed by either
> bringing a hot-spare for the replication target online and/or using
> additional redundant replication links. Thus, it is a different
> problem.)
> 
> The problem arises from the overlap of this scenario with the 'one node
> down' scenario from the point of view of the resource itself as I will
> go on to try and show.
> 
> Consider first the complex solution:
> 
> Time	N1	Link	N2
> 1	Master	ok	Replica		Everything's fine.
> 2	Master	fail	Replica		Link fails, one of the two nodes
> 					notices.
> 
> (It does not matter whether N1 or N2 tells us first that it noticed the
> loss of internal connectivity; first it's very very unlikely that only
> one incarnation notices the split-brain, and second it doesn't matter,
> for the vote by one incarnation is sufficient.)
> 
> Notice that this failure case is _not_ a regular monitoring failure; the
> incarnations themselves are still just fine. (Or they should report a
> real monitoring failure instead.) This means 'monitor' needs more
> semantics, essentially a special return code.
> 
> Essentially, at this point in time, the Master has to suspend IO for a
> while, because the WriteAcks from the Replica are obviously not
> arriving. (This already happens with drbd protocol C.)
> 
> We need to explicitly tell N1 that it is allowed to proceed, and that
> the N2 knows that from that point on, it's local data is Outdated (which
> is a special case of 'Inconsistent') and must refuse to become Master
> unless forced manually (with "--I-want-to-lose-transactions"). Sequence
> obviously is to first tell N2 'mark-desync' and only when that completed
> successfully then allow N1 to resume.
> 
> This is, from the master resource point of view, identical to:
> 
> Time	N1	Link	N2
> 1	Master	ok	Replica		
> 2	Master	ok	crash
> 
> Master freezes, tells us about it's internal split-brain, and we
> eventually tell it that yeah, we know, we have fenced N2 (post-fence is
> equivalent to a post-mark-desync notification). Here it also doesn't
> matter whether we receive the notification from N1 before or after we
> have noticed that N2 went down or failed. N2 has to know that if it
> crashed while being connected to a Master, it's by definition outdated.
> 
> 
> The uglyness arises, as hinted at above, from the overlap with another
> failure case, which I'm now going to illustrate.
> 
> Time	N1	Link	N2
> 1	Master	ok	Replica		Everything's fine.
> 2	crash	ok	Replica
> 
> If we notice that N1 is crashed first, that's fine. Everything will
> happen just as always, and N2 can proceed as soon as it sees the
> post-fence/stop notification, which it will see before being promoted to
> master or even being asked about it.
> 
> But, from the point of view of the replicated resource on N2, this is
> indistinguishable from the split-brain; all it knows is that it lost
> connection to it's peer. So it goes on to report this.
> 
> If this event occurs before we have noticed a monitoring failure or full
> node failure on N1 and were using the recovery method explained so far,
> we are going to assume an internal split-brain, and tell N2 to mark
> itself outdated, and then try to tell N1 to resume.  Oops. No more
> talky-talky to N1, and we just told N2 it's supposed to refuse to become
> master.
> 
> So, this requires special logic - whenever one incarnation reports an
> internal split-brain, we actively need to go and verify the status of
> the other incarnations first.
> 
> In which case we'd notice that, ah, N1 is down or experiencing a local
> resource failure, and instead of outdating N2, would fence / stop N1 and
> then promote N2.
> 
> This is the special logic I don't much like. As Rusty put it in his
> keynote, "Fear of complexity" is good for programmers. And this reeks of
> it - extending the monitor semantics, needing an additional command on
> the secondary, _and_ needing to talk to all incarnations and then
> figuring out what to do. (I don't want to think much about partitions
> with >2 resources involved.) Alas, the problem seems to be real.


if a resource not in "Primary" state reports that it does no longer know
about its peer, there is no need to hurry and mark it outdated.
we just do nothing (well, or as an optimisation trigger an immediate
monitoring poll on the Primary, or even on all other peers).
nothing bad can happen.
since a passive replica can not do any harm, there is no point in
forcing it to refuse anything...

if it really was a primary crash, we will eventually recognize and do
the failover. if it is "just" a communication problem between the
replicas, the master will soon notice itself, too, freeze io, and
wait for confirmation of some fence operation (whether this is stonith
or "mark-outdated" is not important to the master). then it will
continue.

the point why we are concerned at all is that if the Primary lost
connection to its peer, and continues to just confirm transactions,
it may have been a total communications loss, and the CRM may decide to
fence it, and fail over to the other node. in wich case transactions
that have been committed and confirmed between the connection loss event
and the actual stonith and failover are lost.

since the resource does not know, it has to block io until it gets
confirmation that the peer won't consider this node dead and continue in
master mode while this node still is in master mode... 
confirmation can be given when the CRM still can see the peer (and mark
it outdated), or if it can no longer see the peer (and stonith it).

the algorithm within the CRM is
 res = some replicating resource which no longer sees its peer
 if res is in master state
    fence the peer (by marking it outdated or stonithing it)
    tell res about that, and to continue
 if res is in passive state
    trigger immediate monitoring of the peer(s),
    but otherwise do nothing


	Lars

btw, sorry for the master,active,primary,slave,passive,secondary
confusion... we should probably agree on some terminology :-/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Drbd-dev] [RFC] Handling of internal split-brain in multiple state resources
  2004-09-20 16:03 ` Lars Ellenberg
@ 2004-09-21 12:58   ` Philipp Reisner
  2004-09-24 15:15   ` [Linux-ha-dev] " Lars Marowsky-Bree
  1 sibling, 0 replies; 7+ messages in thread
From: Philipp Reisner @ 2004-09-21 12:58 UTC (permalink / raw)
  To: drbd-dev

[...]
> the algorithm within the CRM is
>  res = some replicating resource which no longer sees its peer
>  if res is in master state
>     fence the peer (by marking it outdated or stonithing it)
>     tell res about that, and to continue
>  if res is in passive state
>     trigger immediate monitoring of the peer(s),
>     but otherwise do nothing

Seems that we have exactly the same imagination about this...

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Linux-ha-dev] Re: [Drbd-dev] [RFC] Handling of internal split-brain in multiple state resources
  2004-09-20 16:03 ` Lars Ellenberg
  2004-09-21 12:58   ` Philipp Reisner
@ 2004-09-24 15:15   ` Lars Marowsky-Bree
  1 sibling, 0 replies; 7+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-24 15:15 UTC (permalink / raw)
  To: linux-ha-dev, drbd-dev

On 2004-09-20T18:03:42,
   Lars Ellenberg <Lars.Ellenberg@linbit.com> said:

Sorry it took me so long to reply, but I wanted a few moments to
actually read the mails, and I was somewhat busy this week.

I'm replying to you because it's the last mail in the thread and agrees
with what Philipp said ;-)

> the algorithm within the CRM is
>  res = some replicating resource which no longer sees its peer
>  if res is in master state
>     fence the peer (by marking it outdated or stonithing it)
>     tell res about that, and to continue
>  if res is in passive state
>     trigger immediate monitoring of the peer(s),
>     but otherwise do nothing

Yes, that's essentially the algorithm. I should have sticked with
pseudo-code instead of explaining it and fumbling around...

Philipp: This is a "special" case for the CRM, in as far as it's quite
different from anything we did before ;-)

It requires at least an extension to the 'monitor' semantics and a new
action for the secondary.

But, this seems to be what everyone who has considered so far deems
necessary. (I've tried to make the problem go away and figure out how
the resource could handle it internally, but it really seems to need
support from the CRM as above.)

Ok.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-09-24 15:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-10 18:55 [Drbd-dev] [RFC] Handling of internal split-brain in multiple state resources Lars Marowsky-Bree
2004-09-20 15:09 ` Philipp Reisner
2004-09-20 15:36   ` Lars Ellenberg
2004-09-24 15:17   ` [Linux-ha-dev] " Lars Marowsky-Bree
2004-09-20 16:03 ` Lars Ellenberg
2004-09-21 12:58   ` Philipp Reisner
2004-09-24 15:15   ` [Linux-ha-dev] " Lars Marowsky-Bree

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox