[Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem
       [not found]   ` <R+ahoCHARbsLOMKIahWH0/Q=lge@web.de>
@ 2004-08-20 12:52     ` Philipp Reisner
  2004-08-20 13:32       ` Lars Ellenberg
                         ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Philipp Reisner @ 2004-08-20 12:52 UTC (permalink / raw)
  To: Lars Ellenberg, Lars Marowsky-Bree, drbd-dev

On Thursday 19 August 2004 14:14, Lars Ellenberg wrote:
[...]
> > Split-brain Szenarien die mit Primary/Primary (beide StandAlone) enden
> > habe ich schon im neuen Design bedacht (ich schreibe gerade). Was sonst?
>
> gar nicht soo unwahrscheinlich:
>
> wenn der primary stirbt (oder getötet wird), aber vor dem sterben
> irgendwie noch geschafft hat, seine drbd connection zu verlieren _und_
> daher den "ConnectedCount" hochgezählt hat...
>
> der "slave" wird jetzt Secondary->Primary, zählt aber, weil < Connected
> den ArbitraryCount hoch...
>
> situation beim nächsten connect:
>
>  Flags: consistent,             ,been primary last time
>
> früherer Primary  1:X:Y:a+1:b  :10 (nach reboot jetzt Secondary)
> jetziger Primary  1:X:Y:a  :b+1:10
>
> doh. jetziger Primary soll SyncTarget werden... shitty.
> --> jetziger Primary goes StandAlone.
>
> nächster verbindungsversuch (von operator eingeleitet)
> ... -> "split brain detected"
> --> both go StandAlone
>
> u.U. müssen wir einen zusätzlichen counter einführen, einen "CRM
> count", und der CRM muss, wenn er den anderen node geschossen hat,
> sicherheitshalber ein drbdsetup "--crm" (vgl. --human) primary
> machen, dass würde zumindest das oben beschriebene scenario auflösen...
>

Hi,

Right, old toppic: What should we do after a split-brain situation.
I have looked up my papers from 2001 to unterstand, why it is done 
the way it is today:

The situation:

 N1    N2
 P --- S   Everything ok.
 P - - S   Link breaks.
 P - - P   A (also split-brained) Cluster-mgr makes N2 primary too.
 X     X   Both nodes down.
 P --- S   The current behaviour. 

What should be done after Split brain ? 

The current policy is, that the node that was Primary before the
split-brain situation should be primary afterwards.

This Policy is hard-coded into DRBD. It is an arbitrary decission, 
I thought it is a good idea.

The question are:
Should this policy be configurable ? (IMO: yes)
Which policies do we want to offer ?

 * The node that was primary before split brain (current behaviour)
 * The node that becaume primary during split brain 
 * The node that modified more of it's data during the split-brain
   situation  [ Do not think about implementation yet, just about
                the policy ]
 * others ?...

The second question to answer is:
What should we do if the connecting network heals ? I.e.

 N1    N2
 P --- S   Everything ok.
 P - - S   Link breaks.
 P - - P   A (also split-brained) Cluster-mgr makes N2 primary too.
 ? --- ?   What now ?

Current policy: The two nodes will refuse to connect. The administrator
                has to resove this.

Are there any other policies that would make sense ?

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem
  2004-08-20 12:52     ` [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem Philipp Reisner
@ 2004-08-20 13:32       ` Lars Ellenberg
  2004-08-23 14:28         ` [Drbd-dev] gen_counts and primary --human Lars Ellenberg
                           ` (3 more replies)
  2004-08-20 14:10       ` [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem Helmut Wollmersdorfer
  2004-08-23 22:01       ` Lars Marowsky-Bree
  2 siblings, 4 replies; 32+ messages in thread
From: Lars Ellenberg @ 2004-08-20 13:32 UTC (permalink / raw)
  To: drbd-dev

/ 2004-08-20 14:52:52 +0200
\ Philipp Reisner:
> On Thursday 19 August 2004 14:14, Lars Ellenberg wrote:
> [...]
> > > Split-brain Szenarien die mit Primary/Primary (beide StandAlone) enden
> > > habe ich schon im neuen Design bedacht (ich schreibe gerade). Was sonst?
> >
> > gar nicht soo unwahrscheinlich:
> >
> > wenn der primary stirbt (oder getötet wird), aber vor dem sterben
> > irgendwie noch geschafft hat, seine drbd connection zu verlieren _und_
> > daher den "ConnectedCount" hochgezählt hat...
> >
> > der "slave" wird jetzt Secondary->Primary, zählt aber, weil < Connected
> > den ArbitraryCount hoch...
> >
> > situation beim nächsten connect:
> >
> >  Flags: consistent,             ,been primary last time
> >
> > früherer Primary  1:X:Y:a+1:b  :10 (nach reboot jetzt Secondary)
> > jetziger Primary  1:X:Y:a  :b+1:10
> >
> > doh. jetziger Primary soll SyncTarget werden... shitty.
> > --> jetziger Primary goes StandAlone.
> >
> > nächster verbindungsversuch (von operator eingeleitet)
> > ... -> "split brain detected"
> > --> both go StandAlone
> >
> > u.U. müssen wir einen zusätzlichen counter einführen, einen "CRM
> > count", und der CRM muss, wenn er den anderen node geschossen hat,
> > sicherheitshalber ein drbdsetup "--crm" (vgl. --human) primary
> > machen, dass würde zumindest das oben beschriebene scenario auflösen...
> >
> 
> Hi,
> 
> Right, old toppic: What should we do after a split-brain situation.
> I have looked up my papers from 2001 to unterstand, why it is done 
> the way it is today:
> 
> The situation:
> 
>  N1    N2
>  P --- S   Everything ok.
>  P - - S   Link breaks.
>  P - - P   A (also split-brained) Cluster-mgr makes N2 primary too.
>  X     X   Both nodes down.
>  P --- S   The current behaviour. 
> 
> What should be done after Split brain ? 
> 
> The current policy is, that the node that was Primary before the
> split-brain situation should be primary afterwards.
> 
> This Policy is hard-coded into DRBD. It is an arbitrary decission, 
> I thought it is a good idea.
> 
> The question are:
> Should this policy be configurable ? (IMO: yes)
> Which policies do we want to offer ?
> 
>  * The node that was primary before split brain (current behaviour)
>  * The node that becaume primary during split brain 
>  * The node that modified more of it's data during the split-brain
>    situation  [ Do not think about implementation yet, just about
>                 the policy ]
>  * others ?...
> 
> The second question to answer is:
> What should we do if the connecting network heals ? I.e.
> 
>  N1    N2
>  P --- S   Everything ok.
>  P - - S   Link breaks.
>  P - - P   A (also split-brained) Cluster-mgr makes N2 primary too.
>  ? --- ?   What now ?
> 
> Current policy: The two nodes will refuse to connect. The administrator
>                 has to resove this.
> 
> Are there any other policies that would make sense ?

and one more scenario, which I described above and consider to be the most
likely one... and you seem to have missed the point...

  N1    N2
  P --- S   Everything ok.
  P - - S   N1 is failing, but for the moment being just can no
            longer answer the network; but it is still able to update
            drbds generation counts
  ?   - S   Now N1 may be dead, or maybe not
  X   - S   A sane Cluster-mgr makes N2 primary, but stonith N1 first ...
  X   - P   N1 now is really dead.
  S --- P   N1 comes back
  S - : P   oops, N1 has "better" generation counts than N2
            N2 shall become sync target, but since it is
            currently Primary, it will refuse this.
            It goes standalone.

Now, I think in that case, N1 needs special handling of the situation,
too, which it currently has not.

Currently this situation is not readily resolvable. One would need to
first make N2 secondary, too, then either make it primary again using
the --humman flag (N2 will become SyncSource), or just reconnect now
(N2 will become SyncTarget).

I think we should allow the drbdadm invalidate in
StandAlone(WFConnection) Secondary/Unknown, too.
It would then just clear the MDF_Consistent.

Yet an other deficiency:
we still do not handle the gencounts correctly in this situation:

  S --- S
  P --- S  drbdsetup primary --human
   now, N1 increments its human cnt, N2 only its connection count after
   failure of N1, N2 will take over, maybe be primary for a whole week.
   then N1 comes back, has the higher human count, and will
   either [see above] (if N2 still is Primary)
   or wipe out a week worth of changes (if N2 was demoted to Secondary
   meanwhile).

   Oops :-(


	Lars Ellenberg

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Drbd-dev] gen_counts and primary --human
  2004-08-20 13:32       ` Lars Ellenberg
@ 2004-08-23 14:28         ` Lars Ellenberg
  2004-08-23 21:57           ` Lars Marowsky-Bree
  2004-08-25  9:42           ` Philipp Reisner
  2004-08-23 21:56         ` [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem Lars Marowsky-Bree
                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 32+ messages in thread
From: Lars Ellenberg @ 2004-08-23 14:28 UTC (permalink / raw)
  To: drbd-dev

> Yet an other deficiency:
> we still do not handle the gencounts correctly in this situation:
> 
>   S --- S
>   P --- S  drbdsetup primary --human
>    now, N1 increments its human cnt, N2 only its connection count after
>    failure of N1, N2 will take over, maybe be primary for a whole week.
>    then N1 comes back, has the higher human count, and will
>    either [see above] (if N2 still is Primary)
>    or wipe out a week worth of changes (if N2 was demoted to Secondary
>    meanwhile).
> 
>    Oops :-(

I'd suggest to ignore the --human flag when connected.
It does not make sense in the connected case.

	lge



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] gen_counts and primary --human
  2004-08-23 14:28         ` [Drbd-dev] gen_counts and primary --human Lars Ellenberg
@ 2004-08-23 21:57           ` Lars Marowsky-Bree
  2004-08-25  9:42           ` Philipp Reisner
  1 sibling, 0 replies; 32+ messages in thread
From: Lars Marowsky-Bree @ 2004-08-23 21:57 UTC (permalink / raw)
  To: drbd-dev

On 2004-08-23T16:28:04,
   Lars Ellenberg <Lars.Ellenberg@linbit.com> said:

> I'd suggest to ignore the --human flag when connected.
> It does not make sense in the connected case.

That's even better.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] gen_counts and primary --human
  2004-08-23 14:28         ` [Drbd-dev] gen_counts and primary --human Lars Ellenberg
  2004-08-23 21:57           ` Lars Marowsky-Bree
@ 2004-08-25  9:42           ` Philipp Reisner
  1 sibling, 0 replies; 32+ messages in thread
From: Philipp Reisner @ 2004-08-25  9:42 UTC (permalink / raw)
  To: drbd-dev

On Monday 23 August 2004 16:28, Lars Ellenberg wrote:
> > Yet an other deficiency:
> > we still do not handle the gencounts correctly in this situation:
> >
> >   S --- S
> >   P --- S  drbdsetup primary --human
> >    now, N1 increments its human cnt, N2 only its connection count after
> >    failure of N1, N2 will take over, maybe be primary for a whole week.
> >    then N1 comes back, has the higher human count, and will
> >    either [see above] (if N2 still is Primary)
> >    or wipe out a week worth of changes (if N2 was demoted to Secondary
> >    meanwhile).
> >
> >    Oops :-(
>
> I'd suggest to ignore the --human flag when connected.
> It does not make sense in the connected case.
>

Right. We should fix that rather sooner than later.

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem
  2004-08-20 13:32       ` Lars Ellenberg
  2004-08-23 14:28         ` [Drbd-dev] gen_counts and primary --human Lars Ellenberg
@ 2004-08-23 21:56         ` Lars Marowsky-Bree
  2004-08-25  9:42         ` Philipp Reisner
  2004-09-04  9:48         ` [Drbd-dev] Another drbd race Lars Marowsky-Bree
  3 siblings, 0 replies; 32+ messages in thread
From: Lars Marowsky-Bree @ 2004-08-23 21:56 UTC (permalink / raw)
  To: drbd-dev

On 2004-08-20T15:32:15,
   Lars Ellenberg <Lars.Ellenberg@linbit.com> said:

>   N1    N2
>   P --- S   Everything ok.
>   P - - S   N1 is failing, but for the moment being just can no
>             longer answer the network; but it is still able to update
>             drbds generation counts
>   ?   - S   Now N1 may be dead, or maybe not
>   X   - S   A sane Cluster-mgr makes N2 primary, but stonith N1 first ...

As you pointed out, the sane cluster manager (or admin) ought to be
setting a Kain flag when it knows for sure it has slain it's brother...

Now, that would even help catch the case where the crm had a malfunction
and made both sides primary, in which case it really really shouldn't
automatically connect, but will require higher level help (to make one
side secondary first).

>   X   - P   N1 now is really dead.
>   S --- P   N1 comes back
>   S - : P   oops, N1 has "better" generation counts than N2
>             N2 shall become sync target, but since it is
>             currently Primary, it will refuse this.
>             It goes standalone.
> 
> Now, I think in that case, N1 needs special handling of the situation,
> too, which it currently has not.

> Yet an other deficiency:
> we still do not handle the gencounts correctly in this situation:
> 
>   S --- S
>   P --- S  drbdsetup primary --human
>    now, N1 increments its human cnt, N2 only its connection count after
>    failure of N1, N2 will take over, maybe be primary for a whole week.
>    then N1 comes back, has the higher human count, and will
>    either [see above] (if N2 still is Primary)
>    or wipe out a week worth of changes (if N2 was demoted to Secondary
>    meanwhile).
> 
>    Oops :-(

Ouchie. Probably should send that across the wire...


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	    \        This space          /
SUSE Labs, Research and Development |       intentionally        |
SUSE LINUX AG - A Novell company    \        left blank          /

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem
  2004-08-20 13:32       ` Lars Ellenberg
  2004-08-23 14:28         ` [Drbd-dev] gen_counts and primary --human Lars Ellenberg
  2004-08-23 21:56         ` [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem Lars Marowsky-Bree
@ 2004-08-25  9:42         ` Philipp Reisner
  2004-08-25 10:28           ` Lars Marowsky-Bree
  2004-08-25 13:38           ` Lars Ellenberg
  2004-09-04  9:48         ` [Drbd-dev] Another drbd race Lars Marowsky-Bree
  3 siblings, 2 replies; 32+ messages in thread
From: Philipp Reisner @ 2004-08-25  9:42 UTC (permalink / raw)
  To: drbd-dev

[...]

> and one more scenario, which I described above and consider to be the most
> likely one... and you seem to have missed the point...
>
>   N1    N2
>   P --- S   Everything ok.
>   P - - S   N1 is failing, but for the moment being just can no
>             longer answer the network; but it is still able to update
>             drbds generation counts
>   ?   - S   Now N1 may be dead, or maybe not
>   X   - S   A sane Cluster-mgr makes N2 primary, but stonith N1 first ...
>   X   - P   N1 now is really dead.
>   S --- P   N1 comes back
>   S - : P   oops, N1 has "better" generation counts than N2
>             N2 shall become sync target, but since it is
>             currently Primary, it will refuse this.
>             It goes standalone.
>
> Now, I think in that case, N1 needs special handling of the situation,
> too, which it currently has not.

So, the current policy is: 
 * The primary node refuses to connect to a peer with higher generation
   counts. This keeps the data intact. This is very related to the other
   after-split-brain-policy I want to make expclicit.

* Remeber the options so far: (for primary-after-split-brain)

 - The node that was primary before split brain (current behaviour)
 - The node that became primary during split brain 
 - The node that modified more of it's data during the split-brain
   situation  [ Do not think about implementation yet, just about
                the policy ]
 - None, wait for operator's decission.  [suggested by LMB]
 - Node that is currently primary [see example above by LGE]

* We should probabely have a second configurable policy
       (loosers-data-after-split-brain)

 - Keep
 - Overwrite

Currently we have no clear line in regard in regard to the
loosers-data-after-split-brain.

> Currently this situation is not readily resolvable. One would need to
> first make N2 secondary, too, then either make it primary again using
> the --humman flag (N2 will become SyncSource), or just reconnect now
> (N2 will become SyncTarget).

Hmmm. 

> I think we should allow the drbdadm invalidate in
> StandAlone(WFConnection) Secondary/Unknown, too.
> It would then just clear the MDF_Consistent.

For 0.7 thats is a good idea I think.

> Yet an other deficiency:
> we still do not handle the gencounts correctly in this situation:
>
>   S --- S
>   P --- S  drbdsetup primary --human
>    now, N1 increments its human cnt, N2 only its connection count after
>    failure of N1, N2 will take over, maybe be primary for a whole week.
>    then N1 comes back, has the higher human count, and will
>    either [see above] (if N2 still is Primary)
>    or wipe out a week worth of changes (if N2 was demoted to Secondary
>    meanwhile).

The real bug here is that we allow the counters to become different,
while the two nodes are connected. [I have to blame myself for, allowing
patches in, I blame Lars for writing them :)]

Here is an excerpt from the
http://www.drbd.org/fileadmin/drbd/publications/drbd_lk9.pdf
Paper. [middle of Page 7]

With the exception of the consistency  flag, connection indicator and the 
primary indicator, all parts of the meta-data are synchronized while 
communication is working. After system start the secondary node inherits the 
counter values from the newly selected primary node.

PS: I really like it to have documents describing the ideas the algorithms
    first, and writing the code to conform to these documents.

PS2: Sorry for the late answers lately...

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem
  2004-08-25  9:42         ` Philipp Reisner
@ 2004-08-25 10:28           ` Lars Marowsky-Bree
  2004-08-25 11:30             ` Philipp Reisner
  2004-08-25 13:38           ` Lars Ellenberg
  1 sibling, 1 reply; 32+ messages in thread
From: Lars Marowsky-Bree @ 2004-08-25 10:28 UTC (permalink / raw)
  To: Philipp Reisner, drbd-dev

On 2004-08-25T11:42:18,
   Philipp Reisner <philipp.reisner@linbit.com> said:

> So, the current policy is: 
>  * The primary node refuses to connect to a peer with higher generation
>    counts. This keeps the data intact. This is very related to the other
>    after-split-brain-policy I want to make expclicit.

Makes sense.

> * Remeber the options so far: (for primary-after-split-brain)
> 
>  - The node that was primary before split brain (current behaviour)
>  - The node that became primary during split brain 
>  - The node that modified more of it's data during the split-brain
>    situation  [ Do not think about implementation yet, just about
>                 the policy ]
>  - None, wait for operator's decission.  [suggested by LMB]
>  - Node that is currently primary [see example above by LGE]

Minor clarification: I think the question is not about "Who becomes
primary", as the Sync* is decoupled from that status, but which side
drbd deems to have the good data and thus the SyncMaster.

Looking at it from this angle, we have two dimensions:

- Node state after the split-brain heals. Each side can either be
  primary or secondary.

- The data state on each side.

Now, obviously, if the node state of both sides is "primary", drbd can't
automatically do something, but _must_ wait for admin intervention. It
can't resolve this internally, because it would destroy the layers
above. -> _MUST_ wait for operator intervention.

(Embedded environments with a dumb cluster manager... Hmm... Ok, maybe
crashing one side (which inherently stops the higher layers and triggers
recovery) and thus reducing the problem to one of the somewhat simpler
ones below might work...)

If only one side is primary, and the algorithms determine that this one
has the good data, and the other side has not touched the data in
between, this is also a simple case.

If both sides are secondary, but only one side has modified the data
since or been primary, again it's simple.

If one side is primary, but the other side has been primary in between
(but not at the time of the connect), drbd can either wait for a
higher-level intervention, or sync the now-secondary. Only two options,
nothing else makes sense. (Changing the data underneath the primary
strikes me as an exceptionally bad idea.)

If both sides are secondary, but both sides have modified the data
since, then we have several choices like picking the most recent
(timestamp?), most data modified, throwing a coin or again waiting for
admin intervention.

(Personally, I'd say operator intervention, after very careful
consideration of the problem, is in fact the only choice; this scenario
is only reached by a combination of several _severe_ faults.)

A special case obviously exists if one secondary side has inconsistent
data and the other has a consistent snapshot, which case it is a
somewhat safer assumption to sync automatically from the consistent to
the inconsistent side. This should be the default, but may be
configurable...

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	   \\\  /// 
SUSE Labs, Research and Development \honk/ 
SUSE LINUX AG - A Novell company     \\// 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem
  2004-08-25 10:28           ` Lars Marowsky-Bree
@ 2004-08-25 11:30             ` Philipp Reisner
  0 siblings, 0 replies; 32+ messages in thread
From: Philipp Reisner @ 2004-08-25 11:30 UTC (permalink / raw)
  To: drbd-dev

On Wednesday 25 August 2004 12:28, Lars Marowsky-Bree wrote:
> On 2004-08-25T11:42:18,
>
>    Philipp Reisner <philipp.reisner@linbit.com> said:
> > So, the current policy is:
> >  * The primary node refuses to connect to a peer with higher generation
> >    counts. This keeps the data intact. This is very related to the other
> >    after-split-brain-policy I want to make expclicit.
>
> Makes sense.
>
> > * Remeber the options so far: (for primary-after-split-brain)
> >
> >  - The node that was primary before split brain (current behaviour)
> >  - The node that became primary during split brain
> >  - The node that modified more of it's data during the split-brain
> >    situation  [ Do not think about implementation yet, just about
> >                 the policy ]
> >  - None, wait for operator's decission.  [suggested by LMB]
> >  - Node that is currently primary [see example above by LGE]
>
> Minor clarification: I think the question is not about "Who becomes
> primary", as the Sync* is decoupled from that status, but which side
> drbd deems to have the good data and thus the SyncMaster.
>

Right. It is really a shame that I have not yet fully realized
that drbd-7 is the current stable release :)

> Looking at it from this angle, we have two dimensions:
>
> - Node state after the split-brain heals. Each side can either be
>   primary or secondary.
>
> - The data state on each side.
>
[...]

I will post a new version of the "Summary" mail...

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem
  2004-08-25  9:42         ` Philipp Reisner
  2004-08-25 10:28           ` Lars Marowsky-Bree
@ 2004-08-25 13:38           ` Lars Ellenberg
  1 sibling, 0 replies; 32+ messages in thread
From: Lars Ellenberg @ 2004-08-25 13:38 UTC (permalink / raw)
  To: drbd-dev

> > Yet an other deficiency:
> > we still do not handle the gencounts correctly in this situation:
> >
> >   S --- S
> >   P --- S  drbdsetup primary --human
> >    now, N1 increments its human cnt, N2 only its connection count after
> >    failure of N1, N2 will take over, maybe be primary for a whole week.
> >    then N1 comes back, has the higher human count, and will
> >    either [see above] (if N2 still is Primary)
> >    or wipe out a week worth of changes (if N2 was demoted to Secondary
> >    meanwhile).
> 
> The real bug here is that we allow the counters to become different,
> while the two nodes are connected. [I have to blame myself for, allowing
> patches in, I blame Lars for writing them :)]
> 
> Here is an excerpt from the
> http://www.drbd.org/fileadmin/drbd/publications/drbd_lk9.pdf
> Paper. [middle of Page 7]
> 
> With the exception of the consistency  flag, connection indicator and the 
> primary indicator, all parts of the meta-data are synchronized while 
> communication is working. After system start the secondary node inherits the 
> counter values from the newly selected primary node.
> 
> PS: I really like it to have documents describing the ideas the algorithms
>     first, and writing the code to conform to these documents.
> 
> PS2: Sorry for the late answers lately...

to defend myself, and to suggest it as solution:

 Author: phil
 Date: 2004-07-27 18:58:09 +0200 (Tue, 27 Jul 2004)
 New Revision: 1459
 
 Modified:
    trunk/drbd/drbd_fs.c
    trunk/drbd/drbd_int.h
    trunk/drbd/linux/drbd.h
    trunk/drbd/linux/drbd_config.h
    trunk/user/drbdadm_main.c
    trunk/user/drbdsetup.c
 Log:
 * Increment the human-count if someone types in "yes" at the
   user's dialog.
 * Make sure the timeout-count is increased if the timeout
   expires at the user's dialog.
 
 
 Modified: trunk/drbd/drbd_fs.c
 ===================================================================
 --- trunk/drbd/drbd_fs.c        2004-07-27 09:01:05 UTC (rev 1458)
 +++ trunk/drbd/drbd_fs.c        2004-07-27 16:58:09 UTC (rev 1459)
 @@ -714,15 +714,6 @@
                                 return -EIO;
                         }
                 }
 -#if 0
 -               else if (mdev->cstate >= Connected) {
 -                       /* do NOT increase the Human count if we are connected,
 -                        * and there is no reason for it.  I'm not yet sure
 -                        * wether this is what I mean, though...
 -                        */
 -                       newstate &= ~(Human|DontBlameDrbd);
 -               }
 -#endif
         }
 
...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Drbd-dev] Another drbd race
  2004-08-20 13:32       ` Lars Ellenberg
                           ` (2 preceding siblings ...)
  2004-08-25  9:42         ` Philipp Reisner
@ 2004-09-04  9:48         ` Lars Marowsky-Bree
  2004-09-04 10:00           ` Lars Ellenberg
  3 siblings, 1 reply; 32+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-04  9:48 UTC (permalink / raw)
  To: drbd-dev

Hi,

lge and I have yesterday discussed a 'new' drbd race condition and also
touched on its resolution.

Scope: in a split-brain, drbd might confirm write to the clients and
might on a subsequent failover lose the transactions which _have been
confirmed_. This is not acceptable.

Sequence:

Step	N1	Link	N2
1	P	ok	S
2	P	breaks	S	node1 notices, goes into stand alone,
				stops waiting for N2 to confirm.
3	P	broken	S	S notices, initiates fencing
4	x	broken	P	N2 becomes primary

Writes which have been done in between step 2-4 will have been confirmed
to the higher layers, but are not actually available on N2. This is data
loss; N2 is still consistent, but lost confirmed transaction.

Partially, this is solved by the Oracle-requested "only ever confirm if
committed to both nodes", but of course then if it's not a broken link,
but N2 really went down, we'd be blocking on N1 forever, which we don't
want to do for HA.

So, here's the new sequence to solve this:

Step	N1	Link	N2
1	P	ok	S
2	P(blk)	ok	X	P blocks waiting for acks; heartbeat
				notices that it has lost N2, and initiates
				fencing.
3	P(blk)	ok	fenced	heartbeat tells drbd on N1 that yes, we
				know it's dead, we fenced it, no point
				waiting.
4	P	ok	fenced	Cluster proceeds to run.

Now, in this super-safe mode, if now N1 also fails after step 3 but
before N2 comes back up and is resynced, we need to make sure that N2
does refuse to become primary itself. This will probably require
additional magic in the cluster manager to handle correctly, but N2
needs an additional flag to prevent this from happening by accident.

Lars?

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	   \\\  /// 
SUSE Labs, Research and Development \honk/ 
SUSE LINUX AG - A Novell company     \\// 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-04  9:48         ` [Drbd-dev] Another drbd race Lars Marowsky-Bree
@ 2004-09-04 10:00           ` Lars Ellenberg
  2004-09-04 10:18             ` Lars Marowsky-Bree
  2004-09-07  9:39             ` Philipp Reisner
  0 siblings, 2 replies; 32+ messages in thread
From: Lars Ellenberg @ 2004-09-04 10:00 UTC (permalink / raw)
  To: drbd-dev

On Sat, Sep 04, 2004 at 11:48:14AM +0200, Lars Marowsky-Bree wrote:
> Hi,
> 
> lge and I have yesterday discussed a 'new' drbd race condition and also
> touched on its resolution.
> 
> Scope: in a split-brain, drbd might confirm write to the clients and
> might on a subsequent failover lose the transactions which _have been
> confirmed_. This is not acceptable.
> 
> Sequence:
> 
> Step	N1	Link	N2
> 1	P	ok	S
> 2	P	breaks	S	node1 notices, goes into stand alone,
> 				stops waiting for N2 to confirm.
> 3	P	broken	S	S notices, initiates fencing
> 4	x	broken	P	N2 becomes primary
> 
> Writes which have been done in between step 2-4 will have been confirmed
> to the higher layers, but are not actually available on N2. This is data
> loss; N2 is still consistent, but lost confirmed transaction.
> 
> Partially, this is solved by the Oracle-requested "only ever confirm if
> committed to both nodes", but of course then if it's not a broken link,
> but N2 really went down, we'd be blocking on N1 forever, which we don't
> want to do for HA.
> 
> So, here's the new sequence to solve this:
> 
> Step	N1	Link	N2
> 1	P	ok	S
> 2	P(blk)	ok	X	P blocks waiting for acks; heartbeat
> 				notices that it has lost N2, and initiates
> 				fencing.
> 3	P(blk)	ok	fenced	heartbeat tells drbd on N1 that yes, we
> 				know it's dead, we fenced it, no point
> 				waiting.
> 4	P	ok	fenced	Cluster proceeds to run.
> 
> Now, in this super-safe mode, if now N1 also fails after step 3 but
> before N2 comes back up and is resynced, we need to make sure that N2
> does refuse to become primary itself. This will probably require
> additional magic in the cluster manager to handle correctly, but N2
> needs an additional flag to prevent this from happening by accident.
> 
> Lars?

I think we can do this detection already with the combination of the
Consistent and Connected as well as HaveBeenPrimary flag. Only the logic
needs to be built in.

Most likely right after connection loss the Primary should blocks for a
configurable (default: infinity?) amount of time before giving end_io
events back to the upper layer.
We then need to be able to tell it to resume operation (we can do this,
as soon as we took precautions to prevent the Secondary to become
Primary without being forced or resynced before).

Or, if the cluster decides to do so, the Secondary has time to STONITH
the Primary (while that is still blocking) and take over.

I want to include a timeout, so the cluster manager don't need to
know about "peer is dead" notification, it only needs to know about
STONITH.

Maybe we want to introduce this functionality as a new wire protocoll,
or only in proto C.

	lge

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-04 10:00           ` Lars Ellenberg
@ 2004-09-04 10:18             ` Lars Marowsky-Bree
  2004-09-04 10:43               ` Lars Ellenberg
  2004-09-07  9:39             ` Philipp Reisner
  1 sibling, 1 reply; 32+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-04 10:18 UTC (permalink / raw)
  To: Lars Ellenberg, drbd-dev

On 2004-09-04T12:00:08,
   Lars Ellenberg <lars.ellenberg@linbit.com> said:

Yep, that should be enough to detect this on the secondary. But:

> Most likely right after connection loss the Primary should blocks for a
> configurable (default: infinity?) amount of time before giving end_io
> events back to the upper layer.
> We then need to be able to tell it to resume operation (we can do this,
> as soon as we took precautions to prevent the Secondary to become
> Primary without being forced or resynced before).
> 
> Or, if the cluster decides to do so, the Secondary has time to STONITH
> the Primary (while that is still blocking) and take over.
> 
> I want to include a timeout, so the cluster manager don't need to
> know about "peer is dead" notification, it only needs to know about
> STONITH.

If it defaults to an 'infinite' timeout, which is safe, we need the
resume operation. (Or rather, notification about the successful "peer is
dead now" event.) This is easy to add.

And it is needed, because 

a) if the fencing _failed_, the primary needs to stay blocked until it
eventually succeeds. This is a correctness issue.

a) otherwise drbd would _always_ block for at least that amount of time
when it lost the secondary, even though it's been fenced since seconds
(or even we may have fenced it before drbd's internal peer timeout hits,
in which case it wouldn't ever block). This is a performance issue.

The combination of a+b gives a very good argument for having a resume
operation, which the new CRM will be able to drive in a couple of weeks
;-)

> Maybe we want to introduce this functionality as a new wire protocoll,
> or only in proto C.

It doesn't actually need to be a new wire protocol, it just needs an
additional option set (ie, the Oracle mode) and the 'resume' operation
on the primary; or actually, that could be mapped to an explicit switch
from WFConnection to StandAlone.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	   \\\  /// 
SUSE Labs, Research and Development \honk/ 
SUSE LINUX AG - A Novell company     \\// 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-04 10:18             ` Lars Marowsky-Bree
@ 2004-09-04 10:43               ` Lars Ellenberg
  2004-09-04 10:51                 ` Lars Marowsky-Bree
  0 siblings, 1 reply; 32+ messages in thread
From: Lars Ellenberg @ 2004-09-04 10:43 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: drbd-dev

On Sat, Sep 04, 2004 at 12:18:14PM +0200, Lars Marowsky-Bree wrote:
> On 2004-09-04T12:00:08,
>    Lars Ellenberg <lars.ellenberg@linbit.com> said:
> 
> Yep, that should be enough to detect this on the secondary. But:
> 
> > Most likely right after connection loss the Primary should blocks for a
> > configurable (default: infinity?) amount of time before giving end_io
> > events back to the upper layer.
> > We then need to be able to tell it to resume operation (we can do this,
> > as soon as we took precautions to prevent the Secondary to become
> > Primary without being forced or resynced before).
> > 
> > Or, if the cluster decides to do so, the Secondary has time to STONITH
> > the Primary (while that is still blocking) and take over.
> > 
> > I want to include a timeout, so the cluster manager don't need to
> > know about "peer is dead" notification, it only needs to know about
> > STONITH.
> 
> If it defaults to an 'infinite' timeout, which is safe, we need the
> resume operation. (Or rather, notification about the successful "peer is
> dead now" event.) This is easy to add.
> 
> And it is needed, because 
> 
> a) if the fencing _failed_, the primary needs to stay blocked until it
> eventually succeeds. This is a correctness issue.
> 
> a) otherwise drbd would _always_ block for at least that amount of time
> when it lost the secondary, even though it's been fenced since seconds
> (or even we may have fenced it before drbd's internal peer timeout hits,
> in which case it wouldn't ever block). This is a performance issue.
> 
> The combination of a+b gives a very good argument for having a resume
> operation, which the new CRM will be able to drive in a couple of weeks
> ;-)

I did not say we need either/or, I say I want an _additional_ timeout,
which defaults to infinity; so I have the _choice_ to run with a cluster
manager that does only know about stonith (and yes then there still
remains a race, and it might just block for say 2 minutes even if it
won't need to, but it won't lose any writes anymore). Of course we can
optimize, and I'd like to; but we need to be correct first.
so don't argue if you don't disagree.

> > Maybe we want to introduce this functionality as a new wire protocoll,
> > or only in proto C.
> 
> It doesn't actually need to be a new wire protocol, it just needs an
> additional option set (ie, the Oracle mode) and the 'resume' operation
> on the primary; or actually, that could be mapped to an explicit switch
> from WFConnection to StandAlone.

I did not say it needs to be, I suggest it would make sense, and that it
won't make much sense to have that option with proto A or B, because it
would make the user "feel" he never loses writes while the asynchronouse
protocols might lose commits anyways.

The "oracle" option I'd like to call "write quorum", and thats a
different, though related issue. we either make sure it is written to at
least two (we currently can not do more than that) independend stable
storages, or we don't acknowledge the write at all (or maybe even fail
it, if that makes any sense) to the application layer.
we then no longer have service HA (unless we introduce a concept of
additional peers and multiple hot standby mirrors), but we have data
security. this is indeed not a protocol change, but an option.
the implementation of which needs to be verified and improved.
but at least we have it.

	lge

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-04 10:43               ` Lars Ellenberg
@ 2004-09-04 10:51                 ` Lars Marowsky-Bree
  0 siblings, 0 replies; 32+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-04 10:51 UTC (permalink / raw)
  To: Lars Ellenberg; +Cc: drbd-dev

On 2004-09-04T12:43:38,
   Lars Ellenberg <lars.ellenberg@linbit.com> said:

> I did not say we need either/or, I say I want an _additional_ timeout,

Ah, misread that. Ok, then we agree. Seems ok.

> won't need to, but it won't lose any writes anymore). Of course we can
> optimize, and I'd like to; but we need to be correct first.
> so don't argue if you don't disagree.

I wasn't realizing I wasn't disagreeing ;-) I thought you meant the
additional timeout relative to the current situation and instead of the
resume.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	   \\\  /// 
SUSE Labs, Research and Development \honk/ 
SUSE LINUX AG - A Novell company     \\// 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-04 10:00           ` Lars Ellenberg
  2004-09-04 10:18             ` Lars Marowsky-Bree
@ 2004-09-07  9:39             ` Philipp Reisner
  2004-09-07 10:13               ` Lars Ellenberg
  1 sibling, 1 reply; 32+ messages in thread
From: Philipp Reisner @ 2004-09-07  9:39 UTC (permalink / raw)
  To: drbd-dev

On Saturday 04 September 2004 12:00, Lars Ellenberg wrote:
> On Sat, Sep 04, 2004 at 11:48:14AM +0200, Lars Marowsky-Bree wrote:
> > Hi,
> >
> > lge and I have yesterday discussed a 'new' drbd race condition and also
> > touched on its resolution.
> >
> > Scope: in a split-brain, drbd might confirm write to the clients and
> > might on a subsequent failover lose the transactions which _have been
> > confirmed_. This is not acceptable.
> >
> > Sequence:
> >
> > Step N1 Link N2
> > 1 P ok S
> > 2 P breaks S node1 notices, goes into stand alone,
> >     stops waiting for N2 to confirm.
> > 3 P broken S S notices, initiates fencing
> > 4 x broken P N2 becomes primary
> >
> > Writes which have been done in between step 2-4 will have been confirmed
> > to the higher layers, but are not actually available on N2. This is data
> > loss; N2 is still consistent, but lost confirmed transaction.
> >
> > Partially, this is solved by the Oracle-requested "only ever confirm if
> > committed to both nodes", but of course then if it's not a broken link,
> > but N2 really went down, we'd be blocking on N1 forever, which we don't
> > want to do for HA.
> >
> > So, here's the new sequence to solve this:
> >
> > Step N1 Link N2
> > 1 P ok S
> > 2 P(blk) ok X P blocks waiting for acks; heartbeat
> >     notices that it has lost N2, and initiates
> >     fencing.
> > 3 P(blk) ok fenced heartbeat tells drbd on N1 that yes, we
> >     know it's dead, we fenced it, no point
> >     waiting.
> > 4 P ok fenced Cluster proceeds to run.
> >
> > Now, in this super-safe mode, if now N1 also fails after step 3 but
> > before N2 comes back up and is resynced, we need to make sure that N2
> > does refuse to become primary itself. This will probably require
> > additional magic in the cluster manager to handle correctly, but N2
> > needs an additional flag to prevent this from happening by accident.
> >
> > Lars?
>
> I think we can do this detection already with the combination of the
> Consistent and Connected as well as HaveBeenPrimary flag. Only the logic
> needs to be built in.
>

I do not want to "misuse" the Consistent Bit for this.

!Consistent  .... means that we are in the middle of a sync.
                   = data is not usable at all.
 Fenced      .... our data is 100% okay, but not the latest copy.


> Most likely right after connection loss the Primary should blocks for a
> configurable (default: infinity?) amount of time before giving end_io
> events back to the upper layer.
> We then need to be able to tell it to resume operation (we can do this,
> as soon as we took precautions to prevent the Secondary to become
> Primary without being forced or resynced before).
>
> Or, if the cluster decides to do so, the Secondary has time to STONITH
> the Primary (while that is still blocking) and take over.
>
> I want to include a timeout, so the cluster manager don't need to
> know about "peer is dead" notification, it only needs to know about
> STONITH.

I see. Makes sense, but on the other hand STONITH (more genral:
FENCING)  might fail, as LMB points out in one of the other mails.

-> We should probabely _not_ offer a timeout here, as soon as
   "on-disconnect freeze_io;" is set, it is freezed forever.
   Or it gets a "drbdadm resume-io r0" from the cluster manager.

> Maybe we want to introduce this functionality as a new wire protocoll,
> or only in proto C.
>

I see it controled by the 

"on-disconnect freeze_io;" option.

For N2 we need a "drbdadm fence-off r0" command and for N1 we need 
a "drbdadm resume-io r0".

* The fenced bit gets cleard when the resync is finished.
* A node refuses to become primary when the fenced bit is set.
* "drbdadm -- --do-what-I-say primary r0" overrules (and cleares?) 
  the fenced bit

To be defined: What should we do at node startup with the fenced bit.
               (At least display it at the user-dialog)

-philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-07  9:39             ` Philipp Reisner
@ 2004-09-07 10:13               ` Lars Ellenberg
  2004-09-07 11:32                 ` Philipp Reisner
  2004-09-07 12:19                 ` Philipp Reisner
  0 siblings, 2 replies; 32+ messages in thread
From: Lars Ellenberg @ 2004-09-07 10:13 UTC (permalink / raw)
  To: drbd-dev

On Tue, Sep 07, 2004 at 11:39:29AM +0200, Philipp Reisner wrote:
> On Saturday 04 September 2004 12:00, Lars Ellenberg wrote:
> > On Sat, Sep 04, 2004 at 11:48:14AM +0200, Lars Marowsky-Bree wrote:
> > > Hi,
> > >
> > > lge and I have yesterday discussed a 'new' drbd race condition and also
> > > touched on its resolution.
> > >
> > > Scope: in a split-brain, drbd might confirm write to the clients and
> > > might on a subsequent failover lose the transactions which _have been
> > > confirmed_. This is not acceptable.
> > >
> > > Sequence:
> > >
> > > Step N1 Link N2
> > > 1 P ok S
> > > 2 P breaks S node1 notices, goes into stand alone,
> > >     stops waiting for N2 to confirm.
> > > 3 P broken S S notices, initiates fencing
> > > 4 x broken P N2 becomes primary
> > >
> > > Writes which have been done in between step 2-4 will have been confirmed
> > > to the higher layers, but are not actually available on N2. This is data
> > > loss; N2 is still consistent, but lost confirmed transaction.
> > >
> > > Partially, this is solved by the Oracle-requested "only ever confirm if
> > > committed to both nodes", but of course then if it's not a broken link,
> > > but N2 really went down, we'd be blocking on N1 forever, which we don't
> > > want to do for HA.
> > >
> > > So, here's the new sequence to solve this:
> > >
> > > Step N1 Link N2
> > > 1 P ok S
> > > 2 P(blk) ok X P blocks waiting for acks; heartbeat
> > >     notices that it has lost N2, and initiates
> > >     fencing.
> > > 3 P(blk) ok fenced heartbeat tells drbd on N1 that yes, we
> > >     know it's dead, we fenced it, no point
> > >     waiting.
> > > 4 P ok fenced Cluster proceeds to run.
> > >
> > > Now, in this super-safe mode, if now N1 also fails after step 3 but
> > > before N2 comes back up and is resynced, we need to make sure that N2
> > > does refuse to become primary itself. This will probably require
> > > additional magic in the cluster manager to handle correctly, but N2
> > > needs an additional flag to prevent this from happening by accident.
> > >
> > > Lars?
> >
> > I think we can do this detection already with the combination of the
> > Consistent and Connected as well as HaveBeenPrimary flag. Only the logic
> > needs to be built in.
> >
> 
> I do not want to "misuse" the Consistent Bit for this.
> 
> !Consistent  .... means that we are in the middle of a sync.
>                    = data is not usable at all.
>  Fenced      .... our data is 100% okay, but not the latest copy.

lets call it "Outdated"

my idea is that a crashed Secondary will come up as !Primary|Connected, so
it can assume it is outdated. (similar to the choice about wfc-degr...)

we can only possibly lose write transaction in the very moment we
promote a Secondary to Primary. until we do that, and the harddisk where
the transactions have been written to is still physically intact, the
data is still there, though maybe not available.

we can try to make sure that we never promote a Secondary that possibly
(or knowingly) is outdated.

see below.

> 
> 
> > Most likely right after connection loss the Primary should blocks for a
> > configurable (default: infinity?) amount of time before giving end_io
> > events back to the upper layer.
> > We then need to be able to tell it to resume operation (we can do this,
> > as soon as we took precautions to prevent the Secondary to become
> > Primary without being forced or resynced before).
> >
> > Or, if the cluster decides to do so, the Secondary has time to STONITH
> > the Primary (while that is still blocking) and take over.
> >
> > I want to include a timeout, so the cluster manager don't need to
> > know about "peer is dead" notification, it only needs to know about
> > STONITH.
> 
> I see. Makes sense, but on the other hand STONITH (more genral:
> FENCING)  might fail, as LMB points out in one of the other mails.
> 
> -> We should probabely _not_ offer a timeout here, as soon as
>    "on-disconnect freeze_io;" is set, it is freezed forever.
>    Or it gets a "drbdadm resume-io r0" from the cluster manager.
> 
> > Maybe we want to introduce this functionality as a new wire protocoll,
> > or only in proto C.
> >
> 
> I see it controled by the 
> 
> "on-disconnect freeze_io;" option.
> 
> For N2 we need a "drbdadm fence-off r0" command and for N1 we need 
> a "drbdadm resume-io r0".
> 
> * The fenced bit gets cleard when the resync is finished.
> * A node refuses to become primary when the fenced bit is set.
> * "drbdadm -- --do-what-I-say primary r0" overrules (and cleares?) 
>   the fenced bit
> 
> To be defined: What should we do at node startup with the fenced bit.
>                (At least display it at the user-dialog)

I would like to introduce an additional Node state for the o_state:
Dead. it is never "recognized" internally, but can be set by the
operator or cluster manager. basically, if we go to WhatEver/Unknown,
we don't accept anything (since we don't want to risk split brain).
some higher authority can and needs to resolve this, telling us the peer
is dead (after a successfull stonith, when we are Secondary and shall be
promoted).


now we have this:
  P/S --- S/P
  P/? -:- S/?
  
 A)
  if this is in fact (from the pov of heartbeat)
  P/? -.. XXX
    we stonith it (just to be sure) and tell it "peer dead"
  P/D -..   
    (and there it resumes).

 B)
  if this is in fact (from the pov of heartbeat)
  P/? XXX S/?
    - we do nothing
      (blocks until network is fixed again)
    - we tell S that it is outdated,
      then tell P to resume
    - or we make it (by STONITH) into either A or C
    
 C)
  if this is in fact (from the pov of heartbeat)
  XXX ..- S/?
    we stonith it (just to be sure) and tell it "peer dead"
  XXX ..- S/D
    (and there it accepts to be promoted again).


similar after bootup:
  we refuse to be promoted to Primary from Secondary/Unknown,
  unless we got an explicit "peer dead" confirmation by someone.

does that make any sense?

  	lge

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-07 10:13               ` Lars Ellenberg
@ 2004-09-07 11:32                 ` Philipp Reisner
  2004-09-07 12:05                   ` Lars Ellenberg
  2004-09-07 12:06                   ` Lars Marowsky-Bree
  2004-09-07 12:19                 ` Philipp Reisner
  1 sibling, 2 replies; 32+ messages in thread
From: Philipp Reisner @ 2004-09-07 11:32 UTC (permalink / raw)
  To: drbd-dev

On Tuesday 07 September 2004 12:13, Lars Ellenberg wrote:
> On Tue, Sep 07, 2004 at 11:39:29AM +0200, Philipp Reisner wrote:
> > On Saturday 04 September 2004 12:00, Lars Ellenberg wrote:
> > > On Sat, Sep 04, 2004 at 11:48:14AM +0200, Lars Marowsky-Bree wrote:
> > > > Hi,
> > > >
> > > > lge and I have yesterday discussed a 'new' drbd race condition and
> > > > also touched on its resolution.
> > > >
> > > > Scope: in a split-brain, drbd might confirm write to the clients and
> > > > might on a subsequent failover lose the transactions which _have been
> > > > confirmed_. This is not acceptable.
> > > >
> > > > Sequence:
> > > >
> > > > Step N1 Link N2
> > > > 1 P ok S
> > > > 2 P breaks S node1 notices, goes into stand alone,
> > > >     stops waiting for N2 to confirm.
> > > > 3 P broken S S notices, initiates fencing
> > > > 4 x broken P N2 becomes primary
> > > >
> > > > Writes which have been done in between step 2-4 will have been
> > > > confirmed to the higher layers, but are not actually available on N2.
> > > > This is data loss; N2 is still consistent, but lost confirmed
> > > > transaction.
> > > >
> > > > Partially, this is solved by the Oracle-requested "only ever confirm
> > > > if committed to both nodes", but of course then if it's not a broken
> > > > link, but N2 really went down, we'd be blocking on N1 forever, which
> > > > we don't want to do for HA.
> > > >
> > > > So, here's the new sequence to solve this:
> > > >
> > > > Step N1 Link N2
> > > > 1 P ok S
> > > > 2 P(blk) ok X P blocks waiting for acks; heartbeat
> > > >     notices that it has lost N2, and initiates
> > > >     fencing.
> > > > 3 P(blk) ok fenced heartbeat tells drbd on N1 that yes, we
> > > >     know it's dead, we fenced it, no point
> > > >     waiting.
> > > > 4 P ok fenced Cluster proceeds to run.
> > > >
> > > > Now, in this super-safe mode, if now N1 also fails after step 3 but
> > > > before N2 comes back up and is resynced, we need to make sure that N2
> > > > does refuse to become primary itself. This will probably require
> > > > additional magic in the cluster manager to handle correctly, but N2
> > > > needs an additional flag to prevent this from happening by accident.
> > > >
> > > > Lars?
> > >
> > > I think we can do this detection already with the combination of the
> > > Consistent and Connected as well as HaveBeenPrimary flag. Only the
> > > logic needs to be built in.
> >
> > I do not want to "misuse" the Consistent Bit for this.
> >
> > !Consistent  .... means that we are in the middle of a sync.
> >                    = data is not usable at all.
> >  Fenced      .... our data is 100% okay, but not the latest copy.
>
> lets call it "Outdated"
>
> my idea is that a crashed Secondary will come up as !Primary|Connected, so
> it can assume it is outdated. (similar to the choice about wfc-degr...)
>
> we can only possibly lose write transaction in the very moment we
> promote a Secondary to Primary. until we do that, and the harddisk where
> the transactions have been written to is still physically intact, the
> data is still there, though maybe not available.
>
> we can try to make sure that we never promote a Secondary that possibly
> (or knowingly) is outdated.
>
> see below.
>
> > > Most likely right after connection loss the Primary should blocks for a
> > > configurable (default: infinity?) amount of time before giving end_io
> > > events back to the upper layer.
> > > We then need to be able to tell it to resume operation (we can do this,
> > > as soon as we took precautions to prevent the Secondary to become
> > > Primary without being forced or resynced before).
> > >
> > > Or, if the cluster decides to do so, the Secondary has time to STONITH
> > > the Primary (while that is still blocking) and take over.
> > >
> > > I want to include a timeout, so the cluster manager don't need to
> > > know about "peer is dead" notification, it only needs to know about
> > > STONITH.
> >
> > I see. Makes sense, but on the other hand STONITH (more genral:
> > FENCING)  might fail, as LMB points out in one of the other mails.
> >
> > -> We should probabely _not_ offer a timeout here, as soon as
> >    "on-disconnect freeze_io;" is set, it is freezed forever.
> >    Or it gets a "drbdadm resume-io r0" from the cluster manager.
> >
> > > Maybe we want to introduce this functionality as a new wire protocoll,
> > > or only in proto C.
> >
> > I see it controled by the
> >
> > "on-disconnect freeze_io;" option.
> >
> > For N2 we need a "drbdadm fence-off r0" command and for N1 we need
> > a "drbdadm resume-io r0".
> >
> > * The fenced bit gets cleard when the resync is finished.
> > * A node refuses to become primary when the fenced bit is set.
> > * "drbdadm -- --do-what-I-say primary r0" overrules (and cleares?)
> >   the fenced bit
> >
> > To be defined: What should we do at node startup with the fenced bit.
> >                (At least display it at the user-dialog)
>
> I would like to introduce an additional Node state for the o_state:
> Dead. it is never "recognized" internally, but can be set by the
> operator or cluster manager. basically, if we go to WhatEver/Unknown,
> we don't accept anything (since we don't want to risk split brain).
> some higher authority can and needs to resolve this, telling us the peer
> is dead (after a successfull stonith, when we are Secondary and shall be
> promoted).
>
>
> now we have this:
>   P/S --- S/P
>   P/? -:- S/?
>
>  A)
>   if this is in fact (from the pov of heartbeat)
>   P/? -.. XXX
>     we stonith it (just to be sure) and tell it "peer dead"
>   P/D -..
>     (and there it resumes).
>
>  B)
>   if this is in fact (from the pov of heartbeat)
>   P/? XXX S/?
>     - we do nothing
>       (blocks until network is fixed again)
>     - we tell S that it is outdated,
>       then tell P to resume
>     - or we make it (by STONITH) into either A or C
>
>  C)
>   if this is in fact (from the pov of heartbeat)
>   XXX ..- S/?
>     we stonith it (just to be sure) and tell it "peer dead"
>   XXX ..- S/D
>     (and there it accepts to be promoted again).
>
>
> similar after bootup:
>   we refuse to be promoted to Primary from Secondary/Unknown,
>   unless we got an explicit "peer dead" confirmation by someone.
>
> does that make any sense?
>

I like it a lot!

Thus we will not call it "drbdadm resume-io r0" but 
"drbdadm peer-dead r0"

I think the assertion that the peer is dead 
(short "peer-dead")  is a lot easier to understand than
a "resume-io" command.


Also the question at the startup-user-dialog: 

Is the peer dead ? 

Is easier to get right....

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-07 11:32                 ` Philipp Reisner
@ 2004-09-07 12:05                   ` Lars Ellenberg
  2004-09-07 12:12                     ` Lars Marowsky-Bree
  2004-09-07 12:06                   ` Lars Marowsky-Bree
  1 sibling, 1 reply; 32+ messages in thread
From: Lars Ellenberg @ 2004-09-07 12:05 UTC (permalink / raw)
  To: drbd-dev

On Tue, Sep 07, 2004 at 01:32:02PM +0200, Philipp Reisner wrote:
> > I would like to introduce an additional Node state for the o_state:
> > Dead. it is never "recognized" internally, but can be set by the
> > operator or cluster manager. basically, if we go to WhatEver/Unknown,
> > we don't accept anything (since we don't want to risk split brain).
> > some higher authority can and needs to resolve this, telling us the peer
> > is dead (after a successfull stonith, when we are Secondary and shall be
> > promoted).
> >
> >
> > now we have this:
> >   P/S --- S/P
> >   P/? -:- S/?
> >
> >  A)
> >   if this is in fact (from the pov of heartbeat)
> >   P/? -.. XXX
> >     we stonith it (just to be sure) and tell it "peer dead"
> >   P/D -..
> >     (and there it resumes).
> >
> >  B)
> >   if this is in fact (from the pov of heartbeat)
> >   P/? XXX S/?
> >     - we do nothing
> >       (blocks until network is fixed again)
> >     - we tell S that it is outdated,
> >       then tell P to resume
> >     - or we make it (by STONITH) into either A or C
> >
> >  C)
> >   if this is in fact (from the pov of heartbeat)
> >   XXX ..- S/?
> >     we stonith it (just to be sure) and tell it "peer dead"
> >   XXX ..- S/D
> >     (and there it accepts to be promoted again).
> >
> >
> > similar after bootup:
> >   we refuse to be promoted to Primary from Secondary/Unknown,
> >   unless we got an explicit "peer dead" confirmation by someone.
> >
> > does that make any sense?
> >
> 
> I like it a lot!
> 
> Thus we will not call it "drbdadm resume-io r0" but 
> "drbdadm peer-dead r0"
> 
> I think the assertion that the peer is dead 
> (short "peer-dead")  is a lot easier to understand than
> a "resume-io" command.
> 
> 
> Also the question at the startup-user-dialog: 
> 
> Is the peer dead ? 
> 
> Is easier to get right....


maybe we still need to have this a two-stage process:
after reboot, and we remain in Secondary/Unknown,
we need to be told "peer dead", but we also need to get the confirmation
"up-to-date" (just to cover our ass).

when it was just a connection loss, we *are* up-to-date, and just need the
confirmation "peer dead"; or we get the confirmation "link dead, peer
alive", which basically is "you are outdated!".

just so we cannot be blamed for "automatically losing transactions",
even in a multiple failure scenario.

	lge

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-07 12:05                   ` Lars Ellenberg
@ 2004-09-07 12:12                     ` Lars Marowsky-Bree
  0 siblings, 0 replies; 32+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-07 12:12 UTC (permalink / raw)
  To: Lars Ellenberg, drbd-dev

On 2004-09-07T14:05:02,
   Lars Ellenberg <lars.ellenberg@linbit.com> said:

> maybe we still need to have this a two-stage process:
> after reboot, and we remain in Secondary/Unknown,
> we need to be told "peer dead", but we also need to get the confirmation
> "up-to-date" (just to cover our ass).

> when it was just a connection loss, we *are* up-to-date, and just need the
> confirmation "peer dead"; or we get the confirmation "link dead, peer
> alive", which basically is "you are outdated!".
> 
> just so we cannot be blamed for "automatically losing transactions",
> even in a multiple failure scenario.

I think peer-dead is sufficient. I don't see the additional problem
solved here?


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	   \\\  /// 
SUSE Labs, Research and Development \honk/ 
SUSE LINUX AG - A Novell company     \\// 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-07 11:32                 ` Philipp Reisner
  2004-09-07 12:05                   ` Lars Ellenberg
@ 2004-09-07 12:06                   ` Lars Marowsky-Bree
  1 sibling, 0 replies; 32+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-07 12:06 UTC (permalink / raw)
  To: Philipp Reisner, drbd-dev

On 2004-09-07T13:32:02,
   Philipp Reisner <philipp.reisner@linbit.com> said:

> > similar after bootup:
> >   we refuse to be promoted to Primary from Secondary/Unknown,
> >   unless we got an explicit "peer dead" confirmation by someone.
> >
> > does that make any sense?
> >
> I like it a lot!
> 
> Thus we will not call it "drbdadm resume-io r0" but 
> "drbdadm peer-dead r0"

The drbd proof of concept Resource Agent I wrote actually calls this
command "mark-peer-dead", so we are quite in alignment ;-)

Note that in particular this could be set before even drbd itself
notices; which is more easily understood than getting a "resume_io"
command before drbd had even suspended IO.

> Also the question at the startup-user-dialog: 
> 
> Is the peer dead ? 
> 
> Is easier to get right....

Yep. It's also much easier to code for.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	   \\\  /// 
SUSE Labs, Research and Development \honk/ 
SUSE LINUX AG - A Novell company     \\// 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-07 10:13               ` Lars Ellenberg
  2004-09-07 11:32                 ` Philipp Reisner
@ 2004-09-07 12:19                 ` Philipp Reisner
  2004-09-07 12:28                   ` Lars Marowsky-Bree
  2004-09-07 15:55                   ` Lars Ellenberg
  1 sibling, 2 replies; 32+ messages in thread
From: Philipp Reisner @ 2004-09-07 12:19 UTC (permalink / raw)
  To: drbd-dev


> > I do not want to "misuse" the Consistent Bit for this.
> >
> > !Consistent  .... means that we are in the middle of a sync.
> >                    = data is not usable at all.
> >  Fenced      .... our data is 100% okay, but not the latest copy.
>
> lets call it "Outdated"
>
> my idea is that a crashed Secondary will come up as !Primary|Connected, so
> it can assume it is outdated. (similar to the choice about wfc-degr...)
>
> we can only possibly lose write transaction in the very moment we
> promote a Secondary to Primary. until we do that, and the harddisk where
> the transactions have been written to is still physically intact, the
> data is still there, though maybe not available.
>
> we can try to make sure that we never promote a Secondary that possibly
> (or knowingly) is outdated.
>
> see below.
>

Let us assume that we have two boxes (N1 and N2) and that tese
two boxes are connected by two networks (net and cnet [ clinets'-net ]).

Net is used by DRBD, while heartbeat uses both, net and cnet

I know that you are talking about fencing by STONITH, but DRBD is
not limited to that. Here comes my understanding of how fencing
(other tan STONITH) could work with DRBD-0.8 :

 N1  net   N2
 P/S ---  S/P     everything up and running.
 P/? - -  S/?     network breaks ; N1 freezes IO
 P/? - -  S/?     N1 fences N2:
                  In the Stonith case: turn off N2.
                  In the "smart" case: 
                  N1 asks N2 to fence itself from the storage via cnet.
                  HB calls "drbdadm fence r0" on N2.
                  N2 replies to N1 that fencins is done via cnet.
                  N1 calls "drbdadm peer-dead r0".
 P/D - -  S/?     N1 thaws IO

N2 got the the "Outdated" flag set in its meta-data, by the "fence" 
command. I am not sure if it should be called "fence", other ideas:
"considered-dead","die","fence","outdate". What do you think ?

My question is:
 Is it planed that heartbeat will be able to perform this kind of fencing ?

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-07 12:19                 ` Philipp Reisner
@ 2004-09-07 12:28                   ` Lars Marowsky-Bree
  2004-09-07 12:47                     ` Philipp Reisner
  2004-09-07 15:55                   ` Lars Ellenberg
  1 sibling, 1 reply; 32+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-07 12:28 UTC (permalink / raw)
  To: Philipp Reisner, drbd-dev

On 2004-09-07T14:19:55,
   Philipp Reisner <philipp.reisner@linbit.com> said:

>  N1  net   N2
>  P/S ---  S/P     everything up and running.
>  P/? - -  S/?     network breaks ; N1 freezes IO
>  P/? - -  S/?     N1 fences N2:
>                   In the Stonith case: turn off N2.
>                   In the "smart" case: 
>                   N1 asks N2 to fence itself from the storage via cnet.
>                   HB calls "drbdadm fence r0" on N2.
>                   N2 replies to N1 that fencins is done via cnet.
>                   N1 calls "drbdadm peer-dead r0".
>  P/D - -  S/?     N1 thaws IO

In this scenario, there is no need to have a special 'fence' operation.
IO is already frozen, so we can simply 'stop' drbd on N2 if we so
chose. (Which will freeze the generation counter, will set the Outdated
flag etc.)

> N2 got the the "Outdated" flag set in its meta-data, by the "fence" 
> command. I am not sure if it should be called "fence", other ideas:
> "considered-dead","die","fence","outdate". What do you think ?
> 
> My question is:
>  Is it planed that heartbeat will be able to perform this kind of fencing ?

It actually comes for free.

If however heartbeat fences or stops N1 in this case, we'll deliver a
(successful) stop or fence notification to the incarnation running on
N2, which will call out to 'drbdadm mark-peer-dead' and basically do the
same.

You may find
http://wiki.trick.ca/linux-ha/ClusterResourceManager_2fMultipleIncarnationResources
an interesting read here. (Also the link to the MultiStateResources in
there.)

It's almost implemented, the drbd agent is already written (proof of
concept code stage), and Andrew tells me we'll have the multiple
incarnations within a few weeks too.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	   \\\  /// 
SUSE Labs, Research and Development \honk/ 
SUSE LINUX AG - A Novell company     \\// 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-07 12:28                   ` Lars Marowsky-Bree
@ 2004-09-07 12:47                     ` Philipp Reisner
  2004-09-08 11:20                       ` Lars Marowsky-Bree
  0 siblings, 1 reply; 32+ messages in thread
From: Philipp Reisner @ 2004-09-07 12:47 UTC (permalink / raw)
  To: drbd-dev

On Tuesday 07 September 2004 14:28, Lars Marowsky-Bree wrote:
> On 2004-09-07T14:19:55,
>
>    Philipp Reisner <philipp.reisner@linbit.com> said:
> >  N1  net   N2
> >  P/S ---  S/P     everything up and running.
> >  P/? - -  S/?     network breaks ; N1 freezes IO
> >  P/? - -  S/?     N1 fences N2:
> >                   In the Stonith case: turn off N2.
> >                   In the "smart" case:
> >                   N1 asks N2 to fence itself from the storage via cnet.
> >                   HB calls "drbdadm fence r0" on N2.
> >                   N2 replies to N1 that fencins is done via cnet.
> >                   N1 calls "drbdadm peer-dead r0".
> >  P/D - -  S/?     N1 thaws IO
>
> In this scenario, there is no need to have a special 'fence' operation.
> IO is already frozen, so we can simply 'stop' drbd on N2 if we so
> chose. (Which will freeze the generation counter, will set the Outdated
> flag etc.)
>
> > N2 got the the "Outdated" flag set in its meta-data, by the "fence"
> > command. I am not sure if it should be called "fence", other ideas:
> > "considered-dead","die","fence","outdate". What do you think ?
> >
> > My question is:
> >  Is it planed that heartbeat will be able to perform this kind of fencing
> > ?
>
> It actually comes for free.
>
> If however heartbeat fences or stops N1 in this case, we'll deliver a
> (successful) stop or fence notification to the incarnation running on
> N2, which will call out to 'drbdadm mark-peer-dead' and basically do the
> same.

(I think you swapped N1 and N2 accidentially in the paragraph above )

No. It would be better to have a "drbdadm fence r0" operation on N2!
The "drbdadm fence r0" command would only set the "Outdated" flag.

The big advantage over stopping DRBD on N2 is that in case the network
recoveres N2 will be resynced to up-to-date automatically.

>
> It's almost implemented, the drbd agent is already written (proof of
> concept code stage), and Andrew tells me we'll have the multiple
> incarnations within a few weeks too.
>

I hope that it is possible to make this agent to call "drbdadm fence r0"
on the secondary instead of "/etc/init.d/drbd stop".

-philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-07 12:47                     ` Philipp Reisner
@ 2004-09-08 11:20                       ` Lars Marowsky-Bree
  2004-09-08 11:31                         ` Lars Ellenberg
  2004-09-08 11:33                         ` Philipp Reisner
  0 siblings, 2 replies; 32+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-08 11:20 UTC (permalink / raw)
  To: Philipp Reisner, drbd-dev

On 2004-09-07T14:47:45,
   Philipp Reisner <philipp.reisner@linbit.com> said:

> No. It would be better to have a "drbdadm fence r0" operation on N2!
> The "drbdadm fence r0" command would only set the "Outdated" flag.

Well, it's automatically supposed to assume it's outdated when it
crashes in S-P mode.

When the secondary loses connection to the primary, a mark-peer-dead
would prevent that flag from being set.

So, why an explicit drbdadm fence operation? I'm missing what that would
catch.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	   \\\  /// 
SUSE Labs, Research and Development \honk/ 
SUSE LINUX AG - A Novell company     \\// 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-08 11:20                       ` Lars Marowsky-Bree
@ 2004-09-08 11:31                         ` Lars Ellenberg
  2004-09-08 15:11                           ` Lars Marowsky-Bree
  2004-09-08 11:33                         ` Philipp Reisner
  1 sibling, 1 reply; 32+ messages in thread
From: Lars Ellenberg @ 2004-09-08 11:31 UTC (permalink / raw)
  To: drbd-dev

On Wed, Sep 08, 2004 at 01:20:01PM +0200, Lars Marowsky-Bree wrote:
> On 2004-09-07T14:47:45,
>    Philipp Reisner <philipp.reisner@linbit.com> said:
> 
> > No. It would be better to have a "drbdadm fence r0" operation on N2!
> > The "drbdadm fence r0" command would only set the "Outdated" flag.
> 
> Well, it's automatically supposed to assume it's outdated when it
> crashes in S-P mode.
> 
> When the secondary loses connection to the primary, a mark-peer-dead
> would prevent that flag from being set.
> 
> So, why an explicit drbdadm fence operation? I'm missing what that would
> catch.

we probably can cope without, but it is more "polite" if we have it.
if we _can_ handle it explicit, why not?
implicit things are more easy to overlook...

and:
  P --- S  
  P xxx S        link breaks

  [ you can insert here even a complete cluster crash ]

  X xxx S        N2 receives "Peer dead", but still is outdated.

 
  the point is: just receiving a "peer definetely dead" in S/?
  is not enough to know that we are not outdated.

	lge

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-08 11:31                         ` Lars Ellenberg
@ 2004-09-08 15:11                           ` Lars Marowsky-Bree
  2004-09-08 15:22                             ` Lars Ellenberg
  0 siblings, 1 reply; 32+ messages in thread
From: Lars Marowsky-Bree @ 2004-09-08 15:11 UTC (permalink / raw)
  To: Lars Ellenberg, drbd-dev

On 2004-09-08T13:31:10,
   Lars Ellenberg <lars.ellenberg@linbit.com> said:

> > So, why an explicit drbdadm fence operation? I'm missing what that would
> > catch.
> 
> we probably can cope without, but it is more "polite" if we have it.
> if we _can_ handle it explicit, why not?

We _need_ to handle it implicitly in case we lose the connection in that
scenario.

_Explicitly_ setting the outdated flag in some more scenarios may also
be appropriate, yes.

> implicit things are more easy to overlook...
> 
> and:
>   P --- S  
>   P xxx S        link breaks
> 
>   [ you can insert here even a complete cluster crash ]

That's a triple fault already!

>   X xxx S        N2 receives "Peer dead", but still is outdated.

That is a quad-fault!!! (Link lost, two nodes down, one node not coming
up)

Yes, and it knows that because of the implicit "lost connection to
primary or died while being connected" already, even if the crash then
happened even before the CRM could invoke the 'mark_outdated'
operation.

The mark_peer_dead in this case should not reset the the 'Outdated'
flag. It should only do so in case it's received after a connection loss
to the primary; the 'unclean reboot' should be taken into consideration
(and I think there's a flag for that already.)

A S-P should always consider itself outdated unless it receives the
mark_peer_dead under the right circumstances. 

But, we are already pretty far in lala land.

>   the point is: just receiving a "peer definetely dead" in S/?
>   is not enough to know that we are not outdated.

Right. But the fence doesn't help much either, for we need to set that
flag in that scenario even if the 'fence' event just isn't delivered.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	   \\\  /// 
SUSE Labs, Research and Development \honk/ 
SUSE LINUX AG - A Novell company     \\// 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-08 15:11                           ` Lars Marowsky-Bree
@ 2004-09-08 15:22                             ` Lars Ellenberg
  0 siblings, 0 replies; 32+ messages in thread
From: Lars Ellenberg @ 2004-09-08 15:22 UTC (permalink / raw)
  To: drbd-dev

On Wed, Sep 08, 2004 at 05:11:30PM +0200, Lars Marowsky-Bree wrote:
> On 2004-09-08T13:31:10,
>    Lars Ellenberg <lars.ellenberg@linbit.com> said:
> 
> > > So, why an explicit drbdadm fence operation? I'm missing what that would
> > > catch.
> > 
> > we probably can cope without, but it is more "polite" if we have it.
> > if we _can_ handle it explicit, why not?
> 
> We _need_ to handle it implicitly in case we lose the connection in that
> scenario.
> 
> _Explicitly_ setting the outdated flag in some more scenarios may also
> be appropriate, yes.

that is what I said. :)

> 
> > implicit things are more easy to overlook...
> > 
> > and:
> >   P --- S  
> >   P xxx S        link breaks
> > 
> >   [ you can insert here even a complete cluster crash ]
> 
> That's a triple fault already!
> 
> >   X xxx S        N2 receives "Peer dead", but still is outdated.
> 
> That is a quad-fault!!! (Link lost, two nodes down, one node not coming
> up)

...

> But, we are already pretty far in lala land.

right :)

> 
> >   the point is: just receiving a "peer definetely dead" in S/?
> >   is not enough to know that we are not outdated.
> 
> Right. But the fence doesn't help much either, for we need to set that
> flag in that scenario even if the 'fence' event just isn't delivered.

what I want to hav in is just a cover-my-ass thingy to require explicit
confirmation before possibly losing (application-wise) confirmed data
transactions.

	Lars

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-08 11:20                       ` Lars Marowsky-Bree
  2004-09-08 11:31                         ` Lars Ellenberg
@ 2004-09-08 11:33                         ` Philipp Reisner
  1 sibling, 0 replies; 32+ messages in thread
From: Philipp Reisner @ 2004-09-08 11:33 UTC (permalink / raw)
  To: drbd-dev

On Wednesday 08 September 2004 13:20, Lars Marowsky-Bree wrote:
> On 2004-09-07T14:47:45,
>
>    Philipp Reisner <philipp.reisner@linbit.com> said:
> > No. It would be better to have a "drbdadm fence r0" operation on N2!
> > The "drbdadm fence r0" command would only set the "Outdated" flag.
>
> Well, it's automatically supposed to assume it's outdated when it
> crashes in S-P mode.
>
> When the secondary loses connection to the primary, a mark-peer-dead
> would prevent that flag from being set.
>
> So, why an explicit drbdadm fence operation? I'm missing what that would
> catch.
>

Here is the text you did not quote:

The big advantage over stopping DRBD on N2 is that in case the network
recoveres N2 will be resynced to up-to-date automatically.

-Philipp
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Another drbd race
  2004-09-07 12:19                 ` Philipp Reisner
  2004-09-07 12:28                   ` Lars Marowsky-Bree
@ 2004-09-07 15:55                   ` Lars Ellenberg
  1 sibling, 0 replies; 32+ messages in thread
From: Lars Ellenberg @ 2004-09-07 15:55 UTC (permalink / raw)
  To: drbd-dev

On Tue, Sep 07, 2004 at 02:19:55PM +0200, Philipp Reisner wrote:
> 
> > > I do not want to "misuse" the Consistent Bit for this.
> > >
> > > !Consistent  .... means that we are in the middle of a sync.
> > >                    = data is not usable at all.
> > >  Fenced      .... our data is 100% okay, but not the latest copy.
> >
> > lets call it "Outdated"
> >
> > my idea is that a crashed Secondary will come up as !Primary|Connected, so
> > it can assume it is outdated. (similar to the choice about wfc-degr...)
> >
> > we can only possibly lose write transaction in the very moment we
> > promote a Secondary to Primary. until we do that, and the harddisk where
> > the transactions have been written to is still physically intact, the
> > data is still there, though maybe not available.
> >
> > we can try to make sure that we never promote a Secondary that possibly
> > (or knowingly) is outdated.
> >
> > see below.
> >
> 
> Let us assume that we have two boxes (N1 and N2) and that tese
> two boxes are connected by two networks (net and cnet [ clinets'-net ]).
> 
> Net is used by DRBD, while heartbeat uses both, net and cnet
> 
> I know that you are talking about fencing by STONITH, but DRBD is
> not limited to that. Here comes my understanding of how fencing
> (other tan STONITH) could work with DRBD-0.8 :
> 
>  N1  net   N2
>  P/S ---  S/P     everything up and running.
>  P/? - -  S/?     network breaks ; N1 freezes IO
>  P/? - -  S/?     N1 fences N2:
>                   In the Stonith case: turn off N2.
>                   In the "smart" case: 

>                   N1 asks N2 to fence itself from the storage via cnet.
>                   HB calls "drbdadm fence r0" on N2.
>                   N2 replies to N1 that fencins is done via cnet.
>                   N1 calls "drbdadm peer-dead r0".

the above lines are basically what happens in the recovery path of the
cluster resource manager. yes.

>  P/D - -  S/?     N1 thaws IO
> 
> N2 got the the "Outdated" flag set in its meta-data, by the "fence" 
> command. I am not sure if it should be called "fence", other ideas:
> "considered-dead","die","fence","outdate". What do you think ?
> 
> My question is:
>  Is it planed that heartbeat will be able to perform this kind of fencing ?

that is more or less what we are going to do.

the "fence" in the above "smart" case I'd call "drbdadm mark-outdated r0".
yes, heartbeat 2.x will do resource level fencing when possible.


	lge

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem
  2004-08-20 12:52     ` [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem Philipp Reisner
  2004-08-20 13:32       ` Lars Ellenberg
@ 2004-08-20 14:10       ` Helmut Wollmersdorfer
  2004-08-23 22:01       ` Lars Marowsky-Bree
  2 siblings, 0 replies; 32+ messages in thread
From: Helmut Wollmersdorfer @ 2004-08-20 14:10 UTC (permalink / raw)
  To: drbd-dev

Philipp Reisner wrote:
[...]
> Which policies do we want to offer ?
> 
>  * The node that was primary before split brain (current behaviour)
>  * The node that becaume primary during split brain 
>  * The node that modified more of it's data during the split-brain
>    situation  [ Do not think about implementation yet, just about
>                 the policy ]
>  * others ?...

    * The node preferred by the user,
      if both nodes can be Primary.
      This makes sense, when the standby node provides not
      the same performance.

    * The more stable node - "stable" TBD [To Be Defined]

Helmut Wollmersdorfer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem
  2004-08-20 12:52     ` [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem Philipp Reisner
  2004-08-20 13:32       ` Lars Ellenberg
  2004-08-20 14:10       ` [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem Helmut Wollmersdorfer
@ 2004-08-23 22:01       ` Lars Marowsky-Bree
  2 siblings, 0 replies; 32+ messages in thread
From: Lars Marowsky-Bree @ 2004-08-23 22:01 UTC (permalink / raw)
  To: Philipp Reisner, Lars Ellenberg, drbd-dev

On 2004-08-20T14:52:52,
   Philipp Reisner <philipp.reisner@linbit.com> said:

> The situation:
> 
>  N1    N2
>  P --- S   Everything ok.
>  P - - S   Link breaks.
>  P - - P   A (also split-brained) Cluster-mgr makes N2 primary too.

Big fat bug in the setup and in the cluster manager. ;-) Thus, while it
must be resolveable, it doesn't need to be resolved efficiently.

>  X     X   Both nodes down.
>  P --- S   The current behaviour. 
> 
> What should be done after Split brain ? 

Both sides should detect this and by default refuse to connect until a
human (or higher up being such as the cluster manager) interferes and
explicitly and force-fully demotes one side to secondary again.

> The question are:
> Should this policy be configurable ? (IMO: yes)
> Which policies do we want to offer ?
> 
>  * The node that was primary before split brain (current behaviour)
>  * The node that becaume primary during split brain 
>  * The node that modified more of it's data during the split-brain
>    situation  [ Do not think about implementation yet, just about
>                 the policy ]
>  * others ?...

See above. None of your three choices seems the safe answer, because it
will need an admin to sort out which side really has the 'better' data,
or even worse, may require an image to be taken of both sides and the
changes merged.

> The second question to answer is:
> What should we do if the connecting network heals ? I.e.
> 
>  N1    N2
>  P --- S   Everything ok.
>  P - - S   Link breaks.
>  P - - P   A (also split-brained) Cluster-mgr makes N2 primary too.

(Comment about broken setup applies again.)

>  ? --- ?   What now ?
> 
> Current policy: The two nodes will refuse to connect. The administrator
>                 has to resove this.
> 
> Are there any other policies that would make sense ?

This is the best solution I can think of for the above reasons. As there
may be higher level services running on both nodes, you can't
(internally to drbd) resolve this. The higher level services need to be
stopped, and one side explicitly demoted. Or both demoted and one
explicitly promoted, which should come out the same.

Mit freundlichen Grüßen,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	    \        This space          /
SUSE Labs, Research and Development |       intentionally        |
SUSE LINUX AG - A Novell company    \        left blank          /

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2004-09-08 15:22 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20040819110202.GO9601@marowsky-bree.de>
     [not found] ` <20040819113205.GP9601@marowsky-bree.de>
     [not found]   ` <R+ahoCHARbsLOMKIahWH0/Q=lge@web.de>
2004-08-20 12:52     ` [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem Philipp Reisner
2004-08-20 13:32       ` Lars Ellenberg
2004-08-23 14:28         ` [Drbd-dev] gen_counts and primary --human Lars Ellenberg
2004-08-23 21:57           ` Lars Marowsky-Bree
2004-08-25  9:42           ` Philipp Reisner
2004-08-23 21:56         ` [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem Lars Marowsky-Bree
2004-08-25  9:42         ` Philipp Reisner
2004-08-25 10:28           ` Lars Marowsky-Bree
2004-08-25 11:30             ` Philipp Reisner
2004-08-25 13:38           ` Lars Ellenberg
2004-09-04  9:48         ` [Drbd-dev] Another drbd race Lars Marowsky-Bree
2004-09-04 10:00           ` Lars Ellenberg
2004-09-04 10:18             ` Lars Marowsky-Bree
2004-09-04 10:43               ` Lars Ellenberg
2004-09-04 10:51                 ` Lars Marowsky-Bree
2004-09-07  9:39             ` Philipp Reisner
2004-09-07 10:13               ` Lars Ellenberg
2004-09-07 11:32                 ` Philipp Reisner
2004-09-07 12:05                   ` Lars Ellenberg
2004-09-07 12:12                     ` Lars Marowsky-Bree
2004-09-07 12:06                   ` Lars Marowsky-Bree
2004-09-07 12:19                 ` Philipp Reisner
2004-09-07 12:28                   ` Lars Marowsky-Bree
2004-09-07 12:47                     ` Philipp Reisner
2004-09-08 11:20                       ` Lars Marowsky-Bree
2004-09-08 11:31                         ` Lars Ellenberg
2004-09-08 15:11                           ` Lars Marowsky-Bree
2004-09-08 15:22                             ` Lars Ellenberg
2004-09-08 11:33                         ` Philipp Reisner
2004-09-07 15:55                   ` Lars Ellenberg
2004-08-20 14:10       ` [Drbd-dev] Re: drbd Frage zu secondary vs primary; drbddisk status problem Helmut Wollmersdorfer
2004-08-23 22:01       ` Lars Marowsky-Bree

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.