All of lore.kernel.org
 help / color / mirror / Atom feed
* [Drbd-dev] Problems with code to disallow connection when peer has mismatched UUID
@ 2008-09-10 20:13 Graham, Simon
  2008-09-11  9:24 ` Lars Ellenberg
  2008-09-11 10:33 ` [Drbd-dev] Problems with code to disallow connection when peerhas " Graham, Simon
  0 siblings, 2 replies; 6+ messages in thread
From: Graham, Simon @ 2008-09-10 20:13 UTC (permalink / raw)
  To: drbd-dev

A change made to DRBD 8.2 on 3/25/08 (commit e817888f) to forcibly
disconnect in receive_uuids in some cases seems to be causing us
problems in some scenarios - the specific sequence is this (see detailed
trace extracts at the end of this)

. We start with node1 secondary and node0 primary and then disconnect
and reconnect
. This causes us to enter PausedSyncT on node1 and PausedSyncS on node0
(because higher priority devices
  are syncing)
. Now we swap primaryness so node1 is Primary/PausedSyncT/Inconsistent
and node0 is Secondary/PausedSyncS/UpToDate
. Lose connection again - because Node1 is primary, we create a new
current UUID.
. Now we connect again - the code in receive_uuids on Node1 refuses to
connect because it's Primary/Inconsistent and
  the other node is using a different UUID.
. We are now stuck in a loop trying to connect and dropping the
connection continually - no way to fix automatically either
  since no event is generated that a user mode program could use to
trigger recovery.

I'm not sure what the right answer is here, but I have a couple of
observations:
1. When we lose connection and are Primary/Inconsistent, should we
really create a new current UUID? The local data is bad.
2. When trying to connect and we are Primary/Inconsistent, shouldn't we
accept the connection even if the UUIDs are different?
   Surely the uuid handshake will calculate the correct way to
synchronize in this case? It certainly would in the above
   case (and you can see this in the trace from node1)

Here's the code in question:

    if (mdev->state.conn < Connected &&
        mdev->state.disk < Outdated &&
        mdev->state.role == Primary &&
        (mdev->ed_uuid & ~((u64)1)) != (p_uuid[Current] & ~((u64)1))) {
        ERR("Can only connect to data with current UUID=%016llX\n",
            (unsigned long long)mdev->ed_uuid);
        drbd_force_state(mdev,NS(conn,Disconnecting));
        return FALSE;
    }

Any thoughts?
Simon

Trace from node1:

   6719:Aug 18 22:45:05 node1 kernel: drbd38: conn( WFConnection ->
WFReportParams ) 
   6722:Aug 18 22:45:05 node1 kernel: drbd38: drbd_sync_handshake:
   6723:Aug 18 22:45:05 node1 kernel: drbd38: self
CADAA290A29A95F0:0000000000000000:4CC7FE9AE8243C78:0000000000000006
   6724:Aug 18 22:45:05 node1 kernel: drbd38: peer
678FDAD7A9A6D636:CADAA290A29A95F0:8ACAF333A2BE196A:4CC7FE9AE8243C78
   6725:Aug 18 22:45:05 node1 kernel: drbd38: uuid_compare()=-1 by rule
5
   6726:Aug 18 22:45:05 node1 kernel: drbd38: Becoming sync target due
to disk states.
   6727:Aug 18 22:45:05 node1 kernel: drbd38: peer( Unknown -> Secondary
) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
peer_isp( 0 -> 1 ) 
   6729:Aug 18 22:45:05 node1 kernel: drbd38: conn( WFBitMapT ->
WFSyncUUID ) 
   6730:Aug 18 22:45:05 node1 kernel: drbd38:  exposed data uuid now
3633090B73366ED2
   6731:Aug 18 22:45:05 node1 kernel: drbd38:  uuid[Current] now
3633090B73366ED2
   6732:Aug 18 22:45:05 node1 kernel: drbd38:  uuid[Bitmap] now
0000000000000000
   6734:Aug 18 22:45:05 node1 kernel: drbd38: conn( WFSyncUUID ->
PausedSyncT ) 
   6735:Aug 18 22:45:05 node1 kernel: drbd38: Began resync as
PausedSyncT (will sync 118668 KB [29667 bits set]).
   7027:Aug 18 22:46:07 node1 kernel: drbd38: role( Secondary -> Primary
) 
   7437:Aug 18 22:46:39 node1 kernel: drbd38: meta connection shut down
by peer.
   7438:Aug 18 22:46:39 node1 kernel: drbd38: peer( Secondary -> Unknown
) conn( PausedSyncT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
peer_isp( 1 -> 0 ) 
   7456:Aug 18 22:46:39 node1 kernel: drbd38: sock was shut down by peer
   7460:Aug 18 22:46:39 node1 kernel: drbd38: Creating new current UUID
   7461:Aug 18 22:46:39 node1 kernel: drbd38:  uuid[Bitmap] now
3633090B73366ED3
   7462:Aug 18 22:46:39 node1 kernel: drbd38:  exposed data uuid now
A398D35216141B4D
   7463:Aug 18 22:46:39 node1 kernel: drbd38:  uuid[Current] now
A398D35216141B4D
   7464:Aug 18 22:46:39 node1 kernel: drbd38: Writing meta data super
block now.
   7486:Aug 18 22:46:39 node1 kernel: drbd38: Connection closed
   7487:Aug 18 22:46:39 node1 kernel: drbd38: Considering state change
from bad state. Error would be: 'Refusing to be Primary without at least
one UpToDate disk'
   7488:Aug 18 22:46:39 node1 kernel: drbd38:  old = { cs:NetworkFailure
st:Primary/Unknown ds:Inconsistent/DUnknown ra-- }
   7489:Aug 18 22:46:39 node1 kernel: drbd38:  new = { cs:Unconnected
st:Primary/Unknown ds:Inconsistent/DUnknown ra-- }
   7490:Aug 18 22:46:39 node1 kernel: drbd38: conn( NetworkFailure ->
Unconnected ) 
   7491:Aug 18 22:46:39 node1 kernel: drbd38: receiver terminated
   7492:Aug 18 22:46:39 node1 kernel: drbd38: receiver (re)started
   7493:Aug 18 22:46:39 node1 kernel: drbd38: Considering state change
from bad state. Error would be: 'Refusing to be Primary without at least
one UpToDate disk'
   7494:Aug 18 22:46:39 node1 kernel: drbd38:  old = { cs:Unconnected
st:Primary/Unknown ds:Inconsistent/DUnknown ra-- }
   7495:Aug 18 22:46:39 node1 kernel: drbd38:  new = { cs:WFConnection
st:Primary/Unknown ds:Inconsistent/DUnknown ra-- }
   7496:Aug 18 22:46:39 node1 kernel: drbd38: conn( Unconnected ->
WFConnection ) 
   7560:Aug 18 22:46:40 node1 kernel: drbd38: Handshake successful:
Agreed network protocol version 88
   7561:Aug 18 22:46:40 node1 kernel: drbd38: Considering state change
from bad state. Error would be: 'Refusing to be Primary without at least
one UpToDate disk'
   7562:Aug 18 22:46:40 node1 kernel: drbd38:  old = { cs:WFConnection
st:Primary/Unknown ds:Inconsistent/DUnknown ra-- }
   7563:Aug 18 22:46:40 node1 kernel: drbd38:  new = { cs:WFReportParams
st:Primary/Unknown ds:Inconsistent/DUnknown ra-- }
   7564:Aug 18 22:46:40 node1 kernel: drbd38: conn( WFConnection ->
WFReportParams ) 
   7580:Aug 18 22:46:40 node1 kernel: drbd38: Can only connect to data
with current UUID=A398D35216141B4D
   7581:Aug 18 22:46:40 node1 kernel: drbd38: conn( WFReportParams ->
Disconnecting ) 
   7586:Aug 18 22:46:40 node1 kernel: drbd38: Connection closed

Trace from Node0:

   5185:Aug 18 22:45:06 node0 kernel: drbd38: conn( StandAlone ->
Unconnected ) 
   5188:Aug 18 22:45:06 node0 kernel: drbd38: conn( Unconnected ->
WFConnection ) 
   5197:Aug 18 22:45:06 node0 kernel: drbd38: aftr_isp( 0 -> 1 ) 
   5248:Aug 18 22:45:06 node0 kernel: drbd38: Handshake successful:
Agreed network protocol version 88
   5249:Aug 18 22:45:06 node0 kernel: drbd38: conn( WFConnection ->
WFReportParams ) 
   5250:Aug 18 22:45:06 node0 kernel: drbd38: Starting asender thread
(from drbd38_receiver [16796])
   5251:Aug 18 22:45:06 node0 kernel: drbd38: data-integrity-alg:
<not-used>
   5252:Aug 18 22:45:06 node0 kernel: drbd38: drbd_sync_handshake:
   5253:Aug 18 22:45:06 node0 kernel: drbd38: self
678FDAD7A9A6D636:CADAA290A29A95F0:8ACAF333A2BE196A:4CC7FE9AE8243C78
   5254:Aug 18 22:45:06 node0 kernel: drbd38: peer
CADAA290A29A95F0:0000000000000000:4CC7FE9AE8243C78:0000000000000006
   5255:Aug 18 22:45:06 node0 kernel: drbd38: uuid_compare()=1 by rule 7
   5256:Aug 18 22:45:06 node0 kernel: drbd38: Becoming sync source due
to disk states.
   5257:Aug 18 22:45:06 node0 kernel: drbd38: peer( Unknown -> Secondary
) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent )
peer_isp( 0 -> 1 ) 
   5258:Aug 18 22:45:06 node0 kernel: drbd38: Writing meta data super
block now.
   5263:Aug 18 22:45:06 node0 kernel: drbd38:  uuid[History_start] now
CADAA290A29A95F0
   5264:Aug 18 22:45:06 node0 kernel: drbd38:  uuid[Bitmap] now
3633090B73366ED2
   5265:Aug 18 22:45:06 node0 kernel: drbd38: conn( WFBitMapS ->
PausedSyncS ) 
   5266:Aug 18 22:45:06 node0 kernel: drbd38: Began resync as
PausedSyncS (will sync 118668 KB [29667 bits set]).
   5267:Aug 18 22:45:06 node0 kernel: drbd38: Writing meta data super
block now.
   5410:Aug 18 22:46:07 node0 kernel: drbd38: peer( Secondary -> Primary
) 
   5427:Aug 18 22:46:39 node0 kernel: drbd38: PingAck did not arrive in
time.
   5428:Aug 18 22:46:39 node0 kernel: drbd38: peer( Primary -> Unknown )
conn( PausedSyncS -> NetworkFailure ) peer_isp( 1 -> 0 ) 
   5434:Aug 18 22:46:39 node0 kernel: drbd38: Connection closed
   5435:Aug 18 22:46:39 node0 kernel: drbd38: conn( NetworkFailure ->
Unconnected ) 
   5438:Aug 18 22:46:39 node0 kernel: drbd38: conn( Unconnected ->
WFConnection ) 
   5546:Aug 18 22:46:40 node0 kernel: drbd38: Handshake successful:
Agreed network protocol version 88
   5547:Aug 18 22:46:40 node0 kernel: drbd38: conn( WFConnection ->
WFReportParams ) 
   5561:Aug 18 22:46:40 node0 kernel: drbd38: meta connection shut down
by peer.
   5562:Aug 18 22:46:40 node0 kernel: drbd38: conn( WFReportParams ->
NetworkFailure ) 
   5573:Aug 18 22:46:40 node0 kernel: drbd38: Connection closed
   5574:Aug 18 22:46:40 node0 kernel: drbd38: conn( NetworkFailure ->
Unconnected ) 
   5577:Aug 18 22:46:40 node0 kernel: drbd38: conn( Unconnected ->
WFConnection ) 
   5598:Aug 18 22:46:43 node0 kernel: drbd38: Handshake successful:
Agreed network protocol version 88
   5599:Aug 18 22:46:43 node0 kernel: drbd38: conn( WFConnection ->
WFReportParams ) 
   5600:Aug 18 22:46:43 node0 kernel: drbd38: Starting asender thread
(from drbd38_receiver [16796])
   5602:Aug 18 22:46:43 node0 kernel: drbd38: drbd_sync_handshake:
   5603:Aug 18 22:46:43 node0 kernel: drbd38: self
678FDAD7A9A6D636:3633090B73366ED2:CADAA290A29A95F0:8ACAF333A2BE196A
   5604:Aug 18 22:46:43 node0 kernel: drbd38: peer
A398D35216141B4D:3633090B73366ED3:4CC7FE9AE8243C78:0000000000000006
   5605:Aug 18 22:46:43 node0 kernel: drbd38: uuid_compare()=100 by rule
9
   5606:Aug 18 22:46:43 node0 kernel: drbd38: Becoming sync source due
to disk states.
   5607:Aug 18 22:46:43 node0 kernel: drbd38: peer( Unknown -> Primary )
conn( WFReportParams -> WFBitMapS ) peer_isp( 0 -> 1 ) 
   5609:Aug 18 22:46:43 node0 kernel: drbd38: meta connection shut down
by peer.
   5610:Aug 18 22:46:43 node0 kernel: drbd38: peer( Primary -> Unknown )
conn( WFBitMapS -> NetworkFailure ) peer_isp( 1 -> 0 ) 
   5611:Aug 18 22:46:43 node0 kernel: drbd38: asender terminated
   5612:Aug 18 22:46:43 node0 kernel: drbd38: Terminating asender thread


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Drbd-dev] Problems with code to disallow connection when peer has mismatched UUID
  2008-09-10 20:13 [Drbd-dev] Problems with code to disallow connection when peer has mismatched UUID Graham, Simon
@ 2008-09-11  9:24 ` Lars Ellenberg
  2008-09-11 10:33 ` [Drbd-dev] Problems with code to disallow connection when peerhas " Graham, Simon
  1 sibling, 0 replies; 6+ messages in thread
From: Lars Ellenberg @ 2008-09-11  9:24 UTC (permalink / raw)
  To: drbd-dev

On Wed, Sep 10, 2008 at 04:13:43PM -0400, Graham, Simon wrote:
> A change made to DRBD 8.2 on 3/25/08 (commit e817888f) to forcibly
> disconnect in receive_uuids in some cases seems to be causing us
> problems in some scenarios - the specific sequence is this (see detailed
> trace extracts at the end of this)
> 
> . We start with node1 secondary and node0 primary and then disconnect
> and reconnect
> . This causes us to enter PausedSyncT on node1 and PausedSyncS on node0
> (because higher priority devices
>   are syncing)
> . Now we swap primaryness so node1 is Primary/PausedSyncT/Inconsistent
> and node0 is Secondary/PausedSyncS/UpToDate
> . Lose connection again - because Node1 is primary, we create a new
> current UUID.
> . Now we connect again - the code in receive_uuids on Node1 refuses to
> connect because it's Primary/Inconsistent and
>   the other node is using a different UUID.
> . We are now stuck in a loop trying to connect and dropping the
> connection continually - no way to fix automatically either
>   since no event is generated that a user mode program could use to
> trigger recovery.
> 
> I'm not sure what the right answer is here, but I have a couple of
> observations:
> 1. When we lose connection and are Primary/Inconsistent, should we
> really create a new current UUID? The local data is bad.

I don't think we should.

> 2. When trying to connect and we are Primary/Inconsistent, shouldn't we
> accept the connection even if the UUIDs are different?

And potentially time-warp the content?
No.

>    Surely the uuid handshake will calculate the correct way to
> synchronize in this case? It certainly would in the above
>    case (and you can see this in the trace from node1)

in the handshake, an inconsistent node will always lose, yes.

> Here's the code in question:
> 
>     if (mdev->state.conn < Connected &&
>         mdev->state.disk < Outdated &&
>         mdev->state.role == Primary &&
>         (mdev->ed_uuid & ~((u64)1)) != (p_uuid[Current] & ~((u64)1))) {
>         ERR("Can only connect to data with current UUID=%016llX\n",
>             (unsigned long long)mdev->ed_uuid);
>         drbd_force_state(mdev,NS(conn,Disconnecting));
>         return FALSE;
>     }

that piece of code is to avoid "time warps" of content.
the specific scenario this should protect against is:

Primary, Connected
 connection breaks.
Primary  ---?
 primary generates new "current uuid",
 and continues to write
 local disk breaks
Primary, Diskless, ---?
 connection heals
 
now, if we allow this connection and accept the data of the secondary,
we just jumped back in time.

if however the local disk "heals" (maybe it was an iSCSI disk, and some
switch has been flaky?) we allow it to attach, as it has the same UUID
as we currently have, then allow the connection and resync to happen.

at least that is what "should" happen.

-- 
: Lars Ellenberg                
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [Drbd-dev] Problems with code to disallow connection when peerhas mismatched UUID
  2008-09-10 20:13 [Drbd-dev] Problems with code to disallow connection when peer has mismatched UUID Graham, Simon
  2008-09-11  9:24 ` Lars Ellenberg
@ 2008-09-11 10:33 ` Graham, Simon
  2008-09-15 14:40   ` [Drbd-dev] Problems with code to disallow connection when peer has " Graham, Simon
  1 sibling, 1 reply; 6+ messages in thread
From: Graham, Simon @ 2008-09-11 10:33 UTC (permalink / raw)
  To: drbd-dev

Thanks Lars,

> > I'm not sure what the right answer is here, but I have a couple of
> > observations:
> > 1. When we lose connection and are Primary/Inconsistent, should we
> > really create a new current UUID? The local data is bad.
> 
> I don't think we should.

OK - I'll look at implementing this.

> that piece of code is to avoid "time warps" of content.
> the specific scenario this should protect against is:
> 
> Primary, Connected
>  connection breaks.
> Primary  ---?
>  primary generates new "current uuid",
>  and continues to write
>  local disk breaks
> Primary, Diskless, ---?
>  connection heals
> 
> now, if we allow this connection and accept the data of the secondary,
> we just jumped back in time.

Ah OK -- would it perhaps then be OK to check for local disk state <=
Negotiating instead of < Outdated:

	if (mdev->state.conn < Connected &&
	    mdev->state.disk <= Negotiating &&
	    mdev->state.role == Primary &&
	    (mdev->ed_uuid & ~((u64)1)) != (p_uuid[Current] &
~((u64)1))) {
		ERR("Can only connect to data with current
UUID=%016llX\n",
		    (unsigned long long)mdev->ed_uuid);
		drbd_force_state(mdev,NS(conn,Disconnecting));
		return FALSE;
	}

This would handle the case we see where we KNOW the local disk does not
have good data... (and we knew before the connection was lost, so we
would NOT have actually been able to change the local copy of the data
on disk).

Simon

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [Drbd-dev] Problems with code to disallow connection when peer has mismatched UUID
  2008-09-11 10:33 ` [Drbd-dev] Problems with code to disallow connection when peerhas " Graham, Simon
@ 2008-09-15 14:40   ` Graham, Simon
  2008-09-16 19:14     ` Graham, Simon
  2008-10-31 12:43     ` Philipp Reisner
  0 siblings, 2 replies; 6+ messages in thread
From: Graham, Simon @ 2008-09-15 14:40 UTC (permalink / raw)
  To: Graham, Simon, drbd-dev

[-- Attachment #1: Type: text/plain, Size: 2593 bytes --]

Attached is a proposed patch to address this problem that does the
following:

1. When connection is lost, a new UUID is not created on the primary if
the local disk is inconsistent
2. Allow connection to be established if local disk is inconsistent even
if the remote is using
   a different UUID - our data is useless and cannot have been changed
anyway. The following handshake
   will cause the appropriate resync.

Although only one of these is actually required to fix my problem, I
thought it was better to be consistent
and change both.

Simon

> -----Original Message-----
> From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com]
> On Behalf Of Graham, Simon
> Sent: Thursday, September 11, 2008 6:33 AM
> To: drbd-dev@linbit.com
> Subject: RE: [Drbd-dev] Problems with code to disallow connection when
> peerhasmismatched UUID
> 
> Thanks Lars,
> 
> > > I'm not sure what the right answer is here, but I have a couple of
> > > observations:
> > > 1. When we lose connection and are Primary/Inconsistent, should we
> > > really create a new current UUID? The local data is bad.
> >
> > I don't think we should.
> 
> OK - I'll look at implementing this.
> 
> > that piece of code is to avoid "time warps" of content.
> > the specific scenario this should protect against is:
> >
> > Primary, Connected
> >  connection breaks.
> > Primary  ---?
> >  primary generates new "current uuid",
> >  and continues to write
> >  local disk breaks
> > Primary, Diskless, ---?
> >  connection heals
> >
> > now, if we allow this connection and accept the data of the
> secondary,
> > we just jumped back in time.
> 
> Ah OK -- would it perhaps then be OK to check for local disk state <=
> Negotiating instead of < Outdated:
> 
> 	if (mdev->state.conn < Connected &&
> 	    mdev->state.disk <= Negotiating &&
> 	    mdev->state.role == Primary &&
> 	    (mdev->ed_uuid & ~((u64)1)) != (p_uuid[Current] &
> ~((u64)1))) {
> 		ERR("Can only connect to data with current
> UUID=%016llX\n",
> 		    (unsigned long long)mdev->ed_uuid);
> 		drbd_force_state(mdev,NS(conn,Disconnecting));
> 		return FALSE;
> 	}
> 
> This would handle the case we see where we KNOW the local disk does
not
> have good data... (and we knew before the connection was lost, so we
> would NOT have actually been able to change the local copy of the data
> on disk).
> 
> Simon
> _______________________________________________
> drbd-dev mailing list
> drbd-dev@lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-dev

[-- Attachment #2: 0001-Better-handling-of-losing-connection-in-primary-inco.patch --]
[-- Type: application/octet-stream, Size: 2069 bytes --]

From 418cd46e033da83695b873a842f44abe7db0571f Mon Sep 17 00:00:00 2001
From: Simon P. Graham <Simon.Graham@stratus.com>
Date: Fri, 12 Sep 2008 10:41:49 -0400
Subject: [PATCH] Better handling of losing connection in primary/inconsistent state

1. Dont create a new UUID if primary/insonsistent when connection dropped
2. Allow connection to be established even if remote side has a different
   UUID - our data is useless anyway.
---
 drbd/drbd_main.c     |    1 +
 drbd/drbd_nl.c       |    1 +
 drbd/drbd_receiver.c |    2 +-
 3 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drbd/drbd_main.c b/drbd/drbd_main.c
index 3d45907..4c396f2 100644
--- a/drbd/drbd_main.c
+++ b/drbd/drbd_main.c
@@ -1078,6 +1078,7 @@ static void after_state_ch(struct drbd_conf *mdev, union drbd_state_t os,
 		if (inc_local(mdev)) {
 			/* generate new uuid, unless we did already */
 			if (ns.role == Primary &&
+			    ns.disk >= Consistent &&
 			    mdev->bc->md.uuid[Bitmap] == 0)
 				drbd_uuid_new_current(mdev);
 			if (ns.peer == Primary) {
diff --git a/drbd/drbd_nl.c b/drbd/drbd_nl.c
index 3c76edc..ddb33f2 100644
--- a/drbd/drbd_nl.c
+++ b/drbd/drbd_nl.c
@@ -360,6 +360,7 @@ int drbd_set_role(struct drbd_conf *mdev, enum drbd_role new_role, int force)
 		if (inc_local(mdev)) {
 			if ( ((mdev->state.conn < Connected ||
 			       mdev->state.pdsk <= Failed)
+			      && mdev->state.disk >= Consistent
 			      && mdev->bc->md.uuid[Bitmap] == 0) || forced)
 				drbd_uuid_new_current(mdev);
 
diff --git a/drbd/drbd_receiver.c b/drbd/drbd_receiver.c
index c619c7a..5d38552 100644
--- a/drbd/drbd_receiver.c
+++ b/drbd/drbd_receiver.c
@@ -2454,7 +2454,7 @@ STATIC int receive_uuids(struct drbd_conf *mdev, struct Drbd_Header *h)
 	mdev->p_uuid = p_uuid;
 
 	if (mdev->state.conn < Connected &&
-	    mdev->state.disk < Outdated &&
+	    mdev->state.disk <= Negotiating &&
 	    mdev->state.role == Primary &&
 	    (mdev->ed_uuid & ~((u64)1)) != (p_uuid[Current] & ~((u64)1))) {
 		ERR("Can only connect to data with current UUID=%016llX\n",
-- 
1.5.4-rc1.GIT


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* RE: [Drbd-dev] Problems with code to disallow connection when peer has mismatched UUID
  2008-09-15 14:40   ` [Drbd-dev] Problems with code to disallow connection when peer has " Graham, Simon
@ 2008-09-16 19:14     ` Graham, Simon
  2008-10-31 12:43     ` Philipp Reisner
  1 sibling, 0 replies; 6+ messages in thread
From: Graham, Simon @ 2008-09-16 19:14 UTC (permalink / raw)
  To: Graham, Simon, drbd-dev

Hah! I just realized that the current HEAD in the 8.2 tree already
includes this - thanks Phil!

So - just ignore this one.
Simon

> -----Original Message-----
> From: Graham, Simon
> Sent: Monday, September 15, 2008 10:41 AM
> To: Graham, Simon; drbd-dev@linbit.com
> Subject: RE: [Drbd-dev] Problems with code to disallow connection when
> peer has mismatched UUID
> 
> Attached is a proposed patch to address this problem that does the
> following:
> 
> 1. When connection is lost, a new UUID is not created on the primary
if
> the local disk is inconsistent
> 2. Allow connection to be established if local disk is inconsistent
> even if the remote is using
>    a different UUID - our data is useless and cannot have been changed
> anyway. The following handshake
>    will cause the appropriate resync.
> 
> Although only one of these is actually required to fix my problem, I
> thought it was better to be consistent
> and change both.
> 
> Simon
> 
> > -----Original Message-----
> > From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-
> bounces@linbit.com]
> > On Behalf Of Graham, Simon
> > Sent: Thursday, September 11, 2008 6:33 AM
> > To: drbd-dev@linbit.com
> > Subject: RE: [Drbd-dev] Problems with code to disallow connection
> when
> > peerhasmismatched UUID
> >
> > Thanks Lars,
> >
> > > > I'm not sure what the right answer is here, but I have a couple
> of
> > > > observations:
> > > > 1. When we lose connection and are Primary/Inconsistent, should
> we
> > > > really create a new current UUID? The local data is bad.
> > >
> > > I don't think we should.
> >
> > OK - I'll look at implementing this.
> >
> > > that piece of code is to avoid "time warps" of content.
> > > the specific scenario this should protect against is:
> > >
> > > Primary, Connected
> > >  connection breaks.
> > > Primary  ---?
> > >  primary generates new "current uuid",
> > >  and continues to write
> > >  local disk breaks
> > > Primary, Diskless, ---?
> > >  connection heals
> > >
> > > now, if we allow this connection and accept the data of the
> > secondary,
> > > we just jumped back in time.
> >
> > Ah OK -- would it perhaps then be OK to check for local disk state
<=
> > Negotiating instead of < Outdated:
> >
> > 	if (mdev->state.conn < Connected &&
> > 	    mdev->state.disk <= Negotiating &&
> > 	    mdev->state.role == Primary &&
> > 	    (mdev->ed_uuid & ~((u64)1)) != (p_uuid[Current] &
> > ~((u64)1))) {
> > 		ERR("Can only connect to data with current
> > UUID=%016llX\n",
> > 		    (unsigned long long)mdev->ed_uuid);
> > 		drbd_force_state(mdev,NS(conn,Disconnecting));
> > 		return FALSE;
> > 	}
> >
> > This would handle the case we see where we KNOW the local disk does
> not
> > have good data... (and we knew before the connection was lost, so we
> > would NOT have actually been able to change the local copy of the
> data
> > on disk).
> >
> > Simon
> > _______________________________________________
> > drbd-dev mailing list
> > drbd-dev@lists.linbit.com
> > http://lists.linbit.com/mailman/listinfo/drbd-dev

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Drbd-dev] Problems with code to disallow connection when peer has mismatched UUID
  2008-09-15 14:40   ` [Drbd-dev] Problems with code to disallow connection when peer has " Graham, Simon
  2008-09-16 19:14     ` Graham, Simon
@ 2008-10-31 12:43     ` Philipp Reisner
  1 sibling, 0 replies; 6+ messages in thread
From: Philipp Reisner @ 2008-10-31 12:43 UTC (permalink / raw)
  To: drbd-dev

Am Montag 15 September 2008 16:40:34 schrieb Graham, Simon:
> Attached is a proposed patch to address this problem that does the
> following:
>
> 1. When connection is lost, a new UUID is not created on the primary if
> the local disk is inconsistent
> 2. Allow connection to be established if local disk is inconsistent even
> if the remote is using
>    a different UUID - our data is useless and cannot have been changed
> anyway. The following handshake
>    will cause the appropriate resync.
>
> Although only one of these is actually required to fix my problem, I
> thought it was better to be consistent
> and change both.
>

Hi Simon,

You patch is correct, and quite similar to that commit. The second
hunk is missing since a node in such an state can not get promoted
to primary anyways.

commit a20ecb2e221a3a0e565ecdbb9ac5239b54dca395
Author: Philipp Reisner <philipp.reisner@linbit.com>
Date:   Thu Jul 24 14:06:41 2008 +0200

    Fixed the "exposed data" logic in case a sync target primary lost the 
network connection

    closes #98

-Phil
-- 
: Dipl-Ing Philipp Reisner
: LINBIT | Your Way to High Availability
: Tel: +43-1-8178292-50, Fax: +43-1-8178292-82
: http://www.linbit.com

DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-10-31 12:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-10 20:13 [Drbd-dev] Problems with code to disallow connection when peer has mismatched UUID Graham, Simon
2008-09-11  9:24 ` Lars Ellenberg
2008-09-11 10:33 ` [Drbd-dev] Problems with code to disallow connection when peerhas " Graham, Simon
2008-09-15 14:40   ` [Drbd-dev] Problems with code to disallow connection when peer has " Graham, Simon
2008-09-16 19:14     ` Graham, Simon
2008-10-31 12:43     ` Philipp Reisner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.