* [Drbd-dev] DRBD8: drbd nodes deadlock in WFBitMapT
@ 2007-03-30 22:01 Montrose, Ernest
2007-04-02 19:59 ` Philipp Reisner
0 siblings, 1 reply; 8+ messages in thread
From: Montrose, Ernest @ 2007-03-30 22:01 UTC (permalink / raw)
To: drbd-dev
[-- Attachment #1: Type: text/plain, Size: 3330 bytes --]
Hi all,
This is another hard to reproduce one but the proofs are in the logs
that the problem is alive and well.
I am hoping for at least some clues that may help reproduce
this...Essentially after one node in
Primary state is powered down(Not a graceful shutdown) both nodes ends
up In WFBitMapT for a drbd volume.
Here are some logs:
On one node========================
Drbd2: Writing metadata to superblock now.
......
....... This nodes is powered of and came back with:
Mar 23 13:16:53 jerry kernel: drbd2: rct = 0 in
/test_logs/builds/SuperNova/trunk/070323/platform/drbd/src/drbd/drbd_rec
eiver.c:1878
Mar 23 13:16:55 [ OK ]
Mar 23 13:16:54 jerry kernel: drbd2: drbd_sync_handshake:
Mar 23 13:16:54 jerry kernel: drbd2: self
F71E503A8179BC5D:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
Mar 23 13:16:54 jerry kernel: drbd2: peer
F71E503A8179BC5C:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
Mar 23 13:16:55 jerry kernel: drbd2: uuid_compare()=0 by rule 4
Mar 23 13:16:55 jerry kernel: drbd2: No resync, but bits in bitmap!
......
Mar 23 13:17:00 jerry kernel: drbd2: drbd_sync_handshake:
Mar 23 13:17:00 jerry kernel: drbd2: self
F71E503A8179BC5D:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
Mar 23 13:17:00 jerry kernel: drbd2: peer
0000000000000000:0000000000000000:F71E503A8179BC5C:3F430D4E1D59C3EA
Mar 23 13:17:00 jerry kernel: drbd2: uuid_compare()=-2 by rule 6
Mar 23 13:17:00 jerry kernel: drbd2: Writing meta data super block now.
Mar 23 13:17:01 jerry kernel: drbd2: writing of bitmap took 11 jiffies
Mar 23 13:17:01 jerry kernel: drbd2: 12 GB marked out-of-sync by on disk
bit-map.
Mar 23 13:17:02 jerry kernel: drbd2: 13336132 KB now marked out-of-sync
by on disk bit-map.
Mar 23 13:17:02 jerry kernel: drbd2: Writing meta data super block now.
Mar 23 13:17:02 jerry kernel: drbd2: uuid[History_start] now
F71E503A8179BC5D
Mar 23 13:17:02 jerry kernel: drbd2: uuid[Current] now 0000000000000000
Mar 23 13:17:03 jerry kernel: drbd2: conn( Connected -> WFBitMapT )
Mar 23 13:17:03 jerry kernel: drbd2: Writing meta data super block now.
On theh other node=============================
Mar 23 13:16:48 ben kernel: drbd2: aftr_isp( 0 -> 1 )
Mar 23 13:16:48 ben kernel: drbd2: Handshake successful: DRBD Network
Protocol version 86
Mar 23 13:16:48 ben kernel: drbd2: peer( Unknown -> Secondary ) conn(
WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate ) peer_isp( 0
-> 1 )
Mar 23 13:16:48 ben kernel: drbd2: Writing meta data super block now.
....
Mar 23 13:16:49 ben kernel: drbd2: rct = 2 in
/test_logs/builds/SuperNova/trunk/070323/platform/drbd/src/drbd/drbd_rec
eiver.c:1878
Mar 23 13:16:49 ben kernel: drbd2: drbd_sync_handshake:
Mar 23 13:16:49 ben kernel: drbd2: self
F71E503A8179BC5C:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
Mar 23 13:16:49 ben kernel: drbd2: peer
F71E503A8179BC5D:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
Mar 23 13:16:49 ben kernel: drbd2: uuid_compare()=-1 by rule 4
Mar 23 13:16:49 ben kernel: drbd2: uuid[History_start] now
F71E503A8179BC5C
Mar 23 13:16:49 ben kernel: drbd2: uuid[Current] now 0000000000000000
Mar 23 13:16:49 ben kernel: drbd2: conn( Connected -> WFBitMapT )
Mar 23 13:16:49 ben kernel: drbd2: Writing meta data super block now.
[-- Attachment #2: Type: text/html, Size: 10906 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Drbd-dev] DRBD8: drbd nodes deadlock in WFBitMapT
2007-03-30 22:01 [Drbd-dev] DRBD8: drbd nodes deadlock in WFBitMapT Montrose, Ernest
@ 2007-04-02 19:59 ` Philipp Reisner
2007-04-02 21:44 ` Montrose, Ernest
0 siblings, 1 reply; 8+ messages in thread
From: Philipp Reisner @ 2007-04-02 19:59 UTC (permalink / raw)
To: drbd-dev; +Cc: Montrose, Ernest
Hi Ernest,
I removed all the linebreaks:
Drbd2: Writing metadata to superblock now.
......
....... This nodes is powered of and came back with:
13:16:53 je rct = 0 in
13:16:54 je drbd_sync_handshake:
13:16:54 je self F71E503A8179BC5D:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
13:16:54 je peer F71E503A8179BC5C:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
13:16:55 je uuid_compare()=0 by rule 4 <<<<<<<<=========---------------
13:16:55 je No resync, but bits in bitmap!
......
13:17:00 je drbd_sync_handshake:
13:17:00 je self F71E503A8179BC5D:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
13:17:00 je peer 0000000000000000:0000000000000000:F71E503A8179BC5C:3F430D4E1D59C3EA
13:17:00 je uuid_compare()=-2 by rule 6
13:17:00 je Writing meta data super block now.
13:17:01 je writing of bitmap took 11 jiffies
13:17:01 je 12 GB marked out-of-sync by on disk bit-map.
13:17:02 je 13336132 KB now marked out-of-sync by on disk bit-map.
13:17:02 je Writing meta data super block now.
13:17:02 je uuid[History_start] now F71E503A8179BC5D
13:17:02 je uuid[Current] now 0000000000000000
13:17:03 je conn( Connected -> WFBitMapT )
13:17:03 je Writing meta data super block now.
On theh other node=============================
13:16:48 be aftr_isp( 0 -> 1 )
13:16:48 be Handshake successful: DRBD Network Protocol version 86
13:16:48 be peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
13:16:48 be Writing meta data super block now.
....
13:16:49 be rct = 2 in
13:16:49 be drbd_sync_handshake:
13:16:49 be self F71E503A8179BC5C:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
13:16:49 be peer F71E503A8179BC5D:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
13:16:49 be uuid_compare()=-1 by rule 4 <<<<<<<<=========---------------
13:16:49 be uuid[History_start] now F71E503A8179BC5C
13:16:49 be uuid[Current] now 0000000000000000
13:16:49 be conn( Connected -> WFBitMapT )
13:16:49 be Writing meta data super block now.
Unfortunately the clocks of the nodes seems to be not in sync (please
use NTP, this would make reading such logs much easiert!)
I assume that the line marked with the arrow ( <<==-- ) is
where the troubles begin. Because the 'self' and 'peer' line
is matches here inverse. Then the suspicious decision is
the outcome of uuid_compare() with rule 4.
While je (jerry) comes to the conclusion 0, gets ben -1.
The rtc (a few lines before) suggests us that jerry thought
that neither node was a CRASHED_PRIMARY, while ben thought
that jerry was a CRASHED_PRIMARY. Out of you comment
that jerry was powered off, I guess that jerry was really
a CRASHED_PRIMARY, and therefore it sent this information
over to ben. But somehow that CRASHED_PRIMARY bit got
cleared before jerry came to its drbd_uuid_compare()
function.
The question is: how did that happen ?
In the rest of the log entries on jerry we can see how he
tried to do a second sync handshake ...
Back to the question of the lost CRASHED_PRIMARY bit.
That bits can get cleared when:
1) We attach a disk.
2) Connection state goes to connected.
3) In the receive_bitmap.
I think that the clear_bit in receive_bitmap() is wrong, but
I currently can not think of a situation where this clear_bit
can get executed before the drbd_sync_handshake() function.
Ernest, could you provide a bit more context with a few more
lines of jerry's log before 13:16:53.
So far this my proposed patch. But there is still a missing
link in the reasoning:
--- drbd_receiver.c (revision 2821)
+++ drbd_receiver.c (working copy)
@@ -2509,7 +2509,6 @@
D_ASSERT(h->command == ReportBitMap);
}
- clear_bit(CRASHED_PRIMARY, &mdev->flags); // md_write() is in drbd_start_resync.
if (mdev->state.conn == WFBitMapS) {
drbd_start_resync(mdev,SyncSource);
} else if (mdev->state.conn == WFBitMapT) {
^ permalink raw reply [flat|nested] 8+ messages in thread* RE: [Drbd-dev] DRBD8: drbd nodes deadlock in WFBitMapT
2007-04-02 19:59 ` Philipp Reisner
@ 2007-04-02 21:44 ` Montrose, Ernest
2007-04-03 9:34 ` Philipp Reisner
0 siblings, 1 reply; 8+ messages in thread
From: Montrose, Ernest @ 2007-04-02 21:44 UTC (permalink / raw)
To: Philipp Reisner, drbd-dev
[-- Attachment #1: Type: text/plain, Size: 5376 bytes --]
Phil,
OK...I have learned a tad more since that last email. So before I even
try the proposed patch here is a way I was able to duplicate the
problem.
Perhaps that will help a bit. Here it is with two nodes 'a' and 'b' .
Suppose
They are in steady states with UUIDS:
Xa:0:Ha:HH:1:1:0:1:0:0
Xb:0:Hb:HH:1:1:0:1:0:0
Role Secondary/Secondary
1) Disconnect/detach /dev/drbdX on nodea
2) Move Current UUID of nodea to history-UUID of nodea and set
current_UUID of nodea to 00000000000 with drbdmeta..
0:0:Xa:HH:1:1:0:1:0:0
3) Now attach and connect /dev/drbdX and the problem will occur
I have attached the logs for my "manufactured" version of the problem
:). I have some extra instrumentation that I had placed in there so you
can ignore
Them...
I am not sure how we got that way under normal operation..:( One thing
is that
the test we are running calls for "pulling the plug" on one of the node.
So
the reboot was not the result of a graceful shutdown and right before
the reboot, the messed up volume had just written its meta data "to
super block" coming fresh from a resync.
Hope this helps
Thanks a lot...
EM--
-----Original Message-----
From: Philipp Reisner [mailto:philipp.reisner@linbit.com]
Sent: Monday, April 02, 2007 4:00 PM
To: drbd-dev@linbit.com
Cc: Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8: drbd nodes deadlock in WFBitMapT
Hi Ernest,
I removed all the linebreaks:
Drbd2: Writing metadata to superblock now.
......
....... This nodes is powered of and came back with:
13:16:53 je rct = 0 in
13:16:54 je drbd_sync_handshake:
13:16:54 je self
F71E503A8179BC5D:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
13:16:54 je peer
F71E503A8179BC5C:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
13:16:55 je uuid_compare()=0 by rule 4
<<<<<<<<=========---------------
13:16:55 je No resync, but bits in bitmap!
......
13:17:00 je drbd_sync_handshake:
13:17:00 je self
F71E503A8179BC5D:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
13:17:00 je peer
0000000000000000:0000000000000000:F71E503A8179BC5C:3F430D4E1D59C3EA
13:17:00 je uuid_compare()=-2 by rule 6
13:17:00 je Writing meta data super block now.
13:17:01 je writing of bitmap took 11 jiffies
13:17:01 je 12 GB marked out-of-sync by on disk bit-map.
13:17:02 je 13336132 KB now marked out-of-sync by on disk bit-map.
13:17:02 je Writing meta data super block now.
13:17:02 je uuid[History_start] now F71E503A8179BC5D
13:17:02 je uuid[Current] now 0000000000000000
13:17:03 je conn( Connected -> WFBitMapT )
13:17:03 je Writing meta data super block now.
On theh other node=============================
13:16:48 be aftr_isp( 0 -> 1 )
13:16:48 be Handshake successful: DRBD Network Protocol version 86
13:16:48 be peer( Unknown -> Secondary ) conn( WFReportParams ->
Connected ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
13:16:48 be Writing meta data super block now.
....
13:16:49 be rct = 2 in
13:16:49 be drbd_sync_handshake:
13:16:49 be self
F71E503A8179BC5C:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
13:16:49 be peer
F71E503A8179BC5D:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
13:16:49 be uuid_compare()=-1 by rule 4
<<<<<<<<=========---------------
13:16:49 be uuid[History_start] now F71E503A8179BC5C
13:16:49 be uuid[Current] now 0000000000000000
13:16:49 be conn( Connected -> WFBitMapT )
13:16:49 be Writing meta data super block now.
Unfortunately the clocks of the nodes seems to be not in sync (please
use NTP, this would make reading such logs much easiert!)
I assume that the line marked with the arrow ( <<==-- ) is
where the troubles begin. Because the 'self' and 'peer' line
is matches here inverse. Then the suspicious decision is
the outcome of uuid_compare() with rule 4.
While je (jerry) comes to the conclusion 0, gets ben -1.
The rtc (a few lines before) suggests us that jerry thought
that neither node was a CRASHED_PRIMARY, while ben thought
that jerry was a CRASHED_PRIMARY. Out of you comment
that jerry was powered off, I guess that jerry was really
a CRASHED_PRIMARY, and therefore it sent this information
over to ben. But somehow that CRASHED_PRIMARY bit got
cleared before jerry came to its drbd_uuid_compare()
function.
The question is: how did that happen ?
In the rest of the log entries on jerry we can see how he
tried to do a second sync handshake ...
Back to the question of the lost CRASHED_PRIMARY bit.
That bits can get cleared when:
1) We attach a disk.
2) Connection state goes to connected.
3) In the receive_bitmap.
I think that the clear_bit in receive_bitmap() is wrong, but
I currently can not think of a situation where this clear_bit
can get executed before the drbd_sync_handshake() function.
Ernest, could you provide a bit more context with a few more
lines of jerry's log before 13:16:53.
So far this my proposed patch. But there is still a missing
link in the reasoning:
--- drbd_receiver.c (revision 2821)
+++ drbd_receiver.c (working copy)
@@ -2509,7 +2509,6 @@
D_ASSERT(h->command == ReportBitMap);
}
- clear_bit(CRASHED_PRIMARY, &mdev->flags); // md_write() is in
drbd_start_resync.
if (mdev->state.conn == WFBitMapS) {
drbd_start_resync(mdev,SyncSource);
} else if (mdev->state.conn == WFBitMapT) {
[-- Attachment #2: nodeA.txt --]
[-- Type: text/plain, Size: 3353 bytes --]
drbd2: _drbd_set_state: Changing state.
drbd2: _drbd_set_state: setting conn to SyncSource or SyncTarget see message below
drbd2: peer( Secondary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
drbd2: _drbd_set_state: Changing state.
drbd2: _drbd_set_state: setting conn to SyncSource or SyncTarget see message below
drbd2: _drbd_set_state: No state new state to change to.
drbd2: short read expecting header on sock: r=-512
drbd2: asender terminated
drbd2: tl_clear()
drbd2: Connection closed
drbd2: Writing meta data super block now.
drbd2: _drbd_set_state: Changing state.
drbd2: _drbd_set_state: setting conn to SyncSource or SyncTarget see message below
drbd2: conn( Disconnecting -> StandAlone )
drbd2: receiver terminated
drbd2: _drbd_set_state: Changing state.
drbd2: _drbd_set_state: setting conn to SyncSource or SyncTarget see message below
drbd2: disk( UpToDate -> Diskless )
drbd2: drbd_bm_resize called with capacity == 0
drbd2: worker terminated
drbd2: _drbd_set_state: Changing state.
drbd2: _drbd_set_state: setting conn to SyncSource or SyncTarget see message below
drbd2: disk( Diskless -> Attaching )
drbd2: No usable activity log found.
drbd2: max_segment_size ( = BIO size ) = 32768
drbd2: drbd_bm_resize called with capacity == 26672264
drbd2: resync bitmap: bits=3334033 words=104190
drbd2: size = 12 GB (13336132 KB)
drbd2: reading of bitmap took 2 jiffies
drbd2: recounting of set bits took additional 0 jiffies
drbd2: 0 KB marked out-of-sync by on disk bit-map.
drbd2: _drbd_set_state: Changing state.
drbd2: _drbd_set_state: setting conn to SyncSource or SyncTarget see message below
drbd2: disk( Attaching -> UpToDate )
drbd2: Writing meta data super block now.
drbd2: _drbd_set_state: Changing state.
drbd2: _drbd_set_state: setting conn to SyncSource or SyncTarget see message below
drbd2: conn( StandAlone -> Unconnected )
drbd2: receiver (re)started
drbd2: drbd_connect: ****************Connecting********
drbd2: _drbd_set_state: Changing state.
drbd2: _drbd_set_state: setting conn to SyncSource or SyncTarget see message below
drbd2: conn( Unconnected -> WFConnection )
drbd2: _drbd_set_state: Changing state.
drbd2: _drbd_set_state: setting conn to SyncSource or SyncTarget see message below
drbd2: conn( WFConnection -> WFReportParams )
drbd2: Handshake successful: DRBD Network Protocol version 86
drbd2: drbd_connect: ****************sending procol while Connecting********
drbd2: drbd_connect: ****************Sending paran while Connecting********
drbd2: drbd_connect: **************** Sending sizes while Connecting********
drbd2: drbd_connect: ****************sendin uuids while Connecting********
drbd2: drbd_connect: ****************Sending states while Connecting********
drbd2: drbd_connect: ****************DONE!!!! Connecting********
drbd2: drbd_sync_handshake:
drbd2: self 0000000000000000:0000000000000000:B2E340FD0FB4F236:B2E340FD0FB4F219
drbd2: peer B2E340FD0FB4F236:0000000000000000:E515D7C70644AF92:B2E340FD0FB4F219
drbd2: uuid_compare()=-1 by rule 5
drbd2: uuid[Current] now 0000000000000000
drbd2: _drbd_set_state: Changing state.
drbd2: _drbd_set_state: setting conn to SyncSource or SyncTarget see message below
drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
drbd2: Writing meta data super block now.
[-- Attachment #3: nodeB.txt --]
[-- Type: text/plain, Size: 2312 bytes --]
drbd2: _drbd_set_state: Changing state.
drbd2: _drbd_set_state: setting conn to SyncSource or SyncTarget see message below
drbd2: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
drbd2: Writing meta data super block now.
drbd2: meta connection shut down by peer.
drbd2: asender terminated
drbd2: tl_clear()
drbd2: Connection closed
drbd2: _drbd_set_state: Changing state.
drbd2: _drbd_set_state: setting conn to SyncSource or SyncTarget see message below
drbd2: conn( TearDown -> Unconnected )
drbd2: receiver terminated
drbd2: receiver (re)started
drbd2: drbd_connect: ****************Connecting********
drbd2: _drbd_set_state: Changing state.
drbd2: _drbd_set_state: setting conn to SyncSource or SyncTarget see message below
drbd2: conn( Unconnected -> WFConnection )
drbd2: _drbd_set_state: Changing state.
drbd2: _drbd_set_state: setting conn to SyncSource or SyncTarget see message below
drbd2: conn( WFConnection -> WFReportParams )
drbd2: Handshake successful: DRBD Network Protocol version 86
drbd2: drbd_connect: ****************sending procol while Connecting********
drbd2: drbd_connect: ****************Sending paran while Connecting********
drbd2: drbd_connect: **************** Sending sizes while Connecting********
drbd2: drbd_connect: ****************sendin uuids while Connecting********
drbd2: drbd_connect: ****************Sending states while Connecting********
drbd2: drbd_connect: ****************DONE!!!! Connecting********
drbd2: drbd_sync_handshake:
drbd2: self B2E340FD0FB4F236:0000000000000000:E515D7C70644AF92:B2E340FD0FB4F219
drbd2: peer 0000000000000000:0000000000000000:B2E340FD0FB4F236:B2E340FD0FB4F219
drbd2: uuid_compare()=-2 by rule 6
drbd2: Writing meta data super block now.
drbd2: writing of bitmap took 4 jiffies
drbd2: 12 GB marked out-of-sync by on disk bit-map.
drbd2: 13336132 KB now marked out-of-sync by on disk bit-map.
drbd2: Writing meta data super block now.
drbd2: uuid[History_start] now B2E340FD0FB4F236
drbd2: uuid[Current] now 0000000000000000
drbd2: _drbd_set_state: Changing state.
drbd2: _drbd_set_state: setting conn to SyncSource or SyncTarget see message below
drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
drbd2: Writing meta data super block now.
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [Drbd-dev] DRBD8: drbd nodes deadlock in WFBitMapT
2007-04-02 21:44 ` Montrose, Ernest
@ 2007-04-03 9:34 ` Philipp Reisner
0 siblings, 0 replies; 8+ messages in thread
From: Philipp Reisner @ 2007-04-03 9:34 UTC (permalink / raw)
To: drbd-dev; +Cc: Montrose, Ernest
[-- Attachment #1: Type: text/plain, Size: 1354 bytes --]
Am Montag, 2. April 2007 23:44 schrieb Montrose, Ernest:
> Phil,
> OK...I have learned a tad more since that last email. So before I even
> try the proposed patch here is a way I was able to duplicate the
> problem.
> Perhaps that will help a bit. Here it is with two nodes 'a' and 'b' .
> Suppose
> They are in steady states with UUIDS:
> Xa:0:Ha:HH:1:1:0:1:0:0
> Xb:0:Hb:HH:1:1:0:1:0:0
> Role Secondary/Secondary
>
> 1) Disconnect/detach /dev/drbdX on nodea
> 2) Move Current UUID of nodea to history-UUID of nodea and set
> current_UUID of nodea to 00000000000 with drbdmeta..
> 0:0:Xa:HH:1:1:0:1:0:0
> 3) Now attach and connect /dev/drbdX and the problem will occur
>
> I have attached the logs for my "manufactured" version of the problem
>
Ernest,
You are right, that DRBD should get out of this situation. The attached
patch fixes this. (I will commit it when you confirm that it also fixes
the issue for your)
But I am still asking myself how the CRASHED_PRIMARY got lost. Ernest,
do you still have the log of jerry from Mar 23 13:16:54 ?
I would really like to see the last 30 lines before Mar 23 13:16:54.
Thanks!
-Phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
[-- Attachment #2: fix_i2.diff --]
[-- Type: text/x-diff, Size: 675 bytes --]
Index: drbd_receiver.c
===================================================================
--- drbd_receiver.c (revision 2822)
+++ drbd_receiver.c (working copy)
@@ -1890,7 +1890,7 @@
*rule_nr = 5;
peer = mdev->p_uuid[Bitmap] & ~((u64)1);
- if (self == peer) return -1;
+ if (self == peer && self != ((u64)0)) return -1;
*rule_nr = 6;
for ( i=History_start ; i<=History_end ; i++ ) {
@@ -1901,7 +1901,7 @@
*rule_nr = 7;
self = mdev->bc->md.uuid[Bitmap] & ~((u64)1);
peer = mdev->p_uuid[Current] & ~((u64)1);
- if (self == peer) return 1;
+ if (self == peer && self != ((u64)0)) return 1;
*rule_nr = 8;
for ( i=History_start ; i<=History_end ; i++ ) {
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [Drbd-dev] DRBD8: drbd nodes deadlock in WFBitMapT
@ 2007-04-03 12:49 Montrose, Ernest
2007-04-05 20:41 ` Philipp Reisner
0 siblings, 1 reply; 8+ messages in thread
From: Montrose, Ernest @ 2007-04-03 12:49 UTC (permalink / raw)
To: Philipp Reisner, drbd-dev
[-- Attachment #1: Type: text/plain, Size: 2321 bytes --]
Phil,
I will try that patch and let you know but first attached is the logs
for drbd2 for both Ben and Jerry. For over a week I tried to figure out
how we got into that state but failed. So thanks for the help. I
flagged some weirdness where there appear to be missing lines. I
suspect that this may have to do with the way we collect these logs
here. The problem is near the bottom of the logs since we stop the
tests as soon as we notice the deadlock. Bear in mind that the test is
to power cycle the nodes and see what happens to various subsystems.
And...O! sorry you're not going to be on the East coast we would have
been thrilled to meet you!
Thanks,
EM--
-----Original Message-----
From: Philipp Reisner [mailto:philipp.reisner@linbit.com]
Sent: Tuesday, April 03, 2007 5:35 AM
To: drbd-dev@linbit.com
Cc: Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8: drbd nodes deadlock in WFBitMapT
Am Montag, 2. April 2007 23:44 schrieb Montrose, Ernest:
> Phil,
> OK...I have learned a tad more since that last email. So before I
even
> try the proposed patch here is a way I was able to duplicate the
> problem.
> Perhaps that will help a bit. Here it is with two nodes 'a' and 'b' .
> Suppose
> They are in steady states with UUIDS:
> Xa:0:Ha:HH:1:1:0:1:0:0
> Xb:0:Hb:HH:1:1:0:1:0:0
> Role Secondary/Secondary
>
> 1) Disconnect/detach /dev/drbdX on nodea
> 2) Move Current UUID of nodea to history-UUID of nodea and set
> current_UUID of nodea to 00000000000 with drbdmeta..
> 0:0:Xa:HH:1:1:0:1:0:0
> 3) Now attach and connect /dev/drbdX and the problem will occur
>
> I have attached the logs for my "manufactured" version of the problem
>
Ernest,
You are right, that DRBD should get out of this situation. The attached
patch fixes this. (I will commit it when you confirm that it also fixes
the issue for your)
But I am still asking myself how the CRASHED_PRIMARY got lost. Ernest,
do you still have the log of jerry from Mar 23 13:16:54 ?
I would really like to see the last 30 lines before Mar 23 13:16:54.
Thanks!
-Phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
[-- Attachment #2: drbd2.ben.log --]
[-- Type: application/octet-stream, Size: 9232 bytes --]
Mar 23 13:02:35 drbd2: PingAck did not arrive in time.
Mar 23 13:02:35 drbd2: short read expecting header on sock: r=-512
Mar 23 13:02:35 drbd2: tl_clear()
Mar 23 13:02:35 ben kernel: drbd2: PingAck did not arrive in time.
Mar 23 13:02:35 ben kernel: drbd2: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Mar 23 13:02:35 ben kernel: drbd2: asender terminated
Mar 23 13:02:35 ben kernel: drbd2: short read expecting header on sock: r=-512
Mar 23 13:02:35 ben kernel: drbd2: tl_clear()
Mar 23 13:02:35 ben kernel: drbd2: Connection closed
Mar 23 13:02:35 ben kernel: drbd2: Writing meta data super block now.
Mar 23 13:02:35 ben kernel: drbd2: conn( NetworkFailure -> Unconnected )
Mar 23 13:02:35 ben kernel: drbd2: receiver terminated
Mar 23 13:02:35 ben kernel: drbd2: receiver (re)started
Mar 23 13:02:35 ben kernel: drbd2: conn( Unconnected -> WFConnection )
Mar 23 13:03:54 ben kernel: drbd2: role( Secondary -> Primary )
Mar 23 13:03:54 ben kernel: drbd2: Creating new current UUID
Mar 23 13:03:54 ben kernel: drbd2: uuid[Bitmap] now 8A0F3A9F9E1A77C6
Mar 23 13:03:54 ben kernel: drbd2: uuid[Current] now 6B0B1DA5CB20689D
Mar 23 13:03:54 ben kernel: drbd2: Writing meta data super block now.
Mar 23 13:05:18 ben kernel: drbd2: aftr_isp( 0 -> 1 )
Mar 23 13:05:19 ben kernel: drbd2: conn( WFConnection -> WFReportParams )
Mar 23 13:05:19 ben kernel: drbd2: Handshake successful: DRBD Network Protocol version 86
Mar 23 13:05:19 ben kernel: drbd2: drbd_sync_handshake:
Mar 23 13:05:19 ben kernel: drbd2: self 6B0B1DA5CB20689D:8A0F3A9F9E1A77C6:B450183638BDD68A:F6CE1616F9917C9F
Mar 23 13:05:19 ben kernel: drbd2: peer 8A0F3A9F9E1A77C7:0000000000000000:B450183638BDD68A:F6CE1616F9917C9F
Mar 23 13:05:19 ben kernel: drbd2: uuid_compare()=1 by rule 7
Mar 23 13:05:19 ben kernel: drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
Mar 23 13:05:19 ben kernel: drbd2: Writing meta data super block now.
Mar 23 13:05:22 ben kernel: drbd2: uuid[History_start] now 8A0F3A9F9E1A77C6
Mar 23 13:05:22 ben kernel: drbd2: uuid[Bitmap] now CB02C14FA3081C52
Mar 23 13:05:23 ben kernel: drbd2: conn( WFBitMapS -> PausedSyncS ) pdsk( UpToDate -> Inconsistent )
Mar 23 13:05:23 ben kernel: drbd2: Began resync as PausedSyncS (will sync 659464 KB [164866 bits set]).
Mar 23 13:05:23 ben kernel: drbd2: Writing meta data super block now.
Mar 23 13:06:21 ben kernel: drbd2: peer_isp( 1 -> 0 )
Mar 23 13:06:21 ben kernel: drbd2: conn( PausedSyncS -> SyncSource ) aftr_isp( 1 -> 0 )
Mar 23 13:06:21 ben kernel: drbd2: Syncer continues.
Mar 23 13:06:46 ben kernel: drbd2: Resync done (total 88 sec; paused 63 sec; 26376 K/sec)
Mar 23 13:06:46 ben kernel: drbd2: uuid[History_start] now CB02C14FA3081C52
Mar 23 13:06:46 ben kernel: drbd2: uuid[Bitmap] now 0000000000000000
Mar 23 13:06:46 ben kernel: drbd2: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
Mar 23 13:06:46 ben kernel: drbd2: Writing meta data super block now.
<<<<<<<<<<<<<<<<<<<<POWER cycled here and rebooting ben>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Mar 23 13:10:57 ben kernel: drbd2: disk( Diskless -> Attaching )
Mar 23 13:10:57 ben kernel: drbd2: Found 6 transactions (324 active extents) in activity log.
Mar 23 13:10:57 ben kernel: drbd2: max_segment_size ( = BIO size ) = 32768
Mar 23 13:10:57 ben kernel: drbd2: drbd_bm_resize called with capacity == 26672264
Mar 23 13:10:57 ben kernel: drbd2: resync bitmap: bits=3334033 words=104190
Mar 23 13:10:57 ben kernel: drbd2: size = 12 GB (13336132 KB)
Mar 23 13:10:57 ben kernel: drbd2: reading of bitmap took 2 jiffies
Mar 23 13:10:57 ben kernel: drbd2: recounting of set bits took additional 0 jiffies
Mar 23 13:10:57 ben kernel: drbd2: 0 KB marked out-of-sync by on disk bit-map.
Mar 23 13:10:58 ben kernel: drbd2: Marked additional 1028 MB as out-of-sync based on AL.
Mar 23 13:10:58 ben kernel: drbd2: disk( Attaching -> UpToDate )
Mar 23 13:10:58 ben kernel: drbd2: Writing meta data super block now.
Mar 23 13:11:01 ben kernel: drbd2: aftr_isp( 0 -> 1 )
Mar 23 13:11:01 ben kernel: drbd2: conn( StandAlone -> Unconnected )
Mar 23 13:11:02 ben kernel: drbd2: receiver (re)started
Mar 23 13:11:05 ben.sn.stratus.com login: Mar 23 13:11:02 ben kernel: drbd2: conn( Unconnected -> WFConnection )
Mar 23 13:11:02 ben kernel: drbd2: conn( WFConnection -> WFReportParams )
Mar 23 13:11:03 ben kernel: drbd2: Handshake successful: DRBD Network Protocol version 86
Mar 23 13:11:04 ben kernel: drbd2: drbd_sync_handshake:
Mar 23 13:11:04 ben kernel: drbd2: self 6B0B1DA5CB20689D:0000000000000000:CB02C14FA3081C52:8A0F3A9F9E1A77C6
Mar 23 13:11:04 ben kernel: drbd2: peer F71E503A8179BC5D:6B0B1DA5CB20689C:CB02C14FA3081C52:8A0F3A9F9E1A77C6
Mar 23 13:11:05 ben kernel: drbd2: uuid_compare()=-1 by rule 5
Mar 23 13:11:06 ben kernel: drbd2: uuid[History_start] now 6B0B1DA5CB20689D
Mar 23 13:11:08 ben kernel: drbd2: uuid[Current] now 6B0B1DA5CB20689C
Mar 23 13:11:09 ben kernel: drbd2: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
Mar 23 13:11:10 ben kernel: drbd2: Writing meta data super block now.
Mar 23 13:11:18 ben kernel: drbd2: conn( WFBitMapT -> WFSyncUUID )
Mar 23 13:11:18 ben kernel: drbd2: uuid[Current] now 3F430D4E1D59C3EA
Mar 23 13:11:18 ben kernel: drbd2: uuid[Bitmap] now 0000000000000000
Mar 23 13:11:18 ben kernel: drbd2: conn( WFSyncUUID -> PausedSyncT ) disk( UpToDate -> Inconsistent )
Mar 23 13:11:18 ben kernel: drbd2: Began resync as PausedSyncT (will sync 1052696 KB [263174 bits set]).
Mar 23 13:11:18 ben kernel: drbd2: Writing meta data super block now.
Mar 23 13:12:09 ben kernel: drbd2: peer_isp( 1 -> 0 )
Mar 23 13:12:09 ben kernel: drbd2: conn( PausedSyncT -> SyncTarget ) aftr_isp( 1 -> 0 )
Mar 23 13:12:09 ben kernel: drbd2: Syncer continues.
Mar 23 13:12:46 ben kernel: drbd2: Resync done (total 111 sec; paused 74 sec; 28448 K/sec)
Mar 23 13:12:46 ben kernel: drbd2: uuid[Bitmap] now 6B0B1DA5CB20689C
Mar 23 13:12:46 ben kernel: drbd2: uuid[History_start] now CB02C14FA3081C52
Mar 23 13:12:46 ben kernel: drbd2: uuid[History_end] now 8A0F3A9F9E1A77C6
Mar 23 13:12:46 ben kernel: drbd2: uuid[History_start] now 6B0B1DA5CB20689C
Mar 23 13:12:46 ben kernel: drbd2: uuid[Bitmap] now 3F430D4E1D59C3EA
Mar 23 13:12:46 ben kernel: drbd2: uuid[Current] now F71E503A8179BC5C
Mar 23 13:12:46 ben kernel: drbd2: uuid[History_start] now 3F430D4E1D59C3EA
Mar 23 13:12:46 ben kernel: drbd2: uuid[Bitmap] now 0000000000000000
Mar 23 13:12:46 ben kernel: drbd2: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Mar 23 13:12:46 ben kernel: drbd2: Writing meta data super block now.
Mar 23 13:14:09 drbd2: PingAck did not arrive in time.
Mar 23 13:14:09 drbd2: short read expecting header on sock: r=-512
Mar 23 13:14:09 ben kernel: drbd2: PingAck did not arrive in time.
Mar 23 13:14:09 drbd2: tl_clear()
Mar 23 13:14:09 ben kernel: drbd2: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Mar 23 13:14:09 ben kernel: drbd2: asender terminated
Mar 23 13:14:09 ben kernel: drbd2: short read expecting header on sock: r=-512
Mar 23 13:14:09 ben kernel: drbd2: tl_clear()
Mar 23 13:14:09 ben kernel: drbd2: Connection closed
Mar 23 13:14:09 ben kernel: drbd2: Writing meta data super block now.
Mar 23 13:14:09 ben kernel: drbd2: conn( NetworkFailure -> Unconnected )
Mar 23 13:14:09 ben kernel: drbd2: receiver terminated
Mar 23 13:14:09 ben kernel: drbd2: receiver (re)started
Mar 23 13:14:09 ben kernel: drbd2: conn( Unconnected -> WFConnection )
Mar 23 13:16:46 drbd2: rct = 2 in /test_logs/builds/SuperNova/trunk/070323/platform/drbd/src/drbd/drbd_receiver.c:1878
<<<<<<<<<Log weirdness here we should have seen drbd_sync_hanshake messages..>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Mar 23 13:16:47 ben kernel: drbd2: conn( WFConnection -> WFReportParams )
Mar 23 13:16:48 ben kernel: drbd2: aftr_isp( 0 -> 1 )
Mar 23 13:16:48 ben kernel: drbd2: Handshake successful: DRBD Network Protocol version 86
Mar 23 13:16:48 ben kernel: drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
Mar 23 13:16:48 ben kernel: drbd2: Writing meta data super block now.
Mar 23 13:16:49 ben kernel: drbd2: rct = 2 in /test_logs/builds/SuperNova/trunk/070323/platform/drbd/src/drbd/drbd_receiver.c:1878
Mar 23 13:16:49 ben kernel: drbd2: drbd_sync_handshake:
Mar 23 13:16:49 ben kernel: drbd2: self F71E503A8179BC5C:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
Mar 23 13:16:49 ben kernel: drbd2: peer F71E503A8179BC5D:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
Mar 23 13:16:49 ben kernel: drbd2: uuid_compare()=-1 by rule 4
Mar 23 13:16:49 ben kernel: drbd2: uuid[History_start] now F71E503A8179BC5C
Mar 23 13:16:49 ben kernel: drbd2: uuid[Current] now 0000000000000000
Mar 23 13:16:49 ben kernel: drbd2: conn( Connected -> WFBitMapT ) <<<<<<The problem>>>>>
Mar 23 13:16:49 ben kernel: drbd2: Writing meta data super block now.
Mar 23 13:17:42 ben kernel: drbd2: peer_isp( 1 -> 0 )
Mar 23 13:17:42 ben kernel: drbd2: aftr_isp( 1 -> 0 )
[-- Attachment #3: drbd2.jerry.log --]
[-- Type: application/octet-stream, Size: 10363 bytes --]
Mar 23 13:05:21 jerry kernel: drbd2: disk( Diskless -> Attaching )
Mar 23 13:05:21 jerry kernel: drbd2: Found 6 transactions (162 active extents) in activity log.
Mar 23 13:05:21 jerry kernel: drbd2: max_segment_size ( = BIO size ) = 32768
Mar 23 13:05:21 jerry kernel: drbd2: drbd_bm_resize called with capacity == 26672264
Mar 23 13:05:21 jerry kernel: drbd2: resync bitmap: bits=3334033 words=104190
Mar 23 13:05:21 jerry kernel: drbd2: size = 12 GB (13336132 KB)
Mar 23 13:05:21 jerry kernel: drbd2: reading of bitmap took 2 jiffies
Mar 23 13:05:21 jerry kernel: drbd2: recounting of set bits took additional 0 jiffies
Mar 23 13:05:21 jerry kernel: drbd2: 0 KB marked out-of-sync by on disk bit-map.
Mar 23 13:05:21 jerry kernel: drbd2: Marked additional 644 MB as out-of-sync based on AL.
Mar 23 13:05:21 jerry kernel: drbd2: disk( Attaching -> UpToDate )
Mar 23 13:05:21 jerry kernel: drbd2: Writing meta data super block now.
Mar 23 13:05:23 jerry kernel: drbd2: aftr_isp( 0 -> 1 )
Mar 23 13:05:24 jerry kernel: drbd2: conn( StandAlone -> Unconnected )
Mar 23 13:05:24 jerry kernel: drbd2: receiver (re)started
Mar 23 13:05:24 jerry kernel: drbd2: conn( Unconnected -> WFConnection )
Mar 23 13:05:24 jerry kernel: drbd2: conn( WFConnection -> WFReportParams )
Mar 23 13:05:24 jerry kernel: drbd2: Handshake successful: DRBD Network Protocol version 86
Mar 23 13:05:26 jerry kernel: drbd2: drbd_sync_handshake:
Mar 23 13:05:26 jerry kernel: drbd2: self 8A0F3A9F9E1A77C7:0000000000000000:B450183638BDD68A:F6CE1616F9917C9F
Mar 23 13:05:26 jerry kernel: drbd2: peer 6B0B1DA5CB20689D:8A0F3A9F9E1A77C6:B450183638BDD68A:F6CE1616F9917C9F
Mar 23 13:05:27 jerry kernel: drbd2: uuid_compare()=-1 by rule 5
Mar 23 13:05:28 jerry.sn.stratus.com login: Mar 23 13:05:28 jerry kernel: drbd2: uuid[History_start] now 8A0F3A9F9E1A77C7
Mar 23 13:05:28 jerry kernel: drbd2: uuid[Current] now 8A0F3A9F9E1A77C6
Mar 23 13:05:28 jerry kernel: drbd2: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
Mar 23 13:05:30 jerry kernel: drbd2: Writing meta data super block now.
Mar 23 13:05:35 jerry kernel: drbd2: conn( WFBitMapT -> WFSyncUUID )
Mar 23 13:05:35 jerry kernel: drbd2: uuid[Current] now CB02C14FA3081C52
Mar 23 13:05:35 jerry kernel: drbd2: uuid[Bitmap] now 0000000000000000
Mar 23 13:05:36 jerry kernel: drbd2: conn( WFSyncUUID -> PausedSyncT ) disk( UpToDate -> Inconsistent )
Mar 23 13:05:36 jerry kernel: drbd2: Began resync as PausedSyncT (will sync 659464 KB [164866 bits set]).
Mar 23 13:05:36 jerry kernel: drbd2: Writing meta data super block now.
Mar 23 13:06:22 jerry kernel: drbd2: aftr_isp( 1 -> 0 )
Mar 23 13:06:22 jerry kernel: drbd2: conn( PausedSyncT -> SyncTarget ) peer_isp( 1 -> 0 )
Mar 23 13:06:22 jerry kernel: drbd2: Syncer continues.
Mar 23 13:06:47 jerry kernel: drbd2: Resync done (total 88 sec; paused 63 sec; 26376 K/sec)
Mar 23 13:06:47 jerry kernel: drbd2: uuid[Bitmap] now 8A0F3A9F9E1A77C6
Mar 23 13:06:47 jerry kernel: drbd2: uuid[History_start] now B450183638BDD68A
Mar 23 13:06:47 jerry kernel: drbd2: uuid[History_end] now F6CE1616F9917C9F
Mar 23 13:06:47 jerry kernel: drbd2: uuid[History_start] now 8A0F3A9F9E1A77C6
Mar 23 13:06:47 jerry kernel: drbd2: uuid[Bitmap] now CB02C14FA3081C52
Mar 23 13:06:47 jerry kernel: drbd2: uuid[Current] now 6B0B1DA5CB20689C
Mar 23 13:06:47 jerry kernel: drbd2: uuid[History_start] now CB02C14FA3081C52
Mar 23 13:06:47 jerry kernel: drbd2: uuid[Bitmap] now 0000000000000000
Mar 23 13:06:47 jerry kernel: drbd2: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Mar 23 13:06:47 jerry kernel: drbd2: Writing meta data super block now.
Mar 23 13:08:13 drbd2: PingAck did not arrive in time.
Mar 23 13:08:13 drbd2: short read expecting header on sock: r=-512
Mar 23 13:08:13 drbd2: tl_clear()
Mar 23 13:08:13 jerry kernel: drbd2: PingAck did not arrive in time.
Mar 23 13:08:13 jerry kernel: drbd2: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Mar 23 13:08:13 jerry kernel: drbd2: asender terminated
Mar 23 13:08:13 jerry kernel: drbd2: short read expecting header on sock: r=-512
Mar 23 13:08:13 jerry kernel: drbd2: tl_clear()
Mar 23 13:08:13 jerry kernel: drbd2: Connection closed
Mar 23 13:08:13 jerry kernel: drbd2: Writing meta data super block now.
Mar 23 13:08:13 jerry kernel: drbd2: conn( NetworkFailure -> Unconnected )
Mar 23 13:08:13 jerry kernel: drbd2: receiver terminated
Mar 23 13:08:13 jerry kernel: drbd2: receiver (re)started
Mar 23 13:08:13 jerry kernel: drbd2: conn( Unconnected -> WFConnection )
Mar 23 13:09:27 jerry kernel: drbd2: role( Secondary -> Primary )
Mar 23 13:09:27 jerry kernel: drbd2: Creating new current UUID
Mar 23 13:09:27 jerry kernel: drbd2: uuid[Bitmap] now 6B0B1DA5CB20689C
Mar 23 13:09:27 jerry kernel: drbd2: uuid[Current] now F71E503A8179BC5D
Mar 23 13:09:27 jerry kernel: drbd2: Writing meta data super block now.
Mar 23 13:10:56 jerry kernel: drbd2: aftr_isp( 0 -> 1 )
Mar 23 13:10:56 jerry kernel: drbd2: conn( WFConnection -> WFReportParams )
Mar 23 13:10:56 jerry kernel: drbd2: Handshake successful: DRBD Network Protocol version 86
Mar 23 13:10:56 jerry kernel: drbd2: drbd_sync_handshake:
Mar 23 13:10:56 jerry kernel: drbd2: self F71E503A8179BC5D:6B0B1DA5CB20689C:CB02C14FA3081C52:8A0F3A9F9E1A77C6
Mar 23 13:10:56 jerry kernel: drbd2: peer 6B0B1DA5CB20689D:0000000000000000:CB02C14FA3081C52:8A0F3A9F9E1A77C6
Mar 23 13:10:56 jerry kernel: drbd2: uuid_compare()=1 by rule 7
Mar 23 13:10:56 jerry kernel: drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
Mar 23 13:10:56 jerry kernel: drbd2: Writing meta data super block now.
Mar 23 13:10:57 jerry kernel: drbd2: uuid[History_start] now 6B0B1DA5CB20689C
Mar 23 13:10:57 jerry kernel: drbd2: uuid[Bitmap] now 3F430D4E1D59C3EA
Mar 23 13:10:57 jerry kernel: drbd2: conn( WFBitMapS -> PausedSyncS ) pdsk( UpToDate -> Inconsistent )
Mar 23 13:10:57 jerry kernel: drbd2: Began resync as PausedSyncS (will sync 1052696 KB [263174 bits set]).
Mar 23 13:10:57 jerry kernel: drbd2: Writing meta data super block now.
Mar 23 13:12:09 jerry kernel: drbd2: aftr_isp( 1 -> 0 )
Mar 23 13:12:09 jerry kernel: drbd2: conn( PausedSyncS -> SyncSource ) peer_isp( 1 -> 0 )
Mar 23 13:12:09 jerry kernel: drbd2: Syncer continues.
Mar 23 13:12:47 jerry kernel: drbd2: Resync done (total 111 sec; paused 74 sec; 28448 K/sec)
Mar 23 13:12:47 jerry kernel: drbd2: uuid[History_start] now 3F430D4E1D59C3EA
Mar 23 13:12:47 jerry kernel: drbd2: uuid[Bitmap] now 0000000000000000
Mar 23 13:12:47 jerry kernel: drbd2: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
Mar 23 13:12:47 jerry kernel: drbd2: Writing meta data super block now.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<POWER cycled here and rebooting>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Mar 23 13:16:46 drbd2: rct = 0 in /test_logs/builds/SuperNova/trunk/070323/platform/drbd/src/drbd/drbd_receiver.c:1878
Mar 23 13:16:48 jerry kernel: drbd2: disk( Diskless -> Attaching )
Mar 23 13:16:49 jerry kernel: drbd2: Found 6 transactions (168 active extents) in activity log.
Mar 23 13:16:49 jerry kernel: drbd2: max_segment_size ( = BIO size ) = 32768
Mar 23 13:16:49 jerry kernel: drbd2: drbd_bm_resize called with capacity == 26672264
Mar 23 13:16:49 jerry kernel: drbd2: resync bitmap: bits=3334033 words=104190
Mar 23 13:16:49 jerry kernel: drbd2: size = 12 GB (13336132 KB)
Mar 23 13:16:49 jerry kernel: drbd2: reading of bitmap took 3 jiffies
Mar 23 13:16:49 jerry kernel: drbd2: recounting of set bits took additional 0 jiffies
Mar 23 13:16:49 jerry kernel: drbd2: 0 KB marked out-of-sync by on disk bit-map.
Mar 23 13:16:49 jerry kernel: drbd2: Marked additional 652 MB as out-of-sync based on AL.
Mar 23 13:16:49 jerry kernel: drbd2: disk( Attaching -> UpToDate )
Mar 23 13:16:49 jerry kernel: drbd2: Writing meta data super block now.
Mar 23 13:16:52 jerry kernel: drbd2: conn( StandAlone -> Unconnected )
Mar 23 13:16:52 jerry kernel: drbd2: receiver (re)started
Mar 23 13:16:52 jerry kernel: drbd2: conn( Unconnected -> WFConnection )
Mar 23 13:16:52 jerry kernel: drbd2: conn( WFConnection -> WFReportParams )
Mar 23 13:16:52 jerry kernel: drbd2: Handshake successful: DRBD Network Protocol version 86
Mar 23 13:16:52 jerry kernel: drbd2: aftr_isp( 0 -> 1 )
Mar 23 13:16:52 jerry kernel: drbd2: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
Mar 23 13:16:52 jerry kernel: drbd2: Writing meta data super block now.
Mar 23 13:16:53 jerry kernel: drbd2: rct = 0 in /test_logs/builds/SuperNova/trunk/070323/platform/drbd/src/drbd/drbd_receiver.c:1878
Mar 23 13:16:54 jerry kernel: drbd2: drbd_sync_handshake:
Mar 23 13:16:54 jerry kernel: drbd2: self F71E503A8179BC5D:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
Mar 23 13:16:54 jerry kernel: drbd2: peer F71E503A8179BC5C:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
Mar 23 13:16:55 jerry kernel: drbd2: uuid_compare()=0 by rule 4
Mar 23 13:16:55 jerry kernel: drbd2: No resync, but bits in bitmap!
Mar 23 13:17:00 jerry kernel: drbd2: drbd_sync_handshake:
Mar 23 13:17:00 jerry kernel: drbd2: self F71E503A8179BC5D:0000000000000000:3F430D4E1D59C3EA:6B0B1DA5CB20689C
Mar 23 13:17:00 jerry kernel: drbd2: peer 0000000000000000:0000000000000000:F71E503A8179BC5C:3F430D4E1D59C3EA
Mar 23 13:17:00 jerry kernel: drbd2: uuid_compare()=-2 by rule 6
Mar 23 13:17:00 jerry kernel: drbd2: Writing meta data super block now.
Mar 23 13:17:01 jerry kernel: drbd2: writing of bitmap took 11 jiffies
Mar 23 13:17:01 jerry kernel: drbd2: 12 GB marked out-of-sync by on disk bit-map.
Mar 23 13:17:02 jerry kernel: drbd2: 13336132 KB now marked out-of-sync by on disk bit-map.
Mar 23 13:17:02 jerry kernel: drbd2: Writing meta data super block now.
Mar 23 13:17:02 jerry kernel: drbd2: uuid[History_start] now F71E503A8179BC5D
Mar 23 13:17:02 jerry kernel: drbd2: uuid[Current] now 0000000000000000
Mar 23 13:17:03 jerry kernel: drbd2: conn( Connected -> WFBitMapT ) <<<<<<<<The problem>>>>>>>>>
Mar 23 13:17:03 jerry kernel: drbd2: Writing meta data super block now.
Mar 23 13:17:42 jerry kernel: drbd2: aftr_isp( 1 -> 0 )
Mar 23 13:17:42 jerry kernel: drbd2: peer_isp( 1 -> 0 )
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [Drbd-dev] DRBD8: drbd nodes deadlock in WFBitMapT
@ 2007-04-05 20:45 Montrose, Ernest
0 siblings, 0 replies; 8+ messages in thread
From: Montrose, Ernest @ 2007-04-05 20:45 UTC (permalink / raw)
To: Philipp Reisner, drbd-dev
Phil,
Thanks for the patch! I will try this and let you know.
EM--
-----Original Message-----
From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com]
On Behalf Of Philipp Reisner
Sent: Thursday, April 05, 2007 4:41 PM
To: drbd-dev@linbit.com
Cc: Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8: drbd nodes deadlock in WFBitMapT
Hi Ernest,
I think with this patch it is no longer possible to end up
in the both nodes WFBitMapT situation.
http://lists.linbit.com/pipermail/drbd-cvs/2007-April/001501.html
-Phil
_______________________________________________
drbd-dev mailing list
drbd-dev@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-dev
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [Drbd-dev] DRBD8: drbd nodes deadlock in WFBitMapT
@ 2007-04-06 19:00 Montrose, Ernest
0 siblings, 0 replies; 8+ messages in thread
From: Montrose, Ernest @ 2007-04-06 19:00 UTC (permalink / raw)
To: Philipp Reisner, drbd-dev
Phil,
I tested the new patch and the indications are clear that the deadlock
is fixed. Thanks!
EM--
-----Original Message-----
From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com]
On Behalf Of Philipp Reisner
Sent: Thursday, April 05, 2007 4:41 PM
To: drbd-dev@linbit.com
Cc: Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8: drbd nodes deadlock in WFBitMapT
Hi Ernest,
I think with this patch it is no longer possible to end up
in the both nodes WFBitMapT situation.
http://lists.linbit.com/pipermail/drbd-cvs/2007-April/001501.html
-Phil
_______________________________________________
drbd-dev mailing list
drbd-dev@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-dev
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2007-04-06 19:00 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-30 22:01 [Drbd-dev] DRBD8: drbd nodes deadlock in WFBitMapT Montrose, Ernest
2007-04-02 19:59 ` Philipp Reisner
2007-04-02 21:44 ` Montrose, Ernest
2007-04-03 9:34 ` Philipp Reisner
-- strict thread matches above, loose matches on Subject: below --
2007-04-03 12:49 Montrose, Ernest
2007-04-05 20:41 ` Philipp Reisner
2007-04-05 20:45 Montrose, Ernest
2007-04-06 19:00 Montrose, Ernest
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox