* [Drbd-dev] DRBD8: Deadlock in PausedSyncS
@ 2007-07-30 22:31 Montrose, Ernest
2007-08-02 9:44 ` Philipp Reisner
0 siblings, 1 reply; 4+ messages in thread
From: Montrose, Ernest @ 2007-07-30 22:31 UTC (permalink / raw)
To: drbd-dev
[-- Attachment #1.1: Type: text/plain, Size: 1067 bytes --]
Hi all,
We are seeing a problem where we deadlock if a pause sync request
happens while attaching. Below is an explaination of what I think is
occurring:
Consider two nodes X primary and Y Secondary.
1. X becomes Secondary/Diskless
2. Y becomes Primary
3. X tries to Attach and sends its states/uuids to Y
4. While Y is in receive_state() doing a drbd_sync_handshake(), it
receives a Paused Sync request.
This is where the trouble starts.
5. aftr_isp is changed from 0->1 on Y and after_state_ch() called. This
triggers
a drbd_send_state() from Y.
6. X receives States from Y but no uuids and runs a
drbd_sync_handshake() with the old uuids and
we deadlocked with PausedSyncS on both sides.
I am not sure how to best fix this. Perhaps we should not call
drbd_send_state()
in after_state_ch() for a sync request from the peer if the peer's disk
is diskless.
Or we do send states, sends the uuids as well. The attached patch will
at least serve as an illustration of the issue if not the correct fix.
Thank you!
EM--
[-- Attachment #1.2: Type: text/html, Size: 6292 bytes --]
[-- Attachment #2: pausedsync.patch --]
[-- Type: application/octet-stream, Size: 478 bytes --]
Index: drbd/drbd_main.c
===================================================================
--- drbd/drbd_main.c (revision 16853)
+++ drbd/drbd_main.c (working copy)
@@ -953,7 +953,8 @@
/* We want to pause/continue resync, tell peer. */
if ( ns.conn >= Connected &&
(( os.aftr_isp != ns.aftr_isp ) ||
- ( os.user_isp != ns.user_isp )) ) {
+ ( os.user_isp != ns.user_isp )) &&
+ mdev->state.pdsk != Diskless ) {
drbd_send_state(mdev);
}
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Drbd-dev] DRBD8: Deadlock in PausedSyncS
2007-07-30 22:31 [Drbd-dev] DRBD8: Deadlock in PausedSyncS Montrose, Ernest
@ 2007-08-02 9:44 ` Philipp Reisner
2007-08-02 12:07 ` Oren Nechushtan
0 siblings, 1 reply; 4+ messages in thread
From: Philipp Reisner @ 2007-08-02 9:44 UTC (permalink / raw)
To: drbd-dev; +Cc: Montrose, Ernest
On Tuesday 31 July 2007 00:31:19 Montrose, Ernest wrote:
> Hi all,
> We are seeing a problem where we deadlock if a pause sync request
> happens while attaching. Below is an explaination of what I think is
> occurring:
>
> Consider two nodes X primary and Y Secondary.
> 1. X becomes Secondary/Diskless
> 2. Y becomes Primary
> 3. X tries to Attach and sends its states/uuids to Y
> 4. While Y is in receive_state() doing a drbd_sync_handshake(), it
> receives a Paused Sync request.
> This is where the trouble starts.
> 5. aftr_isp is changed from 0->1 on Y and after_state_ch() called. This
> triggers
> a drbd_send_state() from Y.
> 6. X receives States from Y but no uuids and runs a
> drbd_sync_handshake() with the old uuids and
> we deadlocked with PausedSyncS on both sides.
>
> I am not sure how to best fix this. Perhaps we should not call
> drbd_send_state()
> in after_state_ch() for a sync request from the peer if the peer's disk
> is diskless.
> Or we do send states, sends the uuids as well. The attached patch will
> at least serve as an illustration of the issue if not the correct fix.
>
Hi Ernest,
Your patch was correct I think. But in reality we have not only this
one case of the problem it is a whole class of such problems.
While the node that gets the new disk is in disk=Negotiating, it
will run drbd_sync_handshake() on each state packet that comes in.
We need to avoid this. Calling pause-sync is just one possible way
to cause the transmittion of a state packets, this is a whole class
of such problems.
Instead of fixing every place where we send a state packet, to not
send it if the peer has no disk, I decided to fix the receiving
side.
http://lists.linbit.com/pipermail/drbd-cvs/2007-August/001613.html
-Phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: [Drbd-dev] DRBD8: Deadlock in PausedSyncS
@ 2007-08-02 11:58 Montrose, Ernest
0 siblings, 0 replies; 4+ messages in thread
From: Montrose, Ernest @ 2007-08-02 11:58 UTC (permalink / raw)
To: Philipp Reisner, drbd-dev
Phil,
Yeah, I agree. There are a few more situations other then this one.
I will try to actually test a few of those with this patch and let you
know if I find anything weird. Thanks!
EM--
-----Original Message-----
From: Philipp Reisner [mailto:philipp.reisner@linbit.com]
Sent: Thursday, August 02, 2007 5:45 AM
To: drbd-dev@linbit.com
Cc: Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8: Deadlock in PausedSyncS
On Tuesday 31 July 2007 00:31:19 Montrose, Ernest wrote:
> Hi all,
> We are seeing a problem where we deadlock if a pause sync request
> happens while attaching. Below is an explaination of what I think is
> occurring:
>
> Consider two nodes X primary and Y Secondary.
> 1. X becomes Secondary/Diskless
> 2. Y becomes Primary
> 3. X tries to Attach and sends its states/uuids to Y
> 4. While Y is in receive_state() doing a drbd_sync_handshake(), it
> receives a Paused Sync request.
> This is where the trouble starts.
> 5. aftr_isp is changed from 0->1 on Y and after_state_ch() called.
This
> triggers
> a drbd_send_state() from Y.
> 6. X receives States from Y but no uuids and runs a
> drbd_sync_handshake() with the old uuids and
> we deadlocked with PausedSyncS on both sides.
>
> I am not sure how to best fix this. Perhaps we should not call
> drbd_send_state()
> in after_state_ch() for a sync request from the peer if the peer's
disk
> is diskless.
> Or we do send states, sends the uuids as well. The attached patch
will
> at least serve as an illustration of the issue if not the correct fix.
>
Hi Ernest,
Your patch was correct I think. But in reality we have not only this
one case of the problem it is a whole class of such problems.
While the node that gets the new disk is in disk=Negotiating, it
will run drbd_sync_handshake() on each state packet that comes in.
We need to avoid this. Calling pause-sync is just one possible way
to cause the transmittion of a state packets, this is a whole class
of such problems.
Instead of fixing every place where we send a state packet, to not
send it if the peer has no disk, I decided to fix the receiving
side.
http://lists.linbit.com/pipermail/drbd-cvs/2007-August/001613.html
-Phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: [Drbd-dev] DRBD8: Deadlock in PausedSyncS
2007-08-02 9:44 ` Philipp Reisner
@ 2007-08-02 12:07 ` Oren Nechushtan
0 siblings, 0 replies; 4+ messages in thread
From: Oren Nechushtan @ 2007-08-02 12:07 UTC (permalink / raw)
To: Philipp Reisner, drbd-dev; +Cc: Montrose, Ernest
Hi,
I believe we have a similar problem, possibly a variant.
We are using 2.4.x kernel with drbd 0.7.22.
It seems that sometimes drbdadm state stalls, for 20sec or much longer
(possibly indefinitely.)
Can it be that the same problem exists in the older drbd branch?
More info..
We have a two node HA cluster with drbd. Usually drbdadm state works
great within less than a second.
However, sometime it stalls. It seems that in these cases drbdadm
returns code 20.
The output of the command varies between Primary/Secondary to
Primary/Unknown.
drbd.conf special flags
wfc-timeout 30;
degr-wfc-timeout 60;
For example, during one of the times drbdadm stalled for 20 sec,
/var/log/messages did not show any activity on this node (me) during and
immediately before this happened.
A second node (him) , attached to the same drbd device, shows the next
output around 2 minutes before this happened.
There were no messages during the drbdadm state check on the second
node.
him kernel: drbd0: drbdsetup [17327]: cstate StandAlone --> Unconnected
him kernel: drbd0: drbdsetup [17327]: cstate Unconnected --> StandAlone
him kernel: drbd0: drbdsetup [17327]: cstate StandAlone --> Unconfigured
him kernel: drbd0: worker terminated
him kernel: drbd: module cleanup done.
him kernel: drbd: initialized. Version: 0.7.22 (api:79/proto:74)
him kernel: drbd: SVN Revision: 2554 build by root@Build, 2007-02-21
09:49:13
him kernel: drbd: registered as block device major 147
him kernel: drbd0: resync bitmap: bits=15360148 words=480006
him kernel: drbd0: size = 58 GB (61440592 KB)
him kernel: klogd 1.4.1, ---------- state change ----------
him kernel: Loaded 551 symbols from 19 modules.
him kernel: drbd0: 8 KB marked out-of-sync by on disk bit-map.
him kernel: drbd0: Found 6 transactions (324 active extents) in activity
log.
him kernel: drbd0: drbdsetup [17349]: cstate Unconfigured --> StandAlone
him kernel: drbd0: drbdsetup [17353]: cstate StandAlone --> Unconnected
him kernel: drbd0: drbd0_receiver [17354]: cstate Unconnected -->
WFConnection
him kernel: drbd0: Unable to bind source sock (-99)
him kernel: drbd0: Unable to bind sock2 (-99)
him kernel: drbd0: drbd0_receiver [17354]: cstate WFConnection -->
Unconnected
him kernel: drbd0: worker terminated
him kernel: drbd0: drbd0_receiver [17354]: cstate Unconnected -->
Unconnected
him kernel: drbd0: Connection lost.
him kernel: drbd0: Discarding network configuration.
him kernel: drbd0: drbd0_receiver [17354]: cstate Unconnected -->
StandAlone
him kernel: drbd0: receiver terminated
him kernel: drbd0: drbdsetup [17534]: cstate StandAlone --> Unconnected
him kernel: drbd0: drbd0_receiver [17535]: cstate Unconnected -->
WFConnection
him kernel: drbd0: Unable to bind source sock (-99)
him kernel: drbd0: Unable to bind sock2 (-99)
him kernel: drbd0: drbd0_receiver [17535]: cstate WFConnection -->
Unconnected
him kernel: drbd0: worker terminated
him kernel: drbd0: drbd0_receiver [17535]: cstate Unconnected -->
Unconnected
him kernel: drbd0: Connection lost.
him kernel: drbd0: Discarding network configuration.
him kernel: drbd0: drbd0_receiver [17535]: cstate Unconnected -->
StandAlone
him kernel: drbd0: receiver terminated
In case the state was Primary/Unknown, a drbdadm connect results with
the next output, everything within a short time interval:
him kernel: drbd0: drbdsetup [18288]: cstate StandAlone --> Unconnected
him kernel: drbd0: drbd0_receiver [18289]: cstate Unconnected -->
WFConnection
me kernel: drbd0: drbd0_receiver [1915]: cstate WFConnection -->
WFReportParams
me kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
him kernel: drbd0: drbd0_receiver [18289]: cstate WFConnection -->
WFReportParams
me kernel: drbd0: Connection established.
him kernel: drbd0: Handshake successful: DRBD Network Protocol version
74
me kernel: drbd0: I am(P): 1:00000002:00000001:0000008a:00000028:10
him kernel: drbd0: Connection established.
me kernel: drbd0: Peer(S): 1:00000002:00000001:0000008c:00000027:00
him kernel: drbd0: I am(S): 1:00000002:00000001:0000008c:00000027:00
me kernel: drbd0: Current Primary shall become sync TARGET! Aborting to
prevent data corruption.
him kernel: drbd0: Peer(P): 1:00000002:00000001:0000008a:00000028:10
me kernel: drbd0: drbd0_receiver [1915]: cstate WFReportParams -->
StandAlone
him kernel: drbd0: drbd0_receiver [18289]: cstate WFReportParams -->
WFBitMapS
me kernel: drbd0: error receiving ReportParams, l: 72!
him kernel: drbd0: meta connection shut down by peer.
me kernel: drbd0: worker terminated
him kernel: drbd0: drbd0_asender [18290]: cstate WFBitMapS -->
NetworkFailure
me kernel: drbd0: asender terminated
him kernel: drbd0: asender terminated
me kernel: drbd0: drbd0_receiver [1915]: cstate StandAlone -->
StandAlone
him kernel: drbd0: short sent ReportBitMap size=4096 sent=-1001
me kernel: drbd0: Connection lost.
him kernel: drbd0: Secondary/Unknown --> Secondary/Primary
me kernel: drbd0: receiver terminated
him kernel: drbd0: sock was shut down by peer
him kernel: drbd0: drbd0_receiver [18289]: cstate NetworkFailure -->
BrokenPipe
him kernel: drbd0: short read expecting header on sock: r=0
him kernel: drbd0: worker terminated
him kernel: drbd0: drbd0_receiver [18289]: cstate BrokenPipe -->
Unconnected
him kernel: drbd0: Connection lost.
him kernel: drbd0: drbd0_receiver [18289]: cstate Unconnected -->
WFConnection
Thanks in advance,
Oren.
> -----Original Message-----
> From: drbd-dev-bounces@lists.linbit.com
> [mailto:drbd-dev-bounces@lists.linbit.com] On Behalf Of
> Philipp Reisner
> Sent: Thursday, August 02, 2007 12:45 PM
> To: drbd-dev@lists.linbit.com
> Cc: Montrose, Ernest
> Subject: Re: [Drbd-dev] DRBD8: Deadlock in PausedSyncS
>
> On Tuesday 31 July 2007 00:31:19 Montrose, Ernest wrote:
> > Hi all,
> > We are seeing a problem where we deadlock if a pause sync request
> > happens while attaching. Below is an explaination of what
> I think is
> > occurring:
> >
> > Consider two nodes X primary and Y Secondary.
> > 1. X becomes Secondary/Diskless
> > 2. Y becomes Primary
> > 3. X tries to Attach and sends its states/uuids to Y 4.
> While Y is in
> > receive_state() doing a drbd_sync_handshake(), it receives a Paused
> > Sync request.
> > This is where the trouble starts.
> > 5. aftr_isp is changed from 0->1 on Y and after_state_ch() called.
> > This triggers
> > a drbd_send_state() from Y.
> > 6. X receives States from Y but no uuids and runs a
> > drbd_sync_handshake() with the old uuids and
> > we deadlocked with PausedSyncS on both sides.
> >
> > I am not sure how to best fix this. Perhaps we should not call
> > drbd_send_state()
> > in after_state_ch() for a sync request from the peer if the peer's
> > disk is diskless.
> > Or we do send states, sends the uuids as well. The attached patch
> > will at least serve as an illustration of the issue if not
> the correct fix.
> >
>
>
> Hi Ernest,
>
> Your patch was correct I think. But in reality we have not
> only this one case of the problem it is a whole class of such
> problems.
>
> While the node that gets the new disk is in disk=Negotiating,
> it will run drbd_sync_handshake() on each state packet that comes in.
>
> We need to avoid this. Calling pause-sync is just one
> possible way to cause the transmittion of a state packets,
> this is a whole class of such problems.
>
> Instead of fixing every place where we send a state packet,
> to not send it if the peer has no disk, I decided to fix the
> receiving side.
>
> http://lists.linbit.com/pipermail/drbd-cvs/2007-August/001613.html
>
> -Phil
> --
> : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
> : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
> : Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
> _______________________________________________
> drbd-dev mailing list
> drbd-dev@lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-dev
>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2007-08-02 12:07 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-30 22:31 [Drbd-dev] DRBD8: Deadlock in PausedSyncS Montrose, Ernest
2007-08-02 9:44 ` Philipp Reisner
2007-08-02 12:07 ` Oren Nechushtan
-- strict thread matches above, loose matches on Subject: below --
2007-08-02 11:58 Montrose, Ernest
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox