Distributed Replicated Block Device (DRBD) development
 help / color / mirror / Atom feed
* RE: [Drbd-dev] DRBD8: Resync stalled at 100% due to race condition
@ 2007-06-01 14:27 Ernest Montrose
  0 siblings, 0 replies; 5+ messages in thread
From: Ernest Montrose @ 2007-06-01 14:27 UTC (permalink / raw)
  To: philipp.reisner, drbd-dev

Phil,
Thanks! But be aware that I had sent multiple messages about this since we are having
email issues here. I don't know which email you've seen.  But at least one of the emails
have a proposed patch that would simply acquire the req_lok earlier.  But that would
deadlock, so ignore that patch. Just an FYI.

EM--
-----Original Message-----
From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com] On Behalf Of Philipp Reisner
Sent: Friday, June 01, 2007 10:19 AM
To: drbd-dev@linbit.com
Subject: Re: [Drbd-dev] DRBD8: Resync stalled at 100% due to race condition
 
On Thursday 24 May 2007 14:41:53 Montrose, Ernest wrote:
> Hi all,
> We are seeing a problem where a resync hangs on the SyncSource at the
> end.  The SyncTarget finished OK and shows Connected. The signature on
> the SyncSource is:
[...]
 
Hi Ernest,
 
I will try to go trough this early next week (hopefully Monday).
 
-Phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :
_______________________________________________
drbd-dev mailing list
drbd-dev@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-dev


       
____________________________________________________________________________________
Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. 
http://mobile.yahoo.com/go?refer=1GNXIC


^ permalink raw reply	[flat|nested] 5+ messages in thread
* RE: [Drbd-dev] DRBD8: Resync stalled at 100% due to race condition
@ 2007-06-04 18:15 Montrose, Ernest
  0 siblings, 0 replies; 5+ messages in thread
From: Montrose, Ernest @ 2007-06-04 18:15 UTC (permalink / raw)
  To: Philipp Reisner, drbd-dev

Phil,
Thanks for the patch.  I tested the change and it appears to be fine so
far.
I am pretty confident this fixes it based on what I have tried before
when 
I tried to stage the problem. Please check it in.

EM-- 

-----Original Message-----
From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com]
On Behalf Of Philipp Reisner
Sent: Monday, June 04, 2007 5:09 AM
To: drbd-dev@linbit.com
Cc: Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8: Resync stalled at 100% due to race
condition

On Thursday 24 May 2007 14:41:53 Montrose, Ernest wrote:
> Hi all,
> We are seeing a problem where a resync hangs on the SyncSource at the
> end.  The SyncTarget finished OK and shows Connected. The signature on
> the SyncSource is:
>
[...]

I think you are right in saying that receive_state() is wrong, but
I have an other interpretation of the logs.

> What I think is happening is that there is a race condition where
> drbd_resync_finished() races receive_state() in this manner:
> 1)  The resync finished on drbd0 and we enter drbd_resync_finished().
> But before it can set the stated to Connected, drbd15 which is a
higher
> priority device starts syncing. This puts drbd0 in PausedSyncS from
> SyncSource.

Right.

> 2) Drbd_resync_finished() for drbd0 now tries to go to Connected from
> PausedSyncS.  The logs below prints this transition but the transition
> was not actually commited since we print before we actually assign the
> new values.

No, the log line "conn (PausedSyncS -> Connected)" is done by
_drbd_set_state() (with the PSC macros) which runs under the req_lock, 
and there happens the assignment "mdev->state.i = ns.i;". 
The log is okay.

I think we have a race betwen receive_state() assigning the 
connection state to nconn=PausedSyncS, then the resync finishes
before we reach the call to spin_lock() (mdev->state.conn = Connected).
Now when receive_state() finally continues, it assigns (the now
obsolete value of) nconn to mdev->state.conn again by calling
_drbd_set_state().

> I include a patch that may at least help illustrate the issue if not
fix
> it as I am not sure the req_lock can be held this early without
causing
> a deadlock or other perfomance issues.
>

I considered the approach of making drbd_sync_handshake() to run
under the req_lock and to have a simple retry mechanism in
receive_state().

Since 8.0.x is not the stable branch I decided to go with that small
patch. (I did not do any testing of this, since I guess it is rather
hard to hit this exact timing...)

Ernest, thanks for pointing this out!
As soon as you agree, I will commit this patch...

-phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-06-04 18:15 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <BD7042533C2F8943A6A4257A9E31C454F47906@EXNA.corp.stratus.com>
2007-05-24 12:41 ` [Drbd-dev] DRBD8: Resync stalled at 100% due to race condition Montrose, Ernest
2007-06-01 14:19   ` Philipp Reisner
2007-06-04  9:08   ` Philipp Reisner
2007-06-01 14:27 Ernest Montrose
  -- strict thread matches above, loose matches on Subject: below --
2007-06-04 18:15 Montrose, Ernest

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox