* [Drbd-dev] Handling on-disk caches
@ 2007-11-07 3:54 Graham, Simon
2007-11-07 14:03 ` Lars Ellenberg
` (5 more replies)
0 siblings, 6 replies; 14+ messages in thread
From: Graham, Simon @ 2007-11-07 3:54 UTC (permalink / raw)
To: drbd-dev
A few months ago, we had a discussion about how to handle systems with
on-disk caches enabled in the face of failures which can cause the cache
to be lost after disk writes are completed back to DRBD. At the time,
the suggestion was to rely on the Linux barrier implementation which is
used by the file systems to ensure correct behavior in the face of disk
caches.
I've now had time to get back to this and review the Linux barrier
implementation and it's become clear to me that the barrier
implementation is insufficient -- imagine the case where a write is
being done, it completes on the secondary (but is still in disk cache
there), then we power off this node -- NO errors are reported to Linux
on the primary (because the other half of the raid set is still there,
the original IO completes successfully BUT we have a difference side to
side...
So a failure of the secondary is NOT reflected back to linux and
therefore we can get out of sync in a way that does not track the blocks
that need to be resynced independent of the use of barriers.
Consider the following sequence of writes:
[1] [2] [3] [barrier] [4] [5]
If we've processed [1] through [3] and the writes have completed on both
primary and secondary but the data is sitting in the disk cache and then
the secondary is powered off, the following occurs:
1. The primary doesn't return any error to Linux
2. The primary goes ahead and processes the [barrier] (which flushes
[1]-[3] to disk then
performs [4] and [5] and includes the blocks covered by these in the
DRBD bitmap.
3. Now the Secondary comes back -- we ONLY resync [4] and [5] even
though [1]-[3] never made it
to disk (because we didn't execute the [barrier] on the secondary)
I think the solution to this consists of a number of changes:
1. As suggested previously, DRBD should respect barriers on the
secondary (by passing the appropriate
flags to the secondary) -- this will handle unexpected failure of the
primary.
2. Meta-data updates (certainly the AL but possibly all meta-data
updates) should be
issued as barrier requests (so that we know these are on disk before
issuing the
associated writes) (I don't think they are currently)
3. DRBD should include the area addressed by the AL when recovering from
an unexpected
secondary failure. There are two approaches for this:
a) Maintain the AL on both sides - when the secondary restarts, add
the AL to the
set of blocks needing to be resynced as is done on the primary
today
b) Add the current AL to the bitmap on the primary when it loses
contact with the
secondary.
The second is probably easier and is, I think, just as effective --
even if the primary
fails as well (so we lose the in memory bitmap), when it comes back it
WILL add the on-disk
AL to the bitmap and we wont resync until it comes back...
What do you think?
Simon
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Drbd-dev] Handling on-disk caches
2007-11-07 3:54 [Drbd-dev] Handling on-disk caches Graham, Simon
@ 2007-11-07 14:03 ` Lars Ellenberg
2007-11-07 14:16 ` Graham, Simon
` (4 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: Lars Ellenberg @ 2007-11-07 14:03 UTC (permalink / raw)
To: Graham, Simon; +Cc: drbd-dev
On Tue, Nov 06, 2007 at 10:54:02PM -0500, Graham, Simon wrote:
> A few months ago, we had a discussion about how to handle systems with
> on-disk caches enabled in the face of failures which can cause the cache
> to be lost after disk writes are completed back to DRBD. At the time,
> the suggestion was to rely on the Linux barrier implementation which is
> used by the file systems to ensure correct behavior in the face of disk
> caches.
>
> I've now had time to get back to this and review the Linux barrier
> implementation and it's become clear to me that the barrier
> implementation is insufficient -- imagine the case where a write is
> being done, it completes on the secondary (but is still in disk cache
> there), then we power off this node -- NO errors are reported to Linux
> on the primary (because the other half of the raid set is still there,
> the original IO completes successfully BUT we have a difference side to
> side...
>
> So a failure of the secondary is NOT reflected back to linux and
> therefore we can get out of sync in a way that does not track the blocks
> that need to be resynced independent of the use of barriers.
>
> Consider the following sequence of writes:
>
> [1] [2] [3] [barrier] [4] [5]
>
> If we've processed [1] through [3] and the writes have completed on both
> primary and secondary but the data is sitting in the disk cache and then
> the secondary is powered off, the following occurs:
>
> 1. The primary doesn't return any error to Linux
> 2. The primary goes ahead and processes the [barrier] (which flushes
> [1]-[3] to disk then
> performs [4] and [5] and includes the blocks covered by these in the
> DRBD bitmap.
> 3. Now the Secondary comes back -- we ONLY resync [4] and [5] even
> though [1]-[3] never made it
> to disk (because we didn't execute the [barrier] on the secondary)
>
> I think the solution to this consists of a number of changes:
I think there is no solution to this short of fixing the lower
layers/hardware to not tell lies.
> 1. As suggested previously, DRBD should respect barriers on the
> secondary (by passing the appropriate flags to the secondary) -- this
> will handle unexpected failure of the primary.
actually, we do not support "barriers" in the sense of tagged command
queuing or even only "BIO_RW_BARRIER" at all, yet.
we only support a "flush" like barrier, i.e. if kernel wants a barrier,
it needs to wait for all outstanding requests to be finished.
we do however provide our own "drbd barriers",
to ensure that write ordering on the secondary is respected.
yes, we trust (as the linux kernel in total) that once a completion
event happens for a bio, it is indeed on stable storage.
if the storage lies, there is not much we can do about that.
we probably should start to support BIO_RW_BARRIER.
but still, we have to trust the lower layers.
> 2. Meta-data updates (certainly the AL but possibly all meta-data
> updates) should be issued as barrier requests (so that we know these
> are on disk before issuing the associated writes) (I don't think they
> are currently)
I may be wrong, but even with barrier requests,
I doubt that a device with volatile write cache enabled would
handle such a "barrier" thing any different.
the assumption are
a disk accepts a write request,
and "completes" it (reports as on stable storage)
when it is in the "on disk cache", even when that cache is volatile,
not stable (battery backed).
that same disk would somehow treat a "barrier" write request different,
and this time in fact get the things from its on disk cache
to stable storage.
I think this assumption will not hold true.
but my hardware knowlegde is lacking, so I may be wrong.
what makes you know that this assumption is valid?
if I am right,
there is no point in trying to do "3.",
because it would not "solve" the issue,
but only make it less likely to see any bad things.
> 3. DRBD should include the area addressed by the AL when recovering from
> an unexpected
> secondary failure. There are two approaches for this:
> a) Maintain the AL on both sides - when the secondary restarts, add
> the AL to the
> set of blocks needing to be resynced as is done on the primary
> today
> b) Add the current AL to the bitmap on the primary when it loses
> contact with the
> secondary.
> The second is probably easier and is, I think, just as effective --
> even if the primary
> fails as well (so we lose the in memory bitmap), when it comes back it
> WILL add the on-disk
> AL to the bitmap and we wont resync until it comes back...
and even if I am wrong, and "barrier" writes would in fact induce a
write through, thus we could trust our bitmap and metadata...
any network hickup would cause a resync of the area equivalent to the
activity log (several GB in most cases), were it would have caused only
very few blocks to be resynced now.
hm. but better sync some GB to much than overlook one KB, right.
--
: Lars Ellenberg Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [Drbd-dev] Handling on-disk caches
2007-11-07 3:54 [Drbd-dev] Handling on-disk caches Graham, Simon
2007-11-07 14:03 ` Lars Ellenberg
@ 2007-11-07 14:16 ` Graham, Simon
2007-11-12 12:39 ` Philipp Reisner
` (3 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: Graham, Simon @ 2007-11-07 14:16 UTC (permalink / raw)
To: Lars Ellenberg; +Cc: drbd-dev
> > I think the solution to this consists of a number of changes:
>
> I think there is no solution to this short of fixing the lower
> layers/hardware to not tell lies.
>
Don't think of it as lying -- instead consider it as an enhanced disk
interface that provides the means to get improved performance out of the
disk; the barrier implementations provide the tools necessary to get
this performance benefit safely, we just need to make use of them (and
ensure that all out of sync blocks are resynced).
> > 2. Meta-data updates (certainly the AL but possibly all meta-data
> > updates) should be issued as barrier requests (so that we know these
> > are on disk before issuing the associated writes) (I don't think
they
> > are currently)
>
> I may be wrong, but even with barrier requests,
> I doubt that a device with volatile write cache enabled would
> handle such a "barrier" thing any different.
>
No they do -- the block layer code will issue flush requests and writes
with the FUA bit set for example to ensure barrier requests are actually
on rotating rust before continuing. Certainly this doesn't work _unless_
the underlying HBA/Disks support provides at least a flush operation
(and preferably both flush and FUA). There's a good description of this
in the Linux doc tree -- documentation/block/barrier.txt from your
favourite source tree -- see section 2. - Forced flushing to physical
medium.
> any network hickup would cause a resync of the area equivalent to the
> activity log (several GB in most cases), were it would have caused
only
> very few blocks to be resynced now.
>
Well, this is a good argument for implementing option b) -- keep the AL
on both primary and secondary -- that way, we only do the extra resync
if the secondary actually crashes as opposed to simply losing the
network connection for a while. More complex but better performing in
the face of network glitches.
> hm. but better sync some GB to much than overlook one KB, right.
Correct!
Thanks,
Simon
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Drbd-dev] Handling on-disk caches
2007-11-07 3:54 [Drbd-dev] Handling on-disk caches Graham, Simon
2007-11-07 14:03 ` Lars Ellenberg
2007-11-07 14:16 ` Graham, Simon
@ 2007-11-12 12:39 ` Philipp Reisner
2007-11-12 13:41 ` [Drbd-dev] DRBD8: incorrect state transition Connected ->WFBitMapS and UpToDate->Inconsistent Montrose, Ernest
` (2 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: Philipp Reisner @ 2007-11-12 12:39 UTC (permalink / raw)
To: drbd-dev
On Wednesday 07 November 2007 04:54:02 Graham, Simon wrote:
> A few months ago, we had a discussion about how to handle systems with
> on-disk caches enabled in the face of failures which can cause the cache
> to be lost after disk writes are completed back to DRBD. At the time,
> the suggestion was to rely on the Linux barrier implementation which is
> used by the file systems to ensure correct behavior in the face of disk
> caches.
>
> I've now had time to get back to this and review the Linux barrier
> implementation and it's become clear to me that the barrier
> implementation is insufficient -- imagine the case where a write is
> being done, it completes on the secondary (but is still in disk cache
> there), then we power off this node -- NO errors are reported to Linux
> on the primary (because the other half of the raid set is still there,
> the original IO completes successfully BUT we have a difference side to
> side...
>
> So a failure of the secondary is NOT reflected back to linux and
> therefore we can get out of sync in a way that does not track the blocks
> that need to be resynced independent of the use of barriers.
>
> Consider the following sequence of writes:
>
> [1] [2] [3] [barrier] [4] [5]
>
> If we've processed [1] through [3] and the writes have completed on both
> primary and secondary but the data is sitting in the disk cache and then
> the secondary is powered off, the following occurs:
>
> 1. The primary doesn't return any error to Linux
> 2. The primary goes ahead and processes the [barrier] (which flushes
> [1]-[3] to disk then
> performs [4] and [5] and includes the blocks covered by these in the
> DRBD bitmap.
> 3. Now the Secondary comes back -- we ONLY resync [4] and [5] even
> though [1]-[3] never made it
> to disk (because we didn't execute the [barrier] on the secondary)
>
Right. So far I completely agree.
> I think the solution to this consists of a number of changes:
>
> 1. As suggested previously, DRBD should respect barriers on the
> secondary (by passing the appropriate
> flags to the secondary) -- this will handle unexpected failure of the
> primary.
Right. We should do that.
I think that we do that for the BIO_RW_BARRIER, and the BIO_RW_SYNC
flag already.
> 2. Meta-data updates (certainly the AL but possibly all meta-data
> updates) should be
> issued as barrier requests (so that we know these are on disk before
> issuing the
> associated writes) (I don't think they are currently)
Right. We currently do not use BIO_RW_BARRIER here, but we should do so.
> 3. DRBD should include the area addressed by the AL when recovering from
> an unexpected
> secondary failure. There are two approaches for this:
> a) Maintain the AL on both sides - when the secondary restarts, add
> the AL to the
> set of blocks needing to be resynced as is done on the primary
> today
> b) Add the current AL to the bitmap on the primary when it loses
> contact with the
> secondary.
> The second is probably easier and is, I think, just as effective --
> even if the primary
> fails as well (so we lose the in memory bitmap), when it comes back it
> WILL add the on-disk
> AL to the bitmap and we wont resync until it comes back...
For item 3 I have an other opinion.
On the primary we have a data structure called the "transfer log" or tl
in the code. Up to now this was mainly important for protocol A and B.
It is a data structure conaining objects for all our self-generated
drbd-barriers on the fly, and objects for all write requests between
these barriers.
If we loose connection in protocol A or B we need to mark everything
we find in the transfer_log as out-of-sync in the bitmap.
When we also do this for protocol C, _AND_ use BIO_RW_BARRIER for
doing writing on the secondary we have solved the issue you
described in the first part of the mail.
I took this as occasion to write down what we are currently up
to in development of DRBD-8.2.
1 Online Verify.
Release the online-verify code from drbd-plus to drbd-8.2, creating a new
protocol version by the way.
2 Hot cache.
Finish the hot cache feature. With this feature enabled DRBD updates the
correct block caches (page cache) on the secondary node, as data gets
written and read on the primary.
Rationale: On Database machines with huge amounts of RAM, the database
can only deliver reasonable performance if Linux's disk caches are hot.
With a conventional DRBD cluster for such a database, the performance
of the database is after a switchover insufficient, since the caches
on the secondary machine are cold.
3 write quorum of 2.
There are users that want to use DRBD to mirror data but do not want
it to continue in case the connection to the secondary is lost. Such
a system is not an HA-system but an always redundant system. It should
freeze IO in case the connection to the secondary is lost, or the
the local disk gets detached. And thaw IO as soon as both pathes
are available again.
4 Configurable write quorum weights.
For OCFS2/GFS users it makes even sense to have configurable weights
for the write quorum. So that one can setup a cluster in that node
A continues to run but node B freezes its IO when the brain splits.
I should mention that number 1 and 2 are already in the works and
will soon appear in DRBD-8.2.
Now I added to the list:
5 Use the kernel's write barriers
As the support for write barriers is now available (this holds
true for hardware as for the Linux kernel) we should make use of
this.
* Use BIO_RW_BARRIER writes for updates to our meta-data-superblock.
* Use BIO_RW_BARRIER for writes to the AL.
* Implement the algorithm descibed in section 6 of
http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf .
* Delay setting of RQ_NET_DONE in the request objects until the right
BarrierAck comes in, also for protocol C.
Does this make sense ?
-Phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
^ permalink raw reply [flat|nested] 14+ messages in thread
* [Drbd-dev] DRBD8: incorrect state transition Connected ->WFBitMapS and UpToDate->Inconsistent
2007-11-07 3:54 [Drbd-dev] Handling on-disk caches Graham, Simon
` (2 preceding siblings ...)
2007-11-12 12:39 ` Philipp Reisner
@ 2007-11-12 13:41 ` Montrose, Ernest
2007-11-15 16:27 ` Philipp Reisner
2007-11-12 15:59 ` [Drbd-dev] Handling on-disk caches Graham, Simon
[not found] ` <BD7042533C2F8943A6A4257A9E31C454F47A31@EXNA.corp.str atus.com>
5 siblings, 1 reply; 14+ messages in thread
From: Montrose, Ernest @ 2007-11-12 13:41 UTC (permalink / raw)
To: Philipp Reisner, drbd-dev
[-- Attachment #1: Type: text/plain, Size: 9755 bytes --]
Hi,
We have been struggling with a problem where one side gets stuck in
WFBitMapS and Inconsistent State. Consider two nodes (Node0 and node1).
* Device r5 on node0 starts syncing as the synctarget.
* Device r5 is done syncing and on node0 we call drbd_resync_finished()
this gets delayed for a bit in drbd_rs_del_all()
* During this delay, device R0 wants to resync. So the lower priority
devices like R5 gets paused. This is were the trouble starts.
* drbd_resync_finished() is not done on node0.
* node0 goes "Connected" and sends its states to node1 and that triggers
a sync_handshake() which results in Connected ->WFBitMapS and
UpToDate->Inconsistent. We are stuck there.
So the problem, I think, is that while the synctarget is in
drbd_resync_finished() for a bit, a state change can occur causing the
peer that already finished to go into WFBitMapS and beyond.
We had a couple of ideas on how this may be fixed:
1)
We could find a way to NOT actually pause the sync _if_ the sync is in
fact already completed -- Perhaps modifying the pause-resync stuff so it
doesn't do it if there's nothing left to be resunk? (or if we're already
in the drbd_resync_finished processing - I dont know that there's a way
to tell this though).
I attempted to implement this a couple of ways but failed.
2)
* Don't call drbd_sync_handshake() if the peer state is > Connected.
AND...
* Send the synctarget states to the peer when the stalled sync is done.
This is suspiciously problematic but seems to actually close the racy
window. I include the implementation as a patch here.
ANY other Ideas please??
Below is a set of logs when the problem occurs:
On node0:
Oct 4 14:55:58 node0 kernel: drbd60: Began resync as PausedSyncT (will
sync 768 KB [192 bits set]).
Oct 4 14:55:58 node0 kernel: drbd60: Writing meta data super block now.
Oct 4 14:55:59 node0 kernel: drbd60: aftr_isp( 1 -> 0 )
Oct 4 14:56:00 node0 kernel: drbd60: conn( PausedSyncT -> SyncTarget )
peer_isp( 1 -> 0 )
Oct 4 14:56:00 node0 kernel: drbd60: Syncer continues.
Oct 4 14:56:00 node0 kernel: drbd60: conn( SyncTarget -> PausedSyncT )
aftr_isp( 0 -> 1 )
Oct 4 14:56:00 node0 kernel: drbd60: Resync suspended
Oct 4 14:56:01 node0 kernel: drbd60: conn( PausedSyncT -> SyncTarget )
aftr_isp( 1 -> 0 )
Oct 4 14:56:01 node0 kernel: drbd60: Syncer continues.
Oct 4 14:56:01 node0 kernel: drbd60: ASSERT(
!test_bit(STOP_SYNC_TIMER,&mdev->flags) ) in
/sandbox/sgraham/sn/trunk/platform/drbd/src/drbd/drbd_main.c:786
Oct 4 14:56:01 node0 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:01 node0 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:01 node0 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:03 node0 kernel: drbd60: conn( SyncTarget -> PausedSyncT )
aftr_isp( 0 -> 1 )
Oct 4 14:56:03 node0 kernel: drbd60: Resync suspended
Oct 4 14:56:03 node0 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:05 node0 kernel: drbd60: conn( PausedSyncT -> SyncTarget )
aftr_isp( 1 -> 0 )
Oct 4 14:56:05 node0 kernel: drbd60: Syncer continues.
Oct 4 14:56:05 node0 kernel: drbd60: ASSERT(
!test_bit(STOP_SYNC_TIMER,&mdev->flags) ) in
/sandbox/sgraham/sn/trunk/platform/drbd/src/drbd/drbd_main.c:786
Oct 4 14:56:05 node0 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:06 node0 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:07 node0 kernel: drbd60: conn( SyncTarget -> PausedSyncT )
aftr_isp( 0 -> 1 )
Oct 4 14:56:07 node0 kernel: drbd60: Resync suspended
Oct 4 14:56:07 node0 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:07 node0 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:09 node0 kernel: drbd60: conn( PausedSyncT -> SyncTarget )
aftr_isp( 1 -> 0 )
Oct 4 14:56:09 node0 kernel: drbd60: Syncer continues.
Oct 4 14:56:09 node0 kernel: drbd60: ASSERT(
!test_bit(STOP_SYNC_TIMER,&mdev->flags) ) in
/sandbox/sgraham/sn/trunk/platform/drbd/src/drbd/drbd_main.c:786
Oct 4 14:56:09 node0 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused 0
sec; 384 K/sec)
Oct 4 14:56:09 node0 kernel: drbd60: conn( SyncTarget -> Connected )
disk( Inconsistent -> UpToDate )
Oct 4 14:56:09 node0 kernel: drbd60: Writing meta data super block now.
Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused 0
sec; 0 K/sec)
Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused 0
sec; 0 K/sec)
Oct 4 14:56:09 node0 kernel: drbd60: Connected in w_make_resync_request
Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused 0
sec; 0 K/sec)
Oct 4 14:56:10 node0 kernel: drbd60: unexpected cstate (Connected) in
receive_bitmap
Oct 4 14:56:10 node0 kernel: drbd60: No resync, but 1048535 bits in
bitmap!
Oct 4 14:56:10 node0 kernel: drbd60: No resync, but 1048535 bits in
bitmap!
Oct 4 14:57:32 node0 kernel: drbd60: aftr_isp( 0 -> 1 )
Oct 4 14:57:33 node0 kernel: drbd60: aftr_isp( 1 -> 0 )
Oct 4 14:57:35 node0 kernel: drbd60: aftr_isp( 0 -> 1 )
Oct 4 14:57:36 node0 kernel: drbd60: No resync, but 1048535 bits in
bitmap!
Oct 4 14:57:36 node0 kernel: drbd60: peer_isp( 0 -> 1 )
Oct 4 14:57:36 node0 kernel: drbd60: No resync, but 1048535 bits in
bitmap!
Oct 4 14:57:36 node0 kernel: drbd60: No resync, but 1048535 bits in
bitmap!
Oct 4 14:57:36 node0 kernel: drbd60: No resync, but 1048535 bits in
bitmap!
Oct 4 14:57:37 node0 kernel: drbd60: No resync, but 1048535 bits in
bitmap!
Oct 4 14:57:37 node0 kernel: drbd60: No resync, but 1048535 bits in
bitmap!
Oct 4 14:57:39 node0 kernel: drbd60: aftr_isp( 1 -> 0 )
Oct 4 14:57:39 node0 kernel: drbd60: No resync, but 1048535 bits in
bitmap!
Oct 4 14:57:39 node0 kernel: drbd60: peer_isp( 1 -> 0 )
Oct 4 14:57:39 node0 kernel: drbd60: No resync, but 1048535 bits in
bitmap!
#################On node1###################
Oct 4 14:55:58 node1 kernel: drbd60: Began resync as PausedSyncS (will
sync 768 KB [192 bits set]).
Oct 4 14:55:58 node1 kernel: drbd60: Writing meta data super block now.
Oct 4 14:56:00 node1 kernel: drbd60: aftr_isp( 1 -> 0 )
Oct 4 14:56:00 node1 kernel: drbd60: conn( PausedSyncS -> SyncSource )
peer_isp( 1 -> 0 )
Oct 4 14:56:00 node1 kernel: drbd60: Syncer continues.
Oct 4 14:56:01 node1 kernel: drbd60: conn( SyncSource -> PausedSyncS )
aftr_isp( 0 -> 1 )
Oct 4 14:56:01 node1 kernel: drbd60: Resync suspended
Oct 4 14:56:01 node1 kernel: drbd60: conn( PausedSyncS -> SyncSource )
aftr_isp( 1 -> 0 )
Oct 4 14:56:01 node1 kernel: drbd60: Syncer continues.
Oct 4 14:56:01 node1 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:04 node1 kernel: drbd60: conn( SyncSource -> PausedSyncS )
aftr_isp( 0 -> 1 )
Oct 4 14:56:04 node1 kernel: drbd60: Resync suspended
Oct 4 14:56:04 node1 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:05 node1 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:06 node1 kernel: drbd60: conn( PausedSyncS -> SyncSource )
aftr_isp( 1 -> 0 )
Oct 4 14:56:06 node1 kernel: drbd60: Syncer continues.
Oct 4 14:56:06 node1 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:08 node1 kernel: drbd60: conn( SyncSource -> PausedSyncS )
aftr_isp( 0 -> 1 )
Oct 4 14:56:08 node1 kernel: drbd60: Resync suspended
Oct 4 14:56:08 node1 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:08 node1 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:09 node1 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:09 node1 kernel: drbd60: conn( PausedSyncS -> SyncSource )
aftr_isp( 1 -> 0 )
Oct 4 14:56:09 node1 kernel: drbd60: Syncer continues.
Oct 4 14:56:09 node1 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:09 node1 kernel: drbd60: Retrying drbd_rs_del_all() later.
refcnt=1
Oct 4 14:56:09 node1 kernel: drbd60: Resync done (total 1 sec; paused 0
sec; 768 K/sec)
Oct 4 14:56:09 node1 kernel: drbd60: conn( SyncSource -> Connected )
pdsk( Inconsistent -> UpToDate )
Oct 4 14:56:09 node1 kernel: drbd60: Writing meta data super block now.
Oct 4 14:56:09 node1 kernel: drbd60: Resync done (total 2 sec; paused 0
sec; 0 K/sec)
Oct 4 14:56:10 node1 kernel: drbd60: Becoming sync source due to disk
states.
Oct 4 14:56:10 node1 kernel: drbd60: Writing meta data super block now.
Oct 4 14:56:10 node1 kernel: drbd60: writing of bitmap took 2 jiffies
Oct 4 14:56:10 node1 kernel: drbd60: 4095 MB marked out-of-sync by on
disk bit-map.
Oct 4 14:56:10 node1 kernel: drbd60: 4194140 KB now marked out-of-sync
by on disk bit-map.
Oct 4 14:56:10 node1 kernel: drbd60: Writing meta data super block now.
Oct 4 14:56:10 node1 kernel: drbd60: conn( Connected -> WFBitMapS )
pdsk( UpToDate -> Inconsistent )
Oct 4 14:56:10 node1 kernel: drbd60: Writing meta data super block now.
Oct 4 14:56:10 node1 kernel: drbd60: pdsk( Inconsistent -> UpToDate )
Oct 4 14:56:10 node1 kernel: drbd60: Writing meta data super block now.
Oct 4 14:57:34 node1 kernel: drbd60: aftr_isp( 0 -> 1 )
Oct 4 14:57:35 node1 kernel: drbd60: aftr_isp( 1 -> 0 )
Oct 4 14:57:35 node1 kernel: drbd60: aftr_isp( 0 -> 1 )
Oct 4 14:57:35 node1 kernel: drbd60: peer_isp( 0 -> 1 )
Oct 4 14:57:35 node1 kernel: drbd60: peer_isp( 1 -> 0 )
Oct 4 14:57:35 node1 kernel: drbd60: peer_isp( 0 -> 1 )
Oct 4 14:57:40 node1 kernel: drbd60: aftr_isp( 1 -> 0 )
Oct 4 14:57:40 node1 kernel: drbd60: peer_isp( 1 -> 0 )
[-- Attachment #2: Wbimaps_stuck.patch --]
[-- Type: application/octet-stream, Size: 1105 bytes --]
Index: drbd_receiver.c
===================================================================
--- drbd_receiver.c (revision 20778)
+++ drbd_receiver.c (working copy)
@@ -2410,6 +2410,7 @@
if (nconn == WFReportParams ) nconn = Connected;
if (mdev->p_uuid && oconn <= Connected &&
+ peer_state.conn <= Connected &&
peer_state.disk >= Negotiating &&
inc_local_if_state(mdev,Negotiating) ) {
nconn=drbd_sync_handshake(mdev,peer_state.role,peer_state.disk);
Index: drbd_main.c
===================================================================
--- drbd_main.c (revision 20778)
+++ drbd_main.c (working copy)
@@ -961,6 +961,12 @@
drbd_send_state(mdev);
}
+ /* we just finished syncing as the target , tell peer in case we were delayed*/
+ if ((os.conn == SyncTarget || os.conn == PausedSyncT) &&
+ (ns.conn == Connected) ) {
+ drbd_send_state(mdev);
+ }
+
/* In case one of the isp bits got set, suspend other devices. */
if ( ( !os.aftr_isp && !os.peer_isp && !os.user_isp) &&
( ns.aftr_isp || ns.peer_isp || ns.user_isp) ) {
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [Drbd-dev] Handling on-disk caches
2007-11-07 3:54 [Drbd-dev] Handling on-disk caches Graham, Simon
` (3 preceding siblings ...)
2007-11-12 13:41 ` [Drbd-dev] DRBD8: incorrect state transition Connected ->WFBitMapS and UpToDate->Inconsistent Montrose, Ernest
@ 2007-11-12 15:59 ` Graham, Simon
2007-11-12 16:24 ` Philipp Reisner
[not found] ` <BD7042533C2F8943A6A4257A9E31C454F47A31@EXNA.corp.str atus.com>
5 siblings, 1 reply; 14+ messages in thread
From: Graham, Simon @ 2007-11-12 15:59 UTC (permalink / raw)
To: Philipp Reisner, drbd-dev
>
> For item 3 I have an other opinion.
>
> On the primary we have a data structure called the "transfer log" or
tl
> in the code. Up to now this was mainly important for protocol A and B.
>
> It is a data structure conaining objects for all our self-generated
> drbd-barriers on the fly, and objects for all write requests between
> these barriers.
>
> If we loose connection in protocol A or B we need to mark everything
> we find in the transfer_log as out-of-sync in the bitmap.
>
> When we also do this for protocol C, _AND_ use BIO_RW_BARRIER for
> doing writing on the secondary we have solved the issue you
> described in the first part of the mail.
>
I like it! MUCH better than marking the GB's of data covered by the
AL...
I presume you will work on this as part of 8.2 rather than as a fix to
8.0? Since I am currently locked on 8.0, I will probably look at
implementing your suggestions as updates to 8.0 and submit them for
consideration in 8.2.
Simon
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Drbd-dev] Handling on-disk caches
2007-11-12 15:59 ` [Drbd-dev] Handling on-disk caches Graham, Simon
@ 2007-11-12 16:24 ` Philipp Reisner
0 siblings, 0 replies; 14+ messages in thread
From: Philipp Reisner @ 2007-11-12 16:24 UTC (permalink / raw)
To: Graham, Simon; +Cc: drbd-dev
On Monday 12 November 2007 16:59:24 Graham, Simon wrote:
> > For item 3 I have an other opinion.
> >
> > On the primary we have a data structure called the "transfer log" or
>
> tl
>
> > in the code. Up to now this was mainly important for protocol A and B.
> >
> > It is a data structure conaining objects for all our self-generated
> > drbd-barriers on the fly, and objects for all write requests between
> > these barriers.
> >
> > If we loose connection in protocol A or B we need to mark everything
> > we find in the transfer_log as out-of-sync in the bitmap.
> >
> > When we also do this for protocol C, _AND_ use BIO_RW_BARRIER for
> > doing writing on the secondary we have solved the issue you
> > described in the first part of the mail.
>
> I like it! MUCH better than marking the GB's of data covered by the
> AL...
>
> I presume you will work on this as part of 8.2 rather than as a fix to
> 8.0? Since I am currently locked on 8.0, I will probably look at
> implementing your suggestions as updates to 8.0 and submit them for
> consideration in 8.2.
Right, I thought about this, and forgot to mention it in the mail.
From a technical point of view it is possible to do the BIO_RW_BARRIER
stuff in DRBD-8.0 since it can be done without changing the protocol.
From the point of view as release manager I get bellyache when considering
such a big change for DRBD-8.0.
Please start the work based on the 8.0 tree, I postpone the
decision if it should go into 8.2 only or into 8.0 (and propagated to
8.2 of course as well) until we see the intrusiveness of the patch.
Moving patches between 8.0 and 8.2 is still quite possible...
-Phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Drbd-dev] DRBD8: incorrect state transition Connected ->WFBitMapS and UpToDate->Inconsistent
2007-11-12 13:41 ` [Drbd-dev] DRBD8: incorrect state transition Connected ->WFBitMapS and UpToDate->Inconsistent Montrose, Ernest
@ 2007-11-15 16:27 ` Philipp Reisner
2007-11-16 2:36 ` Ernest Montrose
0 siblings, 1 reply; 14+ messages in thread
From: Philipp Reisner @ 2007-11-15 16:27 UTC (permalink / raw)
To: drbd-dev; +Cc: Montrose, Ernest
[-- Attachment #1: Type: text/plain, Size: 2310 bytes --]
On Monday 12 November 2007 14:41:10 Montrose, Ernest wrote:
> Hi,
> We have been struggling with a problem where one side gets stuck in
> WFBitMapS and Inconsistent State. Consider two nodes (Node0 and node1).
>
>
> * Device r5 on node0 starts syncing as the synctarget.
> * Device r5 is done syncing and on node0 we call drbd_resync_finished()
> this gets delayed for a bit in drbd_rs_del_all()
> * During this delay, device R0 wants to resync. So the lower priority
> devices like R5 gets paused. This is were the trouble starts.
Right. But Something else happens...
[...]
> Oct 4 14:56:01 node0 kernel: drbd60: Syncer continues.
> Oct 4 14:56:01 node0 kernel: drbd60: ASSERT(
> !test_bit(STOP_SYNC_TIMER,&mdev->flags) ) in
> /sandbox/sgraham/sn/trunk/platform/drbd/src/drbd/drbd_main.c:786
That assert caught my attention, and this is my understanding what
went wrong...
r5 was already finished with its resync timer and calling
w_make_resync_request(), but due to the continue event after the
pause the timer got restarted...
Unfortunately the drbd_bm_find_next() searched through all the
bitmap and found those bits near the end that where not yet
cleared, and so resync requests where resent...
Therefore...
[...]
> Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused 0
> sec; 384 K/sec)
[...]
> Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused 0
> sec; 0 K/sec)
> Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused 0
> sec; 0 K/sec)
> Oct 4 14:56:09 node0 kernel: drbd60: Connected in w_make_resync_request
> Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused 0
> sec; 0 K/sec)
... we got multiple calls to drbd_resync_finished().
Here is my suggestion to fix that.
1) Do not restart the timer after a syncpause, when the timer is no
longer needed.
2) To make the whole thing more robust against such bugs,
drbd_bm_find_next() should not reset the find_offset back to 0
after it hit the end of the bitmap once.
I have not tested it.... but I think this should do...
-Phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
[-- Attachment #2: Wbimaps_stuck_phil.patch --]
[-- Type: text/x-diff, Size: 1136 bytes --]
diff --git a/drbd/drbd_bitmap.c b/drbd/drbd_bitmap.c
index 015421a..7e118a6 100644
--- a/drbd/drbd_bitmap.c
+++ b/drbd/drbd_bitmap.c
@@ -954,7 +954,7 @@ unsigned long drbd_bm_find_next(drbd_dev *mdev)
}
if (i >= b->bm_bits) {
i = -1UL;
- b->bm_fo = 0;
+ /* leave b->bm_fo unchanged. */
} else {
b->bm_fo = i+1;
}
diff --git a/drbd/drbd_main.c b/drbd/drbd_main.c
index fe8f66d..e25bb3a 100644
--- a/drbd/drbd_main.c
+++ b/drbd/drbd_main.c
@@ -786,9 +786,13 @@ int _drbd_set_state(drbd_dev* mdev, drbd_state_t ns,enum chg_state_flags flags)
INFO("Syncer continues.\n");
mdev->rs_paused += (long)jiffies-(long)mdev->rs_mark_time;
if( ns.conn == SyncTarget ) {
- D_ASSERT(!test_bit(STOP_SYNC_TIMER,&mdev->flags));
- clear_bit(STOP_SYNC_TIMER,&mdev->flags);
- mod_timer(&mdev->resync_timer,jiffies);
+ if (!test_bit(STOP_SYNC_TIMER,&mdev->flags)) {
+ mod_timer(&mdev->resync_timer,jiffies);
+ }
+ /* This if (!test_bit is only needed for the case
+ that a device that has ceased to used its timer,
+ i.e. it is already in drbd_resync_finished() gets
+ paused and resumed. */
}
}
^ permalink raw reply related [flat|nested] 14+ messages in thread
* RE: [Drbd-dev] DRBD8: incorrect state transition Connected->WFBitMapS and UpToDate->Inconsistent
[not found] ` <BD7042533C2F8943A6A4257A9E31C454F47A31@EXNA.corp.str atus.com>
@ 2007-11-15 16:34 ` Montrose, Ernest
0 siblings, 0 replies; 14+ messages in thread
From: Montrose, Ernest @ 2007-11-15 16:34 UTC (permalink / raw)
To: Philipp Reisner, drbd-dev
Phil,
I will test and advise later.
Thanks.
EM--
-----Original Message-----
From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com]
On Behalf Of Philipp Reisner
Sent: Thursday, November 15, 2007 11:27 AM
To: drbd-dev@linbit.com
Cc: Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8: incorrect state transition
Connected->WFBitMapS and UpToDate->Inconsistent
On Monday 12 November 2007 14:41:10 Montrose, Ernest wrote:
> Hi,
> We have been struggling with a problem where one side gets stuck in
> WFBitMapS and Inconsistent State. Consider two nodes (Node0 and
node1).
>
>
> * Device r5 on node0 starts syncing as the synctarget.
> * Device r5 is done syncing and on node0 we call
drbd_resync_finished()
> this gets delayed for a bit in drbd_rs_del_all()
> * During this delay, device R0 wants to resync. So the lower
priority
> devices like R5 gets paused. This is were the trouble starts.
Right. But Something else happens...
[...]
> Oct 4 14:56:01 node0 kernel: drbd60: Syncer continues.
> Oct 4 14:56:01 node0 kernel: drbd60: ASSERT(
> !test_bit(STOP_SYNC_TIMER,&mdev->flags) ) in
> /sandbox/sgraham/sn/trunk/platform/drbd/src/drbd/drbd_main.c:786
That assert caught my attention, and this is my understanding what
went wrong...
r5 was already finished with its resync timer and calling
w_make_resync_request(), but due to the continue event after the
pause the timer got restarted...
Unfortunately the drbd_bm_find_next() searched through all the
bitmap and found those bits near the end that where not yet
cleared, and so resync requests where resent...
Therefore...
[...]
> Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused
0
> sec; 384 K/sec)
[...]
> Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused
0
> sec; 0 K/sec)
> Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused
0
> sec; 0 K/sec)
> Oct 4 14:56:09 node0 kernel: drbd60: Connected in
w_make_resync_request
> Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused
0
> sec; 0 K/sec)
... we got multiple calls to drbd_resync_finished().
Here is my suggestion to fix that.
1) Do not restart the timer after a syncpause, when the timer is no
longer needed.
2) To make the whole thing more robust against such bugs,
drbd_bm_find_next() should not reset the find_offset back to 0
after it hit the end of the bitmap once.
I have not tested it.... but I think this should do...
-Phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Drbd-dev] DRBD8: incorrect state transition Connected ->WFBitMapS and UpToDate->Inconsistent
2007-11-15 16:27 ` Philipp Reisner
@ 2007-11-16 2:36 ` Ernest Montrose
2007-11-26 14:31 ` Philipp Reisner
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Ernest Montrose @ 2007-11-16 2:36 UTC (permalink / raw)
To: Philipp Reisner, drbd-dev; +Cc: Montrose, Ernest
[-- Attachment #1: Type: text/plain, Size: 4914 bytes --]
Phil,
I tested the patch and unfortunately it does not fix the race condition though I believe it fixes
the ASSERT issues.
Essentially, when the resync is done and we are in drbd_resync_finished() if we pause the device
then we send the state to the peer. The peer is done syncing at that point. He does a
sync_hanshake() that sends its state to WBItMapS and Pdsk=Inconsistent (since the target
has not changed its state to connected and UptoDate yet. When resync_finished is done we go
Uptodate and connected and we're stuck.
I tested yet another idea which seems to close the racy window. I turned drbd_resync_finished
into two parts. A cleanup part that the worker can schedule to do the clean up and a done part
that
changes the state right away when the resync is done. I include an untested patch to illustrate
that idea.
Thanks.
EM--
--- Philipp Reisner <philipp.reisner@linbit.com> wrote:
> On Monday 12 November 2007 14:41:10 Montrose, Ernest wrote:
> > Hi,
> > We have been struggling with a problem where one side gets stuck in
> > WFBitMapS and Inconsistent State. Consider two nodes (Node0 and node1).
> >
> >
> > * Device r5 on node0 starts syncing as the synctarget.
> > * Device r5 is done syncing and on node0 we call drbd_resync_finished()
> > this gets delayed for a bit in drbd_rs_del_all()
> > * During this delay, device R0 wants to resync. So the lower priority
> > devices like R5 gets paused. This is were the trouble starts.
>
> Right. But Something else happens...
>
> [...]
> > Oct 4 14:56:01 node0 kernel: drbd60: Syncer continues.
> > Oct 4 14:56:01 node0 kernel: drbd60: ASSERT(
> > !test_bit(STOP_SYNC_TIMER,&mdev->flags) ) in
> > /sandbox/sgraham/sn/trunk/platform/drbd/src/drbd/drbd_main.c:786
>
> That assert caught my attention, and this is my understanding what
> went wrong...
>
> r5 was already finished with its resync timer and calling
> w_make_resync_request(), but due to the continue event after the
> pause the timer got restarted...
>
> Unfortunately the drbd_bm_find_next() searched through all the
> bitmap and found those bits near the end that where not yet
> cleared, and so resync requests where resent...
>
> Therefore...
>
> [...]
> > Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused 0
> > sec; 384 K/sec)
> [...]
> > Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused 0
> > sec; 0 K/sec)
> > Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused 0
> > sec; 0 K/sec)
> > Oct 4 14:56:09 node0 kernel: drbd60: Connected in w_make_resync_request
> > Oct 4 14:56:09 node0 kernel: drbd60: Resync done (total 2 sec; paused 0
> > sec; 0 K/sec)
>
> ... we got multiple calls to drbd_resync_finished().
>
> Here is my suggestion to fix that.
>
> 1) Do not restart the timer after a syncpause, when the timer is no
> longer needed.
>
> 2) To make the whole thing more robust against such bugs,
> drbd_bm_find_next() should not reset the find_offset back to 0
> after it hit the end of the bitmap once.
>
> I have not tested it.... but I think this should do...
>
> -Phil
> --
> : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
> : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
> : Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
> > diff --git a/drbd/drbd_bitmap.c b/drbd/drbd_bitmap.c
> index 015421a..7e118a6 100644
> --- a/drbd/drbd_bitmap.c
> +++ b/drbd/drbd_bitmap.c
> @@ -954,7 +954,7 @@ unsigned long drbd_bm_find_next(drbd_dev *mdev)
> }
> if (i >= b->bm_bits) {
> i = -1UL;
> - b->bm_fo = 0;
> + /* leave b->bm_fo unchanged. */
> } else {
> b->bm_fo = i+1;
> }
> diff --git a/drbd/drbd_main.c b/drbd/drbd_main.c
> index fe8f66d..e25bb3a 100644
> --- a/drbd/drbd_main.c
> +++ b/drbd/drbd_main.c
> @@ -786,9 +786,13 @@ int _drbd_set_state(drbd_dev* mdev, drbd_state_t ns,enum chg_state_flags
> flags)
> INFO("Syncer continues.\n");
> mdev->rs_paused += (long)jiffies-(long)mdev->rs_mark_time;
> if( ns.conn == SyncTarget ) {
> - D_ASSERT(!test_bit(STOP_SYNC_TIMER,&mdev->flags));
> - clear_bit(STOP_SYNC_TIMER,&mdev->flags);
> - mod_timer(&mdev->resync_timer,jiffies);
> + if (!test_bit(STOP_SYNC_TIMER,&mdev->flags)) {
> + mod_timer(&mdev->resync_timer,jiffies);
> + }
> + /* This if (!test_bit is only needed for the case
> + that a device that has ceased to used its timer,
> + i.e. it is already in drbd_resync_finished() gets
> + paused and resumed. */
> }
> }
>
> > _______________________________________________
> drbd-dev mailing list
> drbd-dev@lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-dev
>
____________________________________________________________________________________
Be a better pen pal.
Text or chat with friends inside Yahoo! Mail. See how. http://overview.mail.yahoo.com/
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 552174691-my_3230.patch --]
[-- Type: text/x-patch; name="my_3230.patch", Size: 2254 bytes --]
Index: drbd/drbd_actlog.c
===================================================================
--- drbd/drbd_actlog.c (revision 20723)
+++ drbd/drbd_actlog.c (working copy)
@@ -800,7 +800,8 @@
( mdev->state.conn == SyncSource || mdev->state.conn == SyncTarget ||
mdev->state.conn == PausedSyncS || mdev->state.conn == PausedSyncT ) ) {
drbd_bm_lock(mdev);
- drbd_resync_finished(mdev);
+ drbd_resync_done(mdev);
+ drbd_resync_cleanup(mdev);
drbd_bm_unlock(mdev);
}
drbd_bcast_sync_progress(mdev);
Index: drbd/drbd_worker.c
===================================================================
--- drbd/drbd_worker.c (revision 20723)
+++ drbd/drbd_worker.c (working copy)
@@ -450,16 +450,14 @@
kfree(w);
drbd_bm_lock(mdev);
- drbd_resync_finished(mdev);
+ drbd_resync_cleanup(mdev);
drbd_bm_unlock(mdev);
return 1;
}
-int drbd_resync_finished(drbd_dev* mdev)
+int drbd_resync_cleanup(drbd_dev* mdev)
{
- unsigned long db,dt,dbdt;
- int dstate, pdstate;
struct drbd_work *w;
// Remove all elements from the resync LRU. Since future actions
@@ -483,6 +481,14 @@
ERR("Warn failed to drbd_rs_del_all() and to kmalloc(w).\n");
}
+ return 1;
+}
+
+int drbd_resync_done(drbd_dev* mdev)
+{
+ unsigned long db,dt,dbdt;
+ int dstate, pdstate;
+
dt = (jiffies - mdev->rs_start - mdev->rs_paused) / HZ;
if (dt <= 0) dt=1;
db = mdev->rs_total;
@@ -933,7 +939,8 @@
(unsigned long) mdev->rs_total);
if ( mdev->rs_total == 0 ) {
- drbd_resync_finished(mdev);
+ drbd_resync_done(mdev);
+ drbd_resync_cleanup(mdev);
return;
}
Index: drbd/drbd_int.h
===================================================================
--- drbd/drbd_int.h (revision 20723)
+++ drbd/drbd_int.h (working copy)
@@ -1302,7 +1302,8 @@
extern void drbd_start_resync(drbd_dev *mdev, drbd_conns_t side);
extern void resume_next_sg(drbd_dev* mdev);
extern void suspend_other_sg(drbd_dev* mdev);
-extern int drbd_resync_finished(drbd_dev *mdev);
+extern int drbd_resync_done(drbd_dev *mdev);
+extern int drbd_resync_cleanup(drbd_dev *mdev);
// maybe rather drbd_main.c ?
extern int drbd_md_sync_page_io(drbd_dev *mdev, struct drbd_backing_dev *bdev,
sector_t sector, int rw);
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Drbd-dev] DRBD8: incorrect state transition Connected ->WFBitMapS and UpToDate->Inconsistent
2007-11-16 2:36 ` Ernest Montrose
@ 2007-11-26 14:31 ` Philipp Reisner
2007-11-26 14:43 ` Montrose, Ernest
2007-11-30 0:01 ` Montrose, Ernest
2 siblings, 0 replies; 14+ messages in thread
From: Philipp Reisner @ 2007-11-26 14:31 UTC (permalink / raw)
To: drbd-dev; +Cc: Montrose, Ernest
[-- Attachment #1: Type: text/plain, Size: 2628 bytes --]
On Friday 16 November 2007 03:36:19 Ernest Montrose wrote:
> Phil,
> I tested the patch and unfortunately it does not fix the race condition
> though I believe it fixes the ASSERT issues.
> Essentially, when the resync is done and we are in drbd_resync_finished()
> if we pause the device then we send the state to the peer. The peer is
> done syncing at that point. He does a sync_hanshake() that sends its state
> to WBItMapS and Pdsk=Inconsistent (since the target has not changed its
> state to connected and UptoDate yet. When resync_finished is done we go
> Uptodate and connected and we're stuck.
>
> I tested yet another idea which seems to close the racy window. I turned
> drbd_resync_finished into two parts. A cleanup part that the worker can
> schedule to do the clean up and a done part that
> changes the state right away when the resync is done. I include an
> untested patch to illustrate that idea.
>
Hi Ernest,
Finally the attached patch made it into the GIT repository
(see 3a57119417c46c51dd4bc720ab7dbf14228f05bb.git.txt)
It is slightly different from the patch I suggested at first
As you could not confirm that the bugs is closed for you I tried
to reproduce it here now (with the attached patch) and some
instrumentation code to make the call to drbd_resync_finished()
to last 10 seconds.
I tested pausing and continuing on the SyncTarget and on the
Sync Source side. I could not find any issues:
[42949590.920000] drbd0: conn( StartingSyncT -> WFSyncUUID )
[42949590.920000] drbd0: conn( WFSyncUUID -> SyncTarget )
[42949590.920000] drbd0: Began resync as SyncTarget (will sync 262244 KB [65561 bits set]).
[42949590.920000] drbd0: Writing meta data super block now.
[42949669.270000] drbd0: Warn long sleep start
[42949672.120000] drbd0: conn( SyncTarget -> PausedSyncT ) user_isp( 0 -> 1 )
[42949672.120000] drbd0: Resync suspended
[42949678.610000] drbd0: conn( PausedSyncT -> SyncTarget ) user_isp( 1 -> 0 )
[42949678.610000] drbd0: Syncer continues.
[42949679.280000] drbd0: Warn long sleep stop
[42949679.280000] drbd0: Resync done (total 87 sec; paused 6 sec; 3236 K/sec)
[42949679.280000] drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
[42949679.280000] drbd0: Writing meta data super block now.
Ernest, can you please confirm that this issue is solved for you with that
patch, or provide logfile output of an failing test ?
Thanks!
-phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
[-- Attachment #2: 3a57119417c46c51dd4bc720ab7dbf14228f05bb.git.txt --]
[-- Type: text/plain, Size: 2048 bytes --]
commit 3a57119417c46c51dd4bc720ab7dbf14228f05bb
Author: Philipp Reisner <philipp.reisner@linbit.com>
Date: Sun Nov 18 22:19:42 2007 +0100
make resync more robust, don't reset find bit offset too early.
by the sync group serialisation code,
a resync timer of a device may be rescheduled.
if it had already sent all requests (find reached the end of the bitmap),
but did not yet receive all the answers (some bits are still set),
it would re-request all those bits
if the start offset of the find gets reset too early.
diff --git a/drbd/drbd_bitmap.c b/drbd/drbd_bitmap.c
index dda2a29..68decd2 100644
--- a/drbd/drbd_bitmap.c
+++ b/drbd/drbd_bitmap.c
@@ -874,7 +874,7 @@ unsigned long drbd_bm_find_next(drbd_dev *mdev)
}
if (i >= b->bm_bits) {
i = -1UL;
- b->bm_fo = 0;
+ /* leave b->bm_fo unchanged. */
} else {
b->bm_fo = i+1;
}
@@ -898,7 +898,7 @@ void drbd_bm_set_find(drbd_dev *mdev, unsigned long i)
int drbd_bm_rs_done(drbd_dev *mdev)
{
- return mdev->bitmap->bm_fo == 0;
+ return (mdev->bitmap->bm_fo >= mdev->bitmap->bm_bits);
}
/* returns number of bits actually changed.
diff --git a/drbd/drbd_main.c b/drbd/drbd_main.c
index dec523a..7d0d024 100644
--- a/drbd/drbd_main.c
+++ b/drbd/drbd_main.c
@@ -787,10 +787,14 @@ int _drbd_set_state(drbd_dev* mdev, drbd_state_t ns,enum chg_state_flags flags)
(ns.conn == SyncTarget || ns.conn == SyncSource) ) {
INFO("Syncer continues.\n");
mdev->rs_paused += (long)jiffies-(long)mdev->rs_mark_time;
- if( ns.conn == SyncTarget ) {
- D_ASSERT(!test_bit(STOP_SYNC_TIMER,&mdev->flags));
- clear_bit(STOP_SYNC_TIMER,&mdev->flags);
- mod_timer(&mdev->resync_timer,jiffies);
+ if (ns.conn == SyncTarget) {
+ if (!test_bit(STOP_SYNC_TIMER,&mdev->flags)) {
+ mod_timer(&mdev->resync_timer,jiffies);
+ }
+ /* This if (!test_bit) is only needed for the case
+ that a device that has ceased to used its timer,
+ i.e. it is already in drbd_resync_finished() gets
+ paused and resumed. */
}
}
[-- Attachment #3: resync_finished_10_seconds.diff --]
[-- Type: text/x-diff, Size: 487 bytes --]
diff --git a/drbd/drbd_worker.c b/drbd/drbd_worker.c
index 227b024..ad9dbbe 100644
--- a/drbd/drbd_worker.c
+++ b/drbd/drbd_worker.c
@@ -481,6 +481,10 @@ int drbd_resync_finished(drbd_dev* mdev)
}
ERR("Warn failed to drbd_rs_del_all() and to kmalloc(w).\n");
}
+ ERR("Warn long sleep start\n");
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(10 * HZ);
+ ERR("Warn long sleep stop\n");
dt = (jiffies - mdev->rs_start - mdev->rs_paused) / HZ;
if (dt <= 0) dt=1;
^ permalink raw reply related [flat|nested] 14+ messages in thread
* RE: [Drbd-dev] DRBD8: incorrect state transition Connected ->WFBitMapS and UpToDate->Inconsistent
2007-11-16 2:36 ` Ernest Montrose
2007-11-26 14:31 ` Philipp Reisner
@ 2007-11-26 14:43 ` Montrose, Ernest
2007-11-26 15:09 ` Philipp Reisner
2007-11-30 0:01 ` Montrose, Ernest
2 siblings, 1 reply; 14+ messages in thread
From: Montrose, Ernest @ 2007-11-26 14:43 UTC (permalink / raw)
To: Philipp Reisner, drbd-dev
Phil,
Well as it turned out, my last idea did not completely fix the issue
either.
So it may be that my description and staging of it was not complete. I
will try to completely describe the problem and then test your patch. I
will get back to you with the results as soon as I get them.
FYI, your last idea also introduced an issue where a sync would stall
for ever if paused and resumed quickly. If you checked it in somewhere,
you might want to back out.
Thanks,
EM--
-----Original Message-----
From: Philipp Reisner [mailto:philipp.reisner@linbit.com]
Sent: Monday, November 26, 2007 9:32 AM
To: drbd-dev@linbit.com
Cc: Ernest Montrose; Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8: incorrect state transition Connected
->WFBitMapS and UpToDate->Inconsistent
On Friday 16 November 2007 03:36:19 Ernest Montrose wrote:
> Phil,
> I tested the patch and unfortunately it does not fix the race
condition
> though I believe it fixes the ASSERT issues.
> Essentially, when the resync is done and we are in
drbd_resync_finished()
> if we pause the device then we send the state to the peer. The peer
is
> done syncing at that point. He does a sync_hanshake() that sends its
state
> to WBItMapS and Pdsk=Inconsistent (since the target has not changed
its
> state to connected and UptoDate yet. When resync_finished is done we
go
> Uptodate and connected and we're stuck.
>
> I tested yet another idea which seems to close the racy window. I
turned
> drbd_resync_finished into two parts. A cleanup part that the worker
can
> schedule to do the clean up and a done part that
> changes the state right away when the resync is done. I include an
> untested patch to illustrate that idea.
>
Hi Ernest,
Finally the attached patch made it into the GIT repository
(see 3a57119417c46c51dd4bc720ab7dbf14228f05bb.git.txt)
It is slightly different from the patch I suggested at first
As you could not confirm that the bugs is closed for you I tried
to reproduce it here now (with the attached patch) and some
instrumentation code to make the call to drbd_resync_finished()
to last 10 seconds.
I tested pausing and continuing on the SyncTarget and on the
Sync Source side. I could not find any issues:
[42949590.920000] drbd0: conn( StartingSyncT -> WFSyncUUID )
[42949590.920000] drbd0: conn( WFSyncUUID -> SyncTarget )
[42949590.920000] drbd0: Began resync as SyncTarget (will sync 262244 KB
[65561 bits set]).
[42949590.920000] drbd0: Writing meta data super block now.
[42949669.270000] drbd0: Warn long sleep start
[42949672.120000] drbd0: conn( SyncTarget -> PausedSyncT ) user_isp( 0
-> 1 )
[42949672.120000] drbd0: Resync suspended
[42949678.610000] drbd0: conn( PausedSyncT -> SyncTarget ) user_isp( 1
-> 0 )
[42949678.610000] drbd0: Syncer continues.
[42949679.280000] drbd0: Warn long sleep stop
[42949679.280000] drbd0: Resync done (total 87 sec; paused 6 sec; 3236
K/sec)
[42949679.280000] drbd0: conn( SyncTarget -> Connected ) disk(
Inconsistent -> UpToDate )
[42949679.280000] drbd0: Writing meta data super block now.
Ernest, can you please confirm that this issue is solved for you with
that
patch, or provide logfile output of an failing test ?
Thanks!
-phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Drbd-dev] DRBD8: incorrect state transition Connected ->WFBitMapS and UpToDate->Inconsistent
2007-11-26 14:43 ` Montrose, Ernest
@ 2007-11-26 15:09 ` Philipp Reisner
0 siblings, 0 replies; 14+ messages in thread
From: Philipp Reisner @ 2007-11-26 15:09 UTC (permalink / raw)
To: Montrose, Ernest; +Cc: drbd-dev
On Monday 26 November 2007 15:43:11 Montrose, Ernest wrote:
> Phil,
> Well as it turned out, my last idea did not completely fix the issue
> either.
> So it may be that my description and staging of it was not complete. I
> will try to completely describe the problem and then test your patch. I
> will get back to you with the results as soon as I get them.
>
> FYI, your last idea also introduced an issue where a sync would stall
> for ever if paused and resumed quickly. If you checked it in somewhere,
> you might want to back out.
>
Oh, right.
This makes it work also with quick pause-resume cycles:
--- a/drbd/drbd_main.c
+++ b/drbd/drbd_main.c
@@ -788,7 +788,7 @@ int _drbd_set_state(drbd_dev* mdev, drbd_state_t ns,enum chg_state_flags flags)
INFO("Syncer continues.\n");
mdev->rs_paused += (long)jiffies-(long)mdev->rs_mark_time;
if (ns.conn == SyncTarget) {
- if (!test_bit(STOP_SYNC_TIMER,&mdev->flags)) {
+ if (!test_and_clear_bit(STOP_SYNC_TIMER,&mdev->flags)) {
mod_timer(&mdev->resync_timer,jiffies);
}
/* This if (!test_bit) is only needed for the case
-Phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: [Drbd-dev] DRBD8: incorrect state transition Connected ->WFBitMapS and UpToDate->Inconsistent
2007-11-16 2:36 ` Ernest Montrose
2007-11-26 14:31 ` Philipp Reisner
2007-11-26 14:43 ` Montrose, Ernest
@ 2007-11-30 0:01 ` Montrose, Ernest
2 siblings, 0 replies; 14+ messages in thread
From: Montrose, Ernest @ 2007-11-30 0:01 UTC (permalink / raw)
To: Philipp Reisner, drbd-dev
Phil,
Sorry it took me a while to get to this but I am still able to reproduce
the problem. It's either:
1) I am using older code as we are unable to get latest code base for
now. OR
2) Your testing is a tad different then mine. I use your two patches.
But I wonder if you actually do the "drbdsetup dev0 pause-sync and
resume-sync" exactly between the time you get the first and last sleep
message. Also be aware that you have to have "syncer" set with
--after=[-1,0,1,2..} for drbd0,1,2 and 3...etc. You would then do an
"invalidate" say on drbd10 then a "pause" "resume" on drbd0.
In my case, I do a :
Drbdsetup /dev/drbd27 invalidate
Then I wait for the message from drbd_resync_finished()
I quickly do :
drbdsetup /dev/drbd0 pause-sync
usleep 1000
drbdsetup /dev/drbd0 resume-sync
usleep 1000
drbdsetup /dev/drbd0 pause-sync
usleep 1000
drbdsetup /dev/drbd0 resume-sync
I actually have a script that does this.
EM--
-----Original Message-----
From: Philipp Reisner [mailto:philipp.reisner@linbit.com]
Sent: Monday, November 26, 2007 9:32 AM
To: drbd-dev@linbit.com
Cc: Ernest Montrose; Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8: incorrect state transition Connected
->WFBitMapS and UpToDate->Inconsistent
On Friday 16 November 2007 03:36:19 Ernest Montrose wrote:
> Phil,
> I tested the patch and unfortunately it does not fix the race
condition
> though I believe it fixes the ASSERT issues.
> Essentially, when the resync is done and we are in
drbd_resync_finished()
> if we pause the device then we send the state to the peer. The peer
is
> done syncing at that point. He does a sync_hanshake() that sends its
state
> to WBItMapS and Pdsk=Inconsistent (since the target has not changed
its
> state to connected and UptoDate yet. When resync_finished is done we
go
> Uptodate and connected and we're stuck.
>
> I tested yet another idea which seems to close the racy window. I
turned
> drbd_resync_finished into two parts. A cleanup part that the worker
can
> schedule to do the clean up and a done part that
> changes the state right away when the resync is done. I include an
> untested patch to illustrate that idea.
>
Hi Ernest,
Finally the attached patch made it into the GIT repository
(see 3a57119417c46c51dd4bc720ab7dbf14228f05bb.git.txt)
It is slightly different from the patch I suggested at first
As you could not confirm that the bugs is closed for you I tried
to reproduce it here now (with the attached patch) and some
instrumentation code to make the call to drbd_resync_finished()
to last 10 seconds.
I tested pausing and continuing on the SyncTarget and on the
Sync Source side. I could not find any issues:
[42949590.920000] drbd0: conn( StartingSyncT -> WFSyncUUID )
[42949590.920000] drbd0: conn( WFSyncUUID -> SyncTarget )
[42949590.920000] drbd0: Began resync as SyncTarget (will sync 262244 KB
[65561 bits set]).
[42949590.920000] drbd0: Writing meta data super block now.
[42949669.270000] drbd0: Warn long sleep start
[42949672.120000] drbd0: conn( SyncTarget -> PausedSyncT ) user_isp( 0
-> 1 )
[42949672.120000] drbd0: Resync suspended
[42949678.610000] drbd0: conn( PausedSyncT -> SyncTarget ) user_isp( 1
-> 0 )
[42949678.610000] drbd0: Syncer continues.
[42949679.280000] drbd0: Warn long sleep stop
[42949679.280000] drbd0: Resync done (total 87 sec; paused 6 sec; 3236
K/sec)
[42949679.280000] drbd0: conn( SyncTarget -> Connected ) disk(
Inconsistent -> UpToDate )
[42949679.280000] drbd0: Writing meta data super block now.
Ernest, can you please confirm that this issue is solved for you with
that
patch, or provide logfile output of an failing test ?
Thanks!
-phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2007-11-30 0:01 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-07 3:54 [Drbd-dev] Handling on-disk caches Graham, Simon
2007-11-07 14:03 ` Lars Ellenberg
2007-11-07 14:16 ` Graham, Simon
2007-11-12 12:39 ` Philipp Reisner
2007-11-12 13:41 ` [Drbd-dev] DRBD8: incorrect state transition Connected ->WFBitMapS and UpToDate->Inconsistent Montrose, Ernest
2007-11-15 16:27 ` Philipp Reisner
2007-11-16 2:36 ` Ernest Montrose
2007-11-26 14:31 ` Philipp Reisner
2007-11-26 14:43 ` Montrose, Ernest
2007-11-26 15:09 ` Philipp Reisner
2007-11-30 0:01 ` Montrose, Ernest
2007-11-12 15:59 ` [Drbd-dev] Handling on-disk caches Graham, Simon
2007-11-12 16:24 ` Philipp Reisner
[not found] ` <BD7042533C2F8943A6A4257A9E31C454F47A31@EXNA.corp.str atus.com>
2007-11-15 16:34 ` [Drbd-dev] DRBD8: incorrect state transition Connected->WFBitMapS and UpToDate->Inconsistent Montrose, Ernest
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox