[RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
@ 2008-04-02 22:09 Mike Snitzer
  2008-05-06  6:53 ` Neil Brown
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2008-04-02 22:09 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-kernel, paul.clements, neilb

resync via bitmap if faulty's events+1 == bitmap's events_cleared

For more background please see:
http://marc.info/?l=linux-raid&m=120703208715865&w=2

Without this change validate_super() will prevent the previously faulty
member from recovering via bitmap, e.g.:

 md: nbd0 rdev's ev1 (30080) < mddev->bitmap->events_cleared (30081)... rdev->raid_disk=-1
 md: nbd1 rdev's ev1 (30342) < mddev->bitmap->events_cleared (30343)... rdev->raid_disk=-1
 md: nbd0 rdev's ev1 (30186) < mddev->bitmap->events_cleared (30187)... rdev->raid_disk=-1
 md: nbd0 rdev's ev1 (30286) < mddev->bitmap->events_cleared (30287)... rdev->raid_disk=-1
 md: nbd1 rdev's ev1 (30476) < mddev->bitmap->events_cleared (30477)... rdev->raid_disk=-1
 md: nbd0 rdev's ev1 (30488) < mddev->bitmap->events_cleared (30489)... rdev->raid_disk=-1
 md: nbd1 rdev's ev1 (30680) < mddev->bitmap->events_cleared (30681)... rdev->raid_disk=-1
 md: nbd0 rdev's ev1 (31082) < mddev->bitmap->events_cleared (31083)... rdev->raid_disk=-1
 md: nbd1 rdev's ev1 (31264) < mddev->bitmap->events_cleared (31265)... rdev->raid_disk=-1
 md: nbd0 rdev's ev1 (31108) < mddev->bitmap->events_cleared (31109)... rdev->raid_disk=-1
 md: nbd0 rdev's ev1 (31126) < mddev->bitmap->events_cleared (31127)... rdev->raid_disk=-1
 md: nbd1 rdev's ev1 (31416) < mddev->bitmap->events_cleared (31417)... rdev->raid_disk=-1
 md: nbd1 rdev's ev1 (31432) < mddev->bitmap->events_cleared (31433)... rdev->raid_disk=-1
 md: nbd0 rdev's ev1 (31274) < mddev->bitmap->events_cleared (31275)... rdev->raid_disk=-1
 md: nbd1 rdev's ev1 (31448) < mddev->bitmap->events_cleared (31449)... rdev->raid_disk=-1
 md: nbd1 rdev's ev1 (31494) < mddev->bitmap->events_cleared (31495)... rdev->raid_disk=-1
 md: nbd1 rdev's ev1 (31512) < mddev->bitmap->events_cleared (31513)... rdev->raid_disk=-1

Note that 'mddev->bitmap->events_cleared' is _always_ odd and the
previously faulty member's 'ev1' (aka events) is _always_ even.  The
current validate_super() logic is blind to clean-to-dirty events
transitions and as such it imposes, potentially expensive, full resyncs.

This change makes the bitmap's 'events_cleared' logic more nuanced than
that which is documented in include/linux/raid/bitmap.h:

 * (2) This event counter [events_cleared] is updated when the other one
 *    [events] is *if*and*only*if* the array is not degraded.  As bits are
 *    not cleared when the array is degraded, this represents the last
 *    time that any bits were cleared.  If a device is being added that
 *    has an event count with this value or higher, it is accepted as
 *    conforming to the bitmap.

But the question becomes: is the proposed change safe?

Considerable testing seems to indicate that it is.  But I welcome any
other suggestions for how to prevent such unnecessary full resyncs.
---
 drivers/md/md.c |   20 ++++++++++++++++++--
 1 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 61ccbd2..43425e4 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -839,8 +839,16 @@ static int super_90_validate(mddev_t *mddev, mdk_rdev_t *rdev)
 	} else if (mddev->bitmap) {
 		/* if adding to array with a bitmap, then we can accept an
 		 * older device ... but not too old.
+		 *
+		 * if 'mddev->bitmap->events_cleared' is odd it implies a clean-to-dirty
+		 * transition occurred just before the array became degraded
+		 * - if rdev's on-disk 'events' is just one less (aka even) this
+		 *   dirty transition wasn't recorded; allow use of the bitmap to
+		 *   efficiently resync to this member
 		 */
-		if (ev1 < mddev->bitmap->events_cleared)
+		if (ev1 < mddev->bitmap->events_cleared &&
+		    !(mddev->degraded && (mddev->bitmap->events_cleared & 1) &&
+		      (ev1+1 == mddev->bitmap->events_cleared)))
 			return 0;
 	} else {
 		if (ev1 < mddev->events)
@@ -1214,8 +1222,16 @@ static int super_1_validate(mddev_t *mddev, mdk_rdev_t *rdev)
 	} else if (mddev->bitmap) {
 		/* If adding to array with a bitmap, then we can accept an
 		 * older device, but not too old.
+		 *
+		 * if 'mddev->bitmap->events_cleared' is odd it implies a clean-to-dirty
+		 * transition likely occurred just before the array became degraded
+		 * - if rdev's on-disk 'events' is just one less (aka even) this
+		 *   dirty transition wasn't recorded; allow use of the bitmap to
+		 *   efficiently resync to this member
 		 */
-		if (ev1 < mddev->bitmap->events_cleared)
+		if (ev1 < mddev->bitmap->events_cleared &&
+		    !(mddev->degraded && (mddev->bitmap->events_cleared & 1) &&
+		      (ev1+1 == mddev->bitmap->events_cleared)))
 			return 0;
 	} else {
 		if (ev1 < mddev->events)
-- 
1.5.3.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-04-02 22:09 [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition Mike Snitzer
@ 2008-05-06  6:53 ` Neil Brown
  2008-05-06 11:58   ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2008-05-06  6:53 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-raid, linux-kernel, paul.clements

On Wednesday April 2, snitzer@gmail.com wrote:
> resync via bitmap if faulty's events+1 == bitmap's events_cleared
> 
> For more background please see:
> http://marc.info/?l=linux-raid&m=120703208715865&w=2
> 
> Without this change validate_super() will prevent the previously faulty
> member from recovering via bitmap, e.g.:

I can't help thinking that you are misinterpreting something.  I don't
think there is a clean->dirty transition happening here.
You could confirm this by using --examine on both devices after the
messy shutdown and before re-assembling the array.

Even allowing for that possible confusion, I cannot quite see what is
going on.
It is fairly clear from the event counts that the NBD device is marked
clean, but if this is happening at array-shutdown time, I cannot see
why md would try to write to the NBD device and thereby detect an
error...

Do you have an internal bitmap or a bitmap in an external file?

In general, I would not like to make decisions based on the
oddness/evenness of the event counter.  I consider that to be an
internal implementation detail.  I am happy to make decisions based on
a difference-of-1.  I need to understand the big picture first though.

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-06  6:53 ` Neil Brown
@ 2008-05-06 11:58   ` Mike Snitzer
  2008-05-08  6:13     ` Neil Brown
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2008-05-06 11:58 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-kernel, paul.clements

On Tue, May 6, 2008 at 2:53 AM, Neil Brown <neilb@suse.de> wrote:
> On Wednesday April 2, snitzer@gmail.com wrote:
>  > resync via bitmap if faulty's events+1 == bitmap's events_cleared
>  >
>  > For more background please see:
>  > http://marc.info/?l=linux-raid&m=120703208715865&w=2
>  >
>  > Without this change validate_super() will prevent the previously faulty
>  > member from recovering via bitmap, e.g.:
>
>  I can't help thinking that you are misinterpreting something.  I don't
>  think there is a clean->dirty transition happening here.
>  You could confirm this by using --examine on both devices after the
>  messy shutdown and before re-assembling the array.
>
>  Even allowing for that possible confusion, I cannot quite see what is
>  going on.
>  It is fairly clear from the event counts that the NBD device is marked
>  clean, but if this is happening at array-shutdown time, I cannot see
>  why md would try to write to the NBD device and thereby detect an
>  error...
>
>  Do you have an internal bitmap or a bitmap in an external file?
>
>  In general, I would not like to make decisions based on the
>  oddness/evenness of the event counter.  I consider that to be an
>  internal implementation detail.  I am happy to make decisions based on
>  a difference-of-1.  I need to understand the big picture first though.

Hi Neil,

I definitely could be misinterpreting something.  However, I did
determine that if the write-mostly NBD member of the raid1 becomes
degraded while writing to the raid1 it frequently has an 'events' that
is one less than the 'events_cleared' (of the local raid1 member that
the array gets reassembled with first).  The events indicate the NBD
member is clean and the local member is dirty.

I'm using internal bitmaps.  I've focused on the even->odd
(clean->dirty) transition to rationalize the safety of allowing the
NBD member to be off by one _and_ clean.  That could easily be
superficial but it seems significant.

It looks like bitmap_update_sb()'s incrementing of events_cleared (on
behalf of the local member) could be racing with the fact that the NBD
member becomes faulty (whereby making the array degraded).  This
allows the events_cleared to reflect a clean->dirty transition last
occurred before the array became degraded.  My reasoning is: If it was
a clean->dirty transition the bitmap still has the associated dirty
bit set in the local member's bitmap, so using the bitmap to resync is
valid.

thanks,
Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-06 11:58   ` Mike Snitzer
@ 2008-05-08  6:13     ` Neil Brown
  2008-05-08 20:11       ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2008-05-08  6:13 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-raid, linux-kernel, paul.clements

On Tuesday May 6, snitzer@gmail.com wrote:
> 
> It looks like bitmap_update_sb()'s incrementing of events_cleared (on
> behalf of the local member) could be racing with the fact that the NBD
> member becomes faulty (whereby making the array degraded).  This
> allows the events_cleared to reflect a clean->dirty transition last
> occurred before the array became degraded.  My reasoning is: If it was
> a clean->dirty transition the bitmap still has the associated dirty
> bit set in the local member's bitmap, so using the bitmap to resync is
> valid.
> 
> thanks,
> Mike

Thanks for persisting.  I think I understand what is going on now.

How about this patch?  It is similar to your, but instead of depending
on the odd/even state of the event counter, it directly checks the
clean/dirty state of the array.

NeilBrown


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    5 +++++
 1 file changed, 5 insertions(+)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-05-02 14:49:05.000000000 +1000
+++ ./drivers/md/md.c	2008-05-08 16:10:48.000000000 +1000
@@ -843,6 +843,8 @@ static int super_90_validate(mddev_t *md
 		/* if adding to array with a bitmap, then we can accept an
 		 * older device ... but not too old.
 		 */
+		if (sb->state & (1<<MD_SB_CLEAN))
+			ev1++;
 		if (ev1 < mddev->bitmap->events_cleared)
 			return 0;
 	} else {
@@ -1218,6 +1220,9 @@ static int super_1_validate(mddev_t *mdd
 		/* If adding to array with a bitmap, then we can accept an
 		 * older device, but not too old.
 		 */
+		if (mddev->recovery_cp == MaxSector)
+			/* array was clean, so can allow 'next' event */
+			ev1++;
 		if (ev1 < mddev->bitmap->events_cleared)
 			return 0;
 	} else {

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-08  6:13     ` Neil Brown
@ 2008-05-08 20:11       ` Mike Snitzer
  2008-05-09  1:40         ` Neil Brown
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2008-05-08 20:11 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-kernel, paul.clements

On Thu, May 8, 2008 at 2:13 AM, Neil Brown <neilb@suse.de> wrote:
> On Tuesday May 6, snitzer@gmail.com wrote:
>  >
>  > It looks like bitmap_update_sb()'s incrementing of events_cleared (on
>  > behalf of the local member) could be racing with the fact that the NBD
>  > member becomes faulty (whereby making the array degraded).  This
>  > allows the events_cleared to reflect a clean->dirty transition last
>  > occurred before the array became degraded.  My reasoning is: If it was
>  > a clean->dirty transition the bitmap still has the associated dirty
>  > bit set in the local member's bitmap, so using the bitmap to resync is
>  > valid.
>  >
>  > thanks,
>  > Mike
>
>  Thanks for persisting.  I think I understand what is going on now.
>
>  How about this patch?  It is similar to your, but instead of depending
>  on the odd/even state of the event counter, it directly checks the
>  clean/dirty state of the array.

Hi Neil,

Your revised patch works great and is obviously cleaner.

Thanks!

Tested-by: Mike Snitzer <snitzer@gmail.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-08 20:11       ` Mike Snitzer
@ 2008-05-09  1:40         ` Neil Brown
  2008-05-09  4:42           ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2008-05-09  1:40 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-raid, linux-kernel, paul.clements

On Thursday May 8, snitzer@gmail.com wrote:
> On Thu, May 8, 2008 at 2:13 AM, Neil Brown <neilb@suse.de> wrote:
> > On Tuesday May 6, snitzer@gmail.com wrote:
> >  >
> >  > It looks like bitmap_update_sb()'s incrementing of events_cleared (on
> >  > behalf of the local member) could be racing with the fact that the NBD
> >  > member becomes faulty (whereby making the array degraded).  This
> >  > allows the events_cleared to reflect a clean->dirty transition last
> >  > occurred before the array became degraded.  My reasoning is: If it was
> >  > a clean->dirty transition the bitmap still has the associated dirty
> >  > bit set in the local member's bitmap, so using the bitmap to resync is
> >  > valid.
> >  >
> >  > thanks,
> >  > Mike
> >
> >  Thanks for persisting.  I think I understand what is going on now.
> >
> >  How about this patch?  It is similar to your, but instead of depending
> >  on the odd/even state of the event counter, it directly checks the
> >  clean/dirty state of the array.
> 
> Hi Neil,
> 
> Your revised patch works great and is obviously cleaner.

But I'm still not happy with it :-(
I suspect there might be other cases where it will still do the wrong
thing.
The real problem is that we are updating events_cleared to early.  We
are setting to the new event counter before that is even written out.

So I've come up with this patch, which I think more clearly
encapsulated what events_cleared means.  It is now set to the current
'events' counter immediately before we clear any bit.

If you could test it, I'd really appreciate it.

Thanks,
NeilBrown

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/bitmap.c |   18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c	2008-05-09 11:02:13.000000000 +1000
+++ ./drivers/md/bitmap.c	2008-05-09 11:38:35.000000000 +1000
@@ -465,8 +465,6 @@ void bitmap_update_sb(struct bitmap *bit
 	spin_unlock_irqrestore(&bitmap->lock, flags);
 	sb = (bitmap_super_t *)kmap_atomic(bitmap->sb_page, KM_USER0);
 	sb->events = cpu_to_le64(bitmap->mddev->events);
-	if (!bitmap->mddev->degraded)
-		sb->events_cleared = cpu_to_le64(bitmap->mddev->events);
 	kunmap_atomic(sb, KM_USER0);
 	write_page(bitmap, bitmap->sb_page, 1);
 }
@@ -1094,9 +1092,19 @@ void bitmap_daemon_work(struct bitmap *b
 			} else
 				spin_unlock_irqrestore(&bitmap->lock, flags);
 			lastpage = page;
-/*
-			printk("bitmap clean at page %lu\n", j);
-*/
+
+			/* We are possibly going to clear some bits, so make
+			 * sure that events_cleared is up-to-date.
+			 */
+			if (bitmap->events_cleared < bitmap->mddev->events) {
+				bitmap_super_t *sb;
+				sb = kmap_atomic(bitmap->sb_page, KM_USER0);
+				bitmap->events_cleared = bitmap->mddev->events;
+				sb->events_cleared =
+					cpu_to_le64(bitmap->events_cleared);
+				kunmap_atomic(sb, KM_USER0);
+				write_page(bitmap, bitmap->sb_page, 1);
+			}
 			spin_lock_irqsave(&bitmap->lock, flags);
 			clear_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
 		}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-09  1:40         ` Neil Brown
@ 2008-05-09  4:42           ` Mike Snitzer
  2008-05-09  5:08             ` Mike Snitzer
  2008-05-09  6:01             ` Neil Brown
  0 siblings, 2 replies; 18+ messages in thread
From: Mike Snitzer @ 2008-05-09  4:42 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-kernel, paul.clements

On Thu, May 8, 2008 at 9:40 PM, Neil Brown <neilb@suse.de> wrote:
>
> On Thursday May 8, snitzer@gmail.com wrote:
>  > On Thu, May 8, 2008 at 2:13 AM, Neil Brown <neilb@suse.de> wrote:
>  > > On Tuesday May 6, snitzer@gmail.com wrote:
>  > >  >
>  > >  > It looks like bitmap_update_sb()'s incrementing of events_cleared (on
>  > >  > behalf of the local member) could be racing with the fact that the NBD
>  > >  > member becomes faulty (whereby making the array degraded).  This
>  > >  > allows the events_cleared to reflect a clean->dirty transition last
>  > >  > occurred before the array became degraded.  My reasoning is: If it was
>  > >  > a clean->dirty transition the bitmap still has the associated dirty
>  > >  > bit set in the local member's bitmap, so using the bitmap to resync is
>  > >  > valid.
>  > >  >
>  > >  > thanks,
>  > >  > Mike
>  > >
>  > >  Thanks for persisting.  I think I understand what is going on now.
>  > >
>  > >  How about this patch?  It is similar to your, but instead of depending
>  > >  on the odd/even state of the event counter, it directly checks the
>  > >  clean/dirty state of the array.
>  >
>  > Hi Neil,
>  >
>  > Your revised patch works great and is obviously cleaner.
>
>  But I'm still not happy with it :-(
>  I suspect there might be other cases where it will still do the wrong
>  thing.
>  The real problem is that we are updating events_cleared to early.  We
>  are setting to the new event counter before that is even written out.
>
>  So I've come up with this patch, which I think more clearly
>  encapsulated what events_cleared means.  It is now set to the current
>  'events' counter immediately before we clear any bit.
>
>  If you could test it, I'd really appreciate it.

Unfortunately my testing with this patch results in a full resync.

Here is the state of the array after shutdown:
# mdadm -X /dev/nbd0 /dev/sdq
        Filename : /dev/nbd0
           Magic : 6d746962
         Version : 4
            UUID : 7140cc3c:8681416c:12c5668a:984ca55d
          Events : 896
  Events Cleared : 897
           State : OK
       Chunksize : 128 KB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 52428736 (50.00 GiB 53.69 GB)
          Bitmap : 409600 bits (chunks), 1 dirty (0.0%)

        Filename : /dev/sdq
           Magic : 6d746962
         Version : 4
            UUID : 7140cc3c:8681416c:12c5668a:984ca55d
          Events : 898
  Events Cleared : 897
           State : OK
       Chunksize : 128 KB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 52428736 (50.00 GiB 53.69 GB)
          Bitmap : 409600 bits (chunks), 0 dirty (0.0%)

# mdadm --examine /dev/nbd0 /dev/sdq
/dev/nbd0:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 7140cc3c:8681416c:12c5668a:984ca55d
  Creation Time : Thu May  8 06:55:32 2008
     Raid Level : raid1
  Used Dev Size : 52428736 (50.00 GiB 53.69 GB)
     Array Size : 52428736 (50.00 GiB 53.69 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0

    Update Time : Thu May  8 18:07:47 2008
          State : clean
Internal Bitmap : present
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
       Checksum : df65cb35 - correct
         Events : 0.896


      Number   Major   Minor   RaidDevice State
this     1      43        0        1      active sync write-mostly   /dev/nbd0

   0     0      65        0        0      active sync   /dev/sdq
   1     1      43        0        1      active sync write-mostly   /dev/nbd0


/dev/sdq:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 7140cc3c:8681416c:12c5668a:984ca55d
  Creation Time : Thu May  8 06:55:32 2008
     Raid Level : raid1
  Used Dev Size : 52428736 (50.00 GiB 53.69 GB)
     Array Size : 52428736 (50.00 GiB 53.69 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0

    Update Time : Thu May  8 18:07:49 2008
          State : clean
Internal Bitmap : present
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0
       Checksum : df65c956 - correct
         Events : 0.898


      Number   Major   Minor   RaidDevice State
this     0      65        0        0      active sync   /dev/sdq

   0     0      65        0        0      active sync   /dev/sdq
   1     1       0        0        1      faulty removed

Was I supposed to use this latest patch in combination with your
previous patch (to validate_super)?  Because you'll note that with
your most recent patch nbd0's events (ev1) is still one less than
sdq's events_cleared.  As such the validate_super's "ev1 <
mddev->bitmap->events_cleared" check triggers a full rebuild.

The kernel log shows:
md: md0 stopped.
md: bind<nbd0>
md: bind<sdq>
md: kicking non-fresh nbd0 from array!
md: unbind<nbd0>
md: export_rdev(nbd0)
raid1: raid set md0 active with 1 out of 2 mirrors
md0: bitmap initialized from disk: read 13/13 pages, set 0 bits, status: 0
created bitmap (200 pages) for device md0
Nope!!! ev1 (896) < mddev->bitmap->events_cleared (897)
md: bind<nbd0>
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:sdq
 disk 1, wo:1, o:1, dev:nbd0
md: recovery of RAID array md0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-09  4:42           ` Mike Snitzer
@ 2008-05-09  5:08             ` Mike Snitzer
  2008-05-09  5:26               ` Mike Snitzer
  2008-05-09  6:01             ` Neil Brown
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2008-05-09  5:08 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-kernel, paul.clements

On Fri, May 9, 2008 at 12:42 AM, Mike Snitzer <snitzer@gmail.com> wrote:

>  Was I supposed to use this latest patch in combination with your
>  previous patch (to validate_super)?  Because you'll note that with
>  your most recent patch nbd0's events (ev1) is still one less than
>  sdq's events_cleared.  As such the validate_super's "ev1 <
>  mddev->bitmap->events_cleared" check triggers a full rebuild.
>
>  The kernel log shows:
>  md: md0 stopped.
>  md: bind<nbd0>
>  md: bind<sdq>
>  md: kicking non-fresh nbd0 from array!
>  md: unbind<nbd0>
>  md: export_rdev(nbd0)
>  raid1: raid set md0 active with 1 out of 2 mirrors
>  md0: bitmap initialized from disk: read 13/13 pages, set 0 bits, status: 0

Also, no bits were set in the bitmap.. bitmap_create() must've thrown
away the dirty bits.  Given your latest patch, does bitmap_create()'s
"bitmap->events_cleared == mddev->events" check need to be adjusted?

Before I would always see something like:
md0: bitmap initialized from disk: read 13/13 pages, set 1 bits, status: 0

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-09  5:08             ` Mike Snitzer
@ 2008-05-09  5:26               ` Mike Snitzer
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2008-05-09  5:26 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-kernel, paul.clements

On Fri, May 9, 2008 at 1:08 AM, Mike Snitzer <snitzer@gmail.com> wrote:
> On Fri, May 9, 2008 at 12:42 AM, Mike Snitzer <snitzer@gmail.com> wrote:
>
>  >  Was I supposed to use this latest patch in combination with your
>  >  previous patch (to validate_super)?  Because you'll note that with
>  >  your most recent patch nbd0's events (ev1) is still one less than
>  >  sdq's events_cleared.  As such the validate_super's "ev1 <
>  >  mddev->bitmap->events_cleared" check triggers a full rebuild.
>  >
>  >  The kernel log shows:
>  >  md: md0 stopped.
>  >  md: bind<nbd0>
>  >  md: bind<sdq>
>  >  md: kicking non-fresh nbd0 from array!
>  >  md: unbind<nbd0>
>  >  md: export_rdev(nbd0)
>  >  raid1: raid set md0 active with 1 out of 2 mirrors
>  >  md0: bitmap initialized from disk: read 13/13 pages, set 0 bits, status: 0
>
>  Also, no bits were set in the bitmap.. bitmap_create() must've thrown
>  away the dirty bits.  Given your latest patch, does bitmap_create()'s
>  "bitmap->events_cleared == mddev->events" check need to be adjusted?
>
>  Before I would always see something like:
>  md0: bitmap initialized from disk: read 13/13 pages, set 1 bits, status: 0

Actually, the mdadm -X output I provided shows that sdq's bitmap
doesn't have any bits set:
Bitmap : 409600 bits (chunks), 0 dirty (0.0%)

This can't be right, considering nbd0 was marked faulty and the array
became degraded, can it?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-09  4:42           ` Mike Snitzer
  2008-05-09  5:08             ` Mike Snitzer
@ 2008-05-09  6:01             ` Neil Brown
  2008-05-09 15:00               ` Mike Snitzer
  1 sibling, 1 reply; 18+ messages in thread
From: Neil Brown @ 2008-05-09  6:01 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-raid, linux-kernel, paul.clements

On Friday May 9, snitzer@gmail.com wrote:
> On Thu, May 8, 2008 at 9:40 PM, Neil Brown <neilb@suse.de> wrote:
> >
> > On Thursday May 8, snitzer@gmail.com wrote:
> >  > On Thu, May 8, 2008 at 2:13 AM, Neil Brown <neilb@suse.de> wrote:
> >  > > On Tuesday May 6, snitzer@gmail.com wrote:
> >  > >  >
> >  > >  > It looks like bitmap_update_sb()'s incrementing of events_cleared (on
> >  > >  > behalf of the local member) could be racing with the fact that the NBD
> >  > >  > member becomes faulty (whereby making the array degraded).  This
> >  > >  > allows the events_cleared to reflect a clean->dirty transition last
> >  > >  > occurred before the array became degraded.  My reasoning is: If it was
> >  > >  > a clean->dirty transition the bitmap still has the associated dirty
> >  > >  > bit set in the local member's bitmap, so using the bitmap to resync is
> >  > >  > valid.
> >  > >  >
> >  > >  > thanks,
> >  > >  > Mike
> >  > >
> >  > >  Thanks for persisting.  I think I understand what is going on now.
> >  > >
> >  > >  How about this patch?  It is similar to your, but instead of depending
> >  > >  on the odd/even state of the event counter, it directly checks the
> >  > >  clean/dirty state of the array.
> >  >
> >  > Hi Neil,
> >  >
> >  > Your revised patch works great and is obviously cleaner.
> >
> >  But I'm still not happy with it :-(
> >  I suspect there might be other cases where it will still do the wrong
> >  thing.
> >  The real problem is that we are updating events_cleared to early.  We
> >  are setting to the new event counter before that is even written out.
> >
> >  So I've come up with this patch, which I think more clearly
> >  encapsulated what events_cleared means.  It is now set to the current
> >  'events' counter immediately before we clear any bit.
> >
> >  If you could test it, I'd really appreciate it.
> 
> Unfortunately my testing with this patch results in a full resync.
> 
> Here is the state of the array after shutdown:
> # mdadm -X /dev/nbd0 /dev/sdq
>         Filename : /dev/nbd0
>            Magic : 6d746962
>          Version : 4
>             UUID : 7140cc3c:8681416c:12c5668a:984ca55d
>           Events : 896
>   Events Cleared : 897

Events Cleared is *larger* than Events!!! Is that repeatable?  I can
only see it happening if a very small race were lost.  You don't have
any other patches in there do you?

> 
> Was I supposed to use this latest patch in combination with your
> previous patch (to validate_super)?  Because you'll note that with
> your most recent patch nbd0's events (ev1) is still one less than
> sdq's events_cleared.  As such the validate_super's "ev1 <
> mddev->bitmap->events_cleared" check triggers a full rebuild.

No, you weren't suppose to combine it with the previous patch.

This patch should close the race, though I still find it hard to
believe that you lost the race.

NeilBrown


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/bitmap.c |   20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c	2008-05-09 11:02:13.000000000 +1000
+++ ./drivers/md/bitmap.c	2008-05-09 16:00:07.000000000 +1000
@@ -465,8 +465,6 @@ void bitmap_update_sb(struct bitmap *bit
 	spin_unlock_irqrestore(&bitmap->lock, flags);
 	sb = (bitmap_super_t *)kmap_atomic(bitmap->sb_page, KM_USER0);
 	sb->events = cpu_to_le64(bitmap->mddev->events);
-	if (!bitmap->mddev->degraded)
-		sb->events_cleared = cpu_to_le64(bitmap->mddev->events);
 	kunmap_atomic(sb, KM_USER0);
 	write_page(bitmap, bitmap->sb_page, 1);
 }
@@ -1094,9 +1092,21 @@ void bitmap_daemon_work(struct bitmap *b
 			} else
 				spin_unlock_irqrestore(&bitmap->lock, flags);
 			lastpage = page;
-/*
-			printk("bitmap clean at page %lu\n", j);
-*/
+
+			/* We are possibly going to clear some bits, so make
+			 * sure that events_cleared is up-to-date.
+			 */
+			if (bitmap->events_cleared < bitmap->mddev->events) {
+				bitmap_super_t *sb;
+				bitmap->events_cleared = bitmap->mddev->events;
+				wait_event(mddev->sb_wait,
+				    !test_bit(MD_CHANGE_CLEAN, &mddev->flags));
+				sb = kmap_atomic(bitmap->sb_page, KM_USER0);
+				sb->events_cleared =
+					cpu_to_le64(bitmap->events_cleared);
+				kunmap_atomic(sb, KM_USER0);
+				write_page(bitmap, bitmap->sb_page, 1);
+			}
 			spin_lock_irqsave(&bitmap->lock, flags);
 			clear_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
 		}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-09  6:01             ` Neil Brown
@ 2008-05-09 15:00               ` Mike Snitzer
  2008-05-16 11:54                 ` Neil Brown
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2008-05-09 15:00 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-kernel, paul.clements

On Fri, May 9, 2008 at 2:01 AM, Neil Brown <neilb@suse.de> wrote:
>
> On Friday May 9, snitzer@gmail.com wrote:

>  > Unfortunately my testing with this patch results in a full resync.
>  >
>  > Here is the state of the array after shutdown:
>  > # mdadm -X /dev/nbd0 /dev/sdq
>  >         Filename : /dev/nbd0
>  >            Magic : 6d746962
>  >          Version : 4
>  >             UUID : 7140cc3c:8681416c:12c5668a:984ca55d
>  >           Events : 896
>  >   Events Cleared : 897
>
>  Events Cleared is *larger* than Events!!! Is that repeatable?  I can
>  only see it happening if a very small race were lost.  You don't have
>  any other patches in there do you?

Yes, it is repeatable with your previous patch.  But with your most
recent patch I had the following after shutdown:

# mdadm -X /dev/nbd0 /dev/sdq
        Filename : /dev/nbd0
          Events : 1732
  Events Cleared : 1732
          Bitmap : 409600 bits (chunks), 1 dirty (0.0%)

        Filename : /dev/sdq
          Events : 1736
  Events Cleared : 1736
          Bitmap : 409600 bits (chunks), 1 dirty (0.0%)

Unfortunately sdq's events_cleared appears to have been updated
_after_ the array became degraded.
As such a full resync occurred because 1732 < 1736.

>  This patch should close the race, though I still find it hard to
>  believe that you lost the race.

Comments inlined below.

>  Signed-off-by: Neil Brown <neilb@suse.de>
>
>  ### Diffstat output
>   ./drivers/md/bitmap.c |   20 +++++++++++++++-----
>   1 file changed, 15 insertions(+), 5 deletions(-)
>
>
>  diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
>  --- .prev/drivers/md/bitmap.c   2008-05-09 11:02:13.000000000 +1000
>  +++ ./drivers/md/bitmap.c       2008-05-09 16:00:07.000000000 +1000
>
> @@ -465,8 +465,6 @@ void bitmap_update_sb(struct bitmap *bit
>         spin_unlock_irqrestore(&bitmap->lock, flags);
>         sb = (bitmap_super_t *)kmap_atomic(bitmap->sb_page, KM_USER0);
>         sb->events = cpu_to_le64(bitmap->mddev->events);
>  -       if (!bitmap->mddev->degraded)
>  -               sb->events_cleared = cpu_to_le64(bitmap->mddev->events);

Before, events_cleared was _not_ updated if the array was degraded.
Your patch doesn't appear to maintain that design.

I tried adding the following degraded check to your below conditional
but that resulted in nbd0's events < mddev->bitmap->events_cleared
again, so I'm back to square one:

                        if (!bitmap->mddev->degraded &&
                            bitmap->events_cleared < bitmap->mddev->events) {

In addition no bits were set in sdq's bitmap:
# mdadm -X /dev/nbd0 /dev/sdq
        Filename : /dev/nbd0
          Events : 2616
  Events Cleared : 2617
          Bitmap : 409600 bits (chunks), 0 dirty (0.0%)

        Filename : /dev/sdq
          Events : 2618
  Events Cleared : 2617
          Bitmap : 409600 bits (chunks), 0 dirty (0.0%)

>  @@ -1094,9 +1092,21 @@ void bitmap_daemon_work(struct bitmap *b
>
>                         } else
>                                 spin_unlock_irqrestore(&bitmap->lock, flags);
>                         lastpage = page;
>  -/*
>  -                       printk("bitmap clean at page %lu\n", j);
>  -*/
>  +
>  +                       /* We are possibly going to clear some bits, so make
>  +                        * sure that events_cleared is up-to-date.
>  +                        */
>  +                       if (bitmap->events_cleared < bitmap->mddev->events) {
>  +                               bitmap_super_t *sb;
>  +                               bitmap->events_cleared = bitmap->mddev->events;
>  +                               wait_event(mddev->sb_wait,
>  +                                   !test_bit(MD_CHANGE_CLEAN, &mddev->flags));

I needed "bitmap->mddev->sb_wait" and "bitmap->mddev->flags" to get
the code to compile.

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-09 15:00               ` Mike Snitzer
@ 2008-05-16 11:54                 ` Neil Brown
  2008-05-19  4:33                   ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2008-05-16 11:54 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-raid, linux-kernel, paul.clements

On Friday May 9, snitzer@gmail.com wrote:
> On Fri, May 9, 2008 at 2:01 AM, Neil Brown <neilb@suse.de> wrote:
> >
> > On Friday May 9, snitzer@gmail.com wrote:
> 
> >  > Unfortunately my testing with this patch results in a full resync.
> >  >
> >  > Here is the state of the array after shutdown:
> >  > # mdadm -X /dev/nbd0 /dev/sdq
> >  >         Filename : /dev/nbd0
> >  >            Magic : 6d746962
> >  >          Version : 4
> >  >             UUID : 7140cc3c:8681416c:12c5668a:984ca55d
> >  >           Events : 896
> >  >   Events Cleared : 897
> >
> >  Events Cleared is *larger* than Events!!! Is that repeatable?  I can
> >  only see it happening if a very small race were lost.  You don't have
> >  any other patches in there do you?
> 
> Yes, it is repeatable with your previous patch.  But with your most
> recent patch I had the following after shutdown:
> 
> # mdadm -X /dev/nbd0 /dev/sdq
>         Filename : /dev/nbd0
>           Events : 1732
>   Events Cleared : 1732
>           Bitmap : 409600 bits (chunks), 1 dirty (0.0%)
> 
>         Filename : /dev/sdq
>           Events : 1736
>   Events Cleared : 1736
>           Bitmap : 409600 bits (chunks), 1 dirty (0.0%)
> 
> Unfortunately sdq's events_cleared appears to have been updated
> _after_ the array became degraded.
> As such a full resync occurred because 1732 < 1736.
> 
> >  This patch should close the race, though I still find it hard to
> >  believe that you lost the race.
> 
> Comments inlined below.
> 
> >  Signed-off-by: Neil Brown <neilb@suse.de>
> >
> >  ### Diffstat output
> >   ./drivers/md/bitmap.c |   20 +++++++++++++++-----
> >   1 file changed, 15 insertions(+), 5 deletions(-)
> >
> >
> >  diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
> >  --- .prev/drivers/md/bitmap.c   2008-05-09 11:02:13.000000000 +1000
> >  +++ ./drivers/md/bitmap.c       2008-05-09 16:00:07.000000000 +1000
> >
> > @@ -465,8 +465,6 @@ void bitmap_update_sb(struct bitmap *bit
> >         spin_unlock_irqrestore(&bitmap->lock, flags);
> >         sb = (bitmap_super_t *)kmap_atomic(bitmap->sb_page, KM_USER0);
> >         sb->events = cpu_to_le64(bitmap->mddev->events);
> >  -       if (!bitmap->mddev->degraded)
> >  -               sb->events_cleared = cpu_to_le64(bitmap->mddev->events);
> 
> Before, events_cleared was _not_ updated if the array was degraded.
> Your patch doesn't appear to maintain that design.

It does, but it is well hidden.
Bits in the bitmap are only cleared when the array is not degraded.
The new code for updating events_cleared is only triggered when a bit
is about to be cleared.

> 
> I needed "bitmap->mddev->sb_wait" and "bitmap->mddev->flags" to get
> the code to compile.

Sorry about that...

I decided to bite the bullet and create a setup where I could test
this myself.  Using the faulty personality of md make it fairly
straight forward.  This script:

------------------------------------------------------------
mdadm -Ss 
mdadm -B /dev/md9 -l faulty -n 1 /dev/sdc
mdadm -CR /dev/md0 -l1 -n2 -d 1 --bitmap internal --assume-clean /dev/sdb /dev/md9
mkfs /dev/md0 3000000
mount /dev/md0 /mnt
echo hello > /mnt/afile
sync
sleep 4
echo before grow
mdadm -E /dev/md9 | grep -E '(State|Event).*:'
mdadm -X /dev/md9 | grep Bitmap
mdadm -G /dev/md9 -l faulty -p wa
umount /mnt
echo after umount
mdadm -X /dev/sdb | grep Event
mdadm -S /dev/md0
echo sdb
mdadm -E /dev/sdb | grep Event
mdadm -E /dev/sdb | grep State' :'
mdadm -X /dev/sdb | grep Event

echo mdp
mdadm -E /dev/md9 | grep Event
mdadm -E /dev/md9 | grep State' :'
mdadm -X /dev/md9 | grep Event

mdadm -G /dev/md9 -l faulty -p none
mdadm -A /dev/md0 /dev/sdb
mdadm /dev/md0 -a /dev/md9
sleep 1
cat /proc/mdstat
------------------------------------------------------------

reproduces exactly your problem (I think).

This helped me discover what was wrong with my patch.  It has to do
with the event counter going backwards sometimes.

This patch makes the above test work as expected, and should provide
happiness for you too.

NeilBrown


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/bitmap.c |   26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c	2008-05-16 20:27:49.000000000 +1000
+++ ./drivers/md/bitmap.c	2008-05-16 21:49:20.000000000 +1000
@@ -454,8 +454,11 @@ void bitmap_update_sb(struct bitmap *bit
 	spin_unlock_irqrestore(&bitmap->lock, flags);
 	sb = (bitmap_super_t *)kmap_atomic(bitmap->sb_page, KM_USER0);
 	sb->events = cpu_to_le64(bitmap->mddev->events);
-	if (!bitmap->mddev->degraded)
-		sb->events_cleared = cpu_to_le64(bitmap->mddev->events);
+	if (bitmap->mddev->events < bitmap->events_cleared){
+		/* rocking back to read-only */
+		bitmap->events_cleared = bitmap->mddev->events;
+		sb->events_cleared = cpu_to_le64(bitmap->events_cleared);
+	}
 	kunmap_atomic(sb, KM_USER0);
 	write_page(bitmap, bitmap->sb_page, 1);
 }
@@ -1085,9 +1088,22 @@ void bitmap_daemon_work(struct bitmap *b
 			} else
 				spin_unlock_irqrestore(&bitmap->lock, flags);
 			lastpage = page;
-/*
-			printk("bitmap clean at page %lu\n", j);
-*/
+
+			/* We are possibly going to clear some bits, so make
+			 * sure that events_cleared is up-to-date.
+			 */
+			if (bitmap->events_cleared < bitmap->mddev->events) {
+				bitmap_super_t *sb;
+				bitmap->events_cleared = bitmap->mddev->events;
+				wait_event(bitmap->mddev->sb_wait,
+				    !test_bit(MD_CHANGE_CLEAN,
+					      &bitmap->mddev->flags));
+				sb = kmap_atomic(bitmap->sb_page, KM_USER0);
+				sb->events_cleared =
+					cpu_to_le64(bitmap->events_cleared);
+				kunmap_atomic(sb, KM_USER0);
+				write_page(bitmap, bitmap->sb_page, 1);
+			}
 			spin_lock_irqsave(&bitmap->lock, flags);
 			clear_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
 		}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-16 11:54                 ` Neil Brown
@ 2008-05-19  4:33                   ` Mike Snitzer
  2008-05-19  5:27                     ` Neil Brown
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Snitzer @ 2008-05-19  4:33 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-kernel, paul.clements

On Fri, May 16, 2008 at 7:54 AM, Neil Brown <neilb@suse.de> wrote:
> On Friday May 9, snitzer@gmail.com wrote:
>> On Fri, May 9, 2008 at 2:01 AM, Neil Brown <neilb@suse.de> wrote:
>> >
>> > On Friday May 9, snitzer@gmail.com wrote:
>>
>> >  > Unfortunately my testing with this patch results in a full resync.
...
>> >  diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
>> >  --- .prev/drivers/md/bitmap.c   2008-05-09 11:02:13.000000000 +1000
>> >  +++ ./drivers/md/bitmap.c       2008-05-09 16:00:07.000000000 +1000
>> >
>> > @@ -465,8 +465,6 @@ void bitmap_update_sb(struct bitmap *bit
>> >         spin_unlock_irqrestore(&bitmap->lock, flags);
>> >         sb = (bitmap_super_t *)kmap_atomic(bitmap->sb_page, KM_USER0);
>> >         sb->events = cpu_to_le64(bitmap->mddev->events);
>> >  -       if (!bitmap->mddev->degraded)
>> >  -               sb->events_cleared = cpu_to_le64(bitmap->mddev->events);
>>
>> Before, events_cleared was _not_ updated if the array was degraded.
>> Your patch doesn't appear to maintain that design.
>
> It does, but it is well hidden.
> Bits in the bitmap are only cleared when the array is not degraded.
> The new code for updating events_cleared is only triggered when a bit
> is about to be cleared.

Hi Neil,

Sorry about not getting back with you sooner.  Thanks for putting
significant time to chasing this problem.

I tested your most recent patch and unfortunately still hit the case
where the nbd member becomes degraded yet the array continues to clear
bits (events_cleared of the non-degraded member is higher than the
degraded member).  Is this behavior somehow expected/correct?

This was the state of the array after the nbd0 member became degraded
and the array was stopped:

# mdadm -X /dev/nbd0 /dev/sdq
        Filename : /dev/nbd0
           Magic : 6d746962
         Version : 4
            UUID : 7140cc3c:8681416c:12c5668a:984ca55d
          Events : 2642
  Events Cleared : 2642
           State : OK
       Chunksize : 128 KB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 52428736 (50.00 GiB 53.69 GB)
          Bitmap : 409600 bits (chunks), 1 dirty (0.0%)

        Filename : /dev/sdq
           Magic : 6d746962
         Version : 4
            UUID : 7140cc3c:8681416c:12c5668a:984ca55d
          Events : 2646
  Events Cleared : 2645
           State : OK
       Chunksize : 128 KB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 52428736 (50.00 GiB 53.69 GB)
          Bitmap : 409600 bits (chunks), 1 dirty (0.0%)


At the time the nbd0 member became degraded events_cleared was 2642.
What I'm failing to understand is how sdq's events_cleared could be
allowed to increment higher than 2642?

I've not yet taken steps to understand/verify your test script.  As
such I'm not sure it models my test scenario yet.

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-19  4:33                   ` Mike Snitzer
@ 2008-05-19  5:27                     ` Neil Brown
  2008-05-20 15:30                       ` Mike Snitzer
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2008-05-19  5:27 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-raid, linux-kernel, paul.clements

On Monday May 19, snitzer@gmail.com wrote:
> 
> Hi Neil,
> 
> Sorry about not getting back with you sooner.  Thanks for putting
> significant time to chasing this problem.
> 
> I tested your most recent patch and unfortunately still hit the case
> where the nbd member becomes degraded yet the array continues to clear
> bits (events_cleared of the non-degraded member is higher than the
> degraded member).  Is this behavior somehow expected/correct?

It shouldn't be..... ahhh.
There is a delay between noting that the bit can be cleared, and
actually writing the zero to disk.  This is obviously intentional
in case the bit gets set again quickly.
I'm sampling the event count at the latter point instead of the
former, and there is time for it to change.

Maybe this patch on top of what I recently sent out?

Thanks,
NeilBrown


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/bitmap.c         |   10 ++++++++--
 ./include/linux/raid/bitmap.h |    1 +
 2 files changed, 9 insertions(+), 2 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c	2008-05-19 15:23:42.000000000 +1000
+++ ./drivers/md/bitmap.c	2008-05-19 15:24:56.000000000 +1000
@@ -1092,9 +1092,9 @@ void bitmap_daemon_work(struct bitmap *b
 			/* We are possibly going to clear some bits, so make
 			 * sure that events_cleared is up-to-date.
 			 */
-			if (bitmap->events_cleared < bitmap->mddev->events) {
+			if (bitmap->need_sync) {
 				bitmap_super_t *sb;
-				bitmap->events_cleared = bitmap->mddev->events;
+				bitmap->need_sync = 0;
 				wait_event(bitmap->mddev->sb_wait,
 				    !test_bit(MD_CHANGE_CLEAN,
 					      &bitmap->mddev->flags));
@@ -1273,6 +1273,12 @@ void bitmap_endwrite(struct bitmap *bitm
 			return;
 		}
 
+		if (success &&
+		    bitmap->events_cleared < bitmap->mddev->events) {
+			bitmap->events_cleared = bitmap->mddev->events;
+			bitmap->need_sync = 1;
+		}
+
 		if (!success && ! (*bmc & NEEDED_MASK))
 			*bmc |= NEEDED_MASK;
 

diff .prev/include/linux/raid/bitmap.h ./include/linux/raid/bitmap.h
--- .prev/include/linux/raid/bitmap.h	2008-05-19 15:23:50.000000000 +1000
+++ ./include/linux/raid/bitmap.h	2008-05-19 15:24:56.000000000 +1000
@@ -221,6 +221,7 @@ struct bitmap {
 	unsigned long syncchunk;
 
 	__u64	events_cleared;
+	int need_sync;
 
 	/* bitmap spinlock */
 	spinlock_t lock;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-19  5:27                     ` Neil Brown
@ 2008-05-20 15:30                       ` Mike Snitzer
  2008-05-20 15:33                         ` Mike Snitzer
  2008-05-27  6:56                         ` Neil Brown
  0 siblings, 2 replies; 18+ messages in thread
From: Mike Snitzer @ 2008-05-20 15:30 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-kernel, paul.clements

On Mon, May 19, 2008 at 1:27 AM, Neil Brown <neilb@suse.de> wrote:
> On Monday May 19, snitzer@gmail.com wrote:
>  >
>  > Hi Neil,
>  >
>  > Sorry about not getting back with you sooner.  Thanks for putting
>  > significant time to chasing this problem.
>  >
>  > I tested your most recent patch and unfortunately still hit the case
>  > where the nbd member becomes degraded yet the array continues to clear
>  > bits (events_cleared of the non-degraded member is higher than the
>  > degraded member).  Is this behavior somehow expected/correct?
>
>  It shouldn't be..... ahhh.
>  There is a delay between noting that the bit can be cleared, and
>  actually writing the zero to disk.  This is obviously intentional
>  in case the bit gets set again quickly.
>  I'm sampling the event count at the latter point instead of the
>  former, and there is time for it to change.
>
>  Maybe this patch on top of what I recently sent out?

Hi Neil,

We're much closer.  The events_cleared is symmetric on both the failed
and active member of the raid1.  But there have been some instances
where the md thread hits a deadlock during my testing.  What follows
is the backtrace and live crash info:

md0_raid1     D 000002c4b6483a7f     0 11249      2 (L-TLB)
 ffff81005747dce0 0000000000000046 0000000000000000 ffff8100454c53c0
 000000000000000a ffff810048fbd0c0 000000000000000a ffff810048fbd0c0
 ffff81007f853840 000000000000148e ffff810048fbd2b0 0000000362c10780
Call Trace:
 [<ffffffff88ba8503>] :md_mod:bitmap_daemon_work+0x249/0x4d3
 [<ffffffff802457a5>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88ba53b3>] :md_mod:md_check_recovery+0x20/0x4a5
 [<ffffffff8044cb5c>] thread_return+0x0/0xf1
 [<ffffffff88bbe0eb>] :raid1:raid1d+0x25/0xd09
 [<ffffffff8023bcd7>] lock_timer_base+0x26/0x4b
 [<ffffffff8023bd4d>] try_to_del_timer_sync+0x51/0x5a
 [<ffffffff8023bd62>] del_timer_sync+0xc/0x16
 [<ffffffff8044d38a>] schedule_timeout+0x92/0xad
 [<ffffffff88ba6c6c>] :md_mod:md_thread+0xeb/0x101
 [<ffffffff802457a5>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88ba6b81>] :md_mod:md_thread+0x0/0x101
 [<ffffffff8024564d>] kthread+0x47/0x76
 [<ffffffff8020aa38>] child_rip+0xa/0x12
 [<ffffffff80245606>] kthread+0x0/0x76
 [<ffffffff8020aa2e>] child_rip+0x0/0x12

crash> bt 11249
PID: 11249  TASK: ffff810048fbd0c0  CPU: 3   COMMAND: "md0_raid1"
 #0 [ffff81005747dbf0] schedule at ffffffff8044cb5c
 #1 [ffff81005747dce8] bitmap_daemon_work at ffffffff88ba8503
 #2 [ffff81005747dd68] md_check_recovery at ffffffff88ba53b3
 #3 [ffff81005747ddb8] raid1d at ffffffff88bbe0eb
 #4 [ffff81005747ded8] md_thread at ffffffff88ba6c6c
 #5 [ffff81005747df28] kthread at ffffffff8024564d
 #6 [ffff81005747df48] kernel_thread at ffffffff8020aa38

0xffffffff88ba84ee <bitmap_daemon_work+0x234>:  callq
0xffffffff802458ec <prepare_to_wait>
0xffffffff88ba84f3 <bitmap_daemon_work+0x239>:  mov    0x18(%rbx),%rax
0xffffffff88ba84f7 <bitmap_daemon_work+0x23d>:  mov    0x28(%rax),%eax
0xffffffff88ba84fa <bitmap_daemon_work+0x240>:  test   $0x2,%al
0xffffffff88ba84fc <bitmap_daemon_work+0x242>:  je
0xffffffff88ba8505 <bitmap_daemon_work+0x24b>
0xffffffff88ba84fe <bitmap_daemon_work+0x244>:  callq
0xffffffff8044c200 <__sched_text_start>
0xffffffff88ba8503 <bitmap_daemon_work+0x249>:  jmp
0xffffffff88ba84d6 <bitmap_daemon_work+0x21c>
0xffffffff88ba8505 <bitmap_daemon_work+0x24b>:  mov    0x18(%rbx),%rdi
0xffffffff88ba8509 <bitmap_daemon_work+0x24f>:  mov    %rbp,%rsi
0xffffffff88ba850c <bitmap_daemon_work+0x252>:  add    $0x200,%rdi
0xffffffff88ba8513 <bitmap_daemon_work+0x259>:  callq
0xffffffff802457f6 <finish_wait>

So running with your latest patches seems to introduce a race in
bitmap_daemon_work's if (unlikely((*bmc & COUNTER_MAX) ==
COUNTER_MAX)) { } block.

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-20 15:30                       ` Mike Snitzer
@ 2008-05-20 15:33                         ` Mike Snitzer
  2008-05-27  6:56                         ` Neil Brown
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2008-05-20 15:33 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-kernel, paul.clements

On Tue, May 20, 2008 at 11:30 AM, Mike Snitzer <snitzer@gmail.com> wrote:
> On Mon, May 19, 2008 at 1:27 AM, Neil Brown <neilb@suse.de> wrote:
>  > On Monday May 19, snitzer@gmail.com wrote:
>  >  >
>  >  > Hi Neil,
>  >  >
>  >  > Sorry about not getting back with you sooner.  Thanks for putting
>  >  > significant time to chasing this problem.
>  >  >
>  >  > I tested your most recent patch and unfortunately still hit the case
>  >  > where the nbd member becomes degraded yet the array continues to clear
>  >  > bits (events_cleared of the non-degraded member is higher than the
>  >  > degraded member).  Is this behavior somehow expected/correct?
>  >
>  >  It shouldn't be..... ahhh.
>  >  There is a delay between noting that the bit can be cleared, and
>  >  actually writing the zero to disk.  This is obviously intentional
>  >  in case the bit gets set again quickly.
>  >  I'm sampling the event count at the latter point instead of the
>  >  former, and there is time for it to change.
>  >
>  >  Maybe this patch on top of what I recently sent out?
>
>  Hi Neil,
>
>  We're much closer.  The events_cleared is symmetric on both the failed
>  and active member of the raid1.  But there have been some instances
>  where the md thread hits a deadlock during my testing.  What follows
>  is the backtrace and live crash info:
>
>  md0_raid1     D 000002c4b6483a7f     0 11249      2 (L-TLB)
>   ffff81005747dce0 0000000000000046 0000000000000000 ffff8100454c53c0
>   000000000000000a ffff810048fbd0c0 000000000000000a ffff810048fbd0c0
>   ffff81007f853840 000000000000148e ffff810048fbd2b0 0000000362c10780
>  Call Trace:
>   [<ffffffff88ba8503>] :md_mod:bitmap_daemon_work+0x249/0x4d3
>   [<ffffffff802457a5>] autoremove_wake_function+0x0/0x2e
>   [<ffffffff88ba53b3>] :md_mod:md_check_recovery+0x20/0x4a5
>   [<ffffffff8044cb5c>] thread_return+0x0/0xf1
>   [<ffffffff88bbe0eb>] :raid1:raid1d+0x25/0xd09
>   [<ffffffff8023bcd7>] lock_timer_base+0x26/0x4b
>   [<ffffffff8023bd4d>] try_to_del_timer_sync+0x51/0x5a
>   [<ffffffff8023bd62>] del_timer_sync+0xc/0x16
>   [<ffffffff8044d38a>] schedule_timeout+0x92/0xad
>   [<ffffffff88ba6c6c>] :md_mod:md_thread+0xeb/0x101
>   [<ffffffff802457a5>] autoremove_wake_function+0x0/0x2e
>   [<ffffffff88ba6b81>] :md_mod:md_thread+0x0/0x101
>   [<ffffffff8024564d>] kthread+0x47/0x76
>   [<ffffffff8020aa38>] child_rip+0xa/0x12
>   [<ffffffff80245606>] kthread+0x0/0x76
>   [<ffffffff8020aa2e>] child_rip+0x0/0x12
>
>  crash> bt 11249
>  PID: 11249  TASK: ffff810048fbd0c0  CPU: 3   COMMAND: "md0_raid1"
>   #0 [ffff81005747dbf0] schedule at ffffffff8044cb5c
>   #1 [ffff81005747dce8] bitmap_daemon_work at ffffffff88ba8503
>   #2 [ffff81005747dd68] md_check_recovery at ffffffff88ba53b3
>   #3 [ffff81005747ddb8] raid1d at ffffffff88bbe0eb
>   #4 [ffff81005747ded8] md_thread at ffffffff88ba6c6c
>   #5 [ffff81005747df28] kthread at ffffffff8024564d
>   #6 [ffff81005747df48] kernel_thread at ffffffff8020aa38
>
>  0xffffffff88ba84ee <bitmap_daemon_work+0x234>:  callq
>  0xffffffff802458ec <prepare_to_wait>
>  0xffffffff88ba84f3 <bitmap_daemon_work+0x239>:  mov    0x18(%rbx),%rax
>  0xffffffff88ba84f7 <bitmap_daemon_work+0x23d>:  mov    0x28(%rax),%eax
>  0xffffffff88ba84fa <bitmap_daemon_work+0x240>:  test   $0x2,%al
>  0xffffffff88ba84fc <bitmap_daemon_work+0x242>:  je
>  0xffffffff88ba8505 <bitmap_daemon_work+0x24b>
>  0xffffffff88ba84fe <bitmap_daemon_work+0x244>:  callq
>  0xffffffff8044c200 <__sched_text_start>
>  0xffffffff88ba8503 <bitmap_daemon_work+0x249>:  jmp
>  0xffffffff88ba84d6 <bitmap_daemon_work+0x21c>
>  0xffffffff88ba8505 <bitmap_daemon_work+0x24b>:  mov    0x18(%rbx),%rdi
>  0xffffffff88ba8509 <bitmap_daemon_work+0x24f>:  mov    %rbp,%rsi
>  0xffffffff88ba850c <bitmap_daemon_work+0x252>:  add    $0x200,%rdi
>  0xffffffff88ba8513 <bitmap_daemon_work+0x259>:  callq
>  0xffffffff802457f6 <finish_wait>
>
>  So running with your latest patches seems to introduce a race in
>  bitmap_daemon_work's if (unlikely((*bmc & COUNTER_MAX) ==
>  COUNTER_MAX)) { } block.

Err, that block is in bitmap_startwrite()...

Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-20 15:30                       ` Mike Snitzer
  2008-05-20 15:33                         ` Mike Snitzer
@ 2008-05-27  6:56                         ` Neil Brown
  2008-05-27 14:33                           ` Mike Snitzer
  1 sibling, 1 reply; 18+ messages in thread
From: Neil Brown @ 2008-05-27  6:56 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-raid, linux-kernel, paul.clements

On Tuesday May 20, snitzer@gmail.com wrote:
> 
> Hi Neil,
> 
> We're much closer.  The events_cleared is symmetric on both the failed
> and active member of the raid1.  But there have been some instances
> where the md thread hits a deadlock during my testing.  What follows
> is the backtrace and live crash info:
...
> 
> So running with your latest patches seems to introduce a race in
> bitmap_daemon_work's if (unlikely((*bmc & COUNTER_MAX) ==
> COUNTER_MAX)) { } block.

As you not, that block is in the wrong place.
It is actually locking up in 
				wait_event(bitmap->mddev->sb_wait,
				    !test_bit(MD_CHANGE_CLEAN,
					      &bitmap->mddev->flags));

which the patch adds.  However with my last update that wait_event
isn't needed any more.  I was using it to ensure mddev->events matched
what was on disk.  But we now read mddev->events much earlier and it
will definitely be on disc by this time.

So: this combined patch should do it.

Thanks for all your testing.

NeilBrown


---------------------------
Improve setting of "events_cleared" for write-intent bitmaps.

When an array is degraded, bits in the write-intent bitmap are not
cleared, so that if the missing device is re-added, it can be synced
by only updated those parts of the device that have changed since
it was removed.

The enable this a 'events_cleared' value is stored. It is the event
counter for the array the last time that any bits were cleared.

Sometimes - if a device disappears from an array while it is 'clean' -
the events_cleared value gets updated incorrectly (there are subtle
ordering issues between updateing events in the main metadata and the
bitmap metadata) resulting in the missing device appearing to require
a full resync when it is re-added.

With this patch, we update events_cleared precisely when we are about
to clear a bit in the bitmap.  We record events_cleared when we clear
the bit internally, and copy that to the superblock which is written
out before the bit on storage.  This makes it more "obviously correct".

We also need to update events_cleared when the event_count is going
backwards (as happens on a dirty->clean transition of a non-degraded
array).

Thanks to Mike Snitzer for identifying this problem and testing early
"fixes".


Cc:  "Mike Snitzer" <snitzer@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/bitmap.c         |   29 ++++++++++++++++++++++++-----
 ./include/linux/raid/bitmap.h |    1 +
 2 files changed, 25 insertions(+), 5 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c	2008-05-27 16:50:04.000000000 +1000
+++ ./drivers/md/bitmap.c	2008-05-27 16:50:53.000000000 +1000
@@ -454,8 +454,11 @@ void bitmap_update_sb(struct bitmap *bit
 	spin_unlock_irqrestore(&bitmap->lock, flags);
 	sb = (bitmap_super_t *)kmap_atomic(bitmap->sb_page, KM_USER0);
 	sb->events = cpu_to_le64(bitmap->mddev->events);
-	if (!bitmap->mddev->degraded)
-		sb->events_cleared = cpu_to_le64(bitmap->mddev->events);
+	if (bitmap->mddev->events < bitmap->events_cleared) {
+		/* rocking back to read-only */
+		bitmap->events_cleared = bitmap->mddev->events;
+		sb->events_cleared = cpu_to_le64(bitmap->events_cleared);
+	}
 	kunmap_atomic(sb, KM_USER0);
 	write_page(bitmap, bitmap->sb_page, 1);
 }
@@ -1085,9 +1088,19 @@ void bitmap_daemon_work(struct bitmap *b
 			} else
 				spin_unlock_irqrestore(&bitmap->lock, flags);
 			lastpage = page;
-/*
-			printk("bitmap clean at page %lu\n", j);
-*/
+
+			/* We are possibly going to clear some bits, so make
+			 * sure that events_cleared is up-to-date.
+			 */
+			if (bitmap->need_sync) {
+				bitmap_super_t *sb;
+				bitmap->need_sync = 0;
+				sb = kmap_atomic(bitmap->sb_page, KM_USER0);
+				sb->events_cleared =
+					cpu_to_le64(bitmap->events_cleared);
+				kunmap_atomic(sb, KM_USER0);
+				write_page(bitmap, bitmap->sb_page, 1);
+			}
 			spin_lock_irqsave(&bitmap->lock, flags);
 			clear_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
 		}
@@ -1257,6 +1270,12 @@ void bitmap_endwrite(struct bitmap *bitm
 			return;
 		}
 
+		if (success &&
+		    bitmap->events_cleared < bitmap->mddev->events) {
+			bitmap->events_cleared = bitmap->mddev->events;
+			bitmap->need_sync = 1;
+		}
+
 		if (!success && ! (*bmc & NEEDED_MASK))
 			*bmc |= NEEDED_MASK;
 

diff .prev/include/linux/raid/bitmap.h ./include/linux/raid/bitmap.h
--- .prev/include/linux/raid/bitmap.h	2008-05-26 09:46:04.000000000 +1000
+++ ./include/linux/raid/bitmap.h	2008-05-27 16:50:19.000000000 +1000
@@ -221,6 +221,7 @@ struct bitmap {
 	unsigned long syncchunk;
 
 	__u64	events_cleared;
+	int need_sync;
 
 	/* bitmap spinlock */
 	spinlock_t lock;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition
  2008-05-27  6:56                         ` Neil Brown
@ 2008-05-27 14:33                           ` Mike Snitzer
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Snitzer @ 2008-05-27 14:33 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-kernel, paul.clements

On Tue, May 27, 2008 at 2:56 AM, Neil Brown <neilb@suse.de> wrote:
> On Tuesday May 20, snitzer@gmail.com wrote:
>>
>> Hi Neil,
>>
>> We're much closer.  The events_cleared is symmetric on both the failed
>> and active member of the raid1.  But there have been some instances
>> where the md thread hits a deadlock during my testing.  What follows
>> is the backtrace and live crash info:
> ...
>>
>> So running with your latest patches seems to introduce a race in
>> bitmap_daemon_work's if (unlikely((*bmc & COUNTER_MAX) ==
>> COUNTER_MAX)) { } block.
>
> As you not, that block is in the wrong place.
> It is actually locking up in
>                                wait_event(bitmap->mddev->sb_wait,
>                                    !test_bit(MD_CHANGE_CLEAN,
>                                              &bitmap->mddev->flags));
>
> which the patch adds.  However with my last update that wait_event
> isn't needed any more.  I was using it to ensure mddev->events matched
> what was on disk.  But we now read mddev->events much earlier and it
> will definitely be on disc by this time.
>
> So: this combined patch should do it.
>
> Thanks for all your testing.
>
> NeilBrown
>
>
> ---------------------------
> Improve setting of "events_cleared" for write-intent bitmaps.
>
> When an array is degraded, bits in the write-intent bitmap are not
> cleared, so that if the missing device is re-added, it can be synced
> by only updated those parts of the device that have changed since
> it was removed.
>
> The enable this a 'events_cleared' value is stored. It is the event
> counter for the array the last time that any bits were cleared.
>
> Sometimes - if a device disappears from an array while it is 'clean' -
> the events_cleared value gets updated incorrectly (there are subtle
> ordering issues between updateing events in the main metadata and the
> bitmap metadata) resulting in the missing device appearing to require
> a full resync when it is re-added.
>
> With this patch, we update events_cleared precisely when we are about
> to clear a bit in the bitmap.  We record events_cleared when we clear
> the bit internally, and copy that to the superblock which is written
> out before the bit on storage.  This makes it more "obviously correct".
>
> We also need to update events_cleared when the event_count is going
> backwards (as happens on a dirty->clean transition of a non-degraded
> array).
>
> Thanks to Mike Snitzer for identifying this problem and testing early
> "fixes".
>
>
> Cc:  "Mike Snitzer" <snitzer@gmail.com>
> Signed-off-by: Neil Brown <neilb@suse.de>
> Signed-off-by: Neil Brown <neilb@suse.de>

Neil,

Works great now.  Thanks.

Tested-by: Mike Snitzer <snitzer@gmail.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2008-05-27 14:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-02 22:09 [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition Mike Snitzer
2008-05-06  6:53 ` Neil Brown
2008-05-06 11:58   ` Mike Snitzer
2008-05-08  6:13     ` Neil Brown
2008-05-08 20:11       ` Mike Snitzer
2008-05-09  1:40         ` Neil Brown
2008-05-09  4:42           ` Mike Snitzer
2008-05-09  5:08             ` Mike Snitzer
2008-05-09  5:26               ` Mike Snitzer
2008-05-09  6:01             ` Neil Brown
2008-05-09 15:00               ` Mike Snitzer
2008-05-16 11:54                 ` Neil Brown
2008-05-19  4:33                   ` Mike Snitzer
2008-05-19  5:27                     ` Neil Brown
2008-05-20 15:30                       ` Mike Snitzer
2008-05-20 15:33                         ` Mike Snitzer
2008-05-27  6:56                         ` Neil Brown
2008-05-27 14:33                           ` Mike Snitzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).