[PATCH 000 of 10] md: Various bug fixes and small improvements for md in 2.6.26-rc

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 000 of 10] md: Various bug fixes and small improvements for md in 2.6.26-rc
@ 2008-05-19  1:10 NeilBrown
  2008-05-19  1:10 ` [PATCH 001 of 10] md: Fix possible oops when removing a bitmap from an active array NeilBrown
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: NeilBrown @ 2008-05-19  1:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-raid, linux-kernel, Adrian Bunk, Bernd Schubert,
	Bernd Schubert, Christoph Hellwig, Dan Williams, Eivind Sarto,
	Fairbanks, David, Mike Snitzer

Following are a collection of 10 patches for md/raid that are suitable
for 2.6.26-rc.  They are ordered roughly from simple to more complex with 
serious bugfixes possibly getting elevated in the sort order.

Thanks,
NeilBrown

 [PATCH 000 of 10] md: Introduction EXPLAIN PATCH SET HERE
 [PATCH 001 of 10] md: Fix possible oops when removing a bitmap from an active array
 [PATCH 002 of 10] md: proper extern for mdp_major
 [PATCH 003 of 10] md: kill file_path wrapper
 [PATCH 004 of 10] md: md: raid5 rate limit error printk
 [PATCH 005 of 10] md: raid1: Fix restoration of bio between failed read and write.
 [PATCH 006 of 10] md: Notify userspace on 'write-pending' changes to array_state
 [PATCH 007 of 10] md: notify userspace on 'stop' events
 [PATCH 008 of 10] md: Improve setting of "events_cleared" for write-intent bitmaps.
 [PATCH 009 of 10] md: Allow parallel resync of md-devices.
 [PATCH 010 of 10] md: Restart recovery cleanly after device failure.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 001 of 10] md: Fix possible oops when removing a bitmap from an active array
  2008-05-19  1:10 [PATCH 000 of 10] md: Various bug fixes and small improvements for md in 2.6.26-rc NeilBrown
@ 2008-05-19  1:10 ` NeilBrown
  2008-05-19  1:10 ` [PATCH 002 of 10] md: proper extern for mdp_major NeilBrown
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2008-05-19  1:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


It is possible to add a write-intent bitmap to an active array, or remove
the bitmap that is there.
When we do with the 'quiesce' the array, which causes make_request to
block in "wait_barrier()".
However we are sampling the value of "mddev->bitmap" before the
wait_barrier call, and using it afterwards.  This can result in
using a bitmap structure that has been freed.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c	2008-05-19 11:02:04.000000000 +1000
+++ ./drivers/md/raid1.c	2008-05-19 11:02:15.000000000 +1000
@@ -773,7 +773,7 @@ static int make_request(struct request_q
 	r1bio_t *r1_bio;
 	struct bio *read_bio;
 	int i, targets = 0, disks;
-	struct bitmap *bitmap = mddev->bitmap;
+	struct bitmap *bitmap;
 	unsigned long flags;
 	struct bio_list bl;
 	struct page **behind_pages = NULL;
@@ -802,6 +802,8 @@ static int make_request(struct request_q
 
 	wait_barrier(conf);
 
+	bitmap = mddev->bitmap;
+
 	disk_stat_inc(mddev->gendisk, ios[rw]);
 	disk_stat_add(mddev->gendisk, sectors[rw], bio_sectors(bio));
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 002 of 10] md: proper extern for mdp_major
  2008-05-19  1:10 [PATCH 000 of 10] md: Various bug fixes and small improvements for md in 2.6.26-rc NeilBrown
  2008-05-19  1:10 ` [PATCH 001 of 10] md: Fix possible oops when removing a bitmap from an active array NeilBrown
@ 2008-05-19  1:10 ` NeilBrown
  2008-05-19  1:10 ` [PATCH 003 of 10] md: kill file_path wrapper NeilBrown
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2008-05-19  1:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Adrian Bunk


From: Adrian Bunk <bunk@kernel.org>

This patch adds a proper extern for mdp_major in include/linux/raid/md.h

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./include/linux/raid/md.h |    2 ++
 ./init/do_mounts_md.c     |    1 -
 2 files changed, 2 insertions(+), 1 deletion(-)

diff .prev/include/linux/raid/md.h ./include/linux/raid/md.h
--- .prev/include/linux/raid/md.h	2008-05-19 11:02:06.000000000 +1000
+++ ./include/linux/raid/md.h	2008-05-19 11:02:24.000000000 +1000
@@ -72,6 +72,8 @@
  */
 #define MD_PATCHLEVEL_VERSION           3
 
+extern int mdp_major;
+
 extern int register_md_personality (struct mdk_personality *p);
 extern int unregister_md_personality (struct mdk_personality *p);
 extern mdk_thread_t * md_register_thread (void (*run) (mddev_t *mddev),

diff .prev/init/do_mounts_md.c ./init/do_mounts_md.c
--- .prev/init/do_mounts_md.c	2008-05-19 11:02:06.000000000 +1000
+++ ./init/do_mounts_md.c	2008-05-19 11:02:24.000000000 +1000
@@ -24,7 +24,6 @@ static struct {
 
 static int md_setup_ents __initdata;
 
-extern int mdp_major;
 /*
  * Parse the command-line parameters given our kernel, but do not
  * actually try to invoke the MD device now; that is handled by

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 003 of 10] md: kill file_path wrapper
  2008-05-19  1:10 [PATCH 000 of 10] md: Various bug fixes and small improvements for md in 2.6.26-rc NeilBrown
  2008-05-19  1:10 ` [PATCH 001 of 10] md: Fix possible oops when removing a bitmap from an active array NeilBrown
  2008-05-19  1:10 ` [PATCH 002 of 10] md: proper extern for mdp_major NeilBrown
@ 2008-05-19  1:10 ` NeilBrown
  2008-05-19  1:10 ` [PATCH 004 of 10] md: md: raid5 rate limit error printk NeilBrown
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2008-05-19  1:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Christoph Hellwig


From: Christoph Hellwig <hch@lst.de>

Kill the trivial and rather pointless file_path wrapper around d_path.


Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/bitmap.c         |   17 ++++-------------
 ./drivers/md/md.c             |    4 ++--
 ./include/linux/raid/bitmap.h |    1 -
 3 files changed, 6 insertions(+), 16 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c	2008-05-19 11:02:05.000000000 +1000
+++ ./drivers/md/bitmap.c	2008-05-19 11:02:35.000000000 +1000
@@ -203,17 +203,6 @@ static void bitmap_checkfree(struct bitm
  * bitmap file handling - read and write the bitmap file and its superblock
  */
 
-/* copy the pathname of a file to a buffer */
-char *file_path(struct file *file, char *buf, int count)
-{
-	if (!buf)
-		return NULL;
-
-	buf = d_path(&file->f_path, buf, count);
-
-	return IS_ERR(buf) ? NULL : buf;
-}
-
 /*
  * basic page I/O operations
  */
@@ -721,11 +710,13 @@ static void bitmap_file_kick(struct bitm
 		if (bitmap->file) {
 			path = kmalloc(PAGE_SIZE, GFP_KERNEL);
 			if (path)
-				ptr = file_path(bitmap->file, path, PAGE_SIZE);
+				ptr = d_path(&bitmap->file->f_path, path,
+					     PAGE_SIZE);
+
 
 			printk(KERN_ALERT
 			      "%s: kicking failed bitmap file %s from array!\n",
-			      bmname(bitmap), ptr ? ptr : "");
+			      bmname(bitmap), IS_ERR(ptr) ? "" : ptr);
 
 			kfree(path);
 		} else

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-05-19 11:02:08.000000000 +1000
+++ ./drivers/md/md.c	2008-05-19 11:02:35.000000000 +1000
@@ -3987,8 +3987,8 @@ static int get_bitmap_file(mddev_t * mdd
 	if (!buf)
 		goto out;
 
-	ptr = file_path(mddev->bitmap->file, buf, sizeof(file->pathname));
-	if (!ptr)
+	ptr = d_path(&mddev->bitmap->file->f_path, buf, sizeof(file->pathname));
+	if (IS_ERR(ptr))
 		goto out;
 
 	strcpy(file->pathname, ptr);

diff .prev/include/linux/raid/bitmap.h ./include/linux/raid/bitmap.h
--- .prev/include/linux/raid/bitmap.h	2008-05-19 11:02:05.000000000 +1000
+++ ./include/linux/raid/bitmap.h	2008-05-19 11:02:35.000000000 +1000
@@ -262,7 +262,6 @@ int  bitmap_create(mddev_t *mddev);
 void bitmap_flush(mddev_t *mddev);
 void bitmap_destroy(mddev_t *mddev);
 
-char *file_path(struct file *file, char *buf, int count);
 void bitmap_print_sb(struct bitmap *bitmap);
 void bitmap_update_sb(struct bitmap *bitmap);
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 004 of 10] md: md: raid5 rate limit error printk
  2008-05-19  1:10 [PATCH 000 of 10] md: Various bug fixes and small improvements for md in 2.6.26-rc NeilBrown
                   ` (2 preceding siblings ...)
  2008-05-19  1:10 ` [PATCH 003 of 10] md: kill file_path wrapper NeilBrown
@ 2008-05-19  1:10 ` NeilBrown
  2008-05-19  1:10 ` [PATCH 005 of 10] md: raid1: Fix restoration of bio between failed read and write NeilBrown
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2008-05-19  1:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Bernd Schubert, Dan Williams


From: Bernd Schubert <bernd-schubert@gmx.de>

last night we had scsi problems and a hardware raid
unit was offlined during heavy i/o. While this happened we got for
about 3 minutes a huge number messages like these

Apr 12 03:36:07 pfs1n14 kernel: [197510.696595] raid5:md7: read error not correctable (sector 2993096568 on sdj2).

I guess the high error rate is responsible for not scheduling other
events - during this time the system was not pingable and in the end
also other devices run into scsi command timeouts causing problems on
these unrelated devices as well.

Signed-off-by: Bernd Schubert <bernd-schubert@gmx.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |   34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2008-05-19 11:02:09.000000000 +1000
+++ ./drivers/md/raid5.c	2008-05-19 11:02:44.000000000 +1000
@@ -94,6 +94,8 @@
 #define __inline__
 #endif
 
+#define printk_rl(args...) ((void) (printk_ratelimit() && printk(args)))
+
 #if !RAID6_USE_EMPTY_ZERO_PAGE
 /* In .bss so it's zeroed */
 const char raid6_empty_zero_page[PAGE_SIZE] __attribute__((aligned(256)));
@@ -1143,10 +1145,12 @@ static void raid5_end_read_request(struc
 		set_bit(R5_UPTODATE, &sh->dev[i].flags);
 		if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
 			rdev = conf->disks[i].rdev;
-			printk(KERN_INFO "raid5:%s: read error corrected (%lu sectors at %llu on %s)\n",
-			       mdname(conf->mddev), STRIPE_SECTORS,
-			       (unsigned long long)(sh->sector + rdev->data_offset),
-			       bdevname(rdev->bdev, b));
+			printk_rl(KERN_INFO "raid5:%s: read error corrected"
+				  " (%lu sectors at %llu on %s)\n",
+				  mdname(conf->mddev), STRIPE_SECTORS,
+				  (unsigned long long)(sh->sector
+						       + rdev->data_offset),
+				  bdevname(rdev->bdev, b));
 			clear_bit(R5_ReadError, &sh->dev[i].flags);
 			clear_bit(R5_ReWrite, &sh->dev[i].flags);
 		}
@@ -1160,16 +1164,22 @@ static void raid5_end_read_request(struc
 		clear_bit(R5_UPTODATE, &sh->dev[i].flags);
 		atomic_inc(&rdev->read_errors);
 		if (conf->mddev->degraded)
-			printk(KERN_WARNING "raid5:%s: read error not correctable (sector %llu on %s).\n",
-			       mdname(conf->mddev),
-			       (unsigned long long)(sh->sector + rdev->data_offset),
-			       bdn);
+			printk_rl(KERN_WARNING
+				  "raid5:%s: read error not correctable "
+				  "(sector %llu on %s).\n",
+				  mdname(conf->mddev),
+				  (unsigned long long)(sh->sector
+						       + rdev->data_offset),
+				  bdn);
 		else if (test_bit(R5_ReWrite, &sh->dev[i].flags))
 			/* Oh, no!!! */
-			printk(KERN_WARNING "raid5:%s: read error NOT corrected!! (sector %llu on %s).\n",
-			       mdname(conf->mddev),
-			       (unsigned long long)(sh->sector + rdev->data_offset),
-			       bdn);
+			printk_rl(KERN_WARNING
+				  "raid5:%s: read error NOT corrected!! "
+				  "(sector %llu on %s).\n",
+				  mdname(conf->mddev),
+				  (unsigned long long)(sh->sector
+						       + rdev->data_offset),
+				  bdn);
 		else if (atomic_read(&rdev->read_errors)
 			 > conf->max_nr_stripes)
 			printk(KERN_WARNING

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 005 of 10] md: raid1: Fix restoration of bio between failed read and write.
  2008-05-19  1:10 [PATCH 000 of 10] md: Various bug fixes and small improvements for md in 2.6.26-rc NeilBrown
                   ` (3 preceding siblings ...)
  2008-05-19  1:10 ` [PATCH 004 of 10] md: md: raid5 rate limit error printk NeilBrown
@ 2008-05-19  1:10 ` NeilBrown
  2008-05-19  1:10 ` [PATCH 006 of 10] md: Notify userspace on 'write-pending' changes to array_state NeilBrown
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2008-05-19  1:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Fairbanks, David


When performing a "recovery" or "check" pass on a RAID1 array,
we read from each device and possible, if there is a difference or a
read error, write back to some devices.

We use the same 'bio' for both read and write, resetting
various fields between the two operations.

We forgot to reset bv_offset and bv_len however.
These are often left unchanged, but in the case where there is an
IO error one or two sectors into a page, they are changed.

This results in correctable errors not being corrected properly.
It does not result in any data corruption.

Cc: "Fairbanks, David" <David.Fairbanks@stratus.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c	2008-05-19 11:03:05.000000000 +1000
+++ ./drivers/md/raid1.c	2008-05-19 11:02:55.000000000 +1000
@@ -1284,6 +1284,7 @@ static void sync_request_write(mddev_t *
 					rdev_dec_pending(conf->mirrors[i].rdev, mddev);
 				} else {
 					/* fixup the bio for reuse */
+					int size;
 					sbio->bi_vcnt = vcnt;
 					sbio->bi_size = r1_bio->sectors << 9;
 					sbio->bi_idx = 0;
@@ -1297,10 +1298,20 @@ static void sync_request_write(mddev_t *
 					sbio->bi_sector = r1_bio->sector +
 						conf->mirrors[i].rdev->data_offset;
 					sbio->bi_bdev = conf->mirrors[i].rdev->bdev;
-					for (j = 0; j < vcnt ; j++)
-						memcpy(page_address(sbio->bi_io_vec[j].bv_page),
+					size = sbio->bi_size;
+					for (j = 0; j < vcnt ; j++) {
+						struct bio_vec *bi;
+						bi = &sbio->bi_io_vec[j];
+						bi->bv_offset = 0;
+						if (size > PAGE_SIZE)
+							bi->bv_len = PAGE_SIZE;
+						else
+							bi->bv_len = size;
+						size -= PAGE_SIZE;
+						memcpy(page_address(bi->bv_page),
 						       page_address(pbio->bi_io_vec[j].bv_page),
 						       PAGE_SIZE);
+					}
 
 				}
 			}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 006 of 10] md: Notify userspace on 'write-pending' changes to array_state
  2008-05-19  1:10 [PATCH 000 of 10] md: Various bug fixes and small improvements for md in 2.6.26-rc NeilBrown
                   ` (4 preceding siblings ...)
  2008-05-19  1:10 ` [PATCH 005 of 10] md: raid1: Fix restoration of bio between failed read and write NeilBrown
@ 2008-05-19  1:10 ` NeilBrown
  2008-05-19  1:10 ` [PATCH 007 of 10] md: notify userspace on 'stop' events NeilBrown
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2008-05-19  1:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


When an array enters write pending, 'array_state' changes, so we
must be sure to sysfs_notify.

Also, when waiting for user-space to acknowledge 'write-pending' by
marking the metadata as dirty, we don't want to wait for
MD_CHANGE_DEVS to be cleared as that might not happen.  So explicity
test for the bits that we are really interested in.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-05-19 11:02:35.000000000 +1000
+++ ./drivers/md/md.c	2008-05-19 11:03:43.000000000 +1000
@@ -5435,8 +5435,11 @@ void md_write_start(mddev_t *mddev, stru
 			md_wakeup_thread(mddev->thread);
 		}
 		spin_unlock_irq(&mddev->write_lock);
+		sysfs_notify(&mddev->kobj, NULL, "array_state");
 	}
-	wait_event(mddev->sb_wait, mddev->flags==0);
+	wait_event(mddev->sb_wait,
+		   !test_bit(MD_CHANGE_CLEAN, &mddev->flags) &&
+		   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
 }
 
 void md_write_end(mddev_t *mddev)
@@ -5471,6 +5474,12 @@ void md_allow_write(mddev_t *mddev)
 			mddev->safemode = 1;
 		spin_unlock_irq(&mddev->write_lock);
 		md_update_sb(mddev, 0);
+
+		sysfs_notify(&mddev->kobj, NULL, "array_state");
+		/* wait for the dirty state to be recorded in the metadata */
+		wait_event(mddev->sb_wait,
+			   !test_bit(MD_CHANGE_CLEAN, &mddev->flags) &&
+			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
 	} else
 		spin_unlock_irq(&mddev->write_lock);
 }

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 007 of 10] md: notify userspace on 'stop' events
  2008-05-19  1:10 [PATCH 000 of 10] md: Various bug fixes and small improvements for md in 2.6.26-rc NeilBrown
                   ` (5 preceding siblings ...)
  2008-05-19  1:10 ` [PATCH 006 of 10] md: Notify userspace on 'write-pending' changes to array_state NeilBrown
@ 2008-05-19  1:10 ` NeilBrown
  2008-05-19  1:10 ` [PATCH 008 of 10] md: Improve setting of "events_cleared" for write-intent bitmaps NeilBrown
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2008-05-19  1:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Dan Williams


From: Dan Williams <dan.j.williams@intel.com>

This additional notification to 'array_state' is needed to allow the monitor
application to learn about stop events via sysfs.  The
sysfs_notify("sync_action") call that comes at the end of do_md_stop() (via
md_new_event) is insufficient since the 'sync_action' attribute has been
removed by this point.

(Seems like a sysfs-notify-on-removal patch is a better fix.  Currently removal
 updates the event count but does not wake up waiters)

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    2 ++
 1 file changed, 2 insertions(+)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-05-19 11:03:43.000000000 +1000
+++ ./drivers/md/md.c	2008-05-19 11:03:47.000000000 +1000
@@ -3691,6 +3691,8 @@ static int do_md_stop(mddev_t * mddev, i
 
 			module_put(mddev->pers->owner);
 			mddev->pers = NULL;
+			/* tell userspace to handle 'inactive' */
+			sysfs_notify(&mddev->kobj, NULL, "array_state");
 
 			set_capacity(disk, 0);
 			mddev->changed = 1;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 008 of 10] md: Improve setting of "events_cleared" for write-intent bitmaps.
  2008-05-19  1:10 [PATCH 000 of 10] md: Various bug fixes and small improvements for md in 2.6.26-rc NeilBrown
                   ` (6 preceding siblings ...)
  2008-05-19  1:10 ` [PATCH 007 of 10] md: notify userspace on 'stop' events NeilBrown
@ 2008-05-19  1:10 ` NeilBrown
  2008-05-19  1:11 ` [PATCH 009 of 10] md: Allow parallel resync of md-devices NeilBrown
  2008-05-19  1:11 ` [PATCH 010 of 10] md: Restart recovery cleanly after device failure NeilBrown
  9 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2008-05-19  1:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Mike Snitzer


When an array is degraded, bits in the write-intent bitmap are not
cleared, so that if the missing device is re-added, it can be synced
by only updated those parts of the device that have changed since
it was removed.

The enable this a 'events_cleared' value is stored. It is the event
counter for the array the last time that any bits were cleared.

Sometime - if a device disappears from an array while it is 'clean' -
the events_cleared value gets updated incorrectly (there are subtle
ordering issues between updateing events in the main metadata and the
bitmap metadata) resulting in the missing device appearing to require
a full resync when it is re-added.

With this patch, we update events_cleared precised when we are about
to clear a bit in the bitmap.  This makes it more "obviously correct".
We also need to update events_cleared when the event_count is going
backwards (as happens on a dirty->clean transition of a non-degraded
array).

Thanks to Mike Snitzer for identifying this problem and testing early
"fixes".


Cc:  "Mike Snitzer" <snitzer@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/bitmap.c |   26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c	2008-05-19 11:02:35.000000000 +1000
+++ ./drivers/md/bitmap.c	2008-05-19 11:04:00.000000000 +1000
@@ -454,8 +454,11 @@ void bitmap_update_sb(struct bitmap *bit
 	spin_unlock_irqrestore(&bitmap->lock, flags);
 	sb = (bitmap_super_t *)kmap_atomic(bitmap->sb_page, KM_USER0);
 	sb->events = cpu_to_le64(bitmap->mddev->events);
-	if (!bitmap->mddev->degraded)
-		sb->events_cleared = cpu_to_le64(bitmap->mddev->events);
+	if (bitmap->mddev->events < bitmap->events_cleared) {
+		/* rocking back to read-only */
+		bitmap->events_cleared = bitmap->mddev->events;
+		sb->events_cleared = cpu_to_le64(bitmap->events_cleared);
+	}
 	kunmap_atomic(sb, KM_USER0);
 	write_page(bitmap, bitmap->sb_page, 1);
 }
@@ -1085,9 +1088,22 @@ void bitmap_daemon_work(struct bitmap *b
 			} else
 				spin_unlock_irqrestore(&bitmap->lock, flags);
 			lastpage = page;
-/*
-			printk("bitmap clean at page %lu\n", j);
-*/
+
+			/* We are possibly going to clear some bits, so make
+			 * sure that events_cleared is up-to-date.
+			 */
+			if (bitmap->events_cleared < bitmap->mddev->events) {
+				bitmap_super_t *sb;
+				bitmap->events_cleared = bitmap->mddev->events;
+				wait_event(bitmap->mddev->sb_wait,
+				    !test_bit(MD_CHANGE_CLEAN,
+					      &bitmap->mddev->flags));
+				sb = kmap_atomic(bitmap->sb_page, KM_USER0);
+				sb->events_cleared =
+					cpu_to_le64(bitmap->events_cleared);
+				kunmap_atomic(sb, KM_USER0);
+				write_page(bitmap, bitmap->sb_page, 1);
+			}
 			spin_lock_irqsave(&bitmap->lock, flags);
 			clear_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
 		}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 009 of 10] md: Allow parallel resync of md-devices.
  2008-05-19  1:10 [PATCH 000 of 10] md: Various bug fixes and small improvements for md in 2.6.26-rc NeilBrown
                   ` (7 preceding siblings ...)
  2008-05-19  1:10 ` [PATCH 008 of 10] md: Improve setting of "events_cleared" for write-intent bitmaps NeilBrown
@ 2008-05-19  1:11 ` NeilBrown
  2008-05-19  1:11 ` [PATCH 010 of 10] md: Restart recovery cleanly after device failure NeilBrown
  9 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2008-05-19  1:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Bernd Schubert


From: Bernd Schubert <bs@q-leap.de>

In some configurations, a raid6 resync can be limited by CPU speed
(Calculating P and Q and moving data) rather than by device speed.
In these cases there is nothing to be gained byt serialising resync
of arrays that share a device, and doing the resync in parallel can
provide benefit.
So add a sysfs tunable to flag an array as being allowed to
resync in parallel with other arrays that use (a different part of)
the same device.


Signed-off-by: Bernd Schubert <bs@q-leap.de>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c           |   40 ++++++++++++++++++++++++++++++++++++----
 ./include/linux/raid/md_k.h |    3 +++
 2 files changed, 39 insertions(+), 4 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-05-19 11:03:47.000000000 +1000
+++ ./drivers/md/md.c	2008-05-19 11:04:07.000000000 +1000
@@ -74,6 +74,8 @@ static DEFINE_SPINLOCK(pers_lock);
 
 static void md_print_devices(void);
 
+static DECLARE_WAIT_QUEUE_HEAD(resync_wait);
+
 #define MD_BUG(x...) { printk("md: bug in file %s, line %d\n", __FILE__, __LINE__); md_print_devices(); }
 
 /*
@@ -3013,6 +3015,36 @@ degraded_show(mddev_t *mddev, char *page
 static struct md_sysfs_entry md_degraded = __ATTR_RO(degraded);
 
 static ssize_t
+sync_force_parallel_show(mddev_t *mddev, char *page)
+{
+	return sprintf(page, "%d\n", mddev->parallel_resync);
+}
+
+static ssize_t
+sync_force_parallel_store(mddev_t *mddev, const char *buf, size_t len)
+{
+	long n;
+
+	if (strict_strtol(buf, 10, &n))
+		return -EINVAL;
+
+	if (n != 0 && n != 1)
+		return -EINVAL;
+
+	mddev->parallel_resync = n;
+
+	if (mddev->sync_thread)
+		wake_up(&resync_wait);
+
+	return len;
+}
+
+/* force parallel resync, even with shared block devices */
+static struct md_sysfs_entry md_sync_force_parallel =
+__ATTR(sync_force_parallel, S_IRUGO|S_IWUSR,
+       sync_force_parallel_show, sync_force_parallel_store);
+
+static ssize_t
 sync_speed_show(mddev_t *mddev, char *page)
 {
 	unsigned long resync, dt, db;
@@ -3187,6 +3219,7 @@ static struct attribute *md_redundancy_a
 	&md_sync_min.attr,
 	&md_sync_max.attr,
 	&md_sync_speed.attr,
+	&md_sync_force_parallel.attr,
 	&md_sync_completed.attr,
 	&md_max_sync.attr,
 	&md_suspend_lo.attr,
@@ -5487,8 +5520,6 @@ void md_allow_write(mddev_t *mddev)
 }
 EXPORT_SYMBOL_GPL(md_allow_write);
 
-static DECLARE_WAIT_QUEUE_HEAD(resync_wait);
-
 #define SYNC_MARKS	10
 #define	SYNC_MARK_STEP	(3*HZ)
 void md_do_sync(mddev_t *mddev)
@@ -5552,8 +5583,9 @@ void md_do_sync(mddev_t *mddev)
 		for_each_mddev(mddev2, tmp) {
 			if (mddev2 == mddev)
 				continue;
-			if (mddev2->curr_resync && 
-			    match_mddev_units(mddev,mddev2)) {
+			if (!mddev->parallel_resync
+			&&  mddev2->curr_resync
+			&&  match_mddev_units(mddev, mddev2)) {
 				DEFINE_WAIT(wq);
 				if (mddev < mddev2 && mddev->curr_resync == 2) {
 					/* arbitrarily yield */

diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h	2008-05-19 11:02:06.000000000 +1000
+++ ./include/linux/raid/md_k.h	2008-05-19 11:04:07.000000000 +1000
@@ -180,6 +180,9 @@ struct mddev_s
 	int				sync_speed_min;
 	int				sync_speed_max;
 
+	/* resync even though the same disks are shared among md-devices */
+	int				parallel_resync;
+
 	int				ok_start_degraded;
 	/* recovery/resync flags 
 	 * NEEDED:   we might need to start a resync/recover

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 010 of 10] md: Restart recovery cleanly after device failure.
  2008-05-19  1:10 [PATCH 000 of 10] md: Various bug fixes and small improvements for md in 2.6.26-rc NeilBrown
                   ` (8 preceding siblings ...)
  2008-05-19  1:11 ` [PATCH 009 of 10] md: Allow parallel resync of md-devices NeilBrown
@ 2008-05-19  1:11 ` NeilBrown
  9 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2008-05-19  1:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Eivind Sarto


When we get any IO error during a recovery (rebuilding a spare), we
abort the recovery and restart it.

For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may
be able to continue and re-doing all that has already been done doesn't
make sense.

We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
  - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
    which causes the recovery not be be checkpointed.
  - We remove spares and then re-added them which loses important state
    information.

The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really
isn't needed.  If there is an error, the relevant drive will be marked
as Faulty, and that is enough to ensure correct handling of the error.
So we first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.

Then we cause the attempt to remove a non-faulty device from an array
to fail (unless recovery is impossible as the array is too degraded).
Then when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and recovery
will continue on them as desired.

Issue:  If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available,  do we want to:
 1/ complete the current rebuild, then go back and rebuild the new spare or
 2/ restart the rebuild from the start and rebuild both devices in
    parallel.

Both options can be argued for.  The code currently takes option 2 as
  a/ this requires least code change
  b/ this results in a minimally-degraded array in minimal time.

Cc:  "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c           |   22 +++++++++++-----------
 ./drivers/md/multipath.c    |    3 ++-
 ./drivers/md/raid1.c        |   10 +++++++++-
 ./drivers/md/raid10.c       |   14 ++++++++++++--
 ./drivers/md/raid5.c        |   10 +++++++++-
 ./include/linux/raid/md_k.h |    4 +---
 6 files changed, 44 insertions(+), 19 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-05-19 11:04:07.000000000 +1000
+++ ./drivers/md/md.c	2008-05-19 11:04:11.000000000 +1000
@@ -5434,7 +5434,7 @@ void md_done_sync(mddev_t *mddev, int bl
 	atomic_sub(blocks, &mddev->recovery_active);
 	wake_up(&mddev->recovery_wait);
 	if (!ok) {
-		set_bit(MD_RECOVERY_ERR, &mddev->recovery);
+		set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 		md_wakeup_thread(mddev->thread);
 		// stop recovery, signal do_sync ....
 	}
@@ -5690,7 +5690,7 @@ void md_do_sync(mddev_t *mddev)
 		sectors = mddev->pers->sync_request(mddev, j, &skipped,
 						  currspeed < speed_min(mddev));
 		if (sectors == 0) {
-			set_bit(MD_RECOVERY_ERR, &mddev->recovery);
+			set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 			goto out;
 		}
 
@@ -5713,8 +5713,7 @@ void md_do_sync(mddev_t *mddev)
 
 		last_check = io_sectors;
 
-		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery) ||
-		    test_bit(MD_RECOVERY_ERR, &mddev->recovery))
+		if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
 			break;
 
 	repeat:
@@ -5768,8 +5767,7 @@ void md_do_sync(mddev_t *mddev)
 	/* tell personality that we are finished */
 	mddev->pers->sync_request(mddev, max_sectors, &skipped, 1);
 
-	if (!test_bit(MD_RECOVERY_ERR, &mddev->recovery) &&
-	    !test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
+	if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
 	    mddev->curr_resync > 2) {
 		if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
 			if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
@@ -5838,7 +5836,10 @@ static int remove_and_add_spares(mddev_t
 		}
 
 	if (mddev->degraded) {
-		rdev_for_each(rdev, rtmp, mddev)
+		rdev_for_each(rdev, rtmp, mddev) {
+			if (rdev->raid_disk >= 0 &&
+			    !test_bit(In_sync, &rdev->flags))
+				spares++;
 			if (rdev->raid_disk < 0
 			    && !test_bit(Faulty, &rdev->flags)) {
 				rdev->recovery_offset = 0;
@@ -5856,6 +5857,7 @@ static int remove_and_add_spares(mddev_t
 				} else
 					break;
 			}
+		}
 	}
 	return spares;
 }
@@ -5869,7 +5871,7 @@ static int remove_and_add_spares(mddev_t
  * to do that as needed.
  * When it is determined that resync is needed, we set MD_RECOVERY_RUNNING in
  * "->recovery" and create a thread at ->sync_thread.
- * When the thread finishes it sets MD_RECOVERY_DONE (and might set MD_RECOVERY_ERR)
+ * When the thread finishes it sets MD_RECOVERY_DONE
  * and wakeups up this thread which will reap the thread and finish up.
  * This thread also removes any faulty devices (with nr_pending == 0).
  *
@@ -5944,8 +5946,7 @@ void md_check_recovery(mddev_t *mddev)
 			/* resync has finished, collect result */
 			md_unregister_thread(mddev->sync_thread);
 			mddev->sync_thread = NULL;
-			if (!test_bit(MD_RECOVERY_ERR, &mddev->recovery) &&
-			    !test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
+			if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
 				/* success...*/
 				/* activate any spares */
 				mddev->pers->spare_active(mddev);
@@ -5969,7 +5970,6 @@ void md_check_recovery(mddev_t *mddev)
 		 * might be left set
 		 */
 		clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
-		clear_bit(MD_RECOVERY_ERR, &mddev->recovery);
 		clear_bit(MD_RECOVERY_INTR, &mddev->recovery);
 		clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
 

diff .prev/drivers/md/multipath.c ./drivers/md/multipath.c
--- .prev/drivers/md/multipath.c	2008-05-19 11:02:01.000000000 +1000
+++ ./drivers/md/multipath.c	2008-05-19 11:04:11.000000000 +1000
@@ -327,7 +327,8 @@ static int multipath_remove_disk(mddev_t
 	if (rdev) {
 		if (test_bit(In_sync, &rdev->flags) ||
 		    atomic_read(&rdev->nr_pending)) {
-			printk(KERN_ERR "hot-remove-disk, slot %d is identified"				" but is still operational!\n", number);
+			printk(KERN_ERR "hot-remove-disk, slot %d is identified"
+			       " but is still operational!\n", number);
 			err = -EBUSY;
 			goto abort;
 		}

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c	2008-05-19 11:02:01.000000000 +1000
+++ ./drivers/md/raid10.c	2008-05-19 11:04:11.000000000 +1000
@@ -1020,7 +1020,7 @@ static void error(mddev_t *mddev, mdk_rd
 		/*
 		 * if recovery is running, make sure it aborts.
 		 */
-		set_bit(MD_RECOVERY_ERR, &mddev->recovery);
+		set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 	}
 	set_bit(Faulty, &rdev->flags);
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
@@ -1171,6 +1171,14 @@ static int raid10_remove_disk(mddev_t *m
 			err = -EBUSY;
 			goto abort;
 		}
+		/* Only remove faulty devices in recovery
+		 * is not possible.
+		 */
+		if (!test_bit(Faulty, &rdev->flags) &&
+		    enough(conf)) {
+			err = -EBUSY;
+			goto abort;
+		}
 		p->rdev = NULL;
 		synchronize_rcu();
 		if (atomic_read(&rdev->nr_pending)) {
@@ -1237,6 +1245,7 @@ static void end_sync_write(struct bio *b
 
 	if (!uptodate)
 		md_error(mddev, conf->mirrors[d].rdev);
+
 	update_head_pos(i, r10_bio);
 
 	while (atomic_dec_and_test(&r10_bio->remaining)) {
@@ -1844,7 +1853,8 @@ static sector_t sync_request(mddev_t *md
 					if (rb2)
 						atomic_dec(&rb2->remaining);
 					r10_bio = rb2;
-					if (!test_and_set_bit(MD_RECOVERY_ERR, &mddev->recovery))
+					if (!test_and_set_bit(MD_RECOVERY_INTR,
+							      &mddev->recovery))
 						printk(KERN_INFO "raid10: %s: insufficient working devices for recovery.\n",
 						       mdname(mddev));
 					break;

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c	2008-05-19 11:02:55.000000000 +1000
+++ ./drivers/md/raid1.c	2008-05-19 11:04:11.000000000 +1000
@@ -1027,7 +1027,7 @@ static void error(mddev_t *mddev, mdk_rd
 		/*
 		 * if recovery is running, make sure it aborts.
 		 */
-		set_bit(MD_RECOVERY_ERR, &mddev->recovery);
+		set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 	} else
 		set_bit(Faulty, &rdev->flags);
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
@@ -1148,6 +1148,14 @@ static int raid1_remove_disk(mddev_t *md
 			err = -EBUSY;
 			goto abort;
 		}
+		/* Only remove non-faulty devices is recovery
+		 * is not possible.
+		 */
+		if (!test_bit(Faulty, &rdev->flags) &&
+		    mddev->degraded < conf->raid_disks) {
+			err = -EBUSY;
+			goto abort;
+		}
 		p->rdev = NULL;
 		synchronize_rcu();
 		if (atomic_read(&rdev->nr_pending)) {

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2008-05-19 11:02:44.000000000 +1000
+++ ./drivers/md/raid5.c	2008-05-19 11:04:11.000000000 +1000
@@ -1268,7 +1268,7 @@ static void error(mddev_t *mddev, mdk_rd
 			/*
 			 * if recovery was running, make sure it aborts.
 			 */
-			set_bit(MD_RECOVERY_ERR, &mddev->recovery);
+			set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 		}
 		set_bit(Faulty, &rdev->flags);
 		printk (KERN_ALERT
@@ -4574,6 +4574,14 @@ static int raid5_remove_disk(mddev_t *md
 			err = -EBUSY;
 			goto abort;
 		}
+		/* Only remove non-faulty devices if recovery
+		 * isn't possible.
+		 */
+		if (!test_bit(Faulty, &rdev->flags) &&
+		    mddev->degraded <= conf->max_degraded) {
+			err = -EBUSY;
+			goto abort;
+		}
 		p->rdev = NULL;
 		synchronize_rcu();
 		if (atomic_read(&rdev->nr_pending)) {

diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h	2008-05-19 11:04:07.000000000 +1000
+++ ./include/linux/raid/md_k.h	2008-05-19 11:04:11.000000000 +1000
@@ -188,8 +188,7 @@ struct mddev_s
 	 * NEEDED:   we might need to start a resync/recover
 	 * RUNNING:  a thread is running, or about to be started
 	 * SYNC:     actually doing a resync, not a recovery
-	 * ERR:      and IO error was detected - abort the resync/recovery
-	 * INTR:     someone requested a (clean) early abort.
+	 * INTR:     resync needs to be aborted for some reason
 	 * DONE:     thread is done and is waiting to be reaped
 	 * REQUEST:  user-space has requested a sync (used with SYNC)
 	 * CHECK:    user-space request for for check-only, no repair
@@ -199,7 +198,6 @@ struct mddev_s
 	 */
 #define	MD_RECOVERY_RUNNING	0
 #define	MD_RECOVERY_SYNC	1
-#define	MD_RECOVERY_ERR		2
 #define	MD_RECOVERY_INTR	3
 #define	MD_RECOVERY_DONE	4
 #define	MD_RECOVERY_NEEDED	5

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-05-19  1:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-19  1:10 [PATCH 000 of 10] md: Various bug fixes and small improvements for md in 2.6.26-rc NeilBrown
2008-05-19  1:10 ` [PATCH 001 of 10] md: Fix possible oops when removing a bitmap from an active array NeilBrown
2008-05-19  1:10 ` [PATCH 002 of 10] md: proper extern for mdp_major NeilBrown
2008-05-19  1:10 ` [PATCH 003 of 10] md: kill file_path wrapper NeilBrown
2008-05-19  1:10 ` [PATCH 004 of 10] md: md: raid5 rate limit error printk NeilBrown
2008-05-19  1:10 ` [PATCH 005 of 10] md: raid1: Fix restoration of bio between failed read and write NeilBrown
2008-05-19  1:10 ` [PATCH 006 of 10] md: Notify userspace on 'write-pending' changes to array_state NeilBrown
2008-05-19  1:10 ` [PATCH 007 of 10] md: notify userspace on 'stop' events NeilBrown
2008-05-19  1:10 ` [PATCH 008 of 10] md: Improve setting of "events_cleared" for write-intent bitmaps NeilBrown
2008-05-19  1:11 ` [PATCH 009 of 10] md: Allow parallel resync of md-devices NeilBrown
2008-05-19  1:11 ` [PATCH 010 of 10] md: Restart recovery cleanly after device failure NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).