[PATCH 000 of 9] md: udev notification, raid5 read improvements etc

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 000 of 9] md: udev notification, raid5 read improvements etc
@ 2006-11-07 22:09 NeilBrown
  2006-11-07 22:09 ` [PATCH 001 of 9] md: Change ONLINE/OFFLINE events to a single CHANGE event NeilBrown
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: NeilBrown @ 2006-11-07 22:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel

Following are 9 patches for md in 2.6.19-rc4-mm2.
The first two are suitable for 2.6.19.
The third might be.  It seems straight forward, but is awkward to test.
Possibly safest to keep it for .20...
The rest should be held for .20.

4 is a minor tidyup
5 is a resend with a bug fixed.
6-9 are resends with attribution improved.

Thanks,
NeilBrown

 [PATCH 001 of 9] md: Change ONLINE/OFFLINE events to a single CHANGE event
 [PATCH 002 of 9] md: Fix sizing problem with raid5-reshape and CONFIG_LBD=n
 [PATCH 003 of 9] md: Do not freeze md threads for suspend.
 [PATCH 004 of 9] md: Tidy up device-change notification when an md array is stopped
 [PATCH 005 of 9] md: Change lifetime rules for 'md' devices.
 [PATCH 006 of 9] md: Define raid5_mergeable_bvec
 [PATCH 007 of 9] md: Handle bypassing the read cache (assuming nothing fails).
 [PATCH 008 of 9] md: Allow reads that have bypassed the cache to be retried on failure.
 [PATCH 009 of 9] md: Enable bypassing cache for reads.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 001 of 9] md: Change ONLINE/OFFLINE events to a single CHANGE event
  2006-11-07 22:09 [PATCH 000 of 9] md: udev notification, raid5 read improvements etc NeilBrown
@ 2006-11-07 22:09 ` NeilBrown
  2006-11-07 22:09 ` [PATCH 002 of 9] md: Fix sizing problem with raid5-reshape and CONFIG_LBD=n NeilBrown
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2006-11-07 22:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Kay Sievers


It turns out that CHANGE is preferred to ONLINE/OFFLINE for various reasons
(not least of which being that udev understands it already).

So remove the recently added KOBJ_OFFLINE (no-one is likely to care
anyway) and change the ONLINE to a CHANGE event

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2006-11-06 11:21:25.000000000 +1100
+++ ./drivers/md/md.c	2006-11-06 11:22:14.000000000 +1100
@@ -3200,7 +3200,7 @@ static int do_md_run(mddev_t * mddev)
 
 	mddev->changed = 1;
 	md_new_event(mddev);
-	kobject_uevent(&mddev->gendisk->kobj, KOBJ_ONLINE);
+	kobject_uevent(&mddev->gendisk->kobj, KOBJ_CHANGE);
 	return 0;
 }
 
@@ -3314,7 +3314,6 @@ static int do_md_stop(mddev_t * mddev, i
 
 			module_put(mddev->pers->owner);
 			mddev->pers = NULL;
-			kobject_uevent(&mddev->gendisk->kobj, KOBJ_OFFLINE);
 			if (mddev->ro)
 				mddev->ro = 0;
 		}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 002 of 9] md: Fix sizing problem with raid5-reshape and CONFIG_LBD=n
  2006-11-07 22:09 [PATCH 000 of 9] md: udev notification, raid5 read improvements etc NeilBrown
  2006-11-07 22:09 ` [PATCH 001 of 9] md: Change ONLINE/OFFLINE events to a single CHANGE event NeilBrown
@ 2006-11-07 22:09 ` NeilBrown
  2006-11-07 22:09 ` [PATCH 003 of 9] md: Do not freeze md threads for suspend NeilBrown
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2006-11-07 22:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


I forgot to has the size-in-blocks to (loff_t) before shifting up to a size-in-bytes.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2006-11-06 11:21:24.000000000 +1100
+++ ./drivers/md/raid5.c	2006-11-06 11:28:51.000000000 +1100
@@ -3659,7 +3659,7 @@ static void end_reshape(raid5_conf_t *co
 		bdev = bdget_disk(conf->mddev->gendisk, 0);
 		if (bdev) {
 			mutex_lock(&bdev->bd_inode->i_mutex);
-			i_size_write(bdev->bd_inode, conf->mddev->array_size << 10);
+			i_size_write(bdev->bd_inode, (loff_t)conf->mddev->array_size << 10);
 			mutex_unlock(&bdev->bd_inode->i_mutex);
 			bdput(bdev);
 		}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 003 of 9] md: Do not freeze md threads for suspend.
  2006-11-07 22:09 [PATCH 000 of 9] md: udev notification, raid5 read improvements etc NeilBrown
  2006-11-07 22:09 ` [PATCH 001 of 9] md: Change ONLINE/OFFLINE events to a single CHANGE event NeilBrown
  2006-11-07 22:09 ` [PATCH 002 of 9] md: Fix sizing problem with raid5-reshape and CONFIG_LBD=n NeilBrown
@ 2006-11-07 22:09 ` NeilBrown
  2006-11-07 22:09 ` [PATCH 004 of 9] md: Tidy up device-change notification when an md array is stopped NeilBrown
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2006-11-07 22:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Rafael J. Wysocki


From:  "Rafael J. Wysocki" <rjw@sisk.pl>

If there's a swap file on a software RAID, it should be possible to use this
file for saving the swsusp's suspend image.  Also, this file should be
available to the memory management subsystem when memory is being freed before
the suspend image is created.

For the above reasons it seems that md_threads should not be frozen during
the suspend and the appended patch makes this happen, but then there is the
question if they don't cause any data to be written to disks after the
suspend image has been created, provided that all filesystems are frozen
at that time.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2006-11-06 11:28:44.000000000 +1100
+++ ./drivers/md/md.c	2006-11-06 11:29:00.000000000 +1100
@@ -4488,6 +4488,7 @@ static int md_thread(void * arg)
 	 * many dirty RAID5 blocks.
 	 */
 
+	current->flags |= PF_NOFREEZE;
 	allow_signal(SIGKILL);
 	while (!kthread_should_stop()) {
 
@@ -4504,7 +4505,6 @@ static int md_thread(void * arg)
 			 test_bit(THREAD_WAKEUP, &thread->flags)
 			 || kthread_should_stop(),
 			 thread->timeout);
-		try_to_freeze();
 
 		clear_bit(THREAD_WAKEUP, &thread->flags);
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 004 of 9] md: Tidy up device-change notification when an md array is stopped
  2006-11-07 22:09 [PATCH 000 of 9] md: udev notification, raid5 read improvements etc NeilBrown
                   ` (2 preceding siblings ...)
  2006-11-07 22:09 ` [PATCH 003 of 9] md: Do not freeze md threads for suspend NeilBrown
@ 2006-11-07 22:09 ` NeilBrown
  2006-11-07 22:09 ` [PATCH 005 of 9] md: Change lifetime rules for 'md' devices NeilBrown
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2006-11-07 22:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


An md array can be stopped leaving all the setting still in place,
or it can torn down and destroyed.
set_capacity and other change notifications only happen in the latter
case, but should happen in both.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2006-11-06 11:29:00.000000000 +1100
+++ ./drivers/md/md.c	2006-11-06 11:29:12.000000000 +1100
@@ -3314,6 +3314,10 @@ static int do_md_stop(mddev_t * mddev, i
 
 			module_put(mddev->pers->owner);
 			mddev->pers = NULL;
+
+			set_capacity(disk, 0);
+			mddev->changed = 1;
+
 			if (mddev->ro)
 				mddev->ro = 0;
 		}
@@ -3333,7 +3337,7 @@ static int do_md_stop(mddev_t * mddev, i
 	if (mode == 0) {
 		mdk_rdev_t *rdev;
 		struct list_head *tmp;
-		struct gendisk *disk;
+
 		printk(KERN_INFO "md: %s stopped.\n", mdname(mddev));
 
 		bitmap_destroy(mddev);
@@ -3358,10 +3362,6 @@ static int do_md_stop(mddev_t * mddev, i
 		mddev->raid_disks = 0;
 		mddev->recovery_cp = 0;
 
-		disk = mddev->gendisk;
-		if (disk)
-			set_capacity(disk, 0);
-		mddev->changed = 1;
 	} else if (mddev->pers)
 		printk(KERN_INFO "md: %s switched to read-only mode.\n",
 			mdname(mddev));

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 005 of 9] md: Change lifetime rules for 'md' devices.
  2006-11-07 22:09 [PATCH 000 of 9] md: udev notification, raid5 read improvements etc NeilBrown
                   ` (3 preceding siblings ...)
  2006-11-07 22:09 ` [PATCH 004 of 9] md: Tidy up device-change notification when an md array is stopped NeilBrown
@ 2006-11-07 22:09 ` NeilBrown
  2006-11-07 22:09 ` [PATCH 006 of 9] md: Define raid5_mergeable_bvec NeilBrown
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2006-11-07 22:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


Currently md devices are created when first opened and remain in existence
until the module is unloaded.
This isn't a major problem, but it somewhat ugly.

This patch changes the lifetime rules so that an md device will
disappear on the last close if it has no state.

Locking rules depend on bd_mutex being held in do_open and
__blkdev_put, and on setting bd_disk->private_data to 'mddev'.

There is room for a race because md_probe is called early in do_open
(get_gendisk) to create the mddev.  As this isn't protected by
bd_mutex, a concurrent call to md_close can destroy that mddev before
do_open calls md_open to get a reference on it.
md_open and md_close are serialised by md_mutex so the worst that
can happen is that md_open finds that the mddev structure doesn't
exist after all.  In this case bd_disk->private_data will be NULL,
and md_open chooses to exit with -EBUSY in this case, which is
arguable and appropriate result.

The new 'dead' field in mddev is used to track whether it is time
to destroy the mddev (if a last-close happens).  It is cleared when
any state is create (set_array_info) and set when the array is stopped
(do_md_stop).

mddev_put becomes simpler. It just destroys the mddev when the
refcount hits zero.  This will normally be the reference held in
bd_disk->private_data.
  

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c           |   35 +++++++++++++++++++++++++----------
 ./include/linux/raid/md_k.h |    3 +++
 2 files changed, 28 insertions(+), 10 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2006-11-06 11:29:12.000000000 +1100
+++ ./drivers/md/md.c	2006-11-06 11:29:13.000000000 +1100
@@ -226,13 +226,14 @@ static void mddev_put(mddev_t *mddev)
 {
 	if (!atomic_dec_and_lock(&mddev->active, &all_mddevs_lock))
 		return;
-	if (!mddev->raid_disks && list_empty(&mddev->disks)) {
-		list_del(&mddev->all_mddevs);
-		spin_unlock(&all_mddevs_lock);
-		blk_cleanup_queue(mddev->queue);
-		kobject_unregister(&mddev->kobj);
-	} else
-		spin_unlock(&all_mddevs_lock);
+	list_del(&mddev->all_mddevs);
+	spin_unlock(&all_mddevs_lock);
+
+	del_gendisk(mddev->gendisk);
+	mddev->gendisk = NULL;
+	blk_cleanup_queue(mddev->queue);
+	mddev->queue = NULL;
+	kobject_unregister(&mddev->kobj);
 }
 
 static mddev_t * mddev_find(dev_t unit)
@@ -273,6 +274,7 @@ static mddev_t * mddev_find(dev_t unit)
 	atomic_set(&new->active, 1);
 	spin_lock_init(&new->write_lock);
 	init_waitqueue_head(&new->sb_wait);
+	new->dead = 1;
 
 	new->queue = blk_alloc_queue(GFP_KERNEL);
 	if (!new->queue) {
@@ -3360,6 +3362,8 @@ static int do_md_stop(mddev_t * mddev, i
 		mddev->array_size = 0;
 		mddev->size = 0;
 		mddev->raid_disks = 0;
+		mddev->dead = 1;
+
 		mddev->recovery_cp = 0;
 
 	} else if (mddev->pers)
@@ -4292,7 +4296,8 @@ static int md_ioctl(struct inode *inode,
 					printk(KERN_WARNING "md: couldn't set"
 					       " array info. %d\n", err);
 					goto abort_unlock;
-				}
+				} else
+					mddev->dead = 0;
 			}
 			goto done_unlock;
 
@@ -4376,6 +4381,8 @@ static int md_ioctl(struct inode *inode,
 				err = -EFAULT;
 			else
 				err = add_new_disk(mddev, &info);
+			if (!err)
+				mddev->dead = 0;
 			goto done_unlock;
 		}
 
@@ -4422,8 +4429,12 @@ static int md_open(struct inode *inode, 
 	 * Succeed if we can lock the mddev, which confirms that
 	 * it isn't being stopped right now.
 	 */
-	mddev_t *mddev = inode->i_bdev->bd_disk->private_data;
-	int err;
+	mddev_t *mddev;
+	int err = -EBUSY;
+
+	mddev = inode->i_bdev->bd_disk->private_data;
+	if (!mddev)
+		goto out;
 
 	if ((err = mutex_lock_interruptible_nested(&mddev->reconfig_mutex, 1)))
 		goto out;
@@ -4442,6 +4453,10 @@ static int md_release(struct inode *inod
  	mddev_t *mddev = inode->i_bdev->bd_disk->private_data;
 
 	BUG_ON(!mddev);
+	if (inode->i_bdev->bd_openers == 0 && mddev->dead) {
+		inode->i_bdev->bd_disk->private_data = NULL;
+		mddev_put(mddev);
+	}
 	mddev_put(mddev);
 
 	return 0;

diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h	2006-11-06 11:21:24.000000000 +1100
+++ ./include/linux/raid/md_k.h	2006-11-06 11:29:13.000000000 +1100
@@ -119,6 +119,9 @@ struct mddev_s
 #define MD_CHANGE_PENDING 2	/* superblock update in progress */
 
 	int				ro;
+	int				dead; /* array should be discarded on
+					       * last close
+					       */
 
 	struct gendisk			*gendisk;
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 006 of 9] md: Define raid5_mergeable_bvec
  2006-11-07 22:09 [PATCH 000 of 9] md: udev notification, raid5 read improvements etc NeilBrown
                   ` (4 preceding siblings ...)
  2006-11-07 22:09 ` [PATCH 005 of 9] md: Change lifetime rules for 'md' devices NeilBrown
@ 2006-11-07 22:09 ` NeilBrown
  2006-11-07 22:09 ` [PATCH 007 of 9] md: Handle bypassing the read cache (assuming nothing fails) NeilBrown
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2006-11-07 22:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Raz Ben-Jehuda(caro)


From: "Raz Ben-Jehuda(caro)" <raziebe@gmail.com>

This will encourage read request to be on only one device,
so we will often be able to bypass the cache for read
requests.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |   24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2006-11-06 11:28:51.000000000 +1100
+++ ./drivers/md/raid5.c	2006-11-06 11:29:13.000000000 +1100
@@ -2611,6 +2611,28 @@ static int raid5_congested(void *data, i
 	return 0;
 }
 
+/* We want read requests to align with chunks where possible,
+ * but write requests don't need to.
+ */
+static int raid5_mergeable_bvec(request_queue_t *q, struct bio *bio, struct bio_vec *biovec)
+{
+	mddev_t *mddev = q->queuedata;
+	sector_t sector = bio->bi_sector + get_start_sect(bio->bi_bdev);
+	int max;
+	unsigned int chunk_sectors = mddev->chunk_size >> 9;
+	unsigned int bio_sectors = bio->bi_size >> 9;
+
+	if (bio_data_dir(bio))
+		return biovec->bv_len; /* always allow writes to be mergeable */
+
+	max =  (chunk_sectors - ((sector & (chunk_sectors - 1)) + bio_sectors)) << 9;
+	if (max < 0) max = 0;
+	if (max <= biovec->bv_len && bio_sectors == 0)
+		return biovec->bv_len;
+	else
+		return max;
+}
+
 static int make_request(request_queue_t *q, struct bio * bi)
 {
 	mddev_t *mddev = q->queuedata;
@@ -3320,6 +3342,8 @@ static int run(mddev_t *mddev)
 	mddev->array_size =  mddev->size * (conf->previous_raid_disks -
 					    conf->max_degraded);
 
+	blk_queue_merge_bvec(mddev->queue, raid5_mergeable_bvec);
+
 	return 0;
 abort:
 	if (conf) {

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 007 of 9] md: Handle bypassing the read cache (assuming nothing fails).
  2006-11-07 22:09 [PATCH 000 of 9] md: udev notification, raid5 read improvements etc NeilBrown
                   ` (5 preceding siblings ...)
  2006-11-07 22:09 ` [PATCH 006 of 9] md: Define raid5_mergeable_bvec NeilBrown
@ 2006-11-07 22:09 ` NeilBrown
  2006-11-07 22:10 ` [PATCH 008 of 9] md: Allow reads that have bypassed the cache to be retried on failure NeilBrown
  2006-11-07 22:10 ` [PATCH 009 of 9] md: Enable bypassing cache for reads NeilBrown
  8 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2006-11-07 22:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Raz Ben-Jehuda(caro)


From: "Raz Ben-Jehuda(caro)" <raziebe@gmail.com>

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |   78 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2006-11-06 11:29:13.000000000 +1100
+++ ./drivers/md/raid5.c	2006-11-06 11:29:13.000000000 +1100
@@ -2633,6 +2633,84 @@ static int raid5_mergeable_bvec(request_
 		return max;
 }
 
+
+static int in_chunk_boundary(mddev_t *mddev, struct bio *bio)
+{
+	sector_t sector = bio->bi_sector + get_start_sect(bio->bi_bdev);
+	unsigned int chunk_sectors = mddev->chunk_size >> 9;
+	unsigned int bio_sectors = bio->bi_size >> 9;
+
+	return  chunk_sectors >=
+		((sector & (chunk_sectors - 1)) + bio_sectors);
+}
+
+/*
+ *  The "raid5_align_endio" should check if the read succeeded and if it
+ *  did, call bio_endio on the original bio (having bio_put the new bio
+ *  first).
+ *  If the read failed..
+ */
+int raid5_align_endio(struct bio *bi, unsigned int bytes , int error)
+{
+	struct bio* raid_bi  = bi->bi_private;
+	if (bi->bi_size)
+		return 1;
+	bio_put(bi);
+	bio_endio(raid_bi, bytes, error);
+	return 0;
+}
+
+static int chunk_aligned_read(request_queue_t *q, struct bio * raid_bio)
+{
+	mddev_t *mddev = q->queuedata;
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	const unsigned int raid_disks = conf->raid_disks;
+	const unsigned int data_disks = raid_disks - 1;
+	unsigned int dd_idx, pd_idx;
+	struct bio* align_bi;
+	mdk_rdev_t *rdev;
+
+	if (!in_chunk_boundary(mddev, raid_bio)) {
+		printk("chunk_aligned_read : non aligned\n");
+		return 0;
+	}
+	/*
+ 	 * use bio_clone to make a copy of the bio
+	 */
+	align_bi = bio_clone(raid_bio, GFP_NOIO);
+	if (!align_bi)
+		return 0;
+	/*
+	 *   set bi_end_io to a new function, and set bi_private to the
+	 *     original bio.
+	 */
+	align_bi->bi_end_io  = raid5_align_endio;
+	align_bi->bi_private = raid_bio;
+	/*
+	 *	compute position
+	 */
+	align_bi->bi_sector =  raid5_compute_sector(raid_bio->bi_sector,
+					raid_disks,
+					data_disks,
+					&dd_idx,
+					&pd_idx,
+					conf);
+
+	rcu_read_lock();
+	rdev = rcu_dereference(conf->disks[dd_idx].rdev);
+	if (rdev && test_bit(In_sync, &rdev->flags)) {
+		align_bi->bi_bdev =  rdev->bdev;
+		atomic_inc(&rdev->nr_pending);
+		rcu_read_unlock();
+		generic_make_request(align_bi);
+		return 1;
+	} else {
+		rcu_read_unlock();
+		return 0;
+	}
+}
+
+
 static int make_request(request_queue_t *q, struct bio * bi)
 {
 	mddev_t *mddev = q->queuedata;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 008 of 9] md: Allow reads that have bypassed the cache to be retried on failure.
  2006-11-07 22:09 [PATCH 000 of 9] md: udev notification, raid5 read improvements etc NeilBrown
                   ` (6 preceding siblings ...)
  2006-11-07 22:09 ` [PATCH 007 of 9] md: Handle bypassing the read cache (assuming nothing fails) NeilBrown
@ 2006-11-07 22:10 ` NeilBrown
  2006-11-07 22:10 ` [PATCH 009 of 9] md: Enable bypassing cache for reads NeilBrown
  8 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2006-11-07 22:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Raz Ben-Jehuda(caro)


From: "Raz Ben-Jehuda(caro)" <raziebe@gmail.com>

If a bypass-the-cache read fails, we simply try again through
the cache.  If it fails again it will trigger normal recovery
precedures.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c         |  150 ++++++++++++++++++++++++++++++++++++++++++-
 ./include/linux/raid/raid5.h |    3 
 2 files changed, 150 insertions(+), 3 deletions(-)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2006-11-06 11:29:13.000000000 +1100
+++ ./drivers/md/raid5.c	2006-11-06 11:29:14.000000000 +1100
@@ -134,6 +134,8 @@ static void __release_stripe(raid5_conf_
 			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
 				list_add_tail(&sh->lru, &conf->inactive_list);
 				wake_up(&conf->wait_for_stripe);
+				if (conf->retry_read_aligned)
+					md_wakeup_thread(conf->mddev->thread);
 			}
 		}
 	}
@@ -2645,18 +2647,74 @@ static int in_chunk_boundary(mddev_t *md
 }
 
 /*
+ *  add bio to the retry LIFO  ( in O(1) ... we are in interrupt )
+ *  later sampled by raid5d.
+ */
+static void add_bio_to_retry(struct bio *bi,raid5_conf_t *conf)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&conf->device_lock, flags);
+
+	bi->bi_next = conf->retry_read_aligned;
+	conf->retry_read_aligned = bi;
+
+	spin_unlock_irqrestore(&conf->device_lock, flags);
+	md_wakeup_thread(conf->mddev->thread);
+}
+
+
+static struct bio *remove_bio_from_retry(raid5_conf_t *conf)
+{
+	struct bio *bi;
+
+	bi = conf->retry_read_aligned;
+	if (bi) {
+		conf->retry_read_aligned = NULL;
+		return bi;
+	}
+	bi = conf->retry_read_aligned_list;
+	if(bi) {
+		conf->retry_read_aligned = bi->bi_next;
+		bi->bi_next = NULL;
+		bi->bi_phys_segments = 1; /* biased count of active stripes */
+		bi->bi_hw_segments = 0; /* count of processed stripes */
+	}
+
+	return bi;
+}
+
+
+/*
  *  The "raid5_align_endio" should check if the read succeeded and if it
  *  did, call bio_endio on the original bio (having bio_put the new bio
  *  first).
  *  If the read failed..
  */
-int raid5_align_endio(struct bio *bi, unsigned int bytes , int error)
+int raid5_align_endio(struct bio *bi, unsigned int bytes, int error)
 {
 	struct bio* raid_bi  = bi->bi_private;
+	mddev_t *mddev;
+	raid5_conf_t *conf;
+
 	if (bi->bi_size)
 		return 1;
 	bio_put(bi);
-	bio_endio(raid_bi, bytes, error);
+
+	mddev = raid_bi->bi_bdev->bd_disk->queue->queuedata;
+	conf = mddev_to_conf(mddev);
+
+	if (!error && test_bit(BIO_UPTODATE, &bi->bi_flags)) {
+		bio_endio(raid_bi, bytes, 0);
+		if (atomic_dec_and_test(&conf->active_aligned_reads))
+			wake_up(&conf->wait_for_stripe);
+		return 0;
+	}
+
+
+	PRINTK("raid5_align_endio : io error...handing IO for a retry\n");
+
+	add_bio_to_retry(raid_bi, conf);
 	return 0;
 }
 
@@ -2702,6 +2760,14 @@ static int chunk_aligned_read(request_qu
 		align_bi->bi_bdev =  rdev->bdev;
 		atomic_inc(&rdev->nr_pending);
 		rcu_read_unlock();
+
+		spin_lock_irq(&conf->device_lock);
+		wait_event_lock_irq(conf->wait_for_stripe,
+				    conf->quiesce == 0,
+				    conf->device_lock, /* nothing */);
+		atomic_inc(&conf->active_aligned_reads);
+		spin_unlock_irq(&conf->device_lock);
+
 		generic_make_request(align_bi);
 		return 1;
 	} else {
@@ -3050,6 +3116,71 @@ static inline sector_t sync_request(mdde
 	return STRIPE_SECTORS;
 }
 
+static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
+{
+	/* We may not be able to submit a whole bio at once as there
+	 * may not be enough stripe_heads available.
+	 * We cannot pre-allocate enough stripe_heads as we may need
+	 * more than exist in the cache (if we allow ever large chunks).
+	 * So we do one stripe head at a time and record in
+	 * ->bi_hw_segments how many have been done.
+	 *
+	 * We *know* that this entire raid_bio is in one chunk, so
+	 * it will be only one 'dd_idx' and only need one call to raid5_compute_sector.
+	 */
+	struct stripe_head *sh;
+	int dd_idx, pd_idx;
+	sector_t sector, logical_sector, last_sector;
+	int scnt = 0;
+	int remaining;
+	int handled = 0;
+
+	logical_sector = raid_bio->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
+	sector = raid5_compute_sector(	logical_sector,
+					conf->raid_disks,
+					conf->raid_disks-1,
+					&dd_idx,
+					&pd_idx,
+					conf);
+	last_sector = raid_bio->bi_sector + (raid_bio->bi_size>>9);
+
+	for (; logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
+
+		if (scnt < raid_bio->bi_hw_segments)
+			/* already done this stripe */
+			continue;
+
+		sh = get_active_stripe(conf, sector, conf->raid_disks, pd_idx, 1);
+
+		if (!sh) {
+			/* failed to get a stripe - must wait */
+			raid_bio->bi_hw_segments = scnt;
+			conf->retry_read_aligned = raid_bio;
+			return handled;
+		}
+
+		set_bit(R5_ReadError, &sh->dev[dd_idx].flags);
+		add_stripe_bio(sh, raid_bio, dd_idx, 0);
+		handle_stripe(sh, NULL);
+		release_stripe(sh);
+		handled++;
+	}
+	spin_lock_irq(&conf->device_lock);
+	remaining = --raid_bio->bi_phys_segments;
+	spin_unlock_irq(&conf->device_lock);
+	if (remaining == 0) {
+		int bytes = raid_bio->bi_size;
+
+		raid_bio->bi_size = 0;
+		raid_bio->bi_end_io(raid_bio, bytes, 0);
+	}
+	if (atomic_dec_and_test(&conf->active_aligned_reads))
+		wake_up(&conf->wait_for_stripe);
+	return handled;
+}
+
+
+
 /*
  * This is our raid5 kernel thread.
  *
@@ -3071,6 +3202,7 @@ static void raid5d (mddev_t *mddev)
 	spin_lock_irq(&conf->device_lock);
 	while (1) {
 		struct list_head *first;
+		struct bio *bio;
 
 		if (conf->seq_flush != conf->seq_write) {
 			int seq = conf->seq_flush;
@@ -3087,6 +3219,16 @@ static void raid5d (mddev_t *mddev)
 		    !list_empty(&conf->delayed_list))
 			raid5_activate_delayed(conf);
 
+		while ((bio = remove_bio_from_retry(conf))) {
+			int ok;
+			spin_unlock_irq(&conf->device_lock);
+			ok = retry_aligned_read(conf, bio);
+			spin_lock_irq(&conf->device_lock);
+			if (!ok)
+				break;
+			handled++;
+		}
+
 		if (list_empty(&conf->handle_list))
 			break;
 
@@ -3274,6 +3416,7 @@ static int run(mddev_t *mddev)
 	INIT_LIST_HEAD(&conf->inactive_list);
 	atomic_set(&conf->active_stripes, 0);
 	atomic_set(&conf->preread_active_stripes, 0);
+	atomic_set(&conf->active_aligned_reads, 0);
 
 	PRINTK("raid5: run(%s) called.\n", mdname(mddev));
 
@@ -3796,7 +3939,8 @@ static void raid5_quiesce(mddev_t *mddev
 		spin_lock_irq(&conf->device_lock);
 		conf->quiesce = 1;
 		wait_event_lock_irq(conf->wait_for_stripe,
-				    atomic_read(&conf->active_stripes) == 0,
+				    atomic_read(&conf->active_stripes) == 0 &&
+				    atomic_read(&conf->active_aligned_reads) == 0,
 				    conf->device_lock, /* nothing */);
 		spin_unlock_irq(&conf->device_lock);
 		break;

diff .prev/include/linux/raid/raid5.h ./include/linux/raid/raid5.h
--- .prev/include/linux/raid/raid5.h	2006-11-06 11:21:23.000000000 +1100
+++ ./include/linux/raid/raid5.h	2006-11-06 11:29:14.000000000 +1100
@@ -227,7 +227,10 @@ struct raid5_private_data {
 	struct list_head	handle_list; /* stripes needing handling */
 	struct list_head	delayed_list; /* stripes that have plugged requests */
 	struct list_head	bitmap_list; /* stripes delaying awaiting bitmap update */
+	struct bio		*retry_read_aligned; /* currently retrying aligned bios   */
+	struct bio		*retry_read_aligned_list; /* aligned bios retry list  */
 	atomic_t		preread_active_stripes; /* stripes with scheduled io */
+	atomic_t		active_aligned_reads;
 
 	atomic_t		reshape_stripes; /* stripes with pending writes for reshape */
 	/* unfortunately we need two cache names as we temporarily have

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 009 of 9] md: Enable bypassing cache for reads.
  2006-11-07 22:09 [PATCH 000 of 9] md: udev notification, raid5 read improvements etc NeilBrown
                   ` (7 preceding siblings ...)
  2006-11-07 22:10 ` [PATCH 008 of 9] md: Allow reads that have bypassed the cache to be retried on failure NeilBrown
@ 2006-11-07 22:10 ` NeilBrown
  8 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2006-11-07 22:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Raz Ben-Jehuda(caro)


From: "Raz Ben-Jehuda(caro)" <raziebe@gmail.com>

Call the chunk_aligned_read where appropriate.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |    5 +++++
 1 file changed, 5 insertions(+)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2006-11-06 11:29:14.000000000 +1100
+++ ./drivers/md/raid5.c	2006-11-06 11:29:14.000000000 +1100
@@ -2798,6 +2798,11 @@ static int make_request(request_queue_t 
 	disk_stat_inc(mddev->gendisk, ios[rw]);
 	disk_stat_add(mddev->gendisk, sectors[rw], bio_sectors(bi));
 
+	if ( bio_data_dir(bi) == READ &&
+	     mddev->reshape_position == MaxSector &&
+	     chunk_aligned_read(q,bi))
+            		return 0;
+
 	logical_sector = bi->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
 	last_sector = bi->bi_sector + (bi->bi_size>>9);
 	bi->bi_next = NULL;

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-11-07 22:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-07 22:09 [PATCH 000 of 9] md: udev notification, raid5 read improvements etc NeilBrown
2006-11-07 22:09 ` [PATCH 001 of 9] md: Change ONLINE/OFFLINE events to a single CHANGE event NeilBrown
2006-11-07 22:09 ` [PATCH 002 of 9] md: Fix sizing problem with raid5-reshape and CONFIG_LBD=n NeilBrown
2006-11-07 22:09 ` [PATCH 003 of 9] md: Do not freeze md threads for suspend NeilBrown
2006-11-07 22:09 ` [PATCH 004 of 9] md: Tidy up device-change notification when an md array is stopped NeilBrown
2006-11-07 22:09 ` [PATCH 005 of 9] md: Change lifetime rules for 'md' devices NeilBrown
2006-11-07 22:09 ` [PATCH 006 of 9] md: Define raid5_mergeable_bvec NeilBrown
2006-11-07 22:09 ` [PATCH 007 of 9] md: Handle bypassing the read cache (assuming nothing fails) NeilBrown
2006-11-07 22:10 ` [PATCH 008 of 9] md: Allow reads that have bypassed the cache to be retried on failure NeilBrown
2006-11-07 22:10 ` [PATCH 009 of 9] md: Enable bypassing cache for reads NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).