linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance)
@ 2007-10-06 17:06 Dan Williams
  2007-10-06 17:06 ` [PATCH -mm 1/4] raid5: add the stripe_queue object for tracking raid io requests (rev3) Dan Williams
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Dan Williams @ 2007-10-06 17:06 UTC (permalink / raw)
  To: neilb, akpm; +Cc: linux-raid

Neil,

Here is the latest spin of the 'stripe_queue' implementation.  Thanks to
raid6+bitmap testing done by Mr. James W. Laferriere there have been
several cleanups and fixes since the last release.  Also, the changes
are now spread over 4 patches to isolate one conceptual change per
patch.  The most significant cleanup is removing the stripe_head back
pointer from stripe_queue.  This effectively makes the queuing layer
independent from the caching layer.

Expansion support needs more testing.

See the individual patch changelogs for details.  Patch 1 contains
updated performance numbers.

Andrew,

These are updated in the git-md-accel tree, but I will work the
finalized versions through Neil's 'Signed-off-by' path.

Dan Williams (4):
      raid5: add the stripe_queue object for tracking raid io requests (rev3)
      raid5: split allocation of stripe_heads and stripe_queues
      raid5: convert add_stripe_bio to add_queue_bio
      raid5: use stripe_queues to prioritize the "most deserving" requests (rev7)

 drivers/md/raid5.c         | 1560 ++++++++++++++++++++++++++++++++------------
 include/linux/raid/raid5.h |   88 ++-
 2 files changed, 1200 insertions(+), 448 deletions(-)

--
Dan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH -mm 1/4] raid5: add the stripe_queue object for tracking raid io requests (rev3)
  2007-10-06 17:06 [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance) Dan Williams
@ 2007-10-06 17:06 ` Dan Williams
  2007-10-06 17:06 ` [PATCH -mm 2/4] raid5: split allocation of stripe_heads and stripe_queues Dan Williams
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Dan Williams @ 2007-10-06 17:06 UTC (permalink / raw)
  To: neilb, akpm; +Cc: linux-raid

The raid5 stripe cache object, struct stripe_head, serves two purposes:
	1/ front-end: queuing incoming requests
	2/ back-end: transitioning requests through the cache state machine
	   to the backing devices
The problem with this model is that queuing decisions are directly tied to
cache availability.  There is no facility to determine that a request or
group of requests 'deserves' usage of the cache and disks at any given time.

This patch separates the object members needed for queuing from the object
members used for caching.  The stripe_queue object takes over the incoming
bio lists, the io completion bio lists, and the parameters needed for
expansion.

The following fields are moved from struct stripe_head to struct
stripe_queue:
	raid5_private_data *raid_conf
	int pd_idx
	spinlock_t lock
	int disks

The following fields are moved from struct r5dev to struct r5_queue_dev:
	sector_t sector
	struct bio *toread, *read, *towrite, *written

This (first) commit just moves fields around subsequent commits take
advantage of the split for performance gains.

--- Performance Data ---
Platform: SMP x4 IA, sata_vsc, 7200RPM SATA Drives x4
Test1: tiobench --size 2048 --numruns 5 --block 4096 --block 131072 --dir /mnt/raid
=====pre-patch=====
Sequential Writes:
File   Blk      Num     Avg
Size   Size     Thread  Rate (MiB/s)
----   -----    ---     ------
2048   131072   1        72.02
2048   131072   8        41.51

=====post-patch=====
Sequential Writes:
File   Blk      Num     Avg
Size   Size     Thr     Rate (MiB/s)
----   -----    ---     ------
2048   131072   1       140.86 (+96%)
2048   131072   8        50.18 (+21%)

Test2: blktrace of: dd if=/dev/zero of=/dev/md0 bs=1024k count=1024
=====pre-patch=====
Total (sdd):
 Reads Queued:       1,383,    5,532KiB  Writes Queued:      80,186, 320,744KiB
 Reads Completed:      276,    4,888KiB  Writes Completed:   12,677, 294,324KiB
 IO unplugs:             0               Timer unplugs:           0

=====post-patch=====
Total (sdd):
 Reads Queued:          61,      244KiB  Writes Queued:      66,330, 265,320KiB
 Reads Completed:        4,      112KiB  Writes Completed:    3,562, 285,912KiB
 IO unplugs:            16               Timer unplugs:          17

Platform: SMP x4 IA, mptsas, 15000RPM SAS Drives x4
Test: tiobench --size 2048 --numruns 5 --block 4096 --block 131072 --dir /mnt/raid
=====pre-patch=====
Sequential Writes:
File   Blk      Num     Avg
Size   Size     Thr     Rate (MiB/s)
----   -----    ---     ------
2048   131072   1       132.51
2048   131072   8        86.92

=====post-patch=====
Sequential Writes:
File   Blk      Num     Avg
Size   Size     Thr     Rate (MiB/s)
----   -----    ---     ------
2048   131072   1       172.26 (+30%)
2048   131072   8       114.82 (+32%)

Changes in rev2:
* leave the flags with the buffers, prevents a data corruption issue
  whereby stale buffer state flags are attached to newly initialized
  buffers

Changes in rev3:
* move bm_seq back into the stripe_head, since the bitmap sequencing
  matters at write-out time (after cache attach)  Thanks to Mr. James W.
  Laferriere for his bug reports and testing of bitmap support.
* move 'int disks' into stripe_queue since expansion details are recorded
  at make_request() time (i.e. pre stripe_head availability)
* move dev->read, dev->written to dev_q->read and dev_q->written.  Allow
  sq->sh back references to be killed, and removes need to handle sh
  details in add_queue_bio

Tested-by: Mr. James W. Laferriere <babydr@baby-dragons.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/raid5.c         |  564 +++++++++++++++++++++++++++-----------------
 include/linux/raid/raid5.h |   28 +-
 2 files changed, 364 insertions(+), 228 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f96dea9..a13de7d 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -159,7 +159,7 @@ static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 }
 static void release_stripe(struct stripe_head *sh)
 {
-	raid5_conf_t *conf = sh->raid_conf;
+	raid5_conf_t *conf = sh->sq->raid_conf;
 	unsigned long flags;
 
 	spin_lock_irqsave(&conf->device_lock, flags);
@@ -238,7 +238,7 @@ static void raid5_build_block (struct stripe_head *sh, int i);
 
 static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int disks)
 {
-	raid5_conf_t *conf = sh->raid_conf;
+	raid5_conf_t *conf = sh->sq->raid_conf;
 	int i;
 
 	BUG_ON(atomic_read(&sh->count) != 0);
@@ -252,19 +252,20 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int
 	remove_hash(sh);
 
 	sh->sector = sector;
-	sh->pd_idx = pd_idx;
+	sh->sq->pd_idx = pd_idx;
 	sh->state = 0;
 
-	sh->disks = disks;
+	sh->sq->disks = disks;
 
-	for (i = sh->disks; i--; ) {
+	for (i = disks; i--;) {
 		struct r5dev *dev = &sh->dev[i];
+		struct r5_queue_dev *dev_q = &sh->sq->dev[i];
 
-		if (dev->toread || dev->read || dev->towrite || dev->written ||
-		    test_bit(R5_LOCKED, &dev->flags)) {
+		if (dev_q->toread || dev_q->read || dev_q->towrite ||
+		    dev_q->written || test_bit(R5_LOCKED, &dev->flags)) {
 			printk(KERN_ERR "sector=%llx i=%d %p %p %p %p %d\n",
-			       (unsigned long long)sh->sector, i, dev->toread,
-			       dev->read, dev->towrite, dev->written,
+			       (unsigned long long)sh->sector, i, dev_q->toread,
+			       dev_q->read, dev_q->towrite, dev_q->written,
 			       test_bit(R5_LOCKED, &dev->flags));
 			BUG();
 		}
@@ -282,12 +283,15 @@ static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector, in
 	CHECK_DEVLOCK();
 	pr_debug("__find_stripe, sector %llu\n", (unsigned long long)sector);
 	hlist_for_each_entry(sh, hn, stripe_hash(conf, sector), hash)
-		if (sh->sector == sector && sh->disks == disks)
+		if (sh->sector == sector && sh->sq->disks == disks)
 			return sh;
 	pr_debug("__stripe %llu not in cache\n", (unsigned long long)sector);
 	return NULL;
 }
 
+static sector_t compute_blocknr(raid5_conf_t *conf, int raid_disks,
+	sector_t sector, int pd_idx, int i);
+
 static void unplug_slaves(mddev_t *mddev);
 static void raid5_unplug_device(struct request_queue *q);
 
@@ -389,12 +393,13 @@ raid5_end_write_request (struct bio *bi, unsigned int bytes_done, int error);
 
 static void ops_run_io(struct stripe_head *sh)
 {
-	raid5_conf_t *conf = sh->raid_conf;
-	int i, disks = sh->disks;
+	struct stripe_queue *sq = sh->sq;
+	raid5_conf_t *conf = sq->raid_conf;
+	int i, disks = sq->disks;
 
 	might_sleep();
 
-	for (i = disks; i--; ) {
+	for (i = disks; i--;) {
 		int rw;
 		struct bio *bi;
 		mdk_rdev_t *rdev;
@@ -513,15 +518,17 @@ static void ops_complete_biofill(void *stripe_head_ref)
 {
 	struct stripe_head *sh = stripe_head_ref;
 	struct bio *return_bi = NULL;
-	raid5_conf_t *conf = sh->raid_conf;
+	struct stripe_queue *sq = sh->sq;
+	raid5_conf_t *conf = sq->raid_conf;
 	int i;
 
 	pr_debug("%s: stripe %llu\n", __FUNCTION__,
 		(unsigned long long)sh->sector);
 
 	/* clear completed biofills */
-	for (i = sh->disks; i--; ) {
+	for (i = sq->disks; i--;) {
 		struct r5dev *dev = &sh->dev[i];
+		struct r5_queue_dev *dev_q = &sq->dev[i];
 
 		/* acknowledge completion of a biofill operation */
 		/* and check if we need to reply to a read request,
@@ -531,16 +538,16 @@ static void ops_complete_biofill(void *stripe_head_ref)
 		if (test_and_clear_bit(R5_Wantfill, &dev->flags)) {
 			struct bio *rbi, *rbi2;
 
-			/* The access to dev->read is outside of the
+			/* The access to dev_q->read is outside of the
 			 * spin_lock_irq(&conf->device_lock), but is protected
 			 * by the STRIPE_OP_BIOFILL pending bit
 			 */
-			BUG_ON(!dev->read);
-			rbi = dev->read;
-			dev->read = NULL;
+			BUG_ON(!dev_q->read);
+			rbi = dev_q->read;
+			dev_q->read = NULL;
 			while (rbi && rbi->bi_sector <
-				dev->sector + STRIPE_SECTORS) {
-				rbi2 = r5_next_bio(rbi, dev->sector);
+				dev_q->sector + STRIPE_SECTORS) {
+				rbi2 = r5_next_bio(rbi, dev_q->sector);
 				spin_lock_irq(&conf->device_lock);
 				if (--rbi->bi_phys_segments == 0) {
 					rbi->bi_next = return_bi;
@@ -563,25 +570,27 @@ static void ops_complete_biofill(void *stripe_head_ref)
 static void ops_run_biofill(struct stripe_head *sh)
 {
 	struct dma_async_tx_descriptor *tx = NULL;
-	raid5_conf_t *conf = sh->raid_conf;
+	raid5_conf_t *conf = sh->sq->raid_conf;
 	int i;
 
 	pr_debug("%s: stripe %llu\n", __FUNCTION__,
 		(unsigned long long)sh->sector);
 
-	for (i = sh->disks; i--; ) {
+	for (i = sh->sq->disks; i--;) {
 		struct r5dev *dev = &sh->dev[i];
+		struct r5_queue_dev *dev_q = &sh->sq->dev[i];
+
 		if (test_bit(R5_Wantfill, &dev->flags)) {
 			struct bio *rbi;
 			spin_lock_irq(&conf->device_lock);
-			dev->read = rbi = dev->toread;
-			dev->toread = NULL;
+			dev_q->read = rbi = dev_q->toread;
+			dev_q->toread = NULL;
 			spin_unlock_irq(&conf->device_lock);
 			while (rbi && rbi->bi_sector <
-				dev->sector + STRIPE_SECTORS) {
+				dev_q->sector + STRIPE_SECTORS) {
 				tx = async_copy_data(0, rbi, dev->page,
-					dev->sector, tx);
-				rbi = r5_next_bio(rbi, dev->sector);
+					dev_q->sector, tx);
+				rbi = r5_next_bio(rbi, dev_q->sector);
 			}
 		}
 	}
@@ -612,7 +621,7 @@ static struct dma_async_tx_descriptor *
 ops_run_compute5(struct stripe_head *sh, unsigned long pending)
 {
 	/* kernel stack size limits the total number of disks */
-	int disks = sh->disks;
+	int disks = sh->sq->disks;
 	struct page *xor_srcs[disks];
 	int target = sh->ops.target;
 	struct r5dev *tgt = &sh->dev[target];
@@ -660,9 +669,10 @@ static struct dma_async_tx_descriptor *
 ops_run_prexor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 {
 	/* kernel stack size limits the total number of disks */
-	int disks = sh->disks;
+	int disks = sh->sq->disks;
 	struct page *xor_srcs[disks];
-	int count = 0, pd_idx = sh->pd_idx, i;
+	struct stripe_queue *sq = sh->sq;
+	int count = 0, pd_idx = sq->pd_idx, i;
 
 	/* existing parity data subtracted */
 	struct page *xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
@@ -672,8 +682,9 @@ ops_run_prexor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 
 	for (i = disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
+		struct r5_queue_dev *dev_q = &sq->dev[i];
 		/* Only process blocks that are known to be uptodate */
-		if (dev->towrite && test_bit(R5_Wantprexor, &dev->flags))
+		if (dev_q->towrite && test_bit(R5_Wantprexor, &dev->flags))
 			xor_srcs[count++] = dev->page;
 	}
 
@@ -687,8 +698,9 @@ ops_run_prexor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 static struct dma_async_tx_descriptor *
 ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 {
-	int disks = sh->disks;
-	int pd_idx = sh->pd_idx, i;
+	int disks = sh->sq->disks;
+	struct stripe_queue *sq = sh->sq;
+	int pd_idx = sq->pd_idx, i;
 
 	/* check if prexor is active which means only process blocks
 	 * that are part of a read-modify-write (Wantprexor)
@@ -700,16 +712,17 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 
 	for (i = disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
+		struct r5_queue_dev *dev_q = &sq->dev[i];
 		struct bio *chosen;
 		int towrite;
 
 		towrite = 0;
 		if (prexor) { /* rmw */
-			if (dev->towrite &&
+			if (dev_q->towrite &&
 			    test_bit(R5_Wantprexor, &dev->flags))
 				towrite = 1;
 		} else { /* rcw */
-			if (i != pd_idx && dev->towrite &&
+			if (i != pd_idx && dev_q->towrite &&
 				test_bit(R5_LOCKED, &dev->flags))
 				towrite = 1;
 		}
@@ -717,18 +730,18 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 		if (towrite) {
 			struct bio *wbi;
 
-			spin_lock(&sh->lock);
-			chosen = dev->towrite;
-			dev->towrite = NULL;
-			BUG_ON(dev->written);
-			wbi = dev->written = chosen;
-			spin_unlock(&sh->lock);
+			spin_lock(&sq->lock);
+			chosen = dev_q->towrite;
+			dev_q->towrite = NULL;
+			BUG_ON(dev_q->written);
+			wbi = dev_q->written = chosen;
+			spin_unlock(&sq->lock);
 
 			while (wbi && wbi->bi_sector <
-				dev->sector + STRIPE_SECTORS) {
+				dev_q->sector + STRIPE_SECTORS) {
 				tx = async_copy_data(1, wbi, dev->page,
-					dev->sector, tx);
-				wbi = r5_next_bio(wbi, dev->sector);
+					dev_q->sector, tx);
+				wbi = r5_next_bio(wbi, dev_q->sector);
 			}
 		}
 	}
@@ -751,14 +764,17 @@ static void ops_complete_postxor(void *stripe_head_ref)
 static void ops_complete_write(void *stripe_head_ref)
 {
 	struct stripe_head *sh = stripe_head_ref;
-	int disks = sh->disks, i, pd_idx = sh->pd_idx;
+	struct stripe_queue *sq = sh->sq;
+	int disks = sq->disks, i, pd_idx = sq->pd_idx;
 
 	pr_debug("%s: stripe %llu\n", __FUNCTION__,
 		(unsigned long long)sh->sector);
 
 	for (i = disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
-		if (dev->written || i == pd_idx)
+		struct r5_queue_dev *dev_q = &sq->dev[i];
+
+		if (dev_q->written || i == pd_idx)
 			set_bit(R5_UPTODATE, &dev->flags);
 	}
 
@@ -773,10 +789,11 @@ static void
 ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 {
 	/* kernel stack size limits the total number of disks */
-	int disks = sh->disks;
+	struct stripe_queue *sq = sh->sq;
+	int disks = sq->disks;
 	struct page *xor_srcs[disks];
 
-	int count = 0, pd_idx = sh->pd_idx, i;
+	int count = 0, pd_idx = sh->sq->pd_idx, i;
 	struct page *xor_dest;
 	int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
 	unsigned long flags;
@@ -792,7 +809,9 @@ ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 		xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
-			if (dev->written)
+			struct r5_queue_dev *dev_q = &sq->dev[i];
+
+			if (dev_q->written)
 				xor_srcs[count++] = dev->page;
 		}
 	} else {
@@ -830,7 +849,7 @@ ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 static void ops_complete_check(void *stripe_head_ref)
 {
 	struct stripe_head *sh = stripe_head_ref;
-	int pd_idx = sh->pd_idx;
+	int pd_idx = sh->sq->pd_idx;
 
 	pr_debug("%s: stripe %llu\n", __FUNCTION__,
 		(unsigned long long)sh->sector);
@@ -847,11 +866,11 @@ static void ops_complete_check(void *stripe_head_ref)
 static void ops_run_check(struct stripe_head *sh)
 {
 	/* kernel stack size limits the total number of disks */
-	int disks = sh->disks;
+	int disks = sh->sq->disks;
 	struct page *xor_srcs[disks];
 	struct dma_async_tx_descriptor *tx;
 
-	int count = 0, pd_idx = sh->pd_idx, i;
+	int count = 0, pd_idx = sh->sq->pd_idx, i;
 	struct page *xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
 
 	pr_debug("%s: stripe %llu\n", __FUNCTION__,
@@ -878,7 +897,7 @@ static void ops_run_check(struct stripe_head *sh)
 
 static void raid5_run_ops(struct stripe_head *sh, unsigned long pending)
 {
-	int overlap_clear = 0, i, disks = sh->disks;
+	int overlap_clear = 0, i, disks = sh->sq->disks;
 	struct dma_async_tx_descriptor *tx = NULL;
 
 	if (test_bit(STRIPE_OP_BIOFILL, &pending)) {
@@ -906,35 +925,51 @@ static void raid5_run_ops(struct stripe_head *sh, unsigned long pending)
 	if (test_bit(STRIPE_OP_IO, &pending))
 		ops_run_io(sh);
 
-	if (overlap_clear)
+	if (overlap_clear) {
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
 			if (test_and_clear_bit(R5_Overlap, &dev->flags))
-				wake_up(&sh->raid_conf->wait_for_overlap);
+				wake_up(&sh->sq->raid_conf->wait_for_overlap);
 		}
+	}
 }
 
 static int grow_one_stripe(raid5_conf_t *conf)
 {
 	struct stripe_head *sh;
-	sh = kmem_cache_alloc(conf->slab_cache, GFP_KERNEL);
+	struct stripe_queue *sq;
+
+	sh = kmem_cache_alloc(conf->sh_slab_cache, GFP_KERNEL);
 	if (!sh)
 		return 0;
+
+	sq = kmem_cache_alloc(conf->sq_slab_cache, GFP_KERNEL);
+	if (!sq) {
+		kmem_cache_free(conf->sh_slab_cache, sh);
+		return 0;
+	}
+
 	memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1)*sizeof(struct r5dev));
-	sh->raid_conf = conf;
-	spin_lock_init(&sh->lock);
+	memset(sq, 0, sizeof(*sq) +
+		(conf->raid_disks-1) * sizeof(struct r5_queue_dev));
+	sh->sq = sq;
+	sq->raid_conf = conf;
+	spin_lock_init(&sq->lock);
 
 	if (grow_buffers(sh, conf->raid_disks)) {
 		shrink_buffers(sh, conf->raid_disks);
-		kmem_cache_free(conf->slab_cache, sh);
+		kmem_cache_free(conf->sh_slab_cache, sh);
+		kmem_cache_free(conf->sq_slab_cache, sq);
 		return 0;
 	}
-	sh->disks = conf->raid_disks;
+	sq->disks = conf->raid_disks;
 	/* we just created an active stripe so... */
 	atomic_set(&sh->count, 1);
 	atomic_inc(&conf->active_stripes);
 	INIT_LIST_HEAD(&sh->lru);
-	release_stripe(sh);
+	spin_lock_irq(&conf->device_lock);
+	__release_stripe(conf, sh);
+	spin_unlock_irq(&conf->device_lock);
 	return 1;
 }
 
@@ -943,16 +978,28 @@ static int grow_stripes(raid5_conf_t *conf, int num)
 	struct kmem_cache *sc;
 	int devs = conf->raid_disks;
 
-	sprintf(conf->cache_name[0], "raid5-%s", mdname(conf->mddev));
-	sprintf(conf->cache_name[1], "raid5-%s-alt", mdname(conf->mddev));
+	sprintf(conf->sh_cache_name[0], "raid5-%s", mdname(conf->mddev));
+	sprintf(conf->sh_cache_name[1], "raid5-%s-alt", mdname(conf->mddev));
+	sprintf(conf->sq_cache_name[0], "raid5q-%s", mdname(conf->mddev));
+	sprintf(conf->sq_cache_name[1], "raid5q-%s-alt", mdname(conf->mddev));
+
 	conf->active_name = 0;
-	sc = kmem_cache_create(conf->cache_name[conf->active_name],
+	sc = kmem_cache_create(conf->sh_cache_name[conf->active_name],
 			       sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
 			       0, 0, NULL);
 	if (!sc)
 		return 1;
-	conf->slab_cache = sc;
+	conf->sh_slab_cache = sc;
 	conf->pool_size = devs;
+
+	sc = kmem_cache_create(conf->sq_cache_name[conf->active_name],
+		sizeof(struct stripe_queue) +
+		(devs-1)*sizeof(struct r5_queue_dev), 0, 0, NULL);
+
+	if (!sc)
+		return 1;
+	conf->sq_slab_cache = sc;
+
 	while (num--)
 		if (!grow_one_stripe(conf))
 			return 1;
@@ -989,7 +1036,7 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 	LIST_HEAD(newstripes);
 	struct disk_info *ndisks;
 	int err = 0;
-	struct kmem_cache *sc;
+	struct kmem_cache *sc, *sc_q;
 	int i;
 
 	if (newsize <= conf->pool_size)
@@ -998,21 +1045,40 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 	md_allow_write(conf->mddev);
 
 	/* Step 1 */
-	sc = kmem_cache_create(conf->cache_name[1-conf->active_name],
+	sc = kmem_cache_create(conf->sh_cache_name[1-conf->active_name],
 			       sizeof(struct stripe_head)+(newsize-1)*sizeof(struct r5dev),
 			       0, 0, NULL);
 	if (!sc)
 		return -ENOMEM;
 
+	sc_q = kmem_cache_create(conf->sh_cache_name[1-conf->active_name],
+		    sizeof(struct stripe_queue) +
+		    (newsize-1)*sizeof(struct r5_queue_dev), 0, 0, NULL);
+	if (!sc_q) {
+		kmem_cache_destroy(sc);
+		return -ENOMEM;
+	}
+
 	for (i = conf->max_nr_stripes; i; i--) {
+		struct stripe_queue *nsq;
+
 		nsh = kmem_cache_alloc(sc, GFP_KERNEL);
 		if (!nsh)
 			break;
 
+		nsq = kmem_cache_alloc(sc_q, GFP_KERNEL);
+		if (!nsq) {
+			kmem_cache_free(sc, nsh);
+			break;
+		}
+
 		memset(nsh, 0, sizeof(*nsh) + (newsize-1)*sizeof(struct r5dev));
+		memset(nsq, 0, sizeof(*nsq) +
+			(newsize-1)*sizeof(struct r5_queue_dev));
 
-		nsh->raid_conf = conf;
-		spin_lock_init(&nsh->lock);
+		nsq->raid_conf = conf;
+		nsh->sq = nsq;
+		spin_lock_init(&nsq->lock);
 
 		list_add(&nsh->lru, &newstripes);
 	}
@@ -1021,8 +1087,10 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 		while (!list_empty(&newstripes)) {
 			nsh = list_entry(newstripes.next, struct stripe_head, lru);
 			list_del(&nsh->lru);
+			kmem_cache_free(sc_q, nsh->sq);
 			kmem_cache_free(sc, nsh);
 		}
+		kmem_cache_destroy(sc_q);
 		kmem_cache_destroy(sc);
 		return -ENOMEM;
 	}
@@ -1044,9 +1112,11 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 			nsh->dev[i].page = osh->dev[i].page;
 		for( ; i<newsize; i++)
 			nsh->dev[i].page = NULL;
-		kmem_cache_free(conf->slab_cache, osh);
+		kmem_cache_free(conf->sq_slab_cache, osh->sq);
+		kmem_cache_free(conf->sh_slab_cache, osh);
 	}
-	kmem_cache_destroy(conf->slab_cache);
+	kmem_cache_destroy(conf->sh_slab_cache);
+	kmem_cache_destroy(conf->sq_slab_cache);
 
 	/* Step 3.
 	 * At this point, we are holding all the stripes so the array
@@ -1077,7 +1147,8 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 	}
 	/* critical section pass, GFP_NOIO no longer needed */
 
-	conf->slab_cache = sc;
+	conf->sh_slab_cache = sc;
+	conf->sq_slab_cache = sc_q;
 	conf->active_name = 1-conf->active_name;
 	conf->pool_size = newsize;
 	return err;
@@ -1095,7 +1166,9 @@ static int drop_one_stripe(raid5_conf_t *conf)
 		return 0;
 	BUG_ON(atomic_read(&sh->count));
 	shrink_buffers(sh, conf->pool_size);
-	kmem_cache_free(conf->slab_cache, sh);
+	if (sh->sq)
+		kmem_cache_free(conf->sq_slab_cache, sh->sq);
+	kmem_cache_free(conf->sh_slab_cache, sh);
 	atomic_dec(&conf->active_stripes);
 	return 1;
 }
@@ -1105,17 +1178,21 @@ static void shrink_stripes(raid5_conf_t *conf)
 	while (drop_one_stripe(conf))
 		;
 
-	if (conf->slab_cache)
-		kmem_cache_destroy(conf->slab_cache);
-	conf->slab_cache = NULL;
+	if (conf->sh_slab_cache)
+		kmem_cache_destroy(conf->sh_slab_cache);
+	conf->sh_slab_cache = NULL;
+
+	if (conf->sq_slab_cache)
+		kmem_cache_destroy(conf->sq_slab_cache);
+	conf->sq_slab_cache = NULL;
 }
 
 static int raid5_end_read_request(struct bio * bi, unsigned int bytes_done,
 				   int error)
 {
  	struct stripe_head *sh = bi->bi_private;
-	raid5_conf_t *conf = sh->raid_conf;
-	int disks = sh->disks, i;
+	raid5_conf_t *conf = sh->sq->raid_conf;
+	int disks = sh->sq->disks, i;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 	char b[BDEVNAME_SIZE];
 	mdk_rdev_t *rdev;
@@ -1192,8 +1269,9 @@ static int raid5_end_write_request (struct bio *bi, unsigned int bytes_done,
 				    int error)
 {
  	struct stripe_head *sh = bi->bi_private;
-	raid5_conf_t *conf = sh->raid_conf;
-	int disks = sh->disks, i;
+	struct stripe_queue *sq = sh->sq;
+	raid5_conf_t *conf = sq->raid_conf;
+	int disks = sq->disks, i;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
 	if (bi->bi_size)
@@ -1222,12 +1300,10 @@ static int raid5_end_write_request (struct bio *bi, unsigned int bytes_done,
 	return 0;
 }
 
-
-static sector_t compute_blocknr(struct stripe_head *sh, int i);
-	
 static void raid5_build_block (struct stripe_head *sh, int i)
 {
 	struct r5dev *dev = &sh->dev[i];
+	struct r5_queue_dev *dev_q = &sh->sq->dev[i];
 
 	bio_init(&dev->req);
 	dev->req.bi_io_vec = &dev->vec;
@@ -1241,7 +1317,8 @@ static void raid5_build_block (struct stripe_head *sh, int i)
 	dev->req.bi_private = sh;
 
 	dev->flags = 0;
-	dev->sector = compute_blocknr(sh, i);
+	dev_q->sector = compute_blocknr(sh->sq->raid_conf, sh->sq->disks,
+			sh->sector, sh->sq->pd_idx, i);
 }
 
 static void error(mddev_t *mddev, mdk_rdev_t *rdev)
@@ -1376,12 +1453,12 @@ static sector_t raid5_compute_sector(sector_t r_sector, unsigned int raid_disks,
 }
 
 
-static sector_t compute_blocknr(struct stripe_head *sh, int i)
+static sector_t
+compute_blocknr(raid5_conf_t *conf, int raid_disks, sector_t sector,
+	int pd_idx, int i)
 {
-	raid5_conf_t *conf = sh->raid_conf;
-	int raid_disks = sh->disks;
 	int data_disks = raid_disks - conf->max_degraded;
-	sector_t new_sector = sh->sector, check;
+	sector_t new_sector = sector, check;
 	int sectors_per_chunk = conf->chunk_size >> 9;
 	sector_t stripe;
 	int chunk_offset;
@@ -1393,7 +1470,7 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 	stripe = new_sector;
 	BUG_ON(new_sector != stripe);
 
-	if (i == sh->pd_idx)
+	if (i == pd_idx)
 		return 0;
 	switch(conf->level) {
 	case 4: break;
@@ -1401,14 +1478,14 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 		switch (conf->algorithm) {
 		case ALGORITHM_LEFT_ASYMMETRIC:
 		case ALGORITHM_RIGHT_ASYMMETRIC:
-			if (i > sh->pd_idx)
+			if (i > pd_idx)
 				i--;
 			break;
 		case ALGORITHM_LEFT_SYMMETRIC:
 		case ALGORITHM_RIGHT_SYMMETRIC:
-			if (i < sh->pd_idx)
+			if (i < pd_idx)
 				i += raid_disks;
-			i -= (sh->pd_idx + 1);
+			i -= (pd_idx + 1);
 			break;
 		default:
 			printk(KERN_ERR "raid5: unsupported algorithm %d\n",
@@ -1416,25 +1493,25 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 		}
 		break;
 	case 6:
-		if (i == raid6_next_disk(sh->pd_idx, raid_disks))
+		if (i == raid6_next_disk(pd_idx, raid_disks))
 			return 0; /* It is the Q disk */
 		switch (conf->algorithm) {
 		case ALGORITHM_LEFT_ASYMMETRIC:
 		case ALGORITHM_RIGHT_ASYMMETRIC:
-		  	if (sh->pd_idx == raid_disks-1)
-				i--; 	/* Q D D D P */
-			else if (i > sh->pd_idx)
+			if (pd_idx == raid_disks-1)
+				i--;	/* Q D D D P */
+			else if (i > pd_idx)
 				i -= 2; /* D D P Q D */
 			break;
 		case ALGORITHM_LEFT_SYMMETRIC:
 		case ALGORITHM_RIGHT_SYMMETRIC:
-			if (sh->pd_idx == raid_disks-1)
+			if (pd_idx == raid_disks-1)
 				i--; /* Q D D D P */
 			else {
 				/* D D P Q D */
-				if (i < sh->pd_idx)
+				if (i < pd_idx)
 					i += raid_disks;
-				i -= (sh->pd_idx + 2);
+				i -= (pd_idx + 2);
 			}
 			break;
 		default:
@@ -1448,7 +1525,7 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 	r_sector = (sector_t)chunk_number * sectors_per_chunk + chunk_offset;
 
 	check = raid5_compute_sector (r_sector, raid_disks, data_disks, &dummy1, &dummy2, conf);
-	if (check != sh->sector || dummy1 != dd_idx || dummy2 != sh->pd_idx) {
+	if (check != sector || dummy1 != dd_idx || dummy2 != pd_idx) {
 		printk(KERN_ERR "compute_blocknr: map not correct\n");
 		return 0;
 	}
@@ -1515,8 +1592,9 @@ static void copy_data(int frombio, struct bio *bio,
 
 static void compute_parity6(struct stripe_head *sh, int method)
 {
-	raid6_conf_t *conf = sh->raid_conf;
-	int i, pd_idx = sh->pd_idx, qd_idx, d0_idx, disks = sh->disks, count;
+	struct stripe_queue *sq = sh->sq;
+	raid6_conf_t *conf = sq->raid_conf;
+	int i, pd_idx = sq->pd_idx, qd_idx, d0_idx, disks = sq->disks, count;
 	struct bio *chosen;
 	/**** FIX THIS: This could be very bad if disks is close to 256 ****/
 	void *ptrs[disks];
@@ -1532,15 +1610,15 @@ static void compute_parity6(struct stripe_head *sh, int method)
 		BUG();		/* READ_MODIFY_WRITE N/A for RAID-6 */
 	case RECONSTRUCT_WRITE:
 		for (i= disks; i-- ;)
-			if ( i != pd_idx && i != qd_idx && sh->dev[i].towrite ) {
-				chosen = sh->dev[i].towrite;
-				sh->dev[i].towrite = NULL;
+			if (i != pd_idx && i != qd_idx && sq->dev[i].towrite) {
+				chosen = sq->dev[i].towrite;
+				sq->dev[i].towrite = NULL;
 
 				if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
 					wake_up(&conf->wait_for_overlap);
 
-				BUG_ON(sh->dev[i].written);
-				sh->dev[i].written = chosen;
+				BUG_ON(sq->dev[i].written);
+				sq->dev[i].written = chosen;
 			}
 		break;
 	case CHECK_PARITY:
@@ -1548,9 +1626,9 @@ static void compute_parity6(struct stripe_head *sh, int method)
 	}
 
 	for (i = disks; i--;)
-		if (sh->dev[i].written) {
-			sector_t sector = sh->dev[i].sector;
-			struct bio *wbi = sh->dev[i].written;
+		if (sq->dev[i].written) {
+			sector_t sector = sq->dev[i].sector;
+			struct bio *wbi = sq->dev[i].written;
 			while (wbi && wbi->bi_sector < sector + STRIPE_SECTORS) {
 				copy_data(1, wbi, sh->dev[i].page, sector);
 				wbi = r5_next_bio(wbi, sector);
@@ -1597,9 +1675,10 @@ static void compute_parity6(struct stripe_head *sh, int method)
 /* Compute one missing block */
 static void compute_block_1(struct stripe_head *sh, int dd_idx, int nozero)
 {
-	int i, count, disks = sh->disks;
+	struct stripe_queue *sq = sh->sq;
+	int i, count, disks = sq->disks;
 	void *ptr[MAX_XOR_BLOCKS], *dest, *p;
-	int pd_idx = sh->pd_idx;
+	int pd_idx = sq->pd_idx;
 	int qd_idx = raid6_next_disk(pd_idx, disks);
 
 	pr_debug("compute_block_1, stripe %llu, idx %d\n",
@@ -1635,8 +1714,8 @@ static void compute_block_1(struct stripe_head *sh, int dd_idx, int nozero)
 /* Compute two missing blocks */
 static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2)
 {
-	int i, count, disks = sh->disks;
-	int pd_idx = sh->pd_idx;
+	int i, count, disks = sh->sq->disks;
+	int pd_idx = sh->sq->pd_idx;
 	int qd_idx = raid6_next_disk(pd_idx, disks);
 	int d0_idx = raid6_next_disk(qd_idx, disks);
 	int faila, failb;
@@ -1698,8 +1777,9 @@ static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2)
 static int
 handle_write_operations5(struct stripe_head *sh, int rcw, int expand)
 {
-	int i, pd_idx = sh->pd_idx, disks = sh->disks;
 	int locked = 0;
+	struct stripe_queue *sq = sh->sq;
+	int i, pd_idx = sq->pd_idx, disks = sq->disks;
 
 	if (rcw) {
 		/* if we are not expanding this is a proper write request, and
@@ -1716,8 +1796,9 @@ handle_write_operations5(struct stripe_head *sh, int rcw, int expand)
 
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
+			struct r5_queue_dev *dev_q = &sq->dev[i];
 
-			if (dev->towrite) {
+			if (dev_q->towrite) {
 				set_bit(R5_LOCKED, &dev->flags);
 				if (!expand)
 					clear_bit(R5_UPTODATE, &dev->flags);
@@ -1736,6 +1817,8 @@ handle_write_operations5(struct stripe_head *sh, int rcw, int expand)
 
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
+			struct r5_queue_dev *dev_q = &sq->dev[i];
+
 			if (i == pd_idx)
 				continue;
 
@@ -1744,7 +1827,7 @@ handle_write_operations5(struct stripe_head *sh, int rcw, int expand)
 			 * written so we distinguish these blocks by the
 			 * R5_Wantprexor bit
 			 */
-			if (dev->towrite &&
+			if (dev_q->towrite &&
 			    (test_bit(R5_UPTODATE, &dev->flags) ||
 			    test_bit(R5_Wantcompute, &dev->flags))) {
 				set_bit(R5_Wantprexor, &dev->flags);
@@ -1777,7 +1860,8 @@ handle_write_operations5(struct stripe_head *sh, int rcw, int expand)
 static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, int forwrite)
 {
 	struct bio **bip;
-	raid5_conf_t *conf = sh->raid_conf;
+	struct stripe_queue *sq = sh->sq;
+	raid5_conf_t *conf = sq->raid_conf;
 	int firstwrite=0;
 
 	pr_debug("adding bh b#%llu to stripe s#%llu\n",
@@ -1785,14 +1869,14 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 		(unsigned long long)sh->sector);
 
 
-	spin_lock(&sh->lock);
+	spin_lock(&sq->lock);
 	spin_lock_irq(&conf->device_lock);
 	if (forwrite) {
-		bip = &sh->dev[dd_idx].towrite;
-		if (*bip == NULL && sh->dev[dd_idx].written == NULL)
+		bip = &sq->dev[dd_idx].towrite;
+		if (*bip == NULL && sq->dev[dd_idx].written == NULL)
 			firstwrite = 1;
 	} else
-		bip = &sh->dev[dd_idx].toread;
+		bip = &sq->dev[dd_idx].toread;
 	while (*bip && (*bip)->bi_sector < bi->bi_sector) {
 		if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector)
 			goto overlap;
@@ -1807,7 +1891,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 	*bip = bi;
 	bi->bi_phys_segments ++;
 	spin_unlock_irq(&conf->device_lock);
-	spin_unlock(&sh->lock);
+	spin_unlock(&sq->lock);
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
 		(unsigned long long)bi->bi_sector,
@@ -1822,15 +1906,15 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 
 	if (forwrite) {
 		/* check if page is covered */
-		sector_t sector = sh->dev[dd_idx].sector;
-		for (bi=sh->dev[dd_idx].towrite;
-		     sector < sh->dev[dd_idx].sector + STRIPE_SECTORS &&
+		sector_t sector = sq->dev[dd_idx].sector;
+		for (bi = sq->dev[dd_idx].towrite;
+		     sector < sq->dev[dd_idx].sector + STRIPE_SECTORS &&
 			     bi && bi->bi_sector <= sector;
-		     bi = r5_next_bio(bi, sh->dev[dd_idx].sector)) {
+		     bi = r5_next_bio(bi, sq->dev[dd_idx].sector)) {
 			if (bi->bi_sector + (bi->bi_size>>9) >= sector)
 				sector = bi->bi_sector + (bi->bi_size>>9);
 		}
-		if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
+		if (sector >= sq->dev[dd_idx].sector + STRIPE_SECTORS)
 			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
 	}
 	return 1;
@@ -1838,7 +1922,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
  overlap:
 	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
 	spin_unlock_irq(&conf->device_lock);
-	spin_unlock(&sh->lock);
+	spin_unlock(&sq->lock);
 	return 0;
 }
 
@@ -1870,6 +1954,8 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
 				struct bio **return_bi)
 {
 	int i;
+	struct stripe_queue *sq = sh->sq;
+
 	for (i = disks; i--; ) {
 		struct bio *bi;
 		int bitmap_end = 0;
@@ -1885,8 +1971,8 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
 		}
 		spin_lock_irq(&conf->device_lock);
 		/* fail all writes first */
-		bi = sh->dev[i].towrite;
-		sh->dev[i].towrite = NULL;
+		bi = sq->dev[i].towrite;
+		sq->dev[i].towrite = NULL;
 		if (bi) {
 			s->to_write--;
 			bitmap_end = 1;
@@ -1896,8 +1982,8 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
 			wake_up(&conf->wait_for_overlap);
 
 		while (bi && bi->bi_sector <
-			sh->dev[i].sector + STRIPE_SECTORS) {
-			struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
+			sq->dev[i].sector + STRIPE_SECTORS) {
+			struct bio *nextbi = r5_next_bio(bi, sq->dev[i].sector);
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
 			if (--bi->bi_phys_segments == 0) {
 				md_write_end(conf->mddev);
@@ -1907,12 +1993,12 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
 			bi = nextbi;
 		}
 		/* and fail all 'written' */
-		bi = sh->dev[i].written;
-		sh->dev[i].written = NULL;
+		bi = sq->dev[i].written;
+		sq->dev[i].written = NULL;
 		if (bi) bitmap_end = 1;
 		while (bi && bi->bi_sector <
-		       sh->dev[i].sector + STRIPE_SECTORS) {
-			struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
+		       sq->dev[i].sector + STRIPE_SECTORS) {
+			struct bio *bi2 = r5_next_bio(bi, sq->dev[i].sector);
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
 			if (--bi->bi_phys_segments == 0) {
 				md_write_end(conf->mddev);
@@ -1928,15 +2014,15 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
 		if (!test_bit(R5_Wantfill, &sh->dev[i].flags) &&
 		    (!test_bit(R5_Insync, &sh->dev[i].flags) ||
 		      test_bit(R5_ReadError, &sh->dev[i].flags))) {
-			bi = sh->dev[i].toread;
-			sh->dev[i].toread = NULL;
+			bi = sq->dev[i].toread;
+			sq->dev[i].toread = NULL;
 			if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
 				wake_up(&conf->wait_for_overlap);
 			if (bi) s->to_read--;
 			while (bi && bi->bi_sector <
-			       sh->dev[i].sector + STRIPE_SECTORS) {
+			       sq->dev[i].sector + STRIPE_SECTORS) {
 				struct bio *nextbi =
-					r5_next_bio(bi, sh->dev[i].sector);
+					r5_next_bio(bi, sq->dev[i].sector);
 				clear_bit(BIO_UPTODATE, &bi->bi_flags);
 				if (--bi->bi_phys_segments == 0) {
 					bi->bi_next = *return_bi;
@@ -1959,22 +2045,25 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
 static int __handle_issuing_new_read_requests5(struct stripe_head *sh,
 			struct stripe_head_state *s, int disk_idx, int disks)
 {
+	struct stripe_queue *sq = sh->sq;
 	struct r5dev *dev = &sh->dev[disk_idx];
+	struct r5_queue_dev *dev_q = &sq->dev[disk_idx];
 	struct r5dev *failed_dev = &sh->dev[s->failed_num];
+	struct r5_queue_dev *failed_dev_q = &sq->dev[s->failed_num];
 
 	/* don't schedule compute operations or reads on the parity block while
 	 * a check is in flight
 	 */
-	if ((disk_idx == sh->pd_idx) &&
+	if ((disk_idx == sq->pd_idx) &&
 	     test_bit(STRIPE_OP_CHECK, &sh->ops.pending))
 		return ~0;
 
 	/* is the data in this block needed, and can we get it? */
 	if (!test_bit(R5_LOCKED, &dev->flags) &&
-	    !test_bit(R5_UPTODATE, &dev->flags) && (dev->toread ||
-	    (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
+	    !test_bit(R5_UPTODATE, &dev->flags) && (dev_q->toread ||
+	    (dev_q->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
 	     s->syncing || s->expanding || (s->failed &&
-	     (failed_dev->toread || (failed_dev->towrite &&
+	     (failed_dev_q->toread || (failed_dev_q->towrite &&
 	     !test_bit(R5_OVERWRITE, &failed_dev->flags)
 	     ))))) {
 		/* 1/ We would like to get this block, possibly by computing it,
@@ -2057,18 +2146,22 @@ static void handle_issuing_new_read_requests6(struct stripe_head *sh,
 			int disks)
 {
 	int i;
+	struct stripe_queue *sq = sh->sq;
+
 	for (i = disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
+		struct r5_queue_dev *dev_q = &sq->dev[i];
+
 		if (!test_bit(R5_LOCKED, &dev->flags) &&
 		    !test_bit(R5_UPTODATE, &dev->flags) &&
-		    (dev->toread || (dev->towrite &&
+		    (dev_q->toread || (dev_q->towrite &&
 		     !test_bit(R5_OVERWRITE, &dev->flags)) ||
 		     s->syncing || s->expanding ||
 		     (s->failed >= 1 &&
-		      (sh->dev[r6s->failed_num[0]].toread ||
+		      (sq->dev[r6s->failed_num[0]].toread ||
 		       s->to_write)) ||
 		     (s->failed >= 2 &&
-		      (sh->dev[r6s->failed_num[1]].toread ||
+		      (sq->dev[r6s->failed_num[1]].toread ||
 		       s->to_write)))) {
 			/* we would like to get this block, possibly
 			 * by computing it, but we might not be able to
@@ -2118,11 +2211,12 @@ static void handle_completed_write_requests(raid5_conf_t *conf,
 	struct stripe_head *sh, int disks, struct bio **return_bi)
 {
 	int i;
-	struct r5dev *dev;
+	struct stripe_queue *sq = sh->sq;
 
 	for (i = disks; i--; )
-		if (sh->dev[i].written) {
-			dev = &sh->dev[i];
+		if (sq->dev[i].written) {
+			struct r5dev *dev = &sh->dev[i];
+			struct r5_queue_dev *dev_q = &sq->dev[i];
 			if (!test_bit(R5_LOCKED, &dev->flags) &&
 				test_bit(R5_UPTODATE, &dev->flags)) {
 				/* We can return any write requests */
@@ -2130,11 +2224,11 @@ static void handle_completed_write_requests(raid5_conf_t *conf,
 				int bitmap_end = 0;
 				pr_debug("Return write for disc %d\n", i);
 				spin_lock_irq(&conf->device_lock);
-				wbi = dev->written;
-				dev->written = NULL;
+				wbi = dev_q->written;
+				dev_q->written = NULL;
 				while (wbi && wbi->bi_sector <
-					dev->sector + STRIPE_SECTORS) {
-					wbi2 = r5_next_bio(wbi, dev->sector);
+					dev_q->sector + STRIPE_SECTORS) {
+					wbi2 = r5_next_bio(wbi, dev_q->sector);
 					if (--wbi->bi_phys_segments == 0) {
 						md_write_end(conf->mddev);
 						wbi->bi_next = *return_bi;
@@ -2142,7 +2236,7 @@ static void handle_completed_write_requests(raid5_conf_t *conf,
 					}
 					wbi = wbi2;
 				}
-				if (dev->towrite == NULL)
+				if (dev_q->towrite == NULL)
 					bitmap_end = 1;
 				spin_unlock_irq(&conf->device_lock);
 				if (bitmap_end)
@@ -2159,10 +2253,14 @@ static void handle_issuing_new_write_requests5(raid5_conf_t *conf,
 		struct stripe_head *sh,	struct stripe_head_state *s, int disks)
 {
 	int rmw = 0, rcw = 0, i;
+	struct stripe_queue *sq = sh->sq;
+
 	for (i = disks; i--; ) {
 		/* would I have to read this buffer for read_modify_write */
 		struct r5dev *dev = &sh->dev[i];
-		if ((dev->towrite || i == sh->pd_idx) &&
+		struct r5_queue_dev *dev_q = &sq->dev[i];
+
+		if ((dev_q->towrite || i == sq->pd_idx) &&
 		    !test_bit(R5_LOCKED, &dev->flags) &&
 		    !(test_bit(R5_UPTODATE, &dev->flags) ||
 		      test_bit(R5_Wantcompute, &dev->flags))) {
@@ -2172,7 +2270,7 @@ static void handle_issuing_new_write_requests5(raid5_conf_t *conf,
 				rmw += 2*disks;  /* cannot read it */
 		}
 		/* Would I have to read this buffer for reconstruct_write */
-		if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx &&
+		if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sq->pd_idx &&
 		    !test_bit(R5_LOCKED, &dev->flags) &&
 		    !(test_bit(R5_UPTODATE, &dev->flags) ||
 		    test_bit(R5_Wantcompute, &dev->flags))) {
@@ -2188,7 +2286,9 @@ static void handle_issuing_new_write_requests5(raid5_conf_t *conf,
 		/* prefer read-modify-write, but need to get some data */
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
-			if ((dev->towrite || i == sh->pd_idx) &&
+			struct r5_queue_dev *dev_q = &sq->dev[i];
+
+			if ((dev_q->towrite || i == sq->pd_idx) &&
 			    !test_bit(R5_LOCKED, &dev->flags) &&
 			    !(test_bit(R5_UPTODATE, &dev->flags) ||
 			    test_bit(R5_Wantcompute, &dev->flags)) &&
@@ -2213,8 +2313,9 @@ static void handle_issuing_new_write_requests5(raid5_conf_t *conf,
 		/* want reconstruct write, but need to get some data */
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
+
 			if (!test_bit(R5_OVERWRITE, &dev->flags) &&
-			    i != sh->pd_idx &&
+			    i != sq->pd_idx &&
 			    !test_bit(R5_LOCKED, &dev->flags) &&
 			    !(test_bit(R5_UPTODATE, &dev->flags) ||
 			    test_bit(R5_Wantcompute, &dev->flags)) &&
@@ -2256,7 +2357,8 @@ static void handle_issuing_new_write_requests6(raid5_conf_t *conf,
 		struct stripe_head *sh,	struct stripe_head_state *s,
 		struct r6_state *r6s, int disks)
 {
-	int rcw = 0, must_compute = 0, pd_idx = sh->pd_idx, i;
+	struct stripe_queue *sq = sh->sq;
+	int rcw = 0, must_compute = 0, pd_idx = sq->pd_idx, i;
 	int qd_idx = r6s->qd_idx;
 	for (i = disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
@@ -2352,6 +2454,7 @@ static void handle_issuing_new_write_requests6(raid5_conf_t *conf,
 static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
 				struct stripe_head_state *s, int disks)
 {
+	struct stripe_queue *sq = sh->sq;
 	set_bit(STRIPE_HANDLE, &sh->state);
 	/* Take one of the following actions:
 	 * 1/ start a check parity operation if (uptodate == disks)
@@ -2363,7 +2466,7 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
 	    !test_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending)) {
 		if (!test_and_set_bit(STRIPE_OP_CHECK, &sh->ops.pending)) {
 			BUG_ON(s->uptodate != disks);
-			clear_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags);
+			clear_bit(R5_UPTODATE, &sh->dev[sq->pd_idx].flags);
 			sh->ops.count++;
 			s->uptodate--;
 		} else if (
@@ -2389,8 +2492,8 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
 					set_bit(STRIPE_OP_MOD_REPAIR_PD,
 						&sh->ops.pending);
 					set_bit(R5_Wantcompute,
-						&sh->dev[sh->pd_idx].flags);
-					sh->ops.target = sh->pd_idx;
+						&sh->dev[sq->pd_idx].flags);
+					sh->ops.target = sq->pd_idx;
 					sh->ops.count++;
 					s->uptodate++;
 				}
@@ -2415,9 +2518,10 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
 		!test_bit(STRIPE_OP_CHECK, &sh->ops.pending) &&
 		!test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending)) {
 		struct r5dev *dev;
+
 		/* either failed parity check, or recovery is happening */
 		if (s->failed == 0)
-			s->failed_num = sh->pd_idx;
+			s->failed_num = sq->pd_idx;
 		dev = &sh->dev[s->failed_num];
 		BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));
 		BUG_ON(s->uptodate != disks);
@@ -2440,8 +2544,9 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
 				int disks)
 {
 	int update_p = 0, update_q = 0;
+	struct stripe_queue *sq = sh->sq;
 	struct r5dev *dev;
-	int pd_idx = sh->pd_idx;
+	int pd_idx = sq->pd_idx;
 	int qd_idx = r6s->qd_idx;
 
 	set_bit(STRIPE_HANDLE, &sh->state);
@@ -2531,18 +2636,21 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 				struct r6_state *r6s)
 {
 	int i;
+	struct stripe_queue *sq = sh->sq;
 
 	/* We have read all the blocks in this stripe and now we need to
 	 * copy some of them into a target stripe for expand.
 	 */
 	struct dma_async_tx_descriptor *tx = NULL;
 	clear_bit(STRIPE_EXPAND_SOURCE, &sh->state);
-	for (i = 0; i < sh->disks; i++)
-		if (i != sh->pd_idx && (!r6s || i != r6s->qd_idx)) {
+	for (i = 0; i < sq->disks; i++)
+		if (i != sq->pd_idx && (!r6s || i != r6s->qd_idx)) {
 			int dd_idx, pd_idx, j;
 			struct stripe_head *sh2;
+			struct stripe_queue *sq2;
 
-			sector_t bn = compute_blocknr(sh, i);
+			sector_t bn = compute_blocknr(conf, sq->disks,
+						sh->sector, sq->pd_idx, i);
 			sector_t s = raid5_compute_sector(bn, conf->raid_disks,
 						conf->raid_disks -
 						conf->max_degraded, &dd_idx,
@@ -2567,12 +2675,13 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 				sh->dev[i].page, 0, 0, STRIPE_SIZE,
 				ASYNC_TX_DEP_ACK, tx, NULL, NULL);
 
+			sq2 = sh2->sq;
 			set_bit(R5_Expanded, &sh2->dev[dd_idx].flags);
 			set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
 			for (j = 0; j < conf->raid_disks; j++)
-				if (j != sh2->pd_idx &&
-				    (!r6s || j != raid6_next_disk(sh2->pd_idx,
-								 sh2->disks)) &&
+				if (j != sq2->pd_idx &&
+				    (!r6s || j != raid6_next_disk(sq2->pd_idx,
+								 sq2->disks)) &&
 				    !test_bit(R5_Expanded, &sh2->dev[j].flags))
 					break;
 			if (j == conf->raid_disks) {
@@ -2608,8 +2717,9 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 
 static void handle_stripe5(struct stripe_head *sh)
 {
-	raid5_conf_t *conf = sh->raid_conf;
-	int disks = sh->disks, i;
+	struct stripe_queue *sq = sh->sq;
+	raid5_conf_t *conf = sh->sq->raid_conf;
+	int disks = sq->disks, i;
 	struct bio *return_bi = NULL;
 	struct stripe_head_state s;
 	struct r5dev *dev;
@@ -2618,10 +2728,10 @@ static void handle_stripe5(struct stripe_head *sh)
 	memset(&s, 0, sizeof(s));
 	pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d "
 		"ops=%lx:%lx:%lx\n", (unsigned long long)sh->sector, sh->state,
-		atomic_read(&sh->count), sh->pd_idx,
+		atomic_read(&sh->count), sq->pd_idx,
 		sh->ops.pending, sh->ops.ack, sh->ops.complete);
 
-	spin_lock(&sh->lock);
+	spin_lock(&sq->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	clear_bit(STRIPE_DELAYED, &sh->state);
 
@@ -2634,18 +2744,19 @@ static void handle_stripe5(struct stripe_head *sh)
 	for (i=disks; i--; ) {
 		mdk_rdev_t *rdev;
 		struct r5dev *dev = &sh->dev[i];
+		struct r5_queue_dev *dev_q = &sq->dev[i];
 		clear_bit(R5_Insync, &dev->flags);
 
 		pr_debug("check %d: state 0x%lx toread %p read %p write %p "
-			"written %p\n",	i, dev->flags, dev->toread, dev->read,
-			dev->towrite, dev->written);
+			"written %p\n",	i, dev->flags, dev_q->toread,
+			dev_q->read, dev_q->towrite, dev_q->written);
 
 		/* maybe we can request a biofill operation
 		 *
 		 * new wantfill requests are only permitted while
 		 * STRIPE_OP_BIOFILL is clear
 		 */
-		if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread &&
+		if (test_bit(R5_UPTODATE, &dev->flags) && dev_q->toread &&
 			!test_bit(STRIPE_OP_BIOFILL, &sh->ops.pending))
 			set_bit(R5_Wantfill, &dev->flags);
 
@@ -2656,14 +2767,14 @@ static void handle_stripe5(struct stripe_head *sh)
 
 		if (test_bit(R5_Wantfill, &dev->flags))
 			s.to_fill++;
-		else if (dev->toread)
+		else if (dev_q->toread)
 			s.to_read++;
-		if (dev->towrite) {
+		if (dev_q->towrite) {
 			s.to_write++;
 			if (!test_bit(R5_OVERWRITE, &dev->flags))
 				s.non_overwrite++;
 		}
-		if (dev->written)
+		if (dev_q->written)
 			s.written++;
 		rdev = rcu_dereference(conf->disks[i].rdev);
 		if (!rdev || !test_bit(In_sync, &rdev->flags)) {
@@ -2702,12 +2813,12 @@ static void handle_stripe5(struct stripe_head *sh)
 	/* might be able to return some write requests if the parity block
 	 * is safe, or on a failed drive
 	 */
-	dev = &sh->dev[sh->pd_idx];
+	dev = &sh->dev[sq->pd_idx];
 	if ( s.written &&
 	     ((test_bit(R5_Insync, &dev->flags) &&
 	       !test_bit(R5_LOCKED, &dev->flags) &&
 	       test_bit(R5_UPTODATE, &dev->flags)) ||
-	       (s.failed == 1 && s.failed_num == sh->pd_idx)))
+		(s.failed == 1 && s.failed_num == sq->pd_idx)))
 		handle_completed_write_requests(conf, sh, disks, &return_bi);
 
 	/* Now we might consider reading some blocks, either to check/generate
@@ -2752,18 +2863,20 @@ static void handle_stripe5(struct stripe_head *sh)
 		/* All the 'written' buffers and the parity block are ready to
 		 * be written back to disk
 		 */
-		BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags));
+		BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sq->pd_idx].flags));
 		for (i = disks; i--; ) {
+			struct r5_queue_dev *dev_q = &sq->dev[i];
+
 			dev = &sh->dev[i];
 			if (test_bit(R5_LOCKED, &dev->flags) &&
-				(i == sh->pd_idx || dev->written)) {
+				(i == sq->pd_idx || dev_q->written)) {
 				pr_debug("Writing block %d\n", i);
 				set_bit(R5_Wantwrite, &dev->flags);
 				if (!test_and_set_bit(
 				    STRIPE_OP_IO, &sh->ops.pending))
 					sh->ops.count++;
 				if (!test_bit(R5_Insync, &dev->flags) ||
-				    (i == sh->pd_idx && s.failed == 0))
+				    (i == sq->pd_idx && s.failed == 0))
 					set_bit(STRIPE_INSYNC, &sh->state);
 			}
 		}
@@ -2850,8 +2963,8 @@ static void handle_stripe5(struct stripe_head *sh)
 	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
 		!test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending)) {
 		/* Need to write out all blocks after computing parity */
-		sh->disks = conf->raid_disks;
-		sh->pd_idx = stripe_to_pdidx(sh->sector, conf,
+		sq->disks = conf->raid_disks;
+		sq->pd_idx = stripe_to_pdidx(sh->sector, conf,
 			conf->raid_disks);
 		s.locked += handle_write_operations5(sh, 1, 1);
 	} else if (s.expanded &&
@@ -2868,7 +2981,7 @@ static void handle_stripe5(struct stripe_head *sh)
 	if (sh->ops.count)
 		pending = get_stripe_work(sh);
 
-	spin_unlock(&sh->lock);
+	spin_unlock(&sq->lock);
 
 	if (pending)
 		raid5_run_ops(sh, pending);
@@ -2879,10 +2992,11 @@ static void handle_stripe5(struct stripe_head *sh)
 
 static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 {
-	raid6_conf_t *conf = sh->raid_conf;
-	int disks = sh->disks;
+	struct stripe_queue *sq = sh->sq;
+	raid6_conf_t *conf = sq->raid_conf;
+	int disks = sq->disks;
 	struct bio *return_bi = NULL;
-	int i, pd_idx = sh->pd_idx;
+	int i, pd_idx = sq->pd_idx;
 	struct stripe_head_state s;
 	struct r6_state r6s;
 	struct r5dev *dev, *pdev, *qdev;
@@ -2894,7 +3008,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	       atomic_read(&sh->count), pd_idx, r6s.qd_idx);
 	memset(&s, 0, sizeof(s));
 
-	spin_lock(&sh->lock);
+	spin_lock(&sq->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	clear_bit(STRIPE_DELAYED, &sh->state);
 
@@ -2906,24 +3020,28 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	rcu_read_lock();
 	for (i=disks; i--; ) {
 		mdk_rdev_t *rdev;
+		struct r5_queue_dev *dev_q = &sq->dev[i];
+
 		dev = &sh->dev[i];
 		clear_bit(R5_Insync, &dev->flags);
 
 		pr_debug("check %d: state 0x%lx read %p write %p written %p\n",
-			i, dev->flags, dev->toread, dev->towrite, dev->written);
+			i, dev->flags, dev_q->toread, dev_q->towrite,
+			dev_q->written);
 		/* maybe we can reply to a read */
-		if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread) {
+		if (test_bit(R5_UPTODATE, &dev->flags) && dev_q->toread) {
 			struct bio *rbi, *rbi2;
 			pr_debug("Return read for disc %d\n", i);
 			spin_lock_irq(&conf->device_lock);
-			rbi = dev->toread;
-			dev->toread = NULL;
+			rbi = dev_q->toread;
+			dev_q->toread = NULL;
 			if (test_and_clear_bit(R5_Overlap, &dev->flags))
 				wake_up(&conf->wait_for_overlap);
 			spin_unlock_irq(&conf->device_lock);
-			while (rbi && rbi->bi_sector < dev->sector + STRIPE_SECTORS) {
-				copy_data(0, rbi, dev->page, dev->sector);
-				rbi2 = r5_next_bio(rbi, dev->sector);
+			while (rbi && rbi->bi_sector <
+			       dev_q->sector + STRIPE_SECTORS) {
+				copy_data(0, rbi, dev->page, dev_q->sector);
+				rbi2 = r5_next_bio(rbi, dev_q->sector);
 				spin_lock_irq(&conf->device_lock);
 				if (--rbi->bi_phys_segments == 0) {
 					rbi->bi_next = return_bi;
@@ -2939,14 +3057,14 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 		if (test_bit(R5_UPTODATE, &dev->flags)) s.uptodate++;
 
 
-		if (dev->toread)
+		if (dev_q->toread)
 			s.to_read++;
-		if (dev->towrite) {
+		if (dev_q->towrite) {
 			s.to_write++;
 			if (!test_bit(R5_OVERWRITE, &dev->flags))
 				s.non_overwrite++;
 		}
-		if (dev->written)
+		if (dev_q->written)
 			s.written++;
 		rdev = rcu_dereference(conf->disks[i].rdev);
 		if (!rdev || !test_bit(In_sync, &rdev->flags)) {
@@ -3047,8 +3165,8 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 
 	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state)) {
 		/* Need to write out all blocks after computing P&Q */
-		sh->disks = conf->raid_disks;
-		sh->pd_idx = stripe_to_pdidx(sh->sector, conf,
+		sq->disks = conf->raid_disks;
+		sq->pd_idx = stripe_to_pdidx(sh->sector, conf,
 					     conf->raid_disks);
 		compute_parity6(sh, RECONSTRUCT_WRITE);
 		for (i = conf->raid_disks ; i-- ;  ) {
@@ -3067,7 +3185,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	if (s.expanding && s.locked == 0)
 		handle_stripe_expansion(conf, sh, &r6s);
 
-	spin_unlock(&sh->lock);
+	spin_unlock(&sq->lock);
 
 	return_io(return_bi);
 
@@ -3133,7 +3251,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 
 static void handle_stripe(struct stripe_head *sh, struct page *tmp_page)
 {
-	if (sh->raid_conf->level == 6)
+	if (sh->sq->raid_conf->level == 6)
 		handle_stripe6(sh, tmp_page);
 	else
 		handle_stripe5(sh);
@@ -3677,14 +3795,17 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 		/* If any of this stripe is beyond the end of the old
 		 * array, then we need to zero those blocks
 		 */
-		for (j=sh->disks; j--;) {
+		for (j = sh->sq->disks; j--;) {
 			sector_t s;
-			if (j == sh->pd_idx)
+			int pd_idx = sh->sq->pd_idx;
+
+			if (j == pd_idx)
 				continue;
 			if (conf->level == 6 &&
-			    j == raid6_next_disk(sh->pd_idx, sh->disks))
+			    j == raid6_next_disk(pd_idx, sh->sq->disks))
 				continue;
-			s = compute_blocknr(sh, j);
+			s = compute_blocknr(conf, sh->sq->disks, sh->sector,
+					    pd_idx, j);
 			if (s < (mddev->array_size<<1)) {
 				skipped = 1;
 				continue;
@@ -3736,6 +3857,7 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 {
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
 	struct stripe_head *sh;
+	struct stripe_queue *sq;
 	int pd_idx;
 	int raid_disks = conf->raid_disks;
 	sector_t max_sector = mddev->size << 1;
@@ -3792,6 +3914,8 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 		 */
 		schedule_timeout_uninterruptible(1);
 	}
+	sq = sh->sq;
+
 	/* Need to check if array will still be degraded after recovery/resync
 	 * We don't need to check the 'failed' flag as when that gets set,
 	 * recovery aborts.
@@ -3802,10 +3926,10 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 
 	bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, still_degraded);
 
-	spin_lock(&sh->lock);
+	spin_lock(&sq->lock);
 	set_bit(STRIPE_SYNCING, &sh->state);
 	clear_bit(STRIPE_INSYNC, &sh->state);
-	spin_unlock(&sh->lock);
+	spin_unlock(&sq->lock);
 
 	handle_stripe(sh, NULL);
 	release_stripe(sh);
@@ -3826,6 +3950,7 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 	 * it will be only one 'dd_idx' and only need one call to raid5_compute_sector.
 	 */
 	struct stripe_head *sh;
+	struct stripe_queue *sq;
 	int dd_idx, pd_idx;
 	sector_t sector, logical_sector, last_sector;
 	int scnt = 0;
@@ -3859,6 +3984,7 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 			return handled;
 		}
 
+		sq = sh->sq;
 		set_bit(R5_ReadError, &sh->dev[dd_idx].flags);
 		if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
 			release_stripe(sh);
@@ -4325,16 +4451,16 @@ static int stop(mddev_t *mddev)
 static void print_sh (struct seq_file *seq, struct stripe_head *sh)
 {
 	int i;
+	struct stripe_queue *sq = sh->sq;
 
 	seq_printf(seq, "sh %llu, pd_idx %d, state %ld.\n",
-		   (unsigned long long)sh->sector, sh->pd_idx, sh->state);
+		   (unsigned long long)sh->sector, sq->pd_idx, sh->state);
 	seq_printf(seq, "sh %llu,  count %d.\n",
 		   (unsigned long long)sh->sector, atomic_read(&sh->count));
 	seq_printf(seq, "sh %llu, ", (unsigned long long)sh->sector);
-	for (i = 0; i < sh->disks; i++) {
+	for (i = 0; i < sq->disks; i++)
 		seq_printf(seq, "(cache%d: %p %ld) ",
 			   i, sh->dev[i].page, sh->dev[i].flags);
-	}
 	seq_printf(seq, "\n");
 }
 
@@ -4347,7 +4473,7 @@ static void printall (struct seq_file *seq, raid5_conf_t *conf)
 	spin_lock_irq(&conf->device_lock);
 	for (i = 0; i < NR_HASH; i++) {
 		hlist_for_each_entry(sh, hn, &conf->stripe_hashtbl[i], hash) {
-			if (sh->raid_conf != conf)
+			if (sh->sq->raid_conf != conf)
 				continue;
 			print_sh(seq, sh);
 		}
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 93678f5..857e2bf 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -158,17 +158,14 @@
  *    the compute block completes.
  */
 
+struct stripe_queue;
 struct stripe_head {
 	struct hlist_node	hash;
 	struct list_head	lru;			/* inactive_list or handle_list */
-	struct raid5_private_data	*raid_conf;
 	sector_t		sector;			/* sector of this row */
-	int			pd_idx;			/* parity disk index */
 	unsigned long		state;			/* state flags */
 	atomic_t		count;			/* nr of active thread/requests */
-	spinlock_t		lock;
 	int			bm_seq;	/* sequence number for bitmap flushes */
-	int			disks;			/* disks in stripe */
 	/* stripe_operations
 	 * @pending - pending ops flags (set for request->issue->complete)
 	 * @ack - submitted ops flags (set for issue->complete)
@@ -184,13 +181,12 @@ struct stripe_head {
 		int		   count;
 		u32		   zero_sum_result;
 	} ops;
+	struct stripe_queue *sq;
 	struct r5dev {
 		struct bio	req;
 		struct bio_vec	vec;
 		struct page	*page;
-		struct bio	*toread, *read, *towrite, *written;
-		sector_t	sector;			/* sector of this page */
-		unsigned long	flags;
+		unsigned long flags;
 	} dev[1]; /* allocated with extra space depending of RAID geometry */
 };
 
@@ -209,6 +205,18 @@ struct r6_state {
 	int p_failed, q_failed, qd_idx, failed_num[2];
 };
 
+struct stripe_queue {
+	sector_t sector;
+	spinlock_t lock; /* protect bio lists and stripe_head state */
+	struct raid5_private_data *raid_conf;
+	int pd_idx; /* parity disk index */
+	int disks; /* disks in stripe */
+	struct r5_queue_dev {
+		sector_t sector; /* hw starting sector for this block */
+		struct bio *toread, *read, *towrite, *written;
+	} dev[1];
+};
+
 /* Flags */
 #define	R5_UPTODATE	0	/* page contains current data */
 #define	R5_LOCKED	1	/* IO has been submitted on "req" */
@@ -328,8 +336,10 @@ struct raid5_private_data {
 	 * two caches.
 	 */
 	int			active_name;
-	char			cache_name[2][20];
-	struct kmem_cache		*slab_cache; /* for allocating stripes */
+	char			sh_cache_name[2][20];
+	char			sq_cache_name[2][20];
+	struct kmem_cache	*sh_slab_cache;
+	struct kmem_cache	*sq_slab_cache;
 
 	int			seq_flush, seq_write;
 	int			quiesce;

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH -mm 2/4] raid5: split allocation of stripe_heads and stripe_queues
  2007-10-06 17:06 [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance) Dan Williams
  2007-10-06 17:06 ` [PATCH -mm 1/4] raid5: add the stripe_queue object for tracking raid io requests (rev3) Dan Williams
@ 2007-10-06 17:06 ` Dan Williams
  2007-10-06 17:06 ` [PATCH -mm 3/4] raid5: convert add_stripe_bio to add_queue_bio Dan Williams
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Dan Williams @ 2007-10-06 17:06 UTC (permalink / raw)
  To: neilb, akpm; +Cc: linux-raid

Provide separate routines for allocating stripe_head and stripe_queue
objects and introduce 'io_weight' bitmaps to struct stripe_queue.

The io_weight bitmaps add an efficient way to determine what is pending in
a stripe_queue using 'hweight' in comparison to a 'for' loop.

Tested-by: Mr. James W. Laferriere <babydr@baby-dragons.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/raid5.c         |  316 ++++++++++++++++++++++++++++++++------------
 include/linux/raid/raid5.h |   11 +-
 2 files changed, 239 insertions(+), 88 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a13de7d..7bc206c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -65,6 +65,7 @@
 #define	IO_THRESHOLD		1
 #define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
 #define HASH_MASK		(NR_HASH - 1)
+#define STRIPE_QUEUE_SIZE 1 /* multiple of nr_stripes */
 
 #define stripe_hash(conf, sect)	(&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))
 
@@ -78,6 +79,8 @@
  * of the current stripe+device
  */
 #define r5_next_bio(bio, sect) ( ( (bio)->bi_sector + ((bio)->bi_size>>9) < sect + STRIPE_SECTORS) ? (bio)->bi_next : NULL)
+#define r5_io_weight_size(devs) (sizeof(unsigned long) * \
+				  (ALIGN(devs, BITS_PER_LONG) / BITS_PER_LONG))
 /*
  * The following can be used to debug the driver
  */
@@ -120,6 +123,21 @@ static void return_io(struct bio *return_bi)
 	}
 }
 
+#if BITS_PER_LONG == 32
+#define hweight hweight32
+#else
+#define hweight hweight64
+#endif
+static unsigned long io_weight(unsigned long *bitmap, int disks)
+{
+	unsigned long weight = hweight(*bitmap);
+
+	for (bitmap++; disks > BITS_PER_LONG; disks -= BITS_PER_LONG, bitmap++)
+		weight += hweight(*bitmap);
+
+	return weight;
+}
+
 static void print_raid5_conf (raid5_conf_t *conf);
 
 static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
@@ -236,36 +254,37 @@ static int grow_buffers(struct stripe_head *sh, int num)
 
 static void raid5_build_block (struct stripe_head *sh, int i);
 
-static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int disks)
+static void init_queue(struct stripe_queue *sq, sector_t sector,
+		int disks, int pd_idx);
+
+static void
+init_stripe(struct stripe_head *sh, struct stripe_queue *sq,
+	     sector_t sector, int pd_idx, int disks)
 {
-	raid5_conf_t *conf = sh->sq->raid_conf;
+	raid5_conf_t *conf = sq->raid_conf;
 	int i;
 
+	pr_debug("init_stripe called, stripe %llu\n",
+		(unsigned long long)sector);
+
 	BUG_ON(atomic_read(&sh->count) != 0);
 	BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
 	BUG_ON(sh->ops.pending || sh->ops.ack || sh->ops.complete);
+	init_queue(sh->sq, sector, disks, pd_idx);
 
 	CHECK_DEVLOCK();
-	pr_debug("init_stripe called, stripe %llu\n",
-		(unsigned long long)sh->sector);
 
 	remove_hash(sh);
 
 	sh->sector = sector;
-	sh->sq->pd_idx = pd_idx;
 	sh->state = 0;
 
-	sh->sq->disks = disks;
-
 	for (i = disks; i--;) {
 		struct r5dev *dev = &sh->dev[i];
-		struct r5_queue_dev *dev_q = &sh->sq->dev[i];
 
-		if (dev_q->toread || dev_q->read || dev_q->towrite ||
-		    dev_q->written || test_bit(R5_LOCKED, &dev->flags)) {
-			printk(KERN_ERR "sector=%llx i=%d %p %p %p %p %d\n",
-			       (unsigned long long)sh->sector, i, dev_q->toread,
-			       dev_q->read, dev_q->towrite, dev_q->written,
+		if (test_bit(R5_LOCKED, &dev->flags)) {
+			printk(KERN_ERR "sector=%llx i=%d %d\n",
+			       (unsigned long long)sector, i,
 			       test_bit(R5_LOCKED, &dev->flags));
 			BUG();
 		}
@@ -283,7 +302,7 @@ static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector, in
 	CHECK_DEVLOCK();
 	pr_debug("__find_stripe, sector %llu\n", (unsigned long long)sector);
 	hlist_for_each_entry(sh, hn, stripe_hash(conf, sector), hash)
-		if (sh->sector == sector && sh->sq->disks == disks)
+		if (sh->sector == sector && disks == disks)
 			return sh;
 	pr_debug("__stripe %llu not in cache\n", (unsigned long long)sector);
 	return NULL;
@@ -326,7 +345,7 @@ static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector
 					);
 				conf->inactive_blocked = 0;
 			} else
-				init_stripe(sh, sector, pd_idx, disks);
+				init_stripe(sh, sh->sq, sector, pd_idx, disks);
 		} else {
 			if (atomic_read(&sh->count)) {
 			  BUG_ON(!list_empty(&sh->lru));
@@ -348,6 +367,39 @@ static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector
 	return sh;
 }
 
+static void init_queue(struct stripe_queue *sq, sector_t sector,
+		int disks, int pd_idx)
+{
+	raid5_conf_t *conf = sq->raid_conf;
+	int i;
+
+	pr_debug("%s: %llu -> %llu [%p]\n",
+		__FUNCTION__, (unsigned long long) sq->sector,
+		(unsigned long long) sector, sq);
+
+	BUG_ON(io_weight(sq->to_read, disks));
+	BUG_ON(io_weight(sq->to_write, disks));
+	BUG_ON(io_weight(sq->overwrite, disks));
+
+	sq->sector = sector;
+	sq->pd_idx = pd_idx;
+	sq->disks = disks;
+
+	for (i = disks; i--;) {
+		struct r5_queue_dev *dev_q = &sq->dev[i];
+
+		if (dev_q->toread || dev_q->read || dev_q->towrite ||
+		    dev_q->written) {
+			printk(KERN_ERR "sector=%llx i=%d %p %p %p %p\n",
+			       (unsigned long long)sq->sector, i, dev_q->toread,
+			       dev_q->read, dev_q->towrite, dev_q->written);
+			BUG();
+		}
+		dev_q->sector = compute_blocknr(conf, disks, sector, pd_idx, i);
+	}
+}
+
+
 /* test_and_ack_op() ensures that we only dequeue an operation once */
 #define test_and_ack_op(op, pend) \
 do {							\
@@ -570,21 +622,23 @@ static void ops_complete_biofill(void *stripe_head_ref)
 static void ops_run_biofill(struct stripe_head *sh)
 {
 	struct dma_async_tx_descriptor *tx = NULL;
-	raid5_conf_t *conf = sh->sq->raid_conf;
+	struct stripe_queue *sq = sh->sq;
+	raid5_conf_t *conf = sq->raid_conf;
 	int i;
 
 	pr_debug("%s: stripe %llu\n", __FUNCTION__,
 		(unsigned long long)sh->sector);
 
-	for (i = sh->sq->disks; i--;) {
+	for (i = sq->disks; i--;) {
 		struct r5dev *dev = &sh->dev[i];
-		struct r5_queue_dev *dev_q = &sh->sq->dev[i];
+		struct r5_queue_dev *dev_q = &sq->dev[i];
 
 		if (test_bit(R5_Wantfill, &dev->flags)) {
 			struct bio *rbi;
 			spin_lock_irq(&conf->device_lock);
 			dev_q->read = rbi = dev_q->toread;
 			dev_q->toread = NULL;
+			clear_bit(i, sq->to_read);
 			spin_unlock_irq(&conf->device_lock);
 			while (rbi && rbi->bi_sector <
 				dev_q->sector + STRIPE_SECTORS) {
@@ -669,9 +723,9 @@ static struct dma_async_tx_descriptor *
 ops_run_prexor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 {
 	/* kernel stack size limits the total number of disks */
-	int disks = sh->sq->disks;
-	struct page *xor_srcs[disks];
 	struct stripe_queue *sq = sh->sq;
+	int disks = sq->disks;
+	struct page *xor_srcs[disks];
 	int count = 0, pd_idx = sq->pd_idx, i;
 
 	/* existing parity data subtracted */
@@ -698,9 +752,10 @@ ops_run_prexor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 static struct dma_async_tx_descriptor *
 ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 {
-	int disks = sh->sq->disks;
 	struct stripe_queue *sq = sh->sq;
-	int pd_idx = sq->pd_idx, i;
+	int disks = sq->disks;
+	int pd_idx = sq->pd_idx;
+	int i;
 
 	/* check if prexor is active which means only process blocks
 	 * that are part of a read-modify-write (Wantprexor)
@@ -733,6 +788,7 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 			spin_lock(&sq->lock);
 			chosen = dev_q->towrite;
 			dev_q->towrite = NULL;
+			clear_bit(i, sq->to_write);
 			BUG_ON(dev_q->written);
 			wbi = dev_q->written = chosen;
 			spin_unlock(&sq->lock);
@@ -793,7 +849,9 @@ ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 	int disks = sq->disks;
 	struct page *xor_srcs[disks];
 
-	int count = 0, pd_idx = sh->sq->pd_idx, i;
+	int count = 0;
+	int pd_idx = sq->pd_idx;
+	int i;
 	struct page *xor_dest;
 	int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
 	unsigned long flags;
@@ -866,11 +924,14 @@ static void ops_complete_check(void *stripe_head_ref)
 static void ops_run_check(struct stripe_head *sh)
 {
 	/* kernel stack size limits the total number of disks */
-	int disks = sh->sq->disks;
+	struct stripe_queue *sq = sh->sq;
+	int disks = sq->disks;
 	struct page *xor_srcs[disks];
 	struct dma_async_tx_descriptor *tx;
 
-	int count = 0, pd_idx = sh->sq->pd_idx, i;
+	int count = 0;
+	int pd_idx = sq->pd_idx;
+	int i;
 	struct page *xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
 
 	pr_debug("%s: stripe %llu\n", __FUNCTION__,
@@ -897,7 +958,10 @@ static void ops_run_check(struct stripe_head *sh)
 
 static void raid5_run_ops(struct stripe_head *sh, unsigned long pending)
 {
-	int overlap_clear = 0, i, disks = sh->sq->disks;
+	struct stripe_queue *sq = sh->sq;
+	int overlap_clear = 0;
+	int disks = sq->disks;
+	int i;
 	struct dma_async_tx_descriptor *tx = NULL;
 
 	if (test_bit(STRIPE_OP_BIOFILL, &pending)) {
@@ -926,43 +990,29 @@ static void raid5_run_ops(struct stripe_head *sh, unsigned long pending)
 		ops_run_io(sh);
 
 	if (overlap_clear) {
-		for (i = disks; i--; ) {
-			struct r5dev *dev = &sh->dev[i];
-			if (test_and_clear_bit(R5_Overlap, &dev->flags))
-				wake_up(&sh->sq->raid_conf->wait_for_overlap);
-		}
+		for (i = disks; i--;)
+			if (test_and_clear_bit(i, sq->overlap))
+				wake_up(&sq->raid_conf->wait_for_overlap);
 	}
 }
 
+static struct stripe_queue *grow_one_queue(raid5_conf_t *conf);
+
 static int grow_one_stripe(raid5_conf_t *conf)
 {
 	struct stripe_head *sh;
-	struct stripe_queue *sq;
-
 	sh = kmem_cache_alloc(conf->sh_slab_cache, GFP_KERNEL);
 	if (!sh)
 		return 0;
-
-	sq = kmem_cache_alloc(conf->sq_slab_cache, GFP_KERNEL);
-	if (!sq) {
-		kmem_cache_free(conf->sh_slab_cache, sh);
-		return 0;
-	}
-
 	memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1)*sizeof(struct r5dev));
-	memset(sq, 0, sizeof(*sq) +
-		(conf->raid_disks-1) * sizeof(struct r5_queue_dev));
-	sh->sq = sq;
-	sq->raid_conf = conf;
-	spin_lock_init(&sq->lock);
+	sh->sq = grow_one_queue(conf);
 
 	if (grow_buffers(sh, conf->raid_disks)) {
 		shrink_buffers(sh, conf->raid_disks);
 		kmem_cache_free(conf->sh_slab_cache, sh);
-		kmem_cache_free(conf->sq_slab_cache, sq);
 		return 0;
 	}
-	sq->disks = conf->raid_disks;
+
 	/* we just created an active stripe so... */
 	atomic_set(&sh->count, 1);
 	atomic_inc(&conf->active_stripes);
@@ -973,6 +1023,37 @@ static int grow_one_stripe(raid5_conf_t *conf)
 	return 1;
 }
 
+static struct stripe_queue *grow_one_queue(raid5_conf_t *conf)
+{
+	struct stripe_queue *sq;
+	int disks = conf->raid_disks;
+	void *weight_map;
+	sq = kmem_cache_alloc(conf->sq_slab_cache, GFP_KERNEL);
+	if (!sq)
+		return 0;
+	memset(sq, 0, (sizeof(*sq)+(disks-1) * sizeof(struct r5_queue_dev)) +
+		r5_io_weight_size(disks) + r5_io_weight_size(disks) +
+		r5_io_weight_size(disks) + r5_io_weight_size(disks));
+
+	/* set the queue weight bitmaps to the free space at the end of sq */
+	weight_map = ((void *) sq) + offsetof(typeof(*sq), dev) +
+			sizeof(struct r5_queue_dev) * disks;
+	sq->to_read = weight_map;
+	weight_map += r5_io_weight_size(disks);
+	sq->to_write = weight_map;
+	weight_map += r5_io_weight_size(disks);
+	sq->overwrite = weight_map;
+	weight_map += r5_io_weight_size(disks);
+	sq->overlap = weight_map;
+
+	spin_lock_init(&sq->lock);
+	sq->sector = MaxSector;
+	sq->raid_conf = conf;
+	sq->disks = disks;
+
+	return sq;
+}
+
 static int grow_stripes(raid5_conf_t *conf, int num)
 {
 	struct kmem_cache *sc;
@@ -993,9 +1074,12 @@ static int grow_stripes(raid5_conf_t *conf, int num)
 	conf->pool_size = devs;
 
 	sc = kmem_cache_create(conf->sq_cache_name[conf->active_name],
-		sizeof(struct stripe_queue) +
-		(devs-1)*sizeof(struct r5_queue_dev), 0, 0, NULL);
-
+			       (sizeof(struct stripe_queue)+(devs-1) *
+				sizeof(struct r5_queue_dev)) +
+				r5_io_weight_size(devs) +
+				r5_io_weight_size(devs) +
+				r5_io_weight_size(devs) +
+				r5_io_weight_size(devs), 0, 0, NULL);
 	if (!sc)
 		return 1;
 	conf->sq_slab_cache = sc;
@@ -1003,6 +1087,7 @@ static int grow_stripes(raid5_conf_t *conf, int num)
 	while (num--)
 		if (!grow_one_stripe(conf))
 			return 1;
+
 	return 0;
 }
 
@@ -1033,11 +1118,13 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 	 * so we use GFP_NOIO allocations.
 	 */
 	struct stripe_head *osh, *nsh;
+	struct stripe_queue *nsq;
 	LIST_HEAD(newstripes);
+	LIST_HEAD(newqueues);
 	struct disk_info *ndisks;
 	int err = 0;
 	struct kmem_cache *sc, *sc_q;
-	int i;
+	int i, j;
 
 	if (newsize <= conf->pool_size)
 		return 0; /* never bother to shrink */
@@ -1051,45 +1138,88 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 	if (!sc)
 		return -ENOMEM;
 
-	sc_q = kmem_cache_create(conf->sh_cache_name[1-conf->active_name],
-		    sizeof(struct stripe_queue) +
-		    (newsize-1)*sizeof(struct r5_queue_dev), 0, 0, NULL);
+	sc_q = kmem_cache_create(conf->sq_cache_name[conf->active_name],
+			       (sizeof(struct stripe_queue)+(newsize-1) *
+				sizeof(struct r5_queue_dev)) +
+				r5_io_weight_size(newsize) +
+				r5_io_weight_size(newsize) +
+				r5_io_weight_size(newsize) +
+				r5_io_weight_size(newsize),
+				0, 0, NULL);
+
 	if (!sc_q) {
 		kmem_cache_destroy(sc);
 		return -ENOMEM;
 	}
 
 	for (i = conf->max_nr_stripes; i; i--) {
-		struct stripe_queue *nsq;
+		struct stripe_queue *nsq_per_sh[STRIPE_QUEUE_SIZE];
 
 		nsh = kmem_cache_alloc(sc, GFP_KERNEL);
 		if (!nsh)
 			break;
 
-		nsq = kmem_cache_alloc(sc_q, GFP_KERNEL);
-		if (!nsq) {
+		/* allocate STRIPE_QUEUE_SIZE queues per stripe */
+		for (j = 0; j < ARRAY_SIZE(nsq_per_sh); j++)
+			nsq_per_sh[j] = kmem_cache_alloc(sc_q, GFP_KERNEL);
+
+		for (j = 0; j < ARRAY_SIZE(nsq_per_sh); j++)
+			if (!nsq_per_sh[j])
+				break;
+
+		if (j <= ARRAY_SIZE(nsq_per_sh)) {
 			kmem_cache_free(sc, nsh);
+			do
+				if (nsq_per_sh[j])
+					kmem_cache_free(sc_q, nsq_per_sh[j]);
+			while (--j >= 0);
 			break;
 		}
 
 		memset(nsh, 0, sizeof(*nsh) + (newsize-1)*sizeof(struct r5dev));
-		memset(nsq, 0, sizeof(*nsq) +
-			(newsize-1)*sizeof(struct r5_queue_dev));
-
-		nsq->raid_conf = conf;
-		nsh->sq = nsq;
-		spin_lock_init(&nsq->lock);
-
 		list_add(&nsh->lru, &newstripes);
+
+		for (j = 0; j < ARRAY_SIZE(nsq_per_sh); j++) {
+			void *weight_map;
+			nsq = nsq_per_sh[j];
+			memset(nsq, 0, (sizeof(*nsq)+(newsize-1) *
+				sizeof(struct r5_queue_dev)) +
+				r5_io_weight_size(newsize) +
+				r5_io_weight_size(newsize) +
+				r5_io_weight_size(newsize) +
+				r5_io_weight_size(newsize));
+			/* set the queue weight bitmaps to the free space at
+			 * the end of nsq
+			 */
+			weight_map = ((void *) nsq) +
+					offsetof(typeof(*nsq), dev) +
+					sizeof(struct r5_queue_dev) * newsize;
+			nsq->to_read = weight_map;
+			weight_map += r5_io_weight_size(newsize);
+			nsq->to_write = weight_map;
+			weight_map += r5_io_weight_size(newsize);
+			nsq->overwrite = weight_map;
+			weight_map += r5_io_weight_size(newsize);
+			nsq->overlap = weight_map;
+			nsq->raid_conf = conf;
+			spin_lock_init(&nsq->lock);
+			list_add(&nsq->list_node, &newqueues);
+		}
 	}
 	if (i) {
 		/* didn't get enough, give up */
 		while (!list_empty(&newstripes)) {
 			nsh = list_entry(newstripes.next, struct stripe_head, lru);
 			list_del(&nsh->lru);
-			kmem_cache_free(sc_q, nsh->sq);
 			kmem_cache_free(sc, nsh);
 		}
+		while (!list_empty(&newqueues)) {
+			nsq = list_entry(newqueues.next,
+					 struct stripe_queue,
+					 list_node);
+			list_del(&nsh->lru);
+			kmem_cache_free(sc_q, nsq);
+		}
 		kmem_cache_destroy(sc_q);
 		kmem_cache_destroy(sc);
 		return -ENOMEM;
@@ -1133,8 +1263,11 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 		err = -ENOMEM;
 
 	/* Step 4, return new stripes to service */
-	while(!list_empty(&newstripes)) {
+	while (!list_empty(&newstripes)) {
+		nsq = list_entry(newqueues.next, struct stripe_queue,
+					list_node);
 		nsh = list_entry(newstripes.next, struct stripe_head, lru);
+		list_del_init(&nsq->list_node);
 		list_del_init(&nsh->lru);
 		for (i=conf->raid_disks; i < newsize; i++)
 			if (nsh->dev[i].page == NULL) {
@@ -1143,6 +1276,7 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 				if (!p)
 					err = -ENOMEM;
 			}
+		nsh->sq = nsq;
 		release_stripe(nsh);
 	}
 	/* critical section pass, GFP_NOIO no longer needed */
@@ -1191,9 +1325,11 @@ static int raid5_end_read_request(struct bio * bi, unsigned int bytes_done,
 				   int error)
 {
  	struct stripe_head *sh = bi->bi_private;
-	raid5_conf_t *conf = sh->sq->raid_conf;
-	int disks = sh->sq->disks, i;
+	struct stripe_queue *sq = sh->sq;
+	raid5_conf_t *conf = sq->raid_conf;
+	int disks = sq->disks;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
+	int i;
 	char b[BDEVNAME_SIZE];
 	mdk_rdev_t *rdev;
 
@@ -1271,8 +1407,9 @@ static int raid5_end_write_request (struct bio *bi, unsigned int bytes_done,
  	struct stripe_head *sh = bi->bi_private;
 	struct stripe_queue *sq = sh->sq;
 	raid5_conf_t *conf = sq->raid_conf;
-	int disks = sq->disks, i;
+	int disks = sq->disks;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
+	int i;
 
 	if (bi->bi_size)
 		return 1;
@@ -1303,7 +1440,6 @@ static int raid5_end_write_request (struct bio *bi, unsigned int bytes_done,
 static void raid5_build_block (struct stripe_head *sh, int i)
 {
 	struct r5dev *dev = &sh->dev[i];
-	struct r5_queue_dev *dev_q = &sh->sq->dev[i];
 
 	bio_init(&dev->req);
 	dev->req.bi_io_vec = &dev->vec;
@@ -1315,10 +1451,6 @@ static void raid5_build_block (struct stripe_head *sh, int i)
 
 	dev->req.bi_sector = sh->sector;
 	dev->req.bi_private = sh;
-
-	dev->flags = 0;
-	dev_q->sector = compute_blocknr(sh->sq->raid_conf, sh->sq->disks,
-			sh->sector, sh->sq->pd_idx, i);
 }
 
 static void error(mddev_t *mddev, mdk_rdev_t *rdev)
@@ -1613,8 +1745,9 @@ static void compute_parity6(struct stripe_head *sh, int method)
 			if (i != pd_idx && i != qd_idx && sq->dev[i].towrite) {
 				chosen = sq->dev[i].towrite;
 				sq->dev[i].towrite = NULL;
+				clear_bit(i, sq->to_write);
 
-				if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+				if (test_and_clear_bit(i, sq->overlap))
 					wake_up(&conf->wait_for_overlap);
 
 				BUG_ON(sq->dev[i].written);
@@ -1714,8 +1847,9 @@ static void compute_block_1(struct stripe_head *sh, int dd_idx, int nozero)
 /* Compute two missing blocks */
 static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2)
 {
-	int i, count, disks = sh->sq->disks;
-	int pd_idx = sh->sq->pd_idx;
+	struct stripe_queue *sq = sh->sq;
+	int i, count, disks = sq->disks;
+	int pd_idx = sq->pd_idx;
 	int qd_idx = raid6_next_disk(pd_idx, disks);
 	int d0_idx = raid6_next_disk(qd_idx, disks);
 	int faila, failb;
@@ -1917,10 +2051,11 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 		if (sector >= sq->dev[dd_idx].sector + STRIPE_SECTORS)
 			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
 	}
+
 	return 1;
 
  overlap:
-	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
+	set_bit(dd_idx, sq->overlap);
 	spin_unlock_irq(&conf->device_lock);
 	spin_unlock(&sq->lock);
 	return 0;
@@ -1973,12 +2108,13 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
 		/* fail all writes first */
 		bi = sq->dev[i].towrite;
 		sq->dev[i].towrite = NULL;
+		clear_bit(i, sq->to_write);
 		if (bi) {
 			s->to_write--;
 			bitmap_end = 1;
 		}
 
-		if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+		if (test_and_clear_bit(i, sq->overlap))
 			wake_up(&conf->wait_for_overlap);
 
 		while (bi && bi->bi_sector <
@@ -2016,7 +2152,8 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
 		      test_bit(R5_ReadError, &sh->dev[i].flags))) {
 			bi = sq->dev[i].toread;
 			sq->dev[i].toread = NULL;
-			if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+			clear_bit(i, sq->to_read);
+			if (test_and_clear_bit(i, sq->overlap))
 				wake_up(&conf->wait_for_overlap);
 			if (bi) s->to_read--;
 			while (bi && bi->bi_sector <
@@ -2718,7 +2855,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 static void handle_stripe5(struct stripe_head *sh)
 {
 	struct stripe_queue *sq = sh->sq;
-	raid5_conf_t *conf = sh->sq->raid_conf;
+	raid5_conf_t *conf = sq->raid_conf;
 	int disks = sq->disks, i;
 	struct bio *return_bi = NULL;
 	struct stripe_head_state s;
@@ -2746,6 +2883,8 @@ static void handle_stripe5(struct stripe_head *sh)
 		struct r5dev *dev = &sh->dev[i];
 		struct r5_queue_dev *dev_q = &sq->dev[i];
 		clear_bit(R5_Insync, &dev->flags);
+		if (test_and_clear_bit(i, sq->overwrite))
+			set_bit(R5_OVERWRITE, &dev->flags);
 
 		pr_debug("check %d: state 0x%lx toread %p read %p write %p "
 			"written %p\n",	i, dev->flags, dev_q->toread,
@@ -3024,6 +3163,8 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 
 		dev = &sh->dev[i];
 		clear_bit(R5_Insync, &dev->flags);
+		if (test_and_clear_bit(i, sq->overwrite))
+			set_bit(R5_OVERWRITE, &dev->flags);
 
 		pr_debug("check %d: state 0x%lx read %p write %p written %p\n",
 			i, dev->flags, dev_q->toread, dev_q->towrite,
@@ -3035,7 +3176,8 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 			spin_lock_irq(&conf->device_lock);
 			rbi = dev_q->toread;
 			dev_q->toread = NULL;
-			if (test_and_clear_bit(R5_Overlap, &dev->flags))
+			clear_bit(i, sq->to_read);
+			if (test_and_clear_bit(i, sq->overlap))
 				wake_up(&conf->wait_for_overlap);
 			spin_unlock_irq(&conf->device_lock);
 			while (rbi && rbi->bi_sector <
@@ -3735,6 +3877,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 	 */
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
 	struct stripe_head *sh;
+	struct stripe_queue *sq;
 	int pd_idx;
 	sector_t first_sector, last_sector;
 	int raid_disks = conf->previous_raid_disks;
@@ -3790,21 +3933,22 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 		pd_idx = stripe_to_pdidx(sector_nr+i, conf, conf->raid_disks);
 		sh = get_active_stripe(conf, sector_nr+i,
 				       conf->raid_disks, pd_idx, 0);
+		sq = sh->sq;
 		set_bit(STRIPE_EXPANDING, &sh->state);
 		atomic_inc(&conf->reshape_stripes);
 		/* If any of this stripe is beyond the end of the old
 		 * array, then we need to zero those blocks
 		 */
-		for (j = sh->sq->disks; j--;) {
+		for (j = sq->disks; j--;) {
 			sector_t s;
 			int pd_idx = sh->sq->pd_idx;
 
 			if (j == pd_idx)
 				continue;
 			if (conf->level == 6 &&
-			    j == raid6_next_disk(pd_idx, sh->sq->disks))
+			    j == raid6_next_disk(pd_idx, sq->disks))
 				continue;
-			s = compute_blocknr(conf, sh->sq->disks, sh->sector,
+			s = compute_blocknr(conf, sq->disks, sh->sector,
 					    pd_idx, j);
 			if (s < (mddev->array_size<<1)) {
 				skipped = 1;
@@ -3950,7 +4094,6 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 	 * it will be only one 'dd_idx' and only need one call to raid5_compute_sector.
 	 */
 	struct stripe_head *sh;
-	struct stripe_queue *sq;
 	int dd_idx, pd_idx;
 	sector_t sector, logical_sector, last_sector;
 	int scnt = 0;
@@ -3984,7 +4127,6 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 			return handled;
 		}
 
-		sq = sh->sq;
 		set_bit(R5_ReadError, &sh->dev[dd_idx].flags);
 		if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
 			release_stripe(sh);
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 857e2bf..fbe622c 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -207,8 +207,18 @@ struct r6_state {
 
 struct stripe_queue {
 	sector_t sector;
+	/* stripe queues are allocated with extra space to hold the following
+	 * four bitmaps.  One bit for each block in the stripe_head.  These
+	 * bitmaps enable use of hweight to count the number of blocks
+	 * undergoing read, write, overwrite.
+	 */
+	unsigned long *to_read;
+	unsigned long *to_write;
+	unsigned long *overwrite;
+	unsigned long *overlap; /* There is a pending overlapping request */
 	spinlock_t lock; /* protect bio lists and stripe_head state */
 	struct raid5_private_data *raid_conf;
+	struct list_head list_node;
 	int pd_idx; /* parity disk index */
 	int disks; /* disks in stripe */
 	struct r5_queue_dev {
@@ -225,7 +235,6 @@ struct stripe_queue {
 #define	R5_Insync	3	/* rdev && rdev->in_sync at start */
 #define	R5_Wantread	4	/* want to schedule a read */
 #define	R5_Wantwrite	5
-#define	R5_Overlap	7	/* There is a pending overlapping request on this block */
 #define	R5_ReadError	8	/* seen a read error here recently */
 #define	R5_ReWrite	9	/* have tried to over-write the readerror */
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH -mm 3/4] raid5: convert add_stripe_bio to add_queue_bio
  2007-10-06 17:06 [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance) Dan Williams
  2007-10-06 17:06 ` [PATCH -mm 1/4] raid5: add the stripe_queue object for tracking raid io requests (rev3) Dan Williams
  2007-10-06 17:06 ` [PATCH -mm 2/4] raid5: split allocation of stripe_heads and stripe_queues Dan Williams
@ 2007-10-06 17:06 ` Dan Williams
  2007-10-06 17:06 ` [PATCH -mm 4/4] raid5: use stripe_queues to prioritize the "most deserving" requests (rev7) Dan Williams
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Dan Williams @ 2007-10-06 17:06 UTC (permalink / raw)
  To: neilb, akpm; +Cc: linux-raid

The stripe_queue object collects i/o requests before they are handled by
the stripe-cache (via the stripe_head object).  add_stripe_bio currently
looks at the state of the stripe-cache to implement bitmap support,
reimplement this using stripe_queue attributes.

Introduce the STRIPE_QUEUE_FIRSTWRITE flag to track when a stripe is first
written.  When a stripe_head is available record the bitmap batch sequence
number and set STRIPE_BIT_DELAY.  For now a stripe_head will always be
available at 'add_queue_bio' time, going forward the 'sh' field of the
stripe_queue will indicate whether a stripe_head is attached.

Tested-by: Mr. James W. Laferriere <babydr@baby-dragons.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/raid5.c         |   53 ++++++++++++++++++++++++++++----------------
 include/linux/raid/raid5.h |    6 +++++
 2 files changed, 40 insertions(+), 19 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7bc206c..d566fc9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -31,8 +31,10 @@
  * conf->bm_flush is the number of the last batch that was closed to
  *    new additions.
  * When we discover that we will need to write to any block in a stripe
- * (in add_stripe_bio) we update the in-memory bitmap and record in sh->bm_seq
- * the number of the batch it will be in. This is bm_flush+1.
+ * (in add_queue_bio) we update the in-memory bitmap and record in the
+ * stripe_queue that a bitmap write was started.  Then, in handle_stripe when
+ * we have a stripe_head available, we update sh->bm_seq to record the
+ * sequence number (target batch number) of this request.  This is bm_flush+1.
  * When we are ready to do a write, if that batch hasn't been written yet,
  *   we plug the array and queue the stripe for later.
  * When an unplug happens, we increment bm_flush, thus closing the current
@@ -360,8 +362,14 @@ static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector
 		}
 	} while (sh == NULL);
 
-	if (sh)
+	if (sh) {
 		atomic_inc(&sh->count);
+		if (test_and_clear_bit(STRIPE_QUEUE_FIRSTWRITE,
+					&sh->sq->state)) {
+			sh->bm_seq = conf->seq_flush+1;
+			set_bit(STRIPE_BIT_DELAY, &sh->state);
+		}
+	}
 
 	spin_unlock_irq(&conf->device_lock);
 	return sh;
@@ -1991,26 +1999,34 @@ handle_write_operations5(struct stripe_head *sh, int rcw, int expand)
  * toread/towrite point to the first in a chain.
  * The bi_next chain must be in order.
  */
-static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, int forwrite)
+static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
+			  int forwrite)
 {
 	struct bio **bip;
-	struct stripe_queue *sq = sh->sq;
 	raid5_conf_t *conf = sq->raid_conf;
 	int firstwrite=0;
 
-	pr_debug("adding bh b#%llu to stripe s#%llu\n",
+	pr_debug("adding bio (%llu) to queue (%llu)\n",
 		(unsigned long long)bi->bi_sector,
-		(unsigned long long)sh->sector);
-
+		(unsigned long long)sq->sector);
 
 	spin_lock(&sq->lock);
 	spin_lock_irq(&conf->device_lock);
 	if (forwrite) {
 		bip = &sq->dev[dd_idx].towrite;
-		if (*bip == NULL && sq->dev[dd_idx].written == NULL)
+		set_bit(dd_idx, sq->to_write);
+		if (*bip == NULL && sq->dev[dd_idx].written == NULL) {
+			/* flag the queue to be assigned a bitmap
+			 * sequence number
+			 */
+			set_bit(STRIPE_QUEUE_FIRSTWRITE, &sq->state);
 			firstwrite = 1;
-	} else
+		}
+	} else {
 		bip = &sq->dev[dd_idx].toread;
+		set_bit(dd_idx, sq->to_read);
+	}
+
 	while (*bip && (*bip)->bi_sector < bi->bi_sector) {
 		if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector)
 			goto overlap;
@@ -2024,19 +2040,17 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 		bi->bi_next = *bip;
 	*bip = bi;
 	bi->bi_phys_segments ++;
+
 	spin_unlock_irq(&conf->device_lock);
 	spin_unlock(&sq->lock);
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
 		(unsigned long long)bi->bi_sector,
-		(unsigned long long)sh->sector, dd_idx);
+		(unsigned long long)sq->sector, dd_idx);
 
-	if (conf->mddev->bitmap && firstwrite) {
-		bitmap_startwrite(conf->mddev->bitmap, sh->sector,
+	if (conf->mddev->bitmap && firstwrite)
+		bitmap_startwrite(conf->mddev->bitmap, sq->sector,
 				  STRIPE_SECTORS, 0);
-		sh->bm_seq = conf->seq_flush+1;
-		set_bit(STRIPE_BIT_DELAY, &sh->state);
-	}
 
 	if (forwrite) {
 		/* check if page is covered */
@@ -2049,7 +2063,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 				sector = bi->bi_sector + (bi->bi_size>>9);
 		}
 		if (sector >= sq->dev[dd_idx].sector + STRIPE_SECTORS)
-			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
+			set_bit(dd_idx, sq->overwrite);
 	}
 
 	return 1;
@@ -3827,7 +3841,8 @@ static int make_request(struct request_queue *q, struct bio * bi)
 			}
 
 			if (test_bit(STRIPE_EXPANDING, &sh->state) ||
-			    !add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
+			    !add_queue_bio(sh->sq, bi, dd_idx,
+					   bi->bi_rw & RW_MASK)) {
 				/* Stripe is busy expanding or
 				 * add failed due to overlap.  Flush everything
 				 * and wait a while
@@ -4128,7 +4143,7 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 		}
 
 		set_bit(R5_ReadError, &sh->dev[dd_idx].flags);
-		if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
+		if (!add_queue_bio(sh->sq, raid_bio, dd_idx, 0)) {
 			release_stripe(sh);
 			raid_bio->bi_hw_segments = scnt;
 			conf->retry_read_aligned = raid_bio;
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index fbe622c..3d4938c 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -218,6 +218,7 @@ struct stripe_queue {
 	unsigned long *overlap; /* There is a pending overlapping request */
 	spinlock_t lock; /* protect bio lists and stripe_head state */
 	struct raid5_private_data *raid_conf;
+	unsigned long state;
 	struct list_head list_node;
 	int pd_idx; /* parity disk index */
 	int disks; /* disks in stripe */
@@ -288,6 +289,11 @@ struct stripe_queue {
 #define STRIPE_OP_MOD_DMA_CHECK 8
 
 /*
+ * Stripe-queue state
+ */
+#define STRIPE_QUEUE_FIRSTWRITE 0
+
+/*
  * Plugging:
  *
  * To improve write throughput, we need to delay the handling of some

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH -mm 4/4] raid5: use stripe_queues to prioritize the "most deserving" requests (rev7)
  2007-10-06 17:06 [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance) Dan Williams
                   ` (2 preceding siblings ...)
  2007-10-06 17:06 ` [PATCH -mm 3/4] raid5: convert add_stripe_bio to add_queue_bio Dan Williams
@ 2007-10-06 17:06 ` Dan Williams
  2007-10-06 18:34 ` [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance) Justin Piszcz
  2007-10-09  6:21 ` Neil Brown
  5 siblings, 0 replies; 10+ messages in thread
From: Dan Williams @ 2007-10-06 17:06 UTC (permalink / raw)
  To: neilb, akpm; +Cc: linux-raid

Overview:
Taking advantage of the stripe_queue/stripe_head separation, this patch
implements a queue in front of the stripe cache.  A stripe_queue pool
accepts incoming requests.  As requests are attached, the weight of the
queue object is updated.  A workqueue (raid456_cache_arbiter) is introduced
to control the flow of requests to the stripe cache.  Pressure (weight of
the queue object) can push requests to be processed by the the cache
(raid5d).  raid5d also pulls requests when its 'handle' list is empty.

The cache arbiter prioritizes reads and full stripe-writes, as there is no
performance to be gained by delaying them.  Sub-stripe-width writes are
handled as before by a 'preread-active' mechanism.  The difference now is
that full-stripe-writes can pass delayed-writes waiting for the cache.
Previously there was no opportunity to make this decision, sub-width-writes
would occupy a stripe cache entry from the time they entered the delayed
list until they finished processing.

Flow:
1/ make_request calls get_active_queue, add_queue_bio, and handle_queue
2/ handle_queue tries, opportunistically, to allocate a stripe_head.  If
   the allocation succeeds the stripe_head is immediately processed otherwise
   the stripe_queue will be placed on a list to be handled by the cache
   arbiter.  The queue routing option is one of io_hi (for
   full-stripe-writes), io_lo (for reads and preread active stripes), delayed
   (for preread inactive stripes), and finally inactive if no i/o is pending.
3/ raid456_cache_arbiter runs and attaches stripe_queues to stripe_heads in
   priority order, io-hi then io-lo.  If the raid device is not plugged and
   there is nothing else to do it will transition delayed queues to the io-lo
   list.  Since there are more stripe_queues in the system than stripe_heads
   we will end up sleeping in get_active_stripe.  While sleeping requests can
   still enter the queue and hopefully promote sub-width-writes to
   full-stripe-writes.

Details:
* the number of stripe_queue objects in the pool is set at 2x the maximum
  number of stripes in the stripe_cache (STRIPE_QUEUE_SIZE).
* stripe_queues are tracked in a red-black-tree
* a stripe_queue is considered active while it has pending i/o
* get_active_stripe increments the count on the stripe_queue such that the
  stripe_queue will not be inactivated until after the stripe_head is
  inactivated

Changes in rev2:
* separate write and overwrite in the io_weight fields, i.e. an overwrite
  no longer implies a write
* rename queue_weight -> io_weight
* fix r5_io_weight_size
* implement support for sysfs changes to stripe_cache_size
* delete and re-add stripe queues from their management lists rather than
  moving them.  This guarantees that when the count is non-zero that the
  queue is not on a list (identical to stripe_head handling)
* __wait_for_inactive_queue was incorrectly using conf->inactive_blocked
  which is exclusively for the stripe_cache.  Added
  conf->inactive_queue_blocked and set the routine to wait until the number
  of active queues drops below 7/8's of the total before unblocking
  processing.  7/8's arises from the following: get_active_stripe waits for
  3/4's of the stripe cache i.e. 1/4 inactive. conf->max_nr_stripes / 4 ==
  conf->max_nr_stripes * STRIPE_QUEUE_SIZE / 8 iff STRIPE_QUEUE_SIZE == 2
* change raid5_congested to report whether the queue is congested and not
  the cache.

Changes in rev3:
* rename raid5qd => raid456_cache_arbiter
* make raid456_cache_arbiter the only thread that can block on a call to
  get_active_stripe, this ensures proper ordering of attachments
* added wait_for_cache_attach for routines outside the i/o path (like
  resync, and reshape) to request servicing from raid456_cache_arbiter
* change cache attachment priorities to io_hi (full-stripe-writes) and
  io_lo (reads and sub-width-stripe-writes)
* changed handle_queue to try to attempt a non-blocking cache attachment,
  this recovers some of the lost read throughput from rev2
* move flags back to r5dev to stay in sync with the buffers
* use sq->overwrite to set R5_OVERWRITE, fixes a data corruption issue with
  stale overwrite flags when attempting to use sq->overwrite in
  handle_stripe

Changes in rev4
* disconnect sq->sh and sh->sh->sq at the end of __release_queue
* remove the implicit get_active_stripe from get_active_queue to ensure
  that writes that need to be delayed go through raid456_cache_arbiter.
  This fixes the performance regression caused by increasing the stripe
  cache size.
* kill __get_active_stripe, not needed

Changes in rev5
* fix retry_aligned_read... check for null returns from get_active_queue
* workqueue leak fix, dmonakhov@openvz.org

Changes in rev6
* place reads/writes to the sq->to_write bitmap under sq->lock to synch up
  with updates to sh->dev[i].towrite
* Fix 'wait_for_stripe'/'wait_for_queue' confusion when performing
  atomic_dec_and_test(&conf->active_aligned_reads)
* Fix, retry_aligned_read needs to call release_stripe on its stripe_head
  after handle_queue, otherwise we deadlock on drive removal.  To make this
  more obvious handle_queue no longer implicitly releases the stripe_queue.
* kill wait_for_attach
* Fix up stripe_queue documentation

Changes in rev7
* split out the 'add_queue_bio' and object allocation changes into separate
  patches
* fix release_stripe/release_queue ordering
* refactor handle_queue and release_queue to remove STRIPE_QUEUE_HANDLE and
  sq->sh back references
* kill init_sh and allocate init_sq on the stack

Tested-by: Mr. James W. Laferriere <babydr@baby-dragons.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/raid5.c         |  843 +++++++++++++++++++++++++++++++++-----------
 include/linux/raid/raid5.h |   45 ++
 2 files changed, 666 insertions(+), 222 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d566fc9..eb7fd10 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -67,7 +67,7 @@
 #define	IO_THRESHOLD		1
 #define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
 #define HASH_MASK		(NR_HASH - 1)
-#define STRIPE_QUEUE_SIZE 1 /* multiple of nr_stripes */
+#define STRIPE_QUEUE_SIZE 2 /* multiple of nr_stripes */
 
 #define stripe_hash(conf, sect)	(&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))
 
@@ -142,16 +142,66 @@ static unsigned long io_weight(unsigned long *bitmap, int disks)
 
 static void print_raid5_conf (raid5_conf_t *conf);
 
+/* __release_queue - route the stripe_queue based on pending i/o's.  The
+ * queue object is allowed to bounce around between 4 lists up until
+ * it is attached to a stripe_head.  The lists in order of priority are:
+ * 1/ overwrite: all data blocks are set to be overwritten, no prereads
+ * 2/ unaligned_read: read requests that get past chunk_aligned_read
+ * 3/ subwidth_write: write requests that require prereading
+ * 4/ delayed_q: write requests pending activation
+ */
+static void __release_queue(raid5_conf_t *conf, struct stripe_queue *sq)
+{
+	if (atomic_dec_and_test(&sq->count)) {
+		int disks = sq->disks;
+		int data_disks = disks - conf->max_degraded;
+		int to_write = io_weight(sq->to_write, disks);
+
+		BUG_ON(!list_empty(&sq->list_node));
+		BUG_ON(atomic_read(&conf->active_queues) == 0);
+
+		if (to_write &&
+		    io_weight(sq->overwrite, disks) == data_disks) {
+			list_add_tail(&sq->list_node, &conf->io_hi_q_list);
+			queue_work(conf->workqueue, &conf->stripe_queue_work);
+		} else if (io_weight(sq->to_read, disks)) {
+			list_add_tail(&sq->list_node, &conf->io_lo_q_list);
+			queue_work(conf->workqueue, &conf->stripe_queue_work);
+		} else if (to_write &&
+			   test_bit(STRIPE_QUEUE_PREREAD_ACTIVE, &sq->state)) {
+			list_add_tail(&sq->list_node, &conf->io_lo_q_list);
+			queue_work(conf->workqueue, &conf->stripe_queue_work);
+		} else if (to_write) {
+			list_add_tail(&sq->list_node, &conf->delayed_q_list);
+			blk_plug_device(conf->mddev->queue);
+		} else {
+			atomic_dec(&conf->active_queues);
+			if (test_and_clear_bit(STRIPE_QUEUE_PREREAD_ACTIVE,
+					       &sq->state)) {
+				atomic_dec(&conf->preread_active_queues);
+				if (atomic_read(&conf->preread_active_queues) <
+				    IO_THRESHOLD)
+					queue_work(conf->workqueue,
+						   &conf->stripe_queue_work);
+			}
+			if (!test_bit(STRIPE_QUEUE_EXPANDING, &sq->state)) {
+				list_add_tail(&sq->list_node,
+					      &conf->inactive_q_list);
+				wake_up(&conf->wait_for_queue);
+			}
+		}
+	}
+}
+
 static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 {
+	struct stripe_queue *sq = sh->sq;
+
 	if (atomic_dec_and_test(&sh->count)) {
 		BUG_ON(!list_empty(&sh->lru));
 		BUG_ON(atomic_read(&conf->active_stripes)==0);
 		if (test_bit(STRIPE_HANDLE, &sh->state)) {
-			if (test_bit(STRIPE_DELAYED, &sh->state)) {
-				list_add_tail(&sh->lru, &conf->delayed_list);
-				blk_plug_device(conf->mddev->queue);
-			} else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
+			if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
 				   sh->bm_seq - conf->seq_write > 0) {
 				list_add_tail(&sh->lru, &conf->bitmap_list);
 				blk_plug_device(conf->mddev->queue);
@@ -162,21 +212,29 @@ static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 			md_wakeup_thread(conf->mddev->thread);
 		} else {
 			BUG_ON(sh->ops.pending);
-			if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-				atomic_dec(&conf->preread_active_stripes);
-				if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
-					md_wakeup_thread(conf->mddev->thread);
-			}
 			atomic_dec(&conf->active_stripes);
-			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
+			if (!test_bit(STRIPE_QUEUE_EXPANDING, &sq->state)) {
 				list_add_tail(&sh->lru, &conf->inactive_list);
 				wake_up(&conf->wait_for_stripe);
 				if (conf->retry_read_aligned)
 					md_wakeup_thread(conf->mddev->thread);
 			}
+			__release_queue(conf, sq);
+			sh->sq = NULL;
 		}
 	}
 }
+
+static void release_queue(struct stripe_queue *sq)
+{
+	raid5_conf_t *conf = sq->raid_conf;
+	unsigned long flags;
+
+	spin_lock_irqsave(&conf->device_lock, flags);
+	__release_queue(conf, sq);
+	spin_unlock_irqrestore(&conf->device_lock, flags);
+}
+
 static void release_stripe(struct stripe_head *sh)
 {
 	raid5_conf_t *conf = sh->sq->raid_conf;
@@ -221,10 +279,28 @@ static struct stripe_head *get_free_stripe(raid5_conf_t *conf)
 	list_del_init(first);
 	remove_hash(sh);
 	atomic_inc(&conf->active_stripes);
+	BUG_ON(sh->sq != NULL);
 out:
 	return sh;
 }
 
+static struct stripe_queue *get_free_queue(raid5_conf_t *conf)
+{
+	struct stripe_queue *sq = NULL;
+	struct list_head *first;
+
+	CHECK_DEVLOCK();
+	if (list_empty(&conf->inactive_q_list))
+		goto out;
+	first = conf->inactive_q_list.next;
+	sq = list_entry(first, struct stripe_queue, list_node);
+	list_del_init(first);
+	rb_erase(&sq->rb_node, &conf->stripe_queue_tree);
+	atomic_inc(&conf->active_queues);
+out:
+	return sq;
+}
+
 static void shrink_buffers(struct stripe_head *sh, int num)
 {
 	struct page *p;
@@ -256,14 +332,11 @@ static int grow_buffers(struct stripe_head *sh, int num)
 
 static void raid5_build_block (struct stripe_head *sh, int i);
 
-static void init_queue(struct stripe_queue *sq, sector_t sector,
-		int disks, int pd_idx);
-
 static void
-init_stripe(struct stripe_head *sh, struct stripe_queue *sq,
-	     sector_t sector, int pd_idx, int disks)
+init_stripe(struct stripe_head *sh, struct stripe_queue *sq, int disks)
 {
 	raid5_conf_t *conf = sq->raid_conf;
+	sector_t sector = sq->sector;
 	int i;
 
 	pr_debug("init_stripe called, stripe %llu\n",
@@ -272,7 +345,6 @@ init_stripe(struct stripe_head *sh, struct stripe_queue *sq,
 	BUG_ON(atomic_read(&sh->count) != 0);
 	BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
 	BUG_ON(sh->ops.pending || sh->ops.ack || sh->ops.complete);
-	init_queue(sh->sq, sector, disks, pd_idx);
 
 	CHECK_DEVLOCK();
 
@@ -310,68 +382,151 @@ static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector, in
 	return NULL;
 }
 
+static struct stripe_queue *__find_queue(raid5_conf_t *conf, sector_t sector)
+{
+	struct rb_node *n = conf->stripe_queue_tree.rb_node;
+	struct stripe_queue *sq;
+
+	pr_debug("%s, sector %llu\n", __FUNCTION__, (unsigned long long)sector);
+	while (n) {
+		sq = rb_entry(n, struct stripe_queue, rb_node);
+
+		if (sector < sq->sector)
+			n = n->rb_left;
+		else if (sector > sq->sector)
+			n = n->rb_right;
+		else
+			return sq;
+	}
+	pr_debug("__queue %llu not in tree\n", (unsigned long long)sector);
+	return NULL;
+}
+
+static struct stripe_queue *
+__insert_active_sq(raid5_conf_t *conf, sector_t sector, struct rb_node *node)
+{
+	struct rb_node **p = &conf->stripe_queue_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct stripe_queue *sq;
+
+	while (*p) {
+		parent = *p;
+		sq = rb_entry(parent, struct stripe_queue, rb_node);
+
+		if (sector < sq->sector)
+			p = &(*p)->rb_left;
+		else if (sector > sq->sector)
+			p = &(*p)->rb_right;
+		else
+			return sq;
+	}
+
+	rb_link_node(node, parent, p);
+
+	return NULL;
+}
+
+static struct stripe_queue *
+insert_active_sq(raid5_conf_t *conf, sector_t sector, struct rb_node *node)
+{
+	struct stripe_queue *sq = __insert_active_sq(conf, sector, node);
+
+	if (sq)
+		goto out;
+	rb_insert_color(node, &conf->stripe_queue_tree);
+ out:
+	return sq;
+}
+
 static sector_t compute_blocknr(raid5_conf_t *conf, int raid_disks,
 	sector_t sector, int pd_idx, int i);
 
+static void
+pickup_cached_stripe(struct stripe_head *sh, struct stripe_queue *sq)
+{
+	raid5_conf_t *conf = sq->raid_conf;
+
+	if (atomic_read(&sh->count))
+		BUG_ON(!list_empty(&sh->lru));
+	else {
+		if (!test_bit(STRIPE_HANDLE, &sh->state)) {
+			atomic_inc(&conf->active_stripes);
+			BUG_ON(sh->sq != NULL);
+		}
+		if (list_empty(&sh->lru) &&
+		    !test_bit(STRIPE_QUEUE_EXPANDING, &sq->state))
+			BUG();
+		list_del_init(&sh->lru);
+	}
+}
+
 static void unplug_slaves(mddev_t *mddev);
 static void raid5_unplug_device(struct request_queue *q);
 
-static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector, int disks,
-					     int pd_idx, int noblock)
+static void
+__wait_for_inactive_stripe(raid5_conf_t *conf, struct stripe_queue *sq)
 {
-	struct stripe_head *sh;
+	conf->inactive_blocked = 1;
+	wait_event_lock_irq(conf->wait_for_stripe,
+			    (!list_empty(&conf->inactive_list) &&
+			     (atomic_read(&conf->active_stripes)
+			      < (conf->max_nr_stripes * 3/4) ||
+			      !conf->inactive_blocked)),
+			    conf->device_lock,
+			    raid5_unplug_device(conf->mddev->queue));
+	conf->inactive_blocked = 0;
+}
 
-	pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector);
+static struct stripe_head *
+get_active_stripe(struct stripe_queue *sq, int disks, int noblock)
+{
+	raid5_conf_t *conf = sq->raid_conf;
+	sector_t sector = sq->sector;
+	struct stripe_head *sh;
 
 	spin_lock_irq(&conf->device_lock);
 
+	pr_debug("get_stripe, sector %llu\n", (unsigned long long)sq->sector);
+
 	do {
-		wait_event_lock_irq(conf->wait_for_stripe,
-				    conf->quiesce == 0,
-				    conf->device_lock, /* nothing */);
+		/* try to get a cached stripe */
 		sh = __find_stripe(conf, sector, disks);
+
+		/* try to activate a new stripe */
 		if (!sh) {
 			if (!conf->inactive_blocked)
 				sh = get_free_stripe(conf);
 			if (noblock && sh == NULL)
 				break;
-			if (!sh) {
-				conf->inactive_blocked = 1;
-				wait_event_lock_irq(conf->wait_for_stripe,
-						    !list_empty(&conf->inactive_list) &&
-						    (atomic_read(&conf->active_stripes)
-						     < (conf->max_nr_stripes *3/4)
-						     || !conf->inactive_blocked),
-						    conf->device_lock,
-						    raid5_unplug_device(conf->mddev->queue)
-					);
-				conf->inactive_blocked = 0;
-			} else
-				init_stripe(sh, sh->sq, sector, pd_idx, disks);
-		} else {
-			if (atomic_read(&sh->count)) {
-			  BUG_ON(!list_empty(&sh->lru));
-			} else {
-				if (!test_bit(STRIPE_HANDLE, &sh->state))
-					atomic_inc(&conf->active_stripes);
-				if (list_empty(&sh->lru) &&
-				    !test_bit(STRIPE_EXPANDING, &sh->state))
-					BUG();
-				list_del_init(&sh->lru);
-			}
-		}
+			if (!sh)
+				__wait_for_inactive_stripe(conf, sq);
+			else
+				init_stripe(sh, sq, disks);
+		} else
+			pickup_cached_stripe(sh, sq);
 	} while (sh == NULL);
 
+	BUG_ON(sq->sector != sector);
+
 	if (sh) {
 		atomic_inc(&sh->count);
-		if (test_and_clear_bit(STRIPE_QUEUE_FIRSTWRITE,
-					&sh->sq->state)) {
+		if (test_and_clear_bit(STRIPE_QUEUE_FIRSTWRITE, &sq->state)) {
 			sh->bm_seq = conf->seq_flush+1;
 			set_bit(STRIPE_BIT_DELAY, &sh->state);
 		}
+
+		if (sh->sq)
+			BUG_ON(sh->sq != sq);
+		else {
+			sh->sq = sq;
+			atomic_inc(&sq->count);
+		}
+
+		BUG_ON(!list_empty(&sq->list_node));
 	}
 
 	spin_unlock_irq(&conf->device_lock);
+
 	return sh;
 }
 
@@ -385,6 +540,7 @@ static void init_queue(struct stripe_queue *sq, sector_t sector,
 		__FUNCTION__, (unsigned long long) sq->sector,
 		(unsigned long long) sector, sq);
 
+	BUG_ON(atomic_read(&sq->count) != 0);
 	BUG_ON(io_weight(sq->to_read, disks));
 	BUG_ON(io_weight(sq->to_write, disks));
 	BUG_ON(io_weight(sq->overwrite, disks));
@@ -392,6 +548,7 @@ static void init_queue(struct stripe_queue *sq, sector_t sector,
 	sq->sector = sector;
 	sq->pd_idx = pd_idx;
 	sq->disks = disks;
+	sq->state = 0;
 
 	for (i = disks; i--;) {
 		struct r5_queue_dev *dev_q = &sq->dev[i];
@@ -405,6 +562,74 @@ static void init_queue(struct stripe_queue *sq, sector_t sector,
 		}
 		dev_q->sector = compute_blocknr(conf, disks, sector, pd_idx, i);
 	}
+
+	sq = insert_active_sq(conf, sector, &sq->rb_node);
+	if (unlikely(sq)) {
+		printk(KERN_ERR "%s: sq: %p sector: %llu bounced off the "
+			"stripe_queue rb_tree\n", __FUNCTION__, sq,
+			(unsigned long long) sq->sector);
+		BUG();
+	}
+}
+
+static void __wait_for_inactive_queue(raid5_conf_t *conf)
+{
+	conf->inactive_queue_blocked = 1;
+	wait_event_lock_irq(conf->wait_for_queue,
+			    !list_empty(&conf->inactive_q_list) &&
+			    (atomic_read(&conf->active_queues)
+			     < conf->max_nr_stripes *
+			     STRIPE_QUEUE_SIZE * 7/8 ||
+			    !conf->inactive_queue_blocked),
+			    conf->device_lock,
+			    /* nothing */);
+	conf->inactive_queue_blocked = 0;
+}
+
+
+static struct stripe_queue *
+get_active_queue(raid5_conf_t *conf, sector_t sector, int disks, int pd_idx,
+		  int noblock)
+{
+	struct stripe_queue *sq;
+
+	pr_debug("%s, sector %llu\n", __FUNCTION__,
+		(unsigned long long)sector);
+
+	spin_lock_irq(&conf->device_lock);
+
+	do {
+		wait_event_lock_irq(conf->wait_for_queue,
+				    conf->quiesce == 0,
+				    conf->device_lock,
+				    /* nothing */);
+		sq = __find_queue(conf, sector);
+		if (!sq) {
+			if (!conf->inactive_queue_blocked)
+				sq = get_free_queue(conf);
+			if (noblock && sq == NULL)
+				break;
+			if (!sq)
+				__wait_for_inactive_queue(conf);
+			else
+				init_queue(sq, sector, disks, pd_idx);
+		} else {
+			if (atomic_read(&sq->count))
+				BUG_ON(!list_empty(&sq->list_node));
+			else if (io_weight(sq->to_write, disks) == 0 &&
+				 io_weight(sq->to_read, disks) == 0)
+				atomic_inc(&conf->active_queues);
+
+			list_del_init(&sq->list_node);
+		}
+	} while (sq == NULL);
+
+	if (sq)
+		atomic_inc(&sq->count);
+
+	spin_unlock_irq(&conf->device_lock);
+
+	return sq;
 }
 
 
@@ -1004,16 +1229,15 @@ static void raid5_run_ops(struct stripe_head *sh, unsigned long pending)
 	}
 }
 
-static struct stripe_queue *grow_one_queue(raid5_conf_t *conf);
-
 static int grow_one_stripe(raid5_conf_t *conf)
 {
 	struct stripe_head *sh;
+	struct stripe_queue init_sq = { .raid_conf = conf };
+
 	sh = kmem_cache_alloc(conf->sh_slab_cache, GFP_KERNEL);
 	if (!sh)
 		return 0;
 	memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1)*sizeof(struct r5dev));
-	sh->sq = grow_one_queue(conf);
 
 	if (grow_buffers(sh, conf->raid_disks)) {
 		shrink_buffers(sh, conf->raid_disks);
@@ -1025,13 +1249,14 @@ static int grow_one_stripe(raid5_conf_t *conf)
 	atomic_set(&sh->count, 1);
 	atomic_inc(&conf->active_stripes);
 	INIT_LIST_HEAD(&sh->lru);
-	spin_lock_irq(&conf->device_lock);
-	__release_stripe(conf, sh);
-	spin_unlock_irq(&conf->device_lock);
+	atomic_set(&init_sq.count, 2); /* bypass release_queue() */
+	sh->sq = &init_sq;
+	release_stripe(sh);
+
 	return 1;
 }
 
-static struct stripe_queue *grow_one_queue(raid5_conf_t *conf)
+static int grow_one_queue(raid5_conf_t *conf)
 {
 	struct stripe_queue *sq;
 	int disks = conf->raid_disks;
@@ -1059,13 +1284,20 @@ static struct stripe_queue *grow_one_queue(raid5_conf_t *conf)
 	sq->raid_conf = conf;
 	sq->disks = disks;
 
-	return sq;
+	/* we just created an active queue so... */
+	atomic_set(&sq->count, 1);
+	atomic_inc(&conf->active_queues);
+	INIT_LIST_HEAD(&sq->list_node);
+	RB_CLEAR_NODE(&sq->rb_node);
+	release_queue(sq);
+
+	return 1;
 }
 
 static int grow_stripes(raid5_conf_t *conf, int num)
 {
 	struct kmem_cache *sc;
-	int devs = conf->raid_disks;
+	int devs = conf->raid_disks, num_q = num * STRIPE_QUEUE_SIZE;
 
 	sprintf(conf->sh_cache_name[0], "raid5-%s", mdname(conf->mddev));
 	sprintf(conf->sh_cache_name[1], "raid5-%s-alt", mdname(conf->mddev));
@@ -1080,6 +1312,9 @@ static int grow_stripes(raid5_conf_t *conf, int num)
 		return 1;
 	conf->sh_slab_cache = sc;
 	conf->pool_size = devs;
+	while (num--)
+		if (!grow_one_stripe(conf))
+			return 1;
 
 	sc = kmem_cache_create(conf->sq_cache_name[conf->active_name],
 			       (sizeof(struct stripe_queue)+(devs-1) *
@@ -1090,10 +1325,10 @@ static int grow_stripes(raid5_conf_t *conf, int num)
 				r5_io_weight_size(devs), 0, 0, NULL);
 	if (!sc)
 		return 1;
-	conf->sq_slab_cache = sc;
 
-	while (num--)
-		if (!grow_one_stripe(conf))
+	conf->sq_slab_cache = sc;
+	while (num_q--)
+		if (!grow_one_queue(conf))
 			return 1;
 
 	return 0;
@@ -1126,7 +1361,7 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 	 * so we use GFP_NOIO allocations.
 	 */
 	struct stripe_head *osh, *nsh;
-	struct stripe_queue *nsq;
+	struct stripe_queue *osq, *nsq;
 	LIST_HEAD(newstripes);
 	LIST_HEAD(newqueues);
 	struct disk_info *ndisks;
@@ -1209,7 +1444,9 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 			nsq->overwrite = weight_map;
 			weight_map += r5_io_weight_size(newsize);
 			nsq->overlap = weight_map;
+
 			nsq->raid_conf = conf;
+			RB_CLEAR_NODE(&nsq->rb_node);
 			spin_lock_init(&nsq->lock);
 			list_add(&nsq->list_node, &newqueues);
 		}
@@ -1236,6 +1473,19 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 	 * OK, we have enough stripes, start collecting inactive
 	 * stripes and copying them over
 	 */
+	list_for_each_entry(nsq, &newqueues, list_node) {
+		spin_lock_irq(&conf->device_lock);
+		wait_event_lock_irq(conf->wait_for_queue,
+				    !list_empty(&conf->inactive_q_list),
+				    conf->device_lock,
+				    unplug_slaves(conf->mddev));
+		osq = get_free_queue(conf);
+		spin_unlock_irq(&conf->device_lock);
+		atomic_set(&nsq->count, 1);
+		kmem_cache_free(conf->sq_slab_cache, osq);
+	}
+	kmem_cache_destroy(conf->sq_slab_cache);
+
 	list_for_each_entry(nsh, &newstripes, lru) {
 		spin_lock_irq(&conf->device_lock);
 		wait_event_lock_irq(conf->wait_for_stripe,
@@ -1250,11 +1500,9 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 			nsh->dev[i].page = osh->dev[i].page;
 		for( ; i<newsize; i++)
 			nsh->dev[i].page = NULL;
-		kmem_cache_free(conf->sq_slab_cache, osh->sq);
 		kmem_cache_free(conf->sh_slab_cache, osh);
 	}
 	kmem_cache_destroy(conf->sh_slab_cache);
-	kmem_cache_destroy(conf->sq_slab_cache);
 
 	/* Step 3.
 	 * At this point, we are holding all the stripes so the array
@@ -1272,10 +1520,8 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 
 	/* Step 4, return new stripes to service */
 	while (!list_empty(&newstripes)) {
-		nsq = list_entry(newqueues.next, struct stripe_queue,
-					list_node);
+		struct stripe_queue init_sq = { .raid_conf = conf };
 		nsh = list_entry(newstripes.next, struct stripe_head, lru);
-		list_del_init(&nsq->list_node);
 		list_del_init(&nsh->lru);
 		for (i=conf->raid_disks; i < newsize; i++)
 			if (nsh->dev[i].page == NULL) {
@@ -1284,9 +1530,19 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 				if (!p)
 					err = -ENOMEM;
 			}
-		nsh->sq = nsq;
+		atomic_set(&init_sq.count, 2); /* bypass release_queue() */
+		nsh->sq = &init_sq;
 		release_stripe(nsh);
 	}
+
+	/* Step 4a, return new queues to service */
+	while (!list_empty(&newqueues)) {
+		nsq = list_entry(newqueues.next, struct stripe_queue,
+				 list_node);
+		list_del_init(&nsq->list_node);
+		release_queue(nsq);
+	}
+
 	/* critical section pass, GFP_NOIO no longer needed */
 
 	conf->sh_slab_cache = sc;
@@ -1308,18 +1564,33 @@ static int drop_one_stripe(raid5_conf_t *conf)
 		return 0;
 	BUG_ON(atomic_read(&sh->count));
 	shrink_buffers(sh, conf->pool_size);
-	if (sh->sq)
-		kmem_cache_free(conf->sq_slab_cache, sh->sq);
 	kmem_cache_free(conf->sh_slab_cache, sh);
 	atomic_dec(&conf->active_stripes);
 	return 1;
 }
 
+static int drop_one_queue(raid5_conf_t *conf)
+{
+	struct stripe_queue *sq;
+
+	spin_lock_irq(&conf->device_lock);
+	sq = get_free_queue(conf);
+	spin_unlock_irq(&conf->device_lock);
+	if (!sq)
+		return 0;
+	kmem_cache_free(conf->sq_slab_cache, sq);
+	atomic_dec(&conf->active_queues);
+	return 1;
+}
+
 static void shrink_stripes(raid5_conf_t *conf)
 {
 	while (drop_one_stripe(conf))
 		;
 
+	while (drop_one_queue(conf))
+		;
+
 	if (conf->sh_slab_cache)
 		kmem_cache_destroy(conf->sh_slab_cache);
 	conf->sh_slab_cache = NULL;
@@ -2055,6 +2326,7 @@ static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
 	if (forwrite) {
 		/* check if page is covered */
 		sector_t sector = sq->dev[dd_idx].sector;
+
 		for (bi = sq->dev[dd_idx].towrite;
 		     sector < sq->dev[dd_idx].sector + STRIPE_SECTORS &&
 			     bi && bi->bi_sector <= sector;
@@ -2444,20 +2716,13 @@ static void handle_issuing_new_write_requests5(raid5_conf_t *conf,
 			    !(test_bit(R5_UPTODATE, &dev->flags) ||
 			    test_bit(R5_Wantcompute, &dev->flags)) &&
 			    test_bit(R5_Insync, &dev->flags)) {
-				if (
-				  test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-					pr_debug("Read_old block "
-						"%d for r-m-w\n", i);
-					set_bit(R5_LOCKED, &dev->flags);
-					set_bit(R5_Wantread, &dev->flags);
-					if (!test_and_set_bit(
-						STRIPE_OP_IO, &sh->ops.pending))
-						sh->ops.count++;
-					s->locked++;
-				} else {
-					set_bit(STRIPE_DELAYED, &sh->state);
-					set_bit(STRIPE_HANDLE, &sh->state);
-				}
+				pr_debug("Read_old block %d for r-m-w\n", i);
+				set_bit(R5_LOCKED, &dev->flags);
+				set_bit(R5_Wantread, &dev->flags);
+				if (!test_and_set_bit(STRIPE_OP_IO,
+				    &sh->ops.pending))
+					sh->ops.count++;
+				s->locked++;
 			}
 		}
 	if (rcw <= rmw && rcw > 0)
@@ -2471,20 +2736,14 @@ static void handle_issuing_new_write_requests5(raid5_conf_t *conf,
 			    !(test_bit(R5_UPTODATE, &dev->flags) ||
 			    test_bit(R5_Wantcompute, &dev->flags)) &&
 			    test_bit(R5_Insync, &dev->flags)) {
-				if (
-				  test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-					pr_debug("Read_old block "
-						"%d for Reconstruct\n", i);
-					set_bit(R5_LOCKED, &dev->flags);
-					set_bit(R5_Wantread, &dev->flags);
-					if (!test_and_set_bit(
-						STRIPE_OP_IO, &sh->ops.pending))
-						sh->ops.count++;
-					s->locked++;
-				} else {
-					set_bit(STRIPE_DELAYED, &sh->state);
-					set_bit(STRIPE_HANDLE, &sh->state);
-				}
+				pr_debug("Read_old block "
+					 "%d for Reconstruct\n", i);
+				set_bit(R5_LOCKED, &dev->flags);
+				set_bit(R5_Wantread, &dev->flags);
+				if (!test_and_set_bit(STRIPE_OP_IO,
+				    &sh->ops.pending))
+					sh->ops.count++;
+				s->locked++;
 			}
 		}
 	/* now if nothing is locked, and if we have enough data,
@@ -2540,21 +2799,12 @@ static void handle_issuing_new_write_requests6(raid5_conf_t *conf,
 			    && !test_bit(R5_LOCKED, &dev->flags) &&
 			    !test_bit(R5_UPTODATE, &dev->flags) &&
 			    test_bit(R5_Insync, &dev->flags)) {
-				if (
-				  test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-					pr_debug("Read_old stripe %llu "
-						"block %d for Reconstruct\n",
-					     (unsigned long long)sh->sector, i);
-					set_bit(R5_LOCKED, &dev->flags);
-					set_bit(R5_Wantread, &dev->flags);
-					s->locked++;
-				} else {
-					pr_debug("Request delayed stripe %llu "
-						"block %d for Reconstruct\n",
-					     (unsigned long long)sh->sector, i);
-					set_bit(STRIPE_DELAYED, &sh->state);
-					set_bit(STRIPE_HANDLE, &sh->state);
-				}
+				pr_debug("Read_old stripe %llu "
+					"block %d for Reconstruct\n",
+				     (unsigned long long)sh->sector, i);
+				set_bit(R5_LOCKED, &dev->flags);
+				set_bit(R5_Wantread, &dev->flags);
+				s->locked++;
 			}
 		}
 	/* now if nothing is locked, and if we have enough data, we can start a
@@ -2592,13 +2842,6 @@ static void handle_issuing_new_write_requests6(raid5_conf_t *conf,
 			}
 		/* after a RECONSTRUCT_WRITE, the stripe MUST be in-sync */
 		set_bit(STRIPE_INSYNC, &sh->state);
-
-		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-			atomic_dec(&conf->preread_active_stripes);
-			if (atomic_read(&conf->preread_active_stripes) <
-			    IO_THRESHOLD)
-				md_wakeup_thread(conf->mddev->thread);
-		}
 	}
 }
 
@@ -2799,25 +3042,32 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 			int dd_idx, pd_idx, j;
 			struct stripe_head *sh2;
 			struct stripe_queue *sq2;
+			int disks = conf->raid_disks;
 
 			sector_t bn = compute_blocknr(conf, sq->disks,
 						sh->sector, sq->pd_idx, i);
 			sector_t s = raid5_compute_sector(bn, conf->raid_disks,
-						conf->raid_disks -
+						disks -
 						conf->max_degraded, &dd_idx,
 						&pd_idx, conf);
-			sh2 = get_active_stripe(conf, s, conf->raid_disks,
-						pd_idx, 1);
-			if (sh2 == NULL)
+			sq2 = get_active_queue(conf, s, disks, pd_idx, 1);
+			if (sq2)
+				sh2 = get_active_stripe(sq, disks, 1);
+			if (!(sq2 && sh2)) {
 				/* so far only the early blocks of this stripe
 				 * have been requested.  When later blocks
 				 * get requested, we will try again
 				 */
+				if (sq2)
+					release_queue(sq2);
 				continue;
-			if (!test_bit(STRIPE_EXPANDING, &sh2->state) ||
-			   test_bit(R5_Expanded, &sh2->dev[dd_idx].flags)) {
+			}
+
+			if (!test_bit(STRIPE_QUEUE_EXPANDING, &sq2->state) ||
+			    test_bit(R5_Expanded, &sh2->dev[dd_idx].flags)) {
 				/* must have already done this block */
 				release_stripe(sh2);
+				release_queue(sq2);
 				continue;
 			}
 
@@ -2826,7 +3076,6 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 				sh->dev[i].page, 0, 0, STRIPE_SIZE,
 				ASYNC_TX_DEP_ACK, tx, NULL, NULL);
 
-			sq2 = sh2->sq;
 			set_bit(R5_Expanded, &sh2->dev[dd_idx].flags);
 			set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
 			for (j = 0; j < conf->raid_disks; j++)
@@ -2840,6 +3089,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 				set_bit(STRIPE_HANDLE, &sh2->state);
 			}
 			release_stripe(sh2);
+			release_queue(sq2);
 
 		}
 	/* done submitting copies, wait for them to complete */
@@ -2884,7 +3134,6 @@ static void handle_stripe5(struct stripe_head *sh)
 
 	spin_lock(&sq->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
-	clear_bit(STRIPE_DELAYED, &sh->state);
 
 	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
 	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
@@ -3033,12 +3282,6 @@ static void handle_stripe5(struct stripe_head *sh)
 					set_bit(STRIPE_INSYNC, &sh->state);
 			}
 		}
-		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-			atomic_dec(&conf->preread_active_stripes);
-			if (atomic_read(&conf->preread_active_stripes) <
-				IO_THRESHOLD)
-				md_wakeup_thread(conf->mddev->thread);
-		}
 	}
 
 	/* Now to consider new write requests and what else, if anything
@@ -3100,7 +3343,7 @@ static void handle_stripe5(struct stripe_head *sh)
 	if (test_bit(STRIPE_OP_POSTXOR, &sh->ops.complete) &&
 		!test_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending)) {
 
-		clear_bit(STRIPE_EXPANDING, &sh->state);
+		clear_bit(STRIPE_QUEUE_EXPANDING, &sq->state);
 
 		clear_bit(STRIPE_OP_POSTXOR, &sh->ops.pending);
 		clear_bit(STRIPE_OP_POSTXOR, &sh->ops.ack);
@@ -3113,7 +3356,7 @@ static void handle_stripe5(struct stripe_head *sh)
 		}
 	}
 
-	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
+	if (s.expanded && test_bit(STRIPE_QUEUE_EXPANDING, &sq->state) &&
 		!test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending)) {
 		/* Need to write out all blocks after computing parity */
 		sq->disks = conf->raid_disks;
@@ -3163,7 +3406,6 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 
 	spin_lock(&sq->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
-	clear_bit(STRIPE_DELAYED, &sh->state);
 
 	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
 	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
@@ -3319,7 +3561,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 			}
 		}
 
-	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state)) {
+	if (s.expanded && test_bit(STRIPE_QUEUE_EXPANDING, &sq->state)) {
 		/* Need to write out all blocks after computing P&Q */
 		sq->disks = conf->raid_disks;
 		sq->pd_idx = stripe_to_pdidx(sh->sector, conf,
@@ -3330,7 +3572,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 			s.locked++;
 			set_bit(R5_Wantwrite, &sh->dev[i].flags);
 		}
-		clear_bit(STRIPE_EXPANDING, &sh->state);
+		clear_bit(STRIPE_QUEUE_EXPANDING, &sq->state);
 	} else if (s.expanded) {
 		clear_bit(STRIPE_EXPAND_READY, &sh->state);
 		atomic_dec(&conf->reshape_stripes);
@@ -3413,20 +3655,41 @@ static void handle_stripe(struct stripe_head *sh, struct page *tmp_page)
 		handle_stripe5(sh);
 }
 
-
+static void handle_queue(struct stripe_queue *sq, int disks, int data_disks)
+{
+	int to_write = io_weight(sq->to_write, disks);
+
+	pr_debug("%s: sector %llu "
+		 "state: %#lx r: %lu w: %lu o: %lu\n", __FUNCTION__,
+		 (unsigned long long) sq->sector, sq->state,
+		 io_weight(sq->to_read, disks),
+		 io_weight(sq->to_write, disks),
+		 io_weight(sq->overwrite, disks));
+
+	/* continue to process i/o while the stripe is cached */
+	if (to_write == data_disks || io_weight(sq->to_read, disks) ||
+	    (to_write && test_bit(STRIPE_QUEUE_PREREAD_ACTIVE, &sq->state))) {
+		struct stripe_head *sh = get_active_stripe(sq, disks, 1);
+		if (sh) {
+			handle_stripe(sh, NULL);
+			release_stripe(sh);
+		}
+	}
+}
 
 static void raid5_activate_delayed(raid5_conf_t *conf)
 {
-	if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {
-		while (!list_empty(&conf->delayed_list)) {
-			struct list_head *l = conf->delayed_list.next;
-			struct stripe_head *sh;
-			sh = list_entry(l, struct stripe_head, lru);
-			list_del_init(l);
-			clear_bit(STRIPE_DELAYED, &sh->state);
-			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-				atomic_inc(&conf->preread_active_stripes);
-			list_add_tail(&sh->lru, &conf->handle_list);
+	if (atomic_read(&conf->preread_active_queues) < IO_THRESHOLD) {
+		struct stripe_queue *sq, *_sq;
+		pr_debug("%s\n", __FUNCTION__);
+		list_for_each_entry_safe(sq, _sq, &conf->delayed_q_list,
+					 list_node) {
+			list_del_init(&sq->list_node);
+			atomic_inc(&sq->count);
+			if (!test_and_set_bit(STRIPE_QUEUE_PREREAD_ACTIVE,
+						&sq->state))
+				atomic_inc(&conf->preread_active_queues);
+			__release_queue(conf, sq);
 		}
 	}
 }
@@ -3481,6 +3744,7 @@ static void raid5_unplug_device(struct request_queue *q)
 		conf->seq_flush++;
 		raid5_activate_delayed(conf);
 	}
+
 	md_wakeup_thread(mddev->thread);
 
 	spin_unlock_irqrestore(&conf->device_lock, flags);
@@ -3524,13 +3788,13 @@ static int raid5_congested(void *data, int bits)
 	raid5_conf_t *conf = mddev_to_conf(mddev);
 
 	/* No difference between reads and writes.  Just check
-	 * how busy the stripe_cache is
+	 * how busy the stripe_queue is
 	 */
-	if (conf->inactive_blocked)
+	if (conf->inactive_queue_blocked)
 		return 1;
 	if (conf->quiesce)
 		return 1;
-	if (list_empty_careful(&conf->inactive_list))
+	if (list_empty_careful(&conf->inactive_q_list))
 		return 1;
 
 	return 0;
@@ -3636,7 +3900,7 @@ static int raid5_align_endio(struct bio *bi, unsigned int bytes, int error)
 	if (!error && uptodate) {
 		bio_endio(raid_bi, bytes, 0);
 		if (atomic_dec_and_test(&conf->active_aligned_reads))
-			wake_up(&conf->wait_for_stripe);
+			wake_up(&conf->wait_for_queue);
 		return 0;
 	}
 
@@ -3722,7 +3986,7 @@ static int chunk_aligned_read(struct request_queue *q, struct bio * raid_bio)
 		}
 
 		spin_lock_irq(&conf->device_lock);
-		wait_event_lock_irq(conf->wait_for_stripe,
+		wait_event_lock_irq(conf->wait_for_queue,
 				    conf->quiesce == 0,
 				    conf->device_lock, /* nothing */);
 		atomic_inc(&conf->active_aligned_reads);
@@ -3745,7 +4009,7 @@ static int make_request(struct request_queue *q, struct bio * bi)
 	unsigned int dd_idx, pd_idx;
 	sector_t new_sector;
 	sector_t logical_sector, last_sector;
-	struct stripe_head *sh;
+	struct stripe_queue *sq;
 	const int rw = bio_data_dir(bi);
 	int remaining;
 
@@ -3807,16 +4071,18 @@ static int make_request(struct request_queue *q, struct bio * bi)
 			(unsigned long long)new_sector, 
 			(unsigned long long)logical_sector);
 
-		sh = get_active_stripe(conf, new_sector, disks, pd_idx, (bi->bi_rw&RWA_MASK));
-		if (sh) {
+		sq = get_active_queue(conf, new_sector, disks, pd_idx,
+					(bi->bi_rw & RWA_MASK));
+		if (sq) {
 			if (unlikely(conf->expand_progress != MaxSector)) {
 				/* expansion might have moved on while waiting for a
-				 * stripe, so we must do the range check again.
+				 * queue, so we must do the range check again.
 				 * Expansion could still move past after this
 				 * test, but as we are holding a reference to
-				 * 'sh', we know that if that happens,
-				 *  STRIPE_EXPANDING will get set and the expansion
-				 * won't proceed until we finish with the stripe.
+				 * 'sq', we know that if that happens,
+				 * STRIPE_QUEUE_EXPANDING will get set and the
+				 * expansion won't proceed until we finish
+				 * with the queue.
 				 */
 				int must_retry = 0;
 				spin_lock_irq(&conf->device_lock);
@@ -3826,7 +4092,7 @@ static int make_request(struct request_queue *q, struct bio * bi)
 					must_retry = 1;
 				spin_unlock_irq(&conf->device_lock);
 				if (must_retry) {
-					release_stripe(sh);
+					release_queue(sq);
 					goto retry;
 				}
 			}
@@ -3835,28 +4101,28 @@ static int make_request(struct request_queue *q, struct bio * bi)
 			 */
 			if (logical_sector >= mddev->suspend_lo &&
 			    logical_sector < mddev->suspend_hi) {
-				release_stripe(sh);
+				release_queue(sq);
 				schedule();
 				goto retry;
 			}
 
-			if (test_bit(STRIPE_EXPANDING, &sh->state) ||
-			    !add_queue_bio(sh->sq, bi, dd_idx,
+			if (test_bit(STRIPE_QUEUE_EXPANDING, &sq->state) ||
+			    !add_queue_bio(sq, bi, dd_idx,
 					   bi->bi_rw & RW_MASK)) {
 				/* Stripe is busy expanding or
 				 * add failed due to overlap.  Flush everything
 				 * and wait a while
 				 */
 				raid5_unplug_device(mddev->queue);
-				release_stripe(sh);
+				release_queue(sq);
 				schedule();
 				goto retry;
 			}
 			finish_wait(&conf->wait_for_overlap, &w);
-			handle_stripe(sh, NULL);
-			release_stripe(sh);
+			handle_queue(sq, disks, data_disks);
+			release_queue(sq);
 		} else {
-			/* cannot get stripe for read-ahead, just give-up */
+			/* cannot get queue for read-ahead, just give-up */
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
 			finish_wait(&conf->wait_for_overlap, &w);
 			break;
@@ -3879,6 +4145,34 @@ static int make_request(struct request_queue *q, struct bio * bi)
 	return 0;
 }
 
+static struct stripe_head *
+wait_for_inactive_cache(raid5_conf_t *conf, sector_t sector,
+			int disks, int pd_idx)
+{
+	struct stripe_head *sh;
+
+	do {
+		struct stripe_queue *sq;
+		wait_queue_t wait;
+		init_waitqueue_entry(&wait, current);
+		add_wait_queue(&conf->wait_for_stripe, &wait);
+		for (;;) {
+			sq = get_active_queue(conf, sector, disks, pd_idx, 0);
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			sh = get_active_stripe(sq, disks, 1);
+			if (sh)
+				break;
+			release_queue(sq);
+			schedule();
+		}
+		current->state = TASK_RUNNING;
+		remove_wait_queue(&conf->wait_for_stripe, &wait);
+	} while (0);
+
+	return sh;
+}
+
+
 static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped)
 {
 	/* reshaping is quite different to recovery/resync so it is
@@ -3946,17 +4240,18 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 		int j;
 		int skipped = 0;
 		pd_idx = stripe_to_pdidx(sector_nr+i, conf, conf->raid_disks);
-		sh = get_active_stripe(conf, sector_nr+i,
-				       conf->raid_disks, pd_idx, 0);
+		sh = wait_for_inactive_cache(conf, sector_nr+i,
+					     conf->raid_disks, pd_idx);
 		sq = sh->sq;
-		set_bit(STRIPE_EXPANDING, &sh->state);
+
+		set_bit(STRIPE_QUEUE_EXPANDING, &sq->state);
 		atomic_inc(&conf->reshape_stripes);
 		/* If any of this stripe is beyond the end of the old
 		 * array, then we need to zero those blocks
 		 */
 		for (j = sq->disks; j--;) {
 			sector_t s;
-			int pd_idx = sh->sq->pd_idx;
+			int pd_idx = sq->pd_idx;
 
 			if (j == pd_idx)
 				continue;
@@ -3978,6 +4273,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 			set_bit(STRIPE_HANDLE, &sh->state);
 		}
 		release_stripe(sh);
+		release_queue(sq);
 	}
 	spin_lock_irq(&conf->device_lock);
 	conf->expand_progress = (sector_nr + i) * new_data_disks;
@@ -4001,11 +4297,14 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 	while (first_sector <= last_sector) {
 		pd_idx = stripe_to_pdidx(first_sector, conf,
 					 conf->previous_raid_disks);
-		sh = get_active_stripe(conf, first_sector,
-				       conf->previous_raid_disks, pd_idx, 0);
+		sh = wait_for_inactive_cache(conf, first_sector,
+					     conf->previous_raid_disks,
+					     pd_idx);
+		sq = sh->sq;
 		set_bit(STRIPE_EXPAND_SOURCE, &sh->state);
 		set_bit(STRIPE_HANDLE, &sh->state);
 		release_stripe(sh);
+		release_queue(sq);
 		first_sector += STRIPE_SECTORS;
 	}
 	return conf->chunk_size>>9;
@@ -4065,14 +4364,8 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 	}
 
 	pd_idx = stripe_to_pdidx(sector_nr, conf, raid_disks);
-	sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 1);
-	if (sh == NULL) {
-		sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 0);
-		/* make sure we don't swamp the stripe cache if someone else
-		 * is trying to get access
-		 */
-		schedule_timeout_uninterruptible(1);
-	}
+
+	sh = wait_for_inactive_cache(conf, sector_nr, raid_disks, pd_idx);
 	sq = sh->sq;
 
 	/* Need to check if array will still be degraded after recovery/resync
@@ -4092,6 +4385,7 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 
 	handle_stripe(sh, NULL);
 	release_stripe(sh);
+	release_queue(sq);
 
 	return STRIPE_SECTORS;
 }
@@ -4108,17 +4402,19 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 	 * We *know* that this entire raid_bio is in one chunk, so
 	 * it will be only one 'dd_idx' and only need one call to raid5_compute_sector.
 	 */
-	struct stripe_head *sh;
+	struct stripe_queue *sq;
 	int dd_idx, pd_idx;
 	sector_t sector, logical_sector, last_sector;
 	int scnt = 0;
 	int remaining;
 	int handled = 0;
+	int disks = conf->raid_disks;
+	int data_disks = disks - conf->max_degraded;
 
 	logical_sector = raid_bio->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
 	sector = raid5_compute_sector(	logical_sector,
-					conf->raid_disks,
-					conf->raid_disks - conf->max_degraded,
+					disks,
+					data_disks,
 					&dd_idx,
 					&pd_idx,
 					conf);
@@ -4128,30 +4424,36 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 	     logical_sector += STRIPE_SECTORS,
 		     sector += STRIPE_SECTORS,
 		     scnt++) {
+		struct stripe_head *sh;
 
 		if (scnt < raid_bio->bi_hw_segments)
 			/* already done this stripe */
 			continue;
 
-		sh = get_active_stripe(conf, sector, conf->raid_disks, pd_idx, 1);
-
-		if (!sh) {
-			/* failed to get a stripe - must wait */
+		sq = get_active_queue(conf, sector, disks, pd_idx, 1);
+		if (sq)
+			sh = get_active_stripe(sq, disks, 1);
+		if (!(sq && sh)) {
+			/* failed to get a queue/stripe - must wait */
 			raid_bio->bi_hw_segments = scnt;
 			conf->retry_read_aligned = raid_bio;
+			if (sq)
+				release_queue(sq);
 			return handled;
 		}
 
 		set_bit(R5_ReadError, &sh->dev[dd_idx].flags);
-		if (!add_queue_bio(sh->sq, raid_bio, dd_idx, 0)) {
+		if (!add_queue_bio(sq, raid_bio, dd_idx, 0)) {
 			release_stripe(sh);
+			release_queue(sq);
 			raid_bio->bi_hw_segments = scnt;
 			conf->retry_read_aligned = raid_bio;
 			return handled;
 		}
 
-		handle_stripe(sh, NULL);
+		handle_queue(sq, disks, data_disks);
 		release_stripe(sh);
+		release_queue(sq);
 		handled++;
 	}
 	spin_lock_irq(&conf->device_lock);
@@ -4166,11 +4468,63 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 			        ? 0 : -EIO);
 	}
 	if (atomic_dec_and_test(&conf->active_aligned_reads))
-		wake_up(&conf->wait_for_stripe);
+		wake_up(&conf->wait_for_queue);
 	return handled;
 }
 
+static void raid456_cache_arbiter(struct work_struct *work)
+{
+	raid5_conf_t *conf = container_of(work, raid5_conf_t,
+					  stripe_queue_work);
+	struct list_head *sq_entry;
+	int attach = 0;
+
+	/* attach queues to stripes in priority order */
+	pr_debug("+++ %s active\n", __FUNCTION__);
+	spin_lock_irq(&conf->device_lock);
+	do {
+		sq_entry = NULL;
+		if (!list_empty(&conf->io_hi_q_list))
+			sq_entry = conf->io_hi_q_list.next;
+		else if (!list_empty(&conf->io_lo_q_list))
+			sq_entry = conf->io_lo_q_list.next;
+
+		/* "these aren't the droids you're looking for..."
+		 * do not handle the delayed list while there are better
+		 * things to do
+		 */
+		if (!sq_entry &&
+		    atomic_read(&conf->preread_active_queues) <
+		    IO_THRESHOLD && !blk_queue_plugged(conf->mddev->queue) &&
+		    !list_empty(&conf->delayed_q_list)) {
+			raid5_activate_delayed(conf);
+			sq_entry = conf->io_lo_q_list.next;
+		}
+
+		if (sq_entry) {
+			struct stripe_queue *sq;
+			struct stripe_head *sh;
+			sq = list_entry(sq_entry, struct stripe_queue,
+					list_node);
+
+			list_del_init(sq_entry);
+			atomic_inc(&sq->count);
+			BUG_ON(atomic_read(&sq->count) != 1);
 
+			spin_unlock_irq(&conf->device_lock);
+			sh = get_active_stripe(sq, conf->raid_disks, 0);
+			spin_lock_irq(&conf->device_lock);
+
+			set_bit(STRIPE_HANDLE, &sh->state);
+			__release_stripe(conf, sh);
+			__release_queue(conf, sq);
+			attach++;
+		}
+	} while (sq_entry);
+	spin_unlock_irq(&conf->device_lock);
+	pr_debug("%d stripe(s) attached\n", attach);
+	pr_debug("--- %s inactive\n", __FUNCTION__);
+}
 
 /*
  * This is our raid5 kernel thread.
@@ -4204,12 +4558,6 @@ static void raid5d (mddev_t *mddev)
 			activate_bit_delay(conf);
 		}
 
-		if (list_empty(&conf->handle_list) &&
-		    atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
-		    !blk_queue_plugged(mddev->queue) &&
-		    !list_empty(&conf->delayed_list))
-			raid5_activate_delayed(conf);
-
 		while ((bio = remove_bio_from_retry(conf))) {
 			int ok;
 			spin_unlock_irq(&conf->device_lock);
@@ -4221,6 +4569,7 @@ static void raid5d (mddev_t *mddev)
 		}
 
 		if (list_empty(&conf->handle_list)) {
+			queue_work(conf->workqueue, &conf->stripe_queue_work);
 			async_tx_issue_pending_all();
 			break;
 		}
@@ -4263,7 +4612,8 @@ raid5_store_stripe_cache_size(mddev_t *mddev, const char *page, size_t len)
 {
 	raid5_conf_t *conf = mddev_to_conf(mddev);
 	char *end;
-	int new;
+	int new, queue, i;
+
 	if (len >= PAGE_SIZE)
 		return -EINVAL;
 	if (!conf)
@@ -4279,9 +4629,21 @@ raid5_store_stripe_cache_size(mddev_t *mddev, const char *page, size_t len)
 			conf->max_nr_stripes--;
 		else
 			break;
+
+		for (i = 0, queue = 0; i < STRIPE_QUEUE_SIZE; i++)
+			queue += drop_one_queue(conf);
+
+		if (queue < STRIPE_QUEUE_SIZE)
+			break;
 	}
 	md_allow_write(mddev);
 	while (new > conf->max_nr_stripes) {
+		for (i = 0, queue = 0; i < STRIPE_QUEUE_SIZE; i++)
+			queue += grow_one_queue(conf);
+
+		if (queue < STRIPE_QUEUE_SIZE)
+			break;
+
 		if (grow_one_stripe(conf))
 			conf->max_nr_stripes++;
 		else break;
@@ -4307,9 +4669,23 @@ stripe_cache_active_show(mddev_t *mddev, char *page)
 static struct md_sysfs_entry
 raid5_stripecache_active = __ATTR_RO(stripe_cache_active);
 
+static ssize_t
+stripe_queue_active_show(mddev_t *mddev, char *page)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	if (conf)
+		return sprintf(page, "%d\n", atomic_read(&conf->active_queues));
+	else
+		return 0;
+}
+
+static struct md_sysfs_entry
+raid5_stripequeue_active = __ATTR_RO(stripe_queue_active);
+
 static struct attribute *raid5_attrs[] =  {
 	&raid5_stripecache_size.attr,
 	&raid5_stripecache_active.attr,
+	&raid5_stripequeue_active.attr,
 	NULL,
 };
 static struct attribute_group raid5_attrs_group = {
@@ -4410,16 +4786,29 @@ static int run(mddev_t *mddev)
 		if (!conf->spare_page)
 			goto abort;
 	}
+
+	sprintf(conf->workqueue_name, "%s_cache_arb",
+		mddev->gendisk->disk_name);
+	conf->workqueue = create_singlethread_workqueue(conf->workqueue_name);
+	if (!conf->workqueue)
+		goto abort;
+
 	spin_lock_init(&conf->device_lock);
 	init_waitqueue_head(&conf->wait_for_stripe);
+	init_waitqueue_head(&conf->wait_for_queue);
 	init_waitqueue_head(&conf->wait_for_overlap);
 	INIT_LIST_HEAD(&conf->handle_list);
-	INIT_LIST_HEAD(&conf->delayed_list);
 	INIT_LIST_HEAD(&conf->bitmap_list);
 	INIT_LIST_HEAD(&conf->inactive_list);
+	INIT_LIST_HEAD(&conf->io_hi_q_list);
+	INIT_LIST_HEAD(&conf->io_lo_q_list);
+	INIT_LIST_HEAD(&conf->delayed_q_list);
+	INIT_LIST_HEAD(&conf->inactive_q_list);
 	atomic_set(&conf->active_stripes, 0);
-	atomic_set(&conf->preread_active_stripes, 0);
+	atomic_set(&conf->active_queues, 0);
+	atomic_set(&conf->preread_active_queues, 0);
 	atomic_set(&conf->active_aligned_reads, 0);
+	INIT_WORK(&conf->stripe_queue_work, raid456_cache_arbiter);
 
 	pr_debug("raid5: run(%s) called.\n", mdname(mddev));
 
@@ -4519,6 +4908,8 @@ static int run(mddev_t *mddev)
 		printk(KERN_INFO "raid5: allocated %dkB for %s\n",
 			memory, mdname(mddev));
 
+	conf->stripe_queue_tree = RB_ROOT;
+
 	if (mddev->degraded == 0)
 		printk("raid5: raid level %d set %s active with %d out of %d"
 			" devices, algorithm %d\n", conf->level, mdname(mddev), 
@@ -4575,6 +4966,8 @@ static int run(mddev_t *mddev)
 abort:
 	if (conf) {
 		print_raid5_conf(conf);
+		if (conf->workqueue)
+			destroy_workqueue(conf->workqueue);
 		safe_put_page(conf->spare_page);
 		kfree(conf->disks);
 		kfree(conf->stripe_hashtbl);
@@ -4599,6 +4992,7 @@ static int stop(mddev_t *mddev)
 	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
 	sysfs_remove_group(&mddev->kobj, &raid5_attrs_group);
 	kfree(conf->disks);
+	destroy_workqueue(conf->workqueue);
 	kfree(conf);
 	mddev->private = NULL;
 	return 0;
@@ -4608,29 +5002,50 @@ static int stop(mddev_t *mddev)
 static void print_sh (struct seq_file *seq, struct stripe_head *sh)
 {
 	int i;
-	struct stripe_queue *sq = sh->sq;
 
-	seq_printf(seq, "sh %llu, pd_idx %d, state %ld.\n",
-		   (unsigned long long)sh->sector, sq->pd_idx, sh->state);
+	seq_printf(seq, "sh %llu, state %ld.\n",
+		   (unsigned long long)sh->sector, sh->state);
 	seq_printf(seq, "sh %llu,  count %d.\n",
 		   (unsigned long long)sh->sector, atomic_read(&sh->count));
 	seq_printf(seq, "sh %llu, ", (unsigned long long)sh->sector);
-	for (i = 0; i < sq->disks; i++)
+	for (i = 0; i < sh->sq->disks; i++) {
 		seq_printf(seq, "(cache%d: %p %ld) ",
 			   i, sh->dev[i].page, sh->dev[i].flags);
 	seq_printf(seq, "\n");
 }
 
-static void printall (struct seq_file *seq, raid5_conf_t *conf)
+static void print_sq(struct seq_file *seq, struct stripe_queue *sq)
 {
+	int disks = sq->disks;
+
+	seq_printf(seq, "sq %llu, pd_idx %d, state %ld.\n",
+		   (unsigned long long)sq->sector, sq->pd_idx, sq->state);
+	seq_printf(seq, "sq %llu,  count %d to_write: %lu to_read: %lu "
+		   "overwrite: %lu\n", (unsigned long long)sq->sector,
+		   atomic_read(&sq->count), io_weight(sq->to_write, disks),
+		   io_weight(sq->to_read, disks),
+		   io_weight(sq->overwrite, disks));
+	seq_printf(seq, "sq %llu, ", (unsigned long long)sq->sector);
+}
+
+static void printall(struct seq_file *seq, raid5_conf_t *conf)
+{
+	struct stripe_queue *sq;
 	struct stripe_head *sh;
+	struct rb_node *rbn;
 	struct hlist_node *hn;
 	int i;
 
 	spin_lock_irq(&conf->device_lock);
+	rbn = rb_first(&conf->stripe_queue_tree);
+	while (rbn) {
+		sq = rb_entry(rbn, struct stripe_queue, rb_node);
+		print_sq(seq, sq);
+		rbn = rb_next(rbn);
+	}
 	for (i = 0; i < NR_HASH; i++) {
 		hlist_for_each_entry(sh, hn, &conf->stripe_hashtbl[i], hash) {
-			if (sh->sq->raid_conf != conf)
+			if (!sh->sq)
 				continue;
 			print_sh(seq, sh);
 		}
@@ -4952,8 +5367,8 @@ static void raid5_quiesce(mddev_t *mddev, int state)
 	case 1: /* stop all writes */
 		spin_lock_irq(&conf->device_lock);
 		conf->quiesce = 1;
-		wait_event_lock_irq(conf->wait_for_stripe,
-				    atomic_read(&conf->active_stripes) == 0 &&
+		wait_event_lock_irq(conf->wait_for_queue,
+				    atomic_read(&conf->active_queues) == 0 &&
 				    atomic_read(&conf->active_aligned_reads) == 0,
 				    conf->device_lock, /* nothing */);
 		spin_unlock_irq(&conf->device_lock);
@@ -4962,7 +5377,7 @@ static void raid5_quiesce(mddev_t *mddev, int state)
 	case 0: /* re-enable writes */
 		spin_lock_irq(&conf->device_lock);
 		conf->quiesce = 0;
-		wake_up(&conf->wait_for_stripe);
+		wake_up(&conf->wait_for_queue);
 		wake_up(&conf->wait_for_overlap);
 		spin_unlock_irq(&conf->device_lock);
 		break;
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 3d4938c..7ff0c9c 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -3,6 +3,7 @@
 
 #include <linux/raid/md.h>
 #include <linux/raid/xor.h>
+#include <linux/rbtree.h>
 
 /*
  *
@@ -181,7 +182,7 @@ struct stripe_head {
 		int		   count;
 		u32		   zero_sum_result;
 	} ops;
-	struct stripe_queue *sq;
+	struct stripe_queue *sq; /* list of pending bios for this stripe */
 	struct r5dev {
 		struct bio	req;
 		struct bio_vec	vec;
@@ -191,7 +192,7 @@ struct stripe_head {
 };
 
 /* stripe_head_state - collects and tracks the dynamic state of a stripe_head
- *     for handle_stripe.  It is only valid under spin_lock(sh->lock);
+ *     for handle_stripe.  It is only valid under spin_lock(sq->lock);
  */
 struct stripe_head_state {
 	int syncing, expanding, expanded;
@@ -205,7 +206,16 @@ struct r6_state {
 	int p_failed, q_failed, qd_idx, failed_num[2];
 };
 
+/* stripe_queue
+ * @sector - rb_tree key
+ * @lock
+ * @sh - our stripe_head in the cache
+ * @list_node - once this queue object satisfies some constraint (like full
+ *  stripe write) it is placed on a list for processing by the cache
+ * @overwrite_count - how many blocks are set to be overwritten
+ */
 struct stripe_queue {
+	struct rb_node rb_node;
 	sector_t sector;
 	/* stripe queues are allocated with extra space to hold the following
 	 * four bitmaps.  One bit for each block in the stripe_head.  These
@@ -222,6 +232,7 @@ struct stripe_queue {
 	struct list_head list_node;
 	int pd_idx; /* parity disk index */
 	int disks; /* disks in stripe */
+	atomic_t count;
 	struct r5_queue_dev {
 		sector_t sector; /* hw starting sector for this block */
 		struct bio *toread, *read, *towrite, *written;
@@ -263,11 +274,8 @@ struct stripe_queue {
 #define STRIPE_HANDLE		2
 #define	STRIPE_SYNCING		3
 #define	STRIPE_INSYNC		4
-#define	STRIPE_PREREAD_ACTIVE	5
-#define	STRIPE_DELAYED		6
 #define	STRIPE_DEGRADED		7
 #define	STRIPE_BIT_DELAY	8
-#define	STRIPE_EXPANDING	9
 #define	STRIPE_EXPAND_SOURCE	10
 #define	STRIPE_EXPAND_READY	11
 /*
@@ -292,6 +300,8 @@ struct stripe_queue {
  * Stripe-queue state
  */
 #define STRIPE_QUEUE_FIRSTWRITE 0
+#define STRIPE_QUEUE_EXPANDING 1
+#define STRIPE_QUEUE_PREREAD_ACTIVE 2
 
 /*
  * Plugging:
@@ -324,6 +334,7 @@ struct disk_info {
 
 struct raid5_private_data {
 	struct hlist_head	*stripe_hashtbl;
+	struct rb_root		stripe_queue_tree;
 	mddev_t			*mddev;
 	struct disk_info	*spare;
 	int			chunk_size, level, algorithm;
@@ -339,12 +350,22 @@ struct raid5_private_data {
 	int			previous_raid_disks;
 
 	struct list_head	handle_list; /* stripes needing handling */
-	struct list_head	delayed_list; /* stripes that have plugged requests */
 	struct list_head	bitmap_list; /* stripes delaying awaiting bitmap update */
+	struct list_head	delayed_q_list; /* queues that have plugged
+						 * requests
+						 */
+	struct list_head	io_hi_q_list; /* reads and full stripe writes */
+	struct list_head	io_lo_q_list; /* sub-stripe-width writes */
+	struct workqueue_struct *workqueue; /* attaches sq's to sh's */
+	struct work_struct	stripe_queue_work;
+	char 			workqueue_name[20];
+
 	struct bio		*retry_read_aligned; /* currently retrying aligned bios   */
 	struct bio		*retry_read_aligned_list; /* aligned bios retry list  */
-	atomic_t		preread_active_stripes; /* stripes with scheduled io */
 	atomic_t		active_aligned_reads;
+	atomic_t		preread_active_queues; /* queues with scheduled
+							* io
+							*/
 
 	atomic_t		reshape_stripes; /* stripes with pending writes for reshape */
 	/* unfortunately we need two cache names as we temporarily have
@@ -367,12 +388,20 @@ struct raid5_private_data {
 	struct page 		*spare_page; /* Used when checking P/Q in raid6 */
 
 	/*
+	 * Free queue pool
+	 */
+	atomic_t		active_queues;
+	struct list_head	inactive_q_list;
+	wait_queue_head_t	wait_for_queue;
+	wait_queue_head_t	wait_for_overlap;
+	int			inactive_queue_blocked;
+
+	/*
 	 * Free stripes pool
 	 */
 	atomic_t		active_stripes;
 	struct list_head	inactive_list;
 	wait_queue_head_t	wait_for_stripe;
-	wait_queue_head_t	wait_for_overlap;
 	int			inactive_blocked;	/* release of inactive stripes blocked,
 							 * waiting for 25% to be free
 							 */

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance)
  2007-10-06 17:06 [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance) Dan Williams
                   ` (3 preceding siblings ...)
  2007-10-06 17:06 ` [PATCH -mm 4/4] raid5: use stripe_queues to prioritize the "most deserving" requests (rev7) Dan Williams
@ 2007-10-06 18:34 ` Justin Piszcz
  2007-10-07 17:30   ` Dan Williams
  2007-10-08  0:47   ` Neil Brown
  2007-10-09  6:21 ` Neil Brown
  5 siblings, 2 replies; 10+ messages in thread
From: Justin Piszcz @ 2007-10-06 18:34 UTC (permalink / raw)
  To: Dan Williams; +Cc: neilb, akpm, linux-raid



On Sat, 6 Oct 2007, Dan Williams wrote:

> Neil,
>
> Here is the latest spin of the 'stripe_queue' implementation.  Thanks to
> raid6+bitmap testing done by Mr. James W. Laferriere there have been
> several cleanups and fixes since the last release.  Also, the changes
> are now spread over 4 patches to isolate one conceptual change per
> patch.  The most significant cleanup is removing the stripe_head back
> pointer from stripe_queue.  This effectively makes the queuing layer
> independent from the caching layer.
>
> Expansion support needs more testing.
>
> See the individual patch changelogs for details.  Patch 1 contains
> updated performance numbers.
>
> Andrew,
>
> These are updated in the git-md-accel tree, but I will work the
> finalized versions through Neil's 'Signed-off-by' path.
>
> Dan Williams (4):
>      raid5: add the stripe_queue object for tracking raid io requests (rev3)
>      raid5: split allocation of stripe_heads and stripe_queues
>      raid5: convert add_stripe_bio to add_queue_bio
>      raid5: use stripe_queues to prioritize the "most deserving" requests (rev7)
>
> drivers/md/raid5.c         | 1560 ++++++++++++++++++++++++++++++++------------
> include/linux/raid/raid5.h |   88 ++-
> 2 files changed, 1200 insertions(+), 448 deletions(-)
>
> --
> Dan
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

These patches & data look very impressive, do we have an ETA of when they 
will be merged into mainline?

Justin.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance)
  2007-10-06 18:34 ` [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance) Justin Piszcz
@ 2007-10-07 17:30   ` Dan Williams
  2007-10-08  0:47   ` Neil Brown
  1 sibling, 0 replies; 10+ messages in thread
From: Dan Williams @ 2007-10-07 17:30 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: neilb, akpm, linux-raid

On 10/6/07, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
>
> On Sat, 6 Oct 2007, Dan Williams wrote:
>
> > Neil,
> >
> > Here is the latest spin of the 'stripe_queue' implementation.  Thanks to
> > raid6+bitmap testing done by Mr. James W. Laferriere there have been
> > several cleanups and fixes since the last release.  Also, the changes
> > are now spread over 4 patches to isolate one conceptual change per
> > patch.  The most significant cleanup is removing the stripe_head back
> > pointer from stripe_queue.  This effectively makes the queuing layer
> > independent from the caching layer.
> >
> > Expansion support needs more testing.
> >
> > See the individual patch changelogs for details.  Patch 1 contains
> > updated performance numbers.
> >
> > Andrew,
> >
> > These are updated in the git-md-accel tree, but I will work the
> > finalized versions through Neil's 'Signed-off-by' path.
> >
> > Dan Williams (4):
> >      raid5: add the stripe_queue object for tracking raid io requests (rev3)
> >      raid5: split allocation of stripe_heads and stripe_queues
> >      raid5: convert add_stripe_bio to add_queue_bio
> >      raid5: use stripe_queues to prioritize the "most deserving" requests (rev7)
> >
> > drivers/md/raid5.c         | 1560 ++++++++++++++++++++++++++++++++------------
> > include/linux/raid/raid5.h |   88 ++-
> > 2 files changed, 1200 insertions(+), 448 deletions(-)
> >
> > --
> > Dan
>
> These patches & data look very impressive, do we have an ETA of when they
> will be merged into mainline?
>
> Justin.

The short answer is "when they are ready."  Jim reported that he is
seeing bonnie++ get stuck in D state on his platform, so the debug is
ongoing.  Additional testing is always welcome...

--
Dan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance)
  2007-10-06 18:34 ` [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance) Justin Piszcz
  2007-10-07 17:30   ` Dan Williams
@ 2007-10-08  0:47   ` Neil Brown
  1 sibling, 0 replies; 10+ messages in thread
From: Neil Brown @ 2007-10-08  0:47 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Dan Williams, akpm, linux-raid

On Saturday October 6, jpiszcz@lucidpixels.com wrote:
> 
> These patches & data look very impressive, do we have an ETA of when they 
> will be merged into mainline?
> 
> Justin.

If you or others could test and confirm similar improvements, that
would be a big help.
But as always, the answer is "when they are ready".  I don't foresee
substantial problems, but review and testing doesn't happen instantly :-)

NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance)
  2007-10-06 17:06 [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance) Dan Williams
                   ` (4 preceding siblings ...)
  2007-10-06 18:34 ` [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance) Justin Piszcz
@ 2007-10-09  6:21 ` Neil Brown
  2007-10-09 22:56   ` Dan Williams
  5 siblings, 1 reply; 10+ messages in thread
From: Neil Brown @ 2007-10-09  6:21 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

On Saturday October 6, dan.j.williams@intel.com wrote:
> Neil,
> 
> Here is the latest spin of the 'stripe_queue' implementation.  Thanks to
> raid6+bitmap testing done by Mr. James W. Laferriere there have been
> several cleanups and fixes since the last release.  Also, the changes
> are now spread over 4 patches to isolate one conceptual change per
> patch.  The most significant cleanup is removing the stripe_head back
> pointer from stripe_queue.  This effectively makes the queuing layer
> independent from the caching layer.

Thanks Dan, and sorry that it has taken such a time for me to take a
serious look at this.
The results seem impressive.  I'll try to do some testing myself, but
firstly: some questions.


1/ Can you explain why this improves the performance more than simply
  doubling the size of the stripe cache?

  The core of what it is doing seems to be to give priority to writing
  full stripes.  We already do that by delaying incomplete stripes.
  Maybe we just need to tune that mechanism a bit?  Maybe release
  fewer partial stripes at a time?

  It seems that the whole point of the stripe_queue structure is to
  allow requests to gather before they are processed so the more
  "deserving" can be processed first, but I cannot see why you need a
  data structure separate from the list_head.

  You could argue that simply doubling the size of the stripe cache
  would be a waste of memory as we only want to use half of it to
  handle active requests - the other half is for requests being built
  up.
  In that case, I don't see a problem with having a pool of pages
  which is smaller that would be needed for the full stripe cache, and
  allocating them to stripe_heads as they become free.

2/ I thought I understood from your descriptions that
   raid456_cache_arbiter would normally be waiting for a free stripe,
   that during this time full stripes could get promoted to io_hi, and
   so when raid456_cache_arbiter finally got a free stripe, it would
   attach it to the most deserving stripe_queue.  However it doesn't
   quite do that.  It chooses the deserving stripe_queue *before*
   waiting for a free stripe_head.  This seems slightly less than
   optimal?

3/ Why create a new workqueue for raid456_cache_arbiter rather than
   use raid5d.  It should be possible to do a non-blocking wait for a
   free stripe_head, in which cache the "find a stripe head and attach
   the most deserving stripe_queue" would fit well into raid5d.

4/ Why do you use an rbtree rather than a hash table to index the
  'stripe_queue' objects?  I seem to recall a discussion about this
  where it was useful to find adjacent requests or something like
  that, but I cannot see that in the current code.
  But maybe rbtrees are a better fit, in which case, should we use
  them for stripe_heads as well???

5/ Then again... It seems to me that a stripe_head will always have a
   stripe_queue pointing to it.  In that case we don't need to index
   the stripe_heads at all any more.  Would that be correct?

6/ What is the point of the do/while loop in
   wait_for_cache_attached_queue?  It seems totally superfluous.

That'll do for now.

NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance)
  2007-10-09  6:21 ` Neil Brown
@ 2007-10-09 22:56   ` Dan Williams
  0 siblings, 0 replies; 10+ messages in thread
From: Dan Williams @ 2007-10-09 22:56 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, babydr

On Mon, 2007-10-08 at 23:21 -0700, Neil Brown wrote:
> On Saturday October 6, dan.j.williams@intel.com wrote:
> > Neil,
> >
> > Here is the latest spin of the 'stripe_queue' implementation.
> Thanks to
> > raid6+bitmap testing done by Mr. James W. Laferriere there have been
> > several cleanups and fixes since the last release.  Also, the
> changes
> > are now spread over 4 patches to isolate one conceptual change per
> > patch.  The most significant cleanup is removing the stripe_head
> back
> > pointer from stripe_queue.  This effectively makes the queuing layer
> > independent from the caching layer.
> 
> Thanks Dan, and sorry that it has taken such a time for me to take a
> serious look at this.

Not a problem, I've actually only recently had some cycles to look at
these patches again myself.

> The results seem impressive.  I'll try to do some testing myself, but
> firstly: some questions.

> 1/ Can you explain why this improves the performance more than simply
>   doubling the size of the stripe cache?

Before I answer here are some quick numbers to quantify the difference
versus simply doubling the size of the stripe cache:

Test Configuration:
mdadm --create /dev/md0 /dev/sd[bcdefghi] -n 8 -l 5 --assume-clean
for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done

Average rate taken for 2.6.23-rc9 (1), 2.6.23-rc9 with stripe_cache_size
= 512 (2), 2.6.23-rc9+stripe_queue (3), 2.6.23-rc9+stripe_queue with
stripe_cache_size = 512 (4).

(1): 181MB/s
(2): 252MB/s (+41%)
(3): 330MB/s (+82%)
(4): 352MB/s (+94%)

>   The core of what it is doing seems to be to give priority to writing
>   full stripes.  We already do that by delaying incomplete stripes.
>   Maybe we just need to tune that mechanism a bit?  Maybe release
>   fewer partial stripes at a time?

>   It seems that the whole point of the stripe_queue structure is to
>   allow requests to gather before they are processed so the more
>   "deserving" can be processed first, but I cannot see why you need a
>   data structure separate from the list_head.

>   You could argue that simply doubling the size of the stripe cache
>   would be a waste of memory as we only want to use half of it to
>   handle active requests - the other half is for requests being built
>   up.
>   In that case, I don't see a problem with having a pool of pages
>   which is smaller that would be needed for the full stripe cache, and
>   allocating them to stripe_heads as they become free.

I believe the additional performance is coming from the fact that
delayed stripes are no longer consuming cache space while they wait for
their delay condition to clear, *and* that full stripe writes are
explicitly detected and moved to the front of the line.  This
effectively makes delayed stripes wait longer in some cases which is the
overall goal.

> 2/ I thought I understood from your descriptions that
>    raid456_cache_arbiter would normally be waiting for a free stripe,
>    that during this time full stripes could get promoted to io_hi, and
>    so when raid456_cache_arbiter finally got a free stripe, it would
>    attach it to the most deserving stripe_queue.  However it doesn't
>    quite do that.  It chooses the deserving stripe_queue *before*
>    waiting for a free stripe_head.  This seems slightly less than
>    optimal?

I see, get the stripe first and then go look at io_hi versus io_lo.
Yes, that would prevent some unnecessary io_lo requests from sneaking
into the cache.
> 
> 3/ Why create a new workqueue for raid456_cache_arbiter rather than
>    use raid5d.  It should be possible to do a non-blocking wait for a
>    free stripe_head, in which cache the "find a stripe head and attach
>    the most deserving stripe_queue" would fit well into raid5d.

It seemed necessary to have at least one thread doing a blocking wait on
the stripe cache... but moving this all under raid5d seems possible.
And, it might fix the deadlock condition that Jim is able to create in
his testing with bitmaps.  I have sent him a patch, off-list, to move
all bitmap handling to the stripe_queue which seems to improve
bitmap-write performance, but he still sees cases where raid5d() and
raid456_cache_arbiter() are staring blankly at each other while bonnie++
patiently waits in D state.  A kick
to /sys/block/md3/md/stripe_cache_size gets things going again.

> 4/ Why do you use an rbtree rather than a hash table to index the
>   'stripe_queue' objects?  I seem to recall a discussion about this
>   where it was useful to find adjacent requests or something like
>   that, but I cannot see that in the current code.
>   But maybe rbtrees are a better fit, in which case, should we use
>   them for stripe_heads as well???

If you are referring to the following:
http://marc.info/?l=linux-kernel&m=117740314031101&w=2
...then no, I am not caching the leftmost or rightmost entry to speed
lookups.

I initially did not know how many queues would need to be in play versus
stripe_heads to yield a performance advantage so I picked the rbtree
mainly because it was being used in other 'schedulers'.  As far as
implementing an rbtree for stripe_heads I don't have data around whether
it would be a performance win/loss.
> 
> 5/ Then again... It seems to me that a stripe_head will always have a
>    stripe_queue pointing to it.  In that case we don't need to index
>    the stripe_heads at all any more.  Would that be correct?

I actually cleaned up the code to remove that back reference (sq->sh),
but I could put in back for the purpose of not needing two lookup
mechanisms...  I'll investigate.
> 
> 6/ What is the point of the do/while loop in
>    wait_for_cache_attached_queue?  It seems totally superfluous.

Yes, totally superfluous copy and paste error from the
wait_event_lock_irq macro.

I guess you are looking at the latest state of the code in -mm since
wait_for_cache_attached_queue was renamed to wait_for_inactive_cache in
the current patches.

> That'll do for now.
Yes, should be enough for me to chew on for a while.
> 
> NeilBrown
> 

Thanks,
Dan

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-10-09 22:56 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-06 17:06 [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance) Dan Williams
2007-10-06 17:06 ` [PATCH -mm 1/4] raid5: add the stripe_queue object for tracking raid io requests (rev3) Dan Williams
2007-10-06 17:06 ` [PATCH -mm 2/4] raid5: split allocation of stripe_heads and stripe_queues Dan Williams
2007-10-06 17:06 ` [PATCH -mm 3/4] raid5: convert add_stripe_bio to add_queue_bio Dan Williams
2007-10-06 17:06 ` [PATCH -mm 4/4] raid5: use stripe_queues to prioritize the "most deserving" requests (rev7) Dan Williams
2007-10-06 18:34 ` [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance) Justin Piszcz
2007-10-07 17:30   ` Dan Williams
2007-10-08  0:47   ` Neil Brown
2007-10-09  6:21 ` Neil Brown
2007-10-09 22:56   ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).