[RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2
@ 2007-07-03 23:20 Dan Williams
  2007-07-03 23:20 ` [RFC PATCH 1/2] raid5: add the stripe_queue object for tracking raid io requests Dan Williams
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Dan Williams @ 2007-07-03 23:20 UTC (permalink / raw)
  To: neilb; +Cc: raziebe, akpm, davidsen, linux-kernel, linux-raid

The first take of the stripe-queue implementation[1] had a performance
limiting bug in __wait_for_inactive_queue.  Fixing that issue
drastically changed the performance characteristics.  The following data
from tiobench shows the relative performance difference of the
stripe-queue patchset.

Unit information
================
File size = megabytes
Blk Size  = bytes
Num Thr   = number of threads
Avg Rate  = relative throughput
CPU%      = relative percentage of CPU used during the test
CPU Eff   = Rate divided by CPU% - relative throughput per cpu load

Configuration
=============
Platform: 1200Mhz iop348 with 4-disk sata_vsc array
mdadm --create /dev/md0 /dev/sd[abcd] -n 4 -l 5
mkfs.ext2 /dev/md0
mount /dev/md0 /mnt/raid
tiobench --size 2048 --numruns 5 --block 4096 --block 131072 --dir /mnt/raid

Sequential Reads
                File    Blk     Num     Avg     Maximum CPU
Identifier      Size    Size    Thr     Rate    (CPU%)  Eff
--------------- ------  -----   ---     ------  ------  -----
2.6.22-rc7-iop1 2048    4096    1       0%      4%      -3%
2.6.22-rc7-iop1 2048    4096    2       -38%    -33%    -8%
2.6.22-rc7-iop1 2048    4096    4       -35%    -30%    -8%
2.6.22-rc7-iop1 2048    4096    8       -14%    -11%    -3%
2.6.22-rc7-iop1 2048    13107   1       2%      1%      2%
2.6.22-rc7-iop1 2048    13107   2       -11%    -10%    -2%
2.6.22-rc7-iop1 2048    13107   4       -7%     -6%     -1%
2.6.22-rc7-iop1 2048    13107   8       -9%     -6%     -4%

Random  Reads
                File    Blk     Num     Avg     Maximum CPU
Identifier      Size    Size    Thr     Rate    (CPU%)  Eff
--------------- ------  -----   ---     ------  ------  -----
2.6.22-rc7-iop1 2048    4096    1       -9%     15%     -21%
2.6.22-rc7-iop1 2048    4096    2       -1%     -30%    42%
2.6.22-rc7-iop1 2048    4096    4       -14%    -22%    10%
2.6.22-rc7-iop1 2048    4096    8       -21%    -28%    9%
2.6.22-rc7-iop1 2048    13107   1       -8%     -4%     -4%
2.6.22-rc7-iop1 2048    13107   2       -13%    -13%    0%
2.6.22-rc7-iop1 2048    13107   4       -15%    -15%    0%
2.6.22-rc7-iop1 2048    13107   8       -13%    -13%    0%

Sequential Writes
                File    Blk     Num     Avg     Maximum CPU
Identifier      Size    Size    Thr     Rate    (CPU%)  Eff
--------------- ------  -----   ---     ------  ------  -----
2.6.22-rc7-iop1 2048    4096    1       25%     11%     12%
2.6.22-rc7-iop1 2048    4096    2       41%     42%     -1%
2.6.22-rc7-iop1 2048    4096    4       40%     18%     19%
2.6.22-rc7-iop1 2048    4096    8       15%     -5%     21%
2.6.22-rc7-iop1 2048    13107   1       65%     57%     4%
2.6.22-rc7-iop1 2048    13107   2       46%     36%     8%
2.6.22-rc7-iop1 2048    13107   4       24%     -7%     34%
2.6.22-rc7-iop1 2048    13107   8       28%     -15%    51%

Random  Writes
                File    Blk     Num     Avg     Maximum CPU
Identifier      Size    Size    Thr     Rate    (CPU%)  Eff
--------------- ------  -----   ---     ------  ------  -----
2.6.22-rc7-iop1 2048    4096    1       2%      -8%     11%
2.6.22-rc7-iop1 2048    4096    2       -1%     -19%    21%
2.6.22-rc7-iop1 2048    4096    4       2%      2%      0%
2.6.22-rc7-iop1 2048    4096    8       -1%     -28%    37%
2.6.22-rc7-iop1 2048    13107   1       2%      -3%     5%
2.6.22-rc7-iop1 2048    13107   2       3%      -4%     7%
2.6.22-rc7-iop1 2048    13107   4       4%      -3%     8%
2.6.22-rc7-iop1 2048    13107   8       5%      -9%     15%

The write performance numbers are better than I expected and would seem
to address the concerns raised in the thread "Odd (slow) RAID
performance"[2].  The read performance drop was not expected.  However,
the numbers suggest some additional changes to be made to the queuing
model.  Where read performance is dropping there appears to be an equal
drop in CPU utilization, which seems to suggest that pure read requests
be handled immediately without a trip to the the stripe-queue workqueue.

Although it is not shown in the above data, another positive aspect is that
increasing the cache size past a certain point causes the write performance
gains to erode.  In other words negative returns in contrast to diminishing
returns.  The stripe-queue can only carry out optimizations while the cache is
busy.  When the cache is large requests can be handled without waiting, and
performance approaches the original 1:1 (queue-to-stripe-head) model.  CPU
speed dictates the maximum effective cache size.  Once the CPU can no longer
keep the stripe-queue saturated performance falls off from the peak.  This is
a positive change because it shows that the new queuing model can produce higher
performance with less resources, but it does require more care when changing
'stripe_cache_size.'  The above numbers were taken with the default cache size
of 256.

Changes since take1:
* separate write and overwrite in the io_weight fields, i.e. an overwrite
  no longer implies a write
* rename queue_weight -> io_weight
* fix r5_io_weight_size
* implement support for sysfs changes to stripe_cache_size
* delete and re-add stripe queues from their management lists rather than
  moving them.  This guarantees that when the sq->count is non-zero the
  queue is not on a list (identical to stripe_head handling)
* __wait_for_inactive_queue was incorrectly using conf->inactive_blocked
  which is exclusively for the stripe_cache.  Added
  conf->inactive_queue_blocked and set the routine to wait until the
  number of active queues drops below 7/8's of the total before unblocking
  processing.  7/8's arises from the following: get_active_stripe waits
  for 3/4's of the stripe cache i.e. 1/4 inactive.  conf->max_nr_stripes /
  4 == conf->max_nr_stripes * STRIPE_QUEUE_SIZE / 8 iff STRIPE_QUEUE_SIZE
  == 2
* change raid5_congested to report whether the *queue* is congested, not
  the cache.

As before these patches are on top of the md-accel series:

	git://lost.foo-projects.org/~dwillia2/git/iop md-accel+experimental

For those wanting to test these changes in isolation from other major kernel
changes look for the upcoming 2.6.22-iop1 kernel release on SourceForge[3].

--
Dan

[1] http://marc.info/?l=linux-raid&m=118297138908827&w=2
[2] http://marc.info/?l=linux-raid&m=116489600902178&w=2
[3] https://sourceforge.net/projects/xscaleiop/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH 1/2] raid5: add the stripe_queue object for tracking raid io requests
  2007-07-03 23:20 [RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2 Dan Williams
@ 2007-07-03 23:20 ` Dan Williams
  2007-07-03 23:20 ` [RFC PATCH 2/2] raid5: use stripe_queues to prioritize the "most deserving" requests, take2 Dan Williams
  2007-07-04 11:41 ` [RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2 Andi Kleen
  2 siblings, 0 replies; 5+ messages in thread
From: Dan Williams @ 2007-07-03 23:20 UTC (permalink / raw)
  To: neilb; +Cc: raziebe, akpm, davidsen, linux-kernel, linux-raid

The raid5 stripe cache object, struct stripe_head, serves two purposes:
	1/ frontend: queuing incoming requests
	2/ backend: transitioning requests through the cache state machine
	   to the backing devices
The problem with this model is that queuing decisions are directly tied to
cache availability.  There is no facility to determine that a request or
group of requests 'deserves' usage of the cache and disks at any given time.

This patch separates the object members needed for queuing from the object
members used for caching.  The stripe_queue object takes over the incoming
bio lists as well as the buffer state flags.

The following fields are moved from struct stripe_head to struct
stripe_queue:
	raid5_private_data *raid_conf
	int pd_idx
	spinlock_t lock
	int bm_seq

The following fields are moved from struct r5dev to struct r5_queue_dev:
	sector_t sector
	struct bio *toread, *towrite
	unsigned long flags

This patch lays the groundwork, but does not implement, the facility to
have more queue objects in the system than available stripes, currently this
remains a 1:1 relationship.  In other words, this patch just moves fields
around and does not implement new logic.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/raid5.c         |  903 +++++++++++++++++++++++++-------------------
 include/linux/raid/raid5.h |   29 +
 2 files changed, 530 insertions(+), 402 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0b66afe..a4723c7 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -31,7 +31,7 @@
  * conf->bm_flush is the number of the last batch that was closed to
  *    new additions.
  * When we discover that we will need to write to any block in a stripe
- * (in add_stripe_bio) we update the in-memory bitmap and record in sh->bm_seq
+ * (in add_stripe_bio) we update the in-memory bitmap and record in sq->bm_seq
  * the number of the batch it will be in. This is bm_flush+1.
  * When we are ready to do a write, if that batch hasn't been written yet,
  *   we plug the array and queue the stripe for later.
@@ -132,7 +132,7 @@ static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 				list_add_tail(&sh->lru, &conf->delayed_list);
 				blk_plug_device(conf->mddev->queue);
 			} else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
-				   sh->bm_seq - conf->seq_write > 0) {
+				   sh->sq->bm_seq - conf->seq_write > 0) {
 				list_add_tail(&sh->lru, &conf->bitmap_list);
 				blk_plug_device(conf->mddev->queue);
 			} else {
@@ -159,7 +159,7 @@ static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 }
 static void release_stripe(struct stripe_head *sh)
 {
-	raid5_conf_t *conf = sh->raid_conf;
+	raid5_conf_t *conf = sh->sq->raid_conf;
 	unsigned long flags;
 
 	spin_lock_irqsave(&conf->device_lock, flags);
@@ -238,7 +238,7 @@ static void raid5_build_block (struct stripe_head *sh, int i);
 
 static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int disks)
 {
-	raid5_conf_t *conf = sh->raid_conf;
+	raid5_conf_t *conf = sh->sq->raid_conf;
 	int i;
 
 	BUG_ON(atomic_read(&sh->count) != 0);
@@ -252,23 +252,24 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int
 	remove_hash(sh);
 
 	sh->sector = sector;
-	sh->pd_idx = pd_idx;
+	sh->sq->pd_idx = pd_idx;
 	sh->state = 0;
 
 	sh->disks = disks;
 
 	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
+		struct r5_queue_dev *dev_q = &sh->sq->dev[i];
 
-		if (dev->toread || dev->read || dev->towrite || dev->written ||
-		    test_bit(R5_LOCKED, &dev->flags)) {
+		if (dev_q->toread || dev->read || dev_q->towrite ||
+			dev->written || test_bit(R5_LOCKED, &dev_q->flags)) {
 			printk(KERN_ERR "sector=%llx i=%d %p %p %p %p %d\n",
-			       (unsigned long long)sh->sector, i, dev->toread,
-			       dev->read, dev->towrite, dev->written,
-			       test_bit(R5_LOCKED, &dev->flags));
+			       (unsigned long long)sh->sector, i, dev_q->toread,
+			       dev->read, dev_q->towrite, dev->written,
+			       test_bit(R5_LOCKED, &dev_q->flags));
 			BUG();
 		}
-		dev->flags = 0;
+		dev_q->flags = 0;
 		raid5_build_block(sh, i);
 	}
 	insert_hash(conf, sh);
@@ -288,6 +289,9 @@ static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector, in
 	return NULL;
 }
 
+static sector_t compute_blocknr(raid5_conf_t *conf, int raid_disks,
+	sector_t sector, int pd_idx, int i);
+
 static void unplug_slaves(mddev_t *mddev);
 static void raid5_unplug_device(request_queue_t *q);
 
@@ -389,7 +393,8 @@ raid5_end_write_request (struct bio *bi, unsigned int bytes_done, int error);
 
 static void ops_run_io(struct stripe_head *sh)
 {
-	raid5_conf_t *conf = sh->raid_conf;
+	struct stripe_queue *sq = sh->sq;
+	raid5_conf_t *conf = sq->raid_conf;
 	int i, disks = sh->disks;
 
 	might_sleep();
@@ -398,9 +403,9 @@ static void ops_run_io(struct stripe_head *sh)
 		int rw;
 		struct bio *bi;
 		mdk_rdev_t *rdev;
-		if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags))
+		if (test_and_clear_bit(R5_Wantwrite, &sq->dev[i].flags))
 			rw = WRITE;
-		else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
+		else if (test_and_clear_bit(R5_Wantread, &sq->dev[i].flags))
 			rw = READ;
 		else
 			continue;
@@ -443,7 +448,7 @@ static void ops_run_io(struct stripe_head *sh)
 			bi->bi_size = STRIPE_SIZE;
 			bi->bi_next = NULL;
 			if (rw == WRITE &&
-			    test_bit(R5_ReWrite, &sh->dev[i].flags))
+			    test_bit(R5_ReWrite, &sq->dev[i].flags))
 				atomic_add(STRIPE_SECTORS,
 					&rdev->corrected_errors);
 			generic_make_request(bi);
@@ -452,7 +457,7 @@ static void ops_run_io(struct stripe_head *sh)
 				set_bit(STRIPE_DEGRADED, &sh->state);
 			pr_debug("skip op %ld on disc %d for sector %llu\n",
 				bi->bi_rw, i, (unsigned long long)sh->sector);
-			clear_bit(R5_LOCKED, &sh->dev[i].flags);
+			clear_bit(R5_LOCKED, &sq->dev[i].flags);
 			set_bit(STRIPE_HANDLE, &sh->state);
 		}
 	}
@@ -513,7 +518,8 @@ static void ops_complete_biofill(void *stripe_head_ref)
 {
 	struct stripe_head *sh = stripe_head_ref;
 	struct bio *return_bi = NULL;
-	raid5_conf_t *conf = sh->raid_conf;
+	struct stripe_queue *sq = sh->sq;
+	raid5_conf_t *conf = sq->raid_conf;
 	int i, more_to_read = 0;
 
 	pr_debug("%s: stripe %llu\n", __FUNCTION__,
@@ -521,17 +527,19 @@ static void ops_complete_biofill(void *stripe_head_ref)
 
 	/* clear completed biofills */
 	for (i = sh->disks; i--; ) {
-		struct r5dev *dev = &sh->dev[i];
+		struct r5_queue_dev *dev_q = &sq->dev[i];
 		/* check if this stripe has new incoming reads */
-		if (dev->toread)
+		if (dev_q->toread)
 			more_to_read++;
 
 		/* acknowledge completion of a biofill operation */
 		/* and check if we need to reply to a read request
 		*/
-		if (test_bit(R5_Wantfill, &dev->flags) && !dev->toread) {
+		if (test_bit(R5_Wantfill, &dev_q->flags) && !dev_q->toread) {
 			struct bio *rbi, *rbi2;
-			clear_bit(R5_Wantfill, &dev->flags);
+			struct r5dev *dev = &sh->dev[i];
+
+			clear_bit(R5_Wantfill, &dev_q->flags);
 
 			/* The access to dev->read is outside of the
 			 * spin_lock_irq(&conf->device_lock), but is protected
@@ -541,8 +549,8 @@ static void ops_complete_biofill(void *stripe_head_ref)
 			rbi = dev->read;
 			dev->read = NULL;
 			while (rbi && rbi->bi_sector <
-				dev->sector + STRIPE_SECTORS) {
-				rbi2 = r5_next_bio(rbi, dev->sector);
+				dev_q->sector + STRIPE_SECTORS) {
+				rbi2 = r5_next_bio(rbi, dev_q->sector);
 				spin_lock_irq(&conf->device_lock);
 				if (--rbi->bi_phys_segments == 0) {
 					rbi->bi_next = return_bi;
@@ -566,7 +574,7 @@ static void ops_complete_biofill(void *stripe_head_ref)
 static void ops_run_biofill(struct stripe_head *sh)
 {
 	struct dma_async_tx_descriptor *tx = NULL;
-	raid5_conf_t *conf = sh->raid_conf;
+	raid5_conf_t *conf = sh->sq->raid_conf;
 	int i;
 
 	pr_debug("%s: stripe %llu\n", __FUNCTION__,
@@ -574,17 +582,18 @@ static void ops_run_biofill(struct stripe_head *sh)
 
 	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
-		if (test_bit(R5_Wantfill, &dev->flags)) {
+		struct r5_queue_dev *dev_q = &sh->sq->dev[i];
+		if (test_bit(R5_Wantfill, &dev_q->flags)) {
 			struct bio *rbi;
 			spin_lock_irq(&conf->device_lock);
-			dev->read = rbi = dev->toread;
-			dev->toread = NULL;
+			dev->read = rbi = dev_q->toread;
+			dev_q->toread = NULL;
 			spin_unlock_irq(&conf->device_lock);
 			while (rbi && rbi->bi_sector <
-				dev->sector + STRIPE_SECTORS) {
+				dev_q->sector + STRIPE_SECTORS) {
 				tx = async_copy_data(0, rbi, dev->page,
-					dev->sector, tx);
-				rbi = r5_next_bio(rbi, dev->sector);
+					dev_q->sector, tx);
+				rbi = r5_next_bio(rbi, dev_q->sector);
 			}
 		}
 	}
@@ -597,8 +606,9 @@ static void ops_run_biofill(struct stripe_head *sh)
 static void ops_complete_compute5(void *stripe_head_ref)
 {
 	struct stripe_head *sh = stripe_head_ref;
+	struct stripe_queue *sq = sh->sq;
 	int target = sh->ops.target;
-	struct r5dev *tgt = &sh->dev[target];
+	struct r5_queue_dev *tgt = &sq->dev[target];
 
 	pr_debug("%s: stripe %llu\n", __FUNCTION__,
 		(unsigned long long)sh->sector);
@@ -618,15 +628,14 @@ ops_run_compute5(struct stripe_head *sh, unsigned long pending)
 	int disks = sh->disks;
 	struct page *xor_srcs[disks];
 	int target = sh->ops.target;
-	struct r5dev *tgt = &sh->dev[target];
-	struct page *xor_dest = tgt->page;
+	struct page *xor_dest = sh->dev[target].page;
 	int count = 0;
 	struct dma_async_tx_descriptor *tx;
 	int i;
 
 	pr_debug("%s: stripe %llu block: %d\n",
 		__FUNCTION__, (unsigned long long)sh->sector, target);
-	BUG_ON(!test_bit(R5_Wantcompute, &tgt->flags));
+	BUG_ON(!test_bit(R5_Wantcompute, &sh->sq->dev[target].flags));
 
 	for (i = disks; i--; )
 		if (i != target)
@@ -665,7 +674,8 @@ ops_run_prexor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 	/* kernel stack size limits the total number of disks */
 	int disks = sh->disks;
 	struct page *xor_srcs[disks];
-	int count = 0, pd_idx = sh->pd_idx, i;
+	struct stripe_queue *sq = sh->sq;
+	int count = 0, pd_idx = sq->pd_idx, i;
 
 	/* existing parity data subtracted */
 	struct page *xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
@@ -675,8 +685,9 @@ ops_run_prexor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 
 	for (i = disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
+		struct r5_queue_dev *dev_q = &sq->dev[i];
 		/* Only process blocks that are known to be uptodate */
-		if (dev->towrite && test_bit(R5_Wantprexor, &dev->flags))
+		if (dev_q->towrite && test_bit(R5_Wantprexor, &dev_q->flags))
 			xor_srcs[count++] = dev->page;
 	}
 
@@ -691,7 +702,8 @@ static struct dma_async_tx_descriptor *
 ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 {
 	int disks = sh->disks;
-	int pd_idx = sh->pd_idx, i;
+	struct stripe_queue *sq = sh->sq;
+	int pd_idx = sq->pd_idx, i;
 
 	/* check if prexor is active which means only process blocks
 	 * that are part of a read-modify-write (Wantprexor)
@@ -703,35 +715,36 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 
 	for (i = disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
+		struct r5_queue_dev *dev_q = &sq->dev[i];
 		struct bio *chosen;
 		int towrite;
 
 		towrite = 0;
 		if (prexor) { /* rmw */
-			if (dev->towrite &&
-			    test_bit(R5_Wantprexor, &dev->flags))
+			if (dev_q->towrite &&
+			    test_bit(R5_Wantprexor, &dev_q->flags))
 				towrite = 1;
 		} else { /* rcw */
-			if (i != pd_idx && dev->towrite &&
-				test_bit(R5_LOCKED, &dev->flags))
+			if (i != pd_idx && dev_q->towrite &&
+				test_bit(R5_LOCKED, &dev_q->flags))
 				towrite = 1;
 		}
 
 		if (towrite) {
 			struct bio *wbi;
 
-			spin_lock(&sh->lock);
-			chosen = dev->towrite;
-			dev->towrite = NULL;
+			spin_lock(&sq->lock);
+			chosen = dev_q->towrite;
+			dev_q->towrite = NULL;
 			BUG_ON(dev->written);
 			wbi = dev->written = chosen;
-			spin_unlock(&sh->lock);
+			spin_unlock(&sq->lock);
 
 			while (wbi && wbi->bi_sector <
-				dev->sector + STRIPE_SECTORS) {
+				dev_q->sector + STRIPE_SECTORS) {
 				tx = async_copy_data(1, wbi, dev->page,
-					dev->sector, tx);
-				wbi = r5_next_bio(wbi, dev->sector);
+					dev_q->sector, tx);
+				wbi = r5_next_bio(wbi, dev_q->sector);
 			}
 		}
 	}
@@ -754,15 +767,17 @@ static void ops_complete_postxor(void *stripe_head_ref)
 static void ops_complete_write(void *stripe_head_ref)
 {
 	struct stripe_head *sh = stripe_head_ref;
-	int disks = sh->disks, i, pd_idx = sh->pd_idx;
+	struct stripe_queue *sq = sh->sq;
+	int disks = sh->disks, i, pd_idx = sq->pd_idx;
 
 	pr_debug("%s: stripe %llu\n", __FUNCTION__,
 		(unsigned long long)sh->sector);
 
 	for (i = disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
+		struct r5_queue_dev *dev_q = &sq->dev[i];
 		if (dev->written || i == pd_idx)
-			set_bit(R5_UPTODATE, &dev->flags);
+			set_bit(R5_UPTODATE, &dev_q->flags);
 	}
 
 	set_bit(STRIPE_OP_BIODRAIN, &sh->ops.complete);
@@ -779,7 +794,7 @@ ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 	int disks = sh->disks;
 	struct page *xor_srcs[disks];
 
-	int count = 0, pd_idx = sh->pd_idx, i;
+	int count = 0, pd_idx = sh->sq->pd_idx, i;
 	struct page *xor_dest;
 	int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
 	unsigned long flags;
@@ -833,14 +848,15 @@ ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 static void ops_complete_check(void *stripe_head_ref)
 {
 	struct stripe_head *sh = stripe_head_ref;
-	int pd_idx = sh->pd_idx;
+	int pd_idx = sh->sq->pd_idx;
+	struct stripe_queue *sq = sh->sq;
 
 	pr_debug("%s: stripe %llu\n", __FUNCTION__,
 		(unsigned long long)sh->sector);
 
 	if (test_and_clear_bit(STRIPE_OP_MOD_DMA_CHECK, &sh->ops.pending) &&
 		sh->ops.zero_sum_result == 0)
-		set_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
+		set_bit(R5_UPTODATE, &sq->dev[pd_idx].flags);
 
 	set_bit(STRIPE_OP_CHECK, &sh->ops.complete);
 	set_bit(STRIPE_HANDLE, &sh->state);
@@ -854,7 +870,7 @@ static void ops_run_check(struct stripe_head *sh)
 	struct page *xor_srcs[disks];
 	struct dma_async_tx_descriptor *tx;
 
-	int count = 0, pd_idx = sh->pd_idx, i;
+	int count = 0, pd_idx = sh->sq->pd_idx, i;
 	struct page *xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
 
 	pr_debug("%s: stripe %llu\n", __FUNCTION__,
@@ -911,25 +927,39 @@ static void raid5_run_ops(struct stripe_head *sh, unsigned long pending)
 
 	if (overlap_clear)
 		for (i = disks; i--; ) {
-			struct r5dev *dev = &sh->dev[i];
-			if (test_and_clear_bit(R5_Overlap, &dev->flags))
-				wake_up(&sh->raid_conf->wait_for_overlap);
+			struct stripe_queue *sq = sh->sq;
+			struct r5_queue_dev *dev_q = &sq->dev[i];
+			if (test_and_clear_bit(R5_Overlap, &dev_q->flags))
+				wake_up(&sh->sq->raid_conf->wait_for_overlap);
 		}
 }
 
 static int grow_one_stripe(raid5_conf_t *conf)
 {
 	struct stripe_head *sh;
-	sh = kmem_cache_alloc(conf->slab_cache, GFP_KERNEL);
+	struct stripe_queue *sq;
+
+	sh = kmem_cache_alloc(conf->sh_slab_cache, GFP_KERNEL);
 	if (!sh)
 		return 0;
+
+	sq = kmem_cache_alloc(conf->sq_slab_cache, GFP_KERNEL);
+	if (!sq) {
+		kmem_cache_free(conf->sh_slab_cache, sh);
+		return 0;
+	}
+
 	memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1)*sizeof(struct r5dev));
-	sh->raid_conf = conf;
-	spin_lock_init(&sh->lock);
+	memset(sq, 0, sizeof(*sq) +
+		(conf->raid_disks-1) * sizeof(struct r5_queue_dev));
+	sh->sq = sq;
+	sq->raid_conf = conf;
+	spin_lock_init(&sq->lock);
 
 	if (grow_buffers(sh, conf->raid_disks)) {
 		shrink_buffers(sh, conf->raid_disks);
-		kmem_cache_free(conf->slab_cache, sh);
+		kmem_cache_free(conf->sh_slab_cache, sh);
+		kmem_cache_free(conf->sq_slab_cache, sq);
 		return 0;
 	}
 	sh->disks = conf->raid_disks;
@@ -937,7 +967,9 @@ static int grow_one_stripe(raid5_conf_t *conf)
 	atomic_set(&sh->count, 1);
 	atomic_inc(&conf->active_stripes);
 	INIT_LIST_HEAD(&sh->lru);
-	release_stripe(sh);
+	spin_lock_irq(&conf->device_lock);
+	__release_stripe(conf, sh);
+	spin_unlock_irq(&conf->device_lock);
 	return 1;
 }
 
@@ -946,16 +978,29 @@ static int grow_stripes(raid5_conf_t *conf, int num)
 	struct kmem_cache *sc;
 	int devs = conf->raid_disks;
 
-	sprintf(conf->cache_name[0], "raid5-%s", mdname(conf->mddev));
-	sprintf(conf->cache_name[1], "raid5-%s-alt", mdname(conf->mddev));
+	sprintf(conf->sh_cache_name[0], "raid5-%s", mdname(conf->mddev));
+	sprintf(conf->sh_cache_name[1], "raid5-%s-alt", mdname(conf->mddev));
+	sprintf(conf->sq_cache_name[0], "raid5q-%s", mdname(conf->mddev));
+	sprintf(conf->sq_cache_name[1], "raid5q-%s-alt", mdname(conf->mddev));
+
 	conf->active_name = 0;
-	sc = kmem_cache_create(conf->cache_name[conf->active_name],
+	sc = kmem_cache_create(conf->sh_cache_name[conf->active_name],
 			       sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
 			       0, 0, NULL, NULL);
+
 	if (!sc)
 		return 1;
-	conf->slab_cache = sc;
+	conf->sh_slab_cache = sc;
 	conf->pool_size = devs;
+
+	sc = kmem_cache_create(conf->sq_cache_name[conf->active_name],
+		sizeof(struct stripe_queue) +
+		(devs-1)*sizeof(struct r5_queue_dev), 0, 0, NULL, NULL);
+
+	if (!sc)
+		return 1;
+	conf->sq_slab_cache = sc;
+
 	while (num--)
 		if (!grow_one_stripe(conf))
 			return 1;
@@ -992,7 +1037,7 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 	LIST_HEAD(newstripes);
 	struct disk_info *ndisks;
 	int err = 0;
-	struct kmem_cache *sc;
+	struct kmem_cache *sc, *sc_q;
 	int i;
 
 	if (newsize <= conf->pool_size)
@@ -1001,21 +1046,40 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 	md_allow_write(conf->mddev);
 
 	/* Step 1 */
-	sc = kmem_cache_create(conf->cache_name[1-conf->active_name],
+	sc = kmem_cache_create(conf->sh_cache_name[1-conf->active_name],
 			       sizeof(struct stripe_head)+(newsize-1)*sizeof(struct r5dev),
 			       0, 0, NULL, NULL);
 	if (!sc)
 		return -ENOMEM;
 
+	sc_q = kmem_cache_create(conf->sh_cache_name[1-conf->active_name],
+		    sizeof(struct stripe_queue) +
+		    (newsize-1)*sizeof(struct r5_queue_dev), 0, 0, NULL, NULL);
+	if (!sc_q) {
+		kmem_cache_destroy(sc);
+		return -ENOMEM;
+	}
+
 	for (i = conf->max_nr_stripes; i; i--) {
+		struct stripe_queue *nsq;
+
 		nsh = kmem_cache_alloc(sc, GFP_KERNEL);
 		if (!nsh)
 			break;
 
+		nsq = kmem_cache_alloc(sc_q, GFP_KERNEL);
+		if (!nsq) {
+			kmem_cache_free(sc, nsh);
+			break;
+		}
+
 		memset(nsh, 0, sizeof(*nsh) + (newsize-1)*sizeof(struct r5dev));
+		memset(nsq, 0, sizeof(*nsq) +
+			(newsize-1)*sizeof(struct r5_queue_dev));
 
-		nsh->raid_conf = conf;
-		spin_lock_init(&nsh->lock);
+		nsq->raid_conf = conf;
+		nsh->sq = nsq;
+		spin_lock_init(&nsq->lock);
 
 		list_add(&nsh->lru, &newstripes);
 	}
@@ -1024,8 +1088,10 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 		while (!list_empty(&newstripes)) {
 			nsh = list_entry(newstripes.next, struct stripe_head, lru);
 			list_del(&nsh->lru);
+			kmem_cache_free(sc_q, nsh->sq);
 			kmem_cache_free(sc, nsh);
 		}
+		kmem_cache_destroy(sc_q);
 		kmem_cache_destroy(sc);
 		return -ENOMEM;
 	}
@@ -1047,9 +1113,11 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 			nsh->dev[i].page = osh->dev[i].page;
 		for( ; i<newsize; i++)
 			nsh->dev[i].page = NULL;
-		kmem_cache_free(conf->slab_cache, osh);
+		kmem_cache_free(conf->sq_slab_cache, osh->sq);
+		kmem_cache_free(conf->sh_slab_cache, osh);
 	}
-	kmem_cache_destroy(conf->slab_cache);
+	kmem_cache_destroy(conf->sh_slab_cache);
+	kmem_cache_destroy(conf->sq_slab_cache);
 
 	/* Step 3.
 	 * At this point, we are holding all the stripes so the array
@@ -1080,7 +1148,8 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 	}
 	/* critical section pass, GFP_NOIO no longer needed */
 
-	conf->slab_cache = sc;
+	conf->sh_slab_cache = sc;
+	conf->sq_slab_cache = sc_q;
 	conf->active_name = 1-conf->active_name;
 	conf->pool_size = newsize;
 	return err;
@@ -1098,7 +1167,9 @@ static int drop_one_stripe(raid5_conf_t *conf)
 		return 0;
 	BUG_ON(atomic_read(&sh->count));
 	shrink_buffers(sh, conf->pool_size);
-	kmem_cache_free(conf->slab_cache, sh);
+	if (sh->sq)
+		kmem_cache_free(conf->sq_slab_cache, sh->sq);
+	kmem_cache_free(conf->sh_slab_cache, sh);
 	atomic_dec(&conf->active_stripes);
 	return 1;
 }
@@ -1108,20 +1179,25 @@ static void shrink_stripes(raid5_conf_t *conf)
 	while (drop_one_stripe(conf))
 		;
 
-	if (conf->slab_cache)
-		kmem_cache_destroy(conf->slab_cache);
-	conf->slab_cache = NULL;
+	if (conf->sh_slab_cache)
+		kmem_cache_destroy(conf->sh_slab_cache);
+	conf->sh_slab_cache = NULL;
+
+	if (conf->sq_slab_cache)
+		kmem_cache_destroy(conf->sq_slab_cache);
+	conf->sq_slab_cache = NULL;
 }
 
 static int raid5_end_read_request(struct bio * bi, unsigned int bytes_done,
 				   int error)
 {
  	struct stripe_head *sh = bi->bi_private;
-	raid5_conf_t *conf = sh->raid_conf;
+	raid5_conf_t *conf = sh->sq->raid_conf;
 	int disks = sh->disks, i;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 	char b[BDEVNAME_SIZE];
 	mdk_rdev_t *rdev;
+	struct stripe_queue *sq = sh->sq;
 
 	if (bi->bi_size)
 		return 1;
@@ -1139,15 +1215,15 @@ static int raid5_end_read_request(struct bio * bi, unsigned int bytes_done,
 	}
 
 	if (uptodate) {
-		set_bit(R5_UPTODATE, &sh->dev[i].flags);
-		if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
+		set_bit(R5_UPTODATE, &sq->dev[i].flags);
+		if (test_bit(R5_ReadError, &sq->dev[i].flags)) {
 			rdev = conf->disks[i].rdev;
 			printk(KERN_INFO "raid5:%s: read error corrected (%lu sectors at %llu on %s)\n",
 			       mdname(conf->mddev), STRIPE_SECTORS,
 			       (unsigned long long)sh->sector + rdev->data_offset,
 			       bdevname(rdev->bdev, b));
-			clear_bit(R5_ReadError, &sh->dev[i].flags);
-			clear_bit(R5_ReWrite, &sh->dev[i].flags);
+			clear_bit(R5_ReadError, &sq->dev[i].flags);
+			clear_bit(R5_ReWrite, &sq->dev[i].flags);
 		}
 		if (atomic_read(&conf->disks[i].rdev->read_errors))
 			atomic_set(&conf->disks[i].rdev->read_errors, 0);
@@ -1156,14 +1232,14 @@ static int raid5_end_read_request(struct bio * bi, unsigned int bytes_done,
 		int retry = 0;
 		rdev = conf->disks[i].rdev;
 
-		clear_bit(R5_UPTODATE, &sh->dev[i].flags);
+		clear_bit(R5_UPTODATE, &sq->dev[i].flags);
 		atomic_inc(&rdev->read_errors);
 		if (conf->mddev->degraded)
 			printk(KERN_WARNING "raid5:%s: read error not correctable (sector %llu on %s).\n",
 			       mdname(conf->mddev),
 			       (unsigned long long)sh->sector + rdev->data_offset,
 			       bdn);
-		else if (test_bit(R5_ReWrite, &sh->dev[i].flags))
+		else if (test_bit(R5_ReWrite, &sq->dev[i].flags))
 			/* Oh, no!!! */
 			printk(KERN_WARNING "raid5:%s: read error NOT corrected!! (sector %llu on %s).\n",
 			       mdname(conf->mddev),
@@ -1177,15 +1253,15 @@ static int raid5_end_read_request(struct bio * bi, unsigned int bytes_done,
 		else
 			retry = 1;
 		if (retry)
-			set_bit(R5_ReadError, &sh->dev[i].flags);
+			set_bit(R5_ReadError, &sq->dev[i].flags);
 		else {
-			clear_bit(R5_ReadError, &sh->dev[i].flags);
-			clear_bit(R5_ReWrite, &sh->dev[i].flags);
+			clear_bit(R5_ReadError, &sq->dev[i].flags);
+			clear_bit(R5_ReWrite, &sq->dev[i].flags);
 			md_error(conf->mddev, rdev);
 		}
 	}
 	rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
-	clear_bit(R5_LOCKED, &sh->dev[i].flags);
+	clear_bit(R5_LOCKED, &sq->dev[i].flags);
 	set_bit(STRIPE_HANDLE, &sh->state);
 	release_stripe(sh);
 	return 0;
@@ -1195,7 +1271,8 @@ static int raid5_end_write_request (struct bio *bi, unsigned int bytes_done,
 				    int error)
 {
  	struct stripe_head *sh = bi->bi_private;
-	raid5_conf_t *conf = sh->raid_conf;
+	struct stripe_queue *sq = sh->sq;
+	raid5_conf_t *conf = sq->raid_conf;
 	int disks = sh->disks, i;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
@@ -1219,18 +1296,16 @@ static int raid5_end_write_request (struct bio *bi, unsigned int bytes_done,
 
 	rdev_dec_pending(conf->disks[i].rdev, conf->mddev);
 	
-	clear_bit(R5_LOCKED, &sh->dev[i].flags);
+	clear_bit(R5_LOCKED, &sq->dev[i].flags);
 	set_bit(STRIPE_HANDLE, &sh->state);
 	release_stripe(sh);
 	return 0;
 }
 
-
-static sector_t compute_blocknr(struct stripe_head *sh, int i);
-	
 static void raid5_build_block (struct stripe_head *sh, int i)
 {
 	struct r5dev *dev = &sh->dev[i];
+	struct r5_queue_dev *dev_q = &sh->sq->dev[i];
 
 	bio_init(&dev->req);
 	dev->req.bi_io_vec = &dev->vec;
@@ -1243,8 +1318,9 @@ static void raid5_build_block (struct stripe_head *sh, int i)
 	dev->req.bi_sector = sh->sector;
 	dev->req.bi_private = sh;
 
-	dev->flags = 0;
-	dev->sector = compute_blocknr(sh, i);
+	dev_q->flags = 0;
+	dev_q->sector = compute_blocknr(sh->sq->raid_conf, sh->disks,
+			sh->sector, sh->sq->pd_idx, i);
 }
 
 static void error(mddev_t *mddev, mdk_rdev_t *rdev)
@@ -1379,12 +1455,12 @@ static sector_t raid5_compute_sector(sector_t r_sector, unsigned int raid_disks,
 }
 
 
-static sector_t compute_blocknr(struct stripe_head *sh, int i)
+static sector_t
+compute_blocknr(raid5_conf_t *conf, int raid_disks, sector_t sector,
+	int pd_idx, int i)
 {
-	raid5_conf_t *conf = sh->raid_conf;
-	int raid_disks = sh->disks;
 	int data_disks = raid_disks - conf->max_degraded;
-	sector_t new_sector = sh->sector, check;
+	sector_t new_sector = sector, check;
 	int sectors_per_chunk = conf->chunk_size >> 9;
 	sector_t stripe;
 	int chunk_offset;
@@ -1396,7 +1472,7 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 	stripe = new_sector;
 	BUG_ON(new_sector != stripe);
 
-	if (i == sh->pd_idx)
+	if (i == pd_idx)
 		return 0;
 	switch(conf->level) {
 	case 4: break;
@@ -1404,14 +1480,14 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 		switch (conf->algorithm) {
 		case ALGORITHM_LEFT_ASYMMETRIC:
 		case ALGORITHM_RIGHT_ASYMMETRIC:
-			if (i > sh->pd_idx)
+			if (i > pd_idx)
 				i--;
 			break;
 		case ALGORITHM_LEFT_SYMMETRIC:
 		case ALGORITHM_RIGHT_SYMMETRIC:
-			if (i < sh->pd_idx)
+			if (i < pd_idx)
 				i += raid_disks;
-			i -= (sh->pd_idx + 1);
+			i -= (pd_idx + 1);
 			break;
 		default:
 			printk(KERN_ERR "raid5: unsupported algorithm %d\n",
@@ -1419,25 +1495,25 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 		}
 		break;
 	case 6:
-		if (i == raid6_next_disk(sh->pd_idx, raid_disks))
+		if (i == raid6_next_disk(pd_idx, raid_disks))
 			return 0; /* It is the Q disk */
 		switch (conf->algorithm) {
 		case ALGORITHM_LEFT_ASYMMETRIC:
 		case ALGORITHM_RIGHT_ASYMMETRIC:
-		  	if (sh->pd_idx == raid_disks-1)
-				i--; 	/* Q D D D P */
-			else if (i > sh->pd_idx)
+			if (pd_idx == raid_disks-1)
+				i--;	/* Q D D D P */
+			else if (i > pd_idx)
 				i -= 2; /* D D P Q D */
 			break;
 		case ALGORITHM_LEFT_SYMMETRIC:
 		case ALGORITHM_RIGHT_SYMMETRIC:
-			if (sh->pd_idx == raid_disks-1)
+			if (pd_idx == raid_disks-1)
 				i--; /* Q D D D P */
 			else {
 				/* D D P Q D */
-				if (i < sh->pd_idx)
+				if (i < pd_idx)
 					i += raid_disks;
-				i -= (sh->pd_idx + 2);
+				i -= (pd_idx + 2);
 			}
 			break;
 		default:
@@ -1451,7 +1527,7 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 	r_sector = (sector_t)chunk_number * sectors_per_chunk + chunk_offset;
 
 	check = raid5_compute_sector (r_sector, raid_disks, data_disks, &dummy1, &dummy2, conf);
-	if (check != sh->sector || dummy1 != dd_idx || dummy2 != sh->pd_idx) {
+	if (check != sector || dummy1 != dd_idx || dummy2 != pd_idx) {
 		printk(KERN_ERR "compute_blocknr: map not correct\n");
 		return 0;
 	}
@@ -1518,8 +1594,9 @@ static void copy_data(int frombio, struct bio *bio,
 
 static void compute_parity6(struct stripe_head *sh, int method)
 {
-	raid6_conf_t *conf = sh->raid_conf;
-	int i, pd_idx = sh->pd_idx, qd_idx, d0_idx, disks = sh->disks, count;
+	struct stripe_queue *sq = sh->sq;
+	raid6_conf_t *conf = sq->raid_conf;
+	int i, pd_idx = sq->pd_idx, qd_idx, d0_idx, disks = sh->disks, count;
 	struct bio *chosen;
 	/**** FIX THIS: This could be very bad if disks is close to 256 ****/
 	void *ptrs[disks];
@@ -1535,11 +1612,12 @@ static void compute_parity6(struct stripe_head *sh, int method)
 		BUG();		/* READ_MODIFY_WRITE N/A for RAID-6 */
 	case RECONSTRUCT_WRITE:
 		for (i= disks; i-- ;)
-			if ( i != pd_idx && i != qd_idx && sh->dev[i].towrite ) {
-				chosen = sh->dev[i].towrite;
-				sh->dev[i].towrite = NULL;
+			if (i != pd_idx && i != qd_idx && sq->dev[i].towrite) {
+				chosen = sq->dev[i].towrite;
+				sq->dev[i].towrite = NULL;
 
-				if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+				if (test_and_clear_bit(R5_Overlap,
+							&sq->dev[i].flags))
 					wake_up(&conf->wait_for_overlap);
 
 				BUG_ON(sh->dev[i].written);
@@ -1550,17 +1628,17 @@ static void compute_parity6(struct stripe_head *sh, int method)
 		BUG();		/* Not implemented yet */
 	}
 
-	for (i = disks; i--;)
+	for (i = disks; i--; )
 		if (sh->dev[i].written) {
-			sector_t sector = sh->dev[i].sector;
+			sector_t sector = sq->dev[i].sector;
 			struct bio *wbi = sh->dev[i].written;
 			while (wbi && wbi->bi_sector < sector + STRIPE_SECTORS) {
 				copy_data(1, wbi, sh->dev[i].page, sector);
 				wbi = r5_next_bio(wbi, sector);
 			}
 
-			set_bit(R5_LOCKED, &sh->dev[i].flags);
-			set_bit(R5_UPTODATE, &sh->dev[i].flags);
+			set_bit(R5_LOCKED, &sq->dev[i].flags);
+			set_bit(R5_UPTODATE, &sq->dev[i].flags);
 		}
 
 //	switch(method) {
@@ -1573,7 +1651,8 @@ static void compute_parity6(struct stripe_head *sh, int method)
 		i = d0_idx;
 		do {
 			ptrs[count++] = page_address(sh->dev[i].page);
-			if (count <= disks-2 && !test_bit(R5_UPTODATE, &sh->dev[i].flags))
+			if (count <= disks-2 &&
+			    !test_bit(R5_UPTODATE, &sq->dev[i].flags))
 				printk("block %d/%d not uptodate on parity calc\n", i,count);
 			i = raid6_next_disk(i, disks);
 		} while ( i != d0_idx );
@@ -1584,14 +1663,14 @@ static void compute_parity6(struct stripe_head *sh, int method)
 
 	switch(method) {
 	case RECONSTRUCT_WRITE:
-		set_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
-		set_bit(R5_UPTODATE, &sh->dev[qd_idx].flags);
-		set_bit(R5_LOCKED,   &sh->dev[pd_idx].flags);
-		set_bit(R5_LOCKED,   &sh->dev[qd_idx].flags);
+		set_bit(R5_UPTODATE, &sq->dev[pd_idx].flags);
+		set_bit(R5_UPTODATE, &sq->dev[qd_idx].flags);
+		set_bit(R5_LOCKED,   &sq->dev[pd_idx].flags);
+		set_bit(R5_LOCKED,   &sq->dev[qd_idx].flags);
 		break;
 	case UPDATE_PARITY:
-		set_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
-		set_bit(R5_UPTODATE, &sh->dev[qd_idx].flags);
+		set_bit(R5_UPTODATE, &sq->dev[pd_idx].flags);
+		set_bit(R5_UPTODATE, &sq->dev[qd_idx].flags);
 		break;
 	}
 }
@@ -1600,9 +1679,10 @@ static void compute_parity6(struct stripe_head *sh, int method)
 /* Compute one missing block */
 static void compute_block_1(struct stripe_head *sh, int dd_idx, int nozero)
 {
+	struct stripe_queue *sq = sh->sq;
 	int i, count, disks = sh->disks;
 	void *ptr[MAX_XOR_BLOCKS], *dest, *p;
-	int pd_idx = sh->pd_idx;
+	int pd_idx = sq->pd_idx;
 	int qd_idx = raid6_next_disk(pd_idx, disks);
 
 	pr_debug("compute_block_1, stripe %llu, idx %d\n",
@@ -1619,7 +1699,7 @@ static void compute_block_1(struct stripe_head *sh, int dd_idx, int nozero)
 			if (i == dd_idx || i == qd_idx)
 				continue;
 			p = page_address(sh->dev[i].page);
-			if (test_bit(R5_UPTODATE, &sh->dev[i].flags))
+			if (test_bit(R5_UPTODATE, &sq->dev[i].flags))
 				ptr[count++] = p;
 			else
 				printk("compute_block() %d, stripe %llu, %d"
@@ -1630,8 +1710,10 @@ static void compute_block_1(struct stripe_head *sh, int dd_idx, int nozero)
 		}
 		if (count)
 			xor_blocks(count, STRIPE_SIZE, dest, ptr);
-		if (!nozero) set_bit(R5_UPTODATE, &sh->dev[dd_idx].flags);
-		else clear_bit(R5_UPTODATE, &sh->dev[dd_idx].flags);
+		if (!nozero)
+			set_bit(R5_UPTODATE, &sq->dev[dd_idx].flags);
+		else
+			clear_bit(R5_UPTODATE, &sq->dev[dd_idx].flags);
 	}
 }
 
@@ -1639,7 +1721,7 @@ static void compute_block_1(struct stripe_head *sh, int dd_idx, int nozero)
 static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2)
 {
 	int i, count, disks = sh->disks;
-	int pd_idx = sh->pd_idx;
+	int pd_idx = sh->sq->pd_idx;
 	int qd_idx = raid6_next_disk(pd_idx, disks);
 	int d0_idx = raid6_next_disk(qd_idx, disks);
 	int faila, failb;
@@ -1671,6 +1753,7 @@ static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2)
 
 	/* We're missing D+P or D+D; build pointer table */
 	{
+		struct stripe_queue *sq = sh->sq;
 		/**** FIX THIS: This could be very bad if disks is close to 256 ****/
 		void *ptrs[disks];
 
@@ -1680,7 +1763,7 @@ static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2)
 			ptrs[count++] = page_address(sh->dev[i].page);
 			i = raid6_next_disk(i, disks);
 			if (i != dd_idx1 && i != dd_idx2 &&
-			    !test_bit(R5_UPTODATE, &sh->dev[i].flags))
+			    !test_bit(R5_UPTODATE, &sq->dev[i].flags))
 				printk("compute_2 with missing block %d/%d\n", count, i);
 		} while ( i != d0_idx );
 
@@ -1693,16 +1776,17 @@ static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2)
 		}
 
 		/* Both the above update both missing blocks */
-		set_bit(R5_UPTODATE, &sh->dev[dd_idx1].flags);
-		set_bit(R5_UPTODATE, &sh->dev[dd_idx2].flags);
+		set_bit(R5_UPTODATE, &sq->dev[dd_idx1].flags);
+		set_bit(R5_UPTODATE, &sq->dev[dd_idx2].flags);
 	}
 }
 
 static int
 handle_write_operations5(struct stripe_head *sh, int rcw, int expand)
 {
-	int i, pd_idx = sh->pd_idx, disks = sh->disks;
 	int locked = 0;
+	struct stripe_queue *sq = sh->sq;
+	int i, pd_idx = sq->pd_idx, disks = sh->disks;
 
 	if (rcw) {
 		/* if we are not expanding this is a proper write request, and
@@ -1718,18 +1802,18 @@ handle_write_operations5(struct stripe_head *sh, int rcw, int expand)
 		sh->ops.count++;
 
 		for (i = disks; i--; ) {
-			struct r5dev *dev = &sh->dev[i];
+			struct r5_queue_dev *dev_q = &sq->dev[i];
 
-			if (dev->towrite) {
-				set_bit(R5_LOCKED, &dev->flags);
+			if (dev_q->towrite) {
+				set_bit(R5_LOCKED, &dev_q->flags);
 				if (!expand)
-					clear_bit(R5_UPTODATE, &dev->flags);
+					clear_bit(R5_UPTODATE, &dev_q->flags);
 				locked++;
 			}
 		}
 	} else {
-		BUG_ON(!(test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags) ||
-			test_bit(R5_Wantcompute, &sh->dev[pd_idx].flags)));
+		BUG_ON(!(test_bit(R5_UPTODATE, &sq->dev[pd_idx].flags) ||
+			test_bit(R5_Wantcompute, &sq->dev[pd_idx].flags)));
 
 		set_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
 		set_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending);
@@ -1738,7 +1822,7 @@ handle_write_operations5(struct stripe_head *sh, int rcw, int expand)
 		sh->ops.count += 3;
 
 		for (i = disks; i--; ) {
-			struct r5dev *dev = &sh->dev[i];
+			struct r5_queue_dev *dev_q = &sq->dev[i];
 			if (i == pd_idx)
 				continue;
 
@@ -1747,12 +1831,12 @@ handle_write_operations5(struct stripe_head *sh, int rcw, int expand)
 			 * written so we distinguish these blocks by the
 			 * R5_Wantprexor bit
 			 */
-			if (dev->towrite &&
-			    (test_bit(R5_UPTODATE, &dev->flags) ||
-			    test_bit(R5_Wantcompute, &dev->flags))) {
-				set_bit(R5_Wantprexor, &dev->flags);
-				set_bit(R5_LOCKED, &dev->flags);
-				clear_bit(R5_UPTODATE, &dev->flags);
+			if (dev_q->towrite &&
+			    (test_bit(R5_UPTODATE, &dev_q->flags) ||
+			    test_bit(R5_Wantcompute, &dev_q->flags))) {
+				set_bit(R5_Wantprexor, &dev_q->flags);
+				set_bit(R5_LOCKED, &dev_q->flags);
+				clear_bit(R5_UPTODATE, &dev_q->flags);
 				locked++;
 			}
 		}
@@ -1761,8 +1845,8 @@ handle_write_operations5(struct stripe_head *sh, int rcw, int expand)
 	/* keep the parity disk locked while asynchronous operations
 	 * are in flight
 	 */
-	set_bit(R5_LOCKED, &sh->dev[pd_idx].flags);
-	clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
+	set_bit(R5_LOCKED, &sq->dev[pd_idx].flags);
+	clear_bit(R5_UPTODATE, &sq->dev[pd_idx].flags);
 	locked++;
 
 	pr_debug("%s: stripe %llu locked: %d pending: %lx\n",
@@ -1780,7 +1864,8 @@ handle_write_operations5(struct stripe_head *sh, int rcw, int expand)
 static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, int forwrite)
 {
 	struct bio **bip;
-	raid5_conf_t *conf = sh->raid_conf;
+	struct stripe_queue *sq = sh->sq;
+	raid5_conf_t *conf = sq->raid_conf;
 	int firstwrite=0;
 
 	pr_debug("adding bh b#%llu to stripe s#%llu\n",
@@ -1788,14 +1873,14 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 		(unsigned long long)sh->sector);
 
 
-	spin_lock(&sh->lock);
+	spin_lock(&sq->lock);
 	spin_lock_irq(&conf->device_lock);
 	if (forwrite) {
-		bip = &sh->dev[dd_idx].towrite;
+		bip = &sq->dev[dd_idx].towrite;
 		if (*bip == NULL && sh->dev[dd_idx].written == NULL)
 			firstwrite = 1;
 	} else
-		bip = &sh->dev[dd_idx].toread;
+		bip = &sq->dev[dd_idx].toread;
 	while (*bip && (*bip)->bi_sector < bi->bi_sector) {
 		if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector)
 			goto overlap;
@@ -1810,7 +1895,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 	*bip = bi;
 	bi->bi_phys_segments ++;
 	spin_unlock_irq(&conf->device_lock);
-	spin_unlock(&sh->lock);
+	spin_unlock(&sq->lock);
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
 		(unsigned long long)bi->bi_sector,
@@ -1819,29 +1904,29 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 	if (conf->mddev->bitmap && firstwrite) {
 		bitmap_startwrite(conf->mddev->bitmap, sh->sector,
 				  STRIPE_SECTORS, 0);
-		sh->bm_seq = conf->seq_flush+1;
+		sq->bm_seq = conf->seq_flush+1;
 		set_bit(STRIPE_BIT_DELAY, &sh->state);
 	}
 
 	if (forwrite) {
 		/* check if page is covered */
-		sector_t sector = sh->dev[dd_idx].sector;
-		for (bi=sh->dev[dd_idx].towrite;
-		     sector < sh->dev[dd_idx].sector + STRIPE_SECTORS &&
+		sector_t sector = sq->dev[dd_idx].sector;
+		for (bi = sq->dev[dd_idx].towrite;
+		     sector < sq->dev[dd_idx].sector + STRIPE_SECTORS &&
 			     bi && bi->bi_sector <= sector;
-		     bi = r5_next_bio(bi, sh->dev[dd_idx].sector)) {
+		     bi = r5_next_bio(bi, sq->dev[dd_idx].sector)) {
 			if (bi->bi_sector + (bi->bi_size>>9) >= sector)
 				sector = bi->bi_sector + (bi->bi_size>>9);
 		}
-		if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
-			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
+		if (sector >= sq->dev[dd_idx].sector + STRIPE_SECTORS)
+			set_bit(R5_OVERWRITE, &sq->dev[dd_idx].flags);
 	}
 	return 1;
 
  overlap:
-	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
+	set_bit(R5_Overlap, &sq->dev[dd_idx].flags);
 	spin_unlock_irq(&conf->device_lock);
-	spin_unlock(&sh->lock);
+	spin_unlock(&sq->lock);
 	return 0;
 }
 
@@ -1873,11 +1958,13 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
 				struct bio **return_bi)
 {
 	int i;
+	struct stripe_queue *sq = sh->sq;
+
 	for (i = disks; i--; ) {
 		struct bio *bi;
 		int bitmap_end = 0;
 
-		if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
+		if (test_bit(R5_ReadError, &sq->dev[i].flags)) {
 			mdk_rdev_t *rdev;
 			rcu_read_lock();
 			rdev = rcu_dereference(conf->disks[i].rdev);
@@ -1888,19 +1975,19 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
 		}
 		spin_lock_irq(&conf->device_lock);
 		/* fail all writes first */
-		bi = sh->dev[i].towrite;
-		sh->dev[i].towrite = NULL;
+		bi = sq->dev[i].towrite;
+		sq->dev[i].towrite = NULL;
 		if (bi) {
 			s->to_write--;
 			bitmap_end = 1;
 		}
 
-		if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+		if (test_and_clear_bit(R5_Overlap, &sq->dev[i].flags))
 			wake_up(&conf->wait_for_overlap);
 
 		while (bi && bi->bi_sector <
-			sh->dev[i].sector + STRIPE_SECTORS) {
-			struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
+			sq->dev[i].sector + STRIPE_SECTORS) {
+			struct bio *nextbi = r5_next_bio(bi, sq->dev[i].sector);
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
 			if (--bi->bi_phys_segments == 0) {
 				md_write_end(conf->mddev);
@@ -1914,8 +2001,8 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
 		sh->dev[i].written = NULL;
 		if (bi) bitmap_end = 1;
 		while (bi && bi->bi_sector <
-		       sh->dev[i].sector + STRIPE_SECTORS) {
-			struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
+		       sq->dev[i].sector + STRIPE_SECTORS) {
+			struct bio *bi2 = r5_next_bio(bi, sq->dev[i].sector);
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
 			if (--bi->bi_phys_segments == 0) {
 				md_write_end(conf->mddev);
@@ -1928,18 +2015,18 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
 		/* fail any reads if this device is non-operational and
 		 * the data has not reached the cache yet.
 		 */
-		if (!test_bit(R5_Wantfill, &sh->dev[i].flags) &&
-		    (!test_bit(R5_Insync, &sh->dev[i].flags) ||
-		      test_bit(R5_ReadError, &sh->dev[i].flags))) {
-			bi = sh->dev[i].toread;
-			sh->dev[i].toread = NULL;
-			if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+		if (!test_bit(R5_Wantfill, &sq->dev[i].flags) &&
+		    (!test_bit(R5_Insync, &sq->dev[i].flags) ||
+		      test_bit(R5_ReadError, &sq->dev[i].flags))) {
+			bi = sq->dev[i].toread;
+			sq->dev[i].toread = NULL;
+			if (test_and_clear_bit(R5_Overlap, &sq->dev[i].flags))
 				wake_up(&conf->wait_for_overlap);
 			if (bi) s->to_read--;
 			while (bi && bi->bi_sector <
-			       sh->dev[i].sector + STRIPE_SECTORS) {
+			       sq->dev[i].sector + STRIPE_SECTORS) {
 				struct bio *nextbi =
-					r5_next_bio(bi, sh->dev[i].sector);
+					r5_next_bio(bi, sq->dev[i].sector);
 				clear_bit(BIO_UPTODATE, &bi->bi_flags);
 				if (--bi->bi_phys_segments == 0) {
 					bi->bi_next = *return_bi;
@@ -1962,20 +2049,21 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
 static int __handle_issuing_new_read_requests5(struct stripe_head *sh,
 			struct stripe_head_state *s, int disk_idx, int disks)
 {
-	struct r5dev *dev = &sh->dev[disk_idx];
-	struct r5dev *failed_dev = &sh->dev[s->failed_num];
+	struct stripe_queue *sq = sh->sq;
+	struct r5_queue_dev *dev_q = &sq->dev[disk_idx];
+	struct r5_queue_dev *failed_dev = &sq->dev[s->failed_num];
 
 	/* don't schedule compute operations or reads on the parity block while
 	 * a check is in flight
 	 */
-	if ((disk_idx == sh->pd_idx) &&
+	if ((disk_idx == sq->pd_idx) &&
 	     test_bit(STRIPE_OP_CHECK, &sh->ops.pending))
 		return ~0;
 
 	/* is the data in this block needed, and can we get it? */
-	if (!test_bit(R5_LOCKED, &dev->flags) &&
-	    !test_bit(R5_UPTODATE, &dev->flags) && (dev->toread ||
-	    (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
+	if (!test_bit(R5_LOCKED, &dev_q->flags) &&
+	    !test_bit(R5_UPTODATE, &dev_q->flags) && (dev_q->toread ||
+	    (dev_q->towrite && !test_bit(R5_OVERWRITE, &dev_q->flags)) ||
 	     s->syncing || s->expanding || (s->failed &&
 	     (failed_dev->toread || (failed_dev->towrite &&
 	     !test_bit(R5_OVERWRITE, &failed_dev->flags)
@@ -1993,7 +2081,7 @@ static int __handle_issuing_new_read_requests5(struct stripe_head *sh,
 		if ((s->uptodate == disks - 1) &&
 		    !test_bit(STRIPE_OP_CHECK, &sh->ops.pending)) {
 			set_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending);
-			set_bit(R5_Wantcompute, &dev->flags);
+			set_bit(R5_Wantcompute, &dev_q->flags);
 			sh->ops.target = disk_idx;
 			s->req_compute = 1;
 			sh->ops.count++;
@@ -2006,13 +2094,13 @@ static int __handle_issuing_new_read_requests5(struct stripe_head *sh,
 			s->uptodate++;
 			return 0; /* uptodate + compute == disks */
 		} else if ((s->uptodate < disks - 1) &&
-			test_bit(R5_Insync, &dev->flags)) {
+			test_bit(R5_Insync, &dev_q->flags)) {
 			/* Note: we hold off compute operations while checks are
 			 * in flight, but we still prefer 'compute' over 'read'
 			 * hence we only read if (uptodate < * disks-1)
 			 */
-			set_bit(R5_LOCKED, &dev->flags);
-			set_bit(R5_Wantread, &dev->flags);
+			set_bit(R5_LOCKED, &dev_q->flags);
+			set_bit(R5_Wantread, &dev_q->flags);
 			if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
 				sh->ops.count++;
 			s->locked++;
@@ -2060,18 +2148,19 @@ static void handle_issuing_new_read_requests6(struct stripe_head *sh,
 			int disks)
 {
 	int i;
+	struct stripe_queue *sq = sh->sq;
 	for (i = disks; i--; ) {
-		struct r5dev *dev = &sh->dev[i];
-		if (!test_bit(R5_LOCKED, &dev->flags) &&
-		    !test_bit(R5_UPTODATE, &dev->flags) &&
-		    (dev->toread || (dev->towrite &&
-		     !test_bit(R5_OVERWRITE, &dev->flags)) ||
+		struct r5_queue_dev *dev_q = &sq->dev[i];
+		if (!test_bit(R5_LOCKED, &dev_q->flags) &&
+		    !test_bit(R5_UPTODATE, &dev_q->flags) &&
+		    (dev_q->toread || (dev_q->towrite &&
+		     !test_bit(R5_OVERWRITE, &dev_q->flags)) ||
 		     s->syncing || s->expanding ||
 		     (s->failed >= 1 &&
-		      (sh->dev[r6s->failed_num[0]].toread ||
+		      (sq->dev[r6s->failed_num[0]].toread ||
 		       s->to_write)) ||
 		     (s->failed >= 2 &&
-		      (sh->dev[r6s->failed_num[1]].toread ||
+		      (sq->dev[r6s->failed_num[1]].toread ||
 		       s->to_write)))) {
 			/* we would like to get this block, possibly
 			 * by computing it, but we might not be able to
@@ -2090,7 +2179,7 @@ static void handle_issuing_new_read_requests6(struct stripe_head *sh,
 					if (other == i)
 						continue;
 					if (!test_bit(R5_UPTODATE,
-					      &sh->dev[other].flags))
+					      &sq->dev[other].flags))
 						break;
 				}
 				BUG_ON(other < 0);
@@ -2099,9 +2188,9 @@ static void handle_issuing_new_read_requests6(struct stripe_head *sh,
 				       i, other);
 				compute_block_2(sh, i, other);
 				s->uptodate += 2;
-			} else if (test_bit(R5_Insync, &dev->flags)) {
-				set_bit(R5_LOCKED, &dev->flags);
-				set_bit(R5_Wantread, &dev->flags);
+			} else if (test_bit(R5_Insync, &dev_q->flags)) {
+				set_bit(R5_LOCKED, &dev_q->flags);
+				set_bit(R5_Wantread, &dev_q->flags);
 				s->locked++;
 				pr_debug("Reading block %d (sync=%d)\n",
 					i, s->syncing);
@@ -2121,13 +2210,14 @@ static void handle_completed_write_requests(raid5_conf_t *conf,
 	struct stripe_head *sh, int disks, struct bio **return_bi)
 {
 	int i;
-	struct r5dev *dev;
+	struct stripe_queue *sq = sh->sq;
 
 	for (i = disks; i--; )
 		if (sh->dev[i].written) {
-			dev = &sh->dev[i];
-			if (!test_bit(R5_LOCKED, &dev->flags) &&
-				test_bit(R5_UPTODATE, &dev->flags)) {
+			struct r5dev *dev = &sh->dev[i];
+			struct r5_queue_dev *dev_q = &sq->dev[i];
+			if (!test_bit(R5_LOCKED, &dev_q->flags) &&
+				test_bit(R5_UPTODATE, &dev_q->flags)) {
 				/* We can return any write requests */
 				struct bio *wbi, *wbi2;
 				int bitmap_end = 0;
@@ -2136,8 +2226,8 @@ static void handle_completed_write_requests(raid5_conf_t *conf,
 				wbi = dev->written;
 				dev->written = NULL;
 				while (wbi && wbi->bi_sector <
-					dev->sector + STRIPE_SECTORS) {
-					wbi2 = r5_next_bio(wbi, dev->sector);
+					dev_q->sector + STRIPE_SECTORS) {
+					wbi2 = r5_next_bio(wbi, dev_q->sector);
 					if (--wbi->bi_phys_segments == 0) {
 						md_write_end(conf->mddev);
 						wbi->bi_next = *return_bi;
@@ -2145,7 +2235,7 @@ static void handle_completed_write_requests(raid5_conf_t *conf,
 					}
 					wbi = wbi2;
 				}
-				if (dev->towrite == NULL)
+				if (dev_q->towrite == NULL)
 					bitmap_end = 1;
 				spin_unlock_irq(&conf->device_lock);
 				if (bitmap_end)
@@ -2162,24 +2252,25 @@ static void handle_issuing_new_write_requests5(raid5_conf_t *conf,
 		struct stripe_head *sh,	struct stripe_head_state *s, int disks)
 {
 	int rmw = 0, rcw = 0, i;
+	struct stripe_queue *sq = sh->sq;
 	for (i = disks; i--; ) {
 		/* would I have to read this buffer for read_modify_write */
-		struct r5dev *dev = &sh->dev[i];
-		if ((dev->towrite || i == sh->pd_idx) &&
-		    !test_bit(R5_LOCKED, &dev->flags) &&
-		    !(test_bit(R5_UPTODATE, &dev->flags) ||
-		      test_bit(R5_Wantcompute, &dev->flags))) {
-			if (test_bit(R5_Insync, &dev->flags))
+		struct r5_queue_dev *dev_q = &sq->dev[i];
+		if ((dev_q->towrite || i == sq->pd_idx) &&
+		    !test_bit(R5_LOCKED, &dev_q->flags) &&
+		    !(test_bit(R5_UPTODATE, &dev_q->flags) ||
+		      test_bit(R5_Wantcompute, &dev_q->flags))) {
+			if (test_bit(R5_Insync, &dev_q->flags))
 				rmw++;
 			else
 				rmw += 2*disks;  /* cannot read it */
 		}
 		/* Would I have to read this buffer for reconstruct_write */
-		if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx &&
-		    !test_bit(R5_LOCKED, &dev->flags) &&
-		    !(test_bit(R5_UPTODATE, &dev->flags) ||
-		    test_bit(R5_Wantcompute, &dev->flags))) {
-			if (test_bit(R5_Insync, &dev->flags)) rcw++;
+		if (!test_bit(R5_OVERWRITE, &dev_q->flags) && i != sq->pd_idx &&
+		    !test_bit(R5_LOCKED, &dev_q->flags) &&
+		    !(test_bit(R5_UPTODATE, &dev_q->flags) ||
+		    test_bit(R5_Wantcompute, &dev_q->flags))) {
+			if (test_bit(R5_Insync, &dev_q->flags)) rcw++;
 			else
 				rcw += 2*disks;
 		}
@@ -2190,18 +2281,18 @@ static void handle_issuing_new_write_requests5(raid5_conf_t *conf,
 	if (rmw < rcw && rmw > 0)
 		/* prefer read-modify-write, but need to get some data */
 		for (i = disks; i--; ) {
-			struct r5dev *dev = &sh->dev[i];
-			if ((dev->towrite || i == sh->pd_idx) &&
-			    !test_bit(R5_LOCKED, &dev->flags) &&
-			    !(test_bit(R5_UPTODATE, &dev->flags) ||
-			    test_bit(R5_Wantcompute, &dev->flags)) &&
-			    test_bit(R5_Insync, &dev->flags)) {
+			struct r5_queue_dev *dev_q = &sq->dev[i];
+			if ((dev_q->towrite || i == sq->pd_idx) &&
+			    !test_bit(R5_LOCKED, &dev_q->flags) &&
+			    !(test_bit(R5_UPTODATE, &dev_q->flags) ||
+			    test_bit(R5_Wantcompute, &dev_q->flags)) &&
+			    test_bit(R5_Insync, &dev_q->flags)) {
 				if (
 				  test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
 					pr_debug("Read_old block "
 						"%d for r-m-w\n", i);
-					set_bit(R5_LOCKED, &dev->flags);
-					set_bit(R5_Wantread, &dev->flags);
+					set_bit(R5_LOCKED, &dev_q->flags);
+					set_bit(R5_Wantread, &dev_q->flags);
 					if (!test_and_set_bit(
 						STRIPE_OP_IO, &sh->ops.pending))
 						sh->ops.count++;
@@ -2215,19 +2306,19 @@ static void handle_issuing_new_write_requests5(raid5_conf_t *conf,
 	if (rcw <= rmw && rcw > 0)
 		/* want reconstruct write, but need to get some data */
 		for (i = disks; i--; ) {
-			struct r5dev *dev = &sh->dev[i];
-			if (!test_bit(R5_OVERWRITE, &dev->flags) &&
-			    i != sh->pd_idx &&
-			    !test_bit(R5_LOCKED, &dev->flags) &&
-			    !(test_bit(R5_UPTODATE, &dev->flags) ||
-			    test_bit(R5_Wantcompute, &dev->flags)) &&
-			    test_bit(R5_Insync, &dev->flags)) {
+			struct r5_queue_dev *dev_q = &sq->dev[i];
+			if (!test_bit(R5_OVERWRITE, &dev_q->flags) &&
+			    i != sq->pd_idx &&
+			    !test_bit(R5_LOCKED, &dev_q->flags) &&
+			    !(test_bit(R5_UPTODATE, &dev_q->flags) ||
+			    test_bit(R5_Wantcompute, &dev_q->flags)) &&
+			    test_bit(R5_Insync, &dev_q->flags)) {
 				if (
 				  test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
 					pr_debug("Read_old block "
 						"%d for Reconstruct\n", i);
-					set_bit(R5_LOCKED, &dev->flags);
-					set_bit(R5_Wantread, &dev->flags);
+					set_bit(R5_LOCKED, &dev_q->flags);
+					set_bit(R5_Wantread, &dev_q->flags);
 					if (!test_and_set_bit(
 						STRIPE_OP_IO, &sh->ops.pending))
 						sh->ops.count++;
@@ -2259,20 +2350,22 @@ static void handle_issuing_new_write_requests6(raid5_conf_t *conf,
 		struct stripe_head *sh,	struct stripe_head_state *s,
 		struct r6_state *r6s, int disks)
 {
-	int rcw = 0, must_compute = 0, pd_idx = sh->pd_idx, i;
+	struct stripe_queue *sq = sh->sq;
+	int rcw = 0, must_compute = 0, pd_idx = sq->pd_idx, i;
 	int qd_idx = r6s->qd_idx;
+
 	for (i = disks; i--; ) {
-		struct r5dev *dev = &sh->dev[i];
+		struct r5_queue_dev *dev_q = &sq->dev[i];
 		/* Would I have to read this buffer for reconstruct_write */
-		if (!test_bit(R5_OVERWRITE, &dev->flags)
+		if (!test_bit(R5_OVERWRITE, &dev_q->flags)
 		    && i != pd_idx && i != qd_idx
-		    && (!test_bit(R5_LOCKED, &dev->flags)
+		    && (!test_bit(R5_LOCKED, &dev_q->flags)
 			    ) &&
-		    !test_bit(R5_UPTODATE, &dev->flags)) {
-			if (test_bit(R5_Insync, &dev->flags)) rcw++;
+		    !test_bit(R5_UPTODATE, &dev_q->flags)) {
+			if (test_bit(R5_Insync, &dev_q->flags)) rcw++;
 			else {
 				pr_debug("raid6: must_compute: "
-					"disk %d flags=%#lx\n", i, dev->flags);
+				   "disk %d flags=%#lx\n", i, dev_q->flags);
 				must_compute++;
 			}
 		}
@@ -2284,19 +2377,19 @@ static void handle_issuing_new_write_requests6(raid5_conf_t *conf,
 	if (rcw > 0)
 		/* want reconstruct write, but need to get some data */
 		for (i = disks; i--; ) {
-			struct r5dev *dev = &sh->dev[i];
-			if (!test_bit(R5_OVERWRITE, &dev->flags)
+			struct r5_queue_dev *dev_q = &sq->dev[i];
+			if (!test_bit(R5_OVERWRITE, &dev_q->flags)
 			    && !(s->failed == 0 && (i == pd_idx || i == qd_idx))
-			    && !test_bit(R5_LOCKED, &dev->flags) &&
-			    !test_bit(R5_UPTODATE, &dev->flags) &&
-			    test_bit(R5_Insync, &dev->flags)) {
+			    && !test_bit(R5_LOCKED, &dev_q->flags) &&
+			    !test_bit(R5_UPTODATE, &dev_q->flags) &&
+			    test_bit(R5_Insync, &dev_q->flags)) {
 				if (
 				  test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
 					pr_debug("Read_old stripe %llu "
 						"block %d for Reconstruct\n",
 					     (unsigned long long)sh->sector, i);
-					set_bit(R5_LOCKED, &dev->flags);
-					set_bit(R5_Wantread, &dev->flags);
+					set_bit(R5_LOCKED, &dev_q->flags);
+					set_bit(R5_Wantread, &dev_q->flags);
 					s->locked++;
 				} else {
 					pr_debug("Request delayed stripe %llu "
@@ -2334,11 +2427,11 @@ static void handle_issuing_new_write_requests6(raid5_conf_t *conf,
 		compute_parity6(sh, RECONSTRUCT_WRITE);
 		/* now every locked buffer is ready to be written */
 		for (i = disks; i--; )
-			if (test_bit(R5_LOCKED, &sh->dev[i].flags)) {
+			if (test_bit(R5_LOCKED, &sq->dev[i].flags)) {
 				pr_debug("Writing stripe %llu block %d\n",
 				       (unsigned long long)sh->sector, i);
 				s->locked++;
-				set_bit(R5_Wantwrite, &sh->dev[i].flags);
+				set_bit(R5_Wantwrite, &sq->dev[i].flags);
 			}
 		/* after a RECONSTRUCT_WRITE, the stripe MUST be in-sync */
 		set_bit(STRIPE_INSYNC, &sh->state);
@@ -2346,7 +2439,7 @@ static void handle_issuing_new_write_requests6(raid5_conf_t *conf,
 		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
 			atomic_dec(&conf->preread_active_stripes);
 			if (atomic_read(&conf->preread_active_stripes) <
-			    IO_THRESHOLD)
+				IO_THRESHOLD)
 				md_wakeup_thread(conf->mddev->thread);
 		}
 	}
@@ -2355,6 +2448,8 @@ static void handle_issuing_new_write_requests6(raid5_conf_t *conf,
 static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
 				struct stripe_head_state *s, int disks)
 {
+	struct stripe_queue *sq = sh->sq;
+
 	set_bit(STRIPE_HANDLE, &sh->state);
 	/* Take one of the following actions:
 	 * 1/ start a check parity operation if (uptodate == disks)
@@ -2366,7 +2461,7 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
 	    !test_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending)) {
 		if (!test_and_set_bit(STRIPE_OP_CHECK, &sh->ops.pending)) {
 			BUG_ON(s->uptodate != disks);
-			clear_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags);
+			clear_bit(R5_UPTODATE, &sq->dev[sq->pd_idx].flags);
 			sh->ops.count++;
 			s->uptodate--;
 		} else if (
@@ -2392,8 +2487,8 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
 					set_bit(STRIPE_OP_MOD_REPAIR_PD,
 						&sh->ops.pending);
 					set_bit(R5_Wantcompute,
-						&sh->dev[sh->pd_idx].flags);
-					sh->ops.target = sh->pd_idx;
+						&sq->dev[sq->pd_idx].flags);
+					sh->ops.target = sq->pd_idx;
 					sh->ops.count++;
 					s->uptodate++;
 				}
@@ -2417,16 +2512,16 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
 	if (!test_bit(STRIPE_INSYNC, &sh->state) &&
 		!test_bit(STRIPE_OP_CHECK, &sh->ops.pending) &&
 		!test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending)) {
-		struct r5dev *dev;
+		struct r5_queue_dev *dev_q;
 		/* either failed parity check, or recovery is happening */
 		if (s->failed == 0)
-			s->failed_num = sh->pd_idx;
-		dev = &sh->dev[s->failed_num];
-		BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));
+			s->failed_num = sq->pd_idx;
+		dev_q = &sq->dev[s->failed_num];
+		BUG_ON(!test_bit(R5_UPTODATE, &dev_q->flags));
 		BUG_ON(s->uptodate != disks);
 
-		set_bit(R5_LOCKED, &dev->flags);
-		set_bit(R5_Wantwrite, &dev->flags);
+		set_bit(R5_LOCKED, &dev_q->flags);
+		set_bit(R5_Wantwrite, &dev_q->flags);
 		if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
 			sh->ops.count++;
 
@@ -2443,8 +2538,9 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
 				int disks)
 {
 	int update_p = 0, update_q = 0;
-	struct r5dev *dev;
-	int pd_idx = sh->pd_idx;
+	struct stripe_queue *sq = sh->sq;
+	struct r5_queue_dev *dev_q;
+	int pd_idx = sq->pd_idx;
 	int qd_idx = r6s->qd_idx;
 
 	set_bit(STRIPE_HANDLE, &sh->state);
@@ -2500,29 +2596,29 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
 		 */
 
 		if (s->failed == 2) {
-			dev = &sh->dev[r6s->failed_num[1]];
+			dev_q = &sq->dev[r6s->failed_num[1]];
 			s->locked++;
-			set_bit(R5_LOCKED, &dev->flags);
-			set_bit(R5_Wantwrite, &dev->flags);
+			set_bit(R5_LOCKED, &dev_q->flags);
+			set_bit(R5_Wantwrite, &dev_q->flags);
 		}
 		if (s->failed >= 1) {
-			dev = &sh->dev[r6s->failed_num[0]];
+			dev_q = &sq->dev[r6s->failed_num[0]];
 			s->locked++;
-			set_bit(R5_LOCKED, &dev->flags);
-			set_bit(R5_Wantwrite, &dev->flags);
+			set_bit(R5_LOCKED, &dev_q->flags);
+			set_bit(R5_Wantwrite, &dev_q->flags);
 		}
 
 		if (update_p) {
-			dev = &sh->dev[pd_idx];
+			dev_q = &sq->dev[pd_idx];
 			s->locked++;
-			set_bit(R5_LOCKED, &dev->flags);
-			set_bit(R5_Wantwrite, &dev->flags);
+			set_bit(R5_LOCKED, &dev_q->flags);
+			set_bit(R5_Wantwrite, &dev_q->flags);
 		}
 		if (update_q) {
-			dev = &sh->dev[qd_idx];
+			dev_q = &sq->dev[qd_idx];
 			s->locked++;
-			set_bit(R5_LOCKED, &dev->flags);
-			set_bit(R5_Wantwrite, &dev->flags);
+			set_bit(R5_LOCKED, &dev_q->flags);
+			set_bit(R5_Wantwrite, &dev_q->flags);
 		}
 		clear_bit(STRIPE_DEGRADED, &sh->state);
 
@@ -2534,6 +2630,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 				struct r6_state *r6s)
 {
 	int i;
+	struct stripe_queue *sq = sh->sq;
 
 	/* We have read all the blocks in this stripe and now we need to
 	 * copy some of them into a target stripe for expand.
@@ -2541,11 +2638,12 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 	struct dma_async_tx_descriptor *tx = NULL;
 	clear_bit(STRIPE_EXPAND_SOURCE, &sh->state);
 	for (i = 0; i < sh->disks; i++)
-		if (i != sh->pd_idx && (r6s && i != r6s->qd_idx)) {
+		if (i != sq->pd_idx && (r6s && i != r6s->qd_idx)) {
 			int dd_idx, pd_idx, j;
 			struct stripe_head *sh2;
 
-			sector_t bn = compute_blocknr(sh, i);
+			sector_t bn = compute_blocknr(conf, sh->disks,
+						sh->sector, sq->pd_idx, i);
 			sector_t s = raid5_compute_sector(bn, conf->raid_disks,
 						conf->raid_disks -
 						conf->max_degraded, &dd_idx,
@@ -2559,7 +2657,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 				 */
 				continue;
 			if (!test_bit(STRIPE_EXPANDING, &sh2->state) ||
-			   test_bit(R5_Expanded, &sh2->dev[dd_idx].flags)) {
+			  test_bit(R5_Expanded, &sh2->sq->dev[dd_idx].flags)) {
 				/* must have already done this block */
 				release_stripe(sh2);
 				continue;
@@ -2570,12 +2668,13 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 				sh->dev[i].page, 0, 0, STRIPE_SIZE,
 				ASYNC_TX_DEP_ACK, tx, NULL, NULL);
 
-			set_bit(R5_Expanded, &sh2->dev[dd_idx].flags);
-			set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
+			set_bit(R5_Expanded, &sh2->sq->dev[dd_idx].flags);
+			set_bit(R5_UPTODATE, &sh2->sq->dev[dd_idx].flags);
 			for (j = 0; j < conf->raid_disks; j++)
-				if (j != sh2->pd_idx &&
+				if (j != sh2->sq->pd_idx &&
 				    (r6s && j != r6s->qd_idx) &&
-				    !test_bit(R5_Expanded, &sh2->dev[j].flags))
+				    !test_bit(R5_Expanded,
+				     &sh2->sq->dev[j].flags))
 					break;
 			if (j == conf->raid_disks) {
 				set_bit(STRIPE_EXPAND_READY, &sh2->state);
@@ -2610,20 +2709,21 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 
 static void handle_stripe5(struct stripe_head *sh)
 {
-	raid5_conf_t *conf = sh->raid_conf;
+	struct stripe_queue *sq = sh->sq;
+	raid5_conf_t *conf = sh->sq->raid_conf;
 	int disks = sh->disks, i;
 	struct bio *return_bi = NULL;
 	struct stripe_head_state s;
-	struct r5dev *dev;
+	struct r5_queue_dev *dev_q;
 	unsigned long pending = 0;
 
 	memset(&s, 0, sizeof(s));
 	pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d "
 		"ops=%lx:%lx:%lx\n", (unsigned long long)sh->sector, sh->state,
-		atomic_read(&sh->count), sh->pd_idx,
+		atomic_read(&sh->count), sq->pd_idx,
 		sh->ops.pending, sh->ops.ack, sh->ops.complete);
 
-	spin_lock(&sh->lock);
+	spin_lock(&sq->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	clear_bit(STRIPE_DELAYED, &sh->state);
 
@@ -2636,33 +2736,35 @@ static void handle_stripe5(struct stripe_head *sh)
 	for (i=disks; i--; ) {
 		mdk_rdev_t *rdev;
 		struct r5dev *dev = &sh->dev[i];
-		clear_bit(R5_Insync, &dev->flags);
+
+		dev_q = &sq->dev[i];
+		clear_bit(R5_Insync, &dev_q->flags);
 
 		pr_debug("check %d: state 0x%lx toread %p read %p write %p "
-			"written %p\n",	i, dev->flags, dev->toread, dev->read,
-			dev->towrite, dev->written);
+			"written %p\n",	i, dev_q->flags, dev_q->toread,
+			dev->read, dev_q->towrite, dev->written);
 
 		/* maybe we can request a biofill operation
 		 *
 		 * new wantfill requests are only permitted while
 		 * STRIPE_OP_BIOFILL is clear
 		 */
-		if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread &&
+		if (test_bit(R5_UPTODATE, &dev_q->flags) && dev_q->toread &&
 			!test_bit(STRIPE_OP_BIOFILL, &sh->ops.pending))
-			set_bit(R5_Wantfill, &dev->flags);
+			set_bit(R5_Wantfill, &dev_q->flags);
 
 		/* now count some things */
-		if (test_bit(R5_LOCKED, &dev->flags)) s.locked++;
-		if (test_bit(R5_UPTODATE, &dev->flags)) s.uptodate++;
-		if (test_bit(R5_Wantcompute, &dev->flags)) s.compute++;
+		if (test_bit(R5_LOCKED, &dev_q->flags)) s.locked++;
+		if (test_bit(R5_UPTODATE, &dev_q->flags)) s.uptodate++;
+		if (test_bit(R5_Wantcompute, &dev_q->flags)) s.compute++;
 
-		if (test_bit(R5_Wantfill, &dev->flags))
+		if (test_bit(R5_Wantfill, &dev_q->flags))
 			s.to_fill++;
-		else if (dev->toread)
+		else if (dev_q->toread)
 			s.to_read++;
-		if (dev->towrite) {
+		if (dev_q->towrite) {
 			s.to_write++;
-			if (!test_bit(R5_OVERWRITE, &dev->flags))
+			if (!test_bit(R5_OVERWRITE, &dev_q->flags))
 				s.non_overwrite++;
 		}
 		if (dev->written)
@@ -2670,15 +2772,15 @@ static void handle_stripe5(struct stripe_head *sh)
 		rdev = rcu_dereference(conf->disks[i].rdev);
 		if (!rdev || !test_bit(In_sync, &rdev->flags)) {
 			/* The ReadError flag will just be confusing now */
-			clear_bit(R5_ReadError, &dev->flags);
-			clear_bit(R5_ReWrite, &dev->flags);
+			clear_bit(R5_ReadError, &dev_q->flags);
+			clear_bit(R5_ReWrite, &dev_q->flags);
 		}
 		if (!rdev || !test_bit(In_sync, &rdev->flags)
-		    || test_bit(R5_ReadError, &dev->flags)) {
+		    || test_bit(R5_ReadError, &dev_q->flags)) {
 			s.failed++;
 			s.failed_num = i;
 		} else
-			set_bit(R5_Insync, &dev->flags);
+			set_bit(R5_Insync, &dev_q->flags);
 	}
 	rcu_read_unlock();
 
@@ -2704,12 +2806,12 @@ static void handle_stripe5(struct stripe_head *sh)
 	/* might be able to return some write requests if the parity block
 	 * is safe, or on a failed drive
 	 */
-	dev = &sh->dev[sh->pd_idx];
+	dev_q = &sq->dev[sq->pd_idx];
 	if ( s.written &&
-	     ((test_bit(R5_Insync, &dev->flags) &&
-	       !test_bit(R5_LOCKED, &dev->flags) &&
-	       test_bit(R5_UPTODATE, &dev->flags)) ||
-	       (s.failed == 1 && s.failed_num == sh->pd_idx)))
+	     ((test_bit(R5_Insync, &dev_q->flags) &&
+	       !test_bit(R5_LOCKED, &dev_q->flags) &&
+	       test_bit(R5_UPTODATE, &dev_q->flags)) ||
+		(s.failed == 1 && s.failed_num == sq->pd_idx)))
 		handle_completed_write_requests(conf, sh, disks, &return_bi);
 
 	/* Now we might consider reading some blocks, either to check/generate
@@ -2736,7 +2838,7 @@ static void handle_stripe5(struct stripe_head *sh)
 		clear_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
 
 		for (i = disks; i--; )
-			clear_bit(R5_Wantprexor, &sh->dev[i].flags);
+			clear_bit(R5_Wantprexor, &sq->dev[i].flags);
 	}
 
 	/* if only POSTXOR is set then this is an 'expand' postxor */
@@ -2754,18 +2856,20 @@ static void handle_stripe5(struct stripe_head *sh)
 		/* All the 'written' buffers and the parity block are ready to
 		 * be written back to disk
 		 */
-		BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags));
+		BUG_ON(!test_bit(R5_UPTODATE, &sq->dev[sq->pd_idx].flags));
 		for (i = disks; i--; ) {
-			dev = &sh->dev[i];
-			if (test_bit(R5_LOCKED, &dev->flags) &&
-				(i == sh->pd_idx || dev->written)) {
+			struct r5dev *dev = &sh->dev[i];
+
+			dev_q = &sq->dev[i];
+			if (test_bit(R5_LOCKED, &dev_q->flags) &&
+				(i == sq->pd_idx || dev->written)) {
 				pr_debug("Writing block %d\n", i);
-				set_bit(R5_Wantwrite, &dev->flags);
+				set_bit(R5_Wantwrite, &dev_q->flags);
 				if (!test_and_set_bit(
 				    STRIPE_OP_IO, &sh->ops.pending))
 					sh->ops.count++;
-				if (!test_bit(R5_Insync, &dev->flags) ||
-				    (i == sh->pd_idx && s.failed == 0))
+				if (!test_bit(R5_Insync, &dev_q->flags) ||
+				    (i == sq->pd_idx && s.failed == 0))
 					set_bit(STRIPE_INSYNC, &sh->state);
 			}
 		}
@@ -2808,24 +2912,24 @@ static void handle_stripe5(struct stripe_head *sh)
 	 * the repair/check process
 	 */
 	if (s.failed == 1 && !conf->mddev->ro &&
-	    test_bit(R5_ReadError, &sh->dev[s.failed_num].flags)
-	    && !test_bit(R5_LOCKED, &sh->dev[s.failed_num].flags)
-	    && test_bit(R5_UPTODATE, &sh->dev[s.failed_num].flags)
+	    test_bit(R5_ReadError, &sq->dev[s.failed_num].flags)
+	    && !test_bit(R5_LOCKED, &sq->dev[s.failed_num].flags)
+	    && test_bit(R5_UPTODATE, &sq->dev[s.failed_num].flags)
 		) {
-		dev = &sh->dev[s.failed_num];
-		if (!test_bit(R5_ReWrite, &dev->flags)) {
-			set_bit(R5_Wantwrite, &dev->flags);
+		dev_q = &sq->dev[s.failed_num];
+		if (!test_bit(R5_ReWrite, &dev_q->flags)) {
+			set_bit(R5_Wantwrite, &dev_q->flags);
 			if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
 				sh->ops.count++;
-			set_bit(R5_ReWrite, &dev->flags);
-			set_bit(R5_LOCKED, &dev->flags);
+			set_bit(R5_ReWrite, &dev_q->flags);
+			set_bit(R5_LOCKED, &dev_q->flags);
 			s.locked++;
 		} else {
 			/* let's read it back */
-			set_bit(R5_Wantread, &dev->flags);
+			set_bit(R5_Wantread, &dev_q->flags);
 			if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
 				sh->ops.count++;
-			set_bit(R5_LOCKED, &dev->flags);
+			set_bit(R5_LOCKED, &dev_q->flags);
 			s.locked++;
 		}
 	}
@@ -2843,7 +2947,7 @@ static void handle_stripe5(struct stripe_head *sh)
 		clear_bit(STRIPE_OP_POSTXOR, &sh->ops.complete);
 
 		for (i = conf->raid_disks; i--; ) {
-			set_bit(R5_Wantwrite, &sh->dev[i].flags);
+			set_bit(R5_Wantwrite, &sq->dev[i].flags);
 			if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
 				sh->ops.count++;
 		}
@@ -2853,7 +2957,7 @@ static void handle_stripe5(struct stripe_head *sh)
 		!test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending)) {
 		/* Need to write out all blocks after computing parity */
 		sh->disks = conf->raid_disks;
-		sh->pd_idx = stripe_to_pdidx(sh->sector, conf,
+		sq->pd_idx = stripe_to_pdidx(sh->sector, conf,
 			conf->raid_disks);
 		s.locked += handle_write_operations5(sh, 0, 1);
 	} else if (s.expanded &&
@@ -2870,7 +2974,7 @@ static void handle_stripe5(struct stripe_head *sh)
 	if (sh->ops.count)
 		pending = get_stripe_work(sh);
 
-	spin_unlock(&sh->lock);
+	spin_unlock(&sq->lock);
 
 	if (pending)
 		raid5_run_ops(sh, pending);
@@ -2881,13 +2985,14 @@ static void handle_stripe5(struct stripe_head *sh)
 
 static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 {
-	raid6_conf_t *conf = sh->raid_conf;
+	struct stripe_queue *sq = sh->sq;
+	raid6_conf_t *conf = sq->raid_conf;
 	int disks = sh->disks;
 	struct bio *return_bi = NULL;
-	int i, pd_idx = sh->pd_idx;
+	int i, pd_idx = sq->pd_idx;
 	struct stripe_head_state s;
 	struct r6_state r6s;
-	struct r5dev *dev, *pdev, *qdev;
+	struct r5_queue_dev *dev_q, *pdev_q, *qdev_q;
 
 	r6s.qd_idx = raid6_next_disk(pd_idx, disks);
 	pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
@@ -2896,7 +3001,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	       atomic_read(&sh->count), pd_idx, r6s.qd_idx);
 	memset(&s, 0, sizeof(s));
 
-	spin_lock(&sh->lock);
+	spin_lock(&sq->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	clear_bit(STRIPE_DELAYED, &sh->state);
 
@@ -2908,24 +3013,28 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	rcu_read_lock();
 	for (i=disks; i--; ) {
 		mdk_rdev_t *rdev;
-		dev = &sh->dev[i];
-		clear_bit(R5_Insync, &dev->flags);
+		struct r5dev *dev = &sh->dev[i];
+
+		dev_q = &sq->dev[i];
+		clear_bit(R5_Insync, &dev_q->flags);
 
 		pr_debug("check %d: state 0x%lx read %p write %p written %p\n",
-			i, dev->flags, dev->toread, dev->towrite, dev->written);
+			i, dev_q->flags, dev_q->toread, dev_q->towrite,
+			dev->written);
 		/* maybe we can reply to a read */
-		if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread) {
+		if (test_bit(R5_UPTODATE, &dev_q->flags) && dev_q->toread) {
 			struct bio *rbi, *rbi2;
 			pr_debug("Return read for disc %d\n", i);
 			spin_lock_irq(&conf->device_lock);
-			rbi = dev->toread;
-			dev->toread = NULL;
-			if (test_and_clear_bit(R5_Overlap, &dev->flags))
+			rbi = dev_q->toread;
+			dev_q->toread = NULL;
+			if (test_and_clear_bit(R5_Overlap, &dev_q->flags))
 				wake_up(&conf->wait_for_overlap);
 			spin_unlock_irq(&conf->device_lock);
-			while (rbi && rbi->bi_sector < dev->sector + STRIPE_SECTORS) {
-				copy_data(0, rbi, dev->page, dev->sector);
-				rbi2 = r5_next_bio(rbi, dev->sector);
+			while (rbi && rbi->bi_sector <
+			       dev_q->sector + STRIPE_SECTORS) {
+				copy_data(0, rbi, dev->page, dev_q->sector);
+				rbi2 = r5_next_bio(rbi, dev_q->sector);
 				spin_lock_irq(&conf->device_lock);
 				if (--rbi->bi_phys_segments == 0) {
 					rbi->bi_next = return_bi;
@@ -2937,15 +3046,15 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 		}
 
 		/* now count some things */
-		if (test_bit(R5_LOCKED, &dev->flags)) s.locked++;
-		if (test_bit(R5_UPTODATE, &dev->flags)) s.uptodate++;
+		if (test_bit(R5_LOCKED, &dev_q->flags)) s.locked++;
+		if (test_bit(R5_UPTODATE, &dev_q->flags)) s.uptodate++;
 
 
-		if (dev->toread)
+		if (dev_q->toread)
 			s.to_read++;
-		if (dev->towrite) {
+		if (dev_q->towrite) {
 			s.to_write++;
-			if (!test_bit(R5_OVERWRITE, &dev->flags))
+			if (!test_bit(R5_OVERWRITE, &dev_q->flags))
 				s.non_overwrite++;
 		}
 		if (dev->written)
@@ -2953,16 +3062,16 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 		rdev = rcu_dereference(conf->disks[i].rdev);
 		if (!rdev || !test_bit(In_sync, &rdev->flags)) {
 			/* The ReadError flag will just be confusing now */
-			clear_bit(R5_ReadError, &dev->flags);
-			clear_bit(R5_ReWrite, &dev->flags);
+			clear_bit(R5_ReadError, &dev_q->flags);
+			clear_bit(R5_ReWrite, &dev_q->flags);
 		}
 		if (!rdev || !test_bit(In_sync, &rdev->flags)
-		    || test_bit(R5_ReadError, &dev->flags)) {
+		    || test_bit(R5_ReadError, &dev_q->flags)) {
 			if (s.failed < 2)
 				r6s.failed_num[s.failed] = i;
 			s.failed++;
 		} else
-			set_bit(R5_Insync, &dev->flags);
+			set_bit(R5_Insync, &dev_q->flags);
 	}
 	rcu_read_unlock();
 	pr_debug("locked=%d uptodate=%d to_read=%d"
@@ -2985,20 +3094,20 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	 * might be able to return some write requests if the parity blocks
 	 * are safe, or on a failed drive
 	 */
-	pdev = &sh->dev[pd_idx];
+	pdev_q = &sq->dev[pd_idx];
 	r6s.p_failed = (s.failed >= 1 && r6s.failed_num[0] == pd_idx)
 		|| (s.failed >= 2 && r6s.failed_num[1] == pd_idx);
-	qdev = &sh->dev[r6s.qd_idx];
+	qdev_q = &sq->dev[r6s.qd_idx];
 	r6s.q_failed = (s.failed >= 1 && r6s.failed_num[0] == r6s.qd_idx)
 		|| (s.failed >= 2 && r6s.failed_num[1] == r6s.qd_idx);
 
 	if ( s.written &&
-	     ( r6s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
-			     && !test_bit(R5_LOCKED, &pdev->flags)
-			     && test_bit(R5_UPTODATE, &pdev->flags)))) &&
-	     ( r6s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
-			     && !test_bit(R5_LOCKED, &qdev->flags)
-			     && test_bit(R5_UPTODATE, &qdev->flags)))))
+	     ( r6s.p_failed || ((test_bit(R5_Insync, &pdev_q->flags)
+			     && !test_bit(R5_LOCKED, &pdev_q->flags)
+			     && test_bit(R5_UPTODATE, &pdev_q->flags)))) &&
+	     ( r6s.q_failed || ((test_bit(R5_Insync, &qdev_q->flags)
+			     && !test_bit(R5_LOCKED, &qdev_q->flags)
+			     && test_bit(R5_UPTODATE, &qdev_q->flags)))))
 		handle_completed_write_requests(conf, sh, disks, &return_bi);
 
 	/* Now we might consider reading some blocks, either to check/generate
@@ -3030,19 +3139,19 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	 */
 	if (s.failed <= 2 && !conf->mddev->ro)
 		for (i = 0; i < s.failed; i++) {
-			dev = &sh->dev[r6s.failed_num[i]];
-			if (test_bit(R5_ReadError, &dev->flags)
-			    && !test_bit(R5_LOCKED, &dev->flags)
-			    && test_bit(R5_UPTODATE, &dev->flags)
+			dev_q = &sq->dev[r6s.failed_num[i]];
+			if (test_bit(R5_ReadError, &dev_q->flags)
+			    && !test_bit(R5_LOCKED, &dev_q->flags)
+			    && test_bit(R5_UPTODATE, &dev_q->flags)
 				) {
-				if (!test_bit(R5_ReWrite, &dev->flags)) {
-					set_bit(R5_Wantwrite, &dev->flags);
-					set_bit(R5_ReWrite, &dev->flags);
-					set_bit(R5_LOCKED, &dev->flags);
+				if (!test_bit(R5_ReWrite, &dev_q->flags)) {
+					set_bit(R5_Wantwrite, &dev_q->flags);
+					set_bit(R5_ReWrite, &dev_q->flags);
+					set_bit(R5_LOCKED, &dev_q->flags);
 				} else {
 					/* let's read it back */
-					set_bit(R5_Wantread, &dev->flags);
-					set_bit(R5_LOCKED, &dev->flags);
+					set_bit(R5_Wantread, &dev_q->flags);
+					set_bit(R5_LOCKED, &dev_q->flags);
 				}
 			}
 		}
@@ -3050,13 +3159,13 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state)) {
 		/* Need to write out all blocks after computing P&Q */
 		sh->disks = conf->raid_disks;
-		sh->pd_idx = stripe_to_pdidx(sh->sector, conf,
+		sq->pd_idx = stripe_to_pdidx(sh->sector, conf,
 					     conf->raid_disks);
 		compute_parity6(sh, RECONSTRUCT_WRITE);
 		for (i = conf->raid_disks ; i-- ;  ) {
-			set_bit(R5_LOCKED, &sh->dev[i].flags);
+			set_bit(R5_LOCKED, &sq->dev[i].flags);
 			s.locked++;
-			set_bit(R5_Wantwrite, &sh->dev[i].flags);
+			set_bit(R5_Wantwrite, &sq->dev[i].flags);
 		}
 		clear_bit(STRIPE_EXPANDING, &sh->state);
 	} else if (s.expanded) {
@@ -3069,7 +3178,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	if (s.expanding && s.locked == 0)
 		handle_stripe_expansion(conf, sh, &r6s);
 
-	spin_unlock(&sh->lock);
+	spin_unlock(&sq->lock);
 
 	return_io(return_bi);
 
@@ -3077,9 +3186,9 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 		int rw;
 		struct bio *bi;
 		mdk_rdev_t *rdev;
-		if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags))
+		if (test_and_clear_bit(R5_Wantwrite, &sq->dev[i].flags))
 			rw = WRITE;
-		else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
+		else if (test_and_clear_bit(R5_Wantread, &sq->dev[i].flags))
 			rw = READ;
 		else
 			continue;
@@ -3119,7 +3228,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 			bi->bi_size = STRIPE_SIZE;
 			bi->bi_next = NULL;
 			if (rw == WRITE &&
-			    test_bit(R5_ReWrite, &sh->dev[i].flags))
+			    test_bit(R5_ReWrite, &sq->dev[i].flags))
 				atomic_add(STRIPE_SECTORS, &rdev->corrected_errors);
 			generic_make_request(bi);
 		} else {
@@ -3127,7 +3236,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 				set_bit(STRIPE_DEGRADED, &sh->state);
 			pr_debug("skip op %ld on disc %d for sector %llu\n",
 				bi->bi_rw, i, (unsigned long long)sh->sector);
-			clear_bit(R5_LOCKED, &sh->dev[i].flags);
+			clear_bit(R5_LOCKED, &sq->dev[i].flags);
 			set_bit(STRIPE_HANDLE, &sh->state);
 		}
 	}
@@ -3135,7 +3244,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 
 static void handle_stripe(struct stripe_head *sh, struct page *tmp_page)
 {
-	if (sh->raid_conf->level == 6)
+	if (sh->sq->raid_conf->level == 6)
 		handle_stripe6(sh, tmp_page);
 	else
 		handle_stripe5(sh);
@@ -3465,7 +3574,6 @@ static int chunk_aligned_read(request_queue_t *q, struct bio * raid_bio)
 	}
 }
 
-
 static int make_request(request_queue_t *q, struct bio * bi)
 {
 	mddev_t *mddev = q->queuedata;
@@ -3681,19 +3789,22 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 		 */
 		for (j=sh->disks; j--;) {
 			sector_t s;
-			if (j == sh->pd_idx)
+			int pd_idx = sh->sq->pd_idx;
+
+			if (j == pd_idx)
 				continue;
 			if (conf->level == 6 &&
-			    j == raid6_next_disk(sh->pd_idx, sh->disks))
+			    j == raid6_next_disk(pd_idx, sh->disks))
 				continue;
-			s = compute_blocknr(sh, j);
+			s = compute_blocknr(conf, sh->disks, sh->sector,
+					pd_idx, j);
 			if (s < (mddev->array_size<<1)) {
 				skipped = 1;
 				continue;
 			}
 			memset(page_address(sh->dev[j].page), 0, STRIPE_SIZE);
-			set_bit(R5_Expanded, &sh->dev[j].flags);
-			set_bit(R5_UPTODATE, &sh->dev[j].flags);
+			set_bit(R5_Expanded, &sh->sq->dev[j].flags);
+			set_bit(R5_UPTODATE, &sh->sq->dev[j].flags);
 		}
 		if (!skipped) {
 			set_bit(STRIPE_EXPAND_READY, &sh->state);
@@ -3738,6 +3849,7 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 {
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
 	struct stripe_head *sh;
+	struct stripe_queue *sq;
 	int pd_idx;
 	int raid_disks = conf->raid_disks;
 	sector_t max_sector = mddev->size << 1;
@@ -3794,6 +3906,8 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 		 */
 		schedule_timeout_uninterruptible(1);
 	}
+	sq = sh->sq;
+
 	/* Need to check if array will still be degraded after recovery/resync
 	 * We don't need to check the 'failed' flag as when that gets set,
 	 * recovery aborts.
@@ -3804,10 +3918,10 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 
 	bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, still_degraded);
 
-	spin_lock(&sh->lock);
+	spin_lock(&sq->lock);
 	set_bit(STRIPE_SYNCING, &sh->state);
 	clear_bit(STRIPE_INSYNC, &sh->state);
-	spin_unlock(&sh->lock);
+	spin_unlock(&sq->lock);
 
 	handle_stripe(sh, NULL);
 	release_stripe(sh);
@@ -3828,6 +3942,7 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 	 * it will be only one 'dd_idx' and only need one call to raid5_compute_sector.
 	 */
 	struct stripe_head *sh;
+	struct stripe_queue *sq;
 	int dd_idx, pd_idx;
 	sector_t sector, logical_sector, last_sector;
 	int scnt = 0;
@@ -3861,7 +3976,8 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 			return handled;
 		}
 
-		set_bit(R5_ReadError, &sh->dev[dd_idx].flags);
+		sq = sh->sq;
+		set_bit(R5_ReadError, &sq->dev[dd_idx].flags);
 		if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
 			release_stripe(sh);
 			raid_bio->bi_hw_segments = scnt;
@@ -4327,15 +4443,16 @@ static int stop(mddev_t *mddev)
 static void print_sh (struct seq_file *seq, struct stripe_head *sh)
 {
 	int i;
+	struct stripe_queue *sq = sh->sq;
 
 	seq_printf(seq, "sh %llu, pd_idx %d, state %ld.\n",
-		   (unsigned long long)sh->sector, sh->pd_idx, sh->state);
+		   (unsigned long long)sh->sector, sq->pd_idx, sh->state);
 	seq_printf(seq, "sh %llu,  count %d.\n",
 		   (unsigned long long)sh->sector, atomic_read(&sh->count));
 	seq_printf(seq, "sh %llu, ", (unsigned long long)sh->sector);
 	for (i = 0; i < sh->disks; i++) {
 		seq_printf(seq, "(cache%d: %p %ld) ",
-			   i, sh->dev[i].page, sh->dev[i].flags);
+			   i, sh->dev[i].page, sq->dev[i].flags);
 	}
 	seq_printf(seq, "\n");
 }
@@ -4349,7 +4466,7 @@ static void printall (struct seq_file *seq, raid5_conf_t *conf)
 	spin_lock_irq(&conf->device_lock);
 	for (i = 0; i < NR_HASH; i++) {
 		hlist_for_each_entry(sh, hn, &conf->stripe_hashtbl[i], hash) {
-			if (sh->raid_conf != conf)
+			if (sh->sq->raid_conf != conf)
 				continue;
 			print_sh(seq, sh);
 		}
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 93678f5..33e26e0 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -158,16 +158,13 @@
  *    the compute block completes.
  */
 
+struct stripe_queue;
 struct stripe_head {
 	struct hlist_node	hash;
 	struct list_head	lru;			/* inactive_list or handle_list */
-	struct raid5_private_data	*raid_conf;
 	sector_t		sector;			/* sector of this row */
-	int			pd_idx;			/* parity disk index */
 	unsigned long		state;			/* state flags */
 	atomic_t		count;			/* nr of active thread/requests */
-	spinlock_t		lock;
-	int			bm_seq;	/* sequence number for bitmap flushes */
 	int			disks;			/* disks in stripe */
 	/* stripe_operations
 	 * @pending - pending ops flags (set for request->issue->complete)
@@ -184,13 +181,12 @@ struct stripe_head {
 		int		   count;
 		u32		   zero_sum_result;
 	} ops;
+	struct stripe_queue *sq;
 	struct r5dev {
 		struct bio	req;
 		struct bio_vec	vec;
 		struct page	*page;
-		struct bio	*toread, *read, *towrite, *written;
-		sector_t	sector;			/* sector of this page */
-		unsigned long	flags;
+		struct bio	*read, *written;
 	} dev[1]; /* allocated with extra space depending of RAID geometry */
 };
 
@@ -209,6 +205,19 @@ struct r6_state {
 	int p_failed, q_failed, qd_idx, failed_num[2];
 };
 
+struct stripe_queue {
+	sector_t sector;
+	int pd_idx; /* parity disk index */
+	int bm_seq; /* sequence number for bitmap flushes */
+	spinlock_t lock;
+	struct raid5_private_data *raid_conf;
+	struct r5_queue_dev {
+		sector_t sector; /* hw starting sector for this block */
+		struct bio *toread, *towrite;
+		unsigned long flags;
+	} dev[1];
+};
+
 /* Flags */
 #define	R5_UPTODATE	0	/* page contains current data */
 #define	R5_LOCKED	1	/* IO has been submitted on "req" */
@@ -328,8 +337,10 @@ struct raid5_private_data {
 	 * two caches.
 	 */
 	int			active_name;
-	char			cache_name[2][20];
-	struct kmem_cache		*slab_cache; /* for allocating stripes */
+	char			sh_cache_name[2][20];
+	char			sq_cache_name[2][20];
+	struct kmem_cache	*sh_slab_cache;
+	struct kmem_cache	*sq_slab_cache;
 
 	int			seq_flush, seq_write;
 	int			quiesce;

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC PATCH 2/2] raid5: use stripe_queues to prioritize the "most deserving" requests, take2
  2007-07-03 23:20 [RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2 Dan Williams
  2007-07-03 23:20 ` [RFC PATCH 1/2] raid5: add the stripe_queue object for tracking raid io requests Dan Williams
@ 2007-07-03 23:20 ` Dan Williams
  2007-07-04 11:41 ` [RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2 Andi Kleen
  2 siblings, 0 replies; 5+ messages in thread
From: Dan Williams @ 2007-07-03 23:20 UTC (permalink / raw)
  To: neilb; +Cc: raziebe, akpm, davidsen, linux-kernel, linux-raid

Overview:
Taking advantage of the stripe_queue/stripe_head separation, this patch
implements a queue in front of the stripe cache.  A stripe_queue pool
accepts incoming requests.  As requests are attached, the weight of the
queue object is updated.  A workqueue is introduced to control the flow of
requests to the stripe cache.  Pressure (weight of the queue object) can
push requests to be processed by the the cache (raid5d).  raid5d also pulls
requests when its 'handle' list is empty.

The workqueue, raid5qd, prioritizes reads and full stripe-writes, as there
is no performance to be gained by delaying them.  Sub-stripe-width writes
are handled by the existing PREREAD_ACTIVE infrastructure, but can now be
passed by full-stripe-writes on their way to the cache.  Previously there
was no opportunity to make this decision, sub-width-writes would occupy a
stripe cache entry from the time they entered the delayed list until they
finished processing.

Flow:
1/ make_request calls get_active_queue, add_queue_bio, and handle_queue
2/ handle_queue checks to see if this stripe_queue is already attached to a
   stripe_head and if so we bypass the queue and handle the stripe
   immediately, done.  Otherwise, handle_queue checks the incoming requests
   and flags the queue as overwrite, read, sub-width-write, or delayed.
3/ __release_queue is called and depending on the determination made by
   handle queue the stripe_queue is placed on one of four lists.  Then raid5qd
   is woken up.
4/ raid5qd runs and attaches stripe_queues to stripe_heads in priority
   order (full-stripe-writes, reads, sub-width-writes).  If the raid device is
   not plugged and there is nothing else to do it will transition delayed
   queues to the sub-width-write list.  Since there are more stripe_queues in
   the system than stripe_heads we will end up sleeping in get_active_stripe.
   While sleeping requests can still enter the queue and hopefully promote
   sub-width-writes to full-stripe-writes.

Details:
* the number of stripe_queue objects in the pool is set at 2x the maximum
  number of stripes in the stripe_cache (STRIPE_QUEUE_SIZE).
* stripe_queues are tracked in a red-black-tree
* a stripe_queue is considered active while it has STRIPE_QUEUE_HANDLED
  set, or it is attached to a stripe_head
* once a stripe_queue is activated it is not placed on the inactive list
  until it has been serviced by the stripe cache

Changes in take2:
* separate write and overwrite in the io_weight fields, i.e. an overwrite
  no longer implies a write
* rename queue_weight -> io_weight
* fix r5_io_weight_size
* implement support for sysfs changes to stripe_cache_size
* delete and re-add stripe queues from their management lists rather than
  moving them.  This guarantees that when the count is non-zero that the
  queue is not on a list (identical to stripe_head handling)
* __wait_for_inactive_queue was incorrectly using conf->inactive_blocked
  which is exclusively for the stripe_cache.  Added
  conf->inactive_queue_blocked and set the routine to wait until the number
  of active queues drops below 7/8's of the total before unblocking
  processing.  7/8's arises from the following: get_active_stripe waits for
  3/4's of the stripe cache i.e. 1/4 inactive. conf->max_nr_stripes / 4 ==
  conf->max_nr_stripes * STRIPE_QUEUE_SIZE / 8 iff STRIPE_QUEUE_SIZE == 2
* change raid5_congested to report whether the queue is congested and not
  the cache.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/raid5.c         | 1053 +++++++++++++++++++++++++++++++++-----------
 include/linux/raid/raid5.h |   63 ++-
 2 files changed, 855 insertions(+), 261 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a4723c7..e2062de 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -65,6 +65,7 @@
 #define	IO_THRESHOLD		1
 #define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
 #define HASH_MASK		(NR_HASH - 1)
+#define STRIPE_QUEUE_SIZE 2 /* multiple of nr_stripes */
 
 #define stripe_hash(conf, sect)	(&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))
 
@@ -78,6 +79,8 @@
  * of the current stripe+device
  */
 #define r5_next_bio(bio, sect) ( ( (bio)->bi_sector + ((bio)->bi_size>>9) < sect + STRIPE_SECTORS) ? (bio)->bi_next : NULL)
+#define r5_io_weight_size(devs) (sizeof(unsigned long) * \
+				  (ALIGN(devs, BITS_PER_LONG) / BITS_PER_LONG))
 /*
  * The following can be used to debug the driver
  */
@@ -122,41 +125,105 @@ static void return_io(struct bio *return_bi)
 
 static void print_raid5_conf (raid5_conf_t *conf);
 
+/* __release_queue - route the stripe_queue based on pending i/o's.  The
+ * queue object is allowed to bounce around between 4 lists up until
+ * it is attached to a stripe_head.  The lists in order of priority are:
+ * 1/ overwrite: all data blocks are set to be overwritten, no prereads
+ * 2/ unaligned_read: read requests that get past chunk_aligned_read
+ * 3/ subwidth_write: write requests that require prereading
+ * 4/ delayed_q: write requests pending activation
+ */
+static struct stripe_queue init_sq; /* sq for newborn stripe_heads */
+static struct stripe_head init_sh; /* sh for newborn stripe_queues */
+static void __release_queue(raid5_conf_t *conf, struct stripe_queue *sq)
+{
+	if (atomic_dec_and_test(&sq->count)) {
+		BUG_ON(!list_empty(&sq->list_node));
+		BUG_ON(atomic_read(&conf->active_queues) == 0);
+		if (test_bit(STRIPE_QUEUE_HANDLE, &sq->state)) {
+			int queue = 1;
+			if (test_bit(STRIPE_QUEUE_OVERWRITE, &sq->state))
+				list_add_tail(&sq->list_node,
+					       &conf->stripe_overwrite_list);
+			else if (test_bit(STRIPE_QUEUE_READ, &sq->state))
+				list_add_tail(&sq->list_node,
+					       &conf->unaligned_read_list);
+			else if (test_bit(STRIPE_QUEUE_WRITE, &sq->state))
+				list_add_tail(&sq->list_node,
+					       &conf->subwidth_write_list);
+			else if (test_bit(STRIPE_QUEUE_DELAYED, &sq->state)) {
+				list_add_tail(&sq->list_node,
+					       &conf->delayed_q_list);
+				blk_plug_device(conf->mddev->queue);
+				queue = 0;
+			}
+			if (queue)
+				queue_work(conf->workqueue,
+					   &conf->stripe_queue_work);
+		} else {
+			atomic_dec(&conf->active_queues);
+			if (test_and_clear_bit(STRIPE_QUEUE_PREREAD_ACTIVE,
+						&sq->state)) {
+				atomic_dec(&conf->preread_active_queues);
+				if (atomic_read(&conf->preread_active_queues) <
+					IO_THRESHOLD)
+					queue_work(conf->workqueue,
+						   &conf->stripe_queue_work);
+			}
+			if (!test_bit(STRIPE_QUEUE_EXPANDING, &sq->state)) {
+				BUG_ON(sq->sh == NULL);
+				sq->sh = NULL;
+				list_add_tail(&sq->list_node,
+					      &conf->inactive_queue_list);
+				wake_up(&conf->wait_for_queue);
+				if (conf->retry_read_aligned)
+					md_wakeup_thread(conf->mddev->thread);
+			}
+		}
+	}
+}
+
 static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 {
+	struct stripe_queue *sq = sh->sq;
+
 	if (atomic_dec_and_test(&sh->count)) {
 		BUG_ON(!list_empty(&sh->lru));
 		BUG_ON(atomic_read(&conf->active_stripes)==0);
 		if (test_bit(STRIPE_HANDLE, &sh->state)) {
-			if (test_bit(STRIPE_DELAYED, &sh->state)) {
-				list_add_tail(&sh->lru, &conf->delayed_list);
-				blk_plug_device(conf->mddev->queue);
-			} else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
-				   sh->sq->bm_seq - conf->seq_write > 0) {
+			if (test_bit(STRIPE_QUEUE_BIT_DELAY, &sq->state) &&
+				   sq->bm_seq - conf->seq_write > 0) {
 				list_add_tail(&sh->lru, &conf->bitmap_list);
 				blk_plug_device(conf->mddev->queue);
 			} else {
-				clear_bit(STRIPE_BIT_DELAY, &sh->state);
+				clear_bit(STRIPE_QUEUE_BIT_DELAY, &sq->state);
 				list_add_tail(&sh->lru, &conf->handle_list);
 			}
 			md_wakeup_thread(conf->mddev->thread);
 		} else {
 			BUG_ON(sh->ops.pending);
-			if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-				atomic_dec(&conf->preread_active_stripes);
-				if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
-					md_wakeup_thread(conf->mddev->thread);
-			}
 			atomic_dec(&conf->active_stripes);
-			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
+			if (!test_bit(STRIPE_QUEUE_EXPANDING, &sq->state)) {
+				BUG_ON(sh->sq == NULL);
+				sh->sq = NULL;
+				__release_queue(conf, sq);
 				list_add_tail(&sh->lru, &conf->inactive_list);
 				wake_up(&conf->wait_for_stripe);
-				if (conf->retry_read_aligned)
-					md_wakeup_thread(conf->mddev->thread);
 			}
 		}
 	}
 }
+
+static void release_queue(struct stripe_queue *sq)
+{
+	raid5_conf_t *conf = sq->raid_conf;
+	unsigned long flags;
+
+	spin_lock_irqsave(&conf->device_lock, flags);
+	__release_queue(conf, sq);
+	spin_unlock_irqrestore(&conf->device_lock, flags);
+}
+
 static void release_stripe(struct stripe_head *sh)
 {
 	raid5_conf_t *conf = sh->sq->raid_conf;
@@ -205,6 +272,23 @@ out:
 	return sh;
 }
 
+static struct stripe_queue *get_free_queue(raid5_conf_t *conf)
+{
+	struct stripe_queue *sq = NULL;
+	struct list_head *first;
+
+	CHECK_DEVLOCK();
+	if (list_empty(&conf->inactive_queue_list))
+		goto out;
+	first = conf->inactive_queue_list.next;
+	sq = list_entry(first, struct stripe_queue, list_node);
+	list_del_init(first);
+	rb_erase(&sq->rb_node, &conf->stripe_queue_tree);
+	atomic_inc(&conf->active_queues);
+out:
+	return sq;
+}
+
 static void shrink_buffers(struct stripe_head *sh, int num)
 {
 	struct page *p;
@@ -236,40 +320,72 @@ static int grow_buffers(struct stripe_head *sh, int num)
 
 static void raid5_build_block (struct stripe_head *sh, int i);
 
-static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int disks)
+#if BITS_PER_LONG == 32
+#define hweight hweight32
+#else
+#define hweight hweight64
+#endif
+static unsigned long io_weight(unsigned long *bitmap, int disks)
 {
-	raid5_conf_t *conf = sh->sq->raid_conf;
+	unsigned long weight = hweight(*bitmap);
+
+	for (bitmap++; disks > BITS_PER_LONG; disks -= BITS_PER_LONG, bitmap++)
+		weight += hweight(*bitmap);
+
+	return weight;
+}
+static void __zero_io_weight(unsigned long *bitmap, int disks)
+{
+	*bitmap = 0;
+	for (bitmap++; disks > BITS_PER_LONG; disks -= BITS_PER_LONG, bitmap++)
+		*bitmap = 0;
+}
+
+static void
+attach_queue_to_stripe_head(struct stripe_head *sh, struct stripe_queue *sq)
+{
+	BUG_ON(sh->sq);
+	sh->sq = sq;
+	list_del_init(&sq->list_node);
+	clear_bit(STRIPE_QUEUE_HANDLE, &sq->state);
+	atomic_inc(&sq->count);
+}
+
+static void
+init_stripe(struct stripe_head *sh, struct stripe_queue *sq, int disks)
+{
+	raid5_conf_t *conf = sq->raid_conf;
+	sector_t sector = sq->sector;
 	int i;
 
+	pr_debug("init_stripe called, stripe %llu\n",
+		(unsigned long long)sector);
+
 	BUG_ON(atomic_read(&sh->count) != 0);
 	BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
 	BUG_ON(sh->ops.pending || sh->ops.ack || sh->ops.complete);
+	BUG_ON(sq->sh);
+	sq->sh = sh;
+	attach_queue_to_stripe_head(sh, sq);
 
 	CHECK_DEVLOCK();
-	pr_debug("init_stripe called, stripe %llu\n",
-		(unsigned long long)sh->sector);
 
 	remove_hash(sh);
 
 	sh->sector = sector;
-	sh->sq->pd_idx = pd_idx;
 	sh->state = 0;
 
 	sh->disks = disks;
 
 	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
-		struct r5_queue_dev *dev_q = &sh->sq->dev[i];
 
-		if (dev_q->toread || dev->read || dev_q->towrite ||
-			dev->written || test_bit(R5_LOCKED, &dev_q->flags)) {
-			printk(KERN_ERR "sector=%llx i=%d %p %p %p %p %d\n",
-			       (unsigned long long)sh->sector, i, dev_q->toread,
-			       dev->read, dev_q->towrite, dev->written,
-			       test_bit(R5_LOCKED, &dev_q->flags));
+		if (dev->read || dev->written) {
+			printk(KERN_ERR "sector=%llx i=%d %p %p\n",
+			       (unsigned long long)sector, i, dev->read,
+			       dev->written);
 			BUG();
 		}
-		dev_q->flags = 0;
 		raid5_build_block(sh, i);
 	}
 	insert_hash(conf, sh);
@@ -289,26 +405,216 @@ static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector, in
 	return NULL;
 }
 
+static struct stripe_queue *__find_queue(raid5_conf_t *conf, sector_t sector)
+{
+	struct rb_node *n = conf->stripe_queue_tree.rb_node;
+	struct stripe_queue *sq;
+
+	pr_debug("%s, sector %llu\n", __FUNCTION__, (unsigned long long)sector);
+	while (n) {
+		sq = rb_entry(n, struct stripe_queue, rb_node);
+
+		if (sector < sq->sector)
+			n = n->rb_left;
+		else if (sector > sq->sector)
+			n = n->rb_right;
+		else
+			return sq;
+	}
+	pr_debug("__queue %llu not in tree\n", (unsigned long long)sector);
+	return NULL;
+}
+
+static struct stripe_queue *
+__insert_active_sq(raid5_conf_t *conf, sector_t sector, struct rb_node *node)
+{
+	struct rb_node **p = &conf->stripe_queue_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct stripe_queue *sq;
+
+	while (*p) {
+		parent = *p;
+		sq = rb_entry(parent, struct stripe_queue, rb_node);
+
+		if (sector < sq->sector)
+			p = &(*p)->rb_left;
+		else if (sector > sq->sector)
+			p = &(*p)->rb_right;
+		else
+			return sq;
+	}
+
+	rb_link_node(node, parent, p);
+
+	return NULL;
+}
+
+static inline struct stripe_queue *
+insert_active_sq(raid5_conf_t *conf, sector_t sector, struct rb_node *node)
+{
+	struct stripe_queue *sq = __insert_active_sq(conf, sector, node);
+
+	if (sq)
+		goto out;
+	rb_insert_color(node, &conf->stripe_queue_tree);
+ out:
+	return sq;
+}
+
 static sector_t compute_blocknr(raid5_conf_t *conf, int raid_disks,
 	sector_t sector, int pd_idx, int i);
 
+static void init_queue(struct stripe_queue *sq, sector_t sector,
+		int disks, int pd_idx)
+{
+	raid5_conf_t *conf = sq->raid_conf;
+	int i;
+
+	pr_debug("%s: %llu -> %llu [%p]\n",
+		__FUNCTION__, (unsigned long long) sq->sector,
+		(unsigned long long) sector, sq);
+
+	BUG_ON(atomic_read(&sq->count) != 0);
+	BUG_ON(sq->sh != NULL);
+	BUG_ON(io_weight(sq->to_read, disks));
+	BUG_ON(io_weight(sq->to_write, disks));
+	BUG_ON(test_bit(STRIPE_QUEUE_HANDLE, &sq->state));
+	__zero_io_weight(sq->overwrite, disks);
+
+	sq->state = (1 << STRIPE_QUEUE_HANDLE);
+	sq->sector = sector;
+	sq->pd_idx = pd_idx;
+
+	for (i = disks; i--; ) {
+		struct r5_queue_dev *dev_q = &sq->dev[i];
+
+		if (dev_q->toread || dev_q->towrite ||
+			test_bit(R5_LOCKED, &dev_q->flags)) {
+			printk(KERN_ERR "sector=%llx i=%d %p %p %d\n",
+			       (unsigned long long)sector, i, dev_q->toread,
+			       dev_q->towrite,
+			       test_bit(R5_LOCKED, &dev_q->flags));
+			BUG();
+		}
+		dev_q->flags = 0;
+		dev_q->sector = compute_blocknr(conf, disks, sector, pd_idx, i);
+	}
+
+	sq = insert_active_sq(conf, sector, &sq->rb_node);
+	if (unlikely(sq)) {
+		printk(KERN_ERR "%s: sq: %p sector: %llu bounced off the "
+			"stripe_queue rb_tree\n", __FUNCTION__, sq,
+			(unsigned long long) sq->sector);
+		BUG();
+	}
+}
+
+static void __wait_for_inactive_queue(raid5_conf_t *conf)
+{
+	conf->inactive_queue_blocked = 1;
+	wait_event_lock_irq(conf->wait_for_queue,
+			    !list_empty(&conf->inactive_queue_list) &&
+			    (atomic_read(&conf->active_queues)
+			     < conf->max_nr_stripes *
+			     STRIPE_QUEUE_SIZE * 7/8 ||
+			    !conf->inactive_queue_blocked),
+			    conf->device_lock,
+			    /* nothing */
+		);
+	conf->inactive_queue_blocked = 0;
+}
+
+static void
+pickup_cached_stripe(struct stripe_head *sh, struct stripe_queue *sq,
+			int from_stripe_cache)
+{
+	raid5_conf_t *conf = sq->raid_conf;
+
+	if (atomic_read(&sh->count))
+		BUG_ON(!list_empty(&sh->lru));
+	else {
+		if (!test_bit(STRIPE_HANDLE, &sh->state)) {
+			atomic_inc(&conf->active_stripes);
+			attach_queue_to_stripe_head(sh, sq);
+			if (from_stripe_cache) {
+				BUG_ON(sq->sh);
+				sq->sh = sh;
+			} else if (unlikely(sq->sector != sh->sector))
+				BUG();
+		} else
+			BUG_ON(!sh->sq);
+		if (list_empty(&sh->lru) &&
+		    !test_bit(STRIPE_QUEUE_EXPANDING, &sq->state))
+			BUG();
+		list_del_init(&sh->lru);
+	}
+}
+
+static struct stripe_queue *
+get_active_queue(raid5_conf_t *conf, sector_t sector, int disks,
+		int pd_idx, int noblock, struct stripe_head **sh)
+{
+	struct stripe_queue *sq;
+
+	pr_debug("%s, sector %llu\n", __FUNCTION__,
+		(unsigned long long)sector);
+
+	spin_lock_irq(&conf->device_lock);
+
+	do {
+		wait_event_lock_irq(conf->wait_for_queue,
+				    conf->quiesce == 0,
+				    conf->device_lock, /* nothing */);
+		sq = __find_queue(conf, sector);
+		if (!sq) {
+			if (!conf->inactive_queue_blocked)
+				sq = get_free_queue(conf);
+			if (noblock && sq == NULL)
+				break;
+			if (!sq)
+				__wait_for_inactive_queue(conf);
+			else
+				init_queue(sq, sector, disks, pd_idx);
+		} else {
+			if (atomic_read(&sq->count)) {
+				BUG_ON(!list_empty(&sq->list_node));
+			} else if (!test_and_set_bit(STRIPE_QUEUE_HANDLE,
+						     &sq->state))
+				atomic_inc(&conf->active_queues);
+			list_del_init(&sq->list_node);
+		}
+	} while (sq == NULL);
+
+	if (sq)
+		atomic_inc(&sq->count);
+	if (sq->sh) { /* since we are bypassing get_active_stripe to get this
+		       * sh, we need to do some housekeeping
+		       */
+		pickup_cached_stripe(sq->sh, sq, 0);
+		atomic_inc(&sq->sh->count);
+		*sh = sq->sh;
+	} else
+		*sh = NULL;
+
+	spin_unlock_irq(&conf->device_lock);
+	return sq;
+}
+
 static void unplug_slaves(mddev_t *mddev);
 static void raid5_unplug_device(request_queue_t *q);
 
-static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector, int disks,
-					     int pd_idx, int noblock)
+static struct stripe_head *
+get_active_stripe(raid5_conf_t *conf, struct stripe_queue *sq, int disks,
+			int noblock)
 {
 	struct stripe_head *sh;
 
-	pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector);
+	pr_debug("get_stripe, sector %llu\n", (unsigned long long)sq->sector);
 
 	spin_lock_irq(&conf->device_lock);
 
 	do {
-		wait_event_lock_irq(conf->wait_for_stripe,
-				    conf->quiesce == 0,
-				    conf->device_lock, /* nothing */);
-		sh = __find_stripe(conf, sector, disks);
+		sh = __find_stripe(conf, sq->sector, disks);
 		if (!sh) {
 			if (!conf->inactive_blocked)
 				sh = get_free_stripe(conf);
@@ -326,19 +632,9 @@ static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector
 					);
 				conf->inactive_blocked = 0;
 			} else
-				init_stripe(sh, sector, pd_idx, disks);
-		} else {
-			if (atomic_read(&sh->count)) {
-			  BUG_ON(!list_empty(&sh->lru));
-			} else {
-				if (!test_bit(STRIPE_HANDLE, &sh->state))
-					atomic_inc(&conf->active_stripes);
-				if (list_empty(&sh->lru) &&
-				    !test_bit(STRIPE_EXPANDING, &sh->state))
-					BUG();
-				list_del_init(&sh->lru);
-			}
-		}
+				init_stripe(sh, sq, disks);
+		} else
+			pickup_cached_stripe(sh, sq, 1);
 	} while (sh == NULL);
 
 	if (sh)
@@ -574,7 +870,8 @@ static void ops_complete_biofill(void *stripe_head_ref)
 static void ops_run_biofill(struct stripe_head *sh)
 {
 	struct dma_async_tx_descriptor *tx = NULL;
-	raid5_conf_t *conf = sh->sq->raid_conf;
+	struct stripe_queue *sq = sh->sq;
+	raid5_conf_t *conf = sq->raid_conf;
 	int i;
 
 	pr_debug("%s: stripe %llu\n", __FUNCTION__,
@@ -588,6 +885,7 @@ static void ops_run_biofill(struct stripe_head *sh)
 			spin_lock_irq(&conf->device_lock);
 			dev->read = rbi = dev_q->toread;
 			dev_q->toread = NULL;
+			clear_bit(i, sq->to_read);
 			spin_unlock_irq(&conf->device_lock);
 			while (rbi && rbi->bi_sector <
 				dev_q->sector + STRIPE_SECTORS) {
@@ -736,6 +1034,7 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 			spin_lock(&sq->lock);
 			chosen = dev_q->towrite;
 			dev_q->towrite = NULL;
+			clear_bit(i, sq->to_write);
 			BUG_ON(dev->written);
 			wbi = dev->written = chosen;
 			spin_unlock(&sq->lock);
@@ -937,29 +1236,14 @@ static void raid5_run_ops(struct stripe_head *sh, unsigned long pending)
 static int grow_one_stripe(raid5_conf_t *conf)
 {
 	struct stripe_head *sh;
-	struct stripe_queue *sq;
-
 	sh = kmem_cache_alloc(conf->sh_slab_cache, GFP_KERNEL);
 	if (!sh)
 		return 0;
-
-	sq = kmem_cache_alloc(conf->sq_slab_cache, GFP_KERNEL);
-	if (!sq) {
-		kmem_cache_free(conf->sh_slab_cache, sh);
-		return 0;
-	}
-
 	memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1)*sizeof(struct r5dev));
-	memset(sq, 0, sizeof(*sq) +
-		(conf->raid_disks-1) * sizeof(struct r5_queue_dev));
-	sh->sq = sq;
-	sq->raid_conf = conf;
-	spin_lock_init(&sq->lock);
 
 	if (grow_buffers(sh, conf->raid_disks)) {
 		shrink_buffers(sh, conf->raid_disks);
 		kmem_cache_free(conf->sh_slab_cache, sh);
-		kmem_cache_free(conf->sq_slab_cache, sq);
 		return 0;
 	}
 	sh->disks = conf->raid_disks;
@@ -967,16 +1251,55 @@ static int grow_one_stripe(raid5_conf_t *conf)
 	atomic_set(&sh->count, 1);
 	atomic_inc(&conf->active_stripes);
 	INIT_LIST_HEAD(&sh->lru);
+	sh->sq = &init_sq;
+	atomic_set(&init_sq.count, 2); /* 2, so it does not get picked up in
+					 * __release_queue
+					 */
 	spin_lock_irq(&conf->device_lock);
 	__release_stripe(conf, sh);
 	spin_unlock_irq(&conf->device_lock);
 	return 1;
 }
 
+static int grow_one_queue(raid5_conf_t *conf)
+{
+	struct stripe_queue *sq;
+	int disks = conf->raid_disks;
+	void *weight_map;
+	sq = kmem_cache_alloc(conf->sq_slab_cache, GFP_KERNEL);
+	if (!sq)
+		return 0;
+	memset(sq, 0, (sizeof(*sq)+(disks-1) * sizeof(struct r5_queue_dev)) +
+		r5_io_weight_size(disks) + r5_io_weight_size(disks) +
+		r5_io_weight_size(disks));
+
+	/* set the queue weight bitmaps to the free space at the end of sq */
+	weight_map = ((void *) sq) + offsetof(typeof(*sq), dev) +
+			sizeof(struct r5_queue_dev) * disks;
+	sq->to_read = weight_map;
+	weight_map += r5_io_weight_size(disks);
+	sq->to_write = weight_map;
+	weight_map += r5_io_weight_size(disks);
+	sq->overwrite = weight_map;
+
+	spin_lock_init(&sq->lock);
+	sq->sector = MaxSector;
+	sq->raid_conf = conf;
+	sq->sh = &init_sh;
+	/* we just created an active queue so... */
+	atomic_set(&sq->count, 1);
+	atomic_inc(&conf->active_queues);
+	INIT_LIST_HEAD(&sq->list_node);
+	RB_CLEAR_NODE(&sq->rb_node);
+	release_queue(sq);
+
+	return 1;
+}
+
 static int grow_stripes(raid5_conf_t *conf, int num)
 {
 	struct kmem_cache *sc;
-	int devs = conf->raid_disks;
+	int devs = conf->raid_disks, num_q = num * STRIPE_QUEUE_SIZE;
 
 	sprintf(conf->sh_cache_name[0], "raid5-%s", mdname(conf->mddev));
 	sprintf(conf->sh_cache_name[1], "raid5-%s-alt", mdname(conf->mddev));
@@ -992,18 +1315,24 @@ static int grow_stripes(raid5_conf_t *conf, int num)
 		return 1;
 	conf->sh_slab_cache = sc;
 	conf->pool_size = devs;
+	while (num--)
+		if (!grow_one_stripe(conf))
+			return 1;
 
 	sc = kmem_cache_create(conf->sq_cache_name[conf->active_name],
-		sizeof(struct stripe_queue) +
-		(devs-1)*sizeof(struct r5_queue_dev), 0, 0, NULL, NULL);
-
+			       (sizeof(struct stripe_queue)+(devs-1) *
+				sizeof(struct r5_queue_dev)) +
+				r5_io_weight_size(devs) +
+				r5_io_weight_size(devs) +
+				r5_io_weight_size(devs), 0, 0, NULL, NULL);
 	if (!sc)
 		return 1;
-	conf->sq_slab_cache = sc;
 
-	while (num--)
-		if (!grow_one_stripe(conf))
+	conf->sq_slab_cache = sc;
+	while (num_q--)
+		if (!grow_one_queue(conf))
 			return 1;
+
 	return 0;
 }
 
@@ -1034,11 +1363,13 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 	 * so we use GFP_NOIO allocations.
 	 */
 	struct stripe_head *osh, *nsh;
+	struct stripe_queue *osq, *nsq;
 	LIST_HEAD(newstripes);
+	LIST_HEAD(newqueues);
 	struct disk_info *ndisks;
 	int err = 0;
 	struct kmem_cache *sc, *sc_q;
-	int i;
+	int i, j;
 
 	if (newsize <= conf->pool_size)
 		return 0; /* never bother to shrink */
@@ -1052,45 +1383,84 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 	if (!sc)
 		return -ENOMEM;
 
-	sc_q = kmem_cache_create(conf->sh_cache_name[1-conf->active_name],
-		    sizeof(struct stripe_queue) +
-		    (newsize-1)*sizeof(struct r5_queue_dev), 0, 0, NULL, NULL);
+	sc_q = kmem_cache_create(conf->sq_cache_name[conf->active_name],
+			       (sizeof(struct stripe_queue)+(newsize-1) *
+				sizeof(struct r5_queue_dev)) +
+				r5_io_weight_size(newsize) +
+				r5_io_weight_size(newsize) +
+				r5_io_weight_size(newsize),
+				0, 0, NULL, NULL);
+
 	if (!sc_q) {
 		kmem_cache_destroy(sc);
 		return -ENOMEM;
 	}
 
 	for (i = conf->max_nr_stripes; i; i--) {
-		struct stripe_queue *nsq;
+		struct stripe_queue *nsq_per_sh[STRIPE_QUEUE_SIZE];
 
 		nsh = kmem_cache_alloc(sc, GFP_KERNEL);
 		if (!nsh)
 			break;
 
-		nsq = kmem_cache_alloc(sc_q, GFP_KERNEL);
-		if (!nsq) {
+		/* allocate STRIPE_QUEUE_SIZE queues per stripe */
+		for (j = 0; j < ARRAY_SIZE(nsq_per_sh); j++)
+			nsq_per_sh[j] = kmem_cache_alloc(sc_q, GFP_KERNEL);
+
+		for (j = 0; j < ARRAY_SIZE(nsq_per_sh); j++)
+			if (!nsq_per_sh[j])
+				break;
+
+		if (j <= ARRAY_SIZE(nsq_per_sh)) {
 			kmem_cache_free(sc, nsh);
+			do
+				if (nsq_per_sh[j])
+					kmem_cache_free(sc_q, nsq_per_sh[j]);
+			while (--j >= 0);
 			break;
 		}
 
 		memset(nsh, 0, sizeof(*nsh) + (newsize-1)*sizeof(struct r5dev));
-		memset(nsq, 0, sizeof(*nsq) +
-			(newsize-1)*sizeof(struct r5_queue_dev));
-
-		nsq->raid_conf = conf;
-		nsh->sq = nsq;
-		spin_lock_init(&nsq->lock);
-
 		list_add(&nsh->lru, &newstripes);
+
+		for (j = 0; j < ARRAY_SIZE(nsq_per_sh); j++) {
+			void *weight_map;
+			nsq = nsq_per_sh[j];
+			memset(nsq, 0, (sizeof(*nsq)+(newsize-1) *
+				sizeof(struct r5_queue_dev)) +
+				r5_io_weight_size(newsize) +
+				r5_io_weight_size(newsize) +
+				r5_io_weight_size(newsize));
+			/* set the queue weight bitmaps to the free space at
+			 * the end of nsq
+			 */
+			weight_map = ((void *) nsq) +
+					offsetof(typeof(*nsq), dev) +
+					sizeof(struct r5_queue_dev) * newsize;
+			nsq->to_read = weight_map;
+			weight_map += r5_io_weight_size(newsize);
+			nsq->to_write = weight_map;
+			weight_map += r5_io_weight_size(newsize);
+			nsq->overwrite = weight_map;
+			nsq->raid_conf = conf;
+			spin_lock_init(&nsq->lock);
+			list_add(&nsq->list_node, &newqueues);
+		}
 	}
 	if (i) {
 		/* didn't get enough, give up */
 		while (!list_empty(&newstripes)) {
 			nsh = list_entry(newstripes.next, struct stripe_head, lru);
 			list_del(&nsh->lru);
-			kmem_cache_free(sc_q, nsh->sq);
 			kmem_cache_free(sc, nsh);
 		}
+		while (!list_empty(&newqueues)) {
+			nsq = list_entry(newqueues.next,
+					 struct stripe_queue,
+					 list_node);
+			list_del(&nsh->lru);
+			kmem_cache_free(sc_q, nsq);
+		}
 		kmem_cache_destroy(sc_q);
 		kmem_cache_destroy(sc);
 		return -ENOMEM;
@@ -1099,6 +1469,19 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 	 * OK, we have enough stripes, start collecting inactive
 	 * stripes and copying them over
 	 */
+	list_for_each_entry(nsq, &newqueues, list_node) {
+		spin_lock_irq(&conf->device_lock);
+		wait_event_lock_irq(conf->wait_for_queue,
+				    !list_empty(&conf->inactive_queue_list),
+				    conf->device_lock,
+				    unplug_slaves(conf->mddev)
+			);
+		osq = get_free_queue(conf);
+		spin_unlock_irq(&conf->device_lock);
+		atomic_set(&nsq->count, 1);
+		kmem_cache_free(conf->sq_slab_cache, osq);
+	}
+
 	list_for_each_entry(nsh, &newstripes, lru) {
 		spin_lock_irq(&conf->device_lock);
 		wait_event_lock_irq(conf->wait_for_stripe,
@@ -1113,7 +1496,6 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 			nsh->dev[i].page = osh->dev[i].page;
 		for( ; i<newsize; i++)
 			nsh->dev[i].page = NULL;
-		kmem_cache_free(conf->sq_slab_cache, osh->sq);
 		kmem_cache_free(conf->sh_slab_cache, osh);
 	}
 	kmem_cache_destroy(conf->sh_slab_cache);
@@ -1134,6 +1516,13 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 		err = -ENOMEM;
 
 	/* Step 4, return new stripes to service */
+	while (!list_empty(&newqueues)) {
+		nsq = list_entry(newqueues.next, struct stripe_queue,
+					list_node);
+		list_del_init(&nsq->list_node);
+		release_queue(nsq);
+	}
+
 	while(!list_empty(&newstripes)) {
 		nsh = list_entry(newstripes.next, struct stripe_head, lru);
 		list_del_init(&nsh->lru);
@@ -1144,7 +1533,9 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 				if (!p)
 					err = -ENOMEM;
 			}
-		release_stripe(nsh);
+		spin_lock_irq(&conf->device_lock);
+		__release_stripe(conf, nsh);
+		spin_unlock_irq(&conf->device_lock);
 	}
 	/* critical section pass, GFP_NOIO no longer needed */
 
@@ -1167,18 +1558,33 @@ static int drop_one_stripe(raid5_conf_t *conf)
 		return 0;
 	BUG_ON(atomic_read(&sh->count));
 	shrink_buffers(sh, conf->pool_size);
-	if (sh->sq)
-		kmem_cache_free(conf->sq_slab_cache, sh->sq);
 	kmem_cache_free(conf->sh_slab_cache, sh);
 	atomic_dec(&conf->active_stripes);
 	return 1;
 }
 
+static int drop_one_queue(raid5_conf_t *conf)
+{
+	struct stripe_queue *sq;
+
+	spin_lock_irq(&conf->device_lock);
+	sq = get_free_queue(conf);
+	spin_unlock_irq(&conf->device_lock);
+	if (!sq)
+		return 0;
+	kmem_cache_free(conf->sq_slab_cache, sq);
+	atomic_dec(&conf->active_queues);
+	return 1;
+}
+
 static void shrink_stripes(raid5_conf_t *conf)
 {
 	while (drop_one_stripe(conf))
 		;
 
+	while (drop_one_queue(conf))
+		;
+
 	if (conf->sh_slab_cache)
 		kmem_cache_destroy(conf->sh_slab_cache);
 	conf->sh_slab_cache = NULL;
@@ -1305,7 +1711,6 @@ static int raid5_end_write_request (struct bio *bi, unsigned int bytes_done,
 static void raid5_build_block (struct stripe_head *sh, int i)
 {
 	struct r5dev *dev = &sh->dev[i];
-	struct r5_queue_dev *dev_q = &sh->sq->dev[i];
 
 	bio_init(&dev->req);
 	dev->req.bi_io_vec = &dev->vec;
@@ -1317,10 +1722,6 @@ static void raid5_build_block (struct stripe_head *sh, int i)
 
 	dev->req.bi_sector = sh->sector;
 	dev->req.bi_private = sh;
-
-	dev_q->flags = 0;
-	dev_q->sector = compute_blocknr(sh->sq->raid_conf, sh->disks,
-			sh->sector, sh->sq->pd_idx, i);
 }
 
 static void error(mddev_t *mddev, mdk_rdev_t *rdev)
@@ -1615,6 +2016,7 @@ static void compute_parity6(struct stripe_head *sh, int method)
 			if (i != pd_idx && i != qd_idx && sq->dev[i].towrite) {
 				chosen = sq->dev[i].towrite;
 				sq->dev[i].towrite = NULL;
+				clear_bit(i, sq->to_write);
 
 				if (test_and_clear_bit(R5_Overlap,
 							&sq->dev[i].flags))
@@ -1861,23 +2263,24 @@ handle_write_operations5(struct stripe_head *sh, int rcw, int expand)
  * toread/towrite point to the first in a chain.
  * The bi_next chain must be in order.
  */
-static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, int forwrite)
+static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx,
+			  int forwrite)
 {
 	struct bio **bip;
-	struct stripe_queue *sq = sh->sq;
 	raid5_conf_t *conf = sq->raid_conf;
 	int firstwrite=0;
+	struct stripe_head *sh;
 
-	pr_debug("adding bh b#%llu to stripe s#%llu\n",
+	pr_debug("adding bio (%llu) to queue (%llu)\n",
 		(unsigned long long)bi->bi_sector,
-		(unsigned long long)sh->sector);
-
+		(unsigned long long)sq->sector);
 
 	spin_lock(&sq->lock);
 	spin_lock_irq(&conf->device_lock);
+	sh = sq->sh;
 	if (forwrite) {
 		bip = &sq->dev[dd_idx].towrite;
-		if (*bip == NULL && sh->dev[dd_idx].written == NULL)
+		if (*bip == NULL && (!sh || (sh && !sh->dev[dd_idx].written)))
 			firstwrite = 1;
 	} else
 		bip = &sq->dev[dd_idx].toread;
@@ -1899,13 +2302,13 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
 		(unsigned long long)bi->bi_sector,
-		(unsigned long long)sh->sector, dd_idx);
+		(unsigned long long)sq->sector, dd_idx);
 
 	if (conf->mddev->bitmap && firstwrite) {
-		bitmap_startwrite(conf->mddev->bitmap, sh->sector,
+		bitmap_startwrite(conf->mddev->bitmap, sq->sector,
 				  STRIPE_SECTORS, 0);
 		sq->bm_seq = conf->seq_flush+1;
-		set_bit(STRIPE_BIT_DELAY, &sh->state);
+		set_bit(STRIPE_QUEUE_BIT_DELAY, &sq->state);
 	}
 
 	if (forwrite) {
@@ -1918,9 +2321,14 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 			if (bi->bi_sector + (bi->bi_size>>9) >= sector)
 				sector = bi->bi_sector + (bi->bi_size>>9);
 		}
-		if (sector >= sq->dev[dd_idx].sector + STRIPE_SECTORS)
+		if (sector >= sq->dev[dd_idx].sector + STRIPE_SECTORS) {
 			set_bit(R5_OVERWRITE, &sq->dev[dd_idx].flags);
-	}
+			set_bit(dd_idx, sq->overwrite);
+		} else
+			set_bit(dd_idx, sq->to_write);
+	} else
+		set_bit(dd_idx, sq->to_read);
+
 	return 1;
 
  overlap:
@@ -1977,6 +2385,7 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
 		/* fail all writes first */
 		bi = sq->dev[i].towrite;
 		sq->dev[i].towrite = NULL;
+		clear_bit(i, sq->to_write);
 		if (bi) {
 			s->to_write--;
 			bitmap_end = 1;
@@ -2020,6 +2429,7 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh,
 		      test_bit(R5_ReadError, &sq->dev[i].flags))) {
 			bi = sq->dev[i].toread;
 			sq->dev[i].toread = NULL;
+			clear_bit(i, sq->to_read);
 			if (test_and_clear_bit(R5_Overlap, &sq->dev[i].flags))
 				wake_up(&conf->wait_for_overlap);
 			if (bi) s->to_read--;
@@ -2287,20 +2697,13 @@ static void handle_issuing_new_write_requests5(raid5_conf_t *conf,
 			    !(test_bit(R5_UPTODATE, &dev_q->flags) ||
 			    test_bit(R5_Wantcompute, &dev_q->flags)) &&
 			    test_bit(R5_Insync, &dev_q->flags)) {
-				if (
-				  test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-					pr_debug("Read_old block "
-						"%d for r-m-w\n", i);
-					set_bit(R5_LOCKED, &dev_q->flags);
-					set_bit(R5_Wantread, &dev_q->flags);
-					if (!test_and_set_bit(
-						STRIPE_OP_IO, &sh->ops.pending))
-						sh->ops.count++;
-					s->locked++;
-				} else {
-					set_bit(STRIPE_DELAYED, &sh->state);
-					set_bit(STRIPE_HANDLE, &sh->state);
-				}
+				pr_debug("Read_old block %d for r-m-w\n", i);
+				set_bit(R5_LOCKED, &dev_q->flags);
+				set_bit(R5_Wantread, &dev_q->flags);
+				if (!test_and_set_bit(STRIPE_OP_IO,
+				    &sh->ops.pending))
+					sh->ops.count++;
+				s->locked++;
 			}
 		}
 	if (rcw <= rmw && rcw > 0)
@@ -2313,20 +2716,14 @@ static void handle_issuing_new_write_requests5(raid5_conf_t *conf,
 			    !(test_bit(R5_UPTODATE, &dev_q->flags) ||
 			    test_bit(R5_Wantcompute, &dev_q->flags)) &&
 			    test_bit(R5_Insync, &dev_q->flags)) {
-				if (
-				  test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-					pr_debug("Read_old block "
-						"%d for Reconstruct\n", i);
-					set_bit(R5_LOCKED, &dev_q->flags);
-					set_bit(R5_Wantread, &dev_q->flags);
-					if (!test_and_set_bit(
-						STRIPE_OP_IO, &sh->ops.pending))
-						sh->ops.count++;
-					s->locked++;
-				} else {
-					set_bit(STRIPE_DELAYED, &sh->state);
-					set_bit(STRIPE_HANDLE, &sh->state);
-				}
+				pr_debug("Read_old block "
+					 "%d for Reconstruct\n", i);
+				set_bit(R5_LOCKED, &dev_q->flags);
+				set_bit(R5_Wantread, &dev_q->flags);
+				if (!test_and_set_bit(STRIPE_OP_IO,
+				    &sh->ops.pending))
+					sh->ops.count++;
+				s->locked++;
 			}
 		}
 	/* now if nothing is locked, and if we have enough data,
@@ -2342,7 +2739,7 @@ static void handle_issuing_new_write_requests5(raid5_conf_t *conf,
 	if ((s->req_compute ||
 	    !test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending)) &&
 		(s->locked == 0 && (rcw == 0 || rmw == 0) &&
-		!test_bit(STRIPE_BIT_DELAY, &sh->state)))
+		!test_bit(STRIPE_QUEUE_BIT_DELAY, &sq->state)))
 		s->locked += handle_write_operations5(sh, rcw == 0, 0);
 }
 
@@ -2383,28 +2780,19 @@ static void handle_issuing_new_write_requests6(raid5_conf_t *conf,
 			    && !test_bit(R5_LOCKED, &dev_q->flags) &&
 			    !test_bit(R5_UPTODATE, &dev_q->flags) &&
 			    test_bit(R5_Insync, &dev_q->flags)) {
-				if (
-				  test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-					pr_debug("Read_old stripe %llu "
-						"block %d for Reconstruct\n",
-					     (unsigned long long)sh->sector, i);
-					set_bit(R5_LOCKED, &dev_q->flags);
-					set_bit(R5_Wantread, &dev_q->flags);
-					s->locked++;
-				} else {
-					pr_debug("Request delayed stripe %llu "
-						"block %d for Reconstruct\n",
-					     (unsigned long long)sh->sector, i);
-					set_bit(STRIPE_DELAYED, &sh->state);
-					set_bit(STRIPE_HANDLE, &sh->state);
-				}
+				pr_debug("Read_old stripe %llu "
+					"block %d for Reconstruct\n",
+				     (unsigned long long)sh->sector, i);
+				set_bit(R5_LOCKED, &dev_q->flags);
+				set_bit(R5_Wantread, &dev_q->flags);
+				s->locked++;
 			}
 		}
 	/* now if nothing is locked, and if we have enough data, we can start a
 	 * write request
 	 */
 	if (s->locked == 0 && rcw == 0 &&
-	    !test_bit(STRIPE_BIT_DELAY, &sh->state)) {
+	    !test_bit(STRIPE_QUEUE_BIT_DELAY, &sq->state)) {
 		if (must_compute > 0) {
 			/* We have failed blocks and need to compute them */
 			switch (s->failed) {
@@ -2435,13 +2823,6 @@ static void handle_issuing_new_write_requests6(raid5_conf_t *conf,
 			}
 		/* after a RECONSTRUCT_WRITE, the stripe MUST be in-sync */
 		set_bit(STRIPE_INSYNC, &sh->state);
-
-		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-			atomic_dec(&conf->preread_active_stripes);
-			if (atomic_read(&conf->preread_active_stripes) <
-				IO_THRESHOLD)
-				md_wakeup_thread(conf->mddev->thread);
-		}
 	}
 }
 
@@ -2641,6 +3022,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 		if (i != sq->pd_idx && (r6s && i != r6s->qd_idx)) {
 			int dd_idx, pd_idx, j;
 			struct stripe_head *sh2;
+			struct stripe_queue *sq2;
 
 			sector_t bn = compute_blocknr(conf, sh->disks,
 						sh->sector, sq->pd_idx, i);
@@ -2648,18 +3030,27 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 						conf->raid_disks -
 						conf->max_degraded, &dd_idx,
 						&pd_idx, conf);
-			sh2 = get_active_stripe(conf, s, conf->raid_disks,
-						pd_idx, 1);
-			if (sh2 == NULL)
+			sq2 = get_active_queue(conf, s, conf->raid_disks,
+						pd_idx, 1, &sh2);
+			if (sq2 == NULL)
+				continue;
+
+			if (!sh2)
+				sh2 = get_active_stripe(conf, sq2,
+							conf->raid_disks, 1);
+			if (sh2 == NULL) {
 				/* so far only the early blocks of this stripe
 				 * have been requested.  When later blocks
 				 * get requested, we will try again
 				 */
+				release_queue(sq2);
 				continue;
-			if (!test_bit(STRIPE_EXPANDING, &sh2->state) ||
-			  test_bit(R5_Expanded, &sh2->sq->dev[dd_idx].flags)) {
+			}
+			if (!test_bit(STRIPE_QUEUE_EXPANDING, &sq2->state) ||
+			    test_bit(R5_Expanded, &sq2->dev[dd_idx].flags)) {
 				/* must have already done this block */
 				release_stripe(sh2);
+				release_queue(sq2);
 				continue;
 			}
 
@@ -2668,19 +3059,20 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 				sh->dev[i].page, 0, 0, STRIPE_SIZE,
 				ASYNC_TX_DEP_ACK, tx, NULL, NULL);
 
-			set_bit(R5_Expanded, &sh2->sq->dev[dd_idx].flags);
-			set_bit(R5_UPTODATE, &sh2->sq->dev[dd_idx].flags);
+			set_bit(R5_Expanded, &sq2->dev[dd_idx].flags);
+			set_bit(R5_UPTODATE, &sq2->dev[dd_idx].flags);
 			for (j = 0; j < conf->raid_disks; j++)
 				if (j != sh2->sq->pd_idx &&
 				    (r6s && j != r6s->qd_idx) &&
 				    !test_bit(R5_Expanded,
-				     &sh2->sq->dev[j].flags))
+				     &sq2->dev[j].flags))
 					break;
 			if (j == conf->raid_disks) {
 				set_bit(STRIPE_EXPAND_READY, &sh2->state);
 				set_bit(STRIPE_HANDLE, &sh2->state);
 			}
 			release_stripe(sh2);
+			release_queue(sq2);
 
 			/* done submitting copies, wait for them to complete */
 			if (i + 1 >= sh->disks) {
@@ -2725,7 +3117,6 @@ static void handle_stripe5(struct stripe_head *sh)
 
 	spin_lock(&sq->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
-	clear_bit(STRIPE_DELAYED, &sh->state);
 
 	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
 	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
@@ -2873,12 +3264,6 @@ static void handle_stripe5(struct stripe_head *sh)
 					set_bit(STRIPE_INSYNC, &sh->state);
 			}
 		}
-		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-			atomic_dec(&conf->preread_active_stripes);
-			if (atomic_read(&conf->preread_active_stripes) <
-				IO_THRESHOLD)
-				md_wakeup_thread(conf->mddev->thread);
-		}
 	}
 
 	/* Now to consider new write requests and what else, if anything
@@ -2940,7 +3325,7 @@ static void handle_stripe5(struct stripe_head *sh)
 	if (test_bit(STRIPE_OP_POSTXOR, &sh->ops.complete) &&
 		!test_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending)) {
 
-		clear_bit(STRIPE_EXPANDING, &sh->state);
+		clear_bit(STRIPE_QUEUE_EXPANDING, &sq->state);
 
 		clear_bit(STRIPE_OP_POSTXOR, &sh->ops.pending);
 		clear_bit(STRIPE_OP_POSTXOR, &sh->ops.ack);
@@ -2953,7 +3338,7 @@ static void handle_stripe5(struct stripe_head *sh)
 		}
 	}
 
-	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
+	if (s.expanded && test_bit(STRIPE_QUEUE_EXPANDING, &sq->state) &&
 		!test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending)) {
 		/* Need to write out all blocks after computing parity */
 		sh->disks = conf->raid_disks;
@@ -3003,7 +3388,6 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 
 	spin_lock(&sq->lock);
 	clear_bit(STRIPE_HANDLE, &sh->state);
-	clear_bit(STRIPE_DELAYED, &sh->state);
 
 	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
 	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
@@ -3028,6 +3412,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 			spin_lock_irq(&conf->device_lock);
 			rbi = dev_q->toread;
 			dev_q->toread = NULL;
+			clear_bit(i, sq->to_read);
 			if (test_and_clear_bit(R5_Overlap, &dev_q->flags))
 				wake_up(&conf->wait_for_overlap);
 			spin_unlock_irq(&conf->device_lock);
@@ -3156,7 +3541,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 			}
 		}
 
-	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state)) {
+	if (s.expanded && test_bit(STRIPE_QUEUE_EXPANDING, &sq->state)) {
 		/* Need to write out all blocks after computing P&Q */
 		sh->disks = conf->raid_disks;
 		sq->pd_idx = stripe_to_pdidx(sh->sector, conf,
@@ -3167,7 +3552,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 			s.locked++;
 			set_bit(R5_Wantwrite, &sq->dev[i].flags);
 		}
-		clear_bit(STRIPE_EXPANDING, &sh->state);
+		clear_bit(STRIPE_QUEUE_EXPANDING, &sq->state);
 	} else if (s.expanded) {
 		clear_bit(STRIPE_EXPAND_READY, &sh->state);
 		atomic_dec(&conf->reshape_stripes);
@@ -3250,20 +3635,54 @@ static void handle_stripe(struct stripe_head *sh, struct page *tmp_page)
 		handle_stripe5(sh);
 }
 
-
+static void handle_queue(struct stripe_queue *sq, struct stripe_head *sh,
+			  int disks, int data_disks)
+{
+	/* once a stripe_queue is attached to the cache
+	 * bypass the queue optimization logic
+	 */
+	if (sh) {
+		pr_debug("%s: start request to cached stripe %llu\n",
+			__FUNCTION__, (unsigned long long) sh->sector);
+		handle_stripe(sh, NULL);
+		release_stripe(sh);
+		return;
+	} else {
+		unsigned long overwrite = io_weight(sq->overwrite, disks);
+		if (overwrite == data_disks)
+			set_bit(STRIPE_QUEUE_OVERWRITE, &sq->state);
+		else if (io_weight(sq->to_read, disks))
+			set_bit(STRIPE_QUEUE_READ, &sq->state);
+		else if (overwrite || io_weight(sq->to_write, disks)) {
+			if (!test_bit(STRIPE_QUEUE_PREREAD_ACTIVE, &sq->state))
+				set_bit(STRIPE_QUEUE_DELAYED, &sq->state);
+			else
+				set_bit(STRIPE_QUEUE_WRITE, &sq->state);
+		}
+		pr_debug("%s: update queue %llu "
+			 "state: %#lx r: %lu w: %lu o: %lu\n", __FUNCTION__,
+			 (unsigned long long) sq->sector, sq->state,
+			 io_weight(sq->to_read, disks),
+			 io_weight(sq->to_write, disks),
+			 io_weight(sq->overwrite, disks));
+	}
+}
 
 static void raid5_activate_delayed(raid5_conf_t *conf)
 {
-	if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {
-		while (!list_empty(&conf->delayed_list)) {
-			struct list_head *l = conf->delayed_list.next;
-			struct stripe_head *sh;
-			sh = list_entry(l, struct stripe_head, lru);
+	if (atomic_read(&conf->preread_active_queues) < IO_THRESHOLD) {
+		pr_debug("%s\n", __FUNCTION__);
+		while (!list_empty(&conf->delayed_q_list)) {
+			struct list_head *l = conf->delayed_q_list.next;
+			struct stripe_queue *sq;
+			sq = list_entry(l, struct stripe_queue, list_node);
 			list_del_init(l);
-			clear_bit(STRIPE_DELAYED, &sh->state);
-			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-				atomic_inc(&conf->preread_active_stripes);
-			list_add_tail(&sh->lru, &conf->handle_list);
+			clear_bit(STRIPE_QUEUE_DELAYED, &sq->state);
+			if (!test_and_set_bit(STRIPE_QUEUE_PREREAD_ACTIVE,
+						&sq->state))
+				atomic_inc(&conf->preread_active_queues);
+			list_add_tail(&sq->list_node,
+					&conf->subwidth_write_list);
 		}
 	}
 }
@@ -3318,6 +3737,7 @@ static void raid5_unplug_device(request_queue_t *q)
 		conf->seq_flush++;
 		raid5_activate_delayed(conf);
 	}
+
 	md_wakeup_thread(mddev->thread);
 
 	spin_unlock_irqrestore(&conf->device_lock, flags);
@@ -3361,13 +3781,13 @@ static int raid5_congested(void *data, int bits)
 	raid5_conf_t *conf = mddev_to_conf(mddev);
 
 	/* No difference between reads and writes.  Just check
-	 * how busy the stripe_cache is
+	 * how busy the stripe_queue is
 	 */
-	if (conf->inactive_blocked)
+	if (conf->inactive_queue_blocked)
 		return 1;
 	if (conf->quiesce)
 		return 1;
-	if (list_empty_careful(&conf->inactive_list))
+	if (list_empty_careful(&conf->inactive_queue_list))
 		return 1;
 
 	return 0;
@@ -3559,7 +3979,7 @@ static int chunk_aligned_read(request_queue_t *q, struct bio * raid_bio)
 		}
 
 		spin_lock_irq(&conf->device_lock);
-		wait_event_lock_irq(conf->wait_for_stripe,
+		wait_event_lock_irq(conf->wait_for_queue,
 				    conf->quiesce == 0,
 				    conf->device_lock, /* nothing */);
 		atomic_inc(&conf->active_aligned_reads);
@@ -3582,6 +4002,7 @@ static int make_request(request_queue_t *q, struct bio * bi)
 	sector_t new_sector;
 	sector_t logical_sector, last_sector;
 	struct stripe_head *sh;
+	struct stripe_queue *sq;
 	const int rw = bio_data_dir(bi);
 	int remaining;
 
@@ -3643,16 +4064,18 @@ static int make_request(request_queue_t *q, struct bio * bi)
 			(unsigned long long)new_sector, 
 			(unsigned long long)logical_sector);
 
-		sh = get_active_stripe(conf, new_sector, disks, pd_idx, (bi->bi_rw&RWA_MASK));
-		if (sh) {
+		sq = get_active_queue(conf, new_sector, disks, pd_idx,
+					(bi->bi_rw & RWA_MASK), &sh);
+		if (sq) {
 			if (unlikely(conf->expand_progress != MaxSector)) {
 				/* expansion might have moved on while waiting for a
-				 * stripe, so we must do the range check again.
+				 * queue, so we must do the range check again.
 				 * Expansion could still move past after this
 				 * test, but as we are holding a reference to
-				 * 'sh', we know that if that happens,
-				 *  STRIPE_EXPANDING will get set and the expansion
-				 * won't proceed until we finish with the stripe.
+				 * 'sq', we know that if that happens,
+				 * STRIPE_QUEUE_EXPANDING will get set and the
+				 * expansion won't proceed until we finish
+				 * with the queue.
 				 */
 				int must_retry = 0;
 				spin_lock_irq(&conf->device_lock);
@@ -3662,7 +4085,9 @@ static int make_request(request_queue_t *q, struct bio * bi)
 					must_retry = 1;
 				spin_unlock_irq(&conf->device_lock);
 				if (must_retry) {
-					release_stripe(sh);
+					release_queue(sq);
+					if (sh)
+						release_stripe(sh);
 					goto retry;
 				}
 			}
@@ -3671,27 +4096,32 @@ static int make_request(request_queue_t *q, struct bio * bi)
 			 */
 			if (logical_sector >= mddev->suspend_lo &&
 			    logical_sector < mddev->suspend_hi) {
-				release_stripe(sh);
+				release_queue(sq);
+				if (sh)
+					release_stripe(sh);
 				schedule();
 				goto retry;
 			}
 
-			if (test_bit(STRIPE_EXPANDING, &sh->state) ||
-			    !add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
+			if (test_bit(STRIPE_QUEUE_EXPANDING, &sq->state) ||
+			    !add_queue_bio(sq, bi, dd_idx,
+						bi->bi_rw & RW_MASK)) {
 				/* Stripe is busy expanding or
 				 * add failed due to overlap.  Flush everything
 				 * and wait a while
 				 */
 				raid5_unplug_device(mddev->queue);
-				release_stripe(sh);
+				release_queue(sq);
+				if (sh)
+					release_stripe(sh);
 				schedule();
 				goto retry;
 			}
 			finish_wait(&conf->wait_for_overlap, &w);
-			handle_stripe(sh, NULL);
-			release_stripe(sh);
+			handle_queue(sq, sh, disks, data_disks);
+			release_queue(sq);
 		} else {
-			/* cannot get stripe for read-ahead, just give-up */
+			/* cannot get queue for read-ahead, just give-up */
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
 			finish_wait(&conf->wait_for_overlap, &w);
 			break;
@@ -3727,6 +4157,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 	 */
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
 	struct stripe_head *sh;
+	struct stripe_queue *sq;
 	int pd_idx;
 	sector_t first_sector, last_sector;
 	int raid_disks = conf->previous_raid_disks;
@@ -3780,9 +4211,12 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 		int j;
 		int skipped = 0;
 		pd_idx = stripe_to_pdidx(sector_nr+i, conf, conf->raid_disks);
-		sh = get_active_stripe(conf, sector_nr+i,
-				       conf->raid_disks, pd_idx, 0);
-		set_bit(STRIPE_EXPANDING, &sh->state);
+		sq = get_active_queue(conf, sector_nr+i,
+					conf->raid_disks, pd_idx, 0, &sh);
+		if (!sh)
+			sh = get_active_stripe(conf, sq,
+					       conf->raid_disks, 0);
+		set_bit(STRIPE_QUEUE_EXPANDING, &sq->state);
 		atomic_inc(&conf->reshape_stripes);
 		/* If any of this stripe is beyond the end of the old
 		 * array, then we need to zero those blocks
@@ -3811,6 +4245,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 			set_bit(STRIPE_HANDLE, &sh->state);
 		}
 		release_stripe(sh);
+		release_queue(sq);
 	}
 	spin_lock_irq(&conf->device_lock);
 	conf->expand_progress = (sector_nr + i) * new_data_disks;
@@ -3834,11 +4269,16 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 	while (first_sector <= last_sector) {
 		pd_idx = stripe_to_pdidx(first_sector, conf,
 					 conf->previous_raid_disks);
-		sh = get_active_stripe(conf, first_sector,
-				       conf->previous_raid_disks, pd_idx, 0);
+		sq = get_active_queue(conf, first_sector,
+				       conf->previous_raid_disks, pd_idx, 0,
+				       &sh);
+		if (!sh)
+			sh = get_active_stripe(conf, sq,
+					       conf->previous_raid_disks, 0);
 		set_bit(STRIPE_EXPAND_SOURCE, &sh->state);
 		set_bit(STRIPE_HANDLE, &sh->state);
 		release_stripe(sh);
+		release_queue(sq);
 		first_sector += STRIPE_SECTORS;
 	}
 	return conf->chunk_size>>9;
@@ -3898,16 +4338,16 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 	}
 
 	pd_idx = stripe_to_pdidx(sector_nr, conf, raid_disks);
-	sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 1);
+	sq = get_active_queue(conf, sector_nr, raid_disks, pd_idx, 0, &sh);
+	if (!sh)
+		sh = get_active_stripe(conf, sq, raid_disks, 1);
 	if (sh == NULL) {
-		sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 0);
+		sh = get_active_stripe(conf, sq, raid_disks, 0);
 		/* make sure we don't swamp the stripe cache if someone else
 		 * is trying to get access
 		 */
 		schedule_timeout_uninterruptible(1);
 	}
-	sq = sh->sq;
-
 	/* Need to check if array will still be degraded after recovery/resync
 	 * We don't need to check the 'failed' flag as when that gets set,
 	 * recovery aborts.
@@ -3925,6 +4365,7 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 
 	handle_stripe(sh, NULL);
 	release_stripe(sh);
+	release_queue(sq);
 
 	return STRIPE_SECTORS;
 }
@@ -3941,18 +4382,19 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 	 * We *know* that this entire raid_bio is in one chunk, so
 	 * it will be only one 'dd_idx' and only need one call to raid5_compute_sector.
 	 */
-	struct stripe_head *sh;
 	struct stripe_queue *sq;
 	int dd_idx, pd_idx;
 	sector_t sector, logical_sector, last_sector;
 	int scnt = 0;
 	int remaining;
 	int handled = 0;
+	int disks = conf->raid_disks;
+	int data_disks = disks - conf->max_degraded;
 
 	logical_sector = raid_bio->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
 	sector = raid5_compute_sector(	logical_sector,
-					conf->raid_disks,
-					conf->raid_disks - conf->max_degraded,
+					disks,
+					data_disks,
 					&dd_idx,
 					&pd_idx,
 					conf);
@@ -3962,31 +4404,32 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 	     logical_sector += STRIPE_SECTORS,
 		     sector += STRIPE_SECTORS,
 		     scnt++) {
+		struct stripe_head *sh;
 
 		if (scnt < raid_bio->bi_hw_segments)
 			/* already done this stripe */
 			continue;
 
-		sh = get_active_stripe(conf, sector, conf->raid_disks, pd_idx, 1);
-
-		if (!sh) {
-			/* failed to get a stripe - must wait */
+		sq = get_active_queue(conf, sector, disks, pd_idx, 1, &sh);
+		if (!sq) {
+			/* failed to get a queue - must wait */
 			raid_bio->bi_hw_segments = scnt;
 			conf->retry_read_aligned = raid_bio;
 			return handled;
 		}
 
-		sq = sh->sq;
 		set_bit(R5_ReadError, &sq->dev[dd_idx].flags);
-		if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
-			release_stripe(sh);
+		if (!add_queue_bio(sq, raid_bio, dd_idx, 0)) {
+			release_queue(sq);
+			if (sh)
+				release_stripe(sh);
 			raid_bio->bi_hw_segments = scnt;
 			conf->retry_read_aligned = raid_bio;
 			return handled;
 		}
 
-		handle_stripe(sh, NULL);
-		release_stripe(sh);
+		handle_queue(sq, sh, disks, data_disks);
+		release_queue(sq);
 		handled++;
 	}
 	spin_lock_irq(&conf->device_lock);
@@ -4005,7 +4448,58 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 	return handled;
 }
 
+static void raid5qd(struct work_struct *work)
+{
+	raid5_conf_t *conf = container_of(work, raid5_conf_t,
+					  stripe_queue_work);
+	struct list_head *sq_entry;
+	int attach = 0;
 
+	/* attach queues to stripes in priority order */
+	pr_debug("+++ raid5qd active\n");
+	spin_lock_irq(&conf->device_lock);
+	do {
+		sq_entry = NULL;
+		if (!list_empty(&conf->stripe_overwrite_list))
+			sq_entry = conf->stripe_overwrite_list.next;
+		else if (!list_empty(&conf->unaligned_read_list))
+			sq_entry = conf->unaligned_read_list.next;
+		else if (!list_empty(&conf->subwidth_write_list))
+			sq_entry = conf->subwidth_write_list.next;
+
+		/* "these aren't the droids you're looking for..."
+		 * do not handle the delayed list while there are better
+		 * things to do
+		 */
+		if (!sq_entry &&
+		    atomic_read(&conf->preread_active_queues) <
+		    IO_THRESHOLD && !blk_queue_plugged(conf->mddev->queue) &&
+		    !list_empty(&conf->delayed_q_list)) {
+			raid5_activate_delayed(conf);
+			BUG_ON(list_empty(&conf->subwidth_write_list));
+			sq_entry = conf->subwidth_write_list.next;
+		}
+
+		if (sq_entry) {
+			struct stripe_queue *sq;
+			struct stripe_head *sh;
+			sq = list_entry(sq_entry, struct stripe_queue,
+					list_node);
+			BUG_ON(sq->sh);
+
+			spin_unlock_irq(&conf->device_lock);
+			sh = get_active_stripe(conf, sq, conf->raid_disks, 0);
+			spin_lock_irq(&conf->device_lock);
+
+			set_bit(STRIPE_HANDLE, &sh->state);
+			__release_stripe(conf, sh);
+			attach++;
+		}
+	} while (sq_entry);
+	spin_unlock_irq(&conf->device_lock);
+	pr_debug("%d stripe(s) attached\n", attach);
+	pr_debug("--- raid5qd inactive\n");
+}
 
 /*
  * This is our raid5 kernel thread.
@@ -4039,12 +4533,6 @@ static void raid5d (mddev_t *mddev)
 			activate_bit_delay(conf);
 		}
 
-		if (list_empty(&conf->handle_list) &&
-		    atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
-		    !blk_queue_plugged(mddev->queue) &&
-		    !list_empty(&conf->delayed_list))
-			raid5_activate_delayed(conf);
-
 		while ((bio = remove_bio_from_retry(conf))) {
 			int ok;
 			spin_unlock_irq(&conf->device_lock);
@@ -4056,6 +4544,7 @@ static void raid5d (mddev_t *mddev)
 		}
 
 		if (list_empty(&conf->handle_list)) {
+			queue_work(conf->workqueue, &conf->stripe_queue_work);
 			async_tx_issue_pending_all();
 			break;
 		}
@@ -4098,7 +4587,8 @@ raid5_store_stripe_cache_size(mddev_t *mddev, const char *page, size_t len)
 {
 	raid5_conf_t *conf = mddev_to_conf(mddev);
 	char *end;
-	int new;
+	int new, queue, i;
+
 	if (len >= PAGE_SIZE)
 		return -EINVAL;
 	if (!conf)
@@ -4114,9 +4604,21 @@ raid5_store_stripe_cache_size(mddev_t *mddev, const char *page, size_t len)
 			conf->max_nr_stripes--;
 		else
 			break;
+
+		for (i = 0, queue = 0; i < STRIPE_QUEUE_SIZE; i++)
+			queue += drop_one_queue(conf);
+
+		if (queue < STRIPE_QUEUE_SIZE)
+			break;
 	}
 	md_allow_write(mddev);
 	while (new > conf->max_nr_stripes) {
+		for (i = 0, queue = 0; i < STRIPE_QUEUE_SIZE; i++)
+			queue += grow_one_queue(conf);
+
+		if (queue < STRIPE_QUEUE_SIZE)
+			break;
+
 		if (grow_one_stripe(conf))
 			conf->max_nr_stripes++;
 		else break;
@@ -4142,9 +4644,23 @@ stripe_cache_active_show(mddev_t *mddev, char *page)
 static struct md_sysfs_entry
 raid5_stripecache_active = __ATTR_RO(stripe_cache_active);
 
+static ssize_t
+stripe_queue_active_show(mddev_t *mddev, char *page)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	if (conf)
+		return sprintf(page, "%d\n", atomic_read(&conf->active_queues));
+	else
+		return 0;
+}
+
+static struct md_sysfs_entry
+raid5_stripequeue_active = __ATTR_RO(stripe_queue_active);
+
 static struct attribute *raid5_attrs[] =  {
 	&raid5_stripecache_size.attr,
 	&raid5_stripecache_active.attr,
+	&raid5_stripequeue_active.attr,
 	NULL,
 };
 static struct attribute_group raid5_attrs_group = {
@@ -4245,16 +4761,29 @@ static int run(mddev_t *mddev)
 		if (!conf->spare_page)
 			goto abort;
 	}
+
+	sprintf(conf->workqueue_name, "%s_q", mddev->gendisk->disk_name);
+	conf->workqueue = create_singlethread_workqueue(conf->workqueue_name);
+	if (!conf->workqueue)
+		goto abort;
+
 	spin_lock_init(&conf->device_lock);
 	init_waitqueue_head(&conf->wait_for_stripe);
+	init_waitqueue_head(&conf->wait_for_queue);
 	init_waitqueue_head(&conf->wait_for_overlap);
 	INIT_LIST_HEAD(&conf->handle_list);
-	INIT_LIST_HEAD(&conf->delayed_list);
 	INIT_LIST_HEAD(&conf->bitmap_list);
 	INIT_LIST_HEAD(&conf->inactive_list);
+	INIT_LIST_HEAD(&conf->delayed_q_list);
+	INIT_LIST_HEAD(&conf->stripe_overwrite_list);
+	INIT_LIST_HEAD(&conf->unaligned_read_list);
+	INIT_LIST_HEAD(&conf->subwidth_write_list);
+	INIT_LIST_HEAD(&conf->inactive_queue_list);
 	atomic_set(&conf->active_stripes, 0);
-	atomic_set(&conf->preread_active_stripes, 0);
+	atomic_set(&conf->active_queues, 0);
+	atomic_set(&conf->preread_active_queues, 0);
 	atomic_set(&conf->active_aligned_reads, 0);
+	INIT_WORK(&conf->stripe_queue_work, raid5qd);
 
 	pr_debug("raid5: run(%s) called.\n", mdname(mddev));
 
@@ -4354,6 +4883,8 @@ static int run(mddev_t *mddev)
 		printk(KERN_INFO "raid5: allocated %dkB for %s\n",
 			memory, mdname(mddev));
 
+	conf->stripe_queue_tree = RB_ROOT;
+
 	if (mddev->degraded == 0)
 		printk("raid5: raid level %d set %s active with %d out of %d"
 			" devices, algorithm %d\n", conf->level, mdname(mddev), 
@@ -4434,6 +4965,7 @@ static int stop(mddev_t *mddev)
 	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
 	sysfs_remove_group(&mddev->kobj, &raid5_attrs_group);
 	kfree(conf->disks);
+	destroy_workqueue(conf->workqueue);
 	kfree(conf);
 	mddev->private = NULL;
 	return 0;
@@ -4443,33 +4975,50 @@ static int stop(mddev_t *mddev)
 static void print_sh (struct seq_file *seq, struct stripe_head *sh)
 {
 	int i;
-	struct stripe_queue *sq = sh->sq;
 
-	seq_printf(seq, "sh %llu, pd_idx %d, state %ld.\n",
-		   (unsigned long long)sh->sector, sq->pd_idx, sh->state);
+	seq_printf(seq, "sh %llu, state %ld.\n",
+		   (unsigned long long)sh->sector, sh->state);
 	seq_printf(seq, "sh %llu,  count %d.\n",
 		   (unsigned long long)sh->sector, atomic_read(&sh->count));
 	seq_printf(seq, "sh %llu, ", (unsigned long long)sh->sector);
 	for (i = 0; i < sh->disks; i++) {
-		seq_printf(seq, "(cache%d: %p %ld) ",
-			   i, sh->dev[i].page, sq->dev[i].flags);
+		seq_printf(seq, "(cache%d: %p) ",
+			   i, sh->dev[i].page);
 	}
 	seq_printf(seq, "\n");
 }
 
-static void printall (struct seq_file *seq, raid5_conf_t *conf)
+static void print_sq(struct seq_file *seq, struct stripe_queue *sq)
 {
-	struct stripe_head *sh;
-	struct hlist_node *hn;
 	int i;
 
+	seq_printf(seq, "sq %llu, pd_idx %d, state %ld.\n",
+		   (unsigned long long)sq->sector, sq->pd_idx, sq->state);
+	seq_printf(seq, "sq %llu,  count %d.\n",
+		   (unsigned long long)sq->sector, atomic_read(&sq->count));
+	seq_printf(seq, "sq %llu, ", (unsigned long long)sq->sector);
+	for (i = 0; i < sq->raid_conf->raid_disks; i++) {
+		seq_printf(seq, "(cache%d: %ld) ",
+			   i, sq->dev[i].flags);
+	}
+	seq_printf(seq, "\n");
+	seq_printf(seq, "sq %llu,  sh %p.\n",
+		   (unsigned long long) sq->sector, sq->sh);
+	if (sq->sh)
+		print_sh(seq, sq->sh);
+}
+
+static void printall (struct seq_file *seq, raid5_conf_t *conf)
+{
+	struct stripe_queue *sq;
+	struct rb_node *rbn;
+
 	spin_lock_irq(&conf->device_lock);
-	for (i = 0; i < NR_HASH; i++) {
-		hlist_for_each_entry(sh, hn, &conf->stripe_hashtbl[i], hash) {
-			if (sh->sq->raid_conf != conf)
-				continue;
-			print_sh(seq, sh);
-		}
+	rbn = rb_first(&conf->stripe_queue_tree);
+	while (rbn) {
+		sq = rb_entry(rbn, struct stripe_queue, rb_node);
+		print_sq(seq, sq);
+		rbn = rb_next(rbn);
 	}
 	spin_unlock_irq(&conf->device_lock);
 }
@@ -4788,8 +5337,8 @@ static void raid5_quiesce(mddev_t *mddev, int state)
 	case 1: /* stop all writes */
 		spin_lock_irq(&conf->device_lock);
 		conf->quiesce = 1;
-		wait_event_lock_irq(conf->wait_for_stripe,
-				    atomic_read(&conf->active_stripes) == 0 &&
+		wait_event_lock_irq(conf->wait_for_queue,
+				    atomic_read(&conf->active_queues) == 0 &&
 				    atomic_read(&conf->active_aligned_reads) == 0,
 				    conf->device_lock, /* nothing */);
 		spin_unlock_irq(&conf->device_lock);
@@ -4798,7 +5347,7 @@ static void raid5_quiesce(mddev_t *mddev, int state)
 	case 0: /* re-enable writes */
 		spin_lock_irq(&conf->device_lock);
 		conf->quiesce = 0;
-		wake_up(&conf->wait_for_stripe);
+		wake_up(&conf->wait_for_queue);
 		wake_up(&conf->wait_for_overlap);
 		spin_unlock_irq(&conf->device_lock);
 		break;
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 33e26e0..1bb859f 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -3,6 +3,7 @@
 
 #include <linux/raid/md.h>
 #include <linux/raid/xor.h>
+#include <linux/rbtree.h>
 
 /*
  *
@@ -181,7 +182,7 @@ struct stripe_head {
 		int		   count;
 		u32		   zero_sum_result;
 	} ops;
-	struct stripe_queue *sq;
+	struct stripe_queue *sq; /* list of pending bios for this stripe */
 	struct r5dev {
 		struct bio	req;
 		struct bio_vec	vec;
@@ -205,12 +206,28 @@ struct r6_state {
 	int p_failed, q_failed, qd_idx, failed_num[2];
 };
 
+/* stripe_queue
+ * @sector - rb_tree key
+ * @lock
+ * @sh - our stripe_head in the cache
+ * @list_node - once this queue object satisfies some constraint (like full
+ *  stripe write) it is placed on a list for processing by the cache
+ * @overwrite_count - how many blocks are set to be overwritten
+ */
 struct stripe_queue {
+	struct rb_node rb_node;
 	sector_t sector;
 	int pd_idx; /* parity disk index */
 	int bm_seq; /* sequence number for bitmap flushes */
 	spinlock_t lock;
 	struct raid5_private_data *raid_conf;
+	unsigned long state;
+	struct stripe_head *sh;
+	struct list_head list_node;
+	unsigned long *to_read;
+	unsigned long *to_write;
+	unsigned long *overwrite;
+	atomic_t count;
 	struct r5_queue_dev {
 		sector_t sector; /* hw starting sector for this block */
 		struct bio *toread, *towrite;
@@ -254,11 +271,7 @@ struct stripe_queue {
 #define STRIPE_HANDLE		2
 #define	STRIPE_SYNCING		3
 #define	STRIPE_INSYNC		4
-#define	STRIPE_PREREAD_ACTIVE	5
-#define	STRIPE_DELAYED		6
 #define	STRIPE_DEGRADED		7
-#define	STRIPE_BIT_DELAY	8
-#define	STRIPE_EXPANDING	9
 #define	STRIPE_EXPAND_SOURCE	10
 #define	STRIPE_EXPAND_READY	11
 /*
@@ -280,6 +293,18 @@ struct stripe_queue {
 #define STRIPE_OP_MOD_DMA_CHECK 8
 
 /*
+ * Stripe-queue state
+ */
+#define STRIPE_QUEUE_HANDLE	0
+#define STRIPE_QUEUE_OVERWRITE	1
+#define STRIPE_QUEUE_READ	2
+#define STRIPE_QUEUE_DELAYED	3
+#define STRIPE_QUEUE_WRITE	4
+#define STRIPE_QUEUE_EXPANDING	5
+#define STRIPE_QUEUE_PREREAD_ACTIVE 6
+#define STRIPE_QUEUE_BIT_DELAY	7
+
+/*
  * Plugging:
  *
  * To improve write throughput, we need to delay the handling of some
@@ -310,6 +335,7 @@ struct disk_info {
 
 struct raid5_private_data {
 	struct hlist_head	*stripe_hashtbl;
+	struct rb_root		stripe_queue_tree;
 	mddev_t			*mddev;
 	struct disk_info	*spare;
 	int			chunk_size, level, algorithm;
@@ -325,12 +351,23 @@ struct raid5_private_data {
 	int			previous_raid_disks;
 
 	struct list_head	handle_list; /* stripes needing handling */
-	struct list_head	delayed_list; /* stripes that have plugged requests */
 	struct list_head	bitmap_list; /* stripes delaying awaiting bitmap update */
+	struct list_head	delayed_q_list; /* queues that have plugged
+						 * requests
+						 */
+	struct list_head	stripe_overwrite_list; /* stripe-wide writes */
+	struct list_head	unaligned_read_list; /* dev_q->toread is set */
+	struct list_head	subwidth_write_list; /* dev_q->towrite is set */
+	struct workqueue_struct *workqueue; /* attaches sq's to sh's */
+	struct work_struct	stripe_queue_work;
+	char 			workqueue_name[20];
+
 	struct bio		*retry_read_aligned; /* currently retrying aligned bios   */
 	struct bio		*retry_read_aligned_list; /* aligned bios retry list  */
-	atomic_t		preread_active_stripes; /* stripes with scheduled io */
 	atomic_t		active_aligned_reads;
+	atomic_t		preread_active_queues; /* queues with scheduled
+							* io
+							*/
 
 	atomic_t		reshape_stripes; /* stripes with pending writes for reshape */
 	/* unfortunately we need two cache names as we temporarily have
@@ -338,7 +375,7 @@ struct raid5_private_data {
 	 */
 	int			active_name;
 	char			sh_cache_name[2][20];
-	char			sq_cache_name[2][20];
+	char                    sq_cache_name[2][20];
 	struct kmem_cache	*sh_slab_cache;
 	struct kmem_cache	*sq_slab_cache;
 
@@ -353,12 +390,20 @@ struct raid5_private_data {
 	struct page 		*spare_page; /* Used when checking P/Q in raid6 */
 
 	/*
+	 * Free queue pool
+	 */
+	atomic_t		active_queues;
+	struct list_head	inactive_queue_list;
+	wait_queue_head_t	wait_for_queue;
+	wait_queue_head_t	wait_for_overlap;
+	int			inactive_queue_blocked;
+
+	/*
 	 * Free stripes pool
 	 */
 	atomic_t		active_stripes;
 	struct list_head	inactive_list;
 	wait_queue_head_t	wait_for_stripe;
-	wait_queue_head_t	wait_for_overlap;
 	int			inactive_blocked;	/* release of inactive stripes blocked,
 							 * waiting for 25% to be free
 							 */

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2
  2007-07-03 23:20 [RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2 Dan Williams
  2007-07-03 23:20 ` [RFC PATCH 1/2] raid5: add the stripe_queue object for tracking raid io requests Dan Williams
  2007-07-03 23:20 ` [RFC PATCH 2/2] raid5: use stripe_queues to prioritize the "most deserving" requests, take2 Dan Williams
@ 2007-07-04 11:41 ` Andi Kleen
  2007-07-06  1:56   ` Dan Williams
  2 siblings, 1 reply; 5+ messages in thread
From: Andi Kleen @ 2007-07-04 11:41 UTC (permalink / raw)
  To: Dan Williams; +Cc: neilb, raziebe, akpm, davidsen, linux-kernel, linux-raid

Dan Williams <dan.j.williams@intel.com> writes:

> The write performance numbers are better than I expected and would seem
> to address the concerns raised in the thread "Odd (slow) RAID
> performance"[2].  The read performance drop was not expected.  However,
> the numbers suggest some additional changes to be made to the queuing
> model.

Have you considered supporting copy-xor in MD for non accelerated
RAID? I've been looking at fixing the old dubious slow crufty x86 SSE
XOR functions. One thing I discovered is that it seems fairly
pointless to make them slower with cache avoidance when most of the data is
copied before anyways. I think much more advantage could be gotten by
supporting copy-xor because XORing during a copy should be nearly
free.

On the other hand ext3 write() also uses a cache avoiding copy now
and for the XOR it would need to load the data from memory again.
Perhaps this could be also optimized somehow (e.g. setting a flag
somewhere and using a normal copy for the RAID-5 case)

-Andi

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2
  2007-07-04 11:41 ` [RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2 Andi Kleen
@ 2007-07-06  1:56   ` Dan Williams
  0 siblings, 0 replies; 5+ messages in thread
From: Dan Williams @ 2007-07-06  1:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: neilb, raziebe, akpm, davidsen, linux-kernel, linux-raid

On 04 Jul 2007 13:41:26 +0200, Andi Kleen <andi@firstfloor.org> wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > The write performance numbers are better than I expected and would seem
> > to address the concerns raised in the thread "Odd (slow) RAID
> > performance"[2].  The read performance drop was not expected.  However,
> > the numbers suggest some additional changes to be made to the queuing
> > model.
>
> Have you considered supporting copy-xor in MD for non accelerated
> RAID? I've been looking at fixing the old dubious slow crufty x86 SSE
> XOR functions.
Copy-xor is something that Neil suggested at the beginning of the
acceleration work.  It was put on the back-burner, but now that the
implementation has settled it can be revisited.

> One thing I discovered is that it seems fairly
> pointless to make them slower with cache avoidance when most of the data is
> copied before anyways. I think much more advantage could be gotten by
> supporting copy-xor because XORing during a copy should be nearly
> free.
>
Yes, it does not make sense to have cache-avoidance mismatched copy
and xor operations in MD.  However, I think the memcpy should be
changed to a cache-avoiding memcpy rather than caching the xor data.
Then a copy-xor implementation will have a greater effect, or do you
see it differently?

> On the other hand ext3 write() also uses a cache avoiding copy now
> and for the XOR it would need to load the data from memory again.
> Perhaps this could be also optimized somehow (e.g. setting a flag
> somewhere and using a normal copy for the RAID-5 case)
>
The incoming async_memcpy call has a flags parameter where this could go...

One possible way to implement support for copy-xor (and xor-copy-xor
for that matter) would be to write a soft-dmaengine driver.  When a
memcpy is submitted it can hold off processing it to see if an xor
operation is attached to the chain.  Once the xor descriptor is
attached the implementation will know the location of all the incoming
data, all the existing stripe data and the destination for the xor.

> -Andi

Dan

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-07-06  1:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-03 23:20 [RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2 Dan Williams
2007-07-03 23:20 ` [RFC PATCH 1/2] raid5: add the stripe_queue object for tracking raid io requests Dan Williams
2007-07-03 23:20 ` [RFC PATCH 2/2] raid5: use stripe_queues to prioritize the "most deserving" requests, take2 Dan Williams
2007-07-04 11:41 ` [RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2 Andi Kleen
2007-07-06  1:56   ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).