linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/4] raid5: write-back caching policy and write performance
@ 2007-04-11  6:00 Dan Williams
  2007-04-11  6:00 ` [PATCH RFC 1/4] md: introduce struct stripe_head_state Dan Williams
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Dan Williams @ 2007-04-11  6:00 UTC (permalink / raw)
  To: linux-raid

These patches are presented to further the discussion of raid5 write
performance and are not yet meant for mainline or -mm inclusion.  Raz's
delayed-activation patch showed interesting results so it has been
ported/included in this series.  The question to be answered is whether
the sequential write performance of a raid5 array, out of the box, can
approach that of a similarly configured raid0 array (minus one disk).
Currently, on an iop13xx platform, tiobench is reporting a 2x advantage
for the N-1 raid0 array, so it seems there is room for improvement.

The third patch in the series adds a write-back caching capability to md
to investigate the raw throughput to the stripe cache.  Since battery
backed memory is not being used this patch makes the system markedly
less safe, so only use it with data that can be thrown away.  Initial
testing with dd shows the performance of this policy can be ~1.8x that
of the default write-through policy.  That is, when the data set is
smaller than the cache size.  Once cache pressure begins to force the
writes to disk performance drops well below the write-through case.  So
work remains to be done to see how the write-through case achieves
better sustained throughput numbers.

I am interested in the performance of these patches on other
platforms/configurations and comments on the implementation.

[ based on 2.6.21-rc6 + git-md-accel.patch from -mm ]
      md: introduce struct stripe_head_state
      md: refactor raid5 cache policy code using 'struct stripe_cache_policy'
      md: writeback caching policy for raid5 [experimental]
      md: delayed stripe activation

The patches can also be pulled via git:
git pull git://lost.foo-projects.org/~dwillia2/git/iop md-accel+experimental

--
Dan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH RFC 1/4] md: introduce struct stripe_head_state
  2007-04-11  6:00 [PATCH RFC 0/4] raid5: write-back caching policy and write performance Dan Williams
@ 2007-04-11  6:00 ` Dan Williams
  2007-04-11  6:00 ` [PATCH RFC 2/4] md: refactor raid5 cache policy code using 'struct stripe_cache_policy' Dan Williams
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Dan Williams @ 2007-04-11  6:00 UTC (permalink / raw)
  To: linux-raid

struct stripe_head_state collects all the dynamic stripe-state information
that is calculated/tracked during calls to handle_stripe.  This enables a
mechanism for handle_stripe functionality to be broken off into
subroutines.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/raid5.c         |  280 ++++++++++++++++++++++----------------------
 include/linux/raid/raid5.h |   11 ++
 2 files changed, 153 insertions(+), 138 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 74ce354..684552a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1872,12 +1872,14 @@ static void handle_stripe5(struct stripe_head *sh)
 	struct bio *return_bi= NULL;
 	struct bio *bi;
 	int i;
-	int syncing, expanding, expanded;
-	int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
-	int to_fill=0, compute=0, req_compute=0, non_overwrite=0;
-	int failed_num=0;
+	struct stripe_head_state s = {
+		.locked=0, .uptodate=0, .to_read=0, .to_write=0, .failed=0,
+		.written=0, .to_fill=0, .compute=0, .req_compute=0,
+		.non_overwrite=0,
+	};
 	struct r5dev *dev;
 	unsigned long pending=0;
+	s.failed_num=0;
 
 	PRINTK("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d ops=%lx:%lx:%lx\n",
 	       (unsigned long long)sh->sector, sh->state, atomic_read(&sh->count),
@@ -1887,9 +1889,9 @@ static void handle_stripe5(struct stripe_head *sh)
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	clear_bit(STRIPE_DELAYED, &sh->state);
 
-	syncing = test_bit(STRIPE_SYNCING, &sh->state);
-	expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
-	expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
+	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
+	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
+	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
 	/* Now to look around and see what can be done */
 
 	rcu_read_lock();
@@ -1911,22 +1913,22 @@ static void handle_stripe5(struct stripe_head *sh)
 			set_bit(R5_Wantfill, &dev->flags);
 
 		/* now count some things */
-		if (test_bit(R5_LOCKED, &dev->flags)) locked++;
-		if (test_bit(R5_UPTODATE, &dev->flags)) uptodate++;
+		if (test_bit(R5_LOCKED, &dev->flags)) s.locked++;
+		if (test_bit(R5_UPTODATE, &dev->flags)) s.uptodate++;
 
 		if (test_bit(R5_Wantfill, &dev->flags))
-			to_fill++;
+			s.to_fill++;
 		else if (dev->toread)
-			to_read++;
+			s.to_read++;
 
-		if (test_bit(R5_Wantcompute, &dev->flags)) BUG_ON(++compute > 1);
+		if (test_bit(R5_Wantcompute, &dev->flags)) BUG_ON(++s.compute > 1);
 
 		if (dev->towrite) {
-			to_write++;
+			s.to_write++;
 			if (!test_bit(R5_OVERWRITE, &dev->flags))
-				non_overwrite++;
+				s.non_overwrite++;
 		}
-		if (dev->written) written++;
+		if (dev->written) s.written++;
 		rdev = rcu_dereference(conf->disks[i].rdev);
 		if (!rdev || !test_bit(In_sync, &rdev->flags)) {
 			/* The ReadError flag will just be confusing now */
@@ -1935,23 +1937,24 @@ static void handle_stripe5(struct stripe_head *sh)
 		}
 		if (!rdev || !test_bit(In_sync, &rdev->flags)
 		    || test_bit(R5_ReadError, &dev->flags)) {
-			failed++;
-			failed_num = i;
+			s.failed++;
+			s.failed_num = i;
 		} else
 			set_bit(R5_Insync, &dev->flags);
 	}
 	rcu_read_unlock();
 
-	if (to_fill && !test_and_set_bit(STRIPE_OP_BIOFILL, &sh->ops.pending))
+	if (s.to_fill && !test_and_set_bit(STRIPE_OP_BIOFILL, &sh->ops.pending))
 		sh->ops.count++;
 
 	PRINTK("locked=%d uptodate=%d to_read=%d"
 		" to_write=%d to_fill=%d failed=%d failed_num=%d\n",
-		locked, uptodate, to_read, to_write, to_fill, failed, failed_num);
+		s.locked, s.uptodate, s.to_read, s.to_write, s.to_fill,
+		s.failed, s.failed_num);
 	/* check if the array has lost two devices and, if so, some requests might
 	 * need to be failed
 	 */
-	if (failed > 1 && to_read+to_write+written) {
+	if (s.failed > 1 && s.to_read+s.to_write+s.written) {
 		for (i=disks; i--; ) {
 			int bitmap_end = 0;
 
@@ -1969,7 +1972,7 @@ static void handle_stripe5(struct stripe_head *sh)
 			/* fail all writes first */
 			bi = sh->dev[i].towrite;
 			sh->dev[i].towrite = NULL;
-			if (bi) { to_write--; bitmap_end = 1; }
+			if (bi) { s.to_write--; bitmap_end = 1; }
 
 			if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
 				wake_up(&conf->wait_for_overlap);
@@ -2009,7 +2012,7 @@ static void handle_stripe5(struct stripe_head *sh)
 				sh->dev[i].toread = NULL;
 				if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
 					wake_up(&conf->wait_for_overlap);
-				if (bi) to_read--;
+				if (bi) s.to_read--;
 				while (bi && bi->bi_sector < sh->dev[i].sector + STRIPE_SECTORS){
 					struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
 					clear_bit(BIO_UPTODATE, &bi->bi_flags);
@@ -2026,20 +2029,20 @@ static void handle_stripe5(struct stripe_head *sh)
 						STRIPE_SECTORS, 0, 0);
 		}
 	}
-	if (failed > 1 && syncing) {
+	if (s.failed > 1 && s.syncing) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
 		clear_bit(STRIPE_SYNCING, &sh->state);
-		syncing = 0;
+		s.syncing = 0;
 	}
 
 	/* might be able to return some write requests if the parity block
 	 * is safe, or on a failed drive
 	 */
 	dev = &sh->dev[sh->pd_idx];
-	if ( written &&
+	if ( s.written &&
 	     ( (test_bit(R5_Insync, &dev->flags) && !test_bit(R5_LOCKED, &dev->flags) &&
 		test_bit(R5_UPTODATE, &dev->flags))
-	       || (failed == 1 && failed_num == sh->pd_idx))
+	       || (s.failed == 1 && s.failed_num == sh->pd_idx))
 	    ) {
 	    /* any written block on an uptodate or failed drive can be returned.
 	     * Note that if we 'wrote' to a failed drive, it will be UPTODATE, but 
@@ -2081,8 +2084,8 @@ static void handle_stripe5(struct stripe_head *sh)
 	 * parity, or to satisfy requests
 	 * or to load a block that is being partially written.
 	 */
-	if (to_read || non_overwrite || (syncing && (uptodate + compute < disks)) || expanding ||
-		test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending)) {
+	if (s.to_read || s.non_overwrite || (s.syncing && (s.uptodate + s.compute < disks)) ||
+		s.expanding || test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending)) {
 
 		/* Clear completed compute operations.  Parity recovery
 		 * (STRIPE_OP_MOD_REPAIR_PD) implies a write-back which is handled
@@ -2114,11 +2117,11 @@ static void handle_stripe5(struct stripe_head *sh)
 				if (!test_bit(R5_LOCKED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags) &&
 				     (dev->toread ||
 				     (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
-				     syncing ||
-				     expanding ||
-				     (failed && (sh->dev[failed_num].toread ||
-						 (sh->dev[failed_num].towrite &&
-						 	!test_bit(R5_OVERWRITE, &sh->dev[failed_num].flags))))
+				     s.syncing ||
+				     s.expanding ||
+				     (s.failed && (sh->dev[s.failed_num].toread ||
+						 (sh->dev[s.failed_num].towrite &&
+						 	!test_bit(R5_OVERWRITE, &sh->dev[s.failed_num].flags))))
 					    )
 					) {
 					/* 1/ We would like to get this block, possibly
@@ -2132,20 +2135,20 @@ static void handle_stripe5(struct stripe_head *sh)
 					 * 3/ We hold off parity block re-reads until check
 					 * operations have quiesced.
 					 */
-					if ((uptodate == disks-1) && !test_bit(STRIPE_OP_CHECK, &sh->ops.pending)) {
+					if ((s.uptodate == disks-1) && !test_bit(STRIPE_OP_CHECK, &sh->ops.pending)) {
 						set_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending);
 						set_bit(R5_Wantcompute, &dev->flags);
 						sh->ops.target = i;
-						req_compute = 1;
+						s.req_compute = 1;
 						sh->ops.count++;
 						/* Careful: from this point on 'uptodate' is in the eye of
 						 * raid5_run_ops which services 'compute' operations before
 						 * writes. R5_Wantcompute flags a block that will be R5_UPTODATE
 						 * by the time it is needed for a subsequent operation.
 						 */
-						uptodate++;
+						s.uptodate++;
 						break; /* uptodate + compute == disks */
-					} else if ((uptodate < disks-1) && test_bit(R5_Insync, &dev->flags)) {
+					} else if ((s.uptodate < disks-1) && test_bit(R5_Insync, &dev->flags)) {
 						/* Note: we hold off compute operations while checks are in flight,
 						 * but we still prefer 'compute' over 'read' hence we only read if
 						 * (uptodate < disks-1)
@@ -2154,9 +2157,9 @@ static void handle_stripe5(struct stripe_head *sh)
 						set_bit(R5_Wantread, &dev->flags);
 						if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
 							sh->ops.count++;
-						locked++;
+						s.locked++;
 						PRINTK("Reading block %d (sync=%d)\n",
-							i, syncing);
+							i, s.syncing);
 					}
 				}
 			}
@@ -2207,7 +2210,7 @@ static void handle_stripe5(struct stripe_head *sh)
 				if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
 					sh->ops.count++;
 				if (!test_bit(R5_Insync, &dev->flags)
-				    || (i==sh->pd_idx && failed == 0))
+				    || (i==sh->pd_idx && s.failed == 0))
 					set_bit(STRIPE_INSYNC, &sh->state);
 			}
 		}
@@ -2223,7 +2226,7 @@ static void handle_stripe5(struct stripe_head *sh)
 	 *    a check is in flight
 	 * 3/ Write operations do not stack
 	 */
-	if (to_write && !test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending) &&
+	if (s.to_write && !test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending) &&
 		!test_bit(STRIPE_OP_CHECK, &sh->ops.pending)) {
 		int rmw=0, rcw=0;
 		for (i=disks ; i--;) {
@@ -2266,7 +2269,7 @@ static void handle_stripe5(struct stripe_head *sh)
 						set_bit(R5_Wantread, &dev->flags);
 						if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
 							sh->ops.count++;
-						locked++;
+						s.locked++;
 					} else {
 						set_bit(STRIPE_DELAYED, &sh->state);
 						set_bit(STRIPE_HANDLE, &sh->state);
@@ -2288,7 +2291,7 @@ static void handle_stripe5(struct stripe_head *sh)
 						set_bit(R5_Wantread, &dev->flags);
 						if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
 							sh->ops.count++;
-						locked++;
+						s.locked++;
 					} else {
 						set_bit(STRIPE_DELAYED, &sh->state);
 						set_bit(STRIPE_HANDLE, &sh->state);
@@ -2303,10 +2306,10 @@ static void handle_stripe5(struct stripe_head *sh)
 		 * is not the case then new writes need to be held off until the compute
 		 * completes.
 		 */
-		if ((req_compute || !test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending)) &&
-			(locked == 0 && (rcw == 0 ||rmw == 0) &&
+		if ((s.req_compute || !test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending)) &&
+			(s.locked == 0 && (rcw == 0 ||rmw == 0) &&
 			!test_bit(STRIPE_BIT_DELAY, &sh->state)))
-			locked += handle_write_operations5(sh, rcw == 0, 0);
+			s.locked += handle_write_operations5(sh, rcw == 0, 0);
 	}
 
 	/* 1/ Maybe we need to check and possibly fix the parity for this stripe.
@@ -2315,7 +2318,7 @@ static void handle_stripe5(struct stripe_head *sh)
 	 * 2/ Hold off parity checks while parity dependent operations are in flight
 	 *    (conflicting writes are protected by the 'locked' variable)
 	 */
-	if ((syncing && locked == 0 && !test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending) &&
+	if ((s.syncing && s.locked == 0 && !test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending) &&
 		!test_bit(STRIPE_INSYNC, &sh->state)) ||
 	    	test_bit(STRIPE_OP_CHECK, &sh->ops.pending) ||
 	    	test_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending)) {
@@ -2327,12 +2330,12 @@ static void handle_stripe5(struct stripe_head *sh)
 		 * 3/ skip to the writeback section if we previously
 		 *    initiated a recovery operation
 		 */
-		if (failed == 0 && !test_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending)) {
+		if (s.failed == 0 && !test_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending)) {
 			if (!test_and_set_bit(STRIPE_OP_CHECK, &sh->ops.pending)) {
-				BUG_ON(uptodate != disks);
+				BUG_ON(s.uptodate != disks);
 				clear_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags);
 				sh->ops.count++;
-				uptodate--;
+				s.uptodate--;
 			} else if (test_and_clear_bit(STRIPE_OP_CHECK, &sh->ops.complete)) {
 				clear_bit(STRIPE_OP_CHECK, &sh->ops.ack);
 				clear_bit(STRIPE_OP_CHECK, &sh->ops.pending);
@@ -2354,7 +2357,7 @@ static void handle_stripe5(struct stripe_head *sh)
 							&sh->dev[sh->pd_idx].flags);
 						sh->ops.target = sh->pd_idx;
 						sh->ops.count++;
-						uptodate++;
+						s.uptodate++;
 					}
 				}
 			}
@@ -2378,22 +2381,22 @@ static void handle_stripe5(struct stripe_head *sh)
 			!test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending)) {
 
 			/* either failed parity check, or recovery is happening */
-			if (failed==0)
-				failed_num = sh->pd_idx;
-			dev = &sh->dev[failed_num];
+			if (s.failed==0)
+				s.failed_num = sh->pd_idx;
+			dev = &sh->dev[s.failed_num];
 			BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));
-			BUG_ON(uptodate != disks);
+			BUG_ON(s.uptodate != disks);
 
 			set_bit(R5_LOCKED, &dev->flags);
 			set_bit(R5_Wantwrite, &dev->flags);
 			if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
 				sh->ops.count++;
 			clear_bit(STRIPE_DEGRADED, &sh->state);
-			locked++;
+			s.locked++;
 			set_bit(STRIPE_INSYNC, &sh->state);
 		}
 	}
-	if (syncing && locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
+	if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
 		clear_bit(STRIPE_SYNCING, &sh->state);
 	}
@@ -2401,26 +2404,26 @@ static void handle_stripe5(struct stripe_head *sh)
 	/* If the failed drive is just a ReadError, then we might need to progress
 	 * the repair/check process
 	 */
-	if (failed == 1 && ! conf->mddev->ro &&
-	    test_bit(R5_ReadError, &sh->dev[failed_num].flags)
-	    && !test_bit(R5_LOCKED, &sh->dev[failed_num].flags)
-	    && test_bit(R5_UPTODATE, &sh->dev[failed_num].flags)
+	if (s.failed == 1 && ! conf->mddev->ro &&
+	    test_bit(R5_ReadError, &sh->dev[s.failed_num].flags)
+	    && !test_bit(R5_LOCKED, &sh->dev[s.failed_num].flags)
+	    && test_bit(R5_UPTODATE, &sh->dev[s.failed_num].flags)
 		) {
-		dev = &sh->dev[failed_num];
+		dev = &sh->dev[s.failed_num];
 		if (!test_bit(R5_ReWrite, &dev->flags)) {
 			set_bit(R5_Wantwrite, &dev->flags);
 			if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
 				sh->ops.count++;
 			set_bit(R5_ReWrite, &dev->flags);
 			set_bit(R5_LOCKED, &dev->flags);
-			locked++;
+			s.locked++;
 		} else {
 			/* let's read it back */
 			set_bit(R5_Wantread, &dev->flags);
 			if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
 				sh->ops.count++;
 			set_bit(R5_LOCKED, &dev->flags);
-			locked++;
+			s.locked++;
 		}
 	}
 
@@ -2443,20 +2446,20 @@ static void handle_stripe5(struct stripe_head *sh)
 		}
 	}
 
-	if (expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
+	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
 		!test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending)) {
 		/* Need to write out all blocks after computing parity */
 		sh->disks = conf->raid_disks;
 		sh->pd_idx = stripe_to_pdidx(sh->sector, conf, conf->raid_disks);
-		locked += handle_write_operations5(sh, 0, 1);
-	} else if (expanded && !test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending)) {
+		s.locked += handle_write_operations5(sh, 0, 1);
+	} else if (s.expanded && !test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending)) {
 		clear_bit(STRIPE_EXPAND_READY, &sh->state);
 		atomic_dec(&conf->reshape_stripes);
 		wake_up(&conf->wait_for_overlap);
 		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
 	}
 
-	if (expanding && locked == 0) {
+	if (s.expanding && s.locked == 0) {
 		/* We have read all the blocks in this stripe and now we need to
 		 * copy some of them into a target stripe for expand.
 		 */
@@ -2537,14 +2540,15 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	struct bio *return_bi= NULL;
 	struct bio *bi;
 	int i;
-	int syncing, expanding, expanded;
-	int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
-	int non_overwrite = 0;
-	int failed_num[2] = {0, 0};
+	struct stripe_head_state s = {
+		.locked=0, .uptodate=0, .to_read=0, .to_write=0, .failed=0,
+		.written=0, .non_overwrite = 0,
+	};
 	struct r5dev *dev, *pdev, *qdev;
 	int pd_idx = sh->pd_idx;
 	int qd_idx = raid6_next_disk(pd_idx, disks);
 	int p_failed, q_failed;
+	s.r6_failed_num[0] = s.r6_failed_num[1] = 0;
 
 	PRINTK("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d, qd_idx=%d\n",
 	       (unsigned long long)sh->sector, sh->state, atomic_read(&sh->count),
@@ -2554,9 +2558,9 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	clear_bit(STRIPE_DELAYED, &sh->state);
 
-	syncing = test_bit(STRIPE_SYNCING, &sh->state);
-	expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
-	expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
+	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
+	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
+	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
 	/* Now to look around and see what can be done */
 
 	rcu_read_lock();
@@ -2591,17 +2595,17 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 		}
 
 		/* now count some things */
-		if (test_bit(R5_LOCKED, &dev->flags)) locked++;
-		if (test_bit(R5_UPTODATE, &dev->flags)) uptodate++;
+		if (test_bit(R5_LOCKED, &dev->flags)) s.locked++;
+		if (test_bit(R5_UPTODATE, &dev->flags)) s.uptodate++;
 
 
-		if (dev->toread) to_read++;
+		if (dev->toread) s.to_read++;
 		if (dev->towrite) {
-			to_write++;
+			s.to_write++;
 			if (!test_bit(R5_OVERWRITE, &dev->flags))
-				non_overwrite++;
+				s.non_overwrite++;
 		}
-		if (dev->written) written++;
+		if (dev->written) s.written++;
 		rdev = rcu_dereference(conf->disks[i].rdev);
 		if (!rdev || !test_bit(In_sync, &rdev->flags)) {
 			/* The ReadError flag will just be confusing now */
@@ -2610,21 +2614,21 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 		}
 		if (!rdev || !test_bit(In_sync, &rdev->flags)
 		    || test_bit(R5_ReadError, &dev->flags)) {
-			if ( failed < 2 )
-				failed_num[failed] = i;
-			failed++;
+			if ( s.failed < 2 )
+				s.r6_failed_num[s.failed] = i;
+			s.failed++;
 		} else
 			set_bit(R5_Insync, &dev->flags);
 	}
 	rcu_read_unlock();
 	PRINTK("locked=%d uptodate=%d to_read=%d"
 	       " to_write=%d failed=%d failed_num=%d,%d\n",
-	       locked, uptodate, to_read, to_write, failed,
-	       failed_num[0], failed_num[1]);
+	       s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
+	       s.r6_failed_num[0], s.r6_failed_num[1]);
 	/* check if the array has lost >2 devices and, if so, some requests might
 	 * need to be failed
 	 */
-	if (failed > 2 && to_read+to_write+written) {
+	if (s.failed > 2 && s.to_read+s.to_write+s.written) {
 		for (i=disks; i--; ) {
 			int bitmap_end = 0;
 
@@ -2642,7 +2646,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 			/* fail all writes first */
 			bi = sh->dev[i].towrite;
 			sh->dev[i].towrite = NULL;
-			if (bi) { to_write--; bitmap_end = 1; }
+			if (bi) { s.to_write--; bitmap_end = 1; }
 
 			if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
 				wake_up(&conf->wait_for_overlap);
@@ -2679,7 +2683,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 				sh->dev[i].toread = NULL;
 				if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
 					wake_up(&conf->wait_for_overlap);
-				if (bi) to_read--;
+				if (bi) s.to_read--;
 				while (bi && bi->bi_sector < sh->dev[i].sector + STRIPE_SECTORS){
 					struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
 					clear_bit(BIO_UPTODATE, &bi->bi_flags);
@@ -2696,10 +2700,10 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 						STRIPE_SECTORS, 0, 0);
 		}
 	}
-	if (failed > 2 && syncing) {
+	if (s.failed > 2 && s.syncing) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
 		clear_bit(STRIPE_SYNCING, &sh->state);
-		syncing = 0;
+		s.syncing = 0;
 	}
 
 	/*
@@ -2707,13 +2711,13 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	 * are safe, or on a failed drive
 	 */
 	pdev = &sh->dev[pd_idx];
-	p_failed = (failed >= 1 && failed_num[0] == pd_idx)
-		|| (failed >= 2 && failed_num[1] == pd_idx);
+	p_failed = (s.failed >= 1 && s.r6_failed_num[0] == pd_idx)
+		|| (s.failed >= 2 && s.r6_failed_num[1] == pd_idx);
 	qdev = &sh->dev[qd_idx];
-	q_failed = (failed >= 1 && failed_num[0] == qd_idx)
-		|| (failed >= 2 && failed_num[1] == qd_idx);
+	q_failed = (s.failed >= 1 && s.r6_failed_num[0] == qd_idx)
+		|| (s.failed >= 2 && s.r6_failed_num[1] == qd_idx);
 
-	if ( written &&
+	if ( s.written &&
 	     ( p_failed || ((test_bit(R5_Insync, &pdev->flags)
 			     && !test_bit(R5_LOCKED, &pdev->flags)
 			     && test_bit(R5_UPTODATE, &pdev->flags))) ) &&
@@ -2762,28 +2766,28 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	 * parity, or to satisfy requests
 	 * or to load a block that is being partially written.
 	 */
-	if (to_read || non_overwrite || (to_write && failed) ||
-	    (syncing && (uptodate < disks)) || expanding) {
+	if (s.to_read || s.non_overwrite || (s.to_write && s.failed) ||
+	    (s.syncing && (s.uptodate < disks)) || s.expanding) {
 		for (i=disks; i--;) {
 			dev = &sh->dev[i];
 			if (!test_bit(R5_LOCKED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags) &&
 			    (dev->toread ||
 			     (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
-			     syncing ||
-			     expanding ||
-			     (failed >= 1 && (sh->dev[failed_num[0]].toread || to_write)) ||
-			     (failed >= 2 && (sh->dev[failed_num[1]].toread || to_write))
+			     s.syncing ||
+			     s.expanding ||
+			     (s.failed >= 1 && (sh->dev[s.r6_failed_num[0]].toread || s.to_write)) ||
+			     (s.failed >= 2 && (sh->dev[s.r6_failed_num[1]].toread || s.to_write))
 				    )
 				) {
 				/* we would like to get this block, possibly
 				 * by computing it, but we might not be able to
 				 */
-				if (uptodate == disks-1) {
+				if (s.uptodate == disks-1) {
 					PRINTK("Computing stripe %llu block %d\n",
 					       (unsigned long long)sh->sector, i);
 					compute_block_1(sh, i, 0);
-					uptodate++;
-				} else if ( uptodate == disks-2 && failed >= 2 ) {
+					s.uptodate++;
+				} else if ( s.uptodate == disks-2 && s.failed >= 2 ) {
 					/* Computing 2-failure is *very* expensive; only do it if failed >= 2 */
 					int other;
 					for (other=disks; other--;) {
@@ -2796,13 +2800,13 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 					PRINTK("Computing stripe %llu blocks %d,%d\n",
 					       (unsigned long long)sh->sector, i, other);
 					compute_block_2(sh, i, other);
-					uptodate += 2;
+					s.uptodate += 2;
 				} else if (test_bit(R5_Insync, &dev->flags)) {
 					set_bit(R5_LOCKED, &dev->flags);
 					set_bit(R5_Wantread, &dev->flags);
-					locked++;
+					s.locked++;
 					PRINTK("Reading block %d (sync=%d)\n",
-						i, syncing);
+						i, s.syncing);
 				}
 			}
 		}
@@ -2810,7 +2814,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	}
 
 	/* now to consider writing and what else, if anything should be read */
-	if (to_write) {
+	if (s.to_write) {
 		int rcw=0, must_compute=0;
 		for (i=disks ; i--;) {
 			dev = &sh->dev[i];
@@ -2836,7 +2840,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 			for (i=disks; i--;) {
 				dev = &sh->dev[i];
 				if (!test_bit(R5_OVERWRITE, &dev->flags)
-				    && !(failed == 0 && (i == pd_idx || i == qd_idx))
+				    && !(s.failed == 0 && (i == pd_idx || i == qd_idx))
 				    && !test_bit(R5_LOCKED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags) &&
 				    test_bit(R5_Insync, &dev->flags)) {
 					if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
@@ -2845,7 +2849,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 						       (unsigned long long)sh->sector, i);
 						set_bit(R5_LOCKED, &dev->flags);
 						set_bit(R5_Wantread, &dev->flags);
-						locked++;
+						s.locked++;
 					} else {
 						PRINTK("Request delayed stripe %llu block %d for Reconstruct\n",
 						       (unsigned long long)sh->sector, i);
@@ -2855,14 +2859,14 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 				}
 			}
 		/* now if nothing is locked, and if we have enough data, we can start a write request */
-		if (locked == 0 && rcw == 0 &&
+		if (s.locked == 0 && rcw == 0 &&
 		    !test_bit(STRIPE_BIT_DELAY, &sh->state)) {
 			if ( must_compute > 0 ) {
 				/* We have failed blocks and need to compute them */
-				switch ( failed ) {
+				switch ( s.failed ) {
 				case 0:	BUG();
-				case 1: compute_block_1(sh, failed_num[0], 0); break;
-				case 2: compute_block_2(sh, failed_num[0], failed_num[1]); break;
+				case 1: compute_block_1(sh, s.r6_failed_num[0], 0); break;
+				case 2: compute_block_2(sh, s.r6_failed_num[0], s.r6_failed_num[1]); break;
 				default: BUG();	/* This request should have been failed? */
 				}
 			}
@@ -2874,7 +2878,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 				if (test_bit(R5_LOCKED, &sh->dev[i].flags)) {
 					PRINTK("Writing stripe %llu block %d\n",
 					       (unsigned long long)sh->sector, i);
-					locked++;
+					s.locked++;
 					set_bit(R5_Wantwrite, &sh->dev[i].flags);
 				}
 			/* after a RECONSTRUCT_WRITE, the stripe MUST be in-sync */
@@ -2892,14 +2896,14 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	 * Any reads will already have been scheduled, so we just see if enough data
 	 * is available
 	 */
-	if (syncing && locked == 0 && !test_bit(STRIPE_INSYNC, &sh->state)) {
+	if (s.syncing && s.locked == 0 && !test_bit(STRIPE_INSYNC, &sh->state)) {
 		int update_p = 0, update_q = 0;
 		struct r5dev *dev;
 
 		set_bit(STRIPE_HANDLE, &sh->state);
 
-		BUG_ON(failed>2);
-		BUG_ON(uptodate < disks);
+		BUG_ON(s.failed>2);
+		BUG_ON(s.uptodate < disks);
 		/* Want to check and possibly repair P and Q.
 		 * However there could be one 'failed' device, in which
 		 * case we can only check one of them, possibly using the
@@ -2911,7 +2915,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 		 * by stripe_handle with a tmp_page - just wait until then.
 		 */
 		if (tmp_page) {
-			if (failed == q_failed) {
+			if (s.failed == q_failed) {
 				/* The only possible failed device holds 'Q', so it makes
 				 * sense to check P (If anything else were failed, we would
 				 * have used P to recreate it).
@@ -2922,7 +2926,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 					update_p = 1;
 				}
 			}
-			if (!q_failed && failed < 2) {
+			if (!q_failed && s.failed < 2) {
 				/* q is not failed, and we didn't use it to generate
 				 * anything, so it makes sense to check it
 				 */
@@ -2948,28 +2952,28 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 			 * or P or Q if they need it
 			 */
 
-			if (failed == 2) {
-				dev = &sh->dev[failed_num[1]];
-				locked++;
+			if (s.failed == 2) {
+				dev = &sh->dev[s.r6_failed_num[1]];
+				s.locked++;
 				set_bit(R5_LOCKED, &dev->flags);
 				set_bit(R5_Wantwrite, &dev->flags);
 			}
-			if (failed >= 1) {
-				dev = &sh->dev[failed_num[0]];
-				locked++;
+			if (s.failed >= 1) {
+				dev = &sh->dev[s.r6_failed_num[0]];
+				s.locked++;
 				set_bit(R5_LOCKED, &dev->flags);
 				set_bit(R5_Wantwrite, &dev->flags);
 			}
 
 			if (update_p) {
 				dev = &sh->dev[pd_idx];
-				locked ++;
+				s.locked ++;
 				set_bit(R5_LOCKED, &dev->flags);
 				set_bit(R5_Wantwrite, &dev->flags);
 			}
 			if (update_q) {
 				dev = &sh->dev[qd_idx];
-				locked++;
+				s.locked++;
 				set_bit(R5_LOCKED, &dev->flags);
 				set_bit(R5_Wantwrite, &dev->flags);
 			}
@@ -2979,7 +2983,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 		}
 	}
 
-	if (syncing && locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
+	if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
 		clear_bit(STRIPE_SYNCING, &sh->state);
 	}
@@ -2987,9 +2991,9 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	/* If the failed drives are just a ReadError, then we might need
 	 * to progress the repair/check process
 	 */
-	if (failed <= 2 && ! conf->mddev->ro)
-		for (i=0; i<failed;i++) {
-			dev = &sh->dev[failed_num[i]];
+	if (s.failed <= 2 && ! conf->mddev->ro)
+		for (i=0; i<s.failed;i++) {
+			dev = &sh->dev[s.r6_failed_num[i]];
 			if (test_bit(R5_ReadError, &dev->flags)
 			    && !test_bit(R5_LOCKED, &dev->flags)
 			    && test_bit(R5_UPTODATE, &dev->flags)
@@ -3006,7 +3010,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 			}
 		}
 
-	if (expanded && test_bit(STRIPE_EXPANDING, &sh->state)) {
+	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state)) {
 		/* Need to write out all blocks after computing P&Q */
 		sh->disks = conf->raid_disks;
 		sh->pd_idx = stripe_to_pdidx(sh->sector, conf,
@@ -3014,18 +3018,18 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 		compute_parity6(sh, RECONSTRUCT_WRITE);
 		for (i = conf->raid_disks ; i-- ;  ) {
 			set_bit(R5_LOCKED, &sh->dev[i].flags);
-			locked++;
+			s.locked++;
 			set_bit(R5_Wantwrite, &sh->dev[i].flags);
 		}
 		clear_bit(STRIPE_EXPANDING, &sh->state);
-	} else if (expanded) {
+	} else if (s.expanded) {
 		clear_bit(STRIPE_EXPAND_READY, &sh->state);
 		atomic_dec(&conf->reshape_stripes);
 		wake_up(&conf->wait_for_overlap);
 		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
 	}
 
-	if (expanding && locked == 0) {
+	if (s.expanding && s.locked == 0) {
 		/* We have read all the blocks in this stripe and now we need to
 		 * copy some of them into a target stripe for expand.
 		 */
@@ -3118,7 +3122,7 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 		rcu_read_unlock();
 
 		if (rdev) {
-			if (syncing || expanding || expanded)
+			if (s.syncing || s.expanding || s.expanded)
 				md_sync_acct(rdev->bdev, STRIPE_SECTORS);
 
 			bi->bi_bdev = rdev->bdev;
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 3541d2c..54e2aa2 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -182,6 +182,17 @@ struct stripe_head {
 		unsigned long	flags;
 	} dev[1]; /* allocated with extra space depending of RAID geometry */
 };
+
+struct stripe_head_state {
+	int syncing, expanding, expanded;
+	int locked, uptodate, to_read, to_write, failed, written;
+	int to_fill, compute, req_compute, non_overwrite, dirty;
+	union {
+		int failed_num;
+		int r6_failed_num[2];
+	};
+};
+
 /* Flags */
 #define	R5_UPTODATE	0	/* page contains current data */
 #define	R5_LOCKED	1	/* IO has been submitted on "req" */

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH RFC 2/4] md: refactor raid5 cache policy code using 'struct stripe_cache_policy'
  2007-04-11  6:00 [PATCH RFC 0/4] raid5: write-back caching policy and write performance Dan Williams
  2007-04-11  6:00 ` [PATCH RFC 1/4] md: introduce struct stripe_head_state Dan Williams
@ 2007-04-11  6:00 ` Dan Williams
  2007-04-11  6:00 ` [PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental] Dan Williams
  2007-04-11  6:00 ` [PATCH RFC 4/4] md: delayed stripe activation Dan Williams
  3 siblings, 0 replies; 9+ messages in thread
From: Dan Williams @ 2007-04-11  6:00 UTC (permalink / raw)
  To: linux-raid

struct stripe_cache_policy is introduced as an interface to enable multiple
caching policies.  It adds several methods to be called when cache events
occur.  See the definition of stripe_cache_policy in
include/linux/raid/raid5.h.  This patch does not add any new caching
policies, it just moves the current code to a new location and calls it by
a struct stripe_cache_policy method.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/raid5.c         |  644 +++++++++++++++++++++++++-------------------
 include/linux/raid/raid5.h |   82 +++++-
 2 files changed, 446 insertions(+), 280 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 684552a..3b32a19 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -112,11 +112,12 @@ static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 	if (atomic_dec_and_test(&sh->count)) {
 		BUG_ON(!list_empty(&sh->lru));
 		BUG_ON(atomic_read(&conf->active_stripes)==0);
+		if (conf->cache_policy->release_stripe(conf, sh,
+						test_bit(STRIPE_HANDLE, &sh->state)))
+			return; /* stripe was moved to a cache policy specific queue */
+
 		if (test_bit(STRIPE_HANDLE, &sh->state)) {
-			if (test_bit(STRIPE_DELAYED, &sh->state)) {
-				list_add_tail(&sh->lru, &conf->delayed_list);
-				blk_plug_device(conf->mddev->queue);
-			} else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
+			if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
 				   sh->bm_seq - conf->seq_write > 0) {
 				list_add_tail(&sh->lru, &conf->bitmap_list);
 				blk_plug_device(conf->mddev->queue);
@@ -125,23 +126,11 @@ static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 				list_add_tail(&sh->lru, &conf->handle_list);
 			}
 			md_wakeup_thread(conf->mddev->thread);
-		} else {
-			BUG_ON(sh->ops.pending);
-			if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-				atomic_dec(&conf->preread_active_stripes);
-				if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
-					md_wakeup_thread(conf->mddev->thread);
-			}
-			atomic_dec(&conf->active_stripes);
-			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
-				list_add_tail(&sh->lru, &conf->inactive_list);
-				wake_up(&conf->wait_for_stripe);
-				if (conf->retry_read_aligned)
-					md_wakeup_thread(conf->mddev->thread);
-			}
-		}
+		} else
+			BUG();
 	}
 }
+
 static void release_stripe(struct stripe_head *sh)
 {
 	raid5_conf_t *conf = sh->raid_conf;
@@ -724,39 +713,6 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 	return tx;
 }
 
-static void ops_complete_postxor(void *stripe_head_ref)
-{
-	struct stripe_head *sh = stripe_head_ref;
-
-	PRINTK("%s: stripe %llu\n", __FUNCTION__,
-		(unsigned long long)sh->sector);
-
-	set_bit(STRIPE_OP_POSTXOR, &sh->ops.complete);
-	set_bit(STRIPE_HANDLE, &sh->state);
-	release_stripe(sh);
-}
-
-static void ops_complete_write(void *stripe_head_ref)
-{
-	struct stripe_head *sh = stripe_head_ref;
-	int disks = sh->disks, i, pd_idx = sh->pd_idx;
-
-	PRINTK("%s: stripe %llu\n", __FUNCTION__,
-		(unsigned long long)sh->sector);
-
-	for (i=disks ; i-- ;) {
-		struct r5dev *dev = &sh->dev[i];
-		if (dev->written || i == pd_idx)
-			set_bit(R5_UPTODATE, &dev->flags);
-	}
-
-	set_bit(STRIPE_OP_BIODRAIN, &sh->ops.complete);
-	set_bit(STRIPE_OP_POSTXOR, &sh->ops.complete);
-
-	set_bit(STRIPE_HANDLE, &sh->state);
-	release_stripe(sh);
-}
-
 static void
 ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 {
@@ -764,6 +720,7 @@ ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 	int disks = sh->disks;
 	struct page *xor_srcs[disks];
 
+	raid5_conf_t *conf = sh->raid_conf;
 	int count = 0, pd_idx = sh->pd_idx, i;
 	struct page *xor_dest;
 	int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
@@ -792,9 +749,8 @@ ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 		}
 	}
 
-	/* check whether this postxor is part of a write */
-	callback = test_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending) ?
-		ops_complete_write : ops_complete_postxor;
+	/* take cache policy specific action upon completion of the postxor */
+	callback = conf->cache_policy->complete_postxor_action;
 
 	/* 1/ if we prexor'd then the dest is reused as a source
 	 * 2/ if we did not prexor then we are redoing the parity
@@ -1683,7 +1639,8 @@ static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2)
 	}
 }
 
-static int handle_write_operations5(struct stripe_head *sh, int rcw, int expand)
+static int 
+raid5_wt_cache_handle_parity_updates(struct stripe_head *sh, int rcw, int expand)
 {
 	int i, pd_idx = sh->pd_idx, disks = sh->disks;
 	int locked=0;
@@ -1847,6 +1804,327 @@ static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int disks)
 	return pd_idx;
 }
 
+static int
+raid5_wt_cache_release_stripe(raid5_conf_t *conf, struct stripe_head *sh,
+	int handle)
+{
+	struct stripe_cache_policy *cp = conf->cache_policy;
+
+	PRINTK("%s: stripe %llu\n", __FUNCTION__,
+		(unsigned long long)sh->sector);
+
+	if (handle && test_bit(STRIPE_DELAYED, &sh->state)) {
+		list_add_tail(&sh->lru, &cp->delayed_list);
+		blk_plug_device(conf->mddev->queue);
+		return 1;
+	} else if (!handle) {
+		BUG_ON(sh->ops.pending);
+		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
+			atomic_dec(&cp->preread_active_stripes);
+			if (atomic_read(&cp->preread_active_stripes) < IO_THRESHOLD)
+				md_wakeup_thread(conf->mddev->thread);
+		}
+		atomic_dec(&conf->active_stripes);
+		if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
+			list_add_tail(&sh->lru, &conf->inactive_list);
+			wake_up(&conf->wait_for_stripe);
+			if (conf->retry_read_aligned)
+				md_wakeup_thread(conf->mddev->thread);
+		}
+		return 1;
+	}
+
+	return 0;
+}
+
+static void raid5_wt_cache_complete_postxor_action(void *stripe_head_ref)
+{
+	struct stripe_head *sh = stripe_head_ref;
+
+	PRINTK("%s: stripe %llu\n", __FUNCTION__,
+		(unsigned long long)sh->sector);
+
+	set_bit(STRIPE_OP_POSTXOR, &sh->ops.complete);
+
+	/* leaving prexor set until postxor is done allows us to distinguish
+	 * a rmw from a rcw during biodrain
+	 */
+	if (test_bit(STRIPE_OP_PREXOR, &sh->ops.complete)) {
+		int i;
+		for (i=sh->disks; i--;)
+			clear_bit(R5_Wantprexor, &sh->dev[i].flags);
+
+		clear_bit(STRIPE_OP_PREXOR, &sh->ops.complete);
+		clear_bit(STRIPE_OP_PREXOR, &sh->ops.ack);
+		clear_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
+	}
+
+	if (test_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending)) {
+		int disks = sh->disks, i, pd_idx = sh->pd_idx;
+
+		for (i=disks ; i-- ;) {
+			struct r5dev *dev = &sh->dev[i];
+			if (dev->written || i == pd_idx)
+				set_bit(R5_UPTODATE, &dev->flags);
+		}
+
+		set_bit(STRIPE_OP_BIODRAIN, &sh->ops.complete);
+	}
+
+	set_bit(STRIPE_HANDLE, &sh->state);
+	release_stripe(sh);
+}
+
+static struct bio *
+raid5_wt_cache_handle_completed_writes(struct stripe_head *sh,
+	struct stripe_head_state *s)
+{
+	struct bio *return_bi = NULL;
+
+	/* might be able to return some write requests if the parity block
+	 * is safe, or on a failed drive
+	 */
+	struct r5dev *dev = &sh->dev[sh->pd_idx];
+	if ( s->written &&
+	     ( (test_bit(R5_Insync, &dev->flags) && !test_bit(R5_LOCKED, &dev->flags) &&
+		test_bit(R5_UPTODATE, &dev->flags))
+	       || (s->failed == 1 && s->failed_num == sh->pd_idx))
+	    ) {
+	    raid5_conf_t *conf = sh->raid_conf;
+	    int i;
+	    /* any written block on an uptodate or failed drive can be returned.
+	     * Note that if we 'wrote' to a failed drive, it will be UPTODATE, but 
+	     * never LOCKED, so we don't need to test 'failed' directly.
+	     */
+	    for (i=sh->disks; i--; )
+		if (sh->dev[i].written) {
+		    dev = &sh->dev[i];
+		    if (!test_bit(R5_LOCKED, &dev->flags) &&
+			 test_bit(R5_UPTODATE, &dev->flags) ) {
+			/* We can return any write requests */
+			    struct bio *wbi, *wbi2;
+			    int bitmap_end = 0;
+			    PRINTK("%s: Return write for disc %d\n",
+			    	__FUNCTION__, i);
+			    spin_lock_irq(&conf->device_lock);
+			    wbi = dev->written;
+			    dev->written = NULL;
+			    while (wbi && wbi->bi_sector < dev->sector + STRIPE_SECTORS) {
+				    wbi2 = r5_next_bio(wbi, dev->sector);
+				    if (--wbi->bi_phys_segments == 0) {
+					    md_write_end(conf->mddev);
+					    wbi->bi_next = return_bi;
+					    return_bi = wbi;
+				    }
+				    wbi = wbi2;
+			    }
+			    if (dev->towrite == NULL)
+				    bitmap_end = 1;
+			    spin_unlock_irq(&conf->device_lock);
+			    if (bitmap_end)
+				    bitmap_endwrite(conf->mddev->bitmap, sh->sector,
+						    STRIPE_SECTORS,
+						    !test_bit(STRIPE_DEGRADED, &sh->state), 0);
+		    }
+		}
+	}
+
+	return return_bi;
+}
+
+static void
+raid5_wt_cache_submit_pending_writes(struct stripe_head *sh,
+	struct stripe_head_state *s)
+{
+	/* if only POSTXOR is set then this is an 'expand' postxor */
+	if (test_bit(STRIPE_OP_BIODRAIN, &sh->ops.complete) &&
+		test_bit(STRIPE_OP_POSTXOR, &sh->ops.complete)) {
+		raid5_conf_t *conf = sh->raid_conf;
+		struct stripe_cache_policy *cp = conf->cache_policy;
+		int i;
+
+		PRINTK("%s: stripe %llu\n", __FUNCTION__,
+			(unsigned long long)sh->sector);
+
+		/* All the 'written' buffers and the parity block are ready to be
+		 * written back to disk
+		 */
+		BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags));
+		for (i=sh->disks; i--;) {
+			struct r5dev *dev = &sh->dev[i];
+			if (test_bit(R5_LOCKED, &dev->flags) &&
+				(i == sh->pd_idx || dev->written)) {
+				PRINTK("Writing block %d\n", i);
+				set_bit(R5_Wantwrite, &dev->flags);
+				if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+					sh->ops.count++;
+				if (!test_bit(R5_Insync, &dev->flags)
+				    || (i==sh->pd_idx && s->failed == 0))
+					set_bit(STRIPE_INSYNC, &sh->state);
+			}
+		}
+		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
+			atomic_dec(&cp->preread_active_stripes);
+			if (atomic_read(&cp->preread_active_stripes) < IO_THRESHOLD)
+				md_wakeup_thread(conf->mddev->thread);
+		}
+
+		clear_bit(STRIPE_OP_BIODRAIN, &sh->ops.complete);
+		clear_bit(STRIPE_OP_BIODRAIN, &sh->ops.ack);
+		clear_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending);
+
+		clear_bit(STRIPE_OP_POSTXOR, &sh->ops.complete);
+		clear_bit(STRIPE_OP_POSTXOR, &sh->ops.ack);
+		clear_bit(STRIPE_OP_POSTXOR, &sh->ops.pending);
+	}
+
+}
+
+static void
+raid5_wt_cache_handle_new_writes(struct stripe_head *sh, struct stripe_head_state *s)
+{
+	/* 1/ Check operations clobber the parity block so do not start new writes while
+	 *    a check is in flight
+	 * 2/ Write operations do not stack
+	 */
+	if (s->to_write && !test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending) &&
+		!test_bit(STRIPE_OP_CHECK, &sh->ops.pending)) {
+		int rmw=0, rcw=0, disks = sh->disks, i;
+		struct r5dev *dev;
+		for (i=disks ; i--;) {
+			/* would I have to read this buffer for read_modify_write */
+			dev = &sh->dev[i];
+			if ((dev->towrite || i == sh->pd_idx) &&
+			    (!test_bit(R5_LOCKED, &dev->flags) 
+				    ) &&
+			    !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags))) {
+				if (test_bit(R5_Insync, &dev->flags)
+/*				    && !(!mddev->insync && i == sh->pd_idx) */
+					)
+					rmw++;
+				else rmw += 2*disks;  /* cannot read it */
+			}
+			/* Would I have to read this buffer for reconstruct_write */
+			if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx &&
+			    (!test_bit(R5_LOCKED, &dev->flags) 
+				    ) &&
+			    !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags))) {
+				if (test_bit(R5_Insync, &dev->flags)) rcw++;
+				else rcw += 2*disks;
+			}
+		}
+		PRINTK("for sector %llu, rmw=%d rcw=%d\n", 
+			(unsigned long long)sh->sector, rmw, rcw);
+		set_bit(STRIPE_HANDLE, &sh->state);
+		if (rmw < rcw && rmw > 0)
+			/* prefer read-modify-write, but need to get some data */
+			for (i=disks; i--;) {
+				dev = &sh->dev[i];
+				if ((dev->towrite || i == sh->pd_idx) &&
+				    !test_bit(R5_LOCKED, &dev->flags) &&
+				    !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags)) &&
+				    test_bit(R5_Insync, &dev->flags)) {
+					if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+					{
+						PRINTK("Read_old block %d for r-m-w\n", i);
+						set_bit(R5_LOCKED, &dev->flags);
+						set_bit(R5_Wantread, &dev->flags);
+						if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+							sh->ops.count++;
+						s->locked++;
+					} else {
+						set_bit(STRIPE_DELAYED, &sh->state);
+						set_bit(STRIPE_HANDLE, &sh->state);
+					}
+				}
+			}
+		if (rcw <= rmw && rcw > 0)
+			/* want reconstruct write, but need to get some data */
+			for (i=disks; i--;) {
+				dev = &sh->dev[i];
+				if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx &&
+				    !test_bit(R5_LOCKED, &dev->flags) &&
+				    !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags)) &&
+				    test_bit(R5_Insync, &dev->flags)) {
+					if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+					{
+						PRINTK("Read_old block %d for Reconstruct\n", i);
+						set_bit(R5_LOCKED, &dev->flags);
+						set_bit(R5_Wantread, &dev->flags);
+						if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+							sh->ops.count++;
+						s->locked++;
+					} else {
+						set_bit(STRIPE_DELAYED, &sh->state);
+						set_bit(STRIPE_HANDLE, &sh->state);
+					}
+				}
+			}
+		/* now if nothing is locked, and if we have enough data, we can start a write request */
+		/* since handle_stripe can be called at any time we need to handle the case
+		 * where a compute block operation has been submitted and then a subsequent
+		 * call wants to start a write request.  raid5_run_ops only handles the case where
+		 * compute block and postxor are requested simultaneously.  If this
+		 * is not the case then new writes need to be held off until the compute
+		 * completes.
+		 */
+		if ((s->req_compute || !test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending)) &&
+			(s->locked == 0 && (rcw == 0 ||rmw == 0) &&
+			!test_bit(STRIPE_BIT_DELAY, &sh->state)))
+			s->locked += raid5_wt_cache_handle_parity_updates(sh, rcw == 0, 0);
+			
+	}
+}
+
+static void raid5_wt_cache_activate_delayed(raid5_conf_t *conf)
+{
+	struct stripe_cache_policy *cp = conf->cache_policy;
+	if (atomic_read(&cp->preread_active_stripes) < IO_THRESHOLD) {
+		while (!list_empty(&cp->delayed_list)) {
+			struct list_head *l = cp->delayed_list.next;
+			struct stripe_head *sh;
+			sh = list_entry(l, struct stripe_head, lru);
+			list_del_init(l);
+			clear_bit(STRIPE_DELAYED, &sh->state);
+			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+				atomic_inc(&cp->preread_active_stripes);
+			list_add_tail(&sh->lru, &conf->handle_list);
+		}
+	}
+}
+
+static void raid5_wt_cache_raid5d(mddev_t *mddev, raid5_conf_t *conf)
+{
+	struct stripe_cache_policy *cp = conf->cache_policy;
+
+	if (list_empty(&conf->handle_list) &&
+	    atomic_read(&cp->preread_active_stripes) < IO_THRESHOLD &&
+	    !blk_queue_plugged(mddev->queue) &&
+	    !list_empty(&cp->delayed_list))
+		raid5_wt_cache_activate_delayed(conf);
+}
+
+static void raid5_wt_cache_init(raid5_conf_t *conf)
+{
+	atomic_set(&conf->cache_policy->preread_active_stripes, 0);
+	INIT_LIST_HEAD(&conf->cache_policy->delayed_list);
+}
+
+static void raid5_wt_cache_unplug_device(raid5_conf_t *conf)
+{
+	raid5_wt_cache_activate_delayed(conf);
+}
+
+static struct stripe_cache_policy raid5_cache_policy_write_through = {
+	.release_stripe = raid5_wt_cache_release_stripe,
+	.complete_postxor_action = raid5_wt_cache_complete_postxor_action,
+	.submit_pending_writes = raid5_wt_cache_submit_pending_writes,
+	.handle_new_writes = raid5_wt_cache_handle_new_writes,
+	.handle_completed_writes = raid5_wt_cache_handle_completed_writes,
+	.raid5d = raid5_wt_cache_raid5d,
+	.init = raid5_wt_cache_init,
+	.unplug_device = raid5_wt_cache_unplug_device,
+};
 
 /*
  * handle_stripe - do things to a stripe.
@@ -1944,12 +2222,13 @@ static void handle_stripe5(struct stripe_head *sh)
 	}
 	rcu_read_unlock();
 
+	/* do we need to request a biofill operation? */
 	if (s.to_fill && !test_and_set_bit(STRIPE_OP_BIOFILL, &sh->ops.pending))
 		sh->ops.count++;
 
-	PRINTK("locked=%d uptodate=%d to_read=%d"
+	PRINTK("locked=%d dirty=%d uptodate=%d to_read=%d"
 		" to_write=%d to_fill=%d failed=%d failed_num=%d\n",
-		s.locked, s.uptodate, s.to_read, s.to_write, s.to_fill,
+		s.locked, s.dirty, s.uptodate, s.to_read, s.to_write, s.to_fill,
 		s.failed, s.failed_num);
 	/* check if the array has lost two devices and, if so, some requests might
 	 * need to be failed
@@ -2035,50 +2314,8 @@ static void handle_stripe5(struct stripe_head *sh)
 		s.syncing = 0;
 	}
 
-	/* might be able to return some write requests if the parity block
-	 * is safe, or on a failed drive
-	 */
-	dev = &sh->dev[sh->pd_idx];
-	if ( s.written &&
-	     ( (test_bit(R5_Insync, &dev->flags) && !test_bit(R5_LOCKED, &dev->flags) &&
-		test_bit(R5_UPTODATE, &dev->flags))
-	       || (s.failed == 1 && s.failed_num == sh->pd_idx))
-	    ) {
-	    /* any written block on an uptodate or failed drive can be returned.
-	     * Note that if we 'wrote' to a failed drive, it will be UPTODATE, but 
-	     * never LOCKED, so we don't need to test 'failed' directly.
-	     */
-	    for (i=disks; i--; )
-		if (sh->dev[i].written) {
-		    dev = &sh->dev[i];
-		    if (!test_bit(R5_LOCKED, &dev->flags) &&
-			 test_bit(R5_UPTODATE, &dev->flags) ) {
-			/* We can return any write requests */
-			    struct bio *wbi, *wbi2;
-			    int bitmap_end = 0;
-			    PRINTK("Return write for disc %d\n", i);
-			    spin_lock_irq(&conf->device_lock);
-			    wbi = dev->written;
-			    dev->written = NULL;
-			    while (wbi && wbi->bi_sector < dev->sector + STRIPE_SECTORS) {
-				    wbi2 = r5_next_bio(wbi, dev->sector);
-				    if (--wbi->bi_phys_segments == 0) {
-					    md_write_end(conf->mddev);
-					    wbi->bi_next = return_bi;
-					    return_bi = wbi;
-				    }
-				    wbi = wbi2;
-			    }
-			    if (dev->towrite == NULL)
-				    bitmap_end = 1;
-			    spin_unlock_irq(&conf->device_lock);
-			    if (bitmap_end)
-				    bitmap_endwrite(conf->mddev->bitmap, sh->sector,
-						    STRIPE_SECTORS,
-						    !test_bit(STRIPE_DEGRADED, &sh->state), 0);
-		    }
-		}
-	}
+	/* handle the completion of writes to the backing disks */
+	return_bi = conf->cache_policy->handle_completed_writes(sh, &s);
 
 	/* Now we might consider reading some blocks, either to check/generate
 	 * parity, or to satisfy requests
@@ -2135,7 +2372,8 @@ static void handle_stripe5(struct stripe_head *sh)
 					 * 3/ We hold off parity block re-reads until check
 					 * operations have quiesced.
 					 */
-					if ((s.uptodate == disks-1) && !test_bit(STRIPE_OP_CHECK, &sh->ops.pending)) {
+					if (((s.uptodate == disks-1) && !s.dirty) &&
+						!test_bit(STRIPE_OP_CHECK, &sh->ops.pending)) {
 						set_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending);
 						set_bit(R5_Wantcompute, &dev->flags);
 						sh->ops.target = i;
@@ -2148,7 +2386,8 @@ static void handle_stripe5(struct stripe_head *sh)
 						 */
 						s.uptodate++;
 						break; /* uptodate + compute == disks */
-					} else if ((s.uptodate < disks-1) && test_bit(R5_Insync, &dev->flags)) {
+					} else if (((s.uptodate < disks-1) || s.dirty) &&
+							test_bit(R5_Insync, &dev->flags)) {
 						/* Note: we hold off compute operations while checks are in flight,
 						 * but we still prefer 'compute' over 'read' hence we only read if
 						 * (uptodate < disks-1)
@@ -2167,158 +2406,20 @@ static void handle_stripe5(struct stripe_head *sh)
 		set_bit(STRIPE_HANDLE, &sh->state);
 	}
 
-	/* Now we check to see if any write operations have recently
-	 * completed
-	 */
-
-	/* leave prexor set until postxor is done, allows us to distinguish
-	 * a rmw from a rcw during biodrain
-	 */
-	if (test_bit(STRIPE_OP_PREXOR, &sh->ops.complete) &&
-		test_bit(STRIPE_OP_POSTXOR, &sh->ops.complete)) {
-
-		clear_bit(STRIPE_OP_PREXOR, &sh->ops.complete);
-		clear_bit(STRIPE_OP_PREXOR, &sh->ops.ack);
-		clear_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
-
-		for (i=disks; i--;)
-			clear_bit(R5_Wantprexor, &sh->dev[i].flags);
-	}
-
-	/* if only POSTXOR is set then this is an 'expand' postxor */
-	if (test_bit(STRIPE_OP_BIODRAIN, &sh->ops.complete) &&
-		test_bit(STRIPE_OP_POSTXOR, &sh->ops.complete)) {
-
-		clear_bit(STRIPE_OP_BIODRAIN, &sh->ops.complete);
-		clear_bit(STRIPE_OP_BIODRAIN, &sh->ops.ack);
-		clear_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending);
+	/* Now we check to see if any blocks are ready to be written to disk */
+	conf->cache_policy->submit_pending_writes(sh, &s);
 
-		clear_bit(STRIPE_OP_POSTXOR, &sh->ops.complete);
-		clear_bit(STRIPE_OP_POSTXOR, &sh->ops.ack);
-		clear_bit(STRIPE_OP_POSTXOR, &sh->ops.pending);
-
-		/* All the 'written' buffers and the parity block are ready to be
-		 * written back to disk
-		 */
-		BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags));
-		for (i=disks; i--;) {
-			dev = &sh->dev[i];
-			if (test_bit(R5_LOCKED, &dev->flags) &&
-				(i == sh->pd_idx || dev->written)) {
-				PRINTK("Writing block %d\n", i);
-				set_bit(R5_Wantwrite, &dev->flags);
-				if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
-					sh->ops.count++;
-				if (!test_bit(R5_Insync, &dev->flags)
-				    || (i==sh->pd_idx && s.failed == 0))
-					set_bit(STRIPE_INSYNC, &sh->state);
-			}
-		}
-		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-			atomic_dec(&conf->preread_active_stripes);
-			if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
-				md_wakeup_thread(conf->mddev->thread);
-		}
-	}
-
-	/* 1/ Now to consider new write requests and what else, if anything should be read
-	 * 2/ Check operations clobber the parity block so do not start new writes while
-	 *    a check is in flight
-	 * 3/ Write operations do not stack
-	 */
-	if (s.to_write && !test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending) &&
-		!test_bit(STRIPE_OP_CHECK, &sh->ops.pending)) {
-		int rmw=0, rcw=0;
-		for (i=disks ; i--;) {
-			/* would I have to read this buffer for read_modify_write */
-			dev = &sh->dev[i];
-			if ((dev->towrite || i == sh->pd_idx) &&
-			    (!test_bit(R5_LOCKED, &dev->flags) 
-				    ) &&
-			    !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags))) {
-				if (test_bit(R5_Insync, &dev->flags)
-/*				    && !(!mddev->insync && i == sh->pd_idx) */
-					)
-					rmw++;
-				else rmw += 2*disks;  /* cannot read it */
-			}
-			/* Would I have to read this buffer for reconstruct_write */
-			if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx &&
-			    (!test_bit(R5_LOCKED, &dev->flags) 
-				    ) &&
-			    !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags))) {
-				if (test_bit(R5_Insync, &dev->flags)) rcw++;
-				else rcw += 2*disks;
-			}
-		}
-		PRINTK("for sector %llu, rmw=%d rcw=%d\n", 
-			(unsigned long long)sh->sector, rmw, rcw);
-		set_bit(STRIPE_HANDLE, &sh->state);
-		if (rmw < rcw && rmw > 0)
-			/* prefer read-modify-write, but need to get some data */
-			for (i=disks; i--;) {
-				dev = &sh->dev[i];
-				if ((dev->towrite || i == sh->pd_idx) &&
-				    !test_bit(R5_LOCKED, &dev->flags) &&
-				    !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags)) &&
-				    test_bit(R5_Insync, &dev->flags)) {
-					if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-					{
-						PRINTK("Read_old block %d for r-m-w\n", i);
-						set_bit(R5_LOCKED, &dev->flags);
-						set_bit(R5_Wantread, &dev->flags);
-						if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
-							sh->ops.count++;
-						s.locked++;
-					} else {
-						set_bit(STRIPE_DELAYED, &sh->state);
-						set_bit(STRIPE_HANDLE, &sh->state);
-					}
-				}
-			}
-		if (rcw <= rmw && rcw > 0)
-			/* want reconstruct write, but need to get some data */
-			for (i=disks; i--;) {
-				dev = &sh->dev[i];
-				if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx &&
-				    !test_bit(R5_LOCKED, &dev->flags) &&
-				    !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags)) &&
-				    test_bit(R5_Insync, &dev->flags)) {
-					if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-					{
-						PRINTK("Read_old block %d for Reconstruct\n", i);
-						set_bit(R5_LOCKED, &dev->flags);
-						set_bit(R5_Wantread, &dev->flags);
-						if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
-							sh->ops.count++;
-						s.locked++;
-					} else {
-						set_bit(STRIPE_DELAYED, &sh->state);
-						set_bit(STRIPE_HANDLE, &sh->state);
-					}
-				}
-			}
-		/* now if nothing is locked, and if we have enough data, we can start a write request */
-		/* since handle_stripe can be called at any time we need to handle the case
-		 * where a compute block operation has been submitted and then a subsequent
-		 * call wants to start a write request.  raid5_run_ops only handles the case where
-		 * compute block and postxor are requested simultaneously.  If this
-		 * is not the case then new writes need to be held off until the compute
-		 * completes.
-		 */
-		if ((s.req_compute || !test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending)) &&
-			(s.locked == 0 && (rcw == 0 ||rmw == 0) &&
-			!test_bit(STRIPE_BIT_DELAY, &sh->state)))
-			s.locked += handle_write_operations5(sh, rcw == 0, 0);
-	}
+	/* Now to consider new write requests and what else, if anything should be read */
+	conf->cache_policy->handle_new_writes(sh, &s);
 
 	/* 1/ Maybe we need to check and possibly fix the parity for this stripe.
 	 *    Any reads will already have been scheduled, so we just see if enough data
 	 *    is available.
 	 * 2/ Hold off parity checks while parity dependent operations are in flight
-	 *    (conflicting writes are protected by the 'locked' variable)
+	 *    (conflicting writes are protected by the 'locked' and 'dirty' variables)
 	 */
-	if ((s.syncing && s.locked == 0 && !test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending) &&
+	if ((s.syncing && s.locked == 0 && s.dirty == 0 &&
+		!test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending) &&
 		!test_bit(STRIPE_INSYNC, &sh->state)) ||
 	    	test_bit(STRIPE_OP_CHECK, &sh->ops.pending) ||
 	    	test_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending)) {
@@ -2451,7 +2552,7 @@ static void handle_stripe5(struct stripe_head *sh)
 		/* Need to write out all blocks after computing parity */
 		sh->disks = conf->raid_disks;
 		sh->pd_idx = stripe_to_pdidx(sh->sector, conf, conf->raid_disks);
-		s.locked += handle_write_operations5(sh, 0, 1);
+		s.locked += raid5_wt_cache_handle_parity_updates(sh, 0, 1);
 	} else if (s.expanded && !test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending)) {
 		clear_bit(STRIPE_EXPAND_READY, &sh->state);
 		atomic_dec(&conf->reshape_stripes);
@@ -2885,8 +2986,9 @@ static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 			set_bit(STRIPE_INSYNC, &sh->state);
 
 			if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-				atomic_dec(&conf->preread_active_stripes);
-				if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
+				atomic_dec(&conf->cache_policy->preread_active_stripes);
+				if (atomic_read(&conf->cache_policy->preread_active_stripes)
+					< IO_THRESHOLD)
 					md_wakeup_thread(conf->mddev->thread);
 			}
 		}
@@ -3164,22 +3266,6 @@ static void handle_stripe(struct stripe_head *sh, struct page *tmp_page)
 
 
 
-static void raid5_activate_delayed(raid5_conf_t *conf)
-{
-	if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {
-		while (!list_empty(&conf->delayed_list)) {
-			struct list_head *l = conf->delayed_list.next;
-			struct stripe_head *sh;
-			sh = list_entry(l, struct stripe_head, lru);
-			list_del_init(l);
-			clear_bit(STRIPE_DELAYED, &sh->state);
-			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-				atomic_inc(&conf->preread_active_stripes);
-			list_add_tail(&sh->lru, &conf->handle_list);
-		}
-	}
-}
-
 static void activate_bit_delay(raid5_conf_t *conf)
 {
 	/* device_lock is held */
@@ -3222,14 +3308,17 @@ static void raid5_unplug_device(request_queue_t *q)
 {
 	mddev_t *mddev = q->queuedata;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
+	struct stripe_cache_policy *cp = conf->cache_policy;
 	unsigned long flags;
 
 	spin_lock_irqsave(&conf->device_lock, flags);
 
 	if (blk_remove_plug(q)) {
 		conf->seq_flush++;
-		raid5_activate_delayed(conf);
+		if (cp->unplug_device)
+			cp->unplug_device(conf);
 	}
+
 	md_wakeup_thread(mddev->thread);
 
 	spin_unlock_irqrestore(&conf->device_lock, flags);
@@ -3944,11 +4033,8 @@ static void raid5d (mddev_t *mddev)
 			activate_bit_delay(conf);
 		}
 
-		if (list_empty(&conf->handle_list) &&
-		    atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
-		    !blk_queue_plugged(mddev->queue) &&
-		    !list_empty(&conf->delayed_list))
-			raid5_activate_delayed(conf);
+		if (conf->cache_policy->raid5d)
+			conf->cache_policy->raid5d(mddev, conf);
 
 		while ((bio = remove_bio_from_retry(conf))) {
 			int ok;
@@ -4150,16 +4236,22 @@ static int run(mddev_t *mddev)
 		if (!conf->spare_page)
 			goto abort;
 	}
+
+	#ifdef CONFIG_RAID5_CACHE_POLICY_WRITE_BACK
+	conf->cache_policy = &raid5_cache_policy_write_back;
+	#else
+	conf->cache_policy = &raid5_cache_policy_write_through;
+	#endif
+	
 	spin_lock_init(&conf->device_lock);
 	init_waitqueue_head(&conf->wait_for_stripe);
 	init_waitqueue_head(&conf->wait_for_overlap);
 	INIT_LIST_HEAD(&conf->handle_list);
-	INIT_LIST_HEAD(&conf->delayed_list);
 	INIT_LIST_HEAD(&conf->bitmap_list);
 	INIT_LIST_HEAD(&conf->inactive_list);
 	atomic_set(&conf->active_stripes, 0);
-	atomic_set(&conf->preread_active_stripes, 0);
 	atomic_set(&conf->active_aligned_reads, 0);
+	conf->cache_policy->init(conf);
 
 	PRINTK("raid5: run(%s) called.\n", mdname(mddev));
 
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 54e2aa2..f00da23 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -224,8 +224,8 @@ struct stripe_head_state {
 #define STRIPE_HANDLE		2
 #define	STRIPE_SYNCING		3
 #define	STRIPE_INSYNC		4
-#define	STRIPE_PREREAD_ACTIVE	5
-#define	STRIPE_DELAYED		6
+#define	STRIPE_PREREAD_ACTIVE	5 /* wt cache state */
+#define	STRIPE_DELAYED		6 /* wt cache state */
 #define	STRIPE_DEGRADED		7
 #define	STRIPE_BIT_DELAY	8
 #define	STRIPE_EXPANDING	9
@@ -276,6 +276,81 @@ struct disk_info {
 	mdk_rdev_t	*rdev;
 };
 
+/**
+ * struct stripe_cache_policy - handle writethrough/writeback caching
+ * @post_run_biodrain:
+ *  wb: allows writes to be signalled complete once
+ *      they are in the stripe cache
+ *  wt: NULL
+ * @notify_release:
+ *  wb: transition inactive stripes with pending data to a dirty list
+ *  rather than the inactive list
+ *  wt: handle delayed stripes and issuing pre-read actions.
+ * @submit_pending_writes:
+ *  wb: only writeback when STRIPE_EVICT is set
+ *  wt: always writethrough after postxor completes
+ */
+
+/* wt = write through
+ * wb = write back
+ */
+struct stripe_cache_policy {
+	/* release_stripe - returns '1' if stripe was moved to cache-private list
+	 *  else '0'
+	 * [ called from __release_stripe under spin_lock_irq(&conf->device_lock) ]
+	 * wt: catch 'delayed' stripes and poke the 'preread' state machine
+	 * if necessary
+	 */
+	int (*release_stripe)(struct raid5_private_data *conf,
+		struct stripe_head *sh,	int handle);
+	/* complete_postxor_action
+	 * wt: check if this is the end of a rcw/rmw write request and set
+	 * the state bits accordingly.  set 'handle' and release.
+	 */
+	void (*complete_postxor_action)(void *stripe_head_ref);
+	/* submit_pending_writes
+	 * [ called from handle_stripe under spin_lock(&sh->lock) ]
+	 * wt: check if 'biodrain' and 'postxor' are complete and schedule writes
+	 * to the backing disks
+	 */
+	void (*submit_pending_writes)(struct stripe_head *sh,
+		struct stripe_head_state *s);
+	/* handle_new_writes
+	 * [ called from handle_stripe under spin_lock(&sh->lock) ]
+	 * wt: schedule reads to prepare for a rcw or rmw operation.  once preread
+	 * data is available lock the blocks and schedule '[prexor]+biodrain+postxor'
+	 */
+	void (*handle_new_writes)(struct stripe_head *sh,
+		struct stripe_head_state *s);
+	/* handle_completed_writes
+	 * [ called from handle_stripe under spin_lock(&sh->lock) ]
+	 * wt: call bi_end_io on all written blocks and perform general md/bitmap
+	 * post write housekeeping.
+	 */
+	struct bio *(*handle_completed_writes)(struct stripe_head *sh,
+		struct stripe_head_state *s);
+	/* raid5d
+	 * wt: check for stripes that can be taken off the delayed list
+	 */
+	void (*raid5d)(mddev_t *mddev, struct raid5_private_data *conf);
+	/* init
+	 * wt: initialize 'delayed_list' and 'preread_active_stripes'
+	 * wb: initialize 'dirty_list' and 'dirty_stripes'
+	 */
+	void (*init)(struct raid5_private_data *conf);
+	/* unplug_device
+	 * [ called from raid5_unplug_device under spin_lock_irqsave(&conf->device_lock) ]
+	 * wt: activate stripes on the delayed list
+	 */
+	void (*unplug_device)(struct raid5_private_data *conf);
+	union {
+		struct list_head delayed_list; /* wt: stripes that have plugged requests */
+	};
+	union {
+		atomic_t preread_active_stripes;
+	};
+};
+
 struct raid5_private_data {
 	struct hlist_head	*stripe_hashtbl;
 	mddev_t			*mddev;
@@ -284,6 +359,7 @@ struct raid5_private_data {
 	int			max_degraded;
 	int			raid_disks;
 	int			max_nr_stripes;
+	struct stripe_cache_policy *cache_policy;
 
 	/* used during an expand */
 	sector_t		expand_progress;	/* MaxSector when no expand happening */
@@ -293,11 +369,9 @@ struct raid5_private_data {
 	int			previous_raid_disks;
 
 	struct list_head	handle_list; /* stripes needing handling */
-	struct list_head	delayed_list; /* stripes that have plugged requests */
 	struct list_head	bitmap_list; /* stripes delaying awaiting bitmap update */
 	struct bio		*retry_read_aligned; /* currently retrying aligned bios   */
 	struct bio		*retry_read_aligned_list; /* aligned bios retry list  */
-	atomic_t		preread_active_stripes; /* stripes with scheduled io */
 	atomic_t		active_aligned_reads;
 
 	atomic_t		reshape_stripes; /* stripes with pending writes for reshape */

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental]
  2007-04-11  6:00 [PATCH RFC 0/4] raid5: write-back caching policy and write performance Dan Williams
  2007-04-11  6:00 ` [PATCH RFC 1/4] md: introduce struct stripe_head_state Dan Williams
  2007-04-11  6:00 ` [PATCH RFC 2/4] md: refactor raid5 cache policy code using 'struct stripe_cache_policy' Dan Williams
@ 2007-04-11  6:00 ` Dan Williams
  2007-04-11 22:40   ` Mark Hahn
  2007-04-12  5:37   ` Al Boldi
  2007-04-11  6:00 ` [PATCH RFC 4/4] md: delayed stripe activation Dan Williams
  3 siblings, 2 replies; 9+ messages in thread
From: Dan Williams @ 2007-04-11  6:00 UTC (permalink / raw)
  To: linux-raid

In write-through mode bi_end_io is called once writes to the data disk(s)
and the parity disk have completed.

In write-back mode bi_end_io is called immediately after data has been
copied into the stripe cache, which also causes the stripe to be marked
dirty.  The STRIPE_DIRTY state implies that parity will need to be
reconstructed at eviction time.  In other words, the read-modify-write case
implemented for write-through mode is not supported; all writes are
reconstruct-writes.  An eviction brings the backing disks up to date with
data in the cache.  A dirty stripe is set for eviction when a new stripe
needs to be activated and there are no stripes on the inactive list.  All
dirty stripes are evicted when the array is being shutdown.

In its current implementation write-back mode acknowledges writes before
they have reached non-volatile media.  Unclean shutdowns will result in
filesystem corruption.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

 drivers/md/Kconfig         |   13 ++
 drivers/md/md.c            |    2 
 drivers/md/raid5.c         |  354 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/raid/md_k.h  |    2 
 include/linux/raid/raid5.h |   31 ++++
 5 files changed, 400 insertions(+), 2 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 79a361e..7ab6c55 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -138,6 +138,19 @@ config MD_RAID456
 
 	  If unsure, say Y.
 
+config RAID5_CACHE_POLICY_WRITE_BACK
+	bool "EXPERIMENTAL: Set the raid cache policy to write-back"
+	default n
+	depends on EXPERIMENTAL && MD_RAID456
+	---help---
+	  Enable this feature if you want to test this experiemental
+	  caching policy instead of the default write-through.
+	  Do not enable this on a system with data that you care
+	  about.  Filesytem corruption will occur if an array in write-back
+	  mode is not shutdown cleanly.
+
+	  If unsure, say N.
+
 config MD_RAID5_RESHAPE
 	bool "Support adding drives to a raid-5 array"
 	depends on MD_RAID456
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 509171c..b83f434 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -3344,6 +3344,8 @@ static int do_md_stop(mddev_t * mddev, int mode)
 			break;
 		case 0: /* disassemble */
 		case 2: /* stop */
+			if (mddev->pers->cache_flush)
+				mddev->pers->cache_flush(mddev);
 			bitmap_flush(mddev);
 			md_super_wait(mddev);
 			if (mddev->ro)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3b32a19..1a2d6b5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -267,6 +267,7 @@ static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector
 					     int pd_idx, int noblock)
 {
 	struct stripe_head *sh;
+	struct stripe_cache_policy *cp = conf->cache_policy;
 
 	PRINTK("get_stripe, sector %llu\n", (unsigned long long)sector);
 
@@ -280,6 +281,8 @@ static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector
 		if (!sh) {
 			if (!conf->inactive_blocked)
 				sh = get_free_stripe(conf);
+			if (!sh && cp->try_to_free_stripe)
+				cp->try_to_free_stripe(conf, 0);
 			if (noblock && sh == NULL)
 				break;
 			if (!sh) {
@@ -299,7 +302,8 @@ static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector
 			if (atomic_read(&sh->count)) {
 			  BUG_ON(!list_empty(&sh->lru));
 			} else {
-				if (!test_bit(STRIPE_HANDLE, &sh->state))
+				if (!test_bit(STRIPE_HANDLE, &sh->state) &&
+					!test_bit(STRIPE_EVICT, &sh->state))
 					atomic_inc(&conf->active_stripes);
 				if (list_empty(&sh->lru) &&
 				    !test_bit(STRIPE_EXPANDING, &sh->state))
@@ -668,6 +672,8 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 {
 	int disks = sh->disks;
 	int pd_idx = sh->pd_idx, i;
+	raid5_conf_t *conf = sh->raid_conf;
+	struct stripe_cache_policy *cp = conf->cache_policy;
 
 	/* check if prexor is active which means only process blocks
 	 * that are part of a read-modify-write (Wantprexor)
@@ -688,7 +694,8 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 				towrite = 1;
 		} else { /* rcw */
 			if (i!=pd_idx && dev->towrite &&
-				test_bit(R5_LOCKED, &dev->flags))
+				(test_bit(R5_LOCKED, &dev->flags) ||
+				test_bit(R5_DIRTY, &dev->flags)))
 				towrite = 1;
 		}
 
@@ -710,6 +717,9 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 		}
 	}
 
+	if (cp->complete_biodrain_action)
+		tx = cp->complete_biodrain_action(sh, tx);
+
 	return tx;
 }
 
@@ -1805,6 +1815,39 @@ static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int disks)
 }
 
 static int
+raid5_wb_cache_release_stripe(raid5_conf_t *conf, struct stripe_head *sh,
+	int handle)
+{
+	/* EVICT==HANDLE */
+	if (test_bit(STRIPE_EVICT, &sh->state) && !handle) {
+		set_bit(STRIPE_HANDLE, &sh->state);
+		return 0;
+	}
+
+	if (!handle) {
+		BUG_ON(sh->ops.pending);	
+		atomic_dec(&conf->active_stripes);
+		if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
+			if (test_bit(STRIPE_DIRTY, &sh->state)) {
+				PRINTK("adding stripe %llu to dirty_list\n",
+					(unsigned long long)sh->sector);
+				list_add_tail(&sh->lru, &conf->cache_policy->dirty_list);
+			} else {
+				BUG_ON(test_bit(STRIPE_EVICT, &sh->state));
+				list_add_tail(&sh->lru, &conf->inactive_list);
+				wake_up(&conf->wait_for_stripe);
+			}
+			if (conf->retry_read_aligned)
+				md_wakeup_thread(conf->mddev->thread);
+		}
+		
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
 raid5_wt_cache_release_stripe(raid5_conf_t *conf, struct stripe_head *sh,
 	int handle)
 {
@@ -1875,6 +1918,19 @@ static void raid5_wt_cache_complete_postxor_action(void *stripe_head_ref)
 	release_stripe(sh);
 }
 
+static void raid5_wb_cache_complete_postxor_action(void *stripe_head_ref)
+{
+	struct stripe_head *sh = stripe_head_ref;
+
+	PRINTK("%s: stripe %llu\n", __FUNCTION__,
+		(unsigned long long)sh->sector);
+
+	set_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags);
+	set_bit(STRIPE_OP_POSTXOR, &sh->ops.complete);
+	set_bit(STRIPE_HANDLE, &sh->state);
+	release_stripe(sh);
+}
+
 static struct bio *
 raid5_wt_cache_handle_completed_writes(struct stripe_head *sh,
 	struct stripe_head_state *s)
@@ -1932,6 +1988,34 @@ raid5_wt_cache_handle_completed_writes(struct stripe_head *sh,
 	return return_bi;
 }
 
+static struct bio *
+raid5_wb_cache_handle_completed_writes(struct stripe_head *sh,
+	struct stripe_head_state *s)
+{
+	int i;
+	raid5_conf_t *conf = sh->raid_conf;
+
+	/* the stripe is consistent with the disks when STRIPE_EVICT is sampled
+	 * set and there are no dirty blocks
+	 */
+	if (test_bit(STRIPE_EVICT, &sh->state) &&
+		s->locked == 0 && s->dirty == 0) {
+
+		PRINTK("%s: stripe %llu\n", __FUNCTION__,
+			(unsigned long long)sh->sector);
+
+		for (i = sh->write_requests_pending; i--; )
+			md_write_end(conf->mddev);
+		bitmap_endwrite(conf->mddev->bitmap, sh->sector, STRIPE_SECTORS,
+			!test_bit(STRIPE_DEGRADED, &sh->state), 0);
+		clear_bit(STRIPE_EVICT, &sh->state);
+		atomic_dec(&conf->cache_policy->evict_active_stripes);
+	}
+
+	return NULL;
+}
+
+
 static void
 raid5_wt_cache_submit_pending_writes(struct stripe_head *sh,
 	struct stripe_head_state *s)
@@ -2115,7 +2199,230 @@ static void raid5_wt_cache_unplug_device(raid5_conf_t *conf)
 	raid5_wt_cache_activate_delayed(conf);
 }
 
+static void raid5_wb_cache_init(raid5_conf_t *conf)
+{
+	atomic_set(&conf->cache_policy->evict_active_stripes, 0);
+	INIT_LIST_HEAD(&conf->cache_policy->dirty_list);
+}
+
+static void
+raid5_wb_cache_submit_pending_writes(struct stripe_head *sh,
+	struct stripe_head_state *s)
+{
+	int pd_idx = sh->pd_idx;
+
+	if (test_bit(STRIPE_EVICT, &sh->state) &&
+		s->dirty && test_bit(STRIPE_OP_POSTXOR, &sh->ops.complete)) {
+		int i;
+
+		PRINTK("%s: stripe %llu\n", __FUNCTION__,
+			(unsigned long long)sh->sector);
+
+		/* All the 'dirty' buffers and the parity block are ready to be
+		 * written back to disk
+		 */
+		BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags));
+		for (i=sh->disks; i--;) {
+			struct r5dev *dev = &sh->dev[i];
+			/* transitition write-back 'dirty' blocks to
+			 * write-through 'dirty' blocks
+			 */
+			if (test_bit(R5_LOCKED, &dev->flags) &&
+				(i == pd_idx || test_bit(R5_DIRTY, &dev->flags))) {
+				PRINTK("Writing block %d\n", i);
+				set_bit(R5_Wantwrite, &dev->flags);
+				if (test_and_clear_bit(R5_DIRTY, &dev->flags))
+					s->dirty--;
+				if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+					sh->ops.count++;
+				if (!test_bit(R5_Insync, &dev->flags) ||
+					(i==pd_idx && s->failed == 0))
+					set_bit(STRIPE_INSYNC, &sh->state);
+			}
+		}
+
+		BUG_ON(s->dirty);
+		clear_bit(STRIPE_DIRTY, &sh->state);
+		clear_bit(STRIPE_OP_POSTXOR, &sh->ops.complete);
+		clear_bit(STRIPE_OP_POSTXOR, &sh->ops.ack);
+		clear_bit(STRIPE_OP_POSTXOR, &sh->ops.pending);
+	}
+
+}
+
+static void
+raid5_wb_cache_handle_new_writes(struct stripe_head *sh, struct stripe_head_state *s)
+{
+	int i, disks = sh->disks;
+	int pd_idx = sh->pd_idx;
+	struct r5dev *dev;
+
+	/* allow new data into the cache once dependent operations are clear */
+	if (s->to_write && !test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending) &&
+		!test_bit(STRIPE_OP_CHECK, &sh->ops.pending) &&
+		!test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending)) {
+
+		PRINTK("%s: stripe %llu schedule biodrain\n", __FUNCTION__,
+			(unsigned long long)sh->sector);
+
+		for (i=disks; i--;) {
+			dev = &sh->dev[i];
+			if (dev->towrite && !test_bit(R5_LOCKED, &dev->flags)) {
+				set_bit(R5_DIRTY, &dev->flags);
+				s->dirty++;
+				BUG_ON(!test_bit(R5_UPTODATE, &dev->flags) &&
+					!test_bit(R5_OVERWRITE, &dev->flags));
+			}
+		}
+
+		clear_bit(STRIPE_INSYNC, &sh->state);
+		set_bit(STRIPE_DIRTY, &sh->state);
+		set_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending);
+		sh->ops.count++;
+	}
+
+	/* check if we need to preread data to satisfy an eviction */
+	if (!s->to_write && test_bit(STRIPE_EVICT, &sh->state) &&
+		test_bit(STRIPE_DIRTY, &sh->state))
+		for (i=disks; i--;) {
+			dev = &sh->dev[i];
+			if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx &&
+			    !test_bit(R5_LOCKED, &dev->flags) &&
+			    !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags)) &&
+			    test_bit(R5_Insync, &dev->flags)) {
+				PRINTK("Read_old block %d for eviction\n", i);
+				set_bit(R5_LOCKED, &dev->flags);
+				s->locked++;
+				set_bit(R5_Wantread, &dev->flags);
+				if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+					sh->ops.count++;
+			}
+		}
+
+	/* now if nothing is locked we can start a stripe-clean write request */
+	if (s->locked == 0 && !s->to_write &&
+		test_bit(STRIPE_EVICT, &sh->state) &&
+		test_bit(STRIPE_DIRTY, &sh->state) &&
+		!test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending)) {
+		for (i=disks ; i-- ;) {
+			dev = &sh->dev[i];
+			/* only the dirty blocks and parity will be written back */
+			if (test_bit(R5_DIRTY, &dev->flags) || i == pd_idx) {
+				set_bit(R5_LOCKED, &sh->dev[i].flags);
+				s->locked++;
+			}
+		}
+
+		set_bit(STRIPE_OP_POSTXOR, &sh->ops.pending);
+		sh->ops.count++;
+	}
+}
+
+static void raid5_wb_cache_complete_biodrain(void *stripe_head_ref)
+{
+	struct stripe_head *sh = stripe_head_ref;
+	struct bio *return_bi = NULL, *bi;
+	int written = 0, i;
+	raid5_conf_t *conf = sh->raid_conf;
+
+	PRINTK("%s: stripe %llu\n", __FUNCTION__,
+		(unsigned long long)sh->sector);
+
+	/* clear complete biodrain operations */
+	for (i=sh->disks; i--; )
+		if (sh->dev[i].written) {
+			struct r5dev *dev = &sh->dev[i];
+			struct bio *wbi, *wbi2;
+			written++;
+			PRINTK("%s: Return write for disc %d\n",
+				__FUNCTION__, i);
+			spin_lock_irq(&conf->device_lock);
+			set_bit(R5_UPTODATE, &dev->flags);
+			wbi = dev->written;
+			dev->written = NULL;
+			while (wbi && wbi->bi_sector < dev->sector + STRIPE_SECTORS) {
+				wbi2 = r5_next_bio(wbi, dev->sector);
+				if (--wbi->bi_phys_segments == 0) {
+					sh->write_requests_pending++;
+					wbi->bi_next = return_bi;
+					return_bi = wbi;
+				}
+				wbi = wbi2;
+			}
+			spin_unlock_irq(&conf->device_lock);
+		}
+
+	if (likely(written)) {
+		clear_bit(STRIPE_OP_BIODRAIN, &sh->ops.ack);
+		clear_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending);
+		/* no need to clear 'complete' it was never set */
+	} else
+		BUG();
+
+	while ((bi=return_bi)) {
+		int bytes = bi->bi_size;
+
+		return_bi = bi->bi_next;
+		bi->bi_next = NULL;
+		bi->bi_size = 0;
+		bi->bi_end_io(bi, bytes,
+			      test_bit(BIO_UPTODATE, &bi->bi_flags)
+			        ? 0 : -EIO);
+	}
+
+	release_stripe(sh);
+}
+
+static struct dma_async_tx_descriptor *
+raid5_wb_cache_complete_biodrain_action(struct stripe_head *sh,
+	struct dma_async_tx_descriptor *tx)
+{
+	PRINTK("%s: stripe %llu\n", __FUNCTION__,
+		(unsigned long long)sh->sector);
+
+	atomic_inc(&sh->count);
+	tx = async_trigger_callback(ASYNC_TX_DEP_ACK | ASYNC_TX_ACK, tx,
+		raid5_wb_cache_complete_biodrain, sh);
+	return tx;
+}
+
+static struct stripe_head *
+raid5_wb_cache_try_to_free_stripe(raid5_conf_t *conf, int flush)
+{
+	struct stripe_head *sh = NULL;
+	struct list_head *first;
+	struct stripe_cache_policy *cp = conf->cache_policy;
+
+	CHECK_DEVLOCK();
+	if (list_empty(&cp->dirty_list))
+		goto out;
+
+	/* if we are not flushing only evict one stripe at a time
+	 * and plug the device to wait for more writers
+	 */
+	if (!flush && atomic_read(&cp->evict_active_stripes)) {
+		blk_plug_device(conf->mddev->queue);
+		goto out;
+	}
+
+	first = cp->dirty_list.next;
+	sh = list_entry(first, struct stripe_head, lru);
+	list_del_init(first);
+	atomic_inc(&conf->active_stripes);
+	set_bit(STRIPE_EVICT, &sh->state);
+	atomic_inc(&cp->evict_active_stripes);
+	set_bit(STRIPE_HANDLE, &sh->state);
+	atomic_inc(&sh->count);
+	BUG_ON(atomic_read(&sh->count)!= 1);
+	__release_stripe(conf, sh);
+	PRINTK("stripe %llu queued for eviction\n",
+		(unsigned long long)sh->sector);
+out:
+	return sh;
+}
+
 static struct stripe_cache_policy raid5_cache_policy_write_through = {
+	.complete_biodrain_action = NULL,
 	.release_stripe = raid5_wt_cache_release_stripe,
 	.complete_postxor_action = raid5_wt_cache_complete_postxor_action,
 	.submit_pending_writes = raid5_wt_cache_submit_pending_writes,
@@ -2124,6 +2431,21 @@ static struct stripe_cache_policy raid5_cache_policy_write_through = {
 	.raid5d = raid5_wt_cache_raid5d,
 	.init = raid5_wt_cache_init,
 	.unplug_device = raid5_wt_cache_unplug_device,
+	.try_to_free_stripe = NULL,
+};
+
+static struct stripe_cache_policy raid5_cache_policy_write_back = {
+	.complete_biodrain_action = raid5_wb_cache_complete_biodrain_action,
+	.release_stripe = raid5_wb_cache_release_stripe,
+	.complete_postxor_action = raid5_wb_cache_complete_postxor_action,
+	.submit_pending_writes = raid5_wb_cache_submit_pending_writes,
+	.handle_new_writes = raid5_wb_cache_handle_new_writes,
+	.handle_completed_writes = raid5_wb_cache_handle_completed_writes,
+	.raid5d = NULL,
+	.init = raid5_wb_cache_init,
+	.unplug_device = NULL,
+	.try_to_free_stripe = raid5_wb_cache_try_to_free_stripe,
+
 };
 
 /*
@@ -2193,6 +2515,7 @@ static void handle_stripe5(struct stripe_head *sh)
 		/* now count some things */
 		if (test_bit(R5_LOCKED, &dev->flags)) s.locked++;
 		if (test_bit(R5_UPTODATE, &dev->flags)) s.uptodate++;
+		if (test_bit(R5_DIRTY, &dev->flags)) s.dirty++;
 
 		if (test_bit(R5_Wantfill, &dev->flags))
 			s.to_fill++;
@@ -2226,6 +2549,12 @@ static void handle_stripe5(struct stripe_head *sh)
 	if (s.to_fill && !test_and_set_bit(STRIPE_OP_BIOFILL, &sh->ops.pending))
 		sh->ops.count++;
 
+	/* do we need to evict the stripe for writeback caching? */
+	if (s.dirty && s.syncing) {
+		set_bit(STRIPE_EVICT, &sh->state);
+		set_bit(STRIPE_HANDLE, &sh->state);
+	}
+		
 	PRINTK("locked=%d dirty=%d uptodate=%d to_read=%d"
 		" to_write=%d to_fill=%d failed=%d failed_num=%d\n",
 		s.locked, s.dirty, s.uptodate, s.to_read, s.to_write, s.to_fill,
@@ -4801,6 +5130,24 @@ static void raid5_quiesce(mddev_t *mddev, int state)
 	}
 }
 
+static void raid5_cache_flush(mddev_t *mddev)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	struct stripe_cache_policy *cp = conf->cache_policy;
+	struct stripe_head *sh;
+	unsigned long flags;
+
+	if (cp->try_to_free_stripe) {
+		spin_lock_irqsave(&conf->device_lock, flags);
+		do {
+			sh = cp->try_to_free_stripe(conf, 1);
+		} while (sh != NULL);
+		spin_unlock_irqrestore(&conf->device_lock, flags);
+		raid5_quiesce(mddev, 1);
+		raid5_quiesce(mddev, 0);
+	}
+}
+
 static struct mdk_personality raid6_personality =
 {
 	.name		= "raid6",
@@ -4821,6 +5168,7 @@ static struct mdk_personality raid6_personality =
 	.start_reshape  = raid5_start_reshape,
 #endif
 	.quiesce	= raid5_quiesce,
+	.cache_flush	= raid5_cache_flush
 };
 static struct mdk_personality raid5_personality =
 {
@@ -4842,6 +5190,7 @@ static struct mdk_personality raid5_personality =
 	.start_reshape  = raid5_start_reshape,
 #endif
 	.quiesce	= raid5_quiesce,
+	.cache_flush	= raid5_cache_flush
 };
 
 static struct mdk_personality raid4_personality =
@@ -4864,6 +5213,7 @@ static struct mdk_personality raid4_personality =
 	.start_reshape  = raid5_start_reshape,
 #endif
 	.quiesce	= raid5_quiesce,
+	.cache_flush	= raid5_cache_flush
 };
 
 static int __init raid5_init(void)
diff --git a/include/linux/raid/md_k.h b/include/linux/raid/md_k.h
index de72c49..5455de9 100644
--- a/include/linux/raid/md_k.h
+++ b/include/linux/raid/md_k.h
@@ -287,6 +287,8 @@ struct mdk_personality
 	 * others - reserved
 	 */
 	void (*quiesce) (mddev_t *mddev, int state);
+	/* notifies a writeback cache to dump its dirty blocks */
+	void (*cache_flush)(mddev_t *mddev);
 };
 
 
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index f00da23..560d460 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -165,6 +165,7 @@ struct stripe_head {
 	spinlock_t		lock;
 	int			bm_seq;	/* sequence number for bitmap flushes */
 	int			disks;			/* disks in stripe */
+	int			write_requests_pending;
 	struct stripe_operations {
 		unsigned long	   pending;  /* pending operations (set for request->issue->complete) */
 		unsigned long	   ack;	     /* submitted operations (set for issue->complete */
@@ -209,6 +210,7 @@ struct stripe_head_state {
 #define	R5_Wantcompute	11	/* compute_block in progress treat as uptodate */
 #define	R5_Wantfill	12	/* dev->toread contains a bio that needs filling */
 #define	R5_Wantprexor	13	/* distinguish blocks ready for rmw from other "towrites" */
+#define	R5_DIRTY	14	/* data entered the cache without a parity calculation */
 
 /*
  * Write method
@@ -231,6 +233,8 @@ struct stripe_head_state {
 #define	STRIPE_EXPANDING	9
 #define	STRIPE_EXPAND_SOURCE	10
 #define	STRIPE_EXPAND_READY	11
+#define STRIPE_DIRTY		12 /* wb cache state */
+#define STRIPE_EVICT		13 /* wb cache action */
 
 /*
  * Operations flags (in issue order)
@@ -295,23 +299,35 @@ struct disk_info {
  * wb = write back
  */
 struct stripe_cache_policy {
+	/* complete_biodrain_action
+	 * wt: n/a
+	 * wb: inject bi_end_io calls once data is copied into the cache
+	 */
+	struct dma_async_tx_descriptor *(*complete_biodrain_action)
+		(struct stripe_head *sh, struct dma_async_tx_descriptor *tx);
 	/* release_stripe - returns '1' if stripe was moved to cache-private list
 	 *  else '0'
 	 * [ called from __release_stripe under spin_lock_irq(&conf->device_lock) ]
 	 * wt: catch 'delayed' stripes and poke the 'preread' state machine
 	 * if necessary
+	 * wb: store inactive+dirty stripes on a private list, to be flushed
+	 * by get_active_stripe pressure or a sync-request
 	 */
 	int (*release_stripe)(struct raid5_private_data *conf,
 		struct stripe_head *sh,	int handle);
 	/* complete_postxor_action
 	 * wt: check if this is the end of a rcw/rmw write request and set
 	 * the state bits accordingly.  set 'handle' and release.
+	 * wb: simply record the completion of 'postxor', set 'handle' and release
 	 */
 	void (*complete_postxor_action)(void *stripe_head_ref);
 	/* submit_pending_writes
 	 * [ called from handle_stripe under spin_lock(&sh->lock) ]
 	 * wt: check if 'biodrain' and 'postxor' are complete and schedule writes
 	 * to the backing disks
+	 * wb: if the stripe is set to be evicted and parity is uptodate transition
+	 * 'update+dirty' blocks to 'uptodate+locked' blocks (i.e. wt dirty) and
+	 * schedule writes to the backing disks
 	 */
 	void (*submit_pending_writes)(struct stripe_head *sh,
 		struct stripe_head_state *s);
@@ -319,6 +335,9 @@ struct stripe_cache_policy {
 	 * [ called from handle_stripe under spin_lock(&sh->lock) ]
 	 * wt: schedule reads to prepare for a rcw or rmw operation.  once preread
 	 * data is available lock the blocks and schedule '[prexor]+biodrain+postxor'
+	 * wb: if the stripe is set to be evicted schedule reads to prepare a rcw.
+	 * once preread data is available schedule a 'postxor' to update parity.
+	 * if the stripe is not set to be evicted just schedule a 'biodrain'
 	 */
 	void (*handle_new_writes)(struct stripe_head *sh,
 		struct stripe_head_state *s);
@@ -326,11 +345,13 @@ struct stripe_cache_policy {
 	 * [ called from handle_stripe under spin_lock(&sh->lock) ]
 	 * wt: call bi_end_io on all written blocks and perform general md/bitmap
 	 * post write housekeeping.
+	 * wb: perform general md/bitmap post write housekeeping
 	 */
 	struct bio *(*handle_completed_writes)(struct stripe_head *sh,
 		struct stripe_head_state *s);
 	/* raid5d
 	 * wt: check for stripes that can be taken off the delayed list
+	 * wb: n/a
 	 */
 	void (*raid5d)(mddev_t *mddev, struct raid5_private_data *conf);
 	/* init
@@ -341,13 +362,23 @@ struct stripe_cache_policy {
 	/* unplug_device
 	 * [ called from raid5_unplug_device under spin_lock_irqsave(&conf->device_lock) ]
 	 * wt: activate stripes on the delayed list
+	 * wb: n/a
 	 */
 	void (*unplug_device)(struct raid5_private_data *conf);
+	/* try_to_free_stripe
+	 * [ called from get_active_stripe and raid5_cache_flush ]
+	 * wt: n/a
+	 * wb: evict the oldest dirty stripe to refill the inactive list
+	 */
+	struct stripe_head *(*try_to_free_stripe)(struct raid5_private_data *conf,
+		int flush);
 	union {
 		struct list_head delayed_list; /* wt: stripes that have plugged requests */
+		struct list_head dirty_list; /* wb: inactive stripes with dirty data */
 	};
 	union {
 		atomic_t preread_active_stripes;
+		atomic_t evict_active_stripes;
 	};
 };
 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH RFC 4/4] md: delayed stripe activation
  2007-04-11  6:00 [PATCH RFC 0/4] raid5: write-back caching policy and write performance Dan Williams
                   ` (2 preceding siblings ...)
  2007-04-11  6:00 ` [PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental] Dan Williams
@ 2007-04-11  6:00 ` Dan Williams
  3 siblings, 0 replies; 9+ messages in thread
From: Dan Williams @ 2007-04-11  6:00 UTC (permalink / raw)
  To: linux-raid

based on a patch by: Raz Ben-Jehuda(caro) <raziebe@gmail.com>
---

 drivers/md/raid5.c         |   92 +++++++++++++++++++++++++++++++++++++++++---
 include/linux/raid/raid5.h |    5 ++
 2 files changed, 90 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1a2d6b5..1b3db16 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -226,6 +226,8 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int
 	sh->sector = sector;
 	sh->pd_idx = pd_idx;
 	sh->state = 0;
+	sh->active_preread_jiffies = msecs_to_jiffies(
+		atomic_read(&conf->cache_policy->deadline_ms))+jiffies;
 
 	sh->disks = disks;
 
@@ -1172,6 +1174,7 @@ static int raid5_end_write_request (struct bio *bi, unsigned int bytes_done,
 	
 	clear_bit(R5_LOCKED, &sh->dev[i].flags);
 	set_bit(STRIPE_HANDLE, &sh->state);
+	sh->active_preread_jiffies = jiffies;
 	release_stripe(sh);
 	return 0;
 }
@@ -1741,8 +1744,10 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 		bip = &sh->dev[dd_idx].towrite;
 		if (*bip == NULL && sh->dev[dd_idx].written == NULL)
 			firstwrite = 1;
-	} else
+	} else {
 		bip = &sh->dev[dd_idx].toread;
+		sh->active_preread_jiffies = jiffies;
+	}
 	while (*bip && (*bip)->bi_sector < bi->bi_sector) {
 		if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector)
 			goto overlap;
@@ -2160,7 +2165,7 @@ raid5_wt_cache_handle_new_writes(struct stripe_head *sh, struct stripe_head_stat
 	}
 }
 
-static void raid5_wt_cache_activate_delayed(raid5_conf_t *conf)
+static struct stripe_head *raid5_wt_cache_activate_delayed(raid5_conf_t *conf)
 {
 	struct stripe_cache_policy *cp = conf->cache_policy;
 	if (atomic_read(&cp->preread_active_stripes) < IO_THRESHOLD) {
@@ -2168,6 +2173,20 @@ static void raid5_wt_cache_activate_delayed(raid5_conf_t *conf)
 			struct list_head *l = cp->delayed_list.next;
 			struct stripe_head *sh;
 			sh = list_entry(l, struct stripe_head, lru);
+
+			if (time_before(jiffies,sh->active_preread_jiffies)) {
+				PRINTK("deadline: no expire sec=%lld %8u %8u\n",
+		               		(unsigned long long) sh->sector,
+               			jiffies_to_msecs(sh->active_preread_jiffies),
+               			jiffies_to_msecs(jiffies));
+				return sh;
+      			} else {
+			      PRINTK("deadline: expire:sec=%lld %8u %8u\n",
+	               			(unsigned long long)sh->sector,
+        	       			jiffies_to_msecs(sh->active_preread_jiffies),
+               			jiffies_to_msecs(jiffies));
+			}
+
 			list_del_init(l);
 			clear_bit(STRIPE_DELAYED, &sh->state);
 			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
@@ -2175,9 +2194,11 @@ static void raid5_wt_cache_activate_delayed(raid5_conf_t *conf)
 			list_add_tail(&sh->lru, &conf->handle_list);
 		}
 	}
+
+	return NULL;
 }
 
-static void raid5_wt_cache_raid5d(mddev_t *mddev, raid5_conf_t *conf)
+static struct stripe_head *raid5_wt_cache_raid5d(mddev_t *mddev, raid5_conf_t *conf)
 {
 	struct stripe_cache_policy *cp = conf->cache_policy;
 
@@ -2185,7 +2206,9 @@ static void raid5_wt_cache_raid5d(mddev_t *mddev, raid5_conf_t *conf)
 	    atomic_read(&cp->preread_active_stripes) < IO_THRESHOLD &&
 	    !blk_queue_plugged(mddev->queue) &&
 	    !list_empty(&cp->delayed_list))
-		raid5_wt_cache_activate_delayed(conf);
+		return raid5_wt_cache_activate_delayed(conf);
+
+	return NULL;
 }
 
 static void raid5_wt_cache_init(raid5_conf_t *conf)
@@ -4339,7 +4362,7 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
  */
 static void raid5d (mddev_t *mddev)
 {
-	struct stripe_head *sh;
+	struct stripe_head *sh,*delayed_sh=NULL;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
 	int handled;
 
@@ -4363,7 +4386,10 @@ static void raid5d (mddev_t *mddev)
 		}
 
 		if (conf->cache_policy->raid5d)
-			conf->cache_policy->raid5d(mddev, conf);
+			delayed_sh = conf->cache_policy->raid5d(mddev, conf);
+
+		if (delayed_sh)
+			break;
 
 		while ((bio = remove_bio_from_retry(conf))) {
 			int ok;
@@ -4401,8 +4427,60 @@ static void raid5d (mddev_t *mddev)
 	unplug_slaves(mddev);
 
 	PRINTK("--- raid5d inactive\n");
+
+ 	if (delayed_sh) {
+		unsigned long local_jiffies = jiffies, wakeup;
+		if (delayed_sh->active_preread_jiffies > local_jiffies) {
+	   		wakeup = delayed_sh->active_preread_jiffies - local_jiffies;
+	   		PRINTK("--- raid5d inactive sleep for %d\n",
+	            		jiffies_to_msecs(wakeup) );
+	     		mddev->thread->timeout = wakeup;
+		}
+  	}
+}
+
+static ssize_t
+raid5_show_stripe_deadline(mddev_t *mddev, char *page)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	if (conf)
+		return sprintf(page, "%d\n",
+			atomic_read(&conf->cache_policy->deadline_ms));
+	else
+		return 0;
+}
+ 
+static ssize_t
+raid5_store_stripe_deadline(mddev_t *mddev, const char *page, size_t len)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	char *end;
+	int new;
+	
+	if (len >= PAGE_SIZE)
+		return -EINVAL;
+	if (!conf)
+		return -ENODEV;
+	
+	new = simple_strtoul(page, &end, 10);
+	
+	if (!*page || (*end && *end != '\n') )
+		return -EINVAL;
+	
+	if (new < 0 || new > 10000)
+		return -EINVAL;
+	
+	atomic_set(&conf->cache_policy->deadline_ms,new);
+	
+	return len;
 }
 
+static struct md_sysfs_entry
+raid5_stripe_deadline = __ATTR(stripe_deadline, S_IRUGO | S_IWUSR,
+                                raid5_show_stripe_deadline,
+                               raid5_store_stripe_deadline);
+
+
 static ssize_t
 raid5_show_stripe_cache_size(mddev_t *mddev, char *page)
 {
@@ -4465,6 +4543,7 @@ raid5_stripecache_active = __ATTR_RO(stripe_cache_active);
 static struct attribute *raid5_attrs[] =  {
 	&raid5_stripecache_size.attr,
 	&raid5_stripecache_active.attr,
+	&raid5_stripe_deadline.attr,
 	NULL,
 };
 static struct attribute_group raid5_attrs_group = {
@@ -4581,6 +4660,7 @@ static int run(mddev_t *mddev)
 	atomic_set(&conf->active_stripes, 0);
 	atomic_set(&conf->active_aligned_reads, 0);
 	conf->cache_policy->init(conf);
+	atomic_set(&conf->cache_policy->deadline_ms, 0);
 
 	PRINTK("raid5: run(%s) called.\n", mdname(mddev));
 
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 560d460..d447807 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -166,6 +166,7 @@ struct stripe_head {
 	int			bm_seq;	/* sequence number for bitmap flushes */
 	int			disks;			/* disks in stripe */
 	int			write_requests_pending;
+	unsigned long   	active_preread_jiffies;
 	struct stripe_operations {
 		unsigned long	   pending;  /* pending operations (set for request->issue->complete) */
 		unsigned long	   ack;	     /* submitted operations (set for issue->complete */
@@ -353,7 +354,8 @@ struct stripe_cache_policy {
 	 * wt: check for stripes that can be taken off the delayed list
 	 * wb: n/a
 	 */
-	void (*raid5d)(mddev_t *mddev, struct raid5_private_data *conf);
+	struct stripe_head *(*raid5d)(mddev_t *mddev,
+		struct raid5_private_data *conf);
 	/* init
 	 * wt: initialize 'delayed_list' and 'preread_active_stripes'
 	 * wb: initialize 'dirty_list' and 'dirty_stripes'
@@ -380,6 +382,7 @@ struct stripe_cache_policy {
 		atomic_t preread_active_stripes;
 		atomic_t evict_active_stripes;
 	};
+	atomic_t deadline_ms;
 };
 
 struct raid5_private_data {

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental]
  2007-04-11  6:00 ` [PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental] Dan Williams
@ 2007-04-11 22:40   ` Mark Hahn
  2007-04-12  0:08     ` Williams, Dan J
  2007-04-12  5:37   ` Al Boldi
  1 sibling, 1 reply; 9+ messages in thread
From: Mark Hahn @ 2007-04-11 22:40 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

> In its current implementation write-back mode acknowledges writes before
> they have reached non-volatile media.

which is basically normal for unix, no?

are you planning to support barriers?  (which are the block system's way 
of supporting filesystem atomicity).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental]
  2007-04-11 22:40   ` Mark Hahn
@ 2007-04-12  0:08     ` Williams, Dan J
  2007-04-12  6:21       ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: Williams, Dan J @ 2007-04-12  0:08 UTC (permalink / raw)
  To: Mark Hahn; +Cc: linux-raid

> From: Mark Hahn [mailto:hahn@mcmaster.ca]
> 
> > In its current implementation write-back mode acknowledges writes
before
> > they have reached non-volatile media.
> 
> which is basically normal for unix, no?
I am referring to when bi_end_io is called on the bio submitted to MD.
Normally it is not called until after the bi_end_io event for the bio
submitted to the backing disk.

> 
> are you planning to support barriers?  (which are the block system's
way
> of supporting filesystem atomicity).
Not as a part of these performance experiments.  But, I have wondered
what the underlying issues are behind raid5 not supporting barriers.
Currently in raid5.c:make_request:

	if (unlikely(bio_barrier(bi))) {
		bio_endio(bi, bi->bi_size, -EOPNOTSUPP);
		return 0;
	}

--
Dan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental]
  2007-04-11  6:00 ` [PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental] Dan Williams
  2007-04-11 22:40   ` Mark Hahn
@ 2007-04-12  5:37   ` Al Boldi
  1 sibling, 0 replies; 9+ messages in thread
From: Al Boldi @ 2007-04-12  5:37 UTC (permalink / raw)
  To: linux-raid

Dan Williams wrote:
> In write-through mode bi_end_io is called once writes to the data disk(s)
> and the parity disk have completed.
>
> In write-back mode bi_end_io is called immediately after data has been
> copied into the stripe cache, which also causes the stripe to be marked
> dirty.

This is not really meaningful, as this is exactly what the page-cache already 
does before being synced.

It may be more reasonable to sync the data-stripe as usual, and only delay 
the parity.  This way you shouldn't have to worry about unclean shutdowns.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental]
  2007-04-12  0:08     ` Williams, Dan J
@ 2007-04-12  6:21       ` Neil Brown
  0 siblings, 0 replies; 9+ messages in thread
From: Neil Brown @ 2007-04-12  6:21 UTC (permalink / raw)
  To: Williams, Dan J; +Cc: Mark Hahn, linux-raid

On Wednesday April 11, dan.j.williams@intel.com wrote:
> > From: Mark Hahn [mailto:hahn@mcmaster.ca]
> > 
> > > In its current implementation write-back mode acknowledges writes
> before
> > > they have reached non-volatile media.
> > 
> > which is basically normal for unix, no?
> I am referring to when bi_end_io is called on the bio submitted to MD.
> Normally it is not called until after the bi_end_io event for the bio
> submitted to the backing disk.
> 
> > 
> > are you planning to support barriers?  (which are the block system's
> way
> > of supporting filesystem atomicity).
> Not as a part of these performance experiments.  But, I have wondered
> what the underlying issues are behind raid5 not supporting barriers.
> Currently in raid5.c:make_request:
> 
> 	if (unlikely(bio_barrier(bi))) {
> 		bio_endio(bi, bi->bi_size, -EOPNOTSUPP);
> 		return 0;
> 	}

I should be getting this explanation down to a fine art.  I seem to be
delivering it in multiple forums.

My position is that for a virtual device that stores some blocks on
some devices and other blocks on other devices (e.g. raid0, raid5,
linear, LVM, but not raid1) barrier support in the individual devices
is unusable, and that to achieve the goal it is just as easy for the
filesystem to order requests and to use blkdev_issue_flush to force
sync-to-disk. 

The semantics of a barrier (as I understand it) is that all writes
prior to the barrier are safe before the barrier write is commenced,
and that write itself is safe before any subsequence write is
commenced. (I think those semantics are stronger than we should be
exporting - just the first half should be enough - but such is life).

On a single drive, this is achieved by not re-ordering requests around
a barrier, and asking the device to not re-order requests either.
When you have multiple devices, you cannot ask them not to re-order
requests with respect to each other, so the same mechanism cannot be
used.

Instead, you would have to plug the meta-device, unplug all the lower
level queues, wait for all writes to complete, call blkdev_issue_flush
to make sure the data is safe, issue the barrier write and wait for it
to complete, call blkdev_issue_flush again (Well, maybe the barrier
write could have been sent with BIO_RW_BARRIER for the same effect)
then unplug the queue.

And the thing is that all of that complexity ALREADY needs to be in the
filesystem.  Because if a device doesn't support barriers, the
filesystem should wait for all dependant writes to complete and then
issue the 'barrier' write (and probably call blkdev_issue_flush as
well).

And the filesystem is positioned to do this BETTER because it can know
which writes are really dependant and which might be incidental.

Ext3 gets this right except that it never bothers with
blkdev_issue_flush.  XFS doesn't even bother trying (it's designed to
be used with reliable drives).  reiserfs might actually get it
completely right as it does have a call to blkdev_issue_flush in what
looks like the right place, but I cannot be sure without lots of code
review. 

dm/stripe currently gets this wrong.  If it gets a barrier request it
just passes it down to the one target drive thus failing to ensure any
ordering wrt other drives.

All that said:  raid5 is probably in a better position than most to
implement a barrier as it keeps careful track of everything that is
happening, and could easily wait for all prior writes to complete.
This might mesH well with the write-behind approach to caching.  But I
would still rather than the filesystem just got it right for us.
With a single drive, the drive can implement a barrier more
efficiently than the filesystem.  With multiple drives, the
meta-device can at best be as efficient as the filesystem.

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-04-12  6:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-11  6:00 [PATCH RFC 0/4] raid5: write-back caching policy and write performance Dan Williams
2007-04-11  6:00 ` [PATCH RFC 1/4] md: introduce struct stripe_head_state Dan Williams
2007-04-11  6:00 ` [PATCH RFC 2/4] md: refactor raid5 cache policy code using 'struct stripe_cache_policy' Dan Williams
2007-04-11  6:00 ` [PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental] Dan Williams
2007-04-11 22:40   ` Mark Hahn
2007-04-12  0:08     ` Williams, Dan J
2007-04-12  6:21       ` Neil Brown
2007-04-12  5:37   ` Al Boldi
2007-04-11  6:00 ` [PATCH RFC 4/4] md: delayed stripe activation Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).