[md PATCH 00/34] md patches for 3.1

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [md PATCH 00/34] md patches for 3.1 - part 1
@ 2011-07-21  2:32 NeilBrown
  2011-07-21  2:32 ` [md PATCH 02/34] md/raid10: factor out common bio handling code NeilBrown
                   ` (33 more replies)
  0 siblings, 34 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid
  Cc: Namhyung Kim, Akinobu Mita, Christian Dietrich, Jonathan Brassow

As 3.0 is fast approaching it is long past time to publish my patch
queue for 3.1.  Much of it has been in -next for a while and while
that enabled testing it doesn't really encourage review.

So this is the first of 2 patch-bombs I will be sending which comprise
all of the changes is (currently) plan to submit for 3.1.

This set is mostly ad-hoc bits and pieces.  Code clean up and minor
bugs fixes and so forth.  Probably the strongest theme is some
refactoring in raid5 to reduce code duplication between RAID5 and
RAID6.  (almost) all similar code has been factored into common
routines.

All "Reviewed-by" lines gratefully accepted ... until early-ish next
week when I will submit that code to Linus.  If anyone is keen to do
some review, the RAID5 changes is where I would be most happy to have
it focused.  Remember: if something doesn't look right, it probably
isn't.

Some of these patches have been seen on the list already, but I
thought it best to simply include them all for completeness.

NeilBrown


---

Akinobu Mita (1):
      md: use proper little-endian bitops

Christian Dietrich (1):
      md/raid: use printk_ratelimited instead of printk_ratelimit

Jonathan Brassow (2):
      MD bitmap: Revert DM dirty log hooks
      MD: raid1 s/sysfs_notify_dirent/sysfs_notify_dirent_safe

Namhyung Kim (11):
      md/raid10: move rdev->corrected_errors counting
      md/raid5: move rdev->corrected_errors counting
      md/raid1: move rdev->corrected_errors counting
      md: get rid of unnecessary casts on page_address()
      md: remove ro check in md_check_recovery()
      md: introduce link/unlink_rdev() helpers
      md/raid5: get rid of duplicated call to bio_data_dir()
      md/raid5: use kmem_cache_zalloc()
      md/raid10: share pages between read and write bio's during recovery
      md/raid10: factor out common bio handling code
      md/raid10: get rid of duplicated conditional expression

NeilBrown (19):
      md/raid5: Avoid BUG caused by multiple failures.
      md/raid10: Improve decision on whether to fail a device with a read error.
      md/raid10: Make use of new recovery_disabled handling
      md: change managed of recovery_disabled.
      md/raid5: finalise new merged handle_stripe.
      md/raid5: move some more common code into handle_stripe
      md/raid5: move more common code into handle_stripe
      md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6
      md/raid5: unite fetch_block5 and fetch_block6
      md/raid5: rearrange a test in fetch_block6.
      md/raid5: move more code into common handle_stripe
      md/raid5: Move code for finishing a reconstruction into handle_stripe.
      md/raid5: move stripe_head_state and more code into handle_stripe.
      md/raid5:  add some more fields to stripe_head_state
      md/raid5: unify stripe_head_state and r6_state
      md/raid5: move common code into handle_stripe
      md/raid5: replace sh->lock with an 'active' flag.
      md/raid5: Protect some more code with ->device_lock.
      md/raid5: Remove use of sh->lock in sync_request


 drivers/md/bitmap.c |  137 +++-----
 drivers/md/bitmap.h |    5 
 drivers/md/md.c     |   77 ++---
 drivers/md/md.h     |   28 +-
 drivers/md/raid1.c  |   66 ++--
 drivers/md/raid1.h  |    6 
 drivers/md/raid10.c |  214 +++++++------
 drivers/md/raid10.h |    5 
 drivers/md/raid5.c  |  832 ++++++++++++++-------------------------------------
 drivers/md/raid5.h  |   45 +--
 10 files changed, 505 insertions(+), 910 deletions(-)

-- 
Signature


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [md PATCH 03/34] md/raid10: share pages between read and write bio's during recovery
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
  2011-07-21  2:32 ` [md PATCH 02/34] md/raid10: factor out common bio handling code NeilBrown
  2011-07-21  2:32 ` [md PATCH 01/34] md/raid10: get rid of duplicated conditional expression NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 06/34] md/raid5: Remove use of sh->lock in sync_request NeilBrown
                   ` (30 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Namhyung Kim

From: Namhyung Kim <namhyung@gmail.com>

When performing a recovery, only first 2 slots in r10_bio are in use,
for read and write respectively. However all of pages in the write bio
are never used and just replaced to read bio's when the read completes.

Get rid of those unused pages and share read pages properly.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid10.c |   23 ++++++++++++-----------
 1 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e434f1e..3715e22 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -123,7 +123,14 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
 	for (j = 0 ; j < nalloc; j++) {
 		bio = r10_bio->devs[j].bio;
 		for (i = 0; i < RESYNC_PAGES; i++) {
-			page = alloc_page(gfp_flags);
+			if (j == 1 && !test_bit(MD_RECOVERY_SYNC,
+						&conf->mddev->recovery)) {
+				/* we can share bv_page's during recovery */
+				struct bio *rbio = r10_bio->devs[0].bio;
+				page = rbio->bi_io_vec[i].bv_page;
+				get_page(page);
+			} else
+				page = alloc_page(gfp_flags);
 			if (unlikely(!page))
 				goto out_free_pages;
 
@@ -1360,20 +1367,14 @@ done:
 static void recovery_request_write(mddev_t *mddev, r10bio_t *r10_bio)
 {
 	conf_t *conf = mddev->private;
-	int i, d;
-	struct bio *bio, *wbio;
-
+	int d;
+	struct bio *wbio;
 
-	/* move the pages across to the second bio
+	/*
+	 * share the pages with the first bio
 	 * and submit the write request
 	 */
-	bio = r10_bio->devs[0].bio;
 	wbio = r10_bio->devs[1].bio;
-	for (i=0; i < wbio->bi_vcnt; i++) {
-		struct page *p = bio->bi_io_vec[i].bv_page;
-		bio->bi_io_vec[i].bv_page = wbio->bi_io_vec[i].bv_page;
-		wbio->bi_io_vec[i].bv_page = p;
-	}
 	d = r10_bio->devs[1].devnum;
 
 	atomic_inc(&conf->mirrors[d].rdev->nr_pending);



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 02/34] md/raid10: factor out common bio handling code
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 01/34] md/raid10: get rid of duplicated conditional expression NeilBrown
                   ` (32 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Namhyung Kim

From: Namhyung Kim <namhyung@gmail.com>

When normal-write and sync-read/write bio completes, we should
find out the disk number the bio belongs to. Factor those common
code out to a separate function.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid10.c |   44 +++++++++++++++++++++++---------------------
 1 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d55ae12..e434f1e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -244,6 +244,23 @@ static inline void update_head_pos(int slot, r10bio_t *r10_bio)
 		r10_bio->devs[slot].addr + (r10_bio->sectors);
 }
 
+/*
+ * Find the disk number which triggered given bio
+ */
+static int find_bio_disk(conf_t *conf, r10bio_t *r10_bio, struct bio *bio)
+{
+	int slot;
+
+	for (slot = 0; slot < conf->copies; slot++)
+		if (r10_bio->devs[slot].bio == bio)
+			break;
+
+	BUG_ON(slot == conf->copies);
+	update_head_pos(slot, r10_bio);
+
+	return r10_bio->devs[slot].devnum;
+}
+
 static void raid10_end_read_request(struct bio *bio, int error)
 {
 	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
@@ -289,13 +306,10 @@ static void raid10_end_write_request(struct bio *bio, int error)
 {
 	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
 	r10bio_t *r10_bio = bio->bi_private;
-	int slot, dev;
+	int dev;
 	conf_t *conf = r10_bio->mddev->private;
 
-	for (slot = 0; slot < conf->copies; slot++)
-		if (r10_bio->devs[slot].bio == bio)
-			break;
-	dev = r10_bio->devs[slot].devnum;
+	dev = find_bio_disk(conf, r10_bio, bio);
 
 	/*
 	 * this branch is our 'one mirror IO has finished' event handler:
@@ -316,8 +330,6 @@ static void raid10_end_write_request(struct bio *bio, int error)
 		 */
 		set_bit(R10BIO_Uptodate, &r10_bio->state);
 
-	update_head_pos(slot, r10_bio);
-
 	/*
 	 *
 	 * Let's see if all mirrored write operations have finished
@@ -1173,14 +1185,9 @@ static void end_sync_read(struct bio *bio, int error)
 {
 	r10bio_t *r10_bio = bio->bi_private;
 	conf_t *conf = r10_bio->mddev->private;
-	int i,d;
+	int d;
 
-	for (i=0; i<conf->copies; i++)
-		if (r10_bio->devs[i].bio == bio)
-			break;
-	BUG_ON(i == conf->copies);
-	update_head_pos(i, r10_bio);
-	d = r10_bio->devs[i].devnum;
+	d = find_bio_disk(conf, r10_bio, bio);
 
 	if (test_bit(BIO_UPTODATE, &bio->bi_flags))
 		set_bit(R10BIO_Uptodate, &r10_bio->state);
@@ -1211,18 +1218,13 @@ static void end_sync_write(struct bio *bio, int error)
 	r10bio_t *r10_bio = bio->bi_private;
 	mddev_t *mddev = r10_bio->mddev;
 	conf_t *conf = mddev->private;
-	int i,d;
+	int d;
 
-	for (i = 0; i < conf->copies; i++)
-		if (r10_bio->devs[i].bio == bio)
-			break;
-	d = r10_bio->devs[i].devnum;
+	d = find_bio_disk(conf, r10_bio, bio);
 
 	if (!uptodate)
 		md_error(mddev, conf->mirrors[d].rdev);
 
-	update_head_pos(i, r10_bio);
-
 	rdev_dec_pending(conf->mirrors[d].rdev, mddev);
 	while (atomic_dec_and_test(&r10_bio->remaining)) {
 		if (r10_bio->master_bio == NULL) {



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 01/34] md/raid10: get rid of duplicated conditional expression
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
  2011-07-21  2:32 ` [md PATCH 02/34] md/raid10: factor out common bio handling code NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 03/34] md/raid10: share pages between read and write bio's during recovery NeilBrown
                   ` (31 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Namhyung Kim

From: Namhyung Kim <namhyung@gmail.com>

Variable 'first' is initialized to zero and updated to @rdev->raid_disk
only if it is greater than 0. Thus condition '>= first' always implies
'>= 0' so the latter is not needed.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid10.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 6e84668..d55ae12 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1093,8 +1093,7 @@ static int raid10_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
 	if (rdev->raid_disk >= 0)
 		first = last = rdev->raid_disk;
 
-	if (rdev->saved_raid_disk >= 0 &&
-	    rdev->saved_raid_disk >= first &&
+	if (rdev->saved_raid_disk >= first &&
 	    conf->mirrors[rdev->saved_raid_disk].rdev == NULL)
 		mirror = rdev->saved_raid_disk;
 	else



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 04/34] md/raid5: use kmem_cache_zalloc()
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (4 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 11/34] md/raid5: add some more fields to stripe_head_state NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 08/34] md/raid5: replace sh->lock with an 'active' flag NeilBrown
                   ` (27 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Namhyung Kim

From: Namhyung Kim <namhyung@gmail.com>

Replace kmem_cache_alloc + memset(,0,) to kmem_cache_zalloc.
I think it's not harmful since @conf->slab_cache already knows
actual size of struct stripe_head.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b72edf3..0f71aa9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1315,10 +1315,10 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 static int grow_one_stripe(raid5_conf_t *conf)
 {
 	struct stripe_head *sh;
-	sh = kmem_cache_alloc(conf->slab_cache, GFP_KERNEL);
+	sh = kmem_cache_zalloc(conf->slab_cache, GFP_KERNEL);
 	if (!sh)
 		return 0;
-	memset(sh, 0, sizeof(*sh) + (conf->pool_size-1)*sizeof(struct r5dev));
+
 	sh->raid_conf = conf;
 	spin_lock_init(&sh->lock);
 	#ifdef CONFIG_MULTICORE_RAID456
@@ -1435,12 +1435,10 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 		return -ENOMEM;
 
 	for (i = conf->max_nr_stripes; i; i--) {
-		nsh = kmem_cache_alloc(sc, GFP_KERNEL);
+		nsh = kmem_cache_zalloc(sc, GFP_KERNEL);
 		if (!nsh)
 			break;
 
-		memset(nsh, 0, sizeof(*nsh) + (newsize-1)*sizeof(struct r5dev));
-
 		nsh->raid_conf = conf;
 		spin_lock_init(&nsh->lock);
 		#ifdef CONFIG_MULTICORE_RAID456



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (8 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 05/34] md/raid5: get rid of duplicated call to bio_data_dir() NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-22  4:49   ` Namhyung Kim
  2011-07-21  2:32 ` [md PATCH 09/34] md/raid5: move common code into handle_stripe NeilBrown
                   ` (23 subsequent siblings)
  33 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

'struct stripe_head_state' stores state about the 'current' stripe
that is passed around while handling the stripe.
For RAID6 there is an extension structure: r6_state, which is also
passed around.
There is no value in keeping these separate, so move the fields from
the latter into the former.

This means that all code now needs to treat s->failed_num as an small
array, but this is a small cost.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   79 ++++++++++++++++++++++++++--------------------------
 drivers/md/raid5.h |    8 +----
 2 files changed, 41 insertions(+), 46 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index dfb3d9f..c32ffb5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2318,7 +2318,7 @@ static int fetch_block5(struct stripe_head *sh, struct stripe_head_state *s,
 			int disk_idx, int disks)
 {
 	struct r5dev *dev = &sh->dev[disk_idx];
-	struct r5dev *failed_dev = &sh->dev[s->failed_num];
+	struct r5dev *failed_dev = &sh->dev[s->failed_num[0]];
 
 	/* is the data in this block needed, and can we get it? */
 	if (!test_bit(R5_LOCKED, &dev->flags) &&
@@ -2334,7 +2334,7 @@ static int fetch_block5(struct stripe_head *sh, struct stripe_head_state *s,
 		 * otherwise read it if the backing disk is insync
 		 */
 		if ((s->uptodate == disks - 1) &&
-		    (s->failed && disk_idx == s->failed_num)) {
+		    (s->failed && disk_idx == s->failed_num[0])) {
 			set_bit(STRIPE_COMPUTE_RUN, &sh->state);
 			set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
 			set_bit(R5_Wantcompute, &dev->flags);
@@ -2388,11 +2388,11 @@ static void handle_stripe_fill5(struct stripe_head *sh,
  * 0 to tell the loop in handle_stripe_fill6 to continue
  */
 static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
-			 struct r6_state *r6s, int disk_idx, int disks)
+			int disk_idx, int disks)
 {
 	struct r5dev *dev = &sh->dev[disk_idx];
-	struct r5dev *fdev[2] = { &sh->dev[r6s->failed_num[0]],
-				  &sh->dev[r6s->failed_num[1]] };
+	struct r5dev *fdev[2] = { &sh->dev[s->failed_num[0]],
+				  &sh->dev[s->failed_num[1]] };
 
 	if (!test_bit(R5_LOCKED, &dev->flags) &&
 	    !test_bit(R5_UPTODATE, &dev->flags) &&
@@ -2409,8 +2409,8 @@ static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
 		BUG_ON(test_bit(R5_Wantcompute, &dev->flags));
 		BUG_ON(test_bit(R5_Wantread, &dev->flags));
 		if ((s->uptodate == disks - 1) &&
-		    (s->failed && (disk_idx == r6s->failed_num[0] ||
-				   disk_idx == r6s->failed_num[1]))) {
+		    (s->failed && (disk_idx == s->failed_num[0] ||
+				   disk_idx == s->failed_num[1]))) {
 			/* have disk failed, and we're requested to fetch it;
 			 * do compute it
 			 */
@@ -2465,7 +2465,7 @@ static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
  * handle_stripe_fill6 - read or compute data to satisfy pending requests.
  */
 static void handle_stripe_fill6(struct stripe_head *sh,
-			struct stripe_head_state *s, struct r6_state *r6s,
+			struct stripe_head_state *s,
 			int disks)
 {
 	int i;
@@ -2477,7 +2477,7 @@ static void handle_stripe_fill6(struct stripe_head *sh,
 	if (!test_bit(STRIPE_COMPUTE_RUN, &sh->state) && !sh->check_state &&
 	    !sh->reconstruct_state)
 		for (i = disks; i--; )
-			if (fetch_block6(sh, s, r6s, i, disks))
+			if (fetch_block6(sh, s, i, disks))
 				break;
 	set_bit(STRIPE_HANDLE, &sh->state);
 }
@@ -2625,7 +2625,7 @@ static void handle_stripe_dirtying5(raid5_conf_t *conf,
 
 static void handle_stripe_dirtying6(raid5_conf_t *conf,
 		struct stripe_head *sh,	struct stripe_head_state *s,
-		struct r6_state *r6s, int disks)
+		int disks)
 {
 	int rcw = 0, pd_idx = sh->pd_idx, i;
 	int qd_idx = sh->qd_idx;
@@ -2688,7 +2688,7 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
 			s->uptodate--;
 			break;
 		}
-		dev = &sh->dev[s->failed_num];
+		dev = &sh->dev[s->failed_num[0]];
 		/* fall through */
 	case check_state_compute_result:
 		sh->check_state = check_state_idle;
@@ -2760,7 +2760,7 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
 
 static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
 				  struct stripe_head_state *s,
-				  struct r6_state *r6s, int disks)
+				  int disks)
 {
 	int pd_idx = sh->pd_idx;
 	int qd_idx = sh->qd_idx;
@@ -2779,14 +2779,14 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
 	switch (sh->check_state) {
 	case check_state_idle:
 		/* start a new check operation if there are < 2 failures */
-		if (s->failed == r6s->q_failed) {
+		if (s->failed == s->q_failed) {
 			/* The only possible failed device holds Q, so it
 			 * makes sense to check P (If anything else were failed,
 			 * we would have used P to recreate it).
 			 */
 			sh->check_state = check_state_run;
 		}
-		if (!r6s->q_failed && s->failed < 2) {
+		if (!s->q_failed && s->failed < 2) {
 			/* Q is not failed, and we didn't use it to generate
 			 * anything, so it makes sense to check it
 			 */
@@ -2828,13 +2828,13 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
 		 */
 		BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */
 		if (s->failed == 2) {
-			dev = &sh->dev[r6s->failed_num[1]];
+			dev = &sh->dev[s->failed_num[1]];
 			s->locked++;
 			set_bit(R5_LOCKED, &dev->flags);
 			set_bit(R5_Wantwrite, &dev->flags);
 		}
 		if (s->failed >= 1) {
-			dev = &sh->dev[r6s->failed_num[0]];
+			dev = &sh->dev[s->failed_num[0]];
 			s->locked++;
 			set_bit(R5_LOCKED, &dev->flags);
 			set_bit(R5_Wantwrite, &dev->flags);
@@ -2922,7 +2922,7 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
 }
 
 static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
-				struct r6_state *r6s)
+				    struct stripe_head_state *r6s)
 {
 	int i;
 
@@ -2964,7 +2964,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 			set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
 			for (j = 0; j < conf->raid_disks; j++)
 				if (j != sh2->pd_idx &&
-				    (!r6s || j != sh2->qd_idx) &&
+				    (r6s || j != sh2->qd_idx) &&
 				    !test_bit(R5_Expanded, &sh2->dev[j].flags))
 					break;
 			if (j == conf->raid_disks) {
@@ -3082,7 +3082,7 @@ static void handle_stripe5(struct stripe_head *sh)
 			clear_bit(R5_Insync, &dev->flags);
 		if (!test_bit(R5_Insync, &dev->flags)) {
 			s.failed++;
-			s.failed_num = i;
+			s.failed_num[0] = i;
 		}
 	}
 	spin_unlock_irq(&conf->device_lock);
@@ -3107,7 +3107,7 @@ static void handle_stripe5(struct stripe_head *sh)
 	pr_debug("locked=%d uptodate=%d to_read=%d"
 		" to_write=%d failed=%d failed_num=%d\n",
 		s.locked, s.uptodate, s.to_read, s.to_write,
-		s.failed, s.failed_num);
+		s.failed, s.failed_num[0]);
 	/* check if the array has lost two devices and, if so, some requests might
 	 * need to be failed
 	 */
@@ -3127,7 +3127,7 @@ static void handle_stripe5(struct stripe_head *sh)
 	     ((test_bit(R5_Insync, &dev->flags) &&
 	       !test_bit(R5_LOCKED, &dev->flags) &&
 	       test_bit(R5_UPTODATE, &dev->flags)) ||
-	       (s.failed == 1 && s.failed_num == sh->pd_idx)))
+	       (s.failed == 1 && s.failed_num[0] == sh->pd_idx)))
 		handle_stripe_clean_event(conf, sh, disks, &return_bi);
 
 	/* Now we might consider reading some blocks, either to check/generate
@@ -3198,11 +3198,11 @@ static void handle_stripe5(struct stripe_head *sh)
 	 * the repair/check process
 	 */
 	if (s.failed == 1 && !conf->mddev->ro &&
-	    test_bit(R5_ReadError, &sh->dev[s.failed_num].flags)
-	    && !test_bit(R5_LOCKED, &sh->dev[s.failed_num].flags)
-	    && test_bit(R5_UPTODATE, &sh->dev[s.failed_num].flags)
+	    test_bit(R5_ReadError, &sh->dev[s.failed_num[0]].flags)
+	    && !test_bit(R5_LOCKED, &sh->dev[s.failed_num[0]].flags)
+	    && test_bit(R5_UPTODATE, &sh->dev[s.failed_num[0]].flags)
 		) {
-		dev = &sh->dev[s.failed_num];
+		dev = &sh->dev[s.failed_num[0]];
 		if (!test_bit(R5_ReWrite, &dev->flags)) {
 			set_bit(R5_Wantwrite, &dev->flags);
 			set_bit(R5_ReWrite, &dev->flags);
@@ -3292,7 +3292,6 @@ static void handle_stripe6(struct stripe_head *sh)
 	struct bio *return_bi = NULL;
 	int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
 	struct stripe_head_state s;
-	struct r6_state r6s;
 	struct r5dev *dev, *pdev, *qdev;
 	mdk_rdev_t *blocked_rdev = NULL;
 	int dec_preread_active = 0;
@@ -3370,7 +3369,7 @@ static void handle_stripe6(struct stripe_head *sh)
 			clear_bit(R5_Insync, &dev->flags);
 		if (!test_bit(R5_Insync, &dev->flags)) {
 			if (s.failed < 2)
-				r6s.failed_num[s.failed] = i;
+				s.failed_num[s.failed] = i;
 			s.failed++;
 		}
 	}
@@ -3396,7 +3395,7 @@ static void handle_stripe6(struct stripe_head *sh)
 	pr_debug("locked=%d uptodate=%d to_read=%d"
 	       " to_write=%d failed=%d failed_num=%d,%d\n",
 	       s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
-	       r6s.failed_num[0], r6s.failed_num[1]);
+	       s.failed_num[0], s.failed_num[1]);
 	/* check if the array has lost >2 devices and, if so, some requests
 	 * might need to be failed
 	 */
@@ -3413,17 +3412,17 @@ static void handle_stripe6(struct stripe_head *sh)
 	 * are safe, or on a failed drive
 	 */
 	pdev = &sh->dev[pd_idx];
-	r6s.p_failed = (s.failed >= 1 && r6s.failed_num[0] == pd_idx)
-		|| (s.failed >= 2 && r6s.failed_num[1] == pd_idx);
+	s.p_failed = (s.failed >= 1 && s.failed_num[0] == pd_idx)
+		|| (s.failed >= 2 && s.failed_num[1] == pd_idx);
 	qdev = &sh->dev[qd_idx];
-	r6s.q_failed = (s.failed >= 1 && r6s.failed_num[0] == qd_idx)
-		|| (s.failed >= 2 && r6s.failed_num[1] == qd_idx);
+	s.q_failed = (s.failed >= 1 && s.failed_num[0] == qd_idx)
+		|| (s.failed >= 2 && s.failed_num[1] == qd_idx);
 
-	if ( s.written &&
-	     ( r6s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
+	if (s.written &&
+	    (s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
 			     && !test_bit(R5_LOCKED, &pdev->flags)
 			     && test_bit(R5_UPTODATE, &pdev->flags)))) &&
-	     ( r6s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
+	    (s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
 			     && !test_bit(R5_LOCKED, &qdev->flags)
 			     && test_bit(R5_UPTODATE, &qdev->flags)))))
 		handle_stripe_clean_event(conf, sh, disks, &return_bi);
@@ -3434,7 +3433,7 @@ static void handle_stripe6(struct stripe_head *sh)
 	 */
 	if (s.to_read || s.non_overwrite || (s.to_write && s.failed) ||
 	    (s.syncing && (s.uptodate + s.compute < disks)) || s.expanding)
-		handle_stripe_fill6(sh, &s, &r6s, disks);
+		handle_stripe_fill6(sh, &s, disks);
 
 	/* Now we check to see if any write operations have recently
 	 * completed
@@ -3472,7 +3471,7 @@ static void handle_stripe6(struct stripe_head *sh)
 	 *    block.
 	 */
 	if (s.to_write && !sh->reconstruct_state && !sh->check_state)
-		handle_stripe_dirtying6(conf, sh, &s, &r6s, disks);
+		handle_stripe_dirtying6(conf, sh, &s, disks);
 
 	/* maybe we need to check and possibly fix the parity for this stripe
 	 * Any reads will already have been scheduled, so we just see if enough
@@ -3483,7 +3482,7 @@ static void handle_stripe6(struct stripe_head *sh)
 	    (s.syncing && s.locked == 0 &&
 	     !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
 	     !test_bit(STRIPE_INSYNC, &sh->state)))
-		handle_parity_checks6(conf, sh, &s, &r6s, disks);
+		handle_parity_checks6(conf, sh, &s, disks);
 
 	if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
@@ -3495,7 +3494,7 @@ static void handle_stripe6(struct stripe_head *sh)
 	 */
 	if (s.failed <= 2 && !conf->mddev->ro)
 		for (i = 0; i < s.failed; i++) {
-			dev = &sh->dev[r6s.failed_num[i]];
+			dev = &sh->dev[s.failed_num[i]];
 			if (test_bit(R5_ReadError, &dev->flags)
 			    && !test_bit(R5_LOCKED, &dev->flags)
 			    && test_bit(R5_UPTODATE, &dev->flags)
@@ -3557,7 +3556,7 @@ static void handle_stripe6(struct stripe_head *sh)
 
 	if (s.expanding && s.locked == 0 &&
 	    !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
-		handle_stripe_expansion(conf, sh, &r6s);
+		handle_stripe_expansion(conf, sh, &s);
 
  unlock:
 
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 217a9d4..d3c61d3 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -245,13 +245,9 @@ struct stripe_head_state {
 	int syncing, expanding, expanded;
 	int locked, uptodate, to_read, to_write, failed, written;
 	int to_fill, compute, req_compute, non_overwrite;
-	int failed_num;
+	int failed_num[2];
 	unsigned long ops_request;
-};
-
-/* r6_state - extra state data only relevant to r6 */
-struct r6_state {
-	int p_failed, q_failed, failed_num[2];
+	int p_failed, q_failed;
 };
 
 /* Flags */



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 06/34] md/raid5: Remove use of sh->lock in sync_request
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (2 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 03/34] md/raid10: share pages between read and write bio's during recovery NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-22  3:39   ` Namhyung Kim
  2011-07-21  2:32 ` [md PATCH 11/34] md/raid5: add some more fields to stripe_head_state NeilBrown
                   ` (29 subsequent siblings)
  33 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

This is the start of a series of patches to remove sh->lock.

sync_request takes sh->lock before setting STRIPE_SYNCING to ensure
there is no race with testing it in handle_stripe[56].

Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early
in handle_stripe[56] (after getting the same lock) and perform the
same set/clear operations if it was set.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   13 +++++++++----
 drivers/md/raid5.h |   33 ++++++++++++++++++---------------
 2 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7148064..f2f2ab3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3022,6 +3022,10 @@ static void handle_stripe5(struct stripe_head *sh)
 		 sh->reconstruct_state);
 
 	spin_lock(&sh->lock);
+	if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
+		set_bit(STRIPE_SYNCING, &sh->state);
+		clear_bit(STRIPE_INSYNC, &sh->state);
+	}
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	clear_bit(STRIPE_DELAYED, &sh->state);
 
@@ -3313,6 +3317,10 @@ static void handle_stripe6(struct stripe_head *sh)
 	memset(&s, 0, sizeof(s));
 
 	spin_lock(&sh->lock);
+	if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
+		set_bit(STRIPE_SYNCING, &sh->state);
+		clear_bit(STRIPE_INSYNC, &sh->state);
+	}
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	clear_bit(STRIPE_DELAYED, &sh->state);
 
@@ -4373,10 +4381,7 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 
 	bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, still_degraded);
 
-	spin_lock(&sh->lock);
-	set_bit(STRIPE_SYNCING, &sh->state);
-	clear_bit(STRIPE_INSYNC, &sh->state);
-	spin_unlock(&sh->lock);
+	set_bit(STRIPE_SYNC_REQUESTED, &sh->state);
 
 	handle_stripe(sh);
 	release_stripe(sh);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 3ca77a2..a330011 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -289,21 +289,24 @@ struct r6_state {
 /*
  * Stripe state
  */
-#define STRIPE_HANDLE		2
-#define	STRIPE_SYNCING		3
-#define	STRIPE_INSYNC		4
-#define	STRIPE_PREREAD_ACTIVE	5
-#define	STRIPE_DELAYED		6
-#define	STRIPE_DEGRADED		7
-#define	STRIPE_BIT_DELAY	8
-#define	STRIPE_EXPANDING	9
-#define	STRIPE_EXPAND_SOURCE	10
-#define	STRIPE_EXPAND_READY	11
-#define	STRIPE_IO_STARTED	12 /* do not count towards 'bypass_count' */
-#define	STRIPE_FULL_WRITE	13 /* all blocks are set to be overwritten */
-#define	STRIPE_BIOFILL_RUN	14
-#define	STRIPE_COMPUTE_RUN	15
-#define	STRIPE_OPS_REQ_PENDING	16
+enum {
+	STRIPE_HANDLE,
+	STRIPE_SYNC_REQUESTED,
+	STRIPE_SYNCING,
+	STRIPE_INSYNC,
+	STRIPE_PREREAD_ACTIVE,
+	STRIPE_DELAYED,
+	STRIPE_DEGRADED,
+	STRIPE_BIT_DELAY,
+	STRIPE_EXPANDING,
+	STRIPE_EXPAND_SOURCE,
+	STRIPE_EXPAND_READY,
+	STRIPE_IO_STARTED,	/* do not count towards 'bypass_count' */
+	STRIPE_FULL_WRITE,	/* all blocks are set to be overwritten */
+	STRIPE_BIOFILL_RUN,
+	STRIPE_COMPUTE_RUN,
+	STRIPE_OPS_REQ_PENDING,
+};
 
 /*
  * Operation request flags



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 07/34] md/raid5: Protect some more code with ->device_lock.
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (6 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 08/34] md/raid5: replace sh->lock with an 'active' flag NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-22  3:54   ` Namhyung Kim
  2011-07-21  2:32 ` [md PATCH 05/34] md/raid5: get rid of duplicated call to bio_data_dir() NeilBrown
                   ` (25 subsequent siblings)
  33 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

Other places that change or follow dev->towrite and dev->written take
the device_lock as well as the sh->lock.
So it should really be held in these places too.
Also, doing so will allow sh->lock to be discarded.

with merged fixes by: Namhyung Kim <namhyung@gmail.com>


Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   30 ++++++++++++++++--------------
 1 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f2f2ab3..9985138 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1021,10 +1021,12 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 			struct bio *wbi;
 
 			spin_lock(&sh->lock);
+			spin_lock_irq(&sh->raid_conf->device_lock);
 			chosen = dev->towrite;
 			dev->towrite = NULL;
 			BUG_ON(dev->written);
 			wbi = dev->written = chosen;
+			spin_unlock_irq(&sh->raid_conf->device_lock);
 			spin_unlock(&sh->lock);
 
 			while (wbi && wbi->bi_sector <
@@ -2141,7 +2143,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 	raid5_conf_t *conf = sh->raid_conf;
 	int firstwrite=0;
 
-	pr_debug("adding bh b#%llu to stripe s#%llu\n",
+	pr_debug("adding bi b#%llu to stripe s#%llu\n",
 		(unsigned long long)bi->bi_sector,
 		(unsigned long long)sh->sector);
 
@@ -2167,19 +2169,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 		bi->bi_next = *bip;
 	*bip = bi;
 	bi->bi_phys_segments++;
-	spin_unlock_irq(&conf->device_lock);
-	spin_unlock(&sh->lock);
-
-	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
-		(unsigned long long)bi->bi_sector,
-		(unsigned long long)sh->sector, dd_idx);
-
-	if (conf->mddev->bitmap && firstwrite) {
-		bitmap_startwrite(conf->mddev->bitmap, sh->sector,
-				  STRIPE_SECTORS, 0);
-		sh->bm_seq = conf->seq_flush+1;
-		set_bit(STRIPE_BIT_DELAY, &sh->state);
-	}
 
 	if (forwrite) {
 		/* check if page is covered */
@@ -2194,6 +2183,19 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 		if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
 			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
 	}
+	spin_unlock_irq(&conf->device_lock);
+	spin_unlock(&sh->lock);
+
+	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
+		(unsigned long long)(*bip)->bi_sector,
+		(unsigned long long)sh->sector, dd_idx);
+
+	if (conf->mddev->bitmap && firstwrite) {
+		bitmap_startwrite(conf->mddev->bitmap, sh->sector,
+				  STRIPE_SECTORS, 0);
+		sh->bm_seq = conf->seq_flush+1;
+		set_bit(STRIPE_BIT_DELAY, &sh->state);
+	}
 	return 1;
 
  overlap:



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 05/34] md/raid5: get rid of duplicated call to bio_data_dir()
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (7 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 07/34] md/raid5: Protect some more code with ->device_lock NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state NeilBrown
                   ` (24 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Namhyung Kim

From: Namhyung Kim <namhyung@gmail.com>

In raid5::make_request(), once bio_data_dir(@bi) is detected
it never (and couldn't) be changed. Use the result always.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0f71aa9..7148064 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4014,7 +4014,7 @@ static int make_request(mddev_t *mddev, struct bio * bi)
 				}
 			}
 
-			if (bio_data_dir(bi) == WRITE &&
+			if (rw == WRITE &&
 			    logical_sector >= mddev->suspend_lo &&
 			    logical_sector < mddev->suspend_hi) {
 				release_stripe(sh);
@@ -4032,7 +4032,7 @@ static int make_request(mddev_t *mddev, struct bio * bi)
 			}
 
 			if (test_bit(STRIPE_EXPANDING, &sh->state) ||
-			    !add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
+			    !add_stripe_bio(sh, bi, dd_idx, rw)) {
 				/* Stripe is busy expanding or
 				 * add failed due to overlap.  Flush everything
 				 * and wait a while



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 09/34] md/raid5: move common code into handle_stripe
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (9 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-22  4:30   ` Namhyung Kim
  2011-07-21  2:32 ` [md PATCH 14/34] md/raid5: move more code into common handle_stripe NeilBrown
                   ` (22 subsequent siblings)
  33 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

There is common code at the start of handle_stripe5 and
handle_stripe6.  Move it into handle_stripe.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   18 ++++++------------
 1 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f8275b5..dfb3d9f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3016,12 +3016,6 @@ static void handle_stripe5(struct stripe_head *sh)
 		 atomic_read(&sh->count), sh->pd_idx, sh->check_state,
 		 sh->reconstruct_state);
 
-	if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
-		set_bit(STRIPE_SYNCING, &sh->state);
-		clear_bit(STRIPE_INSYNC, &sh->state);
-	}
-	clear_bit(STRIPE_DELAYED, &sh->state);
-
 	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
 	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
 	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
@@ -3310,12 +3304,6 @@ static void handle_stripe6(struct stripe_head *sh)
 	       sh->check_state, sh->reconstruct_state);
 	memset(&s, 0, sizeof(s));
 
-	if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
-		set_bit(STRIPE_SYNCING, &sh->state);
-		clear_bit(STRIPE_INSYNC, &sh->state);
-	}
-	clear_bit(STRIPE_DELAYED, &sh->state);
-
 	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
 	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
 	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
@@ -3607,6 +3595,12 @@ static void handle_stripe(struct stripe_head *sh)
 		return;
 	}
 
+	if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
+		set_bit(STRIPE_SYNCING, &sh->state);
+		clear_bit(STRIPE_INSYNC, &sh->state);
+	}
+	clear_bit(STRIPE_DELAYED, &sh->state);
+
 	if (sh->raid_conf->level == 6)
 		handle_stripe6(sh);
 	else



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 08/34] md/raid5: replace sh->lock with an 'active' flag.
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (5 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 04/34] md/raid5: use kmem_cache_zalloc() NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-22  4:27   ` Namhyung Kim
  2011-08-03 22:47   ` Dan Williams
  2011-07-21  2:32 ` [md PATCH 07/34] md/raid5: Protect some more code with ->device_lock NeilBrown
                   ` (26 subsequent siblings)
  33 siblings, 2 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

sh->lock is now mainly used to ensure that two threads aren't running
in the locked part of handle_stripe[56] at the same time.

That can more neatly be achieved with an 'active' flag which we set
while running handle_stripe.  If we find the flag is set, we simply
requeue the stripe for later by setting STRIPE_HANDLE.

For safety we take ->device_lock while examining the state of the
stripe and creating a summary in 'stripe_head_state / r6_state'.
This possibly isn't needed but as shared fields like ->toread,
->towrite are checked it is safer for now at least.

We leave the label after the old 'unlock' called "unlock" because it
will disappear in a few patches, so renaming seems pointless.

This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
later, but that is not a problem.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   26 +++++++++++++-------------
 drivers/md/raid5.h |    2 +-
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9985138..f8275b5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1020,14 +1020,12 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 		if (test_and_clear_bit(R5_Wantdrain, &dev->flags)) {
 			struct bio *wbi;
 
-			spin_lock(&sh->lock);
 			spin_lock_irq(&sh->raid_conf->device_lock);
 			chosen = dev->towrite;
 			dev->towrite = NULL;
 			BUG_ON(dev->written);
 			wbi = dev->written = chosen;
 			spin_unlock_irq(&sh->raid_conf->device_lock);
-			spin_unlock(&sh->lock);
 
 			while (wbi && wbi->bi_sector <
 				dev->sector + STRIPE_SECTORS) {
@@ -1322,7 +1320,6 @@ static int grow_one_stripe(raid5_conf_t *conf)
 		return 0;
 
 	sh->raid_conf = conf;
-	spin_lock_init(&sh->lock);
 	#ifdef CONFIG_MULTICORE_RAID456
 	init_waitqueue_head(&sh->ops.wait_for_ops);
 	#endif
@@ -1442,7 +1439,6 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
 			break;
 
 		nsh->raid_conf = conf;
-		spin_lock_init(&nsh->lock);
 		#ifdef CONFIG_MULTICORE_RAID456
 		init_waitqueue_head(&nsh->ops.wait_for_ops);
 		#endif
@@ -2148,7 +2144,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 		(unsigned long long)sh->sector);
 
 
-	spin_lock(&sh->lock);
 	spin_lock_irq(&conf->device_lock);
 	if (forwrite) {
 		bip = &sh->dev[dd_idx].towrite;
@@ -2184,7 +2179,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
 	}
 	spin_unlock_irq(&conf->device_lock);
-	spin_unlock(&sh->lock);
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
 		(unsigned long long)(*bip)->bi_sector,
@@ -2201,7 +2195,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
  overlap:
 	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
 	spin_unlock_irq(&conf->device_lock);
-	spin_unlock(&sh->lock);
 	return 0;
 }
 
@@ -3023,12 +3016,10 @@ static void handle_stripe5(struct stripe_head *sh)
 		 atomic_read(&sh->count), sh->pd_idx, sh->check_state,
 		 sh->reconstruct_state);
 
-	spin_lock(&sh->lock);
 	if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
 		set_bit(STRIPE_SYNCING, &sh->state);
 		clear_bit(STRIPE_INSYNC, &sh->state);
 	}
-	clear_bit(STRIPE_HANDLE, &sh->state);
 	clear_bit(STRIPE_DELAYED, &sh->state);
 
 	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
@@ -3037,6 +3028,7 @@ static void handle_stripe5(struct stripe_head *sh)
 
 	/* Now to look around and see what can be done */
 	rcu_read_lock();
+	spin_lock_irq(&conf->device_lock);
 	for (i=disks; i--; ) {
 		mdk_rdev_t *rdev;
 
@@ -3099,6 +3091,7 @@ static void handle_stripe5(struct stripe_head *sh)
 			s.failed_num = i;
 		}
 	}
+	spin_unlock_irq(&conf->device_lock);
 	rcu_read_unlock();
 
 	if (unlikely(blocked_rdev)) {
@@ -3275,7 +3268,6 @@ static void handle_stripe5(struct stripe_head *sh)
 		handle_stripe_expansion(conf, sh, NULL);
 
  unlock:
-	spin_unlock(&sh->lock);
 
 	/* wait for this device to become unblocked */
 	if (unlikely(blocked_rdev))
@@ -3318,12 +3310,10 @@ static void handle_stripe6(struct stripe_head *sh)
 	       sh->check_state, sh->reconstruct_state);
 	memset(&s, 0, sizeof(s));
 
-	spin_lock(&sh->lock);
 	if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
 		set_bit(STRIPE_SYNCING, &sh->state);
 		clear_bit(STRIPE_INSYNC, &sh->state);
 	}
-	clear_bit(STRIPE_HANDLE, &sh->state);
 	clear_bit(STRIPE_DELAYED, &sh->state);
 
 	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
@@ -3332,6 +3322,7 @@ static void handle_stripe6(struct stripe_head *sh)
 	/* Now to look around and see what can be done */
 
 	rcu_read_lock();
+	spin_lock_irq(&conf->device_lock);
 	for (i=disks; i--; ) {
 		mdk_rdev_t *rdev;
 		dev = &sh->dev[i];
@@ -3395,6 +3386,7 @@ static void handle_stripe6(struct stripe_head *sh)
 			s.failed++;
 		}
 	}
+	spin_unlock_irq(&conf->device_lock);
 	rcu_read_unlock();
 
 	if (unlikely(blocked_rdev)) {
@@ -3580,7 +3572,6 @@ static void handle_stripe6(struct stripe_head *sh)
 		handle_stripe_expansion(conf, sh, &r6s);
 
  unlock:
-	spin_unlock(&sh->lock);
 
 	/* wait for this device to become unblocked */
 	if (unlikely(blocked_rdev))
@@ -3608,10 +3599,19 @@ static void handle_stripe6(struct stripe_head *sh)
 
 static void handle_stripe(struct stripe_head *sh)
 {
+	clear_bit(STRIPE_HANDLE, &sh->state);
+	if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
+		/* already being handled, ensure it gets handled
+		 * again when current action finishes */
+		set_bit(STRIPE_HANDLE, &sh->state);
+		return;
+	}
+
 	if (sh->raid_conf->level == 6)
 		handle_stripe6(sh);
 	else
 		handle_stripe5(sh);
+	clear_bit(STRIPE_ACTIVE, &sh->state);
 }
 
 static void raid5_activate_delayed(raid5_conf_t *conf)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index a330011..217a9d4 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -209,7 +209,6 @@ struct stripe_head {
 	short			ddf_layout;/* use DDF ordering to calculate Q */
 	unsigned long		state;		/* state flags */
 	atomic_t		count;	      /* nr of active thread/requests */
-	spinlock_t		lock;
 	int			bm_seq;	/* sequence number for bitmap flushes */
 	int			disks;		/* disks in stripe */
 	enum check_states	check_state;
@@ -290,6 +289,7 @@ struct r6_state {
  * Stripe state
  */
 enum {
+	STRIPE_ACTIVE,
 	STRIPE_HANDLE,
 	STRIPE_SYNC_REQUESTED,
 	STRIPE_SYNCING,



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 11/34] md/raid5: add some more fields to stripe_head_state
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (3 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 06/34] md/raid5: Remove use of sh->lock in sync_request NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-22  5:31   ` Namhyung Kim
  2011-07-21  2:32 ` [md PATCH 04/34] md/raid5: use kmem_cache_zalloc() NeilBrown
                   ` (28 subsequent siblings)
  33 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

Adding these three fields will allow more common code to be moved
to handle_stripe()

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   54 +++++++++++++++++++++++-----------------------------
 drivers/md/raid5.h |    4 ++++
 2 files changed, 28 insertions(+), 30 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c32ffb5..3327e82 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3003,12 +3003,9 @@ static void handle_stripe5(struct stripe_head *sh)
 {
 	raid5_conf_t *conf = sh->raid_conf;
 	int disks = sh->disks, i;
-	struct bio *return_bi = NULL;
 	struct stripe_head_state s;
 	struct r5dev *dev;
-	mdk_rdev_t *blocked_rdev = NULL;
 	int prexor;
-	int dec_preread_active = 0;
 
 	memset(&s, 0, sizeof(s));
 	pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d check:%d "
@@ -3058,9 +3055,9 @@ static void handle_stripe5(struct stripe_head *sh)
 		if (dev->written)
 			s.written++;
 		rdev = rcu_dereference(conf->disks[i].rdev);
-		if (blocked_rdev == NULL &&
+		if (s.blocked_rdev == NULL &&
 		    rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
-			blocked_rdev = rdev;
+			s.blocked_rdev = rdev;
 			atomic_inc(&rdev->nr_pending);
 		}
 		clear_bit(R5_Insync, &dev->flags);
@@ -3088,15 +3085,15 @@ static void handle_stripe5(struct stripe_head *sh)
 	spin_unlock_irq(&conf->device_lock);
 	rcu_read_unlock();
 
-	if (unlikely(blocked_rdev)) {
+	if (unlikely(s.blocked_rdev)) {
 		if (s.syncing || s.expanding || s.expanded ||
 		    s.to_write || s.written) {
 			set_bit(STRIPE_HANDLE, &sh->state);
 			goto unlock;
 		}
 		/* There is nothing for the blocked_rdev to block */
-		rdev_dec_pending(blocked_rdev, conf->mddev);
-		blocked_rdev = NULL;
+		rdev_dec_pending(s.blocked_rdev, conf->mddev);
+		s.blocked_rdev = NULL;
 	}
 
 	if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
@@ -3112,7 +3109,7 @@ static void handle_stripe5(struct stripe_head *sh)
 	 * need to be failed
 	 */
 	if (s.failed > 1 && s.to_read+s.to_write+s.written)
-		handle_failed_stripe(conf, sh, &s, disks, &return_bi);
+		handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
 	if (s.failed > 1 && s.syncing) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
 		clear_bit(STRIPE_SYNCING, &sh->state);
@@ -3128,7 +3125,7 @@ static void handle_stripe5(struct stripe_head *sh)
 	       !test_bit(R5_LOCKED, &dev->flags) &&
 	       test_bit(R5_UPTODATE, &dev->flags)) ||
 	       (s.failed == 1 && s.failed_num[0] == sh->pd_idx)))
-		handle_stripe_clean_event(conf, sh, disks, &return_bi);
+		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
 
 	/* Now we might consider reading some blocks, either to check/generate
 	 * parity, or to satisfy requests
@@ -3166,7 +3163,7 @@ static void handle_stripe5(struct stripe_head *sh)
 			}
 		}
 		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-			dec_preread_active = 1;
+			s.dec_preread_active = 1;
 	}
 
 	/* Now to consider new write requests and what else, if anything
@@ -3264,15 +3261,15 @@ static void handle_stripe5(struct stripe_head *sh)
  unlock:
 
 	/* wait for this device to become unblocked */
-	if (unlikely(blocked_rdev))
-		md_wait_for_blocked_rdev(blocked_rdev, conf->mddev);
+	if (unlikely(s.blocked_rdev))
+		md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);
 
 	if (s.ops_request)
 		raid_run_ops(sh, s.ops_request);
 
 	ops_run_io(sh, &s);
 
-	if (dec_preread_active) {
+	if (s.dec_preread_active) {
 		/* We delay this until after ops_run_io so that if make_request
 		 * is waiting on a flush, it won't continue until the writes
 		 * have actually been submitted.
@@ -3282,19 +3279,16 @@ static void handle_stripe5(struct stripe_head *sh)
 		    IO_THRESHOLD)
 			md_wakeup_thread(conf->mddev->thread);
 	}
-	return_io(return_bi);
+	return_io(s.return_bi);
 }
 
 static void handle_stripe6(struct stripe_head *sh)
 {
 	raid5_conf_t *conf = sh->raid_conf;
 	int disks = sh->disks;
-	struct bio *return_bi = NULL;
 	int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
 	struct stripe_head_state s;
 	struct r5dev *dev, *pdev, *qdev;
-	mdk_rdev_t *blocked_rdev = NULL;
-	int dec_preread_active = 0;
 
 	pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
 		"pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n",
@@ -3345,9 +3339,9 @@ static void handle_stripe6(struct stripe_head *sh)
 		if (dev->written)
 			s.written++;
 		rdev = rcu_dereference(conf->disks[i].rdev);
-		if (blocked_rdev == NULL &&
+		if (s.blocked_rdev == NULL &&
 		    rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
-			blocked_rdev = rdev;
+			s.blocked_rdev = rdev;
 			atomic_inc(&rdev->nr_pending);
 		}
 		clear_bit(R5_Insync, &dev->flags);
@@ -3376,15 +3370,15 @@ static void handle_stripe6(struct stripe_head *sh)
 	spin_unlock_irq(&conf->device_lock);
 	rcu_read_unlock();
 
-	if (unlikely(blocked_rdev)) {
+	if (unlikely(s.blocked_rdev)) {
 		if (s.syncing || s.expanding || s.expanded ||
 		    s.to_write || s.written) {
 			set_bit(STRIPE_HANDLE, &sh->state);
 			goto unlock;
 		}
 		/* There is nothing for the blocked_rdev to block */
-		rdev_dec_pending(blocked_rdev, conf->mddev);
-		blocked_rdev = NULL;
+		rdev_dec_pending(s.blocked_rdev, conf->mddev);
+		s.blocked_rdev = NULL;
 	}
 
 	if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
@@ -3400,7 +3394,7 @@ static void handle_stripe6(struct stripe_head *sh)
 	 * might need to be failed
 	 */
 	if (s.failed > 2 && s.to_read+s.to_write+s.written)
-		handle_failed_stripe(conf, sh, &s, disks, &return_bi);
+		handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
 	if (s.failed > 2 && s.syncing) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
 		clear_bit(STRIPE_SYNCING, &sh->state);
@@ -3425,7 +3419,7 @@ static void handle_stripe6(struct stripe_head *sh)
 	    (s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
 			     && !test_bit(R5_LOCKED, &qdev->flags)
 			     && test_bit(R5_UPTODATE, &qdev->flags)))))
-		handle_stripe_clean_event(conf, sh, disks, &return_bi);
+		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
 
 	/* Now we might consider reading some blocks, either to check/generate
 	 * parity, or to satisfy requests
@@ -3461,7 +3455,7 @@ static void handle_stripe6(struct stripe_head *sh)
 			}
 		}
 		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-			dec_preread_active = 1;
+			s.dec_preread_active = 1;
 	}
 
 	/* Now to consider new write requests and what else, if anything
@@ -3561,8 +3555,8 @@ static void handle_stripe6(struct stripe_head *sh)
  unlock:
 
 	/* wait for this device to become unblocked */
-	if (unlikely(blocked_rdev))
-		md_wait_for_blocked_rdev(blocked_rdev, conf->mddev);
+	if (unlikely(s.blocked_rdev))
+		md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);
 
 	if (s.ops_request)
 		raid_run_ops(sh, s.ops_request);
@@ -3570,7 +3564,7 @@ static void handle_stripe6(struct stripe_head *sh)
 	ops_run_io(sh, &s);
 
 
-	if (dec_preread_active) {
+	if (s.dec_preread_active) {
 		/* We delay this until after ops_run_io so that if make_request
 		 * is waiting on a flush, it won't continue until the writes
 		 * have actually been submitted.
@@ -3581,7 +3575,7 @@ static void handle_stripe6(struct stripe_head *sh)
 			md_wakeup_thread(conf->mddev->thread);
 	}
 
-	return_io(return_bi);
+	return_io(s.return_bi);
 }
 
 static void handle_stripe(struct stripe_head *sh)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index d3c61d3..9ceb574 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -248,6 +248,10 @@ struct stripe_head_state {
 	int failed_num[2];
 	unsigned long ops_request;
 	int p_failed, q_failed;
+
+	struct bio *return_bi;
+	mdk_rdev_t *blocked_rdev;
+	int dec_preread_active;
 };
 
 /* Flags */



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (14 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 12/34] md/raid5: move stripe_head_state and more code into handle_stripe NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-22  9:10   ` Namhyung Kim
  2011-07-21  2:32 ` [md PATCH 18/34] md/raid5: move more common code into handle_stripe NeilBrown
                   ` (17 subsequent siblings)
  33 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
also allow 'read-modify-write'
Apart from this difference, handle_stripe_dirtying[56] are nearly
identical.  So resolve these differences and create just one function.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   77 ++++++++++++++--------------------------------------
 1 files changed, 21 insertions(+), 56 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ea796a9..b1eba98 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2468,11 +2468,19 @@ static void handle_stripe_clean_event(raid5_conf_t *conf,
 			md_wakeup_thread(conf->mddev->thread);
 }
 
-static void handle_stripe_dirtying5(raid5_conf_t *conf,
-		struct stripe_head *sh,	struct stripe_head_state *s, int disks)
+static void handle_stripe_dirtying(raid5_conf_t *conf,
+				   struct stripe_head *sh,
+				   struct stripe_head_state *s,
+				   int disks)
 {
 	int rmw = 0, rcw = 0, i;
-	for (i = disks; i--; ) {
+	if (conf->max_degraded == 2) {
+		/* RAID6 requires 'rcw' in current implementation
+		 * Calculate the real rcw later - for now fake it
+		 * look like rcw is cheaper
+		 */
+		rcw = 1; rmw = 2;
+	} else for (i = disks; i--; ) {
 		/* would I have to read this buffer for read_modify_write */
 		struct r5dev *dev = &sh->dev[i];
 		if ((dev->towrite || i == sh->pd_idx) &&
@@ -2519,16 +2527,19 @@ static void handle_stripe_dirtying5(raid5_conf_t *conf,
 				}
 			}
 		}
-	if (rcw <= rmw && rcw > 0)
+	if (rcw <= rmw && rcw > 0) {
 		/* want reconstruct write, but need to get some data */
+		rcw = 0;
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
 			if (!test_bit(R5_OVERWRITE, &dev->flags) &&
-			    i != sh->pd_idx &&
+			    i != sh->pd_idx && i != sh->qd_idx &&
 			    !test_bit(R5_LOCKED, &dev->flags) &&
 			    !(test_bit(R5_UPTODATE, &dev->flags) ||
-			    test_bit(R5_Wantcompute, &dev->flags)) &&
-			    test_bit(R5_Insync, &dev->flags)) {
+			      test_bit(R5_Wantcompute, &dev->flags))) {
+				rcw++;
+				if (!test_bit(R5_Insync, &dev->flags))
+					continue; /* it's a failed drive */
 				if (
 				  test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
 					pr_debug("Read_old block "
@@ -2542,6 +2553,7 @@ static void handle_stripe_dirtying5(raid5_conf_t *conf,
 				}
 			}
 		}
+	}
 	/* now if nothing is locked, and if we have enough data,
 	 * we can start a write request
 	 */
@@ -2558,53 +2570,6 @@ static void handle_stripe_dirtying5(raid5_conf_t *conf,
 		schedule_reconstruction(sh, s, rcw == 0, 0);
 }
 
-static void handle_stripe_dirtying6(raid5_conf_t *conf,
-		struct stripe_head *sh,	struct stripe_head_state *s,
-		int disks)
-{
-	int rcw = 0, pd_idx = sh->pd_idx, i;
-	int qd_idx = sh->qd_idx;
-
-	set_bit(STRIPE_HANDLE, &sh->state);
-	for (i = disks; i--; ) {
-		struct r5dev *dev = &sh->dev[i];
-		/* check if we haven't enough data */
-		if (!test_bit(R5_OVERWRITE, &dev->flags) &&
-		    i != pd_idx && i != qd_idx &&
-		    !test_bit(R5_LOCKED, &dev->flags) &&
-		    !(test_bit(R5_UPTODATE, &dev->flags) ||
-		      test_bit(R5_Wantcompute, &dev->flags))) {
-			rcw++;
-			if (!test_bit(R5_Insync, &dev->flags))
-				continue; /* it's a failed drive */
-
-			if (
-			  test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-				pr_debug("Read_old stripe %llu "
-					"block %d for Reconstruct\n",
-				     (unsigned long long)sh->sector, i);
-				set_bit(R5_LOCKED, &dev->flags);
-				set_bit(R5_Wantread, &dev->flags);
-				s->locked++;
-			} else {
-				pr_debug("Request delayed stripe %llu "
-					"block %d for Reconstruct\n",
-				     (unsigned long long)sh->sector, i);
-				set_bit(STRIPE_DELAYED, &sh->state);
-				set_bit(STRIPE_HANDLE, &sh->state);
-			}
-		}
-	}
-	/* now if nothing is locked, and if we have enough data, we can start a
-	 * write request
-	 */
-	if ((s->req_compute || !test_bit(STRIPE_COMPUTE_RUN, &sh->state)) &&
-	    s->locked == 0 && rcw == 0 &&
-	    !test_bit(STRIPE_BIT_DELAY, &sh->state)) {
-		schedule_reconstruction(sh, s, 1, 0);
-	}
-}
-
 static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
 				struct stripe_head_state *s, int disks)
 {
@@ -3100,7 +3065,7 @@ static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
 	 *    block.
 	 */
 	if (s->to_write && !sh->reconstruct_state && !sh->check_state)
-		handle_stripe_dirtying5(conf, sh, s, disks);
+		handle_stripe_dirtying(conf, sh, s, disks);
 
 	/* maybe we need to check and possibly fix the parity for this stripe
 	 * Any reads will already have been scheduled, so we just see if enough
@@ -3289,7 +3254,7 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
 	 *    block.
 	 */
 	if (s->to_write && !sh->reconstruct_state && !sh->check_state)
-		handle_stripe_dirtying6(conf, sh, s, disks);
+		handle_stripe_dirtying(conf, sh, s, disks);
 
 	/* maybe we need to check and possibly fix the parity for this stripe
 	 * Any reads will already have been scheduled, so we just see if enough



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 19/34] md/raid5: move some more common code into handle_stripe
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (17 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 13/34] md/raid5: Move code for finishing a reconstruction " NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-22  9:29   ` Namhyung Kim
  2011-07-21  2:32 ` [md PATCH 23/34] md: introduce link/unlink_rdev() helpers NeilBrown
                   ` (14 subsequent siblings)
  33 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

The RAID6 version of this code is usable for RAID5 providing:
  - we test "conf->max_degraded" rather than "2" as appropriate
  - we make sure s->failed_num[1] is meaningful (and not '-1')
    when s->failed > 1

The 'return 1' must become 'goto finish' in the new location.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |  178 +++++++++++++++++++---------------------------------
 1 files changed, 66 insertions(+), 112 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ba6f892..f23848f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2969,63 +2969,14 @@ static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
 		if (test_bit(R5_ReadError, &dev->flags))
 			clear_bit(R5_Insync, &dev->flags);
 		if (!test_bit(R5_Insync, &dev->flags)) {
+			if (s->failed < 2)
+				s->failed_num[s->failed] = i;
 			s->failed++;
-			s->failed_num[0] = i;
 		}
 	}
 	spin_unlock_irq(&conf->device_lock);
 	rcu_read_unlock();
 
-	if (unlikely(s->blocked_rdev)) {
-		if (s->syncing || s->expanding || s->expanded ||
-		    s->to_write || s->written) {
-			set_bit(STRIPE_HANDLE, &sh->state);
-			return 1;
-		}
-		/* There is nothing for the blocked_rdev to block */
-		rdev_dec_pending(s->blocked_rdev, conf->mddev);
-		s->blocked_rdev = NULL;
-	}
-
-	if (s->to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
-		set_bit(STRIPE_OP_BIOFILL, &s->ops_request);
-		set_bit(STRIPE_BIOFILL_RUN, &sh->state);
-	}
-
-	pr_debug("locked=%d uptodate=%d to_read=%d"
-		" to_write=%d failed=%d failed_num=%d\n",
-		s->locked, s->uptodate, s->to_read, s->to_write,
-		s->failed, s->failed_num[0]);
-	/* check if the array has lost two devices and, if so, some requests might
-	 * need to be failed
-	 */
-	if (s->failed > 1 && s->to_read+s->to_write+s->written)
-		handle_failed_stripe(conf, sh, s, disks, &s->return_bi);
-	if (s->failed > 1 && s->syncing) {
-		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
-		clear_bit(STRIPE_SYNCING, &sh->state);
-		s->syncing = 0;
-	}
-
-	/* might be able to return some write requests if the parity block
-	 * is safe, or on a failed drive
-	 */
-	dev = &sh->dev[sh->pd_idx];
-	if (s->written &&
-	    ((test_bit(R5_Insync, &dev->flags) &&
-	      !test_bit(R5_LOCKED, &dev->flags) &&
-	      test_bit(R5_UPTODATE, &dev->flags)) ||
-	     (s->failed == 1 && s->failed_num[0] == sh->pd_idx)))
-		handle_stripe_clean_event(conf, sh, disks, &s->return_bi);
-
-	/* Now we might consider reading some blocks, either to check/generate
-	 * parity, or to satisfy requests
-	 * or to load a block that is being partially written.
-	 */
-	if (s->to_read || s->non_overwrite ||
-	    (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
-		handle_stripe_fill(sh, s, disks);
-
 	return 0;
 }
 
@@ -3033,8 +2984,8 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
 {
 	raid5_conf_t *conf = sh->raid_conf;
 	int disks = sh->disks;
-	int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
-	struct r5dev *dev, *pdev, *qdev;
+	struct r5dev *dev;
+	int i;
 
 	/* Now to look around and see what can be done */
 
@@ -3108,65 +3059,6 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
 	spin_unlock_irq(&conf->device_lock);
 	rcu_read_unlock();
 
-	if (unlikely(s->blocked_rdev)) {
-		if (s->syncing || s->expanding || s->expanded ||
-		    s->to_write || s->written) {
-			set_bit(STRIPE_HANDLE, &sh->state);
-			return 1;
-		}
-		/* There is nothing for the blocked_rdev to block */
-		rdev_dec_pending(s->blocked_rdev, conf->mddev);
-		s->blocked_rdev = NULL;
-	}
-
-	if (s->to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
-		set_bit(STRIPE_OP_BIOFILL, &s->ops_request);
-		set_bit(STRIPE_BIOFILL_RUN, &sh->state);
-	}
-
-	pr_debug("locked=%d uptodate=%d to_read=%d"
-	       " to_write=%d failed=%d failed_num=%d,%d\n",
-	       s->locked, s->uptodate, s->to_read, s->to_write, s->failed,
-	       s->failed_num[0], s->failed_num[1]);
-	/* check if the array has lost >2 devices and, if so, some requests
-	 * might need to be failed
-	 */
-	if (s->failed > 2 && s->to_read+s->to_write+s->written)
-		handle_failed_stripe(conf, sh, s, disks, &s->return_bi);
-	if (s->failed > 2 && s->syncing) {
-		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
-		clear_bit(STRIPE_SYNCING, &sh->state);
-		s->syncing = 0;
-	}
-
-	/*
-	 * might be able to return some write requests if the parity blocks
-	 * are safe, or on a failed drive
-	 */
-	pdev = &sh->dev[pd_idx];
-	s->p_failed = (s->failed >= 1 && s->failed_num[0] == pd_idx)
-		|| (s->failed >= 2 && s->failed_num[1] == pd_idx);
-	qdev = &sh->dev[qd_idx];
-	s->q_failed = (s->failed >= 1 && s->failed_num[0] == qd_idx)
-		|| (s->failed >= 2 && s->failed_num[1] == qd_idx);
-
-	if (s->written &&
-	    (s->p_failed || ((test_bit(R5_Insync, &pdev->flags)
-			     && !test_bit(R5_LOCKED, &pdev->flags)
-			     && test_bit(R5_UPTODATE, &pdev->flags)))) &&
-	    (s->q_failed || ((test_bit(R5_Insync, &qdev->flags)
-			     && !test_bit(R5_LOCKED, &qdev->flags)
-			     && test_bit(R5_UPTODATE, &qdev->flags)))))
-		handle_stripe_clean_event(conf, sh, disks, &s->return_bi);
-
-	/* Now we might consider reading some blocks, either to check/generate
-	 * parity, or to satisfy requests
-	 * or to load a block that is being partially written.
-	 */
-	if (s->to_read || s->non_overwrite || (s->to_write && s->failed) ||
-	    (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
-		handle_stripe_fill(sh, s, disks);
-
 	return 0;
 }
 
@@ -3178,6 +3070,7 @@ static void handle_stripe(struct stripe_head *sh)
 	int i;
 	int prexor;
 	int disks = sh->disks;
+	struct r5dev *pdev, *qdev;
 
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
@@ -3214,6 +3107,67 @@ static void handle_stripe(struct stripe_head *sh)
 	if (done)
 		goto finish;
 
+	if (unlikely(s.blocked_rdev)) {
+		if (s.syncing || s.expanding || s.expanded ||
+		    s.to_write || s.written) {
+			set_bit(STRIPE_HANDLE, &sh->state);
+			goto finish;
+		}
+		/* There is nothing for the blocked_rdev to block */
+		rdev_dec_pending(s.blocked_rdev, conf->mddev);
+		s.blocked_rdev = NULL;
+	}
+
+	if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
+		set_bit(STRIPE_OP_BIOFILL, &s.ops_request);
+		set_bit(STRIPE_BIOFILL_RUN, &sh->state);
+	}
+
+	pr_debug("locked=%d uptodate=%d to_read=%d"
+	       " to_write=%d failed=%d failed_num=%d,%d\n",
+	       s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
+	       s.failed_num[0], s.failed_num[1]);
+	/* check if the array has lost >2 devices and, if so, some requests
+	 * might need to be failed
+	 */
+	if (s.failed > conf->max_degraded && s.to_read+s.to_write+s.written)
+		handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
+	if (s.failed > conf->max_degraded && s.syncing) {
+		md_done_sync(conf->mddev, STRIPE_SECTORS, 0);
+		clear_bit(STRIPE_SYNCING, &sh->state);
+		s.syncing = 0;
+	}
+
+	/*
+	 * might be able to return some write requests if the parity blocks
+	 * are safe, or on a failed drive
+	 */
+	pdev = &sh->dev[sh->pd_idx];
+	s.p_failed = (s.failed >= 1 && s.failed_num[0] == sh->pd_idx)
+		|| (s.failed >= 2 && s.failed_num[1] == sh->pd_idx);
+	qdev = &sh->dev[sh->qd_idx];
+	s.q_failed = (s.failed >= 1 && s.failed_num[0] == sh->qd_idx)
+		|| (s.failed >= 2 && s.failed_num[1] == sh->qd_idx)
+		|| conf->level < 6;
+
+	if (s.written &&
+	    (s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
+			     && !test_bit(R5_LOCKED, &pdev->flags)
+			     && test_bit(R5_UPTODATE, &pdev->flags)))) &&
+	    (s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
+			     && !test_bit(R5_LOCKED, &qdev->flags)
+			     && test_bit(R5_UPTODATE, &qdev->flags)))))
+		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
+
+	/* Now we might consider reading some blocks, either to check/generate
+	 * parity, or to satisfy requests
+	 * or to load a block that is being partially written.
+	 */
+	if (s.to_read || s.non_overwrite
+	    || (conf->level == 6 && s.to_write && s.failed)
+	    || (s.syncing && (s.uptodate + s.compute < disks)) || s.expanding)
+		handle_stripe_fill(sh, &s, disks);
+
 	/* Now we check to see if any write operations have recently
 	 * completed
 	 */



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 15/34] md/raid5: rearrange a test in fetch_block6.
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (11 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 14/34] md/raid5: move more code into common handle_stripe NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-22  7:37   ` Namhyung Kim
  2011-07-21  2:32 ` [md PATCH 16/34] md/raid5: unite fetch_block5 and fetch_block6 NeilBrown
                   ` (20 subsequent siblings)
  33 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

Next patch will unite fetch_block5 and fetch_block6.
First I want to make the differences a little more clear.

For RAID6 if we are writing at all and there is a failed device, then
we need to load or compute every block so we can do a
reconstruct-write.
This case isn't needed for RAID5 - we will do a read-modify-write in
that case.
So make that test a separate test in fetch_block6 rather than merged
with two other tests.

Make a similar change in fetch_block5 so the one bit that is not
needed for RAID6 is clearly separate.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   12 +++++-------
 1 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e41b622..100ac1f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2326,9 +2326,8 @@ static int fetch_block5(struct stripe_head *sh, struct stripe_head_state *s,
 	    (dev->toread ||
 	     (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
 	     s->syncing || s->expanding ||
-	     (s->failed &&
-	      (failed_dev->toread ||
-	       (failed_dev->towrite &&
+	     (s->failed && failed_dev->toread) ||
+	     (s->failed && failed_dev->towrite &&
 		!test_bit(R5_OVERWRITE, &failed_dev->flags)))))) {
 		/* We would like to get this block, possibly by computing it,
 		 * otherwise read it if the backing disk is insync
@@ -2399,10 +2398,9 @@ static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
 	    (dev->toread ||
 	     (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
 	     s->syncing || s->expanding ||
-	     (s->failed >= 1 &&
-	      (fdev[0]->toread || s->to_write)) ||
-	     (s->failed >= 2 &&
-	      (fdev[1]->toread || s->to_write)))) {
+	     (s->failed >= 1 && fdev[0]->toread) ||
+	     (s->failed >= 2 && fdev[1]->toread) ||
+	     (s->failed && s->to_write)) {
 		/* we would like to get this block, possibly by computing it,
 		 * otherwise read it if the backing disk is insync
 		 */



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 16/34] md/raid5: unite fetch_block5 and fetch_block6
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (12 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 15/34] md/raid5: rearrange a test in fetch_block6 NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-22  8:24   ` Namhyung Kim
  2011-07-21  2:32 ` [md PATCH 12/34] md/raid5: move stripe_head_state and more code into handle_stripe NeilBrown
                   ` (19 subsequent siblings)
  33 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

Provided that ->failed_num[1] is not a valid device number (which is
easily achieved) fetch_block6 provides all the functionality of
fetch_block5.

So remove the latter and rename the former to simply "fetch_block".

Then handle_stripe_fill5 and handle_stripe_fill6 become the same and
can similarly be united.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |  107 +++++++++++-----------------------------------------
 1 files changed, 23 insertions(+), 84 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 100ac1f..ea796a9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2308,91 +2308,20 @@ handle_failed_stripe(raid5_conf_t *conf, struct stripe_head *sh,
 			md_wakeup_thread(conf->mddev->thread);
 }
 
-/* fetch_block5 - checks the given member device to see if its data needs
+/* fetch_block - checks the given member device to see if its data needs
  * to be read or computed to satisfy a request.
  *
  * Returns 1 when no more member devices need to be checked, otherwise returns
- * 0 to tell the loop in handle_stripe_fill5 to continue
+ * 0 to tell the loop in handle_stripe_fill to continue
  */
-static int fetch_block5(struct stripe_head *sh, struct stripe_head_state *s,
-			int disk_idx, int disks)
-{
-	struct r5dev *dev = &sh->dev[disk_idx];
-	struct r5dev *failed_dev = &sh->dev[s->failed_num[0]];
-
-	/* is the data in this block needed, and can we get it? */
-	if (!test_bit(R5_LOCKED, &dev->flags) &&
-	    !test_bit(R5_UPTODATE, &dev->flags) &&
-	    (dev->toread ||
-	     (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
-	     s->syncing || s->expanding ||
-	     (s->failed && failed_dev->toread) ||
-	     (s->failed && failed_dev->towrite &&
-		!test_bit(R5_OVERWRITE, &failed_dev->flags)))))) {
-		/* We would like to get this block, possibly by computing it,
-		 * otherwise read it if the backing disk is insync
-		 */
-		if ((s->uptodate == disks - 1) &&
-		    (s->failed && disk_idx == s->failed_num[0])) {
-			set_bit(STRIPE_COMPUTE_RUN, &sh->state);
-			set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
-			set_bit(R5_Wantcompute, &dev->flags);
-			sh->ops.target = disk_idx;
-			sh->ops.target2 = -1;
-			s->req_compute = 1;
-			/* Careful: from this point on 'uptodate' is in the eye
-			 * of raid_run_ops which services 'compute' operations
-			 * before writes. R5_Wantcompute flags a block that will
-			 * be R5_UPTODATE by the time it is needed for a
-			 * subsequent operation.
-			 */
-			s->uptodate++;
-			return 1; /* uptodate + compute == disks */
-		} else if (test_bit(R5_Insync, &dev->flags)) {
-			set_bit(R5_LOCKED, &dev->flags);
-			set_bit(R5_Wantread, &dev->flags);
-			s->locked++;
-			pr_debug("Reading block %d (sync=%d)\n", disk_idx,
-				s->syncing);
-		}
-	}
-
-	return 0;
-}
-
-/**
- * handle_stripe_fill5 - read or compute data to satisfy pending requests.
- */
-static void handle_stripe_fill5(struct stripe_head *sh,
-			struct stripe_head_state *s, int disks)
-{
-	int i;
-
-	/* look for blocks to read/compute, skip this if a compute
-	 * is already in flight, or if the stripe contents are in the
-	 * midst of changing due to a write
-	 */
-	if (!test_bit(STRIPE_COMPUTE_RUN, &sh->state) && !sh->check_state &&
-	    !sh->reconstruct_state)
-		for (i = disks; i--; )
-			if (fetch_block5(sh, s, i, disks))
-				break;
-	set_bit(STRIPE_HANDLE, &sh->state);
-}
-
-/* fetch_block6 - checks the given member device to see if its data needs
- * to be read or computed to satisfy a request.
- *
- * Returns 1 when no more member devices need to be checked, otherwise returns
- * 0 to tell the loop in handle_stripe_fill6 to continue
- */
-static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
-			int disk_idx, int disks)
+static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
+		       int disk_idx, int disks)
 {
 	struct r5dev *dev = &sh->dev[disk_idx];
 	struct r5dev *fdev[2] = { &sh->dev[s->failed_num[0]],
 				  &sh->dev[s->failed_num[1]] };
 
+	/* is the data in this block needed, and can we get it? */
 	if (!test_bit(R5_LOCKED, &dev->flags) &&
 	    !test_bit(R5_UPTODATE, &dev->flags) &&
 	    (dev->toread ||
@@ -2400,7 +2329,9 @@ static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
 	     s->syncing || s->expanding ||
 	     (s->failed >= 1 && fdev[0]->toread) ||
 	     (s->failed >= 2 && fdev[1]->toread) ||
-	     (s->failed && s->to_write)) {
+	     (sh->raid_conf->level <= 5 && s->failed && fdev[0]->towrite &&
+	      !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||
+	     (sh->raid_conf->level == 6 && s->failed && s->to_write))) {
 		/* we would like to get this block, possibly by computing it,
 		 * otherwise read it if the backing disk is insync
 		 */
@@ -2420,6 +2351,12 @@ static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
 			sh->ops.target = disk_idx;
 			sh->ops.target2 = -1; /* no 2nd target */
 			s->req_compute = 1;
+			/* Careful: from this point on 'uptodate' is in the eye
+			 * of raid_run_ops which services 'compute' operations
+			 * before writes. R5_Wantcompute flags a block that will
+			 * be R5_UPTODATE by the time it is needed for a
+			 * subsequent operation.
+			 */
 			s->uptodate++;
 			return 1;
 		} else if (s->uptodate == disks-2 && s->failed >= 2) {
@@ -2460,11 +2397,11 @@ static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
 }
 
 /**
- * handle_stripe_fill6 - read or compute data to satisfy pending requests.
+ * handle_stripe_fill - read or compute data to satisfy pending requests.
  */
-static void handle_stripe_fill6(struct stripe_head *sh,
-			struct stripe_head_state *s,
-			int disks)
+static void handle_stripe_fill(struct stripe_head *sh,
+			       struct stripe_head_state *s,
+			       int disks)
 {
 	int i;
 
@@ -2475,7 +2412,7 @@ static void handle_stripe_fill6(struct stripe_head *sh,
 	if (!test_bit(STRIPE_COMPUTE_RUN, &sh->state) && !sh->check_state &&
 	    !sh->reconstruct_state)
 		for (i = disks; i--; )
-			if (fetch_block6(sh, s, i, disks))
+			if (fetch_block(sh, s, i, disks))
 				break;
 	set_bit(STRIPE_HANDLE, &sh->state);
 }
@@ -3123,7 +3060,7 @@ static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
 	 */
 	if (s->to_read || s->non_overwrite ||
 	    (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
-		handle_stripe_fill5(sh, s, disks);
+		handle_stripe_fill(sh, s, disks);
 
 	/* Now we check to see if any write operations have recently
 	 * completed
@@ -3314,7 +3251,7 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
 	 */
 	if (s->to_read || s->non_overwrite || (s->to_write && s->failed) ||
 	    (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
-		handle_stripe_fill6(sh, s, disks);
+		handle_stripe_fill(sh, s, disks);
 
 	/* Now we check to see if any write operations have recently
 	 * completed
@@ -3398,6 +3335,8 @@ static void handle_stripe(struct stripe_head *sh)
 	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
 	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
 	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
+	s.failed_num[0] = -1;
+	s.failed_num[1] = -1;
 
 	if (conf->level == 6)
 		done = handle_stripe6(sh, &s);



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 13/34] md/raid5: Move code for finishing a reconstruction into handle_stripe.
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (16 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 18/34] md/raid5: move more common code into handle_stripe NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-22  7:09   ` Namhyung Kim
  2011-07-21  2:32 ` [md PATCH 19/34] md/raid5: move some more common code " NeilBrown
                   ` (15 subsequent siblings)
  33 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

Prior to commit ab69ae12ceef7 the code in handle_stripe5 and
handle_stripe6 to "Finish reconstruct operations initiated by the
expansion process" was identical.
That commit added an identical stanza of code to each function, but in
different places.  That was careless.

The raid5 code was correct, so move that out into handle_stripe and
remove raid6 version.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |  153 +++++++++++++++++++---------------------------------
 1 files changed, 57 insertions(+), 96 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ce1c291..7b562fc 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2999,7 +2999,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
  *
  */
 
-static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
+static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
 {
 	raid5_conf_t *conf = sh->raid_conf;
 	int disks = sh->disks, i;
@@ -3081,7 +3081,7 @@ static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
 		if (s->syncing || s->expanding || s->expanded ||
 		    s->to_write || s->written) {
 			set_bit(STRIPE_HANDLE, &sh->state);
-			return;
+			return 1;
 		}
 		/* There is nothing for the blocked_rdev to block */
 		rdev_dec_pending(s->blocked_rdev, conf->mddev);
@@ -3205,54 +3205,10 @@ static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
 			s->locked++;
 		}
 	}
-
-	/* Finish reconstruct operations initiated by the expansion process */
-	if (sh->reconstruct_state == reconstruct_state_result) {
-		struct stripe_head *sh2
-			= get_active_stripe(conf, sh->sector, 1, 1, 1);
-		if (sh2 && test_bit(STRIPE_EXPAND_SOURCE, &sh2->state)) {
-			/* sh cannot be written until sh2 has been read.
-			 * so arrange for sh to be delayed a little
-			 */
-			set_bit(STRIPE_DELAYED, &sh->state);
-			set_bit(STRIPE_HANDLE, &sh->state);
-			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
-					      &sh2->state))
-				atomic_inc(&conf->preread_active_stripes);
-			release_stripe(sh2);
-			return;
-		}
-		if (sh2)
-			release_stripe(sh2);
-
-		sh->reconstruct_state = reconstruct_state_idle;
-		clear_bit(STRIPE_EXPANDING, &sh->state);
-		for (i = conf->raid_disks; i--; ) {
-			set_bit(R5_Wantwrite, &sh->dev[i].flags);
-			set_bit(R5_LOCKED, &sh->dev[i].flags);
-			s->locked++;
-		}
-	}
-
-	if (s->expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
-	    !sh->reconstruct_state) {
-		/* Need to write out all blocks after computing parity */
-		sh->disks = conf->raid_disks;
-		stripe_set_idx(sh->sector, conf, 0, sh);
-		schedule_reconstruction(sh, s, 1, 1);
-	} else if (s->expanded && !sh->reconstruct_state && s->locked == 0) {
-		clear_bit(STRIPE_EXPAND_READY, &sh->state);
-		atomic_dec(&conf->reshape_stripes);
-		wake_up(&conf->wait_for_overlap);
-		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
-	}
-
-	if (s->expanding && s->locked == 0 &&
-	    !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
-		handle_stripe_expansion(conf, sh, NULL);
+	return 0;
 }
 
-static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
+static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
 {
 	raid5_conf_t *conf = sh->raid_conf;
 	int disks = sh->disks;
@@ -3335,7 +3291,7 @@ static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
 		if (s->syncing || s->expanding || s->expanded ||
 		    s->to_write || s->written) {
 			set_bit(STRIPE_HANDLE, &sh->state);
-			return;
+			return 1;
 		}
 		/* There is nothing for the blocked_rdev to block */
 		rdev_dec_pending(s->blocked_rdev, conf->mddev);
@@ -3468,57 +3424,15 @@ static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
 				}
 			}
 		}
-
-	/* Finish reconstruct operations initiated by the expansion process */
-	if (sh->reconstruct_state == reconstruct_state_result) {
-		sh->reconstruct_state = reconstruct_state_idle;
-		clear_bit(STRIPE_EXPANDING, &sh->state);
-		for (i = conf->raid_disks; i--; ) {
-			set_bit(R5_Wantwrite, &sh->dev[i].flags);
-			set_bit(R5_LOCKED, &sh->dev[i].flags);
-			s->locked++;
-		}
-	}
-
-	if (s->expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
-	    !sh->reconstruct_state) {
-		struct stripe_head *sh2
-			= get_active_stripe(conf, sh->sector, 1, 1, 1);
-		if (sh2 && test_bit(STRIPE_EXPAND_SOURCE, &sh2->state)) {
-			/* sh cannot be written until sh2 has been read.
-			 * so arrange for sh to be delayed a little
-			 */
-			set_bit(STRIPE_DELAYED, &sh->state);
-			set_bit(STRIPE_HANDLE, &sh->state);
-			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
-					      &sh2->state))
-				atomic_inc(&conf->preread_active_stripes);
-			release_stripe(sh2);
-			return;
-		}
-		if (sh2)
-			release_stripe(sh2);
-
-		/* Need to write out all blocks after computing P&Q */
-		sh->disks = conf->raid_disks;
-		stripe_set_idx(sh->sector, conf, 0, sh);
-		schedule_reconstruction(sh, s, 1, 1);
-	} else if (s->expanded && !sh->reconstruct_state && s->locked == 0) {
-		clear_bit(STRIPE_EXPAND_READY, &sh->state);
-		atomic_dec(&conf->reshape_stripes);
-		wake_up(&conf->wait_for_overlap);
-		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
-	}
-
-	if (s->expanding && s->locked == 0 &&
-	    !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
-		handle_stripe_expansion(conf, sh, s);
+	return 0;
 }
 
 static void handle_stripe(struct stripe_head *sh)
 {
 	struct stripe_head_state s;
+	int done;
 	raid5_conf_t *conf = sh->raid_conf;
+	int i;
 
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
@@ -3546,11 +3460,58 @@ static void handle_stripe(struct stripe_head *sh)
 	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
 
 	if (conf->level == 6)
-		handle_stripe6(sh, &s);
+		done = handle_stripe6(sh, &s);
 	else
-		handle_stripe5(sh, &s);
+		done = handle_stripe5(sh, &s);
+
+	if (done)
+		goto finish;
+	/* Finish reconstruct operations initiated by the expansion process */
+	if (sh->reconstruct_state == reconstruct_state_result) {
+		struct stripe_head *sh2
+			= get_active_stripe(conf, sh->sector, 1, 1, 1);
+		if (sh2 && test_bit(STRIPE_EXPAND_SOURCE, &sh2->state)) {
+			/* sh cannot be written until sh2 has been read.
+			 * so arrange for sh to be delayed a little
+			 */
+			set_bit(STRIPE_DELAYED, &sh->state);
+			set_bit(STRIPE_HANDLE, &sh->state);
+			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
+					      &sh2->state))
+				atomic_inc(&conf->preread_active_stripes);
+			release_stripe(sh2);
+			goto finish;
+		}
+		if (sh2)
+			release_stripe(sh2);
+
+		sh->reconstruct_state = reconstruct_state_idle;
+		clear_bit(STRIPE_EXPANDING, &sh->state);
+		for (i = conf->raid_disks; i--; ) {
+			set_bit(R5_Wantwrite, &sh->dev[i].flags);
+			set_bit(R5_LOCKED, &sh->dev[i].flags);
+			s.locked++;
+		}
+	}
 
+	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
+	    !sh->reconstruct_state) {
+		/* Need to write out all blocks after computing parity */
+		sh->disks = conf->raid_disks;
+		stripe_set_idx(sh->sector, conf, 0, sh);
+		schedule_reconstruction(sh, &s, 1, 1);
+	} else if (s.expanded && !sh->reconstruct_state && s.locked == 0) {
+		clear_bit(STRIPE_EXPAND_READY, &sh->state);
+		atomic_dec(&conf->reshape_stripes);
+		wake_up(&conf->wait_for_overlap);
+		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
+	}
+
+	if (s.expanding && s.locked == 0 &&
+	    !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
+		handle_stripe_expansion(conf, sh, NULL);
 
+finish:
 	/* wait for this device to become unblocked */
 	if (unlikely(s.blocked_rdev))
 		md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 14/34] md/raid5: move more code into common handle_stripe
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (10 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 09/34] md/raid5: move common code into handle_stripe NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-22  7:32   ` Namhyung Kim
  2011-07-21  2:32 ` [md PATCH 15/34] md/raid5: rearrange a test in fetch_block6 NeilBrown
                   ` (21 subsequent siblings)
  33 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

The difference between the RAID5 and RAID6 code here is easily
resolved using conf->max_degraded.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   90 ++++++++++++++++++----------------------------------
 1 files changed, 32 insertions(+), 58 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7b562fc..e41b622 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3177,34 +3177,6 @@ static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
 	     !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
 	     !test_bit(STRIPE_INSYNC, &sh->state)))
 		handle_parity_checks5(conf, sh, s, disks);
-
-	if (s->syncing && s->locked == 0
-	    && test_bit(STRIPE_INSYNC, &sh->state)) {
-		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
-		clear_bit(STRIPE_SYNCING, &sh->state);
-	}
-
-	/* If the failed drive is just a ReadError, then we might need to progress
-	 * the repair/check process
-	 */
-	if (s->failed == 1 && !conf->mddev->ro &&
-	    test_bit(R5_ReadError, &sh->dev[s->failed_num[0]].flags)
-	    && !test_bit(R5_LOCKED, &sh->dev[s->failed_num[0]].flags)
-	    && test_bit(R5_UPTODATE, &sh->dev[s->failed_num[0]].flags)
-		) {
-		dev = &sh->dev[s->failed_num[0]];
-		if (!test_bit(R5_ReWrite, &dev->flags)) {
-			set_bit(R5_Wantwrite, &dev->flags);
-			set_bit(R5_ReWrite, &dev->flags);
-			set_bit(R5_LOCKED, &dev->flags);
-			s->locked++;
-		} else {
-			/* let's read it back */
-			set_bit(R5_Wantread, &dev->flags);
-			set_bit(R5_LOCKED, &dev->flags);
-			s->locked++;
-		}
-	}
 	return 0;
 }
 
@@ -3394,36 +3366,6 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
 	     !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
 	     !test_bit(STRIPE_INSYNC, &sh->state)))
 		handle_parity_checks6(conf, sh, s, disks);
-
-	if (s->syncing && s->locked == 0
-	    && test_bit(STRIPE_INSYNC, &sh->state)) {
-		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
-		clear_bit(STRIPE_SYNCING, &sh->state);
-	}
-
-	/* If the failed drives are just a ReadError, then we might need
-	 * to progress the repair/check process
-	 */
-	if (s->failed <= 2 && !conf->mddev->ro)
-		for (i = 0; i < s->failed; i++) {
-			dev = &sh->dev[s->failed_num[i]];
-			if (test_bit(R5_ReadError, &dev->flags)
-			    && !test_bit(R5_LOCKED, &dev->flags)
-			    && test_bit(R5_UPTODATE, &dev->flags)
-				) {
-				if (!test_bit(R5_ReWrite, &dev->flags)) {
-					set_bit(R5_Wantwrite, &dev->flags);
-					set_bit(R5_ReWrite, &dev->flags);
-					set_bit(R5_LOCKED, &dev->flags);
-					s->locked++;
-				} else {
-					/* let's read it back */
-					set_bit(R5_Wantread, &dev->flags);
-					set_bit(R5_LOCKED, &dev->flags);
-					s->locked++;
-				}
-			}
-		}
 	return 0;
 }
 
@@ -3466,6 +3408,38 @@ static void handle_stripe(struct stripe_head *sh)
 
 	if (done)
 		goto finish;
+
+
+	if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
+		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
+		clear_bit(STRIPE_SYNCING, &sh->state);
+	}
+
+	/* If the failed drives are just a ReadError, then we might need
+	 * to progress the repair/check process
+	 */
+	if (s.failed <= conf->max_degraded && !conf->mddev->ro)
+		for (i = 0; i < s.failed; i++) {
+			struct r5dev *dev = &sh->dev[s.failed_num[i]];
+			if (test_bit(R5_ReadError, &dev->flags)
+			    && !test_bit(R5_LOCKED, &dev->flags)
+			    && test_bit(R5_UPTODATE, &dev->flags)
+				) {
+				if (!test_bit(R5_ReWrite, &dev->flags)) {
+					set_bit(R5_Wantwrite, &dev->flags);
+					set_bit(R5_ReWrite, &dev->flags);
+					set_bit(R5_LOCKED, &dev->flags);
+					s.locked++;
+				} else {
+					/* let's read it back */
+					set_bit(R5_Wantread, &dev->flags);
+					set_bit(R5_LOCKED, &dev->flags);
+					s.locked++;
+				}
+			}
+		}
+
+
 	/* Finish reconstruct operations initiated by the expansion process */
 	if (sh->reconstruct_state == reconstruct_state_result) {
 		struct stripe_head *sh2



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 12/34] md/raid5: move stripe_head_state and more code into handle_stripe.
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (13 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 16/34] md/raid5: unite fetch_block5 and fetch_block6 NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-22  5:41   ` Namhyung Kim
  2011-07-21  2:32 ` [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6 NeilBrown
                   ` (18 subsequent siblings)
  33 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

By defining the 'stripe_head_state' in 'handle_stripe', we can move
some common code out of handle_stripe[56]() and into handle_stripe.

The means that all accesses for stripe_head_state in handle_stripe[56]
need to be 's->' instead of 's.', but the compiler should inline
those functions and just use a direct stack reference, and future
patches while hoist most of this code up into handle_stripe()
so we will revert to "s.".

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |  340 ++++++++++++++++++++++++----------------------------
 1 files changed, 158 insertions(+), 182 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3327e82..ce1c291 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2999,24 +2999,13 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
  *
  */
 
-static void handle_stripe5(struct stripe_head *sh)
+static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
 {
 	raid5_conf_t *conf = sh->raid_conf;
 	int disks = sh->disks, i;
-	struct stripe_head_state s;
 	struct r5dev *dev;
 	int prexor;
 
-	memset(&s, 0, sizeof(s));
-	pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d check:%d "
-		 "reconstruct:%d\n", (unsigned long long)sh->sector, sh->state,
-		 atomic_read(&sh->count), sh->pd_idx, sh->check_state,
-		 sh->reconstruct_state);
-
-	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
-	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
-	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
-
 	/* Now to look around and see what can be done */
 	rcu_read_lock();
 	spin_lock_irq(&conf->device_lock);
@@ -3039,25 +3028,28 @@ static void handle_stripe5(struct stripe_head *sh)
 			set_bit(R5_Wantfill, &dev->flags);
 
 		/* now count some things */
-		if (test_bit(R5_LOCKED, &dev->flags)) s.locked++;
-		if (test_bit(R5_UPTODATE, &dev->flags)) s.uptodate++;
-		if (test_bit(R5_Wantcompute, &dev->flags)) s.compute++;
+		if (test_bit(R5_LOCKED, &dev->flags))
+			s->locked++;
+		if (test_bit(R5_UPTODATE, &dev->flags))
+			s->uptodate++;
+		if (test_bit(R5_Wantcompute, &dev->flags))
+			s->compute++;
 
 		if (test_bit(R5_Wantfill, &dev->flags))
-			s.to_fill++;
+			s->to_fill++;
 		else if (dev->toread)
-			s.to_read++;
+			s->to_read++;
 		if (dev->towrite) {
-			s.to_write++;
+			s->to_write++;
 			if (!test_bit(R5_OVERWRITE, &dev->flags))
-				s.non_overwrite++;
+				s->non_overwrite++;
 		}
 		if (dev->written)
-			s.written++;
+			s->written++;
 		rdev = rcu_dereference(conf->disks[i].rdev);
-		if (s.blocked_rdev == NULL &&
+		if (s->blocked_rdev == NULL &&
 		    rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
-			s.blocked_rdev = rdev;
+			s->blocked_rdev = rdev;
 			atomic_inc(&rdev->nr_pending);
 		}
 		clear_bit(R5_Insync, &dev->flags);
@@ -3078,62 +3070,62 @@ static void handle_stripe5(struct stripe_head *sh)
 		if (test_bit(R5_ReadError, &dev->flags))
 			clear_bit(R5_Insync, &dev->flags);
 		if (!test_bit(R5_Insync, &dev->flags)) {
-			s.failed++;
-			s.failed_num[0] = i;
+			s->failed++;
+			s->failed_num[0] = i;
 		}
 	}
 	spin_unlock_irq(&conf->device_lock);
 	rcu_read_unlock();
 
-	if (unlikely(s.blocked_rdev)) {
-		if (s.syncing || s.expanding || s.expanded ||
-		    s.to_write || s.written) {
+	if (unlikely(s->blocked_rdev)) {
+		if (s->syncing || s->expanding || s->expanded ||
+		    s->to_write || s->written) {
 			set_bit(STRIPE_HANDLE, &sh->state);
-			goto unlock;
+			return;
 		}
 		/* There is nothing for the blocked_rdev to block */
-		rdev_dec_pending(s.blocked_rdev, conf->mddev);
-		s.blocked_rdev = NULL;
+		rdev_dec_pending(s->blocked_rdev, conf->mddev);
+		s->blocked_rdev = NULL;
 	}
 
-	if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
-		set_bit(STRIPE_OP_BIOFILL, &s.ops_request);
+	if (s->to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
+		set_bit(STRIPE_OP_BIOFILL, &s->ops_request);
 		set_bit(STRIPE_BIOFILL_RUN, &sh->state);
 	}
 
 	pr_debug("locked=%d uptodate=%d to_read=%d"
 		" to_write=%d failed=%d failed_num=%d\n",
-		s.locked, s.uptodate, s.to_read, s.to_write,
-		s.failed, s.failed_num[0]);
+		s->locked, s->uptodate, s->to_read, s->to_write,
+		s->failed, s->failed_num[0]);
 	/* check if the array has lost two devices and, if so, some requests might
 	 * need to be failed
 	 */
-	if (s.failed > 1 && s.to_read+s.to_write+s.written)
-		handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
-	if (s.failed > 1 && s.syncing) {
+	if (s->failed > 1 && s->to_read+s->to_write+s->written)
+		handle_failed_stripe(conf, sh, s, disks, &s->return_bi);
+	if (s->failed > 1 && s->syncing) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
 		clear_bit(STRIPE_SYNCING, &sh->state);
-		s.syncing = 0;
+		s->syncing = 0;
 	}
 
 	/* might be able to return some write requests if the parity block
 	 * is safe, or on a failed drive
 	 */
 	dev = &sh->dev[sh->pd_idx];
-	if ( s.written &&
-	     ((test_bit(R5_Insync, &dev->flags) &&
-	       !test_bit(R5_LOCKED, &dev->flags) &&
-	       test_bit(R5_UPTODATE, &dev->flags)) ||
-	       (s.failed == 1 && s.failed_num[0] == sh->pd_idx)))
-		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
+	if (s->written &&
+	    ((test_bit(R5_Insync, &dev->flags) &&
+	      !test_bit(R5_LOCKED, &dev->flags) &&
+	      test_bit(R5_UPTODATE, &dev->flags)) ||
+	     (s->failed == 1 && s->failed_num[0] == sh->pd_idx)))
+		handle_stripe_clean_event(conf, sh, disks, &s->return_bi);
 
 	/* Now we might consider reading some blocks, either to check/generate
 	 * parity, or to satisfy requests
 	 * or to load a block that is being partially written.
 	 */
-	if (s.to_read || s.non_overwrite ||
-	    (s.syncing && (s.uptodate + s.compute < disks)) || s.expanding)
-		handle_stripe_fill5(sh, &s, disks);
+	if (s->to_read || s->non_overwrite ||
+	    (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
+		handle_stripe_fill5(sh, s, disks);
 
 	/* Now we check to see if any write operations have recently
 	 * completed
@@ -3158,12 +3150,12 @@ static void handle_stripe5(struct stripe_head *sh)
 				if (prexor)
 					continue;
 				if (!test_bit(R5_Insync, &dev->flags) ||
-				    (i == sh->pd_idx && s.failed == 0))
+				    (i == sh->pd_idx && s->failed == 0))
 					set_bit(STRIPE_INSYNC, &sh->state);
 			}
 		}
 		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-			s.dec_preread_active = 1;
+			s->dec_preread_active = 1;
 	}
 
 	/* Now to consider new write requests and what else, if anything
@@ -3172,8 +3164,8 @@ static void handle_stripe5(struct stripe_head *sh)
 	 * 2/ A 'check' operation is in flight, as it may clobber the parity
 	 *    block.
 	 */
-	if (s.to_write && !sh->reconstruct_state && !sh->check_state)
-		handle_stripe_dirtying5(conf, sh, &s, disks);
+	if (s->to_write && !sh->reconstruct_state && !sh->check_state)
+		handle_stripe_dirtying5(conf, sh, s, disks);
 
 	/* maybe we need to check and possibly fix the parity for this stripe
 	 * Any reads will already have been scheduled, so we just see if enough
@@ -3181,12 +3173,13 @@ static void handle_stripe5(struct stripe_head *sh)
 	 * dependent operations are in flight.
 	 */
 	if (sh->check_state ||
-	    (s.syncing && s.locked == 0 &&
+	    (s->syncing && s->locked == 0 &&
 	     !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
 	     !test_bit(STRIPE_INSYNC, &sh->state)))
-		handle_parity_checks5(conf, sh, &s, disks);
+		handle_parity_checks5(conf, sh, s, disks);
 
-	if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
+	if (s->syncing && s->locked == 0
+	    && test_bit(STRIPE_INSYNC, &sh->state)) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
 		clear_bit(STRIPE_SYNCING, &sh->state);
 	}
@@ -3194,22 +3187,22 @@ static void handle_stripe5(struct stripe_head *sh)
 	/* If the failed drive is just a ReadError, then we might need to progress
 	 * the repair/check process
 	 */
-	if (s.failed == 1 && !conf->mddev->ro &&
-	    test_bit(R5_ReadError, &sh->dev[s.failed_num[0]].flags)
-	    && !test_bit(R5_LOCKED, &sh->dev[s.failed_num[0]].flags)
-	    && test_bit(R5_UPTODATE, &sh->dev[s.failed_num[0]].flags)
+	if (s->failed == 1 && !conf->mddev->ro &&
+	    test_bit(R5_ReadError, &sh->dev[s->failed_num[0]].flags)
+	    && !test_bit(R5_LOCKED, &sh->dev[s->failed_num[0]].flags)
+	    && test_bit(R5_UPTODATE, &sh->dev[s->failed_num[0]].flags)
 		) {
-		dev = &sh->dev[s.failed_num[0]];
+		dev = &sh->dev[s->failed_num[0]];
 		if (!test_bit(R5_ReWrite, &dev->flags)) {
 			set_bit(R5_Wantwrite, &dev->flags);
 			set_bit(R5_ReWrite, &dev->flags);
 			set_bit(R5_LOCKED, &dev->flags);
-			s.locked++;
+			s->locked++;
 		} else {
 			/* let's read it back */
 			set_bit(R5_Wantread, &dev->flags);
 			set_bit(R5_LOCKED, &dev->flags);
-			s.locked++;
+			s->locked++;
 		}
 	}
 
@@ -3227,7 +3220,7 @@ static void handle_stripe5(struct stripe_head *sh)
 					      &sh2->state))
 				atomic_inc(&conf->preread_active_stripes);
 			release_stripe(sh2);
-			goto unlock;
+			return;
 		}
 		if (sh2)
 			release_stripe(sh2);
@@ -3237,69 +3230,35 @@ static void handle_stripe5(struct stripe_head *sh)
 		for (i = conf->raid_disks; i--; ) {
 			set_bit(R5_Wantwrite, &sh->dev[i].flags);
 			set_bit(R5_LOCKED, &sh->dev[i].flags);
-			s.locked++;
+			s->locked++;
 		}
 	}
 
-	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
+	if (s->expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
 	    !sh->reconstruct_state) {
 		/* Need to write out all blocks after computing parity */
 		sh->disks = conf->raid_disks;
 		stripe_set_idx(sh->sector, conf, 0, sh);
-		schedule_reconstruction(sh, &s, 1, 1);
-	} else if (s.expanded && !sh->reconstruct_state && s.locked == 0) {
+		schedule_reconstruction(sh, s, 1, 1);
+	} else if (s->expanded && !sh->reconstruct_state && s->locked == 0) {
 		clear_bit(STRIPE_EXPAND_READY, &sh->state);
 		atomic_dec(&conf->reshape_stripes);
 		wake_up(&conf->wait_for_overlap);
 		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
 	}
 
-	if (s.expanding && s.locked == 0 &&
+	if (s->expanding && s->locked == 0 &&
 	    !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
 		handle_stripe_expansion(conf, sh, NULL);
-
- unlock:
-
-	/* wait for this device to become unblocked */
-	if (unlikely(s.blocked_rdev))
-		md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);
-
-	if (s.ops_request)
-		raid_run_ops(sh, s.ops_request);
-
-	ops_run_io(sh, &s);
-
-	if (s.dec_preread_active) {
-		/* We delay this until after ops_run_io so that if make_request
-		 * is waiting on a flush, it won't continue until the writes
-		 * have actually been submitted.
-		 */
-		atomic_dec(&conf->preread_active_stripes);
-		if (atomic_read(&conf->preread_active_stripes) <
-		    IO_THRESHOLD)
-			md_wakeup_thread(conf->mddev->thread);
-	}
-	return_io(s.return_bi);
 }
 
-static void handle_stripe6(struct stripe_head *sh)
+static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
 {
 	raid5_conf_t *conf = sh->raid_conf;
 	int disks = sh->disks;
 	int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
-	struct stripe_head_state s;
 	struct r5dev *dev, *pdev, *qdev;
 
-	pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
-		"pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n",
-	       (unsigned long long)sh->sector, sh->state,
-	       atomic_read(&sh->count), pd_idx, qd_idx,
-	       sh->check_state, sh->reconstruct_state);
-	memset(&s, 0, sizeof(s));
-
-	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
-	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
-	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
 	/* Now to look around and see what can be done */
 
 	rcu_read_lock();
@@ -3320,28 +3279,30 @@ static void handle_stripe6(struct stripe_head *sh)
 			set_bit(R5_Wantfill, &dev->flags);
 
 		/* now count some things */
-		if (test_bit(R5_LOCKED, &dev->flags)) s.locked++;
-		if (test_bit(R5_UPTODATE, &dev->flags)) s.uptodate++;
+		if (test_bit(R5_LOCKED, &dev->flags))
+			s->locked++;
+		if (test_bit(R5_UPTODATE, &dev->flags))
+			s->uptodate++;
 		if (test_bit(R5_Wantcompute, &dev->flags)) {
-			s.compute++;
-			BUG_ON(s.compute > 2);
+			s->compute++;
+			BUG_ON(s->compute > 2);
 		}
 
 		if (test_bit(R5_Wantfill, &dev->flags)) {
-			s.to_fill++;
+			s->to_fill++;
 		} else if (dev->toread)
-			s.to_read++;
+			s->to_read++;
 		if (dev->towrite) {
-			s.to_write++;
+			s->to_write++;
 			if (!test_bit(R5_OVERWRITE, &dev->flags))
-				s.non_overwrite++;
+				s->non_overwrite++;
 		}
 		if (dev->written)
-			s.written++;
+			s->written++;
 		rdev = rcu_dereference(conf->disks[i].rdev);
-		if (s.blocked_rdev == NULL &&
+		if (s->blocked_rdev == NULL &&
 		    rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
-			s.blocked_rdev = rdev;
+			s->blocked_rdev = rdev;
 			atomic_inc(&rdev->nr_pending);
 		}
 		clear_bit(R5_Insync, &dev->flags);
@@ -3362,43 +3323,43 @@ static void handle_stripe6(struct stripe_head *sh)
 		if (test_bit(R5_ReadError, &dev->flags))
 			clear_bit(R5_Insync, &dev->flags);
 		if (!test_bit(R5_Insync, &dev->flags)) {
-			if (s.failed < 2)
-				s.failed_num[s.failed] = i;
-			s.failed++;
+			if (s->failed < 2)
+				s->failed_num[s->failed] = i;
+			s->failed++;
 		}
 	}
 	spin_unlock_irq(&conf->device_lock);
 	rcu_read_unlock();
 
-	if (unlikely(s.blocked_rdev)) {
-		if (s.syncing || s.expanding || s.expanded ||
-		    s.to_write || s.written) {
+	if (unlikely(s->blocked_rdev)) {
+		if (s->syncing || s->expanding || s->expanded ||
+		    s->to_write || s->written) {
 			set_bit(STRIPE_HANDLE, &sh->state);
-			goto unlock;
+			return;
 		}
 		/* There is nothing for the blocked_rdev to block */
-		rdev_dec_pending(s.blocked_rdev, conf->mddev);
-		s.blocked_rdev = NULL;
+		rdev_dec_pending(s->blocked_rdev, conf->mddev);
+		s->blocked_rdev = NULL;
 	}
 
-	if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
-		set_bit(STRIPE_OP_BIOFILL, &s.ops_request);
+	if (s->to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
+		set_bit(STRIPE_OP_BIOFILL, &s->ops_request);
 		set_bit(STRIPE_BIOFILL_RUN, &sh->state);
 	}
 
 	pr_debug("locked=%d uptodate=%d to_read=%d"
 	       " to_write=%d failed=%d failed_num=%d,%d\n",
-	       s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
-	       s.failed_num[0], s.failed_num[1]);
+	       s->locked, s->uptodate, s->to_read, s->to_write, s->failed,
+	       s->failed_num[0], s->failed_num[1]);
 	/* check if the array has lost >2 devices and, if so, some requests
 	 * might need to be failed
 	 */
-	if (s.failed > 2 && s.to_read+s.to_write+s.written)
-		handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
-	if (s.failed > 2 && s.syncing) {
+	if (s->failed > 2 && s->to_read+s->to_write+s->written)
+		handle_failed_stripe(conf, sh, s, disks, &s->return_bi);
+	if (s->failed > 2 && s->syncing) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
 		clear_bit(STRIPE_SYNCING, &sh->state);
-		s.syncing = 0;
+		s->syncing = 0;
 	}
 
 	/*
@@ -3406,28 +3367,28 @@ static void handle_stripe6(struct stripe_head *sh)
 	 * are safe, or on a failed drive
 	 */
 	pdev = &sh->dev[pd_idx];
-	s.p_failed = (s.failed >= 1 && s.failed_num[0] == pd_idx)
-		|| (s.failed >= 2 && s.failed_num[1] == pd_idx);
+	s->p_failed = (s->failed >= 1 && s->failed_num[0] == pd_idx)
+		|| (s->failed >= 2 && s->failed_num[1] == pd_idx);
 	qdev = &sh->dev[qd_idx];
-	s.q_failed = (s.failed >= 1 && s.failed_num[0] == qd_idx)
-		|| (s.failed >= 2 && s.failed_num[1] == qd_idx);
+	s->q_failed = (s->failed >= 1 && s->failed_num[0] == qd_idx)
+		|| (s->failed >= 2 && s->failed_num[1] == qd_idx);
 
-	if (s.written &&
-	    (s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
+	if (s->written &&
+	    (s->p_failed || ((test_bit(R5_Insync, &pdev->flags)
 			     && !test_bit(R5_LOCKED, &pdev->flags)
 			     && test_bit(R5_UPTODATE, &pdev->flags)))) &&
-	    (s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
+	    (s->q_failed || ((test_bit(R5_Insync, &qdev->flags)
 			     && !test_bit(R5_LOCKED, &qdev->flags)
 			     && test_bit(R5_UPTODATE, &qdev->flags)))))
-		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
+		handle_stripe_clean_event(conf, sh, disks, &s->return_bi);
 
 	/* Now we might consider reading some blocks, either to check/generate
 	 * parity, or to satisfy requests
 	 * or to load a block that is being partially written.
 	 */
-	if (s.to_read || s.non_overwrite || (s.to_write && s.failed) ||
-	    (s.syncing && (s.uptodate + s.compute < disks)) || s.expanding)
-		handle_stripe_fill6(sh, &s, disks);
+	if (s->to_read || s->non_overwrite || (s->to_write && s->failed) ||
+	    (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
+		handle_stripe_fill6(sh, s, disks);
 
 	/* Now we check to see if any write operations have recently
 	 * completed
@@ -3450,12 +3411,12 @@ static void handle_stripe6(struct stripe_head *sh)
 				set_bit(R5_Wantwrite, &dev->flags);
 				if (!test_bit(R5_Insync, &dev->flags) ||
 				    ((i == sh->pd_idx || i == qd_idx) &&
-				      s.failed == 0))
+				      s->failed == 0))
 					set_bit(STRIPE_INSYNC, &sh->state);
 			}
 		}
 		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-			s.dec_preread_active = 1;
+			s->dec_preread_active = 1;
 	}
 
 	/* Now to consider new write requests and what else, if anything
@@ -3464,8 +3425,8 @@ static void handle_stripe6(struct stripe_head *sh)
 	 * 2/ A 'check' operation is in flight, as it may clobber the parity
 	 *    block.
 	 */
-	if (s.to_write && !sh->reconstruct_state && !sh->check_state)
-		handle_stripe_dirtying6(conf, sh, &s, disks);
+	if (s->to_write && !sh->reconstruct_state && !sh->check_state)
+		handle_stripe_dirtying6(conf, sh, s, disks);
 
 	/* maybe we need to check and possibly fix the parity for this stripe
 	 * Any reads will already have been scheduled, so we just see if enough
@@ -3473,12 +3434,13 @@ static void handle_stripe6(struct stripe_head *sh)
 	 * dependent operations are in flight.
 	 */
 	if (sh->check_state ||
-	    (s.syncing && s.locked == 0 &&
+	    (s->syncing && s->locked == 0 &&
 	     !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
 	     !test_bit(STRIPE_INSYNC, &sh->state)))
-		handle_parity_checks6(conf, sh, &s, disks);
+		handle_parity_checks6(conf, sh, s, disks);
 
-	if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
+	if (s->syncing && s->locked == 0
+	    && test_bit(STRIPE_INSYNC, &sh->state)) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
 		clear_bit(STRIPE_SYNCING, &sh->state);
 	}
@@ -3486,9 +3448,9 @@ static void handle_stripe6(struct stripe_head *sh)
 	/* If the failed drives are just a ReadError, then we might need
 	 * to progress the repair/check process
 	 */
-	if (s.failed <= 2 && !conf->mddev->ro)
-		for (i = 0; i < s.failed; i++) {
-			dev = &sh->dev[s.failed_num[i]];
+	if (s->failed <= 2 && !conf->mddev->ro)
+		for (i = 0; i < s->failed; i++) {
+			dev = &sh->dev[s->failed_num[i]];
 			if (test_bit(R5_ReadError, &dev->flags)
 			    && !test_bit(R5_LOCKED, &dev->flags)
 			    && test_bit(R5_UPTODATE, &dev->flags)
@@ -3497,12 +3459,12 @@ static void handle_stripe6(struct stripe_head *sh)
 					set_bit(R5_Wantwrite, &dev->flags);
 					set_bit(R5_ReWrite, &dev->flags);
 					set_bit(R5_LOCKED, &dev->flags);
-					s.locked++;
+					s->locked++;
 				} else {
 					/* let's read it back */
 					set_bit(R5_Wantread, &dev->flags);
 					set_bit(R5_LOCKED, &dev->flags);
-					s.locked++;
+					s->locked++;
 				}
 			}
 		}
@@ -3514,11 +3476,11 @@ static void handle_stripe6(struct stripe_head *sh)
 		for (i = conf->raid_disks; i--; ) {
 			set_bit(R5_Wantwrite, &sh->dev[i].flags);
 			set_bit(R5_LOCKED, &sh->dev[i].flags);
-			s.locked++;
+			s->locked++;
 		}
 	}
 
-	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
+	if (s->expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
 	    !sh->reconstruct_state) {
 		struct stripe_head *sh2
 			= get_active_stripe(conf, sh->sector, 1, 1, 1);
@@ -3532,7 +3494,7 @@ static void handle_stripe6(struct stripe_head *sh)
 					      &sh2->state))
 				atomic_inc(&conf->preread_active_stripes);
 			release_stripe(sh2);
-			goto unlock;
+			return;
 		}
 		if (sh2)
 			release_stripe(sh2);
@@ -3540,19 +3502,54 @@ static void handle_stripe6(struct stripe_head *sh)
 		/* Need to write out all blocks after computing P&Q */
 		sh->disks = conf->raid_disks;
 		stripe_set_idx(sh->sector, conf, 0, sh);
-		schedule_reconstruction(sh, &s, 1, 1);
-	} else if (s.expanded && !sh->reconstruct_state && s.locked == 0) {
+		schedule_reconstruction(sh, s, 1, 1);
+	} else if (s->expanded && !sh->reconstruct_state && s->locked == 0) {
 		clear_bit(STRIPE_EXPAND_READY, &sh->state);
 		atomic_dec(&conf->reshape_stripes);
 		wake_up(&conf->wait_for_overlap);
 		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
 	}
 
-	if (s.expanding && s.locked == 0 &&
+	if (s->expanding && s->locked == 0 &&
 	    !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
-		handle_stripe_expansion(conf, sh, &s);
+		handle_stripe_expansion(conf, sh, s);
+}
+
+static void handle_stripe(struct stripe_head *sh)
+{
+	struct stripe_head_state s;
+	raid5_conf_t *conf = sh->raid_conf;
+
+	clear_bit(STRIPE_HANDLE, &sh->state);
+	if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
+		/* already being handled, ensure it gets handled
+		 * again when current action finishes */
+		set_bit(STRIPE_HANDLE, &sh->state);
+		return;
+	}
+
+	if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
+		set_bit(STRIPE_SYNCING, &sh->state);
+		clear_bit(STRIPE_INSYNC, &sh->state);
+	}
+	clear_bit(STRIPE_DELAYED, &sh->state);
+
+	pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
+		"pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n",
+	       (unsigned long long)sh->sector, sh->state,
+	       atomic_read(&sh->count), sh->pd_idx, sh->qd_idx,
+	       sh->check_state, sh->reconstruct_state);
+	memset(&s, 0, sizeof(s));
+
+	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
+	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
+	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
+
+	if (conf->level == 6)
+		handle_stripe6(sh, &s);
+	else
+		handle_stripe5(sh, &s);
 
- unlock:
 
 	/* wait for this device to become unblocked */
 	if (unlikely(s.blocked_rdev))
@@ -3576,28 +3573,7 @@ static void handle_stripe6(struct stripe_head *sh)
 	}
 
 	return_io(s.return_bi);
-}
 
-static void handle_stripe(struct stripe_head *sh)
-{
-	clear_bit(STRIPE_HANDLE, &sh->state);
-	if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
-		/* already being handled, ensure it gets handled
-		 * again when current action finishes */
-		set_bit(STRIPE_HANDLE, &sh->state);
-		return;
-	}
-
-	if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
-		set_bit(STRIPE_SYNCING, &sh->state);
-		clear_bit(STRIPE_INSYNC, &sh->state);
-	}
-	clear_bit(STRIPE_DELAYED, &sh->state);
-
-	if (sh->raid_conf->level == 6)
-		handle_stripe6(sh);
-	else
-		handle_stripe5(sh);
 	clear_bit(STRIPE_ACTIVE, &sh->state);
 }
 



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 18/34] md/raid5: move more common code into handle_stripe
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (15 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6 NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-22  9:20   ` Namhyung Kim
  2011-07-21  2:32 ` [md PATCH 13/34] md/raid5: Move code for finishing a reconstruction " NeilBrown
                   ` (16 subsequent siblings)
  33 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

Apart from 'prexor' which can only be set for RAID5, and
'qd_idx' which can only be meaningful for RAID6, these two
chunks of code are nearly the same.

So combine them into one adding a test to call either
handle_parity_checks5 or handle_parity_checks6 as appropriate.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |  161 ++++++++++++++++++++--------------------------------
 1 files changed, 61 insertions(+), 100 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b1eba98..ba6f892 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1756,7 +1756,7 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 	/*
 	 * Select the parity disk based on the user selected algorithm.
 	 */
-	pd_idx = qd_idx = ~0;
+	pd_idx = qd_idx = -1;
 	switch(conf->level) {
 	case 4:
 		pd_idx = data_disks;
@@ -2904,7 +2904,6 @@ static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
 	raid5_conf_t *conf = sh->raid_conf;
 	int disks = sh->disks, i;
 	struct r5dev *dev;
-	int prexor;
 
 	/* Now to look around and see what can be done */
 	rcu_read_lock();
@@ -3027,56 +3026,6 @@ static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
 	    (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
 		handle_stripe_fill(sh, s, disks);
 
-	/* Now we check to see if any write operations have recently
-	 * completed
-	 */
-	prexor = 0;
-	if (sh->reconstruct_state == reconstruct_state_prexor_drain_result)
-		prexor = 1;
-	if (sh->reconstruct_state == reconstruct_state_drain_result ||
-	    sh->reconstruct_state == reconstruct_state_prexor_drain_result) {
-		sh->reconstruct_state = reconstruct_state_idle;
-
-		/* All the 'written' buffers and the parity block are ready to
-		 * be written back to disk
-		 */
-		BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags));
-		for (i = disks; i--; ) {
-			dev = &sh->dev[i];
-			if (test_bit(R5_LOCKED, &dev->flags) &&
-				(i == sh->pd_idx || dev->written)) {
-				pr_debug("Writing block %d\n", i);
-				set_bit(R5_Wantwrite, &dev->flags);
-				if (prexor)
-					continue;
-				if (!test_bit(R5_Insync, &dev->flags) ||
-				    (i == sh->pd_idx && s->failed == 0))
-					set_bit(STRIPE_INSYNC, &sh->state);
-			}
-		}
-		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-			s->dec_preread_active = 1;
-	}
-
-	/* Now to consider new write requests and what else, if anything
-	 * should be read.  We do not handle new writes when:
-	 * 1/ A 'write' operation (copy+xor) is already in flight.
-	 * 2/ A 'check' operation is in flight, as it may clobber the parity
-	 *    block.
-	 */
-	if (s->to_write && !sh->reconstruct_state && !sh->check_state)
-		handle_stripe_dirtying(conf, sh, s, disks);
-
-	/* maybe we need to check and possibly fix the parity for this stripe
-	 * Any reads will already have been scheduled, so we just see if enough
-	 * data is available.  The parity check is held off while parity
-	 * dependent operations are in flight.
-	 */
-	if (sh->check_state ||
-	    (s->syncing && s->locked == 0 &&
-	     !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
-	     !test_bit(STRIPE_INSYNC, &sh->state)))
-		handle_parity_checks5(conf, sh, s, disks);
 	return 0;
 }
 
@@ -3218,54 +3167,6 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
 	    (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
 		handle_stripe_fill(sh, s, disks);
 
-	/* Now we check to see if any write operations have recently
-	 * completed
-	 */
-	if (sh->reconstruct_state == reconstruct_state_drain_result) {
-
-		sh->reconstruct_state = reconstruct_state_idle;
-		/* All the 'written' buffers and the parity blocks are ready to
-		 * be written back to disk
-		 */
-		BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags));
-		BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[qd_idx].flags));
-		for (i = disks; i--; ) {
-			dev = &sh->dev[i];
-			if (test_bit(R5_LOCKED, &dev->flags) &&
-			    (i == sh->pd_idx || i == qd_idx ||
-			     dev->written)) {
-				pr_debug("Writing block %d\n", i);
-				BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));
-				set_bit(R5_Wantwrite, &dev->flags);
-				if (!test_bit(R5_Insync, &dev->flags) ||
-				    ((i == sh->pd_idx || i == qd_idx) &&
-				      s->failed == 0))
-					set_bit(STRIPE_INSYNC, &sh->state);
-			}
-		}
-		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-			s->dec_preread_active = 1;
-	}
-
-	/* Now to consider new write requests and what else, if anything
-	 * should be read.  We do not handle new writes when:
-	 * 1/ A 'write' operation (copy+gen_syndrome) is already in flight.
-	 * 2/ A 'check' operation is in flight, as it may clobber the parity
-	 *    block.
-	 */
-	if (s->to_write && !sh->reconstruct_state && !sh->check_state)
-		handle_stripe_dirtying(conf, sh, s, disks);
-
-	/* maybe we need to check and possibly fix the parity for this stripe
-	 * Any reads will already have been scheduled, so we just see if enough
-	 * data is available.  The parity check is held off while parity
-	 * dependent operations are in flight.
-	 */
-	if (sh->check_state ||
-	    (s->syncing && s->locked == 0 &&
-	     !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
-	     !test_bit(STRIPE_INSYNC, &sh->state)))
-		handle_parity_checks6(conf, sh, s, disks);
 	return 0;
 }
 
@@ -3275,6 +3176,8 @@ static void handle_stripe(struct stripe_head *sh)
 	int done;
 	raid5_conf_t *conf = sh->raid_conf;
 	int i;
+	int prexor;
+	int disks = sh->disks;
 
 	clear_bit(STRIPE_HANDLE, &sh->state);
 	if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
@@ -3311,6 +3214,64 @@ static void handle_stripe(struct stripe_head *sh)
 	if (done)
 		goto finish;
 
+	/* Now we check to see if any write operations have recently
+	 * completed
+	 */
+	prexor = 0;
+	if (sh->reconstruct_state == reconstruct_state_prexor_drain_result)
+		prexor = 1;
+	if (sh->reconstruct_state == reconstruct_state_drain_result ||
+	    sh->reconstruct_state == reconstruct_state_prexor_drain_result) {
+		sh->reconstruct_state = reconstruct_state_idle;
+
+		/* All the 'written' buffers and the parity block are ready to
+		 * be written back to disk
+		 */
+		BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags));
+		BUG_ON(sh->qd_idx >= 0 &&
+		       !test_bit(R5_UPTODATE, &sh->dev[sh->qd_idx].flags));
+		for (i = disks; i--; ) {
+			struct r5dev *dev = &sh->dev[i];
+			if (test_bit(R5_LOCKED, &dev->flags) &&
+				(i == sh->pd_idx || i == sh->qd_idx ||
+				 dev->written)) {
+				pr_debug("Writing block %d\n", i);
+				set_bit(R5_Wantwrite, &dev->flags);
+				if (prexor)
+					continue;
+				if (!test_bit(R5_Insync, &dev->flags) ||
+				    ((i == sh->pd_idx || i == sh->qd_idx)  &&
+				     s.failed == 0))
+					set_bit(STRIPE_INSYNC, &sh->state);
+			}
+		}
+		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+			s.dec_preread_active = 1;
+	}
+
+	/* Now to consider new write requests and what else, if anything
+	 * should be read.  We do not handle new writes when:
+	 * 1/ A 'write' operation (copy+xor) is already in flight.
+	 * 2/ A 'check' operation is in flight, as it may clobber the parity
+	 *    block.
+	 */
+	if (s.to_write && !sh->reconstruct_state && !sh->check_state)
+		handle_stripe_dirtying(conf, sh, &s, disks);
+
+	/* maybe we need to check and possibly fix the parity for this stripe
+	 * Any reads will already have been scheduled, so we just see if enough
+	 * data is available.  The parity check is held off while parity
+	 * dependent operations are in flight.
+	 */
+	if (sh->check_state ||
+	    (s.syncing && s.locked == 0 &&
+	     !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
+	     !test_bit(STRIPE_INSYNC, &sh->state))) {
+		if (conf->level == 6)
+			handle_parity_checks6(conf, sh, &s, disks);
+		else
+			handle_parity_checks5(conf, sh, &s, disks);
+	}
 
 	if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
 		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 20/34] md/raid5: finalise new merged handle_stripe.
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (22 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 26/34] md/raid10: Make use of new recovery_disabled handling NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-22  9:36   ` Namhyung Kim
  2011-07-21  2:32 ` [md PATCH 25/34] md: change managed of recovery_disabled NeilBrown
                   ` (9 subsequent siblings)
  33 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

handle_stripe5() and handle_stripe6() are now virtually identical.
So discard one and rename the other to 'analyse_stripe()'.

It always returns 0, so change it to 'void' and remove the 'done'
variable in handle_stripe().

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |  100 +++-------------------------------------------------
 1 files changed, 5 insertions(+), 95 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f23848f..93ee26c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2899,88 +2899,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
  *
  */
 
-static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
-{
-	raid5_conf_t *conf = sh->raid_conf;
-	int disks = sh->disks, i;
-	struct r5dev *dev;
-
-	/* Now to look around and see what can be done */
-	rcu_read_lock();
-	spin_lock_irq(&conf->device_lock);
-	for (i=disks; i--; ) {
-		mdk_rdev_t *rdev;
-
-		dev = &sh->dev[i];
-
-		pr_debug("check %d: state 0x%lx toread %p read %p write %p "
-			"written %p\n",	i, dev->flags, dev->toread, dev->read,
-			dev->towrite, dev->written);
-
-		/* maybe we can request a biofill operation
-		 *
-		 * new wantfill requests are only permitted while
-		 * ops_complete_biofill is guaranteed to be inactive
-		 */
-		if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread &&
-		    !test_bit(STRIPE_BIOFILL_RUN, &sh->state))
-			set_bit(R5_Wantfill, &dev->flags);
-
-		/* now count some things */
-		if (test_bit(R5_LOCKED, &dev->flags))
-			s->locked++;
-		if (test_bit(R5_UPTODATE, &dev->flags))
-			s->uptodate++;
-		if (test_bit(R5_Wantcompute, &dev->flags))
-			s->compute++;
-
-		if (test_bit(R5_Wantfill, &dev->flags))
-			s->to_fill++;
-		else if (dev->toread)
-			s->to_read++;
-		if (dev->towrite) {
-			s->to_write++;
-			if (!test_bit(R5_OVERWRITE, &dev->flags))
-				s->non_overwrite++;
-		}
-		if (dev->written)
-			s->written++;
-		rdev = rcu_dereference(conf->disks[i].rdev);
-		if (s->blocked_rdev == NULL &&
-		    rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
-			s->blocked_rdev = rdev;
-			atomic_inc(&rdev->nr_pending);
-		}
-		clear_bit(R5_Insync, &dev->flags);
-		if (!rdev)
-			/* Not in-sync */;
-		else if (test_bit(In_sync, &rdev->flags))
-			set_bit(R5_Insync, &dev->flags);
-		else {
-			/* could be in-sync depending on recovery/reshape status */
-			if (sh->sector + STRIPE_SECTORS <= rdev->recovery_offset)
-				set_bit(R5_Insync, &dev->flags);
-		}
-		if (!test_bit(R5_Insync, &dev->flags)) {
-			/* The ReadError flag will just be confusing now */
-			clear_bit(R5_ReadError, &dev->flags);
-			clear_bit(R5_ReWrite, &dev->flags);
-		}
-		if (test_bit(R5_ReadError, &dev->flags))
-			clear_bit(R5_Insync, &dev->flags);
-		if (!test_bit(R5_Insync, &dev->flags)) {
-			if (s->failed < 2)
-				s->failed_num[s->failed] = i;
-			s->failed++;
-		}
-	}
-	spin_unlock_irq(&conf->device_lock);
-	rcu_read_unlock();
-
-	return 0;
-}
-
-static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
+static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
 {
 	raid5_conf_t *conf = sh->raid_conf;
 	int disks = sh->disks;
@@ -2988,11 +2907,11 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
 	int i;
 
 	/* Now to look around and see what can be done */
-
 	rcu_read_lock();
 	spin_lock_irq(&conf->device_lock);
 	for (i=disks; i--; ) {
 		mdk_rdev_t *rdev;
+
 		dev = &sh->dev[i];
 
 		pr_debug("check %d: state 0x%lx read %p write %p written %p\n",
@@ -3016,9 +2935,9 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
 			BUG_ON(s->compute > 2);
 		}
 
-		if (test_bit(R5_Wantfill, &dev->flags)) {
+		if (test_bit(R5_Wantfill, &dev->flags))
 			s->to_fill++;
-		} else if (dev->toread)
+		else if (dev->toread)
 			s->to_read++;
 		if (dev->towrite) {
 			s->to_write++;
@@ -3058,14 +2977,11 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
 	}
 	spin_unlock_irq(&conf->device_lock);
 	rcu_read_unlock();
-
-	return 0;
 }
 
 static void handle_stripe(struct stripe_head *sh)
 {
 	struct stripe_head_state s;
-	int done;
 	raid5_conf_t *conf = sh->raid_conf;
 	int i;
 	int prexor;
@@ -3099,13 +3015,7 @@ static void handle_stripe(struct stripe_head *sh)
 	s.failed_num[0] = -1;
 	s.failed_num[1] = -1;
 
-	if (conf->level == 6)
-		done = handle_stripe6(sh, &s);
-	else
-		done = handle_stripe5(sh, &s);
-
-	if (done)
-		goto finish;
+	analyse_stripe(sh, &s);
 
 	if (unlikely(s.blocked_rdev)) {
 		if (s.syncing || s.expanding || s.expanded ||



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 25/34] md: change managed of recovery_disabled.
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (23 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 20/34] md/raid5: finalise new merged handle_stripe NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 22/34] md/raid: use printk_ratelimited instead of printk_ratelimit NeilBrown
                   ` (8 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

If we hit a read error while recovering a mirror, we want to abort the
recovery without necessarily failing the disk - as having a disk this
a read error is better than not having an array at all.

Currently this is managed with a per-array flag "recovery_disabled"
and is only implemented for RAID1.  For RAID10 we will need finer
grained control as we might want to disable recovery for individual
devices separately.

So push more of the decision making into the personality.
'recovery_disabled' is now a 'cookie' which is copied when the
personality want to disable recovery and is changed when a device is
added to the array as this is used as a trigger to 'try recovery
again'.

This will allow RAID10 to get the control that it needs.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c    |    4 ++--
 drivers/md/md.h    |    9 ++++++---
 drivers/md/raid1.c |    7 +++++--
 drivers/md/raid1.h |    6 ++++++
 4 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 77bd8d8..c7d9c6a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1922,7 +1922,7 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 	bd_link_disk_holder(rdev->bdev, mddev->gendisk);
 
 	/* May as well allow recovery to be retried once */
-	mddev->recovery_disabled = 0;
+	mddev->recovery_disabled++;
 
 	return 0;
 
@@ -7070,7 +7070,7 @@ static int remove_and_add_spares(mddev_t *mddev)
 			}
 		}
 
-	if (mddev->degraded && !mddev->recovery_disabled) {
+	if (mddev->degraded) {
 		list_for_each_entry(rdev, &mddev->disks, same_set) {
 			if (rdev->raid_disk >= 0 &&
 			    !test_bit(In_sync, &rdev->flags) &&
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 6863f72..de5455d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -239,9 +239,12 @@ struct mddev_s
 #define	MD_RECOVERY_FROZEN	9
 
 	unsigned long			recovery;
-	int				recovery_disabled; /* if we detect that recovery
-							    * will always fail, set this
-							    * so we don't loop trying */
+	/* If a RAID personality determines that recovery (of a particular
+	 * device) will fail due to a read error on the source device, it
+	 * takes a copy of this number and does not attempt recovery again
+	 * until this number changes.
+	 */
+	int				recovery_disabled;
 
 	int				in_sync;	/* know to not need resync */
 	/* 'open_mutex' avoids races between 'md_open' and 'do_md_stop', so
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 1d79a04..44069b3 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -956,7 +956,7 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
 		 * However don't try a recovery from this drive as
 		 * it is very likely to fail.
 		 */
-		mddev->recovery_disabled = 1;
+		conf->recovery_disabled = mddev->recovery_disabled;
 		return;
 	}
 	if (test_and_clear_bit(In_sync, &rdev->flags)) {
@@ -1052,6 +1052,9 @@ static int raid1_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
 	int first = 0;
 	int last = mddev->raid_disks - 1;
 
+	if (mddev->recovery_disabled == conf->recovery_disabled)
+		return -EBUSY;
+
 	if (rdev->raid_disk >= 0)
 		first = last = rdev->raid_disk;
 
@@ -1107,7 +1110,7 @@ static int raid1_remove_disk(mddev_t *mddev, int number)
 		 * is not possible.
 		 */
 		if (!test_bit(Faulty, &rdev->flags) &&
-		    !mddev->recovery_disabled &&
+		    mddev->recovery_disabled != conf->recovery_disabled &&
 		    mddev->degraded < conf->raid_disks) {
 			err = -EBUSY;
 			goto abort;
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index e743a64..3cd18cf 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -48,6 +48,12 @@ struct r1_private_data_s {
 					    * (fresh device added).
 					    * Cleared when a sync completes.
 					    */
+	int			recovery_disabled; /* when the same as
+						    * mddev->recovery_disabled
+						    * we don't allow recovery
+						    * to be attempted as we
+						    * expect a read error
+						    */
 
 	wait_queue_head_t	wait_barrier;
 



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 24/34] md: remove ro check in md_check_recovery()
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (25 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 22/34] md/raid: use printk_ratelimited instead of printk_ratelimit NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 30/34] md/raid5: move rdev->corrected_errors counting NeilBrown
                   ` (6 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Namhyung Kim

From: Namhyung Kim <namhyung@gmail.com>

Commit c89a8eee6154 ("Allow faulty devices to be removed from a
readonly array.") added some work on ro array in the function,
but it couldn't be done since we didn't allow the ro array to be
handled from the beginning. Fix it.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0398dc4..77bd8d8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7163,9 +7163,6 @@ void md_check_recovery(mddev_t *mddev)
 	if (mddev->bitmap)
 		bitmap_daemon_work(mddev);
 
-	if (mddev->ro)
-		return;
-
 	if (signal_pending(current)) {
 		if (mddev->pers->sync_request && !mddev->external) {
 			printk(KERN_INFO "md: %s in immediate safe mode\n",



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 26/34] md/raid10: Make use of new recovery_disabled handling
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (21 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 27/34] md/raid10: Improve decision on whether to fail a device with a read error NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 20/34] md/raid5: finalise new merged handle_stripe NeilBrown
                   ` (10 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

When we get a read error during recovery, RAID10 previously
arranged for the recovering device to appear to fail so that
the recovery stops and doesn't restart.  This is misleading and wrong.

Instead, make use of the new recovery_disabled handling and mark
the target device and having recovery disabled.

Add appropriate checks in add_disk and remove_disk so that devices
are removed and not re-added when recovery is disabled.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid10.c |   62 ++++++++++++++++++++++++++++++---------------------
 drivers/md/raid10.h |    5 ++++
 2 files changed, 41 insertions(+), 26 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 1725ec1..5583201 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1099,7 +1099,6 @@ static int raid10_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
 	conf_t *conf = mddev->private;
 	int err = -EEXIST;
 	int mirror;
-	mirror_info_t *p;
 	int first = 0;
 	int last = conf->raid_disks - 1;
 
@@ -1119,32 +1118,36 @@ static int raid10_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
 		mirror = rdev->saved_raid_disk;
 	else
 		mirror = first;
-	for ( ; mirror <= last ; mirror++)
-		if ( !(p=conf->mirrors+mirror)->rdev) {
-
-			disk_stack_limits(mddev->gendisk, rdev->bdev,
-					  rdev->data_offset << 9);
-			/* as we don't honour merge_bvec_fn, we must
-			 * never risk violating it, so limit
-			 * ->max_segments to one lying with a single
-			 * page, as a one page request is never in
-			 * violation.
-			 */
-			if (rdev->bdev->bd_disk->queue->merge_bvec_fn) {
-				blk_queue_max_segments(mddev->queue, 1);
-				blk_queue_segment_boundary(mddev->queue,
-							   PAGE_CACHE_SIZE - 1);
-			}
+	for ( ; mirror <= last ; mirror++) {
+		mirror_info_t *p = &conf->mirrors[mirror];
+		if (p->recovery_disabled == mddev->recovery_disabled)
+			continue;
+		if (!p->rdev)
+			continue;
 
-			p->head_position = 0;
-			rdev->raid_disk = mirror;
-			err = 0;
-			if (rdev->saved_raid_disk != mirror)
-				conf->fullsync = 1;
-			rcu_assign_pointer(p->rdev, rdev);
-			break;
+		disk_stack_limits(mddev->gendisk, rdev->bdev,
+				  rdev->data_offset << 9);
+		/* as we don't honour merge_bvec_fn, we must
+		 * never risk violating it, so limit
+		 * ->max_segments to one lying with a single
+		 * page, as a one page request is never in
+		 * violation.
+		 */
+		if (rdev->bdev->bd_disk->queue->merge_bvec_fn) {
+			blk_queue_max_segments(mddev->queue, 1);
+			blk_queue_segment_boundary(mddev->queue,
+						   PAGE_CACHE_SIZE - 1);
 		}
 
+		p->head_position = 0;
+		rdev->raid_disk = mirror;
+		err = 0;
+		if (rdev->saved_raid_disk != mirror)
+			conf->fullsync = 1;
+		rcu_assign_pointer(p->rdev, rdev);
+		break;
+	}
+
 	md_integrity_add_rdev(rdev, mddev);
 	print_conf(conf);
 	return err;
@@ -1169,6 +1172,7 @@ static int raid10_remove_disk(mddev_t *mddev, int number)
 		 * is not possible.
 		 */
 		if (!test_bit(Faulty, &rdev->flags) &&
+		    mddev->recovery_disabled != p->recovery_disabled &&
 		    enough(conf)) {
 			err = -EBUSY;
 			goto abort;
@@ -1383,8 +1387,14 @@ static void recovery_request_write(mddev_t *mddev, r10bio_t *r10_bio)
 	md_sync_acct(conf->mirrors[d].rdev->bdev, wbio->bi_size >> 9);
 	if (test_bit(R10BIO_Uptodate, &r10_bio->state))
 		generic_make_request(wbio);
-	else
-		bio_endio(wbio, -EIO);
+	else {
+		printk(KERN_NOTICE
+		       "md/raid10:%s: recovery aborted due to read error\n",
+		       mdname(mddev));
+		conf->mirrors[d].recovery_disabled = mddev->recovery_disabled;
+		set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+		bio_endio(wbio, 0);
+	}
 }
 
 
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 944b110..a485914 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -6,6 +6,11 @@ typedef struct mirror_info mirror_info_t;
 struct mirror_info {
 	mdk_rdev_t	*rdev;
 	sector_t	head_position;
+	int		recovery_disabled;	/* matches
+						 * mddev->recovery_disabled
+						 * when we shouldn't try
+						 * recovering this device.
+						 */
 };
 
 typedef struct r10bio_s r10bio_t;



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 21/34] md: use proper little-endian bitops
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (19 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 23/34] md: introduce link/unlink_rdev() helpers NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 27/34] md/raid10: Improve decision on whether to fail a device with a read error NeilBrown
                   ` (12 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Akinobu Mita

From: Akinobu Mita <akinobu.mita@gmail.com>

Using __test_and_{set,clear}_bit_le() with ignoring its return value
can be replaced with __{set,clear}_bit_le().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: NeilBrown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/bitmap.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 574b09a..39ff0ef 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -932,7 +932,7 @@ static void bitmap_file_set_bit(struct bitmap *bitmap, sector_t block)
 		if (bitmap->flags & BITMAP_HOSTENDIAN)
 			set_bit(bit, kaddr);
 		else
-			__test_and_set_bit_le(bit, kaddr);
+			__set_bit_le(bit, kaddr);
 		kunmap_atomic(kaddr, KM_USER0);
 		PRINTK("set file bit %lu page %lu\n", bit, page->index);
 	}
@@ -1304,8 +1304,10 @@ void bitmap_daemon_work(mddev_t *mddev)
 						clear_bit(file_page_offset(bitmap, j),
 							  paddr);
 					else
-						__test_and_clear_bit_le(file_page_offset(bitmap, j),
-							       paddr);
+						__clear_bit_le(
+							file_page_offset(bitmap,
+									 j),
+							paddr);
 					kunmap_atomic(paddr, KM_USER0);
 				} else
 					log->type->clear_region(log, j);



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 23/34] md: introduce link/unlink_rdev() helpers
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (18 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 19/34] md/raid5: move some more common code " NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 21/34] md: use proper little-endian bitops NeilBrown
                   ` (13 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Namhyung Kim

From: Namhyung Kim <namhyung@gmail.com>

There are places where sysfs links to rdev are handled
in a same way. Add the helper functions to consolidate
them.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c    |   47 ++++++++++++++---------------------------------
 drivers/md/md.h    |   14 ++++++++++++++
 drivers/md/raid1.c |   15 +++++----------
 drivers/md/raid5.c |   10 +++-------
 4 files changed, 36 insertions(+), 50 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 91e31e2..0398dc4 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2459,7 +2459,6 @@ slot_store(mdk_rdev_t *rdev, const char *buf, size_t len)
 {
 	char *e;
 	int err;
-	char nm[20];
 	int slot = simple_strtoul(buf, &e, 10);
 	if (strncmp(buf, "none", 4)==0)
 		slot = -1;
@@ -2482,8 +2481,7 @@ slot_store(mdk_rdev_t *rdev, const char *buf, size_t len)
 			hot_remove_disk(rdev->mddev, rdev->raid_disk);
 		if (err)
 			return err;
-		sprintf(nm, "rd%d", rdev->raid_disk);
-		sysfs_remove_link(&rdev->mddev->kobj, nm);
+		sysfs_unlink_rdev(rdev->mddev, rdev);
 		rdev->raid_disk = -1;
 		set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery);
 		md_wakeup_thread(rdev->mddev->thread);
@@ -2522,8 +2520,7 @@ slot_store(mdk_rdev_t *rdev, const char *buf, size_t len)
 			return err;
 		} else
 			sysfs_notify_dirent_safe(rdev->sysfs_state);
-		sprintf(nm, "rd%d", rdev->raid_disk);
-		if (sysfs_create_link(&rdev->mddev->kobj, &rdev->kobj, nm))
+		if (sysfs_link_rdev(rdev->mddev, rdev))
 			/* failure here is OK */;
 		/* don't wakeup anyone, leave that to userspace. */
 	} else {
@@ -3149,15 +3146,13 @@ level_store(mddev_t *mddev, const char *buf, size_t len)
 	}
 
 	list_for_each_entry(rdev, &mddev->disks, same_set) {
-		char nm[20];
 		if (rdev->raid_disk < 0)
 			continue;
 		if (rdev->new_raid_disk >= mddev->raid_disks)
 			rdev->new_raid_disk = -1;
 		if (rdev->new_raid_disk == rdev->raid_disk)
 			continue;
-		sprintf(nm, "rd%d", rdev->raid_disk);
-		sysfs_remove_link(&mddev->kobj, nm);
+		sysfs_unlink_rdev(mddev, rdev);
 	}
 	list_for_each_entry(rdev, &mddev->disks, same_set) {
 		if (rdev->raid_disk < 0)
@@ -3168,11 +3163,10 @@ level_store(mddev_t *mddev, const char *buf, size_t len)
 		if (rdev->raid_disk < 0)
 			clear_bit(In_sync, &rdev->flags);
 		else {
-			char nm[20];
-			sprintf(nm, "rd%d", rdev->raid_disk);
-			if(sysfs_create_link(&mddev->kobj, &rdev->kobj, nm))
-				printk("md: cannot register %s for %s after level change\n",
-				       nm, mdname(mddev));
+			if (sysfs_link_rdev(mddev, rdev))
+				printk(KERN_WARNING "md: cannot register rd%d"
+				       " for %s after level change\n",
+				       rdev->raid_disk, mdname(mddev));
 		}
 	}
 
@@ -4621,12 +4615,9 @@ int md_run(mddev_t *mddev)
 	smp_wmb();
 	mddev->ready = 1;
 	list_for_each_entry(rdev, &mddev->disks, same_set)
-		if (rdev->raid_disk >= 0) {
-			char nm[20];
-			sprintf(nm, "rd%d", rdev->raid_disk);
-			if (sysfs_create_link(&mddev->kobj, &rdev->kobj, nm))
+		if (rdev->raid_disk >= 0)
+			if (sysfs_link_rdev(mddev, rdev))
 				/* failure here is OK */;
-		}
 	
 	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 	
@@ -4854,11 +4845,8 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
 		sysfs_notify_dirent_safe(mddev->sysfs_state);
 
 		list_for_each_entry(rdev, &mddev->disks, same_set)
-			if (rdev->raid_disk >= 0) {
-				char nm[20];
-				sprintf(nm, "rd%d", rdev->raid_disk);
-				sysfs_remove_link(&mddev->kobj, nm);
-			}
+			if (rdev->raid_disk >= 0)
+				sysfs_unlink_rdev(mddev, rdev);
 
 		set_capacity(disk, 0);
 		mutex_unlock(&mddev->open_mutex);
@@ -7077,9 +7065,7 @@ static int remove_and_add_spares(mddev_t *mddev)
 		    atomic_read(&rdev->nr_pending)==0) {
 			if (mddev->pers->hot_remove_disk(
 				    mddev, rdev->raid_disk)==0) {
-				char nm[20];
-				sprintf(nm,"rd%d", rdev->raid_disk);
-				sysfs_remove_link(&mddev->kobj, nm);
+				sysfs_unlink_rdev(mddev, rdev);
 				rdev->raid_disk = -1;
 			}
 		}
@@ -7096,10 +7082,7 @@ static int remove_and_add_spares(mddev_t *mddev)
 				rdev->recovery_offset = 0;
 				if (mddev->pers->
 				    hot_add_disk(mddev, rdev) == 0) {
-					char nm[20];
-					sprintf(nm, "rd%d", rdev->raid_disk);
-					if (sysfs_create_link(&mddev->kobj,
-							      &rdev->kobj, nm))
+					if (sysfs_link_rdev(mddev, rdev))
 						/* failure here is OK */;
 					spares++;
 					md_new_event(mddev);
@@ -7219,9 +7202,7 @@ void md_check_recovery(mddev_t *mddev)
 				    atomic_read(&rdev->nr_pending)==0) {
 					if (mddev->pers->hot_remove_disk(
 						    mddev, rdev->raid_disk)==0) {
-						char nm[20];
-						sprintf(nm,"rd%d", rdev->raid_disk);
-						sysfs_remove_link(&mddev->kobj, nm);
+						sysfs_unlink_rdev(mddev, rdev);
 						rdev->raid_disk = -1;
 					}
 				}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1c26c7a..6863f72 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -413,6 +413,20 @@ static inline char * mdname (mddev_t * mddev)
 	return mddev->gendisk ? mddev->gendisk->disk_name : "mdX";
 }
 
+static inline int sysfs_link_rdev(mddev_t *mddev, mdk_rdev_t *rdev)
+{
+	char nm[20];
+	sprintf(nm, "rd%d", rdev->raid_disk);
+	return sysfs_create_link(&mddev->kobj, &rdev->kobj, nm);
+}
+
+static inline void sysfs_unlink_rdev(mddev_t *mddev, mdk_rdev_t *rdev)
+{
+	char nm[20];
+	sprintf(nm, "rd%d", rdev->raid_disk);
+	sysfs_remove_link(&mddev->kobj, nm);
+}
+
 /*
  * iterates through some rdev ringlist. It's safe to remove the
  * current 'rdev'. Dont touch 'tmp' though.
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d3a8f4b..1d79a04 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2159,18 +2159,13 @@ static int raid1_reshape(mddev_t *mddev)
 	for (d = d2 = 0; d < conf->raid_disks; d++) {
 		mdk_rdev_t *rdev = conf->mirrors[d].rdev;
 		if (rdev && rdev->raid_disk != d2) {
-			char nm[20];
-			sprintf(nm, "rd%d", rdev->raid_disk);
-			sysfs_remove_link(&mddev->kobj, nm);
+			sysfs_unlink_rdev(mddev, rdev);
 			rdev->raid_disk = d2;
-			sprintf(nm, "rd%d", rdev->raid_disk);
-			sysfs_remove_link(&mddev->kobj, nm);
-			if (sysfs_create_link(&mddev->kobj,
-					      &rdev->kobj, nm))
+			sysfs_unlink_rdev(mddev, rdev);
+			if (sysfs_link_rdev(mddev, rdev))
 				printk(KERN_WARNING
-				       "md/raid1:%s: cannot register "
-				       "%s\n",
-				       mdname(mddev), nm);
+				       "md/raid1:%s: cannot register rd%d\n",
+				       mdname(mddev), rdev->raid_disk);
 		}
 		if (rdev)
 			newmirrors[d2++].rdev = rdev;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b14183e..3c67942 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5152,16 +5152,14 @@ static int raid5_start_reshape(mddev_t *mddev)
 			if (rdev->raid_disk < 0 &&
 			    !test_bit(Faulty, &rdev->flags)) {
 				if (raid5_add_disk(mddev, rdev) == 0) {
-					char nm[20];
 					if (rdev->raid_disk
 					    >= conf->previous_raid_disks) {
 						set_bit(In_sync, &rdev->flags);
 						added_devices++;
 					} else
 						rdev->recovery_offset = 0;
-					sprintf(nm, "rd%d", rdev->raid_disk);
-					if (sysfs_create_link(&mddev->kobj,
-							      &rdev->kobj, nm))
+
+					if (sysfs_link_rdev(mddev, rdev))
 						/* Failure here is OK */;
 				}
 			} else if (rdev->raid_disk >= conf->previous_raid_disks
@@ -5257,9 +5255,7 @@ static void raid5_finish_reshape(mddev_t *mddev)
 			     d++) {
 				mdk_rdev_t *rdev = conf->disks[d].rdev;
 				if (rdev && raid5_remove_disk(mddev, d) == 0) {
-					char nm[20];
-					sprintf(nm, "rd%d", rdev->raid_disk);
-					sysfs_remove_link(&mddev->kobj, nm);
+					sysfs_unlink_rdev(mddev, rdev);
 					rdev->raid_disk = -1;
 				}
 			}



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 22/34] md/raid: use printk_ratelimited instead of printk_ratelimit
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (24 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 25/34] md: change managed of recovery_disabled NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 24/34] md: remove ro check in md_check_recovery() NeilBrown
                   ` (7 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Christian Dietrich

From: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de>

As per printk_ratelimit comment, it should not be used.

Signed-off-by: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid1.c  |   25 +++++++++++++++----------
 drivers/md/raid10.c |   23 +++++++++++++----------
 drivers/md/raid5.c  |   47 +++++++++++++++++++++++++----------------------
 3 files changed, 53 insertions(+), 42 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index f7431b6..d3a8f4b 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -35,6 +35,7 @@
 #include <linux/delay.h>
 #include <linux/blkdev.h>
 #include <linux/seq_file.h>
+#include <linux/ratelimit.h>
 #include "md.h"
 #include "raid1.h"
 #include "bitmap.h"
@@ -287,10 +288,13 @@ static void raid1_end_read_request(struct bio *bio, int error)
 		 * oops, read error:
 		 */
 		char b[BDEVNAME_SIZE];
-		if (printk_ratelimit())
-			printk(KERN_ERR "md/raid1:%s: %s: rescheduling sector %llu\n",
-			       mdname(conf->mddev),
-			       bdevname(conf->mirrors[mirror].rdev->bdev,b), (unsigned long long)r1_bio->sector);
+		printk_ratelimited(
+			KERN_ERR "md/raid1:%s: %s: "
+			"rescheduling sector %llu\n",
+			mdname(conf->mddev),
+			bdevname(conf->mirrors[mirror].rdev->bdev,
+				 b),
+			(unsigned long long)r1_bio->sector);
 		reschedule_retry(r1_bio);
 	}
 
@@ -1580,12 +1584,13 @@ static void raid1d(mddev_t *mddev)
 						      GFP_NOIO, mddev);
 				r1_bio->bios[r1_bio->read_disk] = bio;
 				rdev = conf->mirrors[disk].rdev;
-				if (printk_ratelimit())
-					printk(KERN_ERR "md/raid1:%s: redirecting sector %llu to"
-					       " other mirror: %s\n",
-					       mdname(mddev),
-					       (unsigned long long)r1_bio->sector,
-					       bdevname(rdev->bdev,b));
+				printk_ratelimited(
+					KERN_ERR
+					"md/raid1:%s: redirecting sector %llu"
+					" to other mirror: %s\n",
+					mdname(mddev),
+					(unsigned long long)r1_bio->sector,
+					bdevname(rdev->bdev, b));
 				bio->bi_sector = r1_bio->sector + rdev->data_offset;
 				bio->bi_bdev = rdev->bdev;
 				bio->bi_end_io = raid1_end_read_request;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3715e22..1725ec1 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -22,6 +22,7 @@
 #include <linux/delay.h>
 #include <linux/blkdev.h>
 #include <linux/seq_file.h>
+#include <linux/ratelimit.h>
 #include "md.h"
 #include "raid10.h"
 #include "raid0.h"
@@ -301,10 +302,11 @@ static void raid10_end_read_request(struct bio *bio, int error)
 		 * oops, read error - keep the refcount on the rdev
 		 */
 		char b[BDEVNAME_SIZE];
-		if (printk_ratelimit())
-			printk(KERN_ERR "md/raid10:%s: %s: rescheduling sector %llu\n",
-			       mdname(conf->mddev),
-			       bdevname(conf->mirrors[dev].rdev->bdev,b), (unsigned long long)r10_bio->sector);
+		printk_ratelimited(KERN_ERR
+				   "md/raid10:%s: %s: rescheduling sector %llu\n",
+				   mdname(conf->mddev),
+				   bdevname(conf->mirrors[dev].rdev->bdev, b),
+				   (unsigned long long)r10_bio->sector);
 		reschedule_retry(r10_bio);
 	}
 }
@@ -1669,12 +1671,13 @@ static void raid10d(mddev_t *mddev)
 				bio_put(bio);
 				slot = r10_bio->read_slot;
 				rdev = conf->mirrors[mirror].rdev;
-				if (printk_ratelimit())
-					printk(KERN_ERR "md/raid10:%s: %s: redirecting sector %llu to"
-					       " another mirror\n",
-					       mdname(mddev),
-					       bdevname(rdev->bdev,b),
-					       (unsigned long long)r10_bio->sector);
+				printk_ratelimited(
+					KERN_ERR
+					"md/raid10:%s: %s: redirecting"
+					"sector %llu to another mirror\n",
+					mdname(mddev),
+					bdevname(rdev->bdev, b),
+					(unsigned long long)r10_bio->sector);
 				bio = bio_clone_mddev(r10_bio->master_bio,
 						      GFP_NOIO, mddev);
 				r10_bio->devs[slot].bio = bio;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 93ee26c..b14183e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -51,6 +51,7 @@
 #include <linux/seq_file.h>
 #include <linux/cpu.h>
 #include <linux/slab.h>
+#include <linux/ratelimit.h>
 #include "md.h"
 #include "raid5.h"
 #include "raid0.h"
@@ -96,8 +97,6 @@
 #define __inline__
 #endif
 
-#define printk_rl(args...) ((void) (printk_ratelimit() && printk(args)))
-
 /*
  * We maintain a biased count of active stripes in the bottom 16 bits of
  * bi_phys_segments, and a count of processed stripes in the upper 16 bits
@@ -1583,12 +1582,14 @@ static void raid5_end_read_request(struct bio * bi, int error)
 		set_bit(R5_UPTODATE, &sh->dev[i].flags);
 		if (test_bit(R5_ReadError, &sh->dev[i].flags)) {
 			rdev = conf->disks[i].rdev;
-			printk_rl(KERN_INFO "md/raid:%s: read error corrected"
-				  " (%lu sectors at %llu on %s)\n",
-				  mdname(conf->mddev), STRIPE_SECTORS,
-				  (unsigned long long)(sh->sector
-						       + rdev->data_offset),
-				  bdevname(rdev->bdev, b));
+			printk_ratelimited(
+				KERN_INFO
+				"md/raid:%s: read error corrected"
+				" (%lu sectors at %llu on %s)\n",
+				mdname(conf->mddev), STRIPE_SECTORS,
+				(unsigned long long)(sh->sector
+						     + rdev->data_offset),
+				bdevname(rdev->bdev, b));
 			clear_bit(R5_ReadError, &sh->dev[i].flags);
 			clear_bit(R5_ReWrite, &sh->dev[i].flags);
 		}
@@ -1602,22 +1603,24 @@ static void raid5_end_read_request(struct bio * bi, int error)
 		clear_bit(R5_UPTODATE, &sh->dev[i].flags);
 		atomic_inc(&rdev->read_errors);
 		if (conf->mddev->degraded >= conf->max_degraded)
-			printk_rl(KERN_WARNING
-				  "md/raid:%s: read error not correctable "
-				  "(sector %llu on %s).\n",
-				  mdname(conf->mddev),
-				  (unsigned long long)(sh->sector
-						       + rdev->data_offset),
-				  bdn);
+			printk_ratelimited(
+				KERN_WARNING
+				"md/raid:%s: read error not correctable "
+				"(sector %llu on %s).\n",
+				mdname(conf->mddev),
+				(unsigned long long)(sh->sector
+						     + rdev->data_offset),
+				bdn);
 		else if (test_bit(R5_ReWrite, &sh->dev[i].flags))
 			/* Oh, no!!! */
-			printk_rl(KERN_WARNING
-				  "md/raid:%s: read error NOT corrected!! "
-				  "(sector %llu on %s).\n",
-				  mdname(conf->mddev),
-				  (unsigned long long)(sh->sector
-						       + rdev->data_offset),
-				  bdn);
+			printk_ratelimited(
+				KERN_WARNING
+				"md/raid:%s: read error NOT corrected!! "
+				"(sector %llu on %s).\n",
+				mdname(conf->mddev),
+				(unsigned long long)(sh->sector
+						     + rdev->data_offset),
+				bdn);
 		else if (atomic_read(&rdev->read_errors)
 			 > conf->max_nr_stripes)
 			printk(KERN_WARNING



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 27/34] md/raid10: Improve decision on whether to fail a device with a read error.
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (20 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 21/34] md: use proper little-endian bitops NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 26/34] md/raid10: Make use of new recovery_disabled handling NeilBrown
                   ` (11 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

Normally we would fail a device with a READ error.  However if doing
so causes the array to fail, it is better to leave the device
in place and just return the read error to the caller.

The current test for decide if the array will fail is overly
simplistic.
We have a function 'enough' which can tell if the array is failed or
not, so use it to guide the decision.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid10.c |   57 +++++++++++++++++++++++++--------------------------
 1 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 5583201..6721cb0 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -970,6 +970,30 @@ static void status(struct seq_file *seq, mddev_t *mddev)
 	seq_printf(seq, "]");
 }
 
+/* check if there are enough drives for
+ * every block to appear on atleast one.
+ * Don't consider the device numbered 'ignore'
+ * as we might be about to remove it.
+ */
+static int enough(conf_t *conf, int ignore)
+{
+	int first = 0;
+
+	do {
+		int n = conf->copies;
+		int cnt = 0;
+		while (n--) {
+			if (conf->mirrors[first].rdev &&
+			    first != ignore)
+				cnt++;
+			first = (first+1) % conf->raid_disks;
+		}
+		if (cnt == 0)
+			return 0;
+	} while (first != 0);
+	return 1;
+}
+
 static void error(mddev_t *mddev, mdk_rdev_t *rdev)
 {
 	char b[BDEVNAME_SIZE];
@@ -982,13 +1006,9 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
 	 * else mark the drive as failed
 	 */
 	if (test_bit(In_sync, &rdev->flags)
-	    && conf->raid_disks-mddev->degraded == 1)
+	    && !enough(conf, rdev->raid_disk))
 		/*
 		 * Don't fail the drive, just return an IO error.
-		 * The test should really be more sophisticated than
-		 * "working_disks == 1", but it isn't critical, and
-		 * can wait until we do more sophisticated "is the drive
-		 * really dead" tests...
 		 */
 		return;
 	if (test_and_clear_bit(In_sync, &rdev->flags)) {
@@ -1043,27 +1063,6 @@ static void close_sync(conf_t *conf)
 	conf->r10buf_pool = NULL;
 }
 
-/* check if there are enough drives for
- * every block to appear on atleast one
- */
-static int enough(conf_t *conf)
-{
-	int first = 0;
-
-	do {
-		int n = conf->copies;
-		int cnt = 0;
-		while (n--) {
-			if (conf->mirrors[first].rdev)
-				cnt++;
-			first = (first+1) % conf->raid_disks;
-		}
-		if (cnt == 0)
-			return 0;
-	} while (first != 0);
-	return 1;
-}
-
 static int raid10_spare_active(mddev_t *mddev)
 {
 	int i;
@@ -1107,7 +1106,7 @@ static int raid10_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
 		 * very different from resync
 		 */
 		return -EBUSY;
-	if (!enough(conf))
+	if (!enough(conf, -1))
 		return -EINVAL;
 
 	if (rdev->raid_disk >= 0)
@@ -1173,7 +1172,7 @@ static int raid10_remove_disk(mddev_t *mddev, int number)
 		 */
 		if (!test_bit(Faulty, &rdev->flags) &&
 		    mddev->recovery_disabled != p->recovery_disabled &&
-		    enough(conf)) {
+		    enough(conf, -1)) {
 			err = -EBUSY;
 			goto abort;
 		}
@@ -2286,7 +2285,7 @@ static int run(mddev_t *mddev)
 		disk->head_position = 0;
 	}
 	/* need to check that every block has at least one working mirror */
-	if (!enough(conf)) {
+	if (!enough(conf, -1)) {
 		printk(KERN_ERR "md/raid10:%s: not enough operational mirrors.\n",
 		       mdname(mddev));
 		goto out_free_conf;



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 33/34] MD: raid1 s/sysfs_notify_dirent/sysfs_notify_dirent_safe
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (29 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 29/34] md/raid1: move rdev->corrected_errors counting NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 32/34] md/raid5: Avoid BUG caused by multiple failures NeilBrown
                   ` (2 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Jonathan Brassow

From: Jonathan Brassow <jbrassow@redhat.com>

If device-mapper creates a RAID1 array that includes devices to
be rebuilt, it will deref a NULL pointer when finished because
sysfs is not used by device-mapper instantiated RAID devices.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid1.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a7e6908..3cbf0ac 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1031,7 +1031,7 @@ static int raid1_spare_active(mddev_t *mddev)
 		    && !test_bit(Faulty, &rdev->flags)
 		    && !test_and_set_bit(In_sync, &rdev->flags)) {
 			count++;
-			sysfs_notify_dirent(rdev->sysfs_state);
+			sysfs_notify_dirent_safe(rdev->sysfs_state);
 		}
 	}
 	spin_lock_irqsave(&conf->device_lock, flags);



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 29/34] md/raid1: move rdev->corrected_errors counting
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (28 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 28/34] md: get rid of unnecessary casts on page_address() NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 33/34] MD: raid1 s/sysfs_notify_dirent/sysfs_notify_dirent_safe NeilBrown
                   ` (3 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Namhyung Kim

From: Namhyung Kim <namhyung@gmail.com>

Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO. Also included a couple of whitespace fixes on sync_page_io().

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid1.c |   17 ++++++-----------
 1 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 44069b3..a7e6908 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1224,9 +1224,7 @@ static int fix_sync_read_error(r1bio_t *r1_bio)
 				 * active, and resync is currently active
 				 */
 				rdev = conf->mirrors[d].rdev;
-				if (sync_page_io(rdev,
-						 sect,
-						 s<<9,
+				if (sync_page_io(rdev, sect, s<<9,
 						 bio->bi_io_vec[idx].bv_page,
 						 READ, false)) {
 					success = 1;
@@ -1261,16 +1259,13 @@ static int fix_sync_read_error(r1bio_t *r1_bio)
 			if (r1_bio->bios[d]->bi_end_io != end_sync_read)
 				continue;
 			rdev = conf->mirrors[d].rdev;
-			if (sync_page_io(rdev,
-					 sect,
-					 s<<9,
+			if (sync_page_io(rdev, sect, s<<9,
 					 bio->bi_io_vec[idx].bv_page,
 					 WRITE, false) == 0) {
 				r1_bio->bios[d]->bi_end_io = NULL;
 				rdev_dec_pending(rdev, mddev);
 				md_error(mddev, rdev);
-			} else
-				atomic_add(s, &rdev->corrected_errors);
+			}
 		}
 		d = start;
 		while (d != r1_bio->read_disk) {
@@ -1280,12 +1275,12 @@ static int fix_sync_read_error(r1bio_t *r1_bio)
 			if (r1_bio->bios[d]->bi_end_io != end_sync_read)
 				continue;
 			rdev = conf->mirrors[d].rdev;
-			if (sync_page_io(rdev,
-					 sect,
-					 s<<9,
+			if (sync_page_io(rdev, sect, s<<9,
 					 bio->bi_io_vec[idx].bv_page,
 					 READ, false) == 0)
 				md_error(mddev, rdev);
+			else
+				atomic_add(s, &rdev->corrected_errors);
 		}
 		sectors -= s;
 		sect += s;



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 34/34] MD bitmap: Revert DM dirty log hooks
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (31 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 32/34] md/raid5: Avoid BUG caused by multiple failures NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 31/34] md/raid10: move rdev->corrected_errors counting NeilBrown
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Jonathan Brassow

From: Jonathan Brassow <jbrassow@redhat.com>

Revert most of commit e384e58549a2e9a83071ad80280c1a9053cfd84c
  md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.

MD should not need to use DM's dirty log - we decided to use md's
bitmaps instead.

Keeping the DIV_ROUND_UP clean-ups that were part of commit
e384e58549a2e9a83071ad80280c1a9053cfd84c, however.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/bitmap.c |  133 ++++++++++++++++-----------------------------------
 drivers/md/bitmap.h |    5 --
 drivers/md/md.h     |    5 --
 3 files changed, 43 insertions(+), 100 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 39ff0ef..0dc6546 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -29,7 +29,6 @@
 #include "md.h"
 #include "bitmap.h"
 
-#include <linux/dm-dirty-log.h>
 /* debug macros */
 
 #define DEBUG 0
@@ -775,10 +774,8 @@ static inline unsigned long file_page_offset(struct bitmap *bitmap, unsigned lon
  * 0 or page 1
  */
 static inline struct page *filemap_get_page(struct bitmap *bitmap,
-					unsigned long chunk)
+					    unsigned long chunk)
 {
-	if (bitmap->filemap == NULL)
-		return NULL;
 	if (file_page_index(bitmap, chunk) >= bitmap->file_pages)
 		return NULL;
 	return bitmap->filemap[file_page_index(bitmap, chunk)
@@ -878,28 +875,19 @@ enum bitmap_page_attr {
 static inline void set_page_attr(struct bitmap *bitmap, struct page *page,
 				enum bitmap_page_attr attr)
 {
-	if (page)
-		__set_bit((page->index<<2) + attr, bitmap->filemap_attr);
-	else
-		__set_bit(attr, &bitmap->logattrs);
+	__set_bit((page->index<<2) + attr, bitmap->filemap_attr);
 }
 
 static inline void clear_page_attr(struct bitmap *bitmap, struct page *page,
 				enum bitmap_page_attr attr)
 {
-	if (page)
-		__clear_bit((page->index<<2) + attr, bitmap->filemap_attr);
-	else
-		__clear_bit(attr, &bitmap->logattrs);
+	__clear_bit((page->index<<2) + attr, bitmap->filemap_attr);
 }
 
 static inline unsigned long test_page_attr(struct bitmap *bitmap, struct page *page,
 					   enum bitmap_page_attr attr)
 {
-	if (page)
-		return test_bit((page->index<<2) + attr, bitmap->filemap_attr);
-	else
-		return test_bit(attr, &bitmap->logattrs);
+	return test_bit((page->index<<2) + attr, bitmap->filemap_attr);
 }
 
 /*
@@ -912,30 +900,26 @@ static inline unsigned long test_page_attr(struct bitmap *bitmap, struct page *p
 static void bitmap_file_set_bit(struct bitmap *bitmap, sector_t block)
 {
 	unsigned long bit;
-	struct page *page = NULL;
+	struct page *page;
 	void *kaddr;
 	unsigned long chunk = block >> CHUNK_BLOCK_SHIFT(bitmap);
 
-	if (!bitmap->filemap) {
-		struct dm_dirty_log *log = bitmap->mddev->bitmap_info.log;
-		if (log)
-			log->type->mark_region(log, chunk);
-	} else {
+	if (!bitmap->filemap)
+		return;
 
-		page = filemap_get_page(bitmap, chunk);
-		if (!page)
-			return;
-		bit = file_page_offset(bitmap, chunk);
+	page = filemap_get_page(bitmap, chunk);
+	if (!page)
+		return;
+	bit = file_page_offset(bitmap, chunk);
 
-		/* set the bit */
-		kaddr = kmap_atomic(page, KM_USER0);
-		if (bitmap->flags & BITMAP_HOSTENDIAN)
-			set_bit(bit, kaddr);
-		else
-			__set_bit_le(bit, kaddr);
-		kunmap_atomic(kaddr, KM_USER0);
-		PRINTK("set file bit %lu page %lu\n", bit, page->index);
-	}
+	/* set the bit */
+	kaddr = kmap_atomic(page, KM_USER0);
+	if (bitmap->flags & BITMAP_HOSTENDIAN)
+		set_bit(bit, kaddr);
+	else
+		__set_bit_le(bit, kaddr);
+	kunmap_atomic(kaddr, KM_USER0);
+	PRINTK("set file bit %lu page %lu\n", bit, page->index);
 	/* record page number so it gets flushed to disk when unplug occurs */
 	set_page_attr(bitmap, page, BITMAP_PAGE_DIRTY);
 }
@@ -952,16 +936,6 @@ void bitmap_unplug(struct bitmap *bitmap)
 
 	if (!bitmap)
 		return;
-	if (!bitmap->filemap) {
-		/* Must be using a dirty_log */
-		struct dm_dirty_log *log = bitmap->mddev->bitmap_info.log;
-		dirty = test_and_clear_bit(BITMAP_PAGE_DIRTY, &bitmap->logattrs);
-		need_write = test_and_clear_bit(BITMAP_PAGE_NEEDWRITE, &bitmap->logattrs);
-		if (dirty || need_write)
-			if (log->type->flush(log))
-				bitmap->flags |= BITMAP_WRITE_ERROR;
-		goto out;
-	}
 
 	/* look at each page to see if there are any set bits that need to be
 	 * flushed out to disk */
@@ -990,7 +964,6 @@ void bitmap_unplug(struct bitmap *bitmap)
 		else
 			md_super_wait(bitmap->mddev);
 	}
-out:
 	if (bitmap->flags & BITMAP_WRITE_ERROR)
 		bitmap_file_kick(bitmap);
 }
@@ -1199,7 +1172,6 @@ void bitmap_daemon_work(mddev_t *mddev)
 	struct page *page = NULL, *lastpage = NULL;
 	sector_t blocks;
 	void *paddr;
-	struct dm_dirty_log *log = mddev->bitmap_info.log;
 
 	/* Use a mutex to guard daemon_work against
 	 * bitmap_destroy.
@@ -1224,12 +1196,11 @@ void bitmap_daemon_work(mddev_t *mddev)
 	spin_lock_irqsave(&bitmap->lock, flags);
 	for (j = 0; j < bitmap->chunks; j++) {
 		bitmap_counter_t *bmc;
-		if (!bitmap->filemap) {
-			if (!log)
-				/* error or shutdown */
-				break;
-		} else
-			page = filemap_get_page(bitmap, j);
+		if (!bitmap->filemap)
+			/* error or shutdown */
+			break;
+
+		page = filemap_get_page(bitmap, j);
 
 		if (page != lastpage) {
 			/* skip this page unless it's marked as needing cleaning */
@@ -1298,19 +1269,16 @@ void bitmap_daemon_work(mddev_t *mddev)
 						  -1);
 
 				/* clear the bit */
-				if (page) {
-					paddr = kmap_atomic(page, KM_USER0);
-					if (bitmap->flags & BITMAP_HOSTENDIAN)
-						clear_bit(file_page_offset(bitmap, j),
-							  paddr);
-					else
-						__clear_bit_le(
+				paddr = kmap_atomic(page, KM_USER0);
+				if (bitmap->flags & BITMAP_HOSTENDIAN)
+					clear_bit(file_page_offset(bitmap, j),
+						  paddr);
+				else
+					__clear_bit_le(
 							file_page_offset(bitmap,
 									 j),
 							paddr);
-					kunmap_atomic(paddr, KM_USER0);
-				} else
-					log->type->clear_region(log, j);
+				kunmap_atomic(paddr, KM_USER0);
 			}
 		} else
 			j |= PAGE_COUNTER_MASK;
@@ -1318,16 +1286,12 @@ void bitmap_daemon_work(mddev_t *mddev)
 	spin_unlock_irqrestore(&bitmap->lock, flags);
 
 	/* now sync the final page */
-	if (lastpage != NULL || log != NULL) {
+	if (lastpage != NULL) {
 		spin_lock_irqsave(&bitmap->lock, flags);
 		if (test_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE)) {
 			clear_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE);
 			spin_unlock_irqrestore(&bitmap->lock, flags);
-			if (lastpage)
-				write_page(bitmap, lastpage, 0);
-			else
-				if (log->type->flush(log))
-					bitmap->flags |= BITMAP_WRITE_ERROR;
+			write_page(bitmap, lastpage, 0);
 		} else {
 			set_page_attr(bitmap, lastpage, BITMAP_PAGE_NEEDWRITE);
 			spin_unlock_irqrestore(&bitmap->lock, flags);
@@ -1769,12 +1733,10 @@ int bitmap_create(mddev_t *mddev)
 	BUILD_BUG_ON(sizeof(bitmap_super_t) != 256);
 
 	if (!file
-	    && !mddev->bitmap_info.offset
-	    && !mddev->bitmap_info.log) /* bitmap disabled, nothing to do */
+	    && !mddev->bitmap_info.offset) /* bitmap disabled, nothing to do */
 		return 0;
 
 	BUG_ON(file && mddev->bitmap_info.offset);
-	BUG_ON(mddev->bitmap_info.offset && mddev->bitmap_info.log);
 
 	bitmap = kzalloc(sizeof(*bitmap), GFP_KERNEL);
 	if (!bitmap)
@@ -1865,6 +1827,7 @@ int bitmap_create(mddev_t *mddev)
 int bitmap_load(mddev_t *mddev)
 {
 	int err = 0;
+	sector_t start = 0;
 	sector_t sector = 0;
 	struct bitmap *bitmap = mddev->bitmap;
 
@@ -1883,24 +1846,14 @@ int bitmap_load(mddev_t *mddev)
 	}
 	bitmap_close_sync(bitmap);
 
-	if (mddev->bitmap_info.log) {
-		unsigned long i;
-		struct dm_dirty_log *log = mddev->bitmap_info.log;
-		for (i = 0; i < bitmap->chunks; i++)
-			if (!log->type->in_sync(log, i, 1))
-				bitmap_set_memory_bits(bitmap,
-						       (sector_t)i << CHUNK_BLOCK_SHIFT(bitmap),
-						       1);
-	} else {
-		sector_t start = 0;
-		if (mddev->degraded == 0
-		    || bitmap->events_cleared == mddev->events)
-			/* no need to keep dirty bits to optimise a
-			 * re-add of a missing device */
-			start = mddev->recovery_cp;
-
-		err = bitmap_init_from_disk(bitmap, start);
-	}
+	if (mddev->degraded == 0
+	    || bitmap->events_cleared == mddev->events)
+		/* no need to keep dirty bits to optimise a
+		 * re-add of a missing device */
+		start = mddev->recovery_cp;
+
+	err = bitmap_init_from_disk(bitmap, start);
+
 	if (err)
 		goto out;
 
diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
index b2a127e..a28f2e5 100644
--- a/drivers/md/bitmap.h
+++ b/drivers/md/bitmap.h
@@ -212,10 +212,6 @@ struct bitmap {
 	unsigned long file_pages; /* number of pages in the file */
 	int last_page_size; /* bytes in the last page */
 
-	unsigned long logattrs; /* used when filemap_attr doesn't exist
-				 * because we are working with a dirty_log
-				 */
-
 	unsigned long flags;
 
 	int allclean;
@@ -237,7 +233,6 @@ struct bitmap {
 	wait_queue_head_t behind_wait;
 
 	struct sysfs_dirent *sysfs_can_clear;
-
 };
 
 /* the bitmap API */
diff --git a/drivers/md/md.h b/drivers/md/md.h
index de5455d..7d906a9 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -307,11 +307,6 @@ struct mddev_s
 							 * hot-adding a bitmap.  It should
 							 * eventually be settable by sysfs.
 							 */
-		/* When md is serving under dm, it might use a
-		 * dirty_log to store the bits.
-		 */
-		struct dm_dirty_log *log;
-
 		struct mutex		mutex;
 		unsigned long		chunksize;
 		unsigned long		daemon_sleep; /* how many jiffies between updates? */



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 32/34] md/raid5: Avoid BUG caused by multiple failures.
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (30 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 33/34] MD: raid1 s/sysfs_notify_dirent/sysfs_notify_dirent_safe NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 34/34] MD bitmap: Revert DM dirty log hooks NeilBrown
  2011-07-21  2:32 ` [md PATCH 31/34] md/raid10: move rdev->corrected_errors counting NeilBrown
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid

While preparing to write a stripe we keep the parity block or blocks
locked (R5_LOCKED) - towards the end of schedule_reconstruction.

If the array is discovered to have failed before this write completes
we can leave those blocks LOCKED, and init_stripe will notice that a
free stripe still has a locked block and will complain.

So clear the R5_LOCKED flag in handle_failed_stripe, and demote the
'BUG' to a 'WARN_ON'.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ee26f50..6337768 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -340,7 +340,7 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
 			       (unsigned long long)sh->sector, i, dev->toread,
 			       dev->read, dev->towrite, dev->written,
 			       test_bit(R5_LOCKED, &dev->flags));
-			BUG();
+			WARN_ON(1);
 		}
 		dev->flags = 0;
 		raid5_build_block(sh, i, previous);
@@ -2301,6 +2301,10 @@ handle_failed_stripe(raid5_conf_t *conf, struct stripe_head *sh,
 		if (bitmap_end)
 			bitmap_endwrite(conf->mddev->bitmap, sh->sector,
 					STRIPE_SECTORS, 0, 0);
+		/* If we were in the middle of a write the parity block might
+		 * still be locked - so just clear all R5_LOCKED flags
+		 */
+		clear_bit(R5_LOCKED, &sh->dev[i].flags);
 	}
 
 	if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state))



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 30/34] md/raid5: move rdev->corrected_errors counting
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (26 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 24/34] md: remove ro check in md_check_recovery() NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 28/34] md: get rid of unnecessary casts on page_address() NeilBrown
                   ` (5 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Namhyung Kim

From: Namhyung Kim <namhyung@gmail.com>

Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3c67942..ee26f50 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -547,10 +547,6 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 			bi->bi_io_vec[0].bv_offset = 0;
 			bi->bi_size = STRIPE_SIZE;
 			bi->bi_next = NULL;
-			if ((rw & WRITE) &&
-			    test_bit(R5_ReWrite, &sh->dev[i].flags))
-				atomic_add(STRIPE_SECTORS,
-					&rdev->corrected_errors);
 			generic_make_request(bi);
 		} else {
 			if (rw & WRITE)
@@ -1590,6 +1586,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
 				(unsigned long long)(sh->sector
 						     + rdev->data_offset),
 				bdevname(rdev->bdev, b));
+			atomic_add(STRIPE_SECTORS, &rdev->corrected_errors);
 			clear_bit(R5_ReadError, &sh->dev[i].flags);
 			clear_bit(R5_ReWrite, &sh->dev[i].flags);
 		}



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 31/34] md/raid10: move rdev->corrected_errors counting
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (32 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 34/34] MD bitmap: Revert DM dirty log hooks NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Namhyung Kim

From: Namhyung Kim <namhyung@gmail.com>

Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid10.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 6721cb0..5def27c 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1533,7 +1533,6 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
 			    test_bit(In_sync, &rdev->flags)) {
 				atomic_inc(&rdev->nr_pending);
 				rcu_read_unlock();
-				atomic_add(s, &rdev->corrected_errors);
 				if (sync_page_io(rdev,
 						 r10_bio->devs[sl].addr +
 						 sect,
@@ -1598,6 +1597,7 @@ static void fix_read_error(conf_t *conf, mddev_t *mddev, r10bio_t *r10_bio)
 					       (unsigned long long)(
 						       sect + rdev->data_offset),
 					       bdevname(rdev->bdev, b));
+					atomic_add(s, &rdev->corrected_errors);
 				}
 
 				rdev_dec_pending(rdev, mddev);



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [md PATCH 28/34] md: get rid of unnecessary casts on page_address()
  2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
                   ` (27 preceding siblings ...)
  2011-07-21  2:32 ` [md PATCH 30/34] md/raid5: move rdev->corrected_errors counting NeilBrown
@ 2011-07-21  2:32 ` NeilBrown
  2011-07-21  2:32 ` [md PATCH 29/34] md/raid1: move rdev->corrected_errors counting NeilBrown
                   ` (4 subsequent siblings)
  33 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-21  2:32 UTC (permalink / raw)
  To: linux-raid; +Cc: Namhyung Kim

From: Namhyung Kim <namhyung@gmail.com>

page_address() returns void pointer, so the casts can be removed.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c |   23 +++++++++++------------
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c7d9c6a..2a32050 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1025,7 +1025,7 @@ static int super_90_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version
 	ret = -EINVAL;
 
 	bdevname(rdev->bdev, b);
-	sb = (mdp_super_t*)page_address(rdev->sb_page);
+	sb = page_address(rdev->sb_page);
 
 	if (sb->md_magic != MD_SB_MAGIC) {
 		printk(KERN_ERR "md: invalid raid superblock magic on %s\n",
@@ -1064,7 +1064,7 @@ static int super_90_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version
 		ret = 1;
 	} else {
 		__u64 ev1, ev2;
-		mdp_super_t *refsb = (mdp_super_t*)page_address(refdev->sb_page);
+		mdp_super_t *refsb = page_address(refdev->sb_page);
 		if (!uuid_equal(refsb, sb)) {
 			printk(KERN_WARNING "md: %s has different UUID to %s\n",
 				b, bdevname(refdev->bdev,b2));
@@ -1099,7 +1099,7 @@ static int super_90_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version
 static int super_90_validate(mddev_t *mddev, mdk_rdev_t *rdev)
 {
 	mdp_disk_t *desc;
-	mdp_super_t *sb = (mdp_super_t *)page_address(rdev->sb_page);
+	mdp_super_t *sb = page_address(rdev->sb_page);
 	__u64 ev1 = md_event(sb);
 
 	rdev->raid_disk = -1;
@@ -1230,7 +1230,7 @@ static void super_90_sync(mddev_t *mddev, mdk_rdev_t *rdev)
 
 	rdev->sb_size = MD_SB_BYTES;
 
-	sb = (mdp_super_t*)page_address(rdev->sb_page);
+	sb = page_address(rdev->sb_page);
 
 	memset(sb, 0, sizeof(*sb));
 
@@ -1435,7 +1435,7 @@ static int super_1_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version)
 	if (ret) return ret;
 
 
-	sb = (struct mdp_superblock_1*)page_address(rdev->sb_page);
+	sb = page_address(rdev->sb_page);
 
 	if (sb->magic != cpu_to_le32(MD_SB_MAGIC) ||
 	    sb->major_version != cpu_to_le32(1) ||
@@ -1477,8 +1477,7 @@ static int super_1_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version)
 		ret = 1;
 	} else {
 		__u64 ev1, ev2;
-		struct mdp_superblock_1 *refsb = 
-			(struct mdp_superblock_1*)page_address(refdev->sb_page);
+		struct mdp_superblock_1 *refsb = page_address(refdev->sb_page);
 
 		if (memcmp(sb->set_uuid, refsb->set_uuid, 16) != 0 ||
 		    sb->level != refsb->level ||
@@ -1513,7 +1512,7 @@ static int super_1_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version)
 
 static int super_1_validate(mddev_t *mddev, mdk_rdev_t *rdev)
 {
-	struct mdp_superblock_1 *sb = (struct mdp_superblock_1*)page_address(rdev->sb_page);
+	struct mdp_superblock_1 *sb = page_address(rdev->sb_page);
 	__u64 ev1 = le64_to_cpu(sb->events);
 
 	rdev->raid_disk = -1;
@@ -1619,7 +1618,7 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
 	int max_dev, i;
 	/* make rdev->sb match mddev and rdev data. */
 
-	sb = (struct mdp_superblock_1*)page_address(rdev->sb_page);
+	sb = page_address(rdev->sb_page);
 
 	sb->feature_map = 0;
 	sb->pad0 = 0;
@@ -1724,7 +1723,7 @@ super_1_rdev_size_change(mdk_rdev_t *rdev, sector_t num_sectors)
 			num_sectors = max_sectors;
 		rdev->sb_start = sb_start;
 	}
-	sb = (struct mdp_superblock_1 *) page_address(rdev->sb_page);
+	sb = page_address(rdev->sb_page);
 	sb->data_size = cpu_to_le64(num_sectors);
 	sb->super_offset = rdev->sb_start;
 	sb->sb_csum = calc_sb_1_csum(sb);
@@ -2127,10 +2126,10 @@ static void print_rdev(mdk_rdev_t *rdev, int major_version)
 		printk(KERN_INFO "md: rdev superblock (MJ:%d):\n", major_version);
 		switch (major_version) {
 		case 0:
-			print_sb_90((mdp_super_t*)page_address(rdev->sb_page));
+			print_sb_90(page_address(rdev->sb_page));
 			break;
 		case 1:
-			print_sb_1((struct mdp_superblock_1 *)page_address(rdev->sb_page));
+			print_sb_1(page_address(rdev->sb_page));
 			break;
 		}
 	} else



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [md PATCH 06/34] md/raid5: Remove use of sh->lock in sync_request
  2011-07-21  2:32 ` [md PATCH 06/34] md/raid5: Remove use of sh->lock in sync_request NeilBrown
@ 2011-07-22  3:39   ` Namhyung Kim
  0 siblings, 0 replies; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  3:39 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> This is the start of a series of patches to remove sh->lock.
>
> sync_request takes sh->lock before setting STRIPE_SYNCING to ensure
> there is no race with testing it in handle_stripe[56].
>
> Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early
> in handle_stripe[56] (after getting the same lock) and perform the
> same set/clear operations if it was set.
>
> Signed-off-by: NeilBrown <neilb@suse.de>

Reviewed-by: Namhyung Kim <namhyung@gmail.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 07/34] md/raid5: Protect some more code with ->device_lock.
  2011-07-21  2:32 ` [md PATCH 07/34] md/raid5: Protect some more code with ->device_lock NeilBrown
@ 2011-07-22  3:54   ` Namhyung Kim
  0 siblings, 0 replies; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  3:54 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> Other places that change or follow dev->towrite and dev->written take
> the device_lock as well as the sh->lock.
> So it should really be held in these places too.
> Also, doing so will allow sh->lock to be discarded.
>
> with merged fixes by: Namhyung Kim <namhyung@gmail.com>
>
>
> Signed-off-by: NeilBrown <neilb@suse.de>

Reviewed-by: Namhyung Kim <namhyung@gmail.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 08/34] md/raid5: replace sh->lock with an 'active' flag.
  2011-07-21  2:32 ` [md PATCH 08/34] md/raid5: replace sh->lock with an 'active' flag NeilBrown
@ 2011-07-22  4:27   ` Namhyung Kim
  2011-07-22  4:49     ` NeilBrown
  2011-08-03 22:47   ` Dan Williams
  1 sibling, 1 reply; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  4:27 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> sh->lock is now mainly used to ensure that two threads aren't running
> in the locked part of handle_stripe[56] at the same time.
>
> That can more neatly be achieved with an 'active' flag which we set
> while running handle_stripe.  If we find the flag is set, we simply
> requeue the stripe for later by setting STRIPE_HANDLE.
>
> For safety we take ->device_lock while examining the state of the
> stripe and creating a summary in 'stripe_head_state / r6_state'.
> This possibly isn't needed but as shared fields like ->toread,
> ->towrite are checked it is safer for now at least.
>
> We leave the label after the old 'unlock' called "unlock" because it
> will disappear in a few patches, so renaming seems pointless.
>
> This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
> later, but that is not a problem.
>
> Signed-off-by: NeilBrown <neilb@suse.de>

Reviewed-by: Namhyung Kim <namhyung@gmail.com>

But I have a question, please see below.


> ---
>
>  drivers/md/raid5.c |   26 +++++++++++++-------------
>  drivers/md/raid5.h |    2 +-
>  2 files changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 9985138..f8275b5 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -1020,14 +1020,12 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
>  		if (test_and_clear_bit(R5_Wantdrain, &dev->flags)) {
>  			struct bio *wbi;
>  
> -			spin_lock(&sh->lock);
>  			spin_lock_irq(&sh->raid_conf->device_lock);
>  			chosen = dev->towrite;
>  			dev->towrite = NULL;
>  			BUG_ON(dev->written);
>  			wbi = dev->written = chosen;
>  			spin_unlock_irq(&sh->raid_conf->device_lock);
> -			spin_unlock(&sh->lock);
>  
>  			while (wbi && wbi->bi_sector <
>  				dev->sector + STRIPE_SECTORS) {
> @@ -1322,7 +1320,6 @@ static int grow_one_stripe(raid5_conf_t *conf)
>  		return 0;
>  
>  	sh->raid_conf = conf;
> -	spin_lock_init(&sh->lock);
>  	#ifdef CONFIG_MULTICORE_RAID456
>  	init_waitqueue_head(&sh->ops.wait_for_ops);
>  	#endif
> @@ -1442,7 +1439,6 @@ static int resize_stripes(raid5_conf_t *conf, int newsize)
>  			break;
>  
>  		nsh->raid_conf = conf;
> -		spin_lock_init(&nsh->lock);
>  		#ifdef CONFIG_MULTICORE_RAID456
>  		init_waitqueue_head(&nsh->ops.wait_for_ops);
>  		#endif
> @@ -2148,7 +2144,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
>  		(unsigned long long)sh->sector);
>  
>  
> -	spin_lock(&sh->lock);
>  	spin_lock_irq(&conf->device_lock);
>  	if (forwrite) {
>  		bip = &sh->dev[dd_idx].towrite;
> @@ -2184,7 +2179,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
>  			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
>  	}
>  	spin_unlock_irq(&conf->device_lock);
> -	spin_unlock(&sh->lock);
>  
>  	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
>  		(unsigned long long)(*bip)->bi_sector,
> @@ -2201,7 +2195,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
>   overlap:
>  	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
>  	spin_unlock_irq(&conf->device_lock);
> -	spin_unlock(&sh->lock);
>  	return 0;
>  }
>  
> @@ -3023,12 +3016,10 @@ static void handle_stripe5(struct stripe_head *sh)
>  		 atomic_read(&sh->count), sh->pd_idx, sh->check_state,
>  		 sh->reconstruct_state);
>  
> -	spin_lock(&sh->lock);
>  	if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
>  		set_bit(STRIPE_SYNCING, &sh->state);
>  		clear_bit(STRIPE_INSYNC, &sh->state);
>  	}
> -	clear_bit(STRIPE_HANDLE, &sh->state);
>  	clear_bit(STRIPE_DELAYED, &sh->state);
>  
>  	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
> @@ -3037,6 +3028,7 @@ static void handle_stripe5(struct stripe_head *sh)
>  
>  	/* Now to look around and see what can be done */
>  	rcu_read_lock();
> +	spin_lock_irq(&conf->device_lock);

Do we still need rcu_read_lock()? AFAIK rcu_read_lock() only prevents
task from preemption but spin_lock does same thing as well.

I know it's been already there under sh->lock before this patch, and
it doesn't hurt anything, but I'm not sure it is really needed.


>  	for (i=disks; i--; ) {
>  		mdk_rdev_t *rdev;
>  
> @@ -3099,6 +3091,7 @@ static void handle_stripe5(struct stripe_head *sh)
>  			s.failed_num = i;
>  		}
>  	}
> +	spin_unlock_irq(&conf->device_lock);
>  	rcu_read_unlock();
>  
>  	if (unlikely(blocked_rdev)) {
> @@ -3275,7 +3268,6 @@ static void handle_stripe5(struct stripe_head *sh)
>  		handle_stripe_expansion(conf, sh, NULL);
>  
>   unlock:
> -	spin_unlock(&sh->lock);
>  
>  	/* wait for this device to become unblocked */
>  	if (unlikely(blocked_rdev))
> @@ -3318,12 +3310,10 @@ static void handle_stripe6(struct stripe_head *sh)
>  	       sh->check_state, sh->reconstruct_state);
>  	memset(&s, 0, sizeof(s));
>  
> -	spin_lock(&sh->lock);
>  	if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
>  		set_bit(STRIPE_SYNCING, &sh->state);
>  		clear_bit(STRIPE_INSYNC, &sh->state);
>  	}
> -	clear_bit(STRIPE_HANDLE, &sh->state);
>  	clear_bit(STRIPE_DELAYED, &sh->state);
>  
>  	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
> @@ -3332,6 +3322,7 @@ static void handle_stripe6(struct stripe_head *sh)
>  	/* Now to look around and see what can be done */
>  
>  	rcu_read_lock();
> +	spin_lock_irq(&conf->device_lock);

Same here.


>  	for (i=disks; i--; ) {
>  		mdk_rdev_t *rdev;
>  		dev = &sh->dev[i];
> @@ -3395,6 +3386,7 @@ static void handle_stripe6(struct stripe_head *sh)
>  			s.failed++;
>  		}
>  	}
> +	spin_unlock_irq(&conf->device_lock);
>  	rcu_read_unlock();
>  
>  	if (unlikely(blocked_rdev)) {
> @@ -3580,7 +3572,6 @@ static void handle_stripe6(struct stripe_head *sh)
>  		handle_stripe_expansion(conf, sh, &r6s);
>  
>   unlock:
> -	spin_unlock(&sh->lock);
>  
>  	/* wait for this device to become unblocked */
>  	if (unlikely(blocked_rdev))
> @@ -3608,10 +3599,19 @@ static void handle_stripe6(struct stripe_head *sh)
>  
>  static void handle_stripe(struct stripe_head *sh)
>  {
> +	clear_bit(STRIPE_HANDLE, &sh->state);
> +	if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
> +		/* already being handled, ensure it gets handled
> +		 * again when current action finishes */
> +		set_bit(STRIPE_HANDLE, &sh->state);
> +		return;
> +	}
> +
>  	if (sh->raid_conf->level == 6)
>  		handle_stripe6(sh);
>  	else
>  		handle_stripe5(sh);
> +	clear_bit(STRIPE_ACTIVE, &sh->state);
>  }
>  
>  static void raid5_activate_delayed(raid5_conf_t *conf)
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index a330011..217a9d4 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -209,7 +209,6 @@ struct stripe_head {
>  	short			ddf_layout;/* use DDF ordering to calculate Q */
>  	unsigned long		state;		/* state flags */
>  	atomic_t		count;	      /* nr of active thread/requests */
> -	spinlock_t		lock;
>  	int			bm_seq;	/* sequence number for bitmap flushes */
>  	int			disks;		/* disks in stripe */
>  	enum check_states	check_state;
> @@ -290,6 +289,7 @@ struct r6_state {
>   * Stripe state
>   */
>  enum {
> +	STRIPE_ACTIVE,
>  	STRIPE_HANDLE,
>  	STRIPE_SYNC_REQUESTED,
>  	STRIPE_SYNCING,
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 09/34] md/raid5: move common code into handle_stripe
  2011-07-21  2:32 ` [md PATCH 09/34] md/raid5: move common code into handle_stripe NeilBrown
@ 2011-07-22  4:30   ` Namhyung Kim
  0 siblings, 0 replies; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  4:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> There is common code at the start of handle_stripe5 and
> handle_stripe6.  Move it into handle_stripe.
>
> Signed-off-by: NeilBrown <neilb@suse.de>

Reviewed-by: Namhyung Kim <namhyung@gmail.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 08/34] md/raid5: replace sh->lock with an 'active' flag.
  2011-07-22  4:27   ` Namhyung Kim
@ 2011-07-22  4:49     ` NeilBrown
  2011-07-22  5:03       ` Namhyung Kim
  0 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-22  4:49 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-raid

On Fri, 22 Jul 2011 13:27:36 +0900 Namhyung Kim <namhyung@gmail.com> wrote:

> NeilBrown <neilb@suse.de> writes:
> 
> > sh->lock is now mainly used to ensure that two threads aren't running
> > in the locked part of handle_stripe[56] at the same time.
> >
> > That can more neatly be achieved with an 'active' flag which we set
> > while running handle_stripe.  If we find the flag is set, we simply
> > requeue the stripe for later by setting STRIPE_HANDLE.
> >
> > For safety we take ->device_lock while examining the state of the
> > stripe and creating a summary in 'stripe_head_state / r6_state'.
> > This possibly isn't needed but as shared fields like ->toread,
> > ->towrite are checked it is safer for now at least.
> >
> > We leave the label after the old 'unlock' called "unlock" because it
> > will disappear in a few patches, so renaming seems pointless.
> >
> > This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
> > later, but that is not a problem.
> >
> > Signed-off-by: NeilBrown <neilb@suse.de>
> 
> Reviewed-by: Namhyung Kim <namhyung@gmail.com>

Thanks.

> 
> But I have a question, please see below.

I like questions...

> > @@ -3037,6 +3028,7 @@ static void handle_stripe5(struct stripe_head *sh)
> >  
> >  	/* Now to look around and see what can be done */
> >  	rcu_read_lock();
> > +	spin_lock_irq(&conf->device_lock);
> 
> Do we still need rcu_read_lock()? AFAIK rcu_read_lock() only prevents
> task from preemption but spin_lock does same thing as well.
> 
> I know it's been already there under sh->lock before this patch, and
> it doesn't hurt anything, but I'm not sure it is really needed.

I see your point, but I think there are two reasons why it really is needed.

1/ I think rcu_read_lock does more than prevent preemption.  What it does
   exactly I don't know and I think it depends on which RCU scheme was chosen
   when configuring the kernel.  But what it conceptually does is stop
   a subsequent synchronize_rcu() from completing, and that it what I need.
   Disabling preemption might stop that as well, but I really don't know and
   so cannot trust it.

2/ It is useful documentation.  A reader seeing rcu_read_lock() knows that
   there must be some rcu-protected data here which needs to be handled with
   a little care and so they can be on the lookout for it.

So I think it is really needed.

Thanks,

NeilBrown


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state
  2011-07-21  2:32 ` [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state NeilBrown
@ 2011-07-22  4:49   ` Namhyung Kim
  2011-07-22  5:15     ` NeilBrown
  0 siblings, 1 reply; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  4:49 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> 'struct stripe_head_state' stores state about the 'current' stripe
> that is passed around while handling the stripe.
> For RAID6 there is an extension structure: r6_state, which is also
> passed around.
> There is no value in keeping these separate, so move the fields from
> the latter into the former.
>
> This means that all code now needs to treat s->failed_num as an small
> array, but this is a small cost.
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>
>  drivers/md/raid5.c |   79 ++++++++++++++++++++++++++--------------------------
>  drivers/md/raid5.h |    8 +----
>  2 files changed, 41 insertions(+), 46 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index dfb3d9f..c32ffb5 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2318,7 +2318,7 @@ static int fetch_block5(struct stripe_head *sh, struct stripe_head_state *s,
>  			int disk_idx, int disks)
>  {
>  	struct r5dev *dev = &sh->dev[disk_idx];
> -	struct r5dev *failed_dev = &sh->dev[s->failed_num];
> +	struct r5dev *failed_dev = &sh->dev[s->failed_num[0]];
>  
>  	/* is the data in this block needed, and can we get it? */
>  	if (!test_bit(R5_LOCKED, &dev->flags) &&
> @@ -2334,7 +2334,7 @@ static int fetch_block5(struct stripe_head *sh, struct stripe_head_state *s,
>  		 * otherwise read it if the backing disk is insync
>  		 */
>  		if ((s->uptodate == disks - 1) &&
> -		    (s->failed && disk_idx == s->failed_num)) {
> +		    (s->failed && disk_idx == s->failed_num[0])) {
>  			set_bit(STRIPE_COMPUTE_RUN, &sh->state);
>  			set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
>  			set_bit(R5_Wantcompute, &dev->flags);
> @@ -2388,11 +2388,11 @@ static void handle_stripe_fill5(struct stripe_head *sh,
>   * 0 to tell the loop in handle_stripe_fill6 to continue
>   */
>  static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
> -			 struct r6_state *r6s, int disk_idx, int disks)
> +			int disk_idx, int disks)
>  {
>  	struct r5dev *dev = &sh->dev[disk_idx];
> -	struct r5dev *fdev[2] = { &sh->dev[r6s->failed_num[0]],
> -				  &sh->dev[r6s->failed_num[1]] };
> +	struct r5dev *fdev[2] = { &sh->dev[s->failed_num[0]],
> +				  &sh->dev[s->failed_num[1]] };
>  
>  	if (!test_bit(R5_LOCKED, &dev->flags) &&
>  	    !test_bit(R5_UPTODATE, &dev->flags) &&
> @@ -2409,8 +2409,8 @@ static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
>  		BUG_ON(test_bit(R5_Wantcompute, &dev->flags));
>  		BUG_ON(test_bit(R5_Wantread, &dev->flags));
>  		if ((s->uptodate == disks - 1) &&
> -		    (s->failed && (disk_idx == r6s->failed_num[0] ||
> -				   disk_idx == r6s->failed_num[1]))) {
> +		    (s->failed && (disk_idx == s->failed_num[0] ||
> +				   disk_idx == s->failed_num[1]))) {
>  			/* have disk failed, and we're requested to fetch it;
>  			 * do compute it
>  			 */
> @@ -2465,7 +2465,7 @@ static int fetch_block6(struct stripe_head *sh, struct stripe_head_state *s,
>   * handle_stripe_fill6 - read or compute data to satisfy pending requests.
>   */
>  static void handle_stripe_fill6(struct stripe_head *sh,
> -			struct stripe_head_state *s, struct r6_state *r6s,
> +			struct stripe_head_state *s,
>  			int disks)
>  {
>  	int i;
> @@ -2477,7 +2477,7 @@ static void handle_stripe_fill6(struct stripe_head *sh,
>  	if (!test_bit(STRIPE_COMPUTE_RUN, &sh->state) && !sh->check_state &&
>  	    !sh->reconstruct_state)
>  		for (i = disks; i--; )
> -			if (fetch_block6(sh, s, r6s, i, disks))
> +			if (fetch_block6(sh, s, i, disks))
>  				break;
>  	set_bit(STRIPE_HANDLE, &sh->state);
>  }
> @@ -2625,7 +2625,7 @@ static void handle_stripe_dirtying5(raid5_conf_t *conf,
>  
>  static void handle_stripe_dirtying6(raid5_conf_t *conf,
>  		struct stripe_head *sh,	struct stripe_head_state *s,
> -		struct r6_state *r6s, int disks)
> +		int disks)
>  {
>  	int rcw = 0, pd_idx = sh->pd_idx, i;
>  	int qd_idx = sh->qd_idx;
> @@ -2688,7 +2688,7 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
>  			s->uptodate--;
>  			break;
>  		}
> -		dev = &sh->dev[s->failed_num];
> +		dev = &sh->dev[s->failed_num[0]];
>  		/* fall through */
>  	case check_state_compute_result:
>  		sh->check_state = check_state_idle;
> @@ -2760,7 +2760,7 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
>  
>  static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
>  				  struct stripe_head_state *s,
> -				  struct r6_state *r6s, int disks)
> +				  int disks)
>  {
>  	int pd_idx = sh->pd_idx;
>  	int qd_idx = sh->qd_idx;
> @@ -2779,14 +2779,14 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
>  	switch (sh->check_state) {
>  	case check_state_idle:
>  		/* start a new check operation if there are < 2 failures */
> -		if (s->failed == r6s->q_failed) {
> +		if (s->failed == s->q_failed) {
>  			/* The only possible failed device holds Q, so it
>  			 * makes sense to check P (If anything else were failed,
>  			 * we would have used P to recreate it).
>  			 */
>  			sh->check_state = check_state_run;
>  		}
> -		if (!r6s->q_failed && s->failed < 2) {
> +		if (!s->q_failed && s->failed < 2) {
>  			/* Q is not failed, and we didn't use it to generate
>  			 * anything, so it makes sense to check it
>  			 */
> @@ -2828,13 +2828,13 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
>  		 */
>  		BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */
>  		if (s->failed == 2) {
> -			dev = &sh->dev[r6s->failed_num[1]];
> +			dev = &sh->dev[s->failed_num[1]];
>  			s->locked++;
>  			set_bit(R5_LOCKED, &dev->flags);
>  			set_bit(R5_Wantwrite, &dev->flags);
>  		}
>  		if (s->failed >= 1) {
> -			dev = &sh->dev[r6s->failed_num[0]];
> +			dev = &sh->dev[s->failed_num[0]];
>  			s->locked++;
>  			set_bit(R5_LOCKED, &dev->flags);
>  			set_bit(R5_Wantwrite, &dev->flags);
> @@ -2922,7 +2922,7 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
>  }
>  
>  static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
> -				struct r6_state *r6s)
> +				    struct stripe_head_state *r6s)
>  {
>  	int i;
>  
> @@ -2964,7 +2964,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
>  			set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
>  			for (j = 0; j < conf->raid_disks; j++)
>  				if (j != sh2->pd_idx &&
> -				    (!r6s || j != sh2->qd_idx) &&
> +				    (r6s || j != sh2->qd_idx) &&

Why is this changed?


>  				    !test_bit(R5_Expanded, &sh2->dev[j].flags))
>  					break;
>  			if (j == conf->raid_disks) {
> @@ -3082,7 +3082,7 @@ static void handle_stripe5(struct stripe_head *sh)
>  			clear_bit(R5_Insync, &dev->flags);
>  		if (!test_bit(R5_Insync, &dev->flags)) {
>  			s.failed++;
> -			s.failed_num = i;
> +			s.failed_num[0] = i;
>  		}
>  	}
>  	spin_unlock_irq(&conf->device_lock);
> @@ -3107,7 +3107,7 @@ static void handle_stripe5(struct stripe_head *sh)
>  	pr_debug("locked=%d uptodate=%d to_read=%d"
>  		" to_write=%d failed=%d failed_num=%d\n",
>  		s.locked, s.uptodate, s.to_read, s.to_write,
> -		s.failed, s.failed_num);
> +		s.failed, s.failed_num[0]);
>  	/* check if the array has lost two devices and, if so, some requests might
>  	 * need to be failed
>  	 */
> @@ -3127,7 +3127,7 @@ static void handle_stripe5(struct stripe_head *sh)
>  	     ((test_bit(R5_Insync, &dev->flags) &&
>  	       !test_bit(R5_LOCKED, &dev->flags) &&
>  	       test_bit(R5_UPTODATE, &dev->flags)) ||
> -	       (s.failed == 1 && s.failed_num == sh->pd_idx)))
> +	       (s.failed == 1 && s.failed_num[0] == sh->pd_idx)))
>  		handle_stripe_clean_event(conf, sh, disks, &return_bi);
>  
>  	/* Now we might consider reading some blocks, either to check/generate
> @@ -3198,11 +3198,11 @@ static void handle_stripe5(struct stripe_head *sh)
>  	 * the repair/check process
>  	 */
>  	if (s.failed == 1 && !conf->mddev->ro &&
> -	    test_bit(R5_ReadError, &sh->dev[s.failed_num].flags)
> -	    && !test_bit(R5_LOCKED, &sh->dev[s.failed_num].flags)
> -	    && test_bit(R5_UPTODATE, &sh->dev[s.failed_num].flags)
> +	    test_bit(R5_ReadError, &sh->dev[s.failed_num[0]].flags)
> +	    && !test_bit(R5_LOCKED, &sh->dev[s.failed_num[0]].flags)
> +	    && test_bit(R5_UPTODATE, &sh->dev[s.failed_num[0]].flags)
>  		) {
> -		dev = &sh->dev[s.failed_num];
> +		dev = &sh->dev[s.failed_num[0]];
>  		if (!test_bit(R5_ReWrite, &dev->flags)) {
>  			set_bit(R5_Wantwrite, &dev->flags);
>  			set_bit(R5_ReWrite, &dev->flags);
> @@ -3292,7 +3292,6 @@ static void handle_stripe6(struct stripe_head *sh)
>  	struct bio *return_bi = NULL;
>  	int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
>  	struct stripe_head_state s;
> -	struct r6_state r6s;
>  	struct r5dev *dev, *pdev, *qdev;
>  	mdk_rdev_t *blocked_rdev = NULL;
>  	int dec_preread_active = 0;
> @@ -3370,7 +3369,7 @@ static void handle_stripe6(struct stripe_head *sh)
>  			clear_bit(R5_Insync, &dev->flags);
>  		if (!test_bit(R5_Insync, &dev->flags)) {
>  			if (s.failed < 2)
> -				r6s.failed_num[s.failed] = i;
> +				s.failed_num[s.failed] = i;
>  			s.failed++;
>  		}
>  	}
> @@ -3396,7 +3395,7 @@ static void handle_stripe6(struct stripe_head *sh)
>  	pr_debug("locked=%d uptodate=%d to_read=%d"
>  	       " to_write=%d failed=%d failed_num=%d,%d\n",
>  	       s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
> -	       r6s.failed_num[0], r6s.failed_num[1]);
> +	       s.failed_num[0], s.failed_num[1]);
>  	/* check if the array has lost >2 devices and, if so, some requests
>  	 * might need to be failed
>  	 */
> @@ -3413,17 +3412,17 @@ static void handle_stripe6(struct stripe_head *sh)
>  	 * are safe, or on a failed drive
>  	 */
>  	pdev = &sh->dev[pd_idx];
> -	r6s.p_failed = (s.failed >= 1 && r6s.failed_num[0] == pd_idx)
> -		|| (s.failed >= 2 && r6s.failed_num[1] == pd_idx);
> +	s.p_failed = (s.failed >= 1 && s.failed_num[0] == pd_idx)
> +		|| (s.failed >= 2 && s.failed_num[1] == pd_idx);
>  	qdev = &sh->dev[qd_idx];
> -	r6s.q_failed = (s.failed >= 1 && r6s.failed_num[0] == qd_idx)
> -		|| (s.failed >= 2 && r6s.failed_num[1] == qd_idx);
> +	s.q_failed = (s.failed >= 1 && s.failed_num[0] == qd_idx)
> +		|| (s.failed >= 2 && s.failed_num[1] == qd_idx);
>  
> -	if ( s.written &&
> -	     ( r6s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
> +	if (s.written &&
> +	    (s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
>  			     && !test_bit(R5_LOCKED, &pdev->flags)
>  			     && test_bit(R5_UPTODATE, &pdev->flags)))) &&
> -	     ( r6s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
> +	    (s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
>  			     && !test_bit(R5_LOCKED, &qdev->flags)
>  			     && test_bit(R5_UPTODATE, &qdev->flags)))))
>  		handle_stripe_clean_event(conf, sh, disks, &return_bi);
> @@ -3434,7 +3433,7 @@ static void handle_stripe6(struct stripe_head *sh)
>  	 */
>  	if (s.to_read || s.non_overwrite || (s.to_write && s.failed) ||
>  	    (s.syncing && (s.uptodate + s.compute < disks)) || s.expanding)
> -		handle_stripe_fill6(sh, &s, &r6s, disks);
> +		handle_stripe_fill6(sh, &s, disks);
>  
>  	/* Now we check to see if any write operations have recently
>  	 * completed
> @@ -3472,7 +3471,7 @@ static void handle_stripe6(struct stripe_head *sh)
>  	 *    block.
>  	 */
>  	if (s.to_write && !sh->reconstruct_state && !sh->check_state)
> -		handle_stripe_dirtying6(conf, sh, &s, &r6s, disks);
> +		handle_stripe_dirtying6(conf, sh, &s, disks);
>  
>  	/* maybe we need to check and possibly fix the parity for this stripe
>  	 * Any reads will already have been scheduled, so we just see if enough
> @@ -3483,7 +3482,7 @@ static void handle_stripe6(struct stripe_head *sh)
>  	    (s.syncing && s.locked == 0 &&
>  	     !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
>  	     !test_bit(STRIPE_INSYNC, &sh->state)))
> -		handle_parity_checks6(conf, sh, &s, &r6s, disks);
> +		handle_parity_checks6(conf, sh, &s, disks);
>  
>  	if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
>  		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
> @@ -3495,7 +3494,7 @@ static void handle_stripe6(struct stripe_head *sh)
>  	 */
>  	if (s.failed <= 2 && !conf->mddev->ro)
>  		for (i = 0; i < s.failed; i++) {
> -			dev = &sh->dev[r6s.failed_num[i]];
> +			dev = &sh->dev[s.failed_num[i]];
>  			if (test_bit(R5_ReadError, &dev->flags)
>  			    && !test_bit(R5_LOCKED, &dev->flags)
>  			    && test_bit(R5_UPTODATE, &dev->flags)
> @@ -3557,7 +3556,7 @@ static void handle_stripe6(struct stripe_head *sh)
>  
>  	if (s.expanding && s.locked == 0 &&
>  	    !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
> -		handle_stripe_expansion(conf, sh, &r6s);
> +		handle_stripe_expansion(conf, sh, &s);
>  
>   unlock:
>  
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 217a9d4..d3c61d3 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -245,13 +245,9 @@ struct stripe_head_state {
>  	int syncing, expanding, expanded;
>  	int locked, uptodate, to_read, to_write, failed, written;
>  	int to_fill, compute, req_compute, non_overwrite;
> -	int failed_num;
> +	int failed_num[2];
>  	unsigned long ops_request;
> -};
> -
> -/* r6_state - extra state data only relevant to r6 */
> -struct r6_state {
> -	int p_failed, q_failed, failed_num[2];
> +	int p_failed, q_failed;
>  };
>  
>  /* Flags */
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 08/34] md/raid5: replace sh->lock with an 'active' flag.
  2011-07-22  4:49     ` NeilBrown
@ 2011-07-22  5:03       ` Namhyung Kim
  0 siblings, 0 replies; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  5:03 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

2011-07-22 (금), 14:49 +1000, NeilBrown:
> On Fri, 22 Jul 2011 13:27:36 +0900 Namhyung Kim <namhyung@gmail.com> wrote:
> 
> > NeilBrown <neilb@suse.de> writes:
> > 
> > > sh->lock is now mainly used to ensure that two threads aren't running
> > > in the locked part of handle_stripe[56] at the same time.
> > >
> > > That can more neatly be achieved with an 'active' flag which we set
> > > while running handle_stripe.  If we find the flag is set, we simply
> > > requeue the stripe for later by setting STRIPE_HANDLE.
> > >
> > > For safety we take ->device_lock while examining the state of the
> > > stripe and creating a summary in 'stripe_head_state / r6_state'.
> > > This possibly isn't needed but as shared fields like ->toread,
> > > ->towrite are checked it is safer for now at least.
> > >
> > > We leave the label after the old 'unlock' called "unlock" because it
> > > will disappear in a few patches, so renaming seems pointless.
> > >
> > > This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
> > > later, but that is not a problem.
> > >
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > 
> > Reviewed-by: Namhyung Kim <namhyung@gmail.com>
> 
> Thanks.
> 
> > 
> > But I have a question, please see below.
> 
> I like questions...
> 
> > > @@ -3037,6 +3028,7 @@ static void handle_stripe5(struct stripe_head *sh)
> > >  
> > >  	/* Now to look around and see what can be done */
> > >  	rcu_read_lock();
> > > +	spin_lock_irq(&conf->device_lock);
> > 
> > Do we still need rcu_read_lock()? AFAIK rcu_read_lock() only prevents
> > task from preemption but spin_lock does same thing as well.
> > 
> > I know it's been already there under sh->lock before this patch, and
> > it doesn't hurt anything, but I'm not sure it is really needed.
> 
> I see your point, but I think there are two reasons why it really is needed.
> 
> 1/ I think rcu_read_lock does more than prevent preemption.  What it does
>    exactly I don't know and I think it depends on which RCU scheme was chosen
>    when configuring the kernel.  But what it conceptually does is stop
>    a subsequent synchronize_rcu() from completing, and that it what I need.
>    Disabling preemption might stop that as well, but I really don't know and
>    so cannot trust it.
> 
> 2/ It is useful documentation.  A reader seeing rcu_read_lock() knows that
>    there must be some rcu-protected data here which needs to be handled with
>    a little care and so they can be on the lookout for it.
> 
> So I think it is really needed.
> 
> Thanks,
> 
> NeilBrown
> 

OK, looks reasonable. Thanks for the explanation. :)


-- 
Regards,
Namhyung Kim


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state
  2011-07-22  4:49   ` Namhyung Kim
@ 2011-07-22  5:15     ` NeilBrown
  2011-07-22  5:37       ` NeilBrown
  0 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-22  5:15 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-raid

On Fri, 22 Jul 2011 13:49:37 +0900 Namhyung Kim <namhyung@gmail.com> wrote:

 
> > @@ -2964,7 +2964,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
> >  			set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
> >  			for (j = 0; j < conf->raid_disks; j++)
> >  				if (j != sh2->pd_idx &&
> > -				    (!r6s || j != sh2->qd_idx) &&
> > +				    (r6s || j != sh2->qd_idx) &&
> 
> Why is this changed?
> 

No good reason that I can think of ... so must be a bug.  I'll fix it.

In the current code, r6s is always NULL! which is also a bug.
When you get to

   md/raid5: Move code for finishing a reconstruction into handle_stripe.

You'll find that the call which passed 's' in was changed to pass 'NULL' in
when merging the RAID5 and RAID6 code.  I'll fix that too.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 11/34] md/raid5: add some more fields to stripe_head_state
  2011-07-21  2:32 ` [md PATCH 11/34] md/raid5: add some more fields to stripe_head_state NeilBrown
@ 2011-07-22  5:31   ` Namhyung Kim
  2011-07-26  1:35     ` NeilBrown
  0 siblings, 1 reply; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  5:31 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> Adding these three fields will allow more common code to be moved
> to handle_stripe()
>
> Signed-off-by: NeilBrown <neilb@suse.de>

Reviewed-by: Namhyung Kim <namhyung@gmail.com> 

and nitpick below.


> ---
>
>  drivers/md/raid5.c |   54 +++++++++++++++++++++++-----------------------------
>  drivers/md/raid5.h |    4 ++++
>  2 files changed, 28 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index c32ffb5..3327e82 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3003,12 +3003,9 @@ static void handle_stripe5(struct stripe_head *sh)
>  {
>  	raid5_conf_t *conf = sh->raid_conf;
>  	int disks = sh->disks, i;
> -	struct bio *return_bi = NULL;
>  	struct stripe_head_state s;
>  	struct r5dev *dev;
> -	mdk_rdev_t *blocked_rdev = NULL;
>  	int prexor;
> -	int dec_preread_active = 0;
>  
>  	memset(&s, 0, sizeof(s));
>  	pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d check:%d "
> @@ -3058,9 +3055,9 @@ static void handle_stripe5(struct stripe_head *sh)
>  		if (dev->written)
>  			s.written++;
>  		rdev = rcu_dereference(conf->disks[i].rdev);
> -		if (blocked_rdev == NULL &&
> +		if (s.blocked_rdev == NULL &&
>  		    rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
> -			blocked_rdev = rdev;
> +			s.blocked_rdev = rdev;
>  			atomic_inc(&rdev->nr_pending);
>  		}
>  		clear_bit(R5_Insync, &dev->flags);
> @@ -3088,15 +3085,15 @@ static void handle_stripe5(struct stripe_head *sh)
>  	spin_unlock_irq(&conf->device_lock);
>  	rcu_read_unlock();
>  
> -	if (unlikely(blocked_rdev)) {
> +	if (unlikely(s.blocked_rdev)) {
>  		if (s.syncing || s.expanding || s.expanded ||
>  		    s.to_write || s.written) {
>  			set_bit(STRIPE_HANDLE, &sh->state);
>  			goto unlock;
>  		}
>  		/* There is nothing for the blocked_rdev to block */
> -		rdev_dec_pending(blocked_rdev, conf->mddev);
> -		blocked_rdev = NULL;
> +		rdev_dec_pending(s.blocked_rdev, conf->mddev);
> +		s.blocked_rdev = NULL;
>  	}
>  
>  	if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
> @@ -3112,7 +3109,7 @@ static void handle_stripe5(struct stripe_head *sh)
>  	 * need to be failed
>  	 */
>  	if (s.failed > 1 && s.to_read+s.to_write+s.written)
> -		handle_failed_stripe(conf, sh, &s, disks, &return_bi);
> +		handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
>  	if (s.failed > 1 && s.syncing) {
>  		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
>  		clear_bit(STRIPE_SYNCING, &sh->state);
> @@ -3128,7 +3125,7 @@ static void handle_stripe5(struct stripe_head *sh)
>  	       !test_bit(R5_LOCKED, &dev->flags) &&
>  	       test_bit(R5_UPTODATE, &dev->flags)) ||
>  	       (s.failed == 1 && s.failed_num[0] == sh->pd_idx)))
> -		handle_stripe_clean_event(conf, sh, disks, &return_bi);
> +		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
>  
>  	/* Now we might consider reading some blocks, either to check/generate
>  	 * parity, or to satisfy requests
> @@ -3166,7 +3163,7 @@ static void handle_stripe5(struct stripe_head *sh)
>  			}
>  		}
>  		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> -			dec_preread_active = 1;
> +			s.dec_preread_active = 1;
>  	}
>  
>  	/* Now to consider new write requests and what else, if anything
> @@ -3264,15 +3261,15 @@ static void handle_stripe5(struct stripe_head *sh)
>   unlock:
>  
>  	/* wait for this device to become unblocked */
> -	if (unlikely(blocked_rdev))
> -		md_wait_for_blocked_rdev(blocked_rdev, conf->mddev);
> +	if (unlikely(s.blocked_rdev))
> +		md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);
>  
>  	if (s.ops_request)
>  		raid_run_ops(sh, s.ops_request);
>  
>  	ops_run_io(sh, &s);
>  
> -	if (dec_preread_active) {
> +	if (s.dec_preread_active) {
>  		/* We delay this until after ops_run_io so that if make_request
>  		 * is waiting on a flush, it won't continue until the writes
>  		 * have actually been submitted.
> @@ -3282,19 +3279,16 @@ static void handle_stripe5(struct stripe_head *sh)
>  		    IO_THRESHOLD)
>  			md_wakeup_thread(conf->mddev->thread);
>  	}
> -	return_io(return_bi);
> +	return_io(s.return_bi);
>  }
>  
>  static void handle_stripe6(struct stripe_head *sh)
>  {
>  	raid5_conf_t *conf = sh->raid_conf;
>  	int disks = sh->disks;
> -	struct bio *return_bi = NULL;
>  	int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
>  	struct stripe_head_state s;
>  	struct r5dev *dev, *pdev, *qdev;
> -	mdk_rdev_t *blocked_rdev = NULL;
> -	int dec_preread_active = 0;
>  
>  	pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
>  		"pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n",
> @@ -3345,9 +3339,9 @@ static void handle_stripe6(struct stripe_head *sh)
>  		if (dev->written)
>  			s.written++;
>  		rdev = rcu_dereference(conf->disks[i].rdev);
> -		if (blocked_rdev == NULL &&
> +		if (s.blocked_rdev == NULL &&
>  		    rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
> -			blocked_rdev = rdev;
> +			s.blocked_rdev = rdev;
>  			atomic_inc(&rdev->nr_pending);
>  		}
>  		clear_bit(R5_Insync, &dev->flags);
> @@ -3376,15 +3370,15 @@ static void handle_stripe6(struct stripe_head *sh)
>  	spin_unlock_irq(&conf->device_lock);
>  	rcu_read_unlock();
>  
> -	if (unlikely(blocked_rdev)) {
> +	if (unlikely(s.blocked_rdev)) {
>  		if (s.syncing || s.expanding || s.expanded ||
>  		    s.to_write || s.written) {
>  			set_bit(STRIPE_HANDLE, &sh->state);
>  			goto unlock;
>  		}
>  		/* There is nothing for the blocked_rdev to block */
> -		rdev_dec_pending(blocked_rdev, conf->mddev);
> -		blocked_rdev = NULL;
> +		rdev_dec_pending(s.blocked_rdev, conf->mddev);
> +		s.blocked_rdev = NULL;
>  	}
>  
>  	if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
> @@ -3400,7 +3394,7 @@ static void handle_stripe6(struct stripe_head *sh)
>  	 * might need to be failed
>  	 */
>  	if (s.failed > 2 && s.to_read+s.to_write+s.written)
> -		handle_failed_stripe(conf, sh, &s, disks, &return_bi);
> +		handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
>  	if (s.failed > 2 && s.syncing) {
>  		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
>  		clear_bit(STRIPE_SYNCING, &sh->state);
> @@ -3425,7 +3419,7 @@ static void handle_stripe6(struct stripe_head *sh)
>  	    (s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
>  			     && !test_bit(R5_LOCKED, &qdev->flags)
>  			     && test_bit(R5_UPTODATE, &qdev->flags)))))
> -		handle_stripe_clean_event(conf, sh, disks, &return_bi);
> +		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
>  
>  	/* Now we might consider reading some blocks, either to check/generate
>  	 * parity, or to satisfy requests
> @@ -3461,7 +3455,7 @@ static void handle_stripe6(struct stripe_head *sh)
>  			}
>  		}
>  		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> -			dec_preread_active = 1;
> +			s.dec_preread_active = 1;
>  	}
>  
>  	/* Now to consider new write requests and what else, if anything
> @@ -3561,8 +3555,8 @@ static void handle_stripe6(struct stripe_head *sh)
>   unlock:
>  
>  	/* wait for this device to become unblocked */
> -	if (unlikely(blocked_rdev))
> -		md_wait_for_blocked_rdev(blocked_rdev, conf->mddev);
> +	if (unlikely(s.blocked_rdev))
> +		md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);
>  
>  	if (s.ops_request)
>  		raid_run_ops(sh, s.ops_request);
> @@ -3570,7 +3564,7 @@ static void handle_stripe6(struct stripe_head *sh)
>  	ops_run_io(sh, &s);
>  
>  
> -	if (dec_preread_active) {
> +	if (s.dec_preread_active) {
>  		/* We delay this until after ops_run_io so that if make_request
>  		 * is waiting on a flush, it won't continue until the writes
>  		 * have actually been submitted.
> @@ -3581,7 +3575,7 @@ static void handle_stripe6(struct stripe_head *sh)
>  			md_wakeup_thread(conf->mddev->thread);
>  	}
>  
> -	return_io(return_bi);
> +	return_io(s.return_bi);
>  }
>  
>  static void handle_stripe(struct stripe_head *sh)
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index d3c61d3..9ceb574 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -248,6 +248,10 @@ struct stripe_head_state {
>  	int failed_num[2];
>  	unsigned long ops_request;
>  	int p_failed, q_failed;
> +
> +	struct bio *return_bi;
> +	mdk_rdev_t *blocked_rdev;
> +	int dec_preread_active;
>  };

I'd rather rearrange the struct to reduce paddings on 64-bit:

/* stripe_head_state - collects and tracks the dynamic state of a stripe_head
 *     for handle_stripe.  It is only valid under spin_lock(sh->lock);
 */
struct stripe_head_state {
	int syncing, expanding, expanded;
	int locked, uptodate, to_read, to_write, failed, written;
	int to_fill, compute, req_compute, non_overwrite;
	int failed_num[2];
        int p_failed, q_failed;
        int dec_preread_active;
	unsigned long ops_request;

        struct bio *return_bi;
        mdk_rdev_t *blocked_rdev;
};

BTW, comment above mentions sh->lock which is disappeared in previous
patch. It should be fixed as well (in patch 08/34?).

Thanks.


>  
>  /* Flags */
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state
  2011-07-22  5:15     ` NeilBrown
@ 2011-07-22  5:37       ` NeilBrown
  2011-07-22  5:53         ` Namhyung Kim
  0 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-22  5:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: Namhyung Kim, linux-raid

On Fri, 22 Jul 2011 15:15:06 +1000 NeilBrown <neilb@suse.de> wrote:

> On Fri, 22 Jul 2011 13:49:37 +0900 Namhyung Kim <namhyung@gmail.com> wrote:
> 
>  
> > > @@ -2964,7 +2964,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
> > >  			set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
> > >  			for (j = 0; j < conf->raid_disks; j++)
> > >  				if (j != sh2->pd_idx &&
> > > -				    (!r6s || j != sh2->qd_idx) &&
> > > +				    (r6s || j != sh2->qd_idx) &&
> > 
> > Why is this changed?
> > 
> 
> No good reason that I can think of ... so must be a bug.  I'll fix it.
> 
> In the current code, r6s is always NULL! which is also a bug.
> When you get to
> 
>    md/raid5: Move code for finishing a reconstruction into handle_stripe.
> 
> You'll find that the call which passed 's' in was changed to pass 'NULL' in
> when merging the RAID5 and RAID6 code.  I'll fix that too.
> 

To be specific:

1/ I removed the hunk that you noticed above.
2/ I added the following patch.
3/ I made the obvious changes to the subsequent patch so that it would still
   apply.

This is all now in my for-next branch (I haven't added the Reviewed-by's -
I'll do them all together later.

Thanks,
NeilBrown

From 74721b68b4cf8d4845aa9efa3daf99247a611b70 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Fri, 22 Jul 2011 15:29:38 +1000
Subject: [PATCH] md/raid5: Remove stripe_head_state arg from handle_stripe_expansion.

This arg is only used to differentiate between RAID5 and RAID6 but
that is not needed.  For RAID5, raid5_compute_sector will set qd_idx
to "~0" so j with certainly not equals qd_idx, so there is no need
for a guard on that condition.

So remove the guard and remove the arg from the declaration and
callers of handle_stripe_expansion.

Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 36873cf..793dd76 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2921,8 +2921,7 @@ static void handle_parity_checks6(raid5_conf_t *conf, struct stripe_head *sh,
 	}
 }
 
-static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
-				    struct stripe_head_state *r6s)
+static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh)
 {
 	int i;
 
@@ -2964,7 +2963,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 			set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
 			for (j = 0; j < conf->raid_disks; j++)
 				if (j != sh2->pd_idx &&
-				    (!r6s || j != sh2->qd_idx) &&
+				    j != sh2->qd_idx &&
 				    !test_bit(R5_Expanded, &sh2->dev[j].flags))
 					break;
 			if (j == conf->raid_disks) {
@@ -3249,7 +3248,7 @@ static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
 
 	if (s->expanding && s->locked == 0 &&
 	    !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
-		handle_stripe_expansion(conf, sh, NULL);
+		handle_stripe_expansion(conf, sh);
 }
 
 static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
@@ -3512,7 +3511,7 @@ static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
 
 	if (s->expanding && s->locked == 0 &&
 	    !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
-		handle_stripe_expansion(conf, sh, s);
+		handle_stripe_expansion(conf, sh);
 }
 
 static void handle_stripe(struct stripe_head *sh)

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [md PATCH 12/34] md/raid5: move stripe_head_state and more code into handle_stripe.
  2011-07-21  2:32 ` [md PATCH 12/34] md/raid5: move stripe_head_state and more code into handle_stripe NeilBrown
@ 2011-07-22  5:41   ` Namhyung Kim
  0 siblings, 0 replies; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  5:41 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> By defining the 'stripe_head_state' in 'handle_stripe', we can move
> some common code out of handle_stripe[56]() and into handle_stripe.
>
> The means that all accesses for stripe_head_state in handle_stripe[56]
> need to be 's->' instead of 's.', but the compiler should inline
> those functions and just use a direct stack reference, and future
> patches while hoist most of this code up into handle_stripe()
> so we will revert to "s.".
>
> Signed-off-by: NeilBrown <neilb@suse.de>

Reviewed-by: Namhyung Kim <namhyung@gmail.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state
  2011-07-22  5:37       ` NeilBrown
@ 2011-07-22  5:53         ` Namhyung Kim
  2011-07-26  6:44           ` Namhyung Kim
  0 siblings, 1 reply; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  5:53 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

2011-07-22 (금), 15:37 +1000, NeilBrown:
> To be specific:
> 
> 1/ I removed the hunk that you noticed above.
> 2/ I added the following patch.
> 3/ I made the obvious changes to the subsequent patch so that it would still
>    apply.
> 
> This is all now in my for-next branch (I haven't added the Reviewed-by's -
> I'll do them all together later.
> 
> Thanks,
> NeilBrown
> 
> From 74721b68b4cf8d4845aa9efa3daf99247a611b70 Mon Sep 17 00:00:00 2001
> From: NeilBrown <neilb@suse.de>
> Date: Fri, 22 Jul 2011 15:29:38 +1000
> Subject: [PATCH] md/raid5: Remove stripe_head_state arg from handle_stripe_expansion.
> 
> This arg is only used to differentiate between RAID5 and RAID6 but
> that is not needed.  For RAID5, raid5_compute_sector will set qd_idx
> to "~0" so j with certainly not equals qd_idx, so there is no need
> for a guard on that condition.
> 
> So remove the guard and remove the arg from the declaration and
> callers of handle_stripe_expansion.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>

Looks good to me, thanks!


-- 
Regards,
Namhyung Kim


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 13/34] md/raid5: Move code for finishing a reconstruction into handle_stripe.
  2011-07-21  2:32 ` [md PATCH 13/34] md/raid5: Move code for finishing a reconstruction " NeilBrown
@ 2011-07-22  7:09   ` Namhyung Kim
  2011-07-26  1:44     ` NeilBrown
  0 siblings, 1 reply; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  7:09 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> Prior to commit ab69ae12ceef7 the code in handle_stripe5 and
> handle_stripe6 to "Finish reconstruct operations initiated by the
> expansion process" was identical.
> That commit added an identical stanza of code to each function, but in
> different places.  That was careless.
>
> The raid5 code was correct, so move that out into handle_stripe and
> remove raid6 version.
>
> Signed-off-by: NeilBrown <neilb@suse.de>

Reviewed-by: Namhyung Kim <namhyung@gmail.com>

and nitpick below..


> ---
>
>  drivers/md/raid5.c |  153 +++++++++++++++++++---------------------------------
>  1 files changed, 57 insertions(+), 96 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index ce1c291..7b562fc 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2999,7 +2999,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
>   *
>   */
>  
> -static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
> +static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
>  {
>  	raid5_conf_t *conf = sh->raid_conf;
>  	int disks = sh->disks, i;
> @@ -3081,7 +3081,7 @@ static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
>  		if (s->syncing || s->expanding || s->expanded ||
>  		    s->to_write || s->written) {
>  			set_bit(STRIPE_HANDLE, &sh->state);
> -			return;
> +			return 1;
>  		}
>  		/* There is nothing for the blocked_rdev to block */
>  		rdev_dec_pending(s->blocked_rdev, conf->mddev);
> @@ -3205,54 +3205,10 @@ static void handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
>  			s->locked++;
>  		}
>  	}
> -
> -	/* Finish reconstruct operations initiated by the expansion process */
> -	if (sh->reconstruct_state == reconstruct_state_result) {
> -		struct stripe_head *sh2
> -			= get_active_stripe(conf, sh->sector, 1, 1, 1);
> -		if (sh2 && test_bit(STRIPE_EXPAND_SOURCE, &sh2->state)) {
> -			/* sh cannot be written until sh2 has been read.
> -			 * so arrange for sh to be delayed a little
> -			 */
> -			set_bit(STRIPE_DELAYED, &sh->state);
> -			set_bit(STRIPE_HANDLE, &sh->state);
> -			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
> -					      &sh2->state))
> -				atomic_inc(&conf->preread_active_stripes);
> -			release_stripe(sh2);
> -			return;
> -		}
> -		if (sh2)
> -			release_stripe(sh2);
> -
> -		sh->reconstruct_state = reconstruct_state_idle;
> -		clear_bit(STRIPE_EXPANDING, &sh->state);
> -		for (i = conf->raid_disks; i--; ) {
> -			set_bit(R5_Wantwrite, &sh->dev[i].flags);
> -			set_bit(R5_LOCKED, &sh->dev[i].flags);
> -			s->locked++;
> -		}
> -	}
> -
> -	if (s->expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
> -	    !sh->reconstruct_state) {
> -		/* Need to write out all blocks after computing parity */
> -		sh->disks = conf->raid_disks;
> -		stripe_set_idx(sh->sector, conf, 0, sh);
> -		schedule_reconstruction(sh, s, 1, 1);
> -	} else if (s->expanded && !sh->reconstruct_state && s->locked == 0) {
> -		clear_bit(STRIPE_EXPAND_READY, &sh->state);
> -		atomic_dec(&conf->reshape_stripes);
> -		wake_up(&conf->wait_for_overlap);
> -		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
> -	}
> -
> -	if (s->expanding && s->locked == 0 &&
> -	    !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
> -		handle_stripe_expansion(conf, sh, NULL);
> +	return 0;
>  }
>  
> -static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
> +static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
>  {
>  	raid5_conf_t *conf = sh->raid_conf;
>  	int disks = sh->disks;
> @@ -3335,7 +3291,7 @@ static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
>  		if (s->syncing || s->expanding || s->expanded ||
>  		    s->to_write || s->written) {
>  			set_bit(STRIPE_HANDLE, &sh->state);
> -			return;
> +			return 1;
>  		}
>  		/* There is nothing for the blocked_rdev to block */
>  		rdev_dec_pending(s->blocked_rdev, conf->mddev);
> @@ -3468,57 +3424,15 @@ static void handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
>  				}
>  			}
>  		}
> -
> -	/* Finish reconstruct operations initiated by the expansion process */
> -	if (sh->reconstruct_state == reconstruct_state_result) {
> -		sh->reconstruct_state = reconstruct_state_idle;
> -		clear_bit(STRIPE_EXPANDING, &sh->state);
> -		for (i = conf->raid_disks; i--; ) {
> -			set_bit(R5_Wantwrite, &sh->dev[i].flags);
> -			set_bit(R5_LOCKED, &sh->dev[i].flags);
> -			s->locked++;
> -		}
> -	}
> -
> -	if (s->expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
> -	    !sh->reconstruct_state) {
> -		struct stripe_head *sh2
> -			= get_active_stripe(conf, sh->sector, 1, 1, 1);
> -		if (sh2 && test_bit(STRIPE_EXPAND_SOURCE, &sh2->state)) {
> -			/* sh cannot be written until sh2 has been read.
> -			 * so arrange for sh to be delayed a little
> -			 */
> -			set_bit(STRIPE_DELAYED, &sh->state);
> -			set_bit(STRIPE_HANDLE, &sh->state);
> -			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
> -					      &sh2->state))
> -				atomic_inc(&conf->preread_active_stripes);
> -			release_stripe(sh2);
> -			return;
> -		}
> -		if (sh2)
> -			release_stripe(sh2);
> -
> -		/* Need to write out all blocks after computing P&Q */
> -		sh->disks = conf->raid_disks;
> -		stripe_set_idx(sh->sector, conf, 0, sh);
> -		schedule_reconstruction(sh, s, 1, 1);
> -	} else if (s->expanded && !sh->reconstruct_state && s->locked == 0) {
> -		clear_bit(STRIPE_EXPAND_READY, &sh->state);
> -		atomic_dec(&conf->reshape_stripes);
> -		wake_up(&conf->wait_for_overlap);
> -		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
> -	}
> -
> -	if (s->expanding && s->locked == 0 &&
> -	    !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
> -		handle_stripe_expansion(conf, sh, s);
> +	return 0;
>  }
>  
>  static void handle_stripe(struct stripe_head *sh)
>  {
>  	struct stripe_head_state s;
> +	int done;
>  	raid5_conf_t *conf = sh->raid_conf;
> +	int i;

Can be declared in a line.


>  
>  	clear_bit(STRIPE_HANDLE, &sh->state);
>  	if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
> @@ -3546,11 +3460,58 @@ static void handle_stripe(struct stripe_head *sh)
>  	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
>  
>  	if (conf->level == 6)
> -		handle_stripe6(sh, &s);
> +		done = handle_stripe6(sh, &s);
>  	else
> -		handle_stripe5(sh, &s);
> +		done = handle_stripe5(sh, &s);
> +
> +	if (done)
> +		goto finish;
> +	/* Finish reconstruct operations initiated by the expansion process */
> +	if (sh->reconstruct_state == reconstruct_state_result) {
> +		struct stripe_head *sh2
> +			= get_active_stripe(conf, sh->sector, 1, 1, 1);

Wouldn't it be better to use more descriptive name rather than sh2,
something like sh_prev, sh_src or whatever?

Thanks


> +		if (sh2 && test_bit(STRIPE_EXPAND_SOURCE, &sh2->state)) {
> +			/* sh cannot be written until sh2 has been read.
> +			 * so arrange for sh to be delayed a little
> +			 */
> +			set_bit(STRIPE_DELAYED, &sh->state);
> +			set_bit(STRIPE_HANDLE, &sh->state);
> +			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
> +					      &sh2->state))
> +				atomic_inc(&conf->preread_active_stripes);
> +			release_stripe(sh2);
> +			goto finish;
> +		}
> +		if (sh2)
> +			release_stripe(sh2);
> +
> +		sh->reconstruct_state = reconstruct_state_idle;
> +		clear_bit(STRIPE_EXPANDING, &sh->state);
> +		for (i = conf->raid_disks; i--; ) {
> +			set_bit(R5_Wantwrite, &sh->dev[i].flags);
> +			set_bit(R5_LOCKED, &sh->dev[i].flags);
> +			s.locked++;
> +		}
> +	}
>  
> +	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
> +	    !sh->reconstruct_state) {
> +		/* Need to write out all blocks after computing parity */
> +		sh->disks = conf->raid_disks;
> +		stripe_set_idx(sh->sector, conf, 0, sh);
> +		schedule_reconstruction(sh, &s, 1, 1);
> +	} else if (s.expanded && !sh->reconstruct_state && s.locked == 0) {
> +		clear_bit(STRIPE_EXPAND_READY, &sh->state);
> +		atomic_dec(&conf->reshape_stripes);
> +		wake_up(&conf->wait_for_overlap);
> +		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
> +	}
> +
> +	if (s.expanding && s.locked == 0 &&
> +	    !test_bit(STRIPE_COMPUTE_RUN, &sh->state))
> +		handle_stripe_expansion(conf, sh, NULL);
>  
> +finish:
>  	/* wait for this device to become unblocked */
>  	if (unlikely(s.blocked_rdev))
>  		md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 14/34] md/raid5: move more code into common handle_stripe
  2011-07-21  2:32 ` [md PATCH 14/34] md/raid5: move more code into common handle_stripe NeilBrown
@ 2011-07-22  7:32   ` Namhyung Kim
  2011-07-26  1:48     ` NeilBrown
  0 siblings, 1 reply; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  7:32 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> The difference between the RAID5 and RAID6 code here is easily
> resolved using conf->max_degraded.
>
> Signed-off-by: NeilBrown <neilb@suse.de>

Reviewed-by: Namhyung Kim <namhyung@gmail.com>

and a coding style issue.


> ---
>
>  drivers/md/raid5.c |   90 ++++++++++++++++++----------------------------------
>  1 files changed, 32 insertions(+), 58 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 7b562fc..e41b622 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3177,34 +3177,6 @@ static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
>  	     !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
>  	     !test_bit(STRIPE_INSYNC, &sh->state)))
>  		handle_parity_checks5(conf, sh, s, disks);
> -
> -	if (s->syncing && s->locked == 0
> -	    && test_bit(STRIPE_INSYNC, &sh->state)) {
> -		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
> -		clear_bit(STRIPE_SYNCING, &sh->state);
> -	}
> -
> -	/* If the failed drive is just a ReadError, then we might need to progress
> -	 * the repair/check process
> -	 */
> -	if (s->failed == 1 && !conf->mddev->ro &&
> -	    test_bit(R5_ReadError, &sh->dev[s->failed_num[0]].flags)
> -	    && !test_bit(R5_LOCKED, &sh->dev[s->failed_num[0]].flags)
> -	    && test_bit(R5_UPTODATE, &sh->dev[s->failed_num[0]].flags)
> -		) {
> -		dev = &sh->dev[s->failed_num[0]];
> -		if (!test_bit(R5_ReWrite, &dev->flags)) {
> -			set_bit(R5_Wantwrite, &dev->flags);
> -			set_bit(R5_ReWrite, &dev->flags);
> -			set_bit(R5_LOCKED, &dev->flags);
> -			s->locked++;
> -		} else {
> -			/* let's read it back */
> -			set_bit(R5_Wantread, &dev->flags);
> -			set_bit(R5_LOCKED, &dev->flags);
> -			s->locked++;
> -		}
> -	}
>  	return 0;
>  }
>  
> @@ -3394,36 +3366,6 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
>  	     !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
>  	     !test_bit(STRIPE_INSYNC, &sh->state)))
>  		handle_parity_checks6(conf, sh, s, disks);
> -
> -	if (s->syncing && s->locked == 0
> -	    && test_bit(STRIPE_INSYNC, &sh->state)) {
> -		md_done_sync(conf->mddev, STRIPE_SECTORS,1);
> -		clear_bit(STRIPE_SYNCING, &sh->state);
> -	}
> -
> -	/* If the failed drives are just a ReadError, then we might need
> -	 * to progress the repair/check process
> -	 */
> -	if (s->failed <= 2 && !conf->mddev->ro)
> -		for (i = 0; i < s->failed; i++) {
> -			dev = &sh->dev[s->failed_num[i]];
> -			if (test_bit(R5_ReadError, &dev->flags)
> -			    && !test_bit(R5_LOCKED, &dev->flags)
> -			    && test_bit(R5_UPTODATE, &dev->flags)
> -				) {
> -				if (!test_bit(R5_ReWrite, &dev->flags)) {
> -					set_bit(R5_Wantwrite, &dev->flags);
> -					set_bit(R5_ReWrite, &dev->flags);
> -					set_bit(R5_LOCKED, &dev->flags);
> -					s->locked++;
> -				} else {
> -					/* let's read it back */
> -					set_bit(R5_Wantread, &dev->flags);
> -					set_bit(R5_LOCKED, &dev->flags);
> -					s->locked++;
> -				}
> -			}
> -		}
>  	return 0;
>  }
>  
> @@ -3466,6 +3408,38 @@ static void handle_stripe(struct stripe_head *sh)
>  
>  	if (done)
>  		goto finish;
> +
> +
> +	if (s.syncing && s.locked == 0 && test_bit(STRIPE_INSYNC, &sh->state)) {
> +		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
> +		clear_bit(STRIPE_SYNCING, &sh->state);
> +	}
> +
> +	/* If the failed drives are just a ReadError, then we might need
> +	 * to progress the repair/check process
> +	 */
> +	if (s.failed <= conf->max_degraded && !conf->mddev->ro)
> +		for (i = 0; i < s.failed; i++) {
> +			struct r5dev *dev = &sh->dev[s.failed_num[i]];
> +			if (test_bit(R5_ReadError, &dev->flags)
> +			    && !test_bit(R5_LOCKED, &dev->flags)
> +			    && test_bit(R5_UPTODATE, &dev->flags)
> +				) {

This line looks weird and should be combined to previous line.

And I slightly prefer that the logical operators are put on the
same line with previous conditional, so that next one can be on
the same column and more readable, but ...

Thanks.


> +				if (!test_bit(R5_ReWrite, &dev->flags)) {
> +					set_bit(R5_Wantwrite, &dev->flags);
> +					set_bit(R5_ReWrite, &dev->flags);
> +					set_bit(R5_LOCKED, &dev->flags);
> +					s.locked++;
> +				} else {
> +					/* let's read it back */
> +					set_bit(R5_Wantread, &dev->flags);
> +					set_bit(R5_LOCKED, &dev->flags);
> +					s.locked++;
> +				}
> +			}
> +		}
> +
> +
>  	/* Finish reconstruct operations initiated by the expansion process */
>  	if (sh->reconstruct_state == reconstruct_state_result) {
>  		struct stripe_head *sh2
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 15/34] md/raid5: rearrange a test in fetch_block6.
  2011-07-21  2:32 ` [md PATCH 15/34] md/raid5: rearrange a test in fetch_block6 NeilBrown
@ 2011-07-22  7:37   ` Namhyung Kim
  0 siblings, 0 replies; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  7:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> Next patch will unite fetch_block5 and fetch_block6.
> First I want to make the differences a little more clear.
>
> For RAID6 if we are writing at all and there is a failed device, then
> we need to load or compute every block so we can do a
> reconstruct-write.
> This case isn't needed for RAID5 - we will do a read-modify-write in
> that case.
> So make that test a separate test in fetch_block6 rather than merged
> with two other tests.
>
> Make a similar change in fetch_block5 so the one bit that is not
> needed for RAID6 is clearly separate.
>
> Signed-off-by: NeilBrown <neilb@suse.de>

Reviewed-by: Namhyung Kim <namhyung@gmail.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 16/34] md/raid5: unite fetch_block5 and fetch_block6
  2011-07-21  2:32 ` [md PATCH 16/34] md/raid5: unite fetch_block5 and fetch_block6 NeilBrown
@ 2011-07-22  8:24   ` Namhyung Kim
  0 siblings, 0 replies; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  8:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> Provided that ->failed_num[1] is not a valid device number (which is
> easily achieved) fetch_block6 provides all the functionality of
> fetch_block5.
>
> So remove the latter and rename the former to simply "fetch_block".
>
> Then handle_stripe_fill5 and handle_stripe_fill6 become the same and
> can similarly be united.
>
> Signed-off-by: NeilBrown <neilb@suse.de>

Reviewed-by: Namhyung Kim <namhyung@gmail.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6
  2011-07-21  2:32 ` [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6 NeilBrown
@ 2011-07-22  9:10   ` Namhyung Kim
  2011-07-26  1:52     ` NeilBrown
  0 siblings, 1 reply; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  9:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
> also allow 'read-modify-write'
> Apart from this difference, handle_stripe_dirtying[56] are nearly
> identical.  So resolve these differences and create just one function.
>
> Signed-off-by: NeilBrown <neilb@suse.de>

Reviewed-by: Namhyung Kim <namhyung@gmail.com>

BTW, here is a question:
  Why RAID6 doesn't allow the read-modify-write? I don't think it is
  not possible, so what prevents doing that? performance? complexity?
  or it's not possible? Why? :)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 18/34] md/raid5: move more common code into handle_stripe
  2011-07-21  2:32 ` [md PATCH 18/34] md/raid5: move more common code into handle_stripe NeilBrown
@ 2011-07-22  9:20   ` Namhyung Kim
  0 siblings, 0 replies; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  9:20 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> Apart from 'prexor' which can only be set for RAID5, and
> 'qd_idx' which can only be meaningful for RAID6, these two
> chunks of code are nearly the same.
>
> So combine them into one adding a test to call either
> handle_parity_checks5 or handle_parity_checks6 as appropriate.
>
> Signed-off-by: NeilBrown <neilb@suse.de>

Reviewed-by: Namhyung Kim <namhyung@gmail.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 19/34] md/raid5: move some more common code into handle_stripe
  2011-07-21  2:32 ` [md PATCH 19/34] md/raid5: move some more common code " NeilBrown
@ 2011-07-22  9:29   ` Namhyung Kim
  2011-07-26  1:59     ` NeilBrown
  0 siblings, 1 reply; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  9:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> The RAID6 version of this code is usable for RAID5 providing:
>   - we test "conf->max_degraded" rather than "2" as appropriate
>   - we make sure s->failed_num[1] is meaningful (and not '-1')
>     when s->failed > 1
>
> The 'return 1' must become 'goto finish' in the new location.
>
> Signed-off-by: NeilBrown <neilb@suse.de>

Reviewed-by: Namhyung Kim <namhyung@gmail.com>

and nitpick below.


> ---
>
>  drivers/md/raid5.c |  178 +++++++++++++++++++---------------------------------
>  1 files changed, 66 insertions(+), 112 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index ba6f892..f23848f 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2969,63 +2969,14 @@ static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
>  		if (test_bit(R5_ReadError, &dev->flags))
>  			clear_bit(R5_Insync, &dev->flags);
>  		if (!test_bit(R5_Insync, &dev->flags)) {
> +			if (s->failed < 2)
> +				s->failed_num[s->failed] = i;
>  			s->failed++;
> -			s->failed_num[0] = i;
>  		}
>  	}
>  	spin_unlock_irq(&conf->device_lock);
>  	rcu_read_unlock();
>  
> -	if (unlikely(s->blocked_rdev)) {
> -		if (s->syncing || s->expanding || s->expanded ||
> -		    s->to_write || s->written) {
> -			set_bit(STRIPE_HANDLE, &sh->state);
> -			return 1;
> -		}
> -		/* There is nothing for the blocked_rdev to block */
> -		rdev_dec_pending(s->blocked_rdev, conf->mddev);
> -		s->blocked_rdev = NULL;
> -	}
> -
> -	if (s->to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
> -		set_bit(STRIPE_OP_BIOFILL, &s->ops_request);
> -		set_bit(STRIPE_BIOFILL_RUN, &sh->state);
> -	}
> -
> -	pr_debug("locked=%d uptodate=%d to_read=%d"
> -		" to_write=%d failed=%d failed_num=%d\n",
> -		s->locked, s->uptodate, s->to_read, s->to_write,
> -		s->failed, s->failed_num[0]);
> -	/* check if the array has lost two devices and, if so, some requests might
> -	 * need to be failed
> -	 */
> -	if (s->failed > 1 && s->to_read+s->to_write+s->written)
> -		handle_failed_stripe(conf, sh, s, disks, &s->return_bi);
> -	if (s->failed > 1 && s->syncing) {
> -		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
> -		clear_bit(STRIPE_SYNCING, &sh->state);
> -		s->syncing = 0;
> -	}
> -
> -	/* might be able to return some write requests if the parity block
> -	 * is safe, or on a failed drive
> -	 */
> -	dev = &sh->dev[sh->pd_idx];
> -	if (s->written &&
> -	    ((test_bit(R5_Insync, &dev->flags) &&
> -	      !test_bit(R5_LOCKED, &dev->flags) &&
> -	      test_bit(R5_UPTODATE, &dev->flags)) ||
> -	     (s->failed == 1 && s->failed_num[0] == sh->pd_idx)))
> -		handle_stripe_clean_event(conf, sh, disks, &s->return_bi);
> -
> -	/* Now we might consider reading some blocks, either to check/generate
> -	 * parity, or to satisfy requests
> -	 * or to load a block that is being partially written.
> -	 */
> -	if (s->to_read || s->non_overwrite ||
> -	    (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
> -		handle_stripe_fill(sh, s, disks);
> -
>  	return 0;
>  }
>  
> @@ -3033,8 +2984,8 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
>  {
>  	raid5_conf_t *conf = sh->raid_conf;
>  	int disks = sh->disks;
> -	int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
> -	struct r5dev *dev, *pdev, *qdev;
> +	struct r5dev *dev;
> +	int i;
>  
>  	/* Now to look around and see what can be done */
>  
> @@ -3108,65 +3059,6 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
>  	spin_unlock_irq(&conf->device_lock);
>  	rcu_read_unlock();
>  
> -	if (unlikely(s->blocked_rdev)) {
> -		if (s->syncing || s->expanding || s->expanded ||
> -		    s->to_write || s->written) {
> -			set_bit(STRIPE_HANDLE, &sh->state);
> -			return 1;
> -		}
> -		/* There is nothing for the blocked_rdev to block */
> -		rdev_dec_pending(s->blocked_rdev, conf->mddev);
> -		s->blocked_rdev = NULL;
> -	}
> -
> -	if (s->to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
> -		set_bit(STRIPE_OP_BIOFILL, &s->ops_request);
> -		set_bit(STRIPE_BIOFILL_RUN, &sh->state);
> -	}
> -
> -	pr_debug("locked=%d uptodate=%d to_read=%d"
> -	       " to_write=%d failed=%d failed_num=%d,%d\n",
> -	       s->locked, s->uptodate, s->to_read, s->to_write, s->failed,
> -	       s->failed_num[0], s->failed_num[1]);
> -	/* check if the array has lost >2 devices and, if so, some requests
> -	 * might need to be failed
> -	 */
> -	if (s->failed > 2 && s->to_read+s->to_write+s->written)
> -		handle_failed_stripe(conf, sh, s, disks, &s->return_bi);
> -	if (s->failed > 2 && s->syncing) {
> -		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
> -		clear_bit(STRIPE_SYNCING, &sh->state);
> -		s->syncing = 0;
> -	}
> -
> -	/*
> -	 * might be able to return some write requests if the parity blocks
> -	 * are safe, or on a failed drive
> -	 */
> -	pdev = &sh->dev[pd_idx];
> -	s->p_failed = (s->failed >= 1 && s->failed_num[0] == pd_idx)
> -		|| (s->failed >= 2 && s->failed_num[1] == pd_idx);
> -	qdev = &sh->dev[qd_idx];
> -	s->q_failed = (s->failed >= 1 && s->failed_num[0] == qd_idx)
> -		|| (s->failed >= 2 && s->failed_num[1] == qd_idx);
> -
> -	if (s->written &&
> -	    (s->p_failed || ((test_bit(R5_Insync, &pdev->flags)
> -			     && !test_bit(R5_LOCKED, &pdev->flags)
> -			     && test_bit(R5_UPTODATE, &pdev->flags)))) &&
> -	    (s->q_failed || ((test_bit(R5_Insync, &qdev->flags)
> -			     && !test_bit(R5_LOCKED, &qdev->flags)
> -			     && test_bit(R5_UPTODATE, &qdev->flags)))))
> -		handle_stripe_clean_event(conf, sh, disks, &s->return_bi);
> -
> -	/* Now we might consider reading some blocks, either to check/generate
> -	 * parity, or to satisfy requests
> -	 * or to load a block that is being partially written.
> -	 */
> -	if (s->to_read || s->non_overwrite || (s->to_write && s->failed) ||
> -	    (s->syncing && (s->uptodate + s->compute < disks)) || s->expanding)
> -		handle_stripe_fill(sh, s, disks);
> -
>  	return 0;
>  }
>  
> @@ -3178,6 +3070,7 @@ static void handle_stripe(struct stripe_head *sh)
>  	int i;
>  	int prexor;
>  	int disks = sh->disks;
> +	struct r5dev *pdev, *qdev;
>  
>  	clear_bit(STRIPE_HANDLE, &sh->state);
>  	if (test_and_set_bit(STRIPE_ACTIVE, &sh->state)) {
> @@ -3214,6 +3107,67 @@ static void handle_stripe(struct stripe_head *sh)
>  	if (done)
>  		goto finish;
>  
> +	if (unlikely(s.blocked_rdev)) {
> +		if (s.syncing || s.expanding || s.expanded ||
> +		    s.to_write || s.written) {
> +			set_bit(STRIPE_HANDLE, &sh->state);
> +			goto finish;
> +		}
> +		/* There is nothing for the blocked_rdev to block */
> +		rdev_dec_pending(s.blocked_rdev, conf->mddev);
> +		s.blocked_rdev = NULL;
> +	}
> +
> +	if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
> +		set_bit(STRIPE_OP_BIOFILL, &s.ops_request);
> +		set_bit(STRIPE_BIOFILL_RUN, &sh->state);
> +	}
> +
> +	pr_debug("locked=%d uptodate=%d to_read=%d"
> +	       " to_write=%d failed=%d failed_num=%d,%d\n",
> +	       s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
> +	       s.failed_num[0], s.failed_num[1]);
> +	/* check if the array has lost >2 devices and, if so, some requests
> +	 * might need to be failed
        /* check if the array has lost more than max_degraded devices and,
           if so, some requests might need to be failed

Thanks.


> +	 */
> +	if (s.failed > conf->max_degraded && s.to_read+s.to_write+s.written)
> +		handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
> +	if (s.failed > conf->max_degraded && s.syncing) {
> +		md_done_sync(conf->mddev, STRIPE_SECTORS, 0);
> +		clear_bit(STRIPE_SYNCING, &sh->state);
> +		s.syncing = 0;
> +	}
> +
> +	/*
> +	 * might be able to return some write requests if the parity blocks
> +	 * are safe, or on a failed drive
> +	 */
> +	pdev = &sh->dev[sh->pd_idx];
> +	s.p_failed = (s.failed >= 1 && s.failed_num[0] == sh->pd_idx)
> +		|| (s.failed >= 2 && s.failed_num[1] == sh->pd_idx);
> +	qdev = &sh->dev[sh->qd_idx];
> +	s.q_failed = (s.failed >= 1 && s.failed_num[0] == sh->qd_idx)
> +		|| (s.failed >= 2 && s.failed_num[1] == sh->qd_idx)
> +		|| conf->level < 6;
> +
> +	if (s.written &&
> +	    (s.p_failed || ((test_bit(R5_Insync, &pdev->flags)
> +			     && !test_bit(R5_LOCKED, &pdev->flags)
> +			     && test_bit(R5_UPTODATE, &pdev->flags)))) &&
> +	    (s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
> +			     && !test_bit(R5_LOCKED, &qdev->flags)
> +			     && test_bit(R5_UPTODATE, &qdev->flags)))))
> +		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
> +
> +	/* Now we might consider reading some blocks, either to check/generate
> +	 * parity, or to satisfy requests
> +	 * or to load a block that is being partially written.
> +	 */
> +	if (s.to_read || s.non_overwrite
> +	    || (conf->level == 6 && s.to_write && s.failed)
> +	    || (s.syncing && (s.uptodate + s.compute < disks)) || s.expanding)
> +		handle_stripe_fill(sh, &s, disks);
> +
>  	/* Now we check to see if any write operations have recently
>  	 * completed
>  	 */
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 20/34] md/raid5: finalise new merged handle_stripe.
  2011-07-21  2:32 ` [md PATCH 20/34] md/raid5: finalise new merged handle_stripe NeilBrown
@ 2011-07-22  9:36   ` Namhyung Kim
  2011-07-26  2:02     ` NeilBrown
  0 siblings, 1 reply; 71+ messages in thread
From: Namhyung Kim @ 2011-07-22  9:36 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown <neilb@suse.de> writes:

> handle_stripe5() and handle_stripe6() are now virtually identical.
> So discard one and rename the other to 'analyse_stripe()'.
>
> It always returns 0, so change it to 'void' and remove the 'done'
> variable in handle_stripe().
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>
>  drivers/md/raid5.c |  100 +++-------------------------------------------------
>  1 files changed, 5 insertions(+), 95 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index f23848f..93ee26c 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2899,88 +2899,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
>   *
>   */
>  
> -static int handle_stripe5(struct stripe_head *sh, struct stripe_head_state *s)
> -{
> -	raid5_conf_t *conf = sh->raid_conf;
> -	int disks = sh->disks, i;
> -	struct r5dev *dev;
> -
> -	/* Now to look around and see what can be done */
> -	rcu_read_lock();
> -	spin_lock_irq(&conf->device_lock);
> -	for (i=disks; i--; ) {
> -		mdk_rdev_t *rdev;
> -
> -		dev = &sh->dev[i];
> -
> -		pr_debug("check %d: state 0x%lx toread %p read %p write %p "
> -			"written %p\n",	i, dev->flags, dev->toread, dev->read,
> -			dev->towrite, dev->written);
> -
> -		/* maybe we can request a biofill operation
> -		 *
> -		 * new wantfill requests are only permitted while
> -		 * ops_complete_biofill is guaranteed to be inactive
> -		 */
> -		if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread &&
> -		    !test_bit(STRIPE_BIOFILL_RUN, &sh->state))
> -			set_bit(R5_Wantfill, &dev->flags);
> -
> -		/* now count some things */
> -		if (test_bit(R5_LOCKED, &dev->flags))
> -			s->locked++;
> -		if (test_bit(R5_UPTODATE, &dev->flags))
> -			s->uptodate++;
> -		if (test_bit(R5_Wantcompute, &dev->flags))
> -			s->compute++;
> -
> -		if (test_bit(R5_Wantfill, &dev->flags))
> -			s->to_fill++;
> -		else if (dev->toread)
> -			s->to_read++;
> -		if (dev->towrite) {
> -			s->to_write++;
> -			if (!test_bit(R5_OVERWRITE, &dev->flags))
> -				s->non_overwrite++;
> -		}
> -		if (dev->written)
> -			s->written++;
> -		rdev = rcu_dereference(conf->disks[i].rdev);
> -		if (s->blocked_rdev == NULL &&
> -		    rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
> -			s->blocked_rdev = rdev;
> -			atomic_inc(&rdev->nr_pending);
> -		}
> -		clear_bit(R5_Insync, &dev->flags);
> -		if (!rdev)
> -			/* Not in-sync */;
> -		else if (test_bit(In_sync, &rdev->flags))
> -			set_bit(R5_Insync, &dev->flags);
> -		else {
> -			/* could be in-sync depending on recovery/reshape status */
> -			if (sh->sector + STRIPE_SECTORS <= rdev->recovery_offset)
> -				set_bit(R5_Insync, &dev->flags);
> -		}
> -		if (!test_bit(R5_Insync, &dev->flags)) {
> -			/* The ReadError flag will just be confusing now */
> -			clear_bit(R5_ReadError, &dev->flags);
> -			clear_bit(R5_ReWrite, &dev->flags);
> -		}
> -		if (test_bit(R5_ReadError, &dev->flags))
> -			clear_bit(R5_Insync, &dev->flags);
> -		if (!test_bit(R5_Insync, &dev->flags)) {
> -			if (s->failed < 2)
> -				s->failed_num[s->failed] = i;
> -			s->failed++;
> -		}
> -	}
> -	spin_unlock_irq(&conf->device_lock);
> -	rcu_read_unlock();
> -
> -	return 0;
> -}
> -
> -static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
> +static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
>  {
>  	raid5_conf_t *conf = sh->raid_conf;
>  	int disks = sh->disks;
> @@ -2988,11 +2907,11 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
>  	int i;
>  
>  	/* Now to look around and see what can be done */
> -
>  	rcu_read_lock();
>  	spin_lock_irq(&conf->device_lock);
>  	for (i=disks; i--; ) {
>  		mdk_rdev_t *rdev;
> +
>  		dev = &sh->dev[i];
>  
>  		pr_debug("check %d: state 0x%lx read %p write %p written %p\n",
> @@ -3016,9 +2935,9 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
>  			BUG_ON(s->compute > 2);
>  		}
>  
> -		if (test_bit(R5_Wantfill, &dev->flags)) {
> +		if (test_bit(R5_Wantfill, &dev->flags))
>  			s->to_fill++;
> -		} else if (dev->toread)
> +		else if (dev->toread)
>  			s->to_read++;
>  		if (dev->towrite) {
>  			s->to_write++;
> @@ -3058,14 +2977,11 @@ static int handle_stripe6(struct stripe_head *sh, struct stripe_head_state *s)
>  	}
>  	spin_unlock_irq(&conf->device_lock);
>  	rcu_read_unlock();
> -
> -	return 0;
>  }
>  
>  static void handle_stripe(struct stripe_head *sh)
>  {
>  	struct stripe_head_state s;
> -	int done;
>  	raid5_conf_t *conf = sh->raid_conf;
>  	int i;
>  	int prexor;
> @@ -3099,13 +3015,7 @@ static void handle_stripe(struct stripe_head *sh)
>  	s.failed_num[0] = -1;
>  	s.failed_num[1] = -1;

There are codes that set/initialize some fields of stripe_head_state
outside of analyse_stripe() and it'd better off moving them into the
function, IMHO.

Thanks.


>  
> -	if (conf->level == 6)
> -		done = handle_stripe6(sh, &s);
> -	else
> -		done = handle_stripe5(sh, &s);
> -
> -	if (done)
> -		goto finish;
> +	analyse_stripe(sh, &s);
>  
>  	if (unlikely(s.blocked_rdev)) {
>  		if (s.syncing || s.expanding || s.expanded ||
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 11/34] md/raid5: add some more fields to stripe_head_state
  2011-07-22  5:31   ` Namhyung Kim
@ 2011-07-26  1:35     ` NeilBrown
  0 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-26  1:35 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-raid

On Fri, 22 Jul 2011 14:31:51 +0900 Namhyung Kim <namhyung@gmail.com> wrote:

> NeilBrown <neilb@suse.de> writes:
> 
> > Adding these three fields will allow more common code to be moved
> > to handle_stripe()
> >
> > Signed-off-by: NeilBrown <neilb@suse.de>
> 
> Reviewed-by: Namhyung Kim <namhyung@gmail.com> 
> 
> and nitpick below.
> 
> 
> > ---
> >
> >  drivers/md/raid5.c |   54 +++++++++++++++++++++++-----------------------------
> >  drivers/md/raid5.h |    4 ++++
> >  2 files changed, 28 insertions(+), 30 deletions(-)
> >
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index c32ffb5..3327e82 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -3003,12 +3003,9 @@ static void handle_stripe5(struct stripe_head *sh)
> >  {
> >  	raid5_conf_t *conf = sh->raid_conf;
> >  	int disks = sh->disks, i;
> > -	struct bio *return_bi = NULL;
> >  	struct stripe_head_state s;
> >  	struct r5dev *dev;
> > -	mdk_rdev_t *blocked_rdev = NULL;
> >  	int prexor;
> > -	int dec_preread_active = 0;
> >  
> >  	memset(&s, 0, sizeof(s));
> >  	pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d check:%d "
> > @@ -3058,9 +3055,9 @@ static void handle_stripe5(struct stripe_head *sh)
> >  		if (dev->written)
> >  			s.written++;
> >  		rdev = rcu_dereference(conf->disks[i].rdev);
> > -		if (blocked_rdev == NULL &&
> > +		if (s.blocked_rdev == NULL &&
> >  		    rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
> > -			blocked_rdev = rdev;
> > +			s.blocked_rdev = rdev;
> >  			atomic_inc(&rdev->nr_pending);
> >  		}
> >  		clear_bit(R5_Insync, &dev->flags);
> > @@ -3088,15 +3085,15 @@ static void handle_stripe5(struct stripe_head *sh)
> >  	spin_unlock_irq(&conf->device_lock);
> >  	rcu_read_unlock();
> >  
> > -	if (unlikely(blocked_rdev)) {
> > +	if (unlikely(s.blocked_rdev)) {
> >  		if (s.syncing || s.expanding || s.expanded ||
> >  		    s.to_write || s.written) {
> >  			set_bit(STRIPE_HANDLE, &sh->state);
> >  			goto unlock;
> >  		}
> >  		/* There is nothing for the blocked_rdev to block */
> > -		rdev_dec_pending(blocked_rdev, conf->mddev);
> > -		blocked_rdev = NULL;
> > +		rdev_dec_pending(s.blocked_rdev, conf->mddev);
> > +		s.blocked_rdev = NULL;
> >  	}
> >  
> >  	if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
> > @@ -3112,7 +3109,7 @@ static void handle_stripe5(struct stripe_head *sh)
> >  	 * need to be failed
> >  	 */
> >  	if (s.failed > 1 && s.to_read+s.to_write+s.written)
> > -		handle_failed_stripe(conf, sh, &s, disks, &return_bi);
> > +		handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
> >  	if (s.failed > 1 && s.syncing) {
> >  		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
> >  		clear_bit(STRIPE_SYNCING, &sh->state);
> > @@ -3128,7 +3125,7 @@ static void handle_stripe5(struct stripe_head *sh)
> >  	       !test_bit(R5_LOCKED, &dev->flags) &&
> >  	       test_bit(R5_UPTODATE, &dev->flags)) ||
> >  	       (s.failed == 1 && s.failed_num[0] == sh->pd_idx)))
> > -		handle_stripe_clean_event(conf, sh, disks, &return_bi);
> > +		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
> >  
> >  	/* Now we might consider reading some blocks, either to check/generate
> >  	 * parity, or to satisfy requests
> > @@ -3166,7 +3163,7 @@ static void handle_stripe5(struct stripe_head *sh)
> >  			}
> >  		}
> >  		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> > -			dec_preread_active = 1;
> > +			s.dec_preread_active = 1;
> >  	}
> >  
> >  	/* Now to consider new write requests and what else, if anything
> > @@ -3264,15 +3261,15 @@ static void handle_stripe5(struct stripe_head *sh)
> >   unlock:
> >  
> >  	/* wait for this device to become unblocked */
> > -	if (unlikely(blocked_rdev))
> > -		md_wait_for_blocked_rdev(blocked_rdev, conf->mddev);
> > +	if (unlikely(s.blocked_rdev))
> > +		md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);
> >  
> >  	if (s.ops_request)
> >  		raid_run_ops(sh, s.ops_request);
> >  
> >  	ops_run_io(sh, &s);
> >  
> > -	if (dec_preread_active) {
> > +	if (s.dec_preread_active) {
> >  		/* We delay this until after ops_run_io so that if make_request
> >  		 * is waiting on a flush, it won't continue until the writes
> >  		 * have actually been submitted.
> > @@ -3282,19 +3279,16 @@ static void handle_stripe5(struct stripe_head *sh)
> >  		    IO_THRESHOLD)
> >  			md_wakeup_thread(conf->mddev->thread);
> >  	}
> > -	return_io(return_bi);
> > +	return_io(s.return_bi);
> >  }
> >  
> >  static void handle_stripe6(struct stripe_head *sh)
> >  {
> >  	raid5_conf_t *conf = sh->raid_conf;
> >  	int disks = sh->disks;
> > -	struct bio *return_bi = NULL;
> >  	int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx;
> >  	struct stripe_head_state s;
> >  	struct r5dev *dev, *pdev, *qdev;
> > -	mdk_rdev_t *blocked_rdev = NULL;
> > -	int dec_preread_active = 0;
> >  
> >  	pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
> >  		"pd_idx=%d, qd_idx=%d\n, check:%d, reconstruct:%d\n",
> > @@ -3345,9 +3339,9 @@ static void handle_stripe6(struct stripe_head *sh)
> >  		if (dev->written)
> >  			s.written++;
> >  		rdev = rcu_dereference(conf->disks[i].rdev);
> > -		if (blocked_rdev == NULL &&
> > +		if (s.blocked_rdev == NULL &&
> >  		    rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
> > -			blocked_rdev = rdev;
> > +			s.blocked_rdev = rdev;
> >  			atomic_inc(&rdev->nr_pending);
> >  		}
> >  		clear_bit(R5_Insync, &dev->flags);
> > @@ -3376,15 +3370,15 @@ static void handle_stripe6(struct stripe_head *sh)
> >  	spin_unlock_irq(&conf->device_lock);
> >  	rcu_read_unlock();
> >  
> > -	if (unlikely(blocked_rdev)) {
> > +	if (unlikely(s.blocked_rdev)) {
> >  		if (s.syncing || s.expanding || s.expanded ||
> >  		    s.to_write || s.written) {
> >  			set_bit(STRIPE_HANDLE, &sh->state);
> >  			goto unlock;
> >  		}
> >  		/* There is nothing for the blocked_rdev to block */
> > -		rdev_dec_pending(blocked_rdev, conf->mddev);
> > -		blocked_rdev = NULL;
> > +		rdev_dec_pending(s.blocked_rdev, conf->mddev);
> > +		s.blocked_rdev = NULL;
> >  	}
> >  
> >  	if (s.to_fill && !test_bit(STRIPE_BIOFILL_RUN, &sh->state)) {
> > @@ -3400,7 +3394,7 @@ static void handle_stripe6(struct stripe_head *sh)
> >  	 * might need to be failed
> >  	 */
> >  	if (s.failed > 2 && s.to_read+s.to_write+s.written)
> > -		handle_failed_stripe(conf, sh, &s, disks, &return_bi);
> > +		handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
> >  	if (s.failed > 2 && s.syncing) {
> >  		md_done_sync(conf->mddev, STRIPE_SECTORS,0);
> >  		clear_bit(STRIPE_SYNCING, &sh->state);
> > @@ -3425,7 +3419,7 @@ static void handle_stripe6(struct stripe_head *sh)
> >  	    (s.q_failed || ((test_bit(R5_Insync, &qdev->flags)
> >  			     && !test_bit(R5_LOCKED, &qdev->flags)
> >  			     && test_bit(R5_UPTODATE, &qdev->flags)))))
> > -		handle_stripe_clean_event(conf, sh, disks, &return_bi);
> > +		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
> >  
> >  	/* Now we might consider reading some blocks, either to check/generate
> >  	 * parity, or to satisfy requests
> > @@ -3461,7 +3455,7 @@ static void handle_stripe6(struct stripe_head *sh)
> >  			}
> >  		}
> >  		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> > -			dec_preread_active = 1;
> > +			s.dec_preread_active = 1;
> >  	}
> >  
> >  	/* Now to consider new write requests and what else, if anything
> > @@ -3561,8 +3555,8 @@ static void handle_stripe6(struct stripe_head *sh)
> >   unlock:
> >  
> >  	/* wait for this device to become unblocked */
> > -	if (unlikely(blocked_rdev))
> > -		md_wait_for_blocked_rdev(blocked_rdev, conf->mddev);
> > +	if (unlikely(s.blocked_rdev))
> > +		md_wait_for_blocked_rdev(s.blocked_rdev, conf->mddev);
> >  
> >  	if (s.ops_request)
> >  		raid_run_ops(sh, s.ops_request);
> > @@ -3570,7 +3564,7 @@ static void handle_stripe6(struct stripe_head *sh)
> >  	ops_run_io(sh, &s);
> >  
> >  
> > -	if (dec_preread_active) {
> > +	if (s.dec_preread_active) {
> >  		/* We delay this until after ops_run_io so that if make_request
> >  		 * is waiting on a flush, it won't continue until the writes
> >  		 * have actually been submitted.
> > @@ -3581,7 +3575,7 @@ static void handle_stripe6(struct stripe_head *sh)
> >  			md_wakeup_thread(conf->mddev->thread);
> >  	}
> >  
> > -	return_io(return_bi);
> > +	return_io(s.return_bi);
> >  }
> >  
> >  static void handle_stripe(struct stripe_head *sh)
> > diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> > index d3c61d3..9ceb574 100644
> > --- a/drivers/md/raid5.h
> > +++ b/drivers/md/raid5.h
> > @@ -248,6 +248,10 @@ struct stripe_head_state {
> >  	int failed_num[2];
> >  	unsigned long ops_request;
> >  	int p_failed, q_failed;
> > +
> > +	struct bio *return_bi;
> > +	mdk_rdev_t *blocked_rdev;
> > +	int dec_preread_active;
> >  };
> 
> I'd rather rearrange the struct to reduce paddings on 64-bit:
> 
> /* stripe_head_state - collects and tracks the dynamic state of a stripe_head
>  *     for handle_stripe.  It is only valid under spin_lock(sh->lock);
>  */
> struct stripe_head_state {
> 	int syncing, expanding, expanded;
> 	int locked, uptodate, to_read, to_write, failed, written;
> 	int to_fill, compute, req_compute, non_overwrite;
> 	int failed_num[2];
>         int p_failed, q_failed;
>         int dec_preread_active;
> 	unsigned long ops_request;
> 
>         struct bio *return_bi;
>         mdk_rdev_t *blocked_rdev;
> };

I've made that change, thanks.


> 
> BTW, comment above mentions sh->lock which is disappeared in previous
> patch. It should be fixed as well (in patch 08/34?).

True ... and there are lots of other mentions made of the stripe lock in all
those comments.
I have fixed some of them up but I really need to review all of that
commentary.

Thanks,
NeilBrown


> 
> Thanks.
> 
> 
> >  
> >  /* Flags */
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 13/34] md/raid5: Move code for finishing a reconstruction into handle_stripe.
  2011-07-22  7:09   ` Namhyung Kim
@ 2011-07-26  1:44     ` NeilBrown
  0 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-26  1:44 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-raid

On Fri, 22 Jul 2011 16:09:10 +0900 Namhyung Kim <namhyung@gmail.com> wrote:

> NeilBrown <neilb@suse.de> writes:
> 
> > Prior to commit ab69ae12ceef7 the code in handle_stripe5 and
> > handle_stripe6 to "Finish reconstruct operations initiated by the
> > expansion process" was identical.
> > That commit added an identical stanza of code to each function, but in
> > different places.  That was careless.
> >
> > The raid5 code was correct, so move that out into handle_stripe and
> > remove raid6 version.
> >
> > Signed-off-by: NeilBrown <neilb@suse.de>
> 
> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
> 
> and nitpick below..
> 
> 

> >  static void handle_stripe(struct stripe_head *sh)
> >  {
> >  	struct stripe_head_state s;
> > +	int done;
> >  	raid5_conf_t *conf = sh->raid_conf;
> > +	int i;
> 
> Can be declared in a line.
> 

True.... There are differing opinions on whether that is a good idea or not.
I tend to like to keep each variable in it's own declaration, though closely
related variables might get put together.
I have moved 'int i' up one line so the addition is in just one place, but
I've kept it separate from 'done'.

> > +	/* Finish reconstruct operations initiated by the expansion process */
> > +	if (sh->reconstruct_state == reconstruct_state_result) {
> > +		struct stripe_head *sh2
> > +			= get_active_stripe(conf, sh->sector, 1, 1, 1);
> 
> Wouldn't it be better to use more descriptive name rather than sh2,
> something like sh_prev, sh_src or whatever?

good point.  'sh_src' it is.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 14/34] md/raid5: move more code into common handle_stripe
  2011-07-22  7:32   ` Namhyung Kim
@ 2011-07-26  1:48     ` NeilBrown
  0 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-26  1:48 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-raid

On Fri, 22 Jul 2011 16:32:03 +0900 Namhyung Kim <namhyung@gmail.com> wrote:

> NeilBrown <neilb@suse.de> writes:
> 
> > The difference between the RAID5 and RAID6 code here is easily
> > resolved using conf->max_degraded.
> >
> > Signed-off-by: NeilBrown <neilb@suse.de>
> 
> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
> 
> and a coding style issue.
> 

> > +	if (s.failed <= conf->max_degraded && !conf->mddev->ro)
> > +		for (i = 0; i < s.failed; i++) {
> > +			struct r5dev *dev = &sh->dev[s.failed_num[i]];
> > +			if (test_bit(R5_ReadError, &dev->flags)
> > +			    && !test_bit(R5_LOCKED, &dev->flags)
> > +			    && test_bit(R5_UPTODATE, &dev->flags)
> > +				) {
> 
> This line looks weird and should be combined to previous line.
> 
> And I slightly prefer that the logical operators are put on the
> same line with previous conditional, so that next one can be on
> the same column and more readable, but ...
> 

I agree that it is nice if they line up.  But I also think the "&&" is
important and want it to stand out.  In this case the '!' would stop them
from lining up very nicely, so I'll leave it as it is.

The thing I like about having ") {" on a separate line is that I can later
add a condition by just adding a line - it makes the patch easier to read.

Not a bit thing maybe and I don't always structure conditions like this, but
I think I'll leave it this time.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6
  2011-07-22  9:10   ` Namhyung Kim
@ 2011-07-26  1:52     ` NeilBrown
  2011-07-26  2:41       ` H. Peter Anvin
  2011-07-26  9:40       ` David Brown
  0 siblings, 2 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-26  1:52 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-raid, H. Peter Anvin

On Fri, 22 Jul 2011 18:10:33 +0900 Namhyung Kim <namhyung@gmail.com> wrote:

> NeilBrown <neilb@suse.de> writes:
> 
> > RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
> > also allow 'read-modify-write'
> > Apart from this difference, handle_stripe_dirtying[56] are nearly
> > identical.  So resolve these differences and create just one function.
> >
> > Signed-off-by: NeilBrown <neilb@suse.de>
> 
> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
> 
> BTW, here is a question:
>   Why RAID6 doesn't allow the read-modify-write? I don't think it is
>   not possible, so what prevents doing that? performance? complexity?
>   or it's not possible? Why? :)


The primary reason in my mind is that when Peter Anvin wrote this code he
didn't implement read-modify-write and I have never seen a need to consider
changing that.

You would need a largish array - at least 7 devices - before RWM could
possibly be a win, but some people do have arrays larger than that.

The computation to "subtract" from the Q-syndrome might be a bit complex - I
don't know.

Peter: maybe you have a reason that I haven't mentioned?

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 19/34] md/raid5: move some more common code into handle_stripe
  2011-07-22  9:29   ` Namhyung Kim
@ 2011-07-26  1:59     ` NeilBrown
  0 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-07-26  1:59 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-raid

On Fri, 22 Jul 2011 18:29:16 +0900 Namhyung Kim <namhyung@gmail.com> wrote:

> NeilBrown <neilb@suse.de> writes:
> 
> > The RAID6 version of this code is usable for RAID5 providing:
> >   - we test "conf->max_degraded" rather than "2" as appropriate
> >   - we make sure s->failed_num[1] is meaningful (and not '-1')
> >     when s->failed > 1
> >
> > The 'return 1' must become 'goto finish' in the new location.
> >
> > Signed-off-by: NeilBrown <neilb@suse.de>
> 
> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
> 
> and nitpick below.
> 
> 

> > +	pr_debug("locked=%d uptodate=%d to_read=%d"
> > +	       " to_write=%d failed=%d failed_num=%d,%d\n",
> > +	       s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
> > +	       s.failed_num[0], s.failed_num[1]);
> > +	/* check if the array has lost >2 devices and, if so, some requests
> > +	 * might need to be failed
>         /* check if the array has lost more than max_degraded devices and,
>            if so, some requests might need to be failed
> 
> Thanks.
> 
> 

OK, I made that change - thanks.

NeilBrown


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 20/34] md/raid5: finalise new merged handle_stripe.
  2011-07-22  9:36   ` Namhyung Kim
@ 2011-07-26  2:02     ` NeilBrown
  2011-07-26  4:50       ` Namhyung Kim
  0 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-07-26  2:02 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: linux-raid

On Fri, 22 Jul 2011 18:36:13 +0900 Namhyung Kim <namhyung@gmail.com> wrote:

> NeilBrown <neilb@suse.de> writes:
> 
> > handle_stripe5() and handle_stripe6() are now virtually identical.
> > So discard one and rename the other to 'analyse_stripe()'.
> >
> > It always returns 0, so change it to 'void' and remove the 'done'
> > variable in handle_stripe().
> >
> > Signed-off-by: NeilBrown <neilb@suse.de>


> >  static void handle_stripe(struct stripe_head *sh)
> >  {
> >  	struct stripe_head_state s;
> > -	int done;
> >  	raid5_conf_t *conf = sh->raid_conf;
> >  	int i;
> >  	int prexor;
> > @@ -3099,13 +3015,7 @@ static void handle_stripe(struct stripe_head *sh)
> >  	s.failed_num[0] = -1;
> >  	s.failed_num[1] = -1;
> 
> There are codes that set/initialize some fields of stripe_head_state
> outside of analyse_stripe() and it'd better off moving them into the
> function, IMHO.
> 
> Thanks.

Thanks.  I have merged this change.

NeilBrown


diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ed6729d..b321d6c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2905,6 +2905,14 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
 	struct r5dev *dev;
 	int i;
 
+	memset(s, 0, sizeof(*s));
+
+	s->syncing = test_bit(STRIPE_SYNCING, &sh->state);
+	s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
+	s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
+	s->failed_num[0] = -1;
+	s->failed_num[1] = -1;
+
 	/* Now to look around and see what can be done */
 	rcu_read_lock();
 	spin_lock_irq(&conf->device_lock);
@@ -3006,13 +3014,6 @@ static void handle_stripe(struct stripe_head *sh)
 	       (unsigned long long)sh->sector, sh->state,
 	       atomic_read(&sh->count), sh->pd_idx, sh->qd_idx,
 	       sh->check_state, sh->reconstruct_state);
-	memset(&s, 0, sizeof(s));
-
-	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
-	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
-	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
-	s.failed_num[0] = -1;
-	s.failed_num[1] = -1;
 
 	analyse_stripe(sh, &s);
 

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6
  2011-07-26  1:52     ` NeilBrown
@ 2011-07-26  2:41       ` H. Peter Anvin
  2011-07-26  9:40       ` David Brown
  1 sibling, 0 replies; 71+ messages in thread
From: H. Peter Anvin @ 2011-07-26  2:41 UTC (permalink / raw)
  To: NeilBrown, Namhyung Kim; +Cc: linux-raid

NeilBrown <neilb@suse.de> wrote:

>On Fri, 22 Jul 2011 18:10:33 +0900 Namhyung Kim <namhyung@gmail.com>
>wrote:
>
>> NeilBrown <neilb@suse.de> writes:
>> 
>> > RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
>> > also allow 'read-modify-write'
>> > Apart from this difference, handle_stripe_dirtying[56] are nearly
>> > identical.  So resolve these differences and create just one
>function.
>> >
>> > Signed-off-by: NeilBrown <neilb@suse.de>
>> 
>> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
>> 
>> BTW, here is a question:
>>   Why RAID6 doesn't allow the read-modify-write? I don't think it is
>>   not possible, so what prevents doing that? performance? complexity?
>>   or it's not possible? Why? :)
>
>
>The primary reason in my mind is that when Peter Anvin wrote this code
>he
>didn't implement read-modify-write and I have never seen a need to
>consider
>changing that.
>
>You would need a largish array - at least 7 devices - before RWM could
>possibly be a win, but some people do have arrays larger than that.
>
>The computation to "subtract" from the Q-syndrome might be a bit
>complex - I
>don't know.
>
>Peter: maybe you have a reason that I haven't mentioned?
>
>Thanks,
>NeilBrown

When I measured it it wasn't a win on my system, and it created a lot of extra complexity.  It probably would be a win with SSSE3 and lots of drives.
-- 
Sent from my mobile phone. Please excuse my brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 20/34] md/raid5: finalise new merged handle_stripe.
  2011-07-26  2:02     ` NeilBrown
@ 2011-07-26  4:50       ` Namhyung Kim
  0 siblings, 0 replies; 71+ messages in thread
From: Namhyung Kim @ 2011-07-26  4:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

2011-07-26 (화), 12:02 +1000, NeilBrown:
> On Fri, 22 Jul 2011 18:36:13 +0900 Namhyung Kim <namhyung@gmail.com> wrote:
> 
> > NeilBrown <neilb@suse.de> writes:
> > 
> > > handle_stripe5() and handle_stripe6() are now virtually identical.
> > > So discard one and rename the other to 'analyse_stripe()'.
> > >
> > > It always returns 0, so change it to 'void' and remove the 'done'
> > > variable in handle_stripe().
> > >
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> 
> 
> > >  static void handle_stripe(struct stripe_head *sh)
> > >  {
> > >  	struct stripe_head_state s;
> > > -	int done;
> > >  	raid5_conf_t *conf = sh->raid_conf;
> > >  	int i;
> > >  	int prexor;
> > > @@ -3099,13 +3015,7 @@ static void handle_stripe(struct stripe_head *sh)
> > >  	s.failed_num[0] = -1;
> > >  	s.failed_num[1] = -1;
> > 
> > There are codes that set/initialize some fields of stripe_head_state
> > outside of analyse_stripe() and it'd better off moving them into the
> > function, IMHO.
> > 
> > Thanks.
> 
> Thanks.  I have merged this change.
> 
> NeilBrown
> 
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index ed6729d..b321d6c 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2905,6 +2905,14 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
>  	struct r5dev *dev;
>  	int i;
>  
> +	memset(s, 0, sizeof(*s));
> +
> +	s->syncing = test_bit(STRIPE_SYNCING, &sh->state);
> +	s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
> +	s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
> +	s->failed_num[0] = -1;
> +	s->failed_num[1] = -1;
> +
>  	/* Now to look around and see what can be done */
>  	rcu_read_lock();
>  	spin_lock_irq(&conf->device_lock);
> @@ -3006,13 +3014,6 @@ static void handle_stripe(struct stripe_head *sh)
>  	       (unsigned long long)sh->sector, sh->state,
>  	       atomic_read(&sh->count), sh->pd_idx, sh->qd_idx,
>  	       sh->check_state, sh->reconstruct_state);
> -	memset(&s, 0, sizeof(s));
> -
> -	s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
> -	s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
> -	s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
> -	s.failed_num[0] = -1;
> -	s.failed_num[1] = -1;
>  
>  	analyse_stripe(sh, &s);
>  

Reviewed-by: Namhyung Kim <namhyung@gmail.com>

Looks great to me, thanks.


-- 
Regards,
Namhyung Kim


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state
  2011-07-22  5:53         ` Namhyung Kim
@ 2011-07-26  6:44           ` Namhyung Kim
  0 siblings, 0 replies; 71+ messages in thread
From: Namhyung Kim @ 2011-07-26  6:44 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

2011-07-22 (금), 14:54 +0900, Namhyung Kim:
> 2011-07-22 (금), 15:37 +1000, NeilBrown:
> From 74721b68b4cf8d4845aa9efa3daf99247a611b70 Mon Sep 17 00:00:00 2001
> > From: NeilBrown <neilb@suse.de>
> > Date: Fri, 22 Jul 2011 15:29:38 +1000
> > Subject: [PATCH] md/raid5: Remove stripe_head_state arg from handle_stripe_expansion.
> > 
> > This arg is only used to differentiate between RAID5 and RAID6 but
> > that is not needed.  For RAID5, raid5_compute_sector will set qd_idx
> > to "~0" so j with certainly not equals qd_idx, so there is no need
> > for a guard on that condition.
> > 
> > So remove the guard and remove the arg from the declaration and
> > callers of handle_stripe_expansion.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> 
> Looks good to me, thanks!
> 

And please feel free to add this in case you need it :)

Reviewed-by: Namhyung Kim <namhyung@gmail.com>



-- 
Regards,
Namhyung Kim


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6
  2011-07-26  1:52     ` NeilBrown
  2011-07-26  2:41       ` H. Peter Anvin
@ 2011-07-26  9:40       ` David Brown
  2011-07-26 13:23         ` Namhyung Kim
  1 sibling, 1 reply; 71+ messages in thread
From: David Brown @ 2011-07-26  9:40 UTC (permalink / raw)
  To: linux-raid

On 26/07/11 03:52, NeilBrown wrote:
> On Fri, 22 Jul 2011 18:10:33 +0900 Namhyung Kim<namhyung@gmail.com>  wrote:
>
>> NeilBrown<neilb@suse.de>  writes:
>>
>>> RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
>>> also allow 'read-modify-write'
>>> Apart from this difference, handle_stripe_dirtying[56] are nearly
>>> identical.  So resolve these differences and create just one function.
>>>
>>> Signed-off-by: NeilBrown<neilb@suse.de>
>>
>> Reviewed-by: Namhyung Kim<namhyung@gmail.com>
>>
>> BTW, here is a question:
>>    Why RAID6 doesn't allow the read-modify-write? I don't think it is
>>    not possible, so what prevents doing that? performance? complexity?
>>    or it's not possible? Why? :)
>
>
> The primary reason in my mind is that when Peter Anvin wrote this code he
> didn't implement read-modify-write and I have never seen a need to consider
> changing that.
>
> You would need a largish array - at least 7 devices - before RWM could
> possibly be a win, but some people do have arrays larger than that.
>
> The computation to "subtract" from the Q-syndrome might be a bit complex - I
> don't know.
>

The "subtract from Q" isn't difficult in theory, but it involves more 
cpu time than the "subtract from P".  It should be an overall win in the 
same was as for RAID5.  However, the code would be more complex if you 
allow for RMW on more than one block.

With raid5, if you have a set of data blocks D0, D1, ..., Dn and a 
parity block P, and you want to change Di_old to Di_new, you can do this:

Read Di_old and P_old
Calculate P_new = P_old xor Di_old xor Di_new
Write Di_new and P_new

You can easily extend this do changing several data blocks without 
reading in the whole old stripe.  I don't know if the Linux raid5 
implementation does that or not (I understand the theory here, but I am 
far from an expert on the implementation).  There is no doubt a balance 
between the speed gains of multiple-block RMW and the code complexity. 
As long as you are changing less than half the blocks in the stripe, the 
cpu time will be less than it would for whole stripe write.  For RMW 
writes that are more than half a stripe, the "best" choice depends on 
the balance between cpu time and disk bandwidth - a balance that is very 
different on today's servers compared to those when Linux raid5 was 
first written.

With raid6, the procedure is similar:

Read Di_old, P_old and Q_old.
Calculate P_new = P_old xor Di_old xor Di_new
Calculate Q_new = Q_old xor (2^i . Di_old) xor (2^i . Di_new)
                 = Q_old xor (2^i . (Di_old xor Di_new))
Write Di_new, P_new and Q_new

The difference is simply that (Di_old xor Di_new) needs to be multiplied 
(over the GF field - not normal multiplication) by 2^i.  You would do 
this by repeatedly applying the multiply-by-two function that already 
exists in the raid6 implementation.

I don't know whether or not it is worth using the common subexpression 
(Di_old xor Di_new), which turns up twice.

Multi-block raid6 RMW is similar, but you have to keep track of how many 
times you should multiply by 2.  For a general case, the code would 
probably be too messy - it's easier to simple handle the whole stripe. 
But the case of consecutive blocks is easier, and likely to be far more 
common in practice.  If you want to change blocks Di through Dj, you can do:

Read Di_old, D(i+1)_old, ..., Dj_old, P_old and Q_old.
Calculate P_new = P_old xor Di_old xor Di_new
                         xor D(i+1)_old xor D(i+1)_new
                         ...
                         xor Dj_old xor Dj_new

Calculate Q_new = Q_old xor (2^i . (Di_old xor Di_new))
                         xor (2^(i+1) . (D(i+1)_old xor D(i+1)_new))
                         ...
                         xor (2^j . (Dj_old xor Dj_new))
                 = Q_old xor (2^i .
                            (Di_old xor Di_new)
                            xor (2^1) . (D(i+1)_old xor D(i+1)_new))
                            xor (2^2) . (D(i+2)_old xor D(i+2)_new))
                            ...
                            xor (2^(j-i) . (Dj_old xor Dj_new))
                          )

Write Di_new, D(i+1)_new, ..., Dj_new, P_new and Q_new

The algorithm above looks a little messy (ASCII emails are not the best 
medium for mathematics), but it's not hard to see the pattern, and the 
loops needed.  It should also be possible to merge such routines with 
the main raid6 parity calculation functions.

mvh.,

David

> Peter: maybe you have a reason that I haven't mentioned?
>
> Thanks,
> NeilBrown
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6
  2011-07-26  9:40       ` David Brown
@ 2011-07-26 13:23         ` Namhyung Kim
  2011-07-26 15:01           ` David Brown
  0 siblings, 1 reply; 71+ messages in thread
From: Namhyung Kim @ 2011-07-26 13:23 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

David Brown <david.brown@hesbynett.no> writes:

> On 26/07/11 03:52, NeilBrown wrote:
>> On Fri, 22 Jul 2011 18:10:33 +0900 Namhyung Kim<namhyung@gmail.com>  wrote:
>>
>>> NeilBrown<neilb@suse.de>  writes:
>>>
>>>> RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
>>>> also allow 'read-modify-write'
>>>> Apart from this difference, handle_stripe_dirtying[56] are nearly
>>>> identical.  So resolve these differences and create just one function.
>>>>
>>>> Signed-off-by: NeilBrown<neilb@suse.de>
>>>
>>> Reviewed-by: Namhyung Kim<namhyung@gmail.com>
>>>
>>> BTW, here is a question:
>>>    Why RAID6 doesn't allow the read-modify-write? I don't think it is
>>>    not possible, so what prevents doing that? performance? complexity?
>>>    or it's not possible? Why? :)
>>
>>
>> The primary reason in my mind is that when Peter Anvin wrote this code he
>> didn't implement read-modify-write and I have never seen a need to consider
>> changing that.
>>
>> You would need a largish array - at least 7 devices - before RWM could
>> possibly be a win, but some people do have arrays larger than that.
>>
>> The computation to "subtract" from the Q-syndrome might be a bit complex - I
>> don't know.
>>
>
> The "subtract from Q" isn't difficult in theory, but it involves more
> cpu time than the "subtract from P".  It should be an overall win in
> the same was as for RAID5.  However, the code would be more complex if
> you allow for RMW on more than one block.
>
> With raid5, if you have a set of data blocks D0, D1, ..., Dn and a
> parity block P, and you want to change Di_old to Di_new, you can do
> this:
>
> Read Di_old and P_old
> Calculate P_new = P_old xor Di_old xor Di_new
> Write Di_new and P_new
>
> You can easily extend this do changing several data blocks without
> reading in the whole old stripe.  I don't know if the Linux raid5
> implementation does that or not (I understand the theory here, but I
> am far from an expert on the implementation).  There is no doubt a
> balance between the speed gains of multiple-block RMW and the code
> complexity. As long as you are changing less than half the blocks in
> the stripe, the cpu time will be less than it would for whole stripe
> write.  For RMW writes that are more than half a stripe, the "best"
> choice depends on the balance between cpu time and disk bandwidth - a
> balance that is very different on today's servers compared to those
> when Linux raid5 was first written.
>
>
>
> With raid6, the procedure is similar:
>
> Read Di_old, P_old and Q_old.
> Calculate P_new = P_old xor Di_old xor Di_new
> Calculate Q_new = Q_old xor (2^i . Di_old) xor (2^i . Di_new)
>                 = Q_old xor (2^i . (Di_old xor Di_new))
> Write Di_new, P_new and Q_new
>
> The difference is simply that (Di_old xor Di_new) needs to be
> multiplied (over the GF field - not normal multiplication) by 2^i.
> You would do this by repeatedly applying the multiply-by-two function
> that already exists in the raid6 implementation.
>
> I don't know whether or not it is worth using the common subexpression
> (Di_old xor Di_new), which turns up twice.
>
>
>
> Multi-block raid6 RMW is similar, but you have to keep track of how
> many times you should multiply by 2.  For a general case, the code
> would probably be too messy - it's easier to simple handle the whole
> stripe. But the case of consecutive blocks is easier, and likely to be
> far more common in practice.  If you want to change blocks Di through
> Dj, you can do:
>
> Read Di_old, D(i+1)_old, ..., Dj_old, P_old and Q_old.
> Calculate P_new = P_old xor Di_old xor Di_new
>                         xor D(i+1)_old xor D(i+1)_new
>                         ...
>                         xor Dj_old xor Dj_new
>
>
> Calculate Q_new = Q_old xor (2^i . (Di_old xor Di_new))
>                         xor (2^(i+1) . (D(i+1)_old xor D(i+1)_new))
>                         ...
>                         xor (2^j . (Dj_old xor Dj_new))
>                 = Q_old xor (2^i .
>                            (Di_old xor Di_new)
>                            xor (2^1) . (D(i+1)_old xor D(i+1)_new))
>                            xor (2^2) . (D(i+2)_old xor D(i+2)_new))
>                            ...
>                            xor (2^(j-i) . (Dj_old xor Dj_new))
>                          )
>
> Write Di_new, D(i+1)_new, ..., Dj_new, P_new and Q_new
>
>
> The algorithm above looks a little messy (ASCII emails are not the
> best medium for mathematics), but it's not hard to see the pattern,
> and the loops needed.  It should also be possible to merge such
> routines with the main raid6 parity calculation functions.
>
>
> mvh.,
>
> David
>

Thanks a lot for your detailed explanation, David!

I think it'd be good if we implement RMW for small (eg. single disk)
write on RAID6 too.


-- 
Regards,
Namhyung Kim

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6
  2011-07-26 13:23         ` Namhyung Kim
@ 2011-07-26 15:01           ` David Brown
  0 siblings, 0 replies; 71+ messages in thread
From: David Brown @ 2011-07-26 15:01 UTC (permalink / raw)
  To: linux-raid

On 26/07/11 15:23, Namhyung Kim wrote:
> David Brown<david.brown@hesbynett.no>  writes:
>
>> On 26/07/11 03:52, NeilBrown wrote:
>>> On Fri, 22 Jul 2011 18:10:33 +0900 Namhyung Kim<namhyung@gmail.com>   wrote:
>>>
>>>> NeilBrown<neilb@suse.de>   writes:
>>>>
>>>>> RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
>>>>> also allow 'read-modify-write'
>>>>> Apart from this difference, handle_stripe_dirtying[56] are nearly
>>>>> identical.  So resolve these differences and create just one function.
>>>>>
>>>>> Signed-off-by: NeilBrown<neilb@suse.de>
>>>>
>>>> Reviewed-by: Namhyung Kim<namhyung@gmail.com>
>>>>
>>>> BTW, here is a question:
>>>>     Why RAID6 doesn't allow the read-modify-write? I don't think it is
>>>>     not possible, so what prevents doing that? performance? complexity?
>>>>     or it's not possible? Why? :)
>>>
>>>
>>> The primary reason in my mind is that when Peter Anvin wrote this code he
>>> didn't implement read-modify-write and I have never seen a need to consider
>>> changing that.
>>>
>>> You would need a largish array - at least 7 devices - before RWM could
>>> possibly be a win, but some people do have arrays larger than that.
>>>
>>> The computation to "subtract" from the Q-syndrome might be a bit complex - I
>>> don't know.
>>>
>>
>> The "subtract from Q" isn't difficult in theory, but it involves more
>> cpu time than the "subtract from P".  It should be an overall win in
>> the same was as for RAID5.  However, the code would be more complex if
>> you allow for RMW on more than one block.
>>
>> With raid5, if you have a set of data blocks D0, D1, ..., Dn and a
>> parity block P, and you want to change Di_old to Di_new, you can do
>> this:
>>
>> Read Di_old and P_old
>> Calculate P_new = P_old xor Di_old xor Di_new
>> Write Di_new and P_new
>>
>> You can easily extend this do changing several data blocks without
>> reading in the whole old stripe.  I don't know if the Linux raid5
>> implementation does that or not (I understand the theory here, but I
>> am far from an expert on the implementation).  There is no doubt a
>> balance between the speed gains of multiple-block RMW and the code
>> complexity. As long as you are changing less than half the blocks in
>> the stripe, the cpu time will be less than it would for whole stripe
>> write.  For RMW writes that are more than half a stripe, the "best"
>> choice depends on the balance between cpu time and disk bandwidth - a
>> balance that is very different on today's servers compared to those
>> when Linux raid5 was first written.
>>
>>
>>
>> With raid6, the procedure is similar:
>>
>> Read Di_old, P_old and Q_old.
>> Calculate P_new = P_old xor Di_old xor Di_new
>> Calculate Q_new = Q_old xor (2^i . Di_old) xor (2^i . Di_new)
>>                  = Q_old xor (2^i . (Di_old xor Di_new))
>> Write Di_new, P_new and Q_new
>>
>> The difference is simply that (Di_old xor Di_new) needs to be
>> multiplied (over the GF field - not normal multiplication) by 2^i.
>> You would do this by repeatedly applying the multiply-by-two function
>> that already exists in the raid6 implementation.
>>
>> I don't know whether or not it is worth using the common subexpression
>> (Di_old xor Di_new), which turns up twice.
>>
>>
>>
>> Multi-block raid6 RMW is similar, but you have to keep track of how
>> many times you should multiply by 2.  For a general case, the code
>> would probably be too messy - it's easier to simple handle the whole
>> stripe. But the case of consecutive blocks is easier, and likely to be
>> far more common in practice.  If you want to change blocks Di through
>> Dj, you can do:
>>
>> Read Di_old, D(i+1)_old, ..., Dj_old, P_old and Q_old.
>> Calculate P_new = P_old xor Di_old xor Di_new
>>                          xor D(i+1)_old xor D(i+1)_new
>>                          ...
>>                          xor Dj_old xor Dj_new
>>
>>
>> Calculate Q_new = Q_old xor (2^i . (Di_old xor Di_new))
>>                          xor (2^(i+1) . (D(i+1)_old xor D(i+1)_new))
>>                          ...
>>                          xor (2^j . (Dj_old xor Dj_new))
>>                  = Q_old xor (2^i .
>>                             (Di_old xor Di_new)
>>                             xor (2^1) . (D(i+1)_old xor D(i+1)_new))
>>                             xor (2^2) . (D(i+2)_old xor D(i+2)_new))
>>                             ...
>>                             xor (2^(j-i) . (Dj_old xor Dj_new))
>>                           )
>>
>> Write Di_new, D(i+1)_new, ..., Dj_new, P_new and Q_new
>>
>>
>> The algorithm above looks a little messy (ASCII emails are not the
>> best medium for mathematics), but it's not hard to see the pattern,
>> and the loops needed.  It should also be possible to merge such
>> routines with the main raid6 parity calculation functions.
>>
>>
>> mvh.,
>>
>> David
>>
>
> Thanks a lot for your detailed explanation, David!
>
> I think it'd be good if we implement RMW for small (eg. single disk)
> write on RAID6 too.
>

Without having looked much at the details of the implementation, I 
believe it should be practical to use RMW for single block writes on 
RAID6.

I also think it should be reasonably practical to do consecutive RMW 
blocks for both raid5 and raid6.  But to do that efficiently would 
probably mean generalising the existing raid5 and raid6 functions a 
little rather than writing a new set of functions - I'm not sure how 
much effort that would take within the existing structure of multiple 
implementations that are different speeds on different cpus.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 08/34] md/raid5: replace sh->lock with an 'active' flag.
  2011-07-21  2:32 ` [md PATCH 08/34] md/raid5: replace sh->lock with an 'active' flag NeilBrown
  2011-07-22  4:27   ` Namhyung Kim
@ 2011-08-03 22:47   ` Dan Williams
  2011-08-03 23:35     ` NeilBrown
  1 sibling, 1 reply; 71+ messages in thread
From: Dan Williams @ 2011-08-03 22:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Wed, Jul 20, 2011 at 7:32 PM, NeilBrown <neilb@suse.de> wrote:
> sh->lock is now mainly used to ensure that two threads aren't running
> in the locked part of handle_stripe[56] at the same time.
>
> That can more neatly be achieved with an 'active' flag which we set
> while running handle_stripe.  If we find the flag is set, we simply
> requeue the stripe for later by setting STRIPE_HANDLE.
>
> For safety we take ->device_lock while examining the state of the
> stripe and creating a summary in 'stripe_head_state / r6_state'.
> This possibly isn't needed but as shared fields like ->toread,
> ->towrite are checked it is safer for now at least.
>
> We leave the label after the old 'unlock' called "unlock" because it
> will disappear in a few patches, so renaming seems pointless.
>
> This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
> later, but that is not a problem.
>

This removal reminds me of one thing I have wondered about, but to
date have not found the time to measure (maybe someone might beat me
to it if the idea is out there), is what is the overhead of all the
atomic operations that raid5.c generates?  If we can guarantee that
certain updates only happen under sh->lock (now STRIPE_ACTIVE) can we
downgrade set_bit and clear_bit to their non-atomic __set_bit and
__clear_bit versions and recover some cpu cycles?

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 08/34] md/raid5: replace sh->lock with an 'active' flag.
  2011-08-03 22:47   ` Dan Williams
@ 2011-08-03 23:35     ` NeilBrown
  2011-08-03 23:45       ` Williams, Dan J
  0 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2011-08-03 23:35 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

On Wed, 3 Aug 2011 15:47:47 -0700 Dan Williams <dan.j.williams@intel.com>
wrote:

> On Wed, Jul 20, 2011 at 7:32 PM, NeilBrown <neilb@suse.de> wrote:
> > sh->lock is now mainly used to ensure that two threads aren't running
> > in the locked part of handle_stripe[56] at the same time.
> >
> > That can more neatly be achieved with an 'active' flag which we set
> > while running handle_stripe.  If we find the flag is set, we simply
> > requeue the stripe for later by setting STRIPE_HANDLE.
> >
> > For safety we take ->device_lock while examining the state of the
> > stripe and creating a summary in 'stripe_head_state / r6_state'.
> > This possibly isn't needed but as shared fields like ->toread,
> > ->towrite are checked it is safer for now at least.
> >
> > We leave the label after the old 'unlock' called "unlock" because it
> > will disappear in a few patches, so renaming seems pointless.
> >
> > This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
> > later, but that is not a problem.
> >
> 
> This removal reminds me of one thing I have wondered about, but to
> date have not found the time to measure (maybe someone might beat me
> to it if the idea is out there), is what is the overhead of all the
> atomic operations that raid5.c generates?  If we can guarantee that
> certain updates only happen under sh->lock (now STRIPE_ACTIVE) can we
> downgrade set_bit and clear_bit to their non-atomic __set_bit and
> __clear_bit versions and recover some cpu cycles?
> 

You can only used the unlocked version if you know that no other CPU will
change any of the bits in the 'unsigned long'.  As STRIPE_HANDLE can be set
at any time, I think all accesses to sh->state must be atomic.

However 'struct stripe_head_state' is thread-local so setting/clearing flags
in ops_request can probably benefit from non-atomic ops.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 08/34] md/raid5: replace sh->lock with an 'active' flag.
  2011-08-03 23:35     ` NeilBrown
@ 2011-08-03 23:45       ` Williams, Dan J
  2011-08-04  0:18         ` NeilBrown
  0 siblings, 1 reply; 71+ messages in thread
From: Williams, Dan J @ 2011-08-03 23:45 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Wed, Aug 3, 2011 at 4:35 PM, NeilBrown <neilb@suse.de> wrote:
> On Wed, 3 Aug 2011 15:47:47 -0700 Dan Williams <dan.j.williams@intel.com>
> wrote:
>
>> On Wed, Jul 20, 2011 at 7:32 PM, NeilBrown <neilb@suse.de> wrote:
>> > sh->lock is now mainly used to ensure that two threads aren't running
>> > in the locked part of handle_stripe[56] at the same time.
>> >
>> > That can more neatly be achieved with an 'active' flag which we set
>> > while running handle_stripe.  If we find the flag is set, we simply
>> > requeue the stripe for later by setting STRIPE_HANDLE.
>> >
>> > For safety we take ->device_lock while examining the state of the
>> > stripe and creating a summary in 'stripe_head_state / r6_state'.
>> > This possibly isn't needed but as shared fields like ->toread,
>> > ->towrite are checked it is safer for now at least.
>> >
>> > We leave the label after the old 'unlock' called "unlock" because it
>> > will disappear in a few patches, so renaming seems pointless.
>> >
>> > This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
>> > later, but that is not a problem.
>> >
>>
>> This removal reminds me of one thing I have wondered about, but to
>> date have not found the time to measure (maybe someone might beat me
>> to it if the idea is out there), is what is the overhead of all the
>> atomic operations that raid5.c generates?  If we can guarantee that
>> certain updates only happen under sh->lock (now STRIPE_ACTIVE) can we
>> downgrade set_bit and clear_bit to their non-atomic __set_bit and
>> __clear_bit versions and recover some cpu cycles?
>>
>
> You can only used the unlocked version if you know that no other CPU will
> change any of the bits in the 'unsigned long'.  As STRIPE_HANDLE can be set
> at any time, I think all accesses to sh->state must be atomic.
>
> However 'struct stripe_head_state' is thread-local so setting/clearing flags
> in ops_request can probably benefit from non-atomic ops.

I was particularly eyeing 'flags' in struct r5dev...
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [md PATCH 08/34] md/raid5: replace sh->lock with an 'active' flag.
  2011-08-03 23:45       ` Williams, Dan J
@ 2011-08-04  0:18         ` NeilBrown
  0 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2011-08-04  0:18 UTC (permalink / raw)
  To: Williams, Dan J; +Cc: linux-raid

On Wed, 3 Aug 2011 16:45:35 -0700 "Williams, Dan J"
<dan.j.williams@intel.com> wrote:

> On Wed, Aug 3, 2011 at 4:35 PM, NeilBrown <neilb@suse.de> wrote:
> > On Wed, 3 Aug 2011 15:47:47 -0700 Dan Williams <dan.j.williams@intel.com>
> > wrote:
> >
> >> On Wed, Jul 20, 2011 at 7:32 PM, NeilBrown <neilb@suse.de> wrote:
> >> > sh->lock is now mainly used to ensure that two threads aren't running
> >> > in the locked part of handle_stripe[56] at the same time.
> >> >
> >> > That can more neatly be achieved with an 'active' flag which we set
> >> > while running handle_stripe.  If we find the flag is set, we simply
> >> > requeue the stripe for later by setting STRIPE_HANDLE.
> >> >
> >> > For safety we take ->device_lock while examining the state of the
> >> > stripe and creating a summary in 'stripe_head_state / r6_state'.
> >> > This possibly isn't needed but as shared fields like ->toread,
> >> > ->towrite are checked it is safer for now at least.
> >> >
> >> > We leave the label after the old 'unlock' called "unlock" because it
> >> > will disappear in a few patches, so renaming seems pointless.
> >> >
> >> > This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
> >> > later, but that is not a problem.
> >> >
> >>
> >> This removal reminds me of one thing I have wondered about, but to
> >> date have not found the time to measure (maybe someone might beat me
> >> to it if the idea is out there), is what is the overhead of all the
> >> atomic operations that raid5.c generates?  If we can guarantee that
> >> certain updates only happen under sh->lock (now STRIPE_ACTIVE) can we
> >> downgrade set_bit and clear_bit to their non-atomic __set_bit and
> >> __clear_bit versions and recover some cpu cycles?
> >>
> >
> > You can only used the unlocked version if you know that no other CPU will
> > change any of the bits in the 'unsigned long'.  As STRIPE_HANDLE can be set
> > at any time, I think all accesses to sh->state must be atomic.
> >
> > However 'struct stripe_head_state' is thread-local so setting/clearing flags
> > in ops_request can probably benefit from non-atomic ops.
> 
> I was particularly eyeing 'flags' in struct r5dev...

good point.... yes, they would probably be safe.
Async updates only happen when the refcount on the stripe head is != 0,
and handle_stripe only updates things when it own the sole reference.

So the updates in handle_stripe and functions directly call from there should
be safe to use unlocked versions...

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2011-08-04  0:18 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-21  2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
2011-07-21  2:32 ` [md PATCH 02/34] md/raid10: factor out common bio handling code NeilBrown
2011-07-21  2:32 ` [md PATCH 01/34] md/raid10: get rid of duplicated conditional expression NeilBrown
2011-07-21  2:32 ` [md PATCH 03/34] md/raid10: share pages between read and write bio's during recovery NeilBrown
2011-07-21  2:32 ` [md PATCH 06/34] md/raid5: Remove use of sh->lock in sync_request NeilBrown
2011-07-22  3:39   ` Namhyung Kim
2011-07-21  2:32 ` [md PATCH 11/34] md/raid5: add some more fields to stripe_head_state NeilBrown
2011-07-22  5:31   ` Namhyung Kim
2011-07-26  1:35     ` NeilBrown
2011-07-21  2:32 ` [md PATCH 04/34] md/raid5: use kmem_cache_zalloc() NeilBrown
2011-07-21  2:32 ` [md PATCH 08/34] md/raid5: replace sh->lock with an 'active' flag NeilBrown
2011-07-22  4:27   ` Namhyung Kim
2011-07-22  4:49     ` NeilBrown
2011-07-22  5:03       ` Namhyung Kim
2011-08-03 22:47   ` Dan Williams
2011-08-03 23:35     ` NeilBrown
2011-08-03 23:45       ` Williams, Dan J
2011-08-04  0:18         ` NeilBrown
2011-07-21  2:32 ` [md PATCH 07/34] md/raid5: Protect some more code with ->device_lock NeilBrown
2011-07-22  3:54   ` Namhyung Kim
2011-07-21  2:32 ` [md PATCH 05/34] md/raid5: get rid of duplicated call to bio_data_dir() NeilBrown
2011-07-21  2:32 ` [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state NeilBrown
2011-07-22  4:49   ` Namhyung Kim
2011-07-22  5:15     ` NeilBrown
2011-07-22  5:37       ` NeilBrown
2011-07-22  5:53         ` Namhyung Kim
2011-07-26  6:44           ` Namhyung Kim
2011-07-21  2:32 ` [md PATCH 09/34] md/raid5: move common code into handle_stripe NeilBrown
2011-07-22  4:30   ` Namhyung Kim
2011-07-21  2:32 ` [md PATCH 14/34] md/raid5: move more code into common handle_stripe NeilBrown
2011-07-22  7:32   ` Namhyung Kim
2011-07-26  1:48     ` NeilBrown
2011-07-21  2:32 ` [md PATCH 15/34] md/raid5: rearrange a test in fetch_block6 NeilBrown
2011-07-22  7:37   ` Namhyung Kim
2011-07-21  2:32 ` [md PATCH 16/34] md/raid5: unite fetch_block5 and fetch_block6 NeilBrown
2011-07-22  8:24   ` Namhyung Kim
2011-07-21  2:32 ` [md PATCH 12/34] md/raid5: move stripe_head_state and more code into handle_stripe NeilBrown
2011-07-22  5:41   ` Namhyung Kim
2011-07-21  2:32 ` [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6 NeilBrown
2011-07-22  9:10   ` Namhyung Kim
2011-07-26  1:52     ` NeilBrown
2011-07-26  2:41       ` H. Peter Anvin
2011-07-26  9:40       ` David Brown
2011-07-26 13:23         ` Namhyung Kim
2011-07-26 15:01           ` David Brown
2011-07-21  2:32 ` [md PATCH 18/34] md/raid5: move more common code into handle_stripe NeilBrown
2011-07-22  9:20   ` Namhyung Kim
2011-07-21  2:32 ` [md PATCH 13/34] md/raid5: Move code for finishing a reconstruction " NeilBrown
2011-07-22  7:09   ` Namhyung Kim
2011-07-26  1:44     ` NeilBrown
2011-07-21  2:32 ` [md PATCH 19/34] md/raid5: move some more common code " NeilBrown
2011-07-22  9:29   ` Namhyung Kim
2011-07-26  1:59     ` NeilBrown
2011-07-21  2:32 ` [md PATCH 23/34] md: introduce link/unlink_rdev() helpers NeilBrown
2011-07-21  2:32 ` [md PATCH 21/34] md: use proper little-endian bitops NeilBrown
2011-07-21  2:32 ` [md PATCH 27/34] md/raid10: Improve decision on whether to fail a device with a read error NeilBrown
2011-07-21  2:32 ` [md PATCH 26/34] md/raid10: Make use of new recovery_disabled handling NeilBrown
2011-07-21  2:32 ` [md PATCH 20/34] md/raid5: finalise new merged handle_stripe NeilBrown
2011-07-22  9:36   ` Namhyung Kim
2011-07-26  2:02     ` NeilBrown
2011-07-26  4:50       ` Namhyung Kim
2011-07-21  2:32 ` [md PATCH 25/34] md: change managed of recovery_disabled NeilBrown
2011-07-21  2:32 ` [md PATCH 22/34] md/raid: use printk_ratelimited instead of printk_ratelimit NeilBrown
2011-07-21  2:32 ` [md PATCH 24/34] md: remove ro check in md_check_recovery() NeilBrown
2011-07-21  2:32 ` [md PATCH 30/34] md/raid5: move rdev->corrected_errors counting NeilBrown
2011-07-21  2:32 ` [md PATCH 28/34] md: get rid of unnecessary casts on page_address() NeilBrown
2011-07-21  2:32 ` [md PATCH 29/34] md/raid1: move rdev->corrected_errors counting NeilBrown
2011-07-21  2:32 ` [md PATCH 33/34] MD: raid1 s/sysfs_notify_dirent/sysfs_notify_dirent_safe NeilBrown
2011-07-21  2:32 ` [md PATCH 32/34] md/raid5: Avoid BUG caused by multiple failures NeilBrown
2011-07-21  2:32 ` [md PATCH 34/34] MD bitmap: Revert DM dirty log hooks NeilBrown
2011-07-21  2:32 ` [md PATCH 31/34] md/raid10: move rdev->corrected_errors counting NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).