[md PATCH 00/23] md patches heading for 3.4

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [md PATCH 00/23] md patches heading for 3.4
@ 2012-03-14  4:40 NeilBrown
  2012-03-14  4:40 ` [md PATCH 02/23] md/raid10: remove unnecessary smp_mb() from end_sync_write NeilBrown
                   ` (22 more replies)
  0 siblings, 23 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

Following are a bunch of patches that I'm planing to submit in the
next merge window (which I expect to open just after LWN publishes
this week :-).

There is nothing really exciting - mostly clean-up patches that are
the product of the development some other features that I'm still
working on (like reshaping some RAID10 arrays to more devices).

The features here are:
 - RAID10 can grow or shrink to match changes in the underlying
   devices.
 - linear, RAID0, RAID1, RAID10 now call the merge_bvec_fn in
   member devices, so that if you stack one of these atop
   LVM or RAID0 or similar, it won't insist on breaking all
   requests up into single page requests for the path through the
   multiple layers.

Review, as always, is most welcome.

NeilBrown

---

NeilBrown (20):
      md: fix clearing of the 'changed' flags for the bad blocks list.
      md/bitmap: discard CHUNK_BLOCK_SHIFT macro
      md/bitmap: remove unnecessary indirection when allocating.
      md/bitmap: remove some pointless locking.
      md/bitmap: change a 'goto' to a normal 'if' construct.
      md/bitmap: move printing of bitmap status to bitmap.c
      md/bitmap: remove some unused noise from bitmap.h
      md/raid10 - support resizing some RAID10 arrays.
      md/raid1: handle merge_bvec_fn in member devices.
      md/raid10: handle merge_bvec_fn in member devices.
      md: add proper merge_bvec handling to RAID0 and Linear.
      md: tidy up rdev_for_each usage.
      md/raid1,raid10: avoid deadlock during resync/recovery.
      md/bitmap: ensure to load bitmap when creating via sysfs.
      md: don't set md arrays to readonly on shutdown.
      md: allow re-add to failed arrays.
      md: allow last device to be forcibly removed from RAID1/RAID10.
      md/raid5: removed unused 'added_devices' variable.
      md/raid10: remove unnecessary smp_mb() from end_sync_write
      md/raid5: make sure reshape_position is cleared on error path.

majianpeng (3):
      md: Add judgement bb->unacked_exist in function md_ack_all_badblocks().
      md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read().
      md: Use existed macros instead of numbers


 drivers/md/bitmap.c       |  152 ++++++++++++++++-----------------
 drivers/md/bitmap.h       |   22 -----
 drivers/md/dm-raid.c      |   16 ++-
 drivers/md/faulty.c       |    2 
 drivers/md/linear.c       |   32 +++----
 drivers/md/md.c           |  156 ++++++++++++++--------------------
 drivers/md/md.h           |   17 +++-
 drivers/md/multipath.c    |    8 +-
 drivers/md/raid0.c        |  164 ++++++++++++++++++++----------------
 drivers/md/raid0.h        |   11 ++
 drivers/md/raid1.c        |  111 +++++++++++++++++-------
 drivers/md/raid10.c       |  206 ++++++++++++++++++++++++++++++++-------------
 drivers/md/raid5.c        |   35 +++-----
 include/linux/raid/md_p.h |    6 +
 14 files changed, 531 insertions(+), 407 deletions(-)

-- 
Signature


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [md PATCH 01/23] md/raid5: make sure reshape_position is cleared on error path.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (4 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 06/23] md: allow last device to be forcibly removed from RAID1/RAID10 NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 10/23] md/raid1, raid10: avoid deadlock during resync/recovery NeilBrown
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

Leaving a valid reshape_position value in place could be confusing.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 360f2b9..8b3eb41 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5592,6 +5592,7 @@ static int raid5_start_reshape(struct mddev *mddev)
 		spin_lock_irq(&conf->device_lock);
 		mddev->raid_disks = conf->raid_disks = conf->previous_raid_disks;
 		conf->reshape_progress = MaxSector;
+		mddev->reshape_position = MaxSector;
 		spin_unlock_irq(&conf->device_lock);
 		return -EAGAIN;
 	}



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 02/23] md/raid10: remove unnecessary smp_mb() from end_sync_write
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 05/23] md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read() NeilBrown
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

Recent commit 4ca40c2ce099e4f1ce3 (md/raid10: Allow replacement device ...)
added an smp_mb in end_sync_write.
This was to close a possible race with raid10_remove_disk.
However there is no such race as it is never attempted to remove a
disk while resync (or recovery) is happening.
so the smp_mb is just noise.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid10.c |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 58c44d6..1a19c96 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1682,10 +1682,8 @@ static void end_sync_write(struct bio *bio, int error)
 	d = find_bio_disk(conf, r10_bio, bio, &slot, &repl);
 	if (repl)
 		rdev = conf->mirrors[d].replacement;
-	if (!rdev) {
-		smp_mb();
+	else
 		rdev = conf->mirrors[d].rdev;
-	}
 
 	if (!uptodate) {
 		if (repl)



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 03/23] md/raid5: removed unused 'added_devices' variable.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (2 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 04/23] md: Use existed macros instead of numbers NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 06/23] md: allow last device to be forcibly removed from RAID1/RAID10 NeilBrown
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

commit 908f4fbd265733 removed the last user of this variable,
so we should discard it completely.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |    7 ++-----
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8b3eb41..3f55145 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5547,16 +5547,14 @@ static int raid5_start_reshape(struct mddev *mddev)
 	 * such devices during the reshape and confusion could result.
 	 */
 	if (mddev->delta_disks >= 0) {
-		int added_devices = 0;
 		list_for_each_entry(rdev, &mddev->disks, same_set)
 			if (rdev->raid_disk < 0 &&
 			    !test_bit(Faulty, &rdev->flags)) {
 				if (raid5_add_disk(mddev, rdev) == 0) {
 					if (rdev->raid_disk
-					    >= conf->previous_raid_disks) {
+					    >= conf->previous_raid_disks)
 						set_bit(In_sync, &rdev->flags);
-						added_devices++;
-					} else
+					else
 						rdev->recovery_offset = 0;
 
 					if (sysfs_link_rdev(mddev, rdev))
@@ -5566,7 +5564,6 @@ static int raid5_start_reshape(struct mddev *mddev)
 				   && !test_bit(Faulty, &rdev->flags)) {
 				/* This is a spare that was manually added */
 				set_bit(In_sync, &rdev->flags);
-				added_devices++;
 			}
 
 		/* When a reshape changes the number of devices,



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 04/23] md: Use existed macros instead of numbers
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
  2012-03-14  4:40 ` [md PATCH 02/23] md/raid10: remove unnecessary smp_mb() from end_sync_write NeilBrown
  2012-03-14  4:40 ` [md PATCH 05/23] md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read() NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 03/23] md/raid5: removed unused 'added_devices' variable NeilBrown
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

From: majianpeng <majianpeng@gmail.com>

Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 include/linux/raid/md_p.h |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/include/linux/raid/md_p.h b/include/linux/raid/md_p.h
index 6f6df86..8c0a3ad 100644
--- a/include/linux/raid/md_p.h
+++ b/include/linux/raid/md_p.h
@@ -281,6 +281,10 @@ struct mdp_superblock_1 {
 					    * active device with same 'role'.
 					    * 'recovery_offset' is also set.
 					    */
-#define	MD_FEATURE_ALL			(1|2|4|8|16)
+#define	MD_FEATURE_ALL			(MD_FEATURE_BITMAP_OFFSET	\
+					|MD_FEATURE_RECOVERY_OFFSET	\
+					|MD_FEATURE_RESHAPE_ACTIVE	\
+					|MD_FEATURE_BAD_BLOCKS		\
+					|MD_FEATURE_REPLACEMENT)
 
 #endif 



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 05/23] md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read().
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
  2012-03-14  4:40 ` [md PATCH 02/23] md/raid10: remove unnecessary smp_mb() from end_sync_write NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 04/23] md: Use existed macros instead of numbers NeilBrown
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

From: majianpeng <majianpeng@gmail.com>

Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3f55145..99b2bbf 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -208,11 +208,10 @@ static void __release_stripe(struct r5conf *conf, struct stripe_head *sh)
 			md_wakeup_thread(conf->mddev->thread);
 		} else {
 			BUG_ON(stripe_operations_active(sh));
-			if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) {
-				atomic_dec(&conf->preread_active_stripes);
-				if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
+			if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+				if (atomic_dec_return(&conf->preread_active_stripes)
+				    < IO_THRESHOLD)
 					md_wakeup_thread(conf->mddev->thread);
-			}
 			atomic_dec(&conf->active_stripes);
 			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
 				list_add_tail(&sh->lru, &conf->inactive_list);



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 06/23] md: allow last device to be forcibly removed from RAID1/RAID10.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (3 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 03/23] md/raid5: removed unused 'added_devices' variable NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 01/23] md/raid5: make sure reshape_position is cleared on error path NeilBrown
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

When the 'last' device in a RAID1 or RAID10 reports an error,
we do not mark it as failed.  This would serve little purpose
as there is no risk of losing data beyond that which is obviously
lost (as there is with RAID5), and there could be other sectors
on the device which are readable, and only readable from this device.
This in general this maximises access to data.

However the current implementation also stops an admin from removing
the last device by direct action.  This is rarely useful, but in many
case is not harmful and can make automation easier by removing special
cases.

Also, if an attempt to write metadata fails the device must be marked
as faulty, else an infinite loop will result, attempting to update
the metadata on all non-faulty devices.

So add a 'force' option to 'md_error()' and '*errorhandler()' which
bypasses the 'last disk' checks for RAID1 and RAID10.
Set it when the removal is explicitly requested by user-space, or
when it is the result of a failed metadata write.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c        |   16 ++++++++--------
 drivers/md/md.h        |    4 ++--
 drivers/md/multipath.c |    6 +++---
 drivers/md/raid1.c     |   13 +++++++------
 drivers/md/raid10.c    |   19 ++++++++++---------
 drivers/md/raid5.c     |   10 +++++-----
 6 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index ce88755..3ca53c6 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -825,7 +825,7 @@ static void super_written(struct bio *bio, int error)
 		printk("md: super_written gets error=%d, uptodate=%d\n",
 		       error, test_bit(BIO_UPTODATE, &bio->bi_flags));
 		WARN_ON(test_bit(BIO_UPTODATE, &bio->bi_flags));
-		md_error(mddev, rdev);
+		md_error(mddev, rdev, 1);
 	}
 
 	if (atomic_dec_and_test(&mddev->pending_writes))
@@ -1785,7 +1785,7 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 		/* Nothing to do for bad blocks*/ ;
 	else if (sb->bblog_offset == 0)
 		/* Cannot record bad blocks on this device */
-		md_error(mddev, rdev);
+		md_error(mddev, rdev, 0);
 	else {
 		struct badblocks *bb = &rdev->badblocks;
 		u64 *bbp = (u64 *)page_address(rdev->bb_page);
@@ -2367,7 +2367,7 @@ repeat:
 			list_for_each_entry(rdev, &mddev->disks, same_set) {
 				if (rdev->badblocks.changed) {
 					md_ack_all_badblocks(&rdev->badblocks);
-					md_error(mddev, rdev);
+					md_error(mddev, rdev, 0);
 				}
 				clear_bit(Blocked, &rdev->flags);
 				clear_bit(BlockedBadBlocks, &rdev->flags);
@@ -2592,7 +2592,7 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 	 */
 	int err = -EINVAL;
 	if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
-		md_error(rdev->mddev, rdev);
+		md_error(rdev->mddev, rdev, 1);
 		if (test_bit(Faulty, &rdev->flags))
 			err = 0;
 		else
@@ -2623,7 +2623,7 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 			/* metadata handler doesn't understand badblocks,
 			 * so we need to fail the device
 			 */
-			md_error(rdev->mddev, rdev);
+			md_error(rdev->mddev, rdev, 1);
 		}
 		clear_bit(Blocked, &rdev->flags);
 		clear_bit(BlockedBadBlocks, &rdev->flags);
@@ -6069,7 +6069,7 @@ static int set_disk_faulty(struct mddev *mddev, dev_t dev)
 	if (!rdev)
 		return -ENODEV;
 
-	md_error(mddev, rdev);
+	md_error(mddev, rdev, 1);
 	if (!test_bit(Faulty, &rdev->flags))
 		return -EBUSY;
 	return 0;
@@ -6524,7 +6524,7 @@ void md_unregister_thread(struct md_thread **threadp)
 	kfree(thread);
 }
 
-void md_error(struct mddev *mddev, struct md_rdev *rdev)
+void md_error(struct mddev *mddev, struct md_rdev *rdev, int force)
 {
 	if (!mddev) {
 		MD_BUG();
@@ -6536,7 +6536,7 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
 
 	if (!mddev->pers || !mddev->pers->error_handler)
 		return;
-	mddev->pers->error_handler(mddev,rdev);
+	mddev->pers->error_handler(mddev, rdev, force);
 	if (mddev->degraded)
 		set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
 	sysfs_notify_dirent_safe(rdev->sysfs_state);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 44c63df..457885a 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -437,7 +437,7 @@ struct md_personality
 	/* error_handler must set ->faulty and clear ->in_sync
 	 * if appropriate, and should abort recovery if needed 
 	 */
-	void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev);
+	void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev, int force);
 	int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);
 	int (*hot_remove_disk) (struct mddev *mddev, struct md_rdev *rdev);
 	int (*spare_active) (struct mddev *mddev);
@@ -579,7 +579,7 @@ extern void md_check_recovery(struct mddev *mddev);
 extern void md_write_start(struct mddev *mddev, struct bio *bi);
 extern void md_write_end(struct mddev *mddev);
 extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
-extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
+extern void md_error(struct mddev *mddev, struct md_rdev *rdev, int force);
 
 extern int mddev_congested(struct mddev *mddev, int bits);
 extern void md_flush_request(struct mddev *mddev, struct bio *bio);
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index a222f51..e626567 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -97,7 +97,7 @@ static void multipath_end_request(struct bio *bio, int error)
 		 * oops, IO error:
 		 */
 		char b[BDEVNAME_SIZE];
-		md_error (mp_bh->mddev, rdev);
+		md_error (mp_bh->mddev, rdev, 0);
 		printk(KERN_ERR "multipath: %s: rescheduling sector %llu\n", 
 		       bdevname(rdev->bdev,b), 
 		       (unsigned long long)bio->bi_sector);
@@ -184,12 +184,12 @@ static int multipath_congested(void *data, int bits)
 /*
  * Careful, this can execute in IRQ contexts as well!
  */
-static void multipath_error (struct mddev *mddev, struct md_rdev *rdev)
+static void multipath_error(struct mddev *mddev, struct md_rdev *rdev, int force)
 {
 	struct mpconf *conf = mddev->private;
 	char b[BDEVNAME_SIZE];
 
-	if (conf->raid_disks - mddev->degraded <= 1) {
+	if (conf->raid_disks - mddev->degraded <= 1 && !force) {
 		/*
 		 * Uh oh, we can do nothing if this is our last path, but
 		 * first check if this is a queued request for a device
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a0b225e..cb04d56 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1188,7 +1188,7 @@ static void status(struct seq_file *seq, struct mddev *mddev)
 }
 
 
-static void error(struct mddev *mddev, struct md_rdev *rdev)
+static void error(struct mddev *mddev, struct md_rdev *rdev, int force)
 {
 	char b[BDEVNAME_SIZE];
 	struct r1conf *conf = mddev->private;
@@ -1200,6 +1200,7 @@ static void error(struct mddev *mddev, struct md_rdev *rdev)
 	 * else mark the drive as failed
 	 */
 	if (test_bit(In_sync, &rdev->flags)
+	    && !force
 	    && (conf->raid_disks - mddev->degraded) == 1) {
 		/*
 		 * Don't fail the drive, act as though we were just a
@@ -1518,7 +1519,7 @@ static int r1_sync_page_io(struct md_rdev *rdev, sector_t sector,
 	}
 	/* need to record an error - either for the block or the device */
 	if (!rdev_set_badblocks(rdev, sector, sectors, 0))
-		md_error(rdev->mddev, rdev);
+		md_error(rdev->mddev, rdev, 0);
 	return 0;
 }
 
@@ -1819,7 +1820,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
 			/* Cannot read from anywhere - mark it bad */
 			struct md_rdev *rdev = conf->mirrors[read_disk].rdev;
 			if (!rdev_set_badblocks(rdev, sect, s, 0))
-				md_error(mddev, rdev);
+				md_error(mddev, rdev, 0);
 			break;
 		}
 		/* write it back and re-read */
@@ -1972,7 +1973,7 @@ static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio
 		if (!test_bit(BIO_UPTODATE, &bio->bi_flags) &&
 		    test_bit(R1BIO_WriteError, &r1_bio->state)) {
 			if (!rdev_set_badblocks(rdev, r1_bio->sector, s, 0))
-				md_error(conf->mddev, rdev);
+				md_error(conf->mddev, rdev, 0);
 		}
 	}
 	put_buf(r1_bio);
@@ -1996,7 +1997,7 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 			 */
 			if (!narrow_write_error(r1_bio, m)) {
 				md_error(conf->mddev,
-					 conf->mirrors[m].rdev);
+					 conf->mirrors[m].rdev, 0);
 				/* an I/O failed, we can't clear the bitmap */
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			}
@@ -2032,7 +2033,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 			       r1_bio->sector, r1_bio->sectors);
 		unfreeze_array(conf);
 	} else
-		md_error(mddev, conf->mirrors[r1_bio->read_disk].rdev);
+		md_error(mddev, conf->mirrors[r1_bio->read_disk].rdev, 0);
 
 	bio = r1_bio->bios[r1_bio->read_disk];
 	bdevname(bio->bi_bdev, b);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 1a19c96..1497cd6 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -430,7 +430,7 @@ static void raid10_end_write_request(struct bio *bio, int error)
 			/* Never record new bad blocks to replacement,
 			 * just fail it.
 			 */
-			md_error(rdev->mddev, rdev);
+			md_error(rdev->mddev, rdev, 1);
 		else {
 			set_bit(WriteErrorSeen,	&rdev->flags);
 			if (!test_and_set_bit(WantReplacement, &rdev->flags))
@@ -1352,7 +1352,7 @@ static int enough(struct r10conf *conf, int ignore)
 	return 1;
 }
 
-static void error(struct mddev *mddev, struct md_rdev *rdev)
+static void error(struct mddev *mddev, struct md_rdev *rdev, int force)
 {
 	char b[BDEVNAME_SIZE];
 	struct r10conf *conf = mddev->private;
@@ -1364,6 +1364,7 @@ static void error(struct mddev *mddev, struct md_rdev *rdev)
 	 * else mark the drive as failed
 	 */
 	if (test_bit(In_sync, &rdev->flags)
+	    && !force
 	    && !enough(conf, rdev->raid_disk))
 		/*
 		 * Don't fail the drive, just return an IO error.
@@ -1687,7 +1688,7 @@ static void end_sync_write(struct bio *bio, int error)
 
 	if (!uptodate) {
 		if (repl)
-			md_error(mddev, rdev);
+			md_error(mddev, rdev, 1);
 		else {
 			set_bit(WriteErrorSeen, &rdev->flags);
 			if (!test_and_set_bit(WantReplacement, &rdev->flags))
@@ -2019,7 +2020,7 @@ static int r10_sync_page_io(struct md_rdev *rdev, sector_t sector,
 	}
 	/* need to record an error - either for the block or the device */
 	if (!rdev_set_badblocks(rdev, sector, sectors, 0))
-		md_error(rdev->mddev, rdev);
+		md_error(rdev->mddev, rdev, 0);
 	return 0;
 }
 
@@ -2063,7 +2064,7 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
 		printk(KERN_NOTICE
 		       "md/raid10:%s: %s: Failing raid device\n",
 		       mdname(mddev), b);
-		md_error(mddev, conf->mirrors[d].rdev);
+		md_error(mddev, conf->mirrors[d].rdev, 0);
 		r10_bio->devs[r10_bio->read_slot].bio = IO_BLOCKED;
 		return;
 	}
@@ -2119,7 +2120,7 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
 				    r10_bio->devs[r10_bio->read_slot].addr
 				    + sect,
 				    s, 0)) {
-				md_error(mddev, rdev);
+				md_error(mddev, rdev, 0);
 				r10_bio->devs[r10_bio->read_slot].bio
 					= IO_BLOCKED;
 			}
@@ -2423,7 +2424,7 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
 					    rdev,
 					    r10_bio->devs[m].addr,
 					    r10_bio->sectors, 0))
-					md_error(conf->mddev, rdev);
+					md_error(conf->mddev, rdev, 0);
 			}
 			rdev = conf->mirrors[dev].replacement;
 			if (r10_bio->devs[m].repl_bio == NULL)
@@ -2439,7 +2440,7 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
 					    rdev,
 					    r10_bio->devs[m].addr,
 					    r10_bio->sectors, 0))
-					md_error(conf->mddev, rdev);
+					md_error(conf->mddev, rdev, 0);
 			}
 		}
 		put_buf(r10_bio);
@@ -2457,7 +2458,7 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
 			} else if (bio != NULL &&
 				   !test_bit(BIO_UPTODATE, &bio->bi_flags)) {
 				if (!narrow_write_error(r10_bio, m)) {
-					md_error(conf->mddev, rdev);
+					md_error(conf->mddev, rdev, 0);
 					set_bit(R10BIO_Degraded,
 						&r10_bio->state);
 				}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 99b2bbf..d3b2fbf 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1738,7 +1738,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
 		else {
 			clear_bit(R5_ReadError, &sh->dev[i].flags);
 			clear_bit(R5_ReWrite, &sh->dev[i].flags);
-			md_error(conf->mddev, rdev);
+			md_error(conf->mddev, rdev, 0);
 		}
 	}
 	rdev_dec_pending(rdev, conf->mddev);
@@ -1786,7 +1786,7 @@ static void raid5_end_write_request(struct bio *bi, int error)
 
 	if (replacement) {
 		if (!uptodate)
-			md_error(conf->mddev, rdev);
+			md_error(conf->mddev, rdev, 0);
 		else if (is_badblock(rdev, sh->sector,
 				     STRIPE_SECTORS,
 				     &first_bad, &bad_sectors))
@@ -1835,7 +1835,7 @@ static void raid5_build_block(struct stripe_head *sh, int i, int previous)
 	dev->sector = compute_blocknr(sh, i, previous);
 }
 
-static void error(struct mddev *mddev, struct md_rdev *rdev)
+static void error(struct mddev *mddev, struct md_rdev *rdev, int force)
 {
 	char b[BDEVNAME_SIZE];
 	struct r5conf *conf = mddev->private;
@@ -2383,7 +2383,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 					    rdev,
 					    sh->sector,
 					    STRIPE_SECTORS, 0))
-					md_error(conf->mddev, rdev);
+					md_error(conf->mddev, rdev, 0);
 				rdev_dec_pending(rdev, conf->mddev);
 			}
 		}
@@ -3550,7 +3550,7 @@ finish:
 				rdev = conf->disks[i].rdev;
 				if (!rdev_set_badblocks(rdev, sh->sector,
 							STRIPE_SECTORS, 0))
-					md_error(conf->mddev, rdev);
+					md_error(conf->mddev, rdev, 0);
 				rdev_dec_pending(rdev, conf->mddev);
 			}
 			if (test_and_clear_bit(R5_MadeGood, &dev->flags)) {



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 07/23] md: allow re-add to failed arrays.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (8 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 13/23] md/raid10: handle merge_bvec_fn in member devices NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 14/23] md/raid1: handle merge_bvec_fn in member devices NeilBrown
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

When an array is failed (some data inaccessible) then there is no
point attempting to add a spare as it could not possibly be recovered.

However that may be value in re-adding a recently removed device.
e.g. if there is a write-intent-bitmap and it is clear, then access
to the data could be restored by this action.

So don't reject a re-add to a failed array for RAID10 and RAID5 (the
only arrays  types that check for a failed array).

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid10.c |    2 +-
 drivers/md/raid5.c  |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 1497cd6..ca1ea25 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1484,7 +1484,7 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 		 * very different from resync
 		 */
 		return -EBUSY;
-	if (!enough(conf, -1))
+	if (rdev->saved_raid_disk < 0 && !enough(conf, -1))
 		return -EINVAL;
 
 	if (rdev->raid_disk >= 0)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d3b2fbf..34f3b5c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5361,7 +5361,7 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 	if (mddev->recovery_disabled == conf->recovery_disabled)
 		return -EBUSY;
 
-	if (has_failed(conf))
+	if (rdev->saved_raid_disk < 0 && has_failed(conf))
 		/* no point adding a device */
 		return -EINVAL;
 



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 08/23] md: don't set md arrays to readonly on shutdown.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (12 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 09/23] md/bitmap: ensure to load bitmap when creating via sysfs NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-04-18 15:37   ` Alexander Lyakas
  2012-03-14  4:40 ` [md PATCH 20/23] md/bitmap: remove unnecessary indirection when allocating NeilBrown
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

It seems that with recent kernel, writeback can still be happening
while shutdown is happening, and consequently data can be written
after the md reboot notifier switches all arrays to read-only.
This causes a BUG.

So don't switch them to read-only - just mark them clean and
set 'safemode' to '2' which mean that immediately after any
write the array will be switch back to 'clean'.

This could result in the shutdown happening when array is marked
dirty, thus forcing a resync on reboot.  However if you reboot
without performing a "sync" first, you get to keep both halves.

This is suitable for any stable kernel (though there might be some
conflicts with obvious fixes in earlier kernels).

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c |   37 +++++++++++++++----------------------
 1 files changed, 15 insertions(+), 22 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3ca53c6..f494e79 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8157,30 +8157,23 @@ static int md_notify_reboot(struct notifier_block *this,
 	struct mddev *mddev;
 	int need_delay = 0;
 
-	if ((code == SYS_DOWN) || (code == SYS_HALT) || (code == SYS_POWER_OFF)) {
-
-		printk(KERN_INFO "md: stopping all md devices.\n");
-
-		for_each_mddev(mddev, tmp) {
-			if (mddev_trylock(mddev)) {
-				/* Force a switch to readonly even array
-				 * appears to still be in use.  Hence
-				 * the '100'.
-				 */
-				md_set_readonly(mddev, 100);
-				mddev_unlock(mddev);
-			}
-			need_delay = 1;
+	for_each_mddev(mddev, tmp) {
+		if (mddev_trylock(mddev)) {
+			__md_stop_writes(mddev);
+			mddev->safemode = 2;
+			mddev_unlock(mddev);
 		}
-		/*
-		 * certain more exotic SCSI devices are known to be
-		 * volatile wrt too early system reboots. While the
-		 * right place to handle this issue is the given
-		 * driver, we do want to have a safe RAID driver ...
-		 */
-		if (need_delay)
-			mdelay(1000*1);
+		need_delay = 1;
 	}
+	/*
+	 * certain more exotic SCSI devices are known to be
+	 * volatile wrt too early system reboots. While the
+	 * right place to handle this issue is the given
+	 * driver, we do want to have a safe RAID driver ...
+	 */
+	if (need_delay)
+		mdelay(1000*1);
+
 	return NOTIFY_DONE;
 }
 



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 09/23] md/bitmap: ensure to load bitmap when creating via sysfs.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (11 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 12/23] md: add proper merge_bvec handling to RAID0 and Linear NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 08/23] md: don't set md arrays to readonly on shutdown NeilBrown
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

When commit 69e51b449d383e (md/bitmap:  separate out loading a bitmap...)
created bitmap_load, it missed calling it after bitmap_create when a
bitmap is created through the sysfs interface.
So if a bitmap is added this way, we don't allocate memory properly
and can crash.

This is suitable for any -stable release since 2.6.35.
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/bitmap.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index cdf36b1..239af9a 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1904,6 +1904,8 @@ location_store(struct mddev *mddev, const char *buf, size_t len)
 			if (mddev->pers) {
 				mddev->pers->quiesce(mddev, 1);
 				rv = bitmap_create(mddev);
+				if (!rv)
+					rv = bitmap_load(mddev);
 				if (rv) {
 					bitmap_destroy(mddev);
 					mddev->bitmap_info.offset = 0;



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 10/23] md/raid1, raid10: avoid deadlock during resync/recovery.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (5 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 01/23] md/raid5: make sure reshape_position is cleared on error path NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 11/23] md: tidy up rdev_for_each usage NeilBrown
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

If RAID1 or RAID10 is used under LVM or some other stacking
block device, it is possible to enter a deadlock during
resync or recovery.
This can happen if the upper level block device creates
two requests to the RAID1 or RAID10.  The first request gets
processed, blocks recovery and queue requests for underlying
requests in current->bio_list.  A resync request then starts
which will wait for those requests and block new IO.

But then the second request to the RAID1/10 will be attempted
and it cannot progress until the resync request completes,
which cannot progress until the underlying device requests complete,
which are on a queue behind that second request.

So allow that second request to proceed even though there is
a resync request about to start.

This is suitable for any -stable kernel.

Cc: stable@vger.kernel.org
Reported-by: Ray Morris <support@bettercgi.com>
Tested-by: Ray Morris <support@bettercgi.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid1.c  |   17 +++++++++++++++--
 drivers/md/raid10.c |   17 +++++++++++++++--
 2 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index cb04d56..3309de7 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -737,9 +737,22 @@ static void wait_barrier(struct r1conf *conf)
 	spin_lock_irq(&conf->resync_lock);
 	if (conf->barrier) {
 		conf->nr_waiting++;
-		wait_event_lock_irq(conf->wait_barrier, !conf->barrier,
+		/* Wait for the barrier to drop.
+		 * However if there are already pending
+		 * requests (preventing the barrier from
+		 * rising completely), and the
+		 * pre-process bio queue isn't empty,
+		 * then don't wait, as we need to empty
+		 * that queue to get the nr_pending
+		 * count down.
+		 */
+		wait_event_lock_irq(conf->wait_barrier,
+				    !conf->barrier ||
+				    (conf->nr_pending &&
+				     current->bio_list &&
+				     !bio_list_empty(current->bio_list)),
 				    conf->resync_lock,
-				    );
+			);
 		conf->nr_waiting--;
 	}
 	conf->nr_pending++;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index ca1ea25..e0c5a88 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -863,9 +863,22 @@ static void wait_barrier(struct r10conf *conf)
 	spin_lock_irq(&conf->resync_lock);
 	if (conf->barrier) {
 		conf->nr_waiting++;
-		wait_event_lock_irq(conf->wait_barrier, !conf->barrier,
+		/* Wait for the barrier to drop.
+		 * However if there are already pending
+		 * requests (preventing the barrier from
+		 * rising completely), and the
+		 * pre-process bio queue isn't empty,
+		 * then don't wait, as we need to empty
+		 * that queue to get the nr_pending
+		 * count down.
+		 */
+		wait_event_lock_irq(conf->wait_barrier,
+				    !conf->barrier ||
+				    (conf->nr_pending &&
+				     current->bio_list &&
+				     !bio_list_empty(current->bio_list)),
 				    conf->resync_lock,
-				    );
+			);
 		conf->nr_waiting--;
 	}
 	conf->nr_pending++;



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 11/23] md: tidy up rdev_for_each usage.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (6 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 10/23] md/raid1, raid10: avoid deadlock during resync/recovery NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 13/23] md/raid10: handle merge_bvec_fn in member devices NeilBrown
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
mddev.  However it uses the 'safe' version of list_for_each_entry,
and so requires the extra variable, but doesn't include 'safe' in the
name, which is useful documentation.

Consequently some places use this safe version without needing it, and
many use an explicity list_for_each entry.

So:
 - rename rdev_for_each to rdev_for_each_safe
 - create a new rdev_for_each which uses the plain
   list_for_each_entry,
 - use the 'safe' version only where needed, and convert all other
   list_for_each_entry calls to use rdev_for_each.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/bitmap.c    |    2 +
 drivers/md/dm-raid.c   |   16 +++++-----
 drivers/md/faulty.c    |    2 +
 drivers/md/linear.c    |    2 +
 drivers/md/md.c        |   74 ++++++++++++++++++++++++------------------------
 drivers/md/md.h        |    5 +++
 drivers/md/multipath.c |    2 +
 drivers/md/raid0.c     |   10 +++---
 drivers/md/raid1.c     |    4 +--
 drivers/md/raid10.c    |    4 +--
 drivers/md/raid5.c     |    8 +++--
 11 files changed, 66 insertions(+), 63 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 239af9a..2c5dbc6 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -171,7 +171,7 @@ static struct page *read_sb_page(struct mddev *mddev, loff_t offset,
 		did_alloc = 1;
 	}
 
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		if (! test_bit(In_sync, &rdev->flags)
 		    || test_bit(Faulty, &rdev->flags))
 			continue;
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 787022c..c5a875d 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -615,14 +615,14 @@ static int read_disk_sb(struct md_rdev *rdev, int size)
 
 static void super_sync(struct mddev *mddev, struct md_rdev *rdev)
 {
-	struct md_rdev *r, *t;
+	struct md_rdev *r;
 	uint64_t failed_devices;
 	struct dm_raid_superblock *sb;
 
 	sb = page_address(rdev->sb_page);
 	failed_devices = le64_to_cpu(sb->failed_devices);
 
-	rdev_for_each(r, t, mddev)
+	rdev_for_each(r, mddev)
 		if ((r->raid_disk >= 0) && test_bit(Faulty, &r->flags))
 			failed_devices |= (1ULL << r->raid_disk);
 
@@ -707,7 +707,7 @@ static int super_init_validation(struct mddev *mddev, struct md_rdev *rdev)
 	struct dm_raid_superblock *sb;
 	uint32_t new_devs = 0;
 	uint32_t rebuilds = 0;
-	struct md_rdev *r, *t;
+	struct md_rdev *r;
 	struct dm_raid_superblock *sb2;
 
 	sb = page_address(rdev->sb_page);
@@ -750,7 +750,7 @@ static int super_init_validation(struct mddev *mddev, struct md_rdev *rdev)
 	 *    case the In_sync bit will /not/ be set and
 	 *    recovery_cp must be MaxSector.
 	 */
-	rdev_for_each(r, t, mddev) {
+	rdev_for_each(r, mddev) {
 		if (!test_bit(In_sync, &r->flags)) {
 			DMINFO("Device %d specified for rebuild: "
 			       "Clearing superblock", r->raid_disk);
@@ -782,7 +782,7 @@ static int super_init_validation(struct mddev *mddev, struct md_rdev *rdev)
 	 * Now we set the Faulty bit for those devices that are
 	 * recorded in the superblock as failed.
 	 */
-	rdev_for_each(r, t, mddev) {
+	rdev_for_each(r, mddev) {
 		if (!r->sb_page)
 			continue;
 		sb2 = page_address(r->sb_page);
@@ -855,11 +855,11 @@ static int super_validate(struct mddev *mddev, struct md_rdev *rdev)
 static int analyse_superblocks(struct dm_target *ti, struct raid_set *rs)
 {
 	int ret;
-	struct md_rdev *rdev, *freshest, *tmp;
+	struct md_rdev *rdev, *freshest;
 	struct mddev *mddev = &rs->md;
 
 	freshest = NULL;
-	rdev_for_each(rdev, tmp, mddev) {
+	rdev_for_each(rdev, mddev) {
 		if (!rdev->meta_bdev)
 			continue;
 
@@ -888,7 +888,7 @@ static int analyse_superblocks(struct dm_target *ti, struct raid_set *rs)
 	if (super_validate(mddev, freshest))
 		return -EINVAL;
 
-	rdev_for_each(rdev, tmp, mddev)
+	rdev_for_each(rdev, mddev)
 		if ((rdev != freshest) && super_validate(mddev, rdev))
 			return -EINVAL;
 
diff --git a/drivers/md/faulty.c b/drivers/md/faulty.c
index feb2c3c..45135f6 100644
--- a/drivers/md/faulty.c
+++ b/drivers/md/faulty.c
@@ -315,7 +315,7 @@ static int run(struct mddev *mddev)
 	}
 	conf->nfaults = 0;
 
-	list_for_each_entry(rdev, &mddev->disks, same_set)
+	rdev_for_each(rdev, mddev)
 		conf->rdev = rdev;
 
 	md_set_array_sectors(mddev, faulty_size(mddev, 0, 0));
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 6274565..6794074 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -138,7 +138,7 @@ static struct linear_conf *linear_conf(struct mddev *mddev, int raid_disks)
 	cnt = 0;
 	conf->array_sectors = 0;
 
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		int j = rdev->raid_disk;
 		struct dev_info *disk = conf->disks + j;
 		sector_t sectors;
diff --git a/drivers/md/md.c b/drivers/md/md.c
index f494e79..84796a5 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -439,7 +439,7 @@ static void submit_flushes(struct work_struct *ws)
 	INIT_WORK(&mddev->flush_work, md_submit_flush_data);
 	atomic_set(&mddev->flush_pending, 1);
 	rcu_read_lock();
-	list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
+	rdev_for_each_rcu(rdev, mddev)
 		if (rdev->raid_disk >= 0 &&
 		    !test_bit(Faulty, &rdev->flags)) {
 			/* Take two references, one is dropped
@@ -749,7 +749,7 @@ static struct md_rdev * find_rdev_nr(struct mddev *mddev, int nr)
 {
 	struct md_rdev *rdev;
 
-	list_for_each_entry(rdev, &mddev->disks, same_set)
+	rdev_for_each(rdev, mddev)
 		if (rdev->desc_nr == nr)
 			return rdev;
 
@@ -760,7 +760,7 @@ static struct md_rdev * find_rdev(struct mddev * mddev, dev_t dev)
 {
 	struct md_rdev *rdev;
 
-	list_for_each_entry(rdev, &mddev->disks, same_set)
+	rdev_for_each(rdev, mddev)
 		if (rdev->bdev->bd_dev == dev)
 			return rdev;
 
@@ -1342,7 +1342,7 @@ static void super_90_sync(struct mddev *mddev, struct md_rdev *rdev)
 		sb->state |= (1<<MD_SB_BITMAP_PRESENT);
 
 	sb->disks[0].state = (1<<MD_DISK_REMOVED);
-	list_for_each_entry(rdev2, &mddev->disks, same_set) {
+	rdev_for_each(rdev2, mddev) {
 		mdp_disk_t *d;
 		int desc_nr;
 		int is_active = test_bit(In_sync, &rdev2->flags);
@@ -1816,7 +1816,7 @@ retry:
 	}
 
 	max_dev = 0;
-	list_for_each_entry(rdev2, &mddev->disks, same_set)
+	rdev_for_each(rdev2, mddev)
 		if (rdev2->desc_nr+1 > max_dev)
 			max_dev = rdev2->desc_nr+1;
 
@@ -1833,7 +1833,7 @@ retry:
 	for (i=0; i<max_dev;i++)
 		sb->dev_roles[i] = cpu_to_le16(0xfffe);
 	
-	list_for_each_entry(rdev2, &mddev->disks, same_set) {
+	rdev_for_each(rdev2, mddev) {
 		i = rdev2->desc_nr;
 		if (test_bit(Faulty, &rdev2->flags))
 			sb->dev_roles[i] = cpu_to_le16(0xfffe);
@@ -1948,7 +1948,7 @@ int md_integrity_register(struct mddev *mddev)
 		return 0; /* nothing to do */
 	if (!mddev->gendisk || blk_get_integrity(mddev->gendisk))
 		return 0; /* shouldn't register, or already is */
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		/* skip spares and non-functional disks */
 		if (test_bit(Faulty, &rdev->flags))
 			continue;
@@ -2175,7 +2175,7 @@ static void export_array(struct mddev *mddev)
 {
 	struct md_rdev *rdev, *tmp;
 
-	rdev_for_each(rdev, tmp, mddev) {
+	rdev_for_each_safe(rdev, tmp, mddev) {
 		if (!rdev->mddev) {
 			MD_BUG();
 			continue;
@@ -2307,11 +2307,11 @@ static void md_print_devices(void)
 			bitmap_print_sb(mddev->bitmap);
 		else
 			printk("%s: ", mdname(mddev));
-		list_for_each_entry(rdev, &mddev->disks, same_set)
+		rdev_for_each(rdev, mddev)
 			printk("<%s>", bdevname(rdev->bdev,b));
 		printk("\n");
 
-		list_for_each_entry(rdev, &mddev->disks, same_set)
+		rdev_for_each(rdev, mddev)
 			print_rdev(rdev, mddev->major_version);
 	}
 	printk("md:	**********************************\n");
@@ -2328,7 +2328,7 @@ static void sync_sbs(struct mddev * mddev, int nospares)
 	 * with the rest of the array)
 	 */
 	struct md_rdev *rdev;
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		if (rdev->sb_events == mddev->events ||
 		    (nospares &&
 		     rdev->raid_disk < 0 &&
@@ -2351,7 +2351,7 @@ static void md_update_sb(struct mddev * mddev, int force_change)
 
 repeat:
 	/* First make sure individual recovery_offsets are correct */
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		if (rdev->raid_disk >= 0 &&
 		    mddev->delta_disks >= 0 &&
 		    !test_bit(In_sync, &rdev->flags) &&
@@ -2364,7 +2364,7 @@ repeat:
 		clear_bit(MD_CHANGE_DEVS, &mddev->flags);
 		if (!mddev->external) {
 			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
-			list_for_each_entry(rdev, &mddev->disks, same_set) {
+			rdev_for_each(rdev, mddev) {
 				if (rdev->badblocks.changed) {
 					md_ack_all_badblocks(&rdev->badblocks);
 					md_error(mddev, rdev, 0);
@@ -2430,7 +2430,7 @@ repeat:
 		mddev->events --;
 	}
 
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		if (rdev->badblocks.changed)
 			any_badblocks_changed++;
 		if (test_bit(Faulty, &rdev->flags))
@@ -2444,7 +2444,7 @@ repeat:
 		 mdname(mddev), mddev->in_sync);
 
 	bitmap_update_sb(mddev->bitmap);
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		char b[BDEVNAME_SIZE];
 
 		if (rdev->sb_loaded != 1)
@@ -2493,7 +2493,7 @@ repeat:
 	if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
 		sysfs_notify(&mddev->kobj, NULL, "sync_completed");
 
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		if (test_and_clear_bit(FaultRecorded, &rdev->flags))
 			clear_bit(Blocked, &rdev->flags);
 
@@ -2896,7 +2896,7 @@ rdev_size_store(struct md_rdev *rdev, const char *buf, size_t len)
 			struct md_rdev *rdev2;
 
 			mddev_lock(mddev);
-			list_for_each_entry(rdev2, &mddev->disks, same_set)
+			rdev_for_each(rdev2, mddev)
 				if (rdev->bdev == rdev2->bdev &&
 				    rdev != rdev2 &&
 				    overlaps(rdev->data_offset, rdev->sectors,
@@ -3193,7 +3193,7 @@ static void analyze_sbs(struct mddev * mddev)
 	char b[BDEVNAME_SIZE];
 
 	freshest = NULL;
-	rdev_for_each(rdev, tmp, mddev)
+	rdev_for_each_safe(rdev, tmp, mddev)
 		switch (super_types[mddev->major_version].
 			load_super(rdev, freshest, mddev->minor_version)) {
 		case 1:
@@ -3214,7 +3214,7 @@ static void analyze_sbs(struct mddev * mddev)
 		validate_super(mddev, freshest);
 
 	i = 0;
-	rdev_for_each(rdev, tmp, mddev) {
+	rdev_for_each_safe(rdev, tmp, mddev) {
 		if (mddev->max_disks &&
 		    (rdev->desc_nr >= mddev->max_disks ||
 		     i > mddev->max_disks)) {
@@ -3403,7 +3403,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 		return -EINVAL;
 	}
 
-	list_for_each_entry(rdev, &mddev->disks, same_set)
+	rdev_for_each(rdev, mddev)
 		rdev->new_raid_disk = rdev->raid_disk;
 
 	/* ->takeover must set new_* and/or delta_disks
@@ -3456,7 +3456,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 		mddev->safemode = 0;
 	}
 
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		if (rdev->raid_disk < 0)
 			continue;
 		if (rdev->new_raid_disk >= mddev->raid_disks)
@@ -3465,7 +3465,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 			continue;
 		sysfs_unlink_rdev(mddev, rdev);
 	}
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		if (rdev->raid_disk < 0)
 			continue;
 		if (rdev->new_raid_disk == rdev->raid_disk)
@@ -4796,7 +4796,7 @@ int md_run(struct mddev *mddev)
 	 * the only valid external interface is through the md
 	 * device.
 	 */
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		if (test_bit(Faulty, &rdev->flags))
 			continue;
 		sync_blockdev(rdev->bdev);
@@ -4867,8 +4867,8 @@ int md_run(struct mddev *mddev)
 		struct md_rdev *rdev2;
 		int warned = 0;
 
-		list_for_each_entry(rdev, &mddev->disks, same_set)
-			list_for_each_entry(rdev2, &mddev->disks, same_set) {
+		rdev_for_each(rdev, mddev)
+			rdev_for_each(rdev2, mddev) {
 				if (rdev < rdev2 &&
 				    rdev->bdev->bd_contains ==
 				    rdev2->bdev->bd_contains) {
@@ -4945,7 +4945,7 @@ int md_run(struct mddev *mddev)
 	mddev->in_sync = 1;
 	smp_wmb();
 	mddev->ready = 1;
-	list_for_each_entry(rdev, &mddev->disks, same_set)
+	rdev_for_each(rdev, mddev)
 		if (rdev->raid_disk >= 0)
 			if (sysfs_link_rdev(mddev, rdev))
 				/* failure here is OK */;
@@ -5175,7 +5175,7 @@ static int do_md_stop(struct mddev * mddev, int mode, int is_open)
 		/* tell userspace to handle 'inactive' */
 		sysfs_notify_dirent_safe(mddev->sysfs_state);
 
-		list_for_each_entry(rdev, &mddev->disks, same_set)
+		rdev_for_each(rdev, mddev)
 			if (rdev->raid_disk >= 0)
 				sysfs_unlink_rdev(mddev, rdev);
 
@@ -5226,7 +5226,7 @@ static void autorun_array(struct mddev *mddev)
 
 	printk(KERN_INFO "md: running: ");
 
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		char b[BDEVNAME_SIZE];
 		printk("<%s>", bdevname(rdev->bdev,b));
 	}
@@ -5356,7 +5356,7 @@ static int get_array_info(struct mddev * mddev, void __user * arg)
 	struct md_rdev *rdev;
 
 	nr=working=insync=failed=spare=0;
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		nr++;
 		if (test_bit(Faulty, &rdev->flags))
 			failed++;
@@ -5923,7 +5923,7 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
 		 * grow, and re-add.
 		 */
 		return -EBUSY;
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		sector_t avail = rdev->sectors;
 
 		if (fit && (num_sectors == 0 || num_sectors > avail))
@@ -6758,7 +6758,7 @@ static int md_seq_show(struct seq_file *seq, void *v)
 		}
 
 		sectors = 0;
-		list_for_each_entry(rdev, &mddev->disks, same_set) {
+		rdev_for_each(rdev, mddev) {
 			char b[BDEVNAME_SIZE];
 			seq_printf(seq, " %s[%d]",
 				bdevname(rdev->bdev,b), rdev->desc_nr);
@@ -7170,7 +7170,7 @@ void md_do_sync(struct mddev *mddev)
 		max_sectors = mddev->dev_sectors;
 		j = MaxSector;
 		rcu_read_lock();
-		list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
+		rdev_for_each_rcu(rdev, mddev)
 			if (rdev->raid_disk >= 0 &&
 			    !test_bit(Faulty, &rdev->flags) &&
 			    !test_bit(In_sync, &rdev->flags) &&
@@ -7342,7 +7342,7 @@ void md_do_sync(struct mddev *mddev)
 			if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
 				mddev->curr_resync = MaxSector;
 			rcu_read_lock();
-			list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
+			rdev_for_each_rcu(rdev, mddev)
 				if (rdev->raid_disk >= 0 &&
 				    mddev->delta_disks >= 0 &&
 				    !test_bit(Faulty, &rdev->flags) &&
@@ -7388,7 +7388,7 @@ static int remove_and_add_spares(struct mddev *mddev)
 
 	mddev->curr_resync_completed = 0;
 
-	list_for_each_entry(rdev, &mddev->disks, same_set)
+	rdev_for_each(rdev, mddev)
 		if (rdev->raid_disk >= 0 &&
 		    !test_bit(Blocked, &rdev->flags) &&
 		    (test_bit(Faulty, &rdev->flags) ||
@@ -7406,7 +7406,7 @@ static int remove_and_add_spares(struct mddev *mddev)
 			     "degraded");
 
 
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		if (rdev->raid_disk >= 0 &&
 		    !test_bit(In_sync, &rdev->flags) &&
 		    !test_bit(Faulty, &rdev->flags))
@@ -7451,7 +7451,7 @@ static void reap_sync_thread(struct mddev *mddev)
 	 * do the superblock for an incrementally recovered device
 	 * written out.
 	 */
-	list_for_each_entry(rdev, &mddev->disks, same_set)
+	rdev_for_each(rdev, mddev)
 		if (!mddev->degraded ||
 		    test_bit(In_sync, &rdev->flags))
 			rdev->saved_raid_disk = -1;
@@ -7529,7 +7529,7 @@ void md_check_recovery(struct mddev *mddev)
 			 * failed devices.
 			 */
 			struct md_rdev *rdev;
-			list_for_each_entry(rdev, &mddev->disks, same_set)
+			rdev_for_each(rdev, mddev)
 				if (rdev->raid_disk >= 0 &&
 				    !test_bit(Blocked, &rdev->flags) &&
 				    test_bit(Faulty, &rdev->flags) &&
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 457885a..a8d91fe 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -519,7 +519,10 @@ static inline void sysfs_unlink_rdev(struct mddev *mddev, struct md_rdev *rdev)
 /*
  * iterates through the 'same array disks' ringlist
  */
-#define rdev_for_each(rdev, tmp, mddev)				\
+#define rdev_for_each(rdev, mddev)				\
+	list_for_each_entry(rdev, &((mddev)->disks), same_set)
+
+#define rdev_for_each_safe(rdev, tmp, mddev)				\
 	list_for_each_entry_safe(rdev, tmp, &((mddev)->disks), same_set)
 
 #define rdev_for_each_rcu(rdev, mddev)				\
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index e626567..e7852c1 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -428,7 +428,7 @@ static int multipath_run (struct mddev *mddev)
 	}
 
 	working_disks = 0;
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		disk_idx = rdev->raid_disk;
 		if (disk_idx < 0 ||
 		    disk_idx >= mddev->raid_disks)
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 7294bd1..7ef5cbf 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -91,7 +91,7 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
 
 	if (!conf)
 		return -ENOMEM;
-	list_for_each_entry(rdev1, &mddev->disks, same_set) {
+	rdev_for_each(rdev1, mddev) {
 		pr_debug("md/raid0:%s: looking at %s\n",
 			 mdname(mddev),
 			 bdevname(rdev1->bdev, b));
@@ -102,7 +102,7 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
 		sector_div(sectors, mddev->chunk_sectors);
 		rdev1->sectors = sectors * mddev->chunk_sectors;
 
-		list_for_each_entry(rdev2, &mddev->disks, same_set) {
+		rdev_for_each(rdev2, mddev) {
 			pr_debug("md/raid0:%s:   comparing %s(%llu)"
 				 " with %s(%llu)\n",
 				 mdname(mddev),
@@ -157,7 +157,7 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
 	smallest = NULL;
 	dev = conf->devlist;
 	err = -EINVAL;
-	list_for_each_entry(rdev1, &mddev->disks, same_set) {
+	rdev_for_each(rdev1, mddev) {
 		int j = rdev1->raid_disk;
 
 		if (mddev->level == 10) {
@@ -329,7 +329,7 @@ static sector_t raid0_size(struct mddev *mddev, sector_t sectors, int raid_disks
 	WARN_ONCE(sectors || raid_disks,
 		  "%s does not support generic reshape\n", __func__);
 
-	list_for_each_entry(rdev, &mddev->disks, same_set)
+	rdev_for_each(rdev, mddev)
 		array_sectors += rdev->sectors;
 
 	return array_sectors;
@@ -543,7 +543,7 @@ static void *raid0_takeover_raid45(struct mddev *mddev)
 		return ERR_PTR(-EINVAL);
 	}
 
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		/* check slot number for a disk */
 		if (rdev->raid_disk == mddev->raid_disks-1) {
 			printk(KERN_ERR "md/raid0:%s: raid5 must have missing parity disk!\n",
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 3309de7..c0d3ffb 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2505,7 +2505,7 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 
 	err = -EINVAL;
 	spin_lock_init(&conf->device_lock);
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		int disk_idx = rdev->raid_disk;
 		if (disk_idx >= mddev->raid_disks
 		    || disk_idx < 0)
@@ -2623,7 +2623,7 @@ static int run(struct mddev *mddev)
 	if (IS_ERR(conf))
 		return PTR_ERR(conf);
 
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		if (!mddev->gendisk)
 			continue;
 		disk_stack_limits(mddev->gendisk, rdev->bdev,
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e0c5a88..9838f08 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3254,7 +3254,7 @@ static int run(struct mddev *mddev)
 		blk_queue_io_opt(mddev->queue, chunk_size *
 				 (conf->raid_disks / conf->near_copies));
 
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 
 		disk_idx = rdev->raid_disk;
 		if (disk_idx >= conf->raid_disks
@@ -3420,7 +3420,7 @@ static void *raid10_takeover_raid0(struct mddev *mddev)
 
 	conf = setup_conf(mddev);
 	if (!IS_ERR(conf)) {
-		list_for_each_entry(rdev, &mddev->disks, same_set)
+		rdev_for_each(rdev, mddev)
 			if (rdev->raid_disk >= 0)
 				rdev->new_raid_disk = rdev->raid_disk * 2;
 		conf->barrier = 1;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 34f3b5c..dc9ab8a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4842,7 +4842,7 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 
 	pr_debug("raid456: run(%s) called.\n", mdname(mddev));
 
-	list_for_each_entry(rdev, &mddev->disks, same_set) {
+	rdev_for_each(rdev, mddev) {
 		raid_disk = rdev->raid_disk;
 		if (raid_disk >= max_disks
 		    || raid_disk < 0)
@@ -5177,7 +5177,7 @@ static int run(struct mddev *mddev)
 		blk_queue_io_opt(mddev->queue, chunk_size *
 				 (conf->raid_disks - conf->max_degraded));
 
-		list_for_each_entry(rdev, &mddev->disks, same_set)
+		rdev_for_each(rdev, mddev)
 			disk_stack_limits(mddev->gendisk, rdev->bdev,
 					  rdev->data_offset << 9);
 	}
@@ -5500,7 +5500,7 @@ static int raid5_start_reshape(struct mddev *mddev)
 	if (!check_stripe_cache(mddev))
 		return -ENOSPC;
 
-	list_for_each_entry(rdev, &mddev->disks, same_set)
+	rdev_for_each(rdev, mddev)
 		if (!test_bit(In_sync, &rdev->flags)
 		    && !test_bit(Faulty, &rdev->flags))
 			spares++;
@@ -5546,7 +5546,7 @@ static int raid5_start_reshape(struct mddev *mddev)
 	 * such devices during the reshape and confusion could result.
 	 */
 	if (mddev->delta_disks >= 0) {
-		list_for_each_entry(rdev, &mddev->disks, same_set)
+		rdev_for_each(rdev, mddev)
 			if (rdev->raid_disk < 0 &&
 			    !test_bit(Faulty, &rdev->flags)) {
 				if (raid5_add_disk(mddev, rdev) == 0) {



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 12/23] md: add proper merge_bvec handling to RAID0 and Linear.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (10 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 14/23] md/raid1: handle merge_bvec_fn in member devices NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 09/23] md/bitmap: ensure to load bitmap when creating via sysfs NeilBrown
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

These personalities currently set a max request size of one page
when any member device has a merge_bvec_fn because they don't
bother to call that function.

This causes extra works in splitting and combining requests.

So make the extra effort to call the merge_bvec_fn when it exists
so that we end up with larger requests out the bottom.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/linear.c |   30 +++++-----
 drivers/md/raid0.c  |  154 ++++++++++++++++++++++++++++-----------------------
 drivers/md/raid0.h  |   11 ++--
 3 files changed, 107 insertions(+), 88 deletions(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 6794074..b0fcc7d 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -68,10 +68,19 @@ static int linear_mergeable_bvec(struct request_queue *q,
 	struct dev_info *dev0;
 	unsigned long maxsectors, bio_sectors = bvm->bi_size >> 9;
 	sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
+	int maxbytes = biovec->bv_len;
+	struct request_queue *subq;
 
 	rcu_read_lock();
 	dev0 = which_dev(mddev, sector);
 	maxsectors = dev0->end_sector - sector;
+	subq = bdev_get_queue(dev0->rdev->bdev);
+	if (subq->merge_bvec_fn) {
+		bvm->bi_bdev = dev0->rdev->bdev;
+		bvm->bi_sector -= dev0->end_sector - dev0->rdev->sectors;
+		maxbytes = min(maxbytes, subq->merge_bvec_fn(subq, bvm,
+							     biovec));
+	}
 	rcu_read_unlock();
 
 	if (maxsectors < bio_sectors)
@@ -80,12 +89,12 @@ static int linear_mergeable_bvec(struct request_queue *q,
 		maxsectors -= bio_sectors;
 
 	if (maxsectors <= (PAGE_SIZE >> 9 ) && bio_sectors == 0)
-		return biovec->bv_len;
-	/* The bytes available at this offset could be really big,
-	 * so we cap at 2^31 to avoid overflow */
-	if (maxsectors > (1 << (31-9)))
-		return 1<<31;
-	return maxsectors << 9;
+		return maxbytes;
+
+	if (maxsectors > (maxbytes >> 9))
+		return maxbytes;
+	else
+		return maxsectors << 9;
 }
 
 static int linear_congested(void *data, int bits)
@@ -158,15 +167,6 @@ static struct linear_conf *linear_conf(struct mddev *mddev, int raid_disks)
 
 		disk_stack_limits(mddev->gendisk, rdev->bdev,
 				  rdev->data_offset << 9);
-		/* as we don't honour merge_bvec_fn, we must never risk
-		 * violating it, so limit max_segments to 1 lying within
-		 * a single page.
-		 */
-		if (rdev->bdev->bd_disk->queue->merge_bvec_fn) {
-			blk_queue_max_segments(mddev->queue, 1);
-			blk_queue_segment_boundary(mddev->queue,
-						   PAGE_CACHE_SIZE - 1);
-		}
 
 		conf->array_sectors += rdev->sectors;
 		cnt++;
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 7ef5cbf..6f31f55 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -188,16 +188,10 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
 
 		disk_stack_limits(mddev->gendisk, rdev1->bdev,
 				  rdev1->data_offset << 9);
-		/* as we don't honour merge_bvec_fn, we must never risk
-		 * violating it, so limit ->max_segments to 1, lying within
-		 * a single page.
-		 */
 
-		if (rdev1->bdev->bd_disk->queue->merge_bvec_fn) {
-			blk_queue_max_segments(mddev->queue, 1);
-			blk_queue_segment_boundary(mddev->queue,
-						   PAGE_CACHE_SIZE - 1);
-		}
+		if (rdev1->bdev->bd_disk->queue->merge_bvec_fn)
+			conf->has_merge_bvec = 1;
+
 		if (!smallest || (rdev1->sectors < smallest->sectors))
 			smallest = rdev1;
 		cnt++;
@@ -290,8 +284,64 @@ abort:
 	return err;
 }
 
+/* Find the zone which holds a particular offset
+ * Update *sectorp to be an offset in that zone
+ */
+static struct strip_zone *find_zone(struct r0conf *conf,
+				    sector_t *sectorp)
+{
+	int i;
+	struct strip_zone *z = conf->strip_zone;
+	sector_t sector = *sectorp;
+
+	for (i = 0; i < conf->nr_strip_zones; i++)
+		if (sector < z[i].zone_end) {
+			if (i)
+				*sectorp = sector - z[i-1].zone_end;
+			return z + i;
+		}
+	BUG();
+}
+
+/*
+ * remaps the bio to the target device. we separate two flows.
+ * power 2 flow and a general flow for the sake of perfromance
+*/
+static struct md_rdev *map_sector(struct mddev *mddev, struct strip_zone *zone,
+				sector_t sector, sector_t *sector_offset)
+{
+	unsigned int sect_in_chunk;
+	sector_t chunk;
+	struct r0conf *conf = mddev->private;
+	int raid_disks = conf->strip_zone[0].nb_dev;
+	unsigned int chunk_sects = mddev->chunk_sectors;
+
+	if (is_power_of_2(chunk_sects)) {
+		int chunksect_bits = ffz(~chunk_sects);
+		/* find the sector offset inside the chunk */
+		sect_in_chunk  = sector & (chunk_sects - 1);
+		sector >>= chunksect_bits;
+		/* chunk in zone */
+		chunk = *sector_offset;
+		/* quotient is the chunk in real device*/
+		sector_div(chunk, zone->nb_dev << chunksect_bits);
+	} else{
+		sect_in_chunk = sector_div(sector, chunk_sects);
+		chunk = *sector_offset;
+		sector_div(chunk, chunk_sects * zone->nb_dev);
+	}
+	/*
+	*  position the bio over the real device
+	*  real sector = chunk in device + starting of zone
+	*	+ the position in the chunk
+	*/
+	*sector_offset = (chunk * chunk_sects) + sect_in_chunk;
+	return conf->devlist[(zone - conf->strip_zone)*raid_disks
+			     + sector_div(sector, zone->nb_dev)];
+}
+
 /**
- *	raid0_mergeable_bvec -- tell bio layer if a two requests can be merged
+ *	raid0_mergeable_bvec -- tell bio layer if two requests can be merged
  *	@q: request queue
  *	@bvm: properties of new bio
  *	@biovec: the request that could be merged to it.
@@ -303,10 +353,15 @@ static int raid0_mergeable_bvec(struct request_queue *q,
 				struct bio_vec *biovec)
 {
 	struct mddev *mddev = q->queuedata;
+	struct r0conf *conf = mddev->private;
 	sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
+	sector_t sector_offset = sector;
 	int max;
 	unsigned int chunk_sectors = mddev->chunk_sectors;
 	unsigned int bio_sectors = bvm->bi_size >> 9;
+	struct strip_zone *zone;
+	struct md_rdev *rdev;
+	struct request_queue *subq;
 
 	if (is_power_of_2(chunk_sectors))
 		max =  (chunk_sectors - ((sector & (chunk_sectors-1))
@@ -314,10 +369,27 @@ static int raid0_mergeable_bvec(struct request_queue *q,
 	else
 		max =  (chunk_sectors - (sector_div(sector, chunk_sectors)
 						+ bio_sectors)) << 9;
-	if (max < 0) max = 0; /* bio_add cannot handle a negative return */
+	if (max < 0)
+		max = 0; /* bio_add cannot handle a negative return */
 	if (max <= biovec->bv_len && bio_sectors == 0)
 		return biovec->bv_len;
-	else 
+	if (max < biovec->bv_len)
+		/* too small already, no need to check further */
+		return max;
+	if (!conf->has_merge_bvec)
+		return max;
+
+	/* May need to check subordinate device */
+	sector = sector_offset;
+	zone = find_zone(mddev->private, &sector_offset);
+	rdev = map_sector(mddev, zone, sector, &sector_offset);
+	subq = bdev_get_queue(rdev->bdev);
+	if (subq->merge_bvec_fn) {
+		bvm->bi_bdev = rdev->bdev;
+		bvm->bi_sector = sector_offset + zone->dev_start +
+			rdev->data_offset;
+		return min(max, subq->merge_bvec_fn(subq, bvm, biovec));
+	} else
 		return max;
 }
 
@@ -397,62 +469,6 @@ static int raid0_stop(struct mddev *mddev)
 	return 0;
 }
 
-/* Find the zone which holds a particular offset
- * Update *sectorp to be an offset in that zone
- */
-static struct strip_zone *find_zone(struct r0conf *conf,
-				    sector_t *sectorp)
-{
-	int i;
-	struct strip_zone *z = conf->strip_zone;
-	sector_t sector = *sectorp;
-
-	for (i = 0; i < conf->nr_strip_zones; i++)
-		if (sector < z[i].zone_end) {
-			if (i)
-				*sectorp = sector - z[i-1].zone_end;
-			return z + i;
-		}
-	BUG();
-}
-
-/*
- * remaps the bio to the target device. we separate two flows.
- * power 2 flow and a general flow for the sake of perfromance
-*/
-static struct md_rdev *map_sector(struct mddev *mddev, struct strip_zone *zone,
-				sector_t sector, sector_t *sector_offset)
-{
-	unsigned int sect_in_chunk;
-	sector_t chunk;
-	struct r0conf *conf = mddev->private;
-	int raid_disks = conf->strip_zone[0].nb_dev;
-	unsigned int chunk_sects = mddev->chunk_sectors;
-
-	if (is_power_of_2(chunk_sects)) {
-		int chunksect_bits = ffz(~chunk_sects);
-		/* find the sector offset inside the chunk */
-		sect_in_chunk  = sector & (chunk_sects - 1);
-		sector >>= chunksect_bits;
-		/* chunk in zone */
-		chunk = *sector_offset;
-		/* quotient is the chunk in real device*/
-		sector_div(chunk, zone->nb_dev << chunksect_bits);
-	} else{
-		sect_in_chunk = sector_div(sector, chunk_sects);
-		chunk = *sector_offset;
-		sector_div(chunk, chunk_sects * zone->nb_dev);
-	}
-	/*
-	*  position the bio over the real device
-	*  real sector = chunk in device + starting of zone
-	*	+ the position in the chunk
-	*/
-	*sector_offset = (chunk * chunk_sects) + sect_in_chunk;
-	return conf->devlist[(zone - conf->strip_zone)*raid_disks
-			     + sector_div(sector, zone->nb_dev)];
-}
-
 /*
  * Is io distribute over 1 or more chunks ?
 */
@@ -505,7 +521,7 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
 	}
 
 	sector_offset = bio->bi_sector;
-	zone =  find_zone(mddev->private, &sector_offset);
+	zone = find_zone(mddev->private, &sector_offset);
 	tmp_dev = map_sector(mddev, zone, bio->bi_sector,
 			     &sector_offset);
 	bio->bi_bdev = tmp_dev->bdev;
diff --git a/drivers/md/raid0.h b/drivers/md/raid0.h
index 0884bba..05539d9 100644
--- a/drivers/md/raid0.h
+++ b/drivers/md/raid0.h
@@ -4,13 +4,16 @@
 struct strip_zone {
 	sector_t zone_end;	/* Start of the next zone (in sectors) */
 	sector_t dev_start;	/* Zone offset in real dev (in sectors) */
-	int nb_dev;		/* # of devices attached to the zone */
+	int	 nb_dev;	/* # of devices attached to the zone */
 };
 
 struct r0conf {
-	struct strip_zone *strip_zone;
-	struct md_rdev **devlist; /* lists of rdevs, pointed to by strip_zone->dev */
-	int nr_strip_zones;
+	struct strip_zone	*strip_zone;
+	struct md_rdev		**devlist; /* lists of rdevs, pointed to
+					    * by strip_zone->dev */
+	int			nr_strip_zones;
+	int			has_merge_bvec;	/* at least one member has
+						 * a merge_bvec_fn */
 };
 
 #endif



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 13/23] md/raid10: handle merge_bvec_fn in member devices.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (7 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 11/23] md: tidy up rdev_for_each usage NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 07/23] md: allow re-add to failed arrays NeilBrown
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

Currently we don't honour merge_bvec_fn in member devices so if there
is one, we force all requests to be single-page at most.
This is not ideal.

So enhance the raid10 merge_bvec_fn to check that function in children
as well.

This introduces a small problem.  There is no locking around calls
the ->merge_bvec_fn and subsequent calls to ->make_request.  So a
device added between these could end up getting a request which
violates its merge_bvec_fn.

Currently the best we can do is synchronize_sched().  This will work
providing no preemption happens.  If there is preemption, we just
have to hope that new devices are largely consistent with old devices.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c     |    1 
 drivers/md/md.h     |    8 +++
 drivers/md/raid10.c |  122 ++++++++++++++++++++++++++++++++++-----------------
 3 files changed, 90 insertions(+), 41 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 84796a5..3e8fcab 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5073,6 +5073,7 @@ static void md_clean(struct mddev *mddev)
 	mddev->changed = 0;
 	mddev->degraded = 0;
 	mddev->safemode = 0;
+	mddev->merge_check_needed = 0;
 	mddev->bitmap_info.offset = 0;
 	mddev->bitmap_info.default_offset = 0;
 	mddev->bitmap_info.chunksize = 0;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index a8d91fe..c797f5f 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -128,6 +128,10 @@ struct md_rdev {
 enum flag_bits {
 	Faulty,			/* device is known to have a fault */
 	In_sync,		/* device is in_sync with rest of array */
+	Unmerged,		/* device is being added to array and should
+				 * be considerred for bvec_merge_fn but not
+				 * yet for actual IO
+				 */
 	WriteMostly,		/* Avoid reading if at all possible */
 	AutoDetected,		/* added by auto-detect */
 	Blocked,		/* An error occurred but has not yet
@@ -345,6 +349,10 @@ struct mddev {
 	int				degraded;	/* whether md should consider
 							 * adding a spare
 							 */
+	int				merge_check_needed; /* at least one
+							     * member device
+							     * has a
+							     * merge_bvec_fn */
 
 	atomic_t			recovery_active; /* blocks scheduled, but not written */
 	wait_queue_head_t		recovery_wait;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 9838f08..7b3346a 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -586,25 +586,68 @@ static sector_t raid10_find_virt(struct r10conf *conf, sector_t sector, int dev)
  *	@biovec: the request that could be merged to it.
  *
  *	Return amount of bytes we can accept at this offset
- *      If near_copies == raid_disk, there are no striping issues,
- *      but in that case, the function isn't called at all.
+ *	This requires checking for end-of-chunk if near_copies != raid_disks,
+ *	and for subordinate merge_bvec_fns if merge_check_needed.
  */
 static int raid10_mergeable_bvec(struct request_queue *q,
 				 struct bvec_merge_data *bvm,
 				 struct bio_vec *biovec)
 {
 	struct mddev *mddev = q->queuedata;
+	struct r10conf *conf = mddev->private;
 	sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
 	int max;
 	unsigned int chunk_sectors = mddev->chunk_sectors;
 	unsigned int bio_sectors = bvm->bi_size >> 9;
 
-	max =  (chunk_sectors - ((sector & (chunk_sectors - 1)) + bio_sectors)) << 9;
-	if (max < 0) max = 0; /* bio_add cannot handle a negative return */
-	if (max <= biovec->bv_len && bio_sectors == 0)
-		return biovec->bv_len;
-	else
-		return max;
+	if (conf->near_copies < conf->raid_disks) {
+		max = (chunk_sectors - ((sector & (chunk_sectors - 1))
+					+ bio_sectors)) << 9;
+		if (max < 0)
+			/* bio_add cannot handle a negative return */
+			max = 0;
+		if (max <= biovec->bv_len && bio_sectors == 0)
+			return biovec->bv_len;
+	} else
+		max = biovec->bv_len;
+
+	if (mddev->merge_check_needed) {
+		struct r10bio r10_bio;
+		int s;
+		r10_bio.sector = sector;
+		raid10_find_phys(conf, &r10_bio);
+		rcu_read_lock();
+		for (s = 0; s < conf->copies; s++) {
+			int disk = r10_bio.devs[s].devnum;
+			struct md_rdev *rdev = rcu_dereference(
+				conf->mirrors[disk].rdev);
+			if (rdev && !test_bit(Faulty, &rdev->flags)) {
+				struct request_queue *q =
+					bdev_get_queue(rdev->bdev);
+				if (q->merge_bvec_fn) {
+					bvm->bi_sector = r10_bio.devs[s].addr
+						+ rdev->data_offset;
+					bvm->bi_bdev = rdev->bdev;
+					max = min(max, q->merge_bvec_fn(
+							  q, bvm, biovec));
+				}
+			}
+			rdev = rcu_dereference(conf->mirrors[disk].replacement);
+			if (rdev && !test_bit(Faulty, &rdev->flags)) {
+				struct request_queue *q =
+					bdev_get_queue(rdev->bdev);
+				if (q->merge_bvec_fn) {
+					bvm->bi_sector = r10_bio.devs[s].addr
+						+ rdev->data_offset;
+					bvm->bi_bdev = rdev->bdev;
+					max = min(max, q->merge_bvec_fn(
+							  q, bvm, biovec));
+				}
+			}
+		}
+		rcu_read_unlock();
+	}
+	return max;
 }
 
 /*
@@ -668,11 +711,12 @@ retry:
 		disk = r10_bio->devs[slot].devnum;
 		rdev = rcu_dereference(conf->mirrors[disk].replacement);
 		if (rdev == NULL || test_bit(Faulty, &rdev->flags) ||
+		    test_bit(Unmerged, &rdev->flags) ||
 		    r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
 			rdev = rcu_dereference(conf->mirrors[disk].rdev);
-		if (rdev == NULL)
-			continue;
-		if (test_bit(Faulty, &rdev->flags))
+		if (rdev == NULL ||
+		    test_bit(Faulty, &rdev->flags) ||
+		    test_bit(Unmerged, &rdev->flags))
 			continue;
 		if (!test_bit(In_sync, &rdev->flags) &&
 		    r10_bio->devs[slot].addr + sectors > rdev->recovery_offset)
@@ -1134,12 +1178,14 @@ retry_write:
 			blocked_rdev = rrdev;
 			break;
 		}
-		if (rrdev && test_bit(Faulty, &rrdev->flags))
+		if (rrdev && (test_bit(Faulty, &rrdev->flags)
+			      || test_bit(Unmerged, &rrdev->flags)))
 			rrdev = NULL;
 
 		r10_bio->devs[i].bio = NULL;
 		r10_bio->devs[i].repl_bio = NULL;
-		if (!rdev || test_bit(Faulty, &rdev->flags)) {
+		if (!rdev || test_bit(Faulty, &rdev->flags) ||
+		    test_bit(Unmerged, &rdev->flags)) {
 			set_bit(R10BIO_Degraded, &r10_bio->state);
 			continue;
 		}
@@ -1491,6 +1537,7 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 	int mirror;
 	int first = 0;
 	int last = conf->raid_disks - 1;
+	struct request_queue *q = bdev_get_queue(rdev->bdev);
 
 	if (mddev->recovery_cp < MaxSector)
 		/* only hot-add to in-sync arrays, as recovery is
@@ -1503,6 +1550,11 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 	if (rdev->raid_disk >= 0)
 		first = last = rdev->raid_disk;
 
+	if (q->merge_bvec_fn) {
+		set_bit(Unmerged, &rdev->flags);
+		mddev->merge_check_needed = 1;
+	}
+
 	if (rdev->saved_raid_disk >= first &&
 	    conf->mirrors[rdev->saved_raid_disk].rdev == NULL)
 		mirror = rdev->saved_raid_disk;
@@ -1522,11 +1574,6 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 			err = 0;
 			disk_stack_limits(mddev->gendisk, rdev->bdev,
 					  rdev->data_offset << 9);
-			if (rdev->bdev->bd_disk->queue->merge_bvec_fn) {
-				blk_queue_max_segments(mddev->queue, 1);
-				blk_queue_segment_boundary(mddev->queue,
-							   PAGE_CACHE_SIZE - 1);
-			}
 			conf->fullsync = 1;
 			rcu_assign_pointer(p->replacement, rdev);
 			break;
@@ -1534,17 +1581,6 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 
 		disk_stack_limits(mddev->gendisk, rdev->bdev,
 				  rdev->data_offset << 9);
-		/* as we don't honour merge_bvec_fn, we must
-		 * never risk violating it, so limit
-		 * ->max_segments to one lying with a single
-		 * page, as a one page request is never in
-		 * violation.
-		 */
-		if (rdev->bdev->bd_disk->queue->merge_bvec_fn) {
-			blk_queue_max_segments(mddev->queue, 1);
-			blk_queue_segment_boundary(mddev->queue,
-						   PAGE_CACHE_SIZE - 1);
-		}
 
 		p->head_position = 0;
 		p->recovery_disabled = mddev->recovery_disabled - 1;
@@ -1555,7 +1591,19 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 		rcu_assign_pointer(p->rdev, rdev);
 		break;
 	}
-
+	if (err == 0 && test_bit(Unmerged, &rdev->flags)) {
+		/* Some requests might not have seen this new
+		 * merge_bvec_fn.  We must wait for them to complete
+		 * before merging the device fully.
+		 * First we make sure any code which has tested
+		 * our function has submitted the request, then
+		 * we wait for all outstanding requests to complete.
+		 */
+		synchronize_sched();
+		raise_barrier(conf, 0);
+		lower_barrier(conf);
+		clear_bit(Unmerged, &rdev->flags);
+	}
 	md_integrity_add_rdev(rdev, mddev);
 	print_conf(conf);
 	return err;
@@ -2099,6 +2147,7 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
 			d = r10_bio->devs[sl].devnum;
 			rdev = rcu_dereference(conf->mirrors[d].rdev);
 			if (rdev &&
+			    !test_bit(Unmerged, &rdev->flags) &&
 			    test_bit(In_sync, &rdev->flags) &&
 			    is_badblock(rdev, r10_bio->devs[sl].addr + sect, s,
 					&first_bad, &bad_sectors) == 0) {
@@ -2152,6 +2201,7 @@ static void fix_read_error(struct r10conf *conf, struct mddev *mddev, struct r10
 			d = r10_bio->devs[sl].devnum;
 			rdev = rcu_dereference(conf->mirrors[d].rdev);
 			if (!rdev ||
+			    test_bit(Unmerged, &rdev->flags) ||
 			    !test_bit(In_sync, &rdev->flags))
 				continue;
 
@@ -3274,15 +3324,6 @@ static int run(struct mddev *mddev)
 
 		disk_stack_limits(mddev->gendisk, rdev->bdev,
 				  rdev->data_offset << 9);
-		/* as we don't honour merge_bvec_fn, we must never risk
-		 * violating it, so limit max_segments to 1 lying
-		 * within a single page.
-		 */
-		if (rdev->bdev->bd_disk->queue->merge_bvec_fn) {
-			blk_queue_max_segments(mddev->queue, 1);
-			blk_queue_segment_boundary(mddev->queue,
-						   PAGE_CACHE_SIZE - 1);
-		}
 
 		disk->head_position = 0;
 	}
@@ -3346,8 +3387,7 @@ static int run(struct mddev *mddev)
 			mddev->queue->backing_dev_info.ra_pages = 2* stripe;
 	}
 
-	if (conf->near_copies < conf->raid_disks)
-		blk_queue_merge_bvec(mddev->queue, raid10_mergeable_bvec);
+	blk_queue_merge_bvec(mddev->queue, raid10_mergeable_bvec);
 
 	if (md_integrity_register(mddev))
 		goto out_free_conf;



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 14/23] md/raid1: handle merge_bvec_fn in member devices.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (9 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 07/23] md: allow re-add to failed arrays NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 12/23] md: add proper merge_bvec handling to RAID0 and Linear NeilBrown
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

Currently we don't honour merge_bvec_fn in member devices so if there
is one, we force all requests to be single-page at most.
This is not ideal.

So create a raid1 merge_bvec_fn to check that function in children
as well.

This introduces a small problem.  There is no locking around calls
the ->merge_bvec_fn and subsequent calls to ->make_request.  So a
device added between these could end up getting a request which
violates its merge_bvec_fn.

Currently the best we can do is synchronize_sched().  This will work
providing no preemption happens.  If there is is preemption, we just
have to hope that new devices are largely consistent with old devices.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid1.c |   77 ++++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 56 insertions(+), 21 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index c0d3ffb..fa4d840 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -523,6 +523,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 		rdev = rcu_dereference(conf->mirrors[disk].rdev);
 		if (r1_bio->bios[disk] == IO_BLOCKED
 		    || rdev == NULL
+		    || test_bit(Unmerged, &rdev->flags)
 		    || test_bit(Faulty, &rdev->flags))
 			continue;
 		if (!test_bit(In_sync, &rdev->flags) &&
@@ -614,6 +615,39 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 	return best_disk;
 }
 
+static int raid1_mergeable_bvec(struct request_queue *q,
+				struct bvec_merge_data *bvm,
+				struct bio_vec *biovec)
+{
+	struct mddev *mddev = q->queuedata;
+	struct r1conf *conf = mddev->private;
+	sector_t sector = bvm->bi_sector + get_start_sect(bvm->bi_bdev);
+	int max = biovec->bv_len;
+
+	if (mddev->merge_check_needed) {
+		int disk;
+		rcu_read_lock();
+		for (disk = 0; disk < conf->raid_disks * 2; disk++) {
+			struct md_rdev *rdev = rcu_dereference(
+				conf->mirrors[disk].rdev);
+			if (rdev && !test_bit(Faulty, &rdev->flags)) {
+				struct request_queue *q =
+					bdev_get_queue(rdev->bdev);
+				if (q->merge_bvec_fn) {
+					bvm->bi_sector = sector +
+						rdev->data_offset;
+					bvm->bi_bdev = rdev->bdev;
+					max = min(max, q->merge_bvec_fn(
+							  q, bvm, biovec));
+				}
+			}
+		}
+		rcu_read_unlock();
+	}
+	return max;
+
+}
+
 int md_raid1_congested(struct mddev *mddev, int bits)
 {
 	struct r1conf *conf = mddev->private;
@@ -1015,7 +1049,8 @@ read_again:
 			break;
 		}
 		r1_bio->bios[i] = NULL;
-		if (!rdev || test_bit(Faulty, &rdev->flags)) {
+		if (!rdev || test_bit(Faulty, &rdev->flags)
+		    || test_bit(Unmerged, &rdev->flags)) {
 			if (i < conf->raid_disks)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			continue;
@@ -1336,6 +1371,7 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 	struct mirror_info *p;
 	int first = 0;
 	int last = conf->raid_disks - 1;
+	struct request_queue *q = bdev_get_queue(rdev->bdev);
 
 	if (mddev->recovery_disabled == conf->recovery_disabled)
 		return -EBUSY;
@@ -1343,23 +1379,17 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 	if (rdev->raid_disk >= 0)
 		first = last = rdev->raid_disk;
 
+	if (q->merge_bvec_fn) {
+		set_bit(Unmerged, &rdev->flags);
+		mddev->merge_check_needed = 1;
+	}
+
 	for (mirror = first; mirror <= last; mirror++) {
 		p = conf->mirrors+mirror;
 		if (!p->rdev) {
 
 			disk_stack_limits(mddev->gendisk, rdev->bdev,
 					  rdev->data_offset << 9);
-			/* as we don't honour merge_bvec_fn, we must
-			 * never risk violating it, so limit
-			 * ->max_segments to one lying with a single
-			 * page, as a one page request is never in
-			 * violation.
-			 */
-			if (rdev->bdev->bd_disk->queue->merge_bvec_fn) {
-				blk_queue_max_segments(mddev->queue, 1);
-				blk_queue_segment_boundary(mddev->queue,
-							   PAGE_CACHE_SIZE - 1);
-			}
 
 			p->head_position = 0;
 			rdev->raid_disk = mirror;
@@ -1384,6 +1414,19 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 			break;
 		}
 	}
+	if (err == 0 && test_bit(Unmerged, &rdev->flags)) {
+		/* Some requests might not have seen this new
+		 * merge_bvec_fn.  We must wait for them to complete
+		 * before merging the device fully.
+		 * First we make sure any code which has tested
+		 * our function has submitted the request, then
+		 * we wait for all outstanding requests to complete.
+		 */
+		synchronize_sched();
+		raise_barrier(conf);
+		lower_barrier(conf);
+		clear_bit(Unmerged, &rdev->flags);
+	}
 	md_integrity_add_rdev(rdev, mddev);
 	print_conf(conf);
 	return err;
@@ -2628,15 +2671,6 @@ static int run(struct mddev *mddev)
 			continue;
 		disk_stack_limits(mddev->gendisk, rdev->bdev,
 				  rdev->data_offset << 9);
-		/* as we don't honour merge_bvec_fn, we must never risk
-		 * violating it, so limit ->max_segments to 1 lying within
-		 * a single page, as a one page request is never in violation.
-		 */
-		if (rdev->bdev->bd_disk->queue->merge_bvec_fn) {
-			blk_queue_max_segments(mddev->queue, 1);
-			blk_queue_segment_boundary(mddev->queue,
-						   PAGE_CACHE_SIZE - 1);
-		}
 	}
 
 	mddev->degraded = 0;
@@ -2670,6 +2704,7 @@ static int run(struct mddev *mddev)
 	if (mddev->queue) {
 		mddev->queue->backing_dev_info.congested_fn = raid1_congested;
 		mddev->queue->backing_dev_info.congested_data = mddev;
+		blk_queue_merge_bvec(mddev->queue, raid1_mergeable_bvec);
 	}
 	return md_integrity_register(mddev);
 }



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 15/23] md/raid10 - support resizing some RAID10 arrays.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (16 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 22/23] md: fix clearing of the 'changed' flags for the bad blocks list NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  6:17   ` keld
  2012-03-14  4:40 ` [md PATCH 17/23] md/bitmap: move printing of bitmap status to bitmap.c NeilBrown
                   ` (4 subsequent siblings)
  22 siblings, 1 reply; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

'resizing' an array in this context means making use of extra
space that has become available in component devices, not adding new
devices.
It also includes shrinking the array to take up less space of
component devices.

This is not supported for array with a 'far' layout.  However
for 'near' and 'offset' layout arrays, adding and removing space at
the end of the devices is easy to support, and this patch provides
that support.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid10.c |   38 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 7b3346a..2f7665c 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3437,6 +3437,43 @@ static void raid10_quiesce(struct mddev *mddev, int state)
 	}
 }
 
+static int raid10_resize(struct mddev *mddev, sector_t sectors)
+{
+	/* Resize of 'far' arrays is not supported.
+	 * For 'near' and 'offset' arrays we can set the
+	 * number of sectors used to be an appropriate multiple
+	 * of the chunk size.
+	 * For 'offset', this is far_copies*chunksize.
+	 * For 'near' the multiplier is the LCM of
+	 * near_copies and raid_disks.
+	 * So if far_copies > 1 && !far_offset, fail.
+	 * Else find LCM(raid_disks, near_copy)*far_copies and
+	 * multiply by chunk_size.  Then round to this number.
+	 * This is mostly done by raid10_size()
+	 */
+	struct r10conf *conf = mddev->private;
+	sector_t oldsize, size;
+
+	if (conf->far_copies > 1 && !conf->far_offset)
+		return -EINVAL;
+
+	oldsize = raid10_size(mddev, 0, 0);
+	size = raid10_size(mddev, sectors, 0);
+	md_set_array_sectors(mddev, size);
+	if (mddev->array_sectors > size)
+		return -EINVAL;
+	set_capacity(mddev->gendisk, mddev->array_sectors);
+	revalidate_disk(mddev->gendisk);
+	if (sectors > mddev->dev_sectors &&
+	    mddev->recovery_cp > oldsize) {
+		mddev->recovery_cp = oldsize;
+		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+	}
+	mddev->dev_sectors = sectors;
+	mddev->resync_max_sectors = size;
+	return 0;
+}
+
 static void *raid10_takeover_raid0(struct mddev *mddev)
 {
 	struct md_rdev *rdev;
@@ -3506,6 +3543,7 @@ static struct md_personality raid10_personality =
 	.sync_request	= sync_request,
 	.quiesce	= raid10_quiesce,
 	.size		= raid10_size,
+	.resize		= raid10_resize,
 	.takeover	= raid10_takeover,
 };
 



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 16/23] md/bitmap: remove some unused noise from bitmap.h
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (14 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 20/23] md/bitmap: remove unnecessary indirection when allocating NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 22/23] md: fix clearing of the 'changed' flags for the bad blocks list NeilBrown
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/bitmap.h |   18 ------------------
 1 files changed, 0 insertions(+), 18 deletions(-)

diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
index a15436d..557e3e8 100644
--- a/drivers/md/bitmap.h
+++ b/drivers/md/bitmap.h
@@ -13,8 +13,6 @@
 #define BITMAP_MAJOR_HI 4
 #define	BITMAP_MAJOR_HOSTENDIAN 3
 
-#define BITMAP_MINOR 39
-
 /*
  * in-memory bitmap:
  *
@@ -101,21 +99,11 @@ typedef __u16 bitmap_counter_t;
 /* same, except a mask value for more efficient bitops */
 #define PAGE_COUNTER_MASK  (PAGE_COUNTER_RATIO - 1)
 
-#define BITMAP_BLOCK_SIZE 512
 #define BITMAP_BLOCK_SHIFT 9
 
 /* how many blocks per chunk? (this is variable) */
 #define CHUNK_BLOCK_RATIO(bitmap) ((bitmap)->mddev->bitmap_info.chunksize >> BITMAP_BLOCK_SHIFT)
 #define CHUNK_BLOCK_SHIFT(bitmap) ((bitmap)->chunkshift - BITMAP_BLOCK_SHIFT)
-#define CHUNK_BLOCK_MASK(bitmap) (CHUNK_BLOCK_RATIO(bitmap) - 1)
-
-/* when hijacked, the counters and bits represent even larger "chunks" */
-/* there will be 1024 chunks represented by each counter in the page pointers */
-#define PAGEPTR_BLOCK_RATIO(bitmap) \
-			(CHUNK_BLOCK_RATIO(bitmap) << PAGE_COUNTER_SHIFT >> 1)
-#define PAGEPTR_BLOCK_SHIFT(bitmap) \
-			(CHUNK_BLOCK_SHIFT(bitmap) + PAGE_COUNTER_SHIFT - 1)
-#define PAGEPTR_BLOCK_MASK(bitmap) (PAGEPTR_BLOCK_RATIO(bitmap) - 1)
 
 #endif
 
@@ -181,12 +169,6 @@ struct bitmap_page {
 	unsigned int  count:31;
 };
 
-/* keep track of bitmap file pages that have pending writes on them */
-struct page_list {
-	struct list_head list;
-	struct page *page;
-};
-
 /* the main bitmap structure - one per mddev */
 struct bitmap {
 	struct bitmap_page *bp;



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 17/23] md/bitmap: move printing of bitmap status to bitmap.c
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (17 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 15/23] md/raid10 - support resizing some RAID10 arrays NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 21/23] md/bitmap: discard CHUNK_BLOCK_SHIFT macro NeilBrown
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

The part of /proc/mdstat which describes the bitmap should really
be generated by code in bitmap.c.  So move it there.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/bitmap.c |   28 ++++++++++++++++++++++++++++
 drivers/md/bitmap.h |    1 +
 drivers/md/md.c     |   23 +----------------------
 3 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 2c5dbc6..04df18e 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -26,6 +26,7 @@
 #include <linux/file.h>
 #include <linux/mount.h>
 #include <linux/buffer_head.h>
+#include <linux/seq_file.h>
 #include "md.h"
 #include "bitmap.h"
 
@@ -1836,6 +1837,33 @@ out:
 }
 EXPORT_SYMBOL_GPL(bitmap_load);
 
+void bitmap_status(struct seq_file *seq, struct bitmap *bitmap)
+{
+	unsigned long chunk_kb;
+	unsigned long flags;
+
+	if (!bitmap)
+		return;
+
+	spin_lock_irqsave(&bitmap->lock, flags);
+	chunk_kb = bitmap->mddev->bitmap_info.chunksize >> 10;
+	seq_printf(seq, "bitmap: %lu/%lu pages [%luKB], "
+		   "%lu%s chunk",
+		   bitmap->pages - bitmap->missing_pages,
+		   bitmap->pages,
+		   (bitmap->pages - bitmap->missing_pages)
+		   << (PAGE_SHIFT - 10),
+		   chunk_kb ? chunk_kb : bitmap->mddev->bitmap_info.chunksize,
+		   chunk_kb ? "KB" : "B");
+	if (bitmap->file) {
+		seq_printf(seq, ", file: ");
+		seq_path(seq, &bitmap->file->f_path, " \t\n");
+	}
+
+	seq_printf(seq, "\n");
+	spin_unlock_irqrestore(&bitmap->lock, flags);
+}
+
 static ssize_t
 location_show(struct mddev *mddev, char *page)
 {
diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
index 557e3e8..e196e6a5 100644
--- a/drivers/md/bitmap.h
+++ b/drivers/md/bitmap.h
@@ -227,6 +227,7 @@ void bitmap_destroy(struct mddev *mddev);
 
 void bitmap_print_sb(struct bitmap *bitmap);
 void bitmap_update_sb(struct bitmap *bitmap);
+void bitmap_status(struct seq_file *seq, struct bitmap *bitmap);
 
 int  bitmap_setallbits(struct bitmap *bitmap);
 void bitmap_write_all(struct bitmap *bitmap);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 3e8fcab..f873e71 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6725,7 +6725,6 @@ static int md_seq_show(struct seq_file *seq, void *v)
 	struct mddev *mddev = v;
 	sector_t sectors;
 	struct md_rdev *rdev;
-	struct bitmap *bitmap;
 
 	if (v == (void*)1) {
 		struct md_personality *pers;
@@ -6813,27 +6812,7 @@ static int md_seq_show(struct seq_file *seq, void *v)
 		} else
 			seq_printf(seq, "\n       ");
 
-		if ((bitmap = mddev->bitmap)) {
-			unsigned long chunk_kb;
-			unsigned long flags;
-			spin_lock_irqsave(&bitmap->lock, flags);
-			chunk_kb = mddev->bitmap_info.chunksize >> 10;
-			seq_printf(seq, "bitmap: %lu/%lu pages [%luKB], "
-				"%lu%s chunk",
-				bitmap->pages - bitmap->missing_pages,
-				bitmap->pages,
-				(bitmap->pages - bitmap->missing_pages)
-					<< (PAGE_SHIFT - 10),
-				chunk_kb ? chunk_kb : mddev->bitmap_info.chunksize,
-				chunk_kb ? "KB" : "B");
-			if (bitmap->file) {
-				seq_printf(seq, ", file: ");
-				seq_path(seq, &bitmap->file->f_path, " \t\n");
-			}
-
-			seq_printf(seq, "\n");
-			spin_unlock_irqrestore(&bitmap->lock, flags);
-		}
+		bitmap_status(seq, mddev->bitmap);
 
 		seq_printf(seq, "\n");
 	}



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 18/23] md/bitmap: change a 'goto' to a normal 'if' construct.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (20 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 19/23] md/bitmap: remove some pointless locking NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 23/23] md: Add judgement bb->unacked_exist in function md_ack_all_badblocks() NeilBrown
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

The use of a goto makes the control flow more obscure here.

So make it a normal:
  if (x) {
     Y;
  }

No functional change.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/bitmap.c |   40 +++++++++++++++++++++-------------------
 1 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 04df18e..fcf3c94 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -633,26 +633,28 @@ static int bitmap_read_sb(struct bitmap *bitmap)
 	/* keep the array size field of the bitmap superblock up to date */
 	sb->sync_size = cpu_to_le64(bitmap->mddev->resync_max_sectors);
 
-	if (!bitmap->mddev->persistent)
-		goto success;
-
-	/*
-	 * if we have a persistent array superblock, compare the
-	 * bitmap's UUID and event counter to the mddev's
-	 */
-	if (memcmp(sb->uuid, bitmap->mddev->uuid, 16)) {
-		printk(KERN_INFO "%s: bitmap superblock UUID mismatch\n",
-			bmname(bitmap));
-		goto out;
-	}
-	events = le64_to_cpu(sb->events);
-	if (events < bitmap->mddev->events) {
-		printk(KERN_INFO "%s: bitmap file is out of date (%llu < %llu) "
-			"-- forcing full recovery\n", bmname(bitmap), events,
-			(unsigned long long) bitmap->mddev->events);
-		sb->state |= cpu_to_le32(BITMAP_STALE);
+	if (bitmap->mddev->persistent) {
+		/*
+		 * We have a persistent array superblock, so compare the
+		 * bitmap's UUID and event counter to the mddev's
+		 */
+		if (memcmp(sb->uuid, bitmap->mddev->uuid, 16)) {
+			printk(KERN_INFO
+			       "%s: bitmap superblock UUID mismatch\n",
+			       bmname(bitmap));
+			goto out;
+		}
+		events = le64_to_cpu(sb->events);
+		if (events < bitmap->mddev->events) {
+			printk(KERN_INFO
+			       "%s: bitmap file is out of date (%llu < %llu) "
+			       "-- forcing full recovery\n",
+			       bmname(bitmap), events,
+			       (unsigned long long) bitmap->mddev->events);
+			sb->state |= cpu_to_le32(BITMAP_STALE);
+		}
 	}
-success:
+
 	/* assign fields using values from superblock */
 	bitmap->mddev->bitmap_info.chunksize = chunksize;
 	bitmap->mddev->bitmap_info.daemon_sleep = daemon_sleep;



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 19/23] md/bitmap: remove some pointless locking.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (19 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 21/23] md/bitmap: discard CHUNK_BLOCK_SHIFT macro NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 18/23] md/bitmap: change a 'goto' to a normal 'if' construct NeilBrown
  2012-03-14  4:40 ` [md PATCH 23/23] md: Add judgement bb->unacked_exist in function md_ack_all_badblocks() NeilBrown
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

There is nothing gained by holding a lock while we check if a pointer
is NULL or not.  If there could be a race, then it could become NULL
immediately after the unlock - but there is no race here.

So just remove the locking.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/bitmap.c |   14 ++------------
 1 files changed, 2 insertions(+), 12 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index fcf3c94..e12b515 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -446,18 +446,13 @@ out:
 void bitmap_update_sb(struct bitmap *bitmap)
 {
 	bitmap_super_t *sb;
-	unsigned long flags;
 
 	if (!bitmap || !bitmap->mddev) /* no bitmap for this array */
 		return;
 	if (bitmap->mddev->bitmap_info.external)
 		return;
-	spin_lock_irqsave(&bitmap->lock, flags);
-	if (!bitmap->sb_page) { /* no superblock */
-		spin_unlock_irqrestore(&bitmap->lock, flags);
+	if (!bitmap->sb_page) /* no superblock */
 		return;
-	}
-	spin_unlock_irqrestore(&bitmap->lock, flags);
 	sb = kmap_atomic(bitmap->sb_page, KM_USER0);
 	sb->events = cpu_to_le64(bitmap->mddev->events);
 	if (bitmap->mddev->events < bitmap->events_cleared)
@@ -683,15 +678,10 @@ static int bitmap_mask_state(struct bitmap *bitmap, enum bitmap_state bits,
 			     enum bitmap_mask_op op)
 {
 	bitmap_super_t *sb;
-	unsigned long flags;
 	int old;
 
-	spin_lock_irqsave(&bitmap->lock, flags);
-	if (!bitmap->sb_page) { /* can't set the state */
-		spin_unlock_irqrestore(&bitmap->lock, flags);
+	if (!bitmap->sb_page) /* can't set the state */
 		return 0;
-	}
-	spin_unlock_irqrestore(&bitmap->lock, flags);
 	sb = kmap_atomic(bitmap->sb_page, KM_USER0);
 	old = le32_to_cpu(sb->state) & bits;
 	switch (op) {



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 20/23] md/bitmap: remove unnecessary indirection when allocating.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (13 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 08/23] md: don't set md arrays to readonly on shutdown NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 16/23] md/bitmap: remove some unused noise from bitmap.h NeilBrown
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

These funcitons don't add anything useful except possibly the trace
points, and I don't think they are worth the extra indirection.
So remove them.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/bitmap.c |   31 +++----------------------------
 1 files changed, 3 insertions(+), 28 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index e12b515..534e007 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -36,31 +36,6 @@ static inline char *bmname(struct bitmap *bitmap)
 }
 
 /*
- * just a placeholder - calls kmalloc for bitmap pages
- */
-static unsigned char *bitmap_alloc_page(struct bitmap *bitmap)
-{
-	unsigned char *page;
-
-	page = kzalloc(PAGE_SIZE, GFP_NOIO);
-	if (!page)
-		printk("%s: bitmap_alloc_page FAILED\n", bmname(bitmap));
-	else
-		pr_debug("%s: bitmap_alloc_page: allocated page at %p\n",
-			 bmname(bitmap), page);
-	return page;
-}
-
-/*
- * for now just a placeholder -- just calls kfree for bitmap pages
- */
-static void bitmap_free_page(struct bitmap *bitmap, unsigned char *page)
-{
-	pr_debug("%s: bitmap_free_page: free page %p\n", bmname(bitmap), page);
-	kfree(page);
-}
-
-/*
  * check a page and, if necessary, allocate it (or hijack it if the alloc fails)
  *
  * 1) check to see if this page is allocated, if it's not then try to alloc
@@ -97,7 +72,7 @@ __acquires(bitmap->lock)
 	/* this page has not been allocated yet */
 
 	spin_unlock_irq(&bitmap->lock);
-	mappage = bitmap_alloc_page(bitmap);
+	mappage = kzalloc(PAGE_SIZE, GFP_NOIO);
 	spin_lock_irq(&bitmap->lock);
 
 	if (mappage == NULL) {
@@ -110,7 +85,7 @@ __acquires(bitmap->lock)
 	} else if (bitmap->bp[page].map ||
 		   bitmap->bp[page].hijacked) {
 		/* somebody beat us to getting the page */
-		bitmap_free_page(bitmap, mappage);
+		kfree(mappage);
 		return 0;
 	} else {
 
@@ -142,7 +117,7 @@ static void bitmap_checkfree(struct bitmap *bitmap, unsigned long page)
 		ptr = bitmap->bp[page].map;
 		bitmap->bp[page].map = NULL;
 		bitmap->missing_pages++;
-		bitmap_free_page(bitmap, ptr);
+		kfree(ptr);
 	}
 }
 



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 21/23] md/bitmap: discard CHUNK_BLOCK_SHIFT macro
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (18 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 17/23] md/bitmap: move printing of bitmap status to bitmap.c NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 19/23] md/bitmap: remove some pointless locking NeilBrown
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

Be redefining ->chunkshift as the shift from sectors to chunks rather
than bytes to chunks, we can just use "bitmap->chunkshift" which is
shorter than the macro call, and less indirect.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/bitmap.c |   35 ++++++++++++++++++-----------------
 drivers/md/bitmap.h |    3 +--
 2 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 534e007..cf5863c 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -838,7 +838,7 @@ static void bitmap_file_set_bit(struct bitmap *bitmap, sector_t block)
 	unsigned long bit;
 	struct page *page;
 	void *kaddr;
-	unsigned long chunk = block >> CHUNK_BLOCK_SHIFT(bitmap);
+	unsigned long chunk = block >> bitmap->chunkshift;
 
 	if (!bitmap->filemap)
 		return;
@@ -1037,10 +1037,10 @@ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start)
 		kunmap_atomic(paddr, KM_USER0);
 		if (b) {
 			/* if the disk bit is set, set the memory bit */
-			int needed = ((sector_t)(i+1) << (CHUNK_BLOCK_SHIFT(bitmap))
+			int needed = ((sector_t)(i+1) << bitmap->chunkshift
 				      >= start);
 			bitmap_set_memory_bits(bitmap,
-					       (sector_t)i << CHUNK_BLOCK_SHIFT(bitmap),
+					       (sector_t)i << bitmap->chunkshift,
 					       needed);
 			bit_cnt++;
 		}
@@ -1084,7 +1084,7 @@ void bitmap_write_all(struct bitmap *bitmap)
 
 static void bitmap_count_page(struct bitmap *bitmap, sector_t offset, int inc)
 {
-	sector_t chunk = offset >> CHUNK_BLOCK_SHIFT(bitmap);
+	sector_t chunk = offset >> bitmap->chunkshift;
 	unsigned long page = chunk >> PAGE_COUNTER_SHIFT;
 	bitmap->bp[page].count += inc;
 	bitmap_checkfree(bitmap, page);
@@ -1190,7 +1190,7 @@ void bitmap_daemon_work(struct mddev *mddev)
 				bitmap->allclean = 0;
 		}
 		bmc = bitmap_get_counter(bitmap,
-					 (sector_t)j << CHUNK_BLOCK_SHIFT(bitmap),
+					 (sector_t)j << bitmap->chunkshift,
 					 &blocks, 0);
 		if (!bmc)
 			j |= PAGE_COUNTER_MASK;
@@ -1199,7 +1199,7 @@ void bitmap_daemon_work(struct mddev *mddev)
 				/* we can clear the bit */
 				*bmc = 0;
 				bitmap_count_page(bitmap,
-						  (sector_t)j << CHUNK_BLOCK_SHIFT(bitmap),
+						  (sector_t)j << bitmap->chunkshift,
 						  -1);
 
 				/* clear the bit */
@@ -1253,7 +1253,7 @@ __acquires(bitmap->lock)
 	 * The lock must have been taken with interrupts enabled.
 	 * If !create, we don't release the lock.
 	 */
-	sector_t chunk = offset >> CHUNK_BLOCK_SHIFT(bitmap);
+	sector_t chunk = offset >> bitmap->chunkshift;
 	unsigned long page = chunk >> PAGE_COUNTER_SHIFT;
 	unsigned long pageoff = (chunk & PAGE_COUNTER_MASK) << COUNTER_BYTE_SHIFT;
 	sector_t csize;
@@ -1263,10 +1263,10 @@ __acquires(bitmap->lock)
 
 	if (bitmap->bp[page].hijacked ||
 	    bitmap->bp[page].map == NULL)
-		csize = ((sector_t)1) << (CHUNK_BLOCK_SHIFT(bitmap) +
+		csize = ((sector_t)1) << (bitmap->chunkshift +
 					  PAGE_COUNTER_SHIFT - 1);
 	else
-		csize = ((sector_t)1) << (CHUNK_BLOCK_SHIFT(bitmap));
+		csize = ((sector_t)1) << bitmap->chunkshift;
 	*blocks = csize - (offset & (csize - 1));
 
 	if (err < 0)
@@ -1392,7 +1392,7 @@ void bitmap_endwrite(struct bitmap *bitmap, sector_t offset, unsigned long secto
 			set_page_attr(bitmap,
 				      filemap_get_page(
 					      bitmap,
-					      offset >> CHUNK_BLOCK_SHIFT(bitmap)),
+					      offset >> bitmap->chunkshift),
 				      BITMAP_PAGE_PENDING);
 			bitmap->allclean = 0;
 		}
@@ -1480,7 +1480,7 @@ void bitmap_end_sync(struct bitmap *bitmap, sector_t offset, sector_t *blocks, i
 		else {
 			if (*bmc <= 2) {
 				set_page_attr(bitmap,
-					      filemap_get_page(bitmap, offset >> CHUNK_BLOCK_SHIFT(bitmap)),
+					      filemap_get_page(bitmap, offset >> bitmap->chunkshift),
 					      BITMAP_PAGE_PENDING);
 				bitmap->allclean = 0;
 			}
@@ -1527,7 +1527,7 @@ void bitmap_cond_end_sync(struct bitmap *bitmap, sector_t sector)
 
 	bitmap->mddev->curr_resync_completed = sector;
 	set_bit(MD_CHANGE_CLEAN, &bitmap->mddev->flags);
-	sector &= ~((1ULL << CHUNK_BLOCK_SHIFT(bitmap)) - 1);
+	sector &= ~((1ULL << bitmap->chunkshift) - 1);
 	s = 0;
 	while (s < sector && s < bitmap->mddev->resync_max_sectors) {
 		bitmap_end_sync(bitmap, s, &blocks, 0);
@@ -1557,7 +1557,7 @@ static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int n
 		struct page *page;
 		*bmc = 2 | (needed ? NEEDED_MASK : 0);
 		bitmap_count_page(bitmap, offset, 1);
-		page = filemap_get_page(bitmap, offset >> CHUNK_BLOCK_SHIFT(bitmap));
+		page = filemap_get_page(bitmap, offset >> bitmap->chunkshift);
 		set_page_attr(bitmap, page, BITMAP_PAGE_PENDING);
 		bitmap->allclean = 0;
 	}
@@ -1570,7 +1570,7 @@ void bitmap_dirty_bits(struct bitmap *bitmap, unsigned long s, unsigned long e)
 	unsigned long chunk;
 
 	for (chunk = s; chunk <= e; chunk++) {
-		sector_t sec = (sector_t)chunk << CHUNK_BLOCK_SHIFT(bitmap);
+		sector_t sec = (sector_t)chunk << bitmap->chunkshift;
 		bitmap_set_memory_bits(bitmap, sec, 1);
 		spin_lock_irq(&bitmap->lock);
 		bitmap_file_set_bit(bitmap, sec);
@@ -1727,11 +1727,12 @@ int bitmap_create(struct mddev *mddev)
 		goto error;
 
 	bitmap->daemon_lastrun = jiffies;
-	bitmap->chunkshift = ffz(~mddev->bitmap_info.chunksize);
+	bitmap->chunkshift = (ffz(~mddev->bitmap_info.chunksize)
+			      - BITMAP_BLOCK_SHIFT);
 
 	/* now that chunksize and chunkshift are set, we can use these macros */
-	chunks = (blocks + CHUNK_BLOCK_RATIO(bitmap) - 1) >>
-			CHUNK_BLOCK_SHIFT(bitmap);
+	chunks = (blocks + bitmap->chunkshift - 1) >>
+			bitmap->chunkshift;
 	pages = (chunks + PAGE_COUNTER_RATIO - 1) / PAGE_COUNTER_RATIO;
 
 	BUG_ON(!pages);
diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
index e196e6a5..55ca5ae 100644
--- a/drivers/md/bitmap.h
+++ b/drivers/md/bitmap.h
@@ -103,7 +103,6 @@ typedef __u16 bitmap_counter_t;
 
 /* how many blocks per chunk? (this is variable) */
 #define CHUNK_BLOCK_RATIO(bitmap) ((bitmap)->mddev->bitmap_info.chunksize >> BITMAP_BLOCK_SHIFT)
-#define CHUNK_BLOCK_SHIFT(bitmap) ((bitmap)->chunkshift - BITMAP_BLOCK_SHIFT)
 
 #endif
 
@@ -178,7 +177,7 @@ struct bitmap {
 	struct mddev *mddev; /* the md device that the bitmap is for */
 
 	/* bitmap chunksize -- how much data does each bit represent? */
-	unsigned long chunkshift; /* chunksize = 2^chunkshift (for bitops) */
+	unsigned long chunkshift; /* chunksize = 2^(chunkshift+9) (for bitops) */
 	unsigned long chunks; /* total number of data chunks for the array */
 
 	__u64	events_cleared;



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 22/23] md: fix clearing of the 'changed' flags for the bad blocks list.
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (15 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 16/23] md/bitmap: remove some unused noise from bitmap.h NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  2012-03-14  4:40 ` [md PATCH 15/23] md/raid10 - support resizing some RAID10 arrays NeilBrown
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

In super_1_sync (the first hunk) we need to clear 'changed' before
checking read_seqretry(), otherwise we might race with other code
adding a bad block and so won't retry later.

In md_update_sb (the second hunk), in the case where there is no
metadata (neither persistent nor external), we treat any bad blocks as
an error.  However we need to clear the 'changed' flag before calling
md_ack_all_badblocks, else it won't do anything.

This patch is suitable for -stable release 3.0 and later.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f873e71..e9913bf 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1805,13 +1805,13 @@ retry:
 						| BB_LEN(internal_bb));
 				*bbp++ = cpu_to_le64(store_bb);
 			}
+			bb->changed = 0;
 			if (read_seqretry(&bb->lock, seq))
 				goto retry;
 
 			bb->sector = (rdev->sb_start +
 				      (int)le32_to_cpu(sb->bblog_offset));
 			bb->size = le16_to_cpu(sb->bblog_size);
-			bb->changed = 0;
 		}
 	}
 
@@ -2366,6 +2366,7 @@ repeat:
 			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
 			rdev_for_each(rdev, mddev) {
 				if (rdev->badblocks.changed) {
+					rdev->badblocks.changed = 0;
 					md_ack_all_badblocks(&rdev->badblocks);
 					md_error(mddev, rdev, 0);
 				}



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [md PATCH 23/23] md: Add judgement bb->unacked_exist in function md_ack_all_badblocks().
  2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
                   ` (21 preceding siblings ...)
  2012-03-14  4:40 ` [md PATCH 18/23] md/bitmap: change a 'goto' to a normal 'if' construct NeilBrown
@ 2012-03-14  4:40 ` NeilBrown
  22 siblings, 0 replies; 42+ messages in thread
From: NeilBrown @ 2012-03-14  4:40 UTC (permalink / raw)
  To: linux-raid

From: majianpeng <majianpeng@gmail.com>

If there are no unacked bad blocks, then there is no point searching
for them to acknowledge them.


Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index e9913bf..3b6f4d0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8021,7 +8021,7 @@ void md_ack_all_badblocks(struct badblocks *bb)
 		return;
 	write_seqlock_irq(&bb->lock);
 
-	if (bb->changed == 0) {
+	if (bb->changed == 0 && bb->unacked_exist) {
 		u64 *p = bb->page;
 		int i;
 		for (i = 0; i < bb->count ; i++) {



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [md PATCH 15/23] md/raid10 - support resizing some RAID10 arrays.
  2012-03-14  4:40 ` [md PATCH 15/23] md/raid10 - support resizing some RAID10 arrays NeilBrown
@ 2012-03-14  6:17   ` keld
  2012-03-14  6:27     ` NeilBrown
  0 siblings, 1 reply; 42+ messages in thread
From: keld @ 2012-03-14  6:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil

What is the problem with adding space to the 'far' layout?

I would think you could just create the new array part 1 from the old array part 2, and
then sync the new array part 2 with the new array part 1. (in the case of a far=2 array,
for n>2 similar constructs would apply).

best regards
Keld

On Wed, Mar 14, 2012 at 03:40:41PM +1100, NeilBrown wrote:
> 'resizing' an array in this context means making use of extra
> space that has become available in component devices, not adding new
> devices.
> It also includes shrinking the array to take up less space of
> component devices.
> 
> This is not supported for array with a 'far' layout.  However
> for 'near' and 'offset' layout arrays, adding and removing space at
> the end of the devices is easy to support, and this patch provides
> that support.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> 
>  drivers/md/raid10.c |   38 ++++++++++++++++++++++++++++++++++++++
>  1 files changed, 38 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 7b3346a..2f7665c 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -3437,6 +3437,43 @@ static void raid10_quiesce(struct mddev *mddev, int state)
>  	}
>  }
>  
> +static int raid10_resize(struct mddev *mddev, sector_t sectors)
> +{
> +	/* Resize of 'far' arrays is not supported.
> +	 * For 'near' and 'offset' arrays we can set the
> +	 * number of sectors used to be an appropriate multiple
> +	 * of the chunk size.
> +	 * For 'offset', this is far_copies*chunksize.
> +	 * For 'near' the multiplier is the LCM of
> +	 * near_copies and raid_disks.
> +	 * So if far_copies > 1 && !far_offset, fail.
> +	 * Else find LCM(raid_disks, near_copy)*far_copies and
> +	 * multiply by chunk_size.  Then round to this number.
> +	 * This is mostly done by raid10_size()
> +	 */
> +	struct r10conf *conf = mddev->private;
> +	sector_t oldsize, size;
> +
> +	if (conf->far_copies > 1 && !conf->far_offset)
> +		return -EINVAL;
> +
> +	oldsize = raid10_size(mddev, 0, 0);
> +	size = raid10_size(mddev, sectors, 0);
> +	md_set_array_sectors(mddev, size);
> +	if (mddev->array_sectors > size)
> +		return -EINVAL;
> +	set_capacity(mddev->gendisk, mddev->array_sectors);
> +	revalidate_disk(mddev->gendisk);
> +	if (sectors > mddev->dev_sectors &&
> +	    mddev->recovery_cp > oldsize) {
> +		mddev->recovery_cp = oldsize;
> +		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> +	}
> +	mddev->dev_sectors = sectors;
> +	mddev->resync_max_sectors = size;
> +	return 0;
> +}
> +
>  static void *raid10_takeover_raid0(struct mddev *mddev)
>  {
>  	struct md_rdev *rdev;
> @@ -3506,6 +3543,7 @@ static struct md_personality raid10_personality =
>  	.sync_request	= sync_request,
>  	.quiesce	= raid10_quiesce,
>  	.size		= raid10_size,
> +	.resize		= raid10_resize,
>  	.takeover	= raid10_takeover,
>  };
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 15/23] md/raid10 - support resizing some RAID10 arrays.
  2012-03-14  6:17   ` keld
@ 2012-03-14  6:27     ` NeilBrown
  2012-03-14  7:51       ` David Brown
  0 siblings, 1 reply; 42+ messages in thread
From: NeilBrown @ 2012-03-14  6:27 UTC (permalink / raw)
  To: keld; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 582 bytes --]

On Wed, 14 Mar 2012 07:17:46 +0100 keld@keldix.com wrote:

> Hi Neil
> 
> What is the problem with adding space to the 'far' layout?
> 
> I would think you could just create the new array part 1 from the old array part 2, and
> then sync the new array part 2 with the new array part 1. (in the case of a far=2 array,
> for n>2 similar constructs would apply).

If I understand your proposal correctly, you would lose redundancy during the
process, which is not acceptable.

If I don't understand properly - please explain in a bit more detail.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 15/23] md/raid10 - support resizing some RAID10 arrays.
  2012-03-14  6:27     ` NeilBrown
@ 2012-03-14  7:51       ` David Brown
  2012-03-14  8:32         ` NeilBrown
  0 siblings, 1 reply; 42+ messages in thread
From: David Brown @ 2012-03-14  7:51 UTC (permalink / raw)
  To: NeilBrown; +Cc: keld, linux-raid

On 14/03/2012 07:27, NeilBrown wrote:
> On Wed, 14 Mar 2012 07:17:46 +0100 keld@keldix.com wrote:
>
>> Hi Neil
>>
>> What is the problem with adding space to the 'far' layout?
>>
>> I would think you could just create the new array part 1 from the
>> old array part 2, and then sync the new array part 2 with the new
>> array part 1. (in the case of a far=2 array, for n>2 similar
>> constructs would apply).
>
> If I understand your proposal correctly, you would lose redundancy
> during the process, which is not acceptable.
>

That's how I understood the suggestion too.  And in some cases, that 
might be a good choice for the user - if they have good backups, they 
might be happy to risk such a re-shape.  Of course, they would have to 
use the "--yes-I-really-understand-the-risks" flag to mdadm, but other 
than that it should be pretty simple to implement.

For a safe re-shape of raid10, you would need to move the "far" copy 
backwards to the right spot on the growing disk (or forwards if you are 
shrinking the array).  It could certainly be done safely, and would be 
very useful for users, but it is not quite as simple as an unsafe re-size.

mvh.,

David

> If I don't understand properly - please explain in a bit more
> detail.
>
> Thanks, NeilBrown
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 15/23] md/raid10 - support resizing some RAID10 arrays.
  2012-03-14  7:51       ` David Brown
@ 2012-03-14  8:32         ` NeilBrown
  2012-03-14 10:20           ` David Brown
  0 siblings, 1 reply; 42+ messages in thread
From: NeilBrown @ 2012-03-14  8:32 UTC (permalink / raw)
  To: David Brown; +Cc: keld, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1880 bytes --]

On Wed, 14 Mar 2012 08:51:44 +0100 David Brown <david@westcontrol.com> wrote:

> On 14/03/2012 07:27, NeilBrown wrote:
> > On Wed, 14 Mar 2012 07:17:46 +0100 keld@keldix.com wrote:
> >
> >> Hi Neil
> >>
> >> What is the problem with adding space to the 'far' layout?
> >>
> >> I would think you could just create the new array part 1 from the
> >> old array part 2, and then sync the new array part 2 with the new
> >> array part 1. (in the case of a far=2 array, for n>2 similar
> >> constructs would apply).
> >
> > If I understand your proposal correctly, you would lose redundancy
> > during the process, which is not acceptable.
> >
> 
> That's how I understood the suggestion too.  And in some cases, that 
> might be a good choice for the user - if they have good backups, they 
> might be happy to risk such a re-shape.  Of course, they would have to 
> use the "--yes-I-really-understand-the-risks" flag to mdadm, but other 
> than that it should be pretty simple to implement.

Patches welcome :-)

(well, actually not - I really don't like the idea.  But my point is that
these things turn out to be somewhat more complicated than they appear at
first).

> 
> For a safe re-shape of raid10, you would need to move the "far" copy 
> backwards to the right spot on the growing disk (or forwards if you are 
> shrinking the array).  It could certainly be done safely, and would be 
> very useful for users, but it is not quite as simple as an unsafe re-size.

Reshaping a raid10-far to use a different amount of the device would
certainly be possibly, but is far from trivial.
One interesting question is how to record all the intermediate states in the
metadata.

NeilBrown



> 
> mvh.,
> 
> David
> 
> 
> > If I don't understand properly - please explain in a bit more
> > detail.
> >
> > Thanks, NeilBrown
> >


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 15/23] md/raid10 - support resizing some RAID10 arrays.
  2012-03-14  8:32         ` NeilBrown
@ 2012-03-14 10:20           ` David Brown
  2012-03-14 12:37             ` keld
  0 siblings, 1 reply; 42+ messages in thread
From: David Brown @ 2012-03-14 10:20 UTC (permalink / raw)
  To: NeilBrown; +Cc: keld, linux-raid

On 14/03/2012 09:32, NeilBrown wrote:
> On Wed, 14 Mar 2012 08:51:44 +0100 David Brown<david@westcontrol.com>  wrote:
>
>> On 14/03/2012 07:27, NeilBrown wrote:
>>> On Wed, 14 Mar 2012 07:17:46 +0100 keld@keldix.com wrote:
>>>
>>>> Hi Neil
>>>>
>>>> What is the problem with adding space to the 'far' layout?
>>>>
>>>> I would think you could just create the new array part 1 from the
>>>> old array part 2, and then sync the new array part 2 with the new
>>>> array part 1. (in the case of a far=2 array, for n>2 similar
>>>> constructs would apply).
>>>
>>> If I understand your proposal correctly, you would lose redundancy
>>> during the process, which is not acceptable.
>>>
>>
>> That's how I understood the suggestion too.  And in some cases, that
>> might be a good choice for the user - if they have good backups, they
>> might be happy to risk such a re-shape.  Of course, they would have to
>> use the "--yes-I-really-understand-the-risks" flag to mdadm, but other
>> than that it should be pretty simple to implement.
>
> Patches welcome :-)
>
> (well, actually not - I really don't like the idea.  But my point is that
> these things turn out to be somewhat more complicated than they appear at
> first).
>

I haven't written any code for md raid, but I've looked at enough to 
know that you have to tread carefully - especially as people expect a 
particularly high level of code correctness in this area.  "Pretty 
simple to implement" is a relative term!

I can imagine use cases where it would be better to have an unsafe 
resize than no resize - and maybe also cases where a fast unsafe resize 
is better than a slow safe resize.  But I can also imagine people 
getting upset when they find they have used the wrong one, and I can 
also see that implementing one "fast but unsafe" feature could easy be 
the start of a slippery slope.

>>
>> For a safe re-shape of raid10, you would need to move the "far" copy
>> backwards to the right spot on the growing disk (or forwards if you are
>> shrinking the array).  It could certainly be done safely, and would be
>> very useful for users, but it is not quite as simple as an unsafe re-size.
>
> Reshaping a raid10-far to use a different amount of the device would
> certainly be possibly, but is far from trivial.
> One interesting question is how to record all the intermediate states in the
> metadata.
>

I had only been thinking of the data itself, not the metadata.

When doing the reshape, you would start off with some free space at the 
end of the device (assuming you are growing the raid).  You copy a block 
of data from near the end of the far copy to its new place in the free 
space.  Then you can update the metadata to track that change.  While 
you are doing the metadata update, both the original part of the far 
copy, and the new part are valid, so you should be safe if you get a 
crash during the update.  Once the metadata has been updated, you've got 
some new free space ready to move the next block.  I don't /think/ you'd 
need to track much new metadata - just a "progress so far" record.

Of course, any changes made to the data and filesystems while this is in 
progress might cause more complications...

One particular situation that might be easier as a special case, and 
would be common in practice, would be when growing a raid10,far to 
devices that are at least twice the size.  If you pretend that the 
existing raid10,f device sits on top of a newly created, bigger raid10,f 
device, then standard raid10,far synchronisation code would copy over 
everything to the right place in the bigger disks - even if the data 
changes underway.  This artificial big raid10,f would have its metadata 
in memory only - there is no need to save anything, since you still have 
the normal original raid10 copy for safety.  Once the new big raid is 
fully synchronised, you write its metadata over the original raid10 
metadata.

I'm just throwing around ideas here.  If they are of help or inspiration 
to anyone, that's great - if not, that's okay too.

mvh.,

David

> NeilBrown
>
>
>
>>
>> mvh.,
>>
>> David
>>
>>
>>> If I don't understand properly - please explain in a bit more
>>> detail.
>>>
>>> Thanks, NeilBrown
>>>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 15/23] md/raid10 - support resizing some RAID10 arrays.
  2012-03-14 10:20           ` David Brown
@ 2012-03-14 12:37             ` keld
  0 siblings, 0 replies; 42+ messages in thread
From: keld @ 2012-03-14 12:37 UTC (permalink / raw)
  To: David Brown; +Cc: NeilBrown, linux-raid

On Wed, Mar 14, 2012 at 11:20:31AM +0100, David Brown wrote:
> On 14/03/2012 09:32, NeilBrown wrote:
> >On Wed, 14 Mar 2012 08:51:44 +0100 David Brown<david@westcontrol.com>  
> >wrote:
> >
> >>On 14/03/2012 07:27, NeilBrown wrote:
> >>>On Wed, 14 Mar 2012 07:17:46 +0100 keld@keldix.com wrote:
> >>>
> >>>>Hi Neil
> >>>>
> >>>>What is the problem with adding space to the 'far' layout?
> >>>>
> >>>>I would think you could just create the new array part 1 from the
> >>>>old array part 2, and then sync the new array part 2 with the new
> >>>>array part 1. (in the case of a far=2 array, for n>2 similar
> >>>>constructs would apply).
> >>>
> >>>If I understand your proposal correctly, you would lose redundancy
> >>>during the process, which is not acceptable.
> >>>
> >>
> >>That's how I understood the suggestion too.  And in some cases, that
> >>might be a good choice for the user - if they have good backups, they
> >>might be happy to risk such a re-shape.  Of course, they would have to
> >>use the "--yes-I-really-understand-the-risks" flag to mdadm, but other
> >>than that it should be pretty simple to implement.
> >
> >Patches welcome :-)
> >
> >(well, actually not - I really don't like the idea.  But my point is that
> >these things turn out to be somewhat more complicated than they appear at
> >first).
> >
> 
> I haven't written any code for md raid, but I've looked at enough to 
> know that you have to tread carefully - especially as people expect a 
> particularly high level of code correctness in this area.  "Pretty 
> simple to implement" is a relative term!
> 
> I can imagine use cases where it would be better to have an unsafe 
> resize than no resize - and maybe also cases where a fast unsafe resize 
> is better than a slow safe resize.  But I can also imagine people 
> getting upset when they find they have used the wrong one, and I can 
> also see that implementing one "fast but unsafe" feature could easy be 
> the start of a slippery slope.
> 
> >>
> >>For a safe re-shape of raid10, you would need to move the "far" copy
> >>backwards to the right spot on the growing disk (or forwards if you are
> >>shrinking the array).  It could certainly be done safely, and would be
> >>very useful for users, but it is not quite as simple as an unsafe re-size.
> >
> >Reshaping a raid10-far to use a different amount of the device would
> >certainly be possibly, but is far from trivial.
> >One interesting question is how to record all the intermediate states in 
> >the
> >metadata.
> >
> 
> I had only been thinking of the data itself, not the metadata.
> 
> When doing the reshape, you would start off with some free space at the 
> end of the device (assuming you are growing the raid).  You copy a block 
> of data from near the end of the far copy to its new place in the free 
> space.  Then you can update the metadata to track that change.  While 
> you are doing the metadata update, both the original part of the far 
> copy, and the new part are valid, so you should be safe if you get a 
> crash during the update.  Once the metadata has been updated, you've got 
> some new free space ready to move the next block.  I don't /think/ you'd 
> need to track much new metadata - just a "progress so far" record.
> 
> Of course, any changes made to the data and filesystems while this is in 
> progress might cause more complications...
> 
> 
> 
> One particular situation that might be easier as a special case, and 
> would be common in practice, would be when growing a raid10,far to 
> devices that are at least twice the size.  If you pretend that the 
> existing raid10,f device sits on top of a newly created, bigger raid10,f 
> device, then standard raid10,far synchronisation code would copy over 
> everything to the right place in the bigger disks - even if the data 
> changes underway.  This artificial big raid10,f would have its metadata 
> in memory only - there is no need to save anything, since you still have 
> the normal original raid10 copy for safety.  Once the new big raid is 
> fully synchronised, you write its metadata over the original raid10 
> metadata.
> 
> 
> I'm just throwing around ideas here.  If they are of help or inspiration 
> to anyone, that's great - if not, that's okay too.
> 
> mvh.,
> 
> David
> 
> 
> 
> 
> >NeilBrown
> >
> >
> >
> >>
> >>mvh.,
> >>
> >>David
> >>
> >>
> >>>If I don't understand properly - please explain in a bit more
> >>>detail.

Well, my knowledge of the kernel code is quite limited, and I originally did not know of your
aim at keeping it safe. Anyway for keeping it safe, I then have some ideas, for raid10,f

Assuming a grow, you could copy within the same raid0 part, that is: you could just make a raid0 grow.
And make a raid0 grow for each of the raid0 parts in the raid10,f array.

Some details: initially you could copy the first blocks to an unused area of one of the new disks.
This would be to get started. Once you have gotten clear of the problem with writing over
data that you have just read then you just need to copy and keep track of where you are,
block number read and block number written. No need to keep other copies. And the ordinary read/writes
will go to the disk blocks according to the old/new blocks divide. For metadata, I dont know what
more you need to keep track of.  For efficiency I would use quite large buffers, say 2 to 8 MB stripes.

Best regards
keld

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 08/23] md: don't set md arrays to readonly on shutdown.
  2012-03-14  4:40 ` [md PATCH 08/23] md: don't set md arrays to readonly on shutdown NeilBrown
@ 2012-04-18 15:37   ` Alexander Lyakas
  2012-04-18 17:44     ` Paweł Brodacki
  2012-04-18 22:48     ` NeilBrown
  0 siblings, 2 replies; 42+ messages in thread
From: Alexander Lyakas @ 2012-04-18 15:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

> This could result in the shutdown happening when array is marked
> dirty, thus forcing a resync on reboot.  However if you reboot
> without performing a "sync" first, you get to keep both halves.
Can you pls clarify the last statement?

Thanks,
  Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 08/23] md: don't set md arrays to readonly on shutdown.
  2012-04-18 15:37   ` Alexander Lyakas
@ 2012-04-18 17:44     ` Paweł Brodacki
  2012-04-18 20:53       ` Alexander Lyakas
  2012-04-18 22:48     ` NeilBrown
  1 sibling, 1 reply; 42+ messages in thread
From: Paweł Brodacki @ 2012-04-18 17:44 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

2012/4/18 Alexander Lyakas <alex.bolshoy@gmail.com>:
> Hi Neil,
>
>> This could result in the shutdown happening when array is marked
>> dirty, thus forcing a resync on reboot.  However if you reboot
>> without performing a "sync" first, you get to keep both halves.
> Can you pls clarify the last statement?
>
> Thanks,
>  Alex.

The RAID array breaks (does not work, because disks are out of sync),
and you can keep the pieces as a keepsake, I guess :)

By the way. Does it mean, that performing a clean shutdown/boot
sequence can result in an array requiring a resync? If this is the
case, could you point me to arguments supporting this change of
behaviour of shutdown process? I see no obvious reason and crave
understanding.

Regards,
Paweł
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 08/23] md: don't set md arrays to readonly on shutdown.
  2012-04-18 17:44     ` Paweł Brodacki
@ 2012-04-18 20:53       ` Alexander Lyakas
  0 siblings, 0 replies; 42+ messages in thread
From: Alexander Lyakas @ 2012-04-18 20:53 UTC (permalink / raw)
  To: Paweł Brodacki; +Cc: linux-raid

Pawel,
I did not understand that it was a joke (was it???).

Anyways, in our testing, we do a lot of reboots, and array comes up
clean after reboot, even of there's IO on the array at that moment. We
trigger other events to simulate un-clean shutdown of an array. The
problem (in our case), that such event can further cause a drive to
not be available when array comes up. This leads to a dirty-degraded
situation, in which you have to decide whether to wait for a drive to
appear or allow the array to come. That's why I am a bit worried about
this new functionality.

Ales.


On Wed, Apr 18, 2012 at 8:44 PM, Paweł Brodacki
<pawel.brodacki@googlemail.com> wrote:
> 2012/4/18 Alexander Lyakas <alex.bolshoy@gmail.com>:
>> Hi Neil,
>>
>>> This could result in the shutdown happening when array is marked
>>> dirty, thus forcing a resync on reboot.  However if you reboot
>>> without performing a "sync" first, you get to keep both halves.
>> Can you pls clarify the last statement?
>>
>> Thanks,
>>  Alex.
>
> The RAID array breaks (does not work, because disks are out of sync),
> and you can keep the pieces as a keepsake, I guess :)
>
> By the way. Does it mean, that performing a clean shutdown/boot
> sequence can result in an array requiring a resync? If this is the
> case, could you point me to arguments supporting this change of
> behaviour of shutdown process? I see no obvious reason and crave
> understanding.
>
> Regards,
> Paweł
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 08/23] md: don't set md arrays to readonly on shutdown.
  2012-04-18 15:37   ` Alexander Lyakas
  2012-04-18 17:44     ` Paweł Brodacki
@ 2012-04-18 22:48     ` NeilBrown
  2012-04-19  9:11       ` Alexander Lyakas
  1 sibling, 1 reply; 42+ messages in thread
From: NeilBrown @ 2012-04-18 22:48 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1281 bytes --]

On Wed, 18 Apr 2012 18:37:56 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> 
> > This could result in the shutdown happening when array is marked
> > dirty, thus forcing a resync on reboot.  However if you reboot
> > without performing a "sync" first, you get to keep both halves.
> Can you pls clarify the last statement?

"If you break it, you get to keep both halves" is a colloquial phrase meaning
that that the person saying it takes no responsibility for any damage that
the person being spoken to causes.
When applied to software issues it is ironic (because when a thing 'breaks' it
doesn't result in two pieces like, for example, when a plate breaks) and so
funny (I hope).
(when applied to a RAID mirror, it is doubly ironic :-)

If you force a fast shutdown without allowing a sync to happen 
(e.g. reboot -f -n) there could be data on the way out to the array when the
system actually reboots, so the array will be dirty, so a resync will be
required when the system restarts.
This might be undesirable, but that isn't my problem and is not the problem
of md/raid.  Rather the cause of the problem is the "reboot -f -n".  i.e.
"your" problem.   "You" broke it, you get to keep both halves.

Does that help?

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 08/23] md: don't set md arrays to readonly on shutdown.
  2012-04-18 22:48     ` NeilBrown
@ 2012-04-19  9:11       ` Alexander Lyakas
  2012-04-19  9:57         ` NeilBrown
  0 siblings, 1 reply; 42+ messages in thread
From: Alexander Lyakas @ 2012-04-19  9:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,
Thanks for the clarification.
However, from your commit message, it stems that during a normal
reboot (without -f -n), writes can still arrive after your reboot
notifier has cleaned the array. In such case, array might be dirty
after reboot. Is that so? If yes, then that's kind of regression.

Thanks,
Alex.

On Thu, Apr 19, 2012 at 1:48 AM, NeilBrown <neilb@suse.de> wrote:
> On Wed, 18 Apr 2012 18:37:56 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hi Neil,
>>
>> > This could result in the shutdown happening when array is marked
>> > dirty, thus forcing a resync on reboot.  However if you reboot
>> > without performing a "sync" first, you get to keep both halves.
>> Can you pls clarify the last statement?
>
> "If you break it, you get to keep both halves" is a colloquial phrase meaning
> that that the person saying it takes no responsibility for any damage that
> the person being spoken to causes.
> When applied to software issues it is ironic (because when a thing 'breaks' it
> doesn't result in two pieces like, for example, when a plate breaks) and so
> funny (I hope).
> (when applied to a RAID mirror, it is doubly ironic :-)
>
> If you force a fast shutdown without allowing a sync to happen
> (e.g. reboot -f -n) there could be data on the way out to the array when the
> system actually reboots, so the array will be dirty, so a resync will be
> required when the system restarts.
> This might be undesirable, but that isn't my problem and is not the problem
> of md/raid.  Rather the cause of the problem is the "reboot -f -n".  i.e.
> "your" problem.   "You" broke it, you get to keep both halves.
>
> Does that help?
>
> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 08/23] md: don't set md arrays to readonly on shutdown.
  2012-04-19  9:11       ` Alexander Lyakas
@ 2012-04-19  9:57         ` NeilBrown
  2012-04-20 11:30           ` Paweł Brodacki
  2012-04-20 16:26           ` John Robinson
  0 siblings, 2 replies; 42+ messages in thread
From: NeilBrown @ 2012-04-19  9:57 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 484 bytes --]

On Thu, 19 Apr 2012 12:11:43 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> Thanks for the clarification.
> However, from your commit message, it stems that during a normal
> reboot (without -f -n), writes can still arrive after your reboot
> notifier has cleaned the array. In such case, array might be dirty
> after reboot. Is that so? If yes, then that's kind of regression.

Why do you think that? 

I don't think that is the case.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 08/23] md: don't set md arrays to readonly on shutdown.
  2012-04-19  9:57         ` NeilBrown
@ 2012-04-20 11:30           ` Paweł Brodacki
  2012-04-20 12:01             ` NeilBrown
  2012-04-20 16:26           ` John Robinson
  1 sibling, 1 reply; 42+ messages in thread
From: Paweł Brodacki @ 2012-04-20 11:30 UTC (permalink / raw)
  To: NeilBrown; +Cc: Alexander Lyakas, linux-raid

2012/4/19 NeilBrown <neilb@suse.de>:
> On Thu, 19 Apr 2012 12:11:43 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hi Neil,
>> Thanks for the clarification.
>> However, from your commit message, it stems that during a normal
>> reboot (without -f -n), writes can still arrive after your reboot
>> notifier has cleaned the array. In such case, array might be dirty
>> after reboot. Is that so? If yes, then that's kind of regression.
>
> Why do you think that?
>
> I don't think that is the case.
>
> NeilBrown

Hello,

The way I read the message agrees with Alexander's perception. My
impression from reading the commit message is, that normal shutdown
may result in unclean array, which I would perceive as a regression.

I would really appreciate clear statement, whether this behaviour
(writeback during shutdown, with possibility of poweroff/reboot while
array is dirty) can or cannot occur during normal shutdown process.

I would bet, that only forced reboot/poweroff (reboot --force) may
result in dirty array on boot, but I'd rather not bet my data.

Best regards,
Paweł Brodacki
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 08/23] md: don't set md arrays to readonly on shutdown.
  2012-04-20 11:30           ` Paweł Brodacki
@ 2012-04-20 12:01             ` NeilBrown
  2012-04-21 15:18               ` Paweł Brodacki
  0 siblings, 1 reply; 42+ messages in thread
From: NeilBrown @ 2012-04-20 12:01 UTC (permalink / raw)
  To: Paweł Brodacki; +Cc: Alexander Lyakas, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1578 bytes --]

On Fri, 20 Apr 2012 13:30:39 +0200 Paweł Brodacki
<pawel.brodacki@googlemail.com> wrote:

> 2012/4/19 NeilBrown <neilb@suse.de>:
> > On Thu, 19 Apr 2012 12:11:43 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> > wrote:
> >
> >> Hi Neil,
> >> Thanks for the clarification.
> >> However, from your commit message, it stems that during a normal
> >> reboot (without -f -n), writes can still arrive after your reboot
> >> notifier has cleaned the array. In such case, array might be dirty
> >> after reboot. Is that so? If yes, then that's kind of regression.
> >
> > Why do you think that?
> >
> > I don't think that is the case.
> >
> > NeilBrown
> 
> Hello,
> 
> The way I read the message agrees with Alexander's perception. My
> impression from reading the commit message is, that normal shutdown
> may result in unclean array, which I would perceive as a regression.

I don't think you'll find the word "normal" in the original message :-)

> 
> I would really appreciate clear statement, whether this behaviour
> (writeback during shutdown, with possibility of poweroff/reboot while
> array is dirty) can or cannot occur during normal shutdown process.

Define "normal".
If you kill any processes that could generate write-out, and then do a
'sync', then everything should be fine.

I suspend a "normal" shutdown sequence does this.

NeilBrown



> 
> I would bet, that only forced reboot/poweroff (reboot --force) may
> result in dirty array on boot, but I'd rather not bet my data.
> 
> Best regards,
> Paweł Brodacki


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 08/23] md: don't set md arrays to readonly on shutdown.
  2012-04-19  9:57         ` NeilBrown
  2012-04-20 11:30           ` Paweł Brodacki
@ 2012-04-20 16:26           ` John Robinson
  1 sibling, 0 replies; 42+ messages in thread
From: John Robinson @ 2012-04-20 16:26 UTC (permalink / raw)
  To: NeilBrown; +Cc: Alexander Lyakas, linux-raid

On 19/04/2012 10:57, NeilBrown wrote:
> On Thu, 19 Apr 2012 12:11:43 +0300 Alexander Lyakas<alex.bolshoy@gmail.com>
> wrote:
>
>> Hi Neil,
>> Thanks for the clarification.
>> However, from your commit message, it stems that during a normal
>> reboot (without -f -n), writes can still arrive after your reboot
>> notifier has cleaned the array. In such case, array might be dirty
>> after reboot. Is that so? If yes, then that's kind of regression.
>
> Why do you think that?
>
> I don't think that is the case.

That's what I had thought too, because in your original message 
describing the patch, you wrote:

> It seems that with recent kernel, writeback can still be happening
> while shutdown is happening, and consequently data can be written
> after the md reboot notifier switches all arrays to read-only.

I understand that to mean that someone somewhere has patched the kernel 
so that writes (writeback?) can still be happening after the array is 
supposed to be read-only. This sounds like a regression to me, though 
not one you have created.

Cheers,

John.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 08/23] md: don't set md arrays to readonly on shutdown.
  2012-04-20 12:01             ` NeilBrown
@ 2012-04-21 15:18               ` Paweł Brodacki
  2012-04-21 20:42                 ` NeilBrown
  0 siblings, 1 reply; 42+ messages in thread
From: Paweł Brodacki @ 2012-04-21 15:18 UTC (permalink / raw)
  To: NeilBrown; +Cc: Alexander Lyakas, linux-raid

W dniu 20 kwietnia 2012 14:01 użytkownik NeilBrown <neilb@suse.de> napisał:
> On Fri, 20 Apr 2012 13:30:39 +0200 Paweł Brodacki
> <pawel.brodacki@googlemail.com> wrote:
>>
>> The way I read the message agrees with Alexander's perception. My
>> impression from reading the commit message is, that normal shutdown
>> may result in unclean array, which I would perceive as a regression.
>
> I don't think you'll find the word "normal" in the original message :-)
>
Neil, I respect you and admire your work and all, but I think you are
having good time with us and this commit message, aren't you? :)

>>
>> I would really appreciate clear statement, whether this behaviour
>> (writeback during shutdown, with possibility of poweroff/reboot while
>> array is dirty) can or cannot occur during normal shutdown process.
>
> Define "normal".
> If you kill any processes that could generate write-out, and then do a
> 'sync', then everything should be fine.
>
I think "normal shutdown" can be defined for the purpose of this
discussion as whatever /sbin/shutdown does.

If I run /sbin/shutdown -h now, it is a normal shutdown. If I do fancy
stuff echoing cute strings and integers into various parts of /proc
and /sys, push magic buttons and/or pull cables at random, it' is not
a normal shutdown.

> I suspend a "normal" shutdown sequence does this.
>
> NeilBrown
>

I smell a hint here.

man shutdown:
"shutdown arranges for the system to be brought down in a safe way."
Define safe...

Neil, could you point us to a bug report which inspired you to write
this patch? My google fu failed me and I was able to find just one bug
entry from before 2 yerars, and even that hit is of dubious relevance.

Pretty please with sugar on top?

Best regards,
Paweł Brodacki
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 08/23] md: don't set md arrays to readonly on shutdown.
  2012-04-21 15:18               ` Paweł Brodacki
@ 2012-04-21 20:42                 ` NeilBrown
  2012-04-30 10:32                   ` Paweł Brodacki
  0 siblings, 1 reply; 42+ messages in thread
From: NeilBrown @ 2012-04-21 20:42 UTC (permalink / raw)
  To: Paweł Brodacki; +Cc: Alexander Lyakas, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3075 bytes --]

On Sat, 21 Apr 2012 17:18:58 +0200 Paweł Brodacki
<pawel.brodacki@googlemail.com> wrote:

> W dniu 20 kwietnia 2012 14:01 użytkownik NeilBrown <neilb@suse.de> napisał:
> > On Fri, 20 Apr 2012 13:30:39 +0200 Paweł Brodacki
> > <pawel.brodacki@googlemail.com> wrote:
> >>
> >> The way I read the message agrees with Alexander's perception. My
> >> impression from reading the commit message is, that normal shutdown
> >> may result in unclean array, which I would perceive as a regression.
> >
> > I don't think you'll find the word "normal" in the original message :-)
> >
> Neil, I respect you and admire your work and all, but I think you are
> having good time with us and this commit message, aren't you? :)

"For what, we ask, is life - without a touch of poetry in it".

Maybe not an entirely relevant quote, but it is the first that came to mind.

"What's life without whimsy?"

Would be a slightly more modern quote.

Can you place them without google's help?

> 
> >>
> >> I would really appreciate clear statement, whether this behaviour
> >> (writeback during shutdown, with possibility of poweroff/reboot while
> >> array is dirty) can or cannot occur during normal shutdown process.
> >
> > Define "normal".
> > If you kill any processes that could generate write-out, and then do a
> > 'sync', then everything should be fine.
> >
> I think "normal shutdown" can be defined for the purpose of this
> discussion as whatever /sbin/shutdown does.
> 
> If I run /sbin/shutdown -h now, it is a normal shutdown. If I do fancy
> stuff echoing cute strings and integers into various parts of /proc
> and /sys, push magic buttons and/or pull cables at random, it' is not
> a normal shutdown.
> 
> > I suspend a "normal" shutdown sequence does this.
> >
> > NeilBrown
> >
> 
> I smell a hint here.
> 
> man shutdown:
> "shutdown arranges for the system to be brought down in a safe way."
> Define safe...
> 
> Neil, could you point us to a bug report which inspired you to write
> this patch? My google fu failed me and I was able to find just one bug
> entry from before 2 yerars, and even that hit is of dubious relevance.
> 
> Pretty please with sugar on top?
> 

https://bugzilla.novell.com/show_bug.cgi?id=713148

But that is against our "enterprise" product so is not globally visible :-(

But it gave me some search-term hints so:

http://lists.debian.org/debian-kernel/2011/05/msg00264.html

http://www.issociate.de/board/post/491554/Kernel_BUG.html

http://www.gossamer-threads.com/lists/xen/devel/174884?do=post_view_threaded

https://lkml.org/lkml/2008/11/24/414
http://comments.gmane.org/gmane.linux.raid/20710

The important elements are:
 1/ you see "md: stopping all devices".  This means that "reboot" has 
   really started.  This is printed by mds reboot notifier.

 2/ A stack trace showing something submitting IO.
   In the first link it is bdi_writeback_task
   In the second it is kjournald

This pattern can only happen on an unclean shutdown.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [md PATCH 08/23] md: don't set md arrays to readonly on shutdown.
  2012-04-21 20:42                 ` NeilBrown
@ 2012-04-30 10:32                   ` Paweł Brodacki
  0 siblings, 0 replies; 42+ messages in thread
From: Paweł Brodacki @ 2012-04-30 10:32 UTC (permalink / raw)
  To: NeilBrown; +Cc: Alexander Lyakas, linux-raid

2012/4/21 NeilBrown <neilb@suse.de>:
> On Sat, 21 Apr 2012 17:18:58 +0200 Paweł Brodacki
> <pawel.brodacki@googlemail.com> wrote:
>
> "For what, we ask, is life - without a touch of poetry in it".
>
> Maybe not an entirely relevant quote, but it is the first that came to mind.
>
> "What's life without whimsy?"
>
> Would be a slightly more modern quote.
>
> Can you place them without google's help?

I pledge ignorance in both cases. After employing Google I can say, that
1) Arthur Sullivan and W. S. Gilbert. kinda ring the bell, but no, I
wouldn't guess.
2) I was dimly aware of existence of character of dr S. Cooper, but at
no point did I memorise his words.
>> Neil, could you point us to a bug report which inspired you to write
>> this patch? My google fu failed me and I was able to find just one bug
>> entry from before 2 yerars, and even that hit is of dubious relevance.
>>
>> Pretty please with sugar on top?
>>
>
> https://bugzilla.novell.com/show_bug.cgi?id=713148
>
> But that is against our "enterprise" product so is not globally visible :-(
>
> But it gave me some search-term hints so:
>
> http://lists.debian.org/debian-kernel/2011/05/msg00264.html
>
> http://www.issociate.de/board/post/491554/Kernel_BUG.html
>
> http://www.gossamer-threads.com/lists/xen/devel/174884?do=post_view_threaded
>
> https://lkml.org/lkml/2008/11/24/414
> http://comments.gmane.org/gmane.linux.raid/20710
>
> The important elements are:
>  1/ you see "md: stopping all devices".  This means that "reboot" has
>   really started.  This is printed by mds reboot notifier.
>
>  2/ A stack trace showing something submitting IO.
>   In the first link it is bdi_writeback_task
>   In the second it is kjournald
>
Amazingly enough, this is about what I found via google. Our
perception of "recent kernel" differs, though. Serves me well for
reading too much of linux-btrfs ;).

> This pattern can only happen on an unclean shutdown.
>
> NeilBrown
>

Thanks a lot for a clear, definitive and catharsis-bringing reply.

Kind regards,
Pawe Brodacki
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2012-04-30 10:32 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-14  4:40 [md PATCH 00/23] md patches heading for 3.4 NeilBrown
2012-03-14  4:40 ` [md PATCH 02/23] md/raid10: remove unnecessary smp_mb() from end_sync_write NeilBrown
2012-03-14  4:40 ` [md PATCH 05/23] md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read() NeilBrown
2012-03-14  4:40 ` [md PATCH 04/23] md: Use existed macros instead of numbers NeilBrown
2012-03-14  4:40 ` [md PATCH 03/23] md/raid5: removed unused 'added_devices' variable NeilBrown
2012-03-14  4:40 ` [md PATCH 06/23] md: allow last device to be forcibly removed from RAID1/RAID10 NeilBrown
2012-03-14  4:40 ` [md PATCH 01/23] md/raid5: make sure reshape_position is cleared on error path NeilBrown
2012-03-14  4:40 ` [md PATCH 10/23] md/raid1, raid10: avoid deadlock during resync/recovery NeilBrown
2012-03-14  4:40 ` [md PATCH 11/23] md: tidy up rdev_for_each usage NeilBrown
2012-03-14  4:40 ` [md PATCH 13/23] md/raid10: handle merge_bvec_fn in member devices NeilBrown
2012-03-14  4:40 ` [md PATCH 07/23] md: allow re-add to failed arrays NeilBrown
2012-03-14  4:40 ` [md PATCH 14/23] md/raid1: handle merge_bvec_fn in member devices NeilBrown
2012-03-14  4:40 ` [md PATCH 12/23] md: add proper merge_bvec handling to RAID0 and Linear NeilBrown
2012-03-14  4:40 ` [md PATCH 09/23] md/bitmap: ensure to load bitmap when creating via sysfs NeilBrown
2012-03-14  4:40 ` [md PATCH 08/23] md: don't set md arrays to readonly on shutdown NeilBrown
2012-04-18 15:37   ` Alexander Lyakas
2012-04-18 17:44     ` Paweł Brodacki
2012-04-18 20:53       ` Alexander Lyakas
2012-04-18 22:48     ` NeilBrown
2012-04-19  9:11       ` Alexander Lyakas
2012-04-19  9:57         ` NeilBrown
2012-04-20 11:30           ` Paweł Brodacki
2012-04-20 12:01             ` NeilBrown
2012-04-21 15:18               ` Paweł Brodacki
2012-04-21 20:42                 ` NeilBrown
2012-04-30 10:32                   ` Paweł Brodacki
2012-04-20 16:26           ` John Robinson
2012-03-14  4:40 ` [md PATCH 20/23] md/bitmap: remove unnecessary indirection when allocating NeilBrown
2012-03-14  4:40 ` [md PATCH 16/23] md/bitmap: remove some unused noise from bitmap.h NeilBrown
2012-03-14  4:40 ` [md PATCH 22/23] md: fix clearing of the 'changed' flags for the bad blocks list NeilBrown
2012-03-14  4:40 ` [md PATCH 15/23] md/raid10 - support resizing some RAID10 arrays NeilBrown
2012-03-14  6:17   ` keld
2012-03-14  6:27     ` NeilBrown
2012-03-14  7:51       ` David Brown
2012-03-14  8:32         ` NeilBrown
2012-03-14 10:20           ` David Brown
2012-03-14 12:37             ` keld
2012-03-14  4:40 ` [md PATCH 17/23] md/bitmap: move printing of bitmap status to bitmap.c NeilBrown
2012-03-14  4:40 ` [md PATCH 21/23] md/bitmap: discard CHUNK_BLOCK_SHIFT macro NeilBrown
2012-03-14  4:40 ` [md PATCH 19/23] md/bitmap: remove some pointless locking NeilBrown
2012-03-14  4:40 ` [md PATCH 18/23] md/bitmap: change a 'goto' to a normal 'if' construct NeilBrown
2012-03-14  4:40 ` [md PATCH 23/23] md: Add judgement bb->unacked_exist in function md_ack_all_badblocks() NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).