Linux RAID subsystem development
 help / color / mirror / Atom feed
* [PATCH/RFC] add "failfast" support for raid1/raid10.
From: NeilBrown @ 2016-11-18  5:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, linux-block, Christoph Hellwig, linux-kernel, hare

Hi,

 I've been sitting on these patches for a while because although they
 solve a real problem, it is a fairly limited use-case, and I don't
 really like some of the details.

 So I'm posting them as RFC in the hope that a different perspective
 might help me like them better, or find a better approach.

 The core idea is that when you have multiple copies of data
 (i.e. mirrored drives) it doesn't make sense to wait for a read from
 a drive that seems to be having problems.  It will probably be faster
 to just cancel that read, and read from the other device.
 Similarly, in some circumstances, it might be better to fail a drive
 that is being slow to respond to writes, rather than cause all writes
 to be very slow.

 The particular context where this comes up is when mirroring across
 storage arrays, where the storage arrays can temporarily take an
 unusually long time to respond to requests (firmware updates have
 been mentioned).  As the array will have redundancy internally, there
 is little risk to the data.  The mirrored pair is really only for
 disaster recovery, and it is deemed better to lose the last few
 minutes of updates in the case of a serious disaster, rather than
 occasionally having latency issues because one array needs to do some
 maintenance for a few minutes.  The particular storage arrays in
 question are DASD devices which are part of the s390 ecosystem.

 Linux block layer has "failfast" flags to direct drivers to fail more
 quickly.  These patches allow devices in an md array to be given a
 "failfast" flag, which will cause IO requests to be marked as
 "failfast" providing there is another device available.  Once the
 array becomes degraded, we stop using failfast, as that could result
 in data loss.

 I don't like the whole "failfast" concept because it is not at all
 clear how fast "fast" is.  In fact, these block-layer flags are
 really a misnomer.  They should be "noretry" flags.
 REQ_FAILFAST_DEV means "don't retry requests which reported an error
 which seems to come from the device.
 REQ_FAILFAST_TRANSPORT means "don't retry requests which seem to
 indicate a problem with the transport, rather than the device"
 REQ_FAILFAST_DRIVER means  .... I'm not exactly sure.  I think it
 means whatever a particular driver wants it to mean, basically "I
 cannot seem to handle this right now, just resend and I'll probably
 be more in control next time".  It seems to be for internal-use only.

 Multipath code uses REQ_FAILFAST_TRANSPORT only, which makes sense.
 btrfs uses REQ_FAILFAST_DEV only (for read-ahead) which doesn't seem
 to make sense.... why would you ever use _DEV without _TRANSPORT?

 None of these actually change the timeouts in the driver or in the
 device, which is what I would expect for "failfast", so to get real
 "fast failure" you need to enable failfast, and adjust the timeouts.
 That is what we do for our customers with DASD.

 Anyway, it seems to make sense to use _TRANSPORT and _DEV for
 requests from md where there is somewhere to fall-back on.
 If we get an error from a "failfast" request, and the array is still
 non-degraded, we just fail the device.  We don't try to repair read
 errors (which is pointless on storage arrays).

 It is assumed that some user-space code will notice the failure,
 monitor the device to see when it becomes available again, and then
 --re-add it.  Assuming the array has a bitmap, the --re-add should be
 fast and the array will become optimal again without experiencing
 excessive latencies.

 My two main concerns are:
  - does this functionality have any use-case outside of mirrored
    storage arrays, and are there other storage arrays which
    occasionally inserted excessive latency (seems like a serious
    misfeature to me, but I know few of the details)?
  - would it be at all possible to have "real" failfast functionality
    in the block layer?  I.e. something that is based on time rather
    than retry count.  Maybe in some cases a retry would be
    appropriate if the first failure was very fast.
    I.e. it would reduce timeouts and decide on retries based on
    elapsed time rather than number of attempts.
    With this would come the question of "how fast is fast" and I
    don't have a really good answer.  Maybe md would need to set a
    timeout, which it would double whenever it got failures on all
    drives.  Otherwise the timeout would drift towards (say) 10 times
    the typical response time.

 So: comments most welcome.  As I say, this does address a genuine
 need.  Just find it hard to like it :-(


Thanks,
NeilBrown

---

NeilBrown (6):
      md/failfast:  add failfast flag for md to be used by some personalities.
      md: Use REQ_FAILFAST_* on metadata writes where appropriate
      md/raid1: add failfast handling for reads.
      md/raid1: add failfast handling for writes.
      md/raid10: add failfast handling for reads.
      md/raid10: add failfast handling for writes.


 drivers/md/bitmap.c            |   15 ++++++--
 drivers/md/md.c                |   71 +++++++++++++++++++++++++++++++-----
 drivers/md/md.h                |   27 +++++++++++++-
 drivers/md/raid1.c             |   79 ++++++++++++++++++++++++++++++++++------
 drivers/md/raid1.h             |    1 +
 drivers/md/raid10.c            |   79 +++++++++++++++++++++++++++++++++++++---
 drivers/md/raid10.h            |    2 +
 include/uapi/linux/raid/md_p.h |    7 +++-
 8 files changed, 249 insertions(+), 32 deletions(-)

--
Signature


^ permalink raw reply

* [md PATCH 1/6] md/failfast: add failfast flag for md to be used by some personalities.
From: NeilBrown @ 2016-11-18  5:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, linux-block, Christoph Hellwig, linux-kernel, hare
In-Reply-To: <147944614789.3302.1959091446949640579.stgit@noble>

This patch just adds a 'failfast' per-device flag which can be stored
in v0.90 or v1.x metadata.
The flag is not used yet but the intent is that it can be used for
mirrored (raid1/raid10) arrays where low latency is more important
than keeping all devices on-line.

Setting the flag for a device effectively gives permission for that
device to be marked as Faulty and excluded from the array on the first
error.  The underlying driver will be directed not to retry requests
that result in failures.  There is a proviso that the device must not
be marked faulty if that would cause the array as a whole to fail, it
may only be marked Faulty if the array remains functional, but is
degraded.

Failures on read requests will cause the device to be marked
as Faulty immediately so that further reads will avoid that
device.  No attempt will be made to correct read errors by
over-writing with the correct data.

It is expected that if transient errors, such as cable unplug, are
possible, then something in user-space will revalidate failed
devices and re-add them when they appear to be working again.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c                |   27 +++++++++++++++++++++++++++
 drivers/md/md.h                |    6 ++++++
 include/uapi/linux/raid/md_p.h |    7 ++++++-
 3 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1f1c7f007b68..0d1a9fd9d1c1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1163,6 +1163,8 @@ static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev)
 		}
 		if (desc->state & (1<<MD_DISK_WRITEMOSTLY))
 			set_bit(WriteMostly, &rdev->flags);
+		if (desc->state & (1<<MD_DISK_FAILFAST))
+			set_bit(FailFast, &rdev->flags);
 	} else /* MULTIPATH are always insync */
 		set_bit(In_sync, &rdev->flags);
 	return 0;
@@ -1288,6 +1290,8 @@ static void super_90_sync(struct mddev *mddev, struct md_rdev *rdev)
 		}
 		if (test_bit(WriteMostly, &rdev2->flags))
 			d->state |= (1<<MD_DISK_WRITEMOSTLY);
+		if (test_bit(FailFast, &rdev2->flags))
+			d->state |= (1<<MD_DISK_FAILFAST);
 	}
 	/* now set the "removed" and "faulty" bits on any missing devices */
 	for (i=0 ; i < mddev->raid_disks ; i++) {
@@ -1672,6 +1676,8 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 		}
 		if (sb->devflags & WriteMostly1)
 			set_bit(WriteMostly, &rdev->flags);
+		if (sb->devflags & FailFast1)
+			set_bit(FailFast, &rdev->flags);
 		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_REPLACEMENT)
 			set_bit(Replacement, &rdev->flags);
 	} else /* MULTIPATH are always insync */
@@ -1710,6 +1716,10 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 	sb->chunksize = cpu_to_le32(mddev->chunk_sectors);
 	sb->level = cpu_to_le32(mddev->level);
 	sb->layout = cpu_to_le32(mddev->layout);
+	if (test_bit(FailFast, &rdev->flags))
+		sb->devflags |= FailFast1;
+	else
+		sb->devflags &= ~FailFast1;
 
 	if (test_bit(WriteMostly, &rdev->flags))
 		sb->devflags |= WriteMostly1;
@@ -2554,6 +2564,8 @@ state_show(struct md_rdev *rdev, char *page)
 		len += sprintf(page+len, "replacement%s", sep);
 	if (test_bit(ExternalBbl, &flags))
 		len += sprintf(page+len, "external_bbl%s", sep);
+	if (test_bit(FailFast, &flags))
+		len += sprintf(page+len, "failfast%s", sep);
 
 	if (len)
 		len -= strlen(sep);
@@ -2576,6 +2588,7 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 	 *            so that it gets rebuilt based on bitmap
 	 *  write_error - sets WriteErrorSeen
 	 *  -write_error - clears WriteErrorSeen
+	 *  {,-}failfast - set/clear FailFast
 	 */
 	int err = -EINVAL;
 	if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
@@ -2634,6 +2647,12 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 	} else if (cmd_match(buf, "insync") && rdev->raid_disk == -1) {
 		set_bit(In_sync, &rdev->flags);
 		err = 0;
+	} else if (cmd_match(buf, "failfast")) {
+		set_bit(FailFast, &rdev->flags);
+		err = 0;
+	} else if (cmd_match(buf, "-failfast")) {
+		clear_bit(FailFast, &rdev->flags);
+		err = 0;
 	} else if (cmd_match(buf, "-insync") && rdev->raid_disk >= 0 &&
 		   !test_bit(Journal, &rdev->flags)) {
 		if (rdev->mddev->pers == NULL) {
@@ -5939,6 +5958,8 @@ static int get_disk_info(struct mddev *mddev, void __user * arg)
 			info.state |= (1<<MD_DISK_JOURNAL);
 		if (test_bit(WriteMostly, &rdev->flags))
 			info.state |= (1<<MD_DISK_WRITEMOSTLY);
+		if (test_bit(FailFast, &rdev->flags))
+			info.state |= (1<<MD_DISK_FAILFAST);
 	} else {
 		info.major = info.minor = 0;
 		info.raid_disk = -1;
@@ -6046,6 +6067,10 @@ static int add_new_disk(struct mddev *mddev, mdu_disk_info_t *info)
 			set_bit(WriteMostly, &rdev->flags);
 		else
 			clear_bit(WriteMostly, &rdev->flags);
+		if (info->state & (1<<MD_DISK_FAILFAST))
+			set_bit(FailFast, &rdev->flags);
+		else
+			clear_bit(FailFast, &rdev->flags);
 
 		if (info->state & (1<<MD_DISK_JOURNAL)) {
 			struct md_rdev *rdev2;
@@ -6135,6 +6160,8 @@ static int add_new_disk(struct mddev *mddev, mdu_disk_info_t *info)
 
 		if (info->state & (1<<MD_DISK_WRITEMOSTLY))
 			set_bit(WriteMostly, &rdev->flags);
+		if (info->state & (1<<MD_DISK_FAILFAST))
+			set_bit(FailFast, &rdev->flags);
 
 		if (!mddev->persistent) {
 			pr_debug("md: nonpersistent superblock ...\n");
diff --git a/drivers/md/md.h b/drivers/md/md.h
index af6b33c30d2d..bc6712ef8c81 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -171,6 +171,12 @@ enum flag_bits {
 	ExternalBbl,            /* External metadata provides bad
 				 * block management for a disk
 				 */
+	FailFast,		/* Minimal retries should be attempted on
+				 * this device, so use REQ_FAILFAST_DEV.
+				 * Also don't try to repair failed reads.
+				 * It is expects that no bad block log
+				 * is present.
+				 */
 };
 
 static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index c3e654c6d518..9930f3e9040f 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -84,6 +84,10 @@
 #define MD_DISK_CANDIDATE	5 /* disk is added as spare (local) until confirmed
 				   * For clustered enviroments only.
 				   */
+#define MD_DISK_FAILFAST	10 /* Send REQ_FAILFAST if there are multiple
+				    * devices available - and don't try to
+				    * correct read errors.
+				    */
 
 #define	MD_DISK_WRITEMOSTLY	9 /* disk is "write-mostly" is RAID1 config.
 				   * read requests will only be sent here in
@@ -265,8 +269,9 @@ struct mdp_superblock_1 {
 	__le32	dev_number;	/* permanent identifier of this  device - not role in raid */
 	__le32	cnt_corrected_read; /* number of read errors that were corrected by re-writing */
 	__u8	device_uuid[16]; /* user-space setable, ignored by kernel */
-	__u8	devflags;	/* per-device flags.  Only one defined...*/
+	__u8	devflags;	/* per-device flags.  Only two defined...*/
 #define	WriteMostly1	1	/* mask for writemostly flag in above */
+#define	FailFast1	2	/* Should avoid retries and fixups and just fail */
 	/* Bad block log.  If there are any bad blocks the feature flag is set.
 	 * If offset and size are non-zero, that space is reserved and available
 	 */

^ permalink raw reply related

* [md PATCH 2/6] md: Use REQ_FAILFAST_* on metadata writes where appropriate
From: NeilBrown @ 2016-11-18  5:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, linux-block, Christoph Hellwig, linux-kernel, hare
In-Reply-To: <147944614789.3302.1959091446949640579.stgit@noble>

This can only be supported on personalities which ensure
that md_error() never causes an array to enter the 'failed'
state.  i.e. if marking a device Faulty would cause some
data to be inaccessible, the device is status is left as
non-Faulty.  This is true for RAID1 and RAID10.

If we get a failure writing metadata but the device doesn't
fail, it must be the last device so we re-write without
FAILFAST to improve chance of success.  We also flag the
device as LastDev so that future metadata updates don't
waste time on failfast writes.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/bitmap.c |   15 ++++++++++++---
 drivers/md/md.c     |   44 ++++++++++++++++++++++++++++++++++----------
 drivers/md/md.h     |   21 ++++++++++++++++++++-
 drivers/md/raid1.c  |    1 +
 drivers/md/raid10.c |    1 +
 5 files changed, 68 insertions(+), 14 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index cf77cbf9ed22..c4621571b718 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -209,11 +209,13 @@ static struct md_rdev *next_active_rdev(struct md_rdev *rdev, struct mddev *mdde
 
 static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
 {
-	struct md_rdev *rdev = NULL;
+	struct md_rdev *rdev;
 	struct block_device *bdev;
 	struct mddev *mddev = bitmap->mddev;
 	struct bitmap_storage *store = &bitmap->storage;
 
+restart:
+	rdev = NULL;
 	while ((rdev = next_active_rdev(rdev, mddev)) != NULL) {
 		int size = PAGE_SIZE;
 		loff_t offset = mddev->bitmap_info.offset;
@@ -269,8 +271,8 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
 			       page);
 	}
 
-	if (wait)
-		md_super_wait(mddev);
+	if (wait && md_super_wait(mddev) < 0)
+		goto restart;
 	return 0;
 
  bad_alignment:
@@ -428,6 +430,13 @@ static void bitmap_wait_writes(struct bitmap *bitmap)
 		wait_event(bitmap->write_wait,
 			   atomic_read(&bitmap->pending_writes)==0);
 	else
+		/* Note that we ignore the return value.  The writes
+		 * might have failed, but that would just mean that
+		 * some bits which should be cleared haven't been,
+		 * which is safe.  The relevant bitmap blocks will
+		 * probably get written again, but there is no great
+		 * loss if they aren't.
+		 */
 		md_super_wait(bitmap->mddev);
 }
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0d1a9fd9d1c1..047e7db1381b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -726,7 +726,13 @@ static void super_written(struct bio *bio)
 	if (bio->bi_error) {
 		pr_err("md: super_written gets error=%d\n", bio->bi_error);
 		md_error(mddev, rdev);
-	}
+		if (!test_bit(Faulty, &rdev->flags)
+		    && (bio->bi_opf & MD_FAILFAST)) {
+			set_bit(MD_NEED_REWRITE, &mddev->flags);
+			set_bit(LastDev, &rdev->flags);
+		}
+	} else
+		clear_bit(LastDev, &rdev->flags);
 
 	if (atomic_dec_and_test(&mddev->pending_writes))
 		wake_up(&mddev->sb_wait);
@@ -743,7 +749,13 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
 	 * if zero is reached.
 	 * If an error occurred, call md_error
 	 */
-	struct bio *bio = bio_alloc_mddev(GFP_NOIO, 1, mddev);
+	struct bio *bio;
+	int ff = 0;
+
+	if (test_bit(Faulty, &rdev->flags))
+		return;
+
+	bio = bio_alloc_mddev(GFP_NOIO, 1, mddev);
 
 	atomic_inc(&rdev->nr_pending);
 
@@ -752,16 +764,24 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
 	bio_add_page(bio, page, size, 0);
 	bio->bi_private = rdev;
 	bio->bi_end_io = super_written;
-	bio_set_op_attrs(bio, REQ_OP_WRITE, WRITE_FLUSH_FUA);
+
+	if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) &&
+	    test_bit(FailFast, &rdev->flags) &&
+	    !test_bit(LastDev, &rdev->flags))
+		ff = MD_FAILFAST;
+	bio_set_op_attrs(bio, REQ_OP_WRITE, WRITE_FLUSH_FUA | ff);
 
 	atomic_inc(&mddev->pending_writes);
 	submit_bio(bio);
 }
 
-void md_super_wait(struct mddev *mddev)
+int md_super_wait(struct mddev *mddev)
 {
 	/* wait for all superblock writes that were scheduled to complete */
 	wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0);
+	if (test_and_clear_bit(MD_NEED_REWRITE, &mddev->flags))
+		return -EAGAIN;
+	return 0;
 }
 
 int sync_page_io(struct md_rdev *rdev, sector_t sector, int size,
@@ -1333,9 +1353,10 @@ super_90_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
 	if (IS_ENABLED(CONFIG_LBDAF) && (u64)num_sectors >= (2ULL << 32) &&
 	    rdev->mddev->level >= 1)
 		num_sectors = (sector_t)(2ULL << 32) - 2;
-	md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
+	do
+		md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
 		       rdev->sb_page);
-	md_super_wait(rdev->mddev);
+	while (md_super_wait(rdev->mddev) < 0);
 	return num_sectors;
 }
 
@@ -1876,9 +1897,10 @@ super_1_rdev_size_change(struct md_rdev *rdev, sector_t num_sectors)
 	sb->data_size = cpu_to_le64(num_sectors);
 	sb->super_offset = rdev->sb_start;
 	sb->sb_csum = calc_sb_1_csum(sb);
-	md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
-		       rdev->sb_page);
-	md_super_wait(rdev->mddev);
+	do
+		md_super_write(rdev->mddev, rdev, rdev->sb_start, rdev->sb_size,
+			       rdev->sb_page);
+	while (md_super_wait(rdev->mddev) < 0);
 	return num_sectors;
 
 }
@@ -2413,6 +2435,7 @@ void md_update_sb(struct mddev *mddev, int force_change)
 	pr_debug("md: updating %s RAID superblock on device (in sync %d)\n",
 		 mdname(mddev), mddev->in_sync);
 
+rewrite:
 	bitmap_update_sb(mddev->bitmap);
 	rdev_for_each(rdev, mddev) {
 		char b[BDEVNAME_SIZE];
@@ -2444,7 +2467,8 @@ void md_update_sb(struct mddev *mddev, int force_change)
 			/* only need to write one superblock... */
 			break;
 	}
-	md_super_wait(mddev);
+	if (md_super_wait(mddev) < 0)
+		goto rewrite;
 	/* if there was a failure, MD_CHANGE_DEVS was set, and we re-write super */
 
 	if (mddev_is_clustered(mddev) && ret == 0)
diff --git a/drivers/md/md.h b/drivers/md/md.h
index bc6712ef8c81..5c08f84101fa 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -30,6 +30,16 @@
 #define MaxSector (~(sector_t)0)
 
 /*
+ * These flags should really be called "NO_RETRY" rather than
+ * "FAILFAST" because they don't make any promise about time lapse,
+ * only about the number of retries, which will be zero.
+ * REQ_FAILFAST_DRIVER is not included because
+ * Commit: 4a27446f3e39 ("[SCSI] modify scsi to handle new fail fast flags.")
+ * seems to suggest that the errors it avoids retrying should usually
+ * be retried.
+ */
+#define	MD_FAILFAST	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT)
+/*
  * MD's 'extended' device
  */
 struct md_rdev {
@@ -177,6 +187,10 @@ enum flag_bits {
 				 * It is expects that no bad block log
 				 * is present.
 				 */
+	LastDev,		/* Seems to be the last working dev as
+				 * it didn't fail, so don't use FailFast
+				 * any more for metadata
+				 */
 };
 
 static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
@@ -213,6 +227,11 @@ enum mddev_flags {
 	MD_CLUSTER_RESYNC_LOCKED, /* cluster raid only, which means node
 				   * already took resync lock, need to
 				   * release the lock */
+	MD_FAILFAST_SUPPORTED,	/* Using MD_FAILFAST on metadata writes is
+				 * supported as calls to md_error() will
+				 * never cause the array to become failed.
+				 */
+	MD_NEED_REWRITE,	/* metadata write needs to be repeated */
 };
 #define MD_UPDATE_SB_FLAGS (BIT(MD_CHANGE_DEVS) | \
 			    BIT(MD_CHANGE_CLEAN) | \
@@ -628,7 +647,7 @@ extern int mddev_congested(struct mddev *mddev, int bits);
 extern void md_flush_request(struct mddev *mddev, struct bio *bio);
 extern void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
 			   sector_t sector, int size, struct page *page);
-extern void md_super_wait(struct mddev *mddev);
+extern int md_super_wait(struct mddev *mddev);
 extern int sync_page_io(struct md_rdev *rdev, sector_t sector, int size,
 			struct page *page, int op, int op_flags,
 			bool metadata_op);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 96474f0af1b8..cfd73730e6af 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2989,6 +2989,7 @@ static int raid1_run(struct mddev *mddev)
 	mddev->thread = conf->thread;
 	conf->thread = NULL;
 	mddev->private = conf;
+	set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
 
 	md_set_array_sectors(mddev, raid1_size(mddev, 0, 0));
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b53610c59104..763ca45b6b32 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3728,6 +3728,7 @@ static int raid10_run(struct mddev *mddev)
 	size = raid10_size(mddev, 0, 0);
 	md_set_array_sectors(mddev, size);
 	mddev->resync_max_sectors = size;
+	set_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
 
 	if (mddev->queue) {
 		int stripe = conf->geo.raid_disks *

^ permalink raw reply related

* [md PATCH 3/6] md/raid1: add failfast handling for reads.
From: NeilBrown @ 2016-11-18  5:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, linux-block, Christoph Hellwig, linux-kernel, hare
In-Reply-To: <147944614789.3302.1959091446949640579.stgit@noble>

If a device is marked FailFast and it is not the only device
we can read from, we mark the bio with REQ_FAILFAST_* flags.

If this does fail, we don't try read repair but just allow
failure.  If it was the last device it doesn't fail of
course, so the retry happens on the same device - this time
without FAILFAST.  A subsequent failure will not retry but
will just pass up the error.

During resync we may use FAILFAST requests and on a failure
we will simply use the other device(s).

During recovery we will only use FAILFAST in the unusual
case were there are multiple places to read from - i.e. if
there are > 2 devices.  If we get a failure we will fail the
device and complete the resync/recovery with remaining
devices.

The new R1BIO_FailFast flag is set on read reqest to suggest
the a FAILFAST request might be acceptable.  The rdev needs
to have FailFast set as well for the read to actually use
REQ_FAILFAST_*.

We need to know there are at least two working devices
before we can set R1BIO_FailFast, so we mustn't stop looking
at the first device we find.  So the "min_pending == 0"
handling to not exit early, but too always choose the
best_pending_disk if min_pending == 0.

The spinlocked region in raid1_error() in enlarged to ensure
that if two bios, reading from two different devices, fail
at the same time, then there is no risk that both devices
will be marked faulty, leaving zero "In_sync" devices.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid1.c |   52 ++++++++++++++++++++++++++++++++++++++++++----------
 drivers/md/raid1.h |    1 +
 2 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index cfd73730e6af..44f93297698d 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -330,6 +330,11 @@ static void raid1_end_read_request(struct bio *bio)
 
 	if (uptodate)
 		set_bit(R1BIO_Uptodate, &r1_bio->state);
+	else if (test_bit(FailFast, &rdev->flags) &&
+		 test_bit(R1BIO_FailFast, &r1_bio->state))
+		/* This was a fail-fast read so we definitely
+		 * want to retry */
+		;
 	else {
 		/* If all other devices have failed, we want to return
 		 * the error upwards rather than fail the last device.
@@ -536,6 +541,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 	best_good_sectors = 0;
 	has_nonrot_disk = 0;
 	choose_next_idle = 0;
+	clear_bit(R1BIO_FailFast, &r1_bio->state);
 
 	if ((conf->mddev->recovery_cp < this_sector + sectors) ||
 	    (mddev_is_clustered(conf->mddev) &&
@@ -609,6 +615,10 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 		} else
 			best_good_sectors = sectors;
 
+		if (best_disk >= 0)
+			/* At least two disks to choose from so failfast is OK */
+			set_bit(R1BIO_FailFast, &r1_bio->state);
+
 		nonrot = blk_queue_nonrot(bdev_get_queue(rdev->bdev));
 		has_nonrot_disk |= nonrot;
 		pending = atomic_read(&rdev->nr_pending);
@@ -647,11 +657,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 			}
 			break;
 		}
-		/* If device is idle, use it */
-		if (pending == 0) {
-			best_disk = disk;
-			break;
-		}
 
 		if (choose_next_idle)
 			continue;
@@ -674,7 +679,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
 	 * mixed ratation/non-rotational disks depending on workload.
 	 */
 	if (best_disk == -1) {
-		if (has_nonrot_disk)
+		if (has_nonrot_disk || min_pending == 0)
 			best_disk = best_pending_disk;
 		else
 			best_disk = best_dist_disk;
@@ -1168,6 +1173,9 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 		read_bio->bi_bdev = mirror->rdev->bdev;
 		read_bio->bi_end_io = raid1_end_read_request;
 		bio_set_op_attrs(read_bio, op, do_sync);
+		if (test_bit(FailFast, &mirror->rdev->flags) &&
+		    test_bit(R1BIO_FailFast, &r1_bio->state))
+			read_bio->bi_opf |= MD_FAILFAST;
 		read_bio->bi_private = r1_bio;
 
 		if (mddev->gendisk)
@@ -1465,6 +1473,7 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 	 * next level up know.
 	 * else mark the drive as failed
 	 */
+	spin_lock_irqsave(&conf->device_lock, flags);
 	if (test_bit(In_sync, &rdev->flags)
 	    && (conf->raid_disks - mddev->degraded) == 1) {
 		/*
@@ -1474,10 +1483,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 		 * it is very likely to fail.
 		 */
 		conf->recovery_disabled = mddev->recovery_disabled;
+		spin_unlock_irqrestore(&conf->device_lock, flags);
 		return;
 	}
 	set_bit(Blocked, &rdev->flags);
-	spin_lock_irqsave(&conf->device_lock, flags);
 	if (test_and_clear_bit(In_sync, &rdev->flags)) {
 		mddev->degraded++;
 		set_bit(Faulty, &rdev->flags);
@@ -1816,12 +1825,24 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
 	sector_t sect = r1_bio->sector;
 	int sectors = r1_bio->sectors;
 	int idx = 0;
+	struct md_rdev *rdev;
+
+	rdev = conf->mirrors[r1_bio->read_disk].rdev;
+	if (test_bit(FailFast, &rdev->flags)) {
+		/* Don't try recovering from here - just fail it
+		 * ... unless it is the last working device of course */
+		md_error(mddev, rdev);
+		if (test_bit(Faulty, &rdev->flags))
+			/* Don't try to read from here, but make sure
+			 * put_buf does it's thing
+			 */
+			bio->bi_end_io = end_sync_write;
+	}
 
 	while(sectors) {
 		int s = sectors;
 		int d = r1_bio->read_disk;
 		int success = 0;
-		struct md_rdev *rdev;
 		int start;
 
 		if (s > (PAGE_SIZE>>9))
@@ -2332,7 +2353,9 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 	bio_put(bio);
 	r1_bio->bios[r1_bio->read_disk] = NULL;
 
-	if (mddev->ro == 0) {
+	rdev = conf->mirrors[r1_bio->read_disk].rdev;
+	if (mddev->ro == 0
+	    && !test_bit(FailFast, &rdev->flags)) {
 		freeze_array(conf, 1);
 		fix_read_error(conf, r1_bio->read_disk,
 			       r1_bio->sector, r1_bio->sectors);
@@ -2341,7 +2364,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 		r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
 	}
 
-	rdev_dec_pending(conf->mirrors[r1_bio->read_disk].rdev, conf->mddev);
+	rdev_dec_pending(rdev, conf->mddev);
 
 read_more:
 	disk = read_balance(conf, r1_bio, &max_sectors);
@@ -2366,6 +2389,9 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 		bio->bi_bdev = rdev->bdev;
 		bio->bi_end_io = raid1_end_read_request;
 		bio_set_op_attrs(bio, REQ_OP_READ, do_sync);
+		if (test_bit(FailFast, &rdev->flags) &&
+		    test_bit(R1BIO_FailFast, &r1_bio->state))
+			bio->bi_opf |= MD_FAILFAST;
 		bio->bi_private = r1_bio;
 		if (max_sectors < r1_bio->sectors) {
 			/* Drat - have to split this up more */
@@ -2654,6 +2680,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 			bio->bi_iter.bi_sector = sector_nr + rdev->data_offset;
 			bio->bi_bdev = rdev->bdev;
 			bio->bi_private = r1_bio;
+			if (test_bit(FailFast, &rdev->flags))
+				bio->bi_opf |= MD_FAILFAST;
 		}
 	}
 	rcu_read_unlock();
@@ -2784,6 +2812,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 			if (bio->bi_end_io == end_sync_read) {
 				read_targets--;
 				md_sync_acct(bio->bi_bdev, nr_sectors);
+				if (read_targets == 1)
+					bio->bi_opf &= ~MD_FAILFAST;
 				generic_make_request(bio);
 			}
 		}
@@ -2791,6 +2821,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 		atomic_set(&r1_bio->remaining, 1);
 		bio = r1_bio->bios[r1_bio->read_disk];
 		md_sync_acct(bio->bi_bdev, nr_sectors);
+		if (read_targets == 1)
+			bio->bi_opf &= ~MD_FAILFAST;
 		generic_make_request(bio);
 
 	}
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index 5ec19449779d..c52ef424a24b 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -183,5 +183,6 @@ enum r1bio_state {
  */
 	R1BIO_MadeGood,
 	R1BIO_WriteError,
+	R1BIO_FailFast,
 };
 #endif



^ permalink raw reply related

* [md PATCH 4/6] md/raid1: add failfast handling for writes.
From: NeilBrown @ 2016-11-18  5:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, linux-block, Christoph Hellwig, linux-kernel, hare
In-Reply-To: <147944614789.3302.1959091446949640579.stgit@noble>

When writing to a fastfail device we use MD_FASTFAIL unless
it is the only device being written to.

For resync/recovery, assume there was a working device to
read from so always use REQ_FASTFAIL_DEV.

If a write for resync/recovery fails, we just fail the
device - there is not much else to do.

If a normal failfast write fails, but the device cannot be
failed (must be only one left), we queue for write error
handling.  This will call narrow_write_error() to retry the
write synchronously and without any FAILFAST flags.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid1.c |   26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 44f93297698d..731fd9fe79ef 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -423,7 +423,24 @@ static void raid1_end_write_request(struct bio *bio)
 			set_bit(MD_RECOVERY_NEEDED, &
 				conf->mddev->recovery);
 
-		set_bit(R1BIO_WriteError, &r1_bio->state);
+		if (test_bit(FailFast, &rdev->flags) &&
+		    (bio->bi_opf & MD_FAILFAST) &&
+		    /* We never try FailFast to WriteMostly devices */
+		    !test_bit(WriteMostly, &rdev->flags)) {
+			md_error(r1_bio->mddev, rdev);
+			if (!test_bit(Faulty, &rdev->flags))
+				/* This is the only remaining device,
+				 * We need to retry the write without
+				 * FailFast
+				 */
+				set_bit(R1BIO_WriteError, &r1_bio->state);
+			else {
+				/* Finished with this branch */
+				r1_bio->bios[mirror] = NULL;
+				to_put = bio;
+			}
+		} else
+			set_bit(R1BIO_WriteError, &r1_bio->state);
 	} else {
 		/*
 		 * Set R1BIO_Uptodate in our master bio, so that we
@@ -1393,6 +1410,10 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 		mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
 		mbio->bi_end_io	= raid1_end_write_request;
 		bio_set_op_attrs(mbio, op, do_flush_fua | do_sync);
+		if (test_bit(FailFast, &conf->mirrors[i].rdev->flags) &&
+		    !test_bit(WriteMostly, &conf->mirrors[i].rdev->flags) &&
+		    conf->raid_disks - mddev->degraded > 1)
+			mbio->bi_opf |= MD_FAILFAST;
 		mbio->bi_private = r1_bio;
 
 		atomic_inc(&r1_bio->remaining);
@@ -2061,6 +2082,9 @@ static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
 			continue;
 
 		bio_set_op_attrs(wbio, REQ_OP_WRITE, 0);
+		if (test_bit(FailFast, &conf->mirrors[i].rdev->flags))
+			wbio->bi_opf |= MD_FAILFAST;
+
 		wbio->bi_end_io = end_sync_write;
 		atomic_inc(&r1_bio->remaining);
 		md_sync_acct(conf->mirrors[i].rdev->bdev, bio_sectors(wbio));



^ permalink raw reply related

* [md PATCH 6/6] md/raid10: add failfast handling for writes.
From: NeilBrown @ 2016-11-18  5:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, linux-block, Christoph Hellwig, linux-kernel, hare
In-Reply-To: <147944614789.3302.1959091446949640579.stgit@noble>

When writing to a fastfail device, we use MD_FASTFAIL unless
it is the only device being written to.  For
resync/recovery, assume there was a working device to read
from so always use MD_FASTFAIL.

If a write for resync/recovery fails, we just fail the
device - there is not much else to do.

If a normal write fails, but the device cannot be marked
Faulty (must be only one left), we queue for write error
handling which calls narrow_write_error() to write the block
synchronously without any failfast flags.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid10.c |   29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 99fa1b980371..c191d00055d0 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -100,6 +100,7 @@ static int max_queued_requests = 1024;
 static void allow_barrier(struct r10conf *conf);
 static void lower_barrier(struct r10conf *conf);
 static int _enough(struct r10conf *conf, int previous, int ignore);
+static int enough(struct r10conf *conf, int ignore);
 static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr,
 				int *skipped);
 static void reshape_request_write(struct mddev *mddev, struct r10bio *r10_bio);
@@ -451,6 +452,7 @@ static void raid10_end_write_request(struct bio *bio)
 	struct r10conf *conf = r10_bio->mddev->private;
 	int slot, repl;
 	struct md_rdev *rdev = NULL;
+	struct bio *to_put = NULL;
 	bool discard_error;
 
 	discard_error = bio->bi_error && bio_op(bio) == REQ_OP_DISCARD;
@@ -478,8 +480,24 @@ static void raid10_end_write_request(struct bio *bio)
 			if (!test_and_set_bit(WantReplacement, &rdev->flags))
 				set_bit(MD_RECOVERY_NEEDED,
 					&rdev->mddev->recovery);
-			set_bit(R10BIO_WriteError, &r10_bio->state);
+
 			dec_rdev = 0;
+			if (test_bit(FailFast, &rdev->flags) &&
+			    (bio->bi_opf & MD_FAILFAST)) {
+				md_error(rdev->mddev, rdev);
+				if (!test_bit(Faulty, &rdev->flags))
+					/* This is the only remaining device,
+					 * We need to retry the write without
+					 * FailFast
+					 */
+					set_bit(R10BIO_WriteError, &r10_bio->state);
+				else {
+					r10_bio->devs[slot].bio = NULL;
+					to_put = bio;
+					dec_rdev = 1;
+				}
+			} else
+				set_bit(R10BIO_WriteError, &r10_bio->state);
 		}
 	} else {
 		/*
@@ -529,6 +547,8 @@ static void raid10_end_write_request(struct bio *bio)
 	one_write_done(r10_bio);
 	if (dec_rdev)
 		rdev_dec_pending(rdev, conf->mddev);
+	if (to_put)
+		bio_put(to_put);
 }
 
 /*
@@ -1391,6 +1411,9 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 			mbio->bi_bdev = rdev->bdev;
 			mbio->bi_end_io	= raid10_end_write_request;
 			bio_set_op_attrs(mbio, op, do_sync | do_fua);
+			if (test_bit(FailFast, &conf->mirrors[d].rdev->flags) &&
+			    enough(conf, d))
+				mbio->bi_opf |= MD_FAILFAST;
 			mbio->bi_private = r10_bio;
 
 			if (conf->mddev->gendisk)
@@ -2051,6 +2074,8 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 		atomic_inc(&r10_bio->remaining);
 		md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(tbio));
 
+		if (test_bit(FailFast, &conf->mirrors[d].rdev->flags))
+			tbio->bi_opf |= MD_FAILFAST;
 		tbio->bi_iter.bi_sector += conf->mirrors[d].rdev->data_offset;
 		tbio->bi_bdev = conf->mirrors[d].rdev->bdev;
 		generic_make_request(tbio);
@@ -3340,6 +3365,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 			bio->bi_private = r10_bio;
 			bio->bi_end_io = end_sync_write;
 			bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+			if (test_bit(FailFast, &conf->mirrors[d].rdev->flags))
+				bio->bi_opf |= MD_FAILFAST;
 			bio->bi_iter.bi_sector = sector + rdev->data_offset;
 			bio->bi_bdev = rdev->bdev;
 			count++;



^ permalink raw reply related

* [md PATCH 5/6] md/raid10: add failfast handling for reads.
From: NeilBrown @ 2016-11-18  5:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, linux-block, Christoph Hellwig, linux-kernel, hare
In-Reply-To: <147944614789.3302.1959091446949640579.stgit@noble>

If a device is marked FailFast, and it is not the only
device we can read from, we mark the bio as MD_FAILFAST.

If this does fail-fast, we don't try read repair but just
allow failure.

If it was the last device, it doesn't get marked Faulty so
the retry happens on the same device - this time without
FAILFAST.  A subsequent failure will not retry but will just
pass up the error.

During resync we may use FAILFAST requests, and on a failure
we will simply use the other device(s).

During recovery we will only use FAILFAST in the unusual
case were there are multiple places to read from - i.e. if
there are > 2 devices.  If we get a failure we will fail the
device and complete the resync/recovery with remaining
devices.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid10.c |   49 ++++++++++++++++++++++++++++++++++++++++++++-----
 drivers/md/raid10.h |    2 ++
 2 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 763ca45b6b32..99fa1b980371 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -720,6 +720,7 @@ static struct md_rdev *read_balance(struct r10conf *conf,
 	best_dist = MaxSector;
 	best_good_sectors = 0;
 	do_balance = 1;
+	clear_bit(R10BIO_FailFast, &r10_bio->state);
 	/*
 	 * Check if we can balance. We can balance on the whole
 	 * device if no resync is going on (recovery is ok), or below
@@ -784,15 +785,18 @@ static struct md_rdev *read_balance(struct r10conf *conf,
 		if (!do_balance)
 			break;
 
+		if (best_slot >= 0)
+			/* At least 2 disks to choose from so failfast is OK */
+			set_bit(R10BIO_FailFast, &r10_bio->state);
 		/* This optimisation is debatable, and completely destroys
 		 * sequential read speed for 'far copies' arrays.  So only
 		 * keep it for 'near' arrays, and review those later.
 		 */
 		if (geo->near_copies > 1 && !atomic_read(&rdev->nr_pending))
-			break;
+			new_distance = 0;
 
 		/* for far > 1 always use the lowest address */
-		if (geo->far_copies > 1)
+		else if (geo->far_copies > 1)
 			new_distance = r10_bio->devs[slot].addr;
 		else
 			new_distance = abs(r10_bio->devs[slot].addr -
@@ -1171,6 +1175,9 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 		read_bio->bi_bdev = rdev->bdev;
 		read_bio->bi_end_io = raid10_end_read_request;
 		bio_set_op_attrs(read_bio, op, do_sync);
+		if (test_bit(FailFast, &rdev->flags) &&
+		    test_bit(R10BIO_FailFast, &r10_bio->state))
+			read_bio->bi_opf |= MD_FAILFAST;
 		read_bio->bi_private = r10_bio;
 
 		if (mddev->gendisk)
@@ -1987,6 +1994,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 	/* now find blocks with errors */
 	for (i=0 ; i < conf->copies ; i++) {
 		int  j, d;
+		struct md_rdev *rdev;
 
 		tbio = r10_bio->devs[i].bio;
 
@@ -1994,6 +2002,8 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 			continue;
 		if (i == first)
 			continue;
+		d = r10_bio->devs[i].devnum;
+		rdev = conf->mirrors[d].rdev;
 		if (!r10_bio->devs[i].bio->bi_error) {
 			/* We know that the bi_io_vec layout is the same for
 			 * both 'first' and 'i', so we just compare them.
@@ -2016,6 +2026,10 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 			if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery))
 				/* Don't fix anything. */
 				continue;
+		} else if (test_bit(FailFast, &rdev->flags)) {
+			/* Just give up on this device */
+			md_error(rdev->mddev, rdev);
+			continue;
 		}
 		/* Ok, we need to write this bio, either to correct an
 		 * inconsistency or to correct an unreadable block.
@@ -2033,7 +2047,6 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 
 		bio_copy_data(tbio, fbio);
 
-		d = r10_bio->devs[i].devnum;
 		atomic_inc(&conf->mirrors[d].rdev->nr_pending);
 		atomic_inc(&r10_bio->remaining);
 		md_sync_acct(conf->mirrors[d].rdev->bdev, bio_sectors(tbio));
@@ -2540,12 +2553,14 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 	bio_put(bio);
 	r10_bio->devs[slot].bio = NULL;
 
-	if (mddev->ro == 0) {
+	if (mddev->ro)
+		r10_bio->devs[slot].bio = IO_BLOCKED;
+	else if (!test_bit(FailFast, &rdev->flags)) {
 		freeze_array(conf, 1);
 		fix_read_error(conf, mddev, r10_bio);
 		unfreeze_array(conf);
 	} else
-		r10_bio->devs[slot].bio = IO_BLOCKED;
+		md_error(mddev, rdev);
 
 	rdev_dec_pending(rdev, mddev);
 
@@ -2574,6 +2589,9 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 		+ choose_data_offset(r10_bio, rdev);
 	bio->bi_bdev = rdev->bdev;
 	bio_set_op_attrs(bio, REQ_OP_READ, do_sync);
+	if (test_bit(FailFast, &rdev->flags) &&
+	    test_bit(R10BIO_FailFast, &r10_bio->state))
+		bio->bi_opf |= MD_FAILFAST;
 	bio->bi_private = r10_bio;
 	bio->bi_end_io = raid10_end_read_request;
 	trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
@@ -3095,6 +3113,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 				bio->bi_private = r10_bio;
 				bio->bi_end_io = end_sync_read;
 				bio_set_op_attrs(bio, REQ_OP_READ, 0);
+				if (test_bit(FailFast, &rdev->flags))
+					bio->bi_opf |= MD_FAILFAST;
 				from_addr = r10_bio->devs[j].addr;
 				bio->bi_iter.bi_sector = from_addr +
 					rdev->data_offset;
@@ -3200,6 +3220,23 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 			rdev_dec_pending(mrdev, mddev);
 			if (mreplace)
 				rdev_dec_pending(mreplace, mddev);
+			if (r10_bio->devs[0].bio->bi_opf & MD_FAILFAST) {
+				/* Only want this if there is elsewhere to
+				 * read from. 'j' is currently the first
+				 * readable copy.
+				 */
+				int targets = 1;
+				for (; j < conf->copies; j++) {
+					int d = r10_bio->devs[j].devnum;
+					if (conf->mirrors[d].rdev &&
+					    test_bit(In_sync,
+						      &conf->mirrors[d].rdev->flags))
+						targets++;
+				}
+				if (targets == 1)
+					r10_bio->devs[0].bio->bi_opf
+						&= ~MD_FAILFAST;
+			}
 		}
 		if (biolist == NULL) {
 			while (r10_bio) {
@@ -3278,6 +3315,8 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
 			bio->bi_private = r10_bio;
 			bio->bi_end_io = end_sync_read;
 			bio_set_op_attrs(bio, REQ_OP_READ, 0);
+			if (test_bit(FailFast, &conf->mirrors[d].rdev->flags))
+				bio->bi_opf |= MD_FAILFAST;
 			bio->bi_iter.bi_sector = sector + rdev->data_offset;
 			bio->bi_bdev = rdev->bdev;
 			count++;
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 18ec1f7a98bf..3162615e57bd 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -156,5 +156,7 @@ enum r10bio_state {
  * flag is set
  */
 	R10BIO_Previous,
+/* failfast devices did receive failfast requests. */
+	R10BIO_FailFast,
 };
 #endif

^ permalink raw reply related

* Re: [PATCH/RFC] add "failfast" support for raid1/raid10.
From: Hannes Reinecke @ 2016-11-18  7:09 UTC (permalink / raw)
  To: NeilBrown, Shaohua Li
  Cc: linux-raid, linux-block, Christoph Hellwig, linux-kernel
In-Reply-To: <147944614789.3302.1959091446949640579.stgit@noble>

(Seeing that it was me who initiated those patches I guess I should
speak up here)

On 11/18/2016 06:16 AM, NeilBrown wrote:
> Hi,
> 
>  I've been sitting on these patches for a while because although they
>  solve a real problem, it is a fairly limited use-case, and I don't
>  really like some of the details.
> 
>  So I'm posting them as RFC in the hope that a different perspective
>  might help me like them better, or find a better approach.
> 
[ .. ]
>  My two main concerns are:
>   - does this functionality have any use-case outside of mirrored
>     storage arrays, and are there other storage arrays which
>     occasionally inserted excessive latency (seems like a serious
>     misfeature to me, but I know few of the details)?
Yes, there are.
I've come across some storage arrays which really take some liberty when
doing internal error recovery; some even take up to 20 minutes
before sending a command completion (the response was "there's nothing
in the SCSI spec which forbids us to do so")

>   - would it be at all possible to have "real" failfast functionality
>     in the block layer?  I.e. something that is based on time rather
>     than retry count.  Maybe in some cases a retry would be
>     appropriate if the first failure was very fast.
>     I.e. it would reduce timeouts and decide on retries based on
>     elapsed time rather than number of attempts.
>     With this would come the question of "how fast is fast" and I
>     don't have a really good answer.  Maybe md would need to set a
>     timeout, which it would double whenever it got failures on all
>     drives.  Otherwise the timeout would drift towards (say) 10 times
>     the typical response time.
> 
The current 'failfast' is rather a 'do not attempt error recovery' flag;
ie the SCSI stack should _not_ start error recovery but rather pass the
request upwards in case of failure.
Problem is that there is no real upper limit on the time error recovery
could take, and it's virtually impossible to give an I/O response time
guarantees once error recovery had been invoked.
And to make matters worse, in most cases error recovery won't work
_anyway_ if the transport is severed.
So this is more to do with error recovery, and not so much on the time
each request can/should spend on the fly.

The S/390 DASD case is even worse, as the DASD driver _by design_ will
always have to wait for an answer from the storage array. So if the link
to the array is severed you are in deep trouble, as you'll never get a
completion (or any status, for that matter) until the array is reconnected.

So while the FAILFAST flag is a mere convenience for SCSI, it's a
positive must for S/390 if you want to have a functional RAID.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply

* Re: [PATCH] block: call trace_block_split() from bio_split()
From: Christoph Hellwig @ 2016-11-18 12:55 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, Christoph Hellwig, linux-block, linux-raid,
	linux-kernel
In-Reply-To: <87vavlfsar.fsf@notabene.neil.brown.name>

On Fri, Nov 18, 2016 at 01:14:20PM +1100, NeilBrown wrote:
> 
> 
> Somewhere around
> Commit: 20d0189b1012 ("block: Introduce new bio_split()")
> and
> Commit: 4b1faf931650 ("block: Kill bio_pair_split()")
> 
> in 3.14 we lost the call to trace_block_split() from bio_split().
> 
> Commit: cda22646adaa ("block: add call to split trace point")
> 
> in 4.5 added it back for blk_queue_split(), but not for other users of
> bio_split(), and particularly not for md/raid.
> 
> This patch moves the trace_block_split() call from blk_queue_split()
> to bio_split().
> As blk_queue_split() calls bio_split() (via various helper functions)
> the same events that were traced before will still be traced.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  block/bio.c       | 1 +
>  block/blk-merge.c | 1 -
>  2 files changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index db85c5753a76..212ea95a7401 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1804,6 +1804,7 @@ struct bio *bio_split(struct bio *bio, int sectors,
>  		bio_integrity_trim(split, 0, sectors);
>  
>  	bio_advance(bio, split->bi_iter.bi_size);
> +	trace_block_split(bdev_get_queue(bio->bi_bdev), split, bio->bi_iter.bi_sector);

This line needs to be wrapped.

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH/RFC] add "failfast" support for raid1/raid10.
From: Jack Wang @ 2016-11-18 15:41 UTC (permalink / raw)
  To: NeilBrown
  Cc: Shaohua Li, linux-raid, linux-block, Christoph Hellwig,
	linux-kernel, hare
In-Reply-To: <147944614789.3302.1959091446949640579.stgit@noble>

2016-11-18 6:16 GMT+01:00 NeilBrown <neilb@suse.com>:
> Hi,
>
>  I've been sitting on these patches for a while because although they
>  solve a real problem, it is a fairly limited use-case, and I don't
>  really like some of the details.
>
>  So I'm posting them as RFC in the hope that a different perspective
>  might help me like them better, or find a better approach.
>
>  The core idea is that when you have multiple copies of data
>  (i.e. mirrored drives) it doesn't make sense to wait for a read from
>  a drive that seems to be having problems.  It will probably be faster
>  to just cancel that read, and read from the other device.
>  Similarly, in some circumstances, it might be better to fail a drive
>  that is being slow to respond to writes, rather than cause all writes
>  to be very slow.
>
>  The particular context where this comes up is when mirroring across
>  storage arrays, where the storage arrays can temporarily take an
>  unusually long time to respond to requests (firmware updates have
>  been mentioned).  As the array will have redundancy internally, there
>  is little risk to the data.  The mirrored pair is really only for
>  disaster recovery, and it is deemed better to lose the last few
>  minutes of updates in the case of a serious disaster, rather than
>  occasionally having latency issues because one array needs to do some
>  maintenance for a few minutes.  The particular storage arrays in
>  question are DASD devices which are part of the s390 ecosystem.

Hi Neil,

Thanks for pushing this feature also to mainline.
We at Profitbricks use raid1 across IB network, one pserver with
raid1, both legs on 2 remote storages.
We've noticed if one remote storage crash , and raid1 still keep
sending IO to the faulty leg, even after 5 minutes,
md still redirect I/Os, and md refuse to remove active disks, eg:

2016-10-27T19:47:07.776233+02:00 pserver25 kernel: [184749.101984]
md/raid1:md23: Disk failure on ibnbd47, disabling device.

2016-10-27T19:47:07.776243+02:00 pserver25 kernel: [184749.101984]
md/raid1:md23: Operation continuing on 1 devices.

[...]

2016-10-27T19:47:16.171694+02:00 pserver25 kernel: [184757.498693]
md/raid1:md23: redirecting sector 79104 to other mirror: ibnbd46

[...]

2016-10-27T19:47:21.301732+02:00 pserver25 kernel: [184762.627288]
md/raid1:md23: redirecting sector 79232 to other mirror: ibnbd46

[...]

2016-10-27T19:47:35.501725+02:00 pserver25 kernel: [184776.829069] md:
cannot remove active disk ibnbd47 from md23 ...

2016-10-27T19:47:36.801769+02:00 pserver25 kernel: [184778.128856] md:
cannot remove active disk ibnbd47 from md23 ...

[...]

2016-10-27T19:52:33.401816+02:00 pserver25 kernel: [185074.727859]
md/raid1:md23: redirecting sector 72832 to other mirror: ibnbd46

2016-10-27T19:52:36.601693+02:00 pserver25 kernel: [185077.924835]
md/raid1:md23: redirecting sector 78336 to other mirror: ibnbd46

2016-10-27T19:52:36.601728+02:00 pserver25 kernel: [185077.925083]
RAID1 conf printout:

2016-10-27T19:52:36.601731+02:00 pserver25 kernel: [185077.925087]
--- wd:1 rd:2

2016-10-27T19:52:36.601733+02:00 pserver25 kernel: [185077.925091]
disk 0, wo:0, o:1, dev:ibnbd46

2016-10-27T19:52:36.601735+02:00 pserver25 kernel: [185077.925093]
disk 1, wo:1, o:0, dev:ibnbd47

2016-10-27T19:52:36.681691+02:00 pserver25 kernel: [185078.003392]
RAID1 conf printout:

2016-10-27T19:52:36.681706+02:00 pserver25 kernel: [185078.003404]
--- wd:1 rd:2

2016-10-27T19:52:36.681709+02:00 pserver25 kernel: [185078.003409]
disk 0, wo:0, o:1, dev:ibnbd46

I tried to port you patch from SLES[1], with the patchset, it reduce
the time to ~30 seconds.

I'm happy to see this feature upstream :)
I will test again this new patchset.

Cheers,
Jack Wang

^ permalink raw reply

* Re: [PATCH v2] md: add block tracing for bio_remapping
From: Shaohua Li @ 2016-11-18 17:26 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87shqpfrxv.fsf@notabene.neil.brown.name>

On Fri, Nov 18, 2016 at 01:22:04PM +1100, Neil Brown wrote:
> 
> The block tracing infrastructure (accessed with blktrace/blkparse)
> supports the tracing of mapping bios from one device to another.
> This is currently used when a bio in a partition is mapped to the
> whole device, when bios are mapped by dm, and for mapping in md/raid5.
> Other md personalities do not include this tracing yet, so add it.
> 
> When a read-error is detected we redirect the request to a different device.
> This could justifiably be seen as a new mapping for the originial bio,
> or a secondary mapping for the bio that errors.  This patch uses
> the second option.
> 
> When md is used under dm-raid, the mappings are not traced as we do
> not have access to the block device number of the parent.

thanks, applied patch 1, 3, 4.

Thanks,
Shaohua
 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
> 
> This is the revised version based on discussions.
> Now uses correct sector for linear and raid0, and code for raid1/raid10
> rearranged a bit.
> 
>  drivers/md/linear.c | 18 ++++++++++++------
>  drivers/md/raid0.c  | 13 ++++++++++---
>  drivers/md/raid1.c  | 26 ++++++++++++++++++++++++--
>  drivers/md/raid10.c | 29 +++++++++++++++++++++++++++--
>  4 files changed, 73 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/md/linear.c b/drivers/md/linear.c
> index 9c7d4f5483ea..5975c9915684 100644
> --- a/drivers/md/linear.c
> +++ b/drivers/md/linear.c
> @@ -21,6 +21,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/module.h>
>  #include <linux/slab.h>
> +#include <trace/events/block.h>
>  #include "md.h"
>  #include "linear.h"
>  
> @@ -227,22 +228,22 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
>  	}
>  
>  	do {
> -		tmp_dev = which_dev(mddev, bio->bi_iter.bi_sector);
> +		sector_t bio_sector = bio->bi_iter.bi_sector;
> +		tmp_dev = which_dev(mddev, bio_sector);
>  		start_sector = tmp_dev->end_sector - tmp_dev->rdev->sectors;
>  		end_sector = tmp_dev->end_sector;
>  		data_offset = tmp_dev->rdev->data_offset;
>  		bio->bi_bdev = tmp_dev->rdev->bdev;
>  
> -		if (unlikely(bio->bi_iter.bi_sector >= end_sector ||
> -			     bio->bi_iter.bi_sector < start_sector))
> +		if (unlikely(bio_sector >= end_sector ||
> +			     bio_sector < start_sector))
>  			goto out_of_bounds;
>  
>  		if (unlikely(bio_end_sector(bio) > end_sector)) {
>  			/* This bio crosses a device boundary, so we have to
>  			 * split it.
>  			 */
> -			split = bio_split(bio, end_sector -
> -					  bio->bi_iter.bi_sector,
> +			split = bio_split(bio, end_sector - bio_sector,
>  					  GFP_NOIO, fs_bio_set);
>  			bio_chain(split, bio);
>  		} else {
> @@ -256,8 +257,13 @@ static void linear_make_request(struct mddev *mddev, struct bio *bio)
>  			 !blk_queue_discard(bdev_get_queue(split->bi_bdev)))) {
>  			/* Just ignore it */
>  			bio_endio(split);
> -		} else
> +		} else {
> +			if (mddev->gendisk)
> +				trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
> +						      split, disk_devt(mddev->gendisk),
> +						      bio_sector);
>  			generic_make_request(split);
> +		}
>  	} while (split != bio);
>  	return;
>  
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index b3ba77a3c3bc..e628f187e5ad 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -21,6 +21,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/module.h>
>  #include <linux/slab.h>
> +#include <trace/events/block.h>
>  #include "md.h"
>  #include "raid0.h"
>  #include "raid5.h"
> @@ -463,7 +464,8 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
>  	}
>  
>  	do {
> -		sector_t sector = bio->bi_iter.bi_sector;
> +		sector_t bio_sector = bio->bi_iter.bi_sector;
> +		sector_t sector = bio_sector;
>  		unsigned chunk_sects = mddev->chunk_sectors;
>  
>  		unsigned sectors = chunk_sects -
> @@ -472,7 +474,7 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
>  			 : sector_div(sector, chunk_sects));
>  
>  		/* Restore due to sector_div */
> -		sector = bio->bi_iter.bi_sector;
> +		sector = bio_sector;
>  
>  		if (sectors < bio_sectors(bio)) {
>  			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
> @@ -491,8 +493,13 @@ static void raid0_make_request(struct mddev *mddev, struct bio *bio)
>  			 !blk_queue_discard(bdev_get_queue(split->bi_bdev)))) {
>  			/* Just ignore it */
>  			bio_endio(split);
> -		} else
> +		} else {
> +			if (mddev->gendisk)
> +				trace_block_bio_remap(bdev_get_queue(split->bi_bdev),
> +						      split, disk_devt(mddev->gendisk),
> +						      bio_sector);
>  			generic_make_request(split);
> +		}
>  	} while (split != bio);
>  }
>  
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 9ac61cd85e5c..2dc1934925ec 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -37,6 +37,7 @@
>  #include <linux/module.h>
>  #include <linux/seq_file.h>
>  #include <linux/ratelimit.h>
> +#include <trace/events/block.h>
>  #include "md.h"
>  #include "raid1.h"
>  #include "bitmap.h"
> @@ -1162,6 +1163,11 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  		bio_set_op_attrs(read_bio, op, do_sync);
>  		read_bio->bi_private = r1_bio;
>  
> +		if (mddev->gendisk)
> +			trace_block_bio_remap(bdev_get_queue(read_bio->bi_bdev),
> +					      read_bio, disk_devt(mddev->gendisk),
> +					      r1_bio->sector);
> +
>  		if (max_sectors < r1_bio->sectors) {
>  			/* could not read all from this device, so we will
>  			 * need another r1_bio.
> @@ -1367,13 +1373,20 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  
>  		mbio->bi_iter.bi_sector	= (r1_bio->sector +
>  				   conf->mirrors[i].rdev->data_offset);
> -		mbio->bi_bdev = (void*)conf->mirrors[i].rdev;
> +		mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
>  		mbio->bi_end_io	= raid1_end_write_request;
>  		bio_set_op_attrs(mbio, op, do_flush_fua | do_sync);
>  		mbio->bi_private = r1_bio;
>  
>  		atomic_inc(&r1_bio->remaining);
>  
> +		if (mddev->gendisk)
> +			trace_block_bio_remap(bdev_get_queue(mbio->bi_bdev),
> +					      mbio, disk_devt(mddev->gendisk),
> +					      r1_bio->sector);
> +		/* flush_pending_writes() needs access to the rdev so...*/
> +		mbio->bi_bdev = (void*)conf->mirrors[i].rdev;
> +
>  		cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
>  		if (cb)
>  			plug = container_of(cb, struct raid1_plug_cb, cb);
> @@ -2290,6 +2303,8 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>  	struct bio *bio;
>  	char b[BDEVNAME_SIZE];
>  	struct md_rdev *rdev;
> +	dev_t bio_dev;
> +	sector_t bio_sector;
>  
>  	clear_bit(R1BIO_ReadError, &r1_bio->state);
>  	/* we got a read error. Maybe the drive is bad.  Maybe just
> @@ -2303,6 +2318,8 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>  
>  	bio = r1_bio->bios[r1_bio->read_disk];
>  	bdevname(bio->bi_bdev, b);
> +	bio_dev = bio->bi_bdev->bd_dev;
> +	bio_sector = conf->mirrors[r1_bio->read_disk].rdev->data_offset + r1_bio->sector;
>  	bio_put(bio);
>  	r1_bio->bios[r1_bio->read_disk] = NULL;
>  
> @@ -2353,6 +2370,8 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>  			else
>  				mbio->bi_phys_segments++;
>  			spin_unlock_irq(&conf->device_lock);
> +			trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
> +					      bio, bio_dev, bio_sector);
>  			generic_make_request(bio);
>  			bio = NULL;
>  
> @@ -2367,8 +2386,11 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>  				sectors_handled;
>  
>  			goto read_more;
> -		} else
> +		} else {
> +			trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
> +					      bio, bio_dev, bio_sector);
>  			generic_make_request(bio);
> +		}
>  	}
>  }
>  
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 5290be3d5c26..c63041ec9415 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -25,6 +25,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/ratelimit.h>
>  #include <linux/kthread.h>
> +#include <trace/events/block.h>
>  #include "md.h"
>  #include "raid10.h"
>  #include "raid0.h"
> @@ -1165,6 +1166,10 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>  		bio_set_op_attrs(read_bio, op, do_sync);
>  		read_bio->bi_private = r10_bio;
>  
> +		if (mddev->gendisk)
> +			trace_block_bio_remap(bdev_get_queue(read_bio->bi_bdev),
> +					      read_bio, disk_devt(mddev->gendisk),
> +					      r10_bio->sector);
>  		if (max_sectors < r10_bio->sectors) {
>  			/* Could not read all from this device, so we will
>  			 * need another r10_bio.
> @@ -1367,11 +1372,17 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>  			mbio->bi_iter.bi_sector	= (r10_bio->devs[i].addr+
>  					   choose_data_offset(r10_bio,
>  							      rdev));
> -			mbio->bi_bdev = (void*)rdev;
> +			mbio->bi_bdev = rdev->bdev;
>  			mbio->bi_end_io	= raid10_end_write_request;
>  			bio_set_op_attrs(mbio, op, do_sync | do_fua);
>  			mbio->bi_private = r10_bio;
>  
> +			if (conf->mddev->gendisk)
> +				trace_block_bio_remap(bdev_get_queue(mbio->bi_bdev),
> +						      mbio, disk_devt(conf->mddev->gendisk),
> +						      r10_bio->sector);
> +			mbio->bi_bdev = (void*)rdev;
> +
>  			atomic_inc(&r10_bio->remaining);
>  
>  			cb = blk_check_plugged(raid10_unplug, mddev,
> @@ -1409,11 +1420,17 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>  			mbio->bi_iter.bi_sector	= (r10_bio->devs[i].addr +
>  					   choose_data_offset(
>  						   r10_bio, rdev));
> -			mbio->bi_bdev = (void*)rdev;
> +			mbio->bi_bdev = rdev->bdev;
>  			mbio->bi_end_io	= raid10_end_write_request;
>  			bio_set_op_attrs(mbio, op, do_sync | do_fua);
>  			mbio->bi_private = r10_bio;
>  
> +			if (conf->mddev->gendisk)
> +				trace_block_bio_remap(bdev_get_queue(mbio->bi_bdev),
> +						      mbio, disk_devt(conf->mddev->gendisk),
> +						      r10_bio->sector);
> +			mbio->bi_bdev = (void*)rdev;
> +
>  			atomic_inc(&r10_bio->remaining);
>  			spin_lock_irqsave(&conf->device_lock, flags);
>  			bio_list_add(&conf->pending_bio_list, mbio);
> @@ -2496,6 +2513,8 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
>  	char b[BDEVNAME_SIZE];
>  	unsigned long do_sync;
>  	int max_sectors;
> +	dev_t bio_dev;
> +	sector_t bio_last_sector;
>  
>  	/* we got a read error. Maybe the drive is bad.  Maybe just
>  	 * the block and we can fix it.
> @@ -2507,6 +2526,8 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
>  	 */
>  	bio = r10_bio->devs[slot].bio;
>  	bdevname(bio->bi_bdev, b);
> +	bio_dev = bio->bi_bdev->bd_dev;
> +	bio_last_sector = r10_bio->devs[slot].addr + rdev->data_offset + r10_bio->sectors;
>  	bio_put(bio);
>  	r10_bio->devs[slot].bio = NULL;
>  
> @@ -2546,6 +2567,10 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
>  	bio_set_op_attrs(bio, REQ_OP_READ, do_sync);
>  	bio->bi_private = r10_bio;
>  	bio->bi_end_io = raid10_end_read_request;
> +	trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
> +			      bio, bio_dev,
> +			      bio_last_sector - r10_bio->sectors);
> +
>  	if (max_sectors < r10_bio->sectors) {
>  		/* Drat - have to split this up more */
>  		struct bio *mbio = r10_bio->master_bio;
> -- 
> 2.10.2
> 



^ permalink raw reply

* Re: [PATCH v2] md: add block tracing for bio_remapping
From: Shaohua Li @ 2016-11-18 17:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <20161118172622.hzvmtt5rbtmdh7hh@kernel.org>

On Fri, Nov 18, 2016 at 09:26:22AM -0800, Shaohua Li wrote:
> On Fri, Nov 18, 2016 at 01:22:04PM +1100, Neil Brown wrote:
> > 
> > The block tracing infrastructure (accessed with blktrace/blkparse)
> > supports the tracing of mapping bios from one device to another.
> > This is currently used when a bio in a partition is mapped to the
> > whole device, when bios are mapped by dm, and for mapping in md/raid5.
> > Other md personalities do not include this tracing yet, so add it.
> > 
> > When a read-error is detected we redirect the request to a different device.
> > This could justifiably be seen as a new mapping for the originial bio,
> > or a secondary mapping for the bio that errors.  This patch uses
> > the second option.
> > 
> > When md is used under dm-raid, the mappings are not traced as we do
> > not have access to the block device number of the parent.
> 
> thanks, applied patch 1, 3, 4.

BTW, I added below patch


commit 504634f60f463e73e7d58c6810a04437da942dba
Author: Shaohua Li <shli@fb.com>
Date:   Fri Nov 18 09:44:08 2016 -0800

    md: add blktrace event for writes to superblock
    
    superblock write is an expensive operation. With raid5-cache, it can be called
    regularly. Tracing to help performance debug.
    
    Signed-off-by: Shaohua Li <shli@fb.com>
    Cc: NeilBrown <neilb@suse.com>

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1f1c7f0..d3cef77 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -64,6 +64,7 @@
 #include <linux/raid/md_p.h>
 #include <linux/raid/md_u.h>
 #include <linux/slab.h>
+#include <trace/events/block.h>
 #include "md.h"
 #include "bitmap.h"
 #include "md-cluster.h"
@@ -2403,6 +2404,8 @@ void md_update_sb(struct mddev *mddev, int force_change)
 	pr_debug("md: updating %s RAID superblock on device (in sync %d)\n",
 		 mdname(mddev), mddev->in_sync);
 
+	if (mddev->queue)
+		blk_add_trace_msg(mddev->queue, "md md_update_sb");
 	bitmap_update_sb(mddev->bitmap);
 	rdev_for_each(rdev, mddev) {
 		char b[BDEVNAME_SIZE];

^ permalink raw reply related

* Re: MD Remnants After --stop
From: Marc Smith @ 2016-11-18 19:31 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87fun3ond9.fsf@notabene.neil.brown.name>

On Mon, Nov 7, 2016 at 12:44 AM, NeilBrown <neilb@suse.com> wrote:
> On Sat, Nov 05 2016, Marc Smith wrote:
>
>> Hi,
>>
>> It may be that I've never noticed this before, so maybe its not a
>> problem... after using '--stop' to deactivate/stop an MD array, there
>> are remnants of it lingering, namely an entry in /sys/block (eg,
>> /sys/block/md127) and the device node in /dev remains (eg,
>> /dev/md127).
>>
>> Is this normal? Like I said, it probably is, and I've just never
>> noticed it before. I assume its not going to hurt anything, but is
>> there a way to clean it up, without rebooting? Obviously I could
>> remove the /dev entry, but what about /sys/block?
>>
>
> You can remove them both by running
>   mdadm -S /dev/md127
>
> but they'll probably just reappear again.
>
> This seems to be an on-going battle between md and udev.  I've "fixed"
> it at least once, but it keeps coming back.
>
> When md removes the md127 device, a message is sent to udev.
> As part of its response to this message, udev tries to open /dev/md127.
> Because of the rather unusual way that md devices are created (it made
> sense nearly 20 years ago when it was designed), opening /dev/md127
> causes md to create device md127 again.
>
> You could
>   mv /dev/md127 /dev/md127X
>   mdadm -S /dev/md127X
>   rm /dev/md127X
> that stop udev from opening /dev/md127.  It seems to work reliably.
>
> md used to generate a CHANGE event before the REMOVE event, and only the
> CHANGE event caused udev to open the device file.  I removed that and
> the problem went away.  Apparently some change has happened to udev and
> now it opens the file in response to REMOVE as well.

I used "udevadm monitor -pku" to watch the events when running "mdadm
--stop /dev/md127" and this is what I see:

--snip--
KERNEL[163074.119778] change   /devices/virtual/block/md127 (block)
ACTION=change
DEVNAME=/dev/md127
DEVPATH=/devices/virtual/block/md127
DEVTYPE=disk
MAJOR=9
MINOR=127
SEQNUM=3701
SUBSYSTEM=block

UDEV  [163074.121569] change   /devices/virtual/block/md127 (block)
ACTION=change
DEVNAME=/dev/md127
DEVPATH=/devices/virtual/block/md127
DEVTYPE=disk
MAJOR=9
MINOR=127
SEQNUM=3701
SUBSYSTEM=block
SYSTEMD_READY=0
USEC_INITIALIZED=370470
--snip--

I don't see any 'remove' event generated. I should mention if I hadn't
already that I'm testing md-cluster (--bitmap=clustered), and
currently using Linux 4.9-rc3.


--Marc

>
> So to "fix" it again, you need to figure out what udev is doing and fix
> that.
>
> Alternately... place "create names=yes" in your mdadm.conf
> and always use names, not numbers, to work with md arrays.
> e.g. /dev/md/home /dev/md/root /dev/md/scratch etc.
>
> When will trigger the use of an alternate scheme for creating md devices
> (using minor numbers >= 512) which udev cannot break so easily.  When it
> tries to open /dev/md_home, that will fail.
>
> NeilBrown

^ permalink raw reply

* Re: [PATCH v7 00/10] raid5-cache: enabling cache features
From: Shaohua Li @ 2016-11-18 21:22 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan
In-Reply-To: <20161117232445.1798305-1-songliubraving@fb.com>

On Thu, Nov 17, 2016 at 03:24:35PM -0800, Song Liu wrote:
> These are the 7th version of patches to enable write cache part of
> raid5-cache. The journal part was released with kernel 4.4.
> 
> The caching part uses same disk format of raid456 journal, and provides
> acceleration to writes. Write operations are committed (bio_endio) once
> the data is secured in journal. Reconstruct and RMW are postponed to
> writing-out phase, which is usually not on the critical path.
> 
> The changes are organized in 10 patches (details below).
> 
> Patch for chunk_aligned_read in earlier RFC is not included yet
> (http://marc.info/?l=linux-raid&m=146432700719277). But we may still need
> some optimizations later, especially for SSD raid devices.
> 
> Changes between v7 and v6 (http://marc.info/?l=linux-raid&m=147881079931937):
>  1. Incorprate feedbacks;
>  2. Modify representation of write-back state machine to use STRIPE_R5C_CACHING
>     instead of STRIPE_R5C_WRITE_OUT. This reduces state trasitions in
>     write-through mode;
>  3. For prexor, allocate extra orig_page instead of page;
>  4. Rename and refactor of some functions and variables;
>  5. Remove path 11 (handle alloc_page failure). I will propose a simpler retry
>     patch later.

Thanks for doing this, Song! And thanks Neil for the code review! I think this
series are in good shape. We do have a lot of todo stuffes to make these work
well, but we can always improve based on these. And since the default setting
is write-through, there should be no regression (hopefully). I'm going to apply
them to my for-next tree. If you have further bug fixes/improvements, please do
them against the for-next tree.

I don't apply the last patch though. That one doesn't work well for !write-back
case (sorry, didn't notice it in last review), as for write-through, we don't
need do any flush/fua. Please send an updated patch, I'll apply.

I'll list the todo here so we don't forget:
- makes read not enter state machine
- write checkpoint block to make recovery faster
- improve recovery
- handle memory allocation failure
- allow removing journal capability if journal disk fails

And of course performance tuning.

Thanks,
Shaohua 

^ permalink raw reply

* [PATCH v8] md/r5cache: handle SYNC and FUA
From: Song Liu @ 2016-11-19  0:46 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

With raid5 cache, we committing data from journal device. When
there is flush request, we need to flush journal device's cache.
This was not needed in raid5 journal, because we will flush the
journal before committing data to raid disks.

This is similar to FUA, except that we also need flush journal for
FUA. Otherwise, corruptions in earlier meta data will stop recovery
from reaching FUA data.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 163 +++++++++++++++++++++++++++++++++++++++++------
 drivers/md/raid5.c       |  12 ++++
 drivers/md/raid5.h       |   1 +
 3 files changed, 158 insertions(+), 18 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 6b99570..a904268 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -19,6 +19,7 @@
 #include <linux/raid/md_p.h>
 #include <linux/crc32c.h>
 #include <linux/random.h>
+#include <trace/events/block.h>
 #include "md.h"
 #include "raid5.h"
 #include "bitmap.h"
@@ -159,6 +160,9 @@ struct r5l_log {
 
 	spinlock_t stripe_in_journal_lock;
 	atomic_t stripe_in_journal_count;
+
+	/* to submit async io_units, to fulfill ordering of flush */
+	struct work_struct deferred_io_work;
 };
 
 /*
@@ -185,6 +189,18 @@ struct r5l_io_unit {
 
 	int state;
 	bool need_split_bio;
+	struct bio *split_bio;
+
+	unsigned int has_flush:1;      /* include flush request */
+	unsigned int has_fua:1;        /* include fua request */
+	unsigned int has_null_flush:1; /* include empty flush request */
+	/*
+	 * io isn't sent yet, flush/fua request can only be submitted till it's
+	 * the first IO in running_ios list
+	 */
+	unsigned int io_deferred:1;
+
+	struct bio_list flush_barriers;   /* size == 0 flush bios */
 };
 
 /* r5l_io_unit state */
@@ -494,9 +510,11 @@ static void r5l_move_to_end_ios(struct r5l_log *log)
 	}
 }
 
+static void __r5l_stripe_write_finished(struct r5l_io_unit *io);
 static void r5l_log_endio(struct bio *bio)
 {
 	struct r5l_io_unit *io = bio->bi_private;
+	struct r5l_io_unit *io_deferred;
 	struct r5l_log *log = io->log;
 	unsigned long flags;
 
@@ -512,18 +530,89 @@ static void r5l_log_endio(struct bio *bio)
 		r5l_move_to_end_ios(log);
 	else
 		r5l_log_run_stripes(log);
+	if (!list_empty(&log->running_ios)) {
+		/*
+		 * FLUSH/FUA io_unit is deferred because of ordering, now we
+		 * can dispatch it
+		 */
+		io_deferred = list_first_entry(&log->running_ios,
+					       struct r5l_io_unit, log_sibling);
+		if (io_deferred->io_deferred)
+			schedule_work(&log->deferred_io_work);
+	}
+
 	spin_unlock_irqrestore(&log->io_list_lock, flags);
 
 	if (log->need_cache_flush)
 		md_wakeup_thread(log->rdev->mddev->thread);
+
+	if (io->has_null_flush) {
+		struct bio *bi;
+
+		WARN_ON(bio_list_empty(&io->flush_barriers));
+		while ((bi = bio_list_pop(&io->flush_barriers)) != NULL) {
+			bio_endio(bi);
+			atomic_dec(&io->pending_stripe);
+		}
+		if (atomic_read(&io->pending_stripe) == 0)
+			__r5l_stripe_write_finished(io);
+	}
+}
+
+static void r5l_do_submit_io(struct r5l_log *log, struct r5l_io_unit *io)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&log->io_list_lock, flags);
+	__r5l_set_io_unit_state(io, IO_UNIT_IO_START);
+	spin_unlock_irqrestore(&log->io_list_lock, flags);
+
+	if (io->has_flush)
+		bio_set_op_attrs(io->current_bio, REQ_OP_WRITE, WRITE_FLUSH);
+	if (io->has_fua)
+		bio_set_op_attrs(io->current_bio, REQ_OP_WRITE, WRITE_FUA);
+	submit_bio(io->current_bio);
+
+	if (!io->split_bio)
+		return;
+
+	if (io->has_flush)
+		bio_set_op_attrs(io->split_bio, REQ_OP_WRITE, WRITE_FLUSH);
+	if (io->has_fua)
+		bio_set_op_attrs(io->split_bio, REQ_OP_WRITE, WRITE_FUA);
+	submit_bio(io->split_bio);
+}
+
+/* deferred io_unit will be dispatched here */
+static void r5l_submit_io_async(struct work_struct *work)
+{
+	struct r5l_log *log = container_of(work, struct r5l_log,
+					   deferred_io_work);
+	struct r5l_io_unit *io = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&log->io_list_lock, flags);
+	if (!list_empty(&log->running_ios)) {
+		io = list_first_entry(&log->running_ios, struct r5l_io_unit,
+				      log_sibling);
+		if (!io->io_deferred)
+			io = NULL;
+		else
+			io->io_deferred = 0;
+	}
+	spin_unlock_irqrestore(&log->io_list_lock, flags);
+	if (io)
+		r5l_do_submit_io(log, io);
 }
 
 static void r5l_submit_current_io(struct r5l_log *log)
 {
 	struct r5l_io_unit *io = log->current_io;
+	struct bio *bio;
 	struct r5l_meta_block *block;
 	unsigned long flags;
 	u32 crc;
+	bool do_submit = true;
 
 	if (!io)
 		return;
@@ -532,13 +621,20 @@ static void r5l_submit_current_io(struct r5l_log *log)
 	block->meta_size = cpu_to_le32(io->meta_offset);
 	crc = crc32c_le(log->uuid_checksum, block, PAGE_SIZE);
 	block->checksum = cpu_to_le32(crc);
+	bio = io->current_bio;
 
 	log->current_io = NULL;
 	spin_lock_irqsave(&log->io_list_lock, flags);
-	__r5l_set_io_unit_state(io, IO_UNIT_IO_START);
+	if (io->has_flush || io->has_fua) {
+		if (io != list_first_entry(&log->running_ios,
+					   struct r5l_io_unit, log_sibling)) {
+			io->io_deferred = 1;
+			do_submit = false;
+		}
+	}
 	spin_unlock_irqrestore(&log->io_list_lock, flags);
-
-	submit_bio(io->current_bio);
+	if (do_submit)
+		r5l_do_submit_io(log, io);
 }
 
 static struct bio *r5l_bio_alloc(struct r5l_log *log)
@@ -583,6 +679,7 @@ static struct r5l_io_unit *r5l_new_meta(struct r5l_log *log)
 	io->log = log;
 	INIT_LIST_HEAD(&io->log_sibling);
 	INIT_LIST_HEAD(&io->stripe_list);
+	bio_list_init(&io->flush_barriers);
 	io->state = IO_UNIT_RUNNING;
 
 	io->meta_page = mempool_alloc(log->meta_pool, GFP_NOIO);
@@ -653,12 +750,11 @@ static void r5l_append_payload_page(struct r5l_log *log, struct page *page)
 	struct r5l_io_unit *io = log->current_io;
 
 	if (io->need_split_bio) {
-		struct bio *prev = io->current_bio;
-
+		BUG_ON(io->split_bio);
+		io->split_bio = io->current_bio;
 		io->current_bio = r5l_bio_alloc(log);
-		bio_chain(io->current_bio, prev);
-
-		submit_bio(prev);
+		bio_chain(io->current_bio, io->split_bio);
+		io->need_split_bio = false;
 	}
 
 	if (!bio_add_page(io->current_bio, page, PAGE_SIZE, 0))
@@ -687,12 +783,24 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
 
 	io = log->current_io;
 
+	if (test_and_clear_bit(STRIPE_R5C_PREFLUSH, &sh->state))
+		io->has_flush = 1;
+
 	for (i = 0; i < sh->disks; i++) {
 		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) ||
 		    test_bit(R5_InJournal, &sh->dev[i].flags))
 			continue;
 		if (i == sh->pd_idx || i == sh->qd_idx)
 			continue;
+		if (test_bit(R5_WantFUA, &sh->dev[i].flags) &&
+		    log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_BACK) {
+			io->has_fua = 1;
+			/*
+			 * we need to flush journal to make sure recovery can
+			 * reach the data with fua flag
+			 */
+			io->has_flush = 1;
+		}
 		r5l_append_payload_meta(log, R5LOG_PAYLOAD_DATA,
 					raid5_compute_blocknr(sh, i, 0),
 					sh->dev[i].log_checksum, 0, false);
@@ -856,17 +964,34 @@ int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio)
 {
 	if (!log)
 		return -ENODEV;
-	/*
-	 * we flush log disk cache first, then write stripe data to raid disks.
-	 * So if bio is finished, the log disk cache is flushed already. The
-	 * recovery guarantees we can recovery the bio from log disk, so we
-	 * don't need to flush again
-	 */
-	if (bio->bi_iter.bi_size == 0) {
-		bio_endio(bio);
-		return 0;
+
+	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH) {
+		/*
+		 * in write through (journal only)
+		 * we flush log disk cache first, then write stripe data to
+		 * raid disks. So if bio is finished, the log disk cache is
+		 * flushed already. The recovery guarantees we can recovery
+		 * the bio from log disk, so we don't need to flush again
+		 */
+		if (bio->bi_iter.bi_size == 0) {
+			bio_endio(bio);
+			return 0;
+		}
+		bio->bi_opf &= ~REQ_PREFLUSH;
+	} else {
+		/* write back (with cache) */
+		if (bio->bi_iter.bi_size == 0) {
+			mutex_lock(&log->io_mutex);
+			r5l_get_meta(log, 0);
+			bio_list_add(&log->current_io->flush_barriers, bio);
+			log->current_io->has_flush = 1;
+			log->current_io->has_null_flush = 1;
+			atomic_inc(&log->current_io->pending_stripe);
+			r5l_submit_current_io(log);
+			mutex_unlock(&log->io_mutex);
+			return 0;
+		}
 	}
-	bio->bi_opf &= ~REQ_PREFLUSH;
 	return -EAGAIN;
 }
 
@@ -2470,6 +2595,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	INIT_LIST_HEAD(&log->no_space_stripes);
 	spin_lock_init(&log->no_space_stripes_lock);
 
+	INIT_WORK(&log->deferred_io_work, r5l_submit_io_async);
+
 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
 	INIT_LIST_HEAD(&log->stripe_in_journal_list);
 	spin_lock_init(&log->stripe_in_journal_lock);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index aa4968c..a850663 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5248,6 +5248,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	int remaining;
 	DEFINE_WAIT(w);
 	bool do_prepare;
+	bool do_flush = false;
 
 	if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
 		int ret = r5l_handle_flush_request(conf->log, bi);
@@ -5259,6 +5260,11 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 			return;
 		}
 		/* ret == -EAGAIN, fallback */
+		/*
+		 * if r5l_handle_flush_request() didn't clear REQ_PREFLUSH,
+		 * we need to flush journal device
+		 */
+		do_flush = (bi->bi_opf & REQ_PREFLUSH) != 0;
 	}
 
 	md_write_start(mddev, bi);
@@ -5398,6 +5404,12 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 				do_prepare = true;
 				goto retry;
 			}
+			if (do_flush) {
+				set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
+				/* we only need flush for one stripe */
+				do_flush = false;
+			}
+
 			set_bit(STRIPE_HANDLE, &sh->state);
 			clear_bit(STRIPE_DELAYED, &sh->state);
 			if ((!sh->batch_head || sh == sh->batch_head) &&
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index a698113..d13fe45 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -376,6 +376,7 @@ enum {
 	STRIPE_R5C_FULL_STRIPE,	/* in r5c cache (to-be/being handled or
 				 * in conf->r5c_full_stripe_list)
 				 */
+	STRIPE_R5C_PREFLUSH,	/* need to flush journal device */
 };
 
 #define STRIPE_EXPAND_SYNC_FLAGS \
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH v8] md/r5cache: handle SYNC and FUA
From: Shaohua Li @ 2016-11-19  1:00 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan
In-Reply-To: <20161119004650.557540-1-songliubraving@fb.com>

On Fri, Nov 18, 2016 at 04:46:50PM -0800, Song Liu wrote:
> With raid5 cache, we committing data from journal device. When
> there is flush request, we need to flush journal device's cache.
> This was not needed in raid5 journal, because we will flush the
> journal before committing data to raid disks.
> 
> This is similar to FUA, except that we also need flush journal for
> FUA. Otherwise, corruptions in earlier meta data will stop recovery
> from reaching FUA data.

Looks good, applied!

> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 163 +++++++++++++++++++++++++++++++++++++++++------
>  drivers/md/raid5.c       |  12 ++++
>  drivers/md/raid5.h       |   1 +
>  3 files changed, 158 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 6b99570..a904268 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -19,6 +19,7 @@
>  #include <linux/raid/md_p.h>
>  #include <linux/crc32c.h>
>  #include <linux/random.h>
> +#include <trace/events/block.h>
>  #include "md.h"
>  #include "raid5.h"
>  #include "bitmap.h"
> @@ -159,6 +160,9 @@ struct r5l_log {
>  
>  	spinlock_t stripe_in_journal_lock;
>  	atomic_t stripe_in_journal_count;
> +
> +	/* to submit async io_units, to fulfill ordering of flush */
> +	struct work_struct deferred_io_work;
>  };
>  
>  /*
> @@ -185,6 +189,18 @@ struct r5l_io_unit {
>  
>  	int state;
>  	bool need_split_bio;
> +	struct bio *split_bio;
> +
> +	unsigned int has_flush:1;      /* include flush request */
> +	unsigned int has_fua:1;        /* include fua request */
> +	unsigned int has_null_flush:1; /* include empty flush request */
> +	/*
> +	 * io isn't sent yet, flush/fua request can only be submitted till it's
> +	 * the first IO in running_ios list
> +	 */
> +	unsigned int io_deferred:1;
> +
> +	struct bio_list flush_barriers;   /* size == 0 flush bios */
>  };
>  
>  /* r5l_io_unit state */
> @@ -494,9 +510,11 @@ static void r5l_move_to_end_ios(struct r5l_log *log)
>  	}
>  }
>  
> +static void __r5l_stripe_write_finished(struct r5l_io_unit *io);
>  static void r5l_log_endio(struct bio *bio)
>  {
>  	struct r5l_io_unit *io = bio->bi_private;
> +	struct r5l_io_unit *io_deferred;
>  	struct r5l_log *log = io->log;
>  	unsigned long flags;
>  
> @@ -512,18 +530,89 @@ static void r5l_log_endio(struct bio *bio)
>  		r5l_move_to_end_ios(log);
>  	else
>  		r5l_log_run_stripes(log);
> +	if (!list_empty(&log->running_ios)) {
> +		/*
> +		 * FLUSH/FUA io_unit is deferred because of ordering, now we
> +		 * can dispatch it
> +		 */
> +		io_deferred = list_first_entry(&log->running_ios,
> +					       struct r5l_io_unit, log_sibling);
> +		if (io_deferred->io_deferred)
> +			schedule_work(&log->deferred_io_work);
> +	}
> +
>  	spin_unlock_irqrestore(&log->io_list_lock, flags);
>  
>  	if (log->need_cache_flush)
>  		md_wakeup_thread(log->rdev->mddev->thread);
> +
> +	if (io->has_null_flush) {
> +		struct bio *bi;
> +
> +		WARN_ON(bio_list_empty(&io->flush_barriers));
> +		while ((bi = bio_list_pop(&io->flush_barriers)) != NULL) {
> +			bio_endio(bi);
> +			atomic_dec(&io->pending_stripe);
> +		}
> +		if (atomic_read(&io->pending_stripe) == 0)
> +			__r5l_stripe_write_finished(io);
> +	}
> +}
> +
> +static void r5l_do_submit_io(struct r5l_log *log, struct r5l_io_unit *io)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&log->io_list_lock, flags);
> +	__r5l_set_io_unit_state(io, IO_UNIT_IO_START);
> +	spin_unlock_irqrestore(&log->io_list_lock, flags);
> +
> +	if (io->has_flush)
> +		bio_set_op_attrs(io->current_bio, REQ_OP_WRITE, WRITE_FLUSH);
> +	if (io->has_fua)
> +		bio_set_op_attrs(io->current_bio, REQ_OP_WRITE, WRITE_FUA);
> +	submit_bio(io->current_bio);
> +
> +	if (!io->split_bio)
> +		return;
> +
> +	if (io->has_flush)
> +		bio_set_op_attrs(io->split_bio, REQ_OP_WRITE, WRITE_FLUSH);
> +	if (io->has_fua)
> +		bio_set_op_attrs(io->split_bio, REQ_OP_WRITE, WRITE_FUA);
> +	submit_bio(io->split_bio);
> +}
> +
> +/* deferred io_unit will be dispatched here */
> +static void r5l_submit_io_async(struct work_struct *work)
> +{
> +	struct r5l_log *log = container_of(work, struct r5l_log,
> +					   deferred_io_work);
> +	struct r5l_io_unit *io = NULL;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&log->io_list_lock, flags);
> +	if (!list_empty(&log->running_ios)) {
> +		io = list_first_entry(&log->running_ios, struct r5l_io_unit,
> +				      log_sibling);
> +		if (!io->io_deferred)
> +			io = NULL;
> +		else
> +			io->io_deferred = 0;
> +	}
> +	spin_unlock_irqrestore(&log->io_list_lock, flags);
> +	if (io)
> +		r5l_do_submit_io(log, io);
>  }
>  
>  static void r5l_submit_current_io(struct r5l_log *log)
>  {
>  	struct r5l_io_unit *io = log->current_io;
> +	struct bio *bio;
>  	struct r5l_meta_block *block;
>  	unsigned long flags;
>  	u32 crc;
> +	bool do_submit = true;
>  
>  	if (!io)
>  		return;
> @@ -532,13 +621,20 @@ static void r5l_submit_current_io(struct r5l_log *log)
>  	block->meta_size = cpu_to_le32(io->meta_offset);
>  	crc = crc32c_le(log->uuid_checksum, block, PAGE_SIZE);
>  	block->checksum = cpu_to_le32(crc);
> +	bio = io->current_bio;
>  
>  	log->current_io = NULL;
>  	spin_lock_irqsave(&log->io_list_lock, flags);
> -	__r5l_set_io_unit_state(io, IO_UNIT_IO_START);
> +	if (io->has_flush || io->has_fua) {
> +		if (io != list_first_entry(&log->running_ios,
> +					   struct r5l_io_unit, log_sibling)) {
> +			io->io_deferred = 1;
> +			do_submit = false;
> +		}
> +	}
>  	spin_unlock_irqrestore(&log->io_list_lock, flags);
> -
> -	submit_bio(io->current_bio);
> +	if (do_submit)
> +		r5l_do_submit_io(log, io);
>  }
>  
>  static struct bio *r5l_bio_alloc(struct r5l_log *log)
> @@ -583,6 +679,7 @@ static struct r5l_io_unit *r5l_new_meta(struct r5l_log *log)
>  	io->log = log;
>  	INIT_LIST_HEAD(&io->log_sibling);
>  	INIT_LIST_HEAD(&io->stripe_list);
> +	bio_list_init(&io->flush_barriers);
>  	io->state = IO_UNIT_RUNNING;
>  
>  	io->meta_page = mempool_alloc(log->meta_pool, GFP_NOIO);
> @@ -653,12 +750,11 @@ static void r5l_append_payload_page(struct r5l_log *log, struct page *page)
>  	struct r5l_io_unit *io = log->current_io;
>  
>  	if (io->need_split_bio) {
> -		struct bio *prev = io->current_bio;
> -
> +		BUG_ON(io->split_bio);
> +		io->split_bio = io->current_bio;
>  		io->current_bio = r5l_bio_alloc(log);
> -		bio_chain(io->current_bio, prev);
> -
> -		submit_bio(prev);
> +		bio_chain(io->current_bio, io->split_bio);
> +		io->need_split_bio = false;
>  	}
>  
>  	if (!bio_add_page(io->current_bio, page, PAGE_SIZE, 0))
> @@ -687,12 +783,24 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
>  
>  	io = log->current_io;
>  
> +	if (test_and_clear_bit(STRIPE_R5C_PREFLUSH, &sh->state))
> +		io->has_flush = 1;
> +
>  	for (i = 0; i < sh->disks; i++) {
>  		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) ||
>  		    test_bit(R5_InJournal, &sh->dev[i].flags))
>  			continue;
>  		if (i == sh->pd_idx || i == sh->qd_idx)
>  			continue;
> +		if (test_bit(R5_WantFUA, &sh->dev[i].flags) &&
> +		    log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_BACK) {
> +			io->has_fua = 1;
> +			/*
> +			 * we need to flush journal to make sure recovery can
> +			 * reach the data with fua flag
> +			 */
> +			io->has_flush = 1;
> +		}
>  		r5l_append_payload_meta(log, R5LOG_PAYLOAD_DATA,
>  					raid5_compute_blocknr(sh, i, 0),
>  					sh->dev[i].log_checksum, 0, false);
> @@ -856,17 +964,34 @@ int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio)
>  {
>  	if (!log)
>  		return -ENODEV;
> -	/*
> -	 * we flush log disk cache first, then write stripe data to raid disks.
> -	 * So if bio is finished, the log disk cache is flushed already. The
> -	 * recovery guarantees we can recovery the bio from log disk, so we
> -	 * don't need to flush again
> -	 */
> -	if (bio->bi_iter.bi_size == 0) {
> -		bio_endio(bio);
> -		return 0;
> +
> +	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH) {
> +		/*
> +		 * in write through (journal only)
> +		 * we flush log disk cache first, then write stripe data to
> +		 * raid disks. So if bio is finished, the log disk cache is
> +		 * flushed already. The recovery guarantees we can recovery
> +		 * the bio from log disk, so we don't need to flush again
> +		 */
> +		if (bio->bi_iter.bi_size == 0) {
> +			bio_endio(bio);
> +			return 0;
> +		}
> +		bio->bi_opf &= ~REQ_PREFLUSH;
> +	} else {
> +		/* write back (with cache) */
> +		if (bio->bi_iter.bi_size == 0) {
> +			mutex_lock(&log->io_mutex);
> +			r5l_get_meta(log, 0);
> +			bio_list_add(&log->current_io->flush_barriers, bio);
> +			log->current_io->has_flush = 1;
> +			log->current_io->has_null_flush = 1;
> +			atomic_inc(&log->current_io->pending_stripe);
> +			r5l_submit_current_io(log);
> +			mutex_unlock(&log->io_mutex);
> +			return 0;
> +		}
>  	}
> -	bio->bi_opf &= ~REQ_PREFLUSH;
>  	return -EAGAIN;
>  }
>  
> @@ -2470,6 +2595,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
>  	INIT_LIST_HEAD(&log->no_space_stripes);
>  	spin_lock_init(&log->no_space_stripes_lock);
>  
> +	INIT_WORK(&log->deferred_io_work, r5l_submit_io_async);
> +
>  	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
>  	INIT_LIST_HEAD(&log->stripe_in_journal_list);
>  	spin_lock_init(&log->stripe_in_journal_lock);
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index aa4968c..a850663 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5248,6 +5248,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
>  	int remaining;
>  	DEFINE_WAIT(w);
>  	bool do_prepare;
> +	bool do_flush = false;
>  
>  	if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
>  		int ret = r5l_handle_flush_request(conf->log, bi);
> @@ -5259,6 +5260,11 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
>  			return;
>  		}
>  		/* ret == -EAGAIN, fallback */
> +		/*
> +		 * if r5l_handle_flush_request() didn't clear REQ_PREFLUSH,
> +		 * we need to flush journal device
> +		 */
> +		do_flush = (bi->bi_opf & REQ_PREFLUSH) != 0;
>  	}
>  
>  	md_write_start(mddev, bi);
> @@ -5398,6 +5404,12 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
>  				do_prepare = true;
>  				goto retry;
>  			}
> +			if (do_flush) {
> +				set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
> +				/* we only need flush for one stripe */
> +				do_flush = false;
> +			}
> +
>  			set_bit(STRIPE_HANDLE, &sh->state);
>  			clear_bit(STRIPE_DELAYED, &sh->state);
>  			if ((!sh->batch_head || sh == sh->batch_head) &&
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index a698113..d13fe45 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -376,6 +376,7 @@ enum {
>  	STRIPE_R5C_FULL_STRIPE,	/* in r5c cache (to-be/being handled or
>  				 * in conf->r5c_full_stripe_list)
>  				 */
> +	STRIPE_R5C_PREFLUSH,	/* need to flush journal device */
>  };
>  
>  #define STRIPE_EXPAND_SYNC_FLAGS \
> -- 
> 2.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] md/r5cache: handle alloc_page failure
From: Song Liu @ 2016-11-19  7:20 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

RMW of r5c write back cache uses an extra page to store old data for
prexor. handle_stripe_dirtying() allocates this page by calling
alloc_page(). However, alloc_page() may fail.

To handle alloc_page() failures, this patch adds a small mempool
in r5l_log. When alloc_page fails, the stripe is added to a waiting
list. Then, these stripes get pages from the mempool (from work queue).

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/raid5.c       |  34 +++++++++++-----
 drivers/md/raid5.h       |   6 +++
 3 files changed, 130 insertions(+), 10 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 8cb79fc..ce7f114 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -51,6 +51,8 @@
  */
 #define R5L_POOL_SIZE	4
 
+#define R5C_EXTRA_PAGE_POOL_SIZE	2
+
 /*
  * r5c journal modes of the array: write-back or write-through.
  * write-through mode has identical behavior as existing log only
@@ -162,6 +164,16 @@ struct r5l_log {
 
 	/* to submit async io_units, to fulfill ordering of flush */
 	struct work_struct deferred_io_work;
+
+	/* to handle alloc_page failures in handle_stripe_dirtying() */
+	/* mempool for up to R5C_EXTRA_PAGE_POOL_SIZE stripes */
+	mempool_t *extra_page_pool;
+	/* list of stripes waiting to use the page pool */
+	struct list_head extra_page_pool_list;
+	/* lock for extra_page_pool_list */
+	spinlock_t extra_page_pool_lock;
+	/* work that allocates pages from extra_page_pool */
+	struct work_struct extra_page_pool_work;
 };
 
 /*
@@ -252,6 +264,69 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 	io->state = state;
 }
 
+void r5c_add_to_extra_page_pool_list(struct stripe_head *sh,
+				     struct stripe_head_state *s)
+{
+	struct r5conf *conf = sh->raid_conf;
+	struct r5l_log *log = conf->log;
+	int i;
+	struct page *p;
+
+	BUG_ON(!log);
+
+	/* free any extra orig_page from alloc_page() */
+	for (i = sh->disks; i--; )
+		if (sh->dev[i].page != sh->dev[i].orig_page) {
+			p = sh->dev[i].orig_page;
+			sh->dev[i].orig_page = sh->dev[i].page;
+			put_page(p);
+		}
+
+	WARN_ON(!list_empty(&sh->log_list));
+
+	atomic_inc(&sh->count);
+	set_bit(STRIPE_R5C_EXTRA_PAGE, &sh->state);
+
+	spin_lock(&log->extra_page_pool_lock);
+	list_add_tail(&sh->log_list, &log->extra_page_pool_list);
+	spin_unlock(&log->extra_page_pool_lock);
+
+	s->waiting_extra_page = 1;
+	schedule_work(&log->extra_page_pool_work);
+}
+
+static void r5c_run_extra_page_pool_list(struct work_struct *work)
+{
+	struct r5l_log *log = container_of(work, struct r5l_log,
+					   extra_page_pool_work);
+	struct stripe_head *sh;
+	struct r5dev *dev;
+	int i;
+
+	while (1) {
+		spin_lock(&log->extra_page_pool_lock);
+		if (list_empty(&log->extra_page_pool_list)) {
+			spin_unlock(&log->extra_page_pool_lock);
+			break;
+		}
+		sh = list_first_entry(&log->extra_page_pool_list,
+				      struct stripe_head, log_list);
+		list_del_init(&sh->log_list);
+		spin_unlock(&log->extra_page_pool_lock);
+
+		for (i = sh->disks; i--; ) {
+			dev = &sh->dev[i];
+			BUG_ON(dev->page != dev->orig_page);
+
+			if (test_bit(R5_InJournal, &dev->flags))
+				dev->orig_page = mempool_alloc(
+					log->extra_page_pool, GFP_NOIO);
+		}
+		set_bit(STRIPE_HANDLE, &sh->state);
+		raid5_release_stripe(sh);
+	}
+}
+
 static void
 r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
 			      struct bio_list *return_bi)
@@ -2334,14 +2409,25 @@ int r5c_try_caching_write(struct r5conf *conf,
  */
 void r5c_release_extra_page(struct stripe_head *sh)
 {
+	struct r5conf *conf = sh->raid_conf;
+	struct r5l_log *log = conf->log;
 	int i;
+	bool using_extra_page_pool;;
+
+	BUG_ON(!log);
+
+	using_extra_page_pool = test_and_clear_bit(
+		STRIPE_R5C_EXTRA_PAGE, &sh->state);
 
 	for (i = sh->disks; i--; )
 		if (sh->dev[i].page != sh->dev[i].orig_page) {
 			struct page *p = sh->dev[i].orig_page;
 
 			sh->dev[i].orig_page = sh->dev[i].page;
-			put_page(p);
+			if (using_extra_page_pool)
+				mempool_free(p, log->extra_page_pool);
+			else
+				put_page(p);
 		}
 }
 
@@ -2581,6 +2667,15 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	if (!log->meta_pool)
 		goto out_mempool;
 
+	log->extra_page_pool = mempool_create_page_pool(
+		R5C_EXTRA_PAGE_POOL_SIZE *
+		(conf->raid_disks - conf->max_degraded), 0);
+	if (!log->extra_page_pool)
+		goto extra_page_pool;
+	INIT_LIST_HEAD(&log->extra_page_pool_list);
+	spin_lock_init(&log->extra_page_pool_lock);
+	INIT_WORK(&log->extra_page_pool_work, r5c_run_extra_page_pool_list);
+
 	log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
 						 log->rdev->mddev, "reclaim");
 	if (!log->reclaim_thread)
@@ -2611,6 +2706,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 error:
 	md_unregister_thread(&log->reclaim_thread);
 reclaim_thread:
+	mempool_destroy(log->extra_page_pool);
+extra_page_pool:
 	mempool_destroy(log->meta_pool);
 out_mempool:
 	bioset_free(log->bs);
@@ -2626,6 +2723,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 void r5l_exit_log(struct r5l_log *log)
 {
 	md_unregister_thread(&log->reclaim_thread);
+	mempool_destroy(log->extra_page_pool);
 	mempool_destroy(log->meta_pool);
 	bioset_free(log->bs);
 	mempool_destroy(log->io_pool);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index dbab8c7..5870ca9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -876,6 +876,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 	if (!test_bit(STRIPE_R5C_CACHING, &sh->state)) {
 		/* writing out phase */
+		if (s->waiting_extra_page)
+			return;
 		if (r5l_write_stripe(conf->log, sh) == 0)
 			return;
 	} else {  /* caching phase */
@@ -2007,6 +2009,7 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
 		INIT_LIST_HEAD(&sh->batch_list);
 		INIT_LIST_HEAD(&sh->lru);
 		INIT_LIST_HEAD(&sh->r5c);
+		INIT_LIST_HEAD(&sh->log_list);
 		atomic_set(&sh->count, 1);
 		sh->log_start = MaxSector;
 		for (i = 0; i < disks; i++) {
@@ -2897,6 +2900,7 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 				s->locked++;
 			}
 		}
+
 		/* if we are not expanding this is a proper write request, and
 		 * there will be bios with new data to be drained into the
 		 * stripe cache
@@ -3580,10 +3584,10 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 		break_stripe_batch_list(head_sh, STRIPE_EXPAND_SYNC_FLAGS);
 }
 
-static void handle_stripe_dirtying(struct r5conf *conf,
-				   struct stripe_head *sh,
-				   struct stripe_head_state *s,
-				   int disks)
+static int handle_stripe_dirtying(struct r5conf *conf,
+				  struct stripe_head *sh,
+				  struct stripe_head_state *s,
+				  int disks)
 {
 	int rmw = 0, rcw = 0, i;
 	sector_t recovery_cp = conf->mddev->recovery_cp;
@@ -3649,12 +3653,19 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 			    dev->page == dev->orig_page &&
 			    !test_bit(R5_LOCKED, &sh->dev[sh->pd_idx].flags)) {
 				/* alloc page for prexor */
-				dev->orig_page = alloc_page(GFP_NOIO);
+				struct page *p;
 
-				/* will handle failure in a later patch*/
-				BUG_ON(!dev->orig_page);
+				p = alloc_page(GFP_NOIO);
+				if (!p) {
+					r5c_add_to_extra_page_pool_list(sh, s);
+					return -EAGAIN;
+				}
+				dev->orig_page = p;
 			}
+		}
 
+		for (i = disks; i--; ) {
+			struct r5dev *dev = &sh->dev[i];
 			if ((dev->towrite ||
 			     i == sh->pd_idx || i == sh->qd_idx ||
 			     test_bit(R5_InJournal, &dev->flags)) &&
@@ -3730,6 +3741,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 	    (s->locked == 0 && (rcw == 0 || rmw == 0) &&
 	     !test_bit(STRIPE_BIT_DELAY, &sh->state)))
 		schedule_reconstruction(sh, s, rcw == 0, 0);
+	return 0;
 }
 
 static void handle_parity_checks5(struct r5conf *conf, struct stripe_head *sh,
@@ -4545,8 +4557,12 @@ static void handle_stripe(struct stripe_head *sh)
 			if (ret == -EAGAIN ||
 			    /* stripe under reclaim: !caching && injournal */
 			    (!test_bit(STRIPE_R5C_CACHING, &sh->state) &&
-			     s.injournal > 0))
-				handle_stripe_dirtying(conf, sh, &s, disks);
+			     s.injournal > 0)) {
+				ret = handle_stripe_dirtying(conf, sh, &s,
+							     disks);
+				if (ret == -EAGAIN)
+					goto finish;
+			}
 		}
 	}
 
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index d13fe45..e0efd46 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -276,6 +276,7 @@ struct stripe_head_state {
 	struct md_rdev *blocked_rdev;
 	int handle_bad_blocks;
 	int log_failed;
+	int waiting_extra_page;
 };
 
 /* Flags for struct r5dev.flags */
@@ -377,6 +378,9 @@ enum {
 				 * in conf->r5c_full_stripe_list)
 				 */
 	STRIPE_R5C_PREFLUSH,	/* need to flush journal device */
+	STRIPE_R5C_EXTRA_PAGE,	/* extra orig_page of this stripe is allocated
+				 * from r5l_log->extra_page_pool
+				 */
 };
 
 #define STRIPE_EXPAND_SYNC_FLAGS \
@@ -774,5 +778,7 @@ extern void r5c_make_stripe_write_out(struct stripe_head *sh);
 extern void r5c_flush_cache(struct r5conf *conf, int num);
 extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
 extern void r5c_check_cached_full_stripe(struct r5conf *conf);
+extern void r5c_add_to_extra_page_pool_list(struct stripe_head *sh,
+					    struct stripe_head_state *s);
 extern struct md_sysfs_entry r5c_journal_mode;
 #endif
-- 
2.9.3


^ permalink raw reply related

* linux raid wiki - backup files
From: Wols Lists @ 2016-11-19 16:03 UTC (permalink / raw)
  To: linux-raid

I'm updating the mdadm overview at the moment, and I'm giving examples
for changing raid levels etc.

Is a backup file still required? I get the impression with current
kernels and mdadm, any required backup is put in the space left by "data
offset", or if you're adding a disk it stores it in the space on that disk.

Okay, I'll have to document that older versions still need it, but if
that's not needed any more, I'd rather not emphasize it.

I'd also like to confirm my understanding of how a resize works ...

If the number of raid devices is increased, am I right in thinking that
a new stripe one is created from the old stripes one and two, then
written, then the new stripe two is created from the old two and three,
etc etc? In effect, the data slumps down from the old array into the new?

And of course the reverse when a device is removed - it starts with the
highest stripe, which gets pulled upwards so the old array gets pulled
up into the new?

(And during the reshape, you have a marker of the current stripe being
updated, so any stripes above that get read from one array, and below
they get read from the other.)

Cheers,
Wol

^ permalink raw reply

* Re: linux raid wiki - backup files
From: Phil Turmel @ 2016-11-20  0:27 UTC (permalink / raw)
  To: Wols Lists, linux-raid
In-Reply-To: <583077D0.5070804@youngman.org.uk>

On 11/19/2016 11:03 AM, Wols Lists wrote:
> I'm updating the mdadm overview at the moment, and I'm giving examples
> for changing raid levels etc.
> 
> Is a backup file still required? I get the impression with current
> kernels and mdadm, any required backup is put in the space left by "data
> offset", or if you're adding a disk it stores it in the space on that disk.

You can't change the data offset of zero for metadata v0.90 and v1.0,
the latter of which isn't obsolete.  So a backup file would be needed
for many --grow operations on those arrays.

> I'd also like to confirm my understanding of how a resize works ...

Just to clarify:  --grow with a change in number of devices or layout is
called a reshape, not a resize, even though the size may change.  Just
changing the amount of space used per device is a "resize".  A simple
resize never requires a backup file, as no chunks will move.

> If the number of raid devices is increased, am I right in thinking that
> a new stripe one is created from the old stripes one and two, then
> written, then the new stripe two is created from the old two and three,
> etc etc? In effect, the data slumps down from the old array into the new?
> 
> And of course the reverse when a device is removed - it starts with the
> highest stripe, which gets pulled upwards so the old array gets pulled
> up into the new?

Yes.  But the new stripes lay on top of the old stripes, unless you move
the data offset.  Which is why a backup file holds the old stripe just
in case.  If you can move the offset, you use the lower offset for the
lower addresses in the array, and the higher offset for the higher
addresses, on either side of the reshape position.

> (And during the reshape, you have a marker of the current stripe being
> updated, so any stripes above that get read from one array, and below
> they get read from the other.)

Yes.  And the stripe being worked on is "frozen" to prevent an upper
layer from interfering.

Phil

^ permalink raw reply

* Re: [BUG 4.4.26] bio->bi_bdev == NULL in raid6 return_io()
From: Konstantin Khlebnikov @ 2016-11-20 10:55 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Shaohua Li
  Cc: linux-kernel@vger.kernel.org, linux-raid, linux-block, Neil Brown,
	Jens Axboe
In-Reply-To: <CALYGNiMSEH-tnfzYMKuYe4KRaY6+s8KeKgUUSZRK5fQiwv3y_g@mail.gmail.com>

On 07.11.2016 23:34, Konstantin Khlebnikov wrote:
> On Mon, Nov 7, 2016 at 10:46 PM, Shaohua Li <shli@kernel.org> wrote:
>> On Sat, Nov 05, 2016 at 01:48:45PM +0300, Konstantin Khlebnikov wrote:
>>> return_io() resolves request_queue even if trace point isn't active:
>>>
>>> static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
>>> {
>>>       return bdev->bd_disk->queue;    /* this is never NULL */
>>> }
>>>
>>> static void return_io(struct bio_list *return_bi)
>>> {
>>>       struct bio *bi;
>>>       while ((bi = bio_list_pop(return_bi)) != NULL) {
>>>               bi->bi_iter.bi_size = 0;
>>>               trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
>>>                                        bi, 0);
>>>               bio_endio(bi);
>>>       }
>>> }
>>
>> I can't see how this could happen. What kind of tests/environment are these running?
>
> That was a random piece of production somewhere.
> Cording to time all crashes happened soon after reboot.
> There're several raids, probably some of them were still under resync.
>
> For now we have only few machines with this kernel. But I'm sure that
> I'll get much more soon =)

I've added this debug patch for catching overflow of active stripes in bio

--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -164,6 +164,7 @@ static inline void raid5_inc_bi_active_stripes(struct bio *bio)
  {
         atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
         atomic_inc(segments);
+       BUG_ON(!(atomic_read(segments) & 0xffff));
  }

And got this. Counter in %edx = 0x00010000

So, looks like one bio (discard?) can cover more than 65535 stripes

<2>[97201.226871] kernel BUG at drivers/md/raid5.c:167!
<4>[97201.232128] invalid opcode: 0000 [#1] SMP
<4>[97201.236735] Modules linked in: netconsole configfs ip6table_filter ip6_tables tcp_diag iptable_filter inet_diag ip_tables unix_diag 
x_tables ipmi_devintf 8021q mrp garp stp llc cls_u32 sch_prio ipmi_ssif nfsd nfs_acl auth_rpcgss nfs fscache lockd sunrpc grace intel_rapl 
iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul serio_raw joydev input_leds sb_edac edac_core 
mei_me lpc_ich mei ioatdma ipmi_si ipmi_msghandler 8250_fintek wmi shpchp mac_hid ip6_tunnel tunnel6 ipip ip_tunnel tunnel4 xfs raid10 
raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c hid_generic usbhid igb hid i2c_algo_bit raid1 isci 
dca raid0 libsas ahci ptp psmouse multipath libahci scsi_transport_sas pps_core fjes linear
<4>[97201.313832] CPU: 0 PID: 667624 Comm: kworker/0:0H Not tainted 4.4.32-11 #1
<4>[97201.321519] Hardware name: Aquarius Aquarius Server/X9DRW, BIOS 3.0c 10/30/2014
<4>[97201.329739] Workqueue: xfs-log/dm-0 xfs_buf_ioend_work [xfs]
<4>[97201.336083] task: ffff88084a27ec00 ti: ffff8808f70a8000 task.ti: ffff8808f70a8000
<4>[97201.344448] RIP: raid5_inc_bi_active_stripes.part.34 (drivers/md/raid5.c:589) raid456
<4>[97201.356423] RSP: 0018:ffff8808f70ab948  EFLAGS: 00010046
<4>[97201.362388] RAX: 0000000000000002 RBX: ffff8810356a6000 RCX: ffff8810356a6290
<4>[97201.370392] RDX: 0000000000010000 RSI: 0000000000000087 RDI: 0000000000000087
<4>[97201.378395] RBP: ffff8808f70ab948 R08: ffff8808f70ab638 R09: 000000001a7a8400
<4>[97201.386398] R10: 0000000000000006 R11: 0000000000000006 R12: ffff881efd418000
<4>[97201.394401] R13: 000000001a7a84b0 R14: ffff881034edec00 R15: 000000001a7c8400
<4>[97201.402406] FS:  0000000000000000(0000) GS:ffff88103fa00000(0000) knlGS:0000000000000000
<4>[97201.411500] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[97201.417949] CR2: 000055acd2d2bb10 CR3: 00000010364f3000 CR4: 00000000000406f0
<4>[97201.425953] Stack:
<4>[97201.428224]  ffff8808f70aba48 ffffffffa01b54fa ffff881034edec00 ffff88084a27ec00
<4>[97201.436589]  ffff8808f70ab998 ffff881034edec00 0000000000000081 ffff88084a27ec00
<4>[97201.444952]  ffff8808f70ab9f0 ffff881038a37800 ffff8810356a6288 ffff881efd418068
<4>[97201.453317] Call Trace:
<4>[97201.456078] make_request (drivers/md/raid5.c:4792 drivers/md/raid5.c:4882 drivers/md/raid5.c:5177) raid456
<4>[97201.463209] ? add_wait_queue (kernel/sched/wait.c:292)
<4>[97201.469562] ? add_wait_queue (kernel/sched/wait.c:292)
<4>[97201.475915] md_make_request (./arch/x86/include/asm/preempt.h:69 include/linux/rcupdate.h:301 include/linux/rcupdate.h:859 
drivers/md/md.c:300)
<4>[97201.482365] ? dm_make_request (drivers/md/dm.c:1785)
<4>[97201.488814] generic_make_request (./arch/x86/include/asm/preempt.h:69 include/linux/rcupdate.h:975 include/linux/percpu-refcount.h:273 
include/linux/percpu-refcount.h:294 block/blk-core.c:674 block/blk-core.c:2069 block/blk-core.c:2022)
<4>[97201.495643] submit_bio (block/blk-core.c:2132)
<4>[97201.501510] blkdev_issue_discard (block/blk-lib.c:119)
<4>[97201.508461] xfs_discard_extents (fs/xfs/xfs_discard.c:228) xfs
<4>[97201.515803] xlog_cil_committed (fs/xfs/xfs_log_cil.c:410) xfs
<4>[97201.523142] xlog_state_do_callback (fs/xfs/xfs_log.c:2756) xfs
<4>[97201.530875] xlog_state_done_syncing (fs/xfs/xfs_log.c:2881) xfs
<4>[97201.538604] ? xfs_buf_ioend_work (fs/xfs/xfs_buf.c:1053) xfs
<4>[97201.545941] xlog_iodone (fs/xfs/xfs_log.c:1218) xfs
<4>[97201.552404] xfs_buf_ioend (fs/xfs/xfs_buf.c:1043) xfs
<4>[97201.559157] xfs_buf_ioend_work (fs/xfs/xfs_buf.c:1053) xfs
<4>[97201.566286] process_one_work (./arch/x86/include/asm/jump_label.h:21 include/linux/jump_label.h:138 
include/trace/events/workqueue.h:111 kernel/workqueue.c:2074)
<4>[97201.572832] worker_thread (kernel/workqueue.c:2202)
<4>[97201.579080] ? create_worker (kernel/workqueue.c:2146)
<4>[97201.585522] kthread (kernel/kthread.c:209)
<4>[97201.591000] ? flush_kthread_worker (kernel/kthread.c:178)
<4>[97201.597936] ret_from_fork (arch/x86/entry/entry_64.S:469)
<4>[97201.603997] ? flush_kthread_worker (kernel/kthread.c:178)
<4>[97201.610937] Code: 01 00 00 48 83 ec 08 e8 83 21 4c e1 48 8b bb 50 01 00 00 e8 97 8d 01 e1 48 89 df e8 9f 01 05 e1 48 83 c4 08 5b 5d c3 
55 48 89 e5 <0f> 0b 55 48 89 e5 41 54 49 89 cc 53 48 03 77 58 48 89 fb 48 8d
All code
========
    0:	01 00                	add    %eax,(%rax)
    2:	00 48 83             	add    %cl,-0x7d(%rax)
    5:	ec                   	in     (%dx),%al
    6:	08 e8                	or     %ch,%al
    8:	83 21 4c             	andl   $0x4c,(%rcx)
    b:	e1 48                	loope  0x55
    d:	8b bb 50 01 00 00    	mov    0x150(%rbx),%edi
   13:	e8 97 8d 01 e1       	callq  0xffffffffe1018daf
   18:	48 89 df             	mov    %rbx,%rdi
   1b:	e8 9f 01 05 e1       	callq  0xffffffffe10501bf
   20:	48 83 c4 08          	add    $0x8,%rsp
   24:	5b                   	pop    %rbx
   25:	5d                   	pop    %rbp
   26:	c3                   	retq
   27:	55                   	push   %rbp
   28:	48 89 e5             	mov    %rsp,%rbp
   2b:*	0f 0b                	ud2    		<-- trapping instruction
   2d:	55                   	push   %rbp
   2e:	48 89 e5             	mov    %rsp,%rbp
   31:	41 54                	push   %r12
   33:	49 89 cc             	mov    %rcx,%r12
   36:	53                   	push   %rbx
   37:	48 03 77 58          	add    0x58(%rdi),%rsi
   3b:	48 89 fb             	mov    %rdi,%rbx
   3e:	48                   	rex.W
   3f:	8d                   	.byte 0x8d

Code starting with the faulting instruction
===========================================
    0:	0f 0b                	ud2
    2:	55                   	push   %rbp
    3:	48 89 e5             	mov    %rsp,%rbp
    6:	41 54                	push   %r12
    8:	49 89 cc             	mov    %rcx,%r12
    b:	53                   	push   %rbx
    c:	48 03 77 58          	add    0x58(%rdi),%rsi
   10:	48 89 fb             	mov    %rdi,%rbx
   13:	48                   	rex.W
   14:	8d                   	.byte 0x8d
<1>[97201.632944] RIP raid5_inc_bi_active_stripes.part.34 (drivers/md/raid5.c:589) raid456
<4>[97201.642343]  RSP <ffff8808f70ab948>
<5>[97201.646732] ---[ now 2016-11-19 14:36:20+03 ]---
<4>[97201.653089] ---[ end trace 77df1d3ea4bbcda7 ]---

I'm not sure what's going on here - xfs sening siscard but

in make_request() stack trace points to this branch

	if (rw == READ && mddev->degraded == 0 &&
	    mddev->reshape_position == MaxSector) {
		bi = chunk_aligned_read(mddev, bi);
		if (!bi)
			return;
	}

probably this is error in debug-info because line of raid5_inc_bi_active_stripes is completely wrong too.

and it actually execures next branch

	if (unlikely(bi->bi_rw & REQ_DISCARD)) {
		make_discard_request(mddev, bi);
		return;
	}


>
>>
>> Thanks,
>> Shaohua
>>
>>> kernel build with gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) from ubuntu precise
>>>
>>> <6>[ 1659.710716] md: md2: resync done.
>>> <6>[ 1659.968273] md: resync of RAID array md0
>>> <6>[ 1659.968281] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
>>> <6>[ 1659.968284] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
>>> <6>[ 1659.968310] md: using 128k window, over a total of 16770816k.
>>> <6>[ 1659.968311] md: resuming resync of md0 from checkpoint.
>>> <7>[ 1659.968674] RAID conf printout:
>>> <7>[ 1659.968678]  --- level:6 rd:6 wd:6
>>> <7>[ 1659.968680]  disk 0, o:1, dev:sda3
>>> <7>[ 1659.968682]  disk 1, o:1, dev:sdc3
>>> <7>[ 1659.968683]  disk 2, o:1, dev:sdb3
>>> <7>[ 1659.968684]  disk 3, o:1, dev:sdd3
>>> <7>[ 1659.968685]  disk 4, o:1, dev:sde3
>>> <7>[ 1659.968686]  disk 5, o:1, dev:sdf3
>>> <7>[ 1779.468199] RAID conf printout:
>>> <7>[ 1779.468204]  --- level:6 rd:6 wd:6
>>> <7>[ 1779.468206]  disk 0, o:1, dev:sda1
>>> <7>[ 1779.468208]  disk 1, o:1, dev:sdc1
>>> <7>[ 1779.468209]  disk 2, o:1, dev:sdb1
>>> <7>[ 1779.468210]  disk 3, o:1, dev:sdd1
>>> <7>[ 1779.468211]  disk 4, o:1, dev:sde1
>>> <7>[ 1779.468212]  disk 5, o:1, dev:sdf1
>>> <1>[ 4658.730260] IP: return_io (include/linux/blkdev.h:825 drivers/md/raid5.c:231) raid456
>>> <4>[ 4658.737189] PGD 0
>>> <4>[ 4658.739452] Oops: 0000 [#1] SMP
>>> <4>[ 4658.743080] Modules linked in: netconsole(E) configfs(E) unix_diag(E)
>>> tcp_diag(E) inet_diag(E) ip6t_REJECT(E) nf_reject_ipv6(E) ip6table_filter(E)
>>> ip6table_mangle(E) ip6_tables(E) ipt_R
>>> EJECT(E) nf_reject_ipv4(E) iptable_filter(E) iptable_mangle(E) ip_tables(E)
>>> x_tables(E) ipmi_devintf(E) nfsd(E) nfs_acl(E) auth_rpcgss(E) nfs(E)
>>> fscache(E) lockd(E) sunrpc(E) grace(E) cls_u32
>>> (E) sch_prio(E) ipmi_ssif(E) intel_rapl(E) iosf_mbi(E)
>>> x86_pkg_temp_thermal(E) intel_powerclamp(E) 8021q(E) coretemp(E) mrp(E)
>>> garp(E) stp(E) kvm_intel(E) llc(E) kvm(E) irqbypass(E) crc32_pcl
>>> mul(E) sb_edac(E) serio_raw(E) joydev(E) input_leds(E) edac_core(E)
>>> mei_me(E) mei(E) ioatdma(E) lpc_ich(E) ipmi_si(E) 8250_fintek(E)
>>> ipmi_msghandler(E) shpchp(E) wmi(E) mac_hid(E) ip6_tunnel(
>>> E) tunnel6(E) ipip(E) ip_tunnel(E) tunnel4(E) xfs(E)<4>[ 4658.822823]
>>> raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E)
>>> async_xor(E) async_tx(E) xor(E) hid_generic(E) usb
>>> hid(E) raid6_pq(E) libcrc32c(E) hid(E) igb(E) i2c_algo_bit(E) raid1(E)
>>> isci(E) dca(E) raid0(E) ptp(E) multipath(E) libsas(E) ahci(E) pps_core(E)
>>> scsi_transport_sas(E) psmouse(E) libahci(E) fj
>>> es(E) linear(E)
>>> <4>[ 4658.855131] CPU: 14 PID: 501 Comm: md2_raid6 Tainted: G            E   4.4.26-9 #1
>>> <4>[ 4658.863621] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.00 07/05/2013
>>> <4>[ 4658.871041] task: ffff882035781a80 ti: ffff882033c08000 task.ti: ffff882033c08000
>>> <4>[ 4658.879455] RIP: return_io (include/linux/blkdev.h:825 drivers/md/raid5.c:231) raid456
>>> <4>[ 4658.889155] RSP: 0018:ffff882033c0bb18  EFLAGS: 00010246
>>> <4>[ 4658.895118] RAX: 0000000000000000 RBX: ffff881ff22af2c0 RCX: ffff881ff22af4e0
>>> <4>[ 4658.903122] RDX: 0000000000000000 RSI: ffff881ff22af2c0 RDI: ffff882033c0bc28
>>> <4>[ 4658.911127] RBP: ffff882033c0bb48 R08: 0000000000000000 R09: 0000000000000000
>>> <4>[ 4658.919130] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88203643db00
>>> <4>[ 4658.927134] R13: 0000000000000006 R14: 0000000000000004 R15: ffff882033c0bc28
>>> <4>[ 4658.935139] FS:  0000000000000000(0000) GS:ffff88203f380000(0000) knlGS:0000000000000000
>>> <4>[ 4658.944233] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> <4>[ 4658.950683] CR2: 0000000000000098 CR3: 0000000001e0b000 CR4: 00000000000406e0
>>> <4>[ 4658.958682] Stack:
>>> <4>[ 4658.960952]  ffff882033c0bb78 ffff881ff22af2c0 ffff8820354b0800 0000000000000006
>>> <4>[ 4658.969329]  0000000000000004 0000000000000006 ffff882033c0bc88 ffffffffa015c0dd
>>> <4>[ 4658.977697]  0000000000000000 ffff8820354b0a78 0000000000000000 0000000000000000
>>> <4>[ 4658.986067] Call Trace:
>>> <4>[ 4658.988827] handle_stripe (drivers/md/raid5.c:4635) raid456
>>> <4>[ 4658.996156] ? default_wake_function (kernel/sched/core.c:3376)
>>> <4>[ 4659.003190] ? autoremove_wake_function (kernel/sched/wait.c:295)
>>> <4>[ 4659.010516] ? __wake_up_common (kernel/sched/wait.c:73)
>>> <4>[ 4659.017065] handle_active_stripes.isra.49 (drivers/md/raid5.c:5776) raid456
>>> <4>[ 4659.025877] raid5d (drivers/md/raid5.c:5889) raid456
>>> <4>[ 4659.032428] ? _raw_spin_lock_irqsave
>>> (./arch/x86/include/asm/paravirt.h:696 ./arch/x86/include/asm/qspinlock.h:28
>>> include/asm-generic/qspinlock.h:102 include/linux/spinlock.h:155
>>> include/linux/spinlock_api_smp.h:121 kernel/locking/spinlock.c:159)
>>> <4>[ 4659.039559] md_thread (drivers/md/md.c:7099)
>>> <4>[ 4659.045434] ? add_wait_queue (kernel/sched/wait.c:292)
>>> <4>[ 4659.051786] ? md_rdev_init (drivers/md/md.c:7083)
>>> <4>[ 4659.058140] kthread (kernel/kthread.c:209)
>>> <4>[ 4659.063618] ? flush_kthread_worker (kernel/kthread.c:178)
>>> <4>[ 4659.070554] ret_from_fork (arch/x86/entry/entry_64.S:469)
>>> <4>[ 4659.076619] ? flush_kthread_worker (kernel/kthread.c:178)
>>> <4>[ 4659.083553] Code: 83 ec 08 eb 41 49 8b 04 24 48 85 c0 49 89 07 0f 84
>>> a3 00 00 00 49 8b 44 24 08 49 c7 04 24 00 00 00 00 41 c7 44 24 28 00 00 00
>>> 00 <48> 8b 80 98 00 00 00 4c 8b a8 c0 03 00 00 66 66 66 66 90 4c 89
>>> All code
>>> ========
>>>    0: 83 ec 08                sub    $0x8,%esp
>>>    3: eb 41                   jmp    0x46
>>>    5: 49 8b 04 24             mov    (%r12),%rax
>>>    9: 48 85 c0                test   %rax,%rax
>>>    c: 49 89 07                mov    %rax,(%r15)
>>>    f: 0f 84 a3 00 00 00       je     0xb8
>>>   15: 49 8b 44 24 08          mov    0x8(%r12),%rax
>>>   1a: 49 c7 04 24 00 00 00    movq   $0x0,(%r12)
>>>   21: 00
>>>   22: 41 c7 44 24 28 00 00    movl   $0x0,0x28(%r12)
>>>   29: 00 00
>>>   2b:*        48 8b 80 98 00 00 00    mov    0x98(%rax),%rax          <-- trapping instruction
>>>   32: 4c 8b a8 c0 03 00 00    mov    0x3c0(%rax),%r13
>>>   39: 66 66 66 66 90          data32 data32 data32 xchg %ax,%ax
>>>   3e: 4c                      rex.WR
>>>   3f: 89                      .byte 0x89
>>>
>>> Code starting with the faulting instruction
>>> ===========================================
>>>    0: 48 8b 80 98 00 00 00    mov    0x98(%rax),%rax
>>>    7: 4c 8b a8 c0 03 00 00    mov    0x3c0(%rax),%r13
>>>    e: 66 66 66 66 90          data32 data32 data32 xchg %ax,%ax
>>>   13: 4c                      rex.WR
>>>   14: 89                      .byte 0x89
>>> <1>[ 4659.105577] RIP return_io (include/linux/blkdev.h:825 drivers/md/raid5.c:231) raid456
>>>
>>>
>>> Couple times kernel failed second dereference
>>>
>>> <6>[ 1815.549178] md: md2: resync done.
>>> <7>[ 1815.675433] RAID conf printout:
>>> <7>[ 1815.675439]  --- level:6 rd:6 wd:6
>>> <7>[ 1815.675441]  disk 0, o:1, dev:sda3
>>> <7>[ 1815.675442]  disk 1, o:1, dev:sdb3
>>> <7>[ 1815.675443]  disk 2, o:1, dev:sdc3
>>> <7>[ 1815.675444]  disk 3, o:1, dev:sdd3
>>> <7>[ 1815.675445]  disk 4, o:1, dev:sde3
>>> <7>[ 1815.675446]  disk 5, o:1, dev:sdf3
>>>
>>> <1>[ 2698.718595] IP: return_io (include/linux/blkdev.h:825 drivers/md/raid5.c:231) raid456
>>> <4>[ 2698.725521] PGD 0
>>> <4>[ 2698.727774] Oops: 0000 [#1] SMP
>>> <4>[ 2698.731409] Modules linked in: netconsole(E) configfs(E) unix_diag(E)
>>> tcp_diag(E) inet_diag(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E)
>>> ip_tables(E) x_tables(E) ipmi_devintf(E
>>> ) nfsd(E) nfs_acl(E) auth_rpcgss(E) nfs(E) fscache(E) lockd(E) sunrpc(E)
>>> grace(E) cls_u32(E) sch_prio(E) ipmi_ssif(E) intel_rapl(E) iosf_mbi(E)
>>> x86_pkg_temp_thermal(E) intel_powerclamp(E) cor
>>> etemp(E) kvm_intel(E) kvm(E) 8021q(E) irqbypass(E) mrp(E) garp(E)
>>> crc32_pclmul(E) stp(E) llc(E) serio_raw(E) input_leds(E) joydev(E)
>>> sb_edac(E) edac_core(E) mei_me(E) lpc_ich(E) mei(E) ipmi_s
>>> i(E) ioatdma(E) ipmi_msghandler(E) 8250_fintek(E) shpchp(E) wmi(E)
>>> mac_hid(E) ip6_tunnel(E) tunnel6(E) ipip(E) ip_tunnel(E) tunnel4(E) xfs(E)
>>> raid10(E) raid456(E) async_raid6_recov(E) async_m
>>> emcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E)<4>[ 2698.811146]
>>> hid_generic(E) raid6_pq(E) libcrc32c(E) usbhid(E) igb(E) hid(E)
>>> i2c_algo_bit(E) raid1(E) isci(E) dca(E) raid0(E) libsas(
>>> E) ahci(E) ptp(E) multipath(E) psmouse(E) libahci(E) scsi_transport_sas(E) pps_core(E) linear(E) fjes(E)
>>> <4>[ 2698.833480] CPU: 2 PID: 514 Comm: md2_raid6 Tainted: G            E   4.4.26-9 #1
>>> <4>[ 2698.841845] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0c 10/30/2014
>>> <4>[ 2698.849241] task: ffff882033ec1a80 ti: ffff882033ef4000 task.ti: ffff882033ef4000
>>> <4>[ 2698.857656] RIP: return_io (include/linux/blkdev.h:825 drivers/md/raid5.c:231) raid456
>>> <4>[ 2698.867346] RSP: 0018:ffff882033ef7b18  EFLAGS: 00010246
>>> <4>[ 2698.873307] RAX: 0000000000000000 RBX: ffff881fef26afd0 RCX: ffff881fef26b1f0
>>> <4>[ 2698.881311] RDX: 0000000000000000 RSI: ffff881fef26afd0 RDI: ffff882033ef7c28
>>> <4>[ 2698.889314] RBP: ffff882033ef7b48 R08: 0000000000000000 R09: 0000000000000000
>>> <4>[ 2698.897319] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880f9880cd00
>>> <4>[ 2698.905322] R13: 0000000000000006 R14: 0000000000000004 R15: ffff882033ef7c28
>>> <4>[ 2698.913327] FS:  0000000000000000(0000) GS:ffff88103fa80000(0000) knlGS:0000000000000000
>>> <4>[ 2698.922420] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> <4>[ 2698.928869] CR2: 00000000000003c0 CR3: 0000000001e0b000 CR4: 00000000000406e0
>>> <4>[ 2698.936871] Stack:
>>> <4>[ 2698.939153]  ffff882033ef7b78 ffff881fef26afd0 ffff8810355cbc00 0000000000000006
>>> <4>[ 2698.947512]  0000000000000004 0000000000000006 ffff882033ef7c88 ffffffffa01970dd
>>> <4>[ 2698.955867]  0000000000000000 ffff8810355cbe78 0000000000000000 0000000000000000
>>> <4>[ 2698.964222] Call Trace:
>>> <4>[ 2698.966982] handle_stripe (drivers/md/raid5.c:4635) raid456
>>> <4>[ 2698.974318] ? default_wake_function (kernel/sched/core.c:3376)
>>> <4>[ 2698.981350] ? autoremove_wake_function (kernel/sched/wait.c:295)
>>> <4>[ 2698.988665] ? __wake_up_common (kernel/sched/wait.c:73)
>>> <4>[ 2698.995214] handle_active_stripes.isra.49 (drivers/md/raid5.c:5776) raid456
>>> <4>[ 2699.004020] raid5d (drivers/md/raid5.c:5889) raid456
>>> <4>[ 2699.010571] ? _raw_spin_lock_irqsave
>>> (./arch/x86/include/asm/paravirt.h:696 ./arch/x86/include/asm/qspinlock.h:28
>>> include/asm-generic/qspinlock.h:102 include/linux/spinlock.h:155
>>> include/linux/spinlock_api_smp.h:121 kernel/locking/spinlock.c:159)
>>> <4>[ 2699.017711] md_thread (drivers/md/md.c:7099)
>>> <4>[ 2699.023592] ? add_wait_queue (kernel/sched/wait.c:292)
>>> <4>[ 2699.029952] ? md_rdev_init (drivers/md/md.c:7083)
>>> <4>[ 2699.036305] kthread (kernel/kthread.c:209)
>>> <4>[ 2699.041783] ? flush_kthread_worker (kernel/kthread.c:178)
>>> <4>[ 2699.048719] ret_from_fork (arch/x86/entry/entry_64.S:469)
>>> <4>[ 2699.054779] ? flush_kthread_worker (kernel/kthread.c:178)
>>> <4>[ 2699.061713] Code: 04 24 48 85 c0 49 89 07 0f 84 a3 00 00 00 49 8b 44
>>> 24 08 49 c7 04 24 00 00 00 00 41 c7 44 24 28 00 00 00 00 48 8b 80 98 00 00
>>> 00 <4c> 8b a8 c0 03 00 00 66 66 66 66 90 4c 89 e7 e8 f4 09 20 e1 4d
>>> All code
>>> ========
>>>    0: 04 24                   add    $0x24,%al
>>>    2: 48 85 c0                test   %rax,%rax
>>>    5: 49 89 07                mov    %rax,(%r15)
>>>    8: 0f 84 a3 00 00 00       je     0xb1
>>>    e: 49 8b 44 24 08          mov    0x8(%r12),%rax
>>>   13: 49 c7 04 24 00 00 00    movq   $0x0,(%r12)
>>>   1a: 00
>>>   1b: 41 c7 44 24 28 00 00    movl   $0x0,0x28(%r12)
>>>   22: 00 00
>>>   24: 48 8b 80 98 00 00 00    mov    0x98(%rax),%rax
>>>   2b:*        4c 8b a8 c0 03 00 00    mov    0x3c0(%rax),%r13         <-- trapping instruction
>>>   32: 66 66 66 66 90          data32 data32 data32 xchg %ax,%ax
>>>   37: 4c 89 e7                mov    %r12,%rdi
>>>   3a: e8 f4 09 20 e1          callq  0xffffffffe1200a33
>>>   3f: 4d                      rex.WRB
>>>
>>> Code starting with the faulting instruction
>>> ===========================================
>>>    0: 4c 8b a8 c0 03 00 00    mov    0x3c0(%rax),%r13
>>>    7: 66 66 66 66 90          data32 data32 data32 xchg %ax,%ax
>>>    c: 4c 89 e7                mov    %r12,%rdi
>>>    f: e8 f4 09 20 e1          callq  0xffffffffe1200a08
>>>   14: 4d                      rex.WRB
>>>
>>>
>>> --
>>> Konstantin
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Konstantin

^ permalink raw reply

* Re: linux raid wiki - backup files
From: Wols Lists @ 2016-11-20 14:48 UTC (permalink / raw)
  To: Phil Turmel, linux-raid
In-Reply-To: <ffb47575-b655-924d-8626-07b9e48a6371@turmel.org>

On 20/11/16 00:27, Phil Turmel wrote:
> Yes.  But the new stripes lay on top of the old stripes, unless you move
> the data offset.  Which is why a backup file holds the old stripe just
> in case.  If you can move the offset, you use the lower offset for the
> lower addresses in the array, and the higher offset for the higher
> addresses, on either side of the reshape position.

Okay, understood. So v0.9 and v1.0 always need a backup for a reshape.

But if we have a data offset with v1.2, a reshape will use that space if
it can rather than needing a backup file?

(And changing topic slightly, that space is also used for an internal
bitmap?)

Cheers,
Wol

^ permalink raw reply

* [md PATCH 0/5] Stop using bi_phys_segments as a counter
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid

There are 2 problems with using bi_phys_segments as a counter
1/ we only use 16bits, which limits bios to 256M
2/ it is poor form to reuse a field like this.  It interferes
   with other changes to bios.

We need to clean up a few things before we can change the use the
counter which is now available inside a bio.

I have only tested this lightly.  More review and testing would be
appreciated.

NeilBrown


---

NeilBrown (5):
      md: optimize md_write_start() slightly.
      md/raid5: use md_write_start to count stripes, not bios
      md/raid5: simplfy delaying of writes while metadata is updated.
      md/raid5: call bio_endio() directly rather than queuing for later.
      md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter


 drivers/md/md.c    |   32 ++++++------
 drivers/md/raid5.c |  136 +++++++++++-----------------------------------------
 drivers/md/raid5.h |    4 --
 3 files changed, 44 insertions(+), 128 deletions(-)

--
Signature


^ permalink raw reply

* [md PATCH 1/5] md: optimize md_write_start() slightly.
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid
In-Reply-To: <147969099621.5434.12384452255155063186.stgit@noble>

If md_write_start() finds that ->writes_pending is non-zero, it
should be able to avoid most of the other checks.

To ensure that a non-zero ->writes_pending does mean that other checks
have completed, move it down until after ->in_sync is known to be
clear.  To avoid races with places like array_state_store() which
possible sets ->in_sync, we need to increment ->write_pending inside
the locked region.  As ->writes_pending is now incremented *after*
->in_sync is tested, we must always take the spin_lock, but only if
->writes_pending was found to be zero.

If ->writes_pending is found to be non-zero, we still need to wait it
MD_CHANGE_PENDING is true.

In the common case, md_write_start() will now only
 - check if data_dir is WRITE
 - increment ->writes_pending
 - check MD_CHANGE_PENDING is cleared.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c |   32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1f1c7f007b68..2f21f6c7156f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7686,20 +7686,18 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 	int did_change = 0;
 	if (bio_data_dir(bi) != WRITE)
 		return;
-
-	BUG_ON(mddev->ro == 1);
-	if (mddev->ro == 2) {
-		/* need to switch to read/write */
-		mddev->ro = 0;
-		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
-		md_wakeup_thread(mddev->thread);
-		md_wakeup_thread(mddev->sync_thread);
-		did_change = 1;
-	}
-	atomic_inc(&mddev->writes_pending);
-	if (mddev->safemode == 1)
-		mddev->safemode = 0;
-	if (mddev->in_sync) {
+	if (!atomic_inc_not_zero(&mddev->writes_pending)) {
+		BUG_ON(mddev->ro == 1);
+		if (mddev->ro == 2) {
+			/* need to switch to read/write */
+			mddev->ro = 0;
+			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+			md_wakeup_thread(mddev->thread);
+			md_wakeup_thread(mddev->sync_thread);
+			did_change = 1;
+		}
+		if (mddev->safemode == 1)
+			mddev->safemode = 0;
 		spin_lock(&mddev->lock);
 		if (mddev->in_sync) {
 			mddev->in_sync = 0;
@@ -7708,10 +7706,12 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 			md_wakeup_thread(mddev->thread);
 			did_change = 1;
 		}
+		atomic_inc(&mddev->writes_pending);
 		spin_unlock(&mddev->lock);
+
+		if (did_change)
+			sysfs_notify_dirent_safe(mddev->sysfs_state);
 	}
-	if (did_change)
-		sysfs_notify_dirent_safe(mddev->sysfs_state);
 	wait_event(mddev->sb_wait,
 		   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
 }



^ permalink raw reply related

* [md PATCH 2/5] md/raid5: use md_write_start to count stripes, not bios
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid
In-Reply-To: <147969099621.5434.12384452255155063186.stgit@noble>

We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count.  We currently count bios
submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
has been attached to.

So raid5_make_request() call md_write_start() and then md_write_end().
add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling when raid5_dec_bi_active_stripes() returns 0.

This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |   25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index df88656d8798..d07d2dce6856 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2998,6 +2998,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 		bi->bi_next = *bip;
 	*bip = bi;
 	raid5_inc_bi_active_stripes(bi);
+	md_write_start(conf->mddev, bi);
 
 	if (forwrite) {
 		/* check if page is covered */
@@ -3121,10 +3122,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
 
 			bi->bi_error = -EIO;
-			if (!raid5_dec_bi_active_stripes(bi)) {
-				md_write_end(conf->mddev);
+			md_write_end(conf->mddev);
+			if (!raid5_dec_bi_active_stripes(bi))
 				bio_list_add(return_bi, bi);
-			}
 			bi = nextbi;
 		}
 		if (bitmap_end)
@@ -3145,10 +3145,9 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 			struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
 
 			bi->bi_error = -EIO;
-			if (!raid5_dec_bi_active_stripes(bi)) {
-				md_write_end(conf->mddev);
+			md_write_end(conf->mddev);
+			if (!raid5_dec_bi_active_stripes(bi))
 				bio_list_add(return_bi, bi);
-			}
 			bi = bi2;
 		}
 
@@ -3490,10 +3489,9 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 				while (wbi && wbi->bi_iter.bi_sector <
 					dev->sector + STRIPE_SECTORS) {
 					wbi2 = r5_next_bio(wbi, dev->sector);
-					if (!raid5_dec_bi_active_stripes(wbi)) {
-						md_write_end(conf->mddev);
+					md_write_end(conf->mddev);
+					if (!raid5_dec_bi_active_stripes(wbi))
 						bio_list_add(return_bi, wbi);
-					}
 					wbi = wbi2;
 				}
 				bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -5142,9 +5140,9 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 		release_stripe_plug(mddev, sh);
 	}
 
+	md_write_end(mddev);
 	remaining = raid5_dec_bi_active_stripes(bi);
 	if (remaining == 0) {
-		md_write_end(mddev);
 		bio_endio(bi);
 	}
 }
@@ -5173,8 +5171,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 		/* ret == -EAGAIN, fallback */
 	}
 
-	md_write_start(mddev, bi);
-
 	/*
 	 * If array is degraded, better not do chunk aligned read because
 	 * later we might have to read it again in order to reconstruct
@@ -5196,6 +5192,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	last_sector = bio_end_sector(bi);
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
+	md_write_start(mddev, bi);
 
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
@@ -5324,11 +5321,11 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	}
 	finish_wait(&conf->wait_for_overlap, &w);
 
+	if (rw == WRITE)
+		md_write_end(mddev);
 	remaining = raid5_dec_bi_active_stripes(bi);
 	if (remaining == 0) {
 
-		if ( rw == WRITE )
-			md_write_end(mddev);
 
 		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
 					 bi, 0);



^ permalink raw reply related

* [md PATCH 3/5] md/raid5: simplfy delaying of writes while metadata is updated.
From: NeilBrown @ 2016-11-21  1:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Konstantin Khlebnikov, Christoph Hellwig, linux-raid
In-Reply-To: <147969099621.5434.12384452255155063186.stgit@noble>

If a device fails during a write, we must ensure the failure is
recorded in the metadata before the completion of the write is
acknowleged.

Commit: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.")
added code for this, but it was unnecessarily complicated.  We already
had similar function for handling updated to the bad-block-list thanks to
Commit: de393cdea66c ("md: make it easier to wait for bad blocks to be acknowledged.")

So revert most of the former commit, and instead avoid collecting
completed write if MD_CHANGE_PENDING is set.  raid5d will then flush
the metadata and retry the stripe_head.

We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
asynchronously.  After analyse_stripe(), we have collected stable data
about the data of devices, which will be used to make decisions.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c |   27 ++++-----------------------
 drivers/md/raid5.h |    3 ---
 2 files changed, 4 insertions(+), 26 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d07d2dce6856..e53b8f499a4c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4344,7 +4344,8 @@ static void handle_stripe(struct stripe_head *sh)
 	if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
 		goto finish;
 
-	if (s.handle_bad_blocks) {
+	if (s.handle_bad_blocks ||
+	    test_bit(MD_CHANGE_PENDING, &conf->mddev->flags)) {
 		set_bit(STRIPE_HANDLE, &sh->state);
 		goto finish;
 	}
@@ -4632,15 +4633,8 @@ static void handle_stripe(struct stripe_head *sh)
 			md_wakeup_thread(conf->mddev->thread);
 	}
 
-	if (!bio_list_empty(&s.return_bi)) {
-		if (test_bit(MD_CHANGE_PENDING, &conf->mddev->flags)) {
-			spin_lock_irq(&conf->device_lock);
-			bio_list_merge(&conf->return_bi, &s.return_bi);
-			spin_unlock_irq(&conf->device_lock);
-			md_wakeup_thread(conf->mddev->thread);
-		} else
-			return_io(&s.return_bi);
-	}
+	if (!bio_list_empty(&s.return_bi))
+		return_io(&s.return_bi);
 
 	clear_bit_unlock(STRIPE_ACTIVE, &sh->state);
 }
@@ -5846,18 +5840,6 @@ static void raid5d(struct md_thread *thread)
 
 	md_check_recovery(mddev);
 
-	if (!bio_list_empty(&conf->return_bi) &&
-	    !test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
-		struct bio_list tmp = BIO_EMPTY_LIST;
-		spin_lock_irq(&conf->device_lock);
-		if (!test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
-			bio_list_merge(&tmp, &conf->return_bi);
-			bio_list_init(&conf->return_bi);
-		}
-		spin_unlock_irq(&conf->device_lock);
-		return_io(&tmp);
-	}
-
 	blk_start_plug(&plug);
 	handled = 0;
 	spin_lock_irq(&conf->device_lock);
@@ -6490,7 +6472,6 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	INIT_LIST_HEAD(&conf->hold_list);
 	INIT_LIST_HEAD(&conf->delayed_list);
 	INIT_LIST_HEAD(&conf->bitmap_list);
-	bio_list_init(&conf->return_bi);
 	init_llist_head(&conf->released_stripes);
 	atomic_set(&conf->active_stripes, 0);
 	atomic_set(&conf->preread_active_stripes, 0);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 57ec49f0839e..f654f8207a44 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -482,9 +482,6 @@ struct r5conf {
 	int			skip_copy; /* Don't copy data from bio to stripe cache */
 	struct list_head	*last_hold; /* detect hold_list promotions */
 
-	/* bios to have bi_end_io called after metadata is synced */
-	struct bio_list		return_bi;
-
 	atomic_t		reshape_stripes; /* stripes with pending writes for reshape */
 	/* unfortunately we need two cache names as we temporarily have
 	 * two caches.



^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox