Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: Process stuck in md_flush_request (state: D)
From: Shaohua Li @ 2017-02-17 20:06 UTC (permalink / raw)
  To: Les Stroud; +Cc: linux-raid
In-Reply-To: <99A92F4D-338D-4486-BB1C-C114A8524403@lesstroud.com>

On Fri, Feb 17, 2017 at 02:05:49PM -0500, Les Stroud wrote:
> 
> I have a problem with processes entering an uninterruptible sleep state in md_flush_request and never returning. I having trouble identifying the underlying issue. I’m hoping someone on here may be able to help.
> 
> The servers in question are running in aws (xen hvm) with kernel 3.8.13-118.16.2.el6uek.x86_64.  The server has two mounts.  The first is vanilla ext4.  The second is a software RAID0 array, striped with 256k chunks, buiIt with md.  It’s file system is ext4. 
> 
> The most immediately and obvious symptom of the issue are kernel errors “kernel: INFO task [some process]: blocked for more than 120 seconds.”.  Shortly there after, other processes start entering the same uninterruptible wait state (D). This almost always impacts ssh logins.
> 
> The problem does not occur when the system is under load, or was recently under load.  It happens when the system is quiet (no cpu, very little io).

This seems suggesting we have a missed blk-plug flush in light workload. Can
you check the output of /sys/block/disk-bame/inflight for both md and the
underlayer disks? This will let us know if there is IO pending.
Also it would be great if you can test a upstream kernel.

Thanks,
Shaohua

^ permalink raw reply

* Re: Process stuck in md_flush_request (state: D)
From: Les Stroud @ 2017-02-17 20:40 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid
In-Reply-To: <20170217200644.amaxgira4nqlbchh@kernel.org>

It’ll take a day or two for it to happen again.  When it does, I’ll pull the inflight stats.  Anything else I should grab while I’m at it?

Thanx,
LES


> On Feb 17, 2017, at 3:06 PM, Shaohua Li <shli@kernel.org> wrote:
> 
> On Fri, Feb 17, 2017 at 02:05:49PM -0500, Les Stroud wrote:
>> 
>> I have a problem with processes entering an uninterruptible sleep state in md_flush_request and never returning. I having trouble identifying the underlying issue. I’m hoping someone on here may be able to help.
>> 
>> The servers in question are running in aws (xen hvm) with kernel 3.8.13-118.16.2.el6uek.x86_64.  The server has two mounts.  The first is vanilla ext4.  The second is a software RAID0 array, striped with 256k chunks, buiIt with md.  It’s file system is ext4. 
>> 
>> The most immediately and obvious symptom of the issue are kernel errors “kernel: INFO task [some process]: blocked for more than 120 seconds.”.  Shortly there after, other processes start entering the same uninterruptible wait state (D). This almost always impacts ssh logins.
>> 
>> The problem does not occur when the system is under load, or was recently under load.  It happens when the system is quiet (no cpu, very little io).
> 
> This seems suggesting we have a missed blk-plug flush in light workload. Can
> you check the output of /sys/block/disk-bame/inflight for both md and the
> underlayer disks? This will let us know if there is IO pending.
> Also it would be great if you can test a upstream kernel.
> 
> Thanks,
> Shaohua


^ permalink raw reply

* [PATCH 0/3]md/raid5: improve IO pattern
From: Shaohua Li @ 2017-02-17 23:32 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, songliubraving, kernel-team

Hi,

These are some patches I'm testing to improve raid5-cache performance. The basic
idea is to flush a lot of stripes to raid disks and then do smart scheduling of
IOs. In my test of a 12-disk raid6 array, this improves around 20% throughput
and request size/disk seek are improved a lot. fio script I'm using are:

[global]
ioengine=libaio
direct=1
loops=1000
runtime=120
time_based=1
file_service_type=random:36
overwrite=1
thread=0
group_reporting=1

[test]
filename=/dev/md0
bs=4k
readwrite=randwrite
numjobs=8
offset_increment=10G

--------------------------------------------
Shaohua Li (3):
  md/raid5: prioritize stripes for writeback
  md/raid5-cache: bump flush stripe batch size
  md/raid5: sort bios

 drivers/md/raid5-cache.c |   2 +-
 drivers/md/raid5.c       | 189 ++++++++++++++++++++++++++++++++++++++---------
 drivers/md/raid5.h       |  13 +++-
 3 files changed, 169 insertions(+), 35 deletions(-)

-- 
2.9.3


^ permalink raw reply

* [PATCH 1/3] md/raid5: prioritize stripes for writeback
From: Shaohua Li @ 2017-02-17 23:32 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, songliubraving, kernel-team
In-Reply-To: <cover.1487373517.git.shli@fb.com>

In raid5-cache writeback mode, we have two types of stripes to handle.
- stripes which aren't cached yet
- stripes which are cached and flushing out to raid disks

Upperlayer is more sensistive to latency of the first type of stripes
generally. But we only one handle list for all these stripes, where the
two types of stripes are mixed together. When reclaim flushes a lot of
stripes, the first type of stripes could be noticeably delayed. On the
other hand, if the log space is tight, we'd like to handle the second
type of stripes faster and free log space.

This patch destinguishes the two types stripes. They are added into
different handle list. When we try to get a stripe to handl, we prefer
the first type of stripes unless log space is tight.

This should have no impact for !writeback case.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 48 +++++++++++++++++++++++++++++++++++++++---------
 drivers/md/raid5.h |  2 ++
 2 files changed, 41 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7b7722bb2..d7c90da 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -174,6 +174,13 @@ static int stripe_operations_active(struct stripe_head *sh)
 	       test_bit(STRIPE_COMPUTE_RUN, &sh->state);
 }
 
+static bool stripe_is_lowprio(struct stripe_head *sh)
+{
+	return (test_bit(STRIPE_R5C_FULL_STRIPE, &sh->state) ||
+		test_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state)) &&
+	       !test_bit(STRIPE_R5C_CACHING, &sh->state);
+}
+
 static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
 {
 	struct r5conf *conf = sh->raid_conf;
@@ -189,7 +196,10 @@ static void raid5_wakeup_stripe_thread(struct stripe_head *sh)
 	if (list_empty(&sh->lru)) {
 		struct r5worker_group *group;
 		group = conf->worker_groups + cpu_to_group(cpu);
-		list_add_tail(&sh->lru, &group->handle_list);
+		if (stripe_is_lowprio(sh))
+			list_add_tail(&sh->lru, &group->loprio_list);
+		else
+			list_add_tail(&sh->lru, &group->handle_list);
 		group->stripes_cnt++;
 		sh->group = group;
 	}
@@ -252,7 +262,12 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 			clear_bit(STRIPE_DELAYED, &sh->state);
 			clear_bit(STRIPE_BIT_DELAY, &sh->state);
 			if (conf->worker_cnt_per_group == 0) {
-				list_add_tail(&sh->lru, &conf->handle_list);
+				if (stripe_is_lowprio(sh))
+					list_add_tail(&sh->lru,
+							&conf->loprio_list);
+				else
+					list_add_tail(&sh->lru,
+							&conf->handle_list);
 			} else {
 				raid5_wakeup_stripe_thread(sh);
 				return;
@@ -5169,19 +5184,27 @@ static struct bio *chunk_aligned_read(struct mddev *mddev, struct bio *raid_bio)
  */
 static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
 {
-	struct stripe_head *sh = NULL, *tmp;
+	struct stripe_head *sh, *tmp;
 	struct list_head *handle_list = NULL;
-	struct r5worker_group *wg = NULL;
+	struct r5worker_group *wg;
+	bool second_try = !r5c_is_writeback(conf->log);
+	bool try_loprio = test_bit(R5C_LOG_TIGHT, &conf->cache_state);
 
+again:
+	wg = NULL;
+	sh = NULL;
 	if (conf->worker_cnt_per_group == 0) {
-		handle_list = &conf->handle_list;
+		handle_list = try_loprio ? &conf->loprio_list :
+					&conf->handle_list;
 	} else if (group != ANY_GROUP) {
-		handle_list = &conf->worker_groups[group].handle_list;
+		handle_list = try_loprio ? &conf->worker_groups[group].loprio_list :
+				&conf->worker_groups[group].handle_list;
 		wg = &conf->worker_groups[group];
 	} else {
 		int i;
 		for (i = 0; i < conf->group_cnt; i++) {
-			handle_list = &conf->worker_groups[i].handle_list;
+			handle_list = try_loprio ? &conf->worker_groups[i].loprio_list :
+				&conf->worker_groups[i].handle_list;
 			wg = &conf->worker_groups[i];
 			if (!list_empty(handle_list))
 				break;
@@ -5232,8 +5255,13 @@ static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
 		wg = NULL;
 	}
 
-	if (!sh)
-		return NULL;
+	if (!sh) {
+		if (second_try)
+			return NULL;
+		second_try = true;
+		try_loprio = !try_loprio;
+		goto again;
+	}
 
 	if (wg) {
 		wg->stripes_cnt--;
@@ -6543,6 +6571,7 @@ static int alloc_thread_groups(struct r5conf *conf, int cnt,
 
 		group = &(*worker_groups)[i];
 		INIT_LIST_HEAD(&group->handle_list);
+		INIT_LIST_HEAD(&group->loprio_list);
 		group->conf = conf;
 		group->workers = workers + i * cnt;
 
@@ -6770,6 +6799,7 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	init_waitqueue_head(&conf->wait_for_stripe);
 	init_waitqueue_head(&conf->wait_for_overlap);
 	INIT_LIST_HEAD(&conf->handle_list);
+	INIT_LIST_HEAD(&conf->loprio_list);
 	INIT_LIST_HEAD(&conf->hold_list);
 	INIT_LIST_HEAD(&conf->delayed_list);
 	INIT_LIST_HEAD(&conf->bitmap_list);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 4bb27b9..6b9d2e8 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -542,6 +542,7 @@ struct r5worker {
 
 struct r5worker_group {
 	struct list_head handle_list;
+	struct list_head loprio_list;
 	struct r5conf *conf;
 	struct r5worker *workers;
 	int stripes_cnt;
@@ -608,6 +609,7 @@ struct r5conf {
 						  */
 
 	struct list_head	handle_list; /* stripes needing handling */
+	struct list_head	loprio_list; /* low priority stripes */
 	struct list_head	hold_list; /* preread ready stripes */
 	struct list_head	delayed_list; /* stripes that have plugged requests */
 	struct list_head	bitmap_list; /* stripes delaying awaiting bitmap update */
-- 
2.9.3


^ permalink raw reply related

* [PATCH 2/3] md/raid5-cache: bump flush stripe batch size
From: Shaohua Li @ 2017-02-17 23:32 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, songliubraving, kernel-team
In-Reply-To: <cover.1487373517.git.shli@fb.com>

Bump the flush stripe batch size to 2048. For my 12 disks raid
array, the stripes takes:
12 * 4k * 2048 = 96MB

This is still quite small. A hardware raid card generally has 1GB size,
which we suggest the raid5-cache has similar cache size.

The advantage of a big batch size is we can dispatch a lot of IO in the
same time, then we can do some scheduling to make better IO pattern.

Last patch prioritizes stripes, so we don't worry about a big flush
stripe batch will starve normal stripes.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 3f307be..b25512c 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -43,7 +43,7 @@
 /* wake up reclaim thread periodically */
 #define R5C_RECLAIM_WAKEUP_INTERVAL (30 * HZ)
 /* start flush with these full stripes */
-#define R5C_FULL_STRIPE_FLUSH_BATCH 256
+#define R5C_FULL_STRIPE_FLUSH_BATCH 2048
 /* reclaim stripes in groups */
 #define R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2)
 
-- 
2.9.3


^ permalink raw reply related

* [PATCH 3/3] md/raid5: sort bios
From: Shaohua Li @ 2017-02-17 23:32 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, songliubraving, kernel-team
In-Reply-To: <cover.1487373517.git.shli@fb.com>

Previous patch (raid5: only dispatch IO from raid5d for harddisk raid)
defers IO dispatching. The goal is to create better IO pattern. At that
time, we don't sort the deffered IO and hope the block layer can do IO
merge and sort. Now the raid5-cache writeback could create large amount
of bios. And if we enable muti-thread for stripe handling, we can't
control when to dispatch IO to raid disks. In a lot of time, we are
dispatching IO which block layer can't do merge effectively.

This patch moves further for the IO dispatching defer. We accumulate
bios, but we don't dispatch all the bios after a threshold is met. This
'dispatch partial portion of bios' stragety allows bios coming in a
large time window are sent to disks together. At the dispatching time,
there is large chance the block layer can merge the bios. To make this
more effective, we dispatch IO in ascending order. This increases
request merge chance and reduces disk seek.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 141 ++++++++++++++++++++++++++++++++++++++++++++---------
 drivers/md/raid5.h |  11 ++++-
 2 files changed, 127 insertions(+), 25 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d7c90da..fe6232f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -56,6 +56,7 @@
 #include <linux/nodemask.h>
 #include <linux/flex_array.h>
 #include <trace/events/block.h>
+#include <linux/sort.h>
 
 #include "md.h"
 #include "raid5.h"
@@ -876,41 +877,121 @@ static int use_new_offset(struct r5conf *conf, struct stripe_head *sh)
 	return 1;
 }
 
-static void flush_deferred_bios(struct r5conf *conf)
+static void dispatch_bio_list(struct bio_list *tmp)
 {
-	struct bio_list tmp;
 	struct bio *bio;
 
+	while ((bio = bio_list_pop(tmp)))
+		generic_make_request(bio);
+}
+
+static int cmp_sh(const void *a, const void *b)
+{
+	const struct r5pending_data *da = a;
+	const struct r5pending_data *db = b;
+
+	if (da->sector > db->sector)
+		return 1;
+	if (da->sector < db->sector)
+		return -1;
+	return 0;
+}
+
+static void dispatch_defer_bios(struct r5conf *conf, int target,
+				struct bio_list *list)
+{
+	int start = 0;
+	int end = conf->pending_data_cnt - 1;
+	int mid;
+	int cnt = 0;
+
+	if (conf->pending_data_cnt == 0)
+		return;
+	sort(conf->pending_data, conf->pending_data_cnt,
+		sizeof(struct r5pending_data), cmp_sh, NULL);
+
+	while (start <= end) {
+		mid = (start + end) / 2;
+		if (conf->pending_data[mid].sector > conf->last_bio_pos)
+			end = mid - 1;
+		else if (conf->pending_data[mid].sector < conf->last_bio_pos)
+			start = mid + 1;
+		else {
+			start = mid;
+			break;
+		}
+	}
+
+	end = conf->pending_data_cnt - 1;
+	for (mid = start; mid <= end; mid++) {
+		conf->last_bio_pos = conf->pending_data[mid].sector;
+		bio_list_merge(list, &conf->pending_data[mid].bios);
+
+		cnt++;
+		if (cnt >= target) {
+			mid++;
+			break;
+		}
+	}
+	conf->pending_data_cnt -= cnt;
+	BUG_ON(conf->pending_data_cnt < 0);
+	if (mid <= end) {
+		memmove(&conf->pending_data[start], &conf->pending_data[mid],
+			(end - mid + 1) * sizeof(struct r5pending_data));
+		return;
+	}
+
+	for (mid = 0; mid < start; mid++) {
+		conf->last_bio_pos = conf->pending_data[mid].sector;
+		bio_list_merge(list, &conf->pending_data[mid].bios);
+
+		cnt++;
+		if (cnt >= target) {
+			mid++;
+			break;
+		}
+	}
+
+	conf->pending_data_cnt -= mid;
+	BUG_ON(conf->pending_data_cnt < 0);
+	if (mid == start) {
+		BUG_ON(conf->pending_data_cnt);
+		return;
+	}
+	memmove(&conf->pending_data[0], &conf->pending_data[mid],
+			(start - mid) * sizeof(struct r5pending_data));
+}
+
+static void flush_deferred_bios(struct r5conf *conf)
+{
+	struct bio_list tmp = BIO_EMPTY_LIST;
+
 	if (!conf->batch_bio_dispatch || !conf->group_cnt)
 		return;
 
-	bio_list_init(&tmp);
 	spin_lock(&conf->pending_bios_lock);
-	bio_list_merge(&tmp, &conf->pending_bios);
-	bio_list_init(&conf->pending_bios);
+	dispatch_defer_bios(conf, conf->pending_data_cnt, &tmp);
 	spin_unlock(&conf->pending_bios_lock);
 
-	while ((bio = bio_list_pop(&tmp)))
-		generic_make_request(bio);
+	dispatch_bio_list(&tmp);
 }
 
-static void defer_bio_issue(struct r5conf *conf, struct bio *bio)
+static void defer_issue_bios(struct r5conf *conf, sector_t sector,
+				struct bio_list *bios)
 {
-	/*
-	 * change group_cnt will drain all bios, so this is safe
-	 *
-	 * A read generally means a read-modify-write, which usually means a
-	 * randwrite, so we don't delay it
-	 */
-	if (!conf->batch_bio_dispatch || !conf->group_cnt ||
-	    bio_op(bio) == REQ_OP_READ) {
-		generic_make_request(bio);
-		return;
-	}
+	struct bio_list tmp = BIO_EMPTY_LIST;
+
 	spin_lock(&conf->pending_bios_lock);
-	bio_list_add(&conf->pending_bios, bio);
+	conf->pending_data[conf->pending_data_cnt].sector = sector;
+	bio_list_init(&conf->pending_data[conf->pending_data_cnt].bios);
+	bio_list_merge(&conf->pending_data[conf->pending_data_cnt].bios,
+		bios);
+	conf->pending_data_cnt++;
+	if (conf->pending_data_cnt >= PENDING_IO_MAX)
+		dispatch_defer_bios(conf, PENDING_IO_ONE_FLUSH, &tmp);
 	spin_unlock(&conf->pending_bios_lock);
-	md_wakeup_thread(conf->mddev->thread);
+
+	dispatch_bio_list(&tmp);
 }
 
 static void
@@ -923,6 +1004,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 	struct r5conf *conf = sh->raid_conf;
 	int i, disks = sh->disks;
 	struct stripe_head *head_sh = sh;
+	struct bio_list pending_bios = BIO_EMPTY_LIST;
+	bool should_defer;
 
 	might_sleep();
 
@@ -939,6 +1022,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 		}
 	}
 
+	should_defer = conf->batch_bio_dispatch && conf->group_cnt;
+
 	for (i = disks; i--; ) {
 		int op, op_flags = 0;
 		int replace_only = 0;
@@ -1093,7 +1178,10 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 				trace_block_bio_remap(bdev_get_queue(bi->bi_bdev),
 						      bi, disk_devt(conf->mddev->gendisk),
 						      sh->dev[i].sector);
-			defer_bio_issue(conf, bi);
+			if (should_defer && op_is_write(op))
+				bio_list_add(&pending_bios, bi);
+			else
+				generic_make_request(bi);
 		}
 		if (rrdev) {
 			if (s->syncing || s->expanding || s->expanded
@@ -1138,7 +1226,10 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 				trace_block_bio_remap(bdev_get_queue(rbi->bi_bdev),
 						      rbi, disk_devt(conf->mddev->gendisk),
 						      sh->dev[i].sector);
-			defer_bio_issue(conf, rbi);
+			if (should_defer && op_is_write(op))
+				bio_list_add(&pending_bios, rbi);
+			else
+				generic_make_request(rbi);
 		}
 		if (!rdev && !rrdev) {
 			if (op_is_write(op))
@@ -1156,6 +1247,9 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 		if (sh != head_sh)
 			goto again;
 	}
+
+	if (should_defer && !bio_list_empty(&pending_bios))
+		defer_issue_bios(conf, head_sh->sector, &pending_bios);
 }
 
 static struct dma_async_tx_descriptor *
@@ -6808,7 +6902,6 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	atomic_set(&conf->active_stripes, 0);
 	atomic_set(&conf->preread_active_stripes, 0);
 	atomic_set(&conf->active_aligned_reads, 0);
-	bio_list_init(&conf->pending_bios);
 	spin_lock_init(&conf->pending_bios_lock);
 	conf->batch_bio_dispatch = true;
 	rdev_for_each(rdev, mddev) {
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 6b9d2e8..eb6d5d5 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -572,6 +572,13 @@ enum r5_cache_state {
 				 */
 };
 
+#define PENDING_IO_MAX 512
+#define PENDING_IO_ONE_FLUSH 128
+struct r5pending_data {
+	sector_t sector; /* stripe sector */
+	struct bio_list bios;
+};
+
 struct r5conf {
 	struct hlist_head	*stripe_hashtbl;
 	/* only protect corresponding hash list and inactive_list */
@@ -689,9 +696,11 @@ struct r5conf {
 	int			worker_cnt_per_group;
 	struct r5l_log		*log;
 
-	struct bio_list		pending_bios;
 	spinlock_t		pending_bios_lock;
 	bool			batch_bio_dispatch;
+	struct r5pending_data	pending_data[PENDING_IO_MAX];
+	int			pending_data_cnt;
+	sector_t		last_bio_pos;
 };
 
 
-- 
2.9.3


^ permalink raw reply related

* mdadm: can only add devices to linear arrays
From: Boylan, Ross @ 2017-02-18  2:33 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

I got "mdadm: can only add devices to linear arrays" when I did not expect it.  Can anyone explain what's  going on, and what the right way to handle the situation is?

In brief, I had a raid1 set for 2 devices, but with only 1.  I tried adding a 2nd.

root@tempserver:/var/log# mdadm -D /dev/md/media4
/dev/md/media4:
        Version : 1.2
  Creation Time : Thu Feb 18 14:09:00 2016
     Raid Level : raid1
     Array Size : 1953376064 (1862.88 GiB 2000.26 GB)
  Used Dev Size : 1953376064 (1862.88 GiB 2000.26 GB)
   Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Fri Feb 17 17:59:40 2017
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           Name : tempserver:media4  (local to host tempserver)
           UUID : 69440821:e9aa9259:ab2b06ce:d09bedc7
         Events : 555297

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       2       8       34        1      active sync   /dev/sdc2
root@tempserver:/var/log# mdadm --grow /dev/md/media4 --add /dev/sda2
mdadm: can only add devices to linear arrays
# I figured since it was raid1 with a missing device it would use the one I added
root@tempserver:/var/log# mdadm --grow /dev/md/media4 --add /dev/sda2 -n 2
mdadm: /dev/md/media4: no change requested
root@tempserver:/var/log# date; mdadm --grow /dev/md/media4 -n 1
# so I reduced the size of the array.  I was a little worried this would kick sdc2 out of the array.
#  It did not.  I then was able to add
# the new partition in  what I think is the usual way.
root@tempserver:/var/log# date; mdadm --grow /dev/md/media4 --force -n 1
Fri Feb 17 18:18:28 PST 2017
raid_disks for /dev/md/media4 set to 1
root@tempserver:/var/log# mdadm -D /dev/md/media4
/dev/md/media4:
        Version : 1.2
  Creation Time : Thu Feb 18 14:09:00 2016
     Raid Level : raid1
     Array Size : 1953376064 (1862.88 GiB 2000.26 GB)
  Used Dev Size : 1953376064 (1862.88 GiB 2000.26 GB)
   Raid Devices : 1
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Fri Feb 17 18:18:41 2017
          State : clean 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           Name : tempserver:media4  (local to host tempserver)
           UUID : 69440821:e9aa9259:ab2b06ce:d09bedc7
         Events : 557442

    Number   Major   Minor   RaidDevice State
       2       8       34        0      active sync   /dev/sdc2

root@tempserver:/var/log# date; mdadm --grow /dev/md/media4 --add /dev/sda2 -n 2
Fri Feb 17 18:19:24 PST 2017
mdadm: added /dev/sda2
raid_disks for /dev/md/media4 set to 2
# and mdadm -D shows the array rebuilding with sda2 being rebuilt.

Thanks.
Ross Boylan

^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2017-02-18  2:40 UTC (permalink / raw)
  To: Shaohua Li, NeilBrown
  Cc: linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <20170217194117.cgbdm247p5p4gj6v@kernel.org>

On 2017/2/18 上午3:41, Shaohua Li wrote:
> On Thu, Feb 16, 2017 at 06:04:00PM +1100, NeilBrown wrote:

[snip]
>>> @@ -1447,36 +1501,26 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>>>  
>>>  static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>>>  {
>>> -	struct r1conf *conf = mddev->private;
>>> -	struct r1bio *r1_bio;
>>> +	void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
>>> +	struct bio *split;
>>> +	sector_t sectors;
>>>  
>>> -	/*
>>> -	 * make_request() can abort the operation when read-ahead is being
>>> -	 * used and no empty request is available.
>>> -	 *
>>> -	 */
>>> -	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
>>> +	make_request_fn = (bio_data_dir(bio) == READ) ?
>>> +			  raid1_read_request : raid1_write_request;
>>>  
>>> -	r1_bio->master_bio = bio;
>>> -	r1_bio->sectors = bio_sectors(bio);
>>> -	r1_bio->state = 0;
>>> -	r1_bio->mddev = mddev;
>>> -	r1_bio->sector = bio->bi_iter.bi_sector;
>>> -
>>> -	/*
>>> -	 * We might need to issue multiple reads to different devices if there
>>> -	 * are bad blocks around, so we keep track of the number of reads in
>>> -	 * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
>>> -	 * no locking will be needed when requests complete.  If it is
>>> -	 * non-zero, then it is the number of not-completed requests.
>>> -	 */
>>> -	bio->bi_phys_segments = 0;
>>> -	bio_clear_flag(bio, BIO_SEG_VALID);
>>> +	/* if bio exceeds barrier unit boundary, split it */
>>> +	do {
>>> +		sectors = align_to_barrier_unit_end(
>>> +				bio->bi_iter.bi_sector, bio_sectors(bio));
>>> +		if (sectors < bio_sectors(bio)) {
>>> +			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
>>> +			bio_chain(split, bio);
>>> +		} else {
>>> +			split = bio;
>>> +		}
>>>  
>>> -	if (bio_data_dir(bio) == READ)
>>> -		raid1_read_request(mddev, bio, r1_bio);
>>> -	else
>>> -		raid1_write_request(mddev, bio, r1_bio);
>>> +		make_request_fn(mddev, split);
>>> +	} while (split != bio);
>>>  }
>>
>> I know you are going to change this as Shaohua wantsthe spitting
>> to happen in a separate function, which I agree with, but there is
>> something else wrong here.
>> Calling bio_split/bio_chain repeatedly in a loop is dangerous.
>> It is OK for simple devices, but when one request can wait for another
>> request to the same device it can deadlock.
>> This can happen with raid1.  If a resync request calls raise_barrier()
>> between one request and the next, then the next has to wait for the
>> resync request, which has to wait for the first request.
>> As the first request will be stuck in the queue in
>> generic_make_request(), you get a deadlock.
>> It is much safer to:
>>
>>     if (need to split) {
>>         split = bio_split(bio, ...)
>>         bio_chain(...)
>>         make_request_fn(split);
>>         generic_make_request(bio);
>>    } else
>>         make_request_fn(mddev, bio);
>>
>> This way we first process the initial section of the bio (in 'split')
>> which will queue some requests to the underlying devices.  These
>> requests will be queued in generic_make_request.
>> Then we queue the remainder of the bio, which will be added to the end
>> of the generic_make_request queue.
>> Then we return.
>> generic_make_request() will pop the lower-level device requests off the
>> queue and handle them first.  Then it will process the remainder
>> of the original bio once the first section has been fully processed.
> 
> Good point! raid10 has the same problem. Looks this doesn't solve the issue for
> device with 3 times stack though.
> 
> I knew you guys are working on this issue in block layer. Should we fix the
> issue in MD side (for 2 stack devices) or wait for the block layer patch?

Obviously I don't get the point at all ... Could you please explain a
little more about why it is an issue and how it may happen ? Thanks a
lot :-)

Coly


^ permalink raw reply

* Re: [PATCH V4 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2017-02-18  2:56 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <20170217200036.2ca5iotzhc4oauwa@kernel.org>

On 2017/2/18 上午4:00, Shaohua Li wrote:
> On Sat, Feb 18, 2017 at 03:05:56AM +0800, colyli@suse.de wrote:
[snip]

>> Changelog
>> V4:
>> - Add alloc_r1bio() to remove redundant r1bio memory allocation code.
>> - Fix many typos in patch comments.
>> - Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS.
>> V3:
>> - Rebase the patch against latest upstream kernel code.
>> - Many fixes by review comments from Neil,
>>   - Back to use pointers to replace arraries in struct r1conf
>>   - Remove total_barriers from struct r1conf
>>   - Add more patch comments to explain how/why the values of
>>     BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
>>   - Use get_unqueued_pending() to replace get_all_pendings() and
>>     get_all_queued()
>>   - Increase bucket number from 512 to 1024
>> - Change code comments format by review from Shaohua.
>> V2:
>> - Use bio_split() to split the orignal bio if it goes across barrier unit
>>   bounday, to make the code more simple, by suggestion from Shaohua and
>>   Neil.
>> - Use hash_long() to replace original linear hash, to avoid a possible
>>   confilict between resync I/O and sequential write I/O, by suggestion from
>>   Shaohua.
>> - Add conf->total_barriers to record barrier depth, which is used to
>>   control number of parallel sync I/O barriers, by suggestion from Shaohua.
>> - In V1 patch the bellowed barrier buckets related members in r1conf are
>>   allocated in memory page. To make the code more simple, V2 patch moves
>>   the memory space into struct r1conf, like this,
>>         -       int                     nr_pending;
>>         -       int                     nr_waiting;
>>         -       int                     nr_queued;
>>         -       int                     barrier;
>>         +       int                     nr_pending[BARRIER_BUCKETS_NR];
>>         +       int                     nr_waiting[BARRIER_BUCKETS_NR];
>>         +       int                     nr_queued[BARRIER_BUCKETS_NR];
>>         +       int                     barrier[BARRIER_BUCKETS_NR];
>>   This change is by the suggestion from Shaohua.
>> - Remove some inrelavent code comments, by suggestion from Guoqing.
>> - Add a missing wait_barrier() before jumping to retry_write, in
>>   raid1_make_write_request().
>> V1:
>> - Original RFC patch for comments
> 
> Thanks, I applied the two patches. For the deadlock issue Neil pointed out, I
> don't want it to hold this patch, we can fix it later after we figured out how
> to fix. The r1bio allocation is guaranteed to success because it's mempool
> allocation, so I deleted the warn statement.

Good to know this, yes please delete the warn message line.

Thanks.

Coly


^ permalink raw reply

* interesting case of a hung 'recovery'
From: Eyal Lebedinsky @ 2017-02-18 12:14 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

I should start by saying that this is an old fedora 19 system

Executive summary: after '--add'ing a new member a 'recovery' starts but 'sync_max' is not reset.

$ uname -a
Linux e7.eyal.emu.id.au 3.14.27-100.fc19.x86_64 #1 SMP Wed Dec 17 19:36:34 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

so the issue may have been fixed since.

I had a disk fail in a raid6. After some 'pending' sectors were logged I decided to do a 'check'
around that location (set sync_min/max and echo 'check'). Sure enough it elicited disk errors,
but the disk did not recover and it was kicked out of the array. Moreover it became unresponsive.
It needed a power cycle so I shutdown and rebooted the machine.

Not one to give up easily I tried the check again, with the same result.
It was time to '--remove' this array member. I then '--add'ed a new disk which started a recovery.

A few hours later I noticed that it slowed down. A lot. It actually did not progress at all for
a few hours (I was away from the machine).

As I was staring at the screen (for a long while) I realised that it stopped at 55.5%, and this
number is exactly where the original 'check' failed (I still do not understand why with my bad
memory I remembered this number).

I checked 'sync_completed' and it was proper.
I then examined 'sync_max' and it was wrong - it had the location where the very early 'check'
failed earlier in the day. It was the same sector where it is now paused at - looks related.

I decided to take a (small) risk and do
	# echo 'max' >/sys/block/md127/md/sync_max
at which point the recovery moved on. It should be finished in about 5 hours.

I do not think that it is correct for 'sync_max' to not be set to 'max' when a new member is
added - it surely requires a full recovery.

I really hope (and expect) that this was actually fixed, but this note may help others facing
same predicament.

cheers

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply

* Re: RAID10 and 'writemostly' support
From: Phil Turmel @ 2017-02-18 22:20 UTC (permalink / raw)
  To: Reindl Harald, linux-raid
In-Reply-To: <f6ed7560-97b8-78ef-26be-789a8bc372b5@thelounge.net>

On 02/17/2017 05:03 AM, Reindl Harald wrote:

>> Be careful. Don't confuse Raid10 with Raid1+0. They are NOT the
>> same thing (on linux at least), although they are very similar
> 
> yeah, i realized that but anyways thought the "writemostly" logic is 
> there too and maybe the docs not up-to-date

Linux MD raid10 doesn't have a requirement that the number of devices
be a multiple of the number of data copies.  Which creates "interesting"
data layouts with odd numbers of devices or similar effects with ,n3 or
,f3 layouts.  Which makes it difficult if not impossible to designate
specific devices as write mostly without weird operational asymmetries
across the assembled array.

In other words, it is not at all like raid 1 on top of raid 0, except in
certain very limited cases, and your assumptions are simply wrong.

If there are features (other than layouts) of raid10 that make you
prefer it to raid1, it would make sense to ask for those features to
be implemented in raid1.

> sadly i can't write a patch on my own but only point how useful it
> would be

That's unfortunate.  Patches are generally welcome.

Phil

^ permalink raw reply

* Re: mdadm: can only add devices to linear arrays
From: Phil Turmel @ 2017-02-18 22:29 UTC (permalink / raw)
  To: Boylan, Ross, linux-raid@vger.kernel.org
In-Reply-To: <dffb96dc-ee78-4bad-82a8-7f9cfdeb80cd@EXHT01.net.ucsf.edu>

On 02/17/2017 09:33 PM, Boylan, Ross wrote:
> I got "mdadm: can only add devices to linear arrays" when I did not
> expect it.  Can anyone explain what's  going on, and what the right
> way to handle the situation is?
> 
> In brief, I had a raid1 set for 2 devices, but with only 1.  I tried
> adding a 2nd.

I think you simply misunderstood the use of --grow.  All you needed was
--add to place a new device in the empty slot.

Use --grow to change the layout or number of devices of a non-degraded
array.  Use --add, --fail, and --remove to manage the devices in an
array without changing its structure.

Often, you see --grow with an --add because a spare is needed for the
structural changes one is commanding, and it is fine to combine the
operations (adding a spare then growing onto it).

Phil

^ permalink raw reply

* Re: RAID10 and 'writemostly' support
From: Reindl Harald @ 2017-02-18 23:35 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <7090c5c7-7f02-14a3-36f1-23b8bfe53043@turmel.org>


Am 18.02.2017 um 23:20 schrieb Phil Turmel:
> On 02/17/2017 05:03 AM, Reindl Harald wrote:
>
>>> Be careful. Don't confuse Raid10 with Raid1+0. They are NOT the
>>> same thing (on linux at least), although they are very similar
>>
>> yeah, i realized that but anyways thought the "writemostly" logic is
>> there too and maybe the docs not up-to-date
>
> Linux MD raid10 doesn't have a requirement that the number of devices
> be a multiple of the number of data copies.  Which creates "interesting"
> data layouts with odd numbers of devices or similar effects with ,n3 or
> ,f3 layouts.  Which makes it difficult if not impossible to designate
> specific devices as write mostly without weird operational asymmetries
> across the assembled array.

but since --writemostly doesn't get without manually intervention that 
cases would be unchanged (besides that they are unlikely)

> In other words, it is not at all like raid 1 on top of raid 0, except in
> certain very limited cases, and your assumptions are simply wrong.
>
> If there are features (other than layouts) of raid10 that make you
> prefer it to raid1, it would make sense to ask for those features to
> be implemented in raid1.

writemostly it's also very appealing on existing setups, the machine 
from where i type was installed in 2011

RAID1 don't have the benefit of doubled performance (also for writes, on 
a hybrid RAID slower but still faster than RAID1) *and* doubled space 
compared to a single disk combined with mirroring

another example: on machines like a HP microserver with only 4 drive 
slots that you could easily improve read-performance which is for many 
workloads the most important part by just switch half of the disk to SSD

price calculation for a hybrid RAID10 with 10 disks:
5x4 TB SSD = 5 x 1400€ = 7000€
5x4 TB HDD = 5 x 100€ = 500€
total price 7500€ versus 14000€ for flash-only

>> sadly i can't write a patch on my own but only point how useful it
>> would be
>
> That's unfortunate.  Patches are generally welcome

i would be *seriously* willing to pay the inital patch for any kernel 
maintainer who takes it over - Fedora regulary does kernel-rebases on GA 
versions

^ permalink raw reply

* Re: RAID10 and 'writemostly' support
From: Phil Turmel @ 2017-02-19 17:31 UTC (permalink / raw)
  To: Reindl Harald, linux-raid
In-Reply-To: <5a11be45-1793-d4b7-7cf5-3adaef5740d6@thelounge.net>

On 02/18/2017 06:35 PM, Reindl Harald wrote:
> 
> Am 18.02.2017 um 23:20 schrieb Phil Turmel:

>> If there are features (other than layouts) of raid10 that make you
>> prefer it to raid1, it would make sense to ask for those features to
>> be implemented in raid1.
> 
> writemostly it's also very appealing on existing setups, the machine
> from where i type was installed in 2011
> 
> RAID1 don't have the benefit of doubled performance (also for writes, on
> a hybrid RAID slower but still faster than RAID1) *and* doubled space
> compared to a single disk combined with mirroring

Doubled capacity?  Vs. raid1?  No.  Raid10,n2 (,n2 is default) on two
devices yields the same capacity as raid1 on two devices.  Unless I'm
misunderstanding your point.

> another example: on machines like a HP microserver with only 4 drive
> slots that you could easily improve read-performance which is for many
> workloads the most important part by just switch half of the disk to SSD
> 
> price calculation for a hybrid RAID10 with 10 disks:
> 5x4 TB SSD = 5 x 1400€ = 7000€
> 5x4 TB HDD = 5 x 100€ = 500€
> total price 7500€ versus 14000€ for flash-only

What is preventing you from using the existing raid1 in pairs with
write mostly, then layering raid0 on top of them for the capacity you
are trying to achieve?  No new code required.  What you are asking for
really is raid1+0, which MD raid allows you to assemble yourself.

> i would be *seriously* willing to pay the inital patch for any kernel
> maintainer who takes it over - Fedora regulary does kernel-rebases on GA
> versions

Since no new kernel code is needed to achieve what you desire, I doubt
a kernel patch for it would be accepted. (But I'm not a maintainer, so
YMMV.)  This is really a user-space question, along the lines of
"should/could mdadm automate creation of dual layers like raid1+0?"

Phil

^ permalink raw reply

* Re: RAID10 and 'writemostly' support
From: Reindl Harald @ 2017-02-19 17:48 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <08f52436-9de4-76d0-99c0-c90dedee8ae1@turmel.org>



Am 19.02.2017 um 18:31 schrieb Phil Turmel:
> On 02/18/2017 06:35 PM, Reindl Harald wrote:
>>
>> Am 18.02.2017 um 23:20 schrieb Phil Turmel:
>
>>> If there are features (other than layouts) of raid10 that make you
>>> prefer it to raid1, it would make sense to ask for those features to
>>> be implemented in raid1.
>>
>> writemostly it's also very appealing on existing setups, the machine
>> from where i type was installed in 2011
>>
>> RAID1 don't have the benefit of doubled performance (also for writes, on
>> a hybrid RAID slower but still faster than RAID1) *and* doubled space
>> compared to a single disk combined with mirroring
>
> Doubled capacity?  Vs. raid1?  No.  Raid10,n2 (,n2 is default) on two
> devices yields the same capacity as raid1 on two devices.  Unless I'm
> misunderstanding your point.

you are misunderstanding

RAID1:  2x2 TB = 2 TB usable
RAID10: 4x2 TB = 4 TB useable

typically smaller disks are cheaper and when i installed the 4x2 TB 
RAID10 4 TB disks where not that common and 4 TB SSD not available at 
all (and 2 TB SSD unpaibale)

>> another example: on machines like a HP microserver with only 4 drive
>> slots that you could easily improve read-performance which is for many
>> workloads the most important part by just switch half of the disk to SSD
>>
>> price calculation for a hybrid RAID10 with 10 disks:
>> 5x4 TB SSD = 5 x 1400€ = 7000€
>> 5x4 TB HDD = 5 x 100€ = 500€
>> total price 7500€ versus 14000€ for flash-only
>
> What is preventing you from using the existing raid1 in pairs with
> write mostly, then layering raid0 on top of them for the capacity you
> are trying to achieve?  No new code required.  What you are asking for
> really is raid1+0, which MD raid allows you to assemble yourself.

already existing setups and the easier configuraion of RAID10 than wrap 
2 RAID1 into a RAID0 especially at inital setup time when you also cover 
the os setup itself

/dev/md0         ext4        485M     33M  448M    7% /boot
/dev/md1         ext4         29G    6,8G   22G   24% /
/dev/md2         ext4        3,6T    2,3T  1,4T   63% /mnt/data

md0: RAID1
md1: RAID10
md2: RAID10

it's really not funny to change that existing layout from RAID10 to 
RAID0+RAID1

>> i would be *seriously* willing to pay the inital patch for any kernel
>> maintainer who takes it over - Fedora regulary does kernel-rebases on GA
>> versions
>
> Since no new kernel code is needed to achieve what you desire, I doubt
> a kernel patch for it would be accepted. (But I'm not a maintainer, so
> YMMV.)  This is really a user-space question, along the lines of
> "should/could mdadm automate creation of dual layers like raid1+0?"

at least "mdadm" in the current state should just refuse 
"--write-mostly" when the array is a RAID10 - in that case i would have 
known by testing it based on http://www.tansi.org/hybrid/ in a virtual 
machine that it *really* don't work with RAID10

obviously there is code needed to achieve "writemostly" on the most 
common setup of 4 disks for a RAID10 where you later try to replace half 
of the disks with SSD and have writes only on the remaining HDD

there are so many workloads where read-performance is more imprtant 
(boot, start of large applications, start virtual machines, rsync large 
data...)

^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: NeilBrown @ 2017-02-19 23:42 UTC (permalink / raw)
  To: Shaohua Li, NeilBrown
  Cc: colyli, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <20170217194117.cgbdm247p5p4gj6v@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 12744 bytes --]

On Fri, Feb 17 2017, Shaohua Li wrote:

> On Thu, Feb 16, 2017 at 06:04:00PM +1100, NeilBrown wrote:
>> On Thu, Feb 16 2017, colyli@suse.de wrote:
>> 
>> > 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
>> > introduces a sliding resync window for raid1 I/O barrier, this idea limits
>> > I/O barriers to happen only inside a slidingresync window, for regular
>> > I/Os out of this resync window they don't need to wait for barrier any
>> > more. On large raid1 device, it helps a lot to improve parallel writing
>> > I/O throughput when there are background resync I/Os performing at
>> > same time.
>> >
>> > The idea of sliding resync widow is awesome, but code complexity is a
>> > challenge. Sliding resync window requires several veriables to work
>> 
>> variables
>> 
>> > collectively, this is complexed and very hard to make it work correctly.
>> > Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
>> > to fix the original resync window patch. This is not the end, any further
>> > related modification may easily introduce more regreassion.
>> >
>> > Therefore I decide to implement a much simpler raid1 I/O barrier, by
>> > removing resync window code, I believe life will be much easier.
>> >
>> > The brief idea of the simpler barrier is,
>> >  - Do not maintain a logbal unique resync window
>> 
>> global
>> 
>> >  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>> 
>> conflicts
>> 
>> >    I/O only has to wait for a resync I/O when both them have same barrier
>> >    bucket index, vice versa.
>> >  - I/O barrier can be recuded to an acceptable number if there are enought
>> 
>> reduced
>> enough
>> 
>> >    barrier buckets
>> >
>> > Here I explain how the barrier buckets are designed,
>> >  - BARRIER_UNIT_SECTOR_SIZE
>> >    The whole LBA address space of a raid1 device is divided into multiple
>> >    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>> >    Bio request won't go across border of barrier unit size, that means
>> 
>> requests
>> 
>> >    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
>> >    For random I/O 64MB is large enough for both read and write requests,
>> >    for sequential I/O considering underlying block layer may merge them
>> >    into larger requests, 64MB is still good enough.
>> >    Neil also points out that for resync operation, "we want the resync to
>> >    move from region to region fairly quickly so that the slowness caused
>> >    by having to synchronize with the resync is averaged out over a fairly
>> >    small time frame". For full speed resync, 64MB should take less then 1
>> >    second. When resync is competing with other I/O, it could take up a few
>> >    minutes. Therefore 64MB size is fairly good range for resync.
>> >
>> >  - BARRIER_BUCKETS_NR
>> >    There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
>> >         #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
>> >         #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
>> >    this patch makes the bellowed members of struct r1conf from integer
>> >    to array of integers,
>> >         -       int                     nr_pending;
>> >         -       int                     nr_waiting;
>> >         -       int                     nr_queued;
>> >         -       int                     barrier;
>> >         +       int                     *nr_pending;
>> >         +       int                     *nr_waiting;
>> >         +       int                     *nr_queued;
>> >         +       int                     *barrier;
>> >    number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
>> >    kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
>> >    barrier buckets, and each array of integers occupies single memory page.
>> >    1024 means for a request which is smaller than the I/O barrier unit size
>> >    has ~0.1% chance to wait for resync to pause, which is quite a small
>> >    enough fraction. Also requesting single memory page is more friendly to
>> >    kernel page allocator than larger memory size.
>> >
>> >  - I/O barrier bucket is indexed by bio start sector
>> >    If multiple I/O requests hit different I/O barrier units, they only need
>> >    to compete I/O barrier with other I/Os which hit the same I/O barrier
>> >    bucket index with each other. The index of a barrier bucket which a
>> >    bio should look for is calculated by sector_to_idx() which is defined
>> >    in raid1.h as an inline function,
>> >         static inline int sector_to_idx(sector_t sector)
>> >         {
>> >                 return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
>> >                                 BARRIER_BUCKETS_NR_BITS);
>> >         }
>> >    Here sector_nr is the start sector number of a bio.
>> 
>> "hash_long() is used so that sequential writes in are region of the
>> array which is not being resynced will not consistently align with
>> the buckets that are being sequentially resynced, as described below"
>> 
>> >
>> >  - Single bio won't go across boundary of a I/O barrier unit
>> >    If a request goes across boundary of barrier unit, it will be split. A
>> >    bio may be split in raid1_make_request() or raid1_sync_request(), if
>> >    sectors returned by align_to_barrier_unit_end() is small than original
>> 
>> smaller
>> 
>> >    bio size.
>> >
>> > Comparing to single sliding resync window,
>> >  - Currently resync I/O grows linearly, therefore regular and resync I/O
>> >    will have confliction within a single barrier units. So the I/O
>> 
>> ... will conflict within ...
>> 
>> >    behavior is similar to single sliding resync window.
>> >  - But a barrier unit bucket is shared by all barrier units with identical
>> >    barrier uinit index, the probability of confliction might be higher
>> >    than single sliding resync window, in condition that writing I/Os
>> >    always hit barrier units which have identical barrier bucket indexs with
>> >    the resync I/Os. This is a very rare condition in real I/O work loads,
>> >    I cannot imagine how it could happen in practice.
>> >  - Therefore we can achieve a good enough low confliction rate with much
>> 
>> ... low conflict rate ...
>> 
>> >    simpler barrier algorithm and implementation.
>> >
>> > There are two changes should be noticed,
>> >  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>> >    single loop, it looks like this,
>> >         spin_lock_irqsave(&conf->device_lock, flags);
>> >         conf->nr_queued[idx]--;
>> >         spin_unlock_irqrestore(&conf->device_lock, flags);
>> >    This change generates more spin lock operations, but in next patch of
>> >    this patch set, it will be replaced by a single line code,
>> >         atomic_dec(&conf->nr_queueud[idx]);
>> >    So we don't need to worry about spin lock cost here.
>> >  - Mainline raid1 code split original raid1_make_request() into
>> >    raid1_read_request() and raid1_write_request(). If the original bio
>> >    goes across an I/O barrier unit size, this bio will be split before
>> >    calling raid1_read_request() or raid1_write_request(),  this change
>> >    the code logic more simple and clear.
>> >  - In this patch wait_barrier() is moved from raid1_make_request() to
>> >    raid1_write_request(). In raid_read_request(), original wait_barrier()
>> >    is replaced by raid1_read_request().
>> >    The differnece is wait_read_barrier() only waits if array is frozen,
>> >    using different barrier function in different code path makes the code
>> >    more clean and easy to read.
>> 
>> Thank you for putting the effort into writing a comprehensve change
>> description.  I really appreciate it.
>> 
>> >  
>> > @@ -1447,36 +1501,26 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>> >  
>> >  static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>> >  {
>> > -	struct r1conf *conf = mddev->private;
>> > -	struct r1bio *r1_bio;
>> > +	void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
>> > +	struct bio *split;
>> > +	sector_t sectors;
>> >  
>> > -	/*
>> > -	 * make_request() can abort the operation when read-ahead is being
>> > -	 * used and no empty request is available.
>> > -	 *
>> > -	 */
>> > -	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
>> > +	make_request_fn = (bio_data_dir(bio) == READ) ?
>> > +			  raid1_read_request : raid1_write_request;
>> >  
>> > -	r1_bio->master_bio = bio;
>> > -	r1_bio->sectors = bio_sectors(bio);
>> > -	r1_bio->state = 0;
>> > -	r1_bio->mddev = mddev;
>> > -	r1_bio->sector = bio->bi_iter.bi_sector;
>> > -
>> > -	/*
>> > -	 * We might need to issue multiple reads to different devices if there
>> > -	 * are bad blocks around, so we keep track of the number of reads in
>> > -	 * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
>> > -	 * no locking will be needed when requests complete.  If it is
>> > -	 * non-zero, then it is the number of not-completed requests.
>> > -	 */
>> > -	bio->bi_phys_segments = 0;
>> > -	bio_clear_flag(bio, BIO_SEG_VALID);
>> > +	/* if bio exceeds barrier unit boundary, split it */
>> > +	do {
>> > +		sectors = align_to_barrier_unit_end(
>> > +				bio->bi_iter.bi_sector, bio_sectors(bio));
>> > +		if (sectors < bio_sectors(bio)) {
>> > +			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
>> > +			bio_chain(split, bio);
>> > +		} else {
>> > +			split = bio;
>> > +		}
>> >  
>> > -	if (bio_data_dir(bio) == READ)
>> > -		raid1_read_request(mddev, bio, r1_bio);
>> > -	else
>> > -		raid1_write_request(mddev, bio, r1_bio);
>> > +		make_request_fn(mddev, split);
>> > +	} while (split != bio);
>> >  }
>> 
>> I know you are going to change this as Shaohua wantsthe spitting
>> to happen in a separate function, which I agree with, but there is
>> something else wrong here.
>> Calling bio_split/bio_chain repeatedly in a loop is dangerous.
>> It is OK for simple devices, but when one request can wait for another
>> request to the same device it can deadlock.
>> This can happen with raid1.  If a resync request calls raise_barrier()
>> between one request and the next, then the next has to wait for the
>> resync request, which has to wait for the first request.
>> As the first request will be stuck in the queue in
>> generic_make_request(), you get a deadlock.
>> It is much safer to:
>> 
>>     if (need to split) {
>>         split = bio_split(bio, ...)
>>         bio_chain(...)
>>         make_request_fn(split);
>>         generic_make_request(bio);
>>    } else
>>         make_request_fn(mddev, bio);
>> 
>> This way we first process the initial section of the bio (in 'split')
>> which will queue some requests to the underlying devices.  These
>> requests will be queued in generic_make_request.
>> Then we queue the remainder of the bio, which will be added to the end
>> of the generic_make_request queue.
>> Then we return.
>> generic_make_request() will pop the lower-level device requests off the
>> queue and handle them first.  Then it will process the remainder
>> of the original bio once the first section has been fully processed.
>
> Good point! raid10 has the same problem. Looks this doesn't solve the issue for
> device with 3 times stack though.
>
> I knew you guys are working on this issue in block layer. Should we fix the
> issue in MD side (for 2 stack devices) or wait for the block layer patch?

We cannot fix everything at the block layer, or at the individual device
layer.  We need changes in both.
I think that looping over bios in a device driver is wrong and can
easily lead to deadlocks.  We should remove that from md.
If the block layer gets fixed that way I want it to, then we could move
the generic_make_request() call earlier, so that above could be

>>     if (need to split) {
>>         split = bio_split(bio, ...)
>>         bio_chain(...)
>>         generic_make_request(bio);
>>         bio = split()
>>    }
>>    make_request_fn(mddev, bio);

which is slightly simpler.  But the original would still work.

So yes, I think we need this change in md/raid1. I suspect that if
you built a kernel with a smaller BARRIER_UNIT_SECTOR_BITS  - e.g. 4 -
you could very easily trigger a deadlock with md/raid1 on scsi.
At 17, it is not quite so easy, but is a real possibility.

I've had similar deadlocks reported before when the code wasn't quite
careful enough.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: NeilBrown @ 2017-02-19 23:50 UTC (permalink / raw)
  To: Coly Li, NeilBrown, linux-raid
  Cc: Shaohua Li, Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <2f6b3d68-1536-3167-7362-78fdfa91e149@suse.de>

[-- Attachment #1: Type: text/plain, Size: 2454 bytes --]

On Fri, Feb 17 2017, Coly Li wrote:

> On 2017/2/16 下午3:04, NeilBrown wrote:
>> I know you are going to change this as Shaohua wantsthe spitting to
>> happen in a separate function, which I agree with, but there is 
>> something else wrong here. Calling bio_split/bio_chain repeatedly
>> in a loop is dangerous. It is OK for simple devices, but when one
>> request can wait for another request to the same device it can
>> deadlock. This can happen with raid1.  If a resync request calls
>> raise_barrier() between one request and the next, then the next has
>> to wait for the resync request, which has to wait for the first
>> request. As the first request will be stuck in the queue in 
>> generic_make_request(), you get a deadlock.
>
> For md raid1, queue in generic_make_request(), can I understand it as
> bio_list_on_stack in this function? And queue in underlying device,
> can I understand it as the data structures like plug->pending and
> conf->pending_bio_list ?

Yes, the queue in generic_make_request() is the bio_list_on_stack.  That
is the only queue I am talking about.  I'm not referring to
plug->pending or conf->pending_bio_list at all.

>
> I still don't get the point of deadlock, let me try to explain why I
> don't see the possible deadlock. If a bio is split, and the first part
> is processed by make_request_fn(), and then a resync comes and it will
> raise a barrier, there are 3 possible conditions,
> - the resync I/O tries to raise barrier on same bucket of the first
> regular bio. Then the resync task has to wait to the first bio drops
> its conf->nr_pending[idx]

Not quite.
First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx]
to be zero.  We can assume this happens immediately.
Then the resync_task will increment ->barrier[idx].
Only then will it wait for the first bio to drop ->nr_pending[idx].
The processing of that first bio will have submitted bios to the
underlying device, and they will be in the bio_list_on_stack queue, and
will not be processed until raid1_make_request() completes.

The loop in raid1_make_request() will then call make_request_fn() which
will call wait_barrier(), which will wait for ->barrier[idx] to be zero.

So raid1_make_request is waiting for the resync to progress, and resync
is waiting for a bio which is on bio_list_on_stack which won't be
processed until raid1_make_request() completes.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH V4 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: NeilBrown @ 2017-02-20  0:31 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Neil Brown, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <1487358357-123924-1-git-send-email-colyli@suse.de>

[-- Attachment #1: Type: text/plain, Size: 1592 bytes --]

On Sat, Feb 18 2017, colyli@suse.de wrote:

> @@ -1447,36 +1497,26 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>  
>  static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>  {
> -	struct r1conf *conf = mddev->private;
> -	struct r1bio *r1_bio;
> +	void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
> +	struct bio *split;
> +	sector_t sectors;
>  
> -	/*
> -	 * make_request() can abort the operation when read-ahead is being
> -	 * used and no empty request is available.
> -	 *
> -	 */
> -	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
> +	make_request_fn = (bio_data_dir(bio) == READ) ?
> +			  raid1_read_request : raid1_write_request;
>  
....
>  
> -	if (bio_data_dir(bio) == READ)
> -		raid1_read_request(mddev, bio, r1_bio);
> -	else
> -		raid1_write_request(mddev, bio, r1_bio);
> +		make_request_fn(mddev, split);
> +	} while (split != bio);
>  }

I don't think the use of make_request_fn() makes the code more readable
or more efficient, and it quite possibly has a cost.
If you left it as

	if (bio_data_dir(bio) == READ)
		raid1_read_request(mddev, bio, r1_bio);
	else
		raid1_write_request(mddev, bio, r1_bio);

then gcc would notice that both raid1_read_request and
raid1_write_request are static functions that are only used once, and
will normally inline them.  This will reduce the total stack depth,
which you expressed some concern about in a previous email.
Using a function pointer like this makes it harder for gcc to perform
that optimization.


NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* assistance recovering failed raid6 array
From: Martin Bosner @ 2017-02-20  1:49 UTC (permalink / raw)
  To: linux-raid

I am running a software raid6 with 36 x 3TB disks (sda to sdaj). All
disks have one partition (gpt, 100%, primary, raid on) and i am using
btrfs on top of the raid.

Last week one of the disks failed and was unrecoverable. I replaced the
disk (sdk) with a new one and the resync process started.
At around 80% recovery two further disks failed and the recovery process
was stopped. That failed disks are sdm and sdh.

All other disks seem to be fine and I was about the use the "mdadm
--create" command when i remembered the lines
"You have been warned! It's better to send an email to the linux-raid
mailing list with detailed information"

So here i am for an advice how to continue.

More details:

Only 35% of the raid space is used.

The disks status is:
sdk: original disk is dead and the replacement was around 80% recovered.
sdm: i was able to copy the first 2 TB with two errors (128kbyte) and
the third TB with around 200GB missing data using ddrescue to a new disk.
sdh: the original disk is dead and i replaced it with a brand new one
and created the partition sdh1.

Since the array is offline i cannot add sdh1 to the raid and trying to
assemble the array gives me:

For mdadm --assemble --force with sdh1:
mdadm: no RAID superblock on /dev/sdh1
mdadm: /dev/sdh1 has no superblock - assembly aborted

For mdadm --assemble --force without sdh1:
mdadm: /dev/md0 assembled from 33 drives, 1 rebuilding and 1 spare - not
enough to start the array.

Full status of /dev/sda1:

mdadm --examine /dev/sda1
/dev/sda1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 5c7c227e:22de5fc1:ca3ebb65:9c283567
           Name : media-storage:0  (local to host media-storage)
  Creation Time : Sun Sep 18 22:46:42 2016
     Raid Level : raid6
   Raid Devices : 36

 Avail Dev Size : 5860268032 (2794.39 GiB 3000.46 GB)
     Array Size : 99624556544 (95009.38 GiB 102015.55 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=0 sectors
          State : clean
    Device UUID : f90e9c41:5aa3c3b2:d715781b:1abbb439

Internal Bitmap : 8 sectors from superblock
    Update Time : Wed Feb 15 14:08:28 2017
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : b0b57ef2 - correct
         Events : 140559

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAAAAAA.AA.AAAAAAAAAAAAAAAAAAAAAAAAA ('A' == active,
'.' == missing, 'R' == replacing)

mdadm --examine for each drive to get "Device Role":

"sda    Device Role : Active device 0"
"sdb    Device Role : Active device 1"
"sdc    Device Role : Active device 2"
"sdd    Device Role : Active device 3"
"sde    Device Role : Active device 4"
"sdf    Device Role : Active device 5"
"sdg    Device Role : Active device 6"
"sdh" mdadm: No md superblock detected on /dev/sdh1.
"sdi    Device Role : Active device 8"
"sdj    Device Role : Active device 9"
"sdk    Device Role : spare"
"sdl    Device Role : Active device 11"
"sdm    Device Role : Active device 12"
"sdn    Device Role : Active device 13"
"sdo    Device Role : Active device 14"
"sdp    Device Role : Active device 15"
"sdq    Device Role : Active device 16"
"sdr    Device Role : Active device 17"
"sds    Device Role : Active device 18"
"sdt    Device Role : Active device 19"
"sdu    Device Role : Active device 20"
"sdv    Device Role : Active device 21"
"sdw    Device Role : Active device 22"
"sdx    Device Role : Active device 23"
"sdy    Device Role : Active device 24"
"sdz    Device Role : Active device 25"
"sdaa    Device Role : Active device 26"
"sdab    Device Role : Active device 27"
"sdac    Device Role : Active device 28"
"sdad    Device Role : Active device 29"
"sdae    Device Role : Active device 30"
"sdaf    Device Role : Active device 31"
"sdag    Device Role : Active device 32"
"sdah    Device Role : Active device 33"
"sdai    Device Role : Active device 34"
"sdaj    Device Role : Active device 35"

The system is Ubuntu 16.04.2 LTS (x86_64) with a 4.4.0-62-generic kernel.

mdadm --version gives me: mdadm - v3.3 - 3rd September 2013

-- 

<https://www.postbox-inc.com/?utm_source=email&utm_medium=siglink&utm_campaign=reach>

^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: NeilBrown @ 2017-02-20  2:51 UTC (permalink / raw)
  To: Coly Li, NeilBrown, linux-raid
  Cc: Shaohua Li, Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <87shn9spsy.fsf@notabene.neil.brown.name>

[-- Attachment #1: Type: text/plain, Size: 2936 bytes --]

On Mon, Feb 20 2017, NeilBrown wrote:

> On Fri, Feb 17 2017, Coly Li wrote:
>
>> On 2017/2/16 下午3:04, NeilBrown wrote:
>>> I know you are going to change this as Shaohua wantsthe spitting to
>>> happen in a separate function, which I agree with, but there is 
>>> something else wrong here. Calling bio_split/bio_chain repeatedly
>>> in a loop is dangerous. It is OK for simple devices, but when one
>>> request can wait for another request to the same device it can
>>> deadlock. This can happen with raid1.  If a resync request calls
>>> raise_barrier() between one request and the next, then the next has
>>> to wait for the resync request, which has to wait for the first
>>> request. As the first request will be stuck in the queue in 
>>> generic_make_request(), you get a deadlock.
>>
>> For md raid1, queue in generic_make_request(), can I understand it as
>> bio_list_on_stack in this function? And queue in underlying device,
>> can I understand it as the data structures like plug->pending and
>> conf->pending_bio_list ?
>
> Yes, the queue in generic_make_request() is the bio_list_on_stack.  That
> is the only queue I am talking about.  I'm not referring to
> plug->pending or conf->pending_bio_list at all.
>
>>
>> I still don't get the point of deadlock, let me try to explain why I
>> don't see the possible deadlock. If a bio is split, and the first part
>> is processed by make_request_fn(), and then a resync comes and it will
>> raise a barrier, there are 3 possible conditions,
>> - the resync I/O tries to raise barrier on same bucket of the first
>> regular bio. Then the resync task has to wait to the first bio drops
>> its conf->nr_pending[idx]
>
> Not quite.
> First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx]
> to be zero.  We can assume this happens immediately.
> Then the resync_task will increment ->barrier[idx].
> Only then will it wait for the first bio to drop ->nr_pending[idx].
> The processing of that first bio will have submitted bios to the
> underlying device, and they will be in the bio_list_on_stack queue, and
> will not be processed until raid1_make_request() completes.
>
> The loop in raid1_make_request() will then call make_request_fn() which
> will call wait_barrier(), which will wait for ->barrier[idx] to be
> zero.

Thinking more carefully about this.. the 'idx' that the second bio will
wait for will normally be different, so there won't be a deadlock after
all.

However it is possible for hash_long() to produce the same idx for two
consecutive barrier_units so there is still the possibility of a
deadlock, though it isn't as likely as I thought at first.

NeilBrown


>
> So raid1_make_request is waiting for the resync to progress, and resync
> is waiting for a bio which is on bio_list_on_stack which won't be
> processed until raid1_make_request() completes.
>
> NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Shaohua Li @ 2017-02-20  7:04 UTC (permalink / raw)
  To: NeilBrown
  Cc: Coly Li, NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <87k28lshg5.fsf@notabene.neil.brown.name>

On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
> On Mon, Feb 20 2017, NeilBrown wrote:
> 
> > On Fri, Feb 17 2017, Coly Li wrote:
> >
> >> On 2017/2/16 下午3:04, NeilBrown wrote:
> >>> I know you are going to change this as Shaohua wantsthe spitting to
> >>> happen in a separate function, which I agree with, but there is 
> >>> something else wrong here. Calling bio_split/bio_chain repeatedly
> >>> in a loop is dangerous. It is OK for simple devices, but when one
> >>> request can wait for another request to the same device it can
> >>> deadlock. This can happen with raid1.  If a resync request calls
> >>> raise_barrier() between one request and the next, then the next has
> >>> to wait for the resync request, which has to wait for the first
> >>> request. As the first request will be stuck in the queue in 
> >>> generic_make_request(), you get a deadlock.
> >>
> >> For md raid1, queue in generic_make_request(), can I understand it as
> >> bio_list_on_stack in this function? And queue in underlying device,
> >> can I understand it as the data structures like plug->pending and
> >> conf->pending_bio_list ?
> >
> > Yes, the queue in generic_make_request() is the bio_list_on_stack.  That
> > is the only queue I am talking about.  I'm not referring to
> > plug->pending or conf->pending_bio_list at all.
> >
> >>
> >> I still don't get the point of deadlock, let me try to explain why I
> >> don't see the possible deadlock. If a bio is split, and the first part
> >> is processed by make_request_fn(), and then a resync comes and it will
> >> raise a barrier, there are 3 possible conditions,
> >> - the resync I/O tries to raise barrier on same bucket of the first
> >> regular bio. Then the resync task has to wait to the first bio drops
> >> its conf->nr_pending[idx]
> >
> > Not quite.
> > First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx]
> > to be zero.  We can assume this happens immediately.
> > Then the resync_task will increment ->barrier[idx].
> > Only then will it wait for the first bio to drop ->nr_pending[idx].
> > The processing of that first bio will have submitted bios to the
> > underlying device, and they will be in the bio_list_on_stack queue, and
> > will not be processed until raid1_make_request() completes.
> >
> > The loop in raid1_make_request() will then call make_request_fn() which
> > will call wait_barrier(), which will wait for ->barrier[idx] to be
> > zero.
> 
> Thinking more carefully about this.. the 'idx' that the second bio will
> wait for will normally be different, so there won't be a deadlock after
> all.
> 
> However it is possible for hash_long() to produce the same idx for two
> consecutive barrier_units so there is still the possibility of a
> deadlock, though it isn't as likely as I thought at first.

Wrapped the function pointer issue Neil pointed out into Coly's original patch.
Also fix a 'use-after-free' bug. For the deadlock issue, I'll add below patch,
please check.

Thanks,
Shaohua

From ee9c98138bcdf8bceef384a68f49258b6b8b8c6d Mon Sep 17 00:00:00 2001
Message-Id: <ee9c98138bcdf8bceef384a68f49258b6b8b8c6d.1487573888.git.shli@fb.com>
From: Shaohua Li <shli@fb.com>
Date: Sun, 19 Feb 2017 22:18:32 -0800
Subject: [PATCH] md/raid1/10: fix potential deadlock

Neil Brown pointed out a potential deadlock in raid 10 code with
bio_split/chain. The raid1 code could have the same issue, but recent
barrier rework makes it less likely to happen. The deadlock happens in
below sequence:

1. generic_make_request(bio), this will set current->bio_list
2. raid10_make_request will split bio to bio1 and bio2
3. __make_request(bio1), wait_barrer, add underlayer disk bio to
current->bio_list
4. __make_request(bio2), wait_barrer

If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
raise_barrier waits for IO completion from 3. And since raise_barrier
sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
dispatched because raid10_make_request() doesn't finished yet.

The solution is to adjust the IO ordering. Quotes from Neil:
"
It is much safer to:

    if (need to split) {
        split = bio_split(bio, ...)
        bio_chain(...)
        make_request_fn(split);
        generic_make_request(bio);
   } else
        make_request_fn(mddev, bio);

This way we first process the initial section of the bio (in 'split')
which will queue some requests to the underlying devices.  These
requests will be queued in generic_make_request.
Then we queue the remainder of the bio, which will be added to the end
of the generic_make_request queue.
Then we return.
generic_make_request() will pop the lower-level device requests off the
queue and handle them first.  Then it will process the remainder
of the original bio once the first section has been fully processed.
"

Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org (v3.14+, only the raid10 part)
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid1.c  | 28 ++++++++++++++--------------
 drivers/md/raid10.c | 41 ++++++++++++++++++++---------------------
 2 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 676f72d..e55d865 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1566,21 +1566,21 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
 	sector_t sectors;
 
 	/* if bio exceeds barrier unit boundary, split it */
-	do {
-		sectors = align_to_barrier_unit_end(
-				bio->bi_iter.bi_sector, bio_sectors(bio));
-		if (sectors < bio_sectors(bio)) {
-			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
-			bio_chain(split, bio);
-		} else {
-			split = bio;
-		}
+	sectors = align_to_barrier_unit_end(
+			bio->bi_iter.bi_sector, bio_sectors(bio));
+	if (sectors < bio_sectors(bio)) {
+		split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
+		bio_chain(split, bio);
+	} else {
+		split = bio;
+	}
 
-		if (bio_data_dir(split) == READ)
-			raid1_read_request(mddev, split);
-		else
-			raid1_write_request(mddev, split);
-	} while (split != bio);
+	if (bio_data_dir(split) == READ)
+		raid1_read_request(mddev, split);
+	else
+		raid1_write_request(mddev, split);
+	if (split != bio)
+		generic_make_request(bio);
 }
 
 static void raid1_status(struct seq_file *seq, struct mddev *mddev)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a1f8e98..b495049 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1551,28 +1551,27 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
 		return;
 	}
 
-	do {
-
-		/*
-		 * If this request crosses a chunk boundary, we need to split
-		 * it.
-		 */
-		if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
-			     bio_sectors(bio) > chunk_sects
-			     && (conf->geo.near_copies < conf->geo.raid_disks
-				 || conf->prev.near_copies <
-				 conf->prev.raid_disks))) {
-			split = bio_split(bio, chunk_sects -
-					  (bio->bi_iter.bi_sector &
-					   (chunk_sects - 1)),
-					  GFP_NOIO, fs_bio_set);
-			bio_chain(split, bio);
-		} else {
-			split = bio;
-		}
+	/*
+	 * If this request crosses a chunk boundary, we need to split
+	 * it.
+	 */
+	if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
+		     bio_sectors(bio) > chunk_sects
+		     && (conf->geo.near_copies < conf->geo.raid_disks
+			 || conf->prev.near_copies <
+			 conf->prev.raid_disks))) {
+		split = bio_split(bio, chunk_sects -
+				  (bio->bi_iter.bi_sector &
+				   (chunk_sects - 1)),
+				  GFP_NOIO, fs_bio_set);
+		bio_chain(split, bio);
+	} else {
+		split = bio;
+	}
 
-		__make_request(mddev, split);
-	} while (split != bio);
+	__make_request(mddev, split);
+	if (split != bio)
+		generic_make_request(bio);
 
 	/* In case raid10d snuck in to freeze_array */
 	wake_up(&conf->wait_barrier);
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2017-02-20  8:07 UTC (permalink / raw)
  To: Shaohua Li
  Cc: NeilBrown, NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <20170220070430.4mca7clpaw7kpj4j@kernel.org>




发自我的 iPhone
> 在 2017年2月20日，下午3:04，Shaohua Li <shli@kernel.org> 写道：
> 
>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
>>> On Mon, Feb 20 2017, NeilBrown wrote:
>>> 
>>>> On Fri, Feb 17 2017, Coly Li wrote:
>>>> 
>>>>> On 2017/2/16 下午3:04, NeilBrown wrote:
>>>>> I know you are going to change this as Shaohua wantsthe spitting to
>>>>> happen in a separate function, which I agree with, but there is 
>>>>> something else wrong here. Calling bio_split/bio_chain repeatedly
>>>>> in a loop is dangerous. It is OK for simple devices, but when one
>>>>> request can wait for another request to the same device it can
>>>>> deadlock. This can happen with raid1.  If a resync request calls
>>>>> raise_barrier() between one request and the next, then the next has
>>>>> to wait for the resync request, which has to wait for the first
>>>>> request. As the first request will be stuck in the queue in 
>>>>> generic_make_request(), you get a deadlock.
>>>> 
>>>> For md raid1, queue in generic_make_request(), can I understand it as
>>>> bio_list_on_stack in this function? And queue in underlying device,
>>>> can I understand it as the data structures like plug->pending and
>>>> conf->pending_bio_list ?
>>> 
>>> Yes, the queue in generic_make_request() is the bio_list_on_stack.  That
>>> is the only queue I am talking about.  I'm not referring to
>>> plug->pending or conf->pending_bio_list at all.
>>> 
>>>> 
>>>> I still don't get the point of deadlock, let me try to explain why I
>>>> don't see the possible deadlock. If a bio is split, and the first part
>>>> is processed by make_request_fn(), and then a resync comes and it will
>>>> raise a barrier, there are 3 possible conditions,
>>>> - the resync I/O tries to raise barrier on same bucket of the first
>>>> regular bio. Then the resync task has to wait to the first bio drops
>>>> its conf->nr_pending[idx]
>>> 
>>> Not quite.
>>> First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx]
>>> to be zero.  We can assume this happens immediately.
>>> Then the resync_task will increment ->barrier[idx].
>>> Only then will it wait for the first bio to drop ->nr_pending[idx].
>>> The processing of that first bio will have submitted bios to the
>>> underlying device, and they will be in the bio_list_on_stack queue, and
>>> will not be processed until raid1_make_request() completes.
>>> 
>>> The loop in raid1_make_request() will then call make_request_fn() which
>>> will call wait_barrier(), which will wait for ->barrier[idx] to be
>>> zero.
>> 
>> Thinking more carefully about this.. the 'idx' that the second bio will
>> wait for will normally be different, so there won't be a deadlock after
>> all.
>> 
>> However it is possible for hash_long() to produce the same idx for two
>> consecutive barrier_units so there is still the possibility of a
>> deadlock, though it isn't as likely as I thought at first.
> 
> Wrapped the function pointer issue Neil pointed out into Coly's original patch.
> Also fix a 'use-after-free' bug. For the deadlock issue, I'll add below patch,
> please check.
> 
> Thanks,
> Shaohua
> 

Hmm, please hold, I am still thinking of it. With barrier bucket and hash_long(), I don't see dead lock yet. For raid10 it might happen, but once we have barrier bucket on it , there will no deadlock.

My question is, this deadlock only happens when a big bio is split, and the split small bios are continuous, and the resync io visiting barrier buckets in sequntial order too. In the case if adjacent split regular bios or resync bios hit same barrier bucket, it will be a very big failure of hash design, and should have been found already. But no one complain it, so I don't convince myself tje deadlock is real with io barrier buckets (this is what Neil concerns).

For the function pointer asignment, it is because I see a brach happens in a loop. If I use a function pointer, I can avoid redundant brach inside the loop. raid1_read_request() and raid1_write_request() are not simple functions, I don't know whether gcc may make them inline or not, so I am on the way to check the disassembled code..

The loop in raid1_make_request() is quite high level, I am not sure whether CPU brach pridiction may work correctly, especially when it is a big DISCARD bio, using function pointer may drop a possible brach.

So I need to check what we get and lose when use function pointer or not. If it is not urgent, please hold this patch for a while.

The only thing I worry in the bellowed patch is, if a very big DISCARD bio comes, will the kernel space stack trend to be overflow？

Thanks.

Coly





> From ee9c98138bcdf8bceef384a68f49258b6b8b8c6d Mon Sep 17 00:00:00 2001
> Message-Id: <ee9c98138bcdf8bceef384a68f49258b6b8b8c6d.1487573888.git.shli@fb.com>
> From: Shaohua Li <shli@fb.com>
> Date: Sun, 19 Feb 2017 22:18:32 -0800
> Subject: [PATCH] md/raid1/10: fix potential deadlock
> 
> Neil Brown pointed out a potential deadlock in raid 10 code with
> bio_split/chain. The raid1 code could have the same issue, but recent
> barrier rework makes it less likely to happen. The deadlock happens in
> below sequence:
> 
> 1. generic_make_request(bio), this will set current->bio_list
> 2. raid10_make_request will split bio to bio1 and bio2
> 3. __make_request(bio1), wait_barrer, add underlayer disk bio to
> current->bio_list
> 4. __make_request(bio2), wait_barrer
> 
> If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
> raise_barrier waits for IO completion from 3. And since raise_barrier
> sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
> dispatched because raid10_make_request() doesn't finished yet.
> 
> The solution is to adjust the IO ordering. Quotes from Neil:
> "
> It is much safer to:
> 
>    if (need to split) {
>        split = bio_split(bio, ...)
>        bio_chain(...)
>        make_request_fn(split);
>        generic_make_request(bio);
>   } else
>        make_request_fn(mddev, bio);
> 
> This way we first process the initial section of the bio (in 'split')
> which will queue some requests to the underlying devices.  These
> requests will be queued in generic_make_request.
> Then we queue the remainder of the bio, which will be added to the end
> of the generic_make_request queue.
> Then we return.
> generic_make_request() will pop the lower-level device requests off the
> queue and handle them first.  Then it will process the remainder
> of the original bio once the first section has been fully processed.
> "
> 
> Cc: Coly Li <colyli@suse.de>
> Cc: stable@vger.kernel.org (v3.14+, only the raid10 part)
> Suggested-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
> drivers/md/raid1.c  | 28 ++++++++++++++--------------
> drivers/md/raid10.c | 41 ++++++++++++++++++++---------------------
> 2 files changed, 34 insertions(+), 35 deletions(-)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 676f72d..e55d865 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1566,21 +1566,21 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>    sector_t sectors;
> 
>    /* if bio exceeds barrier unit boundary, split it */
> -    do {
> -        sectors = align_to_barrier_unit_end(
> -                bio->bi_iter.bi_sector, bio_sectors(bio));
> -        if (sectors < bio_sectors(bio)) {
> -            split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
> -            bio_chain(split, bio);
> -        } else {
> -            split = bio;
> -        }
> +    sectors = align_to_barrier_unit_end(
> +            bio->bi_iter.bi_sector, bio_sectors(bio));
> +    if (sectors < bio_sectors(bio)) {
> +        split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
> +        bio_chain(split, bio);
> +    } else {
> +        split = bio;
> +    }
> 
> -        if (bio_data_dir(split) == READ)
> -            raid1_read_request(mddev, split);
> -        else
> -            raid1_write_request(mddev, split);
> -    } while (split != bio);
> +    if (bio_data_dir(split) == READ)
> +        raid1_read_request(mddev, split);
> +    else
> +        raid1_write_request(mddev, split);
> +    if (split != bio)
> +        generic_make_request(bio);
> }
> 
> static void raid1_status(struct seq_file *seq, struct mddev *mddev)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index a1f8e98..b495049 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1551,28 +1551,27 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
>        return;
>    }
> 
> -    do {
> -
> -        /*
> -         * If this request crosses a chunk boundary, we need to split
> -         * it.
> -         */
> -        if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
> -                 bio_sectors(bio) > chunk_sects
> -                 && (conf->geo.near_copies < conf->geo.raid_disks
> -                 || conf->prev.near_copies <
> -                 conf->prev.raid_disks))) {
> -            split = bio_split(bio, chunk_sects -
> -                      (bio->bi_iter.bi_sector &
> -                       (chunk_sects - 1)),
> -                      GFP_NOIO, fs_bio_set);
> -            bio_chain(split, bio);
> -        } else {
> -            split = bio;
> -        }
> +    /*
> +     * If this request crosses a chunk boundary, we need to split
> +     * it.
> +     */
> +    if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
> +             bio_sectors(bio) > chunk_sects
> +             && (conf->geo.near_copies < conf->geo.raid_disks
> +             || conf->prev.near_copies <
> +             conf->prev.raid_disks))) {
> +        split = bio_split(bio, chunk_sects -
> +                  (bio->bi_iter.bi_sector &
> +                   (chunk_sects - 1)),
> +                  GFP_NOIO, fs_bio_set);
> +        bio_chain(split, bio);
> +    } else {
> +        split = bio;
> +    }
> 
> -        __make_request(mddev, split);
> -    } while (split != bio);
> +    __make_request(mddev, split);
> +    if (split != bio)
> +        generic_make_request(bio);
> 
>    /* In case raid10d snuck in to freeze_array */
>    wake_up(&conf->wait_barrier);
> -- 
> 2.9.3
> 


^ permalink raw reply

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Coly Li @ 2017-02-20  8:30 UTC (permalink / raw)
  To: Shaohua Li
  Cc: NeilBrown, NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <7178ADEA-6263-4FB0-95A5-96E5F71A9740@suse.de>


> 在 2017年2月20日，下午4:07，Coly Li <colyli@suse.de> 写道：
> 
> 
> 
> 
> 发自我的 iPhone
>>> 在 2017年2月20日，下午3:04，Shaohua Li <shli@kernel.org> 写道：
>>> 
>>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
>>>>> On Mon, Feb 20 2017, NeilBrown wrote:
>>>>> 
>>>>>> On Fri, Feb 17 2017, Coly Li wrote:
>>>>>> 
>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote:
>>>>>> I know you are going to change this as Shaohua wantsthe spitting to
>>>>>> happen in a separate function, which I agree with, but there is 
>>>>>> something else wrong here. Calling bio_split/bio_chain repeatedly
>>>>>> in a loop is dangerous. It is OK for simple devices, but when one
>>>>>> request can wait for another request to the same device it can
>>>>>> deadlock. This can happen with raid1.  If a resync request calls
>>>>>> raise_barrier() between one request and the next, then the next has
>>>>>> to wait for the resync request, which has to wait for the first
>>>>>> request. As the first request will be stuck in the queue in 
>>>>>> generic_make_request(), you get a deadlock.
>>>>> 
>>>>> For md raid1, queue in generic_make_request(), can I understand it as
>>>>> bio_list_on_stack in this function? And queue in underlying device,
>>>>> can I understand it as the data structures like plug->pending and
>>>>> conf->pending_bio_list ?
>>>> 
>>>> Yes, the queue in generic_make_request() is the bio_list_on_stack.  That
>>>> is the only queue I am talking about.  I'm not referring to
>>>> plug->pending or conf->pending_bio_list at all.
>>>> 
>>>>> 
>>>>> I still don't get the point of deadlock, let me try to explain why I
>>>>> don't see the possible deadlock. If a bio is split, and the first part
>>>>> is processed by make_request_fn(), and then a resync comes and it will
>>>>> raise a barrier, there are 3 possible conditions,
>>>>> - the resync I/O tries to raise barrier on same bucket of the first
>>>>> regular bio. Then the resync task has to wait to the first bio drops
>>>>> its conf->nr_pending[idx]
>>>> 
>>>> Not quite.
>>>> First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx]
>>>> to be zero.  We can assume this happens immediately.
>>>> Then the resync_task will increment ->barrier[idx].
>>>> Only then will it wait for the first bio to drop ->nr_pending[idx].
>>>> The processing of that first bio will have submitted bios to the
>>>> underlying device, and they will be in the bio_list_on_stack queue, and
>>>> will not be processed until raid1_make_request() completes.
>>>> 
>>>> The loop in raid1_make_request() will then call make_request_fn() which
>>>> will call wait_barrier(), which will wait for ->barrier[idx] to be
>>>> zero.
>>> 
>>> Thinking more carefully about this.. the 'idx' that the second bio will
>>> wait for will normally be different, so there won't be a deadlock after
>>> all.
>>> 
>>> However it is possible for hash_long() to produce the same idx for two
>>> consecutive barrier_units so there is still the possibility of a
>>> deadlock, though it isn't as likely as I thought at first.
>> 
>> Wrapped the function pointer issue Neil pointed out into Coly's original patch.
>> Also fix a 'use-after-free' bug. For the deadlock issue, I'll add below patch,
>> please check.
>> 
>> Thanks,
>> Shaohua
>> 
> 
> Hmm, please hold, I am still thinking of it. With barrier bucket and hash_long(), I don't see dead lock yet. For raid10 it might happen, but once we have barrier bucket on it , there will no deadlock.
> 
> My question is, this deadlock only happens when a big bio is split, and the split small bios are continuous, and the resync io visiting barrier buckets in sequntial order too. In the case if adjacent split regular bios or resync bios hit same barrier bucket, it will be a very big failure of hash design, and should have been found already. But no one complain it, so I don't convince myself tje deadlock is real with io barrier buckets (this is what Neil concerns).
> 
> For the function pointer asignment, it is because I see a brach happens in a loop. If I use a function pointer, I can avoid redundant brach inside the loop. raid1_read_request() and raid1_write_request() are not simple functions, I don't know whether gcc may make them inline or not, so I am on the way to check the disassembled code..
> 
> The loop in raid1_make_request() is quite high level, I am not sure whether CPU brach pridiction may work correctly, especially when it is a big DISCARD bio, using function pointer may drop a possible brach.
> 
> So I need to check what we get and lose when use function pointer or not. If it is not urgent, please hold this patch for a while.
> 
> The only thing I worry in the bellowed patch is, if a very big DISCARD bio comes, will the kernel space stack trend to be overflow？
> 

Before calling generic_make_request(), if we can do a check whether the stack is about to full, and schedule a small time out if there are too many bios stacked, maybe the stack oveflow can be avoided.

Coly



^ permalink raw reply

* Re: [md PATCH 10/14] md/raid1: stop using bi_phys_segment
From: Ming Lei @ 2017-02-20 10:57 UTC (permalink / raw)
  To: NeilBrown
  Cc: Shaohua Li, open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	Christoph Hellwig
In-Reply-To: <148721994272.7521.4656357457037580183.stgit@noble>

On Thu, Feb 16, 2017 at 12:39 PM, NeilBrown <neilb@suse.com> wrote:
> Change to use bio->__bi_remaining to count number of r1bio attached
> to a bio.
> See precious raid10 patch for more details.
>
> inc_pending is a little more complicated in raid1 as we need to adjust
> next_window_requests or current_window_requests.
>
> The wait_event() call if start_next_window is no longer needed as each
> r1_bio is completed separately.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/raid1.c |   99 ++++++++++++++++++----------------------------------
>  1 file changed, 34 insertions(+), 65 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index c1d0675880fb..5a57111c7bc9 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -241,36 +241,19 @@ static void reschedule_retry(struct r1bio *r1_bio)
>  static void call_bio_endio(struct r1bio *r1_bio)
>  {
>         struct bio *bio = r1_bio->master_bio;
> -       int done;
>         struct r1conf *conf = r1_bio->mddev->private;
>         sector_t start_next_window = r1_bio->start_next_window;
>         sector_t bi_sector = bio->bi_iter.bi_sector;
>
> -       if (bio->bi_phys_segments) {
> -               unsigned long flags;
> -               spin_lock_irqsave(&conf->device_lock, flags);
> -               bio->bi_phys_segments--;
> -               done = (bio->bi_phys_segments == 0);
> -               spin_unlock_irqrestore(&conf->device_lock, flags);
> -               /*
> -                * make_request() might be waiting for
> -                * bi_phys_segments to decrease
> -                */
> -               wake_up(&conf->wait_barrier);
> -       } else
> -               done = 1;
> -
>         if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>                 bio->bi_error = -EIO;
>
> -       if (done) {
> -               bio_endio(bio);
> -               /*
> -                * Wake up any possible resync thread that waits for the device
> -                * to go idle.
> -                */
> -               allow_barrier(conf, start_next_window, bi_sector);
> -       }
> +       bio_endio(bio);
> +       /*
> +        * Wake up any possible resync thread that waits for the device
> +        * to go idle.
> +        */
> +       allow_barrier(conf, start_next_window, bi_sector);
>  }
>
>  static void raid_end_bio_io(struct r1bio *r1_bio)
> @@ -923,6 +906,26 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>         return sector;
>  }
>
> +static void inc_pending(struct r1conf *conf, sector_t start_next_window,
> +                       sector_t bi_sector)
> +{
> +       /* The current request requires multiple r1_bio, so
> +        * we need to increment the pending count, and the corresponding
> +        * window count.
> +        */
> +       spin_lock(&conf->resync_lock);
> +       conf->nr_pending++;

Just be curious, in current code 'nr_pending' is increased in wait_barrier(),
and looks this patch introduces inc_pending() to do that on each r10_bio, but
not see any change in wait_barrier(), so that means there might be issue in
current implementation about operating on this counter?

> +       if (start_next_window == conf->start_next_window) {
> +               if (conf->start_next_window + NEXT_NORMALIO_DISTANCE
> +                   <= bi_sector)
> +                       conf->next_window_requests++;
> +               else
> +                       conf->current_window_requests++;
> +       } else
> +               conf->current_window_requests++;
> +       spin_unlock(&conf->resync_lock);
> +}
> +
>  static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
>                           sector_t bi_sector)
>  {
> @@ -1137,12 +1140,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>                 sectors_handled = (r1_bio->sector + max_sectors
>                                    - bio->bi_iter.bi_sector);
>                 r1_bio->sectors = max_sectors;
> -               spin_lock_irq(&conf->device_lock);
> -               if (bio->bi_phys_segments == 0)
> -                       bio->bi_phys_segments = 2;
> -               else
> -                       bio->bi_phys_segments++;
> -               spin_unlock_irq(&conf->device_lock);
> +               bio_inc_remaining(bio);
>
>                 /*
>                  * Cannot call generic_make_request directly as that will be
> @@ -1304,7 +1302,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>         if (unlikely(blocked_rdev)) {
>                 /* Wait for this device to become unblocked */
>                 int j;
> -               sector_t old = start_next_window;
>
>                 for (j = 0; j < i; j++)
>                         if (r1_bio->bios[j])
> @@ -1314,15 +1311,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>                 raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
>                 md_wait_for_blocked_rdev(blocked_rdev, mddev);
>                 start_next_window = wait_barrier(conf, bio);
> -               /*
> -                * We must make sure the multi r1bios of bio have
> -                * the same value of bi_phys_segments
> -                */
> -               if (bio->bi_phys_segments && old &&
> -                   old != start_next_window)
> -                       /* Wait for the former r1bio(s) to complete */
> -                       wait_event(conf->wait_barrier,
> -                                  bio->bi_phys_segments == 1);
>                 goto retry_write;
>         }
>
> @@ -1417,22 +1405,18 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>          * as it could result in the bio being freed.
>          */
>         if (sectors_handled < bio_sectors(bio)) {
> -               /* We need another r1_bio, which must be accounted
> -                * in bio->bi_phys_segments
> -                */
> -               spin_lock_irq(&conf->device_lock);
> -               if (bio->bi_phys_segments == 0)
> -                       bio->bi_phys_segments = 2;
> -               else
> -                       bio->bi_phys_segments++;
> -               spin_unlock_irq(&conf->device_lock);
> +               /* We need another r1_bio, which must be counted */
> +               sector_t sect = bio->bi_iter.bi_sector + sectors_handled;
> +
> +               inc_pending(conf, start_next_window, sect);
> +               bio_inc_remaining(bio);
>                 r1_bio_write_done(r1_bio);
>                 r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
>                 r1_bio->master_bio = bio;
>                 r1_bio->sectors = bio_sectors(bio) - sectors_handled;
>                 r1_bio->state = 0;
>                 r1_bio->mddev = mddev;
> -               r1_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
> +               r1_bio->sector = sect;
>                 goto retry_write;
>         }
>
> @@ -1460,16 +1444,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>         r1_bio->mddev = mddev;
>         r1_bio->sector = bio->bi_iter.bi_sector;
>
> -       /*
> -        * We might need to issue multiple reads to different devices if there
> -        * are bad blocks around, so we keep track of the number of reads in
> -        * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
> -        * no locking will be needed when requests complete.  If it is
> -        * non-zero, then it is the number of not-completed requests.
> -        */
> -       bio->bi_phys_segments = 0;
> -       bio_clear_flag(bio, BIO_SEG_VALID);
> -
>         if (bio_data_dir(bio) == READ)
>                 raid1_read_request(mddev, bio, r1_bio);
>         else
> @@ -2434,12 +2408,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>                         int sectors_handled = (r1_bio->sector + max_sectors
>                                                - mbio->bi_iter.bi_sector);
>                         r1_bio->sectors = max_sectors;
> -                       spin_lock_irq(&conf->device_lock);
> -                       if (mbio->bi_phys_segments == 0)
> -                               mbio->bi_phys_segments = 2;
> -                       else
> -                               mbio->bi_phys_segments++;
> -                       spin_unlock_irq(&conf->device_lock);
> +                       bio_inc_remaining(mbio);
>                         trace_block_bio_remap(bdev_get_queue(bio->bi_bdev),
>                                               bio, bio_dev, bio_sector);
>                         generic_make_request(bio);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Ming Lei

^ permalink raw reply

* Re: assistance recovering failed raid6 array
From: Phil Turmel @ 2017-02-20 15:39 UTC (permalink / raw)
  To: Martin Bosner, linux-raid
In-Reply-To: <58AA4B1E.1030809@bosner.de>

On 02/19/2017 08:49 PM, Martin Bosner wrote:
> I am running a software raid6 with 36 x 3TB disks (sda to sdaj). All
> disks have one partition (gpt, 100%, primary, raid on) and i am using
> btrfs on top of the raid.
> 
> Last week one of the disks failed and was unrecoverable. I replaced the
> disk (sdk) with a new one and the resync process started.
> At around 80% recovery two further disks failed and the recovery process
> was stopped. That failed disks are sdm and sdh.
> 
> All other disks seem to be fine and I was about the use the "mdadm
> --create" command when i remembered the lines
> "You have been warned! It's better to send an email to the linux-raid
> mailing list with detailed information"
> 
> So here i am for an advice how to continue.

More information, please.  Paste inline, untrimmed, in your reply with
line wrapping disabled.  Plain text only.  Use multiple mails if needed.
List limit is ~100k IIRC.

# dmesg

# for x in /dev/sd[a-z] /dev/sda[a-j] ; do echo mdadm -E ${x}1 ; smartctl -iA -l scterc $x ; done

Phil

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox