Linux RAID subsystem development
 help / color / mirror / Atom feed
* Re: [lvm2 PATCH] Remove special-case for md in 69-dm-lvm-metadata.rules
From: Peter Rajnoha @ 2017-01-04 10:27 UTC (permalink / raw)
  To: LVM2 development, linux-raid; +Cc: Lidong Zhong, neilb, GuoQing Jiang
In-Reply-To: <87eg0jmrgz.fsf@notabene.neil.brown.name>

On 01/04/2017 04:30 AM, NeilBrown wrote:
> 
> This special casing brings no value.  It appears to attempt to
> determine if the array is active yet or not, and to skip
> processing if the array has not yet been started.

Hi Neil,

those rules also have another use which is to not trigger further
unnecessary actions if the device is already up and running, hence
avoiding useless resource consumption if it brings no new information.
Currently, this is applied only for running pvscan on top of newly
activated MD device (that's why it's part of 69-dm-lvm-metad.rules
at the moment).

So what we also need is to detect the very first CHANGE event
that makes the device active and to make a difference between this
first CHANGE event and further CHANGE events which are possibly
part of the WATCH rule and any other possible CHANGE events which
do not notify about the device switching from "not ready" to "ready"
state (of course, counting with possible coldplug events).

> However, if the array hasn't been started, then "blkid" will
> not have been able to read a signature, so:
>   ENV{ID_FS_TYPE}!="LVM2_member|LVM1_member", GOTO="lvm_end"
> will have caused all this code to be skipped.
> 
> Further, this code causes incorrect behaviour in at least one case.
> It assumes that the first "add" event should be ignored, as it will be
> followed by a "change" event which indicates the array coming on line.
> This is consistent with how the kernel sends events, but not always
> consistent with how this script sees event.
> Specifically: if the initrd has "mdadm" support installed, but not
> "lvm2" support, then the initial "add" and "change" events will
> happen while the initrd is in charge and this file is not available.
> Once the root filesystem is mountd, this file will be available
> and "udevadm trigger --action=add" will be run.
> So the first and only event seen by this script for an md device will be
> "add", and it will incorrectly ignore it.
> 

Yes, you're right that in this case, it's not behaving correctly when
the initrd doesn't have this rule while the root FS does.  To fix this
issue for now, I suggest to separate those rule out of 69-dm-lvm-metad.rules
and make it a part of MD rules so that rule is always available both in
initrd and root fs when MD is used (while LVM doesn't need to be
installed in initrd).

This comes right on time because right at this very moment, I'm working
on a design for a solution which covers this area - I'll surely pass you
the design doc once it's more complete (should be in next few days) so
you we can discuss this problem further. This will cover the *standard*
notification about when the block device is ready which provides a
standard way of letting others (rules or any uevent monitors) know when
the switch from "not ready" to "ready" state happens exactly. This should
save us lots of unnecessary work that is done at the moment - we don't
need to fire scans and further inspection of the device for all the
events all the time. This work also covers identification of spurious
events coming as a result of the WATCH rule and minimization of its
impact on uevent processing performance in userspace.

-- 
Peter

--
lvm-devel mailing list
lvm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/lvm-devel

^ permalink raw reply

* [PATCH] Don't assume VMD sysfs path ends with a disk entry
From: Tomasz Majchrzak @ 2017-01-04 10:45 UTC (permalink / raw)
  To: linux-raid; +Cc: Jes.Sorensen, tomasz.majchrzak, Alexey Obitotskiy

From: Alexey Obitotskiy <aleksey.obitotskiy@intel.com>

From: Alexey Obitotskiy <aleksey.obitotskiy@intel.com>
Date: Wed, 4 Jan 2017 11:31:23 +0100

When VMD is enabled but no drive is attached to the PCIe port, mdadm
crashes trying to parse the path. Skip entry if valid path has not been
returned. Do it early to avoid unnecessary memory allocation.

Signed-off-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 platform-intel.c | 33 +++++++++++++++++++++------------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/platform-intel.c b/platform-intel.c
index c60fd9e..7ad8831 100644
--- a/platform-intel.c
+++ b/platform-intel.c
@@ -48,9 +48,9 @@ static void free_sys_dev(struct sys_dev **list)
 struct sys_dev *find_driver_devices(const char *bus, const char *driver)
 {
 	/* search sysfs for devices driven by 'driver' */
-	char path[292];
-	char link[256];
-	char *c;
+	char path[PATH_MAX];
+	char link[PATH_MAX];
+	char *c, *p;
 	DIR *driver_dir;
 	struct dirent *de;
 	struct sys_dev *head = NULL;
@@ -123,6 +123,22 @@ struct sys_dev *find_driver_devices(const char *bus, const char *driver)
 		if (devpath_to_ll(path, "class", &class) != 0)
 			continue;
 
+		/*
+		 * Each VMD device (domain) adds separate PCI bus, it is better
+		 * to store path as a path to that bus (easier further
+		 * determination which NVMe dev is connected to this particular
+		 * VMD domain).
+		 */
+		if (type == SYS_DEV_VMD) {
+			sprintf(path, "/sys/bus/%s/drivers/%s/%s/domain/device",
+				bus, driver, de->d_name);
+		}
+		p = realpath(path, NULL);
+		if (p == NULL) {
+			pr_err("Unable to get real path for '%s'\n", path);
+			continue;
+		}
+
 		/* start / add list entry */
 		if (!head) {
 			head = xmalloc(sizeof(*head));
@@ -140,16 +156,9 @@ struct sys_dev *find_driver_devices(const char *bus, const char *driver)
 		list->dev_id = (__u16) dev_id;
 		list->class = (__u32) class;
 		list->type = type;
-		/* Each VMD device (domain) adds separate PCI bus, it is better to
-		 * store path as a path to that bus (easier further determination which
-		 * NVMe dev is connected to this particular VMD domain).
-		 */
-		if (type == SYS_DEV_VMD) {
-			sprintf(path, "/sys/bus/%s/drivers/%s/%s/domain/device",
-			bus, driver, de->d_name);
-		}
-		list->path = realpath(path, NULL);
 		list->next = NULL;
+		list->path = p;
+
 		if ((list->pci_id = strrchr(list->path, '/')) != NULL)
 			list->pci_id++;
 	}
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH] Don't assume VMD sysfs path ends with a disk entry
From: Jes Sorensen @ 2017-01-04 13:27 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid, Alexey Obitotskiy
In-Reply-To: <1483526724-29456-1-git-send-email-tomasz.majchrzak@intel.com>

Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> From: Alexey Obitotskiy <aleksey.obitotskiy@intel.com>
>
> From: Alexey Obitotskiy <aleksey.obitotskiy@intel.com>
> Date: Wed, 4 Jan 2017 11:31:23 +0100
>
> When VMD is enabled but no drive is attached to the PCIe port, mdadm
> crashes trying to parse the path. Skip entry if valid path has not been
> returned. Do it early to avoid unnecessary memory allocation.
>
> Signed-off-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com>
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> ---
>  platform-intel.c | 33 +++++++++++++++++++++------------
>  1 file changed, 21 insertions(+), 12 deletions(-)

Applied!

I had to fixup the commit message since your message had a mail header
from Alaexy as well.

Thanks,
Jes

^ permalink raw reply

* Re: [PATCH v2 00/12] Partial Parity Log for MD RAID 5
From: Jes Sorensen @ 2017-01-04 13:29 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: Shaohua Li, NeilBrown, linux-raid
In-Reply-To: <84d3a102-faef-13ae-44e2-2f9a4d561c37@intel.com>

Artur Paszkiewicz <artur.paszkiewicz@intel.com> writes:
> On 01/03/2017 04:42 PM, Jes Sorensen wrote:
>> Artur,
>> 
>> Did you make any progress getting the alignment issue resolved?
>> 
>> I'd really like to get an mdadm release out the door this week, so
>> getting this resolved would be awesome. Hint hint ;)
>
> Jes,
>
> The alignment issue is fixed but I'm currently making changes in the
> kernel part after comments from Neil. This also requires modifications
> in mdadm, maybe even more than in the kernel. I need a few more days to
> finish this, I'll be sending the next version of the patches probably
> next week. So don't let me keep you if you want to make a release this
> week.

Artur,

Thanks for the update - it sounds like it is probably wise to let this
stew a bit and let all the dust settle before we put it into an official
release.

Cheers,
Jes

^ permalink raw reply

* [PATCH] raid5: only dispatch IO from raid5d for harddisk raid
From: Shaohua Li @ 2017-01-04 17:57 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, NeilBrown, Song Liu

We made raid5 stripe handling multi-thread before. It works well for
SSD. But for harddisk, the multi-threading creates more disk seek, so
not always improve performance. For several hard disks based raid5,
multi-threading is required as raid5d becames a bottleneck especially
for sequential write.

To overcome the disk seek issue, we only dispatch IO from raid5d if the
array is harddisk based. Other threads can still handle stripes, but
can't dispatch IO.

Idealy, we should control IO dispatching order according to IO position
interrnally. Right now we still depend on block layer, which isn't very
efficient sometimes though.

My setup has 9 harddisks, each disk can do around 180M/s sequential
write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
write. The test machine uses an ATOM CPU. I measure sequential write
with large iodepth bandwidth to raid array:

without patch: ~600M/s
without patch and group_thread_cnt=4: 750M/s
with patch and group_thread_cnt=4: 950M/s
with patch, group_thread_cnt=4, skip_copy=1: 1150M/s

We are pretty close to the maximum bandwidth in the large iodepth
iodepth case. The performance gap of small iodepth sequential write
between software raid and theory value is still very big though, because
we don't have an efficient pipeline.

Cc: NeilBrown <neilb@suse.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/md/raid5.h |  4 ++++
 2 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 06d7279..81417cf 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -861,6 +861,43 @@ static int use_new_offset(struct r5conf *conf, struct stripe_head *sh)
 	return 1;
 }
 
+static void flush_deferred_bios(struct r5conf *conf)
+{
+	struct bio_list tmp;
+	struct bio *bio;
+
+	if (!conf->batch_bio_dispatch || !conf->group_cnt)
+		return;
+
+	bio_list_init(&tmp);
+	spin_lock(&conf->pending_bios_lock);
+	bio_list_merge(&tmp, &conf->pending_bios);
+	bio_list_init(&conf->pending_bios);
+	spin_unlock(&conf->pending_bios_lock);
+
+	while ((bio = bio_list_pop(&tmp)))
+		generic_make_request(bio);
+}
+
+static void defer_bio_issue(struct r5conf *conf, struct bio *bio)
+{
+	/*
+	 * change group_cnt will drain all bios, so this is safe
+	 *
+	 * A read generally means a read-modify-write, which usually means a
+	 * randwrite, so we don't delay it
+	 */
+	if (!conf->batch_bio_dispatch || !conf->group_cnt ||
+	    bio_op(bio) == REQ_OP_READ) {
+		generic_make_request(bio);
+		return;
+	}
+	spin_lock(&conf->pending_bios_lock);
+	bio_list_add(&conf->pending_bios, bio);
+	spin_unlock(&conf->pending_bios_lock);
+	md_wakeup_thread(conf->mddev->thread);
+}
+
 static void
 raid5_end_read_request(struct bio *bi);
 static void
@@ -1031,7 +1068,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 				trace_block_bio_remap(bdev_get_queue(bi->bi_bdev),
 						      bi, disk_devt(conf->mddev->gendisk),
 						      sh->dev[i].sector);
-			generic_make_request(bi);
+			defer_bio_issue(conf, bi);
 		}
 		if (rrdev) {
 			if (s->syncing || s->expanding || s->expanded
@@ -1076,7 +1113,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 				trace_block_bio_remap(bdev_get_queue(rbi->bi_bdev),
 						      rbi, disk_devt(conf->mddev->gendisk),
 						      sh->dev[i].sector);
-			generic_make_request(rbi);
+			defer_bio_issue(conf, rbi);
 		}
 		if (!rdev && !rrdev) {
 			if (op_is_write(op))
@@ -6057,6 +6094,8 @@ static void raid5d(struct md_thread *thread)
 		mutex_unlock(&conf->cache_size_mutex);
 	}
 
+	flush_deferred_bios(conf);
+
 	r5l_flush_stripe_to_raid(conf->log);
 
 	async_tx_issue_pending_all();
@@ -6642,6 +6681,16 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	atomic_set(&conf->active_stripes, 0);
 	atomic_set(&conf->preread_active_stripes, 0);
 	atomic_set(&conf->active_aligned_reads, 0);
+	bio_list_init(&conf->pending_bios);
+	spin_lock_init(&conf->pending_bios_lock);
+	conf->batch_bio_dispatch = true;
+	rdev_for_each(rdev, mddev) {
+		if (blk_queue_nonrot(bdev_get_queue(rdev->bdev))) {
+			conf->batch_bio_dispatch = false;
+			break;
+		}
+	}
+
 	conf->bypass_threshold = BYPASS_THRESHOLD;
 	conf->recovery_disabled = mddev->recovery_disabled - 1;
 
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index ed8e136..2af5bea 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -679,6 +679,10 @@ struct r5conf {
 	int			group_cnt;
 	int			worker_cnt_per_group;
 	struct r5l_log		*log;
+
+	struct bio_list		pending_bios;
+	spinlock_t		pending_bios_lock;
+	bool			batch_bio_dispatch;
 };
 
 
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH] raid5: only dispatch IO from raid5d for harddisk raid
From: Song Liu @ 2017-01-04 18:10 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid@vger.kernel.org, Kernel Team, NeilBrown
In-Reply-To: <9e987c0d95c49224da7fb82477dd10fcacf4d794.1483552525.git.shli@fb.com>


> On Jan 4, 2017, at 9:57 AM, Shaohua Li <shli@fb.com> wrote:
> 
> +	spin_lock_init(&conf->pending_bios_lock);
> +	conf->batch_bio_dispatch = true;
> +	rdev_for_each(rdev, mddev) {
> +		if (blk_queue_nonrot(bdev_get_queue(rdev->bdev))) {
> +			conf->batch_bio_dispatch = false;

Shall we skip this check for the journal device? 

Song

^ permalink raw reply

* Re: [PATCH] raid5: only dispatch IO from raid5d for harddisk raid
From: Shaohua Li @ 2017-01-04 18:14 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid@vger.kernel.org, Kernel Team, NeilBrown
In-Reply-To: <E6506DA7-E269-4092-8761-12FFC1770F82@fb.com>

On Wed, Jan 04, 2017 at 06:10:59PM +0000, Song Liu wrote:
> 
> > On Jan 4, 2017, at 9:57 AM, Shaohua Li <shli@fb.com> wrote:
> > 
> > +	spin_lock_init(&conf->pending_bios_lock);
> > +	conf->batch_bio_dispatch = true;
> > +	rdev_for_each(rdev, mddev) {
> > +		if (blk_queue_nonrot(bdev_get_queue(rdev->bdev))) {
> > +			conf->batch_bio_dispatch = false;
> 
> Shall we skip this check for the journal device?

yep, will add it in next post

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH] md/r5cache: fix spelling mistake on "recoverying"
From: Shaohua Li @ 2017-01-04 18:20 UTC (permalink / raw)
  To: Colin King; +Cc: linux-raid, linux-kernel
In-Reply-To: <20161223005230.7731-1-colin.king@canonical.com>

On Fri, Dec 23, 2016 at 12:52:30AM +0000, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> Trivial fix to spelling mistake "recoverying" to "recovering" in
> pr_dbg message.

applied, thanks
 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>  drivers/md/raid5-cache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index bff1b4a..0e8ed2c 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -2170,7 +2170,7 @@ static int r5l_recovery_log(struct r5l_log *log)
>  		pr_debug("md/raid:%s: starting from clean shutdown\n",
>  			 mdname(mddev));
>  	else {
> -		pr_debug("md/raid:%s: recoverying %d data-only stripes and %d data-parity stripes\n",
> +		pr_debug("md/raid:%s: recovering %d data-only stripes and %d data-parity stripes\n",
>  			 mdname(mddev), ctx.data_only_stripes,
>  			 ctx.data_parity_stripes);
>  
> -- 
> 2.10.2
> 

^ permalink raw reply

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
From: Mike Snitzer @ 2017-01-04 18:50 UTC (permalink / raw)
  To: NeilBrown, Mikulas Patocka
  Cc: Peter Zijlstra, Ming Lei, Zheng Liu, Keith Busch,
	device-mapper development, Jack Wang, Alasdair Kergon,
	Roland Kammerer, Michael Wang, Takashi Iwai, Ingo Molnar,
	Shaohua Li, Kent Overstreet, linux-block,
	linux-bcache@vger.kernel.org, Jens Axboe, linux-raid,
	Martin K. Petersen, Jiri Kosina, linux-kernel, Lars
In-Reply-To: <878tqrmmqx.fsf@notabene.neil.brown.name>

On Wed, Jan 04 2017 at 12:12am -0500,
NeilBrown <neilb@suse.com> wrote:

> On Tue, Jan 03 2017, Jack Wang wrote:
> 
> > 2016-12-23 12:45 GMT+01:00 Lars Ellenberg <lars.ellenberg@linbit.com>:
> >> On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote:
> >>> Dear Maintainers
> >>>
> >>> I'd like to ask for the status of this patch since we hit the
> >>> issue too during our testing on md raid1.
> >>>
> >>> Split remainder bio_A was queued ahead, following by bio_B for
> >>> lower device, at this moment raid start freezing, the loop take
> >>> out bio_A firstly and deliver it, which will hung since raid is
> >>> freezing, while the freezing never end since it waiting for
> >>> bio_B to finish, and bio_B is still on the queue, waiting for
> >>> bio_A to finish...
> >>>
> >>> We're looking for a good solution and we found this patch
> >>> already progressed a lot, but we can't find it on linux-next,
> >>> so we'd like to ask are we still planning to have this fix
> >>> in upstream?
> >>
> >> I don't see why not, I'd even like to have it in older kernels,
> >> but did not have the time and energy to push it.
> >>
> >> Thanks for the bump.
> >>
> >>         Lars
> >>
> > Hi folks,
> >
> > As Michael mentioned, we hit a bug this patch is trying to fix.
> > Neil suggested another way to fix it.  I attached below.
> > I personal prefer Neil's version as it's less code change, and straight forward.
> >
> > Could you share your comments, we can get one fix into mainline.
> >
> > Thanks,
> > Jinpu
> > From 69a4829a55503e496ce9c730d2c8e3dd8a08874a Mon Sep 17 00:00:00 2001
> > From: NeilBrown <neilb@suse.com>
> > Date: Wed, 14 Dec 2016 16:55:52 +0100
> > Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()
> >
> > When we call wait_barrier, we might have some bios waiting
> > in current->bio_list, which prevents the array_freeze call to
> > complete. Those can only be internal READs, which have already
> > passed the wait_barrier call (thus incrementing nr_pending), but
> > still were not submitted to the lower level, due to generic_make_request
> > logic to avoid recursive calls. In such case, we have a deadlock:
> > - array_frozen is already set to 1, so wait_barrier unconditionally waits, so
> > - internal READ bios will not be submitted, thus freeze_array will
> > never completes.
> >
> > To fix this, modify generic_make_request to always sort bio_list_on_stack
> > first with lowest level, then higher, until same level.
> >
> > Sent to linux-raid mail list:
> > https://marc.info/?l=linux-raid&m=148232453107685&w=2
> >
> 
> This should probably also have
> 
>   Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> 
> or something that, as I was building on Lars' ideas when I wrote this.
> 
> It would also be worth noting in the description that this addresses
> issues with dm and drbd as well as md.

I never saw this patch but certainly like the relative simplicity of the
solution when compared with other approaches taken, e.g. (5 topmost
commits on this branch):
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

> In fact, I think that with this patch in place, much of the need for the
> rescue_workqueue won't exist any more.  I cannot promise it can be
> removed completely, but it should be to hard to make it optional and
> only enabled for those few block devices that will still need it.
> The rescuer should only be needed for a bioset which can be allocated
> From twice in the one call the ->make_request_fn.  This would include
> raid0 for example, though raid0_make_reqest could be re-written to not
> use a loop and to just call generic_make_request(bio) if bio != split.

Mikulas, would you be willing to try the below patch with the
dm-snapshot deadlock scenario and report back on whether it fixes that?

Patch below looks to be the same as here:
https://marc.info/?l=linux-raid&m=148232453107685&q=p3

Neil and/or others if that isn't the patch that should be tested please
provide a pointer to the latest.

Thanks,
Mike

> > Suggested-by: NeilBrown <neilb@suse.com>
> > Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
> > ---
> >  block/blk-core.c | 20 ++++++++++++++++++++
> >  1 file changed, 20 insertions(+)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 9e3ac56..47ef373 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2138,10 +2138,30 @@ blk_qc_t generic_make_request(struct bio *bio)
> >  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> >  
> >  		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
> > +			struct bio_list lower, same, hold;
> > +
> > +			/* Create a fresh bio_list for all subordinate requests */
> > +			bio_list_init(&hold);
> > +			bio_list_merge(&hold, &bio_list_on_stack);
> > +			bio_list_init(&bio_list_on_stack);
> >  
> >  			ret = q->make_request_fn(q, bio);
> >  
> >  			blk_queue_exit(q);
> > +			/* sort new bios into those for a lower level
> > +			 * and those for the same level
> > +			 */
> > +			bio_list_init(&lower);
> > +			bio_list_init(&same);
> > +			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> > +				if (q == bdev_get_queue(bio->bi_bdev))
> > +					bio_list_add(&same, bio);
> > +				else
> > +					bio_list_add(&lower, bio);
> > +			/* now assemble so we handle the lowest level first */
> > +			bio_list_merge(&bio_list_on_stack, &lower);
> > +			bio_list_merge(&bio_list_on_stack, &same);
> > +			bio_list_merge(&bio_list_on_stack, &hold);
> >  
> >  			bio = bio_list_pop(current->bio_list);
> >  		} else {
> > -- 
> > 2.7.4

^ permalink raw reply

* Re: [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: Shaohua Li @ 2017-01-04 19:35 UTC (permalink / raw)
  To: Coly Li
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <1482853658-82535-1-git-send-email-colyli@suse.de>

On Tue, Dec 27, 2016 at 11:47:37PM +0800, Coly Li wrote:
> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
> introduces a sliding resync window for raid1 I/O barrier, this idea limits
> I/O barriers to happen only inside a slidingresync window, for regular
> I/Os out of this resync window they don't need to wait for barrier any
> more. On large raid1 device, it helps a lot to improve parallel writing
> I/O throughput when there are background resync I/Os performing at
> same time.
> 
> The idea of sliding resync widow is awesome, but there are several
> challenges are very difficult to solve,
>  - code complexity
>    Sliding resync window requires several veriables to work collectively,
>    this is complexed and very hard to make it work correctly. Just grep
>    "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
>    the original resync window patch. This is not the end, any further
>    related modification may easily introduce more regreassion.
>  - multiple sliding resync windows
>    Currently raid1 code only has a single sliding resync window, we cannot
>    do parallel resync with current I/O barrier implementation.
>    Implementing multiple resync windows are much more complexed, and very
>    hard to make it correctly.
> 
> Therefore I decide to implement a much simpler raid1 I/O barrier, by
> removing resync window code, I believe life will be much easier.
> 
> The brief idea of the simpler barrier is,
>  - Do not maintain a logbal unique resync window
>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>    I/O only has to wait for a resync I/O when both them have same barrier
>    bucket index, vice versa.
>  - I/O barrier can be recuded to an acceptable number if there are enought
>    barrier buckets
> 
> Here I explain how the barrier buckets are designed,
>  - BARRIER_UNIT_SECTOR_SIZE
>    The whole LBA address space of a raid1 device is divided into multiple
>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>    Bio request won't go across border of barrier unit size, that means
>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
>  - BARRIER_BUCKETS_NR
>    There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
>         #define BARRIER_BUCKETS_NR_BITS   9
>         #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
>    if multiple I/O requests hit different barrier units, they only need
>    to compete I/O barrier with other I/Os which hit the same barrier
>    bucket index with each other. The index of a barrier bucket which a
>    bio should look for is calculated by,
>         int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS)
>    that sector_nr is the start sector number of a bio. We use function
>    align_to_barrier_unit_end() to calculate sectors number from sector_nr
>    to the next barrier unit size boundary, if the requesting bio size
>    goes across the boundary, we split the bio in raid1_make_request(), to
>    make sure the finall bio sent into generic_make_request() won't exceed
>    barrier unit boundary.
> 
> Comparing to single sliding resync window,
>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>    will have confliction within a single barrier units. So it is similar to
>    single sliding resync window.
>  - But a barrier unit bucket is shared by all barrier units with identical
>    barrier uinit index, the probability of confliction might be higher
>    than single sliding resync window, in condition that writing I/Os
>    always hit barrier units which have identical barrier bucket index with
>    the resync I/Os. This is a very rare condition in real I/O work loads,
>    I cannot imagine how it could happen in practice.
>  - Therefore we can achieve a good enough low confliction rate with much
>    simpler barrier algorithm and implementation.
> 
> If user has a (realy) large raid1 device, for example 10PB size, we may
> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
> it is possible to be a raid1-created-time-defined variable in future.
> 
> There are two changes should be noticed,
>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>    single loop, it looks like this,
>         spin_lock_irqsave(&conf->device_lock, flags);
>         conf->nr_queued[idx]--;
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>    This change generates more spin lock operations, but in next patch of
>    this patch set, it will be replaced by a single line code,
>         atomic_dec(conf->nr_queueud[idx]);
>    So we don't need to worry about spin lock cost here.
>  - Original function raid1_make_request() is split into two functions,
>    - raid1_make_read_request(): handles regular read request and calls
>      wait_read_barrier() for I/O barrier.
>    - raid1_make_write_request(): handles regular write request and calls
>      wait_barrier() for I/O barrier.
>    The differnece is wait_read_barrier() only waits if array is frozen,
>    using different barrier function in different code path makes the code
>    more clean and easy to read.
>  - align_to_barrier_unit_end() is called to make sure both regular and
>    resync I/O won't go across a barrier unit boundary.
> 
> Changelog
> V1:
> - Original RFC patch for comments
> V2:
> - Use bio_split() to split the orignal bio if it goes across barrier unit
>   bounday, to make the code more simple, by suggestion from Shaohua and
>   Neil.
> - Use hash_long() to replace original linear hash, to avoid a possible
>   confilict between resync I/O and sequential write I/O, by suggestion from
>   Shaohua.
> - Add conf->total_barriers to record barrier depth, which is used to
>   control number of parallel sync I/O barriers, by suggestion from Shaohua.
> - In V1 patch the bellowed barrier buckets related members in r1conf are
>   allocated in memory page. To make the code more simple, V2 patch moves
>   the memory space into struct r1conf, like this,
>         -       int                     nr_pending;
>         -       int                     nr_waiting;
>         -       int                     nr_queued;
>         -       int                     barrier;
>         +       int                     nr_pending[BARRIER_BUCKETS_NR];
>         +       int                     nr_waiting[BARRIER_BUCKETS_NR];
>         +       int                     nr_queued[BARRIER_BUCKETS_NR];
>         +       int                     barrier[BARRIER_BUCKETS_NR];
>   This change is by the suggestion from Shaohua.
> - Remove some inrelavent code comments, by suggestion from Guoqing.
> - Add a missing wait_barrier() before jumping to retry_write, in
>   raid1_make_write_request().
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Shaohua Li <shli@fb.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: Guoqing Jiang <gqjiang@suse.com>
> ---
>  
> +static sector_t align_to_barrier_unit_end(sector_t start_sector,
> +					  sector_t sectors)
> +{
> +	sector_t len;
> +
> +	WARN_ON(sectors == 0);
> +	/* len is the number of sectors from start_sector to end of the
> +	 * barrier unit which start_sector belongs to.
> +	 */

The correct format for comments is:
/*
 * something
 */

There are some other places with the same issue

> +	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
> +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
> +	      start_sector;

This one makes me nervous. shouldn't this be:
 +	len = ((start_sector +  (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
 +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
 +	      start_sector;

And you can use round_up()

>  
> -static void raid1_make_request(struct mddev *mddev, struct bio * bio)
> +static void raid1_make_read_request(struct mddev *mddev, struct bio *bio)
>  {

Please rebase the patches to latest md-next. The raid1_make_request already
split for read/write code path recently.

Otherwise, the patch looks good. After these are fixed, I'll add it for 4.11

Thanks,
Shaohua

^ permalink raw reply

* Re: [v2 PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
From: Shaohua Li @ 2017-01-04 19:59 UTC (permalink / raw)
  To: Coly Li
  Cc: linux-raid, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <1482853658-82535-2-git-send-email-colyli@suse.de>

On Tue, Dec 27, 2016 at 11:47:38PM +0800, Coly Li wrote:
> When I run a parallel reading performan testing on a md raid1 device with
> two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
> block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
> only 2.7GB/s, this is around 50% of the idea performance number.
> 
> The perf reports locking contention happens at allow_barrier() and
> wait_barrier() code,
>  - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
>    - _raw_spin_lock_irqsave
>          + 89.92% allow_barrier
>          + 9.34% __wake_up
>  - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
>    - _raw_spin_lock_irq
>          - 100.00% wait_barrier
> 
> The reason is, in these I/O barrier related functions,
>  - raise_barrier()
>  - lower_barrier()
>  - wait_barrier()
>  - allow_barrier()
> They always hold conf->resync_lock firstly, even there are only regular
> reading I/Os and no resync I/O at all. This is a huge performance penalty.
> 
> The solution is a lockless-like algorithm in I/O barrier code, and only
> holding conf->resync_lock when it is really necessary.
> 
> The original idea is from Hannes Reinecke, and Neil Brown provides
> comments to improve it. Now I write the patch based on new simpler raid1
> I/O barrier code.
> 
> In the new simpler raid1 I/O barrier implementation, there are two
> wait barrier functions,
>  - wait_barrier()
>    Which in turns calls _wait_barrier(), is used for regular write I/O.
>    If there is resync I/O happening on the same barrier bucket index, or
>    the whole array is frozen, task will wait until no barrier on same
>    bucket index, or the whold array is unfreezed.
>  - wait_read_barrier()
>    Since regular read I/O won't interfere with resync I/O (read_balance()
>    will make sure only uptodate data will be read out), so it is
>    unnecessary to wait for barrier in regular read I/Os, they only have to
>    wait only when the whole array is frozen.
> The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
> barrier[idx] are very carefully designed in raise_barrier(),
> lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
> avoid unnecessary spin locks in these functions. Once conf->
> nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
> has to wait in raise_barrier(). Then in _wait_barrier() or
> wait_read_barrier() if no barrier raised in same barrier bucket index or
> array is not frozen, the regular I/O doesn't need to hold conf->
> resync_lock, it can just increase conf->nr_pending[idx], and return to its
> caller. For heavy parallel reading I/Os, the lockless I/O barrier code
> almostly gets rid of all spin lock cost.
> 
> This patch significantly improves raid1 reading peroformance. From my
> testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
> blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
> increases from 2.7GB/s to 4.6GB/s (+70%).
> 
> Open question:
> Shaohua points out the memory barrier should be added to some atomic
> operations. Now I am reading the document to learn how to add the memory
> barriers correctly. Anyway, if anyone has suggestion, please don't
> hesitate to let me know.

Yes, because the raise_barrier/_wait_barrier depend on the atomic opertions
order, while atomic_inc/atomic_read don't imply a barrier.
 
> @@ -1005,7 +1031,7 @@ static void unfreeze_array(struct r1conf *conf)
>  {
>  	/* reverse the effect of the freeze */
>  	spin_lock_irq(&conf->resync_lock);
> -	conf->array_frozen = 0;
> +	atomic_set(&conf->array_frozen, 0);
>  	wake_up(&conf->wait_barrier);
>  	spin_unlock_irq(&conf->resync_lock);
>  }

Nitpick: This one doesn't need the lock.

Thanks,
Shaohua

^ permalink raw reply

* Re: PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD - invalid opcode
From: Shaohua Li @ 2017-01-04 22:30 UTC (permalink / raw)
  To: MasterPrenium
  Cc: linux-kernel, xen-users, linux-raid, MasterPrenium@gmail.com,
	xen-devel
In-Reply-To: <585D6C34.2020908@gmail.com>

On Fri, Dec 23, 2016 at 07:25:56PM +0100, MasterPrenium wrote:
> Hello Guys,
> 
> I've having some trouble on a new system I'm setting up. I'm getting a kernel BUG message, seems to be related with the use of Xen (when I boot the system _without_ Xen, I don't get any crash).
> Here is configuration :
> - 3x Hard Drives running on RAID 5 Software raid created by mdadm
> - On top of it, DRBD for replication over another node (Active/passive cluster)
> - On top of it, a BTRFS FileSystem with a few subvolumes
> - On top of it, XEN VMs running.
> 
> The BUG is happening when I'm making "huge" I/O (20MB/s with a rsync for example) on the RAID5 stack.
> I've to reset system to make it work again.

what did you mean 'huge' I/O (20M/s)? Is it possible you can reproduce the
issue with a raw raid5 raid? It would be even better if you can give me a fio
job file with the issue, so I can easily debug it.

also please check if upstream patch (e8d7c33 md/raid5: limit request size
according to implementation limits) helps.

Thanks,
Shaohua

^ permalink raw reply

* [PATCH] md: cleanup mddev flag clear for takeover
From: Shaohua Li @ 2017-01-05  0:10 UTC (permalink / raw)
  To: linux-raid

Commit 6995f0b (md: takeover should clear unrelated bits) clear
unrelated bits, but it's quite fragile. To avoid error in the future,
define a macro for unsupported mddev flags for each raid type and use it
to clear unsupported mddev flags. This should be less error-prone.

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.h    |  8 ++++++++
 drivers/md/raid0.c | 12 ++++++++----
 drivers/md/raid1.c |  8 ++++++--
 drivers/md/raid5.c |  5 ++++-
 4 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2302536..4aae26d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -212,6 +212,7 @@ extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 				int is_new);
 struct md_cluster_info;
 
+/* change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added */
 enum mddev_flags {
 	MD_ARRAY_FIRST_USE,	/* First use of array, needs initialization */
 	MD_CLOSING,		/* If set, we are closing the array, do not open
@@ -704,4 +705,11 @@ static inline int mddev_is_clustered(struct mddev *mddev)
 }
 
 extern void md_writesame_setup(struct mddev *mddev, struct bio *bio);
+
+/* clear unsupported mddev_flags */
+static inline void mddev_clear_unsupported_flags(struct mddev *mddev,
+	unsigned long unsupported_flags)
+{
+	mddev->flags &= ~unsupported_flags;
+}
 #endif /* _MD_MD_H */
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 5e4cddb..5b3db36 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -26,6 +26,11 @@
 #include "raid0.h"
 #include "raid5.h"
 
+#define UNSUPPORTED_MDDEV_FLAGS		\
+	((1L << MD_HAS_JOURNAL) |	\
+	 (1L << MD_JOURNAL_CLEAN) |	\
+	 (1L << MD_FAILFAST_SUPPORTED))
+
 static int raid0_congested(struct mddev *mddev, int bits)
 {
 	struct r0conf *conf = mddev->private;
@@ -541,8 +546,7 @@ static void *raid0_takeover_raid45(struct mddev *mddev)
 	mddev->delta_disks = -1;
 	/* make sure it will be not marked as dirty */
 	mddev->recovery_cp = MaxSector;
-	clear_bit(MD_HAS_JOURNAL, &mddev->flags);
-	clear_bit(MD_JOURNAL_CLEAN, &mddev->flags);
+	mddev_clear_unsupported_flags(mddev, UNSUPPORTED_MDDEV_FLAGS);
 
 	create_strip_zones(mddev, &priv_conf);
 
@@ -585,7 +589,7 @@ static void *raid0_takeover_raid10(struct mddev *mddev)
 	mddev->degraded = 0;
 	/* make sure it will be not marked as dirty */
 	mddev->recovery_cp = MaxSector;
-	clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
+	mddev_clear_unsupported_flags(mddev, UNSUPPORTED_MDDEV_FLAGS);
 
 	create_strip_zones(mddev, &priv_conf);
 	return priv_conf;
@@ -628,7 +632,7 @@ static void *raid0_takeover_raid1(struct mddev *mddev)
 	mddev->raid_disks = 1;
 	/* make sure it will be not marked as dirty */
 	mddev->recovery_cp = MaxSector;
-	clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
+	mddev_clear_unsupported_flags(mddev, UNSUPPORTED_MDDEV_FLAGS);
 
 	create_strip_zones(mddev, &priv_conf);
 	return priv_conf;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 1442240..7b0f647 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -42,6 +42,10 @@
 #include "raid1.h"
 #include "bitmap.h"
 
+#define UNSUPPORTED_MDDEV_FLAGS		\
+	((1L << MD_HAS_JOURNAL) |	\
+	 (1L << MD_JOURNAL_CLEAN))
+
 /*
  * Number of guaranteed r1bios in case of extreme VM load:
  */
@@ -3257,8 +3261,8 @@ static void *raid1_takeover(struct mddev *mddev)
 		if (!IS_ERR(conf)) {
 			/* Array must appear to be quiesced */
 			conf->array_frozen = 1;
-			clear_bit(MD_HAS_JOURNAL, &mddev->flags);
-			clear_bit(MD_JOURNAL_CLEAN, &mddev->flags);
+			mddev_clear_unsupported_flags(mddev,
+				UNSUPPORTED_MDDEV_FLAGS);
 		}
 		return conf;
 	}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ac5fd69..071dfdd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -62,6 +62,8 @@
 #include "raid0.h"
 #include "bitmap.h"
 
+#define UNSUPPORTED_MDDEV_FLAGS	(1L << MD_FAILFAST_SUPPORTED)
+
 #define cpu_to_group(cpu) cpu_to_node(cpu)
 #define ANY_GROUP NUMA_NO_NODE
 
@@ -7881,7 +7883,8 @@ static void *raid5_takeover_raid1(struct mddev *mddev)
 
 	ret = setup_conf(mddev);
 	if (!IS_ERR_VALUE(ret))
-		clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
+		mddev_clear_unsupported_flags(mddev,
+			UNSUPPORTED_MDDEV_FLAGS);
 	return ret;
 }
 
-- 
2.9.3


^ permalink raw reply related

* Re: [lvm-devel] [lvm2 PATCH] Remove special-case for md in 69-dm-lvm-metadata.rules
From: NeilBrown @ 2017-01-05  3:44 UTC (permalink / raw)
  To: Peter Rajnoha, LVM2 development, linux-raid; +Cc: Lidong Zhong, GuoQing Jiang
In-Reply-To: <e0225e4c-80fe-359e-462a-5829195ca5b6@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 3809 bytes --]

On Wed, Jan 04 2017, Peter Rajnoha wrote:

> On 01/04/2017 04:30 AM, NeilBrown wrote:
>> 
>> This special casing brings no value.  It appears to attempt to
>> determine if the array is active yet or not, and to skip
>> processing if the array has not yet been started.
>
> Hi Neil,
>
> those rules also have another use which is to not trigger further
> unnecessary actions if the device is already up and running, hence
> avoiding useless resource consumption if it brings no new information.
> Currently, this is applied only for running pvscan on top of newly
> activated MD device (that's why it's part of 69-dm-lvm-metad.rules
> at the moment).

I am having difficulty seeing why MD devices should be handled
differently from any others in this respect.  If you want to filter out
unhelpful event for MD, surely you should filter them out for all
devices??

I also wonder if there are really so many unhelpful events that there is
a genuine gain in filtering them out.

>
> So what we also need is to detect the very first CHANGE event
> that makes the device active and to make a difference between this
> first CHANGE event and further CHANGE events which are possibly
> part of the WATCH rule and any other possible CHANGE events which
> do not notify about the device switching from "not ready" to "ready"
> state (of course, counting with possible coldplug events).
>
>> However, if the array hasn't been started, then "blkid" will
>> not have been able to read a signature, so:
>>   ENV{ID_FS_TYPE}!="LVM2_member|LVM1_member", GOTO="lvm_end"
>> will have caused all this code to be skipped.
>> 
>> Further, this code causes incorrect behaviour in at least one case.
>> It assumes that the first "add" event should be ignored, as it will be
>> followed by a "change" event which indicates the array coming on line.
>> This is consistent with how the kernel sends events, but not always
>> consistent with how this script sees event.
>> Specifically: if the initrd has "mdadm" support installed, but not
>> "lvm2" support, then the initial "add" and "change" events will
>> happen while the initrd is in charge and this file is not available.
>> Once the root filesystem is mountd, this file will be available
>> and "udevadm trigger --action=add" will be run.
>> So the first and only event seen by this script for an md device will be
>> "add", and it will incorrectly ignore it.
>> 
>
> Yes, you're right that in this case, it's not behaving correctly when
> the initrd doesn't have this rule while the root FS does.  To fix this
> issue for now, I suggest to separate those rule out of 69-dm-lvm-metad.rules
> and make it a part of MD rules so that rule is always available both in
> initrd and root fs when MD is used (while LVM doesn't need to be
> installed in initrd).
>
> This comes right on time because right at this very moment, I'm working
> on a design for a solution which covers this area - I'll surely pass you
> the design doc once it's more complete (should be in next few days) so
> you we can discuss this problem further. This will cover the *standard*
> notification about when the block device is ready which provides a
> standard way of letting others (rules or any uevent monitors) know when
> the switch from "not ready" to "ready" state happens exactly. This should
> save us lots of unnecessary work that is done at the moment - we don't
> need to fire scans and further inspection of the device for all the
> events all the time. This work also covers identification of spurious
> events coming as a result of the WATCH rule and minimization of its
> impact on uevent processing performance in userspace.

I'd certainly be very interested to read and comment on your design.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [RFC PATCH v2] crypto: Add IV generation algorithms
From: Binoy Jayan @ 2017-01-05  6:06 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Milan Broz, Oded, Ofir, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid, Rajendra
In-Reply-To: <20170102065325.GA19553@gondor.apana.org.au>

Hi Herbert,

On 2 January 2017 at 12:23, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Mon, Jan 02, 2017 at 12:16:45PM +0530, Binoy Jayan wrote:
>
> Right.  The actual number of underlying tfms that do the work
> won't change compared to the status quo.  We're just structuring
> it such that if the overall scheme is supported by the hardware
> then we can feed more than one sector at a time to it.

I was thinking of continuing to have the iv generation algorithms as template
ciphers instead of regular 'skcipher' as it is easier to inherit the parameters
from the underlying cipher (e.g. aes) like cra_blocksize, cra_alignmask,
ivsize, chunksize etc.

Usually, the underlying cipher for the template ciphers are instantiated
in the following function:

skcipher_instance:skcipher_alg:init()

Since the number of such cipher instances depend on the key count, which is
not known at the time of creation of the cipher (it's passed to as an argument
to the setkey api), the creation of those have to be delayed until the setkey
operation of the template cipher. But as Mark pointed out, the users of this
cipher may get confused if the creation of the underlying cipher fails while
trying to do a 'setkey' on the template cipher. I was wondering if I can create
a single instance of the cipher and assign it to tfms[0] and allocate the
remaining instances when the setkey operation is called later with the encoded
key_count so that errors during cipher creation are uncovered earlier.

Thanks,
Binoy

^ permalink raw reply

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
From: 王金浦 @ 2017-01-05 10:54 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: NeilBrown, Mikulas Patocka, Jack Wang, Lars Ellenberg, Jens Axboe,
	linux-raid, Michael Wang, Peter Zijlstra, Jiri Kosina, Ming Lei,
	LKML, Zheng Liu, linux-block, Takashi Iwai,
	linux-bcache@vger.kernel.org, Ingo Molnar, Alasdair Kergon,
	Martin K. Petersen, Keith Busch, device-mapper development
In-Reply-To: <20170104185046.GA982@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4392 bytes --]

2017-01-04 19:50 GMT+01:00 Mike Snitzer <snitzer@redhat.com>:
> On Wed, Jan 04 2017 at 12:12am -0500,
> NeilBrown <neilb@suse.com> wrote:
>
>> On Tue, Jan 03 2017, Jack Wang wrote:
>>
>> > 2016-12-23 12:45 GMT+01:00 Lars Ellenberg <lars.ellenberg@linbit.com>:
>> >> On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote:
>> >>> Dear Maintainers
>> >>>
>> >>> I'd like to ask for the status of this patch since we hit the
>> >>> issue too during our testing on md raid1.
>> >>>
>> >>> Split remainder bio_A was queued ahead, following by bio_B for
>> >>> lower device, at this moment raid start freezing, the loop take
>> >>> out bio_A firstly and deliver it, which will hung since raid is
>> >>> freezing, while the freezing never end since it waiting for
>> >>> bio_B to finish, and bio_B is still on the queue, waiting for
>> >>> bio_A to finish...
>> >>>
>> >>> We're looking for a good solution and we found this patch
>> >>> already progressed a lot, but we can't find it on linux-next,
>> >>> so we'd like to ask are we still planning to have this fix
>> >>> in upstream?
>> >>
>> >> I don't see why not, I'd even like to have it in older kernels,
>> >> but did not have the time and energy to push it.
>> >>
>> >> Thanks for the bump.
>> >>
>> >>         Lars
>> >>
>> > Hi folks,
>> >
>> > As Michael mentioned, we hit a bug this patch is trying to fix.
>> > Neil suggested another way to fix it.  I attached below.
>> > I personal prefer Neil's version as it's less code change, and straight forward.
>> >
>> > Could you share your comments, we can get one fix into mainline.
>> >
>> > Thanks,
>> > Jinpu
>> > From 69a4829a55503e496ce9c730d2c8e3dd8a08874a Mon Sep 17 00:00:00 2001
>> > From: NeilBrown <neilb@suse.com>
>> > Date: Wed, 14 Dec 2016 16:55:52 +0100
>> > Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()
>> >
>> > When we call wait_barrier, we might have some bios waiting
>> > in current->bio_list, which prevents the array_freeze call to
>> > complete. Those can only be internal READs, which have already
>> > passed the wait_barrier call (thus incrementing nr_pending), but
>> > still were not submitted to the lower level, due to generic_make_request
>> > logic to avoid recursive calls. In such case, we have a deadlock:
>> > - array_frozen is already set to 1, so wait_barrier unconditionally waits, so
>> > - internal READ bios will not be submitted, thus freeze_array will
>> > never completes.
>> >
>> > To fix this, modify generic_make_request to always sort bio_list_on_stack
>> > first with lowest level, then higher, until same level.
>> >
>> > Sent to linux-raid mail list:
>> > https://marc.info/?l=linux-raid&m=148232453107685&w=2
>> >
>>
>> This should probably also have
>>
>>   Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
>>
>> or something that, as I was building on Lars' ideas when I wrote this.
>>
>> It would also be worth noting in the description that this addresses
>> issues with dm and drbd as well as md.
>
> I never saw this patch but certainly like the relative simplicity of the
> solution when compared with other approaches taken, e.g. (5 topmost
> commits on this branch):
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
>
>> In fact, I think that with this patch in place, much of the need for the
>> rescue_workqueue won't exist any more.  I cannot promise it can be
>> removed completely, but it should be to hard to make it optional and
>> only enabled for those few block devices that will still need it.
>> The rescuer should only be needed for a bioset which can be allocated
>> From twice in the one call the ->make_request_fn.  This would include
>> raid0 for example, though raid0_make_reqest could be re-written to not
>> use a loop and to just call generic_make_request(bio) if bio != split.
>
> Mikulas, would you be willing to try the below patch with the
> dm-snapshot deadlock scenario and report back on whether it fixes that?
>
> Patch below looks to be the same as here:
> https://marc.info/?l=linux-raid&m=148232453107685&q=p3
>
> Neil and/or others if that isn't the patch that should be tested please
> provide a pointer to the latest.
>
> Thanks,
> Mike

Thanks Mike,

I've rebased the patch on to Linux-4.10-rc2, and updated the
description as Neil suggested.
If Mikulas get possitive feedback, then we can go with it.

Cheers,
Jinpu

[-- Attachment #2: 0001-block-fix-deadlock-between-freeze_array-and-wait_bar.patch --]
[-- Type: text/x-patch, Size: 2467 bytes --]

From 4ffaefb719c129ed51f9fcb235b945caf56de8d1 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.com>
Date: Wed, 14 Dec 2016 16:55:52 +0100
Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()

When we call wait_barrier, we might have some bios waiting
in current->bio_list, which prevents the array_freeze call to
complete. Those can only be internal READs, which have already
passed the wait_barrier call (thus incrementing nr_pending), but
still were not submitted to the lower level, due to generic_make_request
logic to avoid recursive calls. In such case, we have a deadlock:
- array_frozen is already set to 1, so wait_barrier unconditionally waits, so
- internal READ bios will not be submitted, thus freeze_array will
never completes.

To fix this, modify generic_make_request to always sort bio_list_on_stack
first with lowest level, then higher, until same level.

This would address issuses with dm and drbd as well as md.

Sent to linux-raid mail list:
https://marc.info/?l=linux-raid&m=148232453107685&w=2

Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 block/blk-core.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c..2f74129 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2019,9 +2019,30 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
+			struct bio_list lower, same, hold;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			bio_list_init(&hold);
+			bio_list_merge(&hold, &bio_list_on_stack);
+			bio_list_init(&bio_list_on_stack);
+
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
 
 			bio = bio_list_pop(current->bio_list);
 		} else {
-- 
2.7.4


^ permalink raw reply related

* Re: recovering failed raid5
From: Alexander Shenkin @ 2017-01-05 12:08 UTC (permalink / raw)
  To: Wols Lists, linux-raid, rm, robin
In-Reply-To: <582C8B6C.5040303@youngman.org.uk>

Hi again all,

I've finally gotten new disks and copies ready, and have a small 
operational question.  But first, just a reminder, as this thread is a 
bit old.

My sdb went down in a 4-disk RAID5 array.  After adding a new sdb and 
rebuilding, sdc went down.  I ddrescue'd sdc to a new drive (previous 
attempts were marred by errors when using a USB enclosure; all finally 
went well when using direct motherboard SATA interface - just one 4096 
byte sector couldn't be read).  So now I have: sda (good), sdc 
(ddrescued), and sdd (good).  I have copied the partition table, and 
randomized the IDs, to a new drive and connected it to the sdb SATA 
interface on the motherboard.  All of this was done using the system 
rescue cd on a USB drive (https://www.system-rescue-cd.org/).

Now the question is: how do I actually get the system up to a state 
where I can run "mdadm --assemble /dev/sd[adc]n" as suggested by Wol 
below?  The system won't boot from the HDD's since there are only 2 
working members of the RAID apparently (I guess it must have removed sdc 
previously?  not sure.).  And trying to run mdadm from the system rescue 
cd OS says that the md config isn't there (or something to that effect). 
  (note: i do have the timeout script running on the USB OS).

Should I somehow recreate the md config on the OS on the USB drive?  Or 
something else?  Thanks again all!

Best,
Allie

On 11/16/2016 4:38 PM, Wols Lists wrote:
> On 16/11/16 15:50, Alexander Shenkin wrote:
>>
>>
>> On 11/16/2016 3:35 PM, Wols Lists wrote:
>>> On 16/11/16 09:04, Alexander Shenkin wrote:
>>>> Hello all,
>>>>
>>>> As a quick reminder, my sdb failed in a 4-disk RAID5, and then sdc
>>>> failed when trying to replace sdb.  I'm now trying to recover sdc with
>>>> ddrescue.
>>>>
>>>> After much back and forth, I've finally got ddrescue running to
>>>> replicate my apparently-faulty sdc.  I'm ddrescue'ing from a seagate 3TB
>>>> to a toshiba 3TB drive, and I'm getting a 'No space left on device
>>>> error'.  Any thoughts?
>>>>
>>>> One further question: should I also try to ddrescue my original failed
>>>> sdb in the hopes that anything lost on sdc would be covered by the
>>>> recovered sdb?
>>>
>>> Depends how badly out of sync the event counts are. However, I note that
>>> your ddrescue copy appeared to run without any errors (apart from
>>> falling off the end of the drive :-) ?
>>
>> Thanks Wol.
>>
>> From my newbie reading, it looked like there was on 65kb error... but
>> i'm not sure how to tell if it got read properly by ddrescue in the end
>> - any tips?  I don't see any "retrying bad sectors" (-) lines in the
>> logfile below...
>>
>> username@Ubuntu-VirtualBox:~$ sudo ddrescue -d -f -r3 /dev/sdb /dev/sdc
>> ~/rescue.logfile
>> [sudo] password for username:
>> GNU ddrescue 1.19
>> Press Ctrl-C to interrupt
>> rescued:     3000 GB,  errsize:   65536 B,  current rate:   55640 kB/s
>>    ipos:     3000 GB,   errors:       1,    average rate:   83070 kB/s
>>    opos:     3000 GB, run time:   10.03 h,  successful read:       0 s ago
>> Copying non-tried blocks... Pass 1 (forwards)
>> ddrescue: Write error: No space left on device
>>
>> # Rescue Logfile. Created by GNU ddrescue version 1.19
>> # Command line: ddrescue -d -f -r3 /dev/sdb /dev/sdc
>> /home/username/rescue.logfile
>> # Start time:   2016-11-15 13:54:24
>> # Current time: 2016-11-15 23:56:25
>> # Copying non-tried blocks... Pass 1 (forwards)
>> # current_pos  current_status
>> 0x2BAA1470000     ?
>> #      pos        size  status
>> 0x00000000  0x7F5A0000  +
>> 0x7F5A0000  0x00010000  *
>> 0x7F5B0000  0x00010000  ?
>> 0x7F5C0000  0x2BA21EB0000  +
>> 0x2BAA1470000  0x00006000  ?
>>
>>>
>>> In which case, you haven't lost anything on sdc. Which is why the wiki
>>> says don't mount your array writeable while you're trying to recover it
>>> - you're not going to muck up your data and have user-space provoke
>>> further errors.
>>
>> gotcha - i'm doing this with removed drives on a different (virtual)
>> machine.  Seemed like the arrays were getting mounted read-only by
>> default when the disks were having issues...
>>
>>>
>>> If the array barfs while it's rebuilding, it's hopefully just a
>>> transient, and do another assemble with --force to get it back again.
>>
>> so, i guess i put the copied drive back in as sdc, and a new blank drive
>> as sdb, add sdb, and just let it rebuild from there?  Or, do I issue
>> this command as appropriate?
>>
>> mdadm --force --assemble /dev/mdN /dev/sd[XYZ]1
>
> Let me get my thoughts straight - cross check what I'm writing but ...
>
> sda and sdd have never failed. sdc is the new drive you've ddrescue'd onto.
>
> So in order to get a working array, you need to do
> "mdadm --assemble /dev/sd[adc]n"
> This will give you a working, degraded array, which unfortunately
> probably has a little bit of corruption - whatever you were writing when
> the array first failed will not have been saved properly. You've
> basically recovered the array with the two drives that are okay, and a
> copy of the drive that failed most recently.
>
> IFF the smarts report that your two failed drives are okay, then you can
> add them back in. I'm hoping it was just the timeout problem - with
> Barracudas that's quite likely.
>
> MAKE SURE that you've run the timeout script on all the Barracudas, or
> the array is simply going to crash again.
>
> WIPE THE SUPERBLOCKS on the old drives. I'm not sure what the mdadm
> command is, but we're adding them back in as new drives.
>
> mdadm --add /dev/old-b /dev/old-c
>
> This will think they are two new drives and will rebuild on to one of
> them. You can then convert the array to raid 6 and it will rebuild on to
> the other one.
>
> Once you've got back to a fully-working raid-5, you can do a fsck on the
> filesystem(s) to find the corruption.
>
> Lastly, if you can get another Toshiba drive, add that in as a spare.
>
> This will leave you with a 6-drive raid-6 - 3xdata, 2xparity, 1xspare.
>
> If the smarts report that any of your barracudas have a load of errors,
> it's not worth faffing about with them. Bin them and replace them.
>
> Going back to an earlier point of yours - DO NOT try to force re-add the
> first drive that failed back into the array. The mismatch in event count
> will mean loads of corruption.
>
> Cheers,
> Wol
>>
>>>
>>> Once you've got the array properly back up again :-
>>>
>>> 1) make sure that the timeout script is run EVERY BOOT to fix the kernel
>>> defaults for your remaining barracudas.
>>>
>>> 2) make sure smarts are enabled EVERY BOOT because barracudas forget
>>> their settings on power-off.
>>>
>>> 3) You've now got a spare drive. If a smart self-check comes back pretty
>>> clean and it looks like a transient problem not a dud drive, then put it
>>> back in and convert the array to raid 6.
>>>
>>> 4) MONITOR MONITOR MONITOR
>>>
>>> You've seen the comments elsewhere about the 3TB barracudas? Barracudas
>>> in general aren't bad drives, but the 3TB model has a reputation for
>>> dying early and quickly. You can then plan to replace the drives at your
>>> leisure, knowing that provided you catch any failure, you've still got
>>> redundancy with one dead drive in a raid-6. Even better, get another
>>> Toshiba and go raid-6+spare. And don't say you haven't got enough sata
>>> ports - an add-in card is about £20 :-)
>>>
>>> Cheers,
>>> Wol
>>>
>>
>

^ permalink raw reply

* Re: PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD - invalid opcode
From: MasterPrenium @ 2017-01-05 14:16 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid, xen-users, MasterPrenium@gmail.com, linux-kernel,
	xen-devel
In-Reply-To: <20170104223015.cr6vtyhxuwxrg76g@kernel.org>

Hi Shaohua,

Thanks for your reply.

Let me explain my "huge". For example, if I'm making a low rate i/o 
stream, I don't get a crash (<1MB written / sec) with random i/o, but if 
I'm making a random I/O of about 20MB/sec, the kernel crashes in a few 
minutes (for example, making an rsync, or even synchronising my DRBD 
stack is causing the crash).
I don't know if this can help, but in most of case, when the kernel 
crashes, after a reboot, my raid 5 stack is re-synchronizing.

I'm not able to reproduce the crash with a raw RAID5 stack (with dd/fio 
...).

It seems I need to stack filesystems to help reproduce it:

Here is a configuration test, command lines to explain (the way I'm able 
to reproduce the crash). Everything is done in dom0.
- mdadm --create /dev/md10 --raid-devices=3 --level=5 /dev/sdc1 
/dev/sdd1 /dev/sde1
- mkfs.btrfs /dev/md10
- mkdir /tmp/btrfs /mnt/XenVM /tmp/ext4
- mount /dev/md10 /tmp/btrfs
- btrfs subvolume create /tmp/btrfs/XenVM
- umount /tmp/btrfs
- mount /dev/md10 /mnt/XenVM -osubvol=XenVM
- truncate /mnt/XenVM/VMTestFile.dat -s 800G
- mkfs.ext4 /mnt/XenVM/VMTestFile.dat
- mount /mnt/XenVM/VMTestFile.dat /tmp/ext4

-> Doing this, doesn't seem to crash the kernel :
fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite 
--rwmixwrite=95 --bs=1M --direct=1 --size=80G --numjobs=8 --runtime=600 
--group_reporting --filename=/mnt/XenVM/Fio.dat

-> Doing this, is crashing the kernel in a few minutes :
fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite 
--rwmixwrite=95 --bs=1M --direct=1 --size=80G --numjobs=8 --runtime=600 
--group_reporting --filename=/tmp/ext4/ext4.dat

Note : --direct=1 or --direct=0 doesn't seem to change the behaviour. 
Also having the raid 5 stack re-synchronizing or already synchronized, 
doesn't change the behaviour.

Here another "crash" : http://pastebin.com/uqLzL4fn

Regarding your patch, I can't find it. Is it the one sent by Konstantin 
Khlebnikov ?

Do you want the "ext4.dat" fio file ? It will be really difficult for me 
to provide it to you as I've only a poor ADSL network connection.

Thanks for your help,

MasterPrenium

Le 04/01/2017 à 23:30, Shaohua Li a écrit :
> On Fri, Dec 23, 2016 at 07:25:56PM +0100, MasterPrenium wrote:
>> Hello Guys,
>>
>> I've having some trouble on a new system I'm setting up. I'm getting a kernel BUG message, seems to be related with the use of Xen (when I boot the system _without_ Xen, I don't get any crash).
>> Here is configuration :
>> - 3x Hard Drives running on RAID 5 Software raid created by mdadm
>> - On top of it, DRBD for replication over another node (Active/passive cluster)
>> - On top of it, a BTRFS FileSystem with a few subvolumes
>> - On top of it, XEN VMs running.
>>
>> The BUG is happening when I'm making "huge" I/O (20MB/s with a rsync for example) on the RAID5 stack.
>> I've to reset system to make it work again.
> what did you mean 'huge' I/O (20M/s)? Is it possible you can reproduce the
> issue with a raw raid5 raid? It would be even better if you can give me a fio
> job file with the issue, so I can easily debug it.
>
> also please check if upstream patch (e8d7c33 md/raid5: limit request size
> according to implementation limits) helps.
>
> Thanks,
> Shaohua


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply

* Re: PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD - invalid opcode
From: Shaohua Li @ 2017-01-05 19:37 UTC (permalink / raw)
  To: MasterPrenium
  Cc: linux-raid, xen-users, MasterPrenium@gmail.com, linux-kernel,
	xen-devel
In-Reply-To: <e56edc2b-f2ad-2ab1-4184-5d7cad80085a@gmail.com>

On Thu, Jan 05, 2017 at 03:16:53PM +0100, MasterPrenium wrote:
> Hi Shaohua,
> 
> Thanks for your reply.
> 
> Let me explain my "huge". For example, if I'm making a low rate i/o stream,
> I don't get a crash (<1MB written / sec) with random i/o, but if I'm making
> a random I/O of about 20MB/sec, the kernel crashes in a few minutes (for
> example, making an rsync, or even synchronising my DRBD stack is causing the
> crash).
> I don't know if this can help, but in most of case, when the kernel crashes,
> after a reboot, my raid 5 stack is re-synchronizing.
> 
> I'm not able to reproduce the crash with a raw RAID5 stack (with dd/fio
> ...).
> 
> It seems I need to stack filesystems to help reproduce it:
> 
> Here is a configuration test, command lines to explain (the way I'm able to
> reproduce the crash). Everything is done in dom0.
> - mdadm --create /dev/md10 --raid-devices=3 --level=5 /dev/sdc1 /dev/sdd1
> /dev/sde1
> - mkfs.btrfs /dev/md10
> - mkdir /tmp/btrfs /mnt/XenVM /tmp/ext4
> - mount /dev/md10 /tmp/btrfs
> - btrfs subvolume create /tmp/btrfs/XenVM
> - umount /tmp/btrfs
> - mount /dev/md10 /mnt/XenVM -osubvol=XenVM
> - truncate /mnt/XenVM/VMTestFile.dat -s 800G
> - mkfs.ext4 /mnt/XenVM/VMTestFile.dat
> - mount /mnt/XenVM/VMTestFile.dat /tmp/ext4
> 
> -> Doing this, doesn't seem to crash the kernel :
> fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite
> --rwmixwrite=95 --bs=1M --direct=1 --size=80G --numjobs=8 --runtime=600
> --group_reporting --filename=/mnt/XenVM/Fio.dat
> 
> -> Doing this, is crashing the kernel in a few minutes :
> fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite
> --rwmixwrite=95 --bs=1M --direct=1 --size=80G --numjobs=8 --runtime=600
> --group_reporting --filename=/tmp/ext4/ext4.dat
> 
> Note : --direct=1 or --direct=0 doesn't seem to change the behaviour. Also
> having the raid 5 stack re-synchronizing or already synchronized, doesn't
> change the behaviour.
> 
> Here another "crash" : http://pastebin.com/uqLzL4fn

I'm trying to reproduce, but no success. So
ext4->btrfs->raid5, crash
btrfs->raid5, no crash
right? does subvolume matter? When you create the raid5 array, does adding
'--assume-clean' option change the behavior? I'd like to narrow down the issue.
If you can capture the blktrace to the raid5 array, it would be great to hint
us what kind of IO it is.
 
> Regarding your patch, I can't find it. Is it the one sent by Konstantin
> Khlebnikov ?

Right.

> Do you want the "ext4.dat" fio file ? It will be really difficult for me to
> provide it to you as I've only a poor ADSL network connection.

Not necessary.

Thanks,
Shaohua

> Thanks for your help,
> 
> MasterPrenium
> 
> Le 04/01/2017 à 23:30, Shaohua Li a écrit :
> > On Fri, Dec 23, 2016 at 07:25:56PM +0100, MasterPrenium wrote:
> > > Hello Guys,
> > > 
> > > I've having some trouble on a new system I'm setting up. I'm getting a kernel BUG message, seems to be related with the use of Xen (when I boot the system _without_ Xen, I don't get any crash).
> > > Here is configuration :
> > > - 3x Hard Drives running on RAID 5 Software raid created by mdadm
> > > - On top of it, DRBD for replication over another node (Active/passive cluster)
> > > - On top of it, a BTRFS FileSystem with a few subvolumes
> > > - On top of it, XEN VMs running.
> > > 
> > > The BUG is happening when I'm making "huge" I/O (20MB/s with a rsync for example) on the RAID5 stack.
> > > I've to reset system to make it work again.
> > what did you mean 'huge' I/O (20M/s)? Is it possible you can reproduce the
> > issue with a raw raid5 raid? It would be even better if you can give me a fio
> > job file with the issue, so I can easily debug it.
> > 
> > also please check if upstream patch (e8d7c33 md/raid5: limit request size
> > according to implementation limits) helps.
> > 
> > Thanks,
> > Shaohua
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply

* Linux Foundation's open source file & storage conference CFP
From: Ric Wheeler @ 2017-01-05 22:30 UTC (permalink / raw)
  To: linux-xfs, linux-ext4, linux-nfs, linux-btrfs, dm-devel,
	linux-scsi, samba-devel, linux-raid

Hi all,

The CFP for the Linux Foundation's Vault conference is coming close to an end. 
The event is being held this year in Cambridge, Massachusetts on the days 
following the LSF/MM summit.

The first two year's events have been solid, focused events in my (slightly 
biased) opinion, so worth submitting to and definitely worth attending.

Submit a Proposal to Speak at Vault, Linux Storage and Filesystems Conference. 
Vault will be held alongside LSF-MM Summit on March 22 & 23 in Cambridge, MA. To 
submit (by 1/14) visit http://events.linuxfoundation.org/events/vault/program/cfp

Happy to answer any questions about the event.

Ric

^ permalink raw reply

* Re: [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
From: NeilBrown @ 2017-01-05 23:08 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Neil Brown, Johannes Thumshirn,
	Guoqing Jiang
In-Reply-To: <1482853658-82535-1-git-send-email-colyli@suse.de>

[-- Attachment #1: Type: text/plain, Size: 40643 bytes --]

On Wed, Dec 28 2016, Coly Li wrote:

> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
> introduces a sliding resync window for raid1 I/O barrier, this idea limits
> I/O barriers to happen only inside a slidingresync window, for regular
> I/Os out of this resync window they don't need to wait for barrier any
> more. On large raid1 device, it helps a lot to improve parallel writing
> I/O throughput when there are background resync I/Os performing at
> same time.
>
> The idea of sliding resync widow is awesome, but there are several
> challenges are very difficult to solve,
>  - code complexity
>    Sliding resync window requires several veriables to work collectively,
>    this is complexed and very hard to make it work correctly. Just grep
>    "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
>    the original resync window patch. This is not the end, any further
>    related modification may easily introduce more regreassion.
>  - multiple sliding resync windows
>    Currently raid1 code only has a single sliding resync window, we cannot
>    do parallel resync with current I/O barrier implementation.
>    Implementing multiple resync windows are much more complexed, and very
>    hard to make it correctly.

I think I've asked this before, but why do you think that parallel
resync might ever be a useful idea?  I don't think it makes any sense, so
it is wrong for you use it as part of the justification for this patch.
Just don't mention it at all unless you have a genuine expectation that
it would really be a good thing, in which case: explain the value.

>
> Therefore I decide to implement a much simpler raid1 I/O barrier, by
> removing resync window code, I believe life will be much easier.
>
> The brief idea of the simpler barrier is,
>  - Do not maintain a logbal unique resync window
>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>    I/O only has to wait for a resync I/O when both them have same barrier
>    bucket index, vice versa.
>  - I/O barrier can be recuded to an acceptable number if there are enought
>    barrier buckets
>
> Here I explain how the barrier buckets are designed,
>  - BARRIER_UNIT_SECTOR_SIZE
>    The whole LBA address space of a raid1 device is divided into multiple
>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>    Bio request won't go across border of barrier unit size, that means
>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.

It would be good to say here what number you chose, and why you chose
it.
You have picked 64MB.  This divides a 1TB device into 4096 regions.
Any write request must fit into one of these regions, so we mustn't make
the region too small, else we would get the benefits for sending large
requests down.

We want the resync to move from region to region fairly quickly so that
the slowness caused by having to synchronize with the resync is averaged
out overa fairly small time frame.  At full speed, 64MB should take less
than 1 second.  When resync is competing with other IO, it could easily
take up to a minute(?).  I think that is a fairly good range.

So I think 64MB is probably a very good choice.  I just would like to
see the justification clearly stated.

>  - BARRIER_BUCKETS_NR
>    There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
>         #define BARRIER_BUCKETS_NR_BITS   9
>         #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)

Why 512 buckets?  What are the tradeoffs?
More buckets means more memory consumed for counters.
Fewer buckets means more false sharing.
With 512 buckets, a request which is smaller than the region size has a
0.2% chance of having to wait for resync to pause.  I think that is
quite a small enough fraction.
I think you originally chose the number of buckets so that a set of
4-byte counters fits exactly into a page.  I think that is still a good
guideline, so I would have
	#define BARRIER_BUCKETS_NR_BITS	(PAGE_SHIFT - 2)
(which makes it 10 ...).

>    if multiple I/O requests hit different barrier units, they only need
>    to compete I/O barrier with other I/Os which hit the same barrier
>    bucket index with each other. The index of a barrier bucket which a
>    bio should look for is calculated by,
>         int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS)

This isn't right.  You have to divide by BARRIER_UNIT_SECTOR_SIZE first.
	int idx = hash_long(sector_nr >> BARRIER_UNIT_SECTOR_BITS, BARRIER_BUCKETS_NR_BITS);

>    that sector_nr is the start sector number of a bio. We use function
>    align_to_barrier_unit_end() to calculate sectors number from sector_nr
>    to the next barrier unit size boundary, if the requesting bio size
>    goes across the boundary, we split the bio in raid1_make_request(), to
>    make sure the finall bio sent into generic_make_request() won't exceed
>    barrier unit boundary.
>
> Comparing to single sliding resync window,
>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>    will have confliction within a single barrier units. So it is similar to
>    single sliding resync window.
>  - But a barrier unit bucket is shared by all barrier units with identical
>    barrier uinit index, the probability of confliction might be higher
>    than single sliding resync window, in condition that writing I/Os
>    always hit barrier units which have identical barrier bucket index with
>    the resync I/Os. This is a very rare condition in real I/O work loads,
>    I cannot imagine how it could happen in practice.
>  - Therefore we can achieve a good enough low confliction rate with much
>    simpler barrier algorithm and implementation.
>
> If user has a (realy) large raid1 device, for example 10PB size, we may
> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
> it is possible to be a raid1-created-time-defined variable in future.

Why?  Why would a large array require more buckets?  Are you just
guessing, or do you see some concrete reason for there to be a
relationship between the size of the array and the number of buckets?
If you can see a connection, please state it.  If not, don't mention it.

>
> There are two changes should be noticed,
>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>    single loop, it looks like this,
>         spin_lock_irqsave(&conf->device_lock, flags);
>         conf->nr_queued[idx]--;
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>    This change generates more spin lock operations, but in next patch of
>    this patch set, it will be replaced by a single line code,
>         atomic_dec(conf->nr_queueud[idx]);
>    So we don't need to worry about spin lock cost here.
>  - Original function raid1_make_request() is split into two functions,
>    - raid1_make_read_request(): handles regular read request and calls
>      wait_read_barrier() for I/O barrier.
>    - raid1_make_write_request(): handles regular write request and calls
>      wait_barrier() for I/O barrier.
>    The differnece is wait_read_barrier() only waits if array is frozen,
>    using different barrier function in different code path makes the code
>    more clean and easy to read.
>  - align_to_barrier_unit_end() is called to make sure both regular and
>    resync I/O won't go across a barrier unit boundary.
>
> Changelog
> V1:
> - Original RFC patch for comments
> V2:
> - Use bio_split() to split the orignal bio if it goes across barrier unit
>   bounday, to make the code more simple, by suggestion from Shaohua and
>   Neil.
> - Use hash_long() to replace original linear hash, to avoid a possible
>   confilict between resync I/O and sequential write I/O, by suggestion from
>   Shaohua.
> - Add conf->total_barriers to record barrier depth, which is used to
>   control number of parallel sync I/O barriers, by suggestion from Shaohua.

I really don't think this is needed.
As long as RESYNC_DEPTH * RESYNC_SECTORS is less than BARRIER_UNIT_SECTOR_SIZE
just testing again ->barrier[idx] will ensure the number of barrier
requests never exceeds RESYNC_DEPTH*2.  That is sufficient.

Also, I think the reason for imposing the RESYNC_DEPTH limit is to make
sure regular IO never has to wait too long for pending resync requests
to flush.  With the simple test, regular IO will never need to wait for
more than RESYNC_DEPTH requests to complete.

So I think have this field brings no valid, and is potentially confusing.

> - In V1 patch the bellowed barrier buckets related members in r1conf are
>   allocated in memory page. To make the code more simple, V2 patch moves
>   the memory space into struct r1conf, like this,
>         -       int                     nr_pending;
>         -       int                     nr_waiting;
>         -       int                     nr_queued;
>         -       int                     barrier;
>         +       int                     nr_pending[BARRIER_BUCKETS_NR];
>         +       int                     nr_waiting[BARRIER_BUCKETS_NR];
>         +       int                     nr_queued[BARRIER_BUCKETS_NR];
>         +       int                     barrier[BARRIER_BUCKETS_NR];

I don't like this.  It makes the r1conf 4 pages is size, most of which
is wasted.  A 4-page allocation is more likely to fail than a few 1-page
allocations.
I think these should be:
>         +       int                     *nr_pending;
>         +       int                     *nr_waiting;
>         +       int                     *nr_queued;
>         +       int                     *barrier;

Then use kcalloc(BARRIER_BUCKETS_NR, sizeof(int), GFP_KERNEL)
to allocate each array.   I think this approach addresses Shaohua's
concerns without requiring a multi-page allocation.

>   This change is by the suggestion from Shaohua.
> - Remove some inrelavent code comments, by suggestion from Guoqing.
> - Add a missing wait_barrier() before jumping to retry_write, in
>   raid1_make_write_request().
>
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Shaohua Li <shli@fb.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/raid1.c | 485 ++++++++++++++++++++++++++++++-----------------------
>  drivers/md/raid1.h |  37 ++--
>  2 files changed, 291 insertions(+), 231 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index a1f3fbe..5813656 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -67,9 +67,8 @@
>   */
>  static int max_queued_requests = 1024;
>  
> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
> -			  sector_t bi_sector);
> -static void lower_barrier(struct r1conf *conf);
> +static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
>  
>  #define raid1_log(md, fmt, args...)				\
>  	do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
> @@ -96,7 +95,6 @@ static void r1bio_pool_free(void *r1_bio, void *data)
>  #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
>  #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
>  #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
> -#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
>  
>  static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
>  {
> @@ -211,7 +209,7 @@ static void put_buf(struct r1bio *r1_bio)
>  
>  	mempool_free(r1_bio, conf->r1buf_pool);
>  
> -	lower_barrier(conf);
> +	lower_barrier(conf, r1_bio->sector);
>  }
>  
>  static void reschedule_retry(struct r1bio *r1_bio)
> @@ -219,10 +217,12 @@ static void reschedule_retry(struct r1bio *r1_bio)
>  	unsigned long flags;
>  	struct mddev *mddev = r1_bio->mddev;
>  	struct r1conf *conf = mddev->private;
> +	int idx;
>  
> +	idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
>  	spin_lock_irqsave(&conf->device_lock, flags);
>  	list_add(&r1_bio->retry_list, &conf->retry_list);
> -	conf->nr_queued ++;
> +	conf->nr_queued[idx]++;
>  	spin_unlock_irqrestore(&conf->device_lock, flags);
>  
>  	wake_up(&conf->wait_barrier);
> @@ -239,8 +239,6 @@ static void call_bio_endio(struct r1bio *r1_bio)
>  	struct bio *bio = r1_bio->master_bio;
>  	int done;
>  	struct r1conf *conf = r1_bio->mddev->private;
> -	sector_t start_next_window = r1_bio->start_next_window;
> -	sector_t bi_sector = bio->bi_iter.bi_sector;
>  
>  	if (bio->bi_phys_segments) {
>  		unsigned long flags;
> @@ -265,7 +263,7 @@ static void call_bio_endio(struct r1bio *r1_bio)
>  		 * Wake up any possible resync thread that waits for the device
>  		 * to go idle.
>  		 */
> -		allow_barrier(conf, start_next_window, bi_sector);
> +		allow_barrier(conf, bio->bi_iter.bi_sector);

Why did you change this to use "bio->bi_iter.bi_sector" instead of
"bi_sector"?

I assume you thought it was an optimization that you would just slip
in.  Can't hurt, right?

Just before this line is:
		bio_endio(bio);
and that might cause the bio to be freed.  So your code could
access freed memory.

Please be *very* cautious when making changes that are not directly
related to the purpose of the patch.
                

>  	}
>  }
>  
> @@ -513,6 +511,25 @@ static void raid1_end_write_request(struct bio *bio)
>  		bio_put(to_put);
>  }
>  
> +static sector_t align_to_barrier_unit_end(sector_t start_sector,
> +					  sector_t sectors)
> +{
> +	sector_t len;
> +
> +	WARN_ON(sectors == 0);
> +	/* len is the number of sectors from start_sector to end of the
> +	 * barrier unit which start_sector belongs to.
> +	 */
> +	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
> +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
> +	      start_sector;

This would be better as

    len = round_up(start_sector+1, BARRIER_UNIT_SECTOR_SIZE) - start_sector;


> +
> +	if (len > sectors)
> +		len = sectors;
> +
> +	return len;
> +}
> +
>  /*
>   * This routine returns the disk from which the requested read should
>   * be done. There is a per-array 'next expected sequential IO' sector
> @@ -809,168 +826,179 @@ static void flush_pending_writes(struct r1conf *conf)
>   */
>  static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
>  {
> +	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
> +
>  	spin_lock_irq(&conf->resync_lock);
>  
>  	/* Wait until no block IO is waiting */
> -	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
> +	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
>  			    conf->resync_lock);
>  
>  	/* block any new IO from starting */
> -	conf->barrier++;
> -	conf->next_resync = sector_nr;
> +	conf->barrier[idx]++;
> +	conf->total_barriers++;
>  
>  	/* For these conditions we must wait:
>  	 * A: while the array is in frozen state
> -	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
> -	 *    the max count which allowed.
> -	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
> -	 *    next resync will reach to the window which normal bios are
> -	 *    handling.
> -	 * D: while there are any active requests in the current window.
> +	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
> +	 *    existing in sector number ranges corresponding to idx.
> +	 * C: while conf->total_barriers >= RESYNC_DEPTH, meaning resync reach
> +	 *    the max count which allowed on the whole raid1 device.
>  	 */
>  	wait_event_lock_irq(conf->wait_barrier,
>  			    !conf->array_frozen &&
> -			    conf->barrier < RESYNC_DEPTH &&
> -			    conf->current_window_requests == 0 &&
> -			    (conf->start_next_window >=
> -			     conf->next_resync + RESYNC_SECTORS),
> +			     !conf->nr_pending[idx] &&
> +			     conf->total_barriers < RESYNC_DEPTH,
>  			    conf->resync_lock);
>  
> -	conf->nr_pending++;
> +	conf->nr_pending[idx]++;
>  	spin_unlock_irq(&conf->resync_lock);
>  }
>  
> -static void lower_barrier(struct r1conf *conf)
> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
>  {
>  	unsigned long flags;
> -	BUG_ON(conf->barrier <= 0);
> +	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
> +
> +	BUG_ON((conf->barrier[idx] <= 0) || conf->total_barriers <= 0);
> +
>  	spin_lock_irqsave(&conf->resync_lock, flags);
> -	conf->barrier--;
> -	conf->nr_pending--;
> +	conf->barrier[idx]--;
> +	conf->total_barriers--;
> +	conf->nr_pending[idx]--;
>  	spin_unlock_irqrestore(&conf->resync_lock, flags);
>  	wake_up(&conf->wait_barrier);
>  }
>  
> -static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
> +static void _wait_barrier(struct r1conf *conf, int idx)
>  {
> -	bool wait = false;
> -
> -	if (conf->array_frozen || !bio)
> -		wait = true;
> -	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
> -		if ((conf->mddev->curr_resync_completed
> -		     >= bio_end_sector(bio)) ||
> -		    (conf->start_next_window + NEXT_NORMALIO_DISTANCE
> -		     <= bio->bi_iter.bi_sector))
> -			wait = false;
> -		else
> -			wait = true;
> +	spin_lock_irq(&conf->resync_lock);
> +	if (conf->array_frozen || conf->barrier[idx]) {
> +		conf->nr_waiting[idx]++;
> +		/* Wait for the barrier to drop. */
> +		wait_event_lock_irq(
> +			conf->wait_barrier,
> +			!conf->array_frozen && !conf->barrier[idx],
> +			conf->resync_lock);
> +		conf->nr_waiting[idx]--;
>  	}
>  
> -	return wait;
> +	conf->nr_pending[idx]++;
> +	spin_unlock_irq(&conf->resync_lock);
>  }
>  
> -static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
> +static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>  {
> -	sector_t sector = 0;
> +	long idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
>  
>  	spin_lock_irq(&conf->resync_lock);
> -	if (need_to_wait_for_sync(conf, bio)) {
> -		conf->nr_waiting++;
> -		/* Wait for the barrier to drop.
> -		 * However if there are already pending
> -		 * requests (preventing the barrier from
> -		 * rising completely), and the
> -		 * per-process bio queue isn't empty,
> -		 * then don't wait, as we need to empty
> -		 * that queue to allow conf->start_next_window
> -		 * to increase.
> -		 */
> -		raid1_log(conf->mddev, "wait barrier");
> -		wait_event_lock_irq(conf->wait_barrier,
> -				    !conf->array_frozen &&
> -				    (!conf->barrier ||
> -				     ((conf->start_next_window <
> -				       conf->next_resync + RESYNC_SECTORS) &&
> -				      current->bio_list &&
> -				      !bio_list_empty(current->bio_list))),
> -				    conf->resync_lock);
> -		conf->nr_waiting--;
> -	}
> -
> -	if (bio && bio_data_dir(bio) == WRITE) {
> -		if (bio->bi_iter.bi_sector >= conf->next_resync) {
> -			if (conf->start_next_window == MaxSector)
> -				conf->start_next_window =
> -					conf->next_resync +
> -					NEXT_NORMALIO_DISTANCE;
> -
> -			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
> -			    <= bio->bi_iter.bi_sector)
> -				conf->next_window_requests++;
> -			else
> -				conf->current_window_requests++;
> -			sector = conf->start_next_window;
> -		}
> +	if (conf->array_frozen) {
> +		conf->nr_waiting[idx]++;
> +		/* Wait for array to unfreeze */
> +		wait_event_lock_irq(
> +			conf->wait_barrier,
> +			!conf->array_frozen,
> +			conf->resync_lock);
> +		conf->nr_waiting[idx]--;
>  	}
>  
> -	conf->nr_pending++;
> +	conf->nr_pending[idx]++;
>  	spin_unlock_irq(&conf->resync_lock);
> -	return sector;
>  }
>  
> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
> -			  sector_t bi_sector)
> +static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
> +{
> +	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
> +
> +	_wait_barrier(conf, idx);
> +}
> +
> +static void wait_all_barriers(struct r1conf *conf)
> +{
> +	int idx;
> +
> +	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> +		_wait_barrier(conf, idx);
> +}
> +
> +static void _allow_barrier(struct r1conf *conf, int idx)
>  {
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&conf->resync_lock, flags);
> -	conf->nr_pending--;
> -	if (start_next_window) {
> -		if (start_next_window == conf->start_next_window) {
> -			if (conf->start_next_window + NEXT_NORMALIO_DISTANCE
> -			    <= bi_sector)
> -				conf->next_window_requests--;
> -			else
> -				conf->current_window_requests--;
> -		} else
> -			conf->current_window_requests--;
> -
> -		if (!conf->current_window_requests) {
> -			if (conf->next_window_requests) {
> -				conf->current_window_requests =
> -					conf->next_window_requests;
> -				conf->next_window_requests = 0;
> -				conf->start_next_window +=
> -					NEXT_NORMALIO_DISTANCE;
> -			} else
> -				conf->start_next_window = MaxSector;
> -		}
> -	}
> +	conf->nr_pending[idx]--;
>  	spin_unlock_irqrestore(&conf->resync_lock, flags);
>  	wake_up(&conf->wait_barrier);
>  }
>  
> +static void allow_barrier(struct r1conf *conf, sector_t sector_nr)
> +{
> +	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
> +
> +	_allow_barrier(conf, idx);
> +}
> +
> +static void allow_all_barriers(struct r1conf *conf)
> +{
> +	int idx;
> +
> +	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> +		_allow_barrier(conf, idx);
> +}
> +
> +/* conf->resync_lock should be held */
> +static int get_all_pendings(struct r1conf *conf)
> +{
> +	int idx, ret;
> +
> +	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> +		ret += conf->nr_pending[idx];
> +	return ret;
> +}
> +
> +/* conf->resync_lock should be held */
> +static int get_all_queued(struct r1conf *conf)
> +{
> +	int idx, ret;
> +
> +	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> +		ret += conf->nr_queued[idx];
> +	return ret;
> +}
> +
>  static void freeze_array(struct r1conf *conf, int extra)
>  {
> -	/* stop syncio and normal IO and wait for everything to
> +	/* Stop sync I/O and normal I/O and wait for everything to
>  	 * go quite.
> -	 * We wait until nr_pending match nr_queued+extra
> -	 * This is called in the context of one normal IO request
> -	 * that has failed. Thus any sync request that might be pending
> -	 * will be blocked by nr_pending, and we need to wait for
> -	 * pending IO requests to complete or be queued for re-try.
> -	 * Thus the number queued (nr_queued) plus this request (extra)
> -	 * must match the number of pending IOs (nr_pending) before
> -	 * we continue.
> +	 * This is called in two situations:
> +	 * 1) management command handlers (reshape, remove disk, quiesce).
> +	 * 2) one normal I/O request failed.
> +
> +	 * After array_frozen is set to 1, new sync IO will be blocked at
> +	 * raise_barrier(), and new normal I/O will blocked at _wait_barrier().
> +	 * The flying I/Os will either complete or be queued. When everything
> +	 * goes quite, there are only queued I/Os left.
> +
> +	 * Every flying I/O contributes to a conf->nr_pending[idx], idx is the
> +	 * barrier bucket index which this I/O request hits. When all sync and
> +	 * normal I/O are queued, sum of all conf->nr_pending[] will match sum
> +	 * of all conf->nr_queued[]. But normal I/O failure is an exception,
> +	 * in handle_read_error(), we may call freeze_array() before trying to
> +	 * fix the read error. In this case, the error read I/O is not queued,
> +	 * so get_all_pending() == get_all_queued() + 1.
> +	 *
> +	 * Therefore before this function returns, we need to wait until
> +	 * get_all_pendings(conf) gets equal to get_all_queued(conf)+extra. For
> +	 * normal I/O context, extra is 1, in rested situations extra is 0.
>  	 */
>  	spin_lock_irq(&conf->resync_lock);
>  	conf->array_frozen = 1;
>  	raid1_log(conf->mddev, "wait freeze");
> -	wait_event_lock_irq_cmd(conf->wait_barrier,
> -				conf->nr_pending == conf->nr_queued+extra,
> -				conf->resync_lock,
> -				flush_pending_writes(conf));
> +	wait_event_lock_irq_cmd(
> +		conf->wait_barrier,
> +		get_all_pendings(conf) == get_all_queued(conf)+extra,
> +		conf->resync_lock,
> +		flush_pending_writes(conf));
>  	spin_unlock_irq(&conf->resync_lock);
>  }
>  static void unfreeze_array(struct r1conf *conf)
> @@ -1066,64 +1094,23 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
>  	kfree(plug);
>  }
>  
> -static void raid1_make_request(struct mddev *mddev, struct bio * bio)
> +static void raid1_make_read_request(struct mddev *mddev, struct bio *bio)
>  {
>  	struct r1conf *conf = mddev->private;
>  	struct raid1_info *mirror;
>  	struct r1bio *r1_bio;
>  	struct bio *read_bio;
> -	int i, disks;
>  	struct bitmap *bitmap;
> -	unsigned long flags;
>  	const int op = bio_op(bio);
> -	const int rw = bio_data_dir(bio);
>  	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
> -	const unsigned long do_flush_fua = (bio->bi_opf &
> -						(REQ_PREFLUSH | REQ_FUA));
> -	struct md_rdev *blocked_rdev;
> -	struct blk_plug_cb *cb;
> -	struct raid1_plug_cb *plug = NULL;
> -	int first_clone;
>  	int sectors_handled;
>  	int max_sectors;
> -	sector_t start_next_window;
> +	int rdisk;
>  
> -	/*
> -	 * Register the new request and wait if the reconstruction
> -	 * thread has put up a bar for new requests.
> -	 * Continue immediately if no resync is active currently.
> +	/* Still need barrier for READ in case that whole
> +	 * array is frozen.
>  	 */
> -
> -	md_write_start(mddev, bio); /* wait on superblock update early */
> -
> -	if (bio_data_dir(bio) == WRITE &&
> -	    ((bio_end_sector(bio) > mddev->suspend_lo &&
> -	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
> -	    (mddev_is_clustered(mddev) &&
> -	     md_cluster_ops->area_resyncing(mddev, WRITE,
> -		     bio->bi_iter.bi_sector, bio_end_sector(bio))))) {
> -		/* As the suspend_* range is controlled by
> -		 * userspace, we want an interruptible
> -		 * wait.
> -		 */
> -		DEFINE_WAIT(w);
> -		for (;;) {
> -			flush_signals(current);
> -			prepare_to_wait(&conf->wait_barrier,
> -					&w, TASK_INTERRUPTIBLE);
> -			if (bio_end_sector(bio) <= mddev->suspend_lo ||
> -			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
> -			    (mddev_is_clustered(mddev) &&
> -			     !md_cluster_ops->area_resyncing(mddev, WRITE,
> -				     bio->bi_iter.bi_sector, bio_end_sector(bio))))
> -				break;
> -			schedule();
> -		}
> -		finish_wait(&conf->wait_barrier, &w);
> -	}
> -
> -	start_next_window = wait_barrier(conf, bio);
> -
> +	wait_read_barrier(conf, bio->bi_iter.bi_sector);
>  	bitmap = mddev->bitmap;
>  
>  	/*
> @@ -1149,12 +1136,9 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  	bio->bi_phys_segments = 0;
>  	bio_clear_flag(bio, BIO_SEG_VALID);
>  
> -	if (rw == READ) {
>  		/*
>  		 * read balancing logic:
>  		 */
> -		int rdisk;
> -
>  read_again:
>  		rdisk = read_balance(conf, r1_bio, &max_sectors);
>  
> @@ -1176,7 +1160,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  				   atomic_read(&bitmap->behind_writes) == 0);
>  		}
>  		r1_bio->read_disk = rdisk;
> -		r1_bio->start_next_window = 0;
>  
>  		read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
>  		bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
> @@ -1232,11 +1215,89 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  		} else
>  			generic_make_request(read_bio);
>  		return;
> +}
> +
> +static void raid1_make_write_request(struct mddev *mddev, struct bio *bio)
> +{
> +	struct r1conf *conf = mddev->private;
> +	struct r1bio *r1_bio;
> +	int i, disks;
> +	struct bitmap *bitmap;
> +	unsigned long flags;
> +	const int op = bio_op(bio);
> +	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
> +	const unsigned long do_flush_fua = (bio->bi_opf &
> +						(REQ_PREFLUSH | REQ_FUA));
> +	struct md_rdev *blocked_rdev;
> +	struct blk_plug_cb *cb;
> +	struct raid1_plug_cb *plug = NULL;
> +	int first_clone;
> +	int sectors_handled;
> +	int max_sectors;
> +
> +	/*
> +	 * Register the new request and wait if the reconstruction
> +	 * thread has put up a bar for new requests.
> +	 * Continue immediately if no resync is active currently.
> +	 */
> +
> +	md_write_start(mddev, bio); /* wait on superblock update early */
> +
> +	if (((bio_end_sector(bio) > mddev->suspend_lo &&
> +	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
> +	    (mddev_is_clustered(mddev) &&
> +	     md_cluster_ops->area_resyncing(mddev, WRITE,
> +		     bio->bi_iter.bi_sector, bio_end_sector(bio))))) {
> +		/* As the suspend_* range is controlled by
> +		 * userspace, we want an interruptible
> +		 * wait.
> +		 */
> +		DEFINE_WAIT(w);
> +
> +		for (;;) {
> +			flush_signals(current);
> +			prepare_to_wait(&conf->wait_barrier,
> +					&w, TASK_INTERRUPTIBLE);
> +			if (bio_end_sector(bio) <= mddev->suspend_lo ||
> +			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
> +			    (mddev_is_clustered(mddev) &&
> +			     !md_cluster_ops->area_resyncing(
> +						mddev,
> +						WRITE,
> +						bio->bi_iter.bi_sector,
> +						bio_end_sector(bio))))
> +				break;
> +			schedule();
> +		}
> +		finish_wait(&conf->wait_barrier, &w);
>  	}
>  
> +	wait_barrier(conf, bio->bi_iter.bi_sector);
> +	bitmap = mddev->bitmap;
> +
>  	/*
> -	 * WRITE:
> +	 * make_request() can abort the operation when read-ahead is being
> +	 * used and no empty request is available.
> +	 *
> +	 */
> +	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
> +
> +	r1_bio->master_bio = bio;
> +	r1_bio->sectors = bio_sectors(bio);
> +	r1_bio->state = 0;
> +	r1_bio->mddev = mddev;
> +	r1_bio->sector = bio->bi_iter.bi_sector;
> +
> +	/* We might need to issue multiple reads to different
> +	 * devices if there are bad blocks around, so we keep
> +	 * track of the number of reads in bio->bi_phys_segments.
> +	 * If this is 0, there is only one r1_bio and no locking
> +	 * will be needed when requests complete.  If it is
> +	 * non-zero, then it is the number of not-completed requests.

This comment mentions "reads".  It should probably be changed to discuss
what happens to "writes" since this is raid1_make_write_request().

>  	 */
> +	bio->bi_phys_segments = 0;
> +	bio_clear_flag(bio, BIO_SEG_VALID);
> +
>  	if (conf->pending_count >= max_queued_requests) {
>  		md_wakeup_thread(mddev->thread);
>  		raid1_log(mddev, "wait queued");
> @@ -1256,7 +1317,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  
>  	disks = conf->raid_disks * 2;
>   retry_write:
> -	r1_bio->start_next_window = start_next_window;
>  	blocked_rdev = NULL;
>  	rcu_read_lock();
>  	max_sectors = r1_bio->sectors;
> @@ -1324,25 +1384,15 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  	if (unlikely(blocked_rdev)) {
>  		/* Wait for this device to become unblocked */
>  		int j;
> -		sector_t old = start_next_window;
>  
>  		for (j = 0; j < i; j++)
>  			if (r1_bio->bios[j])
>  				rdev_dec_pending(conf->mirrors[j].rdev, mddev);
>  		r1_bio->state = 0;
> -		allow_barrier(conf, start_next_window, bio->bi_iter.bi_sector);
> +		allow_barrier(conf, bio->bi_iter.bi_sector);
>  		raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
>  		md_wait_for_blocked_rdev(blocked_rdev, mddev);
> -		start_next_window = wait_barrier(conf, bio);
> -		/*
> -		 * We must make sure the multi r1bios of bio have
> -		 * the same value of bi_phys_segments
> -		 */
> -		if (bio->bi_phys_segments && old &&
> -		    old != start_next_window)
> -			/* Wait for the former r1bio(s) to complete */
> -			wait_event(conf->wait_barrier,
> -				   bio->bi_phys_segments == 1);
> +		wait_barrier(conf, bio->bi_iter.bi_sector);
>  		goto retry_write;
>  	}
>  
> @@ -1464,6 +1514,31 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  	wake_up(&conf->wait_barrier);
>  }
>  
> +static void raid1_make_request(struct mddev *mddev, struct bio *bio)
> +{
> +	void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
> +	struct bio *split;
> +	sector_t sectors;
> +
> +	make_request_fn = (bio_data_dir(bio) == READ) ?
> +			  raid1_make_read_request :
> +			  raid1_make_write_request;
> +
> +	/* if bio exceeds barrier unit boundary, split it */
> +	do {
> +		sectors = align_to_barrier_unit_end(bio->bi_iter.bi_sector,
> +						    bio_sectors(bio));
> +		if (sectors < bio_sectors(bio)) {
> +			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
> +			bio_chain(split, bio);
> +		} else {
> +			split = bio;
> +		}
> +
> +		make_request_fn(mddev, split);
> +	} while (split != bio);
> +}
> +
>  static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>  {
>  	struct r1conf *conf = mddev->private;
> @@ -1552,19 +1627,11 @@ static void print_conf(struct r1conf *conf)
>  
>  static void close_sync(struct r1conf *conf)
>  {
> -	wait_barrier(conf, NULL);
> -	allow_barrier(conf, 0, 0);
> +	wait_all_barriers(conf);
> +	allow_all_barriers(conf);
>  
>  	mempool_destroy(conf->r1buf_pool);
>  	conf->r1buf_pool = NULL;
> -
> -	spin_lock_irq(&conf->resync_lock);
> -	conf->next_resync = MaxSector - 2 * NEXT_NORMALIO_DISTANCE;
> -	conf->start_next_window = MaxSector;
> -	conf->current_window_requests +=
> -		conf->next_window_requests;
> -	conf->next_window_requests = 0;
> -	spin_unlock_irq(&conf->resync_lock);
>  }
>  
>  static int raid1_spare_active(struct mddev *mddev)
> @@ -2311,8 +2378,9 @@ static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio
>  
>  static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
>  {
> -	int m;
> +	int m, idx;
>  	bool fail = false;
> +
>  	for (m = 0; m < conf->raid_disks * 2 ; m++)
>  		if (r1_bio->bios[m] == IO_MADE_GOOD) {
>  			struct md_rdev *rdev = conf->mirrors[m].rdev;
> @@ -2338,7 +2406,8 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
>  	if (fail) {
>  		spin_lock_irq(&conf->device_lock);
>  		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
> -		conf->nr_queued++;
> +		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
> +		conf->nr_queued[idx]++;
>  		spin_unlock_irq(&conf->device_lock);
>  		md_wakeup_thread(conf->mddev->thread);
>  	} else {
> @@ -2460,6 +2529,7 @@ static void raid1d(struct md_thread *thread)
>  	struct r1conf *conf = mddev->private;
>  	struct list_head *head = &conf->retry_list;
>  	struct blk_plug plug;
> +	int idx;
>  
>  	md_check_recovery(mddev);
>  
> @@ -2467,17 +2537,18 @@ static void raid1d(struct md_thread *thread)
>  	    !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
>  		LIST_HEAD(tmp);
>  		spin_lock_irqsave(&conf->device_lock, flags);
> -		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
> -			while (!list_empty(&conf->bio_end_io_list)) {
> -				list_move(conf->bio_end_io_list.prev, &tmp);
> -				conf->nr_queued--;
> -			}
> -		}
> +		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> +			list_splice_init(&conf->bio_end_io_list, &tmp);
>  		spin_unlock_irqrestore(&conf->device_lock, flags);
>  		while (!list_empty(&tmp)) {
>  			r1_bio = list_first_entry(&tmp, struct r1bio,
>  						  retry_list);
>  			list_del(&r1_bio->retry_list);
> +			idx = hash_long(r1_bio->sector,
> +					BARRIER_BUCKETS_NR_BITS);
> +			spin_lock_irqsave(&conf->device_lock, flags);
> +			conf->nr_queued[idx]--;
> +			spin_unlock_irqrestore(&conf->device_lock, flags);
>  			if (mddev->degraded)
>  				set_bit(R1BIO_Degraded, &r1_bio->state);
>  			if (test_bit(R1BIO_WriteError, &r1_bio->state))
> @@ -2498,7 +2569,8 @@ static void raid1d(struct md_thread *thread)
>  		}
>  		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
>  		list_del(head->prev);
> -		conf->nr_queued--;
> +		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
> +		conf->nr_queued[idx]--;
>  		spin_unlock_irqrestore(&conf->device_lock, flags);
>  
>  		mddev = r1_bio->mddev;
> @@ -2537,7 +2609,6 @@ static int init_resync(struct r1conf *conf)
>  					  conf->poolinfo);
>  	if (!conf->r1buf_pool)
>  		return -ENOMEM;
> -	conf->next_resync = 0;
>  	return 0;
>  }
>  
> @@ -2566,6 +2637,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>  	int still_degraded = 0;
>  	int good_sectors = RESYNC_SECTORS;
>  	int min_bad = 0; /* number of sectors that are bad in all devices */
> +	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
>  
>  	if (!conf->r1buf_pool)
>  		if (init_resync(conf))
> @@ -2615,7 +2687,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>  	 * If there is non-resync activity waiting for a turn, then let it
>  	 * though before starting on this new sync request.
>  	 */
> -	if (conf->nr_waiting)
> +	if (conf->nr_waiting[idx])
>  		schedule_timeout_uninterruptible(1);
>  
>  	/* we are incrementing sector_nr below. To be safe, we check against
> @@ -2642,6 +2714,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>  	r1_bio->sector = sector_nr;
>  	r1_bio->state = 0;
>  	set_bit(R1BIO_IsSync, &r1_bio->state);
> +	/* make sure good_sectors won't go across barrier unit boundary */
> +	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
>  
>  	for (i = 0; i < conf->raid_disks * 2; i++) {
>  		struct md_rdev *rdev;
> @@ -2927,9 +3001,6 @@ static struct r1conf *setup_conf(struct mddev *mddev)
>  	conf->pending_count = 0;
>  	conf->recovery_disabled = mddev->recovery_disabled - 1;
>  
> -	conf->start_next_window = MaxSector;
> -	conf->current_window_requests = conf->next_window_requests = 0;
> -
>  	err = -EIO;
>  	for (i = 0; i < conf->raid_disks * 2; i++) {
>  
> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
> index c52ef42..817115d 100644
> --- a/drivers/md/raid1.h
> +++ b/drivers/md/raid1.h
> @@ -1,6 +1,14 @@
>  #ifndef _RAID1_H
>  #define _RAID1_H
>  
> +/* each barrier unit size is 64MB fow now
> + * note: it must be larger than RESYNC_DEPTH
> + */
> +#define BARRIER_UNIT_SECTOR_BITS	17
> +#define BARRIER_UNIT_SECTOR_SIZE	(1<<17)
> +#define BARRIER_BUCKETS_NR_BITS		9
> +#define BARRIER_BUCKETS_NR		(1<<BARRIER_BUCKETS_NR_BITS)
> +
>  struct raid1_info {
>  	struct md_rdev	*rdev;
>  	sector_t	head_position;
> @@ -35,25 +43,6 @@ struct r1conf {
>  						 */
>  	int			raid_disks;
>  
> -	/* During resync, read_balancing is only allowed on the part
> -	 * of the array that has been resynced.  'next_resync' tells us
> -	 * where that is.
> -	 */
> -	sector_t		next_resync;
> -
> -	/* When raid1 starts resync, we divide array into four partitions
> -	 * |---------|--------------|---------------------|-------------|
> -	 *        next_resync   start_next_window       end_window
> -	 * start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
> -	 * end_window = start_next_window + NEXT_NORMALIO_DISTANCE
> -	 * current_window_requests means the count of normalIO between
> -	 *   start_next_window and end_window.
> -	 * next_window_requests means the count of normalIO after end_window.
> -	 * */
> -	sector_t		start_next_window;
> -	int			current_window_requests;
> -	int			next_window_requests;
> -
>  	spinlock_t		device_lock;
>  
>  	/* list of 'struct r1bio' that need to be processed by raid1d,
> @@ -79,10 +68,11 @@ struct r1conf {
>  	 */
>  	wait_queue_head_t	wait_barrier;
>  	spinlock_t		resync_lock;
> -	int			nr_pending;
> -	int			nr_waiting;
> -	int			nr_queued;
> -	int			barrier;
> +	int			nr_pending[BARRIER_BUCKETS_NR];
> +	int			nr_waiting[BARRIER_BUCKETS_NR];
> +	int			nr_queued[BARRIER_BUCKETS_NR];
> +	int			barrier[BARRIER_BUCKETS_NR];
> +	int			total_barriers;
>  	int			array_frozen;
>  
>  	/* Set to 1 if a full sync is needed, (fresh device added).
> @@ -135,7 +125,6 @@ struct r1bio {
>  						 * in this BehindIO request
>  						 */
>  	sector_t		sector;
> -	sector_t		start_next_window;
>  	int			sectors;
>  	unsigned long		state;
>  	struct mddev		*mddev;
> -- 
> 2.6.6


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [v2 PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
From: NeilBrown @ 2017-01-06  1:52 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang
In-Reply-To: <1482853658-82535-2-git-send-email-colyli@suse.de>

[-- Attachment #1: Type: text/plain, Size: 18858 bytes --]

On Wed, Dec 28 2016, Coly Li wrote:

> When I run a parallel reading performan testing on a md raid1 device with
> two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
> block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
> only 2.7GB/s, this is around 50% of the idea performance number.
>
> The perf reports locking contention happens at allow_barrier() and
> wait_barrier() code,
>  - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
>    - _raw_spin_lock_irqsave
>          + 89.92% allow_barrier
>          + 9.34% __wake_up
>  - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
>    - _raw_spin_lock_irq
>          - 100.00% wait_barrier
>
> The reason is, in these I/O barrier related functions,
>  - raise_barrier()
>  - lower_barrier()
>  - wait_barrier()
>  - allow_barrier()
> They always hold conf->resync_lock firstly, even there are only regular
> reading I/Os and no resync I/O at all. This is a huge performance penalty.
>
> The solution is a lockless-like algorithm in I/O barrier code, and only
> holding conf->resync_lock when it is really necessary.
>
> The original idea is from Hannes Reinecke, and Neil Brown provides
> comments to improve it. Now I write the patch based on new simpler raid1
> I/O barrier code.
>
> In the new simpler raid1 I/O barrier implementation, there are two
> wait barrier functions,
>  - wait_barrier()
>    Which in turns calls _wait_barrier(), is used for regular write I/O.
>    If there is resync I/O happening on the same barrier bucket index, or
>    the whole array is frozen, task will wait until no barrier on same
>    bucket index, or the whold array is unfreezed.
>  - wait_read_barrier()
>    Since regular read I/O won't interfere with resync I/O (read_balance()
>    will make sure only uptodate data will be read out), so it is
>    unnecessary to wait for barrier in regular read I/Os, they only have to
>    wait only when the whole array is frozen.
> The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
> barrier[idx] are very carefully designed in raise_barrier(),
> lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
> avoid unnecessary spin locks in these functions. Once conf->
> nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
> has to wait in raise_barrier(). Then in _wait_barrier() or
> wait_read_barrier() if no barrier raised in same barrier bucket index or
> array is not frozen, the regular I/O doesn't need to hold conf->
> resync_lock, it can just increase conf->nr_pending[idx], and return to its
> caller. For heavy parallel reading I/Os, the lockless I/O barrier code
> almostly gets rid of all spin lock cost.
>
> This patch significantly improves raid1 reading peroformance. From my
> testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
> blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
> increases from 2.7GB/s to 4.6GB/s (+70%).
>
> Open question:
> Shaohua points out the memory barrier should be added to some atomic
> operations. Now I am reading the document to learn how to add the memory
> barriers correctly. Anyway, if anyone has suggestion, please don't
> hesitate to let me know.

When converting code from the use of spinlocks to the use of atomics
the most important consideration is to understand changed ordering
requirements.
When spinlocks are used, a sequence of operations within a spinlocked
region are indivisible with respect to other threads running spinlocked
code, so the order within those regions is not relevant.  When you drop
the spinlocks, the order might become relevant, and so needs to be
understood.
The most obvious ordering requirement that your code shows is in
the need to move the increment of nr_pending[] before checking for
barrier[] to be zero.  There might be others.

Once you understand the ordering requirements, you can then determine
if any barriers might be needed to make sure the ordering is globally
visible.

>
> Changelog
> V1:
> - Original RFC patch for comments.
> V2:
> - Remove a spin_lock/unlock pair in raid1d().
> - Add more code comments to explain why there is no racy when checking two
>   atomic_t variables at same time.
>
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Shaohua Li <shli@fb.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/raid1.c | 134 +++++++++++++++++++++++++++++++----------------------
>  drivers/md/raid1.h |  12 ++---
>  2 files changed, 85 insertions(+), 61 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 5813656..b1fb4c1 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -222,7 +222,7 @@ static void reschedule_retry(struct r1bio *r1_bio)
>  	idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
>  	spin_lock_irqsave(&conf->device_lock, flags);
>  	list_add(&r1_bio->retry_list, &conf->retry_list);
> -	conf->nr_queued[idx]++;
> +	atomic_inc(&conf->nr_queued[idx]);
>  	spin_unlock_irqrestore(&conf->device_lock, flags);

nr_queued is only tested in freeze_array().
freeze_array() might be waiting for nr_queued to be incremented so that
it matches nr_pending.  That will implies that all pending requests are
in a queue, and so are not active.
So the increment must happen before the wake_up.  That is the only
ordering requirement on nr_queued.
In every case, nr_queued is still incremented inside the
device_lock spinlocked region, so the spin_unlock() will ensure the new
value is visible.
There is one case (in raid1d) where nr_queued is decremented outside of
any spinlock, both nothing is waiting for nr_queued to be decremented,
so that cannot matter.

I do wonder if nr_pending really needs to be an atomic_t.  It quite be
quite easy to leave it as a simple 'int'.  Maybe not helpful though.


>  
>  	wake_up(&conf->wait_barrier);
> @@ -831,13 +831,13 @@ static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
>  	spin_lock_irq(&conf->resync_lock);
>  
>  	/* Wait until no block IO is waiting */
> -	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
> +	wait_event_lock_irq(conf->wait_barrier,
> +			    !atomic_read(&conf->nr_waiting[idx]),
>  			    conf->resync_lock);
>  
>  	/* block any new IO from starting */
> -	conf->barrier[idx]++;
> -	conf->total_barriers++;
> -
> +	atomic_inc(&conf->barrier[idx]);
> +	atomic_inc(&conf->total_barriers);
>  	/* For these conditions we must wait:
>  	 * A: while the array is in frozen state
>  	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
> @@ -846,44 +846,69 @@ static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
>  	 *    the max count which allowed on the whole raid1 device.
>  	 */
>  	wait_event_lock_irq(conf->wait_barrier,
> -			    !conf->array_frozen &&
> -			     !conf->nr_pending[idx] &&
> -			     conf->total_barriers < RESYNC_DEPTH,
> +			    !atomic_read(&conf->array_frozen) &&
> +			     !atomic_read(&conf->nr_pending[idx]) &&
> +			     atomic_read(&conf->total_barriers) < RESYNC_DEPTH,
>  			    conf->resync_lock);
>  
> -	conf->nr_pending[idx]++;
> +	atomic_inc(&conf->nr_pending[idx]);
>  	spin_unlock_irq(&conf->resync_lock);
>  }
>  
>  static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
>  {
> -	unsigned long flags;
>  	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
>  
> -	BUG_ON((conf->barrier[idx] <= 0) || conf->total_barriers <= 0);
> -
> -	spin_lock_irqsave(&conf->resync_lock, flags);
> -	conf->barrier[idx]--;
> -	conf->total_barriers--;
> -	conf->nr_pending[idx]--;
> -	spin_unlock_irqrestore(&conf->resync_lock, flags);
> +	BUG_ON(atomic_read(&conf->barrier[idx]) <= 0);
> +	BUG_ON(atomic_read(&conf->total_barriers) <= 0);
> +	atomic_dec(&conf->barrier[idx]);
> +	atomic_dec(&conf->total_barriers);
> +	atomic_dec(&conf->nr_pending[idx]);
>  	wake_up(&conf->wait_barrier);

This is the first place where you remove a spin_lock call so it
is worth look at this.

The only code that really cares about the ->barrier and ->nr_pending
values are the wait_event*() calls in freeze_array(), raise_barrier(),
wait_barrier(). As long as the change happens *before* the wake_up()
these changes are safe, and the wake_up() provides all barriers that
were needed.

>  }
>  
>  static void _wait_barrier(struct r1conf *conf, int idx)
>  {
> -	spin_lock_irq(&conf->resync_lock);
> -	if (conf->array_frozen || conf->barrier[idx]) {
> -		conf->nr_waiting[idx]++;
> -		/* Wait for the barrier to drop. */
> -		wait_event_lock_irq(
> -			conf->wait_barrier,
> -			!conf->array_frozen && !conf->barrier[idx],
> -			conf->resync_lock);
> -		conf->nr_waiting[idx]--;
> -	}
> +	/* We need to increase conf->nr_pending[idx] very early here,
> +	 * then raise_barrier() can be blocked when it waits for
> +	 * conf->nr_pending[idx] to be 0. Then we can avoid holding
> +	 * conf->resync_lock when there is no barrier raised in same
> +	 * barrier unit bucket. Also if the array is frozen, I/O
> +	 * should be blocked until array is unfrozen.
> +	 */
> +	atomic_inc(&conf->nr_pending[idx]);

It is important that this atomic_inc() happens *before* the
atomic_read() of ->barrier below.
If the order is reversed then immediately after we read ->barrier,
raise_barrer() might increment->barrier, then test ->nr_pending and find
it to be zero - just before we increment it.
In that case, both wait_barrier() and raise_barrier() would proceed
without waiting.
So it is important that the atomic_inc() is visible to raise_barrier()
before we read conf->barrier.
It is equally important that the atomic_inc of conf->barrer in
raise_barrier() is visible here before raise_barrier() reads
conf->nr_pending.

It would be nice we could call
         atomic_inc_acquire(&conf->nr_pending[idx]);

but there is no atomic_inc_acquire().  The closest is
atomic_inc_acquire_return(), but we don't want the return value.

We could either use it, and just ignore the return value, or
put an explicit smp_mb__after_atomic() after the atomic_inc().

Whatever is done here should also be done in raise_barrier().


> +
> +	/* Don't worry about checking two atomic_t variables at same time
> +	 * here. conf->array_frozen MUST be checked firstly, The logic is,
> +	 * if the array is frozen, no matter there is any barrier or not,
> +	 * all I/O should be blocked. If there is no barrier in current
> +	 * barrier bucket, we still need to check whether the array is frozen,
> +	 * otherwise I/O will happen on frozen array, that's buggy.
> +	 * If during we check conf->barrier[idx], the array is frozen (a.k.a
> +	 * conf->array_frozen is set), and chonf->barrier[idx] is 0, it is
> +	 * safe to return and make the I/O continue. Because the array is
> +	 * frozen, all I/O returned here will eventually complete or be
> +	 * queued, see code comment in frozen_array().
> +	 */
> +	if (!atomic_read(&conf->array_frozen) &&
> +	    !atomic_read(&conf->barrier[idx]))
> +		return;
>  
> -	conf->nr_pending[idx]++;
> +	/* After holding conf->resync_lock, conf->nr_pending[idx]
> +	 * should be decreased before waiting for barrier to drop.
> +	 * Otherwise, we may encounter a race condition because
> +	 * raise_barrer() might be waiting for conf->nr_pending[idx]
> +	 * to be 0 at same time.
> +	 */

You say "After holding conf->resync_lock", but you aren't holding
conf->resync_lock here.

> +	atomic_inc(&conf->nr_waiting[idx]);
> +	atomic_dec(&conf->nr_pending[idx]);

There is an ordering dependency between this atomic_inc and atomic_dec.
If the spinlock is held here (Which I think is your intention)
then you won't need any extra barrier.

> +	/* Wait for the barrier in same barrier unit bucket to drop. */
> +	wait_event_lock_irq(conf->wait_barrier,
> +			    !atomic_read(&conf->array_frozen) &&
> +			     !atomic_read(&conf->barrier[idx]),
> +			    conf->resync_lock);
> +	atomic_inc(&conf->nr_pending[idx]);
> +	atomic_dec(&conf->nr_waiting[idx]);
>  	spin_unlock_irq(&conf->resync_lock);
>  }
>  
> @@ -891,18 +916,23 @@ static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>  {
>  	long idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
>  
> -	spin_lock_irq(&conf->resync_lock);
> -	if (conf->array_frozen) {
> -		conf->nr_waiting[idx]++;
> -		/* Wait for array to unfreeze */
> -		wait_event_lock_irq(
> -			conf->wait_barrier,
> -			!conf->array_frozen,
> -			conf->resync_lock);
> -		conf->nr_waiting[idx]--;
> -	}
> +	/* Very similar to _wait_barrier(). The difference is, for read
> +	 * I/O we don't need wait for sync I/O, but if the whole array
> +	 * is frozen, the read I/O still has to wait until the array is
> +	 * unfrozen.
> +	 */
> +	atomic_inc(&conf->nr_pending[idx]);

I think that above should be atomic_inc_return_acquire() too.

> +	if (!atomic_read(&conf->array_frozen))
> +		return;
>  
> -	conf->nr_pending[idx]++;

again, missing spin_lock(resync_lock);

> +	atomic_inc(&conf->nr_waiting[idx]);
> +	atomic_dec(&conf->nr_pending[idx]);
> +	/* Wait for array to be unfrozen */
> +	wait_event_lock_irq(conf->wait_barrier,
> +			    !atomic_read(&conf->array_frozen),
> +			    conf->resync_lock);
> +	atomic_inc(&conf->nr_pending[idx]);
> +	atomic_dec(&conf->nr_waiting[idx]);
>  	spin_unlock_irq(&conf->resync_lock);
>  }
>  
> @@ -923,11 +953,7 @@ static void wait_all_barriers(struct r1conf *conf)
>  
>  static void _allow_barrier(struct r1conf *conf, int idx)
>  {
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&conf->resync_lock, flags);
> -	conf->nr_pending[idx]--;
> -	spin_unlock_irqrestore(&conf->resync_lock, flags);
> +	atomic_dec(&conf->nr_pending[idx]);
>  	wake_up(&conf->wait_barrier);
>  }
>  
> @@ -952,7 +978,7 @@ static int get_all_pendings(struct r1conf *conf)
>  	int idx, ret;
>  
>  	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> -		ret += conf->nr_pending[idx];
> +		ret += atomic_read(&conf->nr_pending[idx]);
>  	return ret;
>  }
>  
> @@ -962,7 +988,7 @@ static int get_all_queued(struct r1conf *conf)
>  	int idx, ret;
>  
>  	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> -		ret += conf->nr_queued[idx];
> +		ret += atomic_read(&conf->nr_queued[idx]);
>  	return ret;
>  }

I don't really like these get_all_pending, and get_all_queued, but I
can see why they are needed.
Maybe just have a single get_unqueued_pending() which does:

  	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
		ret += atomic_read(&conf->nr_pending[idx]) - atomic_read(&conf->nr_queued[idx]);

then have freeze_array() wait for  get_unqueued_pending() == extra.


>  
> @@ -992,7 +1018,7 @@ static void freeze_array(struct r1conf *conf, int extra)
>  	 * normal I/O context, extra is 1, in rested situations extra is 0.
>  	 */
>  	spin_lock_irq(&conf->resync_lock);
> -	conf->array_frozen = 1;
> +	atomic_set(&conf->array_frozen, 1);

There is no reason for ->array_frozen to be an atomic_t.
Just leave it as an 'int'.
You may still need to think about what ordering dependences it has.

>  	raid1_log(conf->mddev, "wait freeze");
>  	wait_event_lock_irq_cmd(
>  		conf->wait_barrier,
> @@ -1005,7 +1031,7 @@ static void unfreeze_array(struct r1conf *conf)
>  {
>  	/* reverse the effect of the freeze */
>  	spin_lock_irq(&conf->resync_lock);
> -	conf->array_frozen = 0;
> +	atomic_set(&conf->array_frozen, 0);
>  	wake_up(&conf->wait_barrier);
>  	spin_unlock_irq(&conf->resync_lock);
>  }
> @@ -2407,7 +2433,7 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
>  		spin_lock_irq(&conf->device_lock);
>  		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
>  		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
> -		conf->nr_queued[idx]++;
> +		atomic_inc(&conf->nr_queued[idx]);
>  		spin_unlock_irq(&conf->device_lock);
>  		md_wakeup_thread(conf->mddev->thread);
>  	} else {
> @@ -2546,9 +2572,7 @@ static void raid1d(struct md_thread *thread)
>  			list_del(&r1_bio->retry_list);
>  			idx = hash_long(r1_bio->sector,
>  					BARRIER_BUCKETS_NR_BITS);
> -			spin_lock_irqsave(&conf->device_lock, flags);
> -			conf->nr_queued[idx]--;
> -			spin_unlock_irqrestore(&conf->device_lock, flags);
> +			atomic_dec(&conf->nr_queued[idx]);
>  			if (mddev->degraded)
>  				set_bit(R1BIO_Degraded, &r1_bio->state);
>  			if (test_bit(R1BIO_WriteError, &r1_bio->state))
> @@ -2570,7 +2594,7 @@ static void raid1d(struct md_thread *thread)
>  		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
>  		list_del(head->prev);
>  		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
> -		conf->nr_queued[idx]--;
> +		atomic_dec(&conf->nr_queued[idx]);
>  		spin_unlock_irqrestore(&conf->device_lock, flags);
>  
>  		mddev = r1_bio->mddev;
> @@ -2687,7 +2711,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>  	 * If there is non-resync activity waiting for a turn, then let it
>  	 * though before starting on this new sync request.
>  	 */
> -	if (conf->nr_waiting[idx])
> +	if (atomic_read(&conf->nr_waiting[idx]))
>  		schedule_timeout_uninterruptible(1);
>  
>  	/* we are incrementing sector_nr below. To be safe, we check against
> @@ -3316,7 +3340,7 @@ static void *raid1_takeover(struct mddev *mddev)
>  		conf = setup_conf(mddev);
>  		if (!IS_ERR(conf)) {
>  			/* Array must appear to be quiesced */
> -			conf->array_frozen = 1;
> +			atomic_set(&conf->array_frozen, 1);
>  			clear_bit(MD_HAS_JOURNAL, &mddev->flags);
>  			clear_bit(MD_JOURNAL_CLEAN, &mddev->flags);
>  		}
> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
> index 817115d..bbe65f7 100644
> --- a/drivers/md/raid1.h
> +++ b/drivers/md/raid1.h
> @@ -68,12 +68,12 @@ struct r1conf {
>  	 */
>  	wait_queue_head_t	wait_barrier;
>  	spinlock_t		resync_lock;
> -	int			nr_pending[BARRIER_BUCKETS_NR];
> -	int			nr_waiting[BARRIER_BUCKETS_NR];
> -	int			nr_queued[BARRIER_BUCKETS_NR];
> -	int			barrier[BARRIER_BUCKETS_NR];
> -	int			total_barriers;
> -	int			array_frozen;
> +	atomic_t		nr_pending[BARRIER_BUCKETS_NR];
> +	atomic_t		nr_waiting[BARRIER_BUCKETS_NR];
> +	atomic_t		nr_queued[BARRIER_BUCKETS_NR];
> +	atomic_t		barrier[BARRIER_BUCKETS_NR];
> +	atomic_t		total_barriers;
> +	atomic_t		array_frozen;
>  
>  	/* Set to 1 if a full sync is needed, (fresh device added).
>  	 * Cleared when a sync completes.
> -- 
> 2.6.6

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* [PATCH v2 0/7] uapi: export all headers under uapi directories
From: Nicolas Dichtel @ 2017-01-06  9:43 UTC (permalink / raw)
  To: arnd
  Cc: mmarek, linux-kbuild, linux-doc, linux-kernel, linux-alpha,
	linux-snps-arc, linux-arm-kernel, adi-buildroot-devel,
	linux-c6x-dev, linux-cris-kernel, uclinux-h8-devel, linux-hexagon,
	linux-ia64, linux-m68k, linux-metag, linux-mips, linux-am33-list,
	nios2-dev, openrisc, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, sparclinux, linux-xtensa, linux-arch
In-Reply-To: <bf83da6b-01ef-bf44-b3e1-ca6fc5636818@6wind.com>


Here is the v2 of this series. The first 5 patches are just cleanup: some
exported headers were still under a non-uapi directory.
The patch 6 was spotted by code review: there is no in-tree user of this
functionality.
The last patch remove the use of header-y. Now all files under an uapi
directory are exported.

asm is a bit special, most of architectures export asm/<arch>/include/uapi/asm
only, but there is two exceptions:
 - cris which exports arch/cris/include/uapi/arch-v[10|32];
 - tile which exports arch/tile/include/uapi/arch.
Because I don't know if the output of 'make headers_install_all' can be changed,
I introduce subdir-y in Kbuild file. The headers_install_all target copies all
asm/<arch>/include/uapi/asm to usr/include/asm-<arch> but
arch/cris/include/uapi/arch-v[10|32] and arch/tile/include/uapi/arch are not
prefixed (they are put asis in usr/include/). If it's acceptable to modify the
output of 'make headers_install_all' to export asm headers in
usr/include/asm-<arch>/asm, then I could remove this new subdir-y and exports
everything under arch/<arch>/include/uapi/.

Note also that exported files for asm are a mix of files listed by:
 - include/uapi/asm-generic/Kbuild.asm;
 - arch/x86/include/uapi/asm/Kbuild;
 - arch/x86/include/asm/Kbuild.
This complicates a lot the processing (arch/x86/include/asm/Kbuild is also
used by scripts/Makefile.asm-generic).

This series has been tested with a 'make headers_install' on x86 and a
'make headers_install_all'. I've checked the result of both commands.

This patch is built against linus tree. I don't know if it should be
made against antoher tree.

Comments are welcomed,
Nicolas

^ permalink raw reply

* [PATCH v2 1/7] arm: put types.h in uapi
From: Nicolas Dichtel @ 2017-01-06  9:43 UTC (permalink / raw)
  To: arnd
  Cc: linux-mips, alsa-devel, linux-ia64, linux-doc, airlied,
	linux-fbdev, dri-devel, linux-mtd, sparclinux, linux-arch,
	linux-s390, linux-am33-list, linux-c6x-dev, linux-rdma,
	linux-hexagon, linux-sh, coreteam, fcoe-devel, xen-devel,
	linux-snps-arc, linux-media, uclinux-h8-devel, linux-xtensa,
	linux-kbuild, adi-buildroot-devel, linux-raid, linux-m68k,
	openrisc, Nicolas Dichtel
In-Reply-To: <1483695839-18660-1-git-send-email-nicolas.dichtel@6wind.com>

This header file is exported, thus move it to uapi.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 arch/arm/include/asm/types.h      | 36 +----------------------------------
 arch/arm/include/uapi/asm/types.h | 40 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 35 deletions(-)
 create mode 100644 arch/arm/include/uapi/asm/types.h

diff --git a/arch/arm/include/asm/types.h b/arch/arm/include/asm/types.h
index a53cdb8f068c..c48fee3d7b3b 100644
--- a/arch/arm/include/asm/types.h
+++ b/arch/arm/include/asm/types.h
@@ -1,40 +1,6 @@
 #ifndef _ASM_TYPES_H
 #define _ASM_TYPES_H
 
-#include <asm-generic/int-ll64.h>
-
-/*
- * The C99 types uintXX_t that are usually defined in 'stdint.h' are not as
- * unambiguous on ARM as you would expect. For the types below, there is a
- * difference on ARM between GCC built for bare metal ARM, GCC built for glibc
- * and the kernel itself, which results in build errors if you try to build with
- * -ffreestanding and include 'stdint.h' (such as when you include 'arm_neon.h'
- * in order to use NEON intrinsics)
- *
- * As the typedefs for these types in 'stdint.h' are based on builtin defines
- * supplied by GCC, we can tweak these to align with the kernel's idea of those
- * types, so 'linux/types.h' and 'stdint.h' can be safely included from the same
- * source file (provided that -ffreestanding is used).
- *
- *                    int32_t         uint32_t               uintptr_t
- * bare metal GCC     long            unsigned long          unsigned int
- * glibc GCC          int             unsigned int           unsigned int
- * kernel             int             unsigned int           unsigned long
- */
-
-#ifdef __INT32_TYPE__
-#undef __INT32_TYPE__
-#define __INT32_TYPE__		int
-#endif
-
-#ifdef __UINT32_TYPE__
-#undef __UINT32_TYPE__
-#define __UINT32_TYPE__	unsigned int
-#endif
-
-#ifdef __UINTPTR_TYPE__
-#undef __UINTPTR_TYPE__
-#define __UINTPTR_TYPE__	unsigned long
-#endif
+#include <uapi/asm/types.h>
 
 #endif /* _ASM_TYPES_H */
diff --git a/arch/arm/include/uapi/asm/types.h b/arch/arm/include/uapi/asm/types.h
new file mode 100644
index 000000000000..9435a42f575e
--- /dev/null
+++ b/arch/arm/include/uapi/asm/types.h
@@ -0,0 +1,40 @@
+#ifndef _UAPI_ASM_TYPES_H
+#define _UAPI_ASM_TYPES_H
+
+#include <asm-generic/int-ll64.h>
+
+/*
+ * The C99 types uintXX_t that are usually defined in 'stdint.h' are not as
+ * unambiguous on ARM as you would expect. For the types below, there is a
+ * difference on ARM between GCC built for bare metal ARM, GCC built for glibc
+ * and the kernel itself, which results in build errors if you try to build with
+ * -ffreestanding and include 'stdint.h' (such as when you include 'arm_neon.h'
+ * in order to use NEON intrinsics)
+ *
+ * As the typedefs for these types in 'stdint.h' are based on builtin defines
+ * supplied by GCC, we can tweak these to align with the kernel's idea of those
+ * types, so 'linux/types.h' and 'stdint.h' can be safely included from the same
+ * source file (provided that -ffreestanding is used).
+ *
+ *                    int32_t         uint32_t               uintptr_t
+ * bare metal GCC     long            unsigned long          unsigned int
+ * glibc GCC          int             unsigned int           unsigned int
+ * kernel             int             unsigned int           unsigned long
+ */
+
+#ifdef __INT32_TYPE__
+#undef __INT32_TYPE__
+#define __INT32_TYPE__		int
+#endif
+
+#ifdef __UINT32_TYPE__
+#undef __UINT32_TYPE__
+#define __UINT32_TYPE__	unsigned int
+#endif
+
+#ifdef __UINTPTR_TYPE__
+#undef __UINTPTR_TYPE__
+#define __UINTPTR_TYPE__	unsigned long
+#endif
+
+#endif /* _UAPI_ASM_TYPES_H */
-- 
2.8.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related

* [PATCH v2 2/7] h8300: put bitsperlong.h in uapi
From: Nicolas Dichtel @ 2017-01-06  9:43 UTC (permalink / raw)
  To: arnd
  Cc: linux-mips, alsa-devel, linux-ia64, linux-doc, airlied,
	linux-fbdev, dri-devel, linux-mtd, sparclinux, linux-arch,
	linux-s390, linux-am33-list, linux-c6x-dev, linux-rdma,
	linux-hexagon, linux-sh, coreteam, fcoe-devel, xen-devel,
	linux-snps-arc, linux-media, uclinux-h8-devel, linux-xtensa,
	linux-kbuild, adi-buildroot-devel, linux-raid, linux-m68k,
	openrisc, Nicolas Dichtel
In-Reply-To: <1483695839-18660-1-git-send-email-nicolas.dichtel@6wind.com>

This header file is exported, thus move it to uapi.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 arch/h8300/include/asm/bitsperlong.h      | 10 +---------
 arch/h8300/include/uapi/asm/bitsperlong.h | 14 ++++++++++++++
 2 files changed, 15 insertions(+), 9 deletions(-)
 create mode 100644 arch/h8300/include/uapi/asm/bitsperlong.h

diff --git a/arch/h8300/include/asm/bitsperlong.h b/arch/h8300/include/asm/bitsperlong.h
index e140e46729ac..c0a8e2ee531e 100644
--- a/arch/h8300/include/asm/bitsperlong.h
+++ b/arch/h8300/include/asm/bitsperlong.h
@@ -1,14 +1,6 @@
 #ifndef __ASM_H8300_BITS_PER_LONG
 #define __ASM_H8300_BITS_PER_LONG
 
-#include <asm-generic/bitsperlong.h>
-
-#if !defined(__ASSEMBLY__)
-/* h8300-unknown-linux required long */
-#define __kernel_size_t __kernel_size_t
-typedef unsigned long	__kernel_size_t;
-typedef long		__kernel_ssize_t;
-typedef long		__kernel_ptrdiff_t;
-#endif
+#include <uapi/asm/bitsperlong.h>
 
 #endif /* __ASM_H8300_BITS_PER_LONG */
diff --git a/arch/h8300/include/uapi/asm/bitsperlong.h b/arch/h8300/include/uapi/asm/bitsperlong.h
new file mode 100644
index 000000000000..e56cf72369b6
--- /dev/null
+++ b/arch/h8300/include/uapi/asm/bitsperlong.h
@@ -0,0 +1,14 @@
+#ifndef _UAPI_ASM_H8300_BITS_PER_LONG
+#define _UAPI_ASM_H8300_BITS_PER_LONG
+
+#include <asm-generic/bitsperlong.h>
+
+#if !defined(__ASSEMBLY__)
+/* h8300-unknown-linux required long */
+#define __kernel_size_t __kernel_size_t
+typedef unsigned long	__kernel_size_t;
+typedef long		__kernel_ssize_t;
+typedef long		__kernel_ptrdiff_t;
+#endif
+
+#endif /* _UAPI_ASM_H8300_BITS_PER_LONG */
-- 
2.8.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox