Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH] zram: set physical queue limits to avoid array out of bounds accesses
From: Minchan Kim @ 2017-03-07  8:55 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Johannes Thumshirn, Jens Axboe, Nitin Gupta, Christoph Hellwig,
	Sergey Senozhatsky, yizhan, Linux Block Layer Mailinglist,
	Linux Kernel Mailinglist
In-Reply-To: <ed4d83a1-9bdd-9e24-7768-ba5e85429110@suse.de>

On Tue, Mar 07, 2017 at 08:48:06AM +0100, Hannes Reinecke wrote:
> On 03/07/2017 08:23 AM, Minchan Kim wrote:
> > Hi Hannes,
> > 
> > On Tue, Mar 7, 2017 at 4:00 PM, Hannes Reinecke <hare@suse.de> wrote:
> >> On 03/07/2017 06:22 AM, Minchan Kim wrote:
> >>> Hello Johannes,
> >>>
> >>> On Mon, Mar 06, 2017 at 11:23:35AM +0100, Johannes Thumshirn wrote:
> >>>> zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using
> >>>> the NVMe over Fabrics loopback target which potentially sends a huge bulk of
> >>>> pages attached to the bio's bvec this results in a kernel panic because of
> >>>> array out of bounds accesses in zram_decompress_page().
> >>>
> >>> First of all, thanks for the report and fix up!
> >>> Unfortunately, I'm not familiar with that interface of block layer.
> >>>
> >>> It seems this is a material for stable so I want to understand it clear.
> >>> Could you say more specific things to educate me?
> >>>
> >>> What scenario/When/How it is problem?  It will help for me to understand!
> >>>
> > 
> > Thanks for the quick response!
> > 
> >> The problem is that zram as it currently stands can only handle bios
> >> where each bvec contains a single page (or, to be precise, a chunk of
> >> data with a length of a page).
> > 
> > Right.
> > 
> >>
> >> This is not an automatic guarantee from the block layer (who is free to
> >> send us bios with arbitrary-sized bvecs), so we need to set the queue
> >> limits to ensure that.
> > 
> > What does it mean "bios with arbitrary-sized bvecs"?
> > What kinds of scenario is it used/useful?
> > 
> Each bio contains a list of bvecs, each of which points to a specific
> memory area:
> 
> struct bio_vec {
> 	struct page	*bv_page;
> 	unsigned int	bv_len;
> 	unsigned int	bv_offset;
> };
> 
> The trick now is that while 'bv_page' does point to a page, the memory
> area pointed to might in fact be contiguous (if several pages are
> adjacent). Hence we might be getting a bio_vec where bv_len is _larger_
> than a page.

Thanks for detail, Hannes!

If I understand it correctly, it seems to be related to bid_add_page
with high-order page. Right?

If so, I really wonder why I don't see such problem because several
places have used it and I expected some of them might do IO with
contiguous pages intentionally or by chance. Hmm,

IIUC, it's not a nvme specific problme but general problem which
can trigger normal FSes if they uses contiguos pages?

> 
> Hence the check for 'is_partial_io' in zram_drv.c (which just does a
> test 'if bv_len != PAGE_SIZE) is in fact wrong, as it would trigger for
> partial I/O (ie if the overall length of the bio_vec is _smaller_ than a
> page), but also for multipage bvecs (where the length of the bio_vec is
> _larger_ than a page).

Right. I need to look into that. Thanks for the pointing out!

> 
> So rather than fixing the bio scanning loop in zram it's easier to set
> the queue limits correctly so that 'is_partial_io' does the correct
> thing and the overall logic in zram doesn't need to be altered.


Isn't that approach require new bio allocation through blk_queue_split?
Maybe, it wouldn't make severe regression in zram-FS workload but need
to test.

Is there any ways to trigger the problem without real nvme device?
It would really help to test/measure zram.

Anyway, to me, it's really subtle at this moment so I doubt it should
be stable material. :(

^ permalink raw reply

* Re: [PATCH] zram: set physical queue limits to avoid array out of bounds accesses
From: Hannes Reinecke @ 2017-03-07  7:00 UTC (permalink / raw)
  To: Minchan Kim, Johannes Thumshirn
  Cc: Jens Axboe, Nitin Gupta, Christoph Hellwig, Sergey Senozhatsky,
	yizhan, Linux Block Layer Mailinglist, Linux Kernel Mailinglist
In-Reply-To: <20170307052242.GA29458@bbox>

On 03/07/2017 06:22 AM, Minchan Kim wrote:
> Hello Johannes,
> 
> On Mon, Mar 06, 2017 at 11:23:35AM +0100, Johannes Thumshirn wrote:
>> zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using
>> the NVMe over Fabrics loopback target which potentially sends a huge bulk of
>> pages attached to the bio's bvec this results in a kernel panic because of
>> array out of bounds accesses in zram_decompress_page().
> 
> First of all, thanks for the report and fix up!
> Unfortunately, I'm not familiar with that interface of block layer.
> 
> It seems this is a material for stable so I want to understand it clear.
> Could you say more specific things to educate me?
> 
> What scenario/When/How it is problem?  It will help for me to understand!
> 
The problem is that zram as it currently stands can only handle bios
where each bvec contains a single page (or, to be precise, a chunk of
data with a length of a page).

This is not an automatic guarantee from the block layer (who is free to
send us bios with arbitrary-sized bvecs), so we need to set the queue
limits to ensure that.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCH] zram: set physical queue limits to avoid array out of bounds accesses
From: Minchan Kim @ 2017-03-07  7:23 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Johannes Thumshirn, Jens Axboe, Nitin Gupta, Christoph Hellwig,
	Sergey Senozhatsky, yizhan, Linux Block Layer Mailinglist,
	Linux Kernel Mailinglist
In-Reply-To: <95c31a93-32cd-ad06-6cc0-e11b42ec2f68@suse.de>

Hi Hannes,

On Tue, Mar 7, 2017 at 4:00 PM, Hannes Reinecke <hare@suse.de> wrote:
> On 03/07/2017 06:22 AM, Minchan Kim wrote:
>> Hello Johannes,
>>
>> On Mon, Mar 06, 2017 at 11:23:35AM +0100, Johannes Thumshirn wrote:
>>> zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using
>>> the NVMe over Fabrics loopback target which potentially sends a huge bulk of
>>> pages attached to the bio's bvec this results in a kernel panic because of
>>> array out of bounds accesses in zram_decompress_page().
>>
>> First of all, thanks for the report and fix up!
>> Unfortunately, I'm not familiar with that interface of block layer.
>>
>> It seems this is a material for stable so I want to understand it clear.
>> Could you say more specific things to educate me?
>>
>> What scenario/When/How it is problem?  It will help for me to understand!
>>

Thanks for the quick response!

> The problem is that zram as it currently stands can only handle bios
> where each bvec contains a single page (or, to be precise, a chunk of
> data with a length of a page).

Right.

>
> This is not an automatic guarantee from the block layer (who is free to
> send us bios with arbitrary-sized bvecs), so we need to set the queue
> limits to ensure that.

What does it mean "bios with arbitrary-sized bvecs"?
What kinds of scenario is it used/useful?

And how can we solve it by setting queue limit?

Sorry for the many questions due to limited knowledge.

Thanks.

^ permalink raw reply

* Re: [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete
From: Hannes Reinecke @ 2017-03-07  6:47 UTC (permalink / raw)
  To: Jan Kara, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown,
	Omar Sandoval
In-Reply-To: <20170306163404.1238-2-jack@suse.cz>

On 03/06/2017 05:33 PM, Jan Kara wrote:
> When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
> restart the process of opening the block device. However we forget to
> switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
> inode will be pointing to a stale bdi. Fix the problem by setting
> bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/block_dev.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply

* Re: [PATCH] zram: set physical queue limits to avoid array out of bounds accesses
From: Minchan Kim @ 2017-03-07  5:22 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Nitin Gupta, Christoph Hellwig, Sergey Senozhatsky,
	Hannes Reinecke, yizhan, Linux Block Layer Mailinglist,
	Linux Kernel Mailinglist
In-Reply-To: <20170306102335.9180-1-jthumshirn@suse.de>

Hello Johannes,

On Mon, Mar 06, 2017 at 11:23:35AM +0100, Johannes Thumshirn wrote:
> zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using
> the NVMe over Fabrics loopback target which potentially sends a huge bulk of
> pages attached to the bio's bvec this results in a kernel panic because of
> array out of bounds accesses in zram_decompress_page().

First of all, thanks for the report and fix up!
Unfortunately, I'm not familiar with that interface of block layer.

It seems this is a material for stable so I want to understand it clear.
Could you say more specific things to educate me?

What scenario/When/How it is problem?  It will help for me to understand!

Thanks.

> 
> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  drivers/block/zram/zram_drv.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index e27d89a..dceb5ed 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1189,6 +1189,8 @@ static int zram_add(void)
>  	blk_queue_io_min(zram->disk->queue, PAGE_SIZE);
>  	blk_queue_io_opt(zram->disk->queue, PAGE_SIZE);
>  	zram->disk->queue->limits.discard_granularity = PAGE_SIZE;
> +	zram->disk->queue->limits.max_sectors = SECTORS_PER_PAGE;
> +	zram->disk->queue->limits.chunk_sectors = 0;
>  	blk_queue_max_discard_sectors(zram->disk->queue, UINT_MAX);
>  	/*
>  	 * zram_bio_discard() will clear all logical blocks if logical block
> -- 
> 1.8.5.6
> 

^ permalink raw reply

* Re: cfq-iosched: two questions about the hrtimer version of CFQ
From: Hou Tao @ 2017-03-07  1:22 UTC (permalink / raw)
  To: Jan Kara, linux-block; +Cc: axboe, linux-kernel, Vivek Goyal
In-Reply-To: <775f3ecb-45d1-4264-885a-f14e0458d36b@huawei.com>

Sorry for the resend, please refer to the later one.

On 2017/3/6 21:50, Hou Tao wrote:
> Hi Jan and list,
> 
> When testing the hrtimer version of CFQ, we found a performance degradation
> problem which seems to be caused by commit 0b31c10 ("cfq-iosched: Charge at
> least 1 jiffie instead of 1 ns").
> 
> The following is the test process:
> 
> * filesystem and block device
> 	* XFS + /dev/sda mounted on /tmp/sda
> * CFQ configuration
> 	* default configurations
> * fio job configuration
> 	[global]
> 	bs=4k
> 	ioengine=psync
> 	iodepth=1
> 	direct=1
> 	rw=randwrite
> 	time_based
> 	runtime=15
> 	cgroup_nodelete=1
> 	group_reporting=1
> 
> 	[cfq_a]
> 	filename=/tmp/sda/cfq_a.dat
> 	size=2G
> 	cgroup_weight=500
> 	cgroup=cfq_a
> 	thread=1
> 	numjobs=2
> 
> 	[cfq_b]
> 	new_group
> 	filename=/tmp/sda/cfq_b.dat
> 	size=2G
> 	rate=4m
> 	cgroup_weight=500
> 	cgroup=cfq_b
> 	thread=1
> 	numjobs=2
> 
> 
> The following is the test result:
> * with 0b31c10:
> 	* fio report
> 		cfq_a: bw=5312.6KB/s, iops=1328
> 		cfq_b: bw=8192.6KB/s, iops=2048
> 
> 	* blkcg debug files
> 		./cfq_a/blkio.group_wait_time:8:0 12062571233
> 		./cfq_b/blkio.group_wait_time:8:0 155841600
> 		./cfq_a/blkio.io_serviced:Total 19922
> 		./cfq_b/blkio.io_serviced:Total 30722
> 		./cfq_a/blkio.time:8:0 19406083246
> 		./cfq_b/blkio.time:8:0 19417146869
> 
> * without 0b31c10:
> 	* fio report
> 		cfq_a: bw=21670KB/s, iops=5417
> 		cfq_b: bw=8191.2KB/s, iops=2047
> 
> 	* blkcg debug files
> 		./cfq_a/blkio.group_wait_time:8:0 5798452504
> 		./cfq_b/blkio.group_wait_time:8:0 5131844007
> 		./cfq_a/blkio.io_serviced:8:0 Write 81261
> 		./cfq_b/blkio.io_serviced:8:0 Write 30722
> 		./cfq_a/blkio.time:8:0 5642608173
> 		./cfq_b/blkio.time:8:0 5849949812
> 
> We want to known the reason why you revert the minimal used slice to 1 jiffy
> when the slice has not been allocated. Does it lead to some performance
> regressions or something similar ? If not, I think we could revert the minimal
> slice to 1 ns again.
> 
> Another problem is about the time comparison in CFQ code. In no-hrtimer version
> of CFQ, it uses time_after or time_before when possible, Why the hrtimer version
> doesn't use the equivalent time_after64/time_before64 ? Can ktime_get_ns()
> ensure there will be no wrapping problem ?
> 
> Thanks very much.
> 
> Regards,
> 
> Tao
> 
> 
> 
> .
> 

^ permalink raw reply

* Re: [PATCH RFC 00/14] Add the BFQ I/O Scheduler to blk-mq
From: Bart Van Assche @ 2017-03-07  1:00 UTC (permalink / raw)
  To: tj@kernel.org, paolo.valente@linaro.org, axboe@kernel.dk
  Cc: ulf.hansson@linaro.org, linux-kernel@vger.kernel.org,
	fchecconi@gmail.com, avanzini.arianna@gmail.com,
	linux-block@vger.kernel.org, linus.walleij@linaro.org,
	broonie@kernel.org
In-Reply-To: <20170304160131.57366-1-paolo.valente@linaro.org>

On Sat, 2017-03-04 at 17:01 +0100, Paolo Valente wrote:
> Finally, a few details on the patchset.
>=20
> The first two patches introduce BFQ-v0, which is more or less the
> first version of BFQ submitted a few years ago [1].  The remaining
> patches turn progressively BFQ-v0 into BFQ-v8r8, the current version
> of BFQ.

Hello Paolo,

Thank you for having done the work to improve, test, fix and post the
BFQ scheduler as a patch series. However, from what I have seen in the
patches there is a large number of tunable constants in the code for
which no scientific approach exists to chose an optimal value.
Additionally, the complexity of the code is huge. Just like for CFQ,
sooner or later someone will run into a bug or a performance issue
and will post a patch to fix it. However, the complexity of BFQ is
such that a source code review alone won't be sufficient to verify
whether or not such a patch negatively affects a workload or device
that has not been tested by the author of the patch. This makes me
wonder what process should be followed to verify future BFQ patches?

Thanks,

Bart.=

^ permalink raw reply

* Re: [PATCH RFC 13/14] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
From: Bart Van Assche @ 2017-03-07  0:54 UTC (permalink / raw)
  To: tj@kernel.org, paolo.valente@linaro.org, axboe@kernel.dk
  Cc: ulf.hansson@linaro.org, linux-kernel@vger.kernel.org,
	fchecconi@gmail.com, avanzini.arianna@gmail.com,
	linux-block@vger.kernel.org, linus.walleij@linaro.org,
	broonie@kernel.org
In-Reply-To: <20170304160131.57366-14-paolo.valente@linaro.org>

On Sat, 2017-03-04 at 17:01 +0100, Paolo Valente wrote:
> @@ -8301,7 +8297,7 @@ static struct blkcg_policy blkcg_policy_bfq =3D {
>  static int __init bfq_init(void)
>  {
>  	int ret;
> -	char msg[50] =3D "BFQ I/O-scheduler: v6";
> +	char msg[50] =3D "BFQ I/O-scheduler: v7r3";
> =20
>  #ifdef CONFIG_BFQ_GROUP_IOSCHED
>  	ret =3D blkcg_policy_register(&blkcg_policy_bfq);

In the Linux kernel the kernel as a whole has a version number but we
do not assign a version number to individual core components.

Bart.=

^ permalink raw reply

* Re: [PATCH RFC 04/14] block, bfq: modify the peak-rate estimator
From: Bart Van Assche @ 2017-03-07  0:47 UTC (permalink / raw)
  To: tj@kernel.org, paolo.valente@linaro.org, axboe@kernel.dk
  Cc: ulf.hansson@linaro.org, linux-kernel@vger.kernel.org,
	fchecconi@gmail.com, avanzini.arianna@gmail.com,
	linux-block@vger.kernel.org, linus.walleij@linaro.org,
	broonie@kernel.org
In-Reply-To: <20170304160131.57366-5-paolo.valente@linaro.org>

On Sat, 2017-03-04 at 17:01 +0100, Paolo Valente wrote:
> +static sector_t get_sdist(sector_t last_pos, struct request *rq)
> +{
> +	sector_t sdist =3D 0;
> +
> +	if (last_pos) {
> +		if (last_pos < blk_rq_pos(rq))
> +			sdist =3D blk_rq_pos(rq) - last_pos;
> +		else
> +			sdist =3D last_pos - blk_rq_pos(rq);
> +	}
> +
> +	return sdist;
> +}

Have you considered to use abs() from <linux/kernel.h>?

Thanks,

Bart.=

^ permalink raw reply

* Re: [PATCH RFC 00/14] Add the BFQ I/O Scheduler to blk-mq
From: Bart Van Assche @ 2017-03-07  0:22 UTC (permalink / raw)
  To: Paolo Valente, Jens Axboe, Tejun Heo
  Cc: Fabio Checconi, Arianna Avanzini, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, ulf.hansson@linaro.org,
	linus.walleij@linaro.org, broonie@kernel.org
In-Reply-To: <20170304160131.57366-1-paolo.valente@linaro.org>

On 03/04/2017 08:01 AM, Paolo Valente wrote:=0A=
> Some patch generates WARNINGS with checkpatch.pl, but these WARNINGS=0A=
> seem to be either unavoidable for the involved pieces of code (which=0A=
> the patch just extends), or false positives.=0A=
=0A=
The code in this series looks reasonably clean from a code style point=0A=
of view, but please address all checkpatch warnings that can be=0A=
addressed easily. A few examples of such checkpatch warnings:=0A=
=0A=
ERROR: "foo * bar" should be "foo *bar"=0A=
=0A=
WARNING: Symbolic permissions 'S_IRUGO|S_IWUSR' are not preferred.=0A=
Consider using octal permissions '0644'.=0A=
=0A=
Thanks,=0A=
=0A=
Bart.=0A=

^ permalink raw reply

* cfq-iosched: two questions about the hrtimer version of CFQ
From: Hou Tao @ 2017-03-07  0:11 UTC (permalink / raw)
  To: Jan Kara, linux-block; +Cc: Jens Axboe, linux-kernel, jmoyer, Vivek Goyal

Hi Jan and list,

When testing the hrtimer version of CFQ, we found a performance degradation
problem which seems to be caused by commit 0b31c10 ("cfq-iosched: Charge at
least 1 jiffie instead of 1 ns").

The following is the test process:

* filesystem and block device
	* XFS + /dev/sda mounted on /tmp/sda
* CFQ configuration
	* default configuration
* run "fio ./cfq.job"
* fio job configuration cfq.job
	[global]
	bs=4k
	ioengine=psync
	iodepth=1
	direct=1
	rw=randwrite
	time_based
	runtime=15
	cgroup_nodelete=1
	group_reporting=1

	[cfq_a]
	filename=/tmp/sda/cfq_a.dat
	size=2G
	cgroup_weight=500
	cgroup=cfq_a
	thread=1
	numjobs=2

	[cfq_b]
	new_group
	filename=/tmp/sda/cfq_b.dat
	size=2G
	rate=4m
	cgroup_weight=500
	cgroup=cfq_b
	thread=1
	numjobs=2

The following is the test result:
* with 0b31c10:
	* fio report
		cfq_a: bw=5312.6KB/s, iops=1328
		cfq_b: bw=8192.6KB/s, iops=2048

	* blkcg debug files
		./cfq_a/blkio.group_wait_time:8:0 12062571233
		./cfq_b/blkio.group_wait_time:8:0 155841600
		./cfq_a/blkio.io_serviced:Total 19922
		./cfq_b/blkio.io_serviced:Total 30722
		./cfq_a/blkio.time:8:0 19406083246
		./cfq_b/blkio.time:8:0 19417146869

* without 0b31c10:
	* fio report
		cfq_a: bw=21670KB/s, iops=5417
		cfq_b: bw=8191.2KB/s, iops=2047

	* blkcg debug files
		./cfq_a/blkio.group_wait_time:8:0 5798452504
		./cfq_b/blkio.group_wait_time:8:0 5131844007
		./cfq_a/blkio.io_serviced:8:0 Write 81261
		./cfq_b/blkio.io_serviced:8:0 Write 30722
		./cfq_a/blkio.time:8:0 5642608173
		./cfq_b/blkio.time:8:0 5849949812

We want to known the reason why you revert the minimal used slice to 1 jiffy
when the slice has not been allocated. Doest it lead to some performance
regressions or something similar ? If not, I think we could revert the minimal
slice to 1 ns again.

Another problem is about the time comparison in CFQ code. In no-hrtimer version
of CFQ, it uses time_after or time_before when possible, Why the hrtimer version
doesn't use the equivalent time_after64/time_before64 ? Can ktime_get_ns()
ensure there will be no wrapping problem ?

Thanks very much.

Regards,

Tao

^ permalink raw reply

* Re: [PATCH 07/11] bdi: Do not wait for cgwbs release in bdi_unregister()
From: Tejun Heo @ 2017-03-06 22:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown, Omar Sandoval
In-Reply-To: <20170306163404.1238-8-jack@suse.cz>

On Mon, Mar 06, 2017 at 05:34:00PM +0100, Jan Kara wrote:
> Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
> (called from bdi_unregister()). That is however unnecessary now when
> cgwb->bdi is a proper refcounted reference (thus bdi cannot get
> released before all cgwbs are released) and when cgwb_bdi_destroy()
> shuts down writeback directly.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 06/11] bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()
From: Tejun Heo @ 2017-03-06 22:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown, Omar Sandoval
In-Reply-To: <20170306163404.1238-7-jack@suse.cz>

On Mon, Mar 06, 2017 at 05:33:59PM +0100, Jan Kara wrote:
> Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
> which also means that writeback has been shutdown on them. Since this
> wait is going away, directly shutdown writeback on cgwbs from
> cgwb_bdi_destroy() to avoid live writeback structures after
> bdi_unregister() has finished. To make that safe with concurrent
> shutdown from cgwb_release_workfn(), we also have to make sure
> wb_shutdown() returns only after the bdi_writeback structure is really
> shutdown.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
From: Tejun Heo @ 2017-03-06 22:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown, Omar Sandoval
In-Reply-To: <20170306163404.1238-3-jack@suse.cz>

Hello,

On Mon, Mar 06, 2017 at 05:33:55PM +0100, Jan Kara wrote:
> +	disk->flags &= ~GENHD_FL_UP;
> +	/*
> +	 * Make sure __blkdev_open() sees the disk is going away before
> +	 * starting to unhash bdev inodes.
> +	 */
> +	smp_wmb();

But which rmb is this paired with?  Without paring memory barriers
don't do anything.  Given that this isn't a super hot path, I think
it'd be far better to stick to a simpler synchronization mechanism.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 01/11] block: Fix bdi assignment to bdev inode when racing with disk delete
From: Tejun Heo @ 2017-03-06 22:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown, Omar Sandoval
In-Reply-To: <20170306163404.1238-2-jack@suse.cz>

On Mon, Mar 06, 2017 at 05:33:54PM +0100, Jan Kara wrote:
> When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
> restart the process of opening the block device. However we forget to
> switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
> inode will be pointing to a stale bdi. Fix the problem by setting
> bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH] zram: set physical queue limits to avoid array out of bounds accesses
From: Jens Axboe @ 2017-03-06 20:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Thumshirn, Minchan Kim, Nitin Gupta, Christoph Hellwig,
	Sergey Senozhatsky, Hannes Reinecke, yizhan,
	Linux Block Layer Mailinglist, Linux Kernel Mailinglist
In-Reply-To: <20170306121840.aaa0525d3dbdeb82aad3c284@linux-foundation.org>

On 03/06/2017 01:18 PM, Andrew Morton wrote:
> On Mon, 6 Mar 2017 08:21:11 -0700 Jens Axboe <axboe@fb.com> wrote:
> 
>> On 03/06/2017 03:23 AM, Johannes Thumshirn wrote:
>>> zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using
>>> the NVMe over Fabrics loopback target which potentially sends a huge bulk of
>>> pages attached to the bio's bvec this results in a kernel panic because of
>>> array out of bounds accesses in zram_decompress_page().
>>
>> Applied, thanks.
> 
> With an added cc:stable, hopefully?

I didn't. But I can email it to stable@ when it lands in Linus's tree.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH] zram: set physical queue limits to avoid array out of bounds accesses
From: Andrew Morton @ 2017-03-06 20:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Johannes Thumshirn, Minchan Kim, Nitin Gupta, Christoph Hellwig,
	Sergey Senozhatsky, Hannes Reinecke, yizhan,
	Linux Block Layer Mailinglist, Linux Kernel Mailinglist
In-Reply-To: <96ed9003-6299-b303-a901-d040a8cfe03f@fb.com>

On Mon, 6 Mar 2017 08:21:11 -0700 Jens Axboe <axboe@fb.com> wrote:

> On 03/06/2017 03:23 AM, Johannes Thumshirn wrote:
> > zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using
> > the NVMe over Fabrics loopback target which potentially sends a huge bulk of
> > pages attached to the bio's bvec this results in a kernel panic because of
> > array out of bounds accesses in zram_decompress_page().
> 
> Applied, thanks.

With an added cc:stable, hopefully?

^ permalink raw reply

* Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
From: Jens Axboe @ 2017-03-06 20:46 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Tejun Heo, Fabio Checconi, Arianna Avanzini, linux-block,
	Linux-Kernal, Ulf Hansson, Linus Walleij, broonie
In-Reply-To: <060CBCB3-AB73-4F88-9CDC-828F502A8FF7@linaro.org>

On 03/05/2017 09:02 AM, Paolo Valente wrote:
> 
>> Il giorno 05 mar 2017, alle ore 16:16, Jens Axboe <axboe@kernel.dk> ha scritto:
>>
>> On 03/04/2017 09:01 AM, Paolo Valente wrote:
>>> We tag as v0 the version of BFQ containing only BFQ's engine plus
>>> hierarchical support. BFQ's engine is introduced by this commit, while
>>> hierarchical support is added by next commit. We use the v0 tag to
>>> distinguish this minimal version of BFQ from the versions containing
>>> also the features and the improvements added by next commits. BFQ-v0
>>> coincides with the version of BFQ submitted a few years ago [1], apart
>>> from the introduction of preemption, described below.
>>>
>>> BFQ is a proportional-share I/O scheduler, whose general structure,
>>> plus a lot of code, are borrowed from CFQ.
>>
>> I'll take a closer look at this in the coming week.
> 
> ok
> 
>> But one quick
>> comment - don't default to BFQ. Both because it might not be fully
>> stable yet, and also because the performance limitation of it is
>> quite severe. Whereas deadline doesn't really hurt single queue
>> flash at all, BFQ will.
>>
> 
> Ok, sorry.  I was doubtful on what to do, but, to not bother you on
> every details, I went for setting it as default, because I thought
> people would have preferred to test it, even from boot, in this
> preliminary stage.  I reset elevator.c in the submission, unless you
> want me to do it even before receiving your and others' reviews.

I don't think it's stable enough for that yet, it's seen very little
testing outside of your own testing. Given that, it's much better that
people opt in to testing BFQ, so they at least know they can expect
crashes.

Speaking of testing, I ran into this bug:

[ 9469.621413] general protection fault: 0000 [#1] PREEMPT SMP
[ 9469.627872] Modules linked in: loop dm_mod xfs libcrc32c bfq_iosched x86_pkg_temp_thermal btrfe
[ 9469.648196] CPU: 0 PID: 2114 Comm: kworker/0:1H Tainted: G        W       4.11.0-rc1+ #249
[ 9469.657873] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
[ 9469.666742] Workqueue: xfs-log/nvme4n1p1 xfs_buf_ioend_work [xfs]
[ 9469.674213] task: ffff881fe97646c0 task.stack: ffff881ff13d0000
[ 9469.681053] RIP: 0010:__bfq_bfqq_expire+0xb3/0x110 [bfq_iosched]
[ 9469.687991] RSP: 0018:ffff881fff603dd8 EFLAGS: 00010082
[ 9469.694052] RAX: 6b6b6b6b6b6b6b6b RBX: ffff883fe8e0eb58 RCX: 0000000000010004
[ 9469.702251] RDX: 0000000000010004 RSI: 0000000000000000 RDI: 00000000ffffffff
[ 9469.710456] RBP: ffff881fff603de8 R08: 0000000000000000 R09: 0000000000000001
[ 9469.718659] R10: 0000000000000001 R11: 0000000000000000 R12: ffff883fe8dbf4e8
[ 9469.726863] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000007904
[ 9469.735063] FS:  0000000000000000(0000) GS:ffff881fff600000(0000) knlGS:0000000000000000
[ 9469.744539] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9469.751189] CR2: 00000000020c0018 CR3: 0000001fec1db000 CR4: 00000000003406f0
[ 9469.759392] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9469.767596] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9469.775794] Call Trace:
[ 9469.778748]  <IRQ>
[ 9469.781222]  bfq_bfqq_expire+0x104/0x2f0 [bfq_iosched]
[ 9469.787193]  ? bfq_idle_slice_timer+0x2a/0xc0 [bfq_iosched]
[ 9469.793650]  bfq_idle_slice_timer+0x7c/0xc0 [bfq_iosched]
[ 9469.799914]  __hrtimer_run_queues+0xd9/0x500
[ 9469.804911]  ? bfq_rq_enqueued+0x340/0x340 [bfq_iosched]
[ 9469.811072]  hrtimer_interrupt+0xb0/0x200
[ 9469.815781]  local_apic_timer_interrupt+0x31/0x50
[ 9469.821264]  smp_apic_timer_interrupt+0x33/0x50
[ 9469.826555]  apic_timer_interrupt+0x90/0xa0

just running the xfstest suite. It's test generic/299. Have you done
full runs of xfstest? I'd greatly recommend that for shaking out bugs.
Run a full loop with xfs, one with btrfs, and one with ext4 for better
confidence in the stability of the code.

(gdb) l *__bfq_bfqq_expire+0xb3
0x5983 is in __bfq_bfqq_expire (block/bfq-iosched.c:2664).
2659		 * been properly deactivated or requeued, so we can safely
2660		 * execute the final step: reset in_service_entity along the
2661		 * path from entity to the root.
2662		 */
2663		for_each_entity(entity)
2664			entity->sched_data->in_service_entity = NULL;
2665	}
2666	
2667	static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
2668					bool ins_into_idle_tree, bool expiration)

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 0/13 v2] block: Fix block device shutdown related races
From: Omar Sandoval @ 2017-03-06 20:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown
In-Reply-To: <20170301151109.GI20512@quack2.suse.cz>

On Wed, Mar 01, 2017 at 04:11:09PM +0100, Jan Kara wrote:
> On Tue 28-02-17 23:26:53, Omar Sandoval wrote:
> > On Tue, Feb 28, 2017 at 11:25:28PM -0800, Omar Sandoval wrote:
> > > On Wed, Feb 22, 2017 at 11:24:25AM +0100, Jan Kara wrote:
> > > > On Tue 21-02-17 10:19:28, Jens Axboe wrote:
> > > > > On 02/21/2017 10:09 AM, Jan Kara wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > this is a second revision of the patch set to fix several different races and
> > > > > > issues I've found when testing device shutdown and reuse. The first three
> > > > > > patches are fixes to problems in my previous series fixing BDI lifetime issues.
> > > > > > Patch 4 fixes issues with reuse of BDI name with scsi devices. With it I cannot
> > > > > > reproduce the BDI name reuse issues using Omar's stress test using scsi_debug
> > > > > > so it can be used as a replacement of Dan's patches. Patches 5-11 fix oops that
> > > > > > is triggered by __blkdev_put() calling inode_detach_wb() too early (the problem
> > > > > > reported by Thiago). Patches 12 and 13 fix oops due to a bug in gendisk code
> > > > > > where get_gendisk() can return already freed gendisk structure (again triggered
> > > > > > by Omar's stress test).
> > > > > > 
> > > > > > People, please have a look at patches. They are mostly simple however the
> > > > > > interactions are rather complex so I may have missed something. Also I'm
> > > > > > happy for any additional testing these patches can get - I've stressed them
> > > > > > with Omar's script, tested memcg writeback, tested static (not udev managed)
> > > > > > device inodes.
> > > > > > 
> > > > > > Jens, I think at least patches 1-3 should go in together with my fixes you
> > > > > > already have in your tree (or shortly after them). It is up to you whether
> > > > > > you decide to delay my first fixes or pick these up quickly. Patch 4 is
> > > > > > (IMHO a cleaner) replacement of Dan's patches so consider whether you want
> > > > > > to use it instead of those patches.
> > > > > 
> > > > > I have applied 1-3 to my for-linus branch, which will go in after
> > > > > the initial pull request has been pulled by Linus. Consider fixing up
> > > > > #4 so it applies, I like it.
> > > > 
> > > > OK, attached is patch 4 rebased on top of Linus' tree from today which
> > > > already has linux-block changes pulled in. I've left put_disk_devt() in
> > > > blk_cleanup_queue() to maintain the logic in the original patch (now commit
> > > > 0dba1314d4f8) that request_queue and gendisk each hold one devt reference.
> > > > The bdi_unregister() call that is moved to del_gendisk() by this patch is
> > > > now protected by the gendisk reference instead of the request_queue one
> > > > so it still maintains the property that devt reference protects bdi
> > > > registration-unregistration lifetime (as much as that is not needed anymore
> > > > after this patch).
> > > > 
> > > > I have also updated the comment in the code and the changelog - they were
> > > > somewhat stale after changes to the whole series Tejun suggested.
> > > > 
> > > > 								Honza
> > > 
> > > Hey, Jan, I just tested this out when I was seeing similar crashes with
> > > sr instead of sd, and this fixed it.
> > > 
> > > Tested-by: Omar Sandoval <osandov@fb.com>
> > 
> > Just realized it wasn't clear, I'm talking about patch 4 specifically.
> 
> Thanks for confirmation!
> 
> 								Honza

Unfortunately, this patch actually seems to have introduced a
regression. Reverting the patch fixes it.

I'm running a QEMU kvm vm with a virtio-scsi device and I get oopses
like this:

[    6.741773] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[    6.744401] IP: virtscsi_queuecommand_single+0x21/0x40 [virtio_scsi]
[    6.744985] PGD 0
[    6.744985]
[    6.744985] Oops: 0002 [#1] PREEMPT SMP
[    6.744985] Modules linked in: virtio_scsi scsi_mod virtio_net
[    6.744985] CPU: 3 PID: 5 Comm: kworker/u8:0 Not tainted 4.11.0-rc1 #36
[    6.744985] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
[    6.744985] Workqueue: events_unbound async_run_entry_fn
[    6.744985] task: ffff88007c8672c0 task.stack: ffffc9000033c000
[    6.750038] RIP: 0010:virtscsi_queuecommand_single+0x21/0x40 [virtio_scsi]
[    6.750038] RSP: 0018:ffffc9000033f8b0 EFLAGS: 00010246
[    6.750038] RAX: ffff88007c98d000 RBX: ffff880078c814c8 RCX: 0000000000000000
[    6.750038] RDX: ffff880078c814c8 RSI: ffff880078c814c8 RDI: ffff88007c98d000
[    6.750038] RBP: ffffc9000033f8b0 R08: 0000000000000000 R09: 0000000000000400
[    6.750038] R10: ffff88007b1fe748 R11: 0000000000000024 R12: ffff880078c814c8
[    6.750038] R13: ffff88007c98d000 R14: ffff880078c81380 R15: ffff88007b532000
[    6.750038] FS:  0000000000000000(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000
[    6.750038] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.750038] CR2: 0000000000000004 CR3: 0000000001c09000 CR4: 00000000000006e0
[    6.750038] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    6.750038] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    6.750038] Call Trace:
[    6.750038]  scsi_dispatch_cmd+0xa3/0x1d0 [scsi_mod]
[    6.750038]  scsi_queue_rq+0x586/0x740 [scsi_mod]
[    6.750038]  blk_mq_dispatch_rq_list+0x22a/0x420
[    6.750038]  blk_mq_sched_dispatch_requests+0x142/0x1b0
[    6.750038]  __blk_mq_run_hw_queue+0x94/0xb0
[    6.750038]  blk_mq_run_hw_queue+0x82/0xa0
[    6.750038]  blk_mq_sched_insert_request+0x89/0x1a0
[    6.750038]  ? blk_execute_rq+0xc0/0xc0
[    6.750038]  blk_execute_rq_nowait+0x9a/0x140
[    6.750038]  ? bio_phys_segments+0x19/0x20
[    6.750038]  blk_execute_rq+0x56/0xc0
[    6.750038]  scsi_execute+0x128/0x270 [scsi_mod]
[    6.750038]  scsi_probe_and_add_lun+0x247/0xc60 [scsi_mod]
[    6.750038]  ? __pm_runtime_resume+0x4c/0x60
[    6.750038]  __scsi_scan_target+0x102/0x520 [scsi_mod]
[    6.750038]  ? account_entity_dequeue+0xab/0xe0
[    6.750038]  scsi_scan_channel+0x81/0xa0 [scsi_mod]
[    6.750038]  scsi_scan_host_selected+0xcc/0x100 [scsi_mod]
[    6.750038]  do_scsi_scan_host+0x8d/0x90 [scsi_mod]
[    6.750038]  do_scan_async+0x1c/0x1a0 [scsi_mod]
[    6.750038]  async_run_entry_fn+0x4a/0x170
[    6.750038]  process_one_work+0x165/0x430
[    6.750038]  worker_thread+0x4e/0x490
[    6.750038]  kthread+0x101/0x140
[    6.750038]  ? process_one_work+0x430/0x430
[    6.750038]  ? kthread_create_on_node+0x60/0x60
[    6.750038]  ret_from_fork+0x2c/0x40
[    6.750038] Code: ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 4e 30 48 89 f8 48 89 f2 48 89 e5 48 8b 89 88 01 00 00 48 8b 89 08 03 00 00 <f0> ff 41 04 48 8d bf c0 07 00 00 48 8d b0 c8 09 00 00 e8 08 fd
[    6.750038] RIP: virtscsi_queuecommand_single+0x21/0x40 [virtio_scsi] RSP: ffffc9000033f8b0
[    6.750038] CR2: 0000000000000004
[    6.750038] ---[ end trace cee5df55872abfa0 ]---
[    6.750038] note: kworker/u8:0[5] exited with preempt_count 1

Arch Linux 4.11.0-rc1 (ttyS0)

silver login: [   27.370267] scsi 0:0:53:0: tag#766 abort
[   27.374501] scsi 0:0:53:0: device reset
[   27.378539] scsi 0:0:53:0: Device offlined - not ready after error recovery

That crash is because tgt is NULL here:

----
static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
					struct scsi_cmnd *sc)
{
	struct virtio_scsi *vscsi = shost_priv(sh);
	struct virtio_scsi_target_state *tgt =
				scsi_target(sc->device)->hostdata;

======> atomic_inc(&tgt->reqs);
	return virtscsi_queuecommand(vscsi, &vscsi->req_vqs[0], sc);
}
----

Applying the rest of this patch set didn't fix it. Additionally,
enabling KASAN gives this use-after-free spew:
----
[    7.789280] SCSI subsystem initialized
[    7.809991] virtio_net virtio0 ens2: renamed from eth0
[    7.828301] scsi host0: Virtio SCSI HBA
[    7.916214] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
[    7.946713] ==================================================================
[    7.947609] BUG: KASAN: use-after-free in rb_erase+0x13a2/0x1a20 at addr ffff88006a915e30
[    7.948583] Write of size 8 by task dhcpcd-run-hook/146
[    7.949202] CPU: 0 PID: 146 Comm: dhcpcd-run-hook Not tainted 4.11.0-rc1-00011-gc35ccd2d43e9 #42
[    7.950021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
[    7.950021] Call Trace:
[    7.950021]  <IRQ>
[    7.950021]  dump_stack+0x63/0x86
[    7.950021]  kasan_object_err+0x21/0x70
[    7.950021]  kasan_report.part.1+0x23a/0x520
[    7.950021]  ? rb_erase+0x13a2/0x1a20
[    7.950021]  ? swake_up_locked+0x94/0x1c0
[    7.950021]  __asan_report_store8_noabort+0x31/0x40
[    7.950021]  rb_erase+0x13a2/0x1a20
[    7.950021]  wb_congested_put+0x6a/0xd0
[    7.950021]  __blkg_release_rcu+0x16b/0x230
[    7.950021]  rcu_process_callbacks+0x602/0xfc0
[    7.950021]  __do_softirq+0x14e/0x5b3
[    7.950021]  irq_exit+0x14e/0x180
[    7.950021]  smp_apic_timer_interrupt+0x7b/0xa0
[    7.950021]  apic_timer_interrupt+0x89/0x90
[    7.950021] RIP: 0010:copy_strings.isra.7+0x312/0x6b0
[    7.950021] RSP: 0018:ffff880063bffcd0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
[    7.950021] RAX: 0000000000002000 RBX: 0000000000000fe3 RCX: 0000000000000002
[    7.950021] RDX: ffff88006a734880 RSI: 0000000000000002 RDI: ffff880069750808
[    7.950021] RBP: ffff880063bffdd0 R08: 8000000063ec3067 R09: 0000000000000040
[    7.950021] R10: ffffed000c7d8600 R11: ffffffff82afa6a0 R12: 00007fffffffefe3
[    7.950021] R13: 0000000000000015 R14: ffff880069750780 R15: dffffc0000000000
[    7.950021]  </IRQ>
[    7.950021]  ? set_binfmt+0x120/0x120
[    7.950021]  ? insert_vm_struct+0x148/0x2e0
[    7.950021]  ? kasan_slab_alloc+0x12/0x20
[    7.950021]  ? count.isra.6.constprop.16+0x52/0x100
[    7.950021]  do_execveat_common.isra.14+0xef1/0x1b80
[    7.950021]  ? prepare_bprm_creds+0x100/0x100
[    7.950021]  ? getname_flags+0x90/0x3f0
[    7.950021]  ? __do_page_fault+0x5cc/0xbc0
[    7.950021]  SyS_execve+0x3a/0x50
[    7.950021]  ? ptregs_sys_vfork+0x10/0x10
[    7.950021]  do_syscall_64+0x180/0x380
[    7.950021]  entry_SYSCALL64_slow_path+0x25/0x25
[    7.950021] RIP: 0033:0x7f85de145477
[    7.950021] RSP: 002b:00007ffdf2da3568 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
[    7.950021] RAX: ffffffffffffffda RBX: 0000000000d771c0 RCX: 00007f85de145477
[    7.950021] RDX: 0000000000d35fb0 RSI: 0000000000d697e0 RDI: 0000000000d771c0
[    7.950021] RBP: 0000000000000000 R08: 00007ffdf2da3560 R09: 0000000000000030
[    7.950021] R10: 000000000000059a R11: 0000000000000202 R12: 0000000000d771c0
[    7.950021] R13: 0000000000d697e0 R14: 0000000000d35fb0 R15: 0000000000000000
[    7.950021] Object at ffff88006a915b00, in cache kmalloc-1024 size: 1024
[    7.950021] Allocated:
[    7.950021] PID = 72
[    7.950021]  save_stack_trace+0x1b/0x20
[    7.950021]  kasan_kmalloc+0xee/0x190
[    7.950021]  kmem_cache_alloc_node_trace+0x138/0x200
[    7.950021]  bdi_alloc_node+0x4c/0xa0
[    7.950021]  blk_alloc_queue_node+0xdd/0x870
[    7.950021]  blk_mq_init_queue+0x41/0x90
[    7.950021]  scsi_mq_alloc_queue+0x41/0x130 [scsi_mod]
[    7.950021]  scsi_alloc_sdev+0x90e/0xd00 [scsi_mod]
[    7.950021]  scsi_probe_and_add_lun+0x14b3/0x2250 [scsi_mod]
[    7.950021]  __scsi_scan_target+0x23d/0xb60 [scsi_mod]
[    7.950021]  scsi_scan_channel+0x107/0x160 [scsi_mod]
[    7.950021]  scsi_scan_host_selected+0x1bf/0x250 [scsi_mod]
[    7.950021]  do_scsi_scan_host+0x1bc/0x250 [scsi_mod]
[    7.950021]  do_scan_async+0x41/0x4b0 [scsi_mod]
[    7.950021]  async_run_entry_fn+0xc3/0x630
[    7.950021]  process_one_work+0x531/0xf40
[    7.950021]  worker_thread+0xe4/0x10d0
[    7.950021]  kthread+0x298/0x390
[    7.950021]  ret_from_fork+0x2c/0x40
[    7.950021] Freed:
[    7.950021] PID = 72
[    7.950021]  save_stack_trace+0x1b/0x20
[    7.950021]  kasan_slab_free+0xb0/0x180
[    7.950021]  kfree+0x9f/0x1d0
[    7.950021]  bdi_put+0x2a/0x30
[    7.950021]  blk_release_queue+0x51/0x320
[    7.950021]  kobject_put+0x154/0x430
[    7.950021]  blk_put_queue+0x15/0x20
[    7.950021]  scsi_device_dev_release_usercontext+0x59b/0x880 [scsi_mod]
[    7.950021]  execute_in_process_context+0xd9/0x130
[    7.950021]  scsi_device_dev_release+0x1c/0x20 [scsi_mod]
[    7.950021]  device_release+0x76/0x1e0
[    7.950021]  kobject_put+0x154/0x430
[    7.950021]  put_device+0x17/0x20
[    7.950021]  __scsi_remove_device+0x18d/0x250 [scsi_mod]
[    7.950021]  scsi_probe_and_add_lun+0x14d6/0x2250 [scsi_mod]
[    7.950021]  __scsi_scan_target+0x23d/0xb60 [scsi_mod]
[    7.950021]  scsi_scan_channel+0x107/0x160 [scsi_mod]
[    7.950021]  scsi_scan_host_selected+0x1bf/0x250 [scsi_mod]
[    7.950021]  do_scsi_scan_host+0x1bc/0x250 [scsi_mod]
[    7.950021]  do_scan_async+0x41/0x4b0 [scsi_mod]
[    7.950021]  async_run_entry_fn+0xc3/0x630
[    7.950021]  process_one_work+0x531/0xf40
[    7.950021]  worker_thread+0xe4/0x10d0
[    7.950021]  kthread+0x298/0x390
[    7.950021]  ret_from_fork+0x2c/0x40
[    7.950021] Memory state around the buggy address:
[    7.950021]  ffff88006a915d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[    7.950021]  ffff88006a915d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[    7.950021] >ffff88006a915e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[    7.950021]                                      ^
[    7.950021]  ffff88006a915e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[    7.950021]  ffff88006a915f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
----

Any ideas Jan?

Thanks.

^ permalink raw reply

* Re: [PATCH] block/sed: Fix opal user range check and unused variables
From: Jens Axboe @ 2017-03-06 20:15 UTC (permalink / raw)
  To: Jon Derrick
  Cc: linux-block, linux-nvme, David Binderman, Scott Bauer,
	Rafael Antognolli, Christoph Hellwig, Keith Busch
In-Reply-To: <1488814864-28478-1-git-send-email-jonathan.derrick@intel.com>

On 03/06/2017 08:41 AM, Jon Derrick wrote:
> Fixes check that the opal user is within the range, and cleans up unused
> method variables.

Applied, thanks.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v4] blkcg: allocate struct blkcg_gq outside request queue spinlock
From: Tejun Heo @ 2017-03-06 20:03 UTC (permalink / raw)
  To: Tahsin Erdogan; +Cc: Jens Axboe, linux-block, David Rientjes, linux-kernel
In-Reply-To: <20170305141243.20817-1-tahsin@google.com>

On Sun, Mar 05, 2017 at 06:12:43AM -0800, Tahsin Erdogan wrote:
> blkg_conf_prep() currently calls blkg_lookup_create() while holding
> request queue spinlock. This means allocating memory for struct
> blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
> failures in call paths like below:
> 
>   pcpu_alloc+0x68f/0x710
>   __alloc_percpu_gfp+0xd/0x10
>   __percpu_counter_init+0x55/0xc0
>   cfq_pd_alloc+0x3b2/0x4e0
>   blkg_alloc+0x187/0x230
>   blkg_create+0x489/0x670
>   blkg_lookup_create+0x9a/0x230
>   blkg_conf_prep+0x1fb/0x240
>   __cfqg_set_weight_device.isra.105+0x5c/0x180
>   cfq_set_weight_on_dfl+0x69/0xc0
>   cgroup_file_write+0x39/0x1c0
>   kernfs_fop_write+0x13f/0x1d0
>   __vfs_write+0x23/0x120
>   vfs_write+0xc2/0x1f0
>   SyS_write+0x44/0xb0
>   entry_SYSCALL_64_fastpath+0x18/0xad
> 
> In the code path above, percpu allocator cannot call vmalloc() due to
> queue spinlock.
> 
> A failure in this call path gives grief to tools which are trying to
> configure io weights. We see occasional failures happen shortly after
> reboots even when system is not under any memory pressure. Machines
> with a lot of cpus are more vulnerable to this condition.
> 
> Update blkg_create() function to temporarily drop the rcu and queue
> locks when it is allowed by gfp mask.
> 
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Tahsin Erdogan <tahsin@google.com>

Cc: stable@vger.kernel.org # v4.2+

Looks good to me.

 Acked-by: Tejun Heo <tj@kernel.org>

Thanks for your patience!

-- 
tejun

^ permalink raw reply

* Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
From: Bart Van Assche @ 2017-03-06 19:40 UTC (permalink / raw)
  To: Paolo Valente, Jens Axboe, Tejun Heo
  Cc: Fabio Checconi, Arianna Avanzini, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, ulf.hansson@linaro.org,
	linus.walleij@linaro.org, broonie@kernel.org
In-Reply-To: <20170304160131.57366-2-paolo.valente@linaro.org>

On 03/04/2017 08:01 AM, Paolo Valente wrote:=0A=
> BFQ is a proportional-share I/O scheduler, whose general structure,=0A=
> plus a lot of code, are borrowed from CFQ.=0A=
> [ ... ]=0A=
=0A=
This description is very useful. However, since it is identical to the=0A=
description this patch adds to Documentation/block/bfq-iosched.txt I=0A=
propose to leave it out from the patch description.=0A=
=0A=
What seems missing to me is an overview of the limitations of BFQ. Does=0A=
BFQ e.g. support multiple hardware queues?=0A=
=0A=
> +3. What are BFQ's tunable?=0A=
> +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=0A=
> +[ ... ]=0A=
=0A=
A thorough knowledge of BFQ is required to tune it properly. Users don't=0A=
want to tune I/O schedulers. Has it been considered to invent algorithms=0A=
to tune these parameters automatically?=0A=
=0A=
> + * Licensed under GPL-2.=0A=
=0A=
The COPYING file at the top of the tree mentions that GPL-v2 licensing=0A=
should be specified as follows close to the start of each source file:=0A=
=0A=
    This program is free software; you can redistribute it and/or modify=0A=
    it under the terms of the GNU General Public License as published by=0A=
    the Free Software Foundation; either version 2 of the License, or=0A=
    (at your option) any later version.=0A=
=0A=
    This program is distributed in the hope that it will be useful,=0A=
    but WITHOUT ANY WARRANTY; without even the implied warranty of=0A=
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the=0A=
    GNU General Public License for more details.=0A=
=0A=
    You should have received a copy of the GNU General Public License=0A=
    along with this program; if not, write to the Free Software=0A=
    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA=0A=
    02110-1301 USA=0A=
=0A=
> + * BFQ is a proportional-share I/O scheduler, with some extra=0A=
> + * low-latency capabilities. BFQ also supports full hierarchical=0A=
> + * scheduling through cgroups. Next paragraphs provide an introduction=
=0A=
> + * on BFQ inner workings. Details on BFQ benefits and usage can be=0A=
> + * found in Documentation/block/bfq-iosched.txt.=0A=
=0A=
That reference should be sufficient - please do not duplicate=0A=
Documentation/block/bfq-iosched.txt in block/bfq-iosched.c.=0A=
=0A=
> +/**=0A=
> + * struct bfq_service_tree - per ioprio_class service tree.=0A=
> + *=0A=
> + * Each service tree represents a B-WF2Q+ scheduler on its own.  Each=0A=
> + * ioprio_class has its own independent scheduler, and so its own=0A=
> + * bfq_service_tree.  All the fields are protected by the queue lock=0A=
> + * of the containing bfqd.=0A=
> + */=0A=
> +struct bfq_service_tree {=0A=
> +	/* tree for active entities (i.e., those backlogged) */=0A=
> +	struct rb_root active;=0A=
> +	/* tree for idle entities (i.e., not backlogged, with V <=3D F_i)*/=0A=
> +	struct rb_root idle;=0A=
> +=0A=
> +	struct bfq_entity *first_idle;	/* idle entity with minimum F_i */=0A=
> +	struct bfq_entity *last_idle;	/* idle entity with maximum F_i */=0A=
> +=0A=
> +	u64 vtime; /* scheduler virtual time */=0A=
> +	/* scheduler weight sum; active and idle entities contribute to it */=
=0A=
> +	unsigned long wsum;=0A=
> +};=0A=
=0A=
Inline comments next to structure members are ugly and make the=0A=
structure definition hard to read. Please follow the instructions in=0A=
Documentation/kernel-doc-nano-HOWTO.txt for documenting structure members.=
=0A=
=0A=
> +	u64 finish; /* B-WF2Q+ finish timestamp (aka F_i) */=0A=
> +	u64 start;  /* B-WF2Q+ start timestamp (aka S_i) */=0A=
=0A=
For all times and timestamps, please document the time unit (e.g. s, ms,=0A=
us, ns, jiffies, ...).=0A=
=0A=
> +enum bfq_device_speed {=0A=
> +	BFQ_BFQD_FAST,=0A=
> +	BFQ_BFQD_SLOW,=0A=
> +};=0A=
=0A=
What is the meaning of "fast" and "slow" devices in this context?=0A=
Anyway, since the first patch that uses this enum is patch 6, please=0A=
defer introduction of this enum until patch 6.=0A=
=0A=
> +=0A=
> +/**=0A=
> + * struct bfq_data - per-device data structure.=0A=
> + *=0A=
> + * All the fields are protected by @lock.=0A=
> + */=0A=
> +struct bfq_data {=0A=
> +	/* device request queue */=0A=
> +	struct request_queue *queue;=0A=
> +	[ ... ]=0A=
> +=0A=
> +	/* on-disk position of the last served request */=0A=
> +	sector_t last_position;=0A=
=0A=
What is the relevance of last_position if there are multiple hardware=0A=
queues? Will the BFQ algorithm fail to realize its guarantees in that case?=
=0A=
=0A=
What is the relevance of this structure member for block devices that=0A=
have multiple spindles, e.g. arrays of hard disks?=0A=
=0A=
> +enum bfqq_state_flags {=0A=
> +	BFQ_BFQQ_FLAG_busy =3D 0,		/* has requests or is in service */=0A=
> +	BFQ_BFQQ_FLAG_wait_request,	/* waiting for a request */=0A=
> +	BFQ_BFQQ_FLAG_non_blocking_wait_rq, /*=0A=
> +					     * waiting for a request=0A=
> +					     * without idling the device=0A=
> +					     */=0A=
> +	BFQ_BFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */=0A=
> +	BFQ_BFQQ_FLAG_idle_window,	/* slice idling enabled */=0A=
> +	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */=0A=
> +	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */=0A=
> +	BFQ_BFQQ_FLAG_IO_bound,		/*=0A=
> +					 * bfqq has timed-out at least once=0A=
> +					 * having consumed at most 2/10 of=0A=
> +					 * its budget=0A=
> +					 */=0A=
> +};=0A=
=0A=
The "BFQ_BFQQ_FLAG_" prefix looks silly and too long to me. How about=0A=
e.g. using the prefix "BFQQF_" instead?=0A=
=0A=
> +#define BFQ_BFQQ_FNS(name)						\=0A=
> +static void bfq_mark_bfqq_##name(struct bfq_queue *bfqq)		\=0A=
> +{									\=0A=
> +	(bfqq)->flags |=3D (1 << BFQ_BFQQ_FLAG_##name);			\=0A=
> +}									\=0A=
> +static void bfq_clear_bfqq_##name(struct bfq_queue *bfqq)		\=0A=
> +{									\=0A=
> +	(bfqq)->flags &=3D ~(1 << BFQ_BFQQ_FLAG_##name);			\=0A=
> +}									\=0A=
> +static int bfq_bfqq_##name(const struct bfq_queue *bfqq)		\=0A=
> +{									\=0A=
> +	return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) !=3D 0;	\=0A=
> +}=0A=
=0A=
Are the bodies of the above functions duplicates of __set_bit(),=0A=
__clear_bit() and test_bit()?=0A=
=0A=
> +/* Expiration reasons. */=0A=
> +enum bfqq_expiration {=0A=
> +	BFQ_BFQQ_TOO_IDLE =3D 0,		/*=0A=
> +					 * queue has been idling for=0A=
> +					 * too long=0A=
> +					 */=0A=
> +	BFQ_BFQQ_BUDGET_TIMEOUT,	/* budget took too long to be used */=0A=
> +	BFQ_BFQQ_BUDGET_EXHAUSTED,	/* budget consumed */=0A=
> +	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */=0A=
> +	BFQ_BFQQ_PREEMPTED		/* preemption in progress */=0A=
> +};=0A=
=0A=
The prefix of these constants refers twice to "BFQ" and does not make it=0A=
clear that these constants are about expiration. How about using the=0A=
"BFQQE_" prefix instead?=0A=
=0A=
> +/* Maximum backwards seek, in KiB. */=0A=
> +static const int bfq_back_max =3D 16 * 1024;=0A=
=0A=
Where does this constant come from? Should it depend on geometry data=0A=
like e.g. the number of sectors in a cylinder?=0A=
=0A=
> +#define for_each_entity(entity)	\=0A=
> +	for (; entity ; entity =3D NULL)=0A=
=0A=
Why has this confusing #define been introduced? Shouldn't all=0A=
occurrences of this macro be changed into the equivalent "if (entity)"?=0A=
We don't want silly macros like this in the Linux kernel.=0A=
=0A=
> +#define for_each_entity_safe(entity, parent) \=0A=
> +	for (parent =3D NULL; entity ; entity =3D parent)=0A=
=0A=
Same question here - why has this macro been introduced and how has its=0A=
name been chosen? Since this macro is used only once and since no value=0A=
is assigned to 'parent' in the code controlled by this construct, please=0A=
remove this macro and use something that is less confusing than a "for"=0A=
loop for something that is not a loop.=0A=
=0A=
> +/**=0A=
> + * bfq_weight_to_ioprio - calc an ioprio from a weight.=0A=
> + * @weight: the weight value to convert.=0A=
> + *=0A=
> + * To preserve as much as possible the old only-ioprio user interface,=
=0A=
> + * 0 is used as an escape ioprio value for weights (numerically) equal o=
r=0A=
> + * larger than IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF.=0A=
> + */=0A=
> +static unsigned short bfq_weight_to_ioprio(int weight)=0A=
> +{=0A=
> +	return IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight < 0 ?=0A=
> +		0 : IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight;=0A=
> +}=0A=
=0A=
Please consider using max() or max_t() to make this function less verbose.=
=0A=
=0A=
> +=0A=
> +/**=0A=
> + * bfq_active_extract - remove an entity from the active tree.=0A=
> + * @st: the service_tree containing the tree.=0A=
> + * @entity: the entity being removed.=0A=
> + */=0A=
> +static void bfq_active_extract(struct bfq_service_tree *st,=0A=
> +			       struct bfq_entity *entity)=0A=
> +{=0A=
> +	struct bfq_queue *bfqq =3D bfq_entity_to_bfqq(entity);=0A=
> +	struct rb_node *node;=0A=
> +=0A=
> +	node =3D bfq_find_deepest(&entity->rb_node);=0A=
> +	bfq_extract(&st->active, entity);=0A=
> +=0A=
> +	if (node)=0A=
> +		bfq_update_active_tree(node);=0A=
> +=0A=
> +	if (bfqq)=0A=
> +		list_del(&bfqq->bfqq_list);=0A=
> +}=0A=
=0A=
Which locks protect the data structures manipulated by this and other=0A=
functions? Have you considered to use lockdep_assert_held() to document=0A=
these assumptions?=0A=
=0A=
> +	case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */=0A=
=0A=
Please don't use parentheses if no confusion is possible. Additionally,=0A=
checkpatch should have requested you to insert a space before and after=0A=
the logical or operator.=0A=
=0A=
> +static void __bfq_set_in_service_queue(struct bfq_data *bfqd,=0A=
> +				       struct bfq_queue *bfqq)=0A=
> +{=0A=
> +	if (bfqq) {=0A=
> +		bfq_mark_bfqq_budget_new(bfqq);=0A=
> +		bfq_clear_bfqq_fifo_expire(bfqq);=0A=
> +=0A=
> +		bfqd->budgets_assigned =3D (bfqd->budgets_assigned*7 + 256) / 8;=0A=
=0A=
Checkpatch should have asked you to insert spaces around the=0A=
multiplication operator.=0A=
=0A=
> +/*=0A=
> + * bfq_default_budget - return the default budget for @bfqq on @bfqd.=0A=
> + * @bfqd: the device descriptor.=0A=
> + * @bfqq: the queue to consider.=0A=
> + *=0A=
> + * We use 3/4 of the @bfqd maximum budget as the default value=0A=
> + * for the max_budget field of the queues.  This lets the feedback=0A=
> + * mechanism to start from some middle ground, then the behavior=0A=
> + * of the process will drive the heuristics towards high values, if=0A=
> + * it behaves as a greedy sequential reader, or towards small values=0A=
> + * if it shows a more intermittent behavior.=0A=
> + */=0A=
> +static unsigned long bfq_default_budget(struct bfq_data *bfqd,=0A=
> +					struct bfq_queue *bfqq)=0A=
> +{=0A=
> +	unsigned long budget;=0A=
> +=0A=
> +	/*=0A=
> +	 * When we need an estimate of the peak rate we need to avoid=0A=
> +	 * to give budgets that are too short due to previous measurements.=0A=
> +	 * So, in the first 10 assignments use a ``safe'' budget value.=0A=
> +	 */=0A=
> +	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget =3D=3D 0)=
=0A=
> +		budget =3D bfq_default_max_budget;=0A=
> +	else=0A=
> +		budget =3D bfqd->bfq_max_budget;=0A=
> +=0A=
> +	return budget - budget / 4;=0A=
> +}=0A=
=0A=
Where does the magic constant "194" come from?=0A=
=0A=
=0A=
> +	} else=0A=
> +		/*=0A=
> +		 * Async queues get always the maximum possible=0A=
> +		 * budget, as for them we do not care about latency=0A=
> +		 * (in addition, their ability to dispatch is limited=0A=
> +		 * by the charging factor).=0A=
> +		 */=0A=
> +		budget =3D bfqd->bfq_max_budget;=0A=
> +=0A=
=0A=
Please balance braces. Checkpatch should have warned about the use of "}=0A=
else" instead of "} else {".=0A=
=0A=
> +static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)=0A=
> +{=0A=
> +	unsigned long max_budget;=0A=
> +=0A=
> +	/*=0A=
> +	 * The max_budget calculated when autotuning is equal to the=0A=
> +	 * amount of sectors transferred in timeout at the=0A=
> +	 * estimated peak rate.=0A=
> +	 */=0A=
> +	max_budget =3D (unsigned long)(peak_rate * 1000 *=0A=
> +				     timeout >> BFQ_RATE_SHIFT);=0A=
> +=0A=
> +	return max_budget;=0A=
> +}=0A=
=0A=
Where does the constant 1000 come from? What are the units of peak_rate=0A=
and timeout? What is the maximum value of peak_rate? Can the=0A=
multiplication overflow?=0A=
=0A=
> +/*=0A=
> + * In addition to updating the peak rate, checks whether the process=0A=
> + * is "slow", and returns 1 if so. This slow flag is used, in addition=
=0A=
> + * to the budget timeout, to reduce the amount of service provided to=0A=
> + * seeky processes, and hence reduce their chances to lower the=0A=
> + * throughput. See the code for more details.=0A=
> + */=0A=
> +static bool bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue=
 *bfqq,=0A=
> +				 bool compensate)=0A=
> +{=0A=
> +	u64 bw, usecs, expected, timeout;=0A=
> +	ktime_t delta;=0A=
> +	int update =3D 0;=0A=
> +=0A=
> +	if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))=0A=
> +		return false;=0A=
> +=0A=
> +	if (compensate)=0A=
> +		delta =3D bfqd->last_idling_start;=0A=
> +	else=0A=
> +		delta =3D ktime_get();=0A=
> +	delta =3D ktime_sub(delta, bfqd->last_budget_start);=0A=
> +	usecs =3D ktime_to_us(delta);=0A=
> +=0A=
> +	/* Don't trust short/unrealistic values. */=0A=
> +	if (usecs < 100 || usecs >=3D LONG_MAX)=0A=
> +		return false;=0A=
=0A=
If usecs >=3D LONG_MAX that indicates a kernel bug. Please consider=0A=
triggering a kernel warning in that case.=0A=
=0A=
> +/*=0A=
> + * Budget timeout is not implemented through a dedicated timer, but=0A=
> + * just checked on request arrivals and completions, as well as on=0A=
> + * idle timer expirations.=0A=
> + */=0A=
> +static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)=0A=
> +{=0A=
> +	if (bfq_bfqq_budget_new(bfqq) ||=0A=
> +	    time_before(jiffies, bfqq->budget_timeout))=0A=
> +		return false;=0A=
> +	return true;=0A=
> +}=0A=
=0A=
Have you considered to use time_is_after_jiffies() instead of=0A=
time_before(jiffies, ...)?=0A=
=0A=
> +static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,=
=0A=
> +			  struct bfq_io_cq *bic, pid_t pid, int is_sync)=0A=
> +{=0A=
> +	RB_CLEAR_NODE(&bfqq->entity.rb_node);=0A=
> +	INIT_LIST_HEAD(&bfqq->fifo);=0A=
> +=0A=
> +	bfqq->ref =3D 0;=0A=
> +	bfqq->bfqd =3D bfqd;=0A=
> +=0A=
> +	if (bic)=0A=
> +		bfq_set_next_ioprio_data(bfqq, bic);=0A=
> +=0A=
> +	if (is_sync) {=0A=
> +		if (!bfq_class_idle(bfqq))=0A=
> +			bfq_mark_bfqq_idle_window(bfqq);=0A=
> +		bfq_mark_bfqq_sync(bfqq);=0A=
> +	} else=0A=
> +		bfq_clear_bfqq_sync(bfqq);=0A=
> +=0A=
> +	bfqq->ttime.last_end_request =3D ktime_get_ns() - (1ULL<<32);=0A=
> +=0A=
> +	bfq_mark_bfqq_IO_bound(bfqq);=0A=
> +=0A=
> +	bfqq->pid =3D pid;=0A=
> +=0A=
> +	/* Tentative initial value to trade off between thr and lat */=0A=
> +	bfqq->max_budget =3D bfq_default_budget(bfqd, bfqq);=0A=
> +	bfqq->budget_timeout =3D bfq_smallest_from_now();=0A=
> +	bfqq->pid =3D pid;=0A=
> +=0A=
> +	/* first request is almost certainly seeky */=0A=
> +	bfqq->seek_history =3D 1;=0A=
> +}=0A=
=0A=
What is the meaning of the 1ULL << 32 constant?=0A=
=0A=
> +static int __init bfq_init(void)=0A=
> +{=0A=
> +	int ret;=0A=
> +	char msg[50] =3D "BFQ I/O-scheduler: v0";=0A=
=0A=
Please leave out "[50]" and use "static const char" instead of "char".=0A=
=0A=
> diff --git a/block/elevator.c b/block/elevator.c=0A=
> index 01139f5..786fdcd 100644=0A=
> --- a/block/elevator.c=0A=
> +++ b/block/elevator.c=0A=
> @@ -221,14 +221,20 @@ int elevator_init(struct request_queue *q, char *na=
me)=0A=
>  =0A=
>  	if (!e) {=0A=
>  		/*=0A=
> -		 * For blk-mq devices, we default to using mq-deadline,=0A=
> -		 * if available, for single queue devices. If deadline=0A=
> -		 * isn't available OR we have multiple queues, default=0A=
> -		 * to "none".=0A=
> +		 * For blk-mq devices, we default to using bfq, if=0A=
> +		 * available, for single queue devices. If bfq isn't=0A=
> +		 * available, we try mq-deadline. If neither is=0A=
> +		 * available, OR we have multiple queues, default to=0A=
> +		 * "none".=0A=
>  		 */=0A=
>  		if (q->mq_ops) {=0A=
> +			if (q->nr_hw_queues =3D=3D 1) {=0A=
> +				e =3D elevator_get("bfq", false);=0A=
> +				if (!e)=0A=
> +					e =3D elevator_get("mq-deadline", false);=0A=
> +			}=0A=
>  			if (q->nr_hw_queues =3D=3D 1)=0A=
> -				e =3D elevator_get("mq-deadline", false);=0A=
> +				e =3D elevator_get("bfq", false);=0A=
>  			if (!e)=0A=
>  				return 0;=0A=
>  		} else=0A=
> =0A=
=0A=
As Jens wrote, it's way too early to make BFQ the default scheduler.=0A=
=0A=
Bart.=0A=

^ permalink raw reply

* Re: [PATCH 0/8 v2] Non-blocking AIO
From: Avi Kivity @ 2017-03-06 18:50 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs
In-Reply-To: <68e790b7-bee4-9cb7-66c8-3f9157d50b3b@kernel.dk>

On 03/06/2017 08:27 PM, Jens Axboe wrote:
> On 03/06/2017 11:17 AM, Avi Kivity wrote:
>>
>> On 03/06/2017 07:06 PM, Jens Axboe wrote:
>>> On 03/06/2017 09:59 AM, Avi Kivity wrote:
>>>> On 03/06/2017 06:08 PM, Jens Axboe wrote:
>>>>> On 03/06/2017 08:59 AM, Avi Kivity wrote:
>>>>>> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>>>>>>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>>>>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>>>>>>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>>>>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>>>>>>>> any of these conditions are met. This way userspace can push most
>>>>>>>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>>>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>>>>>>>
>>>>>>>>>>> Is it not possible to push the iocb to a workqueue?  This will allow
>>>>>>>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>>>>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>>>>>>>> we're saving anything by pushing it there.
>>>>>>>>>> That is not easy because until IO is fully submitted, you need some parts
>>>>>>>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>>>>>>>> but possibly also other credentials). So you would need to somehow transfer
>>>>>>>>>> this information to the workqueue.
>>>>>>>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>>>>>>>> start blocking at some point. We can't expose a direct connection to
>>>>>>>>> queue work like that, and let any user potentially create millions of
>>>>>>>>> pending work items (and IOs).
>>>>>>>> You wouldn't expect more concurrent events than the maxevents parameter
>>>>>>>> that was supplied to io_setup syscall; it should have reserved any
>>>>>>>> resources needed.
>>>>>>> Doesn't matter what limit you apply, my point still stands - at some
>>>>>>> point you have to return EAGAIN, or block. Returning EAGAIN without
>>>>>>> the caller having flagged support for that change of behavior would
>>>>>>> be problematic.
>>>>>> Doesn't it already return EAGAIN (or some other error) if you exceed
>>>>>> maxevents?
>>>>> It's a setup thing. We check these limits when someone creates an IO
>>>>> context, and carve out the specified entries form our global pool. Then
>>>>> we free those "resources" when the io context is freed.
>>>>>
>>>>> Right now I can setup an IO context with 1000 entries on it, yet that
>>>>> number has NO bearing on when io_submit() would potentially block or
>>>>> return EAGAIN.
>>>>>
>>>>> We can have a huge gap on the intent signaled by io context setup, and
>>>>> the reality imposed by what actually happens on the IO submission side.
>>>> Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return
>>>> EAGAIN?
>>>>
>>>> Just tested it, and maxevents is not respected for this:
>>>>
>>>> io_setup(1, [0x7fc64537f000])           = 0
>>>> io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000,
>>>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>>>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>>>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread,
>>>> fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3,
>>>> buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000,
>>>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>>>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>>>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10
>>>>
>>>> which is unexpected, to me.
>>> ioctx_alloc()
>>> {
>>>           [...]
>>>
>>>           /*
>>>            * We keep track of the number of available ringbuffer slots, to prevent
>>>            * overflow (reqs_available), and we also use percpu counters for this.
>>>            *
>>>            * So since up to half the slots might be on other cpu's percpu counters
>>>            * and unavailable, double nr_events so userspace sees what they
>>>            * expected: additionally, we move req_batch slots to/from percpu
>>>            * counters at a time, so make sure that isn't 0:
>>>            */
>>>           nr_events = max(nr_events, num_possible_cpus() * 4);
>>>           nr_events *= 2;
>>> }
>> On a 4-lcore desktop:
>>
>> io_setup(1, [0x7fc210041000])           = 0
>> io_submit(0x7fc210041000, 10000, [big array]) = 126
>> io_submit(0x7fc210041000, 10000, [big array]) = -1 EAGAIN (Resource
>> temporarily unavailable)
>>
>> so, the user should already expect EAGAIN from io_submit() due to
>> resource limits.  I'm sure the check could be tightened so that if we do
>> have to use a workqueue, we respect the user's limit rather than some
>> inflated number.
> This is why I previously said that the 1000 requests you potentially
> asks for when setting up your IO context has NOTHING to do with when you
> will run into EAGAIN. Yes, returning EAGAIN if the app exceeds the
> limit that it itself has set is existing behavior and it certainly makes
> sense. And it's an easily handled condition, since the app can just
> backoff and wait/reap completion events.

Every time I used aio, I considered maxevents to be the maximum number 
of in-flight requests for that queue, and observed this limit 
religiously.  It's possible others don't.

> But if we allow EAGAIN to bubble up from block request submission, then
> that's a change in behavior. This can happen without the app having any
> pending IO against that IO context, hence we can return EAGAIN to the
> app that then has no reasonable way to handle that condition.
>

For sure (and it's a different EAGAIN -- it's tied to the iocb, not 
request submission).  But we do have an upper bound for the number of 
concurrent requests, even if inflated, so having the kernel convert a 
blocking iocb into a workqueue item does not allow userspace to exploit 
the kernel.

We could limit the number of workqueue submissions to maxevents, and 
queue anything between maxevents and (maxevents * inflation_factor) 
using a regular queue.  So the intent of maxevents is respected, and 
applications that overflow it are not regressed.

   if iocb would overflow inflated maxevents:
       io_submit returns EAGAIN
   elseif iocb can be submitted asynchronously:
       do that
   elseif number of iocbs running in workqueues < maxevents:
       push to a workqueue
   else
       queue somewhere, when a work_item completes it can pick up an 
iocb from the queue

this enables aio for all filesystems, and doesn't require lots of idle 
thread pools if the filesystem works or useless syscalls if it doesn't.  
It's a lot more work for the kernel, but results in a tighter and 
simpler interface.

^ permalink raw reply

* Re: [PATCH 0/8 v2] Non-blocking AIO
From: Jens Axboe @ 2017-03-06 18:27 UTC (permalink / raw)
  To: Avi Kivity, Jan Kara
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs
In-Reply-To: <4cbafb12-a30e-bb57-da43-de7c47726c81@scylladb.com>

On 03/06/2017 11:17 AM, Avi Kivity wrote:
> 
> 
> On 03/06/2017 07:06 PM, Jens Axboe wrote:
>> On 03/06/2017 09:59 AM, Avi Kivity wrote:
>>>
>>> On 03/06/2017 06:08 PM, Jens Axboe wrote:
>>>> On 03/06/2017 08:59 AM, Avi Kivity wrote:
>>>>> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>>>>>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>>>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>>>>>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>>>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>>>>>>> any of these conditions are met. This way userspace can push most
>>>>>>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>>>>>>
>>>>>>>>>> Is it not possible to push the iocb to a workqueue?  This will allow
>>>>>>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>>>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>>>>>>> we're saving anything by pushing it there.
>>>>>>>>> That is not easy because until IO is fully submitted, you need some parts
>>>>>>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>>>>>>> but possibly also other credentials). So you would need to somehow transfer
>>>>>>>>> this information to the workqueue.
>>>>>>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>>>>>>> start blocking at some point. We can't expose a direct connection to
>>>>>>>> queue work like that, and let any user potentially create millions of
>>>>>>>> pending work items (and IOs).
>>>>>>> You wouldn't expect more concurrent events than the maxevents parameter
>>>>>>> that was supplied to io_setup syscall; it should have reserved any
>>>>>>> resources needed.
>>>>>> Doesn't matter what limit you apply, my point still stands - at some
>>>>>> point you have to return EAGAIN, or block. Returning EAGAIN without
>>>>>> the caller having flagged support for that change of behavior would
>>>>>> be problematic.
>>>>> Doesn't it already return EAGAIN (or some other error) if you exceed
>>>>> maxevents?
>>>> It's a setup thing. We check these limits when someone creates an IO
>>>> context, and carve out the specified entries form our global pool. Then
>>>> we free those "resources" when the io context is freed.
>>>>
>>>> Right now I can setup an IO context with 1000 entries on it, yet that
>>>> number has NO bearing on when io_submit() would potentially block or
>>>> return EAGAIN.
>>>>
>>>> We can have a huge gap on the intent signaled by io context setup, and
>>>> the reality imposed by what actually happens on the IO submission side.
>>> Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return
>>> EAGAIN?
>>>
>>> Just tested it, and maxevents is not respected for this:
>>>
>>> io_setup(1, [0x7fc64537f000])           = 0
>>> io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000,
>>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread,
>>> fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3,
>>> buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000,
>>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10
>>>
>>> which is unexpected, to me.
>> ioctx_alloc()
>> {
>>          [...]
>>
>>          /*
>>           * We keep track of the number of available ringbuffer slots, to prevent
>>           * overflow (reqs_available), and we also use percpu counters for this.
>>           *
>>           * So since up to half the slots might be on other cpu's percpu counters
>>           * and unavailable, double nr_events so userspace sees what they
>>           * expected: additionally, we move req_batch slots to/from percpu
>>           * counters at a time, so make sure that isn't 0:
>>           */
>>          nr_events = max(nr_events, num_possible_cpus() * 4);
>>          nr_events *= 2;
>> }
> 
> On a 4-lcore desktop:
> 
> io_setup(1, [0x7fc210041000])           = 0
> io_submit(0x7fc210041000, 10000, [big array]) = 126
> io_submit(0x7fc210041000, 10000, [big array]) = -1 EAGAIN (Resource 
> temporarily unavailable)
> 
> so, the user should already expect EAGAIN from io_submit() due to 
> resource limits.  I'm sure the check could be tightened so that if we do 
> have to use a workqueue, we respect the user's limit rather than some 
> inflated number.

This is why I previously said that the 1000 requests you potentially
asks for when setting up your IO context has NOTHING to do with when you
will run into EAGAIN. Yes, returning EAGAIN if the app exceeds the
limit that it itself has set is existing behavior and it certainly makes
sense. And it's an easily handled condition, since the app can just
backoff and wait/reap completion events.

But if we allow EAGAIN to bubble up from block request submission, then
that's a change in behavior. This can happen without the app having any
pending IO against that IO context, hence we can return EAGAIN to the
app that then has no reasonable way to handle that condition.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 0/8 v2] Non-blocking AIO
From: Avi Kivity @ 2017-03-06 18:17 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara
  Cc: Goldwyn Rodrigues, jack, hch, linux-fsdevel, linux-block,
	linux-btrfs, linux-ext4, linux-xfs
In-Reply-To: <7aabb6b4-df8d-8554-fbe3-90504887fb8e@kernel.dk>



On 03/06/2017 07:06 PM, Jens Axboe wrote:
> On 03/06/2017 09:59 AM, Avi Kivity wrote:
>>
>> On 03/06/2017 06:08 PM, Jens Axboe wrote:
>>> On 03/06/2017 08:59 AM, Avi Kivity wrote:
>>>> On 03/06/2017 05:38 PM, Jens Axboe wrote:
>>>>> On 03/06/2017 08:29 AM, Avi Kivity wrote:
>>>>>> On 03/06/2017 05:19 PM, Jens Axboe wrote:
>>>>>>> On 03/06/2017 01:25 AM, Jan Kara wrote:
>>>>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote:
>>>>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
>>>>>>>>>> any of these conditions are met. This way userspace can push most
>>>>>>>>>> of the write()s to the kernel to the best of its ability to complete
>>>>>>>>>> and if it returns -EAGAIN, can defer it to another thread.
>>>>>>>>>>
>>>>>>>>> Is it not possible to push the iocb to a workqueue?  This will allow
>>>>>>>>> existing userspace to work with the new functionality, unchanged. Any
>>>>>>>>> userspace implementation would have to do the same thing, so it's not like
>>>>>>>>> we're saving anything by pushing it there.
>>>>>>>> That is not easy because until IO is fully submitted, you need some parts
>>>>>>>> of the context of the process which submits the IO (e.g. memory mappings,
>>>>>>>> but possibly also other credentials). So you would need to somehow transfer
>>>>>>>> this information to the workqueue.
>>>>>>> Outside of technical challenges, the API also needs to return EAGAIN or
>>>>>>> start blocking at some point. We can't expose a direct connection to
>>>>>>> queue work like that, and let any user potentially create millions of
>>>>>>> pending work items (and IOs).
>>>>>> You wouldn't expect more concurrent events than the maxevents parameter
>>>>>> that was supplied to io_setup syscall; it should have reserved any
>>>>>> resources needed.
>>>>> Doesn't matter what limit you apply, my point still stands - at some
>>>>> point you have to return EAGAIN, or block. Returning EAGAIN without
>>>>> the caller having flagged support for that change of behavior would
>>>>> be problematic.
>>>> Doesn't it already return EAGAIN (or some other error) if you exceed
>>>> maxevents?
>>> It's a setup thing. We check these limits when someone creates an IO
>>> context, and carve out the specified entries form our global pool. Then
>>> we free those "resources" when the io context is freed.
>>>
>>> Right now I can setup an IO context with 1000 entries on it, yet that
>>> number has NO bearing on when io_submit() would potentially block or
>>> return EAGAIN.
>>>
>>> We can have a huge gap on the intent signaled by io context setup, and
>>> the reality imposed by what actually happens on the IO submission side.
>> Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return
>> EAGAIN?
>>
>> Just tested it, and maxevents is not respected for this:
>>
>> io_setup(1, [0x7fc64537f000])           = 0
>> io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000,
>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread,
>> fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3,
>> buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000,
>> nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
>> offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
>> {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10
>>
>> which is unexpected, to me.
> ioctx_alloc()
> {
>          [...]
>
>          /*
>           * We keep track of the number of available ringbuffer slots, to prevent
>           * overflow (reqs_available), and we also use percpu counters for this.
>           *
>           * So since up to half the slots might be on other cpu's percpu counters
>           * and unavailable, double nr_events so userspace sees what they
>           * expected: additionally, we move req_batch slots to/from percpu
>           * counters at a time, so make sure that isn't 0:
>           */
>          nr_events = max(nr_events, num_possible_cpus() * 4);
>          nr_events *= 2;
> }

On a 4-lcore desktop:

io_setup(1, [0x7fc210041000])           = 0
io_submit(0x7fc210041000, 10000, [big array]) = 126
io_submit(0x7fc210041000, 10000, [big array]) = -1 EAGAIN (Resource 
temporarily unavailable)

so, the user should already expect EAGAIN from io_submit() due to 
resource limits.  I'm sure the check could be tightened so that if we do 
have to use a workqueue, we respect the user's limit rather than some 
inflated number.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox