Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH 2/2] block: fix leak of q->rq_wb
From: Omar Sandoval @ 2017-03-28  3:45 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, kernel-team
In-Reply-To: <20170328033850.GB23217@ming.t460p>

On Tue, Mar 28, 2017 at 11:43:04AM +0800, Ming Lei wrote:
> On Mon, Mar 27, 2017 at 10:43:59AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > CONFIG_DEBUG_TEST_DRIVER_REMOVE found a possible leak of q->rq_wb in a
> > couple of cases: when a request queue is reregistered and when gendisks
> > share a request_queue. This has been a problem since wbt was introduced,
> > but the WARN_ON(!list_empty(&stats->callbacks)) in the blk-stat rework
> > exposed it. The fix is unfortunately a hack until we fix all of the
> > drivers sharing a request_queue.
> > 
> > Fixes: 87760e5eef35 ("block: hook up writeback throttling")
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  block/blk-sysfs.c | 16 +++++++++++++++-
> >  1 file changed, 15 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index fa831cb2fc30..a187e3f70028 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -893,7 +893,21 @@ int blk_register_queue(struct gendisk *disk)
> >  
> >  	kobject_uevent(&q->kobj, KOBJ_ADD);
> >  
> > -	blk_wb_init(q);
> > +	/*
> > +	 * There are two cases where wbt may have already been initialized:
> > +	 * 1. A call sequence of blk_register_queue(); blk_unregister_queue();
> > +	 *    blk_register_queue().
> > +	 * 2. Multiple gendisks sharing a request_queue.
> > +	 *
> > +	 * To fix case 1, we'd like to call wbt_exit() in
> > +	 * blk_unregister_queue(). However, that's unsafe for case 2. So, we're
> > +	 * forced to do this and call wbt_exit() in blk_release_queue() instead.
> > +	 *
> > +	 * Note that in case 2, wbt will account across disks until those legacy
> > +	 * drivers are fixed.
> > +	 */
> > +	if (!q->rq_wb)
> > +		blk_wb_init(q);
> 
> Since 'rq_wb' is per-queue and its life time is same with queue's, I
> am wondering why blk_wb_init() isn't put into blk_alloc_queue_node() or
> queue's initialization api(blk_init_allocated_queue(), or
> blk_mq_init_allocated_queue())?

Doing it at queue init time might be cleaner, I'll try that.

^ permalink raw reply

* Re: [PATCH 2/2] block: fix leak of q->rq_wb
From: Ming Lei @ 2017-03-28  3:43 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-block, kernel-team
In-Reply-To: <d6f2baa4f64dce04da94a649d080acdaaf7d6b57.1490636543.git.osandov@fb.com>

On Mon, Mar 27, 2017 at 10:43:59AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> CONFIG_DEBUG_TEST_DRIVER_REMOVE found a possible leak of q->rq_wb in a
> couple of cases: when a request queue is reregistered and when gendisks
> share a request_queue. This has been a problem since wbt was introduced,
> but the WARN_ON(!list_empty(&stats->callbacks)) in the blk-stat rework
> exposed it. The fix is unfortunately a hack until we fix all of the
> drivers sharing a request_queue.
> 
> Fixes: 87760e5eef35 ("block: hook up writeback throttling")
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  block/blk-sysfs.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index fa831cb2fc30..a187e3f70028 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -893,7 +893,21 @@ int blk_register_queue(struct gendisk *disk)
>  
>  	kobject_uevent(&q->kobj, KOBJ_ADD);
>  
> -	blk_wb_init(q);
> +	/*
> +	 * There are two cases where wbt may have already been initialized:
> +	 * 1. A call sequence of blk_register_queue(); blk_unregister_queue();
> +	 *    blk_register_queue().
> +	 * 2. Multiple gendisks sharing a request_queue.
> +	 *
> +	 * To fix case 1, we'd like to call wbt_exit() in
> +	 * blk_unregister_queue(). However, that's unsafe for case 2. So, we're
> +	 * forced to do this and call wbt_exit() in blk_release_queue() instead.
> +	 *
> +	 * Note that in case 2, wbt will account across disks until those legacy
> +	 * drivers are fixed.
> +	 */
> +	if (!q->rq_wb)
> +		blk_wb_init(q);

Since 'rq_wb' is per-queue and its life time is same with queue's, I
am wondering why blk_wb_init() isn't put into blk_alloc_queue_node() or
queue's initialization api(blk_init_allocated_queue(), or
blk_mq_init_allocated_queue())?

Thanks,
Ming

^ permalink raw reply

* Re: [PATCH 2/2] block: fix leak of q->rq_wb
From: Omar Sandoval @ 2017-03-28  2:52 UTC (permalink / raw)
  To: linux-block; +Cc: kernel-team
In-Reply-To: <d6f2baa4f64dce04da94a649d080acdaaf7d6b57.1490636543.git.osandov@fb.com>

On Mon, Mar 27, 2017 at 10:43:59AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> CONFIG_DEBUG_TEST_DRIVER_REMOVE found a possible leak of q->rq_wb in a
> couple of cases: when a request queue is reregistered and when gendisks
> share a request_queue. This has been a problem since wbt was introduced,
> but the WARN_ON(!list_empty(&stats->callbacks)) in the blk-stat rework
> exposed it. The fix is unfortunately a hack until we fix all of the
> drivers sharing a request_queue.

FYI, I went through the remaining cases and fixed most of them up,
working on the remaining few. Maybe we can finally put this crap to
rest...

^ permalink raw reply

* Re: [PATCH 1/2] blk-mq: fix leak of q->stats
From: Ming Lei @ 2017-03-28  2:43 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-block, kernel-team
In-Reply-To: <de751892a392533e8e26fffbf72ba7904cfda468.1490636543.git.osandov@fb.com>

On Mon, Mar 27, 2017 at 10:43:58AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> blk_alloc_queue_node() already allocates q->stats, so
> blk_mq_init_allocated_queue() is overwriting it with a new allocation.
> 
> Fixes: a83b576c9c25 ("block: fix stacked driver stats init and free")
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  block/blk-mq.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c212b9644a9f..9c6b39bd8358 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2212,10 +2212,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>  	/* mark the queue as mq asap */
>  	q->mq_ops = set->ops;
>  
> -	q->stats = blk_alloc_queue_stats();
> -	if (!q->stats)
> -		goto err_exit;
> -
>  	q->poll_cb = blk_stat_alloc_callback(blk_mq_poll_stats_fn,
>  					     blk_stat_rq_ddir, 2, q);
>  	if (!q->poll_cb)
> -- 
> 2.12.1
> 

Reviewed-by: Ming Lei <tom.leiming@gmail.com>

-- 
Ming

^ permalink raw reply

* Re: [dm-devel] [PATCH v3] block: trace completion of all bios.
From: NeilBrown @ 2017-03-27 23:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, linux-raid, Martin K . Petersen,
	Mike Snitzer, Ming Lei, linux-kernel, Christoph Hellwig, dm-devel,
	Shaohua Li, Alasdair Kergon
In-Reply-To: <20170327171421.GA464@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 1192 bytes --]

On Mon, Mar 27 2017, Christoph Hellwig wrote:

> On Mon, Mar 27, 2017 at 08:49:57PM +1100, NeilBrown wrote:
>> On Mon, Mar 27 2017, Christoph Hellwig wrote:
>> 
>> > I don't really like the flag at all.  I'd much prefer a __bio_endio
>> > with a 'bool trace' flag.  Also please remove the manual tracing in
>> > dm.ċ.  Once that is done I suspect we can also remove the
>> > block_bio_complete export.
>> 
>> Can you say why you don't like it?
>
> It uses up a precious bit in the bio for something that should be state
> that can be determined in the caller at compile time.

I've already demonstrated that the bit is not "precious" at all.  I have
shown how I could easily give you 20 unused flag bits without increasing
the size of struct bio.
Yes, the state could be determined in the caller at compiler time.  That
would require developers to make the correct choice between two very
similar interfaces, where the consequences of an correct choice are not
immediately obvious.
I think that spending one bit (out of 20) to relieve developers of the
burden of choice (and to spare as all of the consequences of wrong
choice) is a price worth paying.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [PATCH 7/7] sd: use ZERO_PAGE for WRITE_SAME payloads
From: Bart Van Assche @ 2017-03-27 23:40 UTC (permalink / raw)
  To: hch@lst.de, tj@kernel.org, martin.petersen@oracle.com,
	axboe@kernel.dk
  Cc: linux-scsi@vger.kernel.org, linux-block@vger.kernel.org,
	linux-ide@vger.kernel.org
In-Reply-To: <20170320204319.12628-8-hch@lst.de>

On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote:
> We're never touching the contents of the page, so save a memory
> allocation for these cases.
>=20
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/scsi/sd.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>=20
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index c18fe9ff1f8f..af632e350ab4 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -756,7 +756,7 @@ static int sd_setup_write_same16_cmnd(struct scsi_cmn=
d *cmd)
>  	u32 nr_sectors =3D blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
>  	u32 data_len =3D sdp->sector_size;
> =20
> -	rq->special_vec.bv_page =3D alloc_page(GFP_ATOMIC | __GFP_ZERO);
> +	rq->special_vec.bv_page =3D ZERO_PAGE(0);
>  	if (!rq->special_vec.bv_page)
>  		return BLKPREP_DEFER;
>  	rq->special_vec.bv_offset =3D 0;
> @@ -785,7 +785,7 @@ static int sd_setup_write_same10_cmnd(struct scsi_cmn=
d *cmd, bool unmap)
>  	u32 nr_sectors =3D blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
>  	u32 data_len =3D sdp->sector_size;
> =20
> -	rq->special_vec.bv_page =3D alloc_page(GFP_ATOMIC | __GFP_ZERO);
> +	rq->special_vec.bv_page =3D ZERO_PAGE(0);
>  	if (!rq->special_vec.bv_page)
>  		return BLKPREP_DEFER;
>  	rq->special_vec.bv_offset =3D 0;
> @@ -1256,7 +1256,8 @@ static void sd_uninit_command(struct scsi_cmnd *SCp=
nt)
>  {
>  	struct request *rq =3D SCpnt->request;
> =20
> -	if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
> +	if ((rq->rq_flags & RQF_SPECIAL_PAYLOAD) &&
> +	    rq->special_vec.bv_page !=3D ZERO_PAGE(0))
>  		__free_page(rq->special_vec.bv_page);
> =20
>  	if (SCpnt->cmnd !=3D scsi_req(rq)->cmd) {

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

^ permalink raw reply

* Re: [PATCH 6/7] sd: support multi-range TRIM for ATA disks
From: Bart Van Assche @ 2017-03-27 23:38 UTC (permalink / raw)
  To: hch@lst.de, tj@kernel.org, martin.petersen@oracle.com,
	axboe@kernel.dk
  Cc: linux-scsi@vger.kernel.org, linux-block@vger.kernel.org,
	linux-ide@vger.kernel.org
In-Reply-To: <20170320204319.12628-7-hch@lst.de>

On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote:
> @@ -698,13 +698,19 @@ static void sd_config_discard(struct scsi_disk *sdk=
p, unsigned int mode)
>  		break;
> =20
>  	case SD_LBP_ATA_TRIM:
> -		max_blocks =3D 65535 * (512 / sizeof(__le64));
> +		max_ranges =3D 512 / sizeof(__le64);
> +		max_range_size =3D USHRT_MAX;
> +		max_blocks =3D max_ranges * max_range_size;
>  		if (sdkp->device->ata_trim_zeroes_data)
>  			q->limits.discard_zeroes_data =3D 1;
>  		break;
>  	}
> =20
>  	blk_queue_max_discard_sectors(q, max_blocks * (logical_block_size >> 9)=
);
> +	if (max_ranges)
> +		blk_queue_max_discard_segments(q, max_ranges);
> +	if (max_range_size)
> +		blk_queue_max_discard_segment_size(q, max_range_size);
>  	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
>  }

Hello Christoph,

Does blk_queue_max_discard_segment_size() expect a second argument that is =
a
number of bytes? Is max_range_size a number of logical blocks that each hav=
e
a size 1 << sector_shift?

> @@ -826,14 +832,21 @@ static int sd_setup_ata_trim_cmnd(struct scsi_cmnd =
*cmd)
>  	cmd->cmnd[8] =3D data_len;
> =20
>  	buf =3D page_address(rq->special_vec.bv_page);
> -	for (i =3D 0; i < (data_len >> 3); i++) {
> -		u64 n =3D min(nr_sectors, 0xffffu);
> +	__rq_for_each_bio(bio, rq) {
> +		u64 sector =3D bio->bi_iter.bi_sector >> (sector_shift - 9);
> +		u32 nr_sectors =3D bio->bi_iter.bi_size >> sector_shift;
> =20
> -		buf[i] =3D cpu_to_le64(sector | (n << 48));
> -		if (nr_sectors <=3D 0xffff)
> -			break;
> -		sector +=3D 0xffff;
> -		nr_sectors -=3D 0xffff;
> +		do {
> +			u64 n =3D min(nr_sectors, 0xffffu);
> +
> +			buf[i] =3D cpu_to_le64(sector | (n << 48));
> +			if (nr_sectors <=3D 0xffff)
> +				break;
> +			sector +=3D 0xffff;
> +			nr_sectors -=3D 0xffff;
> +			i++;
> +
> +		} while (!WARN_ON_ONCE(i >=3D data_len >> 3));
>  	}
> =20
>  	cmd->allowed =3D SD_MAX_RETRIES;

It's unfortunate that the loop-end test occurs twice (i < data_len >> 3).
Please consider using put_unaligned_le64() instead of cpu_to_le64() such
that the data type of buf can be changed from __le64* into void *. With
that change and by introducing e.g. the following:

void *end;
[ ... ]
end =3D (void *)buf + data_len;

the loop variable 'i' can be eliminated. If buf[i++] =3D ... would be
changed into *buf++ =3D ... then that would allow to change the two
occurrences of (i < data_len >> 3) into (buf < end).

Thanks,

Bart.=

^ permalink raw reply

* Re: [PATCH 5/7] block: add a max_discard_segment_size queue limit
From: Bart Van Assche @ 2017-03-27 23:09 UTC (permalink / raw)
  To: hch@lst.de, tj@kernel.org, martin.petersen@oracle.com,
	axboe@kernel.dk
  Cc: linux-scsi@vger.kernel.org, linux-block@vger.kernel.org,
	linux-ide@vger.kernel.org
In-Reply-To: <20170320204319.12628-6-hch@lst.de>

On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote:
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 5a7da607ca04..3b3bd646f580 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -333,6 +333,7 @@ struct queue_limits {
>  	unsigned short		max_segments;
>  	unsigned short		max_integrity_segments;
>  	unsigned short		max_discard_segments;
> +	unsigned int		max_discard_segment_size;
> =20
>  	unsigned char		misaligned;
>  	unsigned char		discard_misaligned;
> @@ -1150,6 +1151,8 @@ extern void blk_queue_max_segments(struct request_q=
ueue *, unsigned short);
>  extern void blk_queue_max_discard_segments(struct request_queue *,
>  		unsigned short);
>  extern void blk_queue_max_segment_size(struct request_queue *, unsigned =
int);
> +extern void blk_queue_max_discard_segment_size(struct request_queue *,
> +		unsigned int);
>  extern void blk_queue_max_discard_sectors(struct request_queue *q,
>  		unsigned int max_discard_sectors);
>  extern void blk_queue_max_write_same_sectors(struct request_queue *q,
> @@ -1415,6 +1418,11 @@ static inline unsigned int queue_max_segment_size(=
struct request_queue *q)
>  	return q->limits.max_segment_size;
>  }
> =20
> +static inline unsigned int queue_max_discard_segment_size(struct request=
_queue *q)
> +{
> +	return q->limits.max_discard_segment_size;
> +}
> +
>  static inline unsigned short queue_logical_block_size(struct request_que=
ue *q)
>  {
>  	int retval =3D 512;

Hello Christoph,

It's probably a good idea to add a comment somewhere that the unit of
max_discard_segment_size is one byte.

Additionally, should it be mentioned in the commit message that this
patch limits the maximum discard segment size to 4 GB?

Thanks,

Bart.=

^ permalink raw reply

* Re: [PATCH 2/7] sd: provide a new ata trim provisioning mode
From: Bart Van Assche @ 2017-03-27 22:40 UTC (permalink / raw)
  To: hch@lst.de, tj@kernel.org, martin.petersen@oracle.com,
	axboe@kernel.dk
  Cc: linux-scsi@vger.kernel.org, linux-block@vger.kernel.org,
	linux-ide@vger.kernel.org
In-Reply-To: <20170320204319.12628-3-hch@lst.de>

On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote:
> +	case SD_LBP_ATA_TRIM:
> +		max_blocks =3D 65535 * (512 / sizeof(__le64));
> +		if (sdkp->device->ata_trim_zeroes_data)
> +			q->limits.discard_zeroes_data =3D 1;
> +		break;

Do we need a comment here that explains where the numbers 65535 and
512 come from?

> +	u64 sector =3D blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
> +	u32 nr_sectors =3D blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);

Please consider using logical_to_sectors() instead of open-coding this
function.

Thanks,

Bart.=

^ permalink raw reply

* Re: [PATCH 1/7] ѕd: split sd_setup_discard_cmnd
From: Bart Van Assche @ 2017-03-27 22:24 UTC (permalink / raw)
  To: hch@lst.de, tj@kernel.org, martin.petersen@oracle.com,
	axboe@kernel.dk
  Cc: linux-scsi@vger.kernel.org, linux-block@vger.kernel.org,
	linux-ide@vger.kernel.org
In-Reply-To: <20170320204319.12628-2-hch@lst.de>

On Mon, 2017-03-20 at 16:43 -0400, Christoph Hellwig wrote:
> +	u64 sector =3D blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
> +	u32 nr_sectors =3D blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);

Although I know this is an issue in the existing code and not something
introduced by you: please consider using logical_to_sectors() instead of
open-coding this function. Otherwise this patch looks fine to me.

Thanks,

Bart.

^ permalink raw reply

* [PATCH 3/3] blk-throttle: add latency target support
From: Shaohua Li @ 2017-03-27 22:19 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: axboe, tj, Vivek Goyal, jmoyer, Kernel-team
In-Reply-To: <cover.1490651903.git.shli@fb.com>

One hard problem adding .low limit is to detect idle cgroup. If one
cgroup doesn't dispatch enough IO against its low limit, we must have a
mechanism to determine if other cgroups dispatch more IO. We added the
think time detection mechanism before, but it doesn't work for all
workloads. Here we add a latency based approach.

We already have mechanism to calculate latency threshold for each IO
size. For every IO dispatched from a cgorup, we compare its latency
against its threshold and record the info. If most IO latency is below
threshold (in the code I use 75%), the cgroup could be treated idle and
other cgroups can dispatch more IO.

Currently this latency target check is only for SSD as we can't
calcualte the latency target for hard disk. And this is only for cgroup
leaf node so far.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 140da29..c82bf9b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -165,6 +165,10 @@ struct throtl_grp {
 	unsigned long checked_last_finish_time; /* ns / 1024 */
 	unsigned long avg_idletime; /* ns / 1024 */
 	unsigned long idletime_threshold; /* us */
+
+	unsigned int bio_cnt; /* total bios */
+	unsigned int bad_bio_cnt; /* bios exceeding latency threshold */
+	unsigned long bio_cnt_reset_time;
 };
 
 /* We measure latency for request size from <= 4k to >= 1M */
@@ -1720,12 +1724,15 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg)
 	 * - single idle is too long, longer than a fixed value (in case user
 	 *   configure a too big threshold) or 4 times of slice
 	 * - average think time is more than threshold
+	 * - IO latency is largely below threshold
 	 */
 	unsigned long time = jiffies_to_usecs(4 * tg->td->throtl_slice);
 
 	time = min_t(unsigned long, MAX_IDLE_TIME, time);
 	return (ktime_get_ns() >> 10) - tg->last_finish_time > time ||
-	       tg->avg_idletime > tg->idletime_threshold;
+	       tg->avg_idletime > tg->idletime_threshold ||
+	       (tg->latency_target && tg->bio_cnt &&
+		tg->bad_bio_cnt * 5 < tg->bio_cnt);
 }
 
 static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
@@ -2194,12 +2201,36 @@ void blk_throtl_bio_endio(struct bio *bio)
 
 	start_time = blk_stat_time(&bio->bi_issue_stat) >> 10;
 	finish_time = __blk_stat_time(finish_time_ns) >> 10;
+	if (!start_time || finish_time <= start_time)
+		return;
+
+	lat = finish_time - start_time;
 	/* this is only for bio based driver */
-	if (start_time && finish_time > start_time &&
-	    !(bio->bi_issue_stat.stat & SKIP_LATENCY)) {
-		lat = finish_time - start_time;
+	if (!(bio->bi_issue_stat.stat & SKIP_LATENCY))
 		throtl_track_latency(tg->td, blk_stat_size(&bio->bi_issue_stat),
 			bio_op(bio), lat);
+
+	if (tg->latency_target) {
+		int bucket;
+		unsigned int threshold;
+
+		bucket = request_bucket_index(
+			blk_stat_size(&bio->bi_issue_stat));
+		threshold = tg->td->avg_buckets[bucket].latency +
+			tg->latency_target;
+		if (lat > threshold)
+			tg->bad_bio_cnt++;
+		/*
+		 * Not race free, could get wrong count, which means cgroups
+		 * will be throttled
+		 */
+		tg->bio_cnt++;
+	}
+
+	if (time_after(jiffies, tg->bio_cnt_reset_time) || tg->bio_cnt > 1024) {
+		tg->bio_cnt_reset_time = tg->td->throtl_slice + jiffies;
+		tg->bio_cnt /= 2;
+		tg->bad_bio_cnt /= 2;
 	}
 }
 #endif
-- 
2.9.3

^ permalink raw reply related

* [PATCH 2/3] blk-throttle: add a mechanism to estimate IO latency
From: Shaohua Li @ 2017-03-27 22:19 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: axboe, tj, Vivek Goyal, jmoyer, Kernel-team
In-Reply-To: <cover.1490651903.git.shli@fb.com>

User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.

To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.

But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.

Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-stat.c          |  15 ++++-
 block/blk-stat.h          |   3 +
 block/blk-throttle.c      | 166 ++++++++++++++++++++++++++++++++++++++++++++--
 block/blk.h               |   2 +
 include/linux/blk_types.h |   9 +--
 5 files changed, 185 insertions(+), 10 deletions(-)

diff --git a/block/blk-stat.c b/block/blk-stat.c
index 188b535..e77ec52 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -9,12 +9,14 @@
 
 #include "blk-stat.h"
 #include "blk-mq.h"
+#include "blk.h"
 
 #define BLK_RQ_STAT_BATCH	64
 
 struct blk_queue_stats {
 	struct list_head callbacks;
 	spinlock_t lock;
+	bool enable_accounting;
 };
 
 unsigned int blk_stat_rq_ddir(const struct request *rq)
@@ -96,6 +98,8 @@ void blk_stat_add(struct request *rq)
 
 	value = now - blk_stat_time(&rq->issue_stat);
 
+	blk_throtl_stat_add(rq, value);
+
 	rcu_read_lock();
 	list_for_each_entry_rcu(cb, &q->stats->callbacks, list) {
 		if (blk_stat_is_active(cb)) {
@@ -190,7 +194,7 @@ void blk_stat_remove_callback(struct request_queue *q,
 {
 	spin_lock(&q->stats->lock);
 	list_del_rcu(&cb->list);
-	if (list_empty(&q->stats->callbacks))
+	if (list_empty(&q->stats->callbacks) && !q->stats->enable_accounting)
 		clear_bit(QUEUE_FLAG_STATS, &q->queue_flags);
 	spin_unlock(&q->stats->lock);
 
@@ -215,6 +219,14 @@ void blk_stat_free_callback(struct blk_stat_callback *cb)
 }
 EXPORT_SYMBOL_GPL(blk_stat_free_callback);
 
+void blk_stat_enable_accounting(struct request_queue *q)
+{
+	spin_lock(&q->stats->lock);
+	q->stats->enable_accounting = true;
+	set_bit(QUEUE_FLAG_STATS, &q->queue_flags);
+	spin_unlock(&q->stats->lock);
+}
+
 struct blk_queue_stats *blk_alloc_queue_stats(void)
 {
 	struct blk_queue_stats *stats;
@@ -225,6 +237,7 @@ struct blk_queue_stats *blk_alloc_queue_stats(void)
 
 	INIT_LIST_HEAD(&stats->callbacks);
 	spin_lock_init(&stats->lock);
+	stats->enable_accounting = false;
 
 	return stats;
 }
diff --git a/block/blk-stat.h b/block/blk-stat.h
index ee47f81..53f08a6 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -108,6 +108,9 @@ static inline void blk_stat_set_issue(struct blk_issue_stat *stat,
 		(((u64)blk_capped_size(size)) << BLK_STAT_SIZE_SHIFT);
 }
 
+/* record time/size info in request but not add a callback */
+void blk_stat_enable_accounting(struct request_queue *q);
+
 /*
  * blk_stat_rq_ddir() - Bucket callback function for the request data direction.
  * @rq: Request.
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 6e1c298..140da29 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -28,6 +28,8 @@ static int throtl_quantum = 32;
 /* default latency target is 0, eg, guarantee IO latency by default */
 #define DFL_LATENCY_TARGET (0)
 
+#define SKIP_LATENCY (((u64)1) << BLK_STAT_RES_SHIFT)
+
 static struct blkcg_policy blkcg_policy_throtl;
 
 /* A workqueue to queue throttle related work */
@@ -165,6 +167,19 @@ struct throtl_grp {
 	unsigned long idletime_threshold; /* us */
 };
 
+/* We measure latency for request size from <= 4k to >= 1M */
+#define LATENCY_BUCKET_SIZE 9
+
+struct latency_bucket {
+	unsigned long total_latency; /* ns / 1024 */
+	int samples;
+};
+
+struct avg_latency_bucket {
+	unsigned long latency; /* ns / 1024 */
+	bool valid;
+};
+
 struct throtl_data
 {
 	/* service tree for active throtl groups */
@@ -188,6 +203,13 @@ struct throtl_data
 	unsigned long low_downgrade_time;
 
 	unsigned int scale;
+
+	struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
+	struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
+	struct latency_bucket __percpu *latency_buckets;
+	unsigned long last_calculate_time;
+
+	bool track_bio_latency;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -306,6 +328,9 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 	return ret;
 }
 
+#define request_bucket_index(sectors) \
+	clamp_t(int, order_base_2(sectors) - 3, 0, LATENCY_BUCKET_SIZE - 1)
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -1931,6 +1956,73 @@ static void blk_throtl_update_idletime(struct throtl_grp *tg)
 	tg->checked_last_finish_time = last_finish_time;
 }
 
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+static void throtl_update_latency_buckets(struct throtl_data *td)
+{
+	struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
+	int i, cpu;
+	unsigned long last_latency = 0;
+	unsigned long latency;
+
+	if (!blk_queue_nonrot(td->queue))
+		return;
+	if (time_before(jiffies, td->last_calculate_time + HZ))
+		return;
+	td->last_calculate_time = jiffies;
+
+	memset(avg_latency, 0, sizeof(avg_latency));
+	for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+		struct latency_bucket *tmp = &td->tmp_buckets[i];
+
+		for_each_possible_cpu(cpu) {
+			struct latency_bucket *bucket;
+
+			/* this isn't race free, but ok in practice */
+			bucket = per_cpu_ptr(td->latency_buckets, cpu);
+			tmp->total_latency += bucket[i].total_latency;
+			tmp->samples += bucket[i].samples;
+			bucket[i].total_latency = 0;
+			bucket[i].samples = 0;
+		}
+
+		if (tmp->samples >= 32) {
+			int samples = tmp->samples;
+
+			latency = tmp->total_latency;
+
+			tmp->total_latency = 0;
+			tmp->samples = 0;
+			latency /= samples;
+			if (latency == 0)
+				continue;
+			avg_latency[i].latency = latency;
+		}
+	}
+
+	for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+		if (!avg_latency[i].latency) {
+			if (td->avg_buckets[i].latency < last_latency)
+				td->avg_buckets[i].latency = last_latency;
+			continue;
+		}
+
+		if (!td->avg_buckets[i].valid)
+			latency = avg_latency[i].latency;
+		else
+			latency = (td->avg_buckets[i].latency * 7 +
+				avg_latency[i].latency) >> 3;
+
+		td->avg_buckets[i].latency = max(latency, last_latency);
+		td->avg_buckets[i].valid = true;
+		last_latency = td->avg_buckets[i].latency;
+	}
+}
+#else
+static inline void throtl_update_latency_buckets(struct throtl_data *td)
+{
+}
+#endif
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1939,6 +2031,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	struct throtl_service_queue *sq;
 	bool rw = bio_data_dir(bio);
 	bool throttled = false;
+	struct throtl_data *td = tg->td;
 	int ret;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
@@ -1949,6 +2042,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	spin_lock_irq(q->queue_lock);
 
+	throtl_update_latency_buckets(td);
+
 	if (unlikely(blk_queue_bypass(q)))
 		goto out_unlock;
 
@@ -1956,6 +2051,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
 	if (ret == 0 || ret == -EBUSY)
 		bio->bi_cg_private = tg;
+	blk_stat_set_issue(&bio->bi_issue_stat, bio_sectors(bio));
 #endif
 	blk_throtl_update_idletime(tg);
 
@@ -1974,8 +2070,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		/* if above limits, break to queue */
 		if (!tg_may_dispatch(tg, bio, NULL)) {
 			tg->last_low_overflow_time[rw] = jiffies;
-			if (throtl_can_upgrade(tg->td, tg)) {
-				throtl_upgrade_state(tg->td);
+			if (throtl_can_upgrade(td, tg)) {
+				throtl_upgrade_state(td);
 				goto again;
 			}
 			break;
@@ -2019,7 +2115,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	tg->last_low_overflow_time[rw] = jiffies;
 
-	tg->td->nr_queued[rw]++;
+	td->nr_queued[rw]++;
 	throtl_add_bio_tg(bio, qn, tg);
 	throttled = true;
 
@@ -2044,20 +2140,67 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	 */
 	if (!throttled)
 		bio_clear_flag(bio, BIO_THROTTLED);
+
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+	if (throttled || !td->track_bio_latency)
+		bio->bi_issue_stat.stat |= SKIP_LATENCY;
+#endif
 	return throttled;
 }
 
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+static void throtl_track_latency(struct throtl_data *td, sector_t size,
+	int op, unsigned long time)
+{
+	struct latency_bucket *latency;
+	int index;
+
+	if (!td || td->limit_index != LIMIT_LOW || op != REQ_OP_READ ||
+	    !blk_queue_nonrot(td->queue))
+		return;
+
+	index = request_bucket_index(size);
+
+	latency = get_cpu_ptr(td->latency_buckets);
+	latency[index].total_latency += time;
+	latency[index].samples++;
+	put_cpu_ptr(td->latency_buckets);
+}
+
+void blk_throtl_stat_add(struct request *rq, u64 time_ns)
+{
+	struct request_queue *q = rq->q;
+	struct throtl_data *td = q->td;
+
+	throtl_track_latency(td, blk_stat_size(&rq->issue_stat),
+		req_op(rq), time_ns >> 10);
+}
+
 void blk_throtl_bio_endio(struct bio *bio)
 {
 	struct throtl_grp *tg;
+	u64 finish_time_ns;
+	unsigned long finish_time;
+	unsigned long start_time;
+	unsigned long lat;
 
 	tg = bio->bi_cg_private;
 	if (!tg)
 		return;
 	bio->bi_cg_private = NULL;
 
-	tg->last_finish_time = ktime_get_ns() >> 10;
+	finish_time_ns = ktime_get_ns();
+	tg->last_finish_time = finish_time_ns >> 10;
+
+	start_time = blk_stat_time(&bio->bi_issue_stat) >> 10;
+	finish_time = __blk_stat_time(finish_time_ns) >> 10;
+	/* this is only for bio based driver */
+	if (start_time && finish_time > start_time &&
+	    !(bio->bi_issue_stat.stat & SKIP_LATENCY)) {
+		lat = finish_time - start_time;
+		throtl_track_latency(tg->td, blk_stat_size(&bio->bi_issue_stat),
+			bio_op(bio), lat);
+	}
 }
 #endif
 
@@ -2133,6 +2276,12 @@ int blk_throtl_init(struct request_queue *q)
 	td = kzalloc_node(sizeof(*td), GFP_KERNEL, q->node);
 	if (!td)
 		return -ENOMEM;
+	td->latency_buckets = __alloc_percpu(sizeof(struct latency_bucket) *
+		LATENCY_BUCKET_SIZE, __alignof__(u64));
+	if (!td->latency_buckets) {
+		kfree(td);
+		return -ENOMEM;
+	}
 
 	INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
 	throtl_service_queue_init(&td->service_queue);
@@ -2147,8 +2296,10 @@ int blk_throtl_init(struct request_queue *q)
 
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
-	if (ret)
+	if (ret) {
+		free_percpu(td->latency_buckets);
 		kfree(td);
+	}
 	return ret;
 }
 
@@ -2157,6 +2308,7 @@ void blk_throtl_exit(struct request_queue *q)
 	BUG_ON(!q->td);
 	throtl_shutdown_wq(q);
 	blkcg_deactivate_policy(q, &blkcg_policy_throtl);
+	free_percpu(q->td->latency_buckets);
 	kfree(q->td);
 }
 
@@ -2181,6 +2333,10 @@ void blk_throtl_register_queue(struct request_queue *q)
 	td->throtl_slice = DFL_THROTL_SLICE_HD;
 #endif
 
+	td->track_bio_latency = !q->mq_ops && !q->request_fn;
+	if (!td->track_bio_latency)
+		blk_stat_enable_accounting(q);
+
 	/*
 	 * some tg are created before queue is fully initialized, eg, nonrot
 	 * isn't initialized yet
diff --git a/block/blk.h b/block/blk.h
index 3ac833e..07d3751 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -331,8 +331,10 @@ extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
 extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
 	const char *page, size_t count);
 extern void blk_throtl_bio_endio(struct bio *bio);
+extern void blk_throtl_stat_add(struct request *rq, u64 time);
 #else
 static inline void blk_throtl_bio_endio(struct bio *bio) { }
+static inline void blk_throtl_stat_add(struct request *rq, u64 time) { }
 #endif
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 3ad5673..67bcf8a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -17,6 +17,10 @@ struct io_context;
 struct cgroup_subsys_state;
 typedef void (bio_end_io_t) (struct bio *);
 
+struct blk_issue_stat {
+	u64 stat;
+};
+
 /*
  * main unit of I/O for the block layer and lower layers (ie drivers and
  * stacking drivers)
@@ -60,6 +64,7 @@ struct bio {
 	struct cgroup_subsys_state *bi_css;
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
 	void			*bi_cg_private;
+	struct blk_issue_stat	bi_issue_stat;
 #endif
 #endif
 	union {
@@ -286,10 +291,6 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
 	return (cookie & BLK_QC_T_INTERNAL) != 0;
 }
 
-struct blk_issue_stat {
-	u64 stat;
-};
-
 struct blk_rq_stat {
 	s64 mean;
 	u64 min;
-- 
2.9.3

^ permalink raw reply related

* [PATCH 1/3] block: track request size in blk_issue_stat
From: Shaohua Li @ 2017-03-27 22:19 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: axboe, tj, Vivek Goyal, jmoyer, Kernel-team
In-Reply-To: <cover.1490651903.git.shli@fb.com>

Currently there is no way to know the request size when the request is
finished. Next patch will need this info. We could add extra field to
record the size, but blk_issue_stat has enough space to record it, so
this patch just overloads blk_issue_stat. With this, we will have 49bits
to track time, which still is very long time.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-core.c          |  2 +-
 block/blk-mq.c            |  2 +-
 block/blk-stat.h          | 41 ++++++++++++++++++++++++++++++-----------
 block/blk-wbt.h           | 10 +++++-----
 include/linux/blk_types.h |  2 +-
 5 files changed, 38 insertions(+), 19 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ad388d5e..4234332 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2483,7 +2483,7 @@ void blk_start_request(struct request *req)
 	blk_dequeue_request(req);
 
 	if (test_bit(QUEUE_FLAG_STATS, &req->q->queue_flags)) {
-		blk_stat_set_issue_time(&req->issue_stat);
+		blk_stat_set_issue(&req->issue_stat, blk_rq_sectors(req));
 		req->rq_flags |= RQF_STATS;
 		wbt_issue(req->q->rq_wb, &req->issue_stat);
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 45b9beb..caef6ee 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -488,7 +488,7 @@ void blk_mq_start_request(struct request *rq)
 	trace_block_rq_issue(q, rq);
 
 	if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
-		blk_stat_set_issue_time(&rq->issue_stat);
+		blk_stat_set_issue(&rq->issue_stat, blk_rq_sectors(rq));
 		rq->rq_flags |= RQF_STATS;
 		wbt_issue(q->rq_wb, &rq->issue_stat);
 	}
diff --git a/block/blk-stat.h b/block/blk-stat.h
index 6ad5b8c..ee47f81 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -8,12 +8,19 @@
 #include <linux/timer.h>
 
 /*
- * Upper 3 bits can be used elsewhere
+ * from upper:
+ * 3 bits: reserved for other usage
+ * 12 bits: size
+ * 49 bits: time
  */
 #define BLK_STAT_RES_BITS	3
-#define BLK_STAT_SHIFT		(64 - BLK_STAT_RES_BITS)
-#define BLK_STAT_TIME_MASK	((1ULL << BLK_STAT_SHIFT) - 1)
-#define BLK_STAT_MASK		~BLK_STAT_TIME_MASK
+#define BLK_STAT_SIZE_BITS	12
+#define BLK_STAT_RES_SHIFT	(64 - BLK_STAT_RES_BITS)
+#define BLK_STAT_SIZE_SHIFT	(BLK_STAT_RES_SHIFT - BLK_STAT_SIZE_BITS)
+#define BLK_STAT_TIME_MASK	((1ULL << BLK_STAT_SIZE_SHIFT) - 1)
+#define BLK_STAT_SIZE_MASK	\
+	(((1ULL << BLK_STAT_SIZE_BITS) - 1) << BLK_STAT_SIZE_SHIFT)
+#define BLK_STAT_RES_MASK	(~((1ULL << BLK_STAT_RES_SHIFT) - 1))
 
 /**
  * struct blk_stat_callback - Block statistics callback.
@@ -73,12 +80,6 @@ void blk_free_queue_stats(struct blk_queue_stats *);
 
 void blk_stat_add(struct request *);
 
-static inline void blk_stat_set_issue_time(struct blk_issue_stat *stat)
-{
-	stat->time = ((stat->time & BLK_STAT_MASK) |
-		      (ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK));
-}
-
 static inline u64 __blk_stat_time(u64 time)
 {
 	return time & BLK_STAT_TIME_MASK;
@@ -86,7 +87,25 @@ static inline u64 __blk_stat_time(u64 time)
 
 static inline u64 blk_stat_time(struct blk_issue_stat *stat)
 {
-	return __blk_stat_time(stat->time);
+	return __blk_stat_time(stat->stat);
+}
+
+static inline sector_t blk_capped_size(sector_t size)
+{
+	return size & ((1ULL << BLK_STAT_SIZE_BITS) - 1);
+}
+
+static inline sector_t blk_stat_size(struct blk_issue_stat *stat)
+{
+	return (stat->stat & BLK_STAT_SIZE_MASK) >> BLK_STAT_SIZE_SHIFT;
+}
+
+static inline void blk_stat_set_issue(struct blk_issue_stat *stat,
+	sector_t size)
+{
+	stat->stat = (stat->stat & BLK_STAT_RES_MASK) |
+		(ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK) |
+		(((u64)blk_capped_size(size)) << BLK_STAT_SIZE_SHIFT);
 }
 
 /*
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index 591ff2f..ad6c785 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -32,27 +32,27 @@ enum {
 
 static inline void wbt_clear_state(struct blk_issue_stat *stat)
 {
-	stat->time &= BLK_STAT_TIME_MASK;
+	stat->stat &= ~BLK_STAT_RES_MASK;
 }
 
 static inline enum wbt_flags wbt_stat_to_mask(struct blk_issue_stat *stat)
 {
-	return (stat->time & BLK_STAT_MASK) >> BLK_STAT_SHIFT;
+	return (stat->stat & BLK_STAT_RES_MASK) >> BLK_STAT_RES_SHIFT;
 }
 
 static inline void wbt_track(struct blk_issue_stat *stat, enum wbt_flags wb_acct)
 {
-	stat->time |= ((u64) wb_acct) << BLK_STAT_SHIFT;
+	stat->stat |= ((u64) wb_acct) << BLK_STAT_RES_SHIFT;
 }
 
 static inline bool wbt_is_tracked(struct blk_issue_stat *stat)
 {
-	return (stat->time >> BLK_STAT_SHIFT) & WBT_TRACKED;
+	return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_TRACKED;
 }
 
 static inline bool wbt_is_read(struct blk_issue_stat *stat)
 {
-	return (stat->time >> BLK_STAT_SHIFT) & WBT_READ;
+	return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_READ;
 }
 
 struct rq_wait {
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 07a9e96..3ad5673 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -287,7 +287,7 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
 }
 
 struct blk_issue_stat {
-	u64 time;
+	u64 stat;
 };
 
 struct blk_rq_stat {
-- 
2.9.3

^ permalink raw reply related

* [PATCH 0/3] blk-throttle: add .low limit fix
From: Shaohua Li @ 2017-03-27 22:19 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: axboe, tj, Vivek Goyal, jmoyer, Kernel-team
In-Reply-To: <567d5361-7d6b-c53e-8ada-a2966e48dc54@fb.com>

Jens,

The 3 patches replace the last two patches (patch 17/18) of the series I sent
out today. The new patch overloads blk stat as you suggested.

Thanks,
Shaohua

Shaohua Li (3):
  block: track request size in blk_issue_stat
  blk-throttle: add a mechanism to estimate IO latency
  blk-throttle: add latency target support

 block/blk-core.c          |   2 +-
 block/blk-mq.c            |   2 +-
 block/blk-stat.c          |  15 +++-
 block/blk-stat.h          |  44 +++++++---
 block/blk-throttle.c      | 199 ++++++++++++++++++++++++++++++++++++++++++++--
 block/blk-wbt.h           |  10 +--
 block/blk.h               |   2 +
 include/linux/blk_types.h |   9 ++-
 8 files changed, 254 insertions(+), 29 deletions(-)

-- 
2.9.3

^ permalink raw reply

* v4.11-rc blk-mq lockup?
From: Bart Van Assche @ 2017-03-27 21:44 UTC (permalink / raw)
  To: axboe@fb.com; +Cc: linux-block@vger.kernel.org

Hello Jens,

If I leave the srp-test software running for a few minutes using the
following command:

# while ~bart/software/infiniband/srp-test/run_tests -d -r 30; do :; done

then after some time the following complaint appears for multiple
kworkers:

INFO: task kworker/9:0:65 blocked for more than 480 seconds.
=A0=A0=A0=A0=A0=A0Tainted: G=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0I=A0=A0=A0=A0=A04=
.11.0-rc4-dbg+ #5
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/9:0=A0=A0=A0=A0=A0D=A0=A0=A0=A00=A0=A0=A0=A065=A0=A0=A0=A0=A0=A02 0=
x00000000
Workqueue: dio/dm-0 dio_aio_complete_work
Call Trace:
=A0__schedule+0x3df/0xc10
=A0schedule+0x38/0x90
=A0rwsem_down_write_failed+0x2c4/0x4c0
=A0call_rwsem_down_write_failed+0x17/0x30
=A0down_write+0x5a/0x70
=A0__generic_file_fsync+0x43/0x90
=A0ext4_sync_file+0x2d0/0x550
=A0vfs_fsync_range+0x46/0xa0
=A0dio_complete+0x181/0x1b0
=A0dio_aio_complete_work+0x17/0x20
=A0process_one_work+0x208/0x6a0
=A0worker_thread+0x49/0x4a0
=A0kthread+0x107/0x140
=A0ret_from_fork+0x2e/0x40

I had not yet observed this behavior with kernel v4.10 or older. If this
happens and I check the queue state with the following script:

#!/bin/bash

cd /sys/class/block || exit $?
for dev in *; do
=A0=A0=A0=A0if [ -e "$dev/mq" ]; then
	echo "$dev"
	for f in "$dev"/mq/*/{pending,*/rq_list}; do
	=A0=A0=A0=A0[ -e "$f" ] || continue
	=A0=A0=A0=A0if { read -r line1 && read -r line2; } <"$f"; then
		echo "$f"
		echo "$line1 $line2" >/dev/null
		head -n 9 "$f"
	=A0=A0=A0=A0fi
	done
	(
	=A0=A0=A0=A0cd /sys/kernel/debug/block >&/dev/null &&
	=A0=A0=A0=A0for d in "$dev"/mq/*; do
		[ ! -d "$d" ] && continue
		grep -q '^busy=3D0$' "$d/tags" && continue
	=A0=A0=A0=A0=A0=A0=A0=A0for f in "$d"/{dispatch,tags*,cpu*/rq_list}; do
		=A0=A0=A0=A0[ -e "$f" ] && grep -aH '' "$f"
		done
	=A0=A0=A0=A0done
	)
=A0=A0=A0=A0fi
done

then the following output appears:

sda
sdb
sdc
sdd
sdd/mq/3/dispatch:ffff880401655d00 {.cmd_flags=3D0xca01, .rq_flags=3D0x2040=
, .tag=3D59, .internal_tag=3D-1}
sdd/mq/3/tags:nr_tags=3D62
sdd/mq/3/tags:nr_reserved_tags=3D0
sdd/mq/3/tags:active_queues=3D0
sdd/mq/3/tags:
sdd/mq/3/tags:bitmap_tags:
sdd/mq/3/tags:depth=3D62
sdd/mq/3/tags:busy=3D31
sdd/mq/3/tags:bits_per_word=3D8
sdd/mq/3/tags:map_nr=3D8
sdd/mq/3/tags:alloc_hint=3D{23, 23, 52, 1, 55, 29, 17, 22, 34, 48, 25, 49, =
37, 43, 58, 25, 6, 20, 50, 14, 55, 7, 21, 17, 26, 36, 43, 43, 4, 6, 3, 47}
sdd/mq/3/tags:wake_batch=3D7
sdd/mq/3/tags:wake_index=3D0
sdd/mq/3/tags:ws=3D{
sdd/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdd/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdd/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdd/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdd/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdd/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdd/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdd/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sdd/mq/3/tags:}
sdd/mq/3/tags:round_robin=3D0
sdd/mq/3/tags_bitmap:00000000: ffff ff1f 0000 0018
sdd/mq/3/cpu5/rq_list:ffff880401657440 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D60, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff88037aba0000 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D0, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff88037aba1740 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D1, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff88037aba2e80 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D2, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff88037aba45c0 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D3, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff88037aba5d00 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D4, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff88037aba7440 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D5, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff88037aba8b80 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D6, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff88037abaa2c0 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D7, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff88037ababa00 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D8, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff88037abad140 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D9, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff88037abae880 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D10, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff880369900000 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D11, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff880369901740 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D12, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff880369902e80 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D13, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff8803699045c0 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D14, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff880369905d00 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D15, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff880369907440 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D16, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff880369908b80 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D17, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff88036990a2c0 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D18, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff88036990ba00 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D19, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff88036990d140 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D20, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff88036990e880 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D21, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff8804001b0000 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D22, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff8804001b1740 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D23, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff8804001b2e80 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D24, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff8804001b45c0 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D25, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff8804001b5d00 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D26, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff8804001b7440 {.cmd_flags=3D0xca01, .rq_flags=3D0x=
2040, .tag=3D27, .internal_tag=3D-1}
sdd/mq/3/cpu5/rq_list:ffff8804001b8b80 {.cmd_flags=3D0x7a01, .rq_flags=3D0x=
2040, .tag=3D28, .internal_tag=3D-1}
sde
sde/mq/3/tags:nr_tags=3D62
sde/mq/3/tags:nr_reserved_tags=3D0
sde/mq/3/tags:active_queues=3D0
sde/mq/3/tags:
sde/mq/3/tags:bitmap_tags:
sde/mq/3/tags:depth=3D62
sde/mq/3/tags:busy=3D31
sde/mq/3/tags:bits_per_word=3D8
sde/mq/3/tags:map_nr=3D8
sde/mq/3/tags:alloc_hint=3D{23, 23, 52, 1, 55, 29, 17, 22, 34, 48, 25, 49, =
37, 43, 58, 25, 6, 20, 50, 14, 55, 7, 21, 17, 26, 36, 43, 43, 4, 6, 3, 47}
sde/mq/3/tags:wake_batch=3D7
sde/mq/3/tags:wake_index=3D0
sde/mq/3/tags:ws=3D{
sde/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/3/tags:	{.wait_cnt=3D7, .wait=3Dinactive},
sde/mq/3/tags:}
sde/mq/3/tags:round_robin=3D0
sde/mq/3/tags_bitmap:00000000: ffff ff1f 0000 0018
sdf
sdg
sdh
sdi
sdj
sr0

I am using the "none" scheduler:

# cat /sys/class/block/sdd/queue/scheduler =A0
[none]=A0
# cat /sys/class/block/sde/queue/scheduler =A0=A0
[none]

What is remarkable is that I see pending requests for the sd* devices
but not for any dm* device and also that the number of busy requests (31)
is exactly half of the queue depth (62). Could this indicate that the
block layer stopped processing these blk-mq queues?

If this happens and I run the following command to trigger SRP logout:

# for p in /sys/class/srp_remote_ports/*; do echo 1 >$p/delete;=A0done

then the test that was running finishes, reports that removing the
multipath device failed and echo w >/proc/sysrq-trigger produces the
following output:

sysrq: SysRq : Show Blocked State
=A0 task=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0PC stack=A0=A0=A0pid father
systemd-udevd=A0=A0=A0D=A0=A0=A0=A00 14490=A0=A0=A0=A0508 0x00000106
Call Trace:
=A0__schedule+0x3df/0xc10
=A0schedule+0x38/0x90
=A0io_schedule+0x11/0x40
=A0__lock_page+0x10c/0x140
=A0truncate_inode_pages_range+0x45d/0x780
=A0truncate_inode_pages+0x10/0x20
=A0kill_bdev+0x30/0x40
=A0__blkdev_put+0x71/0x220
=A0blkdev_put+0x49/0x170
=A0blkdev_close+0x20/0x30
=A0__fput+0xe8/0x1f0
=A0____fput+0x9/0x10
=A0task_work_run+0x80/0xb0
=A0do_exit+0x30c/0xc70
=A0do_group_exit+0x4b/0xc0
=A0get_signal+0x2c2/0x930
=A0do_signal+0x23/0x670
=A0exit_to_usermode_loop+0x5d/0xa0
=A0do_syscall_64+0xd5/0x140
=A0entry_SYSCALL64_slow_path+0x25/0x25

Does this indicate that truncate_inode_pages_range() is waiting
because a block layer queue got stuck?

The kernel tree I used in my tests is the result of merging the
following commits:
* commit 3dca2c2f3d3b=A0from git://git.kernel.dk/linux-block.git
  ("Merge branch 'for-4.12/block' into for-next")
* commit f88ab0c4b481 from git://git.kernel.org/pub/scm/linux/kernel/git/mk=
p/scsi.git
  ("scsi: libsas: fix ata xfer length")
* commit ad0376eb1483 from git://git.kernel.org/pub/scm/linux/kernel/git/to=
rvalds/linux.git
  ("Merge tag 'edac_for_4.11_2' of git://git.kernel.org/pub/scm/linux/kerne=
l/git/bp/bp")

Please let me know if you need more information.

Thanks,

Bart.=

^ permalink raw reply

* Re: [PATCH V7 00/18] blk-throttle: add .low limit
From: Jens Axboe @ 2017-03-27 19:11 UTC (permalink / raw)
  To: Shaohua Li, Jens Axboe
  Cc: linux-kernel, linux-block, tj, Vivek Goyal, jmoyer, Kernel-team
In-Reply-To: <20170327190043.GA96008@dhcp-172-20-189-90.dhcp.thefacebook.com>

On 03/27/2017 01:00 PM, Shaohua Li wrote:
> On Mon, Mar 27, 2017 at 12:15:29PM -0600, Jens Axboe wrote:
>> On 03/27/2017 11:51 AM, Shaohua Li wrote:
>>> V6->V7:
>>> - Don't overload blk stat, which will simplify the code. This will add extra
>>>   space in bio/request though with the low interface configure option on.
>>
>> Hmm, why? As far as I can see, it just makes things worse.
> 
> Part of the reason is the new callback based blk stat makes enabling stat less
> straightforward. I probably can create a fake callback and enalbing the stat
> though. Only the last 3 patches depend on this, if you prefer, I can resend
> them and add the overload back.

Yeah, I think that would be a big improvement.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH V7 00/18] blk-throttle: add .low limit
From: Shaohua Li @ 2017-03-27 19:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-block, tj, Vivek Goyal, jmoyer, Kernel-team
In-Reply-To: <dfc4078b-a130-2997-9c44-20d3d17db9cc@kernel.dk>

On Mon, Mar 27, 2017 at 12:15:29PM -0600, Jens Axboe wrote:
> On 03/27/2017 11:51 AM, Shaohua Li wrote:
> > V6->V7:
> > - Don't overload blk stat, which will simplify the code. This will add extra
> >   space in bio/request though with the low interface configure option on.
> 
> Hmm, why? As far as I can see, it just makes things worse.

Part of the reason is the new callback based blk stat makes enabling stat less
straightforward. I probably can create a fake callback and enalbing the stat
though. Only the last 3 patches depend on this, if you prefer, I can resend
them and add the overload back.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH] blkcg: allocate struct blkcg_gq outside request queue spinlock
From: Tahsin Erdogan @ 2017-03-27 18:29 UTC (permalink / raw)
  To: Julia Lawall
  Cc: Tejun Heo, Jens Axboe, linux-block, David Rientjes, linux-kernel
In-Reply-To: <alpine.DEB.2.20.1703261252310.2204@hadrien>

On Sun, Mar 26, 2017 at 3:54 AM, Julia Lawall <julia.lawall@lip6.fr> wrote:
> Is an unlock needed before line 848?  Or does blkg_lookup_check take care
> of it?

Unlock is not needed. On success, function returns with locks held.
It is documented at line 805:

"... This function returns with RCU read
 * lock and queue lock held and must be paired with blkg_conf_finish()."

^ permalink raw reply

* Re: [BUG] Deadlock due due to interactions of block, RCU, and cpu offline
From: Paul E. McKenney @ 2017-03-27 18:17 UTC (permalink / raw)
  To: Jeffrey Hugo
  Cc: linux-kernel, linux-block, pprakash, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Jens Axboe,
	Sebastian Andrzej Siewior, Thomas Gleixner, Richard Cochran,
	Boris Ostrovsky, Richard Weinberger
In-Reply-To: <d23f3f77-8158-24a4-727d-123ec526dffa@codeaurora.org>

On Mon, Mar 27, 2017 at 12:02:27PM -0600, Jeffrey Hugo wrote:
> Hi Paul.
> 
> Thanks for the quick reply.
> 
> On 3/26/2017 5:28 PM, Paul E. McKenney wrote:
> >On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote:
> 
> >>It is a race between this work running, and the cpu offline processing.
> >
> >One quick way to test this assumption is to build a kernel with Kconfig
> >options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y.  This will
> >cause call_rcu_sched() to queue the work to a kthread, which can migrate
> >to some other CPU.  If your analysis is correct, this should avoid
> >the deadlock.  (Note that the deadlock should be fixed in any case,
> >just a diagnostic assumption-check procedure.)
> 
> I enabled CONFIG_RCU_EXPERT=y, CONFIG_RCU_NOCB_CPU=y,
> CONFIG_RCU_NOCB_CPU_ALL=y in my build.  I've only had time so far to
> do one test run however the issue reproduced, but it took a fair bit
> longer to do so.  An initial look at the data indicates that the
> work is still not running.  An odd observation, the two threads are
> no longer blocked on the same queue, but different ones.

I was afraid of that...

> Let me look at this more and see what is going on now.

Another thing to try would be to affinity the "rcuo" kthreads to
some CPU that is never taken offline, just in case that kthread is
sometimes somehow getting stuck during the CPU-hotplug operation.

> >>What is the opinion of the domain experts?
> >
> >I do hope that we can come up with a better fix.  No offense intended,
> >as coming up with -any- fix in the CPU-hotplug domain is not to be
> >denigrated, but this looks to be at vest quite fragile.
> >
> >							Thanx, Paul
> >
> 
> None taken.  I'm not particularly attached to the current fix.  I
> agree, it does appear to be quite fragile.
> 
> I'm still not sure what a better solution would be though.  Maybe
> the RCU framework flushes the work somehow during cpu offline?  It
> would need to ensure further work is not queued after that point,
> which seems like it might be tricky to synchronize.  I don't know
> enough about the working of RCU to even attempt to implement that.

There are some ways that RCU might be able to shrink the window during
which the outgoing CPU's callbacks are in limbo, but they are not free
of risk, so we really need to compleetly understand what is going on
before making any possibly ill-conceived changes.  ;-)

> In any case, it seem like some more analysis is needed based on the
> latest data.

Looking forward to hearing about you find!

							Thanx, Paul

> -- 
> Jeffrey Hugo
> Qualcomm Datacenter Technologies as an affiliate of Qualcomm
> Technologies, Inc.
> Qualcomm Technologies, Inc. is a member of the
> Code Aurora Forum, a Linux Foundation Collaborative Project.
> 

^ permalink raw reply

* Re: [PATCH V7 00/18] blk-throttle: add .low limit
From: Jens Axboe @ 2017-03-27 18:15 UTC (permalink / raw)
  To: Shaohua Li, linux-kernel, linux-block
  Cc: tj, Vivek Goyal, jmoyer, Kernel-team
In-Reply-To: <cover.1490634565.git.shli@fb.com>

On 03/27/2017 11:51 AM, Shaohua Li wrote:
> V6->V7:
> - Don't overload blk stat, which will simplify the code. This will add extra
>   space in bio/request though with the low interface configure option on.

Hmm, why? As far as I can see, it just makes things worse.

-- 
Jens Axboe

^ permalink raw reply

* Re: [BUG] Deadlock due due to interactions of block, RCU, and cpu offline
From: Jeffrey Hugo @ 2017-03-27 18:02 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, linux-block, pprakash, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Jens Axboe,
	Sebastian Andrzej Siewior, Thomas Gleixner, Richard Cochran,
	Boris Ostrovsky, Richard Weinberger
In-Reply-To: <20170326232843.GA3637@linux.vnet.ibm.com>

Hi Paul.

Thanks for the quick reply.

On 3/26/2017 5:28 PM, Paul E. McKenney wrote:
> On Sun, Mar 26, 2017 at 05:10:40PM -0600, Jeffrey Hugo wrote:

>> It is a race between this work running, and the cpu offline processing.
>
> One quick way to test this assumption is to build a kernel with Kconfig
> options CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y.  This will
> cause call_rcu_sched() to queue the work to a kthread, which can migrate
> to some other CPU.  If your analysis is correct, this should avoid
> the deadlock.  (Note that the deadlock should be fixed in any case,
> just a diagnostic assumption-check procedure.)

I enabled CONFIG_RCU_EXPERT=y, CONFIG_RCU_NOCB_CPU=y, 
CONFIG_RCU_NOCB_CPU_ALL=y in my build.  I've only had time so far to do 
one test run however the issue reproduced, but it took a fair bit longer 
to do so.  An initial look at the data indicates that the work is still 
not running.  An odd observation, the two threads are no longer blocked 
on the same queue, but different ones.

Let me look at this more and see what is going on now.

>> What is the opinion of the domain experts?
>
> I do hope that we can come up with a better fix.  No offense intended,
> as coming up with -any- fix in the CPU-hotplug domain is not to be
> denigrated, but this looks to be at vest quite fragile.
>
> 							Thanx, Paul
>

None taken.  I'm not particularly attached to the current fix.  I agree, 
it does appear to be quite fragile.

I'm still not sure what a better solution would be though.  Maybe the 
RCU framework flushes the work somehow during cpu offline?  It would 
need to ensure further work is not queued after that point, which seems 
like it might be tricky to synchronize.  I don't know enough about the 
working of RCU to even attempt to implement that.

In any case, it seem like some more analysis is needed based on the 
latest data.

-- 
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply

* [PATCH V7 18/18] blk-throttle: add latency target support
From: Shaohua Li @ 2017-03-27 17:51 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: axboe, tj, Vivek Goyal, jmoyer, Kernel-team
In-Reply-To: <cover.1490634565.git.shli@fb.com>

One hard problem adding .low limit is to detect idle cgroup. If one
cgroup doesn't dispatch enough IO against its low limit, we must have a
mechanism to determine if other cgroups dispatch more IO. We added the
think time detection mechanism before, but it doesn't work for all
workloads. Here we add a latency based approach.

We already have mechanism to calculate latency threshold for each IO
size. For every IO dispatched from a cgorup, we compare its latency
against its threshold and record the info. If most IO latency is below
threshold (in the code I use 75%), the cgroup could be treated idle and
other cgroups can dispatch more IO.

Currently this latency target check is only for SSD as we can't
calcualte the latency target for hard disk. And this is only for cgroup
leaf node so far.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 4b9c6a1..e506d94 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -170,6 +170,10 @@ struct throtl_grp {
 	unsigned long checked_last_finish_time; /* ns / 1024 */
 	unsigned long avg_idletime; /* ns / 1024 */
 	unsigned long idletime_threshold; /* us */
+
+	unsigned int bio_cnt; /* total bios */
+	unsigned int bad_bio_cnt; /* bios exceeding latency threshold */
+	unsigned long bio_cnt_reset_time;
 };
 
 /* We measure latency for request size from <= 4k to >= 1M */
@@ -1725,12 +1729,15 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg)
 	 * - single idle is too long, longer than a fixed value (in case user
 	 *   configure a too big threshold) or 4 times of slice
 	 * - average think time is more than threshold
+	 * - IO latency is largely below threshold
 	 */
 	unsigned long time = jiffies_to_usecs(4 * tg->td->throtl_slice);
 
 	time = min_t(unsigned long, MAX_IDLE_TIME, time);
 	return (ktime_get_ns() >> 10) - tg->last_finish_time > time ||
-	       tg->avg_idletime > tg->idletime_threshold;
+	       tg->avg_idletime > tg->idletime_threshold ||
+	       (tg->latency_target && tg->bio_cnt &&
+		tg->bad_bio_cnt * 5 < tg->bio_cnt);
 }
 
 static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
@@ -2211,12 +2218,36 @@ void blk_throtl_bio_endio(struct bio *bio)
 
 	start_time = THROTL_STAT_TIME(bio->bi_throtl_stat) >> 10;
 	finish_time = THROTL_STAT_TIME(finish_time_ns) >> 10;
-	if (start_time && finish_time > start_time &&
-	    !(bio->bi_throtl_stat & THROTL_SKIP_LAT)) {
-		lat = finish_time - start_time;
+	if (!start_time || finish_time <= start_time)
+		return;
+
+	lat = finish_time - start_time;
+	if (!(bio->bi_throtl_stat & THROTL_SKIP_LAT))
 		throtl_track_latency(tg->td,
 			THROTL_STAT_SIZE(bio->bi_throtl_stat),
 			bio_op(bio), lat);
+
+	if (tg->latency_target) {
+		int bucket;
+		unsigned int threshold;
+
+		bucket = request_bucket_index(
+			THROTL_STAT_SIZE(bio->bi_throtl_stat));
+		threshold = tg->td->avg_buckets[bucket].latency +
+			tg->latency_target;
+		if (lat > threshold)
+			tg->bad_bio_cnt++;
+		/*
+		 * Not race free, could get wrong count, which means cgroups
+		 * will be throttled
+		 */
+		tg->bio_cnt++;
+	}
+
+	if (time_after(jiffies, tg->bio_cnt_reset_time) || tg->bio_cnt > 1024) {
+		tg->bio_cnt_reset_time = tg->td->throtl_slice + jiffies;
+		tg->bio_cnt /= 2;
+		tg->bad_bio_cnt /= 2;
 	}
 }
 #endif
-- 
2.9.3

^ permalink raw reply related

* [PATCH V7 17/18] blk-throttle: add a mechanism to estimate IO latency
From: Shaohua Li @ 2017-03-27 17:51 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: axboe, tj, Vivek Goyal, jmoyer, Kernel-team
In-Reply-To: <cover.1490634565.git.shli@fb.com>

User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.

To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.

But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.

Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-core.c          |   4 +
 block/blk-mq.c            |   4 +
 block/blk-throttle.c      | 181 ++++++++++++++++++++++++++++++++++++++++++++--
 block/blk.h               |   4 +
 include/linux/blk_types.h |   1 +
 include/linux/blkdev.h    |   3 +
 6 files changed, 192 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ad388d5e..d5b5169 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2482,6 +2482,8 @@ void blk_start_request(struct request *req)
 {
 	blk_dequeue_request(req);
 
+	blk_throtl_start_request(req);
+
 	if (test_bit(QUEUE_FLAG_STATS, &req->q->queue_flags)) {
 		blk_stat_set_issue_time(&req->issue_stat);
 		req->rq_flags |= RQF_STATS;
@@ -2703,6 +2705,8 @@ void blk_finish_request(struct request *req, int error)
 {
 	struct request_queue *q = req->q;
 
+	blk_throtl_finish_request(req);
+
 	if (req->rq_flags & RQF_STATS)
 		blk_stat_add(req);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 45b9beb..b04a564 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -434,6 +434,8 @@ static void blk_mq_ipi_complete_request(struct request *rq)
 
 static void blk_mq_stat_add(struct request *rq)
 {
+	blk_throtl_finish_request(rq);
+
 	if (rq->rq_flags & RQF_STATS) {
 		blk_mq_poll_stats_start(rq->q);
 		blk_stat_add(rq);
@@ -487,6 +489,8 @@ void blk_mq_start_request(struct request *rq)
 
 	trace_block_rq_issue(q, rq);
 
+	blk_throtl_start_request(rq);
+
 	if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
 		blk_stat_set_issue_time(&rq->issue_stat);
 		rq->rq_flags |= RQF_STATS;
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 6e1c298..4b9c6a1 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -28,6 +28,13 @@ static int throtl_quantum = 32;
 /* default latency target is 0, eg, guarantee IO latency by default */
 #define DFL_LATENCY_TARGET (0)
 
+#define THROTL_STAT(time, size) \
+	(((u64)time & (((u64)1 << 48) - 1)) | \
+	 (((u64)size & (((u64)1 << 12) - 1)) << 48))
+#define THROTL_SKIP_LAT ((u64)1 << 63)
+#define THROTL_STAT_TIME(stat) (stat & (((u64)1 << 48) - 1))
+#define THROTL_STAT_SIZE(stat) ((stat >> 48) & (((u64)1 << 12) - 1))
+
 static struct blkcg_policy blkcg_policy_throtl;
 
 /* A workqueue to queue throttle related work */
@@ -165,6 +172,19 @@ struct throtl_grp {
 	unsigned long idletime_threshold; /* us */
 };
 
+/* We measure latency for request size from <= 4k to >= 1M */
+#define LATENCY_BUCKET_SIZE 9
+
+struct latency_bucket {
+	unsigned long total_latency; /* ns / 1024 */
+	int samples;
+};
+
+struct avg_latency_bucket {
+	unsigned long latency; /* ns / 1024 */
+	bool valid;
+};
+
 struct throtl_data
 {
 	/* service tree for active throtl groups */
@@ -188,6 +208,13 @@ struct throtl_data
 	unsigned long low_downgrade_time;
 
 	unsigned int scale;
+
+	struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
+	struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
+	struct latency_bucket __percpu *latency_buckets;
+	unsigned long last_calculate_time;
+
+	bool track_bio_latency;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -306,6 +333,9 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 	return ret;
 }
 
+#define request_bucket_index(sectors) \
+	clamp_t(int, order_base_2(sectors) - 3, 0, LATENCY_BUCKET_SIZE - 1)
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -1931,6 +1961,73 @@ static void blk_throtl_update_idletime(struct throtl_grp *tg)
 	tg->checked_last_finish_time = last_finish_time;
 }
 
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+static void throtl_update_latency_buckets(struct throtl_data *td)
+{
+	struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
+	int i, cpu;
+	unsigned long last_latency = 0;
+	unsigned long latency;
+
+	if (!blk_queue_nonrot(td->queue))
+		return;
+	if (time_before(jiffies, td->last_calculate_time + HZ))
+		return;
+	td->last_calculate_time = jiffies;
+
+	memset(avg_latency, 0, sizeof(avg_latency));
+	for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+		struct latency_bucket *tmp = &td->tmp_buckets[i];
+
+		for_each_possible_cpu(cpu) {
+			struct latency_bucket *bucket;
+
+			/* this isn't race free, but ok in practice */
+			bucket = per_cpu_ptr(td->latency_buckets, cpu);
+			tmp->total_latency += bucket[i].total_latency;
+			tmp->samples += bucket[i].samples;
+			bucket[i].total_latency = 0;
+			bucket[i].samples = 0;
+		}
+
+		if (tmp->samples >= 32) {
+			int samples = tmp->samples;
+
+			latency = tmp->total_latency;
+
+			tmp->total_latency = 0;
+			tmp->samples = 0;
+			latency /= samples;
+			if (latency == 0)
+				continue;
+			avg_latency[i].latency = latency;
+		}
+	}
+
+	for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+		if (!avg_latency[i].latency) {
+			if (td->avg_buckets[i].latency < last_latency)
+				td->avg_buckets[i].latency = last_latency;
+			continue;
+		}
+
+		if (!td->avg_buckets[i].valid)
+			latency = avg_latency[i].latency;
+		else
+			latency = (td->avg_buckets[i].latency * 7 +
+				avg_latency[i].latency) >> 3;
+
+		td->avg_buckets[i].latency = max(latency, last_latency);
+		td->avg_buckets[i].valid = true;
+		last_latency = td->avg_buckets[i].latency;
+	}
+}
+#else
+static inline void throtl_update_latency_buckets(struct throtl_data *td)
+{
+}
+#endif
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1939,6 +2036,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	struct throtl_service_queue *sq;
 	bool rw = bio_data_dir(bio);
 	bool throttled = false;
+	struct throtl_data *td = tg->td;
 	int ret;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
@@ -1949,6 +2047,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	spin_lock_irq(q->queue_lock);
 
+	throtl_update_latency_buckets(td);
+
 	if (unlikely(blk_queue_bypass(q)))
 		goto out_unlock;
 
@@ -1956,6 +2056,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
 	if (ret == 0 || ret == -EBUSY)
 		bio->bi_cg_private = tg;
+	bio->bi_throtl_stat = THROTL_STAT(ktime_get_ns(), bio_sectors(bio));
 #endif
 	blk_throtl_update_idletime(tg);
 
@@ -1974,8 +2075,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		/* if above limits, break to queue */
 		if (!tg_may_dispatch(tg, bio, NULL)) {
 			tg->last_low_overflow_time[rw] = jiffies;
-			if (throtl_can_upgrade(tg->td, tg)) {
-				throtl_upgrade_state(tg->td);
+			if (throtl_can_upgrade(td, tg)) {
+				throtl_upgrade_state(td);
 				goto again;
 			}
 			break;
@@ -2019,7 +2120,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	tg->last_low_overflow_time[rw] = jiffies;
 
-	tg->td->nr_queued[rw]++;
+	td->nr_queued[rw]++;
 	throtl_add_bio_tg(bio, qn, tg);
 	throttled = true;
 
@@ -2044,20 +2145,79 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	 */
 	if (!throttled)
 		bio_clear_flag(bio, BIO_THROTTLED);
+
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+	if (throttled || !td->track_bio_latency)
+		bio->bi_throtl_stat |= THROTL_SKIP_LAT;
+#endif
 	return throttled;
 }
 
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+static void throtl_track_latency(struct throtl_data *td, sector_t size,
+	int op, unsigned long time)
+{
+	struct latency_bucket *latency;
+	int index;
+
+	if (!td || td->limit_index != LIMIT_LOW || op != REQ_OP_READ ||
+	    !blk_queue_nonrot(td->queue))
+		return;
+
+	index = request_bucket_index(size);
+
+	latency = get_cpu_ptr(td->latency_buckets);
+	latency[index].total_latency += time;
+	latency[index].samples++;
+	put_cpu_ptr(td->latency_buckets);
+}
+
+void blk_throtl_start_request(struct request *rq)
+{
+	rq->throtl_stat = THROTL_STAT(ktime_get_ns(),
+		blk_rq_sectors(rq));
+}
+
+void blk_throtl_finish_request(struct request *rq)
+{
+	struct request_queue *q = rq->q;
+	struct throtl_data *td = q->td;
+	u64 finish_time = THROTL_STAT_TIME(ktime_get_ns());
+	u64 time_ns;
+
+	if (finish_time < THROTL_STAT_TIME(rq->throtl_stat))
+		return;
+	time_ns = finish_time - THROTL_STAT_TIME(rq->throtl_stat);
+
+	throtl_track_latency(td, THROTL_STAT_SIZE(rq->throtl_stat),
+		req_op(rq), time_ns >> 10);
+}
+
 void blk_throtl_bio_endio(struct bio *bio)
 {
 	struct throtl_grp *tg;
+	u64 finish_time_ns;
+	unsigned long finish_time;
+	unsigned long start_time;
+	unsigned long lat;
 
 	tg = bio->bi_cg_private;
 	if (!tg)
 		return;
 	bio->bi_cg_private = NULL;
 
-	tg->last_finish_time = ktime_get_ns() >> 10;
+	finish_time_ns = ktime_get_ns();
+	tg->last_finish_time = finish_time_ns >> 10;
+
+	start_time = THROTL_STAT_TIME(bio->bi_throtl_stat) >> 10;
+	finish_time = THROTL_STAT_TIME(finish_time_ns) >> 10;
+	if (start_time && finish_time > start_time &&
+	    !(bio->bi_throtl_stat & THROTL_SKIP_LAT)) {
+		lat = finish_time - start_time;
+		throtl_track_latency(tg->td,
+			THROTL_STAT_SIZE(bio->bi_throtl_stat),
+			bio_op(bio), lat);
+	}
 }
 #endif
 
@@ -2133,6 +2293,12 @@ int blk_throtl_init(struct request_queue *q)
 	td = kzalloc_node(sizeof(*td), GFP_KERNEL, q->node);
 	if (!td)
 		return -ENOMEM;
+	td->latency_buckets = __alloc_percpu(sizeof(struct latency_bucket) *
+		LATENCY_BUCKET_SIZE, __alignof__(u64));
+	if (!td->latency_buckets) {
+		kfree(td);
+		return -ENOMEM;
+	}
 
 	INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
 	throtl_service_queue_init(&td->service_queue);
@@ -2147,8 +2313,10 @@ int blk_throtl_init(struct request_queue *q)
 
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
-	if (ret)
+	if (ret) {
+		free_percpu(td->latency_buckets);
 		kfree(td);
+	}
 	return ret;
 }
 
@@ -2157,6 +2325,7 @@ void blk_throtl_exit(struct request_queue *q)
 	BUG_ON(!q->td);
 	throtl_shutdown_wq(q);
 	blkcg_deactivate_policy(q, &blkcg_policy_throtl);
+	free_percpu(q->td->latency_buckets);
 	kfree(q->td);
 }
 
@@ -2181,6 +2350,8 @@ void blk_throtl_register_queue(struct request_queue *q)
 	td->throtl_slice = DFL_THROTL_SLICE_HD;
 #endif
 
+	td->track_bio_latency = !q->mq_ops && !q->request_fn;
+
 	/*
 	 * some tg are created before queue is fully initialized, eg, nonrot
 	 * isn't initialized yet
diff --git a/block/blk.h b/block/blk.h
index 3ac833e..fa5610e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -331,8 +331,12 @@ extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
 extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
 	const char *page, size_t count);
 extern void blk_throtl_bio_endio(struct bio *bio);
+extern void blk_throtl_start_request(struct request *rq);
+extern void blk_throtl_finish_request(struct request *rq);
 #else
 static inline void blk_throtl_bio_endio(struct bio *bio) { }
+static inline void blk_throtl_start_request(struct request *rq) { }
+static inline void blk_throtl_finish_request(struct request *rq) { }
 #endif
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 07a9e96..112fd26 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -60,6 +60,7 @@ struct bio {
 	struct cgroup_subsys_state *bi_css;
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
 	void			*bi_cg_private;
+	u64			bi_throtl_stat;
 #endif
 #endif
 	union {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1a7dc42..b14ae55 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -204,6 +204,9 @@ struct request {
 	struct request_list *rl;		/* rl this rq is alloced from */
 	unsigned long long start_time_ns;
 	unsigned long long io_start_time_ns;    /* when passed to hardware */
+#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
+	u64 throtl_stat;
+#endif
 #endif
 	/* Number of scatter-gather DMA addr+len pairs after
 	 * physical address coalescing is performed.
-- 
2.9.3

^ permalink raw reply related

* [PATCH V7 16/18] blk-throttle: add interface for per-cgroup target latency
From: Shaohua Li @ 2017-03-27 17:51 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: axboe, tj, Vivek Goyal, jmoyer, Kernel-team
In-Reply-To: <cover.1490634565.git.shli@fb.com>

Here we introduce per-cgroup latency target. The target determines how a
cgroup can afford latency increasement. We will use the target latency
to calculate a threshold and use it to schedule IO for cgroups. If a
cgroup's bandwidth is below its low limit but its average latency is
below the threshold, other cgroups can safely dispatch more IO even
their bandwidth is higher than their low limits. On the other hand, if
the first cgroup's latency is higher than the threshold, other cgroups
are throttled to their low limits. So the target latency determines how
we efficiently utilize free disk resource without sacifice of worload's
IO latency.

For example, assume 4k IO average latency is 50us when disk isn't
congested. A cgroup sets the target latency to 30us. Then the cgroup can
accept 50+30=80us IO latency. If the cgroupt's average IO latency is
90us and its bandwidth is below low limit, other cgroups are throttled
to their low limit. If the cgroup's average IO latency is 60us, other
cgroups are allowed to dispatch more IO. When other cgroups dispatch
more IO, the first cgroup's IO latency will increase. If it increases to
81us, we then throttle other cgroups.

User will configure the interface in this way:
echo "8:16 rbps=2097152 wbps=max latency=100 idle=200" > io.low

latency is in microsecond unit

By default, latency target is 0, which means to guarantee IO latency.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0ea8698..6e1c298 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -25,6 +25,8 @@ static int throtl_quantum = 32;
 #define DFL_IDLE_THRESHOLD_SSD (1000L) /* 1 ms */
 #define DFL_IDLE_THRESHOLD_HD (100L * 1000) /* 100 ms */
 #define MAX_IDLE_TIME (5L * 1000 * 1000) /* 5 s */
+/* default latency target is 0, eg, guarantee IO latency by default */
+#define DFL_LATENCY_TARGET (0)
 
 static struct blkcg_policy blkcg_policy_throtl;
 
@@ -152,6 +154,7 @@ struct throtl_grp {
 
 	unsigned long last_check_time;
 
+	unsigned long latency_target; /* us */
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
@@ -449,6 +452,8 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 	tg->iops_conf[WRITE][LIMIT_MAX] = UINT_MAX;
 	/* LIMIT_LOW will have default value 0 */
 
+	tg->latency_target = DFL_LATENCY_TARGET;
+
 	return &tg->pd;
 }
 
@@ -1445,6 +1450,7 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 	u64 bps_dft;
 	unsigned int iops_dft;
 	char idle_time[26] = "";
+	char latency_time[26] = "";
 
 	if (!dname)
 		return 0;
@@ -1461,8 +1467,9 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 	    tg->bps_conf[WRITE][off] == bps_dft &&
 	    tg->iops_conf[READ][off] == iops_dft &&
 	    tg->iops_conf[WRITE][off] == iops_dft &&
-	    (off != LIMIT_LOW || tg->idletime_threshold ==
-				  tg->td->dft_idletime_threshold))
+	    (off != LIMIT_LOW ||
+	     (tg->idletime_threshold == tg->td->dft_idletime_threshold &&
+	      tg->latency_target == DFL_LATENCY_TARGET)))
 		return 0;
 
 	if (tg->bps_conf[READ][off] != bps_dft)
@@ -1483,10 +1490,17 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 		else
 			snprintf(idle_time, sizeof(idle_time), " idle=%lu",
 				tg->idletime_threshold);
+
+		if (tg->latency_target == ULONG_MAX)
+			strcpy(latency_time, " latency=max");
+		else
+			snprintf(latency_time, sizeof(latency_time),
+				" latency=%lu", tg->latency_target);
 	}
 
-	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s\n",
-		   dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time);
+	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s%s\n",
+		   dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time,
+		   latency_time);
 	return 0;
 }
 
@@ -1505,6 +1519,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	struct throtl_grp *tg;
 	u64 v[4];
 	unsigned long idle_time;
+	unsigned long latency_time;
 	int ret;
 	int index = of_cft(of)->private;
 
@@ -1520,6 +1535,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	v[3] = tg->iops_conf[WRITE][index];
 
 	idle_time = tg->idletime_threshold;
+	latency_time = tg->latency_target;
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
 		char *p;
@@ -1553,6 +1569,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			v[3] = min_t(u64, val, UINT_MAX);
 		else if (off == LIMIT_LOW && !strcmp(tok, "idle"))
 			idle_time = val;
+		else if (off == LIMIT_LOW && !strcmp(tok, "latency"))
+			latency_time = val;
 		else
 			goto out_finish;
 	}
@@ -1583,6 +1601,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			tg->td->limit_index = LIMIT_LOW;
 		tg->idletime_threshold = (idle_time == ULONG_MAX) ?
 			ULONG_MAX : idle_time;
+		tg->latency_target = (latency_time == ULONG_MAX) ?
+			ULONG_MAX : latency_time;
 	}
 	tg_conf_updated(tg);
 	ret = 0;
-- 
2.9.3

^ permalink raw reply related

* [PATCH V7 15/18] blk-throttle: ignore idle cgroup limit
From: Shaohua Li @ 2017-03-27 17:51 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: axboe, tj, Vivek Goyal, jmoyer, Kernel-team
In-Reply-To: <cover.1490634565.git.shli@fb.com>

Last patch introduces a way to detect idle cgroup. We use it to make
upgrade/downgrade decision. And the new algorithm can detect completely
idle cgroup too, so we can delete the corresponding code.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 40 ++++++++++++++++++++++++++--------------
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index f03e158..0ea8698 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -152,8 +152,6 @@ struct throtl_grp {
 
 	unsigned long last_check_time;
 
-	unsigned long last_dispatch_time[2];
-
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
@@ -508,8 +506,6 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
 	 * Update has_rules[] after a new group is brought online.
 	 */
 	tg_update_has_rules(tg);
-	tg->last_dispatch_time[READ] = jiffies;
-	tg->last_dispatch_time[WRITE] = jiffies;
 }
 
 static void blk_throtl_update_limit_valid(struct throtl_data *td)
@@ -1708,9 +1704,8 @@ static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
 		return true;
 
 	if (time_after_eq(jiffies,
-	     tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
-	    time_after_eq(jiffies,
-	     tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+		tg_last_low_overflow_time(tg) + tg->td->throtl_slice) &&
+	    throtl_tg_is_idle(tg))
 		return true;
 	return false;
 }
@@ -1756,6 +1751,26 @@ static bool throtl_can_upgrade(struct throtl_data *td,
 	return true;
 }
 
+static void throtl_upgrade_check(struct throtl_grp *tg)
+{
+	unsigned long now = jiffies;
+
+	if (tg->td->limit_index != LIMIT_LOW)
+		return;
+
+	if (time_after(tg->last_check_time + tg->td->throtl_slice, now))
+		return;
+
+	tg->last_check_time = now;
+
+	if (!time_after_eq(now,
+	     __tg_last_low_overflow_time(tg) + tg->td->throtl_slice))
+		return;
+
+	if (throtl_can_upgrade(tg->td, NULL))
+		throtl_upgrade_state(tg->td);
+}
+
 static void throtl_upgrade_state(struct throtl_data *td)
 {
 	struct cgroup_subsys_state *pos_css;
@@ -1797,18 +1812,15 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
 	struct throtl_data *td = tg->td;
 	unsigned long now = jiffies;
 
-	if (time_after_eq(now, tg->last_dispatch_time[READ] +
-					td->throtl_slice) &&
-	    time_after_eq(now, tg->last_dispatch_time[WRITE] +
-					td->throtl_slice))
-		return false;
 	/*
 	 * If cgroup is below low limit, consider downgrade and throttle other
 	 * cgroups
 	 */
 	if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) &&
 	    time_after_eq(now, tg_last_low_overflow_time(tg) +
-					td->throtl_slice))
+					td->throtl_slice) &&
+	    (!throtl_tg_is_idle(tg) ||
+	     !list_empty(&tg_to_blkg(tg)->blkcg->css.children)))
 		return true;
 	return false;
 }
@@ -1931,10 +1943,10 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 again:
 	while (true) {
-		tg->last_dispatch_time[rw] = jiffies;
 		if (tg->last_low_overflow_time[rw] == 0)
 			tg->last_low_overflow_time[rw] = jiffies;
 		throtl_downgrade_check(tg);
+		throtl_upgrade_check(tg);
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;
-- 
2.9.3

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox