[PATCH 0/4] block: more scalable inflight tracking

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] block: more scalable inflight tracking
@ 2017-08-03 20:01 Jens Axboe
  2017-08-03 20:01 ` [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags Jens Axboe
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Jens Axboe @ 2017-08-03 20:01 UTC (permalink / raw)
  To: linux-block; +Cc: brking

One scaling issue we currently have in the block code is the
inflight accounting. It's based on a per-device atomic count
for reads and writes, which means that even for an mq device
with lots of hardware queues, we end up dirtying a per-device
cacheline for each IO. The issue can easily be observed by
using null_blk:

modprobe null_blk submit_queues=48 queue_mode=2

and running a fio job that has 32 jobs doing sync reads on
the device (average of 3 runs, though deviation is low):

stats		IOPS		usr		sys
------------------------------------------------------
on		 2.6M		 5.4%		94.6%
off		21.0M		33.7%		67.3%

which shows a 10x slowdown with stats enabled. If we look at
the profile for stats on, the top entries are:

37.38%  fio      [kernel.vmlinux]    [k] blk_account_io_done
14.65%  fio      [kernel.vmlinux]    [k] blk_account_io_start
14.29%  fio      [kernel.vmlinux]    [k] part_round_stats_single
11.81%  fio      [kernel.vmlinux]    [k] blk_account_io_completion
 3.62%  fio      [kernel.vmlinux]    [k] part_round_stats
 0.81%  fio      [kernel.vmlinux]    [k] __blkdev_direct_IO_simple

which shows the system time being dominated by the stats accounting.

This patch series replaces the atomic counter with using the tags
hamming weight for tracking infligh counts. This means we don't have
to do anything when IO starts or completes, and for reading the value
we just have to check how many bits we have set in the tag maps on
the queues. This part is limited to 1000 per second (for HZ=1000).
Using this approach, running the same test again results in:

stats		IOPS		usr		sys
------------------------------------------------------
on		20.4M		30.9%		69.0%
off		21.4M		32.4%		67.4%

and doing a profiled run with stats on, the top of stats reporting is
now:

1.23%  fio  [kernel.vmlinux]  [k] blk_account_io_done
0.83%  fio  [kernel.vmlinux]  [k] blk_account_io_start
0.55%  fio  [kernel.vmlinux]  [k] blk_account_io_completion

which is a lot more reasonable. The difference between stats on and
off is now also neglible.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags
  2017-08-03 20:01 [PATCH 0/4] block: more scalable inflight tracking Jens Axboe
@ 2017-08-03 20:01 ` Jens Axboe
  2017-08-03 20:29   ` Bart Van Assche
  2017-08-03 20:01 ` [PATCH 2/4] block: pass in queue to inflight accounting Jens Axboe
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2017-08-03 20:01 UTC (permalink / raw)
  To: linux-block; +Cc: brking, Jens Axboe

Since we introduced blk-mq-sched, the tags->rqs[] array has been
dynamically assigned. So we need to check for NULL when iterating,
since we could be racing with completion.

This is perfectly safe, since the memory backing of the request is
never going away while the device is alive. Only the pointer in
->rqs[] may be reset.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq-tag.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d0be72ccb091..b856b2827157 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 		bitnr += tags->nr_reserved_tags;
 	rq = tags->rqs[bitnr];
 
-	if (rq->q == hctx->queue)
+	if (rq && rq->q == hctx->queue)
 		iter_data->fn(hctx, rq, iter_data->data, reserved);
 	return true;
 }
@@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;
 	rq = tags->rqs[bitnr];
-
-	iter_data->fn(rq, iter_data->data, reserved);
+	if (rq)
+		iter_data->fn(rq, iter_data->data, reserved);
 	return true;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags
  2017-08-03 20:01 ` [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags Jens Axboe
@ 2017-08-03 20:29   ` Bart Van Assche
  2017-08-03 20:35     ` Jens Axboe
  0 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2017-08-03 20:29 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, axboe@kernel.dk; +Cc: brking@linux.vnet.ibm.com

On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
> Since we introduced blk-mq-sched, the tags->rqs[] array has been
> dynamically assigned. So we need to check for NULL when iterating,
> since we could be racing with completion.
>=20
> This is perfectly safe, since the memory backing of the request is
> never going away while the device is alive. Only the pointer in
> ->rqs[] may be reset.
>=20
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  block/blk-mq-tag.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>=20
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index d0be72ccb091..b856b2827157 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned =
int bitnr, void *data)
>  		bitnr +=3D tags->nr_reserved_tags;
>  	rq =3D tags->rqs[bitnr];
> =20
> -	if (rq->q =3D=3D hctx->queue)
> +	if (rq && rq->q =3D=3D hctx->queue)
>  		iter_data->fn(hctx, rq, iter_data->data, reserved);
>  	return true;
>  }
> @@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsi=
gned int bitnr, void *data)
>  	if (!reserved)
>  		bitnr +=3D tags->nr_reserved_tags;
>  	rq =3D tags->rqs[bitnr];
> -
> -	iter_data->fn(rq, iter_data->data, reserved);
> +	if (rq)
> +		iter_data->fn(rq, iter_data->data, reserved);
>  	return true;
>  }

Hello Jens,

I agree with what you wrote in the description of this patch. However, sinc=
e
I have not yet found the code that clears tags->rqs[], would it be possible
to show me that code?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags
  2017-08-03 20:29   ` Bart Van Assche
@ 2017-08-03 20:35     ` Jens Axboe
  2017-08-03 20:40       ` Jens Axboe
  0 siblings, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2017-08-03 20:35 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org; +Cc: brking@linux.vnet.ibm.com

On 08/03/2017 02:29 PM, Bart Van Assche wrote:
> On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
>> Since we introduced blk-mq-sched, the tags->rqs[] array has been
>> dynamically assigned. So we need to check for NULL when iterating,
>> since we could be racing with completion.
>>
>> This is perfectly safe, since the memory backing of the request is
>> never going away while the device is alive. Only the pointer in
>> ->rqs[] may be reset.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  block/blk-mq-tag.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index d0be72ccb091..b856b2827157 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>  		bitnr += tags->nr_reserved_tags;
>>  	rq = tags->rqs[bitnr];
>>  
>> -	if (rq->q == hctx->queue)
>> +	if (rq && rq->q == hctx->queue)
>>  		iter_data->fn(hctx, rq, iter_data->data, reserved);
>>  	return true;
>>  }
>> @@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>  	if (!reserved)
>>  		bitnr += tags->nr_reserved_tags;
>>  	rq = tags->rqs[bitnr];
>> -
>> -	iter_data->fn(rq, iter_data->data, reserved);
>> +	if (rq)
>> +		iter_data->fn(rq, iter_data->data, reserved);
>>  	return true;
>>  }
> 
> Hello Jens,
> 
> I agree with what you wrote in the description of this patch. However, since
> I have not yet found the code that clears tags->rqs[], would it be possible
> to show me that code?

Since it's been a month since I wrote this code, I went and looked too.
My memory was that we set/clear it dynamically since we added
scheduling, but looks like we don't clear it. The race is still valid
for when someone runs a tag check in parallel with someone allocating a
tag, since there's a window of time where the tag bit is set, but
->rqs[tag] isn't set yet. That's probably the race I hit, not the
completion race mentioned in the change log.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags
  2017-08-03 20:35     ` Jens Axboe
@ 2017-08-03 20:40       ` Jens Axboe
  2017-08-03 20:50         ` Bart Van Assche
  0 siblings, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2017-08-03 20:40 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org; +Cc: brking@linux.vnet.ibm.com

On 08/03/2017 02:35 PM, Jens Axboe wrote:
>> I agree with what you wrote in the description of this patch.
>> However, since I have not yet found the code that clears tags->rqs[],
>> would it be possible to show me that code?
> 
> Since it's been a month since I wrote this code, I went and looked
> too.  My memory was that we set/clear it dynamically since we added
> scheduling, but looks like we don't clear it. The race is still valid
> for when someone runs a tag check in parallel with someone allocating
> a tag, since there's a window of time where the tag bit is set, but
> ->rqs[tag] isn't set yet. That's probably the race I hit, not the
> completion race mentioned in the change log.

Rewrote the commit message:

http://git.kernel.dk/cgit/linux-block/commit/?h=mq-inflight&id=1908e43118e688e41ac8656edcf3e7a150f3f5081

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags
  2017-08-03 20:40       ` Jens Axboe
@ 2017-08-03 20:50         ` Bart Van Assche
  2017-08-03 20:56           ` Jens Axboe
  0 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2017-08-03 20:50 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, axboe@kernel.dk; +Cc: brking@linux.vnet.ibm.com

On Thu, 2017-08-03 at 14:40 -0600, Jens Axboe wrote:
> On 08/03/2017 02:35 PM, Jens Axboe wrote:
> > > I agree with what you wrote in the description of this patch.
> > > However, since I have not yet found the code that clears tags->rqs[],
> > > would it be possible to show me that code?
> >=20
> > Since it's been a month since I wrote this code, I went and looked
> > too.  My memory was that we set/clear it dynamically since we added
> > scheduling, but looks like we don't clear it. The race is still valid
> > for when someone runs a tag check in parallel with someone allocating
> > a tag, since there's a window of time where the tag bit is set, but
> > ->rqs[tag] isn't set yet. That's probably the race I hit, not the
> > completion race mentioned in the change log.
>=20
> Rewrote the commit message:
>=20
> http://git.kernel.dk/cgit/linux-block/commit/?h=3Dmq-inflight&id=3D1908e4=
3118e688e41ac8656edcf3e7a150f3f5081

Hello Jens,

This is what I found in the updated commit:

    blk-mq-tag: check for NULL rq when iterating tags
   =20
    Since we introduced blk-mq-sched, the tags->rqs[] array has been
    dynamically assigned. So we need to check for NULL when iterating,
    since there's a window of time where the bit is set, but we haven't
    dynamically assigned the tags->rqs[] array position yet.
   =20
    This is perfectly safe, since the memory backing of the request is
    never going away while the device is alive.

Does this mean that blk_mq_tagset_busy_iter() can skip requests that it
shouldn't skip and also that blk_mq_tagset_busy_iter() can pass a pointer t=
o
the previous request that was associated with a tag instead of the current
request to its busy_tag_iter_fn argument? Shouldn't these races be fixed,
e.g. by swapping the order in which the tag are set and tags->rqs[] are
assigned such that the correct request pointer is passed to the
busy_tag_iter_fn argument?

Thanks,

Bart.=

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags
  2017-08-03 20:50         ` Bart Van Assche
@ 2017-08-03 20:56           ` Jens Axboe
  0 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2017-08-03 20:56 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org; +Cc: brking@linux.vnet.ibm.com

On 08/03/2017 02:50 PM, Bart Van Assche wrote:
> On Thu, 2017-08-03 at 14:40 -0600, Jens Axboe wrote:
>> On 08/03/2017 02:35 PM, Jens Axboe wrote:
>>>> I agree with what you wrote in the description of this patch.
>>>> However, since I have not yet found the code that clears tags->rqs[],
>>>> would it be possible to show me that code?
>>>
>>> Since it's been a month since I wrote this code, I went and looked
>>> too.  My memory was that we set/clear it dynamically since we added
>>> scheduling, but looks like we don't clear it. The race is still valid
>>> for when someone runs a tag check in parallel with someone allocating
>>> a tag, since there's a window of time where the tag bit is set, but
>>> ->rqs[tag] isn't set yet. That's probably the race I hit, not the
>>> completion race mentioned in the change log.
>>
>> Rewrote the commit message:
>>
>> http://git.kernel.dk/cgit/linux-block/commit/?h=mq-inflight&id=1908e43118e688e41ac8656edcf3e7a150f3f5081
> 
> Hello Jens,
> 
> This is what I found in the updated commit:
> 
>     blk-mq-tag: check for NULL rq when iterating tags
>     
>     Since we introduced blk-mq-sched, the tags->rqs[] array has been
>     dynamically assigned. So we need to check for NULL when iterating,
>     since there's a window of time where the bit is set, but we haven't
>     dynamically assigned the tags->rqs[] array position yet.
>     
>     This is perfectly safe, since the memory backing of the request is
>     never going away while the device is alive.
> 
> Does this mean that blk_mq_tagset_busy_iter() can skip requests that it
> shouldn't skip and also that blk_mq_tagset_busy_iter() can pass a pointer to
> the previous request that was associated with a tag instead of the current
> request to its busy_tag_iter_fn argument? Shouldn't these races be fixed,
> e.g. by swapping the order in which the tag are set and tags->rqs[] are
> assigned such that the correct request pointer is passed to the
> busy_tag_iter_fn argument?

We can't swap them. We need to reserve a bit first, which entails
setting that bit. Once that's set, we can assign ->rqs[]. The race isn't
a big deal, and I don't want to add any code to prevent it, since that
would mean locking. Since we have the luxury of the request itself
always being valid memory, we can deal with stale info or having a NULL
because of the ordering. It's a conscious trade off.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/4] block: pass in queue to inflight accounting
  2017-08-03 20:01 [PATCH 0/4] block: more scalable inflight tracking Jens Axboe
  2017-08-03 20:01 ` [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags Jens Axboe
@ 2017-08-03 20:01 ` Jens Axboe
  2017-08-03 20:35   ` Bart Van Assche
  2017-08-03 20:01 ` [PATCH 3/4] blk-mq: provide internal in-flight variant Jens Axboe
  2017-08-03 20:01 ` [PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time Jens Axboe
  3 siblings, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2017-08-03 20:01 UTC (permalink / raw)
  To: linux-block; +Cc: brking, Jens Axboe

No functional change in this patch, just in preparation for
basing the inflight mechanism on the queue in question.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c                   | 16 ++++++++--------
 block/blk-core.c              | 22 ++++++++++++----------
 block/blk-merge.c             |  4 ++--
 block/genhd.c                 |  4 ++--
 block/partition-generic.c     |  5 +++--
 drivers/block/drbd/drbd_req.c | 10 +++++++---
 drivers/block/rsxx/dev.c      |  6 +++---
 drivers/block/zram/zram_drv.c |  5 +++--
 drivers/md/bcache/request.c   |  7 ++++---
 drivers/md/dm.c               |  6 +++---
 drivers/nvdimm/nd.h           |  5 +++--
 include/linux/bio.h           |  9 +++++----
 include/linux/genhd.h         | 14 +++++++++-----
 13 files changed, 64 insertions(+), 49 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 9cf98b29588a..4d751a5b56f0 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1729,29 +1729,29 @@ void bio_check_pages_dirty(struct bio *bio)
 	}
 }
 
-void generic_start_io_acct(int rw, unsigned long sectors,
-			   struct hd_struct *part)
+void generic_start_io_acct(struct request_queue *q, int rw,
+			   unsigned long sectors, struct hd_struct *part)
 {
 	int cpu = part_stat_lock();
 
-	part_round_stats(cpu, part);
+	part_round_stats(q, cpu, part);
 	part_stat_inc(cpu, part, ios[rw]);
 	part_stat_add(cpu, part, sectors[rw], sectors);
-	part_inc_in_flight(part, rw);
+	part_inc_in_flight(q, part, rw);
 
 	part_stat_unlock();
 }
 EXPORT_SYMBOL(generic_start_io_acct);
 
-void generic_end_io_acct(int rw, struct hd_struct *part,
-			 unsigned long start_time)
+void generic_end_io_acct(struct request_queue *q, int rw,
+			 struct hd_struct *part, unsigned long start_time)
 {
 	unsigned long duration = jiffies - start_time;
 	int cpu = part_stat_lock();
 
 	part_stat_add(cpu, part, ticks[rw], duration);
-	part_round_stats(cpu, part);
-	part_dec_in_flight(part, rw);
+	part_round_stats(q, cpu, part);
+	part_dec_in_flight(q, part, rw);
 
 	part_stat_unlock();
 }
diff --git a/block/blk-core.c b/block/blk-core.c
index af393d5a9680..ba88d71b442f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1469,15 +1469,15 @@ static void add_acct_request(struct request_queue *q, struct request *rq,
 	__elv_add_request(q, rq, where);
 }
 
-static void part_round_stats_single(int cpu, struct hd_struct *part,
-				    unsigned long now)
+static void part_round_stats_single(struct request_queue *q, int cpu,
+				    struct hd_struct *part, unsigned long now)
 {
 	int inflight;
 
 	if (now == part->stamp)
 		return;
 
-	inflight = part_in_flight(part);
+	inflight = part_in_flight(q, part);
 	if (inflight) {
 		__part_stat_add(cpu, part, time_in_queue,
 				inflight * (now - part->stamp));
@@ -1488,6 +1488,7 @@ static void part_round_stats_single(int cpu, struct hd_struct *part,
 
 /**
  * part_round_stats() - Round off the performance stats on a struct disk_stats.
+ * @q: target block queue
  * @cpu: cpu number for stats access
  * @part: target partition
  *
@@ -1502,13 +1503,14 @@ static void part_round_stats_single(int cpu, struct hd_struct *part,
  * /proc/diskstats.  This accounts immediately for all queue usage up to
  * the current jiffies and restarts the counters again.
  */
-void part_round_stats(int cpu, struct hd_struct *part)
+void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part)
 {
 	unsigned long now = jiffies;
 
 	if (part->partno)
-		part_round_stats_single(cpu, &part_to_disk(part)->part0, now);
-	part_round_stats_single(cpu, part, now);
+		part_round_stats_single(q, cpu, &part_to_disk(part)->part0,
+						now);
+	part_round_stats_single(q, cpu, part, now);
 }
 EXPORT_SYMBOL_GPL(part_round_stats);
 
@@ -2434,8 +2436,8 @@ void blk_account_io_done(struct request *req)
 
 		part_stat_inc(cpu, part, ios[rw]);
 		part_stat_add(cpu, part, ticks[rw], duration);
-		part_round_stats(cpu, part);
-		part_dec_in_flight(part, rw);
+		part_round_stats(req->q, cpu, part);
+		part_dec_in_flight(req->q, part, rw);
 
 		hd_struct_put(part);
 		part_stat_unlock();
@@ -2492,8 +2494,8 @@ void blk_account_io_start(struct request *rq, bool new_io)
 			part = &rq->rq_disk->part0;
 			hd_struct_get(part);
 		}
-		part_round_stats(cpu, part);
-		part_inc_in_flight(part, rw);
+		part_round_stats(rq->q, cpu, part);
+		part_inc_in_flight(rq->q, part, rw);
 		rq->part = part;
 	}
 
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 99038830fb42..05f116bfb99d 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -633,8 +633,8 @@ static void blk_account_io_merge(struct request *req)
 		cpu = part_stat_lock();
 		part = req->part;
 
-		part_round_stats(cpu, part);
-		part_dec_in_flight(part, rq_data_dir(req));
+		part_round_stats(req->q, cpu, part);
+		part_dec_in_flight(req->q, part, rq_data_dir(req));
 
 		hd_struct_put(part);
 		part_stat_unlock();
diff --git a/block/genhd.c b/block/genhd.c
index 7f520fa25d16..f735af67a0c9 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1217,7 +1217,7 @@ static int diskstats_show(struct seq_file *seqf, void *v)
 	disk_part_iter_init(&piter, gp, DISK_PITER_INCL_EMPTY_PART0);
 	while ((hd = disk_part_iter_next(&piter))) {
 		cpu = part_stat_lock();
-		part_round_stats(cpu, hd);
+		part_round_stats(gp->queue, cpu, hd);
 		part_stat_unlock();
 		seq_printf(seqf, "%4d %7d %s %lu %lu %lu "
 			   "%u %lu %lu %lu %u %u %u %u\n",
@@ -1231,7 +1231,7 @@ static int diskstats_show(struct seq_file *seqf, void *v)
 			   part_stat_read(hd, merges[WRITE]),
 			   part_stat_read(hd, sectors[WRITE]),
 			   jiffies_to_msecs(part_stat_read(hd, ticks[WRITE])),
-			   part_in_flight(hd),
+			   part_in_flight(gp->queue, hd),
 			   jiffies_to_msecs(part_stat_read(hd, io_ticks)),
 			   jiffies_to_msecs(part_stat_read(hd, time_in_queue))
 			);
diff --git a/block/partition-generic.c b/block/partition-generic.c
index c5ec8246e25e..d1bdd61e1d71 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -112,10 +112,11 @@ ssize_t part_stat_show(struct device *dev,
 		       struct device_attribute *attr, char *buf)
 {
 	struct hd_struct *p = dev_to_part(dev);
+	struct request_queue *q = dev_to_disk(dev)->queue;
 	int cpu;
 
 	cpu = part_stat_lock();
-	part_round_stats(cpu, p);
+	part_round_stats(q, cpu, p);
 	part_stat_unlock();
 	return sprintf(buf,
 		"%8lu %8lu %8llu %8u "
@@ -130,7 +131,7 @@ ssize_t part_stat_show(struct device *dev,
 		part_stat_read(p, merges[WRITE]),
 		(unsigned long long)part_stat_read(p, sectors[WRITE]),
 		jiffies_to_msecs(part_stat_read(p, ticks[WRITE])),
-		part_in_flight(p),
+		part_in_flight(q, p),
 		jiffies_to_msecs(part_stat_read(p, io_ticks)),
 		jiffies_to_msecs(part_stat_read(p, time_in_queue)));
 }
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index f6e865b2d543..8d6b5d137b5e 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -36,14 +36,18 @@ static bool drbd_may_do_local_read(struct drbd_device *device, sector_t sector,
 /* Update disk stats at start of I/O request */
 static void _drbd_start_io_acct(struct drbd_device *device, struct drbd_request *req)
 {
-	generic_start_io_acct(bio_data_dir(req->master_bio), req->i.size >> 9,
-			      &device->vdisk->part0);
+	struct request_queue *q = device->rq_queue;
+
+	generic_start_io_acct(q, bio_data_dir(req->master_bio),
+				req->i.size >> 9, &device->vdisk->part0);
 }
 
 /* Update disk stats when completing request upwards */
 static void _drbd_end_io_acct(struct drbd_device *device, struct drbd_request *req)
 {
-	generic_end_io_acct(bio_data_dir(req->master_bio),
+	struct request_queue *q = device->rq_queue;
+
+	generic_end_io_acct(q, bio_data_dir(req->master_bio),
 			    &device->vdisk->part0, req->start_jif);
 }
 
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index 7f4acebf4657..e397d3ee7308 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -112,7 +112,7 @@ static const struct block_device_operations rsxx_fops = {
 
 static void disk_stats_start(struct rsxx_cardinfo *card, struct bio *bio)
 {
-	generic_start_io_acct(bio_data_dir(bio), bio_sectors(bio),
+	generic_start_io_acct(card->queue, bio_data_dir(bio), bio_sectors(bio),
 			     &card->gendisk->part0);
 }
 
@@ -120,8 +120,8 @@ static void disk_stats_complete(struct rsxx_cardinfo *card,
 				struct bio *bio,
 				unsigned long start_time)
 {
-	generic_end_io_acct(bio_data_dir(bio), &card->gendisk->part0,
-			   start_time);
+	generic_end_io_acct(card->queue, bio_data_dir(bio),
+				&card->gendisk->part0, start_time);
 }
 
 static void bio_dma_done_cb(struct rsxx_cardinfo *card,
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index debee952dcc1..56746f420faa 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -811,9 +811,10 @@ static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index,
 {
 	unsigned long start_time = jiffies;
 	int rw_acct = is_write ? REQ_OP_WRITE : REQ_OP_READ;
+	struct request_queue *q = zram->disk->queue;
 	int ret;
 
-	generic_start_io_acct(rw_acct, bvec->bv_len >> SECTOR_SHIFT,
+	generic_start_io_acct(q, rw_acct, bvec->bv_len >> SECTOR_SHIFT,
 			&zram->disk->part0);
 
 	if (!is_write) {
@@ -825,7 +826,7 @@ static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index,
 		ret = zram_bvec_write(zram, bvec, index, offset);
 	}
 
-	generic_end_io_acct(rw_acct, &zram->disk->part0, start_time);
+	generic_end_io_acct(q, rw_acct, &zram->disk->part0, start_time);
 
 	if (unlikely(ret)) {
 		if (!is_write)
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 019b3df9f1c6..72eb97176403 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -607,7 +607,8 @@ static void request_endio(struct bio *bio)
 static void bio_complete(struct search *s)
 {
 	if (s->orig_bio) {
-		generic_end_io_acct(bio_data_dir(s->orig_bio),
+		struct request_queue *q = bdev_get_queue(s->orig_bio->bi_bdev);
+		generic_end_io_acct(q, bio_data_dir(s->orig_bio),
 				    &s->d->disk->part0, s->start_time);
 
 		trace_bcache_request_end(s->d, s->orig_bio);
@@ -959,7 +960,7 @@ static blk_qc_t cached_dev_make_request(struct request_queue *q,
 	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
 	int rw = bio_data_dir(bio);
 
-	generic_start_io_acct(rw, bio_sectors(bio), &d->disk->part0);
+	generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0);
 
 	bio->bi_bdev = dc->bdev;
 	bio->bi_iter.bi_sector += dc->sb.data_offset;
@@ -1074,7 +1075,7 @@ static blk_qc_t flash_dev_make_request(struct request_queue *q,
 	struct bcache_device *d = bio->bi_bdev->bd_disk->private_data;
 	int rw = bio_data_dir(bio);
 
-	generic_start_io_acct(rw, bio_sectors(bio), &d->disk->part0);
+	generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0);
 
 	s = search_alloc(bio, d);
 	cl = &s->cl;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 402946035308..856c4d69e743 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -516,7 +516,7 @@ static void start_io_acct(struct dm_io *io)
 	io->start_time = jiffies;
 
 	cpu = part_stat_lock();
-	part_round_stats(cpu, &dm_disk(md)->part0);
+	part_round_stats(md->queue, cpu, &dm_disk(md)->part0);
 	part_stat_unlock();
 	atomic_set(&dm_disk(md)->part0.in_flight[rw],
 		atomic_inc_return(&md->pending[rw]));
@@ -535,7 +535,7 @@ static void end_io_acct(struct dm_io *io)
 	int pending;
 	int rw = bio_data_dir(bio);
 
-	generic_end_io_acct(rw, &dm_disk(md)->part0, io->start_time);
+	generic_end_io_acct(md->queue, rw, &dm_disk(md)->part0, io->start_time);
 
 	if (unlikely(dm_stats_used(&md->stats)))
 		dm_stats_account_io(&md->stats, bio_data_dir(bio),
@@ -1408,7 +1408,7 @@ static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
 
 	map = dm_get_live_table(md, &srcu_idx);
 
-	generic_start_io_acct(rw, bio_sectors(bio), &dm_disk(md)->part0);
+	generic_start_io_acct(q, rw, bio_sectors(bio), &dm_disk(md)->part0);
 
 	/* if we're suspended, we have to queue this io for later */
 	if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) {
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 03852d738eec..7063cad88bd9 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -383,7 +383,7 @@ static inline bool nd_iostat_start(struct bio *bio, unsigned long *start)
 		return false;
 
 	*start = jiffies;
-	generic_start_io_acct(bio_data_dir(bio),
+	generic_start_io_acct(disk->queue, bio_data_dir(bio),
 			      bio_sectors(bio), &disk->part0);
 	return true;
 }
@@ -391,7 +391,8 @@ static inline void nd_iostat_end(struct bio *bio, unsigned long start)
 {
 	struct gendisk *disk = bio->bi_bdev->bd_disk;
 
-	generic_end_io_acct(bio_data_dir(bio), &disk->part0, start);
+	generic_end_io_acct(disk->queue, bio_data_dir(bio), &disk->part0,
+				start);
 }
 static inline bool is_bad_pmem(struct badblocks *bb, sector_t sector,
 		unsigned int len)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 4907bea03908..c447959968c1 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -447,10 +447,11 @@ extern struct bio *bio_copy_kern(struct request_queue *, void *, unsigned int,
 extern void bio_set_pages_dirty(struct bio *bio);
 extern void bio_check_pages_dirty(struct bio *bio);
 
-void generic_start_io_acct(int rw, unsigned long sectors,
-			   struct hd_struct *part);
-void generic_end_io_acct(int rw, struct hd_struct *part,
-			 unsigned long start_time);
+void generic_start_io_acct(struct request_queue *q, int rw,
+				unsigned long sectors, struct hd_struct *part);
+void generic_end_io_acct(struct request_queue *q, int rw,
+				struct hd_struct *part,
+				unsigned long start_time);
 
 #ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
 # error	"You should define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE for your platform"
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index e619fae2f037..7f7427e00f9c 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -362,23 +362,27 @@ static inline void free_part_stats(struct hd_struct *part)
 #define part_stat_sub(cpu, gendiskp, field, subnd)			\
 	part_stat_add(cpu, gendiskp, field, -subnd)
 
-static inline void part_inc_in_flight(struct hd_struct *part, int rw)
+static inline void part_inc_in_flight(struct request_queue *q,
+				      struct hd_struct *part, int rw)
 {
 	atomic_inc(&part->in_flight[rw]);
 	if (part->partno)
 		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
 }
 
-static inline void part_dec_in_flight(struct hd_struct *part, int rw)
+static inline void part_dec_in_flight(struct request_queue *q,
+				      struct hd_struct *part, int rw)
 {
 	atomic_dec(&part->in_flight[rw]);
 	if (part->partno)
 		atomic_dec(&part_to_disk(part)->part0.in_flight[rw]);
 }
 
-static inline int part_in_flight(struct hd_struct *part)
+static inline int part_in_flight(struct request_queue *q,
+				 struct hd_struct *part)
 {
-	return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]);
+	return atomic_read(&part->in_flight[0]) +
+			atomic_read(&part->in_flight[1]);
 }
 
 static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk)
@@ -395,7 +399,7 @@ static inline void free_part_info(struct hd_struct *part)
 }
 
 /* block/blk-core.c */
-extern void part_round_stats(int cpu, struct hd_struct *part);
+extern void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part);
 
 /* block/genhd.c */
 extern void device_add_disk(struct device *parent, struct gendisk *disk);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] block: pass in queue to inflight accounting
  2017-08-03 20:01 ` [PATCH 2/4] block: pass in queue to inflight accounting Jens Axboe
@ 2017-08-03 20:35   ` Bart Van Assche
  2017-08-03 20:37     ` Jens Axboe
  0 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2017-08-03 20:35 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, axboe@kernel.dk; +Cc: brking@linux.vnet.ibm.com

On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
> No functional change in this patch, just in preparation for
> basing the inflight mechanism on the queue in question.
>=20
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  block/bio.c                   | 16 ++++++++--------
>  block/blk-core.c              | 22 ++++++++++++----------
>  block/blk-merge.c             |  4 ++--
>  block/genhd.c                 |  4 ++--
>  block/partition-generic.c     |  5 +++--
>  drivers/block/drbd/drbd_req.c | 10 +++++++---
>  drivers/block/rsxx/dev.c      |  6 +++---
>  drivers/block/zram/zram_drv.c |  5 +++--
>  drivers/md/bcache/request.c   |  7 ++++---
>  drivers/md/dm.c               |  6 +++---
>  drivers/nvdimm/nd.h           |  5 +++--
>  include/linux/bio.h           |  9 +++++----
>  include/linux/genhd.h         | 14 +++++++++-----
>  13 files changed, 64 insertions(+), 49 deletions(-)

Hello Jens,

Should the DRBD, rsxx, zram, md and nvdimm driver maintainers be CC-ed?

Anyway:

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>=

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] block: pass in queue to inflight accounting
  2017-08-03 20:35   ` Bart Van Assche
@ 2017-08-03 20:37     ` Jens Axboe
  0 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2017-08-03 20:37 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org; +Cc: brking@linux.vnet.ibm.com

On 08/03/2017 02:35 PM, Bart Van Assche wrote:
> On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
>> No functional change in this patch, just in preparation for
>> basing the inflight mechanism on the queue in question.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  block/bio.c                   | 16 ++++++++--------
>>  block/blk-core.c              | 22 ++++++++++++----------
>>  block/blk-merge.c             |  4 ++--
>>  block/genhd.c                 |  4 ++--
>>  block/partition-generic.c     |  5 +++--
>>  drivers/block/drbd/drbd_req.c | 10 +++++++---
>>  drivers/block/rsxx/dev.c      |  6 +++---
>>  drivers/block/zram/zram_drv.c |  5 +++--
>>  drivers/md/bcache/request.c   |  7 ++++---
>>  drivers/md/dm.c               |  6 +++---
>>  drivers/nvdimm/nd.h           |  5 +++--
>>  include/linux/bio.h           |  9 +++++----
>>  include/linux/genhd.h         | 14 +++++++++-----
>>  13 files changed, 64 insertions(+), 49 deletions(-)
> 
> Hello Jens,
> 
> Should the DRBD, rsxx, zram, md and nvdimm driver maintainers be CC-ed?

Yeah, they should, forgot to do so. I'll make sure they see it, since
this is the initial posting, I'm assuming I'll have to do a v2 anyway.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/4] blk-mq: provide internal in-flight variant
  2017-08-03 20:01 [PATCH 0/4] block: more scalable inflight tracking Jens Axboe
  2017-08-03 20:01 ` [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags Jens Axboe
  2017-08-03 20:01 ` [PATCH 2/4] block: pass in queue to inflight accounting Jens Axboe
@ 2017-08-03 20:01 ` Jens Axboe
  2017-08-03 20:41   ` Bart Van Assche
                     ` (2 more replies)
  2017-08-03 20:01 ` [PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time Jens Axboe
  3 siblings, 3 replies; 24+ messages in thread
From: Jens Axboe @ 2017-08-03 20:01 UTC (permalink / raw)
  To: linux-block; +Cc: brking, Jens Axboe

We don't have to inc/dec some counter, since we can just
iterate the tags. That makes inc/dec a noop, but means we
have to iterate busy tags to get an in-flight count.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-mq.c        | 24 ++++++++++++++++++++++++
 block/blk-mq.h        |  2 ++
 block/genhd.c         | 29 +++++++++++++++++++++++++++++
 include/linux/genhd.h | 25 +++----------------------
 4 files changed, 58 insertions(+), 22 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 05dfa3f270ae..37035891e120 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -86,6 +86,30 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 	sbitmap_clear_bit(&hctx->ctx_map, ctx->index_hw);
 }
 
+struct mq_inflight {
+	struct hd_struct *part;
+	unsigned int inflight;
+};
+
+static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
+				  struct request *rq, void *priv,
+				  bool reserved)
+{
+	struct mq_inflight *mi = priv;
+
+	if (rq->part == mi->part)
+		mi->inflight++;
+}
+
+unsigned int blk_mq_in_flight(struct request_queue *q,
+			       struct hd_struct *part)
+{
+	struct mq_inflight mi = { .part = part, .inflight = 0 };
+
+	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
+	return mi.inflight;
+}
+
 void blk_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 1a06fdf9fd4d..cade1a512a01 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -138,4 +138,6 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 	return hctx->nr_ctx && hctx->tags;
 }
 
+unsigned int blk_mq_in_flight(struct request_queue *q, struct hd_struct *part);
+
 #endif
diff --git a/block/genhd.c b/block/genhd.c
index f735af67a0c9..ad5dc567d57f 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -45,6 +45,35 @@ static void disk_add_events(struct gendisk *disk);
 static void disk_del_events(struct gendisk *disk);
 static void disk_release_events(struct gendisk *disk);
 
+void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
+{
+	if (q->mq_ops)
+		return;
+
+	atomic_inc(&part->in_flight[rw]);
+	if (part->partno)
+		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
+}
+
+void part_dec_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
+{
+	if (q->mq_ops)
+		return;
+
+	atomic_dec(&part->in_flight[rw]);
+	if (part->partno)
+		atomic_dec(&part_to_disk(part)->part0.in_flight[rw]);
+}
+
+int part_in_flight(struct request_queue *q, struct hd_struct *part)
+{
+	if (q->mq_ops)
+		return blk_mq_in_flight(q, part);
+
+	return atomic_read(&part->in_flight[0]) +
+			atomic_read(&part->in_flight[1]);
+}
+
 /**
  * disk_get_part - get partition
  * @disk: disk to look partition from
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 7f7427e00f9c..f2c5096b3a7e 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -362,28 +362,9 @@ static inline void free_part_stats(struct hd_struct *part)
 #define part_stat_sub(cpu, gendiskp, field, subnd)			\
 	part_stat_add(cpu, gendiskp, field, -subnd)
 
-static inline void part_inc_in_flight(struct request_queue *q,
-				      struct hd_struct *part, int rw)
-{
-	atomic_inc(&part->in_flight[rw]);
-	if (part->partno)
-		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
-}
-
-static inline void part_dec_in_flight(struct request_queue *q,
-				      struct hd_struct *part, int rw)
-{
-	atomic_dec(&part->in_flight[rw]);
-	if (part->partno)
-		atomic_dec(&part_to_disk(part)->part0.in_flight[rw]);
-}
-
-static inline int part_in_flight(struct request_queue *q,
-				 struct hd_struct *part)
-{
-	return atomic_read(&part->in_flight[0]) +
-			atomic_read(&part->in_flight[1]);
-}
+int part_in_flight(struct request_queue *q, struct hd_struct *part);
+void part_dec_in_flight(struct request_queue *q, struct hd_struct *part, int rw);
+void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw);
 
 static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] blk-mq: provide internal in-flight variant
  2017-08-03 20:01 ` [PATCH 3/4] blk-mq: provide internal in-flight variant Jens Axboe
@ 2017-08-03 20:41   ` Bart Van Assche
  2017-08-03 20:45     ` Jens Axboe
  2017-08-03 21:25   ` Bart Van Assche
  2017-08-04 11:17   ` Ming Lei
  2 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2017-08-03 20:41 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, axboe@kernel.dk; +Cc: brking@linux.vnet.ibm.com

On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
> We don't have to inc/dec some counter, since we can just
> iterate the tags. That makes inc/dec a noop, but means we
> have to iterate busy tags to get an in-flight count.
> [ ... ]
> +unsigned int blk_mq_in_flight(struct request_queue *q,
> +			       struct hd_struct *part)
> +{
> +	struct mq_inflight mi =3D { .part =3D part, .inflight =3D 0 };

Hello Jens,

A minor stylistic comment: since a C compiler is required to initialize to =
zero
all member variables that have not been initialized explicitly I think
".inflight =3D 0" can be left out.

> diff --git a/block/genhd.c b/block/genhd.c
> index f735af67a0c9..ad5dc567d57f 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -45,6 +45,35 @@ static void disk_add_events(struct gendisk *disk);
>  static void disk_del_events(struct gendisk *disk);
>  static void disk_release_events(struct gendisk *disk);
> =20
> +void part_inc_in_flight(struct request_queue *q, struct hd_struct *part,=
 int rw)
> +{
> +	if (q->mq_ops)
> +		return;
> +
> +	atomic_inc(&part->in_flight[rw]);
> +	if (part->partno)
> +		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
> +}
> [ ... ]
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index 7f7427e00f9c..f2c5096b3a7e 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -362,28 +362,9 @@ static inline void free_part_stats(struct hd_struct =
*part)
>  #define part_stat_sub(cpu, gendiskp, field, subnd)			\
>  	part_stat_add(cpu, gendiskp, field, -subnd)
> =20
> -static inline void part_inc_in_flight(struct request_queue *q,
> -				      struct hd_struct *part, int rw)
> -{
> -	atomic_inc(&part->in_flight[rw]);
> -	if (part->partno)
> -		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
> -}
> [ ... ]

Sorry but to me it seems like this part of the patch does match with the pa=
tch
description? The patch description mentions that inc and dec become a noop =
but
it seems to me that these functions have been uninlined instead of making t=
hese
a noop?

Bart.=

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] blk-mq: provide internal in-flight variant
  2017-08-03 20:41   ` Bart Van Assche
@ 2017-08-03 20:45     ` Jens Axboe
  2017-08-03 20:54       ` Bart Van Assche
  0 siblings, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2017-08-03 20:45 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org; +Cc: brking@linux.vnet.ibm.com

On 08/03/2017 02:41 PM, Bart Van Assche wrote:
> On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
>> We don't have to inc/dec some counter, since we can just
>> iterate the tags. That makes inc/dec a noop, but means we
>> have to iterate busy tags to get an in-flight count.
>> [ ... ]
>> +unsigned int blk_mq_in_flight(struct request_queue *q,
>> +			       struct hd_struct *part)
>> +{
>> +	struct mq_inflight mi = { .part = part, .inflight = 0 };
> 
> Hello Jens,
> 
> A minor stylistic comment: since a C compiler is required to
> initialize to zero all member variables that have not been initialized
> explicitly I think ".inflight = 0" can be left out.

It can, I'll kill it.

>> diff --git a/block/genhd.c b/block/genhd.c
>> index f735af67a0c9..ad5dc567d57f 100644
>> --- a/block/genhd.c
>> +++ b/block/genhd.c
>> @@ -45,6 +45,35 @@ static void disk_add_events(struct gendisk *disk);
>>  static void disk_del_events(struct gendisk *disk);
>>  static void disk_release_events(struct gendisk *disk);
>>  
>> +void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
>> +{
>> +	if (q->mq_ops)
>> +		return;
>> +
>> +	atomic_inc(&part->in_flight[rw]);
>> +	if (part->partno)
>> +		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
>> +}
>> [ ... ]
>> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
>> index 7f7427e00f9c..f2c5096b3a7e 100644
>> --- a/include/linux/genhd.h
>> +++ b/include/linux/genhd.h
>> @@ -362,28 +362,9 @@ static inline void free_part_stats(struct hd_struct *part)
>>  #define part_stat_sub(cpu, gendiskp, field, subnd)			\
>>  	part_stat_add(cpu, gendiskp, field, -subnd)
>>  
>> -static inline void part_inc_in_flight(struct request_queue *q,
>> -				      struct hd_struct *part, int rw)
>> -{
>> -	atomic_inc(&part->in_flight[rw]);
>> -	if (part->partno)
>> -		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
>> -}
>> [ ... ]
> 
> Sorry but to me it seems like this part of the patch does match with
> the patch description? The patch description mentions that inc and dec
> become a noop but it seems to me that these functions have been
> uninlined instead of making these a noop?

The inc/dec goes away for mq, the non-mq path still has to use them. I just
move them as well. Could be a prep patch, but it's just moving the code out
of the header and into a normal C file instead.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] blk-mq: provide internal in-flight variant
  2017-08-03 20:45     ` Jens Axboe
@ 2017-08-03 20:54       ` Bart Van Assche
  0 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2017-08-03 20:54 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, axboe@kernel.dk; +Cc: brking@linux.vnet.ibm.com

On Thu, 2017-08-03 at 14:45 -0600, Jens Axboe wrote:
> The inc/dec goes away for mq, the non-mq path still has to use them. I ju=
st
> move them as well. Could be a prep patch, but it's just moving the code o=
ut
> of the header and into a normal C file instead.

Hello Jens,

I misread that part of the patch. Now I have had another look these changes
look fine to me. Hence:

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>=

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] blk-mq: provide internal in-flight variant
  2017-08-03 20:01 ` [PATCH 3/4] blk-mq: provide internal in-flight variant Jens Axboe
  2017-08-03 20:41   ` Bart Van Assche
@ 2017-08-03 21:25   ` Bart Van Assche
  2017-08-03 22:36     ` Jens Axboe
  2017-08-04 11:17   ` Ming Lei
  2 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2017-08-03 21:25 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, axboe@kernel.dk; +Cc: brking@linux.vnet.ibm.com

On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
> +				  struct request *rq, void *priv,
> +				  bool reserved)
> +{
> +	struct mq_inflight *mi =3D priv;
> +
> +	if (rq->part =3D=3D mi->part)
> +		mi->inflight++;
> +}
> [ ... ]
> -static inline void part_inc_in_flight(struct request_queue *q,
> -				      struct hd_struct *part, int rw)
> -{
> -	atomic_inc(&part->in_flight[rw]);
> -	if (part->partno)
> -		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
> -}

Hello Jens,

The existing part_inc_in_flight() code includes all requests in the in_flig=
ht
statistics for part0 but the new code in blk_mq_check_inflight() not. Is th=
at
on purpose? Should the rq->part =3D=3D mi->part check perhaps be skipped if
mi->part represents part0?

Thanks,

Bart.=

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] blk-mq: provide internal in-flight variant
  2017-08-03 21:25   ` Bart Van Assche
@ 2017-08-03 22:36     ` Jens Axboe
  0 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2017-08-03 22:36 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org; +Cc: brking@linux.vnet.ibm.com

On 08/03/2017 03:25 PM, Bart Van Assche wrote:
> On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
>> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>> +				  struct request *rq, void *priv,
>> +				  bool reserved)
>> +{
>> +	struct mq_inflight *mi = priv;
>> +
>> +	if (rq->part == mi->part)
>> +		mi->inflight++;
>> +}
>> [ ... ]
>> -static inline void part_inc_in_flight(struct request_queue *q,
>> -				      struct hd_struct *part, int rw)
>> -{
>> -	atomic_inc(&part->in_flight[rw]);
>> -	if (part->partno)
>> -		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
>> -}
> 
> Hello Jens,
> 
> The existing part_inc_in_flight() code includes all requests in the in_flight
> statistics for part0 but the new code in blk_mq_check_inflight() not. Is that
> on purpose? Should the rq->part == mi->part check perhaps be skipped if
> mi->part represents part0?

The existing code increments always for the partition in question, and
for the root if it's a partition. I'll take a look at that logic, and
ensure it's all correct.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] blk-mq: provide internal in-flight variant
  2017-08-03 20:01 ` [PATCH 3/4] blk-mq: provide internal in-flight variant Jens Axboe
  2017-08-03 20:41   ` Bart Van Assche
  2017-08-03 21:25   ` Bart Van Assche
@ 2017-08-04 11:17   ` Ming Lei
  2017-08-04 13:55     ` Jens Axboe
  2 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2017-08-04 11:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, brking

On Thu, Aug 03, 2017 at 02:01:55PM -0600, Jens Axboe wrote:
> We don't have to inc/dec some counter, since we can just
> iterate the tags. That makes inc/dec a noop, but means we
> have to iterate busy tags to get an in-flight count.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  block/blk-mq.c        | 24 ++++++++++++++++++++++++
>  block/blk-mq.h        |  2 ++
>  block/genhd.c         | 29 +++++++++++++++++++++++++++++
>  include/linux/genhd.h | 25 +++----------------------
>  4 files changed, 58 insertions(+), 22 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 05dfa3f270ae..37035891e120 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -86,6 +86,30 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
>  	sbitmap_clear_bit(&hctx->ctx_map, ctx->index_hw);
>  }
>  
> +struct mq_inflight {
> +	struct hd_struct *part;
> +	unsigned int inflight;
> +};
> +
> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
> +				  struct request *rq, void *priv,
> +				  bool reserved)
> +{
> +	struct mq_inflight *mi = priv;
> +
> +	if (rq->part == mi->part)
> +		mi->inflight++;
> +}
> +
> +unsigned int blk_mq_in_flight(struct request_queue *q,
> +			       struct hd_struct *part)
> +{
> +	struct mq_inflight mi = { .part = part, .inflight = 0 };
> +
> +	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
> +	return mi.inflight;
> +}

IMO it might not be as efficient as per-cpu variable.

For example, NVMe on one 128-core system, if we use percpu variable,
it is enough to read 128 local variable from each CPU for accounting
one in_flight.

But in this way of blk_mq_in_flight(), we need to do 128 
sbitmap search, and one sbitmap search need to read at least
16 words of 'unsigned long',  and total 128*16 read.

So maybe we need to compare the two approaches first.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] blk-mq: provide internal in-flight variant
  2017-08-04 11:17   ` Ming Lei
@ 2017-08-04 13:55     ` Jens Axboe
  2017-08-04 22:19       ` Ming Lei
  0 siblings, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2017-08-04 13:55 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, brking

On 08/04/2017 05:17 AM, Ming Lei wrote:
> On Thu, Aug 03, 2017 at 02:01:55PM -0600, Jens Axboe wrote:
>> We don't have to inc/dec some counter, since we can just
>> iterate the tags. That makes inc/dec a noop, but means we
>> have to iterate busy tags to get an in-flight count.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  block/blk-mq.c        | 24 ++++++++++++++++++++++++
>>  block/blk-mq.h        |  2 ++
>>  block/genhd.c         | 29 +++++++++++++++++++++++++++++
>>  include/linux/genhd.h | 25 +++----------------------
>>  4 files changed, 58 insertions(+), 22 deletions(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 05dfa3f270ae..37035891e120 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -86,6 +86,30 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
>>  	sbitmap_clear_bit(&hctx->ctx_map, ctx->index_hw);
>>  }
>>  
>> +struct mq_inflight {
>> +	struct hd_struct *part;
>> +	unsigned int inflight;
>> +};
>> +
>> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>> +				  struct request *rq, void *priv,
>> +				  bool reserved)
>> +{
>> +	struct mq_inflight *mi = priv;
>> +
>> +	if (rq->part == mi->part)
>> +		mi->inflight++;
>> +}
>> +
>> +unsigned int blk_mq_in_flight(struct request_queue *q,
>> +			       struct hd_struct *part)
>> +{
>> +	struct mq_inflight mi = { .part = part, .inflight = 0 };
>> +
>> +	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
>> +	return mi.inflight;
>> +}
> 
> IMO it might not be as efficient as per-cpu variable.
> 
> For example, NVMe on one 128-core system, if we use percpu variable,
> it is enough to read 128 local variable from each CPU for accounting
> one in_flight.

IFF the system is configured with NR_CPUS=128. Most distros go
much bigger. On the other hand, we know that nr_queues will
never be bigger than the number of online cpus, not the number
of possible cpus.

> But in this way of blk_mq_in_flight(), we need to do 128 
> sbitmap search, and one sbitmap search need to read at least
> 16 words of 'unsigned long',  and total 128*16 read.

If that ends up being a problem, it hasn't in testing, then we
could always stuff an index in front of the full sbitmap.

> So maybe we need to compare the two approaches first.

We already did, back when this was originally posted. See the
thread from end may / start june and the results from Brian.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] blk-mq: provide internal in-flight variant
  2017-08-04 13:55     ` Jens Axboe
@ 2017-08-04 22:19       ` Ming Lei
  2017-08-07 19:54         ` Brian King
  0 siblings, 1 reply; 24+ messages in thread
From: Ming Lei @ 2017-08-04 22:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, brking

On Fri, Aug 04, 2017 at 07:55:41AM -0600, Jens Axboe wrote:
> On 08/04/2017 05:17 AM, Ming Lei wrote:
> > On Thu, Aug 03, 2017 at 02:01:55PM -0600, Jens Axboe wrote:
> >> We don't have to inc/dec some counter, since we can just
> >> iterate the tags. That makes inc/dec a noop, but means we
> >> have to iterate busy tags to get an in-flight count.
> >>
> >> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >> ---
> >>  block/blk-mq.c        | 24 ++++++++++++++++++++++++
> >>  block/blk-mq.h        |  2 ++
> >>  block/genhd.c         | 29 +++++++++++++++++++++++++++++
> >>  include/linux/genhd.h | 25 +++----------------------
> >>  4 files changed, 58 insertions(+), 22 deletions(-)
> >>
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index 05dfa3f270ae..37035891e120 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -86,6 +86,30 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
> >>  	sbitmap_clear_bit(&hctx->ctx_map, ctx->index_hw);
> >>  }
> >>  
> >> +struct mq_inflight {
> >> +	struct hd_struct *part;
> >> +	unsigned int inflight;
> >> +};
> >> +
> >> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
> >> +				  struct request *rq, void *priv,
> >> +				  bool reserved)
> >> +{
> >> +	struct mq_inflight *mi = priv;
> >> +
> >> +	if (rq->part == mi->part)
> >> +		mi->inflight++;
> >> +}
> >> +
> >> +unsigned int blk_mq_in_flight(struct request_queue *q,
> >> +			       struct hd_struct *part)
> >> +{
> >> +	struct mq_inflight mi = { .part = part, .inflight = 0 };
> >> +
> >> +	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
> >> +	return mi.inflight;
> >> +}
> > 
> > IMO it might not be as efficient as per-cpu variable.
> > 
> > For example, NVMe on one 128-core system, if we use percpu variable,
> > it is enough to read 128 local variable from each CPU for accounting
> > one in_flight.
> 
> IFF the system is configured with NR_CPUS=128. Most distros go
> much bigger. On the other hand, we know that nr_queues will
> never be bigger than the number of online cpus, not the number
> of possible cpus.

We usually use for_each_possible_cpu() for aggregating CPU
local counters, and num_possible_cpus() is the number of
CPUs polulatable in system, which is much less than NR_CPUs:

include/linux/cpumask.h:
*     cpu_possible_mask- has bit 'cpu' set iff cpu is populatable

> 
> > But in this way of blk_mq_in_flight(), we need to do 128 
> > sbitmap search, and one sbitmap search need to read at least
> > 16 words of 'unsigned long',  and total 128*16 read.
> 
> If that ends up being a problem, it hasn't in testing, then we
> could always stuff an index in front of the full sbitmap.
> 
> > So maybe we need to compare the two approaches first.
> 
> We already did, back when this was originally posted. See the
> thread from end may / start june and the results from Brian.

Can't find the compasison data between percpu accounting vs. mq-infilight
in that thread.

Just saw Brian mentioned in patch log that percpu may reach
11.4M(I guess 'M' is missed) [1]:

	"When running this on a Power system, to a single null_blk device
	with 80 submission queues, irq mode 0, with 80 fio jobs, I saw IOPs
	go from 1.5M IO/s to 11.4 IO/s."

But in link[2], he said mq-flight can reach 9.4M.

Could Brian explain it a bit? Maybe the two tests were run in
different settings, don't know.

Even though mq-flight is better, I guess we need to understand
the principle behind why it is better than percpu...


[1] http://marc.info/?l=linux-block&m=149868436905520&w=2
[2] http://marc.info/?l=linux-block&m=149920174301644&w=2

Thanks,
Ming

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] blk-mq: provide internal in-flight variant
  2017-08-04 22:19       ` Ming Lei
@ 2017-08-07 19:54         ` Brian King
  0 siblings, 0 replies; 24+ messages in thread
From: Brian King @ 2017-08-07 19:54 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block

On 08/04/2017 05:19 PM, Ming Lei wrote:
> On Fri, Aug 04, 2017 at 07:55:41AM -0600, Jens Axboe wrote:
>> On 08/04/2017 05:17 AM, Ming Lei wrote:
>>> On Thu, Aug 03, 2017 at 02:01:55PM -0600, Jens Axboe wrote:
>>>> We don't have to inc/dec some counter, since we can just
>>>> iterate the tags. That makes inc/dec a noop, but means we
>>>> have to iterate busy tags to get an in-flight count.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> ---
>>>>  block/blk-mq.c        | 24 ++++++++++++++++++++++++
>>>>  block/blk-mq.h        |  2 ++
>>>>  block/genhd.c         | 29 +++++++++++++++++++++++++++++
>>>>  include/linux/genhd.h | 25 +++----------------------
>>>>  4 files changed, 58 insertions(+), 22 deletions(-)
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index 05dfa3f270ae..37035891e120 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -86,6 +86,30 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
>>>>  	sbitmap_clear_bit(&hctx->ctx_map, ctx->index_hw);
>>>>  }
>>>>  
>>>> +struct mq_inflight {
>>>> +	struct hd_struct *part;
>>>> +	unsigned int inflight;
>>>> +};
>>>> +
>>>> +static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>>>> +				  struct request *rq, void *priv,
>>>> +				  bool reserved)
>>>> +{
>>>> +	struct mq_inflight *mi = priv;
>>>> +
>>>> +	if (rq->part == mi->part)
>>>> +		mi->inflight++;
>>>> +}
>>>> +
>>>> +unsigned int blk_mq_in_flight(struct request_queue *q,
>>>> +			       struct hd_struct *part)
>>>> +{
>>>> +	struct mq_inflight mi = { .part = part, .inflight = 0 };
>>>> +
>>>> +	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
>>>> +	return mi.inflight;
>>>> +}
>>>
>>> IMO it might not be as efficient as per-cpu variable.
>>>
>>> For example, NVMe on one 128-core system, if we use percpu variable,
>>> it is enough to read 128 local variable from each CPU for accounting
>>> one in_flight.
>>
>> IFF the system is configured with NR_CPUS=128. Most distros go
>> much bigger. On the other hand, we know that nr_queues will
>> never be bigger than the number of online cpus, not the number
>> of possible cpus.
> 
> We usually use for_each_possible_cpu() for aggregating CPU
> local counters, and num_possible_cpus() is the number of
> CPUs polulatable in system, which is much less than NR_CPUs:
> 
> include/linux/cpumask.h:
> *     cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
> 
>>
>>> But in this way of blk_mq_in_flight(), we need to do 128 
>>> sbitmap search, and one sbitmap search need to read at least
>>> 16 words of 'unsigned long',  and total 128*16 read.
>>
>> If that ends up being a problem, it hasn't in testing, then we
>> could always stuff an index in front of the full sbitmap.
>>
>>> So maybe we need to compare the two approaches first.
>>
>> We already did, back when this was originally posted. See the
>> thread from end may / start june and the results from Brian.
> 
> Can't find the compasison data between percpu accounting vs. mq-infilight
> in that thread.
> 
> Just saw Brian mentioned in patch log that percpu may reach
> 11.4M(I guess 'M' is missed) [1]:
> 
> 	"When running this on a Power system, to a single null_blk device
> 	with 80 submission queues, irq mode 0, with 80 fio jobs, I saw IOPs
> 	go from 1.5M IO/s to 11.4 IO/s."
> 
> But in link[2], he said mq-flight can reach 9.4M.
> 
> Could Brian explain it a bit? Maybe the two tests were run in
> different settings, don't know.

The 11.4M IOPs vs 9.4M IOPs runs cannot be directly compared, as they were
run on different systems with different NVMe devices.

The key comparison I kept coming back to in my measurements was:

N jobs run to 1 null_blk vs. N null_blk devices, each with 1 job

If the IOPs I get between the two is similar, that should show that
we don't have scaling issues in blk-mq. 

There were three variations of patches I tried with:

* per-cpu - Patch from me

* per-node-atomic - Patch from Ming

* mq-inflight - Patch from Jens

All of them provided a massive improvement in my environment. The mq-inflight
was only marginally better than the per node and it was most prominent in the
N null_blk each with 1 job. While the per-node atomic approach certainly
reduced cross node contention on the atomics, they are still atomics, which
have a bit of overhead, particularly on the Power platform. 

As for the difference between the percpu approach and the mq-inflight approach,
I didn't compare them directly in the same config, since I didn't think the
percpu approach would go anywhere after the initial discussion we had on the list.

I'll get some time on the test machine again and do a direct comparison between
in-flight and percpu to see if there are any significant difference.

Thanks,

Brian

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time
  2017-08-03 20:01 [PATCH 0/4] block: more scalable inflight tracking Jens Axboe
                   ` (2 preceding siblings ...)
  2017-08-03 20:01 ` [PATCH 3/4] blk-mq: provide internal in-flight variant Jens Axboe
@ 2017-08-03 20:01 ` Jens Axboe
  2017-08-03 21:29   ` Bart Van Assche
  2017-08-03 22:30   ` Bart Van Assche
  3 siblings, 2 replies; 24+ messages in thread
From: Jens Axboe @ 2017-08-03 20:01 UTC (permalink / raw)
  To: linux-block; +Cc: brking, Jens Axboe

We only need to iterate the queues and tags once, we if just pass
in both partition structs. Add part_in_flight_double() for
returning this value.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c      | 34 +++++++++++++++++++++++-----------
 block/blk-mq.c        | 25 +++++++++++++++++--------
 block/blk-mq.h        |  3 ++-
 block/genhd.c         | 23 +++++++++++++++++++++--
 include/linux/genhd.h |  2 ++
 5 files changed, 65 insertions(+), 22 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ba88d71b442f..d111a8f5bdf1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1470,14 +1470,9 @@ static void add_acct_request(struct request_queue *q, struct request *rq,
 }
 
 static void part_round_stats_single(struct request_queue *q, int cpu,
-				    struct hd_struct *part, unsigned long now)
+				    struct hd_struct *part, unsigned long now,
+				    unsigned int inflight)
 {
-	int inflight;
-
-	if (now == part->stamp)
-		return;
-
-	inflight = part_in_flight(q, part);
 	if (inflight) {
 		__part_stat_add(cpu, part, time_in_queue,
 				inflight * (now - part->stamp));
@@ -1505,12 +1500,29 @@ static void part_round_stats_single(struct request_queue *q, int cpu,
  */
 void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part)
 {
+	struct hd_struct *part2 = NULL;
 	unsigned long now = jiffies;
+	unsigned int inflight[2];
+	int stats = 0;
+
+	if (part->stamp != now)
+		stats |= 1;
+
+	if (part->partno) {
+		part2 = &part_to_disk(part)->part0;
+		if (part2->stamp != now)
+			stats |= 2;
+	}
+
+	if (!stats)
+		return;
+
+	part_in_flight_double(q, part, part2, inflight);
 
-	if (part->partno)
-		part_round_stats_single(q, cpu, &part_to_disk(part)->part0,
-						now);
-	part_round_stats_single(q, cpu, part, now);
+	if (stats & 2)
+		part_round_stats_single(q, cpu, part2, now, inflight[1]);
+	if (stats & 1)
+		part_round_stats_single(q, cpu, part, now, inflight[0]);
 }
 EXPORT_SYMBOL_GPL(part_round_stats);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 37035891e120..03d164917160 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -87,8 +87,9 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 }
 
 struct mq_inflight {
-	struct hd_struct *part;
-	unsigned int inflight;
+	struct hd_struct *part1;
+	struct hd_struct *part2;
+	unsigned int *inflight;
 };
 
 static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
@@ -97,17 +98,25 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
 {
 	struct mq_inflight *mi = priv;
 
-	if (rq->part == mi->part)
-		mi->inflight++;
+	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
+		return;
+	if (rq->part == mi->part1) {
+		mi->inflight[0]++;
+		if (mi->part1->partno &&
+		    &part_to_disk(mi->part1)->part0 == mi->part2)
+			mi->inflight[1]++;
+	} else if (rq->part == mi->part2)
+		mi->inflight[1]++;
 }
 
-unsigned int blk_mq_in_flight(struct request_queue *q,
-			       struct hd_struct *part)
+void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part1,
+		      struct hd_struct *part2, unsigned int *inflight)
 {
-	struct mq_inflight mi = { .part = part, .inflight = 0 };
+	struct mq_inflight mi = { .part1 = part1, .part2 = part2,
+				  .inflight = inflight };
 
+	inflight[0] = inflight[1] = 0;
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
-	return mi.inflight;
 }
 
 void blk_freeze_queue_start(struct request_queue *q)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index cade1a512a01..c5264a64fb7c 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -138,6 +138,7 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 	return hctx->nr_ctx && hctx->tags;
 }
 
-unsigned int blk_mq_in_flight(struct request_queue *q, struct hd_struct *part);
+void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part1,
+			struct hd_struct *part2, unsigned int *inflight);
 
 #endif
diff --git a/block/genhd.c b/block/genhd.c
index ad5dc567d57f..6faacccd8453 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -67,13 +67,32 @@ void part_dec_in_flight(struct request_queue *q, struct hd_struct *part, int rw)
 
 int part_in_flight(struct request_queue *q, struct hd_struct *part)
 {
-	if (q->mq_ops)
-		return blk_mq_in_flight(q, part);
+	if (q->mq_ops) {
+		unsigned int inflight[2];
+
+		blk_mq_in_flight(q, part, NULL, inflight);
+		return inflight[0];
+	}
 
 	return atomic_read(&part->in_flight[0]) +
 			atomic_read(&part->in_flight[1]);
 }
 
+void part_in_flight_double(struct request_queue *q, struct hd_struct *part1,
+			   struct hd_struct *part2, unsigned int *inflight)
+{
+	if (q->mq_ops) {
+		blk_mq_in_flight(q, part1, part2, inflight);
+		return;
+	}
+
+	inflight[0] = atomic_read(&part1->in_flight[0]) +
+			atomic_read(&part1->in_flight[1]);
+	if (part2) {
+		inflight[1] = atomic_read(&part2->in_flight[0]) +
+				atomic_read(&part2->in_flight[1]);
+	}
+}
 /**
  * disk_get_part - get partition
  * @disk: disk to look partition from
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index f2c5096b3a7e..408690079143 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -363,6 +363,8 @@ static inline void free_part_stats(struct hd_struct *part)
 	part_stat_add(cpu, gendiskp, field, -subnd)
 
 int part_in_flight(struct request_queue *q, struct hd_struct *part);
+void part_in_flight_double(struct request_queue *q, struct hd_struct *part1,
+			struct hd_struct *part2, unsigned int *inflight);
 void part_dec_in_flight(struct request_queue *q, struct hd_struct *part, int rw);
 void part_inc_in_flight(struct request_queue *q, struct hd_struct *part, int rw);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time
  2017-08-03 20:01 ` [PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time Jens Axboe
@ 2017-08-03 21:29   ` Bart Van Assche
  2017-08-03 22:38     ` Jens Axboe
  2017-08-03 22:30   ` Bart Van Assche
  1 sibling, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2017-08-03 21:29 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, axboe@kernel.dk; +Cc: brking@linux.vnet.ibm.com

On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
>  static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
> @@ -97,17 +98,25 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ct=
x *hctx,
>  {
>  	struct mq_inflight *mi =3D priv;
> =20
> -	if (rq->part =3D=3D mi->part)
> -		mi->inflight++;
> +	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> +		return;

Should the REQ_ATOM_STARTED test perhaps have been introduced in patch 3/4
instead of in this patch?

> +	if (rq->part =3D=3D mi->part1) {
> +		mi->inflight[0]++;
> +		if (mi->part1->partno &&
> +		    &part_to_disk(mi->part1)->part0 =3D=3D mi->part2)
> +			mi->inflight[1]++;
> +	} else if (rq->part =3D=3D mi->part2)
> +		mi->inflight[1]++;
>  }

So mi->part2 may represent part0 but mi->part1 not? Does that deserve a com=
ment?

Additionally, shouldn't the mi->part2 =3D=3D part0 test be moved out of the=
 if-statement
such that all requests are counted for part0 instead of storing the same co=
unt in
inflight[0] and inflight[1] if mi->part2 =3D=3D part0?

> -unsigned int blk_mq_in_flight(struct request_queue *q,
> -			       struct hd_struct *part)
> +void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part1,
> +		      struct hd_struct *part2, unsigned int *inflight)
>  {

Should inflight be declared as an array to make it clear that it is a point=
er to
an array with two elements?

Thanks,

Bart.=

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time
  2017-08-03 21:29   ` Bart Van Assche
@ 2017-08-03 22:38     ` Jens Axboe
  0 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2017-08-03 22:38 UTC (permalink / raw)
  To: Bart Van Assche, linux-block@vger.kernel.org; +Cc: brking@linux.vnet.ibm.com

On 08/03/2017 03:29 PM, Bart Van Assche wrote:
> On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
>>  static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>> @@ -97,17 +98,25 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
>>  {
>>  	struct mq_inflight *mi = priv;
>>  
>> -	if (rq->part == mi->part)
>> -		mi->inflight++;
>> +	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
>> +		return;
> 
> Should the REQ_ATOM_STARTED test perhaps have been introduced in patch 3/4
> instead of in this patch?

It should, moved.

>> +	if (rq->part == mi->part1) {
>> +		mi->inflight[0]++;
>> +		if (mi->part1->partno &&
>> +		    &part_to_disk(mi->part1)->part0 == mi->part2)
>> +			mi->inflight[1]++;
>> +	} else if (rq->part == mi->part2)
>> +		mi->inflight[1]++;
>>  }
> 
> So mi->part2 may represent part0 but mi->part1 not? Does that deserve a comment?
> 
> Additionally, shouldn't the mi->part2 == part0 test be moved out of the if-statement
> such that all requests are counted for part0 instead of storing the same count in
> inflight[0] and inflight[1] if mi->part2 == part0?

I think I'll just clean up the whole thing and get rid of part1/part2. The two
can only exist together, if part1 is a partition. So will be easier to just
unconditionally sum part+root, if part is a partition. Then we only need to
pass in 'part', not part1/2.

>> -unsigned int blk_mq_in_flight(struct request_queue *q,
>> -			       struct hd_struct *part)
>> +void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part1,
>> +		      struct hd_struct *part2, unsigned int *inflight)
>>  {
> 
> Should inflight be declared as an array to make it clear that it is a pointer to
> an array with two elements?

Sure, I can do that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time
  2017-08-03 20:01 ` [PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time Jens Axboe
  2017-08-03 21:29   ` Bart Van Assche
@ 2017-08-03 22:30   ` Bart Van Assche
  1 sibling, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2017-08-03 22:30 UTC (permalink / raw)
  To: linux-block@vger.kernel.org, axboe@kernel.dk; +Cc: brking@linux.vnet.ibm.com

On Thu, 2017-08-03 at 14:01 -0600, Jens Axboe wrote:
> We only need to iterate the queues and tags once, we if just pass

Hello Jens,

Regarding the patch description: it seems to me like in the text "we if" th=
ese
two words appear in the wrong order?

Bart.=

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-08-07 19:54 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-03 20:01 [PATCH 0/4] block: more scalable inflight tracking Jens Axboe
2017-08-03 20:01 ` [PATCH 1/4] blk-mq-tag: check for NULL rq when iterating tags Jens Axboe
2017-08-03 20:29   ` Bart Van Assche
2017-08-03 20:35     ` Jens Axboe
2017-08-03 20:40       ` Jens Axboe
2017-08-03 20:50         ` Bart Van Assche
2017-08-03 20:56           ` Jens Axboe
2017-08-03 20:01 ` [PATCH 2/4] block: pass in queue to inflight accounting Jens Axboe
2017-08-03 20:35   ` Bart Van Assche
2017-08-03 20:37     ` Jens Axboe
2017-08-03 20:01 ` [PATCH 3/4] blk-mq: provide internal in-flight variant Jens Axboe
2017-08-03 20:41   ` Bart Van Assche
2017-08-03 20:45     ` Jens Axboe
2017-08-03 20:54       ` Bart Van Assche
2017-08-03 21:25   ` Bart Van Assche
2017-08-03 22:36     ` Jens Axboe
2017-08-04 11:17   ` Ming Lei
2017-08-04 13:55     ` Jens Axboe
2017-08-04 22:19       ` Ming Lei
2017-08-07 19:54         ` Brian King
2017-08-03 20:01 ` [PATCH 4/4] blk-mq: enable checking two part inflight counts at the same time Jens Axboe
2017-08-03 21:29   ` Bart Van Assche
2017-08-03 22:38     ` Jens Axboe
2017-08-03 22:30   ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox