From: Kashyap Desai <kashyap.desai@broadcom.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: John Garry <john.garry@huawei.com>,
axboe@kernel.dk, jejb@linux.ibm.com, martin.petersen@oracle.com,
don.brace@microsemi.com, Sumit Saxena <sumit.saxena@broadcom.com>,
bvanassche@acm.org, hare@suse.com, hch@lst.de,
Shivasharan Srikanteshwara
<shivasharan.srikanteshwara@broadcom.com>,
linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
esc.storagedev@microsemi.com, chenxiang66@hisilicon.com,
"PDL,MEGARAIDLINUX" <megaraidlinux.pdl@broadcom.com>
Subject: RE: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
Date: Wed, 17 Jun 2020 16:56:50 +0530 [thread overview]
Message-ID: <f9a05331a46a8c60c10e35df4aa08c45@mail.gmail.com> (raw)
In-Reply-To: <20200616010055.GA27192@T590>
> On Mon, Jun 15, 2020 at 12:27:30PM +0530, Kashyap Desai wrote:
> > > >
> > > > John -
> > > >
> > > > I tried V7 series and debug further on mq-deadline interface. This
> > > > time I have used another setup since HDD based setup is not
> > > > readily available for me.
> > > > In fact, I was able to simulate issue very easily using single
> > > > scsi_device as well. BTW, this is not an issue with this RFC, but
> > generic issue.
> > > > Since I have converted nr_hw_queue > 1 for Broadcom product using
> > > > this RFC, It becomes noticeable now.
> > > >
> > > > Problem - Using below command I see heavy CPU utilization on "
> > > > native_queued_spin_lock_slowpath". This is because kblockd work
> > > > queue is submitting IO from all the CPUs even though fio is bound
> > > > to single
> > CPU.
> > > > Lock contention from " dd_dispatch_request" is causing this issue.
> > > >
> > > > numactl -C 13 fio
> > > > single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none
> > > > --numjobs=1 --cpus_allowed_policy=split --ioscheduler=mq-deadline
> > > > --group_reporting --filename=/dev/sdd
> > > >
> > > > While running above command, ideally we expect only kworker/13 to
> > > > be
> > > active.
> > > > But you can see below - All the CPU is attempting submission and
> > > > lots of CPU consumption is due to lock contention.
> > > >
> > > > PID USER PR NI VIRT RES SHR S %CPU %MEM
TIME+
> > COMMAND
> > > > 2726 root 0 -20 0 0 0 R 56.5 0.0
0:53.20
> > > > kworker/13:1H-k
> > > > 7815 root 20 0 712404 15536 2228 R 43.2 0.0
0:05.03
> > fio
> > > > 2792 root 0 -20 0 0 0 I 26.6 0.0
0:22.19
> > > > kworker/18:1H-k
> > > > 2791 root 0 -20 0 0 0 I 19.9 0.0
0:17.17
> > > > kworker/19:1H-k
> > > > 1419 root 0 -20 0 0 0 I 19.6 0.0
0:17.03
> > > > kworker/20:1H-k
> > > > 2793 root 0 -20 0 0 0 I 18.3 0.0
0:15.64
> > > > kworker/21:1H-k
> > > > 1424 root 0 -20 0 0 0 I 17.3 0.0
0:14.99
> > > > kworker/22:1H-k
> > > > 2626 root 0 -20 0 0 0 I 16.9 0.0
0:14.68
> > > > kworker/26:1H-k
> > > > 2794 root 0 -20 0 0 0 I 16.9 0.0
0:14.87
> > > > kworker/23:1H-k
> > > > 2795 root 0 -20 0 0 0 I 16.9 0.0
0:14.81
> > > > kworker/24:1H-k
> > > > 2797 root 0 -20 0 0 0 I 16.9 0.0
0:14.62
> > > > kworker/27:1H-k
> > > > 1415 root 0 -20 0 0 0 I 16.6 0.0
0:14.44
> > > > kworker/30:1H-k
> > > > 2669 root 0 -20 0 0 0 I 16.6 0.0
0:14.38
> > > > kworker/31:1H-k
> > > > 2796 root 0 -20 0 0 0 I 16.6 0.0
0:14.74
> > > > kworker/25:1H-k
> > > > 2799 root 0 -20 0 0 0 I 16.6 0.0
0:14.56
> > > > kworker/28:1H-k
> > > > 1425 root 0 -20 0 0 0 I 16.3 0.0
0:14.21
> > > > kworker/34:1H-k
> > > > 2746 root 0 -20 0 0 0 I 16.3 0.0
0:14.33
> > > > kworker/32:1H-k
> > > > 2798 root 0 -20 0 0 0 I 16.3 0.0
0:14.50
> > > > kworker/29:1H-k
> > > > 2800 root 0 -20 0 0 0 I 16.3 0.0
0:14.27
> > > > kworker/33:1H-k
> > > > 1423 root 0 -20 0 0 0 I 15.9 0.0
0:14.10
> > > > kworker/54:1H-k
> > > > 1784 root 0 -20 0 0 0 I 15.9 0.0
0:14.03
> > > > kworker/55:1H-k
> > > > 2801 root 0 -20 0 0 0 I 15.9 0.0
0:14.15
> > > > kworker/35:1H-k
> > > > 2815 root 0 -20 0 0 0 I 15.9 0.0
0:13.97
> > > > kworker/56:1H-k
> > > > 1484 root 0 -20 0 0 0 I 15.6 0.0
0:13.90
> > > > kworker/57:1H-k
> > > > 1485 root 0 -20 0 0 0 I 15.6 0.0
0:13.82
> > > > kworker/59:1H-k
> > > > 1519 root 0 -20 0 0 0 I 15.6 0.0
0:13.64
> > > > kworker/62:1H-k
> > > > 2315 root 0 -20 0 0 0 I 15.6 0.0
0:13.87
> > > > kworker/58:1H-k
> > > > 2627 root 0 -20 0 0 0 I 15.6 0.0
0:13.69
> > > > kworker/61:1H-k
> > > > 2816 root 0 -20 0 0 0 I 15.6 0.0
0:13.75
> > > > kworker/60:1H-k
> > > >
> > > >
> > > > I root cause this issue -
> > > >
> > > > Block layer always queue IO on hctx context mapped to CPU-13, but
> > > > hw queue run from all the hctx context.
> > > > I noticed in my test hctx48 has queued all the IOs. No other hctx
> > > > has queued IO. But all the hctx is counting for "run".
> > > >
> > > > # cat hctx48/queued
> > > > 2087058
> > > >
> > > > #cat hctx*/run
> > > > 151318
> > > > 30038
> > > > 83110
> > > > 50680
> > > > 69907
> > > > 60391
> > > > 111239
> > > > 18036
> > > > 33935
> > > > 91648
> > > > 34582
> > > > 22853
> > > > 61286
> > > > 19489
> > > >
> > > > Below patch has fix - "Run the hctx queue for which request was
> > > > completed instead of running all the hardware queue."
> > > > If this looks valid fix, please include in V8 OR I can post
> > > > separate patch for this. Just want to have some level of review
> > > > from this
> > discussion.
> > > >
> > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > > index 0652acd..f52118f 100644
> > > > --- a/drivers/scsi/scsi_lib.c
> > > > +++ b/drivers/scsi/scsi_lib.c
> > > > @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request
> > > > *req, blk_status_t error,
> > > > struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> > > > struct scsi_device *sdev = cmd->device;
> > > > struct request_queue *q = sdev->request_queue;
> > > > + struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> > > >
> > > > if (blk_update_request(req, error, bytes))
> > > > return true;
> > > > @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request
> > > > *req, blk_status_t error,
> > > > !list_empty(&sdev->host->starved_list))
> > > > kblockd_schedule_work(&sdev->requeue_work);
> > > > else
> > > > - blk_mq_run_hw_queues(q, true);
> > > > + blk_mq_run_hw_queue(mq_hctx, true);
> > > > + //blk_mq_run_hw_queues(q, true);
> > >
> > > This way may cause IO hang because ->device_busy is shared by all
hctxs.
> >
> > From SCSI stack, if we attempt to run all h/w queue, is it possible
> > that block layer actually run hw_queue which has really not queued any
IO.
> > Currently, in case of mq-deadline, IOS are inserted using
> > "dd_insert_request". This function will add IOs on elevator data which
> > is per request queue and not per hctx.
> > When there is an attempt to run hctx, "blk_mq_sched_has_work" will
> > check pending work which is per request queue and not per hctx.
> > Because of this, IOs queued on only one hctx will be run from all the
> > hctx and this will create unnecessary lock contention.
>
> Deadline is supposed for HDD. slow disks, so the lock contention
shouldn't
> have been one problem given there is seldom MQ HDD. before this
patchset.
>
> Another related issue is default scheduler, I guess deadline still
should have
> been the default io sched for HDDs. attached to this kind HBA with
multiple
> reply queues and single submission queue.
>
> >
> > How about below patch - ?
> >
> > diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index
> > 126021f..1d30bd3 100644
> > --- a/block/blk-mq-sched.h
> > +++ b/block/blk-mq-sched.h
> > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> > blk_mq_hw_ctx *hctx) {
> > struct elevator_queue *e = hctx->queue->elevator;
> >
> > + /* If current hctx has not queued any request, there is no
> > + need to
> > run.
> > + * blk_mq_run_hw_queue() on hctx which has queued IO will
handle
> > + * running specific hctx.
> > + */
> > + if (!hctx->queued)
> > + return false;
> > +
> > if (e && e->type->ops.has_work)
> > return e->type->ops.has_work(hctx);
>
> ->queued is increased only and not decreased just for debug purpose so
> ->far, so
> it can't be relied for this purpose.
Thanks. I overlooked that that it is only incremental counter.
>
> One approach is to add one similar counter, and maintain it by
scheduler's
> insert/dispatch callback.
I tried below and I see performance is on expected range.
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index fdcc2c1..ea201d0 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request *rq,
bool at_head,
list_add(&rq->queuelist, &list);
e->type->ops.insert_requests(hctx, &list, at_head);
+ atomic_inc(&hctx->elevator_queued);
} else {
spin_lock(&ctx->lock);
__blk_mq_insert_request(hctx, rq, at_head);
@@ -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct
blk_mq_hw_ctx *hctx,
percpu_ref_get(&q->q_usage_counter);
e = hctx->queue->elevator;
- if (e && e->type->ops.insert_requests)
+ if (e && e->type->ops.insert_requests) {
e->type->ops.insert_requests(hctx, list, false);
+ atomic_inc(&hctx->elevator_queued);
+ }
else {
/*
* try to issue requests directly if the hw queue isn't
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 126021f..946b47a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
blk_mq_hw_ctx *hctx)
{
struct elevator_queue *e = hctx->queue->elevator;
+ /* If current hctx has not queued any request, there is no need to
run.
+ * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+ * running specific hctx.
+ */
+ if (!atomic_read(&hctx->elevator_queued))
+ return false;
+
if (e && e->type->ops.has_work)
return e->type->ops.has_work(hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f73a2f9..48f1824 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq)
struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
if (rq->rq_flags & RQF_ELVPRIV) {
- if (e && e->type->ops.finish_request)
+ if (e && e->type->ops.finish_request) {
e->type->ops.finish_request(rq);
+ atomic_dec(&hctx->elevator_queued);
+ }
if (rq->elv.icq) {
put_io_context(rq->elv.icq->ioc);
rq->elv.icq = NULL;
@@ -2571,6 +2573,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct
blk_mq_tag_set *set,
goto free_hctx;
atomic_set(&hctx->nr_active, 0);
+ atomic_set(&hctx->elevator_queued, 0);
if (node == NUMA_NO_NODE)
node = set->numa_node;
hctx->numa_node = node;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 66711c7..ea1ddb1 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
* shared across request queues.
*/
atomic_t nr_active;
+ /**
+ * @elevator_queued: Number of queued requests on hctx.
+ */
+ atomic_t elevator_queued;
/** @cpuhp_online: List to store request if CPU is going to die */
struct hlist_node cpuhp_online;
>
> Thanks,
> Ming
next prev parent reply other threads:[~2020-06-17 11:28 UTC|newest]
Thread overview: 123+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
2020-06-10 17:29 ` [PATCH RFC v7 01/12] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
2020-06-10 17:29 ` [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
2020-06-11 2:57 ` Ming Lei
2020-06-11 8:26 ` John Garry
2020-06-23 11:25 ` John Garry
2020-06-23 14:23 ` Hannes Reinecke
2020-06-24 8:13 ` Kashyap Desai
2020-06-29 16:18 ` John Garry
2020-08-10 16:51 ` Kashyap Desai
2020-08-11 8:01 ` John Garry
2020-08-11 16:34 ` Kashyap Desai
2020-06-10 17:29 ` [PATCH RFC v7 03/12] blk-mq: Use pointers for blk_mq_tags bitmap tags John Garry
2020-06-10 17:29 ` [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset John Garry
2020-06-11 3:37 ` Ming Lei
2020-06-11 10:09 ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap John Garry
2020-06-11 4:04 ` Ming Lei
2020-06-11 10:22 ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set " John Garry
2020-06-11 13:16 ` Hannes Reinecke
2020-06-11 14:22 ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a " John Garry
2020-06-11 13:19 ` Hannes Reinecke
2020-06-11 14:33 ` John Garry
2020-06-12 6:06 ` Hannes Reinecke
2020-06-29 15:32 ` About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap) John Garry
2020-06-30 6:33 ` Hannes Reinecke
2020-06-30 7:30 ` John Garry
2020-06-30 11:36 ` John Garry
2020-06-30 14:55 ` Bart Van Assche
2020-07-13 9:41 ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap John Garry
2020-07-13 12:20 ` Hannes Reinecke
2020-06-10 17:29 ` [PATCH RFC v7 08/12] scsi: Add template flag 'host_tagset' John Garry
2020-06-10 17:29 ` [PATCH RFC v7 09/12] scsi: hisi_sas: Switch v3 hw to MQ John Garry
2020-06-10 17:29 ` [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters " John Garry
2020-07-02 10:23 ` Kashyap Desai
2020-07-06 8:23 ` John Garry
2020-07-06 8:45 ` Hannes Reinecke
2020-07-06 9:26 ` John Garry
2020-07-06 9:40 ` Hannes Reinecke
2020-07-06 19:19 ` Kashyap Desai
2020-07-07 7:58 ` John Garry
2020-07-07 14:45 ` Kashyap Desai
2020-07-07 16:17 ` John Garry
2020-07-09 19:01 ` Kashyap Desai
2020-07-10 8:10 ` John Garry
2020-07-13 7:55 ` Kashyap Desai
2020-07-13 8:42 ` John Garry
2020-07-19 19:07 ` Kashyap Desai
2020-07-20 7:23 ` Kashyap Desai
2020-07-20 9:18 ` John Garry
2020-07-21 1:13 ` Ming Lei
2020-07-21 6:53 ` Kashyap Desai
2020-07-22 4:12 ` Ming Lei
2020-07-22 5:30 ` Kashyap Desai
2020-07-22 8:04 ` Ming Lei
2020-07-22 9:32 ` John Garry
2020-07-23 14:07 ` Ming Lei
2020-07-23 17:29 ` John Garry
2020-07-24 2:47 ` Ming Lei
2020-07-28 7:54 ` John Garry
2020-07-28 8:45 ` Ming Lei
2020-07-29 5:25 ` Kashyap Desai
2020-07-29 15:36 ` Ming Lei
2020-07-29 18:31 ` Kashyap Desai
2020-08-04 8:36 ` Ming Lei
2020-08-04 9:27 ` Kashyap Desai
2020-08-05 8:40 ` Ming Lei
2020-08-06 10:25 ` Kashyap Desai
2020-08-06 13:38 ` Ming Lei
2020-08-06 14:37 ` Kashyap Desai
2020-08-06 15:29 ` Ming Lei
2020-08-08 19:05 ` Kashyap Desai
2020-08-09 2:16 ` Ming Lei
2020-08-10 16:38 ` Kashyap Desai
2020-08-11 8:09 ` John Garry
2020-08-04 17:00 ` John Garry
2020-08-05 2:56 ` Ming Lei
2020-07-28 8:01 ` Kashyap Desai
2020-07-08 11:31 ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 11/12] smartpqi: enable host tagset John Garry
2020-07-14 13:16 ` John Garry
2020-07-14 13:31 ` John Garry
2020-07-14 18:16 ` Don.Brace
2020-07-15 7:28 ` John Garry
2020-07-14 14:02 ` Hannes Reinecke
2020-08-18 8:33 ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ John Garry
2020-07-14 7:37 ` John Garry
2020-07-14 7:41 ` Hannes Reinecke
2020-07-14 7:52 ` John Garry
2020-07-14 8:06 ` Ming Lei
2020-07-14 9:53 ` John Garry
2020-07-14 10:14 ` Ming Lei
2020-07-14 10:43 ` Hannes Reinecke
2020-07-14 10:19 ` Hannes Reinecke
2020-07-14 10:35 ` John Garry
2020-07-14 10:44 ` Ming Lei
2020-07-14 10:52 ` John Garry
2020-07-14 12:04 ` Ming Lei
2020-08-03 20:39 ` Don.Brace
2020-08-04 9:27 ` John Garry
2020-08-04 15:18 ` Don.Brace
2020-08-05 11:21 ` John Garry
2020-08-14 21:04 ` Don.Brace
2020-08-17 8:00 ` John Garry
2020-08-17 18:39 ` Don.Brace
2020-08-18 7:14 ` Hannes Reinecke
2020-07-16 16:14 ` Don.Brace
2020-07-16 19:45 ` Don.Brace
2020-07-17 10:11 ` John Garry
2020-06-11 3:07 ` [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs Ming Lei
2020-06-11 9:35 ` John Garry
2020-06-12 18:47 ` Kashyap Desai
2020-06-15 2:13 ` Ming Lei
2020-06-15 6:57 ` Kashyap Desai
2020-06-16 1:00 ` Ming Lei
2020-06-17 11:26 ` Kashyap Desai [this message]
2020-06-22 6:24 ` Hannes Reinecke
2020-06-23 0:55 ` Ming Lei
2020-06-23 11:50 ` Kashyap Desai
2020-06-23 12:11 ` Kashyap Desai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f9a05331a46a8c60c10e35df4aa08c45@mail.gmail.com \
--to=kashyap.desai@broadcom.com \
--cc=axboe@kernel.dk \
--cc=bvanassche@acm.org \
--cc=chenxiang66@hisilicon.com \
--cc=don.brace@microsemi.com \
--cc=esc.storagedev@microsemi.com \
--cc=hare@suse.com \
--cc=hch@lst.de \
--cc=jejb@linux.ibm.com \
--cc=john.garry@huawei.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=megaraidlinux.pdl@broadcom.com \
--cc=ming.lei@redhat.com \
--cc=shivasharan.srikanteshwara@broadcom.com \
--cc=sumit.saxena@broadcom.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.