From: Kashyap Desai <kashyap.desai@broadcom.com>
To: John Garry <john.garry@huawei.com>,
axboe@kernel.dk, jejb@linux.ibm.com, martin.petersen@oracle.com,
ming.lei@redhat.com, bvanassche@acm.org, hare@suse.de,
don.brace@microsemi.com, Sumit Saxena <sumit.saxena@broadcom.com>,
hch@infradead.org,
Shivasharan Srikanteshwara
<shivasharan.srikanteshwara@broadcom.com>
Cc: "chenxiang (M)" <chenxiang66@hisilicon.com>,
linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
esc.storagedev@microsemi.com, Hannes Reinecke <hare@suse.com>
Subject: RE: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
Date: Tue, 28 Apr 2020 00:28:49 +0530 [thread overview]
Message-ID: <f712fd935562dcff73f0f6cf15f9cce7@mail.gmail.com> (raw)
In-Reply-To: <8e16d68b-4d71-58f1-ede9-92ffe5d65ba9@huawei.com>
> Hi Kashyap,
>
> >> hmmmm...
> >
> > I did some more experiments. It looks like issue is with both <none>
> > and <mq-deadline> scheduler. Let me simplify what happens with
> > ioscheduler = <none>.
>
> I know it's good to compare like-for-like, but, as I understand, "none"
> is more suited for MQ host, while deadline is more suited for SQ host.
I am using <mq-deadline> which is MQ version of SQ deadline.
For now we can just consider case without scheduler.
I have some more findings. May be someone from upstream community can
connect the dots.
#1. hctx_may_queue() throttle the IO at hctx level. This is eventually per
sdev throttling for megaraid_sas driver because It creates only one
context - hctx0 for each scsi device.
If driver is using only one h/w queue, active_queues will be always steady.
In my test it was 64 thread, so active_queues=64.
Even though <fio> thread is shared among allowed cpumask
(cpus_allowed_policy=shared option in fio), active_queues will be always 64
because we have only one h/w queue.
All the logical cpus are mapped to one h/w queue. It means, thread moving
from one cpu to another cpu will not change active_queues per hctx.
In case of this RFC, active_queues are now per hctx and there are multiple
hctx, but tags are shared. This can create unwanted throttling and
eventually more lock contention in sbitmap.
I added below patch and things improved a bit, but not a full proof.
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6..c708fbc 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -60,10 +60,12 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
* For shared tag users, we track the number of currently active users
* and attempt to provide a fair share of the tag depth for each of them.
*/
-static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
+static inline bool hctx_may_queue(struct request_queue *q,
+ struct blk_mq_hw_ctx *hctx,
struct sbitmap_queue *bt)
{
- unsigned int depth, users;
+ unsigned int depth, users, i, outstanding = 0;
+ struct blk_mq_hw_ctx *hctx_iter;
if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
return true;
@@ -84,14 +86,18 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx
*hctx,
* Allow at least some tags
*/
depth = max((bt->sb.depth + users - 1) / users, 4U);
- return atomic_read(&hctx->nr_active) < depth;
+
+ queue_for_each_hw_ctx(q, hctx_iter, i)
+ outstanding += atomic_read(&hctx_iter->nr_active);
+
+ return outstanding < depth;
}
static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
struct sbitmap_queue *bt)
{
if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
- !hctx_may_queue(data->hctx, bt))
#2 - In completion path, scsi module call blk_mq_run_hw_queues() upon IO
completion. I am not sure if it is good to run all the h/w queue or just
h/w queue of current reference is good enough.
Below patch helped to reduce contention in hcxt_lock().
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 610ee41..f72de2a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -572,6 +572,7 @@ static bool scsi_end_request(struct request *req,
blk_status_t error,
struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
struct scsi_device *sdev = cmd->device;
struct request_queue *q = sdev->request_queue;
+ struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
if (blk_update_request(req, error, bytes))
return true;
@@ -613,7 +614,7 @@ static bool scsi_end_request(struct request *req,
blk_status_t error,
!list_empty(&sdev->host->starved_list))
kblockd_schedule_work(&sdev->requeue_work);
else
- blk_mq_run_hw_queues(q, true);
+ blk_mq_run_hw_queue(mq_hctx, true);
percpu_ref_put(&q->q_usage_counter);
return false;
#3 - __blk_mq_tag_idle() calls blk_mq_tag_wakeup_all which may not be
optimal for shared queue.
There is a penalty if we are calling __sbq_wake_up() frequently. In case of
nr_hw_queue = 1, things are better because one hctx and hctx->state will
avoid multiple calls.
If blk_mq_tag_wakeup_all is called from hctx0 context, there is no need to
call from hctx1, hctx2 etc.
I have added below patch in my testing.
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6..5b331e5 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -53,7 +53,9 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
atomic_dec(&tags->active_queues);
- blk_mq_tag_wakeup_all(tags, false);
+ /* TBD - Do this only for hctx->flags & BLK_MQ_F_TAG_HCTX_SHARED */
+ if (hctx->queue_num == 0)
+ blk_mq_tag_wakeup_all(tags, false);
}
/*
With all above mentioned changes, I see performance improved from 2.2M IOPS
to 2.7M on same workload and profile.
Kashyap
>
> >
> > Old Driver which has nr_hw_queue = 1 and I issue IOs from <fio> queue
> > depth = 128. We get 3.1M IOPS in this config. This eventually exhaust
> > host can_queue.
>
> So I think I need to find a point where we start to get throttled.
>
> > Note - Very low contention in sbitmap_get()
> >
> > - 23.58% 0.25% fio [kernel.vmlinux] [k]
> > blk_mq_make_request
> > - 23.33% blk_mq_make_request
> > - 21.68% blk_mq_get_request
> > - 20.19% blk_mq_get_tag
> > + 10.08% prepare_to_wait_exclusive
> > + 4.51% io_schedule
> > - 3.59% __sbitmap_queue_get
> > - 2.82% sbitmap_get
> > 0.86% __sbitmap_get_word
> > 0.75% _raw_spin_lock_irqsave
> > 0.55% _raw_spin_unlock_irqrestore
> >
> > Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>
> > queue depth = 128. We get 2.3 M IOPS in this config. This eventually
> > exhaust host can_queue.
> > Note - Very high contention in sbitmap_get()
> >
> > - 42.39% 0.12% fio [kernel.vmlinux] [k]
> > generic_make_request
> > - 42.27% generic_make_request
> > - 41.00% blk_mq_make_request
> > - 38.28% blk_mq_get_request
> > - 33.76% blk_mq_get_tag
> > - 30.25% __sbitmap_queue_get
> > - 29.90% sbitmap_get
> > + 9.06% _raw_spin_lock_irqsave
> > + 7.94% _raw_spin_unlock_irqrestore
> > + 3.86% __sbitmap_get_word
> > + 1.78% call_function_single_interrupt
> > + 0.67% ret_from_intr
> > + 1.69% io_schedule
> > 0.59% prepare_to_wait_exclusive
> > 0.55% __blk_mq_get_tag
> >
> > In this particular case, I observed alloc_hint = zeros which means,
> > sbitmap_get is not able to find free tags from hint. That may lead to
> > contention.
> > This condition is not happening with nr_hw_queue=1 (without RFC) driver.
> >
> > alloc_hint=
> > {663, 2425, 3060, 54, 3149, 4319, 4175, 4867, 543, 2481, 0, 4779, 377,
> > ***0***, 2010, 0, 909, 3350, 1546, 2179, 2875, 659, 3902, 2224, 3212,
> > 836, 1892, 1669, 2420, 3415, 1904, 512, 3027, 4810, 2845, 4690, 712,
> > 3105, 0, 0, 0, 3268, 4915, 3897, 1349, 547, 4, 733, 1765, 2068, 979,
> > 51, 880, 0, 370, 3520, 2877, 4097, 418, 4501, 3717, 2893, 604, 508,
> > 759, 3329, 4038, 4829, 715, 842, 1443, 556}
> >
> > Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>
> > queue depth = 32. We get 3.1M IOPS in this config. This workload does
> > *not* exhaust host can_queue.
>
> Please ensure .host_tagset is set for whenever nr_hw_queue = N. This is as
> per
> RFC, and I don't think you modified from the RFC for your test.
> But I just wanted to mention that to be crystal clear.
Yes I have two separate driver copy. One with RFC change and another without
RFC.
>
> >
> > - 5.07% 0.14% fio [kernel.vmlinux] [k]
> > generic_make_request
> > - 4.93% generic_make_request
> > - 3.61% blk_mq_make_request
> > - 2.04% blk_mq_get_request
> > - 1.08% blk_mq_get_tag
> > - 0.70% __sbitmap_queue_get
> > 0.67% sbitmap_get
> >
> > In summary, RFC has some performance bottleneck in sbitmap_get () if
> > outstanding per shost is about to exhaust. Without this RFC also
> > driver works in nr_hw_queue = 1, but that case is managed very well.
> > I am not sure why it happens only with shared host tag ? Theoretically
> > all the hctx is sharing the same bitmaptag which is same as
> > nr_hw_queue=1, so why contention is only visible in shared host tag
> > case.
>
> Let me check this.
>
> >
> > If you want to reproduce this issue, may be you have to reduce the
> > can_queue in hisi_sas driver.
> >
>
> Thanks,
> John
next prev parent reply other threads:[~2020-04-27 18:58 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-05 11:54 [PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
2020-03-05 11:54 ` [PATCH RFC v6 01/10] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
2020-03-05 11:54 ` [PATCH RFC v6 02/10] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
2020-03-05 11:54 ` [PATCH RFC v6 03/10] blk-mq: Use pointers for blk_mq_tags bitmap tags John Garry
2020-03-05 12:42 ` Hannes Reinecke
2020-03-05 11:54 ` [PATCH RFC v6 04/10] blk-mq: Facilitate a shared sbitmap per tagset John Garry
2020-03-05 12:49 ` Hannes Reinecke
2020-03-05 13:52 ` John Garry
2020-03-05 11:54 ` [PATCH RFC v6 05/10] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap John Garry
2020-03-05 12:52 ` Hannes Reinecke
2020-03-05 11:54 ` [PATCH RFC v6 06/10] scsi: Add template flag 'host_tagset' John Garry
2020-03-06 11:12 ` John Garry
2020-03-05 11:54 ` [PATCH RFC v6 07/10] scsi: hisi_sas: Switch v3 hw to MQ John Garry
2020-03-05 12:52 ` Hannes Reinecke
2020-03-05 11:54 ` [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters " John Garry
2020-04-07 11:14 ` Kashyap Desai
2020-04-08 9:33 ` John Garry
2020-04-08 9:59 ` Kashyap Desai
2020-04-17 16:46 ` John Garry
2020-04-20 17:47 ` Kashyap Desai
2020-04-21 12:35 ` John Garry
2020-04-22 18:59 ` Kashyap Desai
2020-04-22 21:28 ` John Garry
2020-04-23 16:31 ` John Garry
2020-04-24 16:31 ` Kashyap Desai
2020-04-27 17:06 ` John Garry
2020-04-27 18:58 ` Kashyap Desai [this message]
2020-04-28 15:55 ` John Garry
2020-04-29 11:29 ` John Garry
2020-04-29 15:50 ` Kashyap Desai
2020-04-29 17:55 ` John Garry
2020-04-30 17:40 ` John Garry
2020-04-30 19:18 ` Kashyap Desai
2020-03-05 11:54 ` [PATCH RFC v6 09/10] smartpqi: enable host tagset John Garry
2020-03-05 11:54 ` [PATCH RFC v6 10/10] hpsa: enable host_tagset and switch to MQ John Garry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f712fd935562dcff73f0f6cf15f9cce7@mail.gmail.com \
--to=kashyap.desai@broadcom.com \
--cc=axboe@kernel.dk \
--cc=bvanassche@acm.org \
--cc=chenxiang66@hisilicon.com \
--cc=don.brace@microsemi.com \
--cc=esc.storagedev@microsemi.com \
--cc=hare@suse.com \
--cc=hare@suse.de \
--cc=hch@infradead.org \
--cc=jejb@linux.ibm.com \
--cc=john.garry@huawei.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=ming.lei@redhat.com \
--cc=shivasharan.srikanteshwara@broadcom.com \
--cc=sumit.saxena@broadcom.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).