All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kashyap Desai <kashyap.desai@broadcom.com>
To: John Garry <john.garry@huawei.com>,
	axboe@kernel.dk, jejb@linux.ibm.com, martin.petersen@oracle.com,
	ming.lei@redhat.com, bvanassche@acm.org, hare@suse.de,
	don.brace@microsemi.com, Sumit Saxena <sumit.saxena@broadcom.com>,
	hch@infradead.org,
	Shivasharan Srikanteshwara 
	<shivasharan.srikanteshwara@broadcom.com>
Cc: chenxiang66@hisilicon.com, linux-block@vger.kernel.org,
	linux-scsi@vger.kernel.org, esc.storagedev@microsemi.com,
	Hannes Reinecke <hare@suse.com>
Subject: RE: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
Date: Fri, 24 Apr 2020 22:01:32 +0530	[thread overview]
Message-ID: <d2ae343770a83466b870a33ffae5fa23@mail.gmail.com> (raw)
In-Reply-To: <e0c5a076-9fe5-4401-fd41-97f457888ad3@huawei.com>

> >
> >>> So I tested this on hisi_sas with x12 SAS SSDs, and performance with
> >>> "mq-
> >>> deadline" is comparable with "none" @ ~ 2M IOPs. But after a while
> >>> performance drops alot, to maybe 700K IOPS. Do you have a similar
> >>> experience?
> >>
> >> I am using mq-deadline only for HDD. I have not tried on SSD since it
> >> is not useful scheduler for SSDs.
> >>
> >
> > I ask as I only have SAS SSDs to test.
> >
> >> I noticed that when I used mq-deadline, performance drop starts if I
> >> have
> >> more number of drives.
> >> I am running <fio> script which has 64 Drives, 64 thread and all
> >> treads are
> >> bound to local numa node which has 36 logical cores.
> >> I noticed that lock contention is in " dd_dispatch_request". I am not
> >> sure
> >> why there is a no penalty of same lock in nr_hw_queue  = 1 mode.
> >
> > So this could be just pre-existing issue of exposing multiple queues for
> > SCSI HBAs combined with mq-deadline iosched. I mean, that's really the
> > only significant change in this series, apart from the shared sbitmap,
> > and, at this point, I don't think that is the issue.
>
> As an experiment, I modified hisi_sas mainline driver to expose hw
> queues and manage tags itself, and I see the same issue I mentioned:
>
> Jobs: 12 (f=12): [R(12)] [14.8% done] [7592MB/0KB/0KB /s] [1943K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [16.4% done] [7949MB/0KB/0KB /s] [2035K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [18.0% done] [7940MB/0KB/0KB /s] [2033K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [19.7% done] [7984MB/0KB/0KB /s] [2044K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [21.3% done] [7984MB/0KB/0KB /s] [2044K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [23.0% done] [2964MB/0KB/0KB /s] [759K/0/0
> iops] [eta 0
> Jobs: 12 (f=12): [R(12)] [24.6% done] [2417MB/0KB/0KB /s] [619K/0/0
> iops] [eta 0
> Jobs: 12 (f=12): [R(12)] [26.2% done] [2909MB/0KB/0KB /s] [745K/0/0
> iops] [eta 0
> Jobs: 12 (f=12): [R(12)] [27.9% done] [2366MB/0KB/0KB /s] [606K/0/0
> iops] [eta 0
>
> The odd time I see "sched: RT throttling activated" around the time the
> throughput falls. I think issue is the per-queue threaded irq threaded
> handlers consuming too many cycles. With "none" io scheduler, IOPS is
> flat at around 2M.
>
> >
> >>
> >> static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
> >> {
> >>          struct deadline_data *dd =
> >> hctx->queue->elevator->elevator_data;
> >>          struct request *rq;
> >>
> >>          spin_lock(&dd->lock);
> >
> > So if multiple hctx's are accessing this lock, then much contention
> > possible.
> >
> >>          rq = __dd_dispatch_request(dd);
> >>          spin_unlock(&dd->lock);
> >>
> >>          return rq;
> >> }
> >>
> >> Here is perf report -
> >>
> >> -    1.04%     0.99%  kworker/18:1H+k  [kernel.vmlinux]  [k]
> >> native_queued_spin_lock_slowpath
> >>       0.99% ret_from_fork
> >>      -   kthread
> >>        - worker_thread
> >>           - 0.98% process_one_work
> >>              - 0.98% __blk_mq_run_hw_queue
> >>                 - blk_mq_sched_dispatch_requests
> >>                    - 0.98% blk_mq_do_dispatch_sched
> >>                       - 0.97% dd_dispatch_request
> >>                          + 0.97% queued_spin_lock_slowpath
> >> +    1.04%     0.00%  kworker/18:1H+k  [kernel.vmlinux]  [k]
> >> queued_spin_lock_slowpath
> >> +    1.03%     0.95%  kworker/19:1H-k  [kernel.vmlinux]  [k]
> >> native_queued_spin_lock_slowpath
> >> +    1.03%     0.00%  kworker/19:1H-k  [kernel.vmlinux]  [k]
> >> queued_spin_lock_slowpath
> >> +    1.02%     0.97%  kworker/20:1H+k  [kernel.vmlinux]  [k]
> >> native_queued_spin_lock_slowpath
> >> +    1.02%     0.00%  kworker/20:1H+k  [kernel.vmlinux]  [k]
> >> queued_spin_lock_slowpath
> >> +    1.01%     0.96%  kworker/21:1H+k  [kernel.vmlinux]  [k]
> >> native_queued_spin_lock_slowpath
> >>
> >
> > I'll try to capture a perf report and compare to mine.
>
> Mine is spending a huge amount of time (circa 33% on a cpu servicing
> completion irqs) in mod_delayed_work_on():
>
> --79.89%--sas_scsi_task_done |
>     |--76.72%--scsi_mq_done
>     |    |
>     |     --76.53%--blk_mq_complete_request
>     |    |
>     |    |--74.81%--scsi_softirq_done
>     |    |    |
>     |    |     --73.91%--scsi_finish_command
>     |    |    |
>     |    |    |--72.11%--scsi_io_completion
>     |    |    |    |
>     |    |    |     --71.89%--scsi_end_request
>     |    |    |    |
>     |    |    |    |--40.82%--blk_mq_run_hw_queues
>     |    |    |    |    |
>     |    |    |    |    |--35.86%--blk_mq_run_hw_queue
>     |    |    |    |    |    |
>     |    |    |    |    |     --33.59%--__blk_mq_delay_run_hw_queue
>     |    |    |    |    |    |
>     |    |    |    |    |     --33.38%--kblockd_mod_delayed_work_on
>     |    |    |    |    |          |
>     |    |    |    |    |                --33.31%--mod_delayed_work_on
>
> hmmmm...

I did some more experiments. It looks like issue is with both <none> and
<mq-deadline> scheduler.  Let me simplify what happens with ioscheduler =
<none>.

Old Driver which has nr_hw_queue = 1 and I issue IOs from <fio>  queue depth
= 128. We get 3.1M IOPS in this config. This eventually exhaust host
can_queue.
Note - Very low contention in sbitmap_get()

-   23.58%     0.25%  fio              [kernel.vmlinux]            [k]
blk_mq_make_request
   - 23.33% blk_mq_make_request
      - 21.68% blk_mq_get_request
         - 20.19% blk_mq_get_tag
            + 10.08% prepare_to_wait_exclusive
            + 4.51% io_schedule
            - 3.59% __sbitmap_queue_get
               - 2.82% sbitmap_get
                    0.86% __sbitmap_get_word
                    0.75% _raw_spin_lock_irqsave
                    0.55% _raw_spin_unlock_irqrestore

Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>  queue
depth = 128. We get 2.3 M IOPS in this config. This eventually exhaust host
can_queue.
Note - Very high contention in sbitmap_get()

-   42.39%     0.12%  fio              [kernel.vmlinux]            [k]
generic_make_request
   - 42.27% generic_make_request
      - 41.00% blk_mq_make_request
         - 38.28% blk_mq_get_request
            - 33.76% blk_mq_get_tag
               - 30.25% __sbitmap_queue_get
                  - 29.90% sbitmap_get
                     + 9.06% _raw_spin_lock_irqsave
                     + 7.94% _raw_spin_unlock_irqrestore
                     + 3.86% __sbitmap_get_word
                     + 1.78% call_function_single_interrupt
                     + 0.67% ret_from_intr
               + 1.69% io_schedule
                 0.59% prepare_to_wait_exclusive
                 0.55% __blk_mq_get_tag

In this particular case, I observed alloc_hint = zeros which means,
sbitmap_get is not able to find free tags from hint. That may lead to
contention.
This condition is not happening with nr_hw_queue=1 (without RFC) driver.

alloc_hint=
{663, 2425, 3060, 54, 3149, 4319, 4175, 4867, 543, 2481, 0, 4779, 377,
***0***, 2010, 0, 909, 3350, 1546, 2179, 2875, 659, 3902, 2224, 3212, 836,
1892, 1669, 2420,
3415, 1904, 512, 3027, 4810, 2845, 4690, 712, 3105, 0, 0, 0, 3268, 4915,
3897, 1349, 547, 4, 733, 1765, 2068, 979, 51, 880, 0, 370, 3520, 2877, 4097,
418, 4501, 3717,
2893, 604, 508, 759, 3329, 4038, 4829, 715, 842, 1443, 556}

Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>  queue
depth = 32. We get 3.1M IOPS in this config. This workload does *not*
exhaust host can_queue.

-    5.07%     0.14%  fio              [kernel.vmlinux]  [k]
generic_make_request
   - 4.93% generic_make_request
      - 3.61% blk_mq_make_request
         - 2.04% blk_mq_get_request
            - 1.08% blk_mq_get_tag
               - 0.70% __sbitmap_queue_get
                    0.67% sbitmap_get

In summary, RFC has some performance bottleneck in sbitmap_get () if
outstanding per shost is about to exhaust.  Without this RFC also driver
works in nr_hw_queue = 1, but that case is managed very well.
I am not sure why it happens only with shared host tag ? Theoretically all
the hctx is sharing the same bitmaptag which is same as nr_hw_queue=1, so
why contention is only visible in shared host tag case.

If you want to reproduce this issue, may be you have to reduce the can_queue
in hisi_sas driver.

Kashyap

>
> Thanks,
> John

  reply	other threads:[~2020-04-24 16:31 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-05 11:54 [PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
2020-03-05 11:54 ` [PATCH RFC v6 01/10] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
2020-03-05 11:54 ` [PATCH RFC v6 02/10] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
2020-03-05 11:54 ` [PATCH RFC v6 03/10] blk-mq: Use pointers for blk_mq_tags bitmap tags John Garry
2020-03-05 12:42   ` Hannes Reinecke
2020-03-05 11:54 ` [PATCH RFC v6 04/10] blk-mq: Facilitate a shared sbitmap per tagset John Garry
2020-03-05 12:49   ` Hannes Reinecke
2020-03-05 13:52     ` John Garry
2020-03-05 11:54 ` [PATCH RFC v6 05/10] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap John Garry
2020-03-05 12:52   ` Hannes Reinecke
2020-03-05 11:54 ` [PATCH RFC v6 06/10] scsi: Add template flag 'host_tagset' John Garry
2020-03-06 11:12   ` John Garry
2020-03-05 11:54 ` [PATCH RFC v6 07/10] scsi: hisi_sas: Switch v3 hw to MQ John Garry
2020-03-05 12:52   ` Hannes Reinecke
2020-03-05 11:54 ` [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters " John Garry
2020-04-07 11:14   ` Kashyap Desai
2020-04-08  9:33     ` John Garry
2020-04-08  9:59       ` Kashyap Desai
2020-04-17 16:46         ` John Garry
2020-04-20 17:47           ` Kashyap Desai
2020-04-21 12:35             ` John Garry
2020-04-22 18:59               ` Kashyap Desai
2020-04-22 21:28                 ` John Garry
2020-04-23 16:31                   ` John Garry
2020-04-24 16:31                     ` Kashyap Desai [this message]
2020-04-27 17:06                       ` John Garry
2020-04-27 18:58                         ` Kashyap Desai
2020-04-28 15:55                           ` John Garry
2020-04-29 11:29                             ` John Garry
2020-04-29 15:50                               ` Kashyap Desai
2020-04-29 17:55                                 ` John Garry
2020-04-30 17:40                                   ` John Garry
2020-04-30 19:18                                     ` Kashyap Desai
2020-03-05 11:54 ` [PATCH RFC v6 09/10] smartpqi: enable host tagset John Garry
2020-03-05 11:54 ` [PATCH RFC v6 10/10] hpsa: enable host_tagset and switch to MQ John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d2ae343770a83466b870a33ffae5fa23@mail.gmail.com \
    --to=kashyap.desai@broadcom.com \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=chenxiang66@hisilicon.com \
    --cc=don.brace@microsemi.com \
    --cc=esc.storagedev@microsemi.com \
    --cc=hare@suse.com \
    --cc=hare@suse.de \
    --cc=hch@infradead.org \
    --cc=jejb@linux.ibm.com \
    --cc=john.garry@huawei.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=ming.lei@redhat.com \
    --cc=shivasharan.srikanteshwara@broadcom.com \
    --cc=sumit.saxena@broadcom.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.