From: Kashyap Desai <kashyap.desai@broadcom.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>, Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org,
Christoph Hellwig <hch@infradead.org>,
Mike Snitzer <snitzer@redhat.com>,
linux-scsi@vger.kernel.org, Arun Easi <arun.easi@cavium.com>,
Omar Sandoval <osandov@fb.com>,
"Martin K . Petersen" <martin.petersen@oracle.com>,
James Bottomley <james.bottomley@hansenpartnership.com>,
Christoph Hellwig <hch@lst.de>,
Don Brace <don.brace@microsemi.com>,
Peter Rivera <peter.rivera@broadcom.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Laurence Oberman <loberman@redhat.com>
Subject: RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
Date: Wed, 14 Feb 2018 11:58:33 +0530 [thread overview]
Message-ID: <8f52c08f51f9e2ff54aee5311670e6b4@mail.gmail.com> (raw)
In-Reply-To: <20180213004031.GA8109@ming.t460p>
> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@redhat.com]
> Sent: Tuesday, February 13, 2018 6:11 AM
> To: Kashyap Desai
> Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> Hi Kashyap,
>
> On Tue, Feb 13, 2018 at 12:05:14AM +0530, Kashyap Desai wrote:
> > > -----Original Message-----
> > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > Sent: Sunday, February 11, 2018 11:01 AM
> > > To: Kashyap Desai
> > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org;
> > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > > Easi; Omar
> > Sandoval;
> > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > Peter
> > > Rivera; Paolo Bonzini; Laurence Oberman
> > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > introduce force_blk_mq
> > >
> > > On Sat, Feb 10, 2018 at 09:00:57AM +0800, Ming Lei wrote:
> > > > Hi Kashyap,
> > > >
> > > > On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote:
> > > > > > -----Original Message-----
> > > > > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > > > Sent: Friday, February 9, 2018 11:01 AM
> > > > > > To: Kashyap Desai
> > > > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org;
> > > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org;
> > > > > > Arun Easi; Omar
> > > > > Sandoval;
> > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > > > Brace;
> > > > > Peter
> > > > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > > > introduce force_blk_mq
> > > > > >
> > > > > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote:
> > > > > > > > -----Original Message-----
> > > > > > > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > > > > > Sent: Thursday, February 8, 2018 10:23 PM
> > > > > > > > To: Hannes Reinecke
> > > > > > > > Cc: Kashyap Desai; Jens Axboe;
> > > > > > > > linux-block@vger.kernel.org; Christoph Hellwig; Mike
> > > > > > > > Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> > > > > > > Sandoval;
> > > > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig;
> > > > > > > > Don Brace;
> > > > > > > Peter
> > > > > > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global
> > > > > > > > tags & introduce force_blk_mq
> > > > > > > >
> > > > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke
> > wrote:
> > > > > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > > > > > > > > >> -----Original Message-----
> > > > > > > > > >> From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > > > > > > > > >> To: Hannes Reinecke
> > > > > > > > > >> Cc: Kashyap Desai; Jens Axboe;
> > > > > > > > > >> linux-block@vger.kernel.org; Christoph Hellwig; Mike
> > > > > > > > > >> Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> > > > > > > > > > Sandoval;
> > > > > > > > > >> Martin K . Petersen; James Bottomley; Christoph
> > > > > > > > > >> Hellwig; Don Brace;
> > > > > > > > > > Peter
> > > > > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support
> > > > > > > > > >> global tags & introduce force_blk_mq
> > > > > > > > > >>
> > > > > > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes
> > > > > > > > > >> Reinecke
> > > > > wrote:
> > > > > > > > > >>> Hi all,
> > > > > > > > > >>>
> > > > > > > > > >>> [ .. ]
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Could you share us your patch for enabling
> > > > > > > > > >>>>> global_tags/MQ on
> > > > > > > > > >>>> megaraid_sas
> > > > > > > > > >>>>> so that I can reproduce your test?
> > > > > > > > > >>>>>
> > > > > > > > > >>>>>> See below perf top data. "bt_iter" is consuming 4
> > > > > > > > > >>>>>> times more
> > > > > > > CPU.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Could you share us what the IOPS/CPU utilization
> > > > > > > > > >>>>> effect is after
> > > > > > > > > >>>> applying the
> > > > > > > > > >>>>> patch V2? And your test script?
> > > > > > > > > >>>> Regarding CPU utilization, I need to test one more
> > time.
> > > > > > > > > >>>> Currently system is in used.
> > > > > > > > > >>>>
> > > > > > > > > >>>> I run below fio test on total 24 SSDs expander
> > attached.
> > > > > > > > > >>>>
> > > > > > > > > >>>> numactl -N 1 fio jbod.fio --rw=randread
> > > > > > > > > >>>> --iodepth=64 --bs=4k --ioengine=libaio
> > > > > > > > > >>>> --rw=randread
> > > > > > > > > >>>>
> > > > > > > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs.
> > > > > > > > > >>>>
> > > > > > > > > >>> This is basically what we've seen with earlier
> > iterations.
> > > > > > > > > >>
> > > > > > > > > >> Hi Hannes,
> > > > > > > > > >>
> > > > > > > > > >> As I mentioned in another mail[1], Kashyap's patch
> > > > > > > > > >> has a big issue,
> > > > > > > > > > which
> > > > > > > > > >> causes only reply queue 0 used.
> > > > > > > > > >>
> > > > > > > > > >> [1]
> > > > > > > > > >> https://marc.info/?l=linux-scsi&m=151793204014631&w=2
> > > > > > > > > >>
> > > > > > > > > >> So could you guys run your performance test again
> > > > > > > > > >> after fixing the
> > > > > > > > > > patch?
> > > > > > > > > >
> > > > > > > > > > Ming -
> > > > > > > > > >
> > > > > > > > > > I tried after change you requested. Performance drop
> > > > > > > > > > is still
> > > > > > > unresolved.
> > > > > > > > > > From 1.6 M IOPS to 770K IOPS.
> > > > > > > > > >
> > > > > > > > > > See below data. All 24 reply queue is in used
correctly.
> > > > > > > > > >
> > > > > > > > > > IRQs / 1 second(s)
> > > > > > > > > > IRQ# TOTAL NODE0 NODE1 NAME
> > > > > > > > > > 360 16422 0 16422 IR-PCI-MSI 70254653-edge
> > megasas
> > > > > > > > > > 364 15980 0 15980 IR-PCI-MSI 70254657-edge
> > megasas
> > > > > > > > > > 362 15979 0 15979 IR-PCI-MSI 70254655-edge
> > megasas
> > > > > > > > > > 345 15696 0 15696 IR-PCI-MSI 70254638-edge
> > megasas
> > > > > > > > > > 341 15659 0 15659 IR-PCI-MSI 70254634-edge
> > megasas
> > > > > > > > > > 369 15656 0 15656 IR-PCI-MSI 70254662-edge
> > megasas
> > > > > > > > > > 359 15650 0 15650 IR-PCI-MSI 70254652-edge
> > megasas
> > > > > > > > > > 358 15596 0 15596 IR-PCI-MSI 70254651-edge
> > megasas
> > > > > > > > > > 350 15574 0 15574 IR-PCI-MSI 70254643-edge
> > megasas
> > > > > > > > > > 342 15532 0 15532 IR-PCI-MSI 70254635-edge
> > megasas
> > > > > > > > > > 344 15527 0 15527 IR-PCI-MSI 70254637-edge
> > megasas
> > > > > > > > > > 346 15485 0 15485 IR-PCI-MSI 70254639-edge
> > megasas
> > > > > > > > > > 361 15482 0 15482 IR-PCI-MSI 70254654-edge
> > megasas
> > > > > > > > > > 348 15467 0 15467 IR-PCI-MSI 70254641-edge
> > megasas
> > > > > > > > > > 368 15463 0 15463 IR-PCI-MSI 70254661-edge
> > megasas
> > > > > > > > > > 354 15420 0 15420 IR-PCI-MSI 70254647-edge
> > megasas
> > > > > > > > > > 351 15378 0 15378 IR-PCI-MSI 70254644-edge
> > megasas
> > > > > > > > > > 352 15377 0 15377 IR-PCI-MSI 70254645-edge
> > megasas
> > > > > > > > > > 356 15348 0 15348 IR-PCI-MSI 70254649-edge
> > megasas
> > > > > > > > > > 337 15344 0 15344 IR-PCI-MSI 70254630-edge
> > megasas
> > > > > > > > > > 343 15320 0 15320 IR-PCI-MSI 70254636-edge
> > megasas
> > > > > > > > > > 355 15266 0 15266 IR-PCI-MSI 70254648-edge
> > megasas
> > > > > > > > > > 335 15247 0 15247 IR-PCI-MSI 70254628-edge
> > megasas
> > > > > > > > > > 363 15233 0 15233 IR-PCI-MSI 70254656-edge
> > megasas
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Average: CPU %usr %nice %sys
> > %iowait
> > > > > > > %steal
> > > > > > > > > > %irq %soft %guest %gnice %idle
> > > > > > > > > > Average: 18 3.80 0.00 14.78
> > 10.08
> > > > > > > 0.00
> > > > > > > > > > 0.00 4.01 0.00 0.00 67.33
> > > > > > > > > > Average: 19 3.26 0.00 15.35
> > 10.62
> > > > > > > 0.00
> > > > > > > > > > 0.00 4.03 0.00 0.00 66.74
> > > > > > > > > > Average: 20 3.42 0.00 14.57
> > 10.67
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.84 0.00 0.00 67.50
> > > > > > > > > > Average: 21 3.19 0.00 15.60
> > 10.75
> > > > > > > 0.00
> > > > > > > > > > 0.00 4.16 0.00 0.00 66.30
> > > > > > > > > > Average: 22 3.58 0.00 15.15
> > 10.66
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.51 0.00 0.00 67.11
> > > > > > > > > > Average: 23 3.34 0.00 15.36
> > 10.63
> > > > > > > 0.00
> > > > > > > > > > 0.00 4.17 0.00 0.00 66.50
> > > > > > > > > > Average: 24 3.50 0.00 14.58
> > 10.93
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.85 0.00 0.00 67.13
> > > > > > > > > > Average: 25 3.20 0.00 14.68
> > 10.86
> > > > > > > 0.00
> > > > > > > > > > 0.00 4.31 0.00 0.00 66.95
> > > > > > > > > > Average: 26 3.27 0.00 14.80
> > 10.70
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.68 0.00 0.00 67.55
> > > > > > > > > > Average: 27 3.58 0.00 15.36
> > 10.80
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.79 0.00 0.00 66.48
> > > > > > > > > > Average: 28 3.46 0.00 15.17
> > 10.46
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.32 0.00 0.00 67.59
> > > > > > > > > > Average: 29 3.34 0.00 14.42
> > 10.72
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.34 0.00 0.00 68.18
> > > > > > > > > > Average: 30 3.34 0.00 15.08
> > 10.70
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.89 0.00 0.00 66.99
> > > > > > > > > > Average: 31 3.26 0.00 15.33
> > 10.47
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.33 0.00 0.00 67.61
> > > > > > > > > > Average: 32 3.21 0.00 14.80
> > 10.61
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.70 0.00 0.00 67.67
> > > > > > > > > > Average: 33 3.40 0.00 13.88
> > 10.55
> > > > > > > 0.00
> > > > > > > > > > 0.00 4.02 0.00 0.00 68.15
> > > > > > > > > > Average: 34 3.74 0.00 17.41
> > 10.61
> > > > > > > 0.00
> > > > > > > > > > 0.00 4.51 0.00 0.00 63.73
> > > > > > > > > > Average: 35 3.35 0.00 14.37
> > 10.74
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.84 0.00 0.00 67.71
> > > > > > > > > > Average: 36 0.54 0.00 1.77
> > 0.00
> > > > > > > 0.00
> > > > > > > > > > 0.00 0.00 0.00 0.00 97.69
> > > > > > > > > > ..
> > > > > > > > > > Average: 54 3.60 0.00 15.17
> > 10.39
> > > > > > > 0.00
> > > > > > > > > > 0.00 4.22 0.00 0.00 66.62
> > > > > > > > > > Average: 55 3.33 0.00 14.85
> > 10.55
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.96 0.00 0.00 67.31
> > > > > > > > > > Average: 56 3.40 0.00 15.19
> > 10.54
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.74 0.00 0.00 67.13
> > > > > > > > > > Average: 57 3.41 0.00 13.98
> > 10.78
> > > > > > > 0.00
> > > > > > > > > > 0.00 4.10 0.00 0.00 67.73
> > > > > > > > > > Average: 58 3.32 0.00 15.16
> > 10.52
> > > > > > > 0.00
> > > > > > > > > > 0.00 4.01 0.00 0.00 66.99
> > > > > > > > > > Average: 59 3.17 0.00 15.80
> > 10.35
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.86 0.00 0.00 66.80
> > > > > > > > > > Average: 60 3.00 0.00 14.63
> > 10.59
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.97 0.00 0.00 67.80
> > > > > > > > > > Average: 61 3.34 0.00 14.70
> > 10.66
> > > > > > > 0.00
> > > > > > > > > > 0.00 4.32 0.00 0.00 66.97
> > > > > > > > > > Average: 62 3.34 0.00 15.29
> > 10.56
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.89 0.00 0.00 66.92
> > > > > > > > > > Average: 63 3.29 0.00 14.51
> > 10.72
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.85 0.00 0.00 67.62
> > > > > > > > > > Average: 64 3.48 0.00 15.31
> > 10.65
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.97 0.00 0.00 66.60
> > > > > > > > > > Average: 65 3.34 0.00 14.36
> > 10.80
> > > > > > > 0.00
> > > > > > > > > > 0.00 4.11 0.00 0.00 67.39
> > > > > > > > > > Average: 66 3.13 0.00 14.94
> > 10.70
> > > > > > > 0.00
> > > > > > > > > > 0.00 4.10 0.00 0.00 67.13
> > > > > > > > > > Average: 67 3.06 0.00 15.56
> > 10.69
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.82 0.00 0.00 66.88
> > > > > > > > > > Average: 68 3.33 0.00 14.98
> > 10.61
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.81 0.00 0.00 67.27
> > > > > > > > > > Average: 69 3.20 0.00 15.43
> > 10.70
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.82 0.00 0.00 66.85
> > > > > > > > > > Average: 70 3.34 0.00 17.14
> > 10.59
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.00 0.00 0.00 65.92
> > > > > > > > > > Average: 71 3.41 0.00 14.94
> > 10.56
> > > > > > > 0.00
> > > > > > > > > > 0.00 3.41 0.00 0.00 67.69
> > > > > > > > > >
> > > > > > > > > > Perf top -
> > > > > > > > > >
> > > > > > > > > > 64.33% [kernel] [k] bt_iter
> > > > > > > > > > 4.86% [kernel] [k]
> > blk_mq_queue_tag_busy_iter
> > > > > > > > > > 4.23% [kernel] [k] _find_next_bit
> > > > > > > > > > 2.40% [kernel] [k]
> > > > > native_queued_spin_lock_slowpath
> > > > > > > > > > 1.09% [kernel] [k] sbitmap_any_bit_set
> > > > > > > > > > 0.71% [kernel] [k] sbitmap_queue_clear
> > > > > > > > > > 0.63% [kernel] [k] find_next_bit
> > > > > > > > > > 0.54% [kernel] [k]
_raw_spin_lock_irqsave
> > > > > > > > > >
> > > > > > > > > Ah. So we're spending quite some time in trying to find
> > > > > > > > > a free
> > > > > tag.
> > > > > > > > > I guess this is due to every queue starting at the same
> > > > > > > > > position trying to find a free tag, which inevitably
> > > > > > > > > leads
> > to a
> > > contention.
> > > > > > > >
> > > > > > > > IMO, the above trace means that blk_mq_in_flight() may be
> > > > > > > > the
> > > > > > > bottleneck,
> > > > > > > > and looks not related with tag allocation.
> > > > > > > >
> > > > > > > > Kashyap, could you run your performance test again after
> > > > > > > > disabling
> > > > > > > iostat by
> > > > > > > > the following command on all test devices and killing all
> > > > > > > > utilities
> > > > > > > which may
> > > > > > > > read iostat(/proc/diskstats, ...)?
> > > > > > > >
> > > > > > > > echo 0 > /sys/block/sdN/queue/iostat
> > > > > > >
> > > > > > > Ming - After changing iostat = 0 , I see performance issue
> > > > > > > is
> > > > > resolved.
> > > > > > >
> > > > > > > Below is perf top output after iostats = 0
> > > > > > >
> > > > > > >
> > > > > > > 23.45% [kernel] [k] bt_iter
> > > > > > > 2.27% [kernel] [k]
blk_mq_queue_tag_busy_iter
> > > > > > > 2.18% [kernel] [k] _find_next_bit
> > > > > > > 2.06% [megaraid_sas] [k] complete_cmd_fusion
> > > > > > > 1.87% [kernel] [k] clflush_cache_range
> > > > > > > 1.70% [kernel] [k] dma_pte_clear_level
> > > > > > > 1.56% [kernel] [k] __domain_mapping
> > > > > > > 1.55% [kernel] [k] sbitmap_queue_clear
> > > > > > > 1.30% [kernel] [k] gup_pgd_range
> > > > > >
> > > > > > Hi Kashyap,
> > > > > >
> > > > > > Thanks for your test and update.
> > > > > >
> > > > > > Looks blk_mq_queue_tag_busy_iter() is still sampled by perf
> > > > > > even though iostats is disabled, and I guess there may be
> > > > > > utilities which are
> > > > > reading iostats
> > > > > > a bit frequently.
> > > > >
> > > > > I will be doing some more testing and post you my findings.
> > > >
> > > > I will find sometime this weekend to see if I can cook a patch to
> > > > address this issue of io accounting.
> > >
> > > Hi Kashyap,
> > >
> > > Please test the top 5 patches in the following tree to see if
> > megaraid_sas's
> > > performance is OK:
> > >
> > > https://github.com/ming1/linux/commits/v4.15-for-next-global-tags-
> > > v2
> > >
> > > This tree is made by adding these 5 patches against patchset V2.
> > >
> >
> > Ming -
> > I applied 5 patches on top of V2 and behavior is still unchanged.
> > Below is perf top data. (1000K IOPS)
> >
> > 34.58% [kernel] [k] bt_iter
> > 2.96% [kernel] [k] sbitmap_any_bit_set
> > 2.77% [kernel] [k] bt_iter_global_tags
> > 1.75% [megaraid_sas] [k] complete_cmd_fusion
> > 1.62% [kernel] [k] sbitmap_queue_clear
> > 1.62% [kernel] [k] _raw_spin_lock
> > 1.51% [kernel] [k] blk_mq_run_hw_queue
> > 1.45% [kernel] [k] gup_pgd_range
> > 1.31% [kernel] [k] irq_entries_start
> > 1.29% fio [.] __fio_gettime
> > 1.13% [kernel] [k] _raw_spin_lock_irqsave
> > 0.95% [kernel] [k]
native_queued_spin_lock_slowpath
> > 0.92% [kernel] [k] scsi_queue_rq
> > 0.91% [kernel] [k] blk_mq_run_hw_queues
> > 0.85% [kernel] [k] blk_mq_get_request
> > 0.81% [kernel] [k] switch_mm_irqs_off
> > 0.78% [megaraid_sas] [k] megasas_build_io_fusion
> > 0.77% [kernel] [k] __schedule
> > 0.73% [kernel] [k] update_load_avg
> > 0.69% [kernel] [k] fput
> > 0.65% [kernel] [k] scsi_dispatch_cmd
> > 0.64% fio [.] fio_libaio_event
> > 0.53% [kernel] [k] do_io_submit
> > 0.52% [kernel] [k] read_tsc
> > 0.51% [megaraid_sas] [k]
megasas_build_and_issue_cmd_fusion
> > 0.51% [kernel] [k] scsi_softirq_done
> > 0.50% [kernel] [k] kobject_put
> > 0.50% [kernel] [k] cpuidle_enter_state
> > 0.49% [kernel] [k] native_write_msr
> > 0.48% fio [.] io_completed
> >
> > Below is perf top data with iostat=0 (1400K IOPS)
> >
> > 4.87% [kernel] [k] sbitmap_any_bit_set
> > 2.93% [kernel] [k] _raw_spin_lock
> > 2.84% [megaraid_sas] [k] complete_cmd_fusion
> > 2.38% [kernel] [k] irq_entries_start
> > 2.36% [kernel] [k] gup_pgd_range
> > 2.35% [kernel] [k] blk_mq_run_hw_queue
> > 2.30% [kernel] [k] sbitmap_queue_clear
> > 2.01% fio [.] __fio_gettime
> > 1.78% [kernel] [k] _raw_spin_lock_irqsave
> > 1.51% [kernel] [k] scsi_queue_rq
> > 1.43% [kernel] [k] blk_mq_run_hw_queues
> > 1.36% [kernel] [k] fput
> > 1.32% [kernel] [k] __schedule
> > 1.31% [kernel] [k] switch_mm_irqs_off
> > 1.29% [kernel] [k] update_load_avg
> > 1.25% [megaraid_sas] [k] megasas_build_io_fusion
> > 1.22% [kernel] [k]
> > native_queued_spin_lock_slowpath
> > 1.03% [kernel] [k] scsi_dispatch_cmd
> > 1.03% [kernel] [k] blk_mq_get_request
> > 0.91% fio [.] fio_libaio_event
> > 0.89% [kernel] [k] scsi_softirq_done
> > 0.87% [kernel] [k] kobject_put
> > 0.86% [kernel] [k] cpuidle_enter_state
> > 0.84% fio [.] io_completed
> > 0.83% [kernel] [k] do_io_submit
> > 0.83% [megaraid_sas] [k]
> > megasas_build_and_issue_cmd_fusion
> > 0.83% [kernel] [k] __switch_to
> > 0.82% [kernel] [k] read_tsc
> > 0.80% [kernel] [k] native_write_msr
> > 0.76% [kernel] [k] aio_comp
> >
> >
> > Perf data without V2 patch applied. (1600K IOPS)
> >
> > 5.97% [megaraid_sas] [k] complete_cmd_fusion
> > 5.24% [kernel] [k] bt_iter
> > 3.28% [kernel] [k] _raw_spin_lock
> > 2.98% [kernel] [k] irq_entries_start
> > 2.29% fio [.] __fio_gettime
> > 2.04% [kernel] [k] scsi_queue_rq
> > 1.92% [megaraid_sas] [k] megasas_build_io_fusion
> > 1.61% [kernel] [k] switch_mm_irqs_off
> > 1.59% [megaraid_sas] [k]
megasas_build_and_issue_cmd_fusion
> > 1.41% [kernel] [k] scsi_dispatch_cmd
> > 1.33% [kernel] [k] scsi_softirq_done
> > 1.18% [kernel] [k] gup_pgd_range
> > 1.18% [kernel] [k] blk_mq_complete_request
> > 1.13% [kernel] [k] blk_mq_free_request
> > 1.05% [kernel] [k] do_io_submit
> > 1.04% [kernel] [k] _find_next_bit
> > 1.02% [kernel] [k] blk_mq_get_request
> > 0.95% [megaraid_sas] [k] megasas_build_ldio_fusion
> > 0.95% [kernel] [k] scsi_dec_host_busy
> > 0.89% fio [.] get_io_u
> > 0.88% [kernel] [k] entry_SYSCALL_64
> > 0.84% [megaraid_sas] [k] megasas_queue_command
> > 0.79% [kernel] [k] native_write_msr
> > 0.77% [kernel] [k] read_tsc
> > 0.73% [kernel] [k] _raw_spin_lock_irqsave
> > 0.73% fio [.] fio_libaio_commit
> > 0.72% [kernel] [k] kmem_cache_alloc
> > 0.72% [kernel] [k] blkdev_direct_IO
> > 0.69% [megaraid_sas] [k] MR_GetPhyParams
> > 0.68% [kernel] [k] blk_mq_dequeue_f
>
> The above data is very helpful to understand the issue, great thanks!
>
> With this patchset V2 and the 5 patches, if iostats is set as 0, IOPS is
1400K, but
> 1600K IOPS can be reached without all these patches with iostats as 1.
>
> BTW, could you share us what the machine is? ARM64? I saw ARM64's cache
> coherence performance is bad before. In the dual socket system(each
socket
> has 8 X86 CPU cores) I tested, only ~0.5% IOPS drop can be observed
after the
> 5 patches are applied on V2 in null_blk test, which is described in
commit log.
I am using Intel Skylake/Lewisburg/Purley.
>
> Looks it means single sbitmap can't perform well under MQ's case in
which
> there will be much more concurrent submissions and completions. In case
of
> single hw queue(current linus tree), one hctx->run_work only allows one
> __blk_mq_run_hw_queue() running at 'async' mode, and reply queues are
> used in round-robin way, which may cause contention on the single
sbitmap
> too, especially io accounting may consume a bit much more CPU, I guess
that
> may contribute some on the CPU lockup.
>
> Could you run your test without V2 patches by setting 'iostats' as 0?
Tested without V2 patch set. Iostat=1. IOPS = 1600K
5.93% [megaraid_sas] [k] complete_cmd_fusion
5.34% [kernel] [k] bt_iter
3.23% [kernel] [k] _raw_spin_lock
2.92% [kernel] [k] irq_entries_start
2.57% fio [.] __fio_gettime
2.10% [kernel] [k] scsi_queue_rq
1.98% [megaraid_sas] [k] megasas_build_io_fusion
1.93% [kernel] [k] switch_mm_irqs_off
1.79% [megaraid_sas] [k]
megasas_build_and_issue_cmd_fusion
1.45% [kernel] [k] scsi_softirq_done
1.42% [kernel] [k] scsi_dispatch_cmd
1.23% [kernel] [k] blk_mq_complete_request
1.11% [megaraid_sas] [k] megasas_build_ldio_fusion
1.11% [kernel] [k] gup_pgd_range
1.08% [kernel] [k] blk_mq_free_request
1.03% [kernel] [k] do_io_submit
1.02% [kernel] [k] _find_next_bit
1.00% [kernel] [k] scsi_dec_host_busy
0.94% [kernel] [k] blk_mq_get_request
0.93% [megaraid_sas] [k] megasas_queue_command
0.92% [kernel] [k] native_write_msr
0.85% fio [.] get_io_u
0.83% [kernel] [k] entry_SYSCALL_64
0.83% [kernel] [k] _raw_spin_lock_irqsave
0.82% [kernel] [k] read_tsc
0.81% [sd_mod] [k] sd_init_command
0.67% [kernel] [k] kmem_cache_alloc
0.63% [kernel] [k] memset_erms
0.63% [kernel] [k] aio_read_events
0.62% [kernel] [k] blkdev_dir
Tested without V2 patch set. Iostat=0. IOPS = 1600K
5.79% [megaraid_sas] [k] complete_cmd_fusion
3.28% [kernel] [k] _raw_spin_lock
3.28% [kernel] [k] irq_entries_start
2.10% [kernel] [k] scsi_queue_rq
1.96% fio [.] __fio_gettime
1.85% [megaraid_sas] [k] megasas_build_io_fusion
1.68% [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion
1.36% [kernel] [k] gup_pgd_range
1.36% [kernel] [k] scsi_dispatch_cmd
1.28% [kernel] [k] do_io_submit
1.25% [kernel] [k] switch_mm_irqs_off
1.20% [kernel] [k] blk_mq_free_request
1.18% [megaraid_sas] [k] megasas_build_ldio_fusion
1.11% [kernel] [k] dput
1.07% [kernel] [k] scsi_softirq_done
1.07% fio [.] get_io_u
1.07% [kernel] [k] scsi_dec_host_busy
1.02% [kernel] [k] blk_mq_get_request
0.96% [sd_mod] [k] sd_init_command
0.92% [kernel] [k] entry_SYSCALL_64
0.89% [kernel] [k] blk_mq_make_request
0.87% [kernel] [k] blkdev_direct_IO
0.84% [kernel] [k] blk_mq_complete_request
0.78% [kernel] [k] _raw_spin_lock_irqsave
0.77% [kernel] [k] lookup_ioctx
0.76% [megaraid_sas] [k] MR_GetPhyParams
0.75% [kernel] [k] blk_mq_dequeue_from_ctx
0.75% [kernel] [k] memset_erms
0.74% [kernel] [k] kmem_cache_alloc
0.72% [megaraid_sas] [k] megasas_queue_comman
> and could you share us what the .can_queue is in this HBA?
can_queue = 8072. In my test I used --iodepth=128 for 12 SCSI device (R0
Volume.) FIO will only push 1536 outstanding commands.
>
> >
> >
> > > If possible, please provide us the performance data without these
> > patches and
> > > with these patches, together with perf trace.
> > >
> > > The top 5 patches are for addressing the io accounting issue, and
> > > which should be the main reason for your performance drop, even
> > > lockup in megaraid_sas's ISR, IMO.
> >
> > I think performance drop is different issue. May be a side effect of
> > the patch set. Even though we fix this perf issue, cpu lock up is
> > completely different issue.
>
> The performance drop is caused by the global data structure of sbitmap
which
> is accessed from all CPUs concurrently.
>
> > Regarding cpu lock up, there was similar discussion and folks are
> > finding irq poll is good method to resolve lockup. Not sure why NVME
> > driver did not opted irq_poll, but there was extensive discussion and
> > I am also
>
> NVMe's hw queues won't use host wide tags, so no such issue.
>
> > seeing cpu lock up mainly due to multiple completion queue/reply queue
> > is tied to single CPU. We have weighing method in irq poll to quit ISR
> > and that is the way we can avoid lock-up.
> > http://lists.infradead.org/pipermail/linux-nvme/2017-January/007724.ht
> > ml
>
> This patch can make sure that one request is always completed in the
> submission CPU, but contention on the global sbitmap is too big and
causes
> performance drop.
>
> Now looks this is really an interesting topic for discussion.
>
>
> Thanks,
> Ming
next prev parent reply other threads:[~2018-02-14 6:28 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-03 4:21 [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq Ming Lei
2018-02-03 4:21 ` [PATCH 1/5] blk-mq: tags: define several fields of tags as pointer Ming Lei
2018-02-05 6:57 ` Hannes Reinecke
2018-02-08 17:34 ` Bart Van Assche
2018-02-03 4:21 ` [PATCH 2/5] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS Ming Lei
2018-02-05 6:54 ` Hannes Reinecke
2018-02-05 10:35 ` Ming Lei
2018-02-03 4:21 ` [PATCH 3/5] block: null_blk: introduce module parameter of 'g_global_tags' Ming Lei
2018-02-05 6:54 ` Hannes Reinecke
2018-02-03 4:21 ` [PATCH 4/5] scsi: introduce force_blk_mq Ming Lei
2018-02-05 6:57 ` Hannes Reinecke
2018-02-03 4:21 ` [PATCH 5/5] scsi: virtio_scsi: fix IO hang by irq vector automatic affinity Ming Lei
2018-02-05 6:57 ` Hannes Reinecke
2018-02-05 6:58 ` [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq Hannes Reinecke
2018-02-05 7:05 ` Kashyap Desai
2018-02-05 10:17 ` Ming Lei
2018-02-06 6:03 ` Kashyap Desai
2018-02-06 8:04 ` Ming Lei
2018-02-06 11:29 ` Kashyap Desai
2018-02-06 12:31 ` Ming Lei
2018-02-06 14:27 ` Kashyap Desai
2018-02-06 15:46 ` Ming Lei
2018-02-07 6:50 ` Hannes Reinecke
2018-02-07 12:23 ` Ming Lei
2018-02-07 14:14 ` Kashyap Desai
2018-02-08 1:23 ` Ming Lei
2018-02-08 7:00 ` Hannes Reinecke
2018-02-08 16:53 ` Ming Lei
2018-02-09 4:58 ` Kashyap Desai
2018-02-09 5:31 ` Ming Lei
2018-02-09 8:42 ` Kashyap Desai
2018-02-10 1:01 ` Ming Lei
2018-02-11 5:31 ` Ming Lei
2018-02-12 18:35 ` Kashyap Desai
2018-02-13 0:40 ` Ming Lei
2018-02-14 6:28 ` Kashyap Desai [this message]
2018-02-05 10:23 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8f52c08f51f9e2ff54aee5311670e6b4@mail.gmail.com \
--to=kashyap.desai@broadcom.com \
--cc=arun.easi@cavium.com \
--cc=axboe@kernel.dk \
--cc=don.brace@microsemi.com \
--cc=hare@suse.de \
--cc=hch@infradead.org \
--cc=hch@lst.de \
--cc=james.bottomley@hansenpartnership.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=loberman@redhat.com \
--cc=martin.petersen@oracle.com \
--cc=ming.lei@redhat.com \
--cc=osandov@fb.com \
--cc=pbonzini@redhat.com \
--cc=peter.rivera@broadcom.com \
--cc=snitzer@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).