From: Jacob Pan <jacob.jun.pan@linux.intel.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: LKML <linux-kernel@vger.kernel.org>, X86 Kernel <x86@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
iommu@lists.linux.dev, Thomas Gleixner <tglx@linutronix.de>,
Lu Baolu <baolu.lu@linux.intel.com>,
kvm@vger.kernel.org, Dave Hansen <dave.hansen@intel.com>,
Joerg Roedel <joro@8bytes.org>, "H. Peter Anvin" <hpa@zytor.com>,
Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@redhat.com>,
Paul Luse <paul.e.luse@intel.com>,
Dan Williams <dan.j.williams@intel.com>,
Raj Ashok <ashok.raj@intel.com>,
"Tian, Kevin" <kevin.tian@intel.com>,
maz@kernel.org, seanjc@google.com,
Robin Murphy <robin.murphy@arm.com>,
jacob.jun.pan@linux.intel.com
Subject: Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI
Date: Mon, 12 Feb 2024 12:13:46 -0800 [thread overview]
Message-ID: <20240212121346.0f8870a7@jacob-builder> (raw)
In-Reply-To: <2aa290eb-ec4b-43b1-87db-4df8ccbeaa37@kernel.dk>
Hi Jens,
On Mon, 12 Feb 2024 11:36:42 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> On 2/12/24 11:27 AM, Jacob Pan wrote:
> > Hi Jens,
> >
> > On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> >
> >> On 2/9/24 10:43 AM, Jacob Pan wrote:
> >>> Hi Jens,
> >>>
> >>> On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> >>>
> >>>> Hi Jacob,
> >>>>
> >>>> I gave this a quick spin, using 4 gen2 optane drives. Basic test,
> >>>> just IOPS bound on the drive, and using 1 thread per drive for IO.
> >>>> Random reads, using io_uring.
> >>>>
> >>>> For reference, using polled IO:
> >>>>
> >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >>>> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >>>> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
> >>>>
> >>>> which is abount 5.1M/drive, which is what they can deliver.
> >>>>
> >>>> Before your patches, I see:
> >>>>
> >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >>>> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >>>> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >>>>
> >>>> at 2.82M ints/sec. With the patches, I see:
> >>>>
> >>>> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
> >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
> >>>> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
> >>>>
> >>>> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
> >>>> quite at the extent I expected. Booted with 'posted_msi' and I do see
> >>>> posted interrupts increasing in the PMN in /proc/interrupts,
> >>>>
> >>> The ints/sec reduction is not as high as I expected either, especially
> >>> at this high rate. Which means not enough coalescing going on to get
> >>> the performance benefits.
> >>
> >> Right, it means that we're getting pretty decent commands-per-int
> >> coalescing already. I added another drive and repeated, here's that
> >> one:
> >>
> >> IOPS w/polled: 25.7M IOPS
> >>
> >> Stock kernel:
> >>
> >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> >> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32
> >> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> >>
> >> at ~3.7M ints/sec, or about 5.8 IOPS / int on average.
> >>
> >> Patched kernel:
> >>
> >> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32
> >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31
> >> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32
> >>
> >> at the same interrupt rate. So not a reduction, but slighter higher
> >> perf. Maybe we're reaping more commands on average per interrupt.
> >>
> >> Anyway, not a lot of interesting data there, just figured I'd re-run it
> >> with the added drive.
> >>
> >>> The opportunity of IRQ coalescing is also dependent on how long the
> >>> driver's hardirq handler executes. In the posted MSI demux loop, it
> >>> does not wait for more MSIs to come before existing the pending IRQ
> >>> polling loop. So if the hardirq handler finishes very quickly, it may
> >>> not coalesce as much. Perhaps, we need to find more "useful" work to
> >>> do to maximize the window for coalescing.
> >>>
> >>> I am not familiar with optane driver, need to look into how its
> >>> hardirq handler work. I have only tested NVMe gen5 in terms of
> >>> storage IO, i saw 30-50% ints/sec reduction at even lower IRQ rate
> >>> (200k/sec).
> >>
> >> It's just an nvme device, so it's the nvme driver. The IRQ side is very
> >> cheap - for as long as there are CQEs in the completion ring, it'll
> >> reap them and complete them. That does mean that if we get an IRQ and
> >> there's more than one entry to complete, we will do all of them. No IRQ
> >> coalescing is configured (nvme kind of sucks for that...), but optane
> >> media is much faster than flash, so that may be a difference.
> >>
> > Yeah, I also check the the driver code it seems just wake up the
> > threaded handler.
>
> That only happens if you're using threaded interrupts, which is not the
> default as it's much slower. What happens for the normal case is that we
> init a batch, and then poll the CQ ring for completions, and then add
> them to the completion batch. Once no more are found, we complete the
> batch.
>
thanks for the explanation.
> You're not using threaded interrupts, are you?
No. I didn't add module parameter "use_threaded_interrupts"
>
> > For the record, here is my set up and performance data for 4 Samsung
> > disks. IOPS increased from 1.6M per disk to 2.1M. One difference I
> > noticed is that IRQ throughput is improved instead of reduction with
> > this patch on my setup. e.g. BEFORE: 185545/sec/vector
> > AFTER: 220128
>
> I'm surprised at the rates being that low, and if so, why the posted MSI
> makes a difference? Usually what I've seen for IRQ being slower than
> poll is if interrupt delivery is unreasonably slow on that architecture
> of machine. But ~200k/sec isn't that high at all.
>
> > [global]
> > bs=4k
> > direct=1
> > norandommap
> > ioengine=libaio
> > randrepeat=0
> > readwrite=randread
> > group_reporting
> > time_based
> > iodepth=64
> > exitall
> > random_generator=tausworthe64
> > runtime=30
> > ramp_time=3
> > numjobs=8
> > group_reporting=1
> >
> > #cpus_allowed_policy=shared
> > cpus_allowed_policy=split
> > [disk_nvme6n1_thread_1]
> > filename=/dev/nvme6n1
> > cpus_allowed=0-7
> > [disk_nvme6n1_thread_1]
> > filename=/dev/nvme5n1
> > cpus_allowed=8-15
> > [disk_nvme5n1_thread_2]
> > filename=/dev/nvme4n1
> > cpus_allowed=16-23
> > [disk_nvme5n1_thread_3]
> > filename=/dev/nvme3n1
> > cpus_allowed=24-31
>
> For better performance, I'd change that engine=libaio to:
>
> ioengine=io_uring
> fixedbufs=1
> registerfiles=1
>
> Particularly fixedbufs makes a big difference, as a big cycle consumer
> is mapping/unmapping pages from the application space into the kernel
> for O_DIRECT. With fixedbufs=1, this is done once and we just reuse the
> buffers. At least for my runs, this is ~15% of the systime for doing IO.
> It also removes the page referencing, which isn't as big a consumer, but
> still noticeable.
>
Indeed, the CPU utilization system time goes down significantly. I got the
following with posted MSI patch applied:
Before (aio):
read: IOPS=8925k, BW=34.0GiB/s (36.6GB/s)(1021GiB/30001msec)
user 3m25.156s
sys 11m16.785s
After (fixedbufs, iouring engine):
read: IOPS=8811k, BW=33.6GiB/s (36.1GB/s)(1008GiB/30002msec)
user 2m56.255s
sys 8m56.378s
It seems to have no gain in IOPS, just CPU utilization reduction.
Both have improvement over libaio w/o posted MSI patch.
> Anyway, side quest, but I think you'll find this considerably reduces
> overhead / improves performance. Also makes it so that you can compare
> with polled IO on nvme, which aio can't do. You'd just add hipri=1 as an
> option for that (with a side note that you need to configure nvme poll
> queues, see the poll_queues parameter).
>
> On my box, all the NVMe devices seem to be on node1, not node0 which
> looks like it's the CPUs you are using. Might be worth checking and
> adjusting your CPU domains for each drive? I also tend to get better
> performance by removing the CPU scheduler, eg just pin each job to a
> single CPU rather than many. It's just one process/thread anyway, so
> really no point in giving it options here. It'll help reduce variability
> too, which can be a pain in the butt to deal with.
>
Much faster with poll_queues=32 (32jobs)
read: IOPS=13.0M, BW=49.6GiB/s (53.3GB/s)(1489GiB/30001msec)
user 2m29.177s
sys 15m7.022s
Observed no IRQ counts from NVME.
Thanks,
Jacob
next prev parent reply other threads:[~2024-02-12 20:08 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
2024-01-26 23:42 ` [PATCH 01/15] x86/irq: Move posted interrupt descriptor out of vmx code Jacob Pan
2024-01-26 23:42 ` [PATCH 02/15] x86/irq: Unionize PID.PIR for 64bit access w/o casting Jacob Pan
2024-01-26 23:42 ` [PATCH 03/15] x86/irq: Use bitfields exclusively in posted interrupt descriptor Jacob Pan
2024-01-31 1:48 ` Sean Christopherson
2024-02-06 0:40 ` Jacob Pan
2024-01-26 23:42 ` [PATCH 04/15] x86/irq: Add a Kconfig option for posted MSI Jacob Pan
2024-04-05 2:28 ` Robert Hoo
2024-04-05 15:54 ` Jacob Pan
2024-01-26 23:42 ` [PATCH 05/15] x86/irq: Reserve a per CPU IDT vector for posted MSIs Jacob Pan
2024-04-04 13:38 ` Robert Hoo
2024-04-04 17:17 ` Jacob Pan
2024-01-26 23:42 ` [PATCH 06/15] x86/irq: Set up per host CPU posted interrupt descriptors Jacob Pan
2024-02-13 19:44 ` Jacob Pan
2024-01-26 23:42 ` [PATCH 07/15] x86/irq: Add accessors for " Jacob Pan
2024-01-26 23:42 ` [PATCH 08/15] x86/irq: Factor out calling ISR from common_interrupt Jacob Pan
2024-01-26 23:42 ` [PATCH 09/15] x86/irq: Install posted MSI notification handler Jacob Pan
2024-03-29 7:32 ` Zeng Guang
2024-04-03 2:43 ` Jacob Pan
2024-01-26 23:42 ` [PATCH 10/15] x86/irq: Factor out common code for checking pending interrupts Jacob Pan
2024-01-26 23:42 ` [PATCH 11/15] x86/irq: Extend checks for pending vectors to posted interrupts Jacob Pan
2024-01-26 23:42 ` [PATCH 12/15] iommu/vt-d: Make posted MSI an opt-in cmdline option Jacob Pan
2024-01-26 23:42 ` [PATCH 13/15] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
2024-01-26 23:42 ` [PATCH 14/15] iommu/vt-d: Add a helper to retrieve PID address Jacob Pan
2024-01-26 23:42 ` [PATCH 15/15] iommu/vt-d: Enable posted mode for device MSIs Jacob Pan
2024-02-08 15:34 ` [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jens Axboe
2024-02-09 17:43 ` Jacob Pan
2024-02-09 20:31 ` Jens Axboe
2024-02-12 18:27 ` Jacob Pan
2024-02-12 18:36 ` Jens Axboe
2024-02-12 20:13 ` Jacob Pan [this message]
2024-02-13 1:10 ` Jacob Pan
2024-04-04 13:45 ` Robert Hoo
2024-04-04 17:37 ` Jacob Pan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240212121346.0f8870a7@jacob-builder \
--to=jacob.jun.pan@linux.intel.com \
--cc=ashok.raj@intel.com \
--cc=axboe@kernel.dk \
--cc=baolu.lu@linux.intel.com \
--cc=bp@alien8.de \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@intel.com \
--cc=hpa@zytor.com \
--cc=iommu@lists.linux.dev \
--cc=joro@8bytes.org \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maz@kernel.org \
--cc=mingo@redhat.com \
--cc=paul.e.luse@intel.com \
--cc=peterz@infradead.org \
--cc=robin.murphy@arm.com \
--cc=seanjc@google.com \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).