From: Jacob Pan <jacob.jun.pan@linux.intel.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: LKML <linux-kernel@vger.kernel.org>, X86 Kernel <x86@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
iommu@lists.linux.dev, Thomas Gleixner <tglx@linutronix.de>,
Lu Baolu <baolu.lu@linux.intel.com>,
kvm@vger.kernel.org, Dave Hansen <dave.hansen@intel.com>,
Joerg Roedel <joro@8bytes.org>, "H. Peter Anvin" <hpa@zytor.com>,
Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@redhat.com>,
Paul Luse <paul.e.luse@intel.com>,
Dan Williams <dan.j.williams@intel.com>,
Raj Ashok <ashok.raj@intel.com>,
"Tian, Kevin" <kevin.tian@intel.com>,
maz@kernel.org, seanjc@google.com,
Robin Murphy <robin.murphy@arm.com>,
jacob.jun.pan@linux.intel.com
Subject: Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI
Date: Mon, 12 Feb 2024 10:27:42 -0800 [thread overview]
Message-ID: <20240212102742.34e1e2c2@jacob-builder> (raw)
In-Reply-To: <9285b29c-6556-46db-b0bb-7a85ad40d725@kernel.dk>
Hi Jens,
On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> On 2/9/24 10:43 AM, Jacob Pan wrote:
> > Hi Jens,
> >
> > On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> >
> >> Hi Jacob,
> >>
> >> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just
> >> IOPS bound on the drive, and using 1 thread per drive for IO. Random
> >> reads, using io_uring.
> >>
> >> For reference, using polled IO:
> >>
> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
> >>
> >> which is abount 5.1M/drive, which is what they can deliver.
> >>
> >> Before your patches, I see:
> >>
> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >>
> >> at 2.82M ints/sec. With the patches, I see:
> >>
> >> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
> >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
> >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
> >>
> >> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
> >> quite at the extent I expected. Booted with 'posted_msi' and I do see
> >> posted interrupts increasing in the PMN in /proc/interrupts,
> >>
> > The ints/sec reduction is not as high as I expected either, especially
> > at this high rate. Which means not enough coalescing going on to get the
> > performance benefits.
>
> Right, it means that we're getting pretty decent commands-per-int
> coalescing already. I added another drive and repeated, here's that one:
>
> IOPS w/polled: 25.7M IOPS
>
> Stock kernel:
>
> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32
> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
>
> at ~3.7M ints/sec, or about 5.8 IOPS / int on average.
>
> Patched kernel:
>
> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32
> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31
> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32
>
> at the same interrupt rate. So not a reduction, but slighter higher
> perf. Maybe we're reaping more commands on average per interrupt.
>
> Anyway, not a lot of interesting data there, just figured I'd re-run it
> with the added drive.
>
> > The opportunity of IRQ coalescing is also dependent on how long the
> > driver's hardirq handler executes. In the posted MSI demux loop, it does
> > not wait for more MSIs to come before existing the pending IRQ polling
> > loop. So if the hardirq handler finishes very quickly, it may not
> > coalesce as much. Perhaps, we need to find more "useful" work to do to
> > maximize the window for coalescing.
> >
> > I am not familiar with optane driver, need to look into how its hardirq
> > handler work. I have only tested NVMe gen5 in terms of storage IO, i saw
> > 30-50% ints/sec reduction at even lower IRQ rate (200k/sec).
>
> It's just an nvme device, so it's the nvme driver. The IRQ side is very
> cheap - for as long as there are CQEs in the completion ring, it'll reap
> them and complete them. That does mean that if we get an IRQ and there's
> more than one entry to complete, we will do all of them. No IRQ
> coalescing is configured (nvme kind of sucks for that...), but optane
> media is much faster than flash, so that may be a difference.
>
Yeah, I also check the the driver code it seems just wake up the threaded
handler.
For the record, here is my set up and performance data for 4 Samsung disks.
IOPS increased from 1.6M per disk to 2.1M. One difference I noticed is that
IRQ throughput is improved instead of reduction with this patch on my setup.
e.g. BEFORE: 185545/sec/vector
AFTER: 220128
CPU: (highest non-turbo freq, maybe different on yours).
echo "Set CPU frequency P1 2.7GHz"
for i in `seq 0 1 127`; do echo 2700000 > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_max_freq ;done
for i in `seq 0 1 127`; do echo 2700000 > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_min_freq ;done
PCI:
[root@emr-bkc posted_msi_tests]# lspci -vv -nn -s 0000:64:00.0|grep -e Lnk -e Sam -e nvme
64:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM174X [144d:a826] (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd Device [144d:aa0a]
LnkCap: Port #0, Speed 32GT/s, Width x4, ASPM notsupported
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled-CommClk+
LnkSta: Speed 32GT/s (ok), Width x4(ok)
LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS
LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis
NVME setup:
nvme5n1 SAMSUNG MZWLO1T9HCJR-00A07
nvme6n1 SAMSUNG MZWLO1T9HCJR-00A07
nvme3n1 SAMSUNG MZWLO1T9HCJR-00A07
nvme4n1 SAMSUNG MZWLO1T9HCJR-00A07
FIO:
[global]
bs=4k
direct=1
norandommap
ioengine=libaio
randrepeat=0
readwrite=randread
group_reporting
time_based
iodepth=64
exitall
random_generator=tausworthe64
runtime=30
ramp_time=3
numjobs=8
group_reporting=1
#cpus_allowed_policy=shared
cpus_allowed_policy=split
[disk_nvme6n1_thread_1]
filename=/dev/nvme6n1
cpus_allowed=0-7
[disk_nvme6n1_thread_1]
filename=/dev/nvme5n1
cpus_allowed=8-15
[disk_nvme5n1_thread_2]
filename=/dev/nvme4n1
cpus_allowed=16-23
[disk_nvme5n1_thread_3]
filename=/dev/nvme3n1
cpus_allowed=24-31
iostat w/o posted MSI patch, v6.8-rc1:
nvme3c3n1 1615525.00 6462100.00 0.00 0.00 6462100
nvme4c4n1 1615471.00 6461884.00 0.00 0.00 6461884
nvme5c5n1 1615602.00 6462408.00 0.00 0.00 6462408
nvme6c6n1 1614637.00 6458544.00 0.00 0.00 6458544
irqtop (delta 1 sec.)
IRQ TOTAL DELTA NAME
800 6290026 185545 IR-PCI-MSIX-0000:65:00.0 76-edge nvme5q76
797 6279554 185295 IR-PCI-MSIX-0000:65:00.0 73-edge nvme5q73
799 6281627 185200 IR-PCI-MSIX-0000:65:00.0 75-edge nvme5q75
802 6285742 185185 IR-PCI-MSIX-0000:65:00.0 78-edge nvme5q78
... ... similar irq rate for all 32 vectors
iostat w/ posted MSI patch:
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
nvme3c3n1 2184313.00 8737256.00 0.00 0.00 8737256 0 0
nvme4c4n1 2184241.00 8736972.00 0.00 0.00 8736972 0 0
nvme5c5n1 2184269.00 8737080.00 0.00 0.00 8737080 0 0
nvme6c6n1 2184003.00 8736012.00 0.00 0.00 8736012 0 0
irqtop w/ posted MSI patch:
IRQ TOTAL DELTA NAME
PMN 5230078416 5502657 Posted MSI notification event
423 138068935 220128 IR-PCI-MSIX-0000:64:00.0 80-edge nvme4q80
425 138057654 219963 IR-PCI-MSIX-0000:64:00.0 82-edge nvme4q82
426 138101745 219890 IR-PCI-MSIX-0000:64:00.0 83-edge nvme4q83
... ... similar irq rate for all 32 vectors
IRQ coalescing ratio: posted interrupt notification (PMN)/total MSIs = 78%
550/(22*32.)=.78125
Thanks,
Jacob
next prev parent reply other threads:[~2024-02-12 18:22 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
2024-01-26 23:42 ` [PATCH 01/15] x86/irq: Move posted interrupt descriptor out of vmx code Jacob Pan
2024-01-26 23:42 ` [PATCH 02/15] x86/irq: Unionize PID.PIR for 64bit access w/o casting Jacob Pan
2024-01-26 23:42 ` [PATCH 03/15] x86/irq: Use bitfields exclusively in posted interrupt descriptor Jacob Pan
2024-01-31 1:48 ` Sean Christopherson
2024-02-06 0:40 ` Jacob Pan
2024-01-26 23:42 ` [PATCH 04/15] x86/irq: Add a Kconfig option for posted MSI Jacob Pan
2024-04-05 2:28 ` Robert Hoo
2024-04-05 15:54 ` Jacob Pan
2024-01-26 23:42 ` [PATCH 05/15] x86/irq: Reserve a per CPU IDT vector for posted MSIs Jacob Pan
2024-04-04 13:38 ` Robert Hoo
2024-04-04 17:17 ` Jacob Pan
2024-01-26 23:42 ` [PATCH 06/15] x86/irq: Set up per host CPU posted interrupt descriptors Jacob Pan
2024-02-13 19:44 ` Jacob Pan
2024-01-26 23:42 ` [PATCH 07/15] x86/irq: Add accessors for " Jacob Pan
2024-01-26 23:42 ` [PATCH 08/15] x86/irq: Factor out calling ISR from common_interrupt Jacob Pan
2024-01-26 23:42 ` [PATCH 09/15] x86/irq: Install posted MSI notification handler Jacob Pan
2024-03-29 7:32 ` Zeng Guang
2024-04-03 2:43 ` Jacob Pan
2024-01-26 23:42 ` [PATCH 10/15] x86/irq: Factor out common code for checking pending interrupts Jacob Pan
2024-01-26 23:42 ` [PATCH 11/15] x86/irq: Extend checks for pending vectors to posted interrupts Jacob Pan
2024-01-26 23:42 ` [PATCH 12/15] iommu/vt-d: Make posted MSI an opt-in cmdline option Jacob Pan
2024-01-26 23:42 ` [PATCH 13/15] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
2024-01-26 23:42 ` [PATCH 14/15] iommu/vt-d: Add a helper to retrieve PID address Jacob Pan
2024-01-26 23:42 ` [PATCH 15/15] iommu/vt-d: Enable posted mode for device MSIs Jacob Pan
2024-02-08 15:34 ` [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jens Axboe
2024-02-09 17:43 ` Jacob Pan
2024-02-09 20:31 ` Jens Axboe
2024-02-12 18:27 ` Jacob Pan [this message]
2024-02-12 18:36 ` Jens Axboe
2024-02-12 20:13 ` Jacob Pan
2024-02-13 1:10 ` Jacob Pan
2024-04-04 13:45 ` Robert Hoo
2024-04-04 17:37 ` Jacob Pan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240212102742.34e1e2c2@jacob-builder \
--to=jacob.jun.pan@linux.intel.com \
--cc=ashok.raj@intel.com \
--cc=axboe@kernel.dk \
--cc=baolu.lu@linux.intel.com \
--cc=bp@alien8.de \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@intel.com \
--cc=hpa@zytor.com \
--cc=iommu@lists.linux.dev \
--cc=joro@8bytes.org \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maz@kernel.org \
--cc=mingo@redhat.com \
--cc=paul.e.luse@intel.com \
--cc=peterz@infradead.org \
--cc=robin.murphy@arm.com \
--cc=seanjc@google.com \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).