From: Jacob Pan <jacob.jun.pan@linux.intel.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: LKML <linux-kernel@vger.kernel.org>, X86 Kernel <x86@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
iommu@lists.linux.dev, Thomas Gleixner <tglx@linutronix.de>,
Lu Baolu <baolu.lu@linux.intel.com>,
kvm@vger.kernel.org, Dave Hansen <dave.hansen@intel.com>,
Joerg Roedel <joro@8bytes.org>, "H. Peter Anvin" <hpa@zytor.com>,
Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@redhat.com>,
Paul Luse <paul.e.luse@intel.com>,
Dan Williams <dan.j.williams@intel.com>,
Raj Ashok <ashok.raj@intel.com>,
"Tian, Kevin" <kevin.tian@intel.com>,
maz@kernel.org, seanjc@google.com,
Robin Murphy <robin.murphy@arm.com>,
jacob.jun.pan@linux.intel.com
Subject: Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI
Date: Mon, 12 Feb 2024 10:27:42 -0800 [thread overview]
Message-ID: <20240212102742.34e1e2c2@jacob-builder> (raw)
In-Reply-To: <9285b29c-6556-46db-b0bb-7a85ad40d725@kernel.dk>
Hi Jens,
On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> On 2/9/24 10:43 AM, Jacob Pan wrote:
> > Hi Jens,
> >
> > On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> >
> >> Hi Jacob,
> >>
> >> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just
> >> IOPS bound on the drive, and using 1 thread per drive for IO. Random
> >> reads, using io_uring.
> >>
> >> For reference, using polled IO:
> >>
> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
> >>
> >> which is abount 5.1M/drive, which is what they can deliver.
> >>
> >> Before your patches, I see:
> >>
> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >>
> >> at 2.82M ints/sec. With the patches, I see:
> >>
> >> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
> >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
> >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
> >>
> >> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
> >> quite at the extent I expected. Booted with 'posted_msi' and I do see
> >> posted interrupts increasing in the PMN in /proc/interrupts,
> >>
> > The ints/sec reduction is not as high as I expected either, especially
> > at this high rate. Which means not enough coalescing going on to get the
> > performance benefits.
>
> Right, it means that we're getting pretty decent commands-per-int
> coalescing already. I added another drive and repeated, here's that one:
>
> IOPS w/polled: 25.7M IOPS
>
> Stock kernel:
>
> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32
> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
>
> at ~3.7M ints/sec, or about 5.8 IOPS / int on average.
>
> Patched kernel:
>
> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32
> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31
> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32
>
> at the same interrupt rate. So not a reduction, but slighter higher
> perf. Maybe we're reaping more commands on average per interrupt.
>
> Anyway, not a lot of interesting data there, just figured I'd re-run it
> with the added drive.
>
> > The opportunity of IRQ coalescing is also dependent on how long the
> > driver's hardirq handler executes. In the posted MSI demux loop, it does
> > not wait for more MSIs to come before existing the pending IRQ polling
> > loop. So if the hardirq handler finishes very quickly, it may not
> > coalesce as much. Perhaps, we need to find more "useful" work to do to
> > maximize the window for coalescing.
> >
> > I am not familiar with optane driver, need to look into how its hardirq
> > handler work. I have only tested NVMe gen5 in terms of storage IO, i saw
> > 30-50% ints/sec reduction at even lower IRQ rate (200k/sec).
>
> It's just an nvme device, so it's the nvme driver. The IRQ side is very
> cheap - for as long as there are CQEs in the completion ring, it'll reap
> them and complete them. That does mean that if we get an IRQ and there's
> more than one entry to complete, we will do all of them. No IRQ
> coalescing is configured (nvme kind of sucks for that...), but optane
> media is much faster than flash, so that may be a difference.
>
Yeah, I also check the the driver code it seems just wake up the threaded
handler.
For the record, here is my set up and performance data for 4 Samsung disks.
IOPS increased from 1.6M per disk to 2.1M. One difference I noticed is that
IRQ throughput is improved instead of reduction with this patch on my setup.
e.g. BEFORE: 185545/sec/vector
AFTER: 220128
CPU: (highest non-turbo freq, maybe different on yours).
echo "Set CPU frequency P1 2.7GHz"
for i in `seq 0 1 127`; do echo 2700000 > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_max_freq ;done
for i in `seq 0 1 127`; do echo 2700000 > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_min_freq ;done
PCI:
[root@emr-bkc posted_msi_tests]# lspci -vv -nn -s 0000:64:00.0|grep -e Lnk -e Sam -e nvme
64:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM174X [144d:a826] (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd Device [144d:aa0a]
LnkCap: Port #0, Speed 32GT/s, Width x4, ASPM notsupported
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled-CommClk+
LnkSta: Speed 32GT/s (ok), Width x4(ok)
LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS
LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis
NVME setup:
nvme5n1 SAMSUNG MZWLO1T9HCJR-00A07
nvme6n1 SAMSUNG MZWLO1T9HCJR-00A07
nvme3n1 SAMSUNG MZWLO1T9HCJR-00A07
nvme4n1 SAMSUNG MZWLO1T9HCJR-00A07
FIO:
[global]
bs=4k
direct=1
norandommap
ioengine=libaio
randrepeat=0
readwrite=randread
group_reporting
time_based
iodepth=64
exitall
random_generator=tausworthe64
runtime=30
ramp_time=3
numjobs=8
group_reporting=1
#cpus_allowed_policy=shared
cpus_allowed_policy=split
[disk_nvme6n1_thread_1]
filename=/dev/nvme6n1
cpus_allowed=0-7
[disk_nvme6n1_thread_1]
filename=/dev/nvme5n1
cpus_allowed=8-15
[disk_nvme5n1_thread_2]
filename=/dev/nvme4n1
cpus_allowed=16-23
[disk_nvme5n1_thread_3]
filename=/dev/nvme3n1
cpus_allowed=24-31
iostat w/o posted MSI patch, v6.8-rc1:
nvme3c3n1 1615525.00 6462100.00 0.00 0.00 6462100
nvme4c4n1 1615471.00 6461884.00 0.00 0.00 6461884
nvme5c5n1 1615602.00 6462408.00 0.00 0.00 6462408
nvme6c6n1 1614637.00 6458544.00 0.00 0.00 6458544
irqtop (delta 1 sec.)
IRQ TOTAL DELTA NAME
800 6290026 185545 IR-PCI-MSIX-0000:65:00.0 76-edge nvme5q76
797 6279554 185295 IR-PCI-MSIX-0000:65:00.0 73-edge nvme5q73
799 6281627 185200 IR-PCI-MSIX-0000:65:00.0 75-edge nvme5q75
802 6285742 185185 IR-PCI-MSIX-0000:65:00.0 78-edge nvme5q78
... ... similar irq rate for all 32 vectors
iostat w/ posted MSI patch:
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
nvme3c3n1 2184313.00 8737256.00 0.00 0.00 8737256 0 0
nvme4c4n1 2184241.00 8736972.00 0.00 0.00 8736972 0 0
nvme5c5n1 2184269.00 8737080.00 0.00 0.00 8737080 0 0
nvme6c6n1 2184003.00 8736012.00 0.00 0.00 8736012 0 0
irqtop w/ posted MSI patch:
IRQ TOTAL DELTA NAME
PMN 5230078416 5502657 Posted MSI notification event
423 138068935 220128 IR-PCI-MSIX-0000:64:00.0 80-edge nvme4q80
425 138057654 219963 IR-PCI-MSIX-0000:64:00.0 82-edge nvme4q82
426 138101745 219890 IR-PCI-MSIX-0000:64:00.0 83-edge nvme4q83
... ... similar irq rate for all 32 vectors
IRQ coalescing ratio: posted interrupt notification (PMN)/total MSIs = 78%
550/(22*32.)=.78125
Thanks,
Jacob
next prev parent reply other threads:[~2024-02-12 18:22 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
2024-01-26 23:42 ` [PATCH 01/15] x86/irq: Move posted interrupt descriptor out of vmx code Jacob Pan
2024-01-26 23:42 ` [PATCH 02/15] x86/irq: Unionize PID.PIR for 64bit access w/o casting Jacob Pan
2024-01-26 23:42 ` [PATCH 03/15] x86/irq: Use bitfields exclusively in posted interrupt descriptor Jacob Pan
2024-01-31 1:48 ` Sean Christopherson
2024-02-06 0:40 ` Jacob Pan
2024-01-26 23:42 ` [PATCH 04/15] x86/irq: Add a Kconfig option for posted MSI Jacob Pan
2024-04-05 2:28 ` Robert Hoo
2024-04-05 15:54 ` Jacob Pan
2024-01-26 23:42 ` [PATCH 05/15] x86/irq: Reserve a per CPU IDT vector for posted MSIs Jacob Pan
2024-04-04 13:38 ` Robert Hoo
2024-04-04 17:17 ` Jacob Pan
2024-01-26 23:42 ` [PATCH 06/15] x86/irq: Set up per host CPU posted interrupt descriptors Jacob Pan
2024-02-13 19:44 ` Jacob Pan
2024-01-26 23:42 ` [PATCH 07/15] x86/irq: Add accessors for " Jacob Pan
2024-01-26 23:42 ` [PATCH 08/15] x86/irq: Factor out calling ISR from common_interrupt Jacob Pan
2024-01-26 23:42 ` [PATCH 09/15] x86/irq: Install posted MSI notification handler Jacob Pan
2024-03-29 7:32 ` Zeng Guang
2024-04-03 2:43 ` Jacob Pan
2024-01-26 23:42 ` [PATCH 10/15] x86/irq: Factor out common code for checking pending interrupts Jacob Pan
2024-01-26 23:42 ` [PATCH 11/15] x86/irq: Extend checks for pending vectors to posted interrupts Jacob Pan
2024-01-26 23:42 ` [PATCH 12/15] iommu/vt-d: Make posted MSI an opt-in cmdline option Jacob Pan
2024-01-26 23:42 ` [PATCH 13/15] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
2024-01-26 23:42 ` [PATCH 14/15] iommu/vt-d: Add a helper to retrieve PID address Jacob Pan
2024-01-26 23:42 ` [PATCH 15/15] iommu/vt-d: Enable posted mode for device MSIs Jacob Pan
2024-02-08 15:34 ` [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jens Axboe
2024-02-09 17:43 ` Jacob Pan
2024-02-09 20:31 ` Jens Axboe
2024-02-12 18:27 ` Jacob Pan [this message]
2024-02-12 18:36 ` Jens Axboe
2024-02-12 20:13 ` Jacob Pan
2024-02-13 1:10 ` Jacob Pan
2024-04-04 13:45 ` Robert Hoo
2024-04-04 17:37 ` Jacob Pan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240212102742.34e1e2c2@jacob-builder \
--to=jacob.jun.pan@linux.intel.com \
--cc=ashok.raj@intel.com \
--cc=axboe@kernel.dk \
--cc=baolu.lu@linux.intel.com \
--cc=bp@alien8.de \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@intel.com \
--cc=hpa@zytor.com \
--cc=iommu@lists.linux.dev \
--cc=joro@8bytes.org \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maz@kernel.org \
--cc=mingo@redhat.com \
--cc=paul.e.luse@intel.com \
--cc=peterz@infradead.org \
--cc=robin.murphy@arm.com \
--cc=seanjc@google.com \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.