All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jacob Pan <jacob.jun.pan@linux.intel.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: LKML <linux-kernel@vger.kernel.org>, X86 Kernel <x86@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	iommu@lists.linux.dev, Thomas Gleixner <tglx@linutronix.de>,
	Lu Baolu <baolu.lu@linux.intel.com>,
	kvm@vger.kernel.org, Dave Hansen <dave.hansen@intel.com>,
	Joerg Roedel <joro@8bytes.org>, "H. Peter Anvin" <hpa@zytor.com>,
	Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@redhat.com>,
	Paul Luse <paul.e.luse@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Raj Ashok <ashok.raj@intel.com>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	maz@kernel.org, seanjc@google.com,
	Robin Murphy <robin.murphy@arm.com>,
	jacob.jun.pan@linux.intel.com
Subject: Re: [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI
Date: Mon, 12 Feb 2024 10:27:42 -0800	[thread overview]
Message-ID: <20240212102742.34e1e2c2@jacob-builder> (raw)
In-Reply-To: <9285b29c-6556-46db-b0bb-7a85ad40d725@kernel.dk>

Hi Jens,

On Fri, 9 Feb 2024 13:31:17 -0700, Jens Axboe <axboe@kernel.dk> wrote:

> On 2/9/24 10:43 AM, Jacob Pan wrote:
> > Hi Jens,
> > 
> > On Thu, 8 Feb 2024 08:34:55 -0700, Jens Axboe <axboe@kernel.dk> wrote:
> >   
> >> Hi Jacob,
> >>
> >> I gave this a quick spin, using 4 gen2 optane drives. Basic test, just
> >> IOPS bound on the drive, and using 1 thread per drive for IO. Random
> >> reads, using io_uring.
> >>
> >> For reference, using polled IO:
> >>
> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >> IOPS=20.36M, BW=9.94GiB/s, IOS/call=31/31
> >> IOPS=20.37M, BW=9.95GiB/s, IOS/call=31/31
> >>
> >> which is abount 5.1M/drive, which is what they can deliver.
> >>
> >> Before your patches, I see:
> >>
> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >> IOPS=14.38M, BW=7.02GiB/s, IOS/call=32/31
> >> IOPS=14.37M, BW=7.02GiB/s, IOS/call=32/32
> >>
> >> at 2.82M ints/sec. With the patches, I see:
> >>
> >> IOPS=14.73M, BW=7.19GiB/s, IOS/call=32/31
> >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=32/31
> >> IOPS=14.90M, BW=7.27GiB/s, IOS/call=31/32
> >>
> >> at 2.34M ints/sec. So a nice reduction in interrupt rate, though not
> >> quite at the extent I expected. Booted with 'posted_msi' and I do see
> >> posted interrupts increasing in the PMN in /proc/interrupts, 
> >>  
> > The ints/sec reduction is not as high as I expected either, especially
> > at this high rate. Which means not enough coalescing going on to get the
> > performance benefits.  
> 
> Right, it means that we're getting pretty decent commands-per-int
> coalescing already. I added another drive and repeated, here's that one:
> 
> IOPS w/polled: 25.7M IOPS
> 
> Stock kernel:
> 
> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> IOPS=21.44M, BW=10.47GiB/s, IOS/call=32/32
> IOPS=21.41M, BW=10.45GiB/s, IOS/call=32/32
> 
> at ~3.7M ints/sec, or about 5.8 IOPS / int on average.
> 
> Patched kernel:
> 
> IOPS=21.90M, BW=10.69GiB/s, IOS/call=31/32
> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/31
> IOPS=21.89M, BW=10.69GiB/s, IOS/call=32/32
> 
> at the same interrupt rate. So not a reduction, but slighter higher
> perf. Maybe we're reaping more commands on average per interrupt.
> 
> Anyway, not a lot of interesting data there, just figured I'd re-run it
> with the added drive.
> 
> > The opportunity of IRQ coalescing is also dependent on how long the
> > driver's hardirq handler executes. In the posted MSI demux loop, it does
> > not wait for more MSIs to come before existing the pending IRQ polling
> > loop. So if the hardirq handler finishes very quickly, it may not
> > coalesce as much. Perhaps, we need to find more "useful" work to do to
> > maximize the window for coalescing.
> > 
> > I am not familiar with optane driver, need to look into how its hardirq
> > handler work. I have only tested NVMe gen5 in terms of storage IO, i saw
> > 30-50% ints/sec reduction at even lower IRQ rate (200k/sec).  
> 
> It's just an nvme device, so it's the nvme driver. The IRQ side is very
> cheap - for as long as there are CQEs in the completion ring, it'll reap
> them and complete them. That does mean that if we get an IRQ and there's
> more than one entry to complete, we will do all of them. No IRQ
> coalescing is configured (nvme kind of sucks for that...), but optane
> media is much faster than flash, so that may be a difference.
> 
Yeah, I also check the the driver code it seems just wake up the threaded
handler.

For the record, here is my set up and performance data for 4 Samsung disks.
IOPS increased from 1.6M per disk to 2.1M. One difference I noticed is that
IRQ throughput is improved instead of reduction with this patch on my setup.
e.g. BEFORE: 185545/sec/vector 
     AFTER:  220128

CPU: (highest non-turbo freq, maybe different on yours).
echo "Set CPU frequency P1 2.7GHz"                                                                      
for i in `seq 0 1 127`; do  echo 2700000 >  /sys/devices/system/cpu/cpu$i/cpufreq/scaling_max_freq ;done
for i in `seq 0 1 127`; do  echo 2700000 >  /sys/devices/system/cpu/cpu$i/cpufreq/scaling_min_freq ;done

PCI:
[root@emr-bkc posted_msi_tests]# lspci -vv -nn -s 0000:64:00.0|grep -e Lnk -e Sam -e nvme                                                   
64:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM174X [144d:a826] (prog-if 02 [NVM Express]) 
        Subsystem: Samsung Electronics Co Ltd Device [144d:aa0a]                                                                            
                LnkCap: Port #0, Speed 32GT/s, Width x4, ASPM notsupported                                                                 
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled-CommClk+                                                                     
                LnkSta: Speed 32GT/s (ok), Width x4(ok)                                                                                    
                LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS
                LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis

NVME setup:                                            
nvme5n1       SAMSUNG MZWLO1T9HCJR-00A07                    
nvme6n1       SAMSUNG MZWLO1T9HCJR-00A07                    
nvme3n1       SAMSUNG MZWLO1T9HCJR-00A07                    
nvme4n1       SAMSUNG MZWLO1T9HCJR-00A07                    

FIO:
[global]                      
bs=4k                         
direct=1                      
norandommap                   
ioengine=libaio               
randrepeat=0                  
readwrite=randread            
group_reporting               
time_based                    
iodepth=64                    
exitall                       
random_generator=tausworthe64 
runtime=30                    
ramp_time=3                   
numjobs=8                     
group_reporting=1             
                              
#cpus_allowed_policy=shared   
cpus_allowed_policy=split     
[disk_nvme6n1_thread_1]       
filename=/dev/nvme6n1         
cpus_allowed=0-7       
[disk_nvme6n1_thread_1]
filename=/dev/nvme5n1  
cpus_allowed=8-15      
[disk_nvme5n1_thread_2]
filename=/dev/nvme4n1  
cpus_allowed=16-23     
[disk_nvme5n1_thread_3]
filename=/dev/nvme3n1  
cpus_allowed=24-31     

iostat w/o posted MSI patch, v6.8-rc1:						
nvme3c3n1     1615525.00   6462100.00         0.00         0.00    6462100						
nvme4c4n1     1615471.00   6461884.00         0.00         0.00    6461884						
nvme5c5n1     1615602.00   6462408.00         0.00         0.00    6462408						
nvme6c6n1     1614637.00   6458544.00         0.00         0.00    6458544	

irqtop (delta 1 sec.)					
           IRQ           TOTAL          DELTA NAME                                      							
           800         6290026         185545 IR-PCI-MSIX-0000:65:00.0 76-edge nvme5q76							
           797         6279554         185295 IR-PCI-MSIX-0000:65:00.0 73-edge nvme5q73							
           799         6281627         185200 IR-PCI-MSIX-0000:65:00.0 75-edge nvme5q75							
           802         6285742         185185 IR-PCI-MSIX-0000:65:00.0 78-edge nvme5q78							
	... ... similar irq rate for all 32 vectors

iostat w/ posted MSI patch:
Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd						
nvme3c3n1     2184313.00   8737256.00         0.00         0.00    8737256          0          0						
nvme4c4n1     2184241.00   8736972.00         0.00         0.00    8736972          0          0						
nvme5c5n1     2184269.00   8737080.00         0.00         0.00    8737080          0          0						
nvme6c6n1     2184003.00   8736012.00         0.00         0.00    8736012          0          0						
						
irqtop w/ posted MSI patch:
           IRQ           TOTAL           DELTA NAME                                     							
           PMN      5230078416         5502657 Posted MSI notification event            							
           423       138068935          220128 IR-PCI-MSIX-0000:64:00.0 80-edge nvme4q80							
           425       138057654          219963 IR-PCI-MSIX-0000:64:00.0 82-edge nvme4q82							
           426       138101745          219890 IR-PCI-MSIX-0000:64:00.0 83-edge nvme4q83							
	... ... similar irq rate for all 32 vectors
IRQ coalescing ratio: posted interrupt notification (PMN)/total MSIs = 78%
550/(22*32.)=.78125         


Thanks,

Jacob

  reply	other threads:[~2024-02-12 18:22 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-26 23:42 [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jacob Pan
2024-01-26 23:42 ` [PATCH 01/15] x86/irq: Move posted interrupt descriptor out of vmx code Jacob Pan
2024-01-26 23:42 ` [PATCH 02/15] x86/irq: Unionize PID.PIR for 64bit access w/o casting Jacob Pan
2024-01-26 23:42 ` [PATCH 03/15] x86/irq: Use bitfields exclusively in posted interrupt descriptor Jacob Pan
2024-01-31  1:48   ` Sean Christopherson
2024-02-06  0:40     ` Jacob Pan
2024-01-26 23:42 ` [PATCH 04/15] x86/irq: Add a Kconfig option for posted MSI Jacob Pan
2024-04-05  2:28   ` Robert Hoo
2024-04-05 15:54     ` Jacob Pan
2024-01-26 23:42 ` [PATCH 05/15] x86/irq: Reserve a per CPU IDT vector for posted MSIs Jacob Pan
2024-04-04 13:38   ` Robert Hoo
2024-04-04 17:17     ` Jacob Pan
2024-01-26 23:42 ` [PATCH 06/15] x86/irq: Set up per host CPU posted interrupt descriptors Jacob Pan
2024-02-13 19:44   ` Jacob Pan
2024-01-26 23:42 ` [PATCH 07/15] x86/irq: Add accessors for " Jacob Pan
2024-01-26 23:42 ` [PATCH 08/15] x86/irq: Factor out calling ISR from common_interrupt Jacob Pan
2024-01-26 23:42 ` [PATCH 09/15] x86/irq: Install posted MSI notification handler Jacob Pan
2024-03-29  7:32   ` Zeng Guang
2024-04-03  2:43     ` Jacob Pan
2024-01-26 23:42 ` [PATCH 10/15] x86/irq: Factor out common code for checking pending interrupts Jacob Pan
2024-01-26 23:42 ` [PATCH 11/15] x86/irq: Extend checks for pending vectors to posted interrupts Jacob Pan
2024-01-26 23:42 ` [PATCH 12/15] iommu/vt-d: Make posted MSI an opt-in cmdline option Jacob Pan
2024-01-26 23:42 ` [PATCH 13/15] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
2024-01-26 23:42 ` [PATCH 14/15] iommu/vt-d: Add a helper to retrieve PID address Jacob Pan
2024-01-26 23:42 ` [PATCH 15/15] iommu/vt-d: Enable posted mode for device MSIs Jacob Pan
2024-02-08 15:34 ` [PATCH 00/15] Coalesced Interrupt Delivery with posted MSI Jens Axboe
2024-02-09 17:43   ` Jacob Pan
2024-02-09 20:31     ` Jens Axboe
2024-02-12 18:27       ` Jacob Pan [this message]
2024-02-12 18:36         ` Jens Axboe
2024-02-12 20:13           ` Jacob Pan
2024-02-13  1:10           ` Jacob Pan
2024-04-04 13:45 ` Robert Hoo
2024-04-04 17:37   ` Jacob Pan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240212102742.34e1e2c2@jacob-builder \
    --to=jacob.jun.pan@linux.intel.com \
    --cc=ashok.raj@intel.com \
    --cc=axboe@kernel.dk \
    --cc=baolu.lu@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=hpa@zytor.com \
    --cc=iommu@lists.linux.dev \
    --cc=joro@8bytes.org \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=mingo@redhat.com \
    --cc=paul.e.luse@intel.com \
    --cc=peterz@infradead.org \
    --cc=robin.murphy@arm.com \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.