[PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI
@ 2023-11-12  4:16 Jacob Pan
  2023-11-12  4:16 ` [PATCH RFC 01/13] x86: Move posted interrupt descriptor out of vmx code Jacob Pan
                   ` (12 more replies)
  0 siblings, 13 replies; 49+ messages in thread
From: Jacob Pan @ 2023-11-12  4:16 UTC (permalink / raw)
  To: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

Hi Thomas and all,

This patch set is aimed to improve IRQ throughput on Intel Xeon by making use of
posted interrupts.

There is a session at LPC2023 IOMMU/VFIO/PCI MC where I will present this
topic. I get this RFC code out for review and discussion but some work is still
in progress.

https://lpc.events/event/17/sessions/172/#20231115

Background
==========
On modern x86 server SoCs, interrupt remapping (IR) is required and turned
on by default to support X2APIC. Two interrupt remapping modes can be supported
by IOMMU:

- Remappable 	(host)
- Posted	(guest only so far)

With remappable mode, the device MSI to CPU process is a HW flow without system
software touch points, it roughly goes as follows:

1.	Devices issue interrupt requests with writes to 0xFEEx_xxxx
2.	The system agent accepts and remaps/translates the IRQ
3.	Upon receiving the translation response, the system agent notifies the
destination CPU with the translated MSI
4.	CPU's local APIC accepts interrupts into its IRR/ISR registers
5.	Interrupt delivered through IDT (MSI vector)

The above process can be inefficient under high IRQ rates. The notifications in
step #3 are often unnecessary when the destination CPU is already overwhelmed
with handling bursts of IRQs. On some architectures, such as Intel Xeon, step #3
is also expensive and requires strong ordering w.r.t DMA. As a result, slower
IRQ rates can become a limiting factor for DMA I/O performance.

For example, on Intel Xeon Sapphire Rapids SoC, as more NVMe disks are attached
to the same socket, FIO (libaio engine) performance per disk drops quickly.

# of disks  	2  	4  	8
-------------------------------------
IOPS(million)  	1.991	1.136  	0.834
(NVMe Gen 5 Samsung PM174x)

With posted mode in interrupt remapping, the interrupt flow is divided into two
parts: posting (storing pending IRQ vector information in memory) and CPU
notification.

The above remappable IRQ flow becomes the following (1 and 2 unchanged):
3.	Notifies the destination CPU with a notification vector
	- IOMMU suppresses CPU notification
	- IOMMU atomic swap IRQ status to memory (PID)
4.	CPU's local APIC accepts the notification interrupt into its IRR/ISR
	registers
5.	Interrupt delivered through IDT (notification vector handler)
	System SW allows new notifications.
(The above flow is not in Linux today since we only use posted mode for VM)

Note that the system software can now suppress CPU notifications at runtime as
needed. This allows the system software to coalesce CPU notifications and in
turn, improve IRQ throughput and DMA performance.

Consider the following scenario when MSIs arrive at a CPU in high-frequency
bursts:

Time ----------------------------------------------------------------------->
    	^ ^ ^       	^ ^ ^ ^     	^   	^
MSIs	A B C       	D E F G     	H   	I

RI  	N  N'  N'     	N  N'  N'  N'  	N   	N

PI  	N           	N           	N   	N

RI: remappable interrupt;  PI:  posted interrupt;
N: interrupt notification, N': superfluous interrupt notification

With remappable interrupt (row titled RI), every MSI generates a notification
event to the CPU.

With posted interrupts enabled in this patchset (row titled PI), CPU
notifications are coalesced during IRQ bursts. N' are eliminated in the flow
above. We refer to this mechanism Coalesced Interrupt Delivery (CID).

Post interrupts have existed for a long time, they have been used for
virtualization where MSIs from directly assigned devices can be delivered to
the guest kernel without VMM intervention. On x86 Intel platforms, posted
interrupts can be used on the host as well. Posted interrupt descriptor (PID)
address is in host physical address.

This patchset enables a new usage of posted interrupts on existing (and new
hardware) for host kernel device MSIs. It is referred to as Posted MSIs
throughout this patch set.

Performance (with this patch set):
==================================
Test #1.

FIO libaio (million IOPS/sec/disk) Gen 5 NVMe Samsung PM174x disks on a single
socket, Intel Xeon Sapphire Rapids.

#disks	Before		After		%Gain
---------------------------------------------
8	0.834		1.943		132%
4	1.136		2.023		78%

Test #2.

Two dedicated workqueues from two Intel Data Streaming Accelerator (DSA)
PCI devices, pin IRQ affinity of the two interrupts to a single CPU.

				Before		After		%Gain
DSA memfill (mil IRQs/sec)	5.157		8.987		74%

DMA throughput has similar improvements.

At lower IRQ rate (< 1 million/second), no performance benefits nor regression
observed so far.

Implementation choices:
======================
- Transparent to the device drivers

- System-wide option instead of per-device or per-IRQ opt-in, i.e. once enabled
  all device MSIs are posted. The benefit is that we only need to change IR
  irq_chip and domain layer. No change to PCI MSI.
  Exceptions are: IOAPIC, HPET, and VT-d's own IRQs

- Limit the number of polling/demuxing loops per CPU notification event

- Only change Intel-IR in IRQ domain hierarchy VECTOR->INTEL-IR->PCI-MSI,

- X86 Intel only so far, can be extended to other architectures with posted
  interrupt support (ARM and AMD), RFC.

- Bare metal only, no posted interrupt capable virtual IOMMU.

Changes and implications (moving from remappable to posted mode)
===============================
1. All MSI vectors are multiplexed into a single notification vector for each
CPU MSI vectors are then de-multiplexed by SW, no IDT delivery for MSIs

2. Losing the following features compared to the remappable mode (AFAIK, none of
the below matters for device MSIs)
- Control of delivery mode, e.g. NMI for MSIs
- No logical destinations, posted interrupt destination is x2APIC
  physical APIC ID
- No per vector stack, since all MSI vectors are multiplexed into one

Runtime changes
===============
The IRQ runtime behavior has changed with this patch, here is a pseudo trace
comparison for 3 MSIs of different vectors arriving in a burst. A system vector
interrupt (e.g. timer) arrives randomly.

BEFORE:
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()

interrupt(timer)

interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()

interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()

AFTER:
interrupt /* Posted MSI notification vector */
    irq_enter()
	atomic_xchg(PIR)
	handler()
	handler()
	handler()
	pi_clear_on()
    apic_eoi()
    irq_exit()
interrupt(timer)
        process_softirq()

With posted MSI (as pointed out by Thomas Gleixner), both high-priority
interrupts (system interrupt vectors) and softIRQs are blocked during MSI vector
demux loop. Some can be timing sensitive.

Here are the options I have attempted or still working on:

1. Use self-IPI to invoke MSI vector handler but that took away the majority of
the performance benefits.

2. Limit the # of demuxing loops, this is implemented in this patch. Note that
today, we already allow one low priority MSI to block system interrupts. System
vector can preempt MSI vectors without waiting for EOI but we have IRQ disabled
in the ISR.

Performance data (on DSA with MEMFILL) also shows that coalescing more than 3
loops yields diminishing benefits. Therefore, the max loops for coalescing is
set to 3 in this patch.
	MaxLoop		IRQ/sec		bandwidth Mbps
-------------------------------------------------------------------------
	2		6157107 		25219
	3		6226611 		25504
	4		6557081 		26857
	5		6629683 		27155
	6		6662425 		27289

3. limit the time that system interrupts can be blocked (WIP).

4. Make posted MSI notification vector preemptable (WIP)
Chose notification vector with lower priority class bit[7:4] than other system
vectors such that it can be preempted by the system interrupts without waiting
for EOI.

interrupt
    irq_enter()
	local_irq_enable()
	atomic_xchg(PIR)
	handler()
	handler()
	handler()
	local_irq_disable()
	pi_clear_on()
    apic_eoi()
    irq_exit()
        process_softirq()

This is a more intrusive change, my limited understanding is that we do not
allow nested IRQ due to the concern of overflowing the IRQ stack.

But with posted MSI, all MSI vectors are multiplexed into one vector, stack size
should not be a concern. No device MSIs are delivered to the CPU directly
anymore. Alternatively, post MSI vector can use another IST entry.

I appreciate any suggestion in addressing this issue.

In addition, posted MSI uses atomic xchg from both CPU and IOMMU. Compared to
remappable mode, there may be additional cache line ownership contention over
PID. However, we have not observed performance regression at lower IRQ rates.
At high interrupt rate, posted mode always wins.

Testing:
========

The following tests have been performed and continue to be evaluated.
- IRQ affinity change, migration
- CPU offlining
- Multi vector coalescing
- Low IRQ rate, general no-harm test
- VM device assignment
- General no harm test, performance regressions have not been observed for low
IRQ rate workload.

With the patch, a new entry in /proc/interrupts is added.
cat /proc/interrupts | grep PMN
PMN:         13868907 Posted MSI notification event

No change to the device MSI accounting.

A new INTEL-IR-POST irq_chip is visible at IRQ debugfs, e.g.
domain:  IR-PCI-MSIX-0000:6f:01.0-12
 hwirq:   0x8
 chip:    IR-PCI-MSIX-0000:6f:01.0
  flags:   0x430
             IRQCHIP_SKIP_SET_WAKE
             IRQCHIP_ONESHOT_SAFE
 parent:
    domain:  INTEL-IR-12-13
     hwirq:   0x90000
     chip:    INTEL-IR-POST /* For posted MSIs */
      flags:   0x0
     parent:
        domain:  VECTOR
         hwirq:   0x65
         chip:    APIC

Acknowledgment
==============

- Rajesh Sankaran and Ashok Raj for the original idea

- Thomas Gleixner for reviewing and guiding the upstream direction of PoC
patches. Help correct my many misunderstandings of the IRQ subsystem.

- Jie J Yan(Jeff), Sebastien Lemarie, and Dan Liang for performance evaluation
with NVMe and network workload

- Bernice Zhang and Scott Morris for functional validation

- Michael Prinke helped me understand how VT-d HW works

- Sanjay Kumar for providing the DSA IRQ test suite

Thanks,

Jacob

Jacob Pan (11):
  x86: Move posted interrupt descriptor out of vmx code
  x86: Add a Kconfig option for posted MSI
  x86: Reserved a per CPU IDT vector for posted MSIs
  iommu/vt-d: Add helper and flag to check/disable posted MSI
  x86/irq: Unionize PID.PIR for 64bit access w/o casting
  x86/irq: Add helpers for checking Intel PID
  x86/irq: Factor out calling ISR from common_interrupt
  x86/irq: Install posted MSI notification handler
  x86/irq: Handle potential lost IRQ during migration and CPU offline
  iommu/vt-d: Add an irq_chip for posted MSIs
  iommu/vt-d: Enable posted mode for device MSIs

Thomas Gleixner (2):
  x86/irq: Set up per host CPU posted interrupt descriptors
  iommu/vt-d: Add a helper to retrieve PID address

 arch/x86/Kconfig                     |  10 ++
 arch/x86/include/asm/apic.h          |   1 +
 arch/x86/include/asm/hardirq.h       |   6 ++
 arch/x86/include/asm/idtentry.h      |   3 +
 arch/x86/include/asm/irq_remapping.h |  11 ++
 arch/x86/include/asm/irq_vectors.h   |  15 ++-
 arch/x86/include/asm/posted_intr.h   | 139 +++++++++++++++++++++++++
 arch/x86/kernel/apic/io_apic.c       |   2 +-
 arch/x86/kernel/apic/vector.c        |  13 ++-
 arch/x86/kernel/cpu/common.c         |   3 +
 arch/x86/kernel/idt.c                |   3 +
 arch/x86/kernel/irq.c                | 147 ++++++++++++++++++++++++---
 arch/x86/kvm/vmx/posted_intr.h       |  93 +----------------
 arch/x86/kvm/vmx/vmx.c               |   1 +
 arch/x86/kvm/vmx/vmx.h               |   2 +-
 drivers/iommu/intel/irq_remapping.c  | 103 ++++++++++++++++++-
 drivers/iommu/irq_remapping.c        |  17 ++++
 17 files changed, 456 insertions(+), 113 deletions(-)
 create mode 100644 arch/x86/include/asm/posted_intr.h

-- 
2.25.1

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH RFC 01/13] x86: Move posted interrupt descriptor out of vmx code
  2023-11-12  4:16 [PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
@ 2023-11-12  4:16 ` Jacob Pan
  2023-12-06 16:33   ` Thomas Gleixner
  2023-11-12  4:16 ` [PATCH RFC 02/13] x86: Add a Kconfig option for posted MSI Jacob Pan
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Jacob Pan @ 2023-11-12  4:16 UTC (permalink / raw)
  To: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

To prepare native usage of posted interrupt, move PID declaration out of
VMX code such that they can be shared.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/posted_intr.h | 97 ++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/posted_intr.h     | 93 +---------------------------
 arch/x86/kvm/vmx/vmx.c             |  1 +
 arch/x86/kvm/vmx/vmx.h             |  2 +-
 4 files changed, 100 insertions(+), 93 deletions(-)
 create mode 100644 arch/x86/include/asm/posted_intr.h

diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
new file mode 100644
index 000000000000..9f2fa38fa57b
--- /dev/null
+++ b/arch/x86/include/asm/posted_intr.h
@@ -0,0 +1,97 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_POSTED_INTR_H
+#define _X86_POSTED_INTR_H
+
+#define POSTED_INTR_ON  0
+#define POSTED_INTR_SN  1
+
+#define PID_TABLE_ENTRY_VALID 1
+
+/* Posted-Interrupt Descriptor */
+struct pi_desc {
+	u32 pir[8];     /* Posted interrupt requested */
+	union {
+		struct {
+				/* bit 256 - Outstanding Notification */
+			u16	on	: 1,
+				/* bit 257 - Suppress Notification */
+				sn	: 1,
+				/* bit 271:258 - Reserved */
+				rsvd_1	: 14;
+				/* bit 279:272 - Notification Vector */
+			u8	nv;
+				/* bit 287:280 - Reserved */
+			u8	rsvd_2;
+				/* bit 319:288 - Notification Destination */
+			u32	ndst;
+		};
+		u64 control;
+	};
+	u32 rsvd[6];
+} __aligned(64);
+
+static inline bool pi_test_and_set_on(struct pi_desc *pi_desc)
+{
+	return test_and_set_bit(POSTED_INTR_ON,
+			(unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_and_clear_on(struct pi_desc *pi_desc)
+{
+	return test_and_clear_bit(POSTED_INTR_ON,
+			(unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_and_clear_sn(struct pi_desc *pi_desc)
+{
+	return test_and_clear_bit(POSTED_INTR_SN,
+			(unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
+{
+	return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
+}
+
+static inline bool pi_is_pir_empty(struct pi_desc *pi_desc)
+{
+	return bitmap_empty((unsigned long *)pi_desc->pir, NR_VECTORS);
+}
+
+static inline void pi_set_sn(struct pi_desc *pi_desc)
+{
+	set_bit(POSTED_INTR_SN,
+		(unsigned long *)&pi_desc->control);
+}
+
+static inline void pi_set_on(struct pi_desc *pi_desc)
+{
+	set_bit(POSTED_INTR_ON,
+		(unsigned long *)&pi_desc->control);
+}
+
+static inline void pi_clear_on(struct pi_desc *pi_desc)
+{
+	clear_bit(POSTED_INTR_ON,
+		(unsigned long *)&pi_desc->control);
+}
+
+static inline void pi_clear_sn(struct pi_desc *pi_desc)
+{
+	clear_bit(POSTED_INTR_SN,
+		(unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_on(struct pi_desc *pi_desc)
+{
+	return test_bit(POSTED_INTR_ON,
+			(unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_sn(struct pi_desc *pi_desc)
+{
+	return test_bit(POSTED_INTR_SN,
+			(unsigned long *)&pi_desc->control);
+}
+
+#endif /* _X86_POSTED_INTR_H */
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 26992076552e..6b2a0226257e 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -1,98 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef __KVM_X86_VMX_POSTED_INTR_H
 #define __KVM_X86_VMX_POSTED_INTR_H
-
-#define POSTED_INTR_ON  0
-#define POSTED_INTR_SN  1
-
-#define PID_TABLE_ENTRY_VALID 1
-
-/* Posted-Interrupt Descriptor */
-struct pi_desc {
-	u32 pir[8];     /* Posted interrupt requested */
-	union {
-		struct {
-				/* bit 256 - Outstanding Notification */
-			u16	on	: 1,
-				/* bit 257 - Suppress Notification */
-				sn	: 1,
-				/* bit 271:258 - Reserved */
-				rsvd_1	: 14;
-				/* bit 279:272 - Notification Vector */
-			u8	nv;
-				/* bit 287:280 - Reserved */
-			u8	rsvd_2;
-				/* bit 319:288 - Notification Destination */
-			u32	ndst;
-		};
-		u64 control;
-	};
-	u32 rsvd[6];
-} __aligned(64);
-
-static inline bool pi_test_and_set_on(struct pi_desc *pi_desc)
-{
-	return test_and_set_bit(POSTED_INTR_ON,
-			(unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_and_clear_on(struct pi_desc *pi_desc)
-{
-	return test_and_clear_bit(POSTED_INTR_ON,
-			(unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_and_clear_sn(struct pi_desc *pi_desc)
-{
-	return test_and_clear_bit(POSTED_INTR_SN,
-			(unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
-{
-	return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
-}
-
-static inline bool pi_is_pir_empty(struct pi_desc *pi_desc)
-{
-	return bitmap_empty((unsigned long *)pi_desc->pir, NR_VECTORS);
-}
-
-static inline void pi_set_sn(struct pi_desc *pi_desc)
-{
-	set_bit(POSTED_INTR_SN,
-		(unsigned long *)&pi_desc->control);
-}
-
-static inline void pi_set_on(struct pi_desc *pi_desc)
-{
-	set_bit(POSTED_INTR_ON,
-		(unsigned long *)&pi_desc->control);
-}
-
-static inline void pi_clear_on(struct pi_desc *pi_desc)
-{
-	clear_bit(POSTED_INTR_ON,
-		(unsigned long *)&pi_desc->control);
-}
-
-static inline void pi_clear_sn(struct pi_desc *pi_desc)
-{
-	clear_bit(POSTED_INTR_SN,
-		(unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_on(struct pi_desc *pi_desc)
-{
-	return test_bit(POSTED_INTR_ON,
-			(unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_sn(struct pi_desc *pi_desc)
-{
-	return test_bit(POSTED_INTR_SN,
-			(unsigned long *)&pi_desc->control);
-}
+#include <asm/posted_intr.h>
 
 void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
 void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 72e3943f3693..d54fa0e06c70 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -66,6 +66,7 @@
 #include "vmx.h"
 #include "x86.h"
 #include "smm.h"
+#include "posted_intr.h"
 
 MODULE_AUTHOR("Qumranet");
 MODULE_LICENSE("GPL");
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index c2130d2c8e24..817b76794ee1 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -7,10 +7,10 @@
 #include <asm/kvm.h>
 #include <asm/intel_pt.h>
 #include <asm/perf_event.h>
+#include <asm/posted_intr.h>
 
 #include "capabilities.h"
 #include "../kvm_cache_regs.h"
-#include "posted_intr.h"
 #include "vmcs.h"
 #include "vmx_ops.h"
 #include "../cpuid.h"
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC 02/13] x86: Add a Kconfig option for posted MSI
  2023-11-12  4:16 [PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
  2023-11-12  4:16 ` [PATCH RFC 01/13] x86: Move posted interrupt descriptor out of vmx code Jacob Pan
@ 2023-11-12  4:16 ` Jacob Pan
  2023-12-06 16:35   ` Thomas Gleixner
  2023-11-12  4:16 ` [PATCH RFC 03/13] x86: Reserved a per CPU IDT vector for posted MSIs Jacob Pan
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Jacob Pan @ 2023-11-12  4:16 UTC (permalink / raw)
  To: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

This option will be used to support delivering MSIs as posted
interrupts. Interrupt remapping is required.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/Kconfig | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 66bfabae8814..f16882ddb390 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -463,6 +463,16 @@ config X86_X2APIC
 
 	  If you don't know what to do here, say N.
 
+config X86_POSTED_MSI
+	bool "Enable MSI and MSI-x delivery by posted interrupts"
+	depends on X86_X2APIC && X86_64 && IRQ_REMAP
+	help
+	  This enables MSIs that are under IRQ remapping to be delivered as posted
+	  interrupts to the host kernel. IRQ throughput can potentially be improved
+	  by coalescing CPU notifications during high frequency IRQ bursts.
+
+	  If you don't know what to do here, say N.
+
 config X86_MPPARSE
 	bool "Enable MPS table" if ACPI
 	default y
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC 03/13] x86: Reserved a per CPU IDT vector for posted MSIs
  2023-11-12  4:16 [PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
  2023-11-12  4:16 ` [PATCH RFC 01/13] x86: Move posted interrupt descriptor out of vmx code Jacob Pan
  2023-11-12  4:16 ` [PATCH RFC 02/13] x86: Add a Kconfig option for posted MSI Jacob Pan
@ 2023-11-12  4:16 ` Jacob Pan
  2023-12-06 16:47   ` Thomas Gleixner
  2023-11-12  4:16 ` [PATCH RFC 04/13] iommu/vt-d: Add helper and flag to check/disable posted MSI Jacob Pan
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Jacob Pan @ 2023-11-12  4:16 UTC (permalink / raw)
  To: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

Under posted MSIs, all device MSIs are multiplexed into a single CPU
notification vector. MSI handlers will be de-multiplexed at run-time by
system software without IDT delivery.

This vector has a priority class below the rest of the system vectors.

Potentially, external vector number space for MSIs can be expanded to
the entire 0-256 range.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/irq_vectors.h | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 3a19904c2db6..077ca38f5a91 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -99,9 +99,22 @@
 
 #define LOCAL_TIMER_VECTOR		0xec
 
+/*
+ * Posted interrupt notification vector for all device MSIs delivered to
+ * the host kernel.
+ *
+ * Choose lower priority class bit [7:4] than other system vectors such
+ * that it can be preempted by the system interrupts.
+ *
+ * It is also higher than all external vectors but it should not matter
+ * in that external vectors for posted MSIs are in a different number space.
+ */
+#define POSTED_MSI_NOTIFICATION_VECTOR	0xdf
 #define NR_VECTORS			 256
 
-#ifdef CONFIG_X86_LOCAL_APIC
+#ifdef X86_POSTED_MSI
+#define FIRST_SYSTEM_VECTOR		POSTED_MSI_NOTIFICATION_VECTOR
+#elif defined(CONFIG_X86_LOCAL_APIC)
 #define FIRST_SYSTEM_VECTOR		LOCAL_TIMER_VECTOR
 #else
 #define FIRST_SYSTEM_VECTOR		NR_VECTORS
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC 04/13] iommu/vt-d: Add helper and flag to check/disable posted MSI
  2023-11-12  4:16 [PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (2 preceding siblings ...)
  2023-11-12  4:16 ` [PATCH RFC 03/13] x86: Reserved a per CPU IDT vector for posted MSIs Jacob Pan
@ 2023-11-12  4:16 ` Jacob Pan
  2023-12-06 16:49   ` Thomas Gleixner
  2023-11-12  4:16 ` [PATCH RFC 05/13] x86/irq: Set up per host CPU posted interrupt descriptors Jacob Pan
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Jacob Pan @ 2023-11-12  4:16 UTC (permalink / raw)
  To: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

Allow command line opt-out posted MSI under CONFIG_X86_POSTED_MSI=y.
And add a helper function for testing if posted MSI is supported on the
CPU side.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/irq_remapping.h | 11 +++++++++++
 drivers/iommu/irq_remapping.c        | 17 +++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index 7a2ed154a5e1..706f58900962 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -50,6 +50,17 @@ static inline struct irq_domain *arch_get_ir_parent_domain(void)
 	return x86_vector_domain;
 }
 
+#ifdef CONFIG_X86_POSTED_MSI
+extern unsigned int posted_msi_off;
+
+static inline bool posted_msi_supported(void)
+{
+	return !posted_msi_off && irq_remapping_cap(IRQ_POSTING_CAP);
+}
+#else
+static inline bool posted_msi_supported(void) { return false; };
+#endif
+
 #else  /* CONFIG_IRQ_REMAP */
 
 static inline bool irq_remapping_cap(enum irq_remap_cap cap) { return 0; }
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 83314b9d8f38..00de6963bb07 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -24,6 +24,23 @@ int no_x2apic_optout;
 
 int disable_irq_post = 0;
 
+#ifdef CONFIG_X86_POSTED_MSI
+
+unsigned int posted_msi_off;
+
+static int __init cmdl_posted_msi_off(char *str)
+{
+	int value = 0;
+
+	get_option(&str, &value);
+	posted_msi_off = value;
+
+	return 1;
+}
+
+__setup("posted_msi_off=", cmdl_posted_msi_off);
+#endif
+
 static int disable_irq_remap;
 static struct irq_remap_ops *remap_ops;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC 05/13] x86/irq: Set up per host CPU posted interrupt descriptors
  2023-11-12  4:16 [PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (3 preceding siblings ...)
  2023-11-12  4:16 ` [PATCH RFC 04/13] iommu/vt-d: Add helper and flag to check/disable posted MSI Jacob Pan
@ 2023-11-12  4:16 ` Jacob Pan
  2023-11-12  4:16 ` [PATCH RFC 06/13] x86/irq: Unionize PID.PIR for 64bit access w/o casting Jacob Pan
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Jacob Pan @ 2023-11-12  4:16 UTC (permalink / raw)
  To: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

From: Thomas Gleixner <tglx@linutronix.de>

To support posted MSIs, create a posted interrupt descriptor (PID) for each
host CPU. Later on, when setting up IRQ CPU affinity, IOMMU's interrupt
remapping table entry (IRTE) will point to the physical address of the
matching CPU's PID.

Each PID is initialized with the owner CPU's physical APICID as the
destination.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/hardirq.h     |  3 +++
 arch/x86/include/asm/posted_intr.h |  7 +++++++
 arch/x86/kernel/cpu/common.c       |  3 +++
 arch/x86/kernel/irq.c              | 13 +++++++++++++
 4 files changed, 26 insertions(+)

diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 66837b8c67f1..72c6a084dba3 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -48,6 +48,9 @@ typedef struct {
 
 DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
 
+#ifdef CONFIG_X86_POSTED_MSI
+DECLARE_PER_CPU_ALIGNED(struct pi_desc, posted_interrupt_desc);
+#endif
 #define __ARCH_IRQ_STAT
 
 #define inc_irq_stat(member)	this_cpu_inc(irq_stat.member)
diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index 9f2fa38fa57b..2cd9ac1af835 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -94,4 +94,11 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
 			(unsigned long *)&pi_desc->control);
 }
 
+#ifdef CONFIG_X86_POSTED_MSI
+extern void intel_posted_msi_init(void);
+
+#else
+static inline void intel_posted_msi_init(void) {};
+
+#endif /* X86_POSTED_MSI */
 #endif /* _X86_POSTED_INTR_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 4e5ffc8b0e46..08b2d1560f8b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -65,6 +65,7 @@
 #include <asm/set_memory.h>
 #include <asm/traps.h>
 #include <asm/sev.h>
+#include <asm/posted_intr.h>
 
 #include "cpu.h"
 
@@ -2266,6 +2267,8 @@ void cpu_init(void)
 		barrier();
 
 		x2apic_setup();
+
+		intel_posted_msi_init();
 	}
 
 	mmgrab(&init_mm);
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 11761c124545..fd4d664d81bb 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -22,6 +22,8 @@
 #include <asm/desc.h>
 #include <asm/traps.h>
 #include <asm/thermal.h>
+#include <asm/posted_intr.h>
+#include <asm/irq_remapping.h>
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/irq_vectors.h>
@@ -334,6 +336,17 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
 }
 #endif
 
+#ifdef CONFIG_X86_POSTED_MSI
+
+/* Posted Interrupt Descriptors for coalesced MSIs to be posted */
+DEFINE_PER_CPU_ALIGNED(struct pi_desc, posted_interrupt_desc);
+
+void intel_posted_msi_init(void)
+{
+	this_cpu_write(posted_interrupt_desc.nv, POSTED_MSI_NOTIFICATION_VECTOR);
+	this_cpu_write(posted_interrupt_desc.ndst, this_cpu_read(x86_cpu_to_apicid));
+}
+#endif /* X86_POSTED_MSI */
 
 #ifdef CONFIG_HOTPLUG_CPU
 /* A cpu has been removed from cpu_online_mask.  Reset irq affinities. */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC 06/13] x86/irq: Unionize PID.PIR for 64bit access w/o casting
  2023-11-12  4:16 [PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (4 preceding siblings ...)
  2023-11-12  4:16 ` [PATCH RFC 05/13] x86/irq: Set up per host CPU posted interrupt descriptors Jacob Pan
@ 2023-11-12  4:16 ` Jacob Pan
  2023-12-06 16:51   ` Thomas Gleixner
  2023-11-12  4:16 ` [PATCH RFC 07/13] x86/irq: Add helpers for checking Intel PID Jacob Pan
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Jacob Pan @ 2023-11-12  4:16 UTC (permalink / raw)
  To: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

Make PIR field into u64 such that atomic xchg64 can be used without ugly
casting.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/posted_intr.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index 2cd9ac1af835..3af00f5395e4 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -9,7 +9,10 @@
 
 /* Posted-Interrupt Descriptor */
 struct pi_desc {
-	u32 pir[8];     /* Posted interrupt requested */
+	union {
+		u32 pir[8];     /* Posted interrupt requested */
+		u64 pir_l[4];
+	};
 	union {
 		struct {
 				/* bit 256 - Outstanding Notification */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC 07/13] x86/irq: Add helpers for checking Intel PID
  2023-11-12  4:16 [PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (5 preceding siblings ...)
  2023-11-12  4:16 ` [PATCH RFC 06/13] x86/irq: Unionize PID.PIR for 64bit access w/o casting Jacob Pan
@ 2023-11-12  4:16 ` Jacob Pan
  2023-12-06 19:02   ` Thomas Gleixner
  2023-11-12  4:16 ` [PATCH RFC 08/13] x86/irq: Factor out calling ISR from common_interrupt Jacob Pan
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Jacob Pan @ 2023-11-12  4:16 UTC (permalink / raw)
  To: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

Intel posted interrupt descriptor (PID) stores pending interrupts in its
posted interrupt requests (PIR) bitmap.

Add helper functions to check individual vector status and the entire bitmap.

They are used for interrupt migration and runtime demultiplexing posted MSI
vectors.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/posted_intr.h | 31 ++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index 3af00f5395e4..12a4fa3ff60e 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -98,9 +98,40 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
 }
 
 #ifdef CONFIG_X86_POSTED_MSI
+/*
+ * Not all external vectors are subject to interrupt remapping, e.g. IOMMU's
+ * own interrupts. Here we do not distinguish them since those vector bits in
+ * PIR will always be zero.
+ */
+static inline bool is_pi_pending_this_cpu(unsigned int vector)
+{
+	struct pi_desc *pid;
+
+	if (WARN_ON(vector > NR_VECTORS || vector < FIRST_EXTERNAL_VECTOR))
+		return false;
+
+	pid = this_cpu_ptr(&posted_interrupt_desc);
+
+	return (pid->pir[vector >> 5] & (1 << (vector % 32)));
+}
+
+static inline bool is_pir_pending(struct pi_desc *pid)
+{
+	int i;
+
+	for (i = 0; i < 4; i++) {
+		if (pid->pir_l[i])
+			return true;
+	}
+
+	return false;
+}
+
 extern void intel_posted_msi_init(void);
 
 #else
+static inline bool is_pi_pending_this_cpu(unsigned int vector) {return false; }
+
 static inline void intel_posted_msi_init(void) {};
 
 #endif /* X86_POSTED_MSI */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC 08/13] x86/irq: Factor out calling ISR from common_interrupt
  2023-11-12  4:16 [PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (6 preceding siblings ...)
  2023-11-12  4:16 ` [PATCH RFC 07/13] x86/irq: Add helpers for checking Intel PID Jacob Pan
@ 2023-11-12  4:16 ` Jacob Pan
  2023-11-12  4:16 ` [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler Jacob Pan
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 49+ messages in thread
From: Jacob Pan @ 2023-11-12  4:16 UTC (permalink / raw)
  To: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

Prepare for calling external IRQ handlers directly from the posted MSI
demultiplexing loop. Extract the common code with common interrupt to
avoid code duplication.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/kernel/irq.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index fd4d664d81bb..0bffe8152385 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -242,18 +242,10 @@ static __always_inline void handle_irq(struct irq_desc *desc,
 		__handle_irq(desc, regs);
 }
 
-/*
- * common_interrupt() handles all normal device IRQ's (the special SMP
- * cross-CPU interrupts have their own entry points).
- */
-DEFINE_IDTENTRY_IRQ(common_interrupt)
+static __always_inline void call_irq_handler(int vector, struct pt_regs *regs)
 {
-	struct pt_regs *old_regs = set_irq_regs(regs);
 	struct irq_desc *desc;
 
-	/* entry code tells RCU that we're not quiescent.  Check it. */
-	RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
-
 	desc = __this_cpu_read(vector_irq[vector]);
 	if (likely(!IS_ERR_OR_NULL(desc))) {
 		handle_irq(desc, regs);
@@ -268,7 +260,20 @@ DEFINE_IDTENTRY_IRQ(common_interrupt)
 			__this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
 		}
 	}
+}
+
+/*
+ * common_interrupt() handles all normal device IRQ's (the special SMP
+ * cross-CPU interrupts have their own entry points).
+ */
+DEFINE_IDTENTRY_IRQ(common_interrupt)
+{
+	struct pt_regs *old_regs = set_irq_regs(regs);
+
+	/* entry code tells RCU that we're not quiescent.  Check it. */
+	RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
 
+	call_irq_handler(vector, regs);
 	set_irq_regs(old_regs);
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler
  2023-11-12  4:16 [PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (7 preceding siblings ...)
  2023-11-12  4:16 ` [PATCH RFC 08/13] x86/irq: Factor out calling ISR from common_interrupt Jacob Pan
@ 2023-11-12  4:16 ` Jacob Pan
  2023-11-15 12:42   ` Peter Zijlstra
                     ` (2 more replies)
  2023-11-12  4:16 ` [PATCH RFC 10/13] x86/irq: Handle potential lost IRQ during migration and CPU offline Jacob Pan
                   ` (3 subsequent siblings)
  12 siblings, 3 replies; 49+ messages in thread
From: Jacob Pan @ 2023-11-12  4:16 UTC (permalink / raw)
  To: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

All MSI vectors are multiplexed into a single notification vector when
posted MSI is enabled. It is the responsibility of the notification
vector handler to demultiplex MSI vectors. In this handler, for each
pending bit, MSI vector handlers are dispatched without IDT delivery.

For example, the interrupt flow will change as follows:
(3 MSIs of different vectors arrive in a a high frequency burst)

BEFORE:
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()

AFTER:
interrupt /* Posted MSI notification vector */
    irq_enter()
	atomic_xchg(PIR)
	handler()
	handler()
	handler()
	pi_clear_on()
    apic_eoi()
    irq_exit()
        process_softirq()

Except for the leading MSI, CPU notifications are skipped/coalesced.

For MSIs arrive at a low frequency, the demultiplexing loop does not
wait for more interrupts to coalesce. Therefore, there's no additional
latency other than the processing time.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/hardirq.h  |  3 ++
 arch/x86/include/asm/idtentry.h |  3 ++
 arch/x86/kernel/idt.c           |  3 ++
 arch/x86/kernel/irq.c           | 91 +++++++++++++++++++++++++++++++++
 4 files changed, 100 insertions(+)

diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 72c6a084dba3..6c8daa7518eb 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -44,6 +44,9 @@ typedef struct {
 	unsigned int irq_hv_reenlightenment_count;
 	unsigned int hyperv_stimer0_count;
 #endif
+#ifdef CONFIG_X86_POSTED_MSI
+	unsigned int posted_msi_notification_count;
+#endif
 } ____cacheline_aligned irq_cpustat_t;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 05fd175cec7d..f756e761e7c0 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -644,6 +644,9 @@ DECLARE_IDTENTRY_SYSVEC(ERROR_APIC_VECTOR,		sysvec_error_interrupt);
 DECLARE_IDTENTRY_SYSVEC(SPURIOUS_APIC_VECTOR,		sysvec_spurious_apic_interrupt);
 DECLARE_IDTENTRY_SYSVEC(LOCAL_TIMER_VECTOR,		sysvec_apic_timer_interrupt);
 DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI_VECTOR,	sysvec_x86_platform_ipi);
+# ifdef CONFIG_X86_POSTED_MSI
+DECLARE_IDTENTRY_SYSVEC(POSTED_MSI_NOTIFICATION_VECTOR,	sysvec_posted_msi_notification);
+# endif
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index b786d48f5a0f..d5840d777469 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -159,6 +159,9 @@ static const __initconst struct idt_data apic_idts[] = {
 # endif
 	INTG(SPURIOUS_APIC_VECTOR,		asm_sysvec_spurious_apic_interrupt),
 	INTG(ERROR_APIC_VECTOR,			asm_sysvec_error_interrupt),
+# ifdef CONFIG_X86_POSTED_MSI
+	INTG(POSTED_MSI_NOTIFICATION_VECTOR,	asm_sysvec_posted_msi_notification),
+# endif
 #endif
 };
 
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 0bffe8152385..786c2c8330f4 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -183,6 +183,13 @@ int arch_show_interrupts(struct seq_file *p, int prec)
 		seq_printf(p, "%10u ",
 			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
 	seq_puts(p, "  Posted-interrupt wakeup event\n");
+#endif
+#ifdef CONFIG_X86_POSTED_MSI
+	seq_printf(p, "%*s: ", prec, "PMN");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ",
+			   irq_stats(j)->posted_msi_notification_count);
+	seq_puts(p, "  Posted MSI notification event\n");
 #endif
 	return 0;
 }
@@ -351,6 +358,90 @@ void intel_posted_msi_init(void)
 	this_cpu_write(posted_interrupt_desc.nv, POSTED_MSI_NOTIFICATION_VECTOR);
 	this_cpu_write(posted_interrupt_desc.ndst, this_cpu_read(x86_cpu_to_apicid));
 }
+
+static __always_inline inline void handle_pending_pir(struct pi_desc *pid, struct pt_regs *regs)
+{
+	int i, vec = FIRST_EXTERNAL_VECTOR;
+	u64 pir_copy[4];
+
+	/*
+	 * Make a copy of PIR which contains IRQ pending bits for vectors,
+	 * then invoke IRQ handlers for each pending vector.
+	 * If any new interrupts were posted while we are processing, will
+	 * do again before allowing new notifications. The idea is to
+	 * minimize the number of the expensive notifications if IRQs come
+	 * in a high frequency burst.
+	 */
+	for (i = 0; i < 4; i++)
+		pir_copy[i] = raw_atomic64_xchg((atomic64_t *)&pid->pir_l[i], 0);
+
+	/*
+	 * Ideally, we should start from the high order bits set in the PIR
+	 * since each bit represents a vector. Higher order bit position means
+	 * the vector has higher priority. But external vectors are allocated
+	 * based on availability not priority.
+	 *
+	 * EOI is included in the IRQ handlers call to apic_ack_irq, which
+	 * allows higher priority system interrupt to get in between.
+	 */
+	for_each_set_bit_from(vec, (unsigned long *)&pir_copy[0], 256)
+		call_irq_handler(vec, regs);
+
+}
+
+/*
+ * Performance data shows that 3 is good enough to harvest 90+% of the benefit
+ * on high IRQ rate workload.
+ * Alternatively, could make this tunable, use 3 as default.
+ */
+#define MAX_POSTED_MSI_COALESCING_LOOP 3
+
+/*
+ * For MSIs that are delivered as posted interrupts, the CPU notifications
+ * can be coalesced if the MSIs arrive in high frequency bursts.
+ */
+DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
+{
+	struct pt_regs *old_regs = set_irq_regs(regs);
+	struct pi_desc *pid;
+	int i = 0;
+
+	pid = this_cpu_ptr(&posted_interrupt_desc);
+
+	inc_irq_stat(posted_msi_notification_count);
+	irq_enter();
+
+	while (i++ < MAX_POSTED_MSI_COALESCING_LOOP) {
+		handle_pending_pir(pid, regs);
+
+		/*
+		 * If there are new interrupts posted in PIR, do again. If
+		 * nothing pending, no need to wait for more interrupts.
+		 */
+		if (is_pir_pending(pid))
+			continue;
+		else
+			break;
+	}
+
+	/*
+	 * Clear outstanding notification bit to allow new IRQ notifications,
+	 * do this last to maximize the window of interrupt coalescing.
+	 */
+	pi_clear_on(pid);
+
+	/*
+	 * There could be a race of PI notification and the clearing of ON bit,
+	 * process PIR bits one last time such that handling the new interrupts
+	 * are not delayed until the next IRQ.
+	 */
+	if (unlikely(is_pir_pending(pid)))
+		handle_pending_pir(pid, regs);
+
+	apic_eoi();
+	irq_exit();
+	set_irq_regs(old_regs);
+}
 #endif /* X86_POSTED_MSI */
 
 #ifdef CONFIG_HOTPLUG_CPU
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC 10/13] x86/irq: Handle potential lost IRQ during migration and CPU offline
  2023-11-12  4:16 [PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (8 preceding siblings ...)
  2023-11-12  4:16 ` [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler Jacob Pan
@ 2023-11-12  4:16 ` Jacob Pan
  2023-12-06 20:09   ` Thomas Gleixner
  2023-11-12  4:16 ` [PATCH RFC 11/13] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 49+ messages in thread
From: Jacob Pan @ 2023-11-12  4:16 UTC (permalink / raw)
  To: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

Though IRTE modification for IRQ affinity change is a atomic operation,
it does not guarantee the timing of IRQ posting at PID.

considered the following scenario:
	Device		system agent		iommu		memory 		CPU/LAPIC
1	FEEX_XXXX
2			Interrupt request
3						Fetch IRTE	->
4						->Atomic Swap PID.PIR(vec)
						Push to Global Observable(GO)
5						if (ON*)
	i						done;*
						else
6							send a notification ->

* ON: outstanding notification, 1 will suppress new notifications

If IRQ affinity change happens between 3 and 5 in IOMMU, old CPU's PIR could
have pending bit set for the vector being moved. We must check PID.PIR
to prevent the lost of interrupts.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/kernel/apic/vector.c |  8 +++++++-
 arch/x86/kernel/irq.c         | 20 +++++++++++++++++---
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 319448d87b99..14fc33cfdb37 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -19,6 +19,7 @@
 #include <asm/apic.h>
 #include <asm/i8259.h>
 #include <asm/desc.h>
+#include <asm/posted_intr.h>
 #include <asm/irq_remapping.h>
 
 #include <asm/trace/irq_vectors.h>
@@ -978,9 +979,14 @@ static void __vector_cleanup(struct vector_cleanup *cl, bool check_irr)
 		 * Do not check IRR when called from lapic_offline(), because
 		 * fixup_irqs() was just called to scan IRR for set bits and
 		 * forward them to new destination CPUs via IPIs.
+		 *
+		 * If the vector to be cleaned is delivered as posted intr,
+		 * it is possible that the interrupt has been posted but
+		 * not made to the IRR due to coalesced notifications.
+		 * Therefore, check PIR to see if the interrupt was posted.
 		 */
 		irr = check_irr ? apic_read(APIC_IRR + (vector / 32 * 0x10)) : 0;
-		if (irr & (1U << (vector % 32))) {
+		if (irr & (1U << (vector % 32)) || is_pi_pending_this_cpu(vector)) {
 			pr_warn_once("Moved interrupt pending in old target APIC %u\n", apicd->irq);
 			rearm = true;
 			continue;
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 786c2c8330f4..7732cb9bbf0c 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -444,11 +444,26 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
 }
 #endif /* X86_POSTED_MSI */
 
+/*
+ * Check if a given vector is pending in APIC IRR or PIR if posted interrupt
+ * is enabled for coalesced interrupt delivery (CID).
+ */
+static inline bool is_vector_pending(unsigned int vector)
+{
+	unsigned int irr;
+
+	irr = apic_read(APIC_IRR + (vector / 32 * 0x10));
+	if (irr  & (1 << (vector % 32)))
+		return true;
+
+	return is_pi_pending_this_cpu(vector);
+}
+
 #ifdef CONFIG_HOTPLUG_CPU
 /* A cpu has been removed from cpu_online_mask.  Reset irq affinities. */
 void fixup_irqs(void)
 {
-	unsigned int irr, vector;
+	unsigned int vector;
 	struct irq_desc *desc;
 	struct irq_data *data;
 	struct irq_chip *chip;
@@ -475,8 +490,7 @@ void fixup_irqs(void)
 		if (IS_ERR_OR_NULL(__this_cpu_read(vector_irq[vector])))
 			continue;
 
-		irr = apic_read(APIC_IRR + (vector / 32 * 0x10));
-		if (irr  & (1 << (vector % 32))) {
+		if (is_vector_pending(vector)) {
 			desc = __this_cpu_read(vector_irq[vector]);
 
 			raw_spin_lock(&desc->lock);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC 11/13] iommu/vt-d: Add an irq_chip for posted MSIs
  2023-11-12  4:16 [PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (9 preceding siblings ...)
  2023-11-12  4:16 ` [PATCH RFC 10/13] x86/irq: Handle potential lost IRQ during migration and CPU offline Jacob Pan
@ 2023-11-12  4:16 ` Jacob Pan
  2023-12-06 20:15   ` Thomas Gleixner
  2023-12-06 20:44   ` Thomas Gleixner
  2023-11-12  4:16 ` [PATCH RFC 12/13] iommu/vt-d: Add a helper to retrieve PID address Jacob Pan
  2023-11-12  4:16 ` [PATCH RFC 13/13] iommu/vt-d: Enable posted mode for device MSIs Jacob Pan
  12 siblings, 2 replies; 49+ messages in thread
From: Jacob Pan @ 2023-11-12  4:16 UTC (permalink / raw)
  To: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

With posted MSIs, end of interrupt is handled by the notification
handler. Each MSI handler does not go through local APIC IRR, ISR
processing. There's no need to do apic_eoi() in those handlers.

Add a new acpi_ack_irq_no_eoi() for the posted MSI IR chip. At runtime
the call trace looks like:

__sysvec_posted_msi_notification() {
  irq_chip_ack_parent() {
    apic_ack_irq_no_eoi();
  }
  handle_irq_event() {
    handle_irq_event_percpu() {
       driver_handler()
    }
  }

IO-APIC IR is excluded the from posted MSI, we need to make sure it
still performs EOI.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/apic.h         |  1 +
 arch/x86/kernel/apic/io_apic.c      |  2 +-
 arch/x86/kernel/apic/vector.c       |  5 ++++
 drivers/iommu/intel/irq_remapping.c | 38 ++++++++++++++++++++++++++++-
 4 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 5af4ec1a0f71..a88015d5638b 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -485,6 +485,7 @@ static inline void apic_setup_apic_calls(void) { }
 #endif /* CONFIG_X86_LOCAL_APIC */
 
 extern void apic_ack_irq(struct irq_data *data);
+extern void apic_ack_irq_no_eoi(struct irq_data *data);
 
 static inline bool lapic_vector_set_in_irr(unsigned int vector)
 {
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 00da6cf6b07d..ca398ee9075b 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1993,7 +1993,7 @@ static struct irq_chip ioapic_ir_chip __read_mostly = {
 	.irq_startup		= startup_ioapic_irq,
 	.irq_mask		= mask_ioapic_irq,
 	.irq_unmask		= unmask_ioapic_irq,
-	.irq_ack		= irq_chip_ack_parent,
+	.irq_ack		= apic_ack_irq,
 	.irq_eoi		= ioapic_ir_ack_level,
 	.irq_set_affinity	= ioapic_set_affinity,
 	.irq_retrigger		= irq_chip_retrigger_hierarchy,
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 14fc33cfdb37..01223ac4f57a 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -911,6 +911,11 @@ void apic_ack_irq(struct irq_data *irqd)
 	apic_eoi();
 }
 
+void apic_ack_irq_no_eoi(struct irq_data *irqd)
+{
+	irq_move_irq(irqd);
+}
+
 void apic_ack_edge(struct irq_data *irqd)
 {
 	irq_complete_move(irqd_cfg(irqd));
diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 29b9e55dcf26..f2870d3c8313 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1233,6 +1233,42 @@ static struct irq_chip intel_ir_chip = {
 	.irq_set_vcpu_affinity	= intel_ir_set_vcpu_affinity,
 };
 
+/*
+ * With posted MSIs, all vectors are multiplexed into a single notification
+ * vector. Devices MSIs are then dispatched in a demux loop where
+ * EOIs can be coalesced as well.
+ *
+ * IR chip "INTEL-IR-POST" does not do EOI on ACK instead letting posted
+ * interrupt notification handler to perform EOI.
+ *
+ * For the example below, 3 MSIs are coalesced in one CPU notification. Only
+ * one apic_eoi() is needed.
+ *
+ * __sysvec_posted_msi_notification() {
+ * irq_enter()
+ *	handle_edge_irq()
+ *		irq_chip_ack_parent()
+ *			apic_ack_irq_no_eoi()
+ *		handle_irq()
+ *	handle_edge_irq()
+ *		irq_chip_ack_parent()
+ *			apic_ack_irq_no_eoi()
+ *		handle_irq()
+ *	handle_edge_irq()
+ *		irq_chip_ack_parent()
+ *			apic_ack_irq_no_eoi()
+ *		handle_irq()
+ *	apic_eoi()
+ * irq_exit()
+ */
+static struct irq_chip intel_ir_chip_post_msi = {
+	.name			= "INTEL-IR-POST",
+	.irq_ack		= apic_ack_irq_no_eoi,
+	.irq_set_affinity	= intel_ir_set_affinity,
+	.irq_compose_msi_msg	= intel_ir_compose_msi_msg,
+	.irq_set_vcpu_affinity	= intel_ir_set_vcpu_affinity,
+};
+
 static void fill_msi_msg(struct msi_msg *msg, u32 index, u32 subhandle)
 {
 	memset(msg, 0, sizeof(*msg));
@@ -1361,7 +1397,7 @@ static int intel_irq_remapping_alloc(struct irq_domain *domain,
 
 		irq_data->hwirq = (index << 16) + i;
 		irq_data->chip_data = ird;
-		irq_data->chip = &intel_ir_chip;
+		irq_data->chip = posted_msi_supported() ? &intel_ir_chip_post_msi : &intel_ir_chip;
 		intel_irq_remapping_prepare_irte(ird, irq_cfg, info, index, i);
 		irq_set_status_flags(virq + i, IRQ_MOVE_PCNTXT);
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC 12/13] iommu/vt-d: Add a helper to retrieve PID address
  2023-11-12  4:16 [PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (10 preceding siblings ...)
  2023-11-12  4:16 ` [PATCH RFC 11/13] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
@ 2023-11-12  4:16 ` Jacob Pan
  2023-12-06 20:19   ` Thomas Gleixner
  2023-11-12  4:16 ` [PATCH RFC 13/13] iommu/vt-d: Enable posted mode for device MSIs Jacob Pan
  12 siblings, 1 reply; 49+ messages in thread
From: Jacob Pan @ 2023-11-12  4:16 UTC (permalink / raw)
  To: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

From: Thomas Gleixner <tglx@linutronix.de>

When programming IRTE for posted mode, we need to retrieve the physical
address of the posted interrupt descriptor (PID) that belongs to it's
target CPU.

This per CPU PID has already been set up during cpu_init().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/iommu/intel/irq_remapping.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index f2870d3c8313..971e6c37002f 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1125,6 +1125,15 @@ struct irq_remap_ops intel_irq_remap_ops = {
 	.reenable		= reenable_irq_remapping,
 	.enable_faulting	= enable_drhd_fault_handling,
 };
+#ifdef CONFIG_X86_POSTED_MSI
+
+static u64 get_pi_desc_addr(struct irq_data *irqd)
+{
+	int cpu = cpumask_first(irq_data_get_effective_affinity_mask(irqd));
+
+	return __pa(per_cpu_ptr(&posted_interrupt_desc, cpu));
+}
+#endif
 
 static void intel_ir_reconfigure_irte(struct irq_data *irqd, bool force)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC 13/13] iommu/vt-d: Enable posted mode for device MSIs
  2023-11-12  4:16 [PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
                   ` (11 preceding siblings ...)
  2023-11-12  4:16 ` [PATCH RFC 12/13] iommu/vt-d: Add a helper to retrieve PID address Jacob Pan
@ 2023-11-12  4:16 ` Jacob Pan
  2023-12-06 20:26   ` Thomas Gleixner
  12 siblings, 1 reply; 49+ messages in thread
From: Jacob Pan @ 2023-11-12  4:16 UTC (permalink / raw)
  To: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

With posted MSI feature is enabled on the CPU side, iommu IRTEs for
device MSIs can be allocated, activated, and programed in posted mode.
This means that IRTEs are linked with their respective PIDs of the
target CPU.

Excluding the following:
- legacy devices IOAPIC, HPET (may be needed for booting, not a source
of high MSIs)
- VT-d's own IRQs (not remappable).

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 arch/x86/include/asm/posted_intr.h  |  1 +
 drivers/iommu/intel/irq_remapping.c | 55 ++++++++++++++++++++++++++---
 2 files changed, 52 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index 12a4fa3ff60e..c6d245f53225 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _X86_POSTED_INTR_H
 #define _X86_POSTED_INTR_H
+#include <asm/irq_vectors.h>
 
 #define POSTED_INTR_ON  0
 #define POSTED_INTR_SN  1
diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 971e6c37002f..1b88846d5338 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -19,6 +19,7 @@
 #include <asm/cpu.h>
 #include <asm/irq_remapping.h>
 #include <asm/pci-direct.h>
+#include <asm/posted_intr.h>
 
 #include "iommu.h"
 #include "../irq_remapping.h"
@@ -49,6 +50,7 @@ struct irq_2_iommu {
 	u16 sub_handle;
 	u8  irte_mask;
 	enum irq_mode mode;
+	bool posted_msi;
 };
 
 struct intel_ir_data {
@@ -1118,6 +1120,14 @@ static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
 	irte->redir_hint = 1;
 }
 
+static void prepare_irte_posted(struct irte *irte)
+{
+	memset(irte, 0, sizeof(*irte));
+
+	irte->present = 1;
+	irte->p_pst = 1;
+}
+
 struct irq_remap_ops intel_irq_remap_ops = {
 	.prepare		= intel_prepare_irq_remapping,
 	.enable			= intel_enable_irq_remapping,
@@ -1125,6 +1135,7 @@ struct irq_remap_ops intel_irq_remap_ops = {
 	.reenable		= reenable_irq_remapping,
 	.enable_faulting	= enable_drhd_fault_handling,
 };
+
 #ifdef CONFIG_X86_POSTED_MSI
 
 static u64 get_pi_desc_addr(struct irq_data *irqd)
@@ -1133,6 +1144,29 @@ static u64 get_pi_desc_addr(struct irq_data *irqd)
 
 	return __pa(per_cpu_ptr(&posted_interrupt_desc, cpu));
 }
+
+static void intel_ir_reconfigure_irte_posted(struct irq_data *irqd)
+{
+	struct intel_ir_data *ir_data = irqd->chip_data;
+	struct irte *irte = &ir_data->irte_entry;
+	struct irte irte_pi;
+	u64 pid_addr;
+
+	pid_addr = get_pi_desc_addr(irqd);
+
+	memset(&irte_pi, 0, sizeof(irte_pi));
+
+	/* The shared IRTE already be set up as posted during alloc_irte */
+	dmar_copy_shared_irte(&irte_pi, irte);
+
+	irte_pi.pda_l = (pid_addr >> (32 - PDA_LOW_BIT)) & ~(-1UL << PDA_LOW_BIT);
+	irte_pi.pda_h = (pid_addr >> 32) & ~(-1UL << PDA_HIGH_BIT);
+
+	modify_irte(&ir_data->irq_2_iommu, &irte_pi);
+}
+
+#else
+static inline void intel_ir_reconfigure_irte_posted(struct irq_data *irqd) {}
 #endif
 
 static void intel_ir_reconfigure_irte(struct irq_data *irqd, bool force)
@@ -1148,8 +1182,9 @@ static void intel_ir_reconfigure_irte(struct irq_data *irqd, bool force)
 	irte->vector = cfg->vector;
 	irte->dest_id = IRTE_DEST(cfg->dest_apicid);
 
-	/* Update the hardware only if the interrupt is in remapped mode. */
-	if (force || ir_data->irq_2_iommu.mode == IRQ_REMAPPING)
+	if (ir_data->irq_2_iommu.posted_msi)
+		intel_ir_reconfigure_irte_posted(irqd);
+	else if (force || ir_data->irq_2_iommu.mode == IRQ_REMAPPING)
 		modify_irte(&ir_data->irq_2_iommu, irte);
 }
 
@@ -1203,7 +1238,7 @@ static int intel_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 	struct intel_ir_data *ir_data = data->chip_data;
 	struct vcpu_data *vcpu_pi_info = info;
 
-	/* stop posting interrupts, back to remapping mode */
+	/* stop posting interrupts, back to the default mode */
 	if (!vcpu_pi_info) {
 		modify_irte(&ir_data->irq_2_iommu, &ir_data->irte_entry);
 	} else {
@@ -1300,10 +1335,14 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
 {
 	struct irte *irte = &data->irte_entry;
 
-	prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
+	if (data->irq_2_iommu.mode == IRQ_POSTING)
+		prepare_irte_posted(irte);
+	else
+		prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
 
 	switch (info->type) {
 	case X86_IRQ_ALLOC_TYPE_IOAPIC:
+		prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
 		/* Set source-id of interrupt request */
 		set_ioapic_sid(irte, info->devid);
 		apic_printk(APIC_VERBOSE, KERN_DEBUG "IOAPIC[%d]: Set IRTE entry (P:%d FPD:%d Dst_Mode:%d Redir_hint:%d Trig_Mode:%d Dlvry_Mode:%X Avail:%X Vector:%02X Dest:%08X SID:%04X SQ:%X SVT:%X)\n",
@@ -1315,10 +1354,18 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
 		sub_handle = info->ioapic.pin;
 		break;
 	case X86_IRQ_ALLOC_TYPE_HPET:
+		prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
 		set_hpet_sid(irte, info->devid);
 		break;
 	case X86_IRQ_ALLOC_TYPE_PCI_MSI:
 	case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
+		if (posted_msi_supported()) {
+			prepare_irte_posted(irte);
+			data->irq_2_iommu.posted_msi = 1;
+		} else {
+			prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
+		}
+
 		set_msi_sid(irte,
 			    pci_real_dma_dev(msi_desc_to_pci_dev(info->desc)));
 		break;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler
  2023-11-12  4:16 ` [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler Jacob Pan
@ 2023-11-15 12:42   ` Peter Zijlstra
  2023-11-15 20:05     ` Jacob Pan
  2023-11-15 12:56   ` Peter Zijlstra
  2023-12-06 19:14   ` Thomas Gleixner
  2 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2023-11-15 12:42 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar, Raj Ashok, Tian, Kevin, maz, seanjc, Robin Murphy

On Sat, Nov 11, 2023 at 08:16:39PM -0800, Jacob Pan wrote:

> +static __always_inline inline void handle_pending_pir(struct pi_desc *pid, struct pt_regs *regs)
> +{
> +	int i, vec = FIRST_EXTERNAL_VECTOR;
> +	u64 pir_copy[4];
> +
> +	/*
> +	 * Make a copy of PIR which contains IRQ pending bits for vectors,
> +	 * then invoke IRQ handlers for each pending vector.
> +	 * If any new interrupts were posted while we are processing, will
> +	 * do again before allowing new notifications. The idea is to
> +	 * minimize the number of the expensive notifications if IRQs come
> +	 * in a high frequency burst.
> +	 */
> +	for (i = 0; i < 4; i++)
> +		pir_copy[i] = raw_atomic64_xchg((atomic64_t *)&pid->pir_l[i], 0);

Might as well use arch_xchg() and save the atomic64_t casting.

> +
> +	/*
> +	 * Ideally, we should start from the high order bits set in the PIR
> +	 * since each bit represents a vector. Higher order bit position means
> +	 * the vector has higher priority. But external vectors are allocated
> +	 * based on availability not priority.
> +	 *
> +	 * EOI is included in the IRQ handlers call to apic_ack_irq, which
> +	 * allows higher priority system interrupt to get in between.
> +	 */
> +	for_each_set_bit_from(vec, (unsigned long *)&pir_copy[0], 256)
> +		call_irq_handler(vec, regs);
> +
> +}

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler
  2023-11-12  4:16 ` [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler Jacob Pan
  2023-11-15 12:42   ` Peter Zijlstra
@ 2023-11-15 12:56   ` Peter Zijlstra
  2023-11-15 20:04     ` Jacob Pan
  2023-12-06 19:50     ` Thomas Gleixner
  2023-12-06 19:14   ` Thomas Gleixner
  2 siblings, 2 replies; 49+ messages in thread
From: Peter Zijlstra @ 2023-11-15 12:56 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar, Raj Ashok, Tian, Kevin, maz, seanjc, Robin Murphy

On Sat, Nov 11, 2023 at 08:16:39PM -0800, Jacob Pan wrote:

> +static __always_inline inline void handle_pending_pir(struct pi_desc *pid, struct pt_regs *regs)
> +{

__always_inline means that... (A)

> +	int i, vec = FIRST_EXTERNAL_VECTOR;
> +	u64 pir_copy[4];
> +
> +	/*
> +	 * Make a copy of PIR which contains IRQ pending bits for vectors,
> +	 * then invoke IRQ handlers for each pending vector.
> +	 * If any new interrupts were posted while we are processing, will
> +	 * do again before allowing new notifications. The idea is to
> +	 * minimize the number of the expensive notifications if IRQs come
> +	 * in a high frequency burst.
> +	 */
> +	for (i = 0; i < 4; i++)
> +		pir_copy[i] = raw_atomic64_xchg((atomic64_t *)&pid->pir_l[i], 0);
> +
> +	/*
> +	 * Ideally, we should start from the high order bits set in the PIR
> +	 * since each bit represents a vector. Higher order bit position means
> +	 * the vector has higher priority. But external vectors are allocated
> +	 * based on availability not priority.
> +	 *
> +	 * EOI is included in the IRQ handlers call to apic_ack_irq, which
> +	 * allows higher priority system interrupt to get in between.
> +	 */
> +	for_each_set_bit_from(vec, (unsigned long *)&pir_copy[0], 256)
> +		call_irq_handler(vec, regs);
> +
> +}
> +
> +/*
> + * Performance data shows that 3 is good enough to harvest 90+% of the benefit
> + * on high IRQ rate workload.
> + * Alternatively, could make this tunable, use 3 as default.
> + */
> +#define MAX_POSTED_MSI_COALESCING_LOOP 3
> +
> +/*
> + * For MSIs that are delivered as posted interrupts, the CPU notifications
> + * can be coalesced if the MSIs arrive in high frequency bursts.
> + */
> +DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
> +{
> +	struct pt_regs *old_regs = set_irq_regs(regs);
> +	struct pi_desc *pid;
> +	int i = 0;
> +
> +	pid = this_cpu_ptr(&posted_interrupt_desc);
> +
> +	inc_irq_stat(posted_msi_notification_count);
> +	irq_enter();
> +
> +	while (i++ < MAX_POSTED_MSI_COALESCING_LOOP) {
> +		handle_pending_pir(pid, regs);
> +
> +		/*
> +		 * If there are new interrupts posted in PIR, do again. If
> +		 * nothing pending, no need to wait for more interrupts.
> +		 */
> +		if (is_pir_pending(pid))

So this reads those same 4 words we xchg in handle_pending_pir(), right?

> +			continue;
> +		else
> +			break;
> +	}
> +
> +	/*
> +	 * Clear outstanding notification bit to allow new IRQ notifications,
> +	 * do this last to maximize the window of interrupt coalescing.
> +	 */
> +	pi_clear_on(pid);
> +
> +	/*
> +	 * There could be a race of PI notification and the clearing of ON bit,
> +	 * process PIR bits one last time such that handling the new interrupts
> +	 * are not delayed until the next IRQ.
> +	 */
> +	if (unlikely(is_pir_pending(pid)))
> +		handle_pending_pir(pid, regs);

(A) ... we get _two_ copies of that thing in this function. Does that
make sense ?

> +
> +	apic_eoi();
> +	irq_exit();
> +	set_irq_regs(old_regs);
> +}
>  #endif /* X86_POSTED_MSI */

Would it not make more sense to write things something like:

bool handle_pending_pir()
{
	bool handled = false;
	u64 pir_copy[4];

	for (i = 0; i < 4; i++) {
		if (!pid-pir_l[i]) {
			pir_copy[i] = 0;
			continue;
		}

		pir_copy[i] = arch_xchg(&pir->pir_l[i], 0);
		handled |= true;
	}

	if (!handled)
		return handled;

	for_each_set_bit()
		....

	return handled.
}

sysvec_posted_blah_blah()
{
	bool done = false;
	bool handled;

	for (;;) {
		handled = handle_pending_pir();
		if (done)
			break;
		if (!handled || ++loops > MAX_LOOPS) {
			pi_clear_on(pid);
			/* once more after clear_on */
			done = true;
		}
	}
}


Hmm?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler
  2023-11-15 12:56   ` Peter Zijlstra
@ 2023-11-15 20:04     ` Jacob Pan
  2023-11-15 20:25       ` Peter Zijlstra
  2023-12-06 19:50     ` Thomas Gleixner
  1 sibling, 1 reply; 49+ messages in thread
From: Jacob Pan @ 2023-11-15 20:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar, Raj Ashok, Tian, Kevin, maz, seanjc, Robin Murphy,
	jacob.jun.pan

Hi Peter,

On Wed, 15 Nov 2023 13:56:24 +0100, Peter Zijlstra <peterz@infradead.org>
wrote:

> On Sat, Nov 11, 2023 at 08:16:39PM -0800, Jacob Pan wrote:
> 
> > +static __always_inline inline void handle_pending_pir(struct pi_desc
> > *pid, struct pt_regs *regs) +{  
> 
> __always_inline means that... (A)
> 
> > +	int i, vec = FIRST_EXTERNAL_VECTOR;
> > +	u64 pir_copy[4];
> > +
> > +	/*
> > +	 * Make a copy of PIR which contains IRQ pending bits for
> > vectors,
> > +	 * then invoke IRQ handlers for each pending vector.
> > +	 * If any new interrupts were posted while we are processing,
> > will
> > +	 * do again before allowing new notifications. The idea is to
> > +	 * minimize the number of the expensive notifications if IRQs
> > come
> > +	 * in a high frequency burst.
> > +	 */
> > +	for (i = 0; i < 4; i++)
> > +		pir_copy[i] = raw_atomic64_xchg((atomic64_t
> > *)&pid->pir_l[i], 0); +
> > +	/*
> > +	 * Ideally, we should start from the high order bits set in
> > the PIR
> > +	 * since each bit represents a vector. Higher order bit
> > position means
> > +	 * the vector has higher priority. But external vectors are
> > allocated
> > +	 * based on availability not priority.
> > +	 *
> > +	 * EOI is included in the IRQ handlers call to apic_ack_irq,
> > which
> > +	 * allows higher priority system interrupt to get in between.
> > +	 */
> > +	for_each_set_bit_from(vec, (unsigned long *)&pir_copy[0], 256)
> > +		call_irq_handler(vec, regs);
> > +
> > +}
> > +
> > +/*
> > + * Performance data shows that 3 is good enough to harvest 90+% of the
> > benefit
> > + * on high IRQ rate workload.
> > + * Alternatively, could make this tunable, use 3 as default.
> > + */
> > +#define MAX_POSTED_MSI_COALESCING_LOOP 3
> > +
> > +/*
> > + * For MSIs that are delivered as posted interrupts, the CPU
> > notifications
> > + * can be coalesced if the MSIs arrive in high frequency bursts.
> > + */
> > +DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
> > +{
> > +	struct pt_regs *old_regs = set_irq_regs(regs);
> > +	struct pi_desc *pid;
> > +	int i = 0;
> > +
> > +	pid = this_cpu_ptr(&posted_interrupt_desc);
> > +
> > +	inc_irq_stat(posted_msi_notification_count);
> > +	irq_enter();
> > +
> > +	while (i++ < MAX_POSTED_MSI_COALESCING_LOOP) {
> > +		handle_pending_pir(pid, regs);
> > +
> > +		/*
> > +		 * If there are new interrupts posted in PIR, do
> > again. If
> > +		 * nothing pending, no need to wait for more
> > interrupts.
> > +		 */
> > +		if (is_pir_pending(pid))  
> 
> So this reads those same 4 words we xchg in handle_pending_pir(), right?
> 
> > +			continue;
> > +		else
> > +			break;
> > +	}
> > +
> > +	/*
> > +	 * Clear outstanding notification bit to allow new IRQ
> > notifications,
> > +	 * do this last to maximize the window of interrupt coalescing.
> > +	 */
> > +	pi_clear_on(pid);
> > +
> > +	/*
> > +	 * There could be a race of PI notification and the clearing
> > of ON bit,
> > +	 * process PIR bits one last time such that handling the new
> > interrupts
> > +	 * are not delayed until the next IRQ.
> > +	 */
> > +	if (unlikely(is_pir_pending(pid)))
> > +		handle_pending_pir(pid, regs);  
> 
> (A) ... we get _two_ copies of that thing in this function. Does that
> make sense ?
> 
> > +
> > +	apic_eoi();
> > +	irq_exit();
> > +	set_irq_regs(old_regs);
> > +}
> >  #endif /* X86_POSTED_MSI */  
> 
> Would it not make more sense to write things something like:
> 
it is a great idea, we can save expensive xchg if pir[i] is 0. But I have
to tweak a little to let it perform better.

> bool handle_pending_pir()
> {
> 	bool handled = false;
> 	u64 pir_copy[4];
> 
> 	for (i = 0; i < 4; i++) {
> 		if (!pid-pir_l[i]) {
> 			pir_copy[i] = 0;
> 			continue;
> 		}
> 
> 		pir_copy[i] = arch_xchg(&pir->pir_l[i], 0);
we are interleaving cacheline read and xchg. So made it to

	for (i = 0; i < 4; i++) {
		pir_copy[i] = pid->pir_l[i];
	}

	for (i = 0; i < 4; i++) {
		if (pir_copy[i]) {
			pir_copy[i] = arch_xchg(&pid->pir_l[i], 0);
			handled = true;
		}
	}

With DSA MEMFILL test just one queue one MSI, we are saving 3 xchg per loop.
Here is the performance comparison in IRQ rate:

Original RFC 9.29 m/sec, 
Optimized in your email 8.82m/sec,
Tweaked above: 9.54m/s

I need to test with more MSI vectors spreading out to all 4 u64. I suspect
the benefit will decrease since we need to do both read and xchg for
non-zero entries.

> 		handled |= true;
> 	}


> 
> 	if (!handled)
> 		return handled;
> 
> 	for_each_set_bit()
> 		....
> 
> 	return handled.
> }
> 
> sysvec_posted_blah_blah()
> {
> 	bool done = false;
> 	bool handled;
> 
> 	for (;;) {
> 		handled = handle_pending_pir();
> 		if (done)
> 			break;
> 		if (!handled || ++loops > MAX_LOOPS) {
> 			pi_clear_on(pid);
> 			/* once more after clear_on */
> 			done = true;
> 		}
> 	}
> }
> 
> 
> Hmm?




Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler
  2023-11-15 12:42   ` Peter Zijlstra
@ 2023-11-15 20:05     ` Jacob Pan
  0 siblings, 0 replies; 49+ messages in thread
From: Jacob Pan @ 2023-11-15 20:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar, Raj Ashok, Tian, Kevin, maz, seanjc, Robin Murphy,
	jacob.jun.pan

Hi Peter,

On Wed, 15 Nov 2023 13:42:21 +0100, Peter Zijlstra <peterz@infradead.org>
wrote:

> On Sat, Nov 11, 2023 at 08:16:39PM -0800, Jacob Pan wrote:
> 
> > +static __always_inline inline void handle_pending_pir(struct pi_desc
> > *pid, struct pt_regs *regs) +{
> > +	int i, vec = FIRST_EXTERNAL_VECTOR;
> > +	u64 pir_copy[4];
> > +
> > +	/*
> > +	 * Make a copy of PIR which contains IRQ pending bits for
> > vectors,
> > +	 * then invoke IRQ handlers for each pending vector.
> > +	 * If any new interrupts were posted while we are processing,
> > will
> > +	 * do again before allowing new notifications. The idea is to
> > +	 * minimize the number of the expensive notifications if IRQs
> > come
> > +	 * in a high frequency burst.
> > +	 */
> > +	for (i = 0; i < 4; i++)
> > +		pir_copy[i] = raw_atomic64_xchg((atomic64_t
> > *)&pid->pir_l[i], 0);  
> 
> Might as well use arch_xchg() and save the atomic64_t casting.
will do

> 
>  [...]  


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler
  2023-11-15 20:04     ` Jacob Pan
@ 2023-11-15 20:25       ` Peter Zijlstra
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2023-11-15 20:25 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, X86 Kernel, iommu, Thomas Gleixner, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar, Raj Ashok, Tian, Kevin, maz, seanjc, Robin Murphy

On Wed, Nov 15, 2023 at 12:04:01PM -0800, Jacob Pan wrote:

> we are interleaving cacheline read and xchg. So made it to

Hmm, I wasn't expecting that to be a problem, but sure.

> 	for (i = 0; i < 4; i++) {
> 		pir_copy[i] = pid->pir_l[i];
> 	}
> 
> 	for (i = 0; i < 4; i++) {
> 		if (pir_copy[i]) {
> 			pir_copy[i] = arch_xchg(&pid->pir_l[i], 0);
> 			handled = true;
> 		}
> 	}
> 
> With DSA MEMFILL test just one queue one MSI, we are saving 3 xchg per loop.
> Here is the performance comparison in IRQ rate:
> 
> Original RFC 9.29 m/sec, 
> Optimized in your email 8.82m/sec,
> Tweaked above: 9.54m/s
> 
> I need to test with more MSI vectors spreading out to all 4 u64. I suspect
> the benefit will decrease since we need to do both read and xchg for
> non-zero entries.

Ah, but performance was not the reason I suggested this. Code
compactness and clarity was.

Possibly using less xchg is just a bonus :-)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 01/13] x86: Move posted interrupt descriptor out of vmx code
  2023-11-12  4:16 ` [PATCH RFC 01/13] x86: Move posted interrupt descriptor out of vmx code Jacob Pan
@ 2023-12-06 16:33   ` Thomas Gleixner
  2023-12-08  4:54     ` Jacob Pan
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gleixner @ 2023-12-06 16:33 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen,
	Joerg Roedel, H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> +/* Posted-Interrupt Descriptor */
> +struct pi_desc {
> +	u32 pir[8];     /* Posted interrupt requested */
> +	union {
> +		struct {
> +				/* bit 256 - Outstanding Notification */
> +			u16	on	: 1,
> +				/* bit 257 - Suppress Notification */
> +				sn	: 1,
> +				/* bit 271:258 - Reserved */
> +				rsvd_1	: 14;
> +				/* bit 279:272 - Notification Vector */
> +			u8	nv;
> +				/* bit 287:280 - Reserved */
> +			u8	rsvd_2;
> +				/* bit 319:288 - Notification Destination */
> +			u32	ndst;

This mixture of bitfields and types is weird and really not intuitive:

/* Posted-Interrupt Descriptor */
struct pi_desc {
	/* Posted interrupt requested */
	u32			pir[8];

	union {
		struct {
				/* bit 256 - Outstanding Notification */
			u64	on	:  1,
				/* bit 257 - Suppress Notification */
				sn	:  1,
				/* bit 271:258 - Reserved */
					: 14,
				/* bit 279:272 - Notification Vector */
				nv	:  8,
				/* bit 287:280 - Reserved */
					:  8,
				/* bit 319:288 - Notification Destination */
				ndst	: 32;
		};
		u64		control;
	};
	u32			rsvd[6];
} __aligned(64);

Hmm?

> +static inline bool pi_test_and_set_on(struct pi_desc *pi_desc)
> +{
> +	return test_and_set_bit(POSTED_INTR_ON,
> +			(unsigned long *)&pi_desc->control);

Please get rid of those line breaks.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 02/13] x86: Add a Kconfig option for posted MSI
  2023-11-12  4:16 ` [PATCH RFC 02/13] x86: Add a Kconfig option for posted MSI Jacob Pan
@ 2023-12-06 16:35   ` Thomas Gleixner
  2023-12-09 21:24     ` Jacob Pan
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gleixner @ 2023-12-06 16:35 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen,
	Joerg Roedel, H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> This option will be used to support delivering MSIs as posted
> interrupts. Interrupt remapping is required.

The last sentence does not make sense.

> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  arch/x86/Kconfig | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 66bfabae8814..f16882ddb390 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -463,6 +463,16 @@ config X86_X2APIC
>  
>  	  If you don't know what to do here, say N.
>  
> +config X86_POSTED_MSI
> +	bool "Enable MSI and MSI-x delivery by posted interrupts"
> +	depends on X86_X2APIC && X86_64 && IRQ_REMAP
> +	help
> +	  This enables MSIs that are under IRQ remapping to be delivered as posted

s/IRQ/interrupt/

This is text and not Xitter.


> +	  interrupts to the host kernel. IRQ throughput can potentially be improved
> +	  by coalescing CPU notifications during high frequency IRQ bursts.
> +
> +	  If you don't know what to do here, say N.
> +
>  config X86_MPPARSE
>  	bool "Enable MPS table" if ACPI
>  	default y

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 03/13] x86: Reserved a per CPU IDT vector for posted MSIs
  2023-11-12  4:16 ` [PATCH RFC 03/13] x86: Reserved a per CPU IDT vector for posted MSIs Jacob Pan
@ 2023-12-06 16:47   ` Thomas Gleixner
  2023-12-09 21:53     ` Jacob Pan
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gleixner @ 2023-12-06 16:47 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen,
	Joerg Roedel, H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:

$Subject: x86/vector: Reserve ...

> Under posted MSIs, all device MSIs are multiplexed into a single CPU

Under?

> notification vector. MSI handlers will be de-multiplexed at run-time by
> system software without IDT delivery.
>
> This vector has a priority class below the rest of the system vectors.

Why?

> Potentially, external vector number space for MSIs can be expanded to
> the entire 0-256 range.

Don't even mention this. It's wishful thinking and has absolutely
nothing to do with the patch at hand.

> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  arch/x86/include/asm/irq_vectors.h | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
> index 3a19904c2db6..077ca38f5a91 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -99,9 +99,22 @@
>  
>  #define LOCAL_TIMER_VECTOR		0xec
>  
> +/*
> + * Posted interrupt notification vector for all device MSIs delivered to
> + * the host kernel.
> + *
> + * Choose lower priority class bit [7:4] than other system vectors such
> + * that it can be preempted by the system interrupts.

That's future music and I'm not convinced at all that we want to allow
nested interrupts with all their implications. Stack depth is the least
of the worries here. There are enough other assumptions about interrupts
not nesting in Linux.

> + * It is also higher than all external vectors but it should not matter
> + * in that external vectors for posted MSIs are in a different number space.

This whole priority muck is pointless. The kernel never used it and will
never use it.

> + */
> +#define POSTED_MSI_NOTIFICATION_VECTOR	0xdf

So this just wants to go into the regular system vector number space
until there is a conclusion whether we can and want to allow nested
interrupts. Premature optimization is just creating more confusion than
value.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 04/13] iommu/vt-d: Add helper and flag to check/disable posted MSI
  2023-11-12  4:16 ` [PATCH RFC 04/13] iommu/vt-d: Add helper and flag to check/disable posted MSI Jacob Pan
@ 2023-12-06 16:49   ` Thomas Gleixner
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gleixner @ 2023-12-06 16:49 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen,
	Joerg Roedel, H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> Allow command line opt-out posted MSI under CONFIG_X86_POSTED_MSI=y.
> And add a helper function for testing if posted MSI is supported on the
> CPU side.

That's backwards. You want command line opt-in first and not enforce
this posted muck on everyone including RT which will regress and suffer
from that.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 06/13] x86/irq: Unionize PID.PIR for 64bit access w/o casting
  2023-11-12  4:16 ` [PATCH RFC 06/13] x86/irq: Unionize PID.PIR for 64bit access w/o casting Jacob Pan
@ 2023-12-06 16:51   ` Thomas Gleixner
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gleixner @ 2023-12-06 16:51 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen,
	Joerg Roedel, H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> Make PIR field into u64 such that atomic xchg64 can be used without ugly
> casting.

Make PIR field into... That's not a sentence.


> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  arch/x86/include/asm/posted_intr.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
> index 2cd9ac1af835..3af00f5395e4 100644
> --- a/arch/x86/include/asm/posted_intr.h
> +++ b/arch/x86/include/asm/posted_intr.h
> @@ -9,7 +9,10 @@
>  
>  /* Posted-Interrupt Descriptor */
>  struct pi_desc {
> -	u32 pir[8];     /* Posted interrupt requested */
> +	union {
> +		u32 pir[8];     /* Posted interrupt requested */
> +		u64 pir_l[4];

pir_l is really not intuitive. What's wrong with spelling the type out
in the name: pir64[4] ?

> +	};
>  	union {
>  		struct {
>  				/* bit 256 - Outstanding Notification */

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 07/13] x86/irq: Add helpers for checking Intel PID
  2023-11-12  4:16 ` [PATCH RFC 07/13] x86/irq: Add helpers for checking Intel PID Jacob Pan
@ 2023-12-06 19:02   ` Thomas Gleixner
  2024-01-26 23:31     ` Jacob Pan
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gleixner @ 2023-12-06 19:02 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen,
	Joerg Roedel, H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:

That 'Intel PID' in the subject line sucks. What's wrong with writing
things out?

       x86/irq: Add accessors for posted interrupt descriptors

Hmm?

> Intel posted interrupt descriptor (PID) stores pending interrupts in its
> posted interrupt requests (PIR) bitmap.
>
> Add helper functions to check individual vector status and the entire bitmap.
>
> They are used for interrupt migration and runtime demultiplexing posted MSI
> vectors.

This is all backwards.

  Posted interrupts are controlled by and pending interrupts are marked in
  the posted interrupt descriptor. The upcoming support for host side
  posted interrupts requires accessors to check for pending vectors.

  Add ....

>  #ifdef CONFIG_X86_POSTED_MSI
> +/*
> + * Not all external vectors are subject to interrupt remapping, e.g. IOMMU's
> + * own interrupts. Here we do not distinguish them since those vector bits in
> + * PIR will always be zero.
> + */
> +static inline bool is_pi_pending_this_cpu(unsigned int vector)

Can you please use a proper name space pi_.....() instead of this
is_...() muck which is horrible to grep for. It's documented ....

> +{
> +	struct pi_desc *pid;
> +
> +	if (WARN_ON(vector > NR_VECTORS || vector < FIRST_EXTERNAL_VECTOR))
> +		return false;

Haha. So much about your 'can use the full vector space' dreams .... And
WARN_ON_ONCE() please.

> +
> +	pid = this_cpu_ptr(&posted_interrupt_desc);

Also this can go into the declaration line.

> +
> +	return (pid->pir[vector >> 5] & (1 << (vector % 32)));

  __test_bit() perhaps?

> +}

> +static inline bool is_pir_pending(struct pi_desc *pid)
> +{
> +	int i;
> +
> +	for (i = 0; i < 4; i++) {
> +		if (pid->pir_l[i])
> +			return true;
> +	}
> +
> +	return false;

This is required because pi_is_pir_empty() is checking the other way
round, right?

> +}
> +
>  extern void intel_posted_msi_init(void);
>  
>  #else
> +static inline bool is_pi_pending_this_cpu(unsigned int vector) {return false; }

lacks space before 'return'

> +
>  static inline void intel_posted_msi_init(void) {};
>  
>  #endif /* X86_POSTED_MSI */

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler
  2023-11-12  4:16 ` [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler Jacob Pan
  2023-11-15 12:42   ` Peter Zijlstra
  2023-11-15 12:56   ` Peter Zijlstra
@ 2023-12-06 19:14   ` Thomas Gleixner
  2 siblings, 0 replies; 49+ messages in thread
From: Thomas Gleixner @ 2023-12-06 19:14 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen,
	Joerg Roedel, H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> +	/*
> +	 * Ideally, we should start from the high order bits set in the PIR
> +	 * since each bit represents a vector. Higher order bit position means
> +	 * the vector has higher priority. But external vectors are allocated
> +	 * based on availability not priority.
> +	 *
> +	 * EOI is included in the IRQ handlers call to apic_ack_irq, which
> +	 * allows higher priority system interrupt to get in between.

What? This does not make sense.

_IF_ we go there then

     1) The EOI must be explicit in sysvec_posted_msi_notification()

     2) Interrupt enabling must happen explicit at a dedicated place in
        sysvec_posted_msi_notification()

        You _CANNOT_ run all the device handlers with interrupts
        enabled.

Please remove all traces of non-working wishful thinking from this series.

> +	 */
> +	for_each_set_bit_from(vec, (unsigned long *)&pir_copy[0], 256)

Why does this need to check up to vector 255? The vector space does not
magially expand just because of posted interrupts, really. At least not
without major modifications to the vector management.

> +		call_irq_handler(vec, regs);
> +

Stray newline.

> +}

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler
  2023-11-15 12:56   ` Peter Zijlstra
  2023-11-15 20:04     ` Jacob Pan
@ 2023-12-06 19:50     ` Thomas Gleixner
  2023-12-08  4:46       ` Jacob Pan
  1 sibling, 1 reply; 49+ messages in thread
From: Thomas Gleixner @ 2023-12-06 19:50 UTC (permalink / raw)
  To: Peter Zijlstra, Jacob Pan
  Cc: LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Raj Ashok,
	Tian, Kevin, maz, seanjc, Robin Murphy

On Wed, Nov 15 2023 at 13:56, Peter Zijlstra wrote:
>
> Would it not make more sense to write things something like:
>
> bool handle_pending_pir()
> {
> 	bool handled = false;
> 	u64 pir_copy[4];
>
> 	for (i = 0; i < 4; i++) {
> 		if (!pid-pir_l[i]) {
> 			pir_copy[i] = 0;
> 			continue;
> 		}
>
> 		pir_copy[i] = arch_xchg(&pir->pir_l[i], 0);
> 		handled |= true;
> 	}
>
> 	if (!handled)
> 		return handled;
>
> 	for_each_set_bit()
> 		....
>
> 	return handled.
> }

I don't understand what the whole copy business is about. It's
absolutely not required.

static bool handle_pending_pir(unsigned long *pir)
{
        unsigned int idx, vec;
	bool handled = false;
        unsigned long pend;
        
        for (idx = 0; offs < 4; idx++) {
                if (!pir[idx])
                	continue;
		pend = arch_xchg(pir + idx, 0);
                for_each_set_bit(vec, &pend, 64)
			call_irq_handler(vec + idx * 64, NULL);
                handled = true;
	}
        return handled;
}

No?

> sysvec_posted_blah_blah()
> {
> 	bool done = false;
> 	bool handled;
>
> 	for (;;) {
> 		handled = handle_pending_pir();
> 		if (done)
> 			break;
> 		if (!handled || ++loops > MAX_LOOPS) {

That does one loop too many. Should be ++loops == MAX_LOOPS. No?

> 			pi_clear_on(pid);
> 			/* once more after clear_on */
> 			done = true;
> 		}
> 	}
> }
>
>
> Hmm?

I think that can be done less convoluted.

{
	struct pi_desc *pid = this_cpu_ptr(&posted_interrupt_desc);
	struct pt_regs *old_regs = set_irq_regs(regs);
        int loops;

	for (loops = 0;;) {
        	bool handled = handle_pending_pir((unsigned long)pid->pir);

                if (++loops > MAX_LOOPS)
                	break;

                if (!handled || loops == MAX_LOOPS) {
                	pi_clear_on(pid);
                        /* Break the loop after handle_pending_pir()! */
                        loops = MAX_LOOPS;
                }
	}

	...
	set_irq_regs(old_regs);
}

Hmm? :)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 10/13] x86/irq: Handle potential lost IRQ during migration and CPU offline
  2023-11-12  4:16 ` [PATCH RFC 10/13] x86/irq: Handle potential lost IRQ during migration and CPU offline Jacob Pan
@ 2023-12-06 20:09   ` Thomas Gleixner
  0 siblings, 0 replies; 49+ messages in thread
From: Thomas Gleixner @ 2023-12-06 20:09 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen,
	Joerg Roedel, H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> Though IRTE modification for IRQ affinity change is a atomic operation,
> it does not guarantee the timing of IRQ posting at PID.

No acronyms please.

> considered the following scenario:
> 	Device		system agent		iommu		memory 		CPU/LAPIC
> 1	FEEX_XXXX
> 2			Interrupt request
> 3						Fetch IRTE	->
> 4						->Atomic Swap PID.PIR(vec)
> 						Push to Global Observable(GO)
> 5						if (ON*)
> 	i						done;*
> 						else
> 6							send a notification ->
>
> * ON: outstanding notification, 1 will suppress new notifications
>
> If IRQ affinity change happens between 3 and 5 in IOMMU, old CPU's PIR could
> have pending bit set for the vector being moved. We must check PID.PIR
> to prevent the lost of interrupts.

We must check nothing. We must ensure that the code is correct, right?

> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  arch/x86/kernel/apic/vector.c |  8 +++++++-
>  arch/x86/kernel/irq.c         | 20 +++++++++++++++++---
>  2 files changed, 24 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 319448d87b99..14fc33cfdb37 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -19,6 +19,7 @@
>  #include <asm/apic.h>
>  #include <asm/i8259.h>
>  #include <asm/desc.h>
> +#include <asm/posted_intr.h>
>  #include <asm/irq_remapping.h>
>  
>  #include <asm/trace/irq_vectors.h>
> @@ -978,9 +979,14 @@ static void __vector_cleanup(struct vector_cleanup *cl, bool check_irr)
>  		 * Do not check IRR when called from lapic_offline(), because
>  		 * fixup_irqs() was just called to scan IRR for set bits and
>  		 * forward them to new destination CPUs via IPIs.
> +		 *
> +		 * If the vector to be cleaned is delivered as posted intr,
> +		 * it is possible that the interrupt has been posted but
> +		 * not made to the IRR due to coalesced notifications.

not made to?

> +		 * Therefore, check PIR to see if the interrupt was posted.
>  		 */
>  		irr = check_irr ? apic_read(APIC_IRR + (vector / 32 * 0x10)) : 0;
> -		if (irr & (1U << (vector % 32))) {
> +		if (irr & (1U << (vector % 32)) || is_pi_pending_this_cpu(vector)) {

The comment above this code clearly explains what check_irr is
about. Why would the PIR pending check have different rules? Just
because its PIR, right?

>  
> +/*
> + * Check if a given vector is pending in APIC IRR or PIR if posted interrupt
> + * is enabled for coalesced interrupt delivery (CID).
> + */
> +static inline bool is_vector_pending(unsigned int vector)
> +{
> +	unsigned int irr;
> +
> +	irr = apic_read(APIC_IRR + (vector / 32 * 0x10));
> +	if (irr  & (1 << (vector % 32)))
> +		return true;
> +
> +	return is_pi_pending_this_cpu(vector);
> +}

Why is this outside of the #ifdef region? Just because there was space
to put it, right?

And of course we need the same thing open coded in two locations.

What's wrong with using this inline function in __vector_cleanup() too?

	if (check_irr && vector_is_pending(vector)) {
        	pr_warn_once(...);
                ....
        }

That would make the logic of __vector_cleanup() correct _AND_ share the
code.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 11/13] iommu/vt-d: Add an irq_chip for posted MSIs
  2023-11-12  4:16 ` [PATCH RFC 11/13] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
@ 2023-12-06 20:15   ` Thomas Gleixner
  2024-01-26 23:31     ` Jacob Pan
  2023-12-06 20:44   ` Thomas Gleixner
  1 sibling, 1 reply; 49+ messages in thread
From: Thomas Gleixner @ 2023-12-06 20:15 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen,
	Joerg Roedel, H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> With posted MSIs, end of interrupt is handled by the notification
> handler. Each MSI handler does not go through local APIC IRR, ISR
> processing. There's no need to do apic_eoi() in those handlers.
>
> Add a new acpi_ack_irq_no_eoi() for the posted MSI IR chip. At runtime
> the call trace looks like:
>
> __sysvec_posted_msi_notification() {
>   irq_chip_ack_parent() {
>     apic_ack_irq_no_eoi();
>   }

Huch? There is something missing here to make sense.

>   handle_irq_event() {
>     handle_irq_event_percpu() {
>        driver_handler()
>     }
>   }
>
> IO-APIC IR is excluded the from posted MSI, we need to make sure it
> still performs EOI.

We need to make the code correct and write changelogs which make
sense. This sentence makes no sense whatsoever.

What has the IO-APIC to do with posted MSIs?

It's a different interrupt chip hierarchy, no?

> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> index 00da6cf6b07d..ca398ee9075b 100644
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -1993,7 +1993,7 @@ static struct irq_chip ioapic_ir_chip __read_mostly = {
>  	.irq_startup		= startup_ioapic_irq,
>  	.irq_mask		= mask_ioapic_irq,
>  	.irq_unmask		= unmask_ioapic_irq,
> -	.irq_ack		= irq_chip_ack_parent,
> +	.irq_ack		= apic_ack_irq,

Why?

>  	.irq_eoi		= ioapic_ir_ack_level,
>  	.irq_set_affinity	= ioapic_set_affinity,
>  	.irq_retrigger		= irq_chip_retrigger_hierarchy,
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index 14fc33cfdb37..01223ac4f57a 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -911,6 +911,11 @@ void apic_ack_irq(struct irq_data *irqd)
>  	apic_eoi();
>  }
>  
> +void apic_ack_irq_no_eoi(struct irq_data *irqd)
> +{
> +	irq_move_irq(irqd);
> +}
> +

The exact purpose of that function is to invoke irq_move_irq() which is
a completely pointless exercise for interrupts which are remapped.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 12/13] iommu/vt-d: Add a helper to retrieve PID address
  2023-11-12  4:16 ` [PATCH RFC 12/13] iommu/vt-d: Add a helper to retrieve PID address Jacob Pan
@ 2023-12-06 20:19   ` Thomas Gleixner
  2024-01-26 23:30     ` Jacob Pan
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gleixner @ 2023-12-06 20:19 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen,
	Joerg Roedel, H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> When programming IRTE for posted mode, we need to retrieve the
> physical

we need .... I surely did not write this changelog.

> address of the posted interrupt descriptor (PID) that belongs to it's
> target CPU.
>
> This per CPU PID has already been set up during cpu_init().

This information is useful because?

> +static u64 get_pi_desc_addr(struct irq_data *irqd)
> +{
> +	int cpu = cpumask_first(irq_data_get_effective_affinity_mask(irqd));

The effective affinity mask is magically correct when this is called?


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 13/13] iommu/vt-d: Enable posted mode for device MSIs
  2023-11-12  4:16 ` [PATCH RFC 13/13] iommu/vt-d: Enable posted mode for device MSIs Jacob Pan
@ 2023-12-06 20:26   ` Thomas Gleixner
  2023-12-13 22:00     ` Jacob Pan
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gleixner @ 2023-12-06 20:26 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen,
	Joerg Roedel, H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
>  #ifdef CONFIG_X86_POSTED_MSI
>  
>  static u64 get_pi_desc_addr(struct irq_data *irqd)
> @@ -1133,6 +1144,29 @@ static u64 get_pi_desc_addr(struct irq_data *irqd)
>  
>  	return __pa(per_cpu_ptr(&posted_interrupt_desc, cpu));
>  }
> +
> +static void intel_ir_reconfigure_irte_posted(struct irq_data *irqd)
> +{
> +	struct intel_ir_data *ir_data = irqd->chip_data;
> +	struct irte *irte = &ir_data->irte_entry;
> +	struct irte irte_pi;
> +	u64 pid_addr;
> +
> +	pid_addr = get_pi_desc_addr(irqd);
> +
> +	memset(&irte_pi, 0, sizeof(irte_pi));
> +
> +	/* The shared IRTE already be set up as posted during alloc_irte */

-ENOPARSE

> +	dmar_copy_shared_irte(&irte_pi, irte);
> +
> +	irte_pi.pda_l = (pid_addr >> (32 - PDA_LOW_BIT)) & ~(-1UL << PDA_LOW_BIT);
> +	irte_pi.pda_h = (pid_addr >> 32) & ~(-1UL << PDA_HIGH_BIT);
> +
> +	modify_irte(&ir_data->irq_2_iommu, &irte_pi);
> +}
> +
> +#else
> +static inline void intel_ir_reconfigure_irte_posted(struct irq_data *irqd) {}
>  #endif
>  
>  static void intel_ir_reconfigure_irte(struct irq_data *irqd, bool force)
> @@ -1148,8 +1182,9 @@ static void intel_ir_reconfigure_irte(struct irq_data *irqd, bool force)
>  	irte->vector = cfg->vector;
>  	irte->dest_id = IRTE_DEST(cfg->dest_apicid);
>  
> -	/* Update the hardware only if the interrupt is in remapped mode. */
> -	if (force || ir_data->irq_2_iommu.mode == IRQ_REMAPPING)
> +	if (ir_data->irq_2_iommu.posted_msi)
> +		intel_ir_reconfigure_irte_posted(irqd);
> +	else if (force || ir_data->irq_2_iommu.mode == IRQ_REMAPPING)
>  		modify_irte(&ir_data->irq_2_iommu, irte);
>  }
>  
> @@ -1203,7 +1238,7 @@ static int intel_ir_set_vcpu_affinity(struct irq_data *data, void *info)
>  	struct intel_ir_data *ir_data = data->chip_data;
>  	struct vcpu_data *vcpu_pi_info = info;
>  
> -	/* stop posting interrupts, back to remapping mode */
> +	/* stop posting interrupts, back to the default mode */
>  	if (!vcpu_pi_info) {
>  		modify_irte(&ir_data->irq_2_iommu, &ir_data->irte_entry);
>  	} else {
> @@ -1300,10 +1335,14 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
>  {
>  	struct irte *irte = &data->irte_entry;
>  
> -	prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
> +	if (data->irq_2_iommu.mode == IRQ_POSTING)
> +		prepare_irte_posted(irte);
> +	else
> +		prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
>  
>  	switch (info->type) {
>  	case X86_IRQ_ALLOC_TYPE_IOAPIC:
> +		prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);

What? This is just wrong. Above you have:

> +	if (data->irq_2_iommu.mode == IRQ_POSTING)
> +		prepare_irte_posted(irte);
> +	else
> +		prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);

Can you spot the fail?

>  		/* Set source-id of interrupt request */
>  		set_ioapic_sid(irte, info->devid);
>  		apic_printk(APIC_VERBOSE, KERN_DEBUG "IOAPIC[%d]: Set IRTE entry (P:%d FPD:%d Dst_Mode:%d Redir_hint:%d Trig_Mode:%d Dlvry_Mode:%X Avail:%X Vector:%02X Dest:%08X SID:%04X SQ:%X SVT:%X)\n",
> @@ -1315,10 +1354,18 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
>  		sub_handle = info->ioapic.pin;
>  		break;
>  	case X86_IRQ_ALLOC_TYPE_HPET:
> +		prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
>  		set_hpet_sid(irte, info->devid);
>  		break;
>  	case X86_IRQ_ALLOC_TYPE_PCI_MSI:
>  	case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
> +		if (posted_msi_supported()) {
> +			prepare_irte_posted(irte);
> +			data->irq_2_iommu.posted_msi = 1;
> +		} else {
> +			prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
> +		}

Here it gets even more hilarious.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 11/13] iommu/vt-d: Add an irq_chip for posted MSIs
  2023-11-12  4:16 ` [PATCH RFC 11/13] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
  2023-12-06 20:15   ` Thomas Gleixner
@ 2023-12-06 20:44   ` Thomas Gleixner
  2023-12-13  3:42     ` Jacob Pan
  1 sibling, 1 reply; 49+ messages in thread
From: Thomas Gleixner @ 2023-12-06 20:44 UTC (permalink / raw)
  To: Jacob Pan, LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen,
	Joerg Roedel, H. Peter Anvin, Borislav Petkov, Ingo Molnar
  Cc: Raj Ashok, Tian, Kevin, maz, peterz, seanjc, Robin Murphy,
	Jacob Pan

On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
>  static void fill_msi_msg(struct msi_msg *msg, u32 index, u32 subhandle)
>  {
>  	memset(msg, 0, sizeof(*msg));
> @@ -1361,7 +1397,7 @@ static int intel_irq_remapping_alloc(struct irq_domain *domain,
>  
>  		irq_data->hwirq = (index << 16) + i;
>  		irq_data->chip_data = ird;
> -		irq_data->chip = &intel_ir_chip;
> +		irq_data->chip = posted_msi_supported() ? &intel_ir_chip_post_msi : &intel_ir_chip;

This is just wrong because you change the chip to posted for _ALL_
domains unconditionally.

The only domains which want this chip are the PCI/MSI domains. And those
are distinct from the domains which serve IO/APIC, HPET, no?

So you can set that chip only for PCI/MSI and just let IO/APIC, HPET
domains keep the original chip, which spares any modification of the
IO/APIC domain.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler
  2023-12-06 19:50     ` Thomas Gleixner
@ 2023-12-08  4:46       ` Jacob Pan
  2023-12-08 11:52         ` Thomas Gleixner
  0 siblings, 1 reply; 49+ messages in thread
From: Jacob Pan @ 2023-12-08  4:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, LKML, X86 Kernel, iommu, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar, Raj Ashok, Tian, Kevin, maz, seanjc, Robin Murphy,
	jacob.jun.pan

Hi Thomas,

On Wed, 06 Dec 2023 20:50:24 +0100, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Wed, Nov 15 2023 at 13:56, Peter Zijlstra wrote:
> >
> > Would it not make more sense to write things something like:
> >
> > bool handle_pending_pir()
> > {
> > 	bool handled = false;
> > 	u64 pir_copy[4];
> >
> > 	for (i = 0; i < 4; i++) {
> > 		if (!pid-pir_l[i]) {
> > 			pir_copy[i] = 0;
> > 			continue;
> > 		}
> >
> > 		pir_copy[i] = arch_xchg(&pir->pir_l[i], 0);
> > 		handled |= true;
> > 	}
> >
> > 	if (!handled)
> > 		return handled;
> >
> > 	for_each_set_bit()
> > 		....
> >
> > 	return handled.
> > }  
> 
> I don't understand what the whole copy business is about. It's
> absolutely not required.
> 
> static bool handle_pending_pir(unsigned long *pir)
> {
>         unsigned int idx, vec;
> 	bool handled = false;
>         unsigned long pend;
>         
>         for (idx = 0; offs < 4; idx++) {
>                 if (!pir[idx])
>                 	continue;
> 		pend = arch_xchg(pir + idx, 0);
>                 for_each_set_bit(vec, &pend, 64)
> 			call_irq_handler(vec + idx * 64, NULL);
>                 handled = true;
> 	}
>         return handled;
> }
> 
My thinking is the following:
The PIR cache line is contended by between CPU and IOMMU, where CPU can
access PIR much faster. Nevertheless, when IOMMU does atomic swap of the
PID (PIR included), L1 cache gets evicted. Subsequent CPU read or xchg will
deal with invalid cold cache.

By making a copy of PIR as quickly as possible and clearing PIR with xchg,
we minimized the chance that IOMMU does atomic swap in the middle.
Therefore, having less L1D misses.

In the code above, it does read, xchg, and call_irq_handler() in a loop
to handle the 4 64bit PIR bits at a time. IOMMU has a greater chance to do
atomic xchg on the PIR cache line while doing call_irq_handler(). Therefore,
it causes more L1D misses.

I might be missing something?

I tested the two versions below with my DSA memory fill test and measured
DMA bandwidth and perf cache misses:

#ifdef NO_PIR_COPY
static __always_inline inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs)
{
	int i, vec;
	bool handled = false;
	unsigned long pending;

	for (i = 0; i < 4; i++) {
		if (!pir[i])
			continue;

		pending = arch_xchg(pir + i, 0);
		for_each_set_bit(vec, &pending, 64)
			call_irq_handler(i * 64 + vec, regs);
		handled = true;
	}

	return handled;
}
#else
static __always_inline inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs)
{
	int i, vec = FIRST_EXTERNAL_VECTOR;
	bool handled = false;
	unsigned long pir_copy[4];

	for (i = 0; i < 4; i++)
		pir_copy[i] = pir[i];

	for (i = 0; i < 4; i++) {
		if (!pir_copy[i])
			continue;

		pir_copy[i] = arch_xchg(pir, 0);
		handled = true;
	}

	if (handled) {
		for_each_set_bit_from(vec, pir_copy, FIRST_SYSTEM_VECTOR)
			call_irq_handler(vec, regs);
	}

	return handled;
}
#endif

DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
{
	struct pt_regs *old_regs = set_irq_regs(regs);
	struct pi_desc *pid;
	int i = 0;

	pid = this_cpu_ptr(&posted_interrupt_desc);

	inc_irq_stat(posted_msi_notification_count);
	irq_enter();

	while (i++ < MAX_POSTED_MSI_COALESCING_LOOP) {
		if (!handle_pending_pir(pid->pir64, regs))
			break;
	}

	/*
	 * Clear outstanding notification bit to allow new IRQ notifications,
	 * do this last to maximize the window of interrupt coalescing.
	 */
	pi_clear_on(pid);

	/*
	 * There could be a race of PI notification and the clearing of ON bit,
	 * process PIR bits one last time such that handling the new interrupts
	 * are not delayed until the next IRQ.
	 */
	handle_pending_pir(pid->pir64, regs);

	apic_eoi();
	irq_exit();
	set_irq_regs(old_regs);
}

Without PIR copy:

DMA memfill bandwidth: 4.944 Gbps
Performance counter stats for './run_intr.sh 512 30':                                                             
                                                                                                                   
    77,313,298,506      L1-dcache-loads                                               (79.98%)                     
         8,279,458      L1-dcache-load-misses     #    0.01% of all L1-dcache accesses  (80.03%)                   
    41,654,221,245      L1-dcache-stores                                              (80.01%)                     
            10,476      LLC-load-misses           #    0.31% of all LL-cache accesses  (79.99%)                    
         3,332,748      LLC-loads                                                     (80.00%)                     
                                                                                                                   
      30.212055434 seconds time elapsed                                                                            
                                                                                                                   
       0.002149000 seconds user                                                                                    
      30.183292000 seconds sys
                        

With PIR copy:
DMA memfill bandwidth: 5.029 Gbps
Performance counter stats for './run_intr.sh 512 30':

    78,327,247,423      L1-dcache-loads                                               (80.01%)
         7,762,311      L1-dcache-load-misses     #    0.01% of all L1-dcache accesses  (80.01%)
    42,203,221,466      L1-dcache-stores                                              (79.99%)
            23,691      LLC-load-misses           #    0.67% of all LL-cache accesses  (80.01%)
         3,561,890      LLC-loads                                                     (80.00%)

      30.201065706 seconds time elapsed

       0.005950000 seconds user
      30.167885000 seconds sys


> No?
> 
> > sysvec_posted_blah_blah()
> > {
> > 	bool done = false;
> > 	bool handled;
> >
> > 	for (;;) {
> > 		handled = handle_pending_pir();
> > 		if (done)
> > 			break;
> > 		if (!handled || ++loops > MAX_LOOPS) {  
> 
> That does one loop too many. Should be ++loops == MAX_LOOPS. No?
> 
> > 			pi_clear_on(pid);
> > 			/* once more after clear_on */
> > 			done = true;
> > 		}
> > 	}
> > }
> >
> >
> > Hmm?  
> 
> I think that can be done less convoluted.
> 
> {
> 	struct pi_desc *pid = this_cpu_ptr(&posted_interrupt_desc);
> 	struct pt_regs *old_regs = set_irq_regs(regs);
>         int loops;
> 
> 	for (loops = 0;;) {
>         	bool handled = handle_pending_pir((unsigned
> long)pid->pir);
> 
>                 if (++loops > MAX_LOOPS)
>                 	break;
> 
>                 if (!handled || loops == MAX_LOOPS) {
>                 	pi_clear_on(pid);
>                         /* Break the loop after handle_pending_pir()! */
>                         loops = MAX_LOOPS;
>                 }
> 	}
> 
> 	...
> 	set_irq_regs(old_regs);
> }
> 
> Hmm? :)


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 01/13] x86: Move posted interrupt descriptor out of vmx code
  2023-12-06 16:33   ` Thomas Gleixner
@ 2023-12-08  4:54     ` Jacob Pan
  2023-12-08  9:31       ` Thomas Gleixner
  0 siblings, 1 reply; 49+ messages in thread
From: Jacob Pan @ 2023-12-08  4:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Raj Ashok,
	Tian, Kevin, maz, peterz, seanjc, Robin Murphy, jacob.jun.pan

Hi Thomas,

On Wed, 06 Dec 2023 17:33:28 +0100, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> > +/* Posted-Interrupt Descriptor */
> > +struct pi_desc {
> > +	u32 pir[8];     /* Posted interrupt requested */
> > +	union {
> > +		struct {
> > +				/* bit 256 - Outstanding Notification
> > */
> > +			u16	on	: 1,
> > +				/* bit 257 - Suppress Notification */
> > +				sn	: 1,
> > +				/* bit 271:258 - Reserved */
> > +				rsvd_1	: 14;
> > +				/* bit 279:272 - Notification Vector */
> > +			u8	nv;
> > +				/* bit 287:280 - Reserved */
> > +			u8	rsvd_2;
> > +				/* bit 319:288 - Notification
> > Destination */
> > +			u32	ndst;  
> 
> This mixture of bitfields and types is weird and really not intuitive:
> 
> /* Posted-Interrupt Descriptor */
> struct pi_desc {
> 	/* Posted interrupt requested */
> 	u32			pir[8];
> 
> 	union {
> 		struct {
> 				/* bit 256 - Outstanding Notification */
> 			u64	on	:  1,
> 				/* bit 257 - Suppress Notification */
> 				sn	:  1,
> 				/* bit 271:258 - Reserved */
> 					: 14,
> 				/* bit 279:272 - Notification Vector */
> 				nv	:  8,
> 				/* bit 287:280 - Reserved */
> 					:  8,
> 				/* bit 319:288 - Notification Destination
> */ ndst	: 32;
> 		};
> 		u64		control;
> 	};
> 	u32			rsvd[6];
> } __aligned(64);
> 
It seems bit-fields cannot pass type check. I got these compile error.

arch/x86/kernel/irq.c: In function ‘intel_posted_msi_init’:
./include/linux/percpu-defs.h:363:20: error: cannot take address of bit-field ‘nv’
  363 |  __verify_pcpu_ptr(&(variable));     \
      |                    ^
./include/linux/percpu-defs.h:219:47: note: in definition of macro ‘__verify_pcpu_ptr’
  219 |  const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; \
      |                                               ^~~
./include/linux/percpu-defs.h:490:34: note: in expansion of macro ‘__pcpu_size_call’
  490 | #define this_cpu_write(pcp, val) __pcpu_size_call(this_cpu_write_, pcp, val)
      |                                  ^~~~~~~~~~~~~~~~
arch/x86/kernel/irq.c:358:2: note: in expansion of macro ‘this_cpu_write’
  358 |  this_cpu_write(posted_interrupt_desc.nv, POSTED_MSI_NOTIFICATION_VECTOR);
      |  ^~~~~~~~~~~~~~
./include/linux/percpu-defs.h:364:15: error: ‘sizeof’ applied to a bit-field
  364 |  switch(sizeof(variable)) {     \
> 
> > +static inline bool pi_test_and_set_on(struct pi_desc *pi_desc)
> > +{
> > +	return test_and_set_bit(POSTED_INTR_ON,
> > +			(unsigned long *)&pi_desc->control);  
> 
> Please get rid of those line breaks.
will do.


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 01/13] x86: Move posted interrupt descriptor out of vmx code
  2023-12-08  4:54     ` Jacob Pan
@ 2023-12-08  9:31       ` Thomas Gleixner
  2023-12-08 23:21         ` Jacob Pan
  2023-12-09  0:28         ` Jacob Pan
  0 siblings, 2 replies; 49+ messages in thread
From: Thomas Gleixner @ 2023-12-08  9:31 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Raj Ashok,
	Tian, Kevin, maz, peterz, seanjc, Robin Murphy, jacob.jun.pan

On Thu, Dec 07 2023 at 20:54, Jacob Pan wrote:
> On Wed, 06 Dec 2023 17:33:28 +0100, Thomas Gleixner <tglx@linutronix.de>
> wrote:
>> On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
>> 	u32			rsvd[6];
>> } __aligned(64);
>> 
> It seems bit-fields cannot pass type check. I got these compile error.

And why are you telling me that instead if simply fixing it?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler
  2023-12-08  4:46       ` Jacob Pan
@ 2023-12-08 11:52         ` Thomas Gleixner
  2023-12-08 20:02           ` Jacob Pan
  2024-01-26 23:32           ` Jacob Pan
  0 siblings, 2 replies; 49+ messages in thread
From: Thomas Gleixner @ 2023-12-08 11:52 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Peter Zijlstra, LKML, X86 Kernel, iommu, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar, Raj Ashok, Tian, Kevin, maz, seanjc, Robin Murphy,
	jacob.jun.pan

On Thu, Dec 07 2023 at 20:46, Jacob Pan wrote:
> On Wed, 06 Dec 2023 20:50:24 +0100, Thomas Gleixner <tglx@linutronix.de>
> wrote:
>> I don't understand what the whole copy business is about. It's
>> absolutely not required.
>
> My thinking is the following:
> The PIR cache line is contended by between CPU and IOMMU, where CPU can
> access PIR much faster. Nevertheless, when IOMMU does atomic swap of the
> PID (PIR included), L1 cache gets evicted. Subsequent CPU read or xchg will
> deal with invalid cold cache.
>
> By making a copy of PIR as quickly as possible and clearing PIR with xchg,
> we minimized the chance that IOMMU does atomic swap in the middle.
> Therefore, having less L1D misses.
>
> In the code above, it does read, xchg, and call_irq_handler() in a loop
> to handle the 4 64bit PIR bits at a time. IOMMU has a greater chance to do
> atomic xchg on the PIR cache line while doing call_irq_handler(). Therefore,
> it causes more L1D misses.

That makes sense and if we go there it wants to be documented.

> Without PIR copy:
>
> DMA memfill bandwidth: 4.944 Gbps
> Performance counter stats for './run_intr.sh 512 30':                                                             
>                                                                                                                    
>     77,313,298,506      L1-dcache-loads                                               (79.98%)                     
>          8,279,458      L1-dcache-load-misses     #    0.01% of all L1-dcache accesses  (80.03%)                   
>     41,654,221,245      L1-dcache-stores                                              (80.01%)                     
>             10,476      LLC-load-misses           #    0.31% of all LL-cache accesses  (79.99%)                    
>          3,332,748      LLC-loads                                                     (80.00%)                     
>                                                                                                                    
>       30.212055434 seconds time elapsed                                                                            
>                                                                                                                    
>        0.002149000 seconds user                                                                                    
>       30.183292000 seconds sys
>                         
>
> With PIR copy:
> DMA memfill bandwidth: 5.029 Gbps
> Performance counter stats for './run_intr.sh 512 30':
>
>     78,327,247,423      L1-dcache-loads                                               (80.01%)
>          7,762,311      L1-dcache-load-misses     #    0.01% of all L1-dcache accesses  (80.01%)
>     42,203,221,466      L1-dcache-stores                                              (79.99%)
>             23,691      LLC-load-misses           #    0.67% of all LL-cache accesses  (80.01%)
>          3,561,890      LLC-loads                                                     (80.00%)
>
>       30.201065706 seconds time elapsed
>
>        0.005950000 seconds user
>       30.167885000 seconds sys

Interesting, though I'm not really convinced that this DMA memfill
microbenchmark resembles real work loads.

Did you test with something realistic, e.g. storage or networking, too?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler
  2023-12-08 11:52         ` Thomas Gleixner
@ 2023-12-08 20:02           ` Jacob Pan
  2024-01-26 23:32           ` Jacob Pan
  1 sibling, 0 replies; 49+ messages in thread
From: Jacob Pan @ 2023-12-08 20:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, LKML, X86 Kernel, iommu, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar, Raj Ashok, Tian, Kevin, maz, seanjc, Robin Murphy,
	jacob.jun.pan

Hi Thomas,

On Fri, 08 Dec 2023 12:52:49 +0100, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Thu, Dec 07 2023 at 20:46, Jacob Pan wrote:
> > On Wed, 06 Dec 2023 20:50:24 +0100, Thomas Gleixner <tglx@linutronix.de>
> > wrote:  
> >> I don't understand what the whole copy business is about. It's
> >> absolutely not required.  
> >
> > My thinking is the following:
> > The PIR cache line is contended by between CPU and IOMMU, where CPU can
> > access PIR much faster. Nevertheless, when IOMMU does atomic swap of the
> > PID (PIR included), L1 cache gets evicted. Subsequent CPU read or xchg
> > will deal with invalid cold cache.
> >
> > By making a copy of PIR as quickly as possible and clearing PIR with
> > xchg, we minimized the chance that IOMMU does atomic swap in the middle.
> > Therefore, having less L1D misses.
> >
> > In the code above, it does read, xchg, and call_irq_handler() in a loop
> > to handle the 4 64bit PIR bits at a time. IOMMU has a greater chance to
> > do atomic xchg on the PIR cache line while doing call_irq_handler().
> > Therefore, it causes more L1D misses.  
> 
> That makes sense and if we go there it wants to be documented.
will do. How about this explanation:
"
Posted interrupt descriptor (PID) fits in a cache line that is frequently
accessed by both CPU and IOMMU.

During posted MSI processing, the CPU needs to do 64-bit read and xchg for
checking and clearing posted interrupt request (PIR), a 256 bit field
within the PID. On the other side, IOMMU do atomic swaps of the entire
PID cache line when posting interrupts. The CPU can access the cache line
much faster than the IOMMU.

The cache line states after each operation are as follows:

CPU		IOMMU			PID Cache line state
-------------------------------------------------------------
read64					exclusive
lock xchg64				modified
		post/atomic swap	invalid
-------------------------------------------------------------
Note that PID cache line is evicted after each IOMMU interrupt posting.

The posted MSI demuxing loop is written to optimize the cache performance
based on the two considerations around the PID cache line:

1. Reduce L1 data cache miss by avoiding contention with IOMMU's interrupt
posting/atomic swap, a copy of PIR is used to dispatch interrupt handlers.

2. Keep the cache line state consistent as much as possible. e.g. when
making a copy and clearing the PIR (assuming non-zero PIR bits are present
in the entire PIR), do:
read, read, read, read, xchg, xchg, xchg, xchg
instead of:
read, xchg, read, xchg, read, xchg, read, xchg
"

> 
> > Without PIR copy:
> >
> > DMA memfill bandwidth: 4.944 Gbps
> > Performance counter stats for './run_intr.sh 512 30':
> > 
> >     77,313,298,506      L1-dcache-loads
> >               (79.98%) 8,279,458      L1-dcache-load-misses     #
> > 0.01% of all L1-dcache accesses  (80.03%) 41,654,221,245
> > L1-dcache-stores                                              (80.01%)
> > 10,476      LLC-load-misses           #    0.31% of all LL-cache
> > accesses  (79.99%) 3,332,748      LLC-loads
> >                         (80.00%) 30.212055434 seconds time elapsed
> > 
> >        0.002149000 seconds user
> > 30.183292000 seconds sys
> >                         
> >
> > With PIR copy:
> > DMA memfill bandwidth: 5.029 Gbps
> > Performance counter stats for './run_intr.sh 512 30':
> >
> >     78,327,247,423      L1-dcache-loads
> >               (80.01%) 7,762,311      L1-dcache-load-misses     #
> > 0.01% of all L1-dcache accesses  (80.01%) 42,203,221,466
> > L1-dcache-stores                                              (79.99%)
> > 23,691      LLC-load-misses           #    0.67% of all LL-cache
> > accesses  (80.01%) 3,561,890      LLC-loads
> >                         (80.00%)
> >
> >       30.201065706 seconds time elapsed
> >
> >        0.005950000 seconds user
> >       30.167885000 seconds sys  
> 
> Interesting, though I'm not really convinced that this DMA memfill
> microbenchmark resembles real work loads.
> 
It is just a tool to get some quick experiments done, not realistic. Though
I am adding various knobs to make it more useful. e.g. adjustable interrupt
rate, delays in idxd hardirq handler.

> Did you test with something realistic, e.g. storage or networking, too?
> 
Not yet for this particular code, working on testing with FIO on Samsung
Gen5 NVMe disks. I am getting help from the people with the set up.


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 01/13] x86: Move posted interrupt descriptor out of vmx code
  2023-12-08  9:31       ` Thomas Gleixner
@ 2023-12-08 23:21         ` Jacob Pan
  2023-12-09  0:28         ` Jacob Pan
  1 sibling, 0 replies; 49+ messages in thread
From: Jacob Pan @ 2023-12-08 23:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Raj Ashok,
	Tian, Kevin, maz, peterz, seanjc, Robin Murphy, jacob.jun.pan

Hi Thomas,

On Fri, 08 Dec 2023 10:31:20 +0100, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Thu, Dec 07 2023 at 20:54, Jacob Pan wrote:
> > On Wed, 06 Dec 2023 17:33:28 +0100, Thomas Gleixner <tglx@linutronix.de>
> > wrote:  
> >> On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> >> 	u32			rsvd[6];
> >> } __aligned(64);
> >>   
> > It seems bit-fields cannot pass type check. I got these compile error.  
> 
> And why are you telling me that instead if simply fixing it?
My point is that I am not sure this change is worthwhile unless I don't do
the per CPU pointer check.

gcc cannot take bit-field address afaik. So the problem is that with this
bit-field change we don't have individual members anymore to check pointers.

e.g.
./include/linux/percpu-defs.h:363:20: error: cannot take address of
bit-field ‘nv’ 363 |  __verify_pcpu_ptr(&(variable));  


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 01/13] x86: Move posted interrupt descriptor out of vmx code
  2023-12-08  9:31       ` Thomas Gleixner
  2023-12-08 23:21         ` Jacob Pan
@ 2023-12-09  0:28         ` Jacob Pan
  1 sibling, 0 replies; 49+ messages in thread
From: Jacob Pan @ 2023-12-09  0:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Raj Ashok,
	Tian, Kevin, maz, peterz, seanjc, Robin Murphy, jacob.jun.pan

Hi Thomas,

On Fri, 08 Dec 2023 10:31:20 +0100, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Thu, Dec 07 2023 at 20:54, Jacob Pan wrote:
> > On Wed, 06 Dec 2023 17:33:28 +0100, Thomas Gleixner <tglx@linutronix.de>
> > wrote:  
> >> On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> >> 	u32			rsvd[6];
> >> } __aligned(64);
> >>   
> > It seems bit-fields cannot pass type check. I got these compile error.  
> 
> And why are you telling me that instead if simply fixing it?
I guess we can fix it like this to use the new bitfields:

 void intel_posted_msi_init(void)
 {
-       this_cpu_write(posted_interrupt_desc.nv, POSTED_MSI_NOTIFICATION_VECTOR);
-       this_cpu_write(posted_interrupt_desc.ndst, this_cpu_read(x86_cpu_to_apicid));
+       struct pi_desc *pid = this_cpu_ptr(&posted_interrupt_desc);
+
+       pid->nv = POSTED_MSI_NOTIFICATION_VECTOR;
+       pid->ndst = this_cpu_read(x86_cpu_to_apicid);

It is init time, no IOMMU posting yet. So no need for atomics.


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 02/13] x86: Add a Kconfig option for posted MSI
  2023-12-06 16:35   ` Thomas Gleixner
@ 2023-12-09 21:24     ` Jacob Pan
  0 siblings, 0 replies; 49+ messages in thread
From: Jacob Pan @ 2023-12-09 21:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Raj Ashok,
	Tian, Kevin, maz, peterz, seanjc, Robin Murphy, jacob.jun.pan

Hi Thomas,

On Wed, 06 Dec 2023 17:35:29 +0100, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> > This option will be used to support delivering MSIs as posted
> > interrupts. Interrupt remapping is required.  
> 
> The last sentence does not make sense.
will remove, superfluous statement.

> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  arch/x86/Kconfig | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 66bfabae8814..f16882ddb390 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -463,6 +463,16 @@ config X86_X2APIC
> >  
> >  	  If you don't know what to do here, say N.
> >  
> > +config X86_POSTED_MSI
> > +	bool "Enable MSI and MSI-x delivery by posted interrupts"
> > +	depends on X86_X2APIC && X86_64 && IRQ_REMAP
> > +	help
> > +	  This enables MSIs that are under IRQ remapping to be
> > delivered as posted  
> 
> s/IRQ/interrupt/
OK, will replace this and IRQs below.

> This is text and not Xitter.
> 
> 
> > +	  interrupts to the host kernel. IRQ throughput can
> > potentially be improved
> > +	  by coalescing CPU notifications during high frequency IRQ
> > bursts. +
> > +	  If you don't know what to do here, say N.
> > +
> >  config X86_MPPARSE
> >  	bool "Enable MPS table" if ACPI
> >  	default y  


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 03/13] x86: Reserved a per CPU IDT vector for posted MSIs
  2023-12-06 16:47   ` Thomas Gleixner
@ 2023-12-09 21:53     ` Jacob Pan
  0 siblings, 0 replies; 49+ messages in thread
From: Jacob Pan @ 2023-12-09 21:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Raj Ashok,
	Tian, Kevin, maz, peterz, seanjc, Robin Murphy, jacob.jun.pan

Hi Thomas,

On Wed, 06 Dec 2023 17:47:07 +0100, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> 
> $Subject: x86/vector: Reserve ...
> 
> > Under posted MSIs, all device MSIs are multiplexed into a single CPU  
> 
> Under?
Will change to "When posted MSIs are enabled, "
 
> > notification vector. MSI handlers will be de-multiplexed at run-time by
> > system software without IDT delivery.
> >
> > This vector has a priority class below the rest of the system vectors.  
> 
> Why?
I was thinking system interrupt can preempt device posted MSIs. But if
nested interrupt is not an option, there is no need.

> > Potentially, external vector number space for MSIs can be expanded to
> > the entire 0-256 range.  
> 
> Don't even mention this. It's wishful thinking and has absolutely
> nothing to do with the patch at hand.
will remove.

> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  arch/x86/include/asm/irq_vectors.h | 15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/irq_vectors.h
> > b/arch/x86/include/asm/irq_vectors.h index 3a19904c2db6..077ca38f5a91
> > 100644 --- a/arch/x86/include/asm/irq_vectors.h
> > +++ b/arch/x86/include/asm/irq_vectors.h
> > @@ -99,9 +99,22 @@
> >  
> >  #define LOCAL_TIMER_VECTOR		0xec
> >  
> > +/*
> > + * Posted interrupt notification vector for all device MSIs delivered
> > to
> > + * the host kernel.
> > + *
> > + * Choose lower priority class bit [7:4] than other system vectors such
> > + * that it can be preempted by the system interrupts.  
> 
> That's future music and I'm not convinced at all that we want to allow
> nested interrupts with all their implications. Stack depth is the least
> of the worries here. There are enough other assumptions about interrupts
> not nesting in Linux.
> 
Then, should we allow limited interrupt priority inversion while processing
posted MSIs?

In the current code, without preemption, we effectively already allow one
low priority to block higher ones.

I don't know the other worries caused by nested interrupts, still
experimenting/studying, but here I am thinking it is just one-deep nesting.
Posted MSI notifications are not allowed to nest, so does other system
interrupts.

> > + * It is also higher than all external vectors but it should not matter
> > + * in that external vectors for posted MSIs are in a different number
> > space.  
> 
> This whole priority muck is pointless. The kernel never used it and will
> never use it.
OK. Perhaps I didn't make it clear, I am just trying to let system
interrupt, such as timer, to preempt posted MSI. Not TPR/PPR etc.

> > + */
> > +#define POSTED_MSI_NOTIFICATION_VECTOR	0xdf  
> 
> So this just wants to go into the regular system vector number space
> until there is a conclusion whether we can and want to allow nested
> interrupts. Premature optimization is just creating more confusion than
> value.
Make sense, for this patchset I didn't include the preemption patch since I
am not sure yet.
I should use the next system vector.


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 11/13] iommu/vt-d: Add an irq_chip for posted MSIs
  2023-12-06 20:44   ` Thomas Gleixner
@ 2023-12-13  3:42     ` Jacob Pan
  0 siblings, 0 replies; 49+ messages in thread
From: Jacob Pan @ 2023-12-13  3:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Raj Ashok,
	Tian, Kevin, maz, peterz, seanjc, Robin Murphy, jacob.jun.pan

Hi Thomas,

On Wed, 06 Dec 2023 21:44:02 +0100, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> >  static void fill_msi_msg(struct msi_msg *msg, u32 index, u32 subhandle)
> >  {
> >  	memset(msg, 0, sizeof(*msg));
> > @@ -1361,7 +1397,7 @@ static int intel_irq_remapping_alloc(struct
> > irq_domain *domain, 
> >  		irq_data->hwirq = (index << 16) + i;
> >  		irq_data->chip_data = ird;
> > -		irq_data->chip = &intel_ir_chip;
> > +		irq_data->chip = posted_msi_supported() ?
> > &intel_ir_chip_post_msi : &intel_ir_chip;  
> 
> This is just wrong because you change the chip to posted for _ALL_
> domains unconditionally.
> 
> The only domains which want this chip are the PCI/MSI domains. And those
> are distinct from the domains which serve IO/APIC, HPET, no?
> 
> So you can set that chip only for PCI/MSI and just let IO/APIC, HPET
> domains keep the original chip, which spares any modification of the
> IO/APIC domain.
> 
> 
make sense.
-               irq_data->chip = posted_msi_supported() ? &intel_ir_chip_post_msi : &intel_ir_chip;
+               if ((info->type == X86_IRQ_ALLOC_TYPE_PCI_MSI) && posted_msi_supported())
+                       irq_data->chip = &intel_ir_chip_post_msi;
+               else
+                       irq_data->chip = &intel_ir_chip;

Now in IRQ debugfs, I can see the correct IR chips for IOAPIC IRQs and
MSIs.

e.g


domain:  IO-APIC-8
 hwirq:   0x9
 chip:    IR-IO-APIC
  flags:   0x410
             IRQCHIP_SKIP_SET_WAKE
 parent:
    domain:  INTEL-IR-9-13
     hwirq:   0x80000
     chip:    INTEL-IR
      flags:   0x0
     parent:
        domain:  VECTOR


domain:  IR-PCI-MSI-0000:3d:00.4-11
 hwirq:   0x0
 chip:    IR-PCI-MSI-0000:3d:00.4
  flags:   0x430
             IRQCHIP_SKIP_SET_WAKE
             IRQCHIP_ONESHOT_SAFE
 parent:
    domain:  INTEL-IR-4-13
     hwirq:   0x0
     chip:    INTEL-IR-POST
      flags:   0x0
     parent:
        domain:  VECTOR


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 13/13] iommu/vt-d: Enable posted mode for device MSIs
  2023-12-06 20:26   ` Thomas Gleixner
@ 2023-12-13 22:00     ` Jacob Pan
  0 siblings, 0 replies; 49+ messages in thread
From: Jacob Pan @ 2023-12-13 22:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Raj Ashok,
	Tian, Kevin, maz, peterz, seanjc, Robin Murphy, jacob.jun.pan

Hi Thomas,

On Wed, 06 Dec 2023 21:26:55 +0100, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> >  #ifdef CONFIG_X86_POSTED_MSI
> >  
> >  static u64 get_pi_desc_addr(struct irq_data *irqd)
> > @@ -1133,6 +1144,29 @@ static u64 get_pi_desc_addr(struct irq_data
> > *irqd) 
> >  	return __pa(per_cpu_ptr(&posted_interrupt_desc, cpu));
> >  }
> > +
> > +static void intel_ir_reconfigure_irte_posted(struct irq_data *irqd)
> > +{
> > +	struct intel_ir_data *ir_data = irqd->chip_data;
> > +	struct irte *irte = &ir_data->irte_entry;
> > +	struct irte irte_pi;
> > +	u64 pid_addr;
> > +
> > +	pid_addr = get_pi_desc_addr(irqd);
> > +
> > +	memset(&irte_pi, 0, sizeof(irte_pi));
> > +
> > +	/* The shared IRTE already be set up as posted during
> > alloc_irte */  
> 
> -ENOPARSE
Will delete this. What I meant was that the shared IRTE has already been
setup as posted mode instead of remappable mode. So when we make a copy,
there is no need to change the mode.

> > +	dmar_copy_shared_irte(&irte_pi, irte);
> > +
> > +	irte_pi.pda_l = (pid_addr >> (32 - PDA_LOW_BIT)) & ~(-1UL <<
> > PDA_LOW_BIT);
> > +	irte_pi.pda_h = (pid_addr >> 32) & ~(-1UL << PDA_HIGH_BIT);
> > +
> > +	modify_irte(&ir_data->irq_2_iommu, &irte_pi);
> > +}
> > +
> > +#else
> > +static inline void intel_ir_reconfigure_irte_posted(struct irq_data
> > *irqd) {} #endif
> >  
> >  static void intel_ir_reconfigure_irte(struct irq_data *irqd, bool
> > force) @@ -1148,8 +1182,9 @@ static void
> > intel_ir_reconfigure_irte(struct irq_data *irqd, bool force)
> > irte->vector = cfg->vector; irte->dest_id = IRTE_DEST(cfg->dest_apicid);
> >  
> > -	/* Update the hardware only if the interrupt is in remapped
> > mode. */
> > -	if (force || ir_data->irq_2_iommu.mode == IRQ_REMAPPING)
> > +	if (ir_data->irq_2_iommu.posted_msi)
> > +		intel_ir_reconfigure_irte_posted(irqd);
> > +	else if (force || ir_data->irq_2_iommu.mode == IRQ_REMAPPING)
> >  		modify_irte(&ir_data->irq_2_iommu, irte);
> >  }
> >  
> > @@ -1203,7 +1238,7 @@ static int intel_ir_set_vcpu_affinity(struct
> > irq_data *data, void *info) struct intel_ir_data *ir_data =
> > data->chip_data; struct vcpu_data *vcpu_pi_info = info;
> >  
> > -	/* stop posting interrupts, back to remapping mode */
> > +	/* stop posting interrupts, back to the default mode */
> >  	if (!vcpu_pi_info) {
> >  		modify_irte(&ir_data->irq_2_iommu,
> > &ir_data->irte_entry); } else {
> > @@ -1300,10 +1335,14 @@ static void
> > intel_irq_remapping_prepare_irte(struct intel_ir_data *data, {
> >  	struct irte *irte = &data->irte_entry;
> >  
> > -	prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
> > +	if (data->irq_2_iommu.mode == IRQ_POSTING)
> > +		prepare_irte_posted(irte);
> > +	else
> > +		prepare_irte(irte, irq_cfg->vector,
> > irq_cfg->dest_apicid); 
> >  	switch (info->type) {
> >  	case X86_IRQ_ALLOC_TYPE_IOAPIC:
> > +		prepare_irte(irte, irq_cfg->vector,
> > irq_cfg->dest_apicid);  
> 
> What? This is just wrong. Above you have:
> 
> > +	if (data->irq_2_iommu.mode == IRQ_POSTING)
> > +		prepare_irte_posted(irte);
> > +	else
> > +		prepare_irte(irte, irq_cfg->vector,
> > irq_cfg->dest_apicid);  
> 
> Can you spot the fail?
My bad, I forgot to delete this.

It is probably easier just override the IRTE for the posted MSI case.
@@ -1274,6 +1354,11 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data 
*data,
                break;
        case X86_IRQ_ALLOC_TYPE_PCI_MSI:
        case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
+               if (posted_msi_supported()) {
+                       prepare_irte_posted(irte);
+                       data->irq_2_iommu.posted_msi = 1;
+               }
+

> 
> >  		/* Set source-id of interrupt request */
> >  		set_ioapic_sid(irte, info->devid);
> >  		apic_printk(APIC_VERBOSE, KERN_DEBUG "IOAPIC[%d]: Set
> > IRTE entry (P:%d FPD:%d Dst_Mode:%d Redir_hint:%d Trig_Mode:%d
> > Dlvry_Mode:%X Avail:%X Vector:%02X Dest:%08X SID:%04X SQ:%X SVT:%X)\n",
> > @@ -1315,10 +1354,18 @@ static void
> > intel_irq_remapping_prepare_irte(struct intel_ir_data *data, sub_handle
> > = info->ioapic.pin; break; case X86_IRQ_ALLOC_TYPE_HPET:
> > +		prepare_irte(irte, irq_cfg->vector,
> > irq_cfg->dest_apicid); set_hpet_sid(irte, info->devid);
> >  		break;
> >  	case X86_IRQ_ALLOC_TYPE_PCI_MSI:
> >  	case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
> > +		if (posted_msi_supported()) {
> > +			prepare_irte_posted(irte);
> > +			data->irq_2_iommu.posted_msi = 1;
> > +		} else {
> > +			prepare_irte(irte, irq_cfg->vector,
> > irq_cfg->dest_apicid);
> > +		}  
> 
> Here it gets even more hilarious.


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 12/13] iommu/vt-d: Add a helper to retrieve PID address
  2023-12-06 20:19   ` Thomas Gleixner
@ 2024-01-26 23:30     ` Jacob Pan
  2024-02-13  8:21       ` Thomas Gleixner
  0 siblings, 1 reply; 49+ messages in thread
From: Jacob Pan @ 2024-01-26 23:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Raj Ashok,
	Tian, Kevin, maz, peterz, seanjc, Robin Murphy, jacob.jun.pan

Hi Thomas,

On Wed, 06 Dec 2023 21:19:11 +0100, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> > From: Thomas Gleixner <tglx@linutronix.de>
> >
> > When programming IRTE for posted mode, we need to retrieve the
> > physical  
> 
> we need .... I surely did not write this changelog.
> 
Will delete this.

> > address of the posted interrupt descriptor (PID) that belongs to it's
> > target CPU.
> >
> > This per CPU PID has already been set up during cpu_init().  
> 
> This information is useful because?
ditto.

> > +static u64 get_pi_desc_addr(struct irq_data *irqd)
> > +{
> > +	int cpu =
> > cpumask_first(irq_data_get_effective_affinity_mask(irqd));  
> 
> The effective affinity mask is magically correct when this is called?
> 
My understanding is that remappable device MSIs have the following
hierarchy,e.g.

parent:                              
    domain:  INTEL-IR-5-13            
     hwirq:   0x20000                 
     chip:    INTEL-IR-POST           
      flags:   0x0                    
     parent:                          
        domain:  VECTOR            
         hwirq:   0x3c             
         chip:    APIC         

When irqs are allocated and activated, parents domain op is always called
first. Effective affinity mask is set up by the parent domain, i.e. VECTOR.
Example call stack for alloc:
	irq_data_update_effective_affinity
	apic_update_irq_cfg
	x86_vector_alloc_irqs
	intel_irq_remapping_alloc
	msi_domain_alloc

x86_vector_activate also changes the effective affinity mask before calling
intel_irq_remapping_activate() where a posted interrupt is configured for
its destination CPU.

At runtime, when IRQ affinity is changed by userspace Intel interrupt
remapping code also calls parent data/chip to update the effective affinity
map before changing IRTE.

intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask,
		      bool force)
{
	ret = parent->chip->irq_set_affinity(parent, mask, force);

...
}
Here the parent APIC chip does apic_set_affinity() which will set up
effective mask before posted MSI affinity change.

Maybe I missed some cases?

I will also add a check if the effective affinity mask is not set up.

static phys_addr_t get_pi_desc_addr(struct irq_data *irqd)
{
	int cpu = cpumask_first(irq_data_get_effective_affinity_mask(irqd));

	if (WARN_ON(cpu >= nr_cpu_ids))
		return 0;

	return __pa(per_cpu_ptr(&posted_interrupt_desc, cpu));
}


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 07/13] x86/irq: Add helpers for checking Intel PID
  2023-12-06 19:02   ` Thomas Gleixner
@ 2024-01-26 23:31     ` Jacob Pan
  0 siblings, 0 replies; 49+ messages in thread
From: Jacob Pan @ 2024-01-26 23:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Raj Ashok,
	Tian, Kevin, maz, peterz, seanjc, Robin Murphy, jacob.jun.pan

Hi Thomas,

On Wed, 06 Dec 2023 20:02:58 +0100, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> 
> That 'Intel PID' in the subject line sucks. What's wrong with writing
> things out?
> 
>        x86/irq: Add accessors for posted interrupt descriptors
> 
will do.

> Hmm?
> 
> > Intel posted interrupt descriptor (PID) stores pending interrupts in its
> > posted interrupt requests (PIR) bitmap.
> >
> > Add helper functions to check individual vector status and the entire
> > bitmap.
> >
> > They are used for interrupt migration and runtime demultiplexing posted
> > MSI vectors.  
> 
> This is all backwards.
> 
>   Posted interrupts are controlled by and pending interrupts are marked in
>   the posted interrupt descriptor. The upcoming support for host side
>   posted interrupts requires accessors to check for pending vectors.
> 
>   Add ....
> 
> >  #ifdef CONFIG_X86_POSTED_MSI
> > +/*
> > + * Not all external vectors are subject to interrupt remapping, e.g.
> > IOMMU's
> > + * own interrupts. Here we do not distinguish them since those vector
> > bits in
> > + * PIR will always be zero.
> > + */
> > +static inline bool is_pi_pending_this_cpu(unsigned int vector)  
> 
> Can you please use a proper name space pi_.....() instead of this
> is_...() muck which is horrible to grep for. It's documented ....
> 
good idea, will do.

> > +{
> > +	struct pi_desc *pid;
> > +
> > +	if (WARN_ON(vector > NR_VECTORS || vector <
> > FIRST_EXTERNAL_VECTOR))
> > +		return false;  
> 
> Haha. So much about your 'can use the full vector space' dreams .... And
> WARN_ON_ONCE() please.
> 
yes, will do. Not enough motivation for the full vector space.

> > +
> > +	pid = this_cpu_ptr(&posted_interrupt_desc);  
> 
> Also this can go into the declaration line.
will do

> 
> > +
> > +	return (pid->pir[vector >> 5] & (1 << (vector % 32)));  
> 
>   __test_bit() perhaps?
> 
> > +}  
> 
> > +static inline bool is_pir_pending(struct pi_desc *pid)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < 4; i++) {
> > +		if (pid->pir_l[i])
> > +			return true;
> > +	}
> > +
> > +	return false;  
> 
> This is required because pi_is_pir_empty() is checking the other way
> round, right?
> 
This function is not needed anymore in the next version. I was thinking
performance is better if we bail out while encountering the first set bit.

> > +}
> > +
> >  extern void intel_posted_msi_init(void);
> >  
> >  #else
> > +static inline bool is_pi_pending_this_cpu(unsigned int vector) {return
> > false; }  
> 
> lacks space before 'return'
> 
will fix.
> > +
> >  static inline void intel_posted_msi_init(void) {};
> >  
> >  #endif /* X86_POSTED_MSI */  


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 11/13] iommu/vt-d: Add an irq_chip for posted MSIs
  2023-12-06 20:15   ` Thomas Gleixner
@ 2024-01-26 23:31     ` Jacob Pan
  0 siblings, 0 replies; 49+ messages in thread
From: Jacob Pan @ 2024-01-26 23:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Raj Ashok,
	Tian, Kevin, maz, peterz, seanjc, Robin Murphy, jacob.jun.pan

Hi Thomas,

On Wed, 06 Dec 2023 21:15:24 +0100, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Sat, Nov 11 2023 at 20:16, Jacob Pan wrote:
> > With posted MSIs, end of interrupt is handled by the notification
> > handler. Each MSI handler does not go through local APIC IRR, ISR
> > processing. There's no need to do apic_eoi() in those handlers.
> >
> > Add a new acpi_ack_irq_no_eoi() for the posted MSI IR chip. At runtime
> > the call trace looks like:
> >
> > __sysvec_posted_msi_notification() {
> >   irq_chip_ack_parent() {
> >     apic_ack_irq_no_eoi();
> >   }  
> 
> Huch? There is something missing here to make sense.
Good point, I was too focused on eoi. The trace should be like

 * __sysvec_posted_msi_notification()
 *	irq_enter();
 *		handle_edge_irq()
 *			irq_chip_ack_parent()
 *				dummy(); // No EOI
 *			handle_irq_event()
 *				driver_handler()
 *	irq_enter();
 *		handle_edge_irq()
 *			irq_chip_ack_parent()
 *				dummy(); // No EOI
 *			handle_irq_event()
 *				driver_handler()
 *	irq_enter();
 *		handle_edge_irq()
 *			irq_chip_ack_parent()
 *				dummy(); // No EOI
 *			handle_irq_event()
 *				driver_handler()
 *	apic_eoi()
 * irq_exit()

> >   handle_irq_event() {
> >     handle_irq_event_percpu() {
> >        driver_handler()
> >     }
> >   }
> >
> > IO-APIC IR is excluded the from posted MSI, we need to make sure it
> > still performs EOI.  
> 
> We need to make the code correct and write changelogs which make
> sense. This sentence makes no sense whatsoever.
> 
> What has the IO-APIC to do with posted MSIs?
> 
> It's a different interrupt chip hierarchy, no?
Right, I should not modify IOAPIC chip. Just assign posted IR chip to
device MSI/x.

> > diff --git a/arch/x86/kernel/apic/io_apic.c
> > b/arch/x86/kernel/apic/io_apic.c index 00da6cf6b07d..ca398ee9075b 100644
> > --- a/arch/x86/kernel/apic/io_apic.c
> > +++ b/arch/x86/kernel/apic/io_apic.c
> > @@ -1993,7 +1993,7 @@ static struct irq_chip ioapic_ir_chip
> > __read_mostly = { .irq_startup		= startup_ioapic_irq,
> >  	.irq_mask		= mask_ioapic_irq,
> >  	.irq_unmask		= unmask_ioapic_irq,
> > -	.irq_ack		= irq_chip_ack_parent,
> > +	.irq_ack		= apic_ack_irq,  
> 
> Why?
ditto.

> 
> >  	.irq_eoi		= ioapic_ir_ack_level,
> >  	.irq_set_affinity	= ioapic_set_affinity,
> >  	.irq_retrigger		= irq_chip_retrigger_hierarchy,
> > diff --git a/arch/x86/kernel/apic/vector.c
> > b/arch/x86/kernel/apic/vector.c index 14fc33cfdb37..01223ac4f57a 100644
> > --- a/arch/x86/kernel/apic/vector.c
> > +++ b/arch/x86/kernel/apic/vector.c
> > @@ -911,6 +911,11 @@ void apic_ack_irq(struct irq_data *irqd)
> >  	apic_eoi();
> >  }
> >  
> > +void apic_ack_irq_no_eoi(struct irq_data *irqd)
> > +{
> > +	irq_move_irq(irqd);
> > +}
> > +  
> 
> The exact purpose of that function is to invoke irq_move_irq() which is
> a completely pointless exercise for interrupts which are remapped.

OK, I will replace this with a dummy .irq_ack() function.
Device MSIs do not have IRQD_SETAFFINITY_PENDING set.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler
  2023-12-08 11:52         ` Thomas Gleixner
  2023-12-08 20:02           ` Jacob Pan
@ 2024-01-26 23:32           ` Jacob Pan
  1 sibling, 0 replies; 49+ messages in thread
From: Jacob Pan @ 2024-01-26 23:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, LKML, X86 Kernel, iommu, Lu Baolu, kvm,
	Dave Hansen, Joerg Roedel, H. Peter Anvin, Borislav Petkov,
	Ingo Molnar, Raj Ashok, Tian, Kevin, maz, seanjc, Robin Murphy,
	jacob.jun.pan

Hi Thomas,

On Fri, 08 Dec 2023 12:52:49 +0100, Thomas Gleixner <tglx@linutronix.de>
wrote:

> > Without PIR copy:
> >
> > DMA memfill bandwidth: 4.944 Gbps
> > Performance counter stats for './run_intr.sh 512 30':
> > 
> >     77,313,298,506      L1-dcache-loads
> >               (79.98%) 8,279,458      L1-dcache-load-misses     #
> > 0.01% of all L1-dcache accesses  (80.03%) 41,654,221,245
> > L1-dcache-stores                                              (80.01%)
> > 10,476      LLC-load-misses           #    0.31% of all LL-cache
> > accesses  (79.99%) 3,332,748      LLC-loads
> >                         (80.00%) 30.212055434 seconds time elapsed
> > 
> >        0.002149000 seconds user
> > 30.183292000 seconds sys
> >                         
> >
> > With PIR copy:
> > DMA memfill bandwidth: 5.029 Gbps
> > Performance counter stats for './run_intr.sh 512 30':
> >
> >     78,327,247,423      L1-dcache-loads
> >               (80.01%) 7,762,311      L1-dcache-load-misses     #
> > 0.01% of all L1-dcache accesses  (80.01%) 42,203,221,466
> > L1-dcache-stores                                              (79.99%)
> > 23,691      LLC-load-misses           #    0.67% of all LL-cache
> > accesses  (80.01%) 3,561,890      LLC-loads
> >                         (80.00%)
> >
> >       30.201065706 seconds time elapsed
> >
> >        0.005950000 seconds user
> >       30.167885000 seconds sys  
> 
> Interesting, though I'm not really convinced that this DMA memfill
> microbenchmark resembles real work loads.
> 
> Did you test with something realistic, e.g. storage or networking, too?
I have done the following FIO test on NVME drives and not seeing any
meaningful differences in IOPS between the two implementations.

Here is my setup and results on 4 NVME drives connected to a x16 PCIe slot:

 +-[0000:62]-
 |           +-01.0-[63]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM174X
 |           +-03.0-[64]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM174X
 |           +-05.0-[65]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM174X
 |           \-07.0-[66]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM174X


libaio, no PIR_COPY
======================================
fio-3.35                                                                                                           
Starting 512 processes                                                                                             
Jobs: 512 (f=512): [r(512)][100.0%][r=32.2GiB/s][r=8445k IOPS][eta 00m:00s]                                        
disk_nvme6n1_thread_1: (groupid=0, jobs=512): err= 0: pid=31559: Mon Jan  8 21:49:22 2024                          
  read: IOPS=8419k, BW=32.1GiB/s (34.5GB/s)(964GiB/30006msec)                                                      
    slat (nsec): min=1325, max=115807k, avg=42368.34, stdev=1517031.57                                             
    clat (usec): min=2, max=499085, avg=15139.97, stdev=25682.25                                                   
     lat (usec): min=68, max=499089, avg=15182.33, stdev=25709.81                                                  
    clat percentiles (usec):                                                                                       
     |  1.00th=[   734],  5.00th=[   783], 10.00th=[   816], 20.00th=[   857],                                     
     | 30.00th=[   906], 40.00th=[   971], 50.00th=[  1074], 60.00th=[  1369],                                     
     | 70.00th=[ 13042], 80.00th=[ 19792], 90.00th=[ 76022], 95.00th=[ 76022],                                     
     | 99.00th=[ 77071], 99.50th=[ 81265], 99.90th=[ 85459], 99.95th=[ 91751],                                     
     | 99.99th=[200279]                                                                                            
   bw (  MiB/s): min=18109, max=51859, per=100.00%, avg=32965.98, stdev=16.88, samples=14839                       
   iops        : min=4633413, max=13281470, avg=8439278.47, stdev=4324.70, samples=14839                           
  lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%                                                  
  lat (usec)   : 250=0.01%, 500=0.01%, 750=1.84%, 1000=41.96%                                                      
  lat (msec)   : 2=18.37%, 4=0.20%, 10=3.88%, 20=13.95%, 50=5.42%                                                  
  lat (msec)   : 100=14.33%, 250=0.02%, 500=0.01%                                                                  
  cpu          : usr=1.16%, sys=3.54%, ctx=4932752, majf=0, minf=192764                                            
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%                                     
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%                                    
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%                                    
     issued rwts: total=252616589,0,0,0 short=0,0,0,0 dropped=0,0,0,0                                              
     latency   : target=0, window=0, percentile=100.00%, depth=256                                                 
                                                                                                                   
Run status group 0 (all jobs):                                                                                     
   READ: bw=32.1GiB/s (34.5GB/s), 32.1GiB/s-32.1GiB/s (34.5GB/s-34.5GB/s), io=964GiB (1035GB), run=30006-30006msec 
                                                                                                                   
Disk stats (read/write):                                                                                           
  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=96.31%                                                  
  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=97.15%                                                  
  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=98.06%                                                  
  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=98.94%                                                  
                                                                                                                   
 Performance counter stats for 'system wide':                                                                      
                                                                                                                   
    22,985,903,515      L1-dcache-load-misses                                                   (42.86%)           
    22,989,992,126      L1-dcache-load-misses                                                   (57.14%)           
   751,228,710,993      L1-dcache-stores                                                        (57.14%)           
       465,033,820      LLC-load-misses                  #   18.27% of all LL-cache accesses    (57.15%)           
     2,545,570,669      LLC-loads                                                               (57.14%)           
     1,058,582,881      LLC-stores                                                              (28.57%)           
       326,135,823      LLC-store-misses                                                        (28.57%)           
                                                                                                                   
      32.045718194 seconds time elapsed                                                                       
-------------------------------------------
libaio with PIR_COPY
-------------------------------------------
fio-3.35                                                                                                           
Starting 512 processes                                                                                             
Jobs: 512 (f=512): [r(512)][100.0%][r=32.2GiB/s][r=8445k IOPS][eta 00m:00s]                                        
disk_nvme6n1_thread_1: (groupid=0, jobs=512): err= 0: pid=5103: Mon Jan  8 23:12:12 2024                           
  read: IOPS=8420k, BW=32.1GiB/s (34.5GB/s)(964GiB/30011msec)                                                      
    slat (nsec): min=1339, max=97021k, avg=42447.84, stdev=1442726.09                                              
    clat (usec): min=2, max=369410, avg=14820.01, stdev=24112.59                                                   
     lat (usec): min=69, max=369412, avg=14862.46, stdev=24139.33                                                  
    clat percentiles (usec):                                                                                       
     |  1.00th=[   717],  5.00th=[   783], 10.00th=[   824], 20.00th=[   873],                                     
     | 30.00th=[   930], 40.00th=[  1012], 50.00th=[  1172], 60.00th=[  8094],                                     
     | 70.00th=[ 14222], 80.00th=[ 18744], 90.00th=[ 76022], 95.00th=[ 76022],                                     
     | 99.00th=[ 76022], 99.50th=[ 78119], 99.90th=[ 81265], 99.95th=[ 81265],                                     
     | 99.99th=[135267]                                                                                            
   bw (  MiB/s): min=19552, max=62819, per=100.00%, avg=33774.56, stdev=31.02, samples=14540                       
   iops        : min=5005807, max=16089892, avg=8646500.17, stdev=7944.42, samples=14540                           
  lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%                                                  
  lat (usec)   : 250=0.01%, 500=0.01%, 750=2.50%, 1000=36.41%                                                      
  lat (msec)   : 2=17.39%, 4=0.27%, 10=5.83%, 20=18.94%, 50=5.59%                                                  
  lat (msec)   : 100=13.06%, 250=0.01%, 500=0.01%                                                                  
  cpu          : usr=1.20%, sys=3.74%, ctx=6758326, majf=0, minf=193128                                            
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%                                     
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%                                    
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%                                    
     issued rwts: total=252677827,0,0,0 short=0,0,0,0 dropped=0,0,0,0                                              
     latency   : target=0, window=0, percentile=100.00%, depth=256                                                 
                                                                                                                   
Run status group 0 (all jobs):                                                                                     
   READ: bw=32.1GiB/s (34.5GB/s), 32.1GiB/s-32.1GiB/s (34.5GB/s-34.5GB/s), io=964GiB (1035GB), run=30011-30011msec 
                                                                                                                   
Disk stats (read/write):                                                                                           
  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=96.36%                                                  
  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=97.18%                                                  
  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=98.08%                                                  
  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=98.96%                                                  
                                                                                                                   
 Performance counter stats for 'system wide':                                                                      
                                                                                                                   
    24,762,800,042      L1-dcache-load-misses                                                   (42.86%)           
    24,764,415,765      L1-dcache-load-misses                                                   (57.14%)           
   756,096,467,595      L1-dcache-stores                                                        (57.14%)           
       483,611,270      LLC-load-misses                  #   16.21% of all LL-cache accesses    (57.14%)           
     2,982,610,898      LLC-loads                                                               (57.14%)           
     1,283,077,818      LLC-stores                                                              (28.57%)           
       313,253,711      LLC-store-misses                                                        (28.57%)           
                                                                                                                   
      32.059810215 seconds time elapsed
                                       


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 12/13] iommu/vt-d: Add a helper to retrieve PID address
  2024-01-26 23:30     ` Jacob Pan
@ 2024-02-13  8:21       ` Thomas Gleixner
  2024-02-13 19:31         ` Jacob Pan
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Gleixner @ 2024-02-13  8:21 UTC (permalink / raw)
  To: Jacob Pan
  Cc: LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Raj Ashok,
	Tian, Kevin, maz, peterz, seanjc, Robin Murphy, jacob.jun.pan

On Fri, Jan 26 2024 at 15:30, Jacob Pan wrote:
> On Wed, 06 Dec 2023 21:19:11 +0100, Thomas Gleixner <tglx@linutronix.de>
> wrote:
>> > +static u64 get_pi_desc_addr(struct irq_data *irqd)
>> > +{
>> > +	int cpu =
>> > cpumask_first(irq_data_get_effective_affinity_mask(irqd));  
>> 
>> The effective affinity mask is magically correct when this is called?
>> 
> My understanding is that remappable device MSIs have the following
> hierarchy,e.g.

SNIP

> Here the parent APIC chip does apic_set_affinity() which will set up
> effective mask before posted MSI affinity change.
>
> Maybe I missed some cases?

The function is only used in intel_ir_reconfigure_irte_posted() in the
next patch, but it's generally available. So I asked that question
because if it's called in some other context then it's going to be not
guaranteed.

That also begs the question why this function exists in the first
place. This really can be part of intel_ir_reconfigure_irte_posted(),
which makes it clear what the context is, no?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 12/13] iommu/vt-d: Add a helper to retrieve PID address
  2024-02-13  8:21       ` Thomas Gleixner
@ 2024-02-13 19:31         ` Jacob Pan
  0 siblings, 0 replies; 49+ messages in thread
From: Jacob Pan @ 2024-02-13 19:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 Kernel, iommu, Lu Baolu, kvm, Dave Hansen, Joerg Roedel,
	H. Peter Anvin, Borislav Petkov, Ingo Molnar, Raj Ashok,
	Tian, Kevin, maz, peterz, seanjc, Robin Murphy, jacob.jun.pan

Hi Thomas,

On Tue, 13 Feb 2024 09:21:47 +0100, Thomas Gleixner <tglx@linutronix.de>
wrote:

> > Here the parent APIC chip does apic_set_affinity() which will set up
> > effective mask before posted MSI affinity change.
> >
> > Maybe I missed some cases?  
> 
> The function is only used in intel_ir_reconfigure_irte_posted() in the
> next patch, but it's generally available. So I asked that question
> because if it's called in some other context then it's going to be not
> guaranteed.
> 
> That also begs the question why this function exists in the first
> place. This really can be part of intel_ir_reconfigure_irte_posted(),
> which makes it clear what the context is, no?
Make sense, will fold it in next time.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2024-02-13 19:26 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-12  4:16 [PATCH RFC 00/13] Coalesced Interrupt Delivery with posted MSI Jacob Pan
2023-11-12  4:16 ` [PATCH RFC 01/13] x86: Move posted interrupt descriptor out of vmx code Jacob Pan
2023-12-06 16:33   ` Thomas Gleixner
2023-12-08  4:54     ` Jacob Pan
2023-12-08  9:31       ` Thomas Gleixner
2023-12-08 23:21         ` Jacob Pan
2023-12-09  0:28         ` Jacob Pan
2023-11-12  4:16 ` [PATCH RFC 02/13] x86: Add a Kconfig option for posted MSI Jacob Pan
2023-12-06 16:35   ` Thomas Gleixner
2023-12-09 21:24     ` Jacob Pan
2023-11-12  4:16 ` [PATCH RFC 03/13] x86: Reserved a per CPU IDT vector for posted MSIs Jacob Pan
2023-12-06 16:47   ` Thomas Gleixner
2023-12-09 21:53     ` Jacob Pan
2023-11-12  4:16 ` [PATCH RFC 04/13] iommu/vt-d: Add helper and flag to check/disable posted MSI Jacob Pan
2023-12-06 16:49   ` Thomas Gleixner
2023-11-12  4:16 ` [PATCH RFC 05/13] x86/irq: Set up per host CPU posted interrupt descriptors Jacob Pan
2023-11-12  4:16 ` [PATCH RFC 06/13] x86/irq: Unionize PID.PIR for 64bit access w/o casting Jacob Pan
2023-12-06 16:51   ` Thomas Gleixner
2023-11-12  4:16 ` [PATCH RFC 07/13] x86/irq: Add helpers for checking Intel PID Jacob Pan
2023-12-06 19:02   ` Thomas Gleixner
2024-01-26 23:31     ` Jacob Pan
2023-11-12  4:16 ` [PATCH RFC 08/13] x86/irq: Factor out calling ISR from common_interrupt Jacob Pan
2023-11-12  4:16 ` [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler Jacob Pan
2023-11-15 12:42   ` Peter Zijlstra
2023-11-15 20:05     ` Jacob Pan
2023-11-15 12:56   ` Peter Zijlstra
2023-11-15 20:04     ` Jacob Pan
2023-11-15 20:25       ` Peter Zijlstra
2023-12-06 19:50     ` Thomas Gleixner
2023-12-08  4:46       ` Jacob Pan
2023-12-08 11:52         ` Thomas Gleixner
2023-12-08 20:02           ` Jacob Pan
2024-01-26 23:32           ` Jacob Pan
2023-12-06 19:14   ` Thomas Gleixner
2023-11-12  4:16 ` [PATCH RFC 10/13] x86/irq: Handle potential lost IRQ during migration and CPU offline Jacob Pan
2023-12-06 20:09   ` Thomas Gleixner
2023-11-12  4:16 ` [PATCH RFC 11/13] iommu/vt-d: Add an irq_chip for posted MSIs Jacob Pan
2023-12-06 20:15   ` Thomas Gleixner
2024-01-26 23:31     ` Jacob Pan
2023-12-06 20:44   ` Thomas Gleixner
2023-12-13  3:42     ` Jacob Pan
2023-11-12  4:16 ` [PATCH RFC 12/13] iommu/vt-d: Add a helper to retrieve PID address Jacob Pan
2023-12-06 20:19   ` Thomas Gleixner
2024-01-26 23:30     ` Jacob Pan
2024-02-13  8:21       ` Thomas Gleixner
2024-02-13 19:31         ` Jacob Pan
2023-11-12  4:16 ` [PATCH RFC 13/13] iommu/vt-d: Enable posted mode for device MSIs Jacob Pan
2023-12-06 20:26   ` Thomas Gleixner
2023-12-13 22:00     ` Jacob Pan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox