[PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
@ 2025-04-04 19:38 Sean Christopherson
  2025-04-04 19:38 ` [PATCH 01/67] KVM: SVM: Allocate IR data using atomic allocation Sean Christopherson
                   ` (71 more replies)
  0 siblings, 72 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

TL;DR: Overhaul device posted interrupts in KVM and IOMMU, and AVIC in
       general.  This needs more testing on AMD with device posted IRQs.

This applies on the small series that adds a enable_device_posted_irqs
module param (the prep work for that is also prep work for this):

   https://lore.kernel.org/all/20250401161804.842968-1-seanjc@google.com

Fix a variety of bugs related to device posted IRQs, especially on the
AMD side, and clean up KVM's implementation, which IMO is in the running
for Most Convoluted Code in KVM.

Stating the obvious, this series is comically large.  I'm posting it as a
single series, at least for the first round of reviews, to build the
(mostly) full picture of the end goal (it's not the true end goal; there's
still more cleanups that can be done).  And because properly testing most
of the code would be futile until almost the end of the series (so. many.
bugs.).

Batch #1 (patches 1-10) fixes bugs of varying severity.

Batch #2 is mostly SVM specific:

 - Cleans up various warts and bugs in the IRTE tracking
 - Fixes AVIC to not reject large VMs (honor KVM's ABI)
 - Wire up AVIC to enable_ipiv to support disabling IPI virtualization while
   still utilizing device posted interrupts, and to workaround erratum #1235.

Batch #3 overhauls the guts of IRQ bypass in KVM, and moves the vast majority
of the logic to common x86; only the code that needs to communicate with the
IOMMU is truly vendor specific.

Batch #4 is more SVM/AVIC cleanups that are made possible by batch #3.

Batch #5 adds WARNs and drops dead code after all the previous cleanups and
fixes (I don't want to add the WARNs earlier; I don't any point in adding
WARNs in code that's known to be broken).

Batch #6 is yet more SVM/AVIC cleanups, with the specific goal of configuring
IRTEs to generate GA log interrupts if and only if KVM actually needs a wake
event.

This series is well tested except for one notable gap: I was not able to
fully test the AMD IOMMU changes.  Long story short, getting upstream
kernels into our full test environments is practically infeasible.  And
exposing a device or VF on systems that are available to developers is a
bit of a mess.

The device the selftest (see the last patch) uses is an internel test VF
that's hosted on a smart NIC using non-production (test-only) firmware.
Unfortunately, only some of our developer systems have the right NIC, and
for unknown reasons I couldn't get the test firmware to install cleanly on
Rome systems.  I was able to get it functional on Milan (and Intel CPUs),
but APIC virtualization is disabled on Milan.  Thanks to KVM's force_avic
I could test the KVM flows, but the IOMMU was having none of my attempts
to force enable APIC virtualization against its will.

Through hackery (see the penultimate patch), I was able to gain a decent
amount of confidence in the IOMMU changes (and the interface between KVM
and the IOMMU).

For initial development of the series, I also cobbled together a "mock"
IRQ bypass device, to allow testing in a VM.

  https://github.com/sean-jc/linux.git x86/mock_irqbypass_producer

Note, the diffstat is misleading due to the last two DO NOT MERGE patches
adding 1k+ LoC.  Without those, this series removes ~80 LoC (substantially
more if comments are ignored).

  21 files changed, 577 insertions(+), 655 deletions(-)

Maxim Levitsky (2):
  KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled
  KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235

Sean Christopherson (65):
  KVM: SVM: Allocate IR data using atomic allocation
  KVM: x86: Reset IRTE to host control if *new* route isn't postable
  KVM: x86: Explicitly treat routing entry type changes as changes
  KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producer
  iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE
  iommu/amd: WARN if KVM attempts to set vCPU affinity without posted
    intrrupts
  KVM: SVM: WARN if an invalid posted interrupt IRTE entry is added
  KVM: x86: Pass new routing entries and irqfd when updating IRTEs
  KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure
  KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE
  KVM: SVM: Delete IRTE link from previous vCPU irrespective of new
    routing
  KVM: SVM: Drop pointless masking of default APIC base when setting
    V_APIC_BAR
  KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA
    masks
  KVM: SVM: Add helper to deduplicate code for getting AVIC backing page
  KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field
  KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU
    creation
  KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creation
  KVM: SVM: Track AVIC tables as natively sized pointers, not "struct
    pages"
  KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer
  KVM: VMX: Move enable_ipiv knob to common x86
  KVM: VMX: Suppress PI notifications whenever the vCPU is put
  KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores
    IRQ blocking
  iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr
  iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
  iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode"
  KVM: SVM: Get vCPU info for IRTE using new routing entry
  KVM: SVM: Stop walking list of routing table entries when updating
    IRTE
  KVM: VMX: Stop walking list of routing table entries when updating
    IRTE
  KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info()
  KVM: x86: Nullify irqfd->producer after updating IRTEs
  KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU
  KVM: x86: Move posted interrupt tracepoint to common code
  KVM: SVM: Clean up return handling in avic_pi_update_irte()
  iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel
    structs
  KVM: Don't WARN if updating IRQ bypass route fails
  KVM: Fold kvm_arch_irqfd_route_changed() into
    kvm_arch_update_irqfd_routing()
  KVM: x86: Track irq_bypass_vcpu in common x86 code
  KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being
    targeted
  KVM: x86: Don't update IRTE entries when old and new routes were !MSI
  KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR
    metadata
  KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU
  iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination
  iommu/amd: Factor out helper for manipulating IRTE GA/CPU info
  iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity
  iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC
    is inhibited
  KVM: SVM: Don't check for assigned device(s) when updating affinity
  KVM: SVM: Don't check for assigned device(s) when activating AVIC
  KVM: SVM: WARN if (de)activating guest mode in IOMMU fails
  KVM: SVM: Process all IRTEs on affinity change even if one update
    fails
  KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails
  KVM: x86: Drop superfluous "has assigned device" check in
    kvm_pi_update_irte()
  KVM: x86: WARN if IRQ bypass isn't supported in kvm_pi_update_irte()
  KVM: x86: WARN if IRQ bypass routing is updated without in-kernel
    local APIC
  KVM: SVM: WARN if ir_list is non-empty at vCPU free
  KVM: x86: Decouple device assignment from IRQ bypass
  KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting IRQ
    bypass
  KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata
  iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC
    support
  KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller
  KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off
  KVM: SVM: Consolidate IRTE update when toggling AVIC on/off
  iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
  KVM: SVM: Generate GA log IRQs only if the associated vCPUs is
    blocking
  *** DO NOT MERGE *** iommu/amd: Hack to fake IRQ posting support
  *** DO NOT MERGE *** KVM: selftests: WIP posted interrupts test

 arch/x86/include/asm/irq_remapping.h          |  17 +-
 arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
 arch/x86/include/asm/kvm_host.h               |  20 +-
 arch/x86/include/asm/svm.h                    |  13 +-
 arch/x86/kvm/svm/avic.c                       | 707 ++++++++----------
 arch/x86/kvm/svm/svm.c                        |   6 +
 arch/x86/kvm/svm/svm.h                        |  24 +-
 arch/x86/kvm/trace.h                          |  19 +-
 arch/x86/kvm/vmx/capabilities.h               |   1 -
 arch/x86/kvm/vmx/main.c                       |   2 +-
 arch/x86/kvm/vmx/posted_intr.c                | 150 ++--
 arch/x86/kvm/vmx/posted_intr.h                |  11 +-
 arch/x86/kvm/vmx/vmx.c                        |   2 -
 arch/x86/kvm/x86.c                            | 124 ++-
 drivers/iommu/amd/amd_iommu_types.h           |   1 -
 drivers/iommu/amd/init.c                      |   8 +-
 drivers/iommu/amd/iommu.c                     | 171 +++--
 drivers/iommu/intel/irq_remapping.c           |  10 +-
 include/linux/amd-iommu.h                     |  25 +-
 include/linux/kvm_host.h                      |   9 +-
 include/linux/kvm_irqfd.h                     |   4 +
 tools/testing/selftests/kvm/Makefile.kvm      |   2 +
 .../selftests/kvm/include/vfio_pci_util.h     | 149 ++++
 .../selftests/kvm/include/x86/processor.h     |  21 +
 .../testing/selftests/kvm/lib/vfio_pci_util.c | 201 +++++
 tools/testing/selftests/kvm/mercury_device.h  | 118 +++
 tools/testing/selftests/kvm/vfio_irq_test.c   | 429 +++++++++++
 virt/kvm/eventfd.c                            |  22 +-
 28 files changed, 1610 insertions(+), 658 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/include/vfio_pci_util.h
 create mode 100644 tools/testing/selftests/kvm/lib/vfio_pci_util.c
 create mode 100644 tools/testing/selftests/kvm/mercury_device.h
 create mode 100644 tools/testing/selftests/kvm/vfio_irq_test.c


base-commit: 5f9f498ea14ffe15390aa46fb85375e7c901bce3
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply	[flat|nested] 128+ messages in thread

* [PATCH 01/67] KVM: SVM: Allocate IR data using atomic allocation
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 02/67] KVM: x86: Reset IRTE to host control if *new* route isn't postable Sean Christopherson
                   ` (70 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Allocate SVM's interrupt remapping metadata using GFP_ATOMIC as
svm_ir_list_add() is called with IRQs are disabled and irqfs.lock held
when kvm_irq_routing_update() reacts to GSI routing changes.

Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 901d8d2dc169..a961e6e67050 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -820,7 +820,7 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
 	 * Allocating new amd_iommu_pi_data, which will get
 	 * add to the per-vcpu ir_list.
 	 */
-	ir = kzalloc(sizeof(struct amd_svm_iommu_ir), GFP_KERNEL_ACCOUNT);
+	ir = kzalloc(sizeof(struct amd_svm_iommu_ir), GFP_ATOMIC | __GFP_ACCOUNT);
 	if (!ir) {
 		ret = -ENOMEM;
 		goto out;
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 02/67] KVM: x86: Reset IRTE to host control if *new* route isn't postable
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
  2025-04-04 19:38 ` [PATCH 01/67] KVM: SVM: Allocate IR data using atomic allocation Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-11  8:08   ` Sairaj Kodilkar
  2025-04-04 19:38 ` [PATCH 03/67] KVM: x86: Explicitly treat routing entry type changes as changes Sean Christopherson
                   ` (69 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Restore an IRTE back to host control (remapped or posted MSI mode) if the
*new* GSI route prevents posting the IRQ directly to a vCPU, regardless of
the GSI routing type.  Updating the IRTE if and only if the new GSI is an
MSI results in KVM leaving an IRTE posting to a vCPU.

The dangling IRTE can result in interrupts being incorrectly delivered to
the guest, and in the worst case scenario can result in use-after-free,
e.g. if the VM is torn down, but the underlying host IRQ isn't freed.

Fixes: efc644048ecd ("KVM: x86: Update IRTE for posted-interrupts")
Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c        | 61 ++++++++++++++++++----------------
 arch/x86/kvm/vmx/posted_intr.c | 28 ++++++----------
 2 files changed, 43 insertions(+), 46 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index a961e6e67050..ef08356fdb1c 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -896,6 +896,7 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
 {
 	struct kvm_kernel_irq_routing_entry *e;
 	struct kvm_irq_routing_table *irq_rt;
+	bool enable_remapped_mode = true;
 	int idx, ret = 0;
 
 	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
@@ -932,6 +933,8 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
 		    kvm_vcpu_apicv_active(&svm->vcpu)) {
 			struct amd_iommu_pi_data pi;
 
+			enable_remapped_mode = false;
+
 			/* Try to enable guest_mode in IRTE */
 			pi.base = __sme_set(page_to_phys(svm->avic_backing_page) &
 					    AVIC_HPA_MASK);
@@ -950,33 +953,6 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
 			 */
 			if (!ret && pi.is_guest_mode)
 				svm_ir_list_add(svm, &pi);
-		} else {
-			/* Use legacy mode in IRTE */
-			struct amd_iommu_pi_data pi;
-
-			/**
-			 * Here, pi is used to:
-			 * - Tell IOMMU to use legacy mode for this interrupt.
-			 * - Retrieve ga_tag of prior interrupt remapping data.
-			 */
-			pi.prev_ga_tag = 0;
-			pi.is_guest_mode = false;
-			ret = irq_set_vcpu_affinity(host_irq, &pi);
-
-			/**
-			 * Check if the posted interrupt was previously
-			 * setup with the guest_mode by checking if the ga_tag
-			 * was cached. If so, we need to clean up the per-vcpu
-			 * ir_list.
-			 */
-			if (!ret && pi.prev_ga_tag) {
-				int id = AVIC_GATAG_TO_VCPUID(pi.prev_ga_tag);
-				struct kvm_vcpu *vcpu;
-
-				vcpu = kvm_get_vcpu_by_id(kvm, id);
-				if (vcpu)
-					svm_ir_list_del(to_svm(vcpu), &pi);
-			}
 		}
 
 		if (!ret && svm) {
@@ -991,7 +967,36 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
 		}
 	}
 
-	ret = 0;
+	if (enable_remapped_mode) {
+		/* Use legacy mode in IRTE */
+		struct amd_iommu_pi_data pi;
+
+		/**
+		 * Here, pi is used to:
+		 * - Tell IOMMU to use legacy mode for this interrupt.
+		 * - Retrieve ga_tag of prior interrupt remapping data.
+		 */
+		pi.prev_ga_tag = 0;
+		pi.is_guest_mode = false;
+		ret = irq_set_vcpu_affinity(host_irq, &pi);
+
+		/**
+		 * Check if the posted interrupt was previously
+		 * setup with the guest_mode by checking if the ga_tag
+		 * was cached. If so, we need to clean up the per-vcpu
+		 * ir_list.
+		 */
+		if (!ret && pi.prev_ga_tag) {
+			int id = AVIC_GATAG_TO_VCPUID(pi.prev_ga_tag);
+			struct kvm_vcpu *vcpu;
+
+			vcpu = kvm_get_vcpu_by_id(kvm, id);
+			if (vcpu)
+				svm_ir_list_del(to_svm(vcpu), &pi);
+		}
+	} else {
+		ret = 0;
+	}
 out:
 	srcu_read_unlock(&kvm->irq_srcu, idx);
 	return ret;
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 16121d29dfd9..78ba3d638fe8 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -273,6 +273,7 @@ int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
 {
 	struct kvm_kernel_irq_routing_entry *e;
 	struct kvm_irq_routing_table *irq_rt;
+	bool enable_remapped_mode = true;
 	struct kvm_lapic_irq irq;
 	struct kvm_vcpu *vcpu;
 	struct vcpu_data vcpu_info;
@@ -311,21 +312,8 @@ int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
 
 		kvm_set_msi_irq(kvm, e, &irq);
 		if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) ||
-		    !kvm_irq_is_postable(&irq)) {
-			/*
-			 * Make sure the IRTE is in remapped mode if
-			 * we don't handle it in posted mode.
-			 */
-			ret = irq_set_vcpu_affinity(host_irq, NULL);
-			if (ret < 0) {
-				printk(KERN_INFO
-				   "failed to back to remapped mode, irq: %u\n",
-				   host_irq);
-				goto out;
-			}
-
+		    !kvm_irq_is_postable(&irq))
 			continue;
-		}
 
 		vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu));
 		vcpu_info.vector = irq.vector;
@@ -333,11 +321,12 @@ int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
 		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, e->gsi,
 				vcpu_info.vector, vcpu_info.pi_desc_addr, set);
 
-		if (set)
-			ret = irq_set_vcpu_affinity(host_irq, &vcpu_info);
-		else
-			ret = irq_set_vcpu_affinity(host_irq, NULL);
+		if (!set)
+			continue;
 
+		enable_remapped_mode = false;
+
+		ret = irq_set_vcpu_affinity(host_irq, &vcpu_info);
 		if (ret < 0) {
 			printk(KERN_INFO "%s: failed to update PI IRTE\n",
 					__func__);
@@ -345,6 +334,9 @@ int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
 		}
 	}
 
+	if (enable_remapped_mode)
+		ret = irq_set_vcpu_affinity(host_irq, NULL);
+
 	ret = 0;
 out:
 	srcu_read_unlock(&kvm->irq_srcu, idx);
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 03/67] KVM: x86: Explicitly treat routing entry type changes as changes
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
  2025-04-04 19:38 ` [PATCH 01/67] KVM: SVM: Allocate IR data using atomic allocation Sean Christopherson
  2025-04-04 19:38 ` [PATCH 02/67] KVM: x86: Reset IRTE to host control if *new* route isn't postable Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 04/67] KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producer Sean Christopherson
                   ` (68 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Explicitly treat type differences as GSI routing changes, as comparing MSI
data between two entries could get a false negative, e.g. if userspace
changed the type but left the type-specific data as-is.

Fixes: 515a0c79e796 ("kvm: irqfd: avoid update unmodified entries of the routing")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9211344b20ae..f94f1217a087 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13615,7 +13615,8 @@ int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
 bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
 				  struct kvm_kernel_irq_routing_entry *new)
 {
-	if (new->type != KVM_IRQ_ROUTING_MSI)
+	if (old->type != KVM_IRQ_ROUTING_MSI ||
+	    new->type != KVM_IRQ_ROUTING_MSI)
 		return true;
 
 	return !!memcmp(&old->msi, &new->msi, sizeof(new->msi));
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 04/67] KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producer
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (2 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 03/67] KVM: x86: Explicitly treat routing entry type changes as changes Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 05/67] iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE Sean Christopherson
                   ` (67 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Take irqfds.lock when adding/deleting an IRQ bypass producer to ensure
irqfd->producer isn't modified while kvm_irq_routing_update() is running.
The only lock held when a producer is added/removed is irqbypass's mutex.

Fixes: 872768800652 ("KVM: x86: select IRQ_BYPASS_MANAGER")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f94f1217a087..dcc173852dc5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13569,15 +13569,22 @@ int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
 {
 	struct kvm_kernel_irqfd *irqfd =
 		container_of(cons, struct kvm_kernel_irqfd, consumer);
+	struct kvm *kvm = irqfd->kvm;
 	int ret;
 
-	irqfd->producer = prod;
 	kvm_arch_start_assignment(irqfd->kvm);
+
+	spin_lock_irq(&kvm->irqfds.lock);
+	irqfd->producer = prod;
+
 	ret = kvm_x86_call(pi_update_irte)(irqfd->kvm,
 					   prod->irq, irqfd->gsi, 1);
 	if (ret)
 		kvm_arch_end_assignment(irqfd->kvm);
 
+	spin_unlock_irq(&kvm->irqfds.lock);
+
+
 	return ret;
 }
 
@@ -13587,9 +13594,9 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 	int ret;
 	struct kvm_kernel_irqfd *irqfd =
 		container_of(cons, struct kvm_kernel_irqfd, consumer);
+	struct kvm *kvm = irqfd->kvm;
 
 	WARN_ON(irqfd->producer != prod);
-	irqfd->producer = NULL;
 
 	/*
 	 * When producer of consumer is unregistered, we change back to
@@ -13597,12 +13604,18 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 	 * when the irq is masked/disabled or the consumer side (KVM
 	 * int this case doesn't want to receive the interrupts.
 	*/
+	spin_lock_irq(&kvm->irqfds.lock);
+	irqfd->producer = NULL;
+
 	ret = kvm_x86_call(pi_update_irte)(irqfd->kvm,
 					   prod->irq, irqfd->gsi, 0);
 	if (ret)
 		printk(KERN_INFO "irq bypass consumer (token %p) unregistration"
 		       " fails: %d\n", irqfd->consumer.token, ret);
 
+	spin_unlock_irq(&kvm->irqfds.lock);
+
+
 	kvm_arch_end_assignment(irqfd->kvm);
 }
 
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 05/67] iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (3 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 04/67] KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producer Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-11  8:34   ` Sairaj Kodilkar
  2025-04-18 12:25   ` Vasant Hegde
  2025-04-04 19:38 ` [PATCH 06/67] iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts Sean Christopherson
                   ` (66 subsequent siblings)
  71 siblings, 2 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Return -EINVAL instead of success if amd_ir_set_vcpu_affinity() is
invoked without use_vapic; lying to KVM about whether or not the IRTE was
configured to post IRQs is all kinds of bad.

Fixes: d98de49a53e4 ("iommu/amd: Enable vAPIC interrupt remapping mode by default")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 drivers/iommu/amd/iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index cd5116d8c3b2..b3a01b7757ee 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3850,7 +3850,7 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 	 * we should not modify the IRTE
 	 */
 	if (!dev_data || !dev_data->use_vapic)
-		return 0;
+		return -EINVAL;
 
 	ir_data->cfg = irqd_cfg(data);
 	pi_data->ir_data = ir_data;
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 06/67] iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (4 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 05/67] iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-11  8:28   ` Sairaj Kodilkar
  2025-04-04 19:38 ` [PATCH 07/67] KVM: SVM: WARN if an invalid posted interrupt IRTE entry is added Sean Christopherson
                   ` (65 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

WARN if KVM attempts to set vCPU affinity when posted interrupts aren't
enabled, as KVM shouldn't try to enable posting when they're unsupported,
and the IOMMU driver darn well should only advertise posting support when
AMD_IOMMU_GUEST_IR_VAPIC() is true.

Note, KVM consumes is_guest_mode only on success.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 drivers/iommu/amd/iommu.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index b3a01b7757ee..4f69a37cf143 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3852,19 +3852,12 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 	if (!dev_data || !dev_data->use_vapic)
 		return -EINVAL;
 
+	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
+		return -EINVAL;
+
 	ir_data->cfg = irqd_cfg(data);
 	pi_data->ir_data = ir_data;
 
-	/* Note:
-	 * SVM tries to set up for VAPIC mode, but we are in
-	 * legacy mode. So, we force legacy mode instead.
-	 */
-	if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)) {
-		pr_debug("%s: Fall back to using intr legacy remap\n",
-			 __func__);
-		pi_data->is_guest_mode = false;
-	}
-
 	pi_data->prev_ga_tag = ir_data->cached_ga_tag;
 	if (pi_data->is_guest_mode) {
 		ir_data->ga_root_ptr = (pi_data->base >> 12);
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 07/67] KVM: SVM: WARN if an invalid posted interrupt IRTE entry is added
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (5 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 06/67] iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 08/67] KVM: x86: Pass new routing entries and irqfd when updating IRTEs Sean Christopherson
                   ` (64 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Now that the AMD IOMMU doesn't signal success incorrectly, WARN if KVM
attempts to track an AMD IRTE entry without metadata.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index ef08356fdb1c..1708ea55125a 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -796,12 +796,15 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
 	struct amd_svm_iommu_ir *ir;
 	u64 entry;
 
+	if (WARN_ON_ONCE(!pi->ir_data))
+		return -EINVAL;
+
 	/**
 	 * In some cases, the existing irte is updated and re-set,
 	 * so we need to check here if it's already been * added
 	 * to the ir_list.
 	 */
-	if (pi->ir_data && (pi->prev_ga_tag != 0)) {
+	if (pi->prev_ga_tag) {
 		struct kvm *kvm = svm->vcpu.kvm;
 		u32 vcpu_id = AVIC_GATAG_TO_VCPUID(pi->prev_ga_tag);
 		struct kvm_vcpu *prev_vcpu = kvm_get_vcpu_by_id(kvm, vcpu_id);
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 08/67] KVM: x86: Pass new routing entries and irqfd when updating IRTEs
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (6 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 07/67] KVM: SVM: WARN if an invalid posted interrupt IRTE entry is added Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-11 10:57   ` Arun Kodilkar, Sairaj
  2025-04-04 19:38 ` [PATCH 09/67] KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure Sean Christopherson
                   ` (63 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

When updating IRTEs in response to a GSI routing or IRQ bypass change,
pass the new/current routing information along with the associated irqfd.
This will allow KVM x86 to harden, simplify, and deduplicate its code.

Since adding/removing a bypass producer is now conveniently protected with
irqfds.lock, i.e. can't run concurrently with kvm_irq_routing_update(),
use the routing information cached in the irqfd instead of looking up
the information in the current GSI routing tables.

Opportunistically convert an existing printk() to pr_info() and put its
string onto a single line (old code that strictly adhered to 80 chars).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  6 ++++--
 arch/x86/kvm/svm/avic.c         | 18 +++++++----------
 arch/x86/kvm/svm/svm.h          |  5 +++--
 arch/x86/kvm/vmx/posted_intr.c  | 19 ++++++++---------
 arch/x86/kvm/vmx/posted_intr.h  |  8 ++++++--
 arch/x86/kvm/x86.c              | 36 ++++++++++++++++++---------------
 include/linux/kvm_host.h        |  7 +++++--
 virt/kvm/eventfd.c              | 11 +++++-----
 8 files changed, 58 insertions(+), 52 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6e8be274c089..54f3cf73329b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -294,6 +294,7 @@ enum x86_intercept_stage;
  */
 #define KVM_APIC_PV_EOI_PENDING	1
 
+struct kvm_kernel_irqfd;
 struct kvm_kernel_irq_routing_entry;
 
 /*
@@ -1828,8 +1829,9 @@ struct kvm_x86_ops {
 	void (*vcpu_blocking)(struct kvm_vcpu *vcpu);
 	void (*vcpu_unblocking)(struct kvm_vcpu *vcpu);
 
-	int (*pi_update_irte)(struct kvm *kvm, unsigned int host_irq,
-			      uint32_t guest_irq, bool set);
+	int (*pi_update_irte)(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
+			      unsigned int host_irq, uint32_t guest_irq,
+			      struct kvm_kernel_irq_routing_entry *new);
 	void (*pi_start_assignment)(struct kvm *kvm);
 	void (*apicv_pre_state_restore)(struct kvm_vcpu *vcpu);
 	void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 1708ea55125a..04dfd898ea8d 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -18,6 +18,7 @@
 #include <linux/hashtable.h>
 #include <linux/amd-iommu.h>
 #include <linux/kvm_host.h>
+#include <linux/kvm_irqfd.h>
 
 #include <asm/irq_remapping.h>
 
@@ -885,21 +886,14 @@ get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
 	return 0;
 }
 
-/*
- * avic_pi_update_irte - set IRTE for Posted-Interrupts
- *
- * @kvm: kvm
- * @host_irq: host irq of the interrupt
- * @guest_irq: gsi of the interrupt
- * @set: set or unset PI
- * returns 0 on success, < 0 on failure
- */
-int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
-			uint32_t guest_irq, bool set)
+int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
+			unsigned int host_irq, uint32_t guest_irq,
+			struct kvm_kernel_irq_routing_entry *new)
 {
 	struct kvm_kernel_irq_routing_entry *e;
 	struct kvm_irq_routing_table *irq_rt;
 	bool enable_remapped_mode = true;
+	bool set = !!new;
 	int idx, ret = 0;
 
 	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
@@ -925,6 +919,8 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
 		if (e->type != KVM_IRQ_ROUTING_MSI)
 			continue;
 
+		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
+
 		/**
 		 * Here, we setup with legacy mode in the following cases:
 		 * 1. When cannot target interrupt to a specific vcpu.
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index d4490eaed55d..294d5594c724 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -731,8 +731,9 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 void avic_vcpu_put(struct kvm_vcpu *vcpu);
 void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu);
 void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
-int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
-			uint32_t guest_irq, bool set);
+int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
+			unsigned int host_irq, uint32_t guest_irq,
+			struct kvm_kernel_irq_routing_entry *new);
 void avic_vcpu_blocking(struct kvm_vcpu *vcpu);
 void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
 void avic_ring_doorbell(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 78ba3d638fe8..1b6b655a2b8a 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -2,6 +2,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include <linux/kvm_host.h>
+#include <linux/kvm_irqfd.h>
 
 #include <asm/irq_remapping.h>
 #include <asm/cpu.h>
@@ -259,17 +260,9 @@ void vmx_pi_start_assignment(struct kvm *kvm)
 	kvm_make_all_cpus_request(kvm, KVM_REQ_UNBLOCK);
 }
 
-/*
- * vmx_pi_update_irte - set IRTE for Posted-Interrupts
- *
- * @kvm: kvm
- * @host_irq: host irq of the interrupt
- * @guest_irq: gsi of the interrupt
- * @set: set or unset PI
- * returns 0 on success, < 0 on failure
- */
-int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
-		       uint32_t guest_irq, bool set)
+int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
+		       unsigned int host_irq, uint32_t guest_irq,
+		       struct kvm_kernel_irq_routing_entry *new)
 {
 	struct kvm_kernel_irq_routing_entry *e;
 	struct kvm_irq_routing_table *irq_rt;
@@ -277,6 +270,7 @@ int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
 	struct kvm_lapic_irq irq;
 	struct kvm_vcpu *vcpu;
 	struct vcpu_data vcpu_info;
+	bool set = !!new;
 	int idx, ret = 0;
 
 	if (!vmx_can_use_vtd_pi(kvm))
@@ -294,6 +288,9 @@ int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
 	hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
 		if (e->type != KVM_IRQ_ROUTING_MSI)
 			continue;
+
+		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
+
 		/*
 		 * VT-d PI cannot support posting multicast/broadcast
 		 * interrupts to a vCPU, we still use interrupt remapping
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index ad9116a99bcc..a586d6aaf862 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -3,6 +3,9 @@
 #define __KVM_X86_VMX_POSTED_INTR_H
 
 #include <linux/bitmap.h>
+#include <linux/find.h>
+#include <linux/kvm_host.h>
+
 #include <asm/posted_intr.h>
 
 void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
@@ -10,8 +13,9 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
 void pi_wakeup_handler(void);
 void __init pi_init_cpu(int cpu);
 bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
-int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
-		       uint32_t guest_irq, bool set);
+int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
+		       unsigned int host_irq, uint32_t guest_irq,
+		       struct kvm_kernel_irq_routing_entry *new);
 void vmx_pi_start_assignment(struct kvm *kvm);
 
 static inline int pi_find_highest_vector(struct pi_desc *pi_desc)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dcc173852dc5..23376fcd928c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13570,31 +13570,31 @@ int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
 	struct kvm_kernel_irqfd *irqfd =
 		container_of(cons, struct kvm_kernel_irqfd, consumer);
 	struct kvm *kvm = irqfd->kvm;
-	int ret;
+	int ret = 0;
 
 	kvm_arch_start_assignment(irqfd->kvm);
 
 	spin_lock_irq(&kvm->irqfds.lock);
 	irqfd->producer = prod;
 
-	ret = kvm_x86_call(pi_update_irte)(irqfd->kvm,
-					   prod->irq, irqfd->gsi, 1);
-	if (ret)
-		kvm_arch_end_assignment(irqfd->kvm);
-
+	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
+		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
+						   irqfd->gsi, &irqfd->irq_entry);
+		if (ret)
+			kvm_arch_end_assignment(irqfd->kvm);
+	}
 	spin_unlock_irq(&kvm->irqfds.lock);
 
-
 	return ret;
 }
 
 void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 				      struct irq_bypass_producer *prod)
 {
-	int ret;
 	struct kvm_kernel_irqfd *irqfd =
 		container_of(cons, struct kvm_kernel_irqfd, consumer);
 	struct kvm *kvm = irqfd->kvm;
+	int ret;
 
 	WARN_ON(irqfd->producer != prod);
 
@@ -13607,11 +13607,13 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 	spin_lock_irq(&kvm->irqfds.lock);
 	irqfd->producer = NULL;
 
-	ret = kvm_x86_call(pi_update_irte)(irqfd->kvm,
-					   prod->irq, irqfd->gsi, 0);
-	if (ret)
-		printk(KERN_INFO "irq bypass consumer (token %p) unregistration"
-		       " fails: %d\n", irqfd->consumer.token, ret);
+	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
+		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
+						   irqfd->gsi, NULL);
+		if (ret)
+			pr_info("irq bypass consumer (token %p) unregistration fails: %d\n",
+				irqfd->consumer.token, ret);
+	}
 
 	spin_unlock_irq(&kvm->irqfds.lock);
 
@@ -13619,10 +13621,12 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 	kvm_arch_end_assignment(irqfd->kvm);
 }
 
-int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
-				   uint32_t guest_irq, bool set)
+int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
+				  struct kvm_kernel_irq_routing_entry *old,
+				  struct kvm_kernel_irq_routing_entry *new)
 {
-	return kvm_x86_call(pi_update_irte)(kvm, host_irq, guest_irq, set);
+	return kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, irqfd->producer->irq,
+					    irqfd->gsi, new);
 }
 
 bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5438a1b446a6..2d9f3aeb766a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2383,6 +2383,8 @@ struct kvm_vcpu *kvm_get_running_vcpu(void);
 struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void);
 
 #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
+struct kvm_kernel_irqfd;
+
 bool kvm_arch_has_irq_bypass(void);
 int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *,
 			   struct irq_bypass_producer *);
@@ -2390,8 +2392,9 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *,
 			   struct irq_bypass_producer *);
 void kvm_arch_irq_bypass_stop(struct irq_bypass_consumer *);
 void kvm_arch_irq_bypass_start(struct irq_bypass_consumer *);
-int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
-				  uint32_t guest_irq, bool set);
+int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
+				  struct kvm_kernel_irq_routing_entry *old,
+				  struct kvm_kernel_irq_routing_entry *new);
 bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *,
 				  struct kvm_kernel_irq_routing_entry *);
 #endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 249ba5b72e9b..ad71e3e4d1c3 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -285,9 +285,9 @@ void __attribute__((weak)) kvm_arch_irq_bypass_start(
 {
 }
 
-int  __attribute__((weak)) kvm_arch_update_irqfd_routing(
-				struct kvm *kvm, unsigned int host_irq,
-				uint32_t guest_irq, bool set)
+int __weak kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
+					 struct kvm_kernel_irq_routing_entry *old,
+					 struct kvm_kernel_irq_routing_entry *new)
 {
 	return 0;
 }
@@ -619,9 +619,8 @@ void kvm_irq_routing_update(struct kvm *kvm)
 #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
 		if (irqfd->producer &&
 		    kvm_arch_irqfd_route_changed(&old, &irqfd->irq_entry)) {
-			int ret = kvm_arch_update_irqfd_routing(
-					irqfd->kvm, irqfd->producer->irq,
-					irqfd->gsi, 1);
+			int ret = kvm_arch_update_irqfd_routing(irqfd, &old, &irqfd->irq_entry);
+
 			WARN_ON(ret);
 		}
 #endif
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 09/67] KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (7 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 08/67] KVM: x86: Pass new routing entries and irqfd when updating IRTEs Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-11  7:47   ` Arun Kodilkar, Sairaj
  2025-04-04 19:38 ` [PATCH 10/67] KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE Sean Christopherson
                   ` (62 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Track the IRTEs that are posting to an SVM vCPU via the associated irqfd
structure and GSI routing instead of dynamically allocating a separate
data structure.  In addition to eliminating an atomic allocation, this
will allow hoisting much of the IRTE update logic to common x86.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c   | 49 ++++++++++++++++-----------------------
 include/linux/kvm_irqfd.h |  3 +++
 2 files changed, 23 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 04dfd898ea8d..967618ba743a 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -774,27 +774,30 @@ static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 	return ret;
 }
 
-static void svm_ir_list_del(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
+static void svm_ir_list_del(struct vcpu_svm *svm,
+			    struct kvm_kernel_irqfd *irqfd,
+			    struct amd_iommu_pi_data *pi)
 {
 	unsigned long flags;
-	struct amd_svm_iommu_ir *cur;
+	struct kvm_kernel_irqfd *cur;
 
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
-	list_for_each_entry(cur, &svm->ir_list, node) {
-		if (cur->data != pi->ir_data)
+	list_for_each_entry(cur, &svm->ir_list, vcpu_list) {
+		if (cur->irq_bypass_data != pi->ir_data)
 			continue;
-		list_del(&cur->node);
-		kfree(cur);
+		if (WARN_ON_ONCE(cur != irqfd))
+			continue;
+		list_del(&irqfd->vcpu_list);
 		break;
 	}
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 }
 
-static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
+static int svm_ir_list_add(struct vcpu_svm *svm,
+			   struct kvm_kernel_irqfd *irqfd,
+			   struct amd_iommu_pi_data *pi)
 {
-	int ret = 0;
 	unsigned long flags;
-	struct amd_svm_iommu_ir *ir;
 	u64 entry;
 
 	if (WARN_ON_ONCE(!pi->ir_data))
@@ -811,25 +814,14 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
 		struct kvm_vcpu *prev_vcpu = kvm_get_vcpu_by_id(kvm, vcpu_id);
 		struct vcpu_svm *prev_svm;
 
-		if (!prev_vcpu) {
-			ret = -EINVAL;
-			goto out;
-		}
+		if (!prev_vcpu)
+			return -EINVAL;
 
 		prev_svm = to_svm(prev_vcpu);
-		svm_ir_list_del(prev_svm, pi);
+		svm_ir_list_del(prev_svm, irqfd, pi);
 	}
 
-	/**
-	 * Allocating new amd_iommu_pi_data, which will get
-	 * add to the per-vcpu ir_list.
-	 */
-	ir = kzalloc(sizeof(struct amd_svm_iommu_ir), GFP_ATOMIC | __GFP_ACCOUNT);
-	if (!ir) {
-		ret = -ENOMEM;
-		goto out;
-	}
-	ir->data = pi->ir_data;
+	irqfd->irq_bypass_data = pi->ir_data;
 
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
 
@@ -844,10 +836,9 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
 		amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
 				    true, pi->ir_data);
 
-	list_add(&ir->node, &svm->ir_list);
+	list_add(&irqfd->vcpu_list, &svm->ir_list);
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
-out:
-	return ret;
+	return 0;
 }
 
 /*
@@ -951,7 +942,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			 * scheduling information in IOMMU irte.
 			 */
 			if (!ret && pi.is_guest_mode)
-				svm_ir_list_add(svm, &pi);
+				svm_ir_list_add(svm, irqfd, &pi);
 		}
 
 		if (!ret && svm) {
@@ -991,7 +982,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 
 			vcpu = kvm_get_vcpu_by_id(kvm, id);
 			if (vcpu)
-				svm_ir_list_del(to_svm(vcpu), &pi);
+				svm_ir_list_del(to_svm(vcpu), irqfd, &pi);
 		}
 	} else {
 		ret = 0;
diff --git a/include/linux/kvm_irqfd.h b/include/linux/kvm_irqfd.h
index 8ad43692e3bb..6510a48e62aa 100644
--- a/include/linux/kvm_irqfd.h
+++ b/include/linux/kvm_irqfd.h
@@ -59,6 +59,9 @@ struct kvm_kernel_irqfd {
 	struct work_struct shutdown;
 	struct irq_bypass_consumer consumer;
 	struct irq_bypass_producer *producer;
+
+	struct list_head vcpu_list;
+	void *irq_bypass_data;
 };
 
 #endif /* __LINUX_KVM_IRQFD_H */
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 10/67] KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (8 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 09/67] KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 11/67] KVM: SVM: Delete IRTE link from previous vCPU irrespective of new routing Sean Christopherson
                   ` (61 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Delete the previous per-vCPU IRTE link prior to modifying the IRTE.  If
forcing the IRTE back to remapped mode fails, the IRQ is already broken;
keeping stale metadata won't change that, and the IOMMU should be
sufficiently paranoid to sanitize the IRTE when the IRQ is freed and
reallocated.

This will allow hoisting the vCPU tracking to common x86, which in turn
will allow most of the IRTE update code to be deduplicated.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c   | 61 +++++++++------------------------------
 include/linux/kvm_irqfd.h |  1 +
 2 files changed, 15 insertions(+), 47 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 967618ba743a..02b6f0007436 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -81,6 +81,7 @@ bool x2avic_enabled;
 struct amd_svm_iommu_ir {
 	struct list_head node;	/* Used by SVM for per-vcpu ir_list */
 	void *data;		/* Storing pointer to struct amd_ir_data */
+	struct vcpu_svm *svm;
 };
 
 static void avic_activate_vmcb(struct vcpu_svm *svm)
@@ -774,23 +775,19 @@ static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 	return ret;
 }
 
-static void svm_ir_list_del(struct vcpu_svm *svm,
-			    struct kvm_kernel_irqfd *irqfd,
-			    struct amd_iommu_pi_data *pi)
+static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)
 {
+	struct kvm_vcpu *vcpu = irqfd->irq_bypass_vcpu;
 	unsigned long flags;
-	struct kvm_kernel_irqfd *cur;
 
-	spin_lock_irqsave(&svm->ir_list_lock, flags);
-	list_for_each_entry(cur, &svm->ir_list, vcpu_list) {
-		if (cur->irq_bypass_data != pi->ir_data)
-			continue;
-		if (WARN_ON_ONCE(cur != irqfd))
-			continue;
-		list_del(&irqfd->vcpu_list);
-		break;
-	}
-	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
+	if (!vcpu)
+		return;
+
+	spin_lock_irqsave(&to_svm(vcpu)->ir_list_lock, flags);
+	list_del(&irqfd->vcpu_list);
+	spin_unlock_irqrestore(&to_svm(vcpu)->ir_list_lock, flags);
+
+	irqfd->irq_bypass_vcpu = NULL;
 }
 
 static int svm_ir_list_add(struct vcpu_svm *svm,
@@ -803,24 +800,7 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
 	if (WARN_ON_ONCE(!pi->ir_data))
 		return -EINVAL;
 
-	/**
-	 * In some cases, the existing irte is updated and re-set,
-	 * so we need to check here if it's already been * added
-	 * to the ir_list.
-	 */
-	if (pi->prev_ga_tag) {
-		struct kvm *kvm = svm->vcpu.kvm;
-		u32 vcpu_id = AVIC_GATAG_TO_VCPUID(pi->prev_ga_tag);
-		struct kvm_vcpu *prev_vcpu = kvm_get_vcpu_by_id(kvm, vcpu_id);
-		struct vcpu_svm *prev_svm;
-
-		if (!prev_vcpu)
-			return -EINVAL;
-
-		prev_svm = to_svm(prev_vcpu);
-		svm_ir_list_del(prev_svm, irqfd, pi);
-	}
-
+	irqfd->irq_bypass_vcpu = &svm->vcpu;
 	irqfd->irq_bypass_data = pi->ir_data;
 
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
@@ -912,6 +892,8 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 
 		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
 
+		svm_ir_list_del(irqfd);
+
 		/**
 		 * Here, we setup with legacy mode in the following cases:
 		 * 1. When cannot target interrupt to a specific vcpu.
@@ -969,21 +951,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		pi.prev_ga_tag = 0;
 		pi.is_guest_mode = false;
 		ret = irq_set_vcpu_affinity(host_irq, &pi);
-
-		/**
-		 * Check if the posted interrupt was previously
-		 * setup with the guest_mode by checking if the ga_tag
-		 * was cached. If so, we need to clean up the per-vcpu
-		 * ir_list.
-		 */
-		if (!ret && pi.prev_ga_tag) {
-			int id = AVIC_GATAG_TO_VCPUID(pi.prev_ga_tag);
-			struct kvm_vcpu *vcpu;
-
-			vcpu = kvm_get_vcpu_by_id(kvm, id);
-			if (vcpu)
-				svm_ir_list_del(to_svm(vcpu), irqfd, &pi);
-		}
 	} else {
 		ret = 0;
 	}
diff --git a/include/linux/kvm_irqfd.h b/include/linux/kvm_irqfd.h
index 6510a48e62aa..361c07f4466d 100644
--- a/include/linux/kvm_irqfd.h
+++ b/include/linux/kvm_irqfd.h
@@ -60,6 +60,7 @@ struct kvm_kernel_irqfd {
 	struct irq_bypass_consumer consumer;
 	struct irq_bypass_producer *producer;
 
+	struct kvm_vcpu *irq_bypass_vcpu;
 	struct list_head vcpu_list;
 	void *irq_bypass_data;
 };
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 11/67] KVM: SVM: Delete IRTE link from previous vCPU irrespective of new routing
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (9 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 10/67] KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-15 11:06   ` Sairaj Kodilkar
  2025-04-04 19:38 ` [PATCH 12/67] KVM: SVM: Drop pointless masking of default APIC base when setting V_APIC_BAR Sean Christopherson
                   ` (60 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Delete the IRTE link from the previous vCPU irrespective of the new
routing state.  This is a glorified nop (only the ordering changes), as
both the "posting" and "remapped" mode paths pre-delete the link.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 02b6f0007436..e9ded2488a0b 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -870,6 +870,12 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
 		return 0;
 
+	/*
+	 * If the IRQ was affined to a different vCPU, remove the IRTE metadata
+	 * from the *previous* vCPU's list.
+	 */
+	svm_ir_list_del(irqfd);
+
 	pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n",
 		 __func__, host_irq, guest_irq, set);
 
@@ -892,8 +898,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 
 		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
 
-		svm_ir_list_del(irqfd);
-
 		/**
 		 * Here, we setup with legacy mode in the following cases:
 		 * 1. When cannot target interrupt to a specific vcpu.
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 12/67] KVM: SVM: Drop pointless masking of default APIC base when setting V_APIC_BAR
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (10 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 11/67] KVM: SVM: Delete IRTE link from previous vCPU irrespective of new routing Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 13/67] KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masks Sean Christopherson
                   ` (59 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Drop VMCB_AVIC_APIC_BAR_MASK, it's just a regurgitation of the maximum
theoretical 4KiB-aligned physical address, i.e. is not novel in any way,
and its only usage is to mask the default APIC base, which is 4KiB aligned
and (obviously) a legal physical address.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/svm.h | 2 --
 arch/x86/kvm/svm/avic.c    | 2 +-
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 9b7fa99ae951..9d3f17732ab4 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -254,8 +254,6 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 
 #define AVIC_DOORBELL_PHYSICAL_ID_MASK			GENMASK_ULL(11, 0)
 
-#define VMCB_AVIC_APIC_BAR_MASK				0xFFFFFFFFFF000ULL
-
 #define AVIC_UNACCEL_ACCESS_WRITE_MASK		1
 #define AVIC_UNACCEL_ACCESS_OFFSET_MASK		0xFF0
 #define AVIC_UNACCEL_ACCESS_VECTOR_MASK		0xFFFFFFFF
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index e9ded2488a0b..69bf82fc7890 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -253,7 +253,7 @@ void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
 	vmcb->control.avic_backing_page = bpa & AVIC_HPA_MASK;
 	vmcb->control.avic_logical_id = lpa & AVIC_HPA_MASK;
 	vmcb->control.avic_physical_id = ppa & AVIC_HPA_MASK;
-	vmcb->control.avic_vapic_bar = APIC_DEFAULT_PHYS_BASE & VMCB_AVIC_APIC_BAR_MASK;
+	vmcb->control.avic_vapic_bar = APIC_DEFAULT_PHYS_BASE;
 
 	if (kvm_apicv_activated(svm->vcpu.kvm))
 		avic_activate_vmcb(svm);
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 13/67] KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masks
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (11 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 12/67] KVM: SVM: Drop pointless masking of default APIC base when setting V_APIC_BAR Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 14/67] KVM: SVM: Add helper to deduplicate code for getting AVIC backing page Sean Christopherson
                   ` (58 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Drop AVIC_HPA_MASK and all its users, the mask is just the 4KiB-aligned
maximum theoretical physical address for x86-64 CPUs, as x86-64 is
currently defined (going beyond PA52 would require an entirely new paging
mode, which would arguably create a new, different architecture).

All usage in KVM masks the result of page_to_phys(), which on x86-64 is
guaranteed to be 4KiB aligned and a legal physical address; if either of
those requirements doesn't hold true, KVM has far bigger problems.

Drop masking the avic_backing_page with
AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK for all the same reasons, but
keep the macro even though it's unused in functional code.  It's a
distinct architectural define, and having the definition in software
helps visualize the layout of an entry.  And to be hyper-paranoid about
MAXPA going beyond 52, add a compile-time assert to ensure the kernel's
maximum supported physical address stays in bounds.

The unnecessary masking in avic_init_vmcb() also incorrectly assumes that
SME's C-bit resides between bits 51:11; that holds true for current CPUs,
but isn't required by AMD's architecture:

  In some implementations, the bit used may be a physical address bit

Key word being "may".

Opportunistically use the GENMASK_ULL() version for
AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK, which is far more readable
than a set of repeating Fs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/svm.h |  4 +---
 arch/x86/kvm/svm/avic.c    | 18 ++++++++++--------
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 9d3f17732ab4..8b07939ef3b9 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -247,7 +247,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 #define AVIC_LOGICAL_ID_ENTRY_VALID_MASK		(1 << 31)
 
 #define AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK	GENMASK_ULL(11, 0)
-#define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK	(0xFFFFFFFFFFULL << 12)
+#define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK	GENMASK_ULL(51, 12)
 #define AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK		(1ULL << 62)
 #define AVIC_PHYSICAL_ID_ENTRY_VALID_MASK		(1ULL << 63)
 #define AVIC_PHYSICAL_ID_TABLE_SIZE_MASK		(0xFFULL)
@@ -282,8 +282,6 @@ enum avic_ipi_failure_cause {
 static_assert((AVIC_MAX_PHYSICAL_ID & AVIC_PHYSICAL_MAX_INDEX_MASK) == AVIC_MAX_PHYSICAL_ID);
 static_assert((X2AVIC_MAX_PHYSICAL_ID & AVIC_PHYSICAL_MAX_INDEX_MASK) == X2AVIC_MAX_PHYSICAL_ID);
 
-#define AVIC_HPA_MASK	~((0xFFFULL << 52) | 0xFFF)
-
 #define SVM_SEV_FEAT_SNP_ACTIVE				BIT(0)
 #define SVM_SEV_FEAT_RESTRICTED_INJECTION		BIT(3)
 #define SVM_SEV_FEAT_ALTERNATE_INJECTION		BIT(4)
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 69bf82fc7890..f04010f66595 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -250,9 +250,9 @@ void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
 	phys_addr_t lpa = __sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page));
 	phys_addr_t ppa = __sme_set(page_to_phys(kvm_svm->avic_physical_id_table_page));
 
-	vmcb->control.avic_backing_page = bpa & AVIC_HPA_MASK;
-	vmcb->control.avic_logical_id = lpa & AVIC_HPA_MASK;
-	vmcb->control.avic_physical_id = ppa & AVIC_HPA_MASK;
+	vmcb->control.avic_backing_page = bpa;
+	vmcb->control.avic_logical_id = lpa;
+	vmcb->control.avic_physical_id = ppa;
 	vmcb->control.avic_vapic_bar = APIC_DEFAULT_PHYS_BASE;
 
 	if (kvm_apicv_activated(svm->vcpu.kvm))
@@ -310,9 +310,12 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	if (!entry)
 		return -EINVAL;
 
-	new_entry = __sme_set((page_to_phys(svm->avic_backing_page) &
-			      AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK) |
-			      AVIC_PHYSICAL_ID_ENTRY_VALID_MASK);
+	/* Note, fls64() returns the bit position, +1. */
+	BUILD_BUG_ON(__PHYSICAL_MASK_SHIFT >
+		     fls64(AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK));
+
+	new_entry = __sme_set(page_to_phys(svm->avic_backing_page)) |
+		    AVIC_PHYSICAL_ID_ENTRY_VALID_MASK;
 	WRITE_ONCE(*entry, new_entry);
 
 	svm->avic_physical_id_cache = entry;
@@ -912,8 +915,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			enable_remapped_mode = false;
 
 			/* Try to enable guest_mode in IRTE */
-			pi.base = __sme_set(page_to_phys(svm->avic_backing_page) &
-					    AVIC_HPA_MASK);
+			pi.base = __sme_set(page_to_phys(svm->avic_backing_page));
 			pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
 						     svm->vcpu.vcpu_id);
 			pi.is_guest_mode = true;
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 14/67] KVM: SVM: Add helper to deduplicate code for getting AVIC backing page
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (12 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 13/67] KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masks Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-15 11:11   ` Sairaj Kodilkar
  2025-04-04 19:38 ` [PATCH 15/67] KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field Sean Christopherson
                   ` (57 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Add a helper to get the physical address of the AVIC backing page, both
to deduplicate code and to prepare for getting the address directly from
apic->regs, at which point it won't be all that obvious that the address
in question is what SVM calls the AVIC backing page.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index f04010f66595..a1f4a08d35f5 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -243,14 +243,18 @@ int avic_vm_init(struct kvm *kvm)
 	return err;
 }
 
+static phys_addr_t avic_get_backing_page_address(struct vcpu_svm *svm)
+{
+	return __sme_set(page_to_phys(svm->avic_backing_page));
+}
+
 void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm);
-	phys_addr_t bpa = __sme_set(page_to_phys(svm->avic_backing_page));
 	phys_addr_t lpa = __sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page));
 	phys_addr_t ppa = __sme_set(page_to_phys(kvm_svm->avic_physical_id_table_page));
 
-	vmcb->control.avic_backing_page = bpa;
+	vmcb->control.avic_backing_page = avic_get_backing_page_address(svm);
 	vmcb->control.avic_logical_id = lpa;
 	vmcb->control.avic_physical_id = ppa;
 	vmcb->control.avic_vapic_bar = APIC_DEFAULT_PHYS_BASE;
@@ -314,7 +318,7 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	BUILD_BUG_ON(__PHYSICAL_MASK_SHIFT >
 		     fls64(AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK));
 
-	new_entry = __sme_set(page_to_phys(svm->avic_backing_page)) |
+	new_entry = avic_get_backing_page_address(svm) |
 		    AVIC_PHYSICAL_ID_ENTRY_VALID_MASK;
 	WRITE_ONCE(*entry, new_entry);
 
@@ -854,7 +858,7 @@ get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
 	pr_debug("SVM: %s: use GA mode for irq %u\n", __func__,
 		 irq.vector);
 	*svm = to_svm(vcpu);
-	vcpu_info->pi_desc_addr = __sme_set(page_to_phys((*svm)->avic_backing_page));
+	vcpu_info->pi_desc_addr = avic_get_backing_page_address(*svm);
 	vcpu_info->vector = irq.vector;
 
 	return 0;
@@ -915,7 +919,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			enable_remapped_mode = false;
 
 			/* Try to enable guest_mode in IRTE */
-			pi.base = __sme_set(page_to_phys(svm->avic_backing_page));
+			pi.base = avic_get_backing_page_address(svm);
 			pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
 						     svm->vcpu.vcpu_id);
 			pi.is_guest_mode = true;
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 15/67] KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (13 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 14/67] KVM: SVM: Add helper to deduplicate code for getting AVIC backing page Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 16/67] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation Sean Christopherson
                   ` (56 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Drop vcpu_svm's avic_backing_page pointer and instead grab the physical
address of KVM's vAPIC page directly from the source.  Getting a physical
address from a kernel virtual address is not an expensive operation, and
getting the physical address from a struct page is *more* expensive for
CONFIG_SPARSEMEM=y kernels.  Regardless, none of the paths that consume
the address are hot paths, i.e. shaving cycles is not a priority.

Eliminating the "cache" means KVM doesn't have to worry about the cache
being invalid, which will simplify a future fix when dealing with vCPU IDs
that are too big.

WARN if KVM attempts to allocate a vCPU's AVIC backing page without an
in-kernel local APIC.  avic_init_vcpu() bails early if the APIC is not
in-kernel, and KVM disallows enabling an in-kernel APIC after vCPUs have
been created, i.e. it should be impossible to reach
avic_init_backing_page() without the vAPIC being allocated.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 6 ++----
 arch/x86/kvm/svm/svm.h  | 1 -
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index a1f4a08d35f5..c8ba2ce4cfd8 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -245,7 +245,7 @@ int avic_vm_init(struct kvm *kvm)
 
 static phys_addr_t avic_get_backing_page_address(struct vcpu_svm *svm)
 {
-	return __sme_set(page_to_phys(svm->avic_backing_page));
+	return __sme_set(__pa(svm->vcpu.arch.apic->regs));
 }
 
 void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
@@ -290,7 +290,7 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	    (id > X2AVIC_MAX_PHYSICAL_ID))
 		return -EINVAL;
 
-	if (!vcpu->arch.apic->regs)
+	if (WARN_ON_ONCE(!vcpu->arch.apic->regs))
 		return -EINVAL;
 
 	if (kvm_apicv_activated(vcpu->kvm)) {
@@ -307,8 +307,6 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 			return ret;
 	}
 
-	svm->avic_backing_page = virt_to_page(vcpu->arch.apic->regs);
-
 	/* Setting AVIC backing page address in the phy APIC ID table */
 	entry = avic_get_physical_id_entry(vcpu, id);
 	if (!entry)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 294d5594c724..1cc4e145577c 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -301,7 +301,6 @@ struct vcpu_svm {
 
 	u32 ldr_reg;
 	u32 dfr_reg;
-	struct page *avic_backing_page;
 	u64 *avic_physical_id_cache;
 
 	/*
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 16/67] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (14 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 15/67] KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 17/67] KVM: SVM: Drop redundant check in AVIC code on ID during " Sean Christopherson
                   ` (55 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Inhibit AVIC with a new "ID too big" flag if userspace creates a vCPU with
an ID that is too big, but otherwise allow vCPU creation to succeed.
Rejecting KVM_CREATE_VCPU with EINVAL violates KVM's ABI as KVM advertises
that the max vCPU ID is 4095, but disallows creating vCPUs with IDs bigger
than 254 (AVIC) or 511 (x2AVIC).

Alternatively, KVM could advertise an accurate value depending on which
AVIC mode is in use, but that wouldn't really solve the underlying problem,
e.g. would be a breaking change if KVM were to ever try and enable AVIC or
x2AVIC by default.

Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  9 ++++++++-
 arch/x86/kvm/svm/avic.c         | 16 ++++++++++++++--
 arch/x86/kvm/svm/svm.h          |  3 ++-
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 54f3cf73329b..0583d8a9c8d4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1304,6 +1304,12 @@ enum kvm_apicv_inhibit {
 	 */
 	APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
 
+	/*
+	 * AVIC is disabled because the vCPU's APIC ID is beyond the max
+	 * supported by AVIC/x2AVIC, i.e. the vCPU is unaddressable.
+	 */
+	APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG,
+
 	NR_APICV_INHIBIT_REASONS,
 };
 
@@ -1322,7 +1328,8 @@ enum kvm_apicv_inhibit {
 	__APICV_INHIBIT_REASON(IRQWIN),			\
 	__APICV_INHIBIT_REASON(PIT_REINJ),		\
 	__APICV_INHIBIT_REASON(SEV),			\
-	__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)
+	__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED),	\
+	__APICV_INHIBIT_REASON(PHYSICAL_ID_TOO_BIG)
 
 struct kvm_arch {
 	unsigned long n_used_mmu_pages;
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index c8ba2ce4cfd8..ba8dfc8a12f4 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -286,9 +286,21 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	int id = vcpu->vcpu_id;
 	struct vcpu_svm *svm = to_svm(vcpu);
 
+	/*
+	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
+	 * hardware.  Do so immediately, i.e. don't defer the update via a
+	 * request, as avic_vcpu_load() expects to be called if and only if the
+	 * vCPU has fully initialized AVIC.  Immediately clear apicv_active,
+	 * as avic_vcpu_load() assumes avic_physical_id_cache is valid, i.e.
+	 * waiting until KVM_REQ_APICV_UPDATE is processed on the first KVM_RUN
+	 * will result in an NULL pointer deference when loading the vCPU.
+	 */
 	if ((!x2avic_enabled && id > AVIC_MAX_PHYSICAL_ID) ||
-	    (id > X2AVIC_MAX_PHYSICAL_ID))
-		return -EINVAL;
+	    (id > X2AVIC_MAX_PHYSICAL_ID)) {
+		kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG);
+		vcpu->arch.apic->apicv_active = false;
+		return 0;
+	}
 
 	if (WARN_ON_ONCE(!vcpu->arch.apic->regs))
 		return -EINVAL;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 1cc4e145577c..7af28802ebee 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -715,7 +715,8 @@ extern struct kvm_x86_nested_ops svm_nested_ops;
 	BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) |	\
 	BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) |	\
 	BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) |	\
-	BIT(APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED)	\
+	BIT(APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED) |	\
+	BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG)	\
 )
 
 bool avic_hardware_setup(void);
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 17/67] KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creation
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (15 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 16/67] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-15 11:16   ` Sairaj Kodilkar
  2025-04-04 19:38 ` [PATCH 18/67] KVM: SVM: Track AVIC tables as natively sized pointers, not "struct pages" Sean Christopherson
                   ` (54 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Drop avic_get_physical_id_entry()'s compatibility check on the incoming
ID, as its sole caller, avic_init_backing_page(), performs the exact same
check.  Drop avic_get_physical_id_entry() entirely as the only remaining
functionality is getting the address of the Physical ID table, and
accessing the array without an immediate bounds check is kludgy.

Opportunistically add a compile-time assertion to ensure the vcpu_id can't
result in a bounds overflow, e.g. if KVM (really) messed up a maximum
physical ID #define, as well as run-time assertions so that a NULL pointer
dereference is morphed into a safer WARN().

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 47 +++++++++++++++++------------------------
 1 file changed, 19 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index ba8dfc8a12f4..344541e418c3 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -265,35 +265,19 @@ void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
 		avic_deactivate_vmcb(svm);
 }
 
-static u64 *avic_get_physical_id_entry(struct kvm_vcpu *vcpu,
-				       unsigned int index)
-{
-	u64 *avic_physical_id_table;
-	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
-
-	if ((!x2avic_enabled && index > AVIC_MAX_PHYSICAL_ID) ||
-	    (index > X2AVIC_MAX_PHYSICAL_ID))
-		return NULL;
-
-	avic_physical_id_table = page_address(kvm_svm->avic_physical_id_table_page);
-
-	return &avic_physical_id_table[index];
-}
-
 static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 {
-	u64 *entry, new_entry;
-	int id = vcpu->vcpu_id;
+	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	struct vcpu_svm *svm = to_svm(vcpu);
+	u32 id = vcpu->vcpu_id;
+	u64 *table, new_entry;
 
 	/*
 	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
-	 * hardware.  Do so immediately, i.e. don't defer the update via a
-	 * request, as avic_vcpu_load() expects to be called if and only if the
-	 * vCPU has fully initialized AVIC.  Immediately clear apicv_active,
-	 * as avic_vcpu_load() assumes avic_physical_id_cache is valid, i.e.
-	 * waiting until KVM_REQ_APICV_UPDATE is processed on the first KVM_RUN
-	 * will result in an NULL pointer deference when loading the vCPU.
+	 * hardware.  Immediately clear apicv_active, i.e. don't wait until the
+	 * KVM_REQ_APICV_UPDATE request is processed on the first KVM_RUN, as
+	 * avic_vcpu_load() expects to be called if and only if the vCPU has
+	 * fully initialized AVIC.
 	 */
 	if ((!x2avic_enabled && id > AVIC_MAX_PHYSICAL_ID) ||
 	    (id > X2AVIC_MAX_PHYSICAL_ID)) {
@@ -302,6 +286,9 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 		return 0;
 	}
 
+	BUILD_BUG_ON((AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE ||
+		     (X2AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE);
+
 	if (WARN_ON_ONCE(!vcpu->arch.apic->regs))
 		return -EINVAL;
 
@@ -320,9 +307,7 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	}
 
 	/* Setting AVIC backing page address in the phy APIC ID table */
-	entry = avic_get_physical_id_entry(vcpu, id);
-	if (!entry)
-		return -EINVAL;
+	table = page_address(kvm_svm->avic_physical_id_table_page);
 
 	/* Note, fls64() returns the bit position, +1. */
 	BUILD_BUG_ON(__PHYSICAL_MASK_SHIFT >
@@ -330,9 +315,9 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 
 	new_entry = avic_get_backing_page_address(svm) |
 		    AVIC_PHYSICAL_ID_ENTRY_VALID_MASK;
-	WRITE_ONCE(*entry, new_entry);
+	WRITE_ONCE(table[id], new_entry);
 
-	svm->avic_physical_id_cache = entry;
+	svm->avic_physical_id_cache = &table[id];
 
 	return 0;
 }
@@ -1018,6 +1003,9 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (WARN_ON(h_physical_id & ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK))
 		return;
 
+	if (WARN_ON_ONCE(!svm->avic_physical_id_cache))
+		return;
+
 	/*
 	 * No need to update anything if the vCPU is blocking, i.e. if the vCPU
 	 * is being scheduled in after being preempted.  The CPU entries in the
@@ -1058,6 +1046,9 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 
 	lockdep_assert_preemption_disabled();
 
+	if (WARN_ON_ONCE(!svm->avic_physical_id_cache))
+		return;
+
 	/*
 	 * Note, reading the Physical ID entry outside of ir_list_lock is safe
 	 * as only the pCPU that has loaded (or is loading) the vCPU is allowed
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 18/67] KVM: SVM: Track AVIC tables as natively sized pointers, not "struct pages"
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (16 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 17/67] KVM: SVM: Drop redundant check in AVIC code on ID during " Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 19/67] KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer Sean Christopherson
                   ` (53 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Allocate and track AVIC's logical and physical tables as u32 and u64
pointers respectively, as managing the pages as "struct page" pointers
adds an almost absurd amount of boilerplate and complexity.  E.g. with
page_address() out of the way, svm->avic_physical_id_cache becomes
completely superfluous, and will be removed in a future cleanup.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 49 ++++++++++++++---------------------------
 arch/x86/kvm/svm/svm.h  |  4 ++--
 2 files changed, 18 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 344541e418c3..ae6d2c00397f 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -181,10 +181,8 @@ void avic_vm_destroy(struct kvm *kvm)
 	if (!enable_apicv)
 		return;
 
-	if (kvm_svm->avic_logical_id_table_page)
-		__free_page(kvm_svm->avic_logical_id_table_page);
-	if (kvm_svm->avic_physical_id_table_page)
-		__free_page(kvm_svm->avic_physical_id_table_page);
+	free_page((unsigned long)kvm_svm->avic_logical_id_table);
+	free_page((unsigned long)kvm_svm->avic_physical_id_table);
 
 	spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
 	hash_del(&kvm_svm->hnode);
@@ -197,27 +195,19 @@ int avic_vm_init(struct kvm *kvm)
 	int err = -ENOMEM;
 	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
 	struct kvm_svm *k2;
-	struct page *p_page;
-	struct page *l_page;
 	u32 vm_id;
 
 	if (!enable_apicv)
 		return 0;
 
-	/* Allocating physical APIC ID table (4KB) */
-	p_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
-	if (!p_page)
+	kvm_svm->avic_physical_id_table = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+	if (!kvm_svm->avic_physical_id_table)
 		goto free_avic;
 
-	kvm_svm->avic_physical_id_table_page = p_page;
-
-	/* Allocating logical APIC ID table (4KB) */
-	l_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
-	if (!l_page)
+	kvm_svm->avic_logical_id_table = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+	if (!kvm_svm->avic_logical_id_table)
 		goto free_avic;
 
-	kvm_svm->avic_logical_id_table_page = l_page;
-
 	spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
  again:
 	vm_id = next_vm_id = (next_vm_id + 1) & AVIC_VM_ID_MASK;
@@ -251,12 +241,10 @@ static phys_addr_t avic_get_backing_page_address(struct vcpu_svm *svm)
 void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm);
-	phys_addr_t lpa = __sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page));
-	phys_addr_t ppa = __sme_set(page_to_phys(kvm_svm->avic_physical_id_table_page));
 
 	vmcb->control.avic_backing_page = avic_get_backing_page_address(svm);
-	vmcb->control.avic_logical_id = lpa;
-	vmcb->control.avic_physical_id = ppa;
+	vmcb->control.avic_logical_id = __sme_set(__pa(kvm_svm->avic_logical_id_table));
+	vmcb->control.avic_physical_id = __sme_set(__pa(kvm_svm->avic_physical_id_table));
 	vmcb->control.avic_vapic_bar = APIC_DEFAULT_PHYS_BASE;
 
 	if (kvm_apicv_activated(svm->vcpu.kvm))
@@ -270,7 +258,7 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	struct vcpu_svm *svm = to_svm(vcpu);
 	u32 id = vcpu->vcpu_id;
-	u64 *table, new_entry;
+	u64 new_entry;
 
 	/*
 	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
@@ -286,8 +274,8 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 		return 0;
 	}
 
-	BUILD_BUG_ON((AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE ||
-		     (X2AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE);
+	BUILD_BUG_ON((AVIC_MAX_PHYSICAL_ID + 1) * sizeof(new_entry) > PAGE_SIZE ||
+		     (X2AVIC_MAX_PHYSICAL_ID + 1) * sizeof(new_entry) > PAGE_SIZE);
 
 	if (WARN_ON_ONCE(!vcpu->arch.apic->regs))
 		return -EINVAL;
@@ -306,18 +294,16 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 			return ret;
 	}
 
-	/* Setting AVIC backing page address in the phy APIC ID table */
-	table = page_address(kvm_svm->avic_physical_id_table_page);
-
 	/* Note, fls64() returns the bit position, +1. */
 	BUILD_BUG_ON(__PHYSICAL_MASK_SHIFT >
 		     fls64(AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK));
 
+	/* Setting AVIC backing page address in the phy APIC ID table */
 	new_entry = avic_get_backing_page_address(svm) |
 		    AVIC_PHYSICAL_ID_ENTRY_VALID_MASK;
-	WRITE_ONCE(table[id], new_entry);
+	WRITE_ONCE(kvm_svm->avic_physical_id_table[id], new_entry);
 
-	svm->avic_physical_id_cache = &table[id];
+	svm->avic_physical_id_cache = &kvm_svm->avic_physical_id_table[id];
 
 	return 0;
 }
@@ -451,7 +437,7 @@ static int avic_kick_target_vcpus_fast(struct kvm *kvm, struct kvm_lapic *source
 		if (apic_x2apic_mode(source))
 			avic_logical_id_table = NULL;
 		else
-			avic_logical_id_table = page_address(kvm_svm->avic_logical_id_table_page);
+			avic_logical_id_table = kvm_svm->avic_logical_id_table;
 
 		/*
 		 * AVIC is inhibited if vCPUs aren't mapped 1:1 with logical
@@ -553,7 +539,6 @@ unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu)
 static u32 *avic_get_logical_id_entry(struct kvm_vcpu *vcpu, u32 ldr, bool flat)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
-	u32 *logical_apic_id_table;
 	u32 cluster, index;
 
 	ldr = GET_APIC_LOGICAL_ID(ldr);
@@ -574,9 +559,7 @@ static u32 *avic_get_logical_id_entry(struct kvm_vcpu *vcpu, u32 ldr, bool flat)
 		return NULL;
 	index += (cluster << 2);
 
-	logical_apic_id_table = (u32 *) page_address(kvm_svm->avic_logical_id_table_page);
-
-	return &logical_apic_id_table[index];
+	return &kvm_svm->avic_logical_id_table[index];
 }
 
 static void avic_ldr_write(struct kvm_vcpu *vcpu, u8 g_physical_id, u32 ldr)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 7af28802ebee..4c83b6b73714 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -119,8 +119,8 @@ struct kvm_svm {
 
 	/* Struct members for AVIC */
 	u32 avic_vm_id;
-	struct page *avic_logical_id_table_page;
-	struct page *avic_physical_id_table_page;
+	u32 *avic_logical_id_table;
+	u64 *avic_physical_id_table;
 	struct hlist_node hnode;
 
 	struct kvm_sev_info sev_info;
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 19/67] KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (17 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 18/67] KVM: SVM: Track AVIC tables as natively sized pointers, not "struct pages" Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 20/67] KVM: VMX: Move enable_ipiv knob to common x86 Sean Christopherson
                   ` (52 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Drop the vCPU's pointer to its AVIC Physical ID entry, and simply index
the table directly.  Caching a pointer address is completely unnecessary
for performance, and while the field technically caches the result of the
pointer calculation, it's all too easy to misinterpret the name and think
that the field somehow caches the _data_ in the table.

No functional change intended.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 27 +++++++++++++++------------
 arch/x86/kvm/svm/svm.h  |  1 -
 2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index ae6d2c00397f..c4e6c97b736f 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -303,8 +303,6 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 		    AVIC_PHYSICAL_ID_ENTRY_VALID_MASK;
 	WRITE_ONCE(kvm_svm->avic_physical_id_table[id], new_entry);
 
-	svm->avic_physical_id_cache = &kvm_svm->avic_physical_id_table[id];
-
 	return 0;
 }
 
@@ -779,13 +777,16 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
 			   struct kvm_kernel_irqfd *irqfd,
 			   struct amd_iommu_pi_data *pi)
 {
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
 	unsigned long flags;
 	u64 entry;
 
 	if (WARN_ON_ONCE(!pi->ir_data))
 		return -EINVAL;
 
-	irqfd->irq_bypass_vcpu = &svm->vcpu;
+	irqfd->irq_bypass_vcpu = vcpu;
 	irqfd->irq_bypass_data = pi->ir_data;
 
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
@@ -796,7 +797,7 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
 	 * will update the pCPU info when the vCPU awkened and/or scheduled in.
 	 * See also avic_vcpu_load().
 	 */
-	entry = READ_ONCE(*(svm->avic_physical_id_cache));
+	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
 	if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
 		amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
 				    true, pi->ir_data);
@@ -976,17 +977,18 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
 
 void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
-	u64 entry;
+	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	int h_physical_id = kvm_cpu_get_apicid(cpu);
 	struct vcpu_svm *svm = to_svm(vcpu);
 	unsigned long flags;
+	u64 entry;
 
 	lockdep_assert_preemption_disabled();
 
 	if (WARN_ON(h_physical_id & ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK))
 		return;
 
-	if (WARN_ON_ONCE(!svm->avic_physical_id_cache))
+	if (WARN_ON_ONCE(vcpu->vcpu_id * sizeof(entry) >= PAGE_SIZE))
 		return;
 
 	/*
@@ -1008,14 +1010,14 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	 */
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
 
-	entry = READ_ONCE(*(svm->avic_physical_id_cache));
+	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
 	WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
 	entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
 	entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 
-	WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
+	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
@@ -1023,13 +1025,14 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 void avic_vcpu_put(struct kvm_vcpu *vcpu)
 {
-	u64 entry;
+	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	struct vcpu_svm *svm = to_svm(vcpu);
 	unsigned long flags;
+	u64 entry;
 
 	lockdep_assert_preemption_disabled();
 
-	if (WARN_ON_ONCE(!svm->avic_physical_id_cache))
+	if (WARN_ON_ONCE(vcpu->vcpu_id * sizeof(entry) >= PAGE_SIZE))
 		return;
 
 	/*
@@ -1039,7 +1042,7 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	 * can't be scheduled out and thus avic_vcpu_{put,load}() can't run
 	 * recursively.
 	 */
-	entry = READ_ONCE(*(svm->avic_physical_id_cache));
+	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
 
 	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
 	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
@@ -1058,7 +1061,7 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
-	WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
+	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 4c83b6b73714..e223e57f7def 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -301,7 +301,6 @@ struct vcpu_svm {
 
 	u32 ldr_reg;
 	u32 dfr_reg;
-	u64 *avic_physical_id_cache;
 
 	/*
 	 * Per-vcpu list of struct amd_svm_iommu_ir:
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 20/67] KVM: VMX: Move enable_ipiv knob to common x86
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (18 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 19/67] KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 21/67] KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled Sean Christopherson
                   ` (51 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Move enable_ipiv to common x86 so that it can be reused by SVM to control
IPI virtualization when AVIC is enabled.  SVM doesn't actually provide a
way to truly disable IPI virtualization, but KVM can get close enough by
skipping the necessary table programming.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/vmx/capabilities.h | 1 -
 arch/x86/kvm/vmx/vmx.c          | 2 --
 arch/x86/kvm/x86.c              | 3 +++
 4 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0583d8a9c8d4..85f45fc5156d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1932,6 +1932,7 @@ struct kvm_arch_async_pf {
 extern u32 __read_mostly kvm_nr_uret_msrs;
 extern bool __read_mostly allow_smaller_maxphyaddr;
 extern bool __read_mostly enable_apicv;
+extern bool __read_mostly enable_ipiv;
 extern bool __read_mostly enable_device_posted_irqs;
 extern struct kvm_x86_ops kvm_x86_ops;
 
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index cb6588238f46..5316c27f6099 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -15,7 +15,6 @@ extern bool __read_mostly enable_ept;
 extern bool __read_mostly enable_unrestricted_guest;
 extern bool __read_mostly enable_ept_ad_bits;
 extern bool __read_mostly enable_pml;
-extern bool __read_mostly enable_ipiv;
 extern int __read_mostly pt_mode;
 
 #define PT_MODE_SYSTEM		0
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ac7f1df612e8..56b68db345a7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -111,8 +111,6 @@ static bool __read_mostly fasteoi = 1;
 module_param(fasteoi, bool, 0444);
 
 module_param(enable_apicv, bool, 0444);
-
-bool __read_mostly enable_ipiv = true;
 module_param(enable_ipiv, bool, 0444);
 
 module_param(enable_device_posted_irqs, bool, 0444);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 23376fcd928c..52d8d0635603 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -227,6 +227,9 @@ EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
 bool __read_mostly enable_apicv = true;
 EXPORT_SYMBOL_GPL(enable_apicv);
 
+bool __read_mostly enable_ipiv = true;
+EXPORT_SYMBOL_GPL(enable_ipiv);
+
 bool __read_mostly enable_device_posted_irqs = true;
 EXPORT_SYMBOL_GPL(enable_device_posted_irqs);
 
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 21/67] KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (19 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 20/67] KVM: VMX: Move enable_ipiv knob to common x86 Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 22/67] KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235 Sean Christopherson
                   ` (50 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

From: Maxim Levitsky <mlevitsk@redhat.com>

Let userspace "disable" IPI virtualization for AVIC via the enable_ipiv
module param, by never setting IsRunning.  SVM doesn't provide a way to
disable IPI virtualization in hardware, but by ensuring CPUs never see
IsRunning=1, every IPI in the guest (except for self-IPIs) will generate a
VM-Exit.

To avoid setting the real IsRunning bit, while still allowing KVM to use
each vCPU's entry to update GA log entries, simply maintain a shadow of
the entry, without propagating IsRunning updates to the real table when
IPI virtualization is disabled.

Providing a way to effectively disable IPI virtualization will allow KVM
to safely enable AVIC on hardware that is susceptible to erratum #1235,
which causes hardware to sometimes fail to detect that the IsRunning bit
has been cleared by software.

Note, the table _must_ be fully populated, as broadcast IPIs skip invalid
entries, i.e. won't generate VM-Exit if every entry is invalid, and so
simply pointing the VMCB at a common dummy table won't work.

Alternatively, KVM could allocate a shadow of the entire table, but that'd
be a waste of 4KiB since the per-vCPU entry doesn't actually consume an
additional 8 bytes of memory (vCPU structures are large enough that they
are backed by order-N pages).

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: keep "entry" variables, reuse enable_ipiv, split from erratum]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 32 ++++++++++++++++++++++++++------
 arch/x86/kvm/svm/svm.c  |  2 ++
 arch/x86/kvm/svm/svm.h  |  9 +++++++++
 3 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index c4e6c97b736f..eea362cd415d 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -301,6 +301,13 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	/* Setting AVIC backing page address in the phy APIC ID table */
 	new_entry = avic_get_backing_page_address(svm) |
 		    AVIC_PHYSICAL_ID_ENTRY_VALID_MASK;
+	svm->avic_physical_id_entry = new_entry;
+
+	/*
+	 * Initialize the real table, as vCPUs must have a valid entry in order
+	 * for broadcast IPIs to function correctly (broadcast IPIs ignore
+	 * invalid entries, i.e. aren't guaranteed to generate a VM-Exit).
+	 */
 	WRITE_ONCE(kvm_svm->avic_physical_id_table[id], new_entry);
 
 	return 0;
@@ -778,8 +785,6 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
 			   struct amd_iommu_pi_data *pi)
 {
 	struct kvm_vcpu *vcpu = &svm->vcpu;
-	struct kvm *kvm = vcpu->kvm;
-	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
 	unsigned long flags;
 	u64 entry;
 
@@ -797,7 +802,7 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
 	 * will update the pCPU info when the vCPU awkened and/or scheduled in.
 	 * See also avic_vcpu_load().
 	 */
-	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
+	entry = svm->avic_physical_id_entry;
 	if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
 		amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
 				    true, pi->ir_data);
@@ -1010,14 +1015,26 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	 */
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
 
-	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
+	entry = svm->avic_physical_id_entry;
 	WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
 	entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
 	entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 
+	svm->avic_physical_id_entry = entry;
+
+	/*
+	 * If IPI virtualization is disabled, clear IsRunning when updating the
+	 * actual Physical ID table, so that the CPU never sees IsRunning=1.
+	 * Keep the APIC ID up-to-date in the entry to minimize the chances of
+	 * things going sideways if hardware peeks at the ID.
+	 */
+	if (!enable_ipiv)
+		entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+
 	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
+
 	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
@@ -1042,7 +1059,7 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	 * can't be scheduled out and thus avic_vcpu_{put,load}() can't run
 	 * recursively.
 	 */
-	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
+	entry = svm->avic_physical_id_entry;
 
 	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
 	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
@@ -1061,7 +1078,10 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
-	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
+	svm->avic_physical_id_entry = entry;
+
+	if (enable_ipiv)
+		WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index def76e63562d..43c4933d7da6 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -230,6 +230,7 @@ module_param(tsc_scaling, int, 0444);
  */
 static bool avic;
 module_param(avic, bool, 0444);
+module_param(enable_ipiv, bool, 0444);
 
 module_param(enable_device_posted_irqs, bool, 0444);
 
@@ -5440,6 +5441,7 @@ static __init int svm_hardware_setup(void)
 	enable_apicv = avic = avic && avic_hardware_setup();
 
 	if (!enable_apicv) {
+		enable_ipiv = false;
 		svm_x86_ops.vcpu_blocking = NULL;
 		svm_x86_ops.vcpu_unblocking = NULL;
 		svm_x86_ops.vcpu_get_apicv_inhibit_reasons = NULL;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index e223e57f7def..6ad0aa86f78d 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -302,6 +302,15 @@ struct vcpu_svm {
 	u32 ldr_reg;
 	u32 dfr_reg;
 
+	/*
+	 * This is essentially a shadow of the vCPU's actual entry in the
+	 * Physical ID table that is programmed into the VMCB, i.e. that is
+	 * seen by the CPU.  If IPI virtualization is disabled, IsRunning is
+	 * only ever set in the shadow, i.e. is never propagated to the "real"
+	 * table, so that hardware never sees IsRunning=1.
+	 */
+	u64 avic_physical_id_entry;
+
 	/*
 	 * Per-vcpu list of struct amd_svm_iommu_ir:
 	 * This is used mainly to store interrupt remapping information used
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 22/67] KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (20 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 21/67] KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 23/67] KVM: VMX: Suppress PI notifications whenever the vCPU is put Sean Christopherson
                   ` (49 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

From: Maxim Levitsky <mlevitsk@redhat.com>

Disable IPI virtualization on AMD Family 17h CPUs (Zen2 and Zen1), as
hardware doesn't reliably detect changes to the 'IsRunning' bit during ICR
write emulation, and might fail to VM-Exit on the sending vCPU, if
IsRunning was recently cleared.

The absence of the VM-Exit leads to KVM not waking (or triggering nested
VM-Exit) of the target vCPU(s) of the IPI, which can lead to hung vCPUs,
unbounded delays in L2 execution, etc.

To workaround the erratum, simply disable IPI virtualization, which
prevents KVM from setting IsRunning and thus eliminates the race where
hardware sees a stale IsRunning=1.  As a result, all ICR writes (except
when "Self" shorthand is used) will VM-Exit and therefore be correctly
emulated by KVM.

Disabling IPI virtualization does carry a performance penalty, but
benchmarkng shows that enabling AVIC without IPI virtualization is still
much better than not using AVIC at all, because AVIC still accelerates
posted interrupts and the receiving end of the IPIs.

Note, when virtualizaing Self-IPIs, the CPU skips reading the physical ID
table and updates the vIRR directly (because the vCPU is by definition
actively running), i.e. Self-IPI isn't susceptible to the erratum *and*
is still accelerated by hardware.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: rebase, massage changelog, disallow user override]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index eea362cd415d..aba3f9d2ad02 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -1199,6 +1199,14 @@ bool avic_hardware_setup(void)
 	if (x2avic_enabled)
 		pr_info("x2AVIC enabled\n");

+	/*
+	 * Disable IPI virtualization for AMD Family 17h CPUs (Zen1 and Zen2)
+	 * due to erratum 1235, which results in missed GA log events and thus
+	 * missed wake events for blocking vCPUs due to the CPU failing to see
+	 * a software update to clear IsRunning.
+	 */
+	enable_ipiv = enable_ipiv && boot_cpu_data.x86 != 0x17;
+
 	amd_iommu_register_ga_log_notifier(&avic_ga_log_notifier);

 	return true;
-- 
2.49.0.504.g3bcea36a83-goog

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 23/67] KVM: VMX: Suppress PI notifications whenever the vCPU is put
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (21 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 22/67] KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235 Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 24/67] KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blocking Sean Christopherson
                   ` (48 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Suppress posted interrupt notifications (set PID.SN=1) whenever the vCPU
is put, i.e. unloaded, not just when the vCPU is preempted, as KVM doesn't
do anything in response to a notification IRQ that arrives in the host,
nor does KVM rely on the Outstanding Notification (PID.ON) flag when the
vCPU is unloaded.  And, the cost of scanning the PIR to manually set PID.ON
when loading the vCPU is quite small, especially relative to the cost of
loading (and unloading) a vCPU.

On the flip side, leaving SN clear means a notification for the vCPU will
result in a spurious IRQ for the pCPU, even if vCPU task is scheduled out,
running in userspace, etc.  Even worse, if the pCPU is running a different
vCPU, the spurious IRQ could trigger posted interrupt processing for the
wrong vCPU, which is technically a violation of the architecture, as
setting bits in PIR aren't supposed to be propagated to the vIRR until a
notification IRQ is received.

The saving grace of the current behavior is that hardware sends
notification interrupts if and only if PID.ON=0, i.e. only the first
posted interrupt for a vCPU will trigger a spurious IRQ (for each window
where the vCPU is unloaded).

Ideally, KVM would suppress notifications before enabling IRQs in the
VM-Exit, but KVM relies on PID.ON as an indicator that there is a posted
interrupt pending in PIR, e.g. in vmx_sync_pir_to_irr(), and sadly there
is no way to ask hardware to set PID.ON, but not generate an interrupt.
That could be solved by using pi_has_pending_interrupt() instead of
checking only PID.ON, but it's not at all clear that would be a performance
win, as KVM would end up scanning the entire PIR whenever an interrupt
isn't pending.

And long term, the spurious IRQ window, i.e. where a vCPU is loaded with
IRQs enabled, can effectively be made smaller for hot paths by moving
performance critical VM-Exit handlers into the fastpath, i.e. by never
enabling IRQs for hot path VM-Exits.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/posted_intr.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 1b6b655a2b8a..00818ca30ee0 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -70,13 +70,10 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 	/*
 	 * If the vCPU wasn't on the wakeup list and wasn't migrated, then the
 	 * full update can be skipped as neither the vector nor the destination
-	 * needs to be changed.
+	 * needs to be changed.  Clear SN even if there is no assigned device,
+	 * again for simplicity.
 	 */
 	if (pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR && vcpu->cpu == cpu) {
-		/*
-		 * Clear SN if it was set due to being preempted.  Again, do
-		 * this even if there is no assigned device for simplicity.
-		 */
 		if (pi_test_and_clear_sn(pi_desc))
 			goto after_clear_sn;
 		return;
@@ -200,15 +197,22 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
 	if (!vmx_needs_pi_wakeup(vcpu))
 		return;

-	if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
+	/*
+	 * If the vCPU is blocking with IRQs enabled and ISN'T being preempted,
+	 * enable the wakeup handler so that notification IRQ wakes the vCPU as
+	 * expected.  There is no need to enable the wakeup handler if the vCPU
+	 * is preempted between setting its wait state and manually scheduling
+	 * out, as the task is still runnable, i.e. doesn't need a wake event
+	 * from KVM to be scheduled in.
+	 *
+	 * If the wakeup handler isn't being enabled, Suppress Notifications as
+	 * the cost of propagating PIR.IRR to PID.ON is negligible compared to
+	 * the cost of a spurious IRQ, and vCPU put/load is a slow path.
+	 */
+	if (!vcpu->preempted && kvm_vcpu_is_blocking(vcpu) &&
+	    !vmx_interrupt_blocked(vcpu))
 		pi_enable_wakeup_handler(vcpu);
-
-	/*
-	 * Set SN when the vCPU is preempted.  Note, the vCPU can both be seen
-	 * as blocking and preempted, e.g. if it's preempted between setting
-	 * its wait state and manually scheduling out.
-	 */
-	if (vcpu->preempted)
+	else
 		pi_set_sn(pi_desc);
 }

-- 
2.49.0.504.g3bcea36a83-goog

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 24/67] KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blocking
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (22 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 23/67] KVM: VMX: Suppress PI notifications whenever the vCPU is put Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 25/67] iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr Sean Christopherson
                   ` (47 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Add a comment to explain why KVM clears IsRunning when putting a vCPU,
even though leaving IsRunning=1 would be ok from a functional perspective.
Per Maxim's experiments, a misbehaving VM could spam the AVIC doorbell so
fast as to induce a 50%+ loss in performance.

Link: https://lore.kernel.org/all/8d7e0d0391df4efc7cb28557297eb2ec9904f1e5.camel@redhat.com
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index aba3f9d2ad02..60e6e82fe41f 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -1133,19 +1133,24 @@ void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
 	if (!kvm_vcpu_apicv_active(vcpu))
 		return;
 
-       /*
-        * Unload the AVIC when the vCPU is about to block, _before_
-        * the vCPU actually blocks.
-        *
-        * Any IRQs that arrive before IsRunning=0 will not cause an
-        * incomplete IPI vmexit on the source, therefore vIRR will also
-        * be checked by kvm_vcpu_check_block() before blocking.  The
-        * memory barrier implicit in set_current_state orders writing
-        * IsRunning=0 before reading the vIRR.  The processor needs a
-        * matching memory barrier on interrupt delivery between writing
-        * IRR and reading IsRunning; the lack of this barrier might be
-        * the cause of errata #1235).
-        */
+	/*
+	 * Unload the AVIC when the vCPU is about to block, _before_ the vCPU
+	 * actually blocks.
+	 *
+	 * Note, any IRQs that arrive before IsRunning=0 will not cause an
+	 * incomplete IPI vmexit on the source; kvm_vcpu_check_block() handles
+	 * this by checking vIRR one last time before blocking.  The memory
+	 * barrier implicit in set_current_state orders writing IsRunning=0
+	 * before reading the vIRR.  The processor needs a matching memory
+	 * barrier on interrupt delivery between writing IRR and reading
+	 * IsRunning; the lack of this barrier might be the cause of errata #1235).
+	 *
+	 * Clear IsRunning=0 even if guest IRQs are disabled, i.e. even if KVM
+	 * doesn't need to detect events for scheduling purposes.  The doorbell
+	 * used to signal running vCPUs cannot be blocked, i.e. will perturb the
+	 * CPU and cause noisy neighbor problems if the VM is sending interrupts
+	 * to the vCPU while it's scheduled out.
+	 */
 	avic_vcpu_put(vcpu);
 }
 
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 25/67] iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (23 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 24/67] KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blocking Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-18 12:24   ` Vasant Hegde
  2025-04-04 19:38 ` [PATCH 26/67] iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields Sean Christopherson
                   ` (46 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Use vcpu_data.pi_desc_addr instead of amd_iommu_pi_data.base to get the
GA root pointer.  KVM is the only source of amd_iommu_pi_data.base, and
KVM's one and only path for writing amd_iommu_pi_data.base computes the
exact same value for vcpu_data.pi_desc_addr and amd_iommu_pi_data.base,
and fills amd_iommu_pi_data.base if and only if vcpu_data.pi_desc_addr is
valid, i.e. amd_iommu_pi_data.base is fully redundant.

Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c   | 7 +++++--
 drivers/iommu/amd/iommu.c | 2 +-
 include/linux/amd-iommu.h | 1 -
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 60e6e82fe41f..9024b9fbca53 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -902,8 +902,11 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 
 			enable_remapped_mode = false;
 
-			/* Try to enable guest_mode in IRTE */
-			pi.base = avic_get_backing_page_address(svm);
+			/*
+			 * Try to enable guest_mode in IRTE.  Note, the address
+			 * of the vCPU's AVIC backing page is passed to the
+			 * IOMMU via vcpu_info->pi_desc_addr.
+			 */
 			pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
 						     svm->vcpu.vcpu_id);
 			pi.is_guest_mode = true;
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 4f69a37cf143..635774642b89 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3860,7 +3860,7 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 
 	pi_data->prev_ga_tag = ir_data->cached_ga_tag;
 	if (pi_data->is_guest_mode) {
-		ir_data->ga_root_ptr = (pi_data->base >> 12);
+		ir_data->ga_root_ptr = (vcpu_pi_info->pi_desc_addr >> 12);
 		ir_data->ga_vector = vcpu_pi_info->vector;
 		ir_data->ga_tag = pi_data->ga_tag;
 		ret = amd_iommu_activate_guest_mode(ir_data);
diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
index 062fbd4c9b77..4f433ef39188 100644
--- a/include/linux/amd-iommu.h
+++ b/include/linux/amd-iommu.h
@@ -20,7 +20,6 @@ struct amd_iommu;
 struct amd_iommu_pi_data {
 	u32 ga_tag;
 	u32 prev_ga_tag;
-	u64 base;
 	bool is_guest_mode;
 	struct vcpu_data *vcpu_data;
 	void *ir_data;
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 26/67] iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (24 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 25/67] iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-08 16:57   ` Paolo Bonzini
  2025-04-18 12:25   ` Vasant Hegde
  2025-04-04 19:38 ` [PATCH 27/67] iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode" Sean Christopherson
                   ` (45 subsequent siblings)
  71 siblings, 2 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Delete the amd_ir_data.prev_ga_tag field now that all usage is
superfluous.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c             |  2 --
 drivers/iommu/amd/amd_iommu_types.h |  1 -
 drivers/iommu/amd/iommu.c           | 10 ----------
 include/linux/amd-iommu.h           |  1 -
 4 files changed, 14 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 9024b9fbca53..7f0f6a9cd2e8 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -943,9 +943,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		/**
 		 * Here, pi is used to:
 		 * - Tell IOMMU to use legacy mode for this interrupt.
-		 * - Retrieve ga_tag of prior interrupt remapping data.
 		 */
-		pi.prev_ga_tag = 0;
 		pi.is_guest_mode = false;
 		ret = irq_set_vcpu_affinity(host_irq, &pi);
 	} else {
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 23caea22f8dc..319a1b650b3b 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -1060,7 +1060,6 @@ struct irq_2_irte {
 };
 
 struct amd_ir_data {
-	u32 cached_ga_tag;
 	struct amd_iommu *iommu;
 	struct irq_2_irte irq_2_irte;
 	struct msi_msg msi_entry;
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 635774642b89..3c40bc9980b7 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3858,23 +3858,13 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 	ir_data->cfg = irqd_cfg(data);
 	pi_data->ir_data = ir_data;
 
-	pi_data->prev_ga_tag = ir_data->cached_ga_tag;
 	if (pi_data->is_guest_mode) {
 		ir_data->ga_root_ptr = (vcpu_pi_info->pi_desc_addr >> 12);
 		ir_data->ga_vector = vcpu_pi_info->vector;
 		ir_data->ga_tag = pi_data->ga_tag;
 		ret = amd_iommu_activate_guest_mode(ir_data);
-		if (!ret)
-			ir_data->cached_ga_tag = pi_data->ga_tag;
 	} else {
 		ret = amd_iommu_deactivate_guest_mode(ir_data);
-
-		/*
-		 * This communicates the ga_tag back to the caller
-		 * so that it can do all the necessary clean up.
-		 */
-		if (!ret)
-			ir_data->cached_ga_tag = 0;
 	}
 
 	return ret;
diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
index 4f433ef39188..deeefc92a5cf 100644
--- a/include/linux/amd-iommu.h
+++ b/include/linux/amd-iommu.h
@@ -19,7 +19,6 @@ struct amd_iommu;
  */
 struct amd_iommu_pi_data {
 	u32 ga_tag;
-	u32 prev_ga_tag;
 	bool is_guest_mode;
 	struct vcpu_data *vcpu_data;
 	void *ir_data;
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 27/67] iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode"
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (25 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 26/67] iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 28/67] KVM: SVM: Get vCPU info for IRTE using new routing entry Sean Christopherson
                   ` (44 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Pass NULL to amd_ir_set_vcpu_affinity() to communicate "don't post to a
vCPU" now that there's no need to communicate information back to KVM
about the previous vCPU (KVM does its own tracking).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c   | 15 +++------------
 drivers/iommu/amd/iommu.c | 10 +++++++---
 2 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 7f0f6a9cd2e8..9c789c288314 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -936,19 +936,10 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		}
 	}
 
-	if (enable_remapped_mode) {
-		/* Use legacy mode in IRTE */
-		struct amd_iommu_pi_data pi;
-
-		/**
-		 * Here, pi is used to:
-		 * - Tell IOMMU to use legacy mode for this interrupt.
-		 */
-		pi.is_guest_mode = false;
-		ret = irq_set_vcpu_affinity(host_irq, &pi);
-	} else {
+	if (enable_remapped_mode)
+		ret = irq_set_vcpu_affinity(host_irq, NULL);
+	else
 		ret = 0;
-	}
 out:
 	srcu_read_unlock(&kvm->irq_srcu, idx);
 	return ret;
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 3c40bc9980b7..08c4fa31da5d 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3835,7 +3835,6 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 {
 	int ret;
 	struct amd_iommu_pi_data *pi_data = vcpu_info;
-	struct vcpu_data *vcpu_pi_info = pi_data->vcpu_data;
 	struct amd_ir_data *ir_data = data->chip_data;
 	struct irq_2_irte *irte_info = &ir_data->irq_2_irte;
 	struct iommu_dev_data *dev_data;
@@ -3856,9 +3855,14 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 		return -EINVAL;
 
 	ir_data->cfg = irqd_cfg(data);
-	pi_data->ir_data = ir_data;
 
-	if (pi_data->is_guest_mode) {
+	if (pi_data) {
+		struct vcpu_data *vcpu_pi_info = pi_data->vcpu_data;
+
+		pi_data->ir_data = ir_data;
+
+		WARN_ON_ONCE(!pi_data->is_guest_mode);
+
 		ir_data->ga_root_ptr = (vcpu_pi_info->pi_desc_addr >> 12);
 		ir_data->ga_vector = vcpu_pi_info->vector;
 		ir_data->ga_tag = pi_data->ga_tag;
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 28/67] KVM: SVM: Get vCPU info for IRTE using new routing entry
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (26 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 27/67] iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode" Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 29/67] KVM: SVM: Stop walking list of routing table entries when updating IRTE Sean Christopherson
                   ` (43 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Explicitly get the vCPU information for a GSI routing entry from the new
(or current) entry provided by common KVM.  This is subtly a nop, as KVM
allows at most one MSI per GSI, i.e. the for-loop can only ever process
one entry, and that entry is the new/current entry.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 9c789c288314..eb6017b01c5f 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -855,7 +855,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	struct kvm_kernel_irq_routing_entry *e;
 	struct kvm_irq_routing_table *irq_rt;
 	bool enable_remapped_mode = true;
-	bool set = !!new;
 	int idx, ret = 0;
 
 	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
@@ -868,7 +867,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	svm_ir_list_del(irqfd);
 
 	pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n",
-		 __func__, host_irq, guest_irq, set);
+		 __func__, host_irq, guest_irq, !!new);
 
 	idx = srcu_read_lock(&kvm->irq_srcu);
 	irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
@@ -896,7 +895,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 * 3. APIC virtualization is disabled for the vcpu.
 		 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
 		 */
-		if (!get_pi_vcpu_info(kvm, e, &vcpu_info, &svm) && set &&
+		if (new && !get_pi_vcpu_info(kvm, new, &vcpu_info, &svm) &&
 		    kvm_vcpu_apicv_active(&svm->vcpu)) {
 			struct amd_iommu_pi_data pi;
 
@@ -927,7 +926,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		if (!ret && svm) {
 			trace_kvm_pi_irte_update(host_irq, svm->vcpu.vcpu_id,
 						 e->gsi, vcpu_info.vector,
-						 vcpu_info.pi_desc_addr, set);
+						 vcpu_info.pi_desc_addr, !!new);
 		}
 
 		if (ret < 0) {
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 29/67] KVM: SVM: Stop walking list of routing table entries when updating IRTE
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (27 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 28/67] KVM: SVM: Get vCPU info for IRTE using new routing entry Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-08 16:56   ` Paolo Bonzini
  2025-04-04 19:38 ` [PATCH 30/67] KVM: VMX: " Sean Christopherson
                   ` (42 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Now that KVM SVM simply uses the provided routing entry, stop walking the
routing table to find that entry.  KVM, via setup_routing_entry() and
sanity checked by kvm_get_msi_route(), disallows having a GSI configured
to trigger multiple MSIs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 106 ++++++++++++++++------------------------
 1 file changed, 43 insertions(+), 63 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index eb6017b01c5f..685a7b01194b 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -852,10 +852,10 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			unsigned int host_irq, uint32_t guest_irq,
 			struct kvm_kernel_irq_routing_entry *new)
 {
-	struct kvm_kernel_irq_routing_entry *e;
-	struct kvm_irq_routing_table *irq_rt;
 	bool enable_remapped_mode = true;
-	int idx, ret = 0;
+	struct vcpu_data vcpu_info;
+	struct vcpu_svm *svm = NULL;
+	int ret = 0;
 
 	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
 		return 0;
@@ -869,70 +869,51 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n",
 		 __func__, host_irq, guest_irq, !!new);
 
-	idx = srcu_read_lock(&kvm->irq_srcu);
-	irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
+	/**
+	 * Here, we setup with legacy mode in the following cases:
+	 * 1. When cannot target interrupt to a specific vcpu.
+	 * 2. Unsetting posted interrupt.
+	 * 3. APIC virtualization is disabled for the vcpu.
+	 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
+	 */
+	if (new && new->type == KVM_IRQ_ROUTING_MSI &&
+	    !get_pi_vcpu_info(kvm, new, &vcpu_info, &svm) &&
+	    kvm_vcpu_apicv_active(&svm->vcpu)) {
+		struct amd_iommu_pi_data pi;
 
-	if (guest_irq >= irq_rt->nr_rt_entries ||
-		hlist_empty(&irq_rt->map[guest_irq])) {
-		pr_warn_once("no route for guest_irq %u/%u (broken user space?)\n",
-			     guest_irq, irq_rt->nr_rt_entries);
-		goto out;
-	}
+		enable_remapped_mode = false;
 
-	hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
-		struct vcpu_data vcpu_info;
-		struct vcpu_svm *svm = NULL;
-
-		if (e->type != KVM_IRQ_ROUTING_MSI)
-			continue;
-
-		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
+		/*
+		 * Try to enable guest_mode in IRTE.  Note, the address
+		 * of the vCPU's AVIC backing page is passed to the
+		 * IOMMU via vcpu_info->pi_desc_addr.
+		 */
+		pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
+					     svm->vcpu.vcpu_id);
+		pi.is_guest_mode = true;
+		pi.vcpu_data = &vcpu_info;
+		ret = irq_set_vcpu_affinity(host_irq, &pi);
 
 		/**
-		 * Here, we setup with legacy mode in the following cases:
-		 * 1. When cannot target interrupt to a specific vcpu.
-		 * 2. Unsetting posted interrupt.
-		 * 3. APIC virtualization is disabled for the vcpu.
-		 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
+		 * Here, we successfully setting up vcpu affinity in
+		 * IOMMU guest mode. Now, we need to store the posted
+		 * interrupt information in a per-vcpu ir_list so that
+		 * we can reference to them directly when we update vcpu
+		 * scheduling information in IOMMU irte.
 		 */
-		if (new && !get_pi_vcpu_info(kvm, new, &vcpu_info, &svm) &&
-		    kvm_vcpu_apicv_active(&svm->vcpu)) {
-			struct amd_iommu_pi_data pi;
-
-			enable_remapped_mode = false;
-
-			/*
-			 * Try to enable guest_mode in IRTE.  Note, the address
-			 * of the vCPU's AVIC backing page is passed to the
-			 * IOMMU via vcpu_info->pi_desc_addr.
-			 */
-			pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
-						     svm->vcpu.vcpu_id);
-			pi.is_guest_mode = true;
-			pi.vcpu_data = &vcpu_info;
-			ret = irq_set_vcpu_affinity(host_irq, &pi);
-
-			/**
-			 * Here, we successfully setting up vcpu affinity in
-			 * IOMMU guest mode. Now, we need to store the posted
-			 * interrupt information in a per-vcpu ir_list so that
-			 * we can reference to them directly when we update vcpu
-			 * scheduling information in IOMMU irte.
-			 */
-			if (!ret && pi.is_guest_mode)
-				svm_ir_list_add(svm, irqfd, &pi);
-		}
-
-		if (!ret && svm) {
-			trace_kvm_pi_irte_update(host_irq, svm->vcpu.vcpu_id,
-						 e->gsi, vcpu_info.vector,
-						 vcpu_info.pi_desc_addr, !!new);
-		}
-
-		if (ret < 0) {
-			pr_err("%s: failed to update PI IRTE\n", __func__);
-			goto out;
-		}
+		if (!ret)
+			ret = svm_ir_list_add(svm, irqfd, &pi);
+	}
+
+	if (!ret && svm) {
+		trace_kvm_pi_irte_update(host_irq, svm->vcpu.vcpu_id,
+					 guest_irq, vcpu_info.vector,
+					 vcpu_info.pi_desc_addr, !!new);
+	}
+
+	if (ret < 0) {
+		pr_err("%s: failed to update PI IRTE\n", __func__);
+		goto out;
 	}
 
 	if (enable_remapped_mode)
@@ -940,7 +921,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	else
 		ret = 0;
 out:
-	srcu_read_unlock(&kvm->irq_srcu, idx);
 	return ret;
 }
 
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 30/67] KVM: VMX: Stop walking list of routing table entries when updating IRTE
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (28 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 29/67] KVM: SVM: Stop walking list of routing table entries when updating IRTE Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-08 17:00   ` Paolo Bonzini
  2025-04-04 19:38 ` [PATCH 31/67] KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info() Sean Christopherson
                   ` (41 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Now that KVM provides the to-be-updated routing entry, stop walking the
routing table to find that entry.  KVM, via setup_routing_entry() and
sanity checked by kvm_get_msi_route(), disallows having a GSI configured
to trigger multiple MSIs, i.e. the for-loop can only process one entry.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/posted_intr.c | 100 +++++++++++----------------------
 1 file changed, 33 insertions(+), 67 deletions(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 00818ca30ee0..786912cee3f8 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -268,78 +268,44 @@ int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		       unsigned int host_irq, uint32_t guest_irq,
 		       struct kvm_kernel_irq_routing_entry *new)
 {
-	struct kvm_kernel_irq_routing_entry *e;
-	struct kvm_irq_routing_table *irq_rt;
-	bool enable_remapped_mode = true;
 	struct kvm_lapic_irq irq;
 	struct kvm_vcpu *vcpu;
 	struct vcpu_data vcpu_info;
-	bool set = !!new;
-	int idx, ret = 0;
 
 	if (!vmx_can_use_vtd_pi(kvm))
 		return 0;
 
-	idx = srcu_read_lock(&kvm->irq_srcu);
-	irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
-	if (guest_irq >= irq_rt->nr_rt_entries ||
-	    hlist_empty(&irq_rt->map[guest_irq])) {
-		pr_warn_once("no route for guest_irq %u/%u (broken user space?)\n",
-			     guest_irq, irq_rt->nr_rt_entries);
-		goto out;
-	}
-
-	hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
-		if (e->type != KVM_IRQ_ROUTING_MSI)
-			continue;
-
-		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
-
-		/*
-		 * VT-d PI cannot support posting multicast/broadcast
-		 * interrupts to a vCPU, we still use interrupt remapping
-		 * for these kind of interrupts.
-		 *
-		 * For lowest-priority interrupts, we only support
-		 * those with single CPU as the destination, e.g. user
-		 * configures the interrupts via /proc/irq or uses
-		 * irqbalance to make the interrupts single-CPU.
-		 *
-		 * We will support full lowest-priority interrupt later.
-		 *
-		 * In addition, we can only inject generic interrupts using
-		 * the PI mechanism, refuse to route others through it.
-		 */
-
-		kvm_set_msi_irq(kvm, e, &irq);
-		if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) ||
-		    !kvm_irq_is_postable(&irq))
-			continue;
-
-		vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu));
-		vcpu_info.vector = irq.vector;
-
-		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, e->gsi,
-				vcpu_info.vector, vcpu_info.pi_desc_addr, set);
-
-		if (!set)
-			continue;
-
-		enable_remapped_mode = false;
-
-		ret = irq_set_vcpu_affinity(host_irq, &vcpu_info);
-		if (ret < 0) {
-			printk(KERN_INFO "%s: failed to update PI IRTE\n",
-					__func__);
-			goto out;
-		}
-	}
-
-	if (enable_remapped_mode)
-		ret = irq_set_vcpu_affinity(host_irq, NULL);
-
-	ret = 0;
-out:
-	srcu_read_unlock(&kvm->irq_srcu, idx);
-	return ret;
+	/*
+	 * VT-d PI cannot support posting multicast/broadcast
+	 * interrupts to a vCPU, we still use interrupt remapping
+	 * for these kind of interrupts.
+	 *
+	 * For lowest-priority interrupts, we only support
+	 * those with single CPU as the destination, e.g. user
+	 * configures the interrupts via /proc/irq or uses
+	 * irqbalance to make the interrupts single-CPU.
+	 *
+	 * We will support full lowest-priority interrupt later.
+	 *
+	 * In addition, we can only inject generic interrupts using
+	 * the PI mechanism, refuse to route others through it.
+	 */
+	if (!new || new->type != KVM_IRQ_ROUTING_MSI)
+		goto do_remapping;
+
+	kvm_set_msi_irq(kvm, new, &irq);
+
+	if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) ||
+	    !kvm_irq_is_postable(&irq))
+		goto do_remapping;
+
+	vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu));
+	vcpu_info.vector = irq.vector;
+
+	trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
+				 vcpu_info.vector, vcpu_info.pi_desc_addr, true);
+
+	return irq_set_vcpu_affinity(host_irq, &vcpu_info);
+do_remapping:
+	return irq_set_vcpu_affinity(host_irq, NULL);
 }
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 31/67] KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info()
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (29 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 30/67] KVM: VMX: " Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-23 15:21   ` Francesco Lavra
  2025-04-04 19:38 ` [PATCH 32/67] KVM: x86: Nullify irqfd->producer after updating IRTEs Sean Christopherson
                   ` (40 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Genericize SVM's get_pi_vcpu_info() so that it can be shared with VMX.
The only SVM specific information it provides is the AVIC back page, and
that can be trivially retrieved by its sole caller.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 685a7b01194b..ea6eae72b941 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -825,14 +825,14 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
  */
 static int
 get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
-		 struct vcpu_data *vcpu_info, struct vcpu_svm **svm)
+		 struct vcpu_data *vcpu_info, struct kvm_vcpu **vcpu)
 {
 	struct kvm_lapic_irq irq;
-	struct kvm_vcpu *vcpu = NULL;
+	*vcpu = NULL;
 
 	kvm_set_msi_irq(kvm, e, &irq);
 
-	if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) ||
+	if (!kvm_intr_is_single_vcpu(kvm, &irq, vcpu) ||
 	    !kvm_irq_is_postable(&irq)) {
 		pr_debug("SVM: %s: use legacy intr remap mode for irq %u\n",
 			 __func__, irq.vector);
@@ -841,8 +841,6 @@ get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
 
 	pr_debug("SVM: %s: use GA mode for irq %u\n", __func__,
 		 irq.vector);
-	*svm = to_svm(vcpu);
-	vcpu_info->pi_desc_addr = avic_get_backing_page_address(*svm);
 	vcpu_info->vector = irq.vector;
 
 	return 0;
@@ -854,7 +852,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 {
 	bool enable_remapped_mode = true;
 	struct vcpu_data vcpu_info;
-	struct vcpu_svm *svm = NULL;
+	struct kvm_vcpu *vcpu = NULL;
 	int ret = 0;
 
 	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
@@ -876,20 +874,21 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	 * 3. APIC virtualization is disabled for the vcpu.
 	 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
 	 */
-	if (new && new->type == KVM_IRQ_ROUTING_MSI &&
-	    !get_pi_vcpu_info(kvm, new, &vcpu_info, &svm) &&
-	    kvm_vcpu_apicv_active(&svm->vcpu)) {
+	if (new && new && new->type == KVM_IRQ_ROUTING_MSI &&
+	    !get_pi_vcpu_info(kvm, new, &vcpu_info, &vcpu) &&
+	    kvm_vcpu_apicv_active(vcpu)) {
 		struct amd_iommu_pi_data pi;
 
 		enable_remapped_mode = false;
 
+		vcpu_info.pi_desc_addr = avic_get_backing_page_address(to_svm(vcpu));
+
 		/*
 		 * Try to enable guest_mode in IRTE.  Note, the address
 		 * of the vCPU's AVIC backing page is passed to the
 		 * IOMMU via vcpu_info->pi_desc_addr.
 		 */
-		pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
-					     svm->vcpu.vcpu_id);
+		pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, vcpu->vcpu_id);
 		pi.is_guest_mode = true;
 		pi.vcpu_data = &vcpu_info;
 		ret = irq_set_vcpu_affinity(host_irq, &pi);
@@ -902,11 +901,11 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 * scheduling information in IOMMU irte.
 		 */
 		if (!ret)
-			ret = svm_ir_list_add(svm, irqfd, &pi);
+			ret = svm_ir_list_add(to_svm(vcpu), irqfd, &pi);
 	}
 
-	if (!ret && svm) {
-		trace_kvm_pi_irte_update(host_irq, svm->vcpu.vcpu_id,
+	if (!ret && vcpu) {
+		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id,
 					 guest_irq, vcpu_info.vector,
 					 vcpu_info.pi_desc_addr, !!new);
 	}
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 32/67] KVM: x86: Nullify irqfd->producer after updating IRTEs
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (30 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 31/67] KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info() Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 33/67] KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU Sean Christopherson
                   ` (39 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Nullify irqfd->producer (when it's going away) _after_ updating IRTEs so
that the producer can be queried during the update.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 52d8d0635603..b8b259847d05 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13608,7 +13608,6 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 	 * int this case doesn't want to receive the interrupts.
 	*/
 	spin_lock_irq(&kvm->irqfds.lock);
-	irqfd->producer = NULL;
 
 	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
 		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
@@ -13617,6 +13616,7 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 			pr_info("irq bypass consumer (token %p) unregistration fails: %d\n",
 				irqfd->consumer.token, ret);
 	}
+	irqfd->producer = NULL;
 
 	spin_unlock_irq(&kvm->irqfds.lock);
 
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 33/67] KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (31 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 32/67] KVM: x86: Nullify irqfd->producer after updating IRTEs Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-08 17:30   ` Paolo Bonzini
  2025-04-24  4:39   ` Sairaj Kodilkar
  2025-04-04 19:38 ` [PATCH 34/67] KVM: x86: Move posted interrupt tracepoint to common code Sean Christopherson
                   ` (38 subsequent siblings)
  71 siblings, 2 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Hoist the logic for identifying the target vCPU for a posted interrupt
into common x86.  The code is functionally identical between Intel and
AMD.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +-
 arch/x86/kvm/svm/avic.c         | 83 ++++++++-------------------------
 arch/x86/kvm/svm/svm.h          |  3 +-
 arch/x86/kvm/vmx/posted_intr.c  | 56 ++++++----------------
 arch/x86/kvm/vmx/posted_intr.h  |  3 +-
 arch/x86/kvm/x86.c              | 46 +++++++++++++++---
 6 files changed, 81 insertions(+), 113 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 85f45fc5156d..cb98d8d3c6c2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1838,7 +1838,8 @@ struct kvm_x86_ops {
 
 	int (*pi_update_irte)(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			      unsigned int host_irq, uint32_t guest_irq,
-			      struct kvm_kernel_irq_routing_entry *new);
+			      struct kvm_kernel_irq_routing_entry *new,
+			      struct kvm_vcpu *vcpu, u32 vector);
 	void (*pi_start_assignment)(struct kvm *kvm);
 	void (*apicv_pre_state_restore)(struct kvm_vcpu *vcpu);
 	void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index ea6eae72b941..666f518340a7 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -812,52 +812,13 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
 	return 0;
 }
 
-/*
- * Note:
- * The HW cannot support posting multicast/broadcast
- * interrupts to a vCPU. So, we still use legacy interrupt
- * remapping for these kind of interrupts.
- *
- * For lowest-priority interrupts, we only support
- * those with single CPU as the destination, e.g. user
- * configures the interrupts via /proc/irq or uses
- * irqbalance to make the interrupts single-CPU.
- */
-static int
-get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
-		 struct vcpu_data *vcpu_info, struct kvm_vcpu **vcpu)
-{
-	struct kvm_lapic_irq irq;
-	*vcpu = NULL;
-
-	kvm_set_msi_irq(kvm, e, &irq);
-
-	if (!kvm_intr_is_single_vcpu(kvm, &irq, vcpu) ||
-	    !kvm_irq_is_postable(&irq)) {
-		pr_debug("SVM: %s: use legacy intr remap mode for irq %u\n",
-			 __func__, irq.vector);
-		return -1;
-	}
-
-	pr_debug("SVM: %s: use GA mode for irq %u\n", __func__,
-		 irq.vector);
-	vcpu_info->vector = irq.vector;
-
-	return 0;
-}
-
 int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			unsigned int host_irq, uint32_t guest_irq,
-			struct kvm_kernel_irq_routing_entry *new)
+			struct kvm_kernel_irq_routing_entry *new,
+			struct kvm_vcpu *vcpu, u32 vector)
 {
-	bool enable_remapped_mode = true;
-	struct vcpu_data vcpu_info;
-	struct kvm_vcpu *vcpu = NULL;
 	int ret = 0;
 
-	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
-		return 0;
-
 	/*
 	 * If the IRQ was affined to a different vCPU, remove the IRTE metadata
 	 * from the *previous* vCPU's list.
@@ -865,7 +826,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	svm_ir_list_del(irqfd);
 
 	pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n",
-		 __func__, host_irq, guest_irq, !!new);
+		 __func__, host_irq, guest_irq, !!vcpu);
 
 	/**
 	 * Here, we setup with legacy mode in the following cases:
@@ -874,23 +835,23 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	 * 3. APIC virtualization is disabled for the vcpu.
 	 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
 	 */
-	if (new && new && new->type == KVM_IRQ_ROUTING_MSI &&
-	    !get_pi_vcpu_info(kvm, new, &vcpu_info, &vcpu) &&
-	    kvm_vcpu_apicv_active(vcpu)) {
-		struct amd_iommu_pi_data pi;
-
-		enable_remapped_mode = false;
-
-		vcpu_info.pi_desc_addr = avic_get_backing_page_address(to_svm(vcpu));
-
+	if (vcpu && kvm_vcpu_apicv_active(vcpu)) {
 		/*
 		 * Try to enable guest_mode in IRTE.  Note, the address
 		 * of the vCPU's AVIC backing page is passed to the
 		 * IOMMU via vcpu_info->pi_desc_addr.
 		 */
-		pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, vcpu->vcpu_id);
-		pi.is_guest_mode = true;
-		pi.vcpu_data = &vcpu_info;
+		struct vcpu_data vcpu_info = {
+			.pi_desc_addr = avic_get_backing_page_address(to_svm(vcpu)),
+			.vector = vector,
+		};
+
+		struct amd_iommu_pi_data pi = {
+			.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, vcpu->vcpu_id),
+			.is_guest_mode = true,
+			.vcpu_data = &vcpu_info,
+		};
+
 		ret = irq_set_vcpu_affinity(host_irq, &pi);
 
 		/**
@@ -902,12 +863,11 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 */
 		if (!ret)
 			ret = svm_ir_list_add(to_svm(vcpu), irqfd, &pi);
-	}
 
-	if (!ret && vcpu) {
-		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id,
-					 guest_irq, vcpu_info.vector,
-					 vcpu_info.pi_desc_addr, !!new);
+		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
+					 vector, vcpu_info.pi_desc_addr, true);
+	} else {
+		ret = irq_set_vcpu_affinity(host_irq, NULL);
 	}
 
 	if (ret < 0) {
@@ -915,10 +875,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		goto out;
 	}
 
-	if (enable_remapped_mode)
-		ret = irq_set_vcpu_affinity(host_irq, NULL);
-	else
-		ret = 0;
+	ret = 0;
 out:
 	return ret;
 }
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 6ad0aa86f78d..5ce240085ee0 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -741,7 +741,8 @@ void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu);
 void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
 int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			unsigned int host_irq, uint32_t guest_irq,
-			struct kvm_kernel_irq_routing_entry *new);
+			struct kvm_kernel_irq_routing_entry *new,
+			struct kvm_vcpu *vcpu, u32 vector);
 void avic_vcpu_blocking(struct kvm_vcpu *vcpu);
 void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
 void avic_ring_doorbell(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 786912cee3f8..fd5f6a125614 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -266,46 +266,20 @@ void vmx_pi_start_assignment(struct kvm *kvm)
 
 int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		       unsigned int host_irq, uint32_t guest_irq,
-		       struct kvm_kernel_irq_routing_entry *new)
+		       struct kvm_kernel_irq_routing_entry *new,
+		       struct kvm_vcpu *vcpu, u32 vector)
 {
-	struct kvm_lapic_irq irq;
-	struct kvm_vcpu *vcpu;
-	struct vcpu_data vcpu_info;
-
-	if (!vmx_can_use_vtd_pi(kvm))
-		return 0;
-
-	/*
-	 * VT-d PI cannot support posting multicast/broadcast
-	 * interrupts to a vCPU, we still use interrupt remapping
-	 * for these kind of interrupts.
-	 *
-	 * For lowest-priority interrupts, we only support
-	 * those with single CPU as the destination, e.g. user
-	 * configures the interrupts via /proc/irq or uses
-	 * irqbalance to make the interrupts single-CPU.
-	 *
-	 * We will support full lowest-priority interrupt later.
-	 *
-	 * In addition, we can only inject generic interrupts using
-	 * the PI mechanism, refuse to route others through it.
-	 */
-	if (!new || new->type != KVM_IRQ_ROUTING_MSI)
-		goto do_remapping;
-
-	kvm_set_msi_irq(kvm, new, &irq);
-
-	if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) ||
-	    !kvm_irq_is_postable(&irq))
-		goto do_remapping;
-
-	vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu));
-	vcpu_info.vector = irq.vector;
-
-	trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
-				 vcpu_info.vector, vcpu_info.pi_desc_addr, true);
-
-	return irq_set_vcpu_affinity(host_irq, &vcpu_info);
-do_remapping:
-	return irq_set_vcpu_affinity(host_irq, NULL);
+	if (vcpu) {
+		struct vcpu_data vcpu_info = {
+			.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu)),
+			.vector = vector,
+		};
+
+		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
+					 vcpu_info.vector, vcpu_info.pi_desc_addr, true);
+
+		return irq_set_vcpu_affinity(host_irq, &vcpu_info);
+	} else {
+		return irq_set_vcpu_affinity(host_irq, NULL);
+	}
 }
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index a586d6aaf862..ee3e19e976ac 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -15,7 +15,8 @@ void __init pi_init_cpu(int cpu);
 bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		       unsigned int host_irq, uint32_t guest_irq,
-		       struct kvm_kernel_irq_routing_entry *new);
+		       struct kvm_kernel_irq_routing_entry *new,
+		       struct kvm_vcpu *vcpu, u32 vector);
 void vmx_pi_start_assignment(struct kvm *kvm);
 
 static inline int pi_find_highest_vector(struct pi_desc *pi_desc)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b8b259847d05..0ab818bba743 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13567,6 +13567,43 @@ bool kvm_arch_has_irq_bypass(void)
 }
 EXPORT_SYMBOL_GPL(kvm_arch_has_irq_bypass);
 
+static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
+			      struct kvm_kernel_irq_routing_entry *old,
+			      struct kvm_kernel_irq_routing_entry *new)
+{
+	struct kvm *kvm = irqfd->kvm;
+	struct kvm_vcpu *vcpu = NULL;
+	struct kvm_lapic_irq irq;
+
+	if (!irqchip_in_kernel(kvm) ||
+	    !kvm_arch_has_irq_bypass() ||
+	    !kvm_arch_has_assigned_device(kvm))
+		return 0;
+
+	if (new && new->type == KVM_IRQ_ROUTING_MSI) {
+		kvm_set_msi_irq(kvm, new, &irq);
+
+		/*
+		 * Force remapped mode if hardware doesn't support posting the
+		 * virtual interrupt to a vCPU.  Only IRQs are postable (NMIs,
+		 * SMIs, etc. are not), and neither AMD nor Intel IOMMUs support
+		 * posting multicast/broadcast IRQs.  If the interrupt can't be
+		 * posted, the device MSI needs to be routed to the host so that
+		 * the guest's desired interrupt can be synthesized by KVM.
+		 *
+		 * This means that KVM can only post lowest-priority interrupts
+		 * if they have a single CPU as the destination, e.g. only if
+		 * the guest has affined the interrupt to a single vCPU.
+		 */
+		if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) ||
+		    !kvm_irq_is_postable(&irq))
+			vcpu = NULL;
+	}
+
+	return kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, irqfd->producer->irq,
+					    irqfd->gsi, new, vcpu, irq.vector);
+}
+
 int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
 				      struct irq_bypass_producer *prod)
 {
@@ -13581,8 +13618,7 @@ int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
 	irqfd->producer = prod;
 
 	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
-		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
-						   irqfd->gsi, &irqfd->irq_entry);
+		ret = kvm_pi_update_irte(irqfd, NULL, &irqfd->irq_entry);
 		if (ret)
 			kvm_arch_end_assignment(irqfd->kvm);
 	}
@@ -13610,8 +13646,7 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 	spin_lock_irq(&kvm->irqfds.lock);
 
 	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
-		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
-						   irqfd->gsi, NULL);
+		ret = kvm_pi_update_irte(irqfd, &irqfd->irq_entry, NULL);
 		if (ret)
 			pr_info("irq bypass consumer (token %p) unregistration fails: %d\n",
 				irqfd->consumer.token, ret);
@@ -13628,8 +13663,7 @@ int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
 				  struct kvm_kernel_irq_routing_entry *old,
 				  struct kvm_kernel_irq_routing_entry *new)
 {
-	return kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, irqfd->producer->irq,
-					    irqfd->gsi, new);
+	return kvm_pi_update_irte(irqfd, old, new);
 }
 
 bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 34/67] KVM: x86: Move posted interrupt tracepoint to common code
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (32 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 33/67] KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 35/67] KVM: SVM: Clean up return handling in avic_pi_update_irte() Sean Christopherson
                   ` (37 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Move the pi_irte_update tracepoint to common x86, and call it whenever the
IRTE is modified.  Tracing only the modifications that result in an IRQ
being posted to a vCPU makes the tracepoint useless for debugging.

Drop the vendor specific address; plumbing that into common code isn't
worth the trouble, as the address is meaningless without a whole pile of
other information that isn't provided in any tracepoint.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c        |  6 ------
 arch/x86/kvm/trace.h           | 19 +++++++------------
 arch/x86/kvm/vmx/posted_intr.c |  3 ---
 arch/x86/kvm/x86.c             | 12 +++++++++---
 4 files changed, 16 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 666f518340a7..dcfe908f5b98 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -825,9 +825,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	 */
 	svm_ir_list_del(irqfd);
 
-	pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n",
-		 __func__, host_irq, guest_irq, !!vcpu);
-
 	/**
 	 * Here, we setup with legacy mode in the following cases:
 	 * 1. When cannot target interrupt to a specific vcpu.
@@ -863,9 +860,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 */
 		if (!ret)
 			ret = svm_ir_list_add(to_svm(vcpu), irqfd, &pi);
-
-		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
-					 vector, vcpu_info.pi_desc_addr, true);
 	} else {
 		ret = irq_set_vcpu_affinity(host_irq, NULL);
 	}
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index ccda95e53f62..be4f55c23ec7 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1089,37 +1089,32 @@ TRACE_EVENT(kvm_smm_transition,
  * Tracepoint for VT-d posted-interrupts and AMD-Vi Guest Virtual APIC.
  */
 TRACE_EVENT(kvm_pi_irte_update,
-	TP_PROTO(unsigned int host_irq, unsigned int vcpu_id,
-		 unsigned int gsi, unsigned int gvec,
-		 u64 pi_desc_addr, bool set),
-	TP_ARGS(host_irq, vcpu_id, gsi, gvec, pi_desc_addr, set),
+	TP_PROTO(unsigned int host_irq, struct kvm_vcpu *vcpu,
+		 unsigned int gsi, unsigned int gvec, bool set),
+	TP_ARGS(host_irq, vcpu, gsi, gvec, set),
 
 	TP_STRUCT__entry(
 		__field(	unsigned int,	host_irq	)
-		__field(	unsigned int,	vcpu_id		)
+		__field(	int,		vcpu_id		)
 		__field(	unsigned int,	gsi		)
 		__field(	unsigned int,	gvec		)
-		__field(	u64,		pi_desc_addr	)
 		__field(	bool,		set		)
 	),
 
 	TP_fast_assign(
 		__entry->host_irq	= host_irq;
-		__entry->vcpu_id	= vcpu_id;
+		__entry->vcpu_id	= vcpu ? vcpu->vcpu_id : -1;
 		__entry->gsi		= gsi;
 		__entry->gvec		= gvec;
-		__entry->pi_desc_addr	= pi_desc_addr;
 		__entry->set		= set;
 	),
 
-	TP_printk("PI is %s for irq %u, vcpu %u, gsi: 0x%x, "
-		  "gvec: 0x%x, pi_desc_addr: 0x%llx",
+	TP_printk("PI is %s for irq %u, vcpu %d, gsi: 0x%x, gvec: 0x%x",
 		  __entry->set ? "enabled and being updated" : "disabled",
 		  __entry->host_irq,
 		  __entry->vcpu_id,
 		  __entry->gsi,
-		  __entry->gvec,
-		  __entry->pi_desc_addr)
+		  __entry->gvec)
 );
 
 /*
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index fd5f6a125614..baf627839498 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -275,9 +275,6 @@ int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			.vector = vector,
 		};
 
-		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
-					 vcpu_info.vector, vcpu_info.pi_desc_addr, true);
-
 		return irq_set_vcpu_affinity(host_irq, &vcpu_info);
 	} else {
 		return irq_set_vcpu_affinity(host_irq, NULL);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0ab818bba743..a20d461718cc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13571,9 +13571,11 @@ static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
 			      struct kvm_kernel_irq_routing_entry *old,
 			      struct kvm_kernel_irq_routing_entry *new)
 {
+	unsigned int host_irq = irqfd->producer->irq;
 	struct kvm *kvm = irqfd->kvm;
 	struct kvm_vcpu *vcpu = NULL;
 	struct kvm_lapic_irq irq;
+	int r;
 
 	if (!irqchip_in_kernel(kvm) ||
 	    !kvm_arch_has_irq_bypass() ||
@@ -13600,8 +13602,13 @@ static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
 			vcpu = NULL;
 	}
 
-	return kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, irqfd->producer->irq,
-					    irqfd->gsi, new, vcpu, irq.vector);
+	r = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, host_irq, irqfd->gsi,
+					 new, vcpu, irq.vector);
+	if (r)
+		return r;
+
+	trace_kvm_pi_irte_update(host_irq, vcpu, irqfd->gsi, irq.vector, !!vcpu);
+	return 0;
 }
 
 int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
@@ -14074,7 +14081,6 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window_update);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pml_full);
-EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pi_irte_update);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_unaccelerated_access);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_incomplete_ipi);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_ga_log);
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 35/67] KVM: SVM: Clean up return handling in avic_pi_update_irte()
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (33 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 34/67] KVM: x86: Move posted interrupt tracepoint to common code Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 36/67] iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structs Sean Christopherson
                   ` (36 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Clean up the return paths for avic_pi_update_irte() now that the
refactoring dust has settled.

Opportunistically drop the pr_err() on IRTE update failures.  Logging that
a failure occurred without _any_ context is quite useless.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 20 +++++---------------
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index dcfe908f5b98..4382ab2eaea6 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -817,8 +817,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			struct kvm_kernel_irq_routing_entry *new,
 			struct kvm_vcpu *vcpu, u32 vector)
 {
-	int ret = 0;
-
 	/*
 	 * If the IRQ was affined to a different vCPU, remove the IRTE metadata
 	 * from the *previous* vCPU's list.
@@ -848,8 +846,11 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			.is_guest_mode = true,
 			.vcpu_data = &vcpu_info,
 		};
+		int ret;
 
 		ret = irq_set_vcpu_affinity(host_irq, &pi);
+		if (ret)
+			return ret;
 
 		/**
 		 * Here, we successfully setting up vcpu affinity in
@@ -858,20 +859,9 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 * we can reference to them directly when we update vcpu
 		 * scheduling information in IOMMU irte.
 		 */
-		if (!ret)
-			ret = svm_ir_list_add(to_svm(vcpu), irqfd, &pi);
-	} else {
-		ret = irq_set_vcpu_affinity(host_irq, NULL);
+		return svm_ir_list_add(to_svm(vcpu), irqfd, &pi);
 	}
-
-	if (ret < 0) {
-		pr_err("%s: failed to update PI IRTE\n", __func__);
-		goto out;
-	}
-
-	ret = 0;
-out:
-	return ret;
+	return irq_set_vcpu_affinity(host_irq, NULL);
 }
 
 static inline int
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 36/67] iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structs
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (34 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 35/67] KVM: SVM: Clean up return handling in avic_pi_update_irte() Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 37/67] KVM: Don't WARN if updating IRQ bypass route fails Sean Christopherson
                   ` (35 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Split the vcpu_data structure that serves as a handoff from KVM to IOMMU
drivers into vendor specific structures.  Overloading a single structure
makes the code hard to read and maintain, is *very* misleading as it
suggests that mixing vendors is actually supported, and bastardizing
Intel's posted interrupt descriptor address when AMD's IOMMU already has
its own structure is quite unnecessary.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/irq_remapping.h | 15 ++++++++++++++-
 arch/x86/kvm/svm/avic.c              | 21 ++++++++-------------
 arch/x86/kvm/vmx/posted_intr.c       |  4 ++--
 drivers/iommu/amd/iommu.c            | 12 ++++--------
 drivers/iommu/intel/irq_remapping.c  | 10 +++++-----
 include/linux/amd-iommu.h            | 12 ------------
 6 files changed, 33 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index 5036f13ab69f..2dbc9cb61c2f 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -26,7 +26,20 @@ enum {
 	IRQ_REMAP_X2APIC_MODE,
 };
 
-struct vcpu_data {
+/*
+ * This is mainly used to communicate information back-and-forth
+ * between SVM and IOMMU for setting up and tearing down posted
+ * interrupt
+ */
+struct amd_iommu_pi_data {
+	u64 vapic_addr;		/* Physical address of the vCPU's vAPIC. */
+	u32 ga_tag;
+	u32 vector;		/* Guest vector of the interrupt */
+	bool is_guest_mode;
+	void *ir_data;
+};
+
+struct intel_iommu_pi_data {
 	u64 pi_desc_addr;	/* Physical address of PI Descriptor */
 	u32 vector;		/* Guest vector of the interrupt */
 };
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 4382ab2eaea6..355673f95b70 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -832,23 +832,18 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	 */
 	if (vcpu && kvm_vcpu_apicv_active(vcpu)) {
 		/*
-		 * Try to enable guest_mode in IRTE.  Note, the address
-		 * of the vCPU's AVIC backing page is passed to the
-		 * IOMMU via vcpu_info->pi_desc_addr.
+		 * Try to enable guest_mode in IRTE.
 		 */
-		struct vcpu_data vcpu_info = {
-			.pi_desc_addr = avic_get_backing_page_address(to_svm(vcpu)),
-			.vector = vector,
-		};
-
-		struct amd_iommu_pi_data pi = {
-			.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, vcpu->vcpu_id),
+		struct amd_iommu_pi_data pi_data = {
+			.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
+					     vcpu->vcpu_id),
 			.is_guest_mode = true,
-			.vcpu_data = &vcpu_info,
+			.vapic_addr = avic_get_backing_page_address(to_svm(vcpu)),
+			.vector = vector,
 		};
 		int ret;
 
-		ret = irq_set_vcpu_affinity(host_irq, &pi);
+		ret = irq_set_vcpu_affinity(host_irq, &pi_data);
 		if (ret)
 			return ret;
 
@@ -859,7 +854,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 * we can reference to them directly when we update vcpu
 		 * scheduling information in IOMMU irte.
 		 */
-		return svm_ir_list_add(to_svm(vcpu), irqfd, &pi);
+		return svm_ir_list_add(to_svm(vcpu), irqfd, &pi_data);
 	}
 	return irq_set_vcpu_affinity(host_irq, NULL);
 }
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index baf627839498..2958b631fde8 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -270,12 +270,12 @@ int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		       struct kvm_vcpu *vcpu, u32 vector)
 {
 	if (vcpu) {
-		struct vcpu_data vcpu_info = {
+		struct intel_iommu_pi_data pi_data = {
 			.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu)),
 			.vector = vector,
 		};
 
-		return irq_set_vcpu_affinity(host_irq, &vcpu_info);
+		return irq_set_vcpu_affinity(host_irq, &pi_data);
 	} else {
 		return irq_set_vcpu_affinity(host_irq, NULL);
 	}
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 08c4fa31da5d..bc6f7eb2f04b 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3831,10 +3831,10 @@ int amd_iommu_deactivate_guest_mode(void *data)
 }
 EXPORT_SYMBOL(amd_iommu_deactivate_guest_mode);
 
-static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
+static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 {
 	int ret;
-	struct amd_iommu_pi_data *pi_data = vcpu_info;
+	struct amd_iommu_pi_data *pi_data = info;
 	struct amd_ir_data *ir_data = data->chip_data;
 	struct irq_2_irte *irte_info = &ir_data->irq_2_irte;
 	struct iommu_dev_data *dev_data;
@@ -3857,14 +3857,10 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 	ir_data->cfg = irqd_cfg(data);
 
 	if (pi_data) {
-		struct vcpu_data *vcpu_pi_info = pi_data->vcpu_data;
-
 		pi_data->ir_data = ir_data;
 
-		WARN_ON_ONCE(!pi_data->is_guest_mode);
-
-		ir_data->ga_root_ptr = (vcpu_pi_info->pi_desc_addr >> 12);
-		ir_data->ga_vector = vcpu_pi_info->vector;
+		ir_data->ga_root_ptr = (pi_data->vapic_addr >> 12);
+		ir_data->ga_vector = pi_data->vector;
 		ir_data->ga_tag = pi_data->ga_tag;
 		ret = amd_iommu_activate_guest_mode(ir_data);
 	} else {
diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index ad795c772f21..8ccec30e5f45 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1236,10 +1236,10 @@ static void intel_ir_compose_msi_msg(struct irq_data *irq_data,
 static int intel_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 {
 	struct intel_ir_data *ir_data = data->chip_data;
-	struct vcpu_data *vcpu_pi_info = info;
+	struct intel_iommu_pi_data *pi_data = info;
 
 	/* stop posting interrupts, back to the default mode */
-	if (!vcpu_pi_info) {
+	if (!pi_data) {
 		modify_irte(&ir_data->irq_2_iommu, &ir_data->irte_entry);
 	} else {
 		struct irte irte_pi;
@@ -1257,10 +1257,10 @@ static int intel_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 		/* Update the posted mode fields */
 		irte_pi.p_pst = 1;
 		irte_pi.p_urgent = 0;
-		irte_pi.p_vector = vcpu_pi_info->vector;
-		irte_pi.pda_l = (vcpu_pi_info->pi_desc_addr >>
+		irte_pi.p_vector = pi_data->vector;
+		irte_pi.pda_l = (pi_data->pi_desc_addr >>
 				(32 - PDA_LOW_BIT)) & ~(-1UL << PDA_LOW_BIT);
-		irte_pi.pda_h = (vcpu_pi_info->pi_desc_addr >> 32) &
+		irte_pi.pda_h = (pi_data->pi_desc_addr >> 32) &
 				~(-1UL << PDA_HIGH_BIT);
 
 		modify_irte(&ir_data->irq_2_iommu, &irte_pi);
diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
index deeefc92a5cf..99b4fa9a0296 100644
--- a/include/linux/amd-iommu.h
+++ b/include/linux/amd-iommu.h
@@ -12,18 +12,6 @@
 
 struct amd_iommu;
 
-/*
- * This is mainly used to communicate information back-and-forth
- * between SVM and IOMMU for setting up and tearing down posted
- * interrupt
- */
-struct amd_iommu_pi_data {
-	u32 ga_tag;
-	bool is_guest_mode;
-	struct vcpu_data *vcpu_data;
-	void *ir_data;
-};
-
 #ifdef CONFIG_AMD_IOMMU
 
 struct task_struct;
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 37/67] KVM: Don't WARN if updating IRQ bypass route fails
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (35 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 36/67] iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structs Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 38/67] KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing() Sean Christopherson
                   ` (34 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Don't bother WARNing if updating an IRTE route fails now that vendor code
provides much more precise WARNs.  The generic WARN doesn't provide enough
information to actually debug the problem, and has obviously done nothing
to surface the myriad bugs in KVM's implementation.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c       |  8 ++++----
 include/linux/kvm_host.h |  6 +++---
 virt/kvm/eventfd.c       | 15 ++++++---------
 3 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a20d461718cc..c2c102f23fa7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13666,11 +13666,11 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 	kvm_arch_end_assignment(irqfd->kvm);
 }
 
-int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
-				  struct kvm_kernel_irq_routing_entry *old,
-				  struct kvm_kernel_irq_routing_entry *new)
+void kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
+				   struct kvm_kernel_irq_routing_entry *old,
+				   struct kvm_kernel_irq_routing_entry *new)
 {
-	return kvm_pi_update_irte(irqfd, old, new);
+	kvm_pi_update_irte(irqfd, old, new);
 }
 
 bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2d9f3aeb766a..7e8f5cb4fc9a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2392,9 +2392,9 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *,
 			   struct irq_bypass_producer *);
 void kvm_arch_irq_bypass_stop(struct irq_bypass_consumer *);
 void kvm_arch_irq_bypass_start(struct irq_bypass_consumer *);
-int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
-				  struct kvm_kernel_irq_routing_entry *old,
-				  struct kvm_kernel_irq_routing_entry *new);
+void kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
+				   struct kvm_kernel_irq_routing_entry *old,
+				   struct kvm_kernel_irq_routing_entry *new);
 bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *,
 				  struct kvm_kernel_irq_routing_entry *);
 #endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index ad71e3e4d1c3..7ccdaa4071c8 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -285,11 +285,11 @@ void __attribute__((weak)) kvm_arch_irq_bypass_start(
 {
 }
 
-int __weak kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
-					 struct kvm_kernel_irq_routing_entry *old,
-					 struct kvm_kernel_irq_routing_entry *new)
+void __weak kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
+					  struct kvm_kernel_irq_routing_entry *old,
+					  struct kvm_kernel_irq_routing_entry *new)
 {
-	return 0;
+
 }
 
 bool __attribute__((weak)) kvm_arch_irqfd_route_changed(
@@ -618,11 +618,8 @@ void kvm_irq_routing_update(struct kvm *kvm)
 
 #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
 		if (irqfd->producer &&
-		    kvm_arch_irqfd_route_changed(&old, &irqfd->irq_entry)) {
-			int ret = kvm_arch_update_irqfd_routing(irqfd, &old, &irqfd->irq_entry);
-
-			WARN_ON(ret);
-		}
+		    kvm_arch_irqfd_route_changed(&old, &irqfd->irq_entry))
+			kvm_arch_update_irqfd_routing(irqfd, &old, &irqfd->irq_entry);
 #endif
 	}
 
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 38/67] KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing()
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (36 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 37/67] KVM: Don't WARN if updating IRQ bypass route fails Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 39/67] KVM: x86: Track irq_bypass_vcpu in common x86 code Sean Christopherson
                   ` (33 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing().
Calling arch code to know whether or not to call arch code is absurd.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c       | 15 +++++----------
 include/linux/kvm_host.h |  2 --
 virt/kvm/eventfd.c       | 10 +---------
 3 files changed, 6 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c2c102f23fa7..36d4a9ed144d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13670,19 +13670,14 @@ void kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
 				   struct kvm_kernel_irq_routing_entry *old,
 				   struct kvm_kernel_irq_routing_entry *new)
 {
+	if (old->type == KVM_IRQ_ROUTING_MSI &&
+	    new->type == KVM_IRQ_ROUTING_MSI &&
+	    !memcmp(&old->msi, &new->msi, sizeof(new->msi)))
+		return;
+
 	kvm_pi_update_irte(irqfd, old, new);
 }
 
-bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
-				  struct kvm_kernel_irq_routing_entry *new)
-{
-	if (old->type != KVM_IRQ_ROUTING_MSI ||
-	    new->type != KVM_IRQ_ROUTING_MSI)
-		return true;
-
-	return !!memcmp(&old->msi, &new->msi, sizeof(new->msi));
-}
-
 bool kvm_vector_hashing_enabled(void)
 {
 	return vector_hashing;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7e8f5cb4fc9a..d1a41c40ae79 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2395,8 +2395,6 @@ void kvm_arch_irq_bypass_start(struct irq_bypass_consumer *);
 void kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
 				   struct kvm_kernel_irq_routing_entry *old,
 				   struct kvm_kernel_irq_routing_entry *new);
-bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *,
-				  struct kvm_kernel_irq_routing_entry *);
 #endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */
 
 #ifdef CONFIG_HAVE_KVM_INVALID_WAKEUPS
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 7ccdaa4071c8..b9810c3654f5 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -291,13 +291,6 @@ void __weak kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
 {
 
 }
-
-bool __attribute__((weak)) kvm_arch_irqfd_route_changed(
-				struct kvm_kernel_irq_routing_entry *old,
-				struct kvm_kernel_irq_routing_entry *new)
-{
-	return true;
-}
 #endif
 
 static int
@@ -617,8 +610,7 @@ void kvm_irq_routing_update(struct kvm *kvm)
 		irqfd_update(kvm, irqfd);
 
 #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
-		if (irqfd->producer &&
-		    kvm_arch_irqfd_route_changed(&old, &irqfd->irq_entry))
+		if (irqfd->producer)
 			kvm_arch_update_irqfd_routing(irqfd, &old, &irqfd->irq_entry);
 #endif
 	}
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 39/67] KVM: x86: Track irq_bypass_vcpu in common x86 code
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (37 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 38/67] KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing() Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 40/67] KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being targeted Sean Christopherson
                   ` (32 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Track the vCPU that is being targeted for IRQ bypass, a.k.a. for a posted
IRQ, in common x86 code.  This will allow for additional consolidation of
the SVM and VMX code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 4 ----
 arch/x86/kvm/x86.c      | 7 ++++++-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 355673f95b70..bd1fcf2ea1e5 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -776,22 +776,18 @@ static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)
 	spin_lock_irqsave(&to_svm(vcpu)->ir_list_lock, flags);
 	list_del(&irqfd->vcpu_list);
 	spin_unlock_irqrestore(&to_svm(vcpu)->ir_list_lock, flags);
-
-	irqfd->irq_bypass_vcpu = NULL;
 }
 
 static int svm_ir_list_add(struct vcpu_svm *svm,
 			   struct kvm_kernel_irqfd *irqfd,
 			   struct amd_iommu_pi_data *pi)
 {
-	struct kvm_vcpu *vcpu = &svm->vcpu;
 	unsigned long flags;
 	u64 entry;
 
 	if (WARN_ON_ONCE(!pi->ir_data))
 		return -EINVAL;
 
-	irqfd->irq_bypass_vcpu = vcpu;
 	irqfd->irq_bypass_data = pi->ir_data;
 
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 36d4a9ed144d..0d9bd8535f61 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13604,8 +13604,13 @@ static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
 
 	r = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, host_irq, irqfd->gsi,
 					 new, vcpu, irq.vector);
-	if (r)
+	if (r) {
+		WARN_ON_ONCE(irqfd->irq_bypass_vcpu && !vcpu);
+		irqfd->irq_bypass_vcpu = NULL;
 		return r;
+	}
+
+	irqfd->irq_bypass_vcpu = vcpu;
 
 	trace_kvm_pi_irte_update(host_irq, vcpu, irqfd->gsi, irq.vector, !!vcpu);
 	return 0;
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 40/67] KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being targeted
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (38 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 39/67] KVM: x86: Track irq_bypass_vcpu in common x86 code Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 41/67] KVM: x86: Don't update IRTE entries when old and new routes were !MSI Sean Christopherson
                   ` (31 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Don't "reconfigure" an IRTE into host controlled mode when it's already in
the state, i.e. if KVM's GSI routing changes but the IRQ wasn't and still
isn't being posted to a vCPU.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0d9bd8535f61..8325a908fa25 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13602,6 +13602,9 @@ static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
 			vcpu = NULL;
 	}
 
+	if (!irqfd->irq_bypass_vcpu && !vcpu)
+		return 0;
+
 	r = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, host_irq, irqfd->gsi,
 					 new, vcpu, irq.vector);
 	if (r) {
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 41/67] KVM: x86: Don't update IRTE entries when old and new routes were !MSI
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (39 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 40/67] KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being targeted Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 42/67] KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR metadata Sean Christopherson
                   ` (30 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Skip the entirety of IRTE updates on a GSI routing change if neither the
old nor the new routing is for an MSI, i.e. if the neither routing setup
allows for posting to a vCPU.  If the IRTE isn't already host controlled,
KVM has bigger problems.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8325a908fa25..0dc3b45cb664 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13678,6 +13678,10 @@ void kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
 				   struct kvm_kernel_irq_routing_entry *old,
 				   struct kvm_kernel_irq_routing_entry *new)
 {
+	if (new->type != KVM_IRQ_ROUTING_MSI &&
+	    old->type != KVM_IRQ_ROUTING_MSI)
+		return;
+
 	if (old->type == KVM_IRQ_ROUTING_MSI &&
 	    new->type == KVM_IRQ_ROUTING_MSI &&
 	    !memcmp(&old->msi, &new->msi, sizeof(new->msi)))
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 42/67] KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR metadata
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (40 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 41/67] KVM: x86: Don't update IRTE entries when old and new routes were !MSI Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 43/67] KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU Sean Christopherson
                   ` (29 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Revert the IRTE back to remapping mode if the AMD IOMMU driver mucks up
and doesn't provide the necessary metadata.  Returning an error up the
stack without actually handling the error is useless and confusing.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index bd1fcf2ea1e5..22fa49fc9717 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -778,16 +778,13 @@ static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)
 	spin_unlock_irqrestore(&to_svm(vcpu)->ir_list_lock, flags);
 }
 
-static int svm_ir_list_add(struct vcpu_svm *svm,
-			   struct kvm_kernel_irqfd *irqfd,
-			   struct amd_iommu_pi_data *pi)
+static void svm_ir_list_add(struct vcpu_svm *svm,
+			    struct kvm_kernel_irqfd *irqfd,
+			    struct amd_iommu_pi_data *pi)
 {
 	unsigned long flags;
 	u64 entry;
 
-	if (WARN_ON_ONCE(!pi->ir_data))
-		return -EINVAL;
-
 	irqfd->irq_bypass_data = pi->ir_data;
 
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
@@ -805,7 +802,6 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
 
 	list_add(&irqfd->vcpu_list, &svm->ir_list);
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
-	return 0;
 }
 
 int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
@@ -843,6 +839,16 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		if (ret)
 			return ret;
 
+		/*
+		 * Revert to legacy mode if the IOMMU didn't provide metadata
+		 * for the IRTE, which KVM needs to keep the IRTE up-to-date,
+		 * e.g. if the vCPU is migrated or AVIC is disabled.
+		 */
+		if (WARN_ON_ONCE(!pi_data.ir_data)) {
+			irq_set_vcpu_affinity(host_irq, NULL);
+			return -EIO;
+		}
+
 		/**
 		 * Here, we successfully setting up vcpu affinity in
 		 * IOMMU guest mode. Now, we need to store the posted
@@ -850,7 +856,8 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 * we can reference to them directly when we update vcpu
 		 * scheduling information in IOMMU irte.
 		 */
-		return svm_ir_list_add(to_svm(vcpu), irqfd, &pi_data);
+		svm_ir_list_add(to_svm(vcpu), irqfd, &pi_data);
+		return 0;
 	}
 	return irq_set_vcpu_affinity(host_irq, NULL);
 }
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 43/67] KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (41 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 42/67] KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR metadata Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-04 19:38 ` [PATCH 44/67] iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination Sean Christopherson
                   ` (28 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Now that svm_ir_list_add() isn't overloaded with all manner of weird
things, fold it into avic_pi_update_irte(), and more importantly take
ir_list_lock across the irq_set_vcpu_affinity() calls to ensure the info
that's shoved into the IRTE is fresh.  While preemption (and IRQs) is
disabled on the task performing the IRTE update, thanks to irqfds.lock,
that task doesn't hold the vCPU's mutex, i.e. preemption being disabled
is irrelevant.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 55 +++++++++++++++++------------------------
 1 file changed, 22 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 22fa49fc9717..4dbbb5a6cacc 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -778,32 +778,6 @@ static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)
 	spin_unlock_irqrestore(&to_svm(vcpu)->ir_list_lock, flags);
 }
 
-static void svm_ir_list_add(struct vcpu_svm *svm,
-			    struct kvm_kernel_irqfd *irqfd,
-			    struct amd_iommu_pi_data *pi)
-{
-	unsigned long flags;
-	u64 entry;
-
-	irqfd->irq_bypass_data = pi->ir_data;
-
-	spin_lock_irqsave(&svm->ir_list_lock, flags);
-
-	/*
-	 * Update the target pCPU for IOMMU doorbells if the vCPU is running.
-	 * If the vCPU is NOT running, i.e. is blocking or scheduled out, KVM
-	 * will update the pCPU info when the vCPU awkened and/or scheduled in.
-	 * See also avic_vcpu_load().
-	 */
-	entry = svm->avic_physical_id_entry;
-	if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
-		amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
-				    true, pi->ir_data);
-
-	list_add(&irqfd->vcpu_list, &svm->ir_list);
-	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
-}
-
 int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			unsigned int host_irq, uint32_t guest_irq,
 			struct kvm_kernel_irq_routing_entry *new,
@@ -833,8 +807,18 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			.vapic_addr = avic_get_backing_page_address(to_svm(vcpu)),
 			.vector = vector,
 		};
+		struct vcpu_svm *svm = to_svm(vcpu);
+		u64 entry;
 		int ret;
 
+		/*
+		 * Prevent the vCPU from being scheduled out or migrated until
+		 * the IRTE is updated and its metadata has been added to the
+		 * list of IRQs being posted to the vCPU, to ensure the IRTE
+		 * isn't programmed with stale pCPU/IsRunning information.
+		 */
+		guard(spinlock_irqsave)(&svm->ir_list_lock);
+
 		ret = irq_set_vcpu_affinity(host_irq, &pi_data);
 		if (ret)
 			return ret;
@@ -849,14 +833,19 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			return -EIO;
 		}
 
-		/**
-		 * Here, we successfully setting up vcpu affinity in
-		 * IOMMU guest mode. Now, we need to store the posted
-		 * interrupt information in a per-vcpu ir_list so that
-		 * we can reference to them directly when we update vcpu
-		 * scheduling information in IOMMU irte.
+		/*
+		 * Update the target pCPU for IOMMU doorbells if the vCPU is
+		 * running.  If the vCPU is NOT running, i.e. is blocking or
+		 * scheduled out, KVM will update the pCPU info when the vCPU
+		 * is awakened and/or scheduled in.  See also avic_vcpu_load().
 		 */
-		svm_ir_list_add(to_svm(vcpu), irqfd, &pi_data);
+		entry = svm->avic_physical_id_entry;
+		if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
+			amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
+					    true, pi_data.ir_data);
+
+		irqfd->irq_bypass_data = pi_data.ir_data;
+		list_add(&irqfd->vcpu_list, &svm->ir_list);
 		return 0;
 	}
 	return irq_set_vcpu_affinity(host_irq, NULL);
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 44/67] iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (42 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 43/67] KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU Sean Christopherson
@ 2025-04-04 19:38 ` Sean Christopherson
  2025-04-08 12:26   ` Joerg Roedel
  2025-04-04 19:39 ` [PATCH 45/67] iommu/amd: Factor out helper for manipulating IRTE GA/CPU info Sean Christopherson
                   ` (27 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Infer whether or not a vCPU should be marked running from the validity of
the pCPU on which it is running.  amd_iommu_update_ga() already skips the
IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an
invalid pCPU would be a blatant and egregrious KVM bug.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c   | 11 +++++------
 drivers/iommu/amd/iommu.c |  6 ++++--
 include/linux/amd-iommu.h |  6 ++----
 3 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 4dbbb5a6cacc..3fcec297e3e3 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -842,7 +842,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		entry = svm->avic_physical_id_entry;
 		if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
 			amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
-					    true, pi_data.ir_data);
+					    pi_data.ir_data);
 
 		irqfd->irq_bypass_data = pi_data.ir_data;
 		list_add(&irqfd->vcpu_list, &svm->ir_list);
@@ -851,8 +851,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	return irq_set_vcpu_affinity(host_irq, NULL);
 }
 
-static inline int
-avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
+static inline int avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu)
 {
 	int ret = 0;
 	struct amd_svm_iommu_ir *ir;
@@ -871,7 +870,7 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
 		return 0;
 
 	list_for_each_entry(ir, &svm->ir_list, node) {
-		ret = amd_iommu_update_ga(cpu, r, ir->data);
+		ret = amd_iommu_update_ga(cpu, ir->data);
 		if (ret)
 			return ret;
 	}
@@ -933,7 +932,7 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 
-	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
+	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 }
@@ -973,7 +972,7 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	 */
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
 
-	avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
+	avic_update_iommu_vcpu_affinity(vcpu, -1);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 	svm->avic_physical_id_entry = entry;
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index bc6f7eb2f04b..ba3a1a403cb2 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3957,7 +3957,7 @@ int amd_iommu_create_irq_domain(struct amd_iommu *iommu)
 	return 0;
 }
 
-int amd_iommu_update_ga(int cpu, bool is_run, void *data)
+int amd_iommu_update_ga(int cpu, void *data)
 {
 	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
 	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
@@ -3974,8 +3974,10 @@ int amd_iommu_update_ga(int cpu, bool is_run, void *data)
 					APICID_TO_IRTE_DEST_LO(cpu);
 		entry->hi.fields.destination =
 					APICID_TO_IRTE_DEST_HI(cpu);
+		entry->lo.fields_vapic.is_run = true;
+	} else {
+		entry->lo.fields_vapic.is_run = false;
 	}
-	entry->lo.fields_vapic.is_run = is_run;
 
 	return __modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
 				ir_data->irq_2_irte.index, entry);
diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
index 99b4fa9a0296..fe0e16ffe0e5 100644
--- a/include/linux/amd-iommu.h
+++ b/include/linux/amd-iommu.h
@@ -30,8 +30,7 @@ static inline void amd_iommu_detect(void) { }
 /* IOMMU AVIC Function */
 extern int amd_iommu_register_ga_log_notifier(int (*notifier)(u32));
 
-extern int
-amd_iommu_update_ga(int cpu, bool is_run, void *data);
+extern int amd_iommu_update_ga(int cpu, void *data);
 
 extern int amd_iommu_activate_guest_mode(void *data);
 extern int amd_iommu_deactivate_guest_mode(void *data);
@@ -44,8 +43,7 @@ amd_iommu_register_ga_log_notifier(int (*notifier)(u32))
 	return 0;
 }
 
-static inline int
-amd_iommu_update_ga(int cpu, bool is_run, void *data)
+static inline int amd_iommu_update_ga(int cpu, void *data)
 {
 	return 0;
 }
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 45/67] iommu/amd: Factor out helper for manipulating IRTE GA/CPU info
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (43 preceding siblings ...)
  2025-04-04 19:38 ` [PATCH 44/67] iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 46/67] iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity Sean Christopherson
                   ` (26 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Split the guts of amd_iommu_update_ga() to a dedicated helper so that the
logic can be shared with flows that put the IRTE into posted mode.

Opportunistically move amd_iommu_update_ga() and its new helper above
amd_iommu_activate_guest_mode() so that it's all co-located.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 drivers/iommu/amd/iommu.c | 59 +++++++++++++++++++++------------------
 1 file changed, 32 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index ba3a1a403cb2..4fdf1502be69 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3775,6 +3775,38 @@ static const struct irq_domain_ops amd_ir_domain_ops = {
 	.deactivate = irq_remapping_deactivate,
 };
 
+static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
+{
+	if (cpu >= 0) {
+		entry->lo.fields_vapic.destination =
+					APICID_TO_IRTE_DEST_LO(cpu);
+		entry->hi.fields.destination =
+					APICID_TO_IRTE_DEST_HI(cpu);
+		entry->lo.fields_vapic.is_run = true;
+	} else {
+		entry->lo.fields_vapic.is_run = false;
+	}
+}
+
+int amd_iommu_update_ga(int cpu, void *data)
+{
+	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
+	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
+
+	if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
+	    !entry || !entry->lo.fields_vapic.guest_mode)
+		return 0;
+
+	if (!ir_data->iommu)
+		return -ENODEV;
+
+	__amd_iommu_update_ga(entry, cpu);
+
+	return __modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
+				ir_data->irq_2_irte.index, entry);
+}
+EXPORT_SYMBOL(amd_iommu_update_ga);
+
 int amd_iommu_activate_guest_mode(void *data)
 {
 	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
@@ -3956,31 +3988,4 @@ int amd_iommu_create_irq_domain(struct amd_iommu *iommu)
 
 	return 0;
 }
-
-int amd_iommu_update_ga(int cpu, void *data)
-{
-	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
-	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
-
-	if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
-	    !entry || !entry->lo.fields_vapic.guest_mode)
-		return 0;
-
-	if (!ir_data->iommu)
-		return -ENODEV;
-
-	if (cpu >= 0) {
-		entry->lo.fields_vapic.destination =
-					APICID_TO_IRTE_DEST_LO(cpu);
-		entry->hi.fields.destination =
-					APICID_TO_IRTE_DEST_HI(cpu);
-		entry->lo.fields_vapic.is_run = true;
-	} else {
-		entry->lo.fields_vapic.is_run = false;
-	}
-
-	return __modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
-				ir_data->irq_2_irte.index, entry);
-}
-EXPORT_SYMBOL(amd_iommu_update_ga);
 #endif
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 46/67] iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (44 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 45/67] iommu/amd: Factor out helper for manipulating IRTE GA/CPU info Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 47/67] iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is inhibited Sean Christopherson
                   ` (25 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that
avic_physical_id_entry can be safely accessed, set the pCPU info
straight-away when setting vCPU affinity.  Putting the IRTE into posted
mode, and then immediately updating the IRTE a second time if the target
vCPU is running is wasteful and confusing.

This also fixes a flaw where a posted IRQ that arrives between putting
the IRTE into guest_mode and setting the correct destination could cause
the IOMMU to ring the doorbell on the wrong pCPU.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/irq_remapping.h |  1 +
 arch/x86/kvm/svm/avic.c              | 26 ++++++++++++++------------
 drivers/iommu/amd/iommu.c            |  6 ++++--
 include/linux/amd-iommu.h            |  4 ++--
 4 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index 2dbc9cb61c2f..4c75a17632f6 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -35,6 +35,7 @@ struct amd_iommu_pi_data {
 	u64 vapic_addr;		/* Physical address of the vCPU's vAPIC. */
 	u32 ga_tag;
 	u32 vector;		/* Guest vector of the interrupt */
+	int cpu;
 	bool is_guest_mode;
 	void *ir_data;
 };
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 3fcec297e3e3..086139e85242 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -735,6 +735,7 @@ void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 
 static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 {
+	int apic_id = kvm_cpu_get_apicid(vcpu->cpu);
 	int ret = 0;
 	unsigned long flags;
 	struct amd_svm_iommu_ir *ir;
@@ -754,7 +755,7 @@ static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 
 	list_for_each_entry(ir, &svm->ir_list, node) {
 		if (activate)
-			ret = amd_iommu_activate_guest_mode(ir->data);
+			ret = amd_iommu_activate_guest_mode(ir->data, apic_id);
 		else
 			ret = amd_iommu_deactivate_guest_mode(ir->data);
 		if (ret)
@@ -819,6 +820,18 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 */
 		guard(spinlock_irqsave)(&svm->ir_list_lock);
 
+		/*
+		 * Update the target pCPU for IOMMU doorbells if the vCPU is
+		 * running.  If the vCPU is NOT running, i.e. is blocking or
+		 * scheduled out, KVM will update the pCPU info when the vCPU
+		 * is awakened and/or scheduled in.  See also avic_vcpu_load().
+		 */
+		entry = svm->avic_physical_id_entry;
+		if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
+			pi_data.cpu = entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
+		else
+			pi_data.cpu = -1;
+
 		ret = irq_set_vcpu_affinity(host_irq, &pi_data);
 		if (ret)
 			return ret;
@@ -833,17 +846,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			return -EIO;
 		}
 
-		/*
-		 * Update the target pCPU for IOMMU doorbells if the vCPU is
-		 * running.  If the vCPU is NOT running, i.e. is blocking or
-		 * scheduled out, KVM will update the pCPU info when the vCPU
-		 * is awakened and/or scheduled in.  See also avic_vcpu_load().
-		 */
-		entry = svm->avic_physical_id_entry;
-		if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
-			amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
-					    pi_data.ir_data);
-
 		irqfd->irq_bypass_data = pi_data.ir_data;
 		list_add(&irqfd->vcpu_list, &svm->ir_list);
 		return 0;
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 4fdf1502be69..b0b4c5ca16a8 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3807,7 +3807,7 @@ int amd_iommu_update_ga(int cpu, void *data)
 }
 EXPORT_SYMBOL(amd_iommu_update_ga);
 
-int amd_iommu_activate_guest_mode(void *data)
+int amd_iommu_activate_guest_mode(void *data, int cpu)
 {
 	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
 	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
@@ -3828,6 +3828,8 @@ int amd_iommu_activate_guest_mode(void *data)
 	entry->hi.fields.vector            = ir_data->ga_vector;
 	entry->lo.fields_vapic.ga_tag      = ir_data->ga_tag;
 
+	__amd_iommu_update_ga(entry, cpu);
+
 	return modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
 			      ir_data->irq_2_irte.index, entry);
 }
@@ -3894,7 +3896,7 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 		ir_data->ga_root_ptr = (pi_data->vapic_addr >> 12);
 		ir_data->ga_vector = pi_data->vector;
 		ir_data->ga_tag = pi_data->ga_tag;
-		ret = amd_iommu_activate_guest_mode(ir_data);
+		ret = amd_iommu_activate_guest_mode(ir_data, pi_data->cpu);
 	} else {
 		ret = amd_iommu_deactivate_guest_mode(ir_data);
 	}
diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
index fe0e16ffe0e5..c9f2df0c4596 100644
--- a/include/linux/amd-iommu.h
+++ b/include/linux/amd-iommu.h
@@ -32,7 +32,7 @@ extern int amd_iommu_register_ga_log_notifier(int (*notifier)(u32));
 
 extern int amd_iommu_update_ga(int cpu, void *data);
 
-extern int amd_iommu_activate_guest_mode(void *data);
+extern int amd_iommu_activate_guest_mode(void *data, int cpu);
 extern int amd_iommu_deactivate_guest_mode(void *data);
 
 #else /* defined(CONFIG_AMD_IOMMU) && defined(CONFIG_IRQ_REMAP) */
@@ -48,7 +48,7 @@ static inline int amd_iommu_update_ga(int cpu, void *data)
 	return 0;
 }
 
-static inline int amd_iommu_activate_guest_mode(void *data)
+static inline int amd_iommu_activate_guest_mode(void *data, int cpu)
 {
 	return 0;
 }
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 47/67] iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is inhibited
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (45 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 46/67] iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 48/67] KVM: SVM: Don't check for assigned device(s) when updating affinity Sean Christopherson
                   ` (24 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

If an IRQ can be posted to a vCPU, but AVIC is currently inhibited on the
vCPU, go through the dance of "affining" the IRTE to the vCPU, but leave
the actual IRTE in remapped mode.  KVM already handles the case where AVIC
is inhibited => uninhibited with posted IRQs (see avic_set_pi_irte_mode()),
but doesn't handle the scenario where a postable IRQ comes along while AVIC
is inhibited.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c   | 16 ++++++----------
 drivers/iommu/amd/iommu.c |  5 ++++-
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 086139e85242..04bc1aa88dcc 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -790,21 +790,17 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	 */
 	svm_ir_list_del(irqfd);
 
-	/**
-	 * Here, we setup with legacy mode in the following cases:
-	 * 1. When cannot target interrupt to a specific vcpu.
-	 * 2. Unsetting posted interrupt.
-	 * 3. APIC virtualization is disabled for the vcpu.
-	 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
-	 */
-	if (vcpu && kvm_vcpu_apicv_active(vcpu)) {
+	if (vcpu) {
 		/*
-		 * Try to enable guest_mode in IRTE.
+		 * Try to enable guest_mode in IRTE, unless AVIC is inhibited,
+		 * in which case configure the IRTE for legacy mode, but track
+		 * the IRTE metadata so that it can be converted to guest mode
+		 * if AVIC is enabled/uninhibited in the future.
 		 */
 		struct amd_iommu_pi_data pi_data = {
 			.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
 					     vcpu->vcpu_id),
-			.is_guest_mode = true,
+			.is_guest_mode = kvm_vcpu_apicv_active(vcpu),
 			.vapic_addr = avic_get_backing_page_address(to_svm(vcpu)),
 			.vector = vector,
 		};
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index b0b4c5ca16a8..a881fad027fd 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3896,7 +3896,10 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 		ir_data->ga_root_ptr = (pi_data->vapic_addr >> 12);
 		ir_data->ga_vector = pi_data->vector;
 		ir_data->ga_tag = pi_data->ga_tag;
-		ret = amd_iommu_activate_guest_mode(ir_data, pi_data->cpu);
+		if (pi_data->is_guest_mode)
+			ret = amd_iommu_activate_guest_mode(ir_data, pi_data->cpu);
+		else
+			ret = amd_iommu_deactivate_guest_mode(ir_data);
 	} else {
 		ret = amd_iommu_deactivate_guest_mode(ir_data);
 	}
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 48/67] KVM: SVM: Don't check for assigned device(s) when updating affinity
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (46 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 47/67] iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is inhibited Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 49/67] KVM: SVM: Don't check for assigned device(s) when activating AVIC Sean Christopherson
                   ` (23 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Don't bother checking if a VM has an assigned device when updating AVIC
vCPU affinity, querying ir_list is just as cheap and nothing prevents
racing with changes in device assignment.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 04bc1aa88dcc..fc06bb9cad88 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -857,9 +857,6 @@ static inline int avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu
 
 	lockdep_assert_held(&svm->ir_list_lock);
 
-	if (!kvm_arch_has_assigned_device(vcpu->kvm))
-		return 0;
-
 	/*
 	 * Here, we go through the per-vcpu ir_list to update all existing
 	 * interrupt remapping table entry targeting this vcpu.
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 49/67] KVM: SVM: Don't check for assigned device(s) when activating AVIC
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (47 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 48/67] KVM: SVM: Don't check for assigned device(s) when updating affinity Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 50/67] KVM: SVM: WARN if (de)activating guest mode in IOMMU fails Sean Christopherson
                   ` (22 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Don't short-circuit IRTE updating when (de)activating AVIC based on the
VM having assigned devices, as nothing prevents AVIC (de)activation from
racing with device (de)assignment.  And from a performance perspective,
bailing early when there is no assigned device doesn't add much, as
ir_list_lock will never be contended if there's no assigned device.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index fc06bb9cad88..620772e07993 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -741,9 +741,6 @@ static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 	struct amd_svm_iommu_ir *ir;
 	struct vcpu_svm *svm = to_svm(vcpu);
 
-	if (!kvm_arch_has_assigned_device(vcpu->kvm))
-		return 0;
-
 	/*
 	 * Here, we go through the per-vcpu ir_list to update all existing
 	 * interrupt remapping table entry targeting this vcpu.
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 50/67] KVM: SVM: WARN if (de)activating guest mode in IOMMU fails
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (48 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 49/67] KVM: SVM: Don't check for assigned device(s) when activating AVIC Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 51/67] KVM: SVM: Process all IRTEs on affinity change even if one update fails Sean Christopherson
                   ` (21 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

WARN if (de)activating "guest mode" for an IRTE entry fails as modifying
an IRTE should only fail if KVM is buggy, e.g. has stale metadata.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 620772e07993..5f5022d12b1b 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -733,10 +733,9 @@ void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 	avic_handle_ldr_update(vcpu);
 }
 
-static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
+static void avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 {
 	int apic_id = kvm_cpu_get_apicid(vcpu->cpu);
-	int ret = 0;
 	unsigned long flags;
 	struct amd_svm_iommu_ir *ir;
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -752,15 +751,12 @@ static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 
 	list_for_each_entry(ir, &svm->ir_list, node) {
 		if (activate)
-			ret = amd_iommu_activate_guest_mode(ir->data, apic_id);
+			WARN_ON_ONCE(amd_iommu_activate_guest_mode(ir->data, apic_id));
 		else
-			ret = amd_iommu_deactivate_guest_mode(ir->data);
-		if (ret)
-			break;
+			WARN_ON_ONCE(amd_iommu_deactivate_guest_mode(ir->data));
 	}
 out:
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
-	return ret;
 }
 
 static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 51/67] KVM: SVM: Process all IRTEs on affinity change even if one update fails
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (49 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 50/67] KVM: SVM: WARN if (de)activating guest mode in IOMMU fails Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 52/67] KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails Sean Christopherson
                   ` (20 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

When updating IRTE GA fields, keep processing all other IRTEs if an update
fails, as not updating later entries risks making a bad situation worse.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 5f5022d12b1b..5544e8e88926 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -857,12 +857,10 @@ static inline int avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu
 	if (list_empty(&svm->ir_list))
 		return 0;
 
-	list_for_each_entry(ir, &svm->ir_list, node) {
+	list_for_each_entry(ir, &svm->ir_list, node)
 		ret = amd_iommu_update_ga(cpu, ir->data);
-		if (ret)
-			return ret;
-	}
-	return 0;
+
+	return ret;
 }
 
 void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 52/67] KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (50 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 51/67] KVM: SVM: Process all IRTEs on affinity change even if one update fails Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 53/67] KVM: x86: Drop superfluous "has assigned device" check in kvm_pi_update_irte() Sean Christopherson
                   ` (19 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

WARN if updating GA information for an IRTE entry fails as modifying an
IRTE should only fail if KVM is buggy, e.g. has stale metadata, and
because returning an error that is always ignored is pointless.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 5544e8e88926..a932eba1f42c 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -842,9 +842,8 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	return irq_set_vcpu_affinity(host_irq, NULL);
 }
 
-static inline int avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu)
+static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu)
 {
-	int ret = 0;
 	struct amd_svm_iommu_ir *ir;
 	struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -855,12 +854,10 @@ static inline int avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu
 	 * interrupt remapping table entry targeting this vcpu.
 	 */
 	if (list_empty(&svm->ir_list))
-		return 0;
+		return;
 
 	list_for_each_entry(ir, &svm->ir_list, node)
-		ret = amd_iommu_update_ga(cpu, ir->data);
-
-	return ret;
+		WARN_ON_ONCE(amd_iommu_update_ga(cpu, ir->data));
 }
 
 void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 53/67] KVM: x86: Drop superfluous "has assigned device" check in kvm_pi_update_irte()
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (51 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 52/67] KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 54/67] KVM: x86: WARN if IRQ bypass isn't supported " Sean Christopherson
                   ` (18 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Don't bother checking if the VM has an assigned device when updating
IRTE entries.  kvm_arch_irq_bypass_add_producer() explicitly increments
the assigned device count, kvm_arch_irq_bypass_del_producer() explicitly
decrements the count before invoking kvm_pi_update_irte(), and
kvm_irq_routing_update() only updates IRTE entries if there's an active
IRQ bypass producer.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0dc3b45cb664..513307952089 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13577,9 +13577,7 @@ static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
 	struct kvm_lapic_irq irq;
 	int r;
 
-	if (!irqchip_in_kernel(kvm) ||
-	    !kvm_arch_has_irq_bypass() ||
-	    !kvm_arch_has_assigned_device(kvm))
+	if (!irqchip_in_kernel(kvm) || !kvm_arch_has_irq_bypass())
 		return 0;
 
 	if (new && new->type == KVM_IRQ_ROUTING_MSI) {
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 54/67] KVM: x86: WARN if IRQ bypass isn't supported in kvm_pi_update_irte()
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (52 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 53/67] KVM: x86: Drop superfluous "has assigned device" check in kvm_pi_update_irte() Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 55/67] KVM: x86: WARN if IRQ bypass routing is updated without in-kernel local APIC Sean Christopherson
                   ` (17 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

WARN if kvm_pi_update_irte() is reached without IRQ bypass support, as the
code is only reachable if the VM already has an IRQ bypass producer (see
kvm_irq_routing_update()), or from kvm_arch_irq_bypass_{add,del}_producer(),
which, stating the obvious, are called if and only if KVM enables its IRQ
bypass hooks.

Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 513307952089..d05bffef88b7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13577,7 +13577,7 @@ static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
 	struct kvm_lapic_irq irq;
 	int r;
 
-	if (!irqchip_in_kernel(kvm) || !kvm_arch_has_irq_bypass())
+	if (!irqchip_in_kernel(kvm) || WARN_ON_ONCE(!kvm_arch_has_irq_bypass()))
 		return 0;
 
 	if (new && new->type == KVM_IRQ_ROUTING_MSI) {
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 55/67] KVM: x86: WARN if IRQ bypass routing is updated without in-kernel local APIC
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (53 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 54/67] KVM: x86: WARN if IRQ bypass isn't supported " Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 56/67] KVM: SVM: WARN if ir_list is non-empty at vCPU free Sean Christopherson
                   ` (16 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Yell if kvm_pi_update_irte() is reached without an in-kernel local APIC,
as kvm_arch_irqfd_allowed() should prevent attaching an irqfd and thus any
and all postable IRQs to an APIC-less VM.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d05bffef88b7..49c3360eb4e8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13577,8 +13577,8 @@ static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
 	struct kvm_lapic_irq irq;
 	int r;
 
-	if (!irqchip_in_kernel(kvm) || WARN_ON_ONCE(!kvm_arch_has_irq_bypass()))
-		return 0;
+	if (WARN_ON_ONCE(!irqchip_in_kernel(kvm) || !kvm_arch_has_irq_bypass()))
+		return -EINVAL;
 
 	if (new && new->type == KVM_IRQ_ROUTING_MSI) {
 		kvm_set_msi_irq(kvm, new, &irq);
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 56/67] KVM: SVM: WARN if ir_list is non-empty at vCPU free
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (54 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 55/67] KVM: x86: WARN if IRQ bypass routing is updated without in-kernel local APIC Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 57/67] KVM: x86: Decouple device assignment from IRQ bypass Sean Christopherson
                   ` (15 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Now that AVIC IRTE tracking is in a mostly sane state, WARN if a vCPU is
freed with ir_list entries, i.e. if KVM leaves a dangling IRTE.

Initialize the per-vCPU interrupt remapping list and its lock even if AVIC
is disabled so that the WARN doesn't hit false positives (and so that KVM
doesn't need to call into AVIC code for a simple sanity check).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 5 +++--
 arch/x86/kvm/svm/svm.c  | 2 ++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index a932eba1f42c..d2cbb7ac91f4 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -713,6 +713,9 @@ int avic_init_vcpu(struct vcpu_svm *svm)
 	int ret;
 	struct kvm_vcpu *vcpu = &svm->vcpu;
 
+	INIT_LIST_HEAD(&svm->ir_list);
+	spin_lock_init(&svm->ir_list_lock);
+
 	if (!enable_apicv || !irqchip_in_kernel(vcpu->kvm))
 		return 0;
 
@@ -720,8 +723,6 @@ int avic_init_vcpu(struct vcpu_svm *svm)
 	if (ret)
 		return ret;
 
-	INIT_LIST_HEAD(&svm->ir_list);
-	spin_lock_init(&svm->ir_list_lock);
 	svm->dfr_reg = APIC_DFR_FLAT;
 
 	return ret;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 43c4933d7da6..71b52ad13577 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1499,6 +1499,8 @@ static void svm_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
+	WARN_ON_ONCE(!list_empty(&svm->ir_list));
+
 	/*
 	 * The vmcb page can be recycled, causing a false negative in
 	 * svm_vcpu_load(). So, ensure that no logical CPU has this
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 57/67] KVM: x86: Decouple device assignment from IRQ bypass
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (55 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 56/67] KVM: SVM: WARN if ir_list is non-empty at vCPU free Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 58/67] KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting " Sean Christopherson
                   ` (14 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Use a dedicated counter to track the number of IRQs that can utilize IRQ
bypass instead of piggybacking the assigned device count.  As evidenced by
commit 2edd9cb79fb3 ("kvm: detect assigned device via irqbypass manager"),
it's possible for a device to be able to post IRQs to a vCPU without said
device being assigned to a VM.

Leave the calls to kvm_arch_{start,end}_assignment() alone for the moment
to avoid regressing the MMIO stale data mitigation.  KVM is abusing the
assigned device count when applying mmio_stale_data_clear, and it's not at
all clear if vDPA devices rely on this behavior.  This will hopefully be
cleaned up in the future, as the number of assigned devices is a terrible
heuristic for detecting if a VM has access to host MMIO.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  2 +-
 arch/x86/include/asm/kvm_host.h    |  3 ++-
 arch/x86/kvm/vmx/main.c            |  2 +-
 arch/x86/kvm/vmx/posted_intr.c     | 16 ++++++++++------
 arch/x86/kvm/vmx/posted_intr.h     |  2 +-
 arch/x86/kvm/x86.c                 | 12 +++++++++---
 6 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 823c0434bbad..435b9b76e464 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -111,7 +111,7 @@ KVM_X86_OP_OPTIONAL(update_cpu_dirty_logging)
 KVM_X86_OP_OPTIONAL(vcpu_blocking)
 KVM_X86_OP_OPTIONAL(vcpu_unblocking)
 KVM_X86_OP_OPTIONAL(pi_update_irte)
-KVM_X86_OP_OPTIONAL(pi_start_assignment)
+KVM_X86_OP_OPTIONAL(pi_start_bypass)
 KVM_X86_OP_OPTIONAL(apicv_pre_state_restore)
 KVM_X86_OP_OPTIONAL(apicv_post_state_restore)
 KVM_X86_OP_OPTIONAL_RET0(dy_apicv_has_pending_interrupt)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cb98d8d3c6c2..88b842cd8959 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1372,6 +1372,7 @@ struct kvm_arch {
 	atomic_t noncoherent_dma_count;
 #define __KVM_HAVE_ARCH_ASSIGNED_DEVICE
 	atomic_t assigned_device_count;
+	unsigned long nr_possible_bypass_irqs;
 	struct kvm_pic *vpic;
 	struct kvm_ioapic *vioapic;
 	struct kvm_pit *vpit;
@@ -1840,7 +1841,7 @@ struct kvm_x86_ops {
 			      unsigned int host_irq, uint32_t guest_irq,
 			      struct kvm_kernel_irq_routing_entry *new,
 			      struct kvm_vcpu *vcpu, u32 vector);
-	void (*pi_start_assignment)(struct kvm *kvm);
+	void (*pi_start_bypass)(struct kvm *kvm);
 	void (*apicv_pre_state_restore)(struct kvm_vcpu *vcpu);
 	void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
 	bool (*dy_apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 43ee9ed11291..95371f26ce20 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -133,7 +133,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.nested_ops = &vmx_nested_ops,
 
 	.pi_update_irte = vmx_pi_update_irte,
-	.pi_start_assignment = vmx_pi_start_assignment,
+	.pi_start_bypass = vmx_pi_start_bypass,
 
 #ifdef CONFIG_X86_64
 	.set_hv_timer = vmx_set_hv_timer,
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 2958b631fde8..457a5b21c9d3 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -132,8 +132,13 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 
 static bool vmx_can_use_vtd_pi(struct kvm *kvm)
 {
+	/*
+	 * Note, reading the number of possible bypass IRQs can race with a
+	 * bypass IRQ being attached to the VM.  vmx_pi_start_bypass() ensures
+	 * blockng vCPUs will see an elevated count or get KVM_REQ_UNBLOCK.
+	 */
 	return irqchip_in_kernel(kvm) && kvm_arch_has_irq_bypass() &&
-	       kvm_arch_has_assigned_device(kvm);
+	       READ_ONCE(kvm->arch.nr_possible_bypass_irqs);
 }
 
 /*
@@ -251,12 +256,11 @@ bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu)
 
 
 /*
- * Bail out of the block loop if the VM has an assigned
- * device, but the blocking vCPU didn't reconfigure the
- * PI.NV to the wakeup vector, i.e. the assigned device
- * came along after the initial check in vmx_vcpu_pi_put().
+ * Kick all vCPUs when the first possible bypass IRQ is attached to a VM, as
+ * blocking vCPUs may scheduled out without reconfiguring PID.NV to the wakeup
+ * vector, i.e. if the bypass IRQ came along after vmx_vcpu_pi_put().
  */
-void vmx_pi_start_assignment(struct kvm *kvm)
+void vmx_pi_start_bypass(struct kvm *kvm)
 {
 	if (!kvm_arch_has_irq_bypass())
 		return;
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index ee3e19e976ac..c3f12a35a7c1 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -17,7 +17,7 @@ int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		       unsigned int host_irq, uint32_t guest_irq,
 		       struct kvm_kernel_irq_routing_entry *new,
 		       struct kvm_vcpu *vcpu, u32 vector);
-void vmx_pi_start_assignment(struct kvm *kvm);
+void vmx_pi_start_bypass(struct kvm *kvm);
 
 static inline int pi_find_highest_vector(struct pi_desc *pi_desc)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 49c3360eb4e8..fec43d6a2b63 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13511,8 +13511,7 @@ bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu)
 
 void kvm_arch_start_assignment(struct kvm *kvm)
 {
-	if (atomic_inc_return(&kvm->arch.assigned_device_count) == 1)
-		kvm_x86_call(pi_start_assignment)(kvm);
+	atomic_inc(&kvm->arch.assigned_device_count);
 }
 EXPORT_SYMBOL_GPL(kvm_arch_start_assignment);
 
@@ -13630,10 +13629,15 @@ int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
 	spin_lock_irq(&kvm->irqfds.lock);
 	irqfd->producer = prod;
 
+	if (!kvm->arch.nr_possible_bypass_irqs++)
+		kvm_x86_call(pi_start_bypass)(kvm);
+
 	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
 		ret = kvm_pi_update_irte(irqfd, NULL, &irqfd->irq_entry);
-		if (ret)
+		if (ret) {
+			kvm->arch.nr_possible_bypass_irqs--;
 			kvm_arch_end_assignment(irqfd->kvm);
+		}
 	}
 	spin_unlock_irq(&kvm->irqfds.lock);
 
@@ -13666,6 +13670,8 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 	}
 	irqfd->producer = NULL;
 
+	kvm->arch.nr_possible_bypass_irqs--;
+
 	spin_unlock_irq(&kvm->irqfds.lock);
 
 
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 58/67] KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting IRQ bypass
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (56 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 57/67] KVM: x86: Decouple device assignment from IRQ bypass Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 59/67] KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata Sean Christopherson
                   ` (13 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

WARN if KVM attempts to "start" IRQ bypass when VT-d Posted IRQs are
disabled, to make it obvious that the logic is a sanity check, and so that
a bug related to nr_possible_bypass_irqs is more like to cause noisy
failures, e.g. so that KVM doesn't silently fail to wake blocking vCPUs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/posted_intr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 457a5b21c9d3..29804dfa826c 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -262,7 +262,7 @@ bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu)
  */
 void vmx_pi_start_bypass(struct kvm *kvm)
 {
-	if (!kvm_arch_has_irq_bypass())
+	if (WARN_ON_ONCE(!vmx_can_use_vtd_pi(kvm)))
 		return;
 
 	kvm_make_all_cpus_request(kvm, KVM_REQ_UNBLOCK);
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 59/67] KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (57 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 58/67] KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting " Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 60/67] iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC support Sean Christopherson
                   ` (12 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Use a vCPU's index, not its ID, for the GA log tag/metadata that's used to
find and kick vCPUs when a device posted interrupt serves as a wake event.
Lookups on a vCPU index are O(fast) (not sure what xa_load() actually
provides), whereas a vCPU ID lookup is O(n) if a vCPU's ID doesn't match
its index.

Unlike the Physical APIC Table, which is accessed by hardware when
virtualizing IPIs, hardware doesn't consume the GA tag, i.e. KVM _must_
use APIC IDs to fill the Physical APIC Table, but KVM has free rein over
the format/meaning of the GA tag.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 37 ++++++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index d2cbb7ac91f4..d567d62463ac 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -29,36 +29,39 @@
 #include "svm.h"
 
 /*
- * Encode the arbitrary VM ID and the vCPU's default APIC ID, i.e the vCPU ID,
- * into the GATag so that KVM can retrieve the correct vCPU from a GALog entry
- * if an interrupt can't be delivered, e.g. because the vCPU isn't running.
+ * Encode the arbitrary VM ID and the vCPU's _index_ into the GATag so that
+ * KVM can retrieve the correct vCPU from a GALog entry if an interrupt can't
+ * be delivered, e.g. because the vCPU isn't running.  Use the vCPU's index
+ * instead of its ID (a.k.a. its default APIC ID), as KVM is guaranteed a fast
+ * lookup on the index, where as vCPUs whose index doesn't match their ID need
+ * to walk the entire xarray of vCPUs in the worst case scenario.
  *
- * For the vCPU ID, use however many bits are currently allowed for the max
+ * For the vCPU index, use however many bits are currently allowed for the max
  * guest physical APIC ID (limited by the size of the physical ID table), and
  * use whatever bits remain to assign arbitrary AVIC IDs to VMs.  Note, the
  * size of the GATag is defined by hardware (32 bits), but is an opaque value
  * as far as hardware is concerned.
  */
-#define AVIC_VCPU_ID_MASK		AVIC_PHYSICAL_MAX_INDEX_MASK
+#define AVIC_VCPU_IDX_MASK		AVIC_PHYSICAL_MAX_INDEX_MASK
 
 #define AVIC_VM_ID_SHIFT		HWEIGHT32(AVIC_PHYSICAL_MAX_INDEX_MASK)
 #define AVIC_VM_ID_MASK			(GENMASK(31, AVIC_VM_ID_SHIFT) >> AVIC_VM_ID_SHIFT)
 
 #define AVIC_GATAG_TO_VMID(x)		((x >> AVIC_VM_ID_SHIFT) & AVIC_VM_ID_MASK)
-#define AVIC_GATAG_TO_VCPUID(x)		(x & AVIC_VCPU_ID_MASK)
+#define AVIC_GATAG_TO_VCPUIDX(x)	(x & AVIC_VCPU_IDX_MASK)
 
-#define __AVIC_GATAG(vm_id, vcpu_id)	((((vm_id) & AVIC_VM_ID_MASK) << AVIC_VM_ID_SHIFT) | \
-					 ((vcpu_id) & AVIC_VCPU_ID_MASK))
-#define AVIC_GATAG(vm_id, vcpu_id)					\
+#define __AVIC_GATAG(vm_id, vcpu_idx)	((((vm_id) & AVIC_VM_ID_MASK) << AVIC_VM_ID_SHIFT) | \
+					 ((vcpu_idx) & AVIC_VCPU_IDX_MASK))
+#define AVIC_GATAG(vm_id, vcpu_idx)					\
 ({									\
-	u32 ga_tag = __AVIC_GATAG(vm_id, vcpu_id);			\
+	u32 ga_tag = __AVIC_GATAG(vm_id, vcpu_idx);			\
 									\
-	WARN_ON_ONCE(AVIC_GATAG_TO_VCPUID(ga_tag) != (vcpu_id));	\
+	WARN_ON_ONCE(AVIC_GATAG_TO_VCPUIDX(ga_tag) != (vcpu_idx));	\
 	WARN_ON_ONCE(AVIC_GATAG_TO_VMID(ga_tag) != (vm_id));		\
 	ga_tag;								\
 })
 
-static_assert(__AVIC_GATAG(AVIC_VM_ID_MASK, AVIC_VCPU_ID_MASK) == -1u);
+static_assert(__AVIC_GATAG(AVIC_VM_ID_MASK, AVIC_VCPU_IDX_MASK) == -1u);
 
 static bool force_avic;
 module_param_unsafe(force_avic, bool, 0444);
@@ -148,16 +151,16 @@ int avic_ga_log_notifier(u32 ga_tag)
 	struct kvm_svm *kvm_svm;
 	struct kvm_vcpu *vcpu = NULL;
 	u32 vm_id = AVIC_GATAG_TO_VMID(ga_tag);
-	u32 vcpu_id = AVIC_GATAG_TO_VCPUID(ga_tag);
+	u32 vcpu_idx = AVIC_GATAG_TO_VCPUIDX(ga_tag);
 
-	pr_debug("SVM: %s: vm_id=%#x, vcpu_id=%#x\n", __func__, vm_id, vcpu_id);
-	trace_kvm_avic_ga_log(vm_id, vcpu_id);
+	pr_debug("SVM: %s: vm_id=%#x, vcpu_idx=%#x\n", __func__, vm_id, vcpu_idx);
+	trace_kvm_avic_ga_log(vm_id, vcpu_idx);
 
 	spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
 	hash_for_each_possible(svm_vm_data_hash, kvm_svm, hnode, vm_id) {
 		if (kvm_svm->avic_vm_id != vm_id)
 			continue;
-		vcpu = kvm_get_vcpu_by_id(&kvm_svm->kvm, vcpu_id);
+		vcpu = kvm_get_vcpu(&kvm_svm->kvm, vcpu_idx);
 		break;
 	}
 	spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags);
@@ -793,7 +796,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 */
 		struct amd_iommu_pi_data pi_data = {
 			.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
-					     vcpu->vcpu_id),
+					     vcpu->vcpu_idx),
 			.is_guest_mode = kvm_vcpu_apicv_active(vcpu),
 			.vapic_addr = avic_get_backing_page_address(to_svm(vcpu)),
 			.vector = vector,
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 60/67] iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC support
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (58 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 59/67] KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 61/67] KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller Sean Christopherson
                   ` (11 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

WARN if KVM attempts to update IRTE entries when virtual APIC isn't fully
supported, as KVM should guard all such calls on IRQ posting being enabled.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 drivers/iommu/amd/iommu.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index a881fad027fd..2e016b98fa1b 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3793,8 +3793,10 @@ int amd_iommu_update_ga(int cpu, void *data)
 	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
 	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
 
-	if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
-	    !entry || !entry->lo.fields_vapic.guest_mode)
+	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
+		return -EINVAL;
+
+	if (!entry || !entry->lo.fields_vapic.guest_mode)
 		return 0;
 
 	if (!ir_data->iommu)
@@ -3813,7 +3815,10 @@ int amd_iommu_activate_guest_mode(void *data, int cpu)
 	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
 	u64 valid;
 
-	if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) || !entry)
+	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
+		return -EINVAL;
+
+	if (!entry)
 		return 0;
 
 	valid = entry->lo.fields_vapic.valid;
@@ -3842,8 +3847,10 @@ int amd_iommu_deactivate_guest_mode(void *data)
 	struct irq_cfg *cfg = ir_data->cfg;
 	u64 valid;
 
-	if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
-	    !entry || !entry->lo.fields_vapic.guest_mode)
+	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
+		return -EINVAL;
+
+	if (!entry || !entry->lo.fields_vapic.guest_mode)
 		return 0;
 
 	valid = entry->lo.fields_remap.valid;
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 61/67] KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (59 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 60/67] iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC support Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 62/67] KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off Sean Christopherson
                   ` (10 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Fold avic_set_pi_irte_mode() into avic_refresh_apicv_exec_ctrl() in
anticipation of moving the __avic_vcpu_{load,put}() calls into the
critical section, and because having a one-off helper with a name that's
easily confused with avic_pi_update_irte() is unnecessary.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 48 ++++++++++++++++++-----------------------
 1 file changed, 21 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index d567d62463ac..0425cc374a79 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -737,32 +737,6 @@ void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 	avic_handle_ldr_update(vcpu);
 }
 
-static void avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
-{
-	int apic_id = kvm_cpu_get_apicid(vcpu->cpu);
-	unsigned long flags;
-	struct amd_svm_iommu_ir *ir;
-	struct vcpu_svm *svm = to_svm(vcpu);
-
-	/*
-	 * Here, we go through the per-vcpu ir_list to update all existing
-	 * interrupt remapping table entry targeting this vcpu.
-	 */
-	spin_lock_irqsave(&svm->ir_list_lock, flags);
-
-	if (list_empty(&svm->ir_list))
-		goto out;
-
-	list_for_each_entry(ir, &svm->ir_list, node) {
-		if (activate)
-			WARN_ON_ONCE(amd_iommu_activate_guest_mode(ir->data, apic_id));
-		else
-			WARN_ON_ONCE(amd_iommu_deactivate_guest_mode(ir->data));
-	}
-out:
-	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
-}
-
 static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)
 {
 	struct kvm_vcpu *vcpu = irqfd->irq_bypass_vcpu;
@@ -998,6 +972,10 @@ void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)
 void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
 	bool activated = kvm_vcpu_apicv_active(vcpu);
+	int apic_id = kvm_cpu_get_apicid(vcpu->cpu);
+	struct vcpu_svm *svm = to_svm(vcpu);
+	struct amd_svm_iommu_ir *ir;
+	unsigned long flags;
 
 	if (!enable_apicv)
 		return;
@@ -1009,7 +987,23 @@ void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 	else
 		avic_vcpu_put(vcpu);
 
-	avic_set_pi_irte_mode(vcpu, activated);
+	/*
+	 * Here, we go through the per-vcpu ir_list to update all existing
+	 * interrupt remapping table entry targeting this vcpu.
+	 */
+	spin_lock_irqsave(&svm->ir_list_lock, flags);
+
+	if (list_empty(&svm->ir_list))
+		goto out;
+
+	list_for_each_entry(ir, &svm->ir_list, node) {
+		if (activated)
+			WARN_ON_ONCE(amd_iommu_activate_guest_mode(ir->data, apic_id));
+		else
+			WARN_ON_ONCE(amd_iommu_deactivate_guest_mode(ir->data));
+	}
+out:
+	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 }
 
 void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 62/67] KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (60 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 61/67] KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-08 17:51   ` Paolo Bonzini
  2025-04-04 19:39 ` [PATCH 63/67] KVM: SVM: Consolidate IRTE update " Sean Christopherson
                   ` (9 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Don't query a vCPU's blocking status when toggling AVIC on/off; barring
KVM bugs, the vCPU can't be blocking when refrecing AVIC controls.  And if
there are KVM bugs, ensuring the vCPU and its associated IRTEs are in the
correct state is desirable, i.e. well worth any overhead in a buggy
scenario.

Isolating the "real" load/put flows will allow moving the IOMMU IRTE
(de)activation logic from avic_refresh_apicv_exec_ctrl() to
avic_update_iommu_vcpu_affinity(), i.e. will allow updating the vCPU's
physical ID entry and its IRTEs in a common path, under a single critical
section of ir_list_lock.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 65 +++++++++++++++++++++++------------------
 1 file changed, 37 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 0425cc374a79..d5fa915d0827 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -838,7 +838,7 @@ static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu)
 		WARN_ON_ONCE(amd_iommu_update_ga(cpu, ir->data));
 }
 
-void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	int h_physical_id = kvm_cpu_get_apicid(cpu);
@@ -854,16 +854,6 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (WARN_ON_ONCE(vcpu->vcpu_id * sizeof(entry) >= PAGE_SIZE))
 		return;
 
-	/*
-	 * No need to update anything if the vCPU is blocking, i.e. if the vCPU
-	 * is being scheduled in after being preempted.  The CPU entries in the
-	 * Physical APIC table and IRTE are consumed iff IsRun{ning} is '1'.
-	 * If the vCPU was migrated, its new CPU value will be stuffed when the
-	 * vCPU unblocks.
-	 */
-	if (kvm_vcpu_is_blocking(vcpu))
-		return;
-
 	/*
 	 * Grab the per-vCPU interrupt remapping lock even if the VM doesn't
 	 * _currently_ have assigned devices, as that can change.  Holding
@@ -898,31 +888,33 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 }
 
-void avic_vcpu_put(struct kvm_vcpu *vcpu)
+void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+	/*
+	 * No need to update anything if the vCPU is blocking, i.e. if the vCPU
+	 * is being scheduled in after being preempted.  The CPU entries in the
+	 * Physical APIC table and IRTE are consumed iff IsRun{ning} is '1'.
+	 * If the vCPU was migrated, its new CPU value will be stuffed when the
+	 * vCPU unblocks.
+	 */
+	if (kvm_vcpu_is_blocking(vcpu))
+		return;
+
+	__avic_vcpu_load(vcpu, cpu);
+}
+
+static void __avic_vcpu_put(struct kvm_vcpu *vcpu)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	struct vcpu_svm *svm = to_svm(vcpu);
 	unsigned long flags;
-	u64 entry;
+	u64 entry = svm->avic_physical_id_entry;
 
 	lockdep_assert_preemption_disabled();
 
 	if (WARN_ON_ONCE(vcpu->vcpu_id * sizeof(entry) >= PAGE_SIZE))
 		return;
 
-	/*
-	 * Note, reading the Physical ID entry outside of ir_list_lock is safe
-	 * as only the pCPU that has loaded (or is loading) the vCPU is allowed
-	 * to modify the entry, and preemption is disabled.  I.e. the vCPU
-	 * can't be scheduled out and thus avic_vcpu_{put,load}() can't run
-	 * recursively.
-	 */
-	entry = svm->avic_physical_id_entry;
-
-	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
-	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
-		return;
-
 	/*
 	 * Take and hold the per-vCPU interrupt remapping lock while updating
 	 * the Physical ID entry even though the lock doesn't protect against
@@ -942,7 +934,24 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 		WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
+}
 
+void avic_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * Note, reading the Physical ID entry outside of ir_list_lock is safe
+	 * as only the pCPU that has loaded (or is loading) the vCPU is allowed
+	 * to modify the entry, and preemption is disabled.  I.e. the vCPU
+	 * can't be scheduled out and thus avic_vcpu_{put,load}() can't run
+	 * recursively.
+	 */
+	u64 entry = to_svm(vcpu)->avic_physical_id_entry;
+
+	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
+	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
+		return;
+
+	__avic_vcpu_put(vcpu);
 }
 
 void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)
@@ -983,9 +992,9 @@ void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 	avic_refresh_virtual_apic_mode(vcpu);
 
 	if (activated)
-		avic_vcpu_load(vcpu, vcpu->cpu);
+		__avic_vcpu_load(vcpu, vcpu->cpu);
 	else
-		avic_vcpu_put(vcpu);
+		__avic_vcpu_put(vcpu);
 
 	/*
 	 * Here, we go through the per-vcpu ir_list to update all existing
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 63/67] KVM: SVM: Consolidate IRTE update when toggling AVIC on/off
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (61 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 62/67] KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 64/67] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts Sean Christopherson
                   ` (8 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Fold the IRTE modification logic in avic_refresh_apicv_exec_ctrl() into
__avic_vcpu_{load,put}(), and add a param to the helpers to communicate
whether or not AVIC is being toggled, i.e. if IRTE needs a "full" update,
or just a quick update to set the CPU and IsRun.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 55 ++++++++++++++---------------------------
 1 file changed, 19 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index d5fa915d0827..c896f00f901c 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -820,7 +820,8 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	return irq_set_vcpu_affinity(host_irq, NULL);
 }
 
-static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu)
+static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
+					    bool toggle_avic)
 {
 	struct amd_svm_iommu_ir *ir;
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -834,11 +835,17 @@ static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu)
 	if (list_empty(&svm->ir_list))
 		return;
 
-	list_for_each_entry(ir, &svm->ir_list, node)
-		WARN_ON_ONCE(amd_iommu_update_ga(cpu, ir->data));
+	list_for_each_entry(ir, &svm->ir_list, node) {
+		if (!toggle_avic)
+			WARN_ON_ONCE(amd_iommu_update_ga(cpu, ir->data));
+		else if (cpu >= 0)
+			WARN_ON_ONCE(amd_iommu_activate_guest_mode(ir->data, cpu));
+		else
+			WARN_ON_ONCE(amd_iommu_deactivate_guest_mode(ir->data));
+	}
 }
 
-static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool toggle_avic)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	int h_physical_id = kvm_cpu_get_apicid(cpu);
@@ -883,7 +890,7 @@ static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 
-	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id);
+	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, toggle_avic);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 }
@@ -900,10 +907,10 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (kvm_vcpu_is_blocking(vcpu))
 		return;
 
-	__avic_vcpu_load(vcpu, cpu);
+	__avic_vcpu_load(vcpu, cpu, false);
 }
 
-static void __avic_vcpu_put(struct kvm_vcpu *vcpu)
+static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -925,7 +932,7 @@ static void __avic_vcpu_put(struct kvm_vcpu *vcpu)
 	 */
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
 
-	avic_update_iommu_vcpu_affinity(vcpu, -1);
+	avic_update_iommu_vcpu_affinity(vcpu, -1, toggle_avic);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 	svm->avic_physical_id_entry = entry;
@@ -951,7 +958,7 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
 		return;
 
-	__avic_vcpu_put(vcpu);
+	__avic_vcpu_put(vcpu, false);
 }
 
 void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)
@@ -980,39 +987,15 @@ void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)
 
 void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
-	bool activated = kvm_vcpu_apicv_active(vcpu);
-	int apic_id = kvm_cpu_get_apicid(vcpu->cpu);
-	struct vcpu_svm *svm = to_svm(vcpu);
-	struct amd_svm_iommu_ir *ir;
-	unsigned long flags;
-
 	if (!enable_apicv)
 		return;
 
 	avic_refresh_virtual_apic_mode(vcpu);
 
-	if (activated)
-		__avic_vcpu_load(vcpu, vcpu->cpu);
+	if (kvm_vcpu_apicv_active(vcpu))
+		__avic_vcpu_load(vcpu, vcpu->cpu, true);
 	else
-		__avic_vcpu_put(vcpu);
-
-	/*
-	 * Here, we go through the per-vcpu ir_list to update all existing
-	 * interrupt remapping table entry targeting this vcpu.
-	 */
-	spin_lock_irqsave(&svm->ir_list_lock, flags);
-
-	if (list_empty(&svm->ir_list))
-		goto out;
-
-	list_for_each_entry(ir, &svm->ir_list, node) {
-		if (activated)
-			WARN_ON_ONCE(amd_iommu_activate_guest_mode(ir->data, apic_id));
-		else
-			WARN_ON_ONCE(amd_iommu_deactivate_guest_mode(ir->data));
-	}
-out:
-	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
+		__avic_vcpu_put(vcpu, true);
 }
 
 void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 64/67] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (62 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 63/67] KVM: SVM: Consolidate IRTE update " Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-09 11:56   ` Joao Martins
  2025-04-04 19:39 ` [PATCH 65/67] KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking Sean Christopherson
                   ` (7 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Add plumbing to the AMD IOMMU driver to allow KVM to control whether or
not an IRTE is configured to generate GA log interrupts.  KVM only needs a
notification if the target vCPU is blocking, so the vCPU can be awakened.
If a vCPU is preempted or exits to userspace, KVM clears is_run, but will
set the vCPU back to running when userspace does KVM_RUN and/or the vCPU
task is scheduled back in, i.e. KVM doesn't need a notification.

Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes
from the KVM changes insofar as possible.

Opportunistically swap the ordering of parameters for amd_iommu_update_ga()
so that the match amd_iommu_activate_guest_mode().

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/irq_remapping.h |  1 +
 arch/x86/kvm/svm/avic.c              | 10 ++++++----
 drivers/iommu/amd/iommu.c            | 17 ++++++++++-------
 include/linux/amd-iommu.h            |  9 ++++-----
 4 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index 4c75a17632f6..5a0d42464d44 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -36,6 +36,7 @@ struct amd_iommu_pi_data {
 	u32 ga_tag;
 	u32 vector;		/* Guest vector of the interrupt */
 	int cpu;
+	bool ga_log_intr;
 	bool is_guest_mode;
 	void *ir_data;
 };
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index c896f00f901c..1466e66cca6c 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -794,10 +794,12 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 * is awakened and/or scheduled in.  See also avic_vcpu_load().
 		 */
 		entry = svm->avic_physical_id_entry;
-		if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
+		if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK) {
 			pi_data.cpu = entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
-		else
+		} else {
 			pi_data.cpu = -1;
+			pi_data.ga_log_intr = true;
+		}
 
 		ret = irq_set_vcpu_affinity(host_irq, &pi_data);
 		if (ret)
@@ -837,9 +839,9 @@ static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
 
 	list_for_each_entry(ir, &svm->ir_list, node) {
 		if (!toggle_avic)
-			WARN_ON_ONCE(amd_iommu_update_ga(cpu, ir->data));
+			WARN_ON_ONCE(amd_iommu_update_ga(ir->data, cpu, true));
 		else if (cpu >= 0)
-			WARN_ON_ONCE(amd_iommu_activate_guest_mode(ir->data, cpu));
+			WARN_ON_ONCE(amd_iommu_activate_guest_mode(ir->data, cpu, true));
 		else
 			WARN_ON_ONCE(amd_iommu_deactivate_guest_mode(ir->data));
 	}
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 2e016b98fa1b..27b03e718980 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3775,7 +3775,8 @@ static const struct irq_domain_ops amd_ir_domain_ops = {
 	.deactivate = irq_remapping_deactivate,
 };
 
-static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
+static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu,
+				  bool ga_log_intr)
 {
 	if (cpu >= 0) {
 		entry->lo.fields_vapic.destination =
@@ -3783,12 +3784,14 @@ static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
 		entry->hi.fields.destination =
 					APICID_TO_IRTE_DEST_HI(cpu);
 		entry->lo.fields_vapic.is_run = true;
+		entry->lo.fields_vapic.ga_log_intr = false;
 	} else {
 		entry->lo.fields_vapic.is_run = false;
+		entry->lo.fields_vapic.ga_log_intr = ga_log_intr;
 	}
 }
 
-int amd_iommu_update_ga(int cpu, void *data)
+int amd_iommu_update_ga(void *data, int cpu, bool ga_log_intr)
 {
 	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
 	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
@@ -3802,14 +3805,14 @@ int amd_iommu_update_ga(int cpu, void *data)
 	if (!ir_data->iommu)
 		return -ENODEV;
 
-	__amd_iommu_update_ga(entry, cpu);
+	__amd_iommu_update_ga(entry, cpu, ga_log_intr);
 
 	return __modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
 				ir_data->irq_2_irte.index, entry);
 }
 EXPORT_SYMBOL(amd_iommu_update_ga);
 
-int amd_iommu_activate_guest_mode(void *data, int cpu)
+int amd_iommu_activate_guest_mode(void *data, int cpu, bool ga_log_intr)
 {
 	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
 	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
@@ -3828,12 +3831,11 @@ int amd_iommu_activate_guest_mode(void *data, int cpu)
 
 	entry->lo.fields_vapic.valid       = valid;
 	entry->lo.fields_vapic.guest_mode  = 1;
-	entry->lo.fields_vapic.ga_log_intr = 1;
 	entry->hi.fields.ga_root_ptr       = ir_data->ga_root_ptr;
 	entry->hi.fields.vector            = ir_data->ga_vector;
 	entry->lo.fields_vapic.ga_tag      = ir_data->ga_tag;
 
-	__amd_iommu_update_ga(entry, cpu);
+	__amd_iommu_update_ga(entry, cpu, ga_log_intr);
 
 	return modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
 			      ir_data->irq_2_irte.index, entry);
@@ -3904,7 +3906,8 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 		ir_data->ga_vector = pi_data->vector;
 		ir_data->ga_tag = pi_data->ga_tag;
 		if (pi_data->is_guest_mode)
-			ret = amd_iommu_activate_guest_mode(ir_data, pi_data->cpu);
+			ret = amd_iommu_activate_guest_mode(ir_data, pi_data->cpu,
+							    pi_data->ga_log_intr);
 		else
 			ret = amd_iommu_deactivate_guest_mode(ir_data);
 	} else {
diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
index c9f2df0c4596..8cced632ecd0 100644
--- a/include/linux/amd-iommu.h
+++ b/include/linux/amd-iommu.h
@@ -30,9 +30,8 @@ static inline void amd_iommu_detect(void) { }
 /* IOMMU AVIC Function */
 extern int amd_iommu_register_ga_log_notifier(int (*notifier)(u32));
 
-extern int amd_iommu_update_ga(int cpu, void *data);
-
-extern int amd_iommu_activate_guest_mode(void *data, int cpu);
+extern int amd_iommu_update_ga(void *data, int cpu, bool ga_log_intr);
+extern int amd_iommu_activate_guest_mode(void *data, int cpu, bool ga_log_intr);
 extern int amd_iommu_deactivate_guest_mode(void *data);
 
 #else /* defined(CONFIG_AMD_IOMMU) && defined(CONFIG_IRQ_REMAP) */
@@ -43,12 +42,12 @@ amd_iommu_register_ga_log_notifier(int (*notifier)(u32))
 	return 0;
 }
 
-static inline int amd_iommu_update_ga(int cpu, void *data)
+static inline int amd_iommu_update_ga(void *data, int cpu, bool ga_log_intr)
 {
 	return 0;
 }
 
-static inline int amd_iommu_activate_guest_mode(void *data, int cpu)
+static inline int amd_iommu_activate_guest_mode(void *data, int cpu, bool ga_log_intr)
 {
 	return 0;
 }
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 65/67] KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (63 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 64/67] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-08 17:53   ` Paolo Bonzini
  2025-04-04 19:39 ` [PATCH 66/67] *** DO NOT MERGE *** iommu/amd: Hack to fake IRQ posting support Sean Christopherson
                   ` (6 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Configure IRTEs to GA log interrupts for device posted IRQs that hit
non-running vCPUs if and only if the target vCPU is blocking, i.e.
actually needs a wake event.  If the vCPU has exited to userspace or was
preempted, generating GA log entries and interrupts is wasteful and
unnecessary, as the vCPU will be re-loaded and/or scheduled back in
irrespective of the GA log notification (avic_ga_log_notifier() is just a
fancy wrapper for kvm_vcpu_wake_up()).

Use a should-be-zero bit in the vCPU's Physical APIC ID Table Entry to
track whether or not the vCPU's associated IRTEs are configured to
generate GA logs, but only set the synthetic bit in KVM's "cache", i.e.
never set the should-be-zero bit in tables that are used by hardware.
Use a synthetic bit instead of a dedicated boolean to minimize the odds
of messing up the locking, i.e. so that all the existing rules that apply
to avic_physical_id_entry for IS_RUNNING are reused verbatim for
GA_LOG_INTR.

Note, because KVM (by design) "puts" AVIC state in a "pre-blocking"
phase, using kvm_vcpu_is_blocking() to track the need for notifications
isn't a viable option.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/svm.h |  7 ++++++
 arch/x86/kvm/svm/avic.c    | 49 +++++++++++++++++++++++++++-----------
 2 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 8b07939ef3b9..be6e833bf92c 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -246,6 +246,13 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 #define AVIC_LOGICAL_ID_ENTRY_VALID_BIT			31
 #define AVIC_LOGICAL_ID_ENTRY_VALID_MASK		(1 << 31)
 
+/*
+ * GA_LOG_INTR is a synthetic flag that's never propagated to hardware-visible
+ * tables.  GA_LOG_INTR is set if the vCPU needs device posted IRQs to generate
+ * GA log interrupts to wake the vCPU (because it's blocking or about to block).
+ */
+#define AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR		BIT_ULL(61)
+
 #define AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK	GENMASK_ULL(11, 0)
 #define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK	GENMASK_ULL(51, 12)
 #define AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK		(1ULL << 62)
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 1466e66cca6c..0d2a17a74be6 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -798,7 +798,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			pi_data.cpu = entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
 		} else {
 			pi_data.cpu = -1;
-			pi_data.ga_log_intr = true;
+			pi_data.ga_log_intr = entry & AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR;
 		}
 
 		ret = irq_set_vcpu_affinity(host_irq, &pi_data);
@@ -823,7 +823,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 }
 
 static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
-					    bool toggle_avic)
+					    bool toggle_avic, bool ga_log_intr)
 {
 	struct amd_svm_iommu_ir *ir;
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -839,9 +839,9 @@ static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
 
 	list_for_each_entry(ir, &svm->ir_list, node) {
 		if (!toggle_avic)
-			WARN_ON_ONCE(amd_iommu_update_ga(ir->data, cpu, true));
+			WARN_ON_ONCE(amd_iommu_update_ga(ir->data, cpu, ga_log_intr));
 		else if (cpu >= 0)
-			WARN_ON_ONCE(amd_iommu_activate_guest_mode(ir->data, cpu, true));
+			WARN_ON_ONCE(amd_iommu_activate_guest_mode(ir->data, cpu, ga_log_intr));
 		else
 			WARN_ON_ONCE(amd_iommu_deactivate_guest_mode(ir->data));
 	}
@@ -875,7 +875,8 @@ static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool toggle_avic)
 	entry = svm->avic_physical_id_entry;
 	WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
 
-	entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
+	entry &= ~(AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK |
+		   AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR);
 	entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
 	entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 
@@ -892,7 +893,7 @@ static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool toggle_avic)
 
 	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 
-	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, toggle_avic);
+	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, toggle_avic, false);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 }
@@ -912,7 +913,8 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	__avic_vcpu_load(vcpu, cpu, false);
 }
 
-static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic)
+static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic,
+			    bool is_blocking)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -934,14 +936,28 @@ static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic)
 	 */
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
 
-	avic_update_iommu_vcpu_affinity(vcpu, -1, toggle_avic);
+	avic_update_iommu_vcpu_affinity(vcpu, -1, toggle_avic, is_blocking);
 
+	WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR);
+
+	/*
+	 * Keep the previouv APIC ID in the entry so that a rogue doorbell from
+	 * hardware is at least restricted to a CPU associated with the vCPU.
+	 */
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
-	svm->avic_physical_id_entry = entry;
 
 	if (enable_ipiv)
 		WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 
+	/*
+	 * Note!  Don't set AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR in the table as
+	 * it's a synthetic flag that usurps an unused a should-be-zero bit.
+	 */
+	if (is_blocking)
+		entry |= AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR;
+
+	svm->avic_physical_id_entry = entry;
+
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 }
 
@@ -957,10 +973,15 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	u64 entry = to_svm(vcpu)->avic_physical_id_entry;
 
 	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
-	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
-		return;
+	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)) {
+		if (WARN_ON_ONCE(!kvm_vcpu_is_blocking(vcpu)))
+			return;
 
-	__avic_vcpu_put(vcpu, false);
+		if (!(WARN_ON_ONCE(!(entry & AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR))))
+			return;
+	}
+
+	__avic_vcpu_put(vcpu, false, kvm_vcpu_is_blocking(vcpu));
 }
 
 void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)
@@ -997,7 +1018,7 @@ void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 	if (kvm_vcpu_apicv_active(vcpu))
 		__avic_vcpu_load(vcpu, vcpu->cpu, true);
 	else
-		__avic_vcpu_put(vcpu, true);
+		__avic_vcpu_put(vcpu, true, true);
 }
 
 void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
@@ -1023,7 +1044,7 @@ void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
 	 * CPU and cause noisy neighbor problems if the VM is sending interrupts
 	 * to the vCPU while it's scheduled out.
 	 */
-	avic_vcpu_put(vcpu);
+	__avic_vcpu_put(vcpu, false, true);
 }
 
 void avic_vcpu_unblocking(struct kvm_vcpu *vcpu)
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 66/67] *** DO NOT MERGE *** iommu/amd: Hack to fake IRQ posting support
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (64 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 65/67] KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-04 19:39 ` [PATCH 67/67] *** DO NOT MERGE *** KVM: selftests: WIP posted interrupts test Sean Christopherson
                   ` (5 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Hack the IOMMU half of AMD device posted IRQ support to allow testing a
decent chunk of the related code on systems with AVIC capable CPUs, but no
IOMMU virtual APIC support.  E.g. some Milan CPUs allow enabling AVIC even
though it's not advertised as being supported, but the IOMMU unfortunately
doesn't allow the same shenanigans.

Not-signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c   | 76 ++++++++++++++++++++++++++++++++++++---
 arch/x86/kvm/svm/svm.c    |  2 ++
 drivers/iommu/amd/init.c  |  8 +++--
 drivers/iommu/amd/iommu.c | 50 +++++++++++++++++++++++++-
 4 files changed, 128 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 0d2a17a74be6..425674e1a04c 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -28,6 +28,8 @@
 #include "irq.h"
 #include "svm.h"
 
+#include "../../../drivers/iommu/amd/amd_iommu_types.h"
+
 /*
  * Encode the arbitrary VM ID and the vCPU's _index_ into the GATag so that
  * KVM can retrieve the correct vCPU from a GALog entry if an interrupt can't
@@ -141,11 +143,7 @@ static void avic_deactivate_vmcb(struct vcpu_svm *svm)
 	svm_set_x2apic_msr_interception(svm, true);
 }
 
-/* Note:
- * This function is called from IOMMU driver to notify
- * SVM to schedule in a particular vCPU of a particular VM.
- */
-int avic_ga_log_notifier(u32 ga_tag)
+static struct kvm_vcpu *avic_ga_log_get_vcpu(u32 ga_tag)
 {
 	unsigned long flags;
 	struct kvm_svm *kvm_svm;
@@ -165,6 +163,17 @@ int avic_ga_log_notifier(u32 ga_tag)
 	}
 	spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags);
 
+	return vcpu;
+}
+
+/* Note:
+ * This function is called from IOMMU driver to notify
+ * SVM to schedule in a particular vCPU of a particular VM.
+ */
+int avic_ga_log_notifier(u32 ga_tag)
+{
+	struct kvm_vcpu *vcpu = avic_ga_log_get_vcpu(ga_tag);
+
 	/* Note:
 	 * At this point, the IOMMU should have already set the pending
 	 * bit in the vAPIC backing page. So, we just need to schedule
@@ -750,6 +759,8 @@ static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)
 	spin_unlock_irqrestore(&to_svm(vcpu)->ir_list_lock, flags);
 }
 
+extern struct amd_iommu_pi_data amd_iommu_fake_irte;
+
 int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			unsigned int host_irq, uint32_t guest_irq,
 			struct kvm_kernel_irq_routing_entry *new,
@@ -1055,6 +1066,58 @@ void avic_vcpu_unblocking(struct kvm_vcpu *vcpu)
 	avic_vcpu_load(vcpu, vcpu->cpu);
 }
 
+static void avic_pi_handler(void)
+{
+	struct amd_iommu_pi_data pi;
+	struct kvm_vcpu *vcpu;
+
+	memcpy(&pi, &amd_iommu_fake_irte, sizeof(pi));
+
+	if (!pi.is_guest_mode) {
+		pr_warn("IRQ %u arrived with !is_guest_mode\n", pi.vector);
+		return;
+	}
+
+	vcpu = avic_ga_log_get_vcpu(pi.ga_tag);
+	if (!vcpu) {
+		pr_warn("No vCPU for IRQ %u\n", pi.vector);
+		return;
+	}
+	WARN_ON_ONCE(pi.vapic_addr << 12 != avic_get_backing_page_address(to_svm(vcpu)));
+
+	/*
+	 * When updating a vCPU's IRTE, the fake posted IRQ can race with the
+	 * IRTE update.  Take ir_list_lock so that the IRQ can be processed
+	 * atomically.  In real hardware, the IOMMU will complete IRQ delivery
+	 * before accepting the new IRTE.
+	 */
+	guard(spinlock_irqsave)(&to_svm(vcpu)->ir_list_lock);
+
+	if (amd_iommu_fake_irte.ga_tag != pi.ga_tag) {
+		WARN_ON_ONCE(amd_iommu_fake_irte.is_guest_mode);
+		return;
+	}
+
+	memcpy(&pi, &amd_iommu_fake_irte, sizeof(pi));
+
+#if 0
+	pr_warn("In PI handler, guest = %u, cpu = %d, tag = %x, intr = %u, vector = %u\n",
+		pi.is_guest_mode, pi.cpu,
+		pi.ga_tag, pi.ga_log_intr, pi.vector);
+#endif
+
+	if (!pi.is_guest_mode)
+		return;
+
+	kvm_lapic_set_irr(pi.vector, vcpu->arch.apic);
+	smp_mb__after_atomic();
+
+	if (pi.cpu >= 0)
+		avic_ring_doorbell(vcpu);
+	else if (pi.ga_log_intr)
+		avic_ga_log_notifier(pi.ga_tag);
+}
+
 /*
  * Note:
  * - The module param avic enable both xAPIC and x2APIC mode.
@@ -1107,5 +1170,8 @@ bool avic_hardware_setup(void)
 
 	amd_iommu_register_ga_log_notifier(&avic_ga_log_notifier);
 
+	pr_warn("Register AVIC PI wakeup handler\n");
+	kvm_set_posted_intr_wakeup_handler(avic_pi_handler);
+
 	return true;
 }
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 71b52ad13577..b8adeb87e800 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1122,6 +1122,8 @@ static void svm_hardware_unsetup(void)
 {
 	int cpu;
 
+	kvm_set_posted_intr_wakeup_handler(NULL);
+
 	sev_hardware_unsetup();
 
 	for_each_possible_cpu(cpu)
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index cb536d372b12..28cc8552ca95 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -2863,8 +2863,12 @@ static void enable_iommus_vapic(void)
 			return;
 	}
 
-	if (AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) &&
-	    !check_feature(FEATURE_GAM_VAPIC)) {
+	if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir))
+		return;
+
+	if (!check_feature(FEATURE_GAM_VAPIC)) {
+		pr_warn("IOMMU lacks GAM_VAPIC, fudging IRQ posting\n");
+		amd_iommu_irq_ops.capability |= (1 << IRQ_POSTING_CAP);
 		amd_iommu_guest_ir = AMD_IOMMU_GUEST_IR_LEGACY_GA;
 		return;
 	}
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 27b03e718980..f2bd262330fa 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3775,6 +3775,15 @@ static const struct irq_domain_ops amd_ir_domain_ops = {
 	.deactivate = irq_remapping_deactivate,
 };
 
+struct amd_iommu_pi_data amd_iommu_fake_irte;
+EXPORT_SYMBOL_GPL(amd_iommu_fake_irte);
+
+static bool amd_iommu_fudge_pi(void)
+{
+	return irq_remapping_cap(IRQ_POSTING_CAP) &&
+	       !AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir);
+}
+
 static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu,
 				  bool ga_log_intr)
 {
@@ -3796,6 +3805,12 @@ int amd_iommu_update_ga(void *data, int cpu, bool ga_log_intr)
 	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
 	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
 
+	if (amd_iommu_fudge_pi()) {
+		amd_iommu_fake_irte.cpu = cpu;
+		amd_iommu_fake_irte.ga_log_intr = ga_log_intr;
+		return 0;
+	}
+
 	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
 		return -EINVAL;
 
@@ -3818,6 +3833,26 @@ int amd_iommu_activate_guest_mode(void *data, int cpu, bool ga_log_intr)
 	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
 	u64 valid;
 
+	if (amd_iommu_fudge_pi()) {
+		if (WARN_ON_ONCE(!entry->lo.fields_remap.valid))
+			return -EINVAL;
+
+		if (WARN_ON_ONCE(entry->lo.fields_remap.int_type != APIC_DELIVERY_MODE_FIXED))
+			return -EINVAL;
+
+		amd_iommu_fake_irte.cpu = cpu;
+		amd_iommu_fake_irte.vapic_addr = ir_data->ga_root_ptr;
+		amd_iommu_fake_irte.vector = ir_data->ga_vector;
+		amd_iommu_fake_irte.ga_tag = ir_data->ga_tag;
+		amd_iommu_fake_irte.ga_log_intr = ga_log_intr;
+		amd_iommu_fake_irte.is_guest_mode = true;
+
+		entry->hi.fields.vector = POSTED_INTR_WAKEUP_VECTOR;
+
+		return modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
+				      ir_data->irq_2_irte.index, entry);
+	}
+
 	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
 		return -EINVAL;
 
@@ -3849,12 +3884,18 @@ int amd_iommu_deactivate_guest_mode(void *data)
 	struct irq_cfg *cfg = ir_data->cfg;
 	u64 valid;
 
+	if (amd_iommu_fudge_pi() && entry) {
+		memset(&amd_iommu_fake_irte, 0, sizeof(amd_iommu_fake_irte));
+		goto fudge;
+	}
+
 	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
 		return -EINVAL;
 
 	if (!entry || !entry->lo.fields_vapic.guest_mode)
 		return 0;
 
+fudge:
 	valid = entry->lo.fields_remap.valid;
 
 	entry->lo.val = 0;
@@ -3891,12 +3932,19 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 	 * This device has never been set up for guest mode.
 	 * we should not modify the IRTE
 	 */
-	if (!dev_data || !dev_data->use_vapic)
+	if (!dev_data)
+		return -EINVAL;
+
+	if (amd_iommu_fudge_pi())
+		goto fudge;
+
+	if (!dev_data->use_vapic)
 		return -EINVAL;
 
 	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
 		return -EINVAL;
 
+fudge:
 	ir_data->cfg = irqd_cfg(data);
 
 	if (pi_data) {
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 67/67] *** DO NOT MERGE *** KVM: selftests: WIP posted interrupts test
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (65 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 66/67] *** DO NOT MERGE *** iommu/amd: Hack to fake IRQ posting support Sean Christopherson
@ 2025-04-04 19:39 ` Sean Christopherson
  2025-04-08 12:44 ` [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Joerg Roedel
                   ` (4 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-04 19:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/Makefile.kvm      |   2 +
 .../selftests/kvm/include/vfio_pci_util.h     | 149 ++++++
 .../selftests/kvm/include/x86/processor.h     |  21 +
 .../testing/selftests/kvm/lib/vfio_pci_util.c | 201 ++++++++
 tools/testing/selftests/kvm/mercury_device.h  | 118 +++++
 tools/testing/selftests/kvm/vfio_irq_test.c   | 429 ++++++++++++++++++
 6 files changed, 920 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/include/vfio_pci_util.h
 create mode 100644 tools/testing/selftests/kvm/lib/vfio_pci_util.c
 create mode 100644 tools/testing/selftests/kvm/mercury_device.h
 create mode 100644 tools/testing/selftests/kvm/vfio_irq_test.c

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index f773f8f99249..8f017b858d4b 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -15,6 +15,7 @@ LIBKVM += lib/sparsebit.c
 LIBKVM += lib/test_util.c
 LIBKVM += lib/ucall_common.c
 LIBKVM += lib/userfaultfd_util.c
+LIBKVM += lib/vfio_pci_util.c
 
 LIBKVM_STRING += lib/string_override.c
 
@@ -133,6 +134,7 @@ TEST_GEN_PROGS_x86 += mmu_stress_test
 TEST_GEN_PROGS_x86 += rseq_test
 TEST_GEN_PROGS_x86 += set_memory_region_test
 TEST_GEN_PROGS_x86 += steal_time
+TEST_GEN_PROGS_x86 += vfio_irq_test
 TEST_GEN_PROGS_x86 += kvm_binary_stats_test
 TEST_GEN_PROGS_x86 += system_counter_offset_test
 TEST_GEN_PROGS_x86 += pre_fault_memory_test
diff --git a/tools/testing/selftests/kvm/include/vfio_pci_util.h b/tools/testing/selftests/kvm/include/vfio_pci_util.h
new file mode 100644
index 000000000000..2a697dcb741e
--- /dev/null
+++ b/tools/testing/selftests/kvm/include/vfio_pci_util.h
@@ -0,0 +1,149 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2022, Google LLC.
+ */
+
+#ifndef SELFTEST_KVM_VFIO_UTIL_H
+#define SELFTEST_KVM_VFIO_UTIL_H
+
+#include <linux/pci_regs.h>
+#include <linux/vfio.h>
+
+#include "kvm_util.h"
+#include "test_util.h"
+
+struct vfio_pci_dev {
+	int fd;
+	int group_fd;
+	int container_fd;
+};
+
+struct vfio_pci_dev *__vfio_pci_init(const char *bdf, unsigned long iommu_type);
+void vfio_pci_free(struct vfio_pci_dev *dev);
+
+static inline struct vfio_pci_dev *vfio_pci_init(const char *bdf)
+{
+	return __vfio_pci_init(bdf, VFIO_TYPE1v2_IOMMU);
+}
+
+#define __vfio_ioctl(vfio_fd, cmd, arg)				\
+({								\
+	__kvm_ioctl(vfio_fd, cmd, arg);				\
+})
+
+#define vfio_ioctl(vfio_fd, cmd, arg)				\
+({								\
+	int ret = __vfio_ioctl(vfio_fd, cmd, arg);		\
+								\
+	TEST_ASSERT(!ret, __KVM_IOCTL_ERROR(#cmd, ret));	\
+})
+
+static inline uint32_t vfio_pci_get_nr_irqs(struct vfio_pci_dev *dev,
+					    uint32_t irq_type)
+{
+	struct vfio_irq_info irq_info = {
+		.argsz = sizeof(struct vfio_irq_info),
+		.index = irq_type,
+	};
+
+	vfio_ioctl(dev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
+
+	TEST_ASSERT(irq_info.flags & VFIO_IRQ_INFO_EVENTFD,
+		    "eventfd signalling unsupported by IRQ type '%u'", irq_type);
+	return irq_info.count;
+}
+
+static inline uint32_t vfio_pci_get_nr_msi_irqs(struct vfio_pci_dev *dev)
+{
+	return vfio_pci_get_nr_irqs(dev, VFIO_PCI_MSI_IRQ_INDEX);
+}
+
+static inline uint32_t vfio_pci_get_nr_msix_irqs(struct vfio_pci_dev *dev)
+{
+	return vfio_pci_get_nr_irqs(dev, VFIO_PCI_MSIX_IRQ_INDEX);
+}
+
+static inline void __vfio_pci_irq_eventfd(struct vfio_pci_dev *dev, int eventfd,
+					  uint32_t irq_type, uint32_t set)
+{
+	struct {
+		struct vfio_irq_set vfio;
+		uint32_t eventfd;
+	} buffer = {};
+
+	memset(&buffer, 0, sizeof(buffer));
+	buffer.vfio.argsz = sizeof(buffer);
+	buffer.vfio.flags = set | VFIO_IRQ_SET_ACTION_TRIGGER;
+	buffer.vfio.index = irq_type;
+	buffer.vfio.count = 1;
+	buffer.eventfd = eventfd;
+
+	vfio_ioctl(dev->fd, VFIO_DEVICE_SET_IRQS, &buffer.vfio);
+}
+
+static inline void vfio_pci_assign_irq_eventfd(struct vfio_pci_dev *dev,
+					       int eventfd, uint32_t irq_type)
+{
+	__vfio_pci_irq_eventfd(dev, eventfd, irq_type, VFIO_IRQ_SET_DATA_EVENTFD);
+}
+
+static inline void vfio_pci_assign_msix(struct vfio_pci_dev *dev, int eventfd)
+{
+	vfio_pci_assign_irq_eventfd(dev, eventfd, VFIO_PCI_MSIX_IRQ_INDEX);
+}
+
+static inline void vfio_pci_release_irq_eventfds(struct vfio_pci_dev *dev,
+						 uint32_t irq_type)
+{
+	struct vfio_irq_set vfio = {
+		.argsz = sizeof(struct vfio_irq_set),
+		.flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
+		.index = irq_type,
+		.count = 0,
+	};
+
+	vfio_ioctl(dev->fd, VFIO_DEVICE_SET_IRQS, &vfio);
+}
+
+static inline void vfio_pci_release_msix(struct vfio_pci_dev *dev)
+{
+	vfio_pci_release_irq_eventfds(dev, VFIO_PCI_MSIX_IRQ_INDEX);
+}
+
+static inline void vfio_pci_send_irq_eventfd(struct vfio_pci_dev *dev,
+					     int eventfd, uint32_t irq_type)
+{
+	__vfio_pci_irq_eventfd(dev, eventfd, irq_type, VFIO_IRQ_SET_DATA_NONE);
+}
+
+static inline void vfio_pci_send_msix(struct vfio_pci_dev *dev, int eventfd)
+{
+	vfio_pci_send_irq_eventfd(dev, eventfd, VFIO_PCI_MSIX_IRQ_INDEX);
+}
+
+void *vfio_pci_map_bar(struct vfio_pci_dev *dev, unsigned int bar_idx,
+		       uint64_t *size);
+
+void vfio_pci_read_config_data(struct vfio_pci_dev *dev, size_t offset,
+			       size_t size, void *data);
+
+static inline uint16_t vfio_pci_config_read_u16(struct vfio_pci_dev *dev,
+						size_t offset)
+{
+	uint16_t val;
+
+	vfio_pci_read_config_data(dev, offset, sizeof(val), &val);
+	return le16toh(val);
+}
+
+static inline uint16_t vfio_pci_get_vendor_id(struct vfio_pci_dev *dev)
+{
+	return vfio_pci_config_read_u16(dev, PCI_VENDOR_ID);
+}
+
+static inline uint16_t vfio_pci_get_device_id(struct vfio_pci_dev *dev)
+{
+	return vfio_pci_config_read_u16(dev, PCI_DEVICE_ID);
+}
+
+#endif /* SELFTEST_KVM_VFIO_UTIL_H */
diff --git a/tools/testing/selftests/kvm/include/x86/processor.h b/tools/testing/selftests/kvm/include/x86/processor.h
index 32ab6ca7ec32..251dcc074503 100644
--- a/tools/testing/selftests/kvm/include/x86/processor.h
+++ b/tools/testing/selftests/kvm/include/x86/processor.h
@@ -19,6 +19,27 @@
 #include "kvm_util.h"
 #include "ucall_common.h"
 
+
+static inline void writel(uint32_t val, volatile void *addr)
+{
+	*(volatile uint32_t *)addr = val;
+}
+
+static inline uint32_t readl(volatile void *addr)
+{
+	return *(volatile uint32_t *)addr;
+}
+
+static inline void writeq(uint64_t val, volatile void *addr)
+{
+	*(volatile uint64_t *)addr = val;
+}
+
+static inline uint64_t readq(volatile void *addr)
+{
+	return *(volatile uint64_t *)addr;
+}
+
 extern bool host_cpu_is_intel;
 extern bool host_cpu_is_amd;
 extern uint64_t guest_tsc_khz;
diff --git a/tools/testing/selftests/kvm/lib/vfio_pci_util.c b/tools/testing/selftests/kvm/lib/vfio_pci_util.c
new file mode 100644
index 000000000000..878d91be2212
--- /dev/null
+++ b/tools/testing/selftests/kvm/lib/vfio_pci_util.c
@@ -0,0 +1,201 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <poll.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <libgen.h>
+#include <endian.h>
+#include <sys/ioctl.h>
+#include <linux/mman.h>
+#include <asm/barrier.h>
+#include <sys/eventfd.h>
+#include <linux/limits.h>
+
+#include <linux/vfio.h>
+#include <linux/pci_regs.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "vfio_pci_util.h"
+
+#define VFIO_DEV_PATH	"/dev/vfio/vfio"
+#define PCI_SYSFS_PATH	"/sys/bus/pci/devices/"
+
+void *vfio_pci_map_bar(struct vfio_pci_dev *dev, unsigned int bar_idx,
+		       uint64_t *size)
+{
+	struct vfio_region_info info = {
+		.argsz = sizeof(struct vfio_region_info),
+		.index = bar_idx,
+	};
+	int fd = dev->fd;
+	void *bar;
+	int prot;
+
+	TEST_ASSERT(bar_idx <= VFIO_PCI_BAR5_REGION_INDEX,
+		    "Invalid BAR index: %d", bar_idx);
+
+	/* Currently only support the cases where the BAR can be mmap-ed */
+	vfio_ioctl(fd, VFIO_DEVICE_GET_REGION_INFO, &info);
+	TEST_ASSERT(info.flags & VFIO_REGION_INFO_FLAG_MMAP,
+		    "BAR%d doesn't support mmap", bar_idx);
+
+	TEST_ASSERT(info.flags & VFIO_REGION_INFO_FLAG_READ,
+		    "BAR%d doesn't support read?", bar_idx);
+
+	prot = PROT_READ;
+	if (info.flags & VFIO_REGION_INFO_FLAG_WRITE)
+		prot |= PROT_WRITE;
+
+	bar = mmap(NULL, info.size, prot, MAP_FILE | MAP_SHARED, fd, info.offset);
+	TEST_ASSERT(bar != MAP_FAILED, "mmap(BAR%d) failed", bar_idx);
+
+	*size = info.size;
+	return bar;
+}
+
+/*
+ * Read the PCI config space data
+ *
+ * Input Args:
+ *   vfio_pci: Pointer to struct vfio_pci_dev
+ *   config: The config space field's offset to read from (eg: PCI_VENDOR_ID)
+ *   size: The size to read from the config region (could be one or more fields).
+ *   data: Pointer to the region where the read data is to be copied into
+ *
+ *  The data returned is in little-endian format, which is the standard for PCI config space.
+ */
+void vfio_pci_read_config_data(struct vfio_pci_dev *dev, size_t offset,
+			       size_t size, void *data)
+{
+	struct vfio_region_info info = {
+		.argsz = sizeof(struct vfio_region_info),
+		.index = VFIO_PCI_CONFIG_REGION_INDEX,
+	};
+	int ret;
+
+	vfio_ioctl(dev->fd, VFIO_DEVICE_GET_REGION_INFO, &info);
+
+	TEST_ASSERT(offset + size <= PCI_CFG_SPACE_EXP_SIZE,
+		    "Requested config (%lu) and size (%lu) is out of bounds (%u)",
+		    offset, size, PCI_CFG_SPACE_EXP_SIZE);
+
+	ret = pread(dev->fd, data, size, info.offset + offset);
+	TEST_ASSERT(ret == size, "Failed to read the PCI config: 0x%lx\n", offset);
+}
+
+static unsigned int vfio_pci_get_group_from_dev(const char *bdf)
+{
+	char dev_iommu_group_path[PATH_MAX] = {0};
+	unsigned int pci_dev_sysfs_path_len;
+	char *pci_dev_sysfs_path;
+	unsigned int group;
+	int ret;
+
+	pci_dev_sysfs_path_len = strlen(PCI_SYSFS_PATH) + strlen("DDDD:BB:DD.F/iommu_group") + 1;
+
+	pci_dev_sysfs_path = calloc(1, pci_dev_sysfs_path_len);
+	TEST_ASSERT(pci_dev_sysfs_path, "Insufficient memory for pci dev sysfs path");
+
+	snprintf(pci_dev_sysfs_path, pci_dev_sysfs_path_len,
+		 "%s%s/iommu_group", PCI_SYSFS_PATH, bdf);
+
+	ret = readlink(pci_dev_sysfs_path, dev_iommu_group_path,
+		       sizeof(dev_iommu_group_path));
+	TEST_ASSERT(ret != -1, "Failed to get IOMMU group for device: %s", bdf);
+
+	ret = sscanf(basename(dev_iommu_group_path), "%u", &group);
+	TEST_ASSERT(ret == 1, "Failed to get IOMMU group for device: %s", bdf);
+
+	free(pci_dev_sysfs_path);
+	return group;
+}
+
+static void vfio_pci_setup_group(struct vfio_pci_dev *dev, const char *bdf)
+{
+	char group_path[32];
+	struct vfio_group_status group_status = {
+	    .argsz = sizeof(group_status),
+	};
+	int group;
+
+	group = vfio_pci_get_group_from_dev(bdf);
+	snprintf(group_path, sizeof(group_path), "/dev/vfio/%d", group);
+
+	dev->group_fd = open(group_path, O_RDWR);
+	TEST_ASSERT(dev->group_fd >= 0,
+		    "Failed to open the VFIO group %d for device: %s\n", group, bdf);
+
+	__vfio_ioctl(dev->group_fd, VFIO_GROUP_GET_STATUS, &group_status);
+	TEST_ASSERT(group_status.flags & VFIO_GROUP_FLAGS_VIABLE,
+		    "Group %d for device %s not viable.  Ensure all devices are bound to vfio-pci",
+		    group, bdf);
+
+	vfio_ioctl(dev->group_fd, VFIO_GROUP_SET_CONTAINER, &dev->container_fd);
+}
+
+static void vfio_pci_set_iommu(struct vfio_pci_dev *dev, unsigned long iommu_type)
+{
+	TEST_ASSERT_EQ(__vfio_ioctl(dev->container_fd, VFIO_CHECK_EXTENSION, (void *)iommu_type), 1);
+	vfio_ioctl(dev->container_fd, VFIO_SET_IOMMU, (void *)iommu_type);
+}
+
+static void vfio_pci_open_device(struct vfio_pci_dev *dev, const char *bdf)
+{
+	struct vfio_device_info dev_info = {
+		.argsz = sizeof(dev_info),
+	};
+
+	dev->fd = __vfio_ioctl(dev->group_fd, VFIO_GROUP_GET_DEVICE_FD, bdf);
+	TEST_ASSERT(dev->fd >= 0, "Failed to get the device fd\n");
+
+	vfio_ioctl(dev->fd, VFIO_DEVICE_GET_INFO, &dev_info);
+
+	TEST_ASSERT(!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET),
+		    "If VFIO tries to reset the VF, it will fail.");
+
+	/* Require at least all BAR regions and the config space. */
+	TEST_ASSERT(dev_info.num_regions >= VFIO_PCI_CONFIG_REGION_INDEX,
+		    "Required number regions not supported (%d) for device: %s",
+		    dev_info.num_regions, bdf);
+
+	/* Check for at least VFIO_PCI_MSIX_IRQ_INDEX irqs */
+	TEST_ASSERT(dev_info.num_irqs >= VFIO_PCI_MSIX_IRQ_INDEX,
+		    "MSI-X IRQs (%d) not supported for device: %s",
+		    dev_info.num_irqs, bdf);
+}
+
+/* bdf: PCI device's Domain:Bus:Device:Function in "DDDD:BB:DD.F" format */
+struct vfio_pci_dev *__vfio_pci_init(const char *bdf, unsigned long iommu_type)
+{
+	struct vfio_pci_dev *dev;
+	int vfio_version;
+
+	TEST_ASSERT(bdf, "PCI BDF not supplied\n");
+
+	dev = calloc(1, sizeof(*dev));
+	TEST_ASSERT(dev, "Insufficient memory for vfio_pci_dev");
+
+	dev->container_fd = open_path_or_exit(VFIO_DEV_PATH, O_RDWR);
+
+	vfio_version = __vfio_ioctl(dev->container_fd, VFIO_GET_API_VERSION, NULL);
+	TEST_REQUIRE(vfio_version == VFIO_API_VERSION);
+
+
+	vfio_pci_setup_group(dev, bdf);
+	vfio_pci_set_iommu(dev, iommu_type);
+	vfio_pci_open_device(dev, bdf);
+
+	return dev;
+}
+
+void vfio_pci_free(struct vfio_pci_dev *dev)
+{
+	close(dev->fd);
+	vfio_ioctl(dev->group_fd, VFIO_GROUP_UNSET_CONTAINER, NULL);
+
+	close(dev->group_fd);
+	close(dev->container_fd);
+
+	free(dev);
+}
diff --git a/tools/testing/selftests/kvm/mercury_device.h b/tools/testing/selftests/kvm/mercury_device.h
new file mode 100644
index 000000000000..fd4a3a5bac25
--- /dev/null
+++ b/tools/testing/selftests/kvm/mercury_device.h
@@ -0,0 +1,118 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2022, Google LLC.
+ */
+
+#ifndef SELFTEST_KVM_MERCURY_DEVICE_H
+#define SELFTEST_KVM_MERCURY_DEVICE_H
+
+#include "processor.h"
+#include "test_util.h"
+
+#define MERCURY_VENDOR_ID	0x1ae0
+#define MERCURY_DEVICE_ID	0x0050
+
+/* The base registers of the mercury device begin at the below offset from BAR0 */
+#define MERCURY_BASE_OFFSET	(768 * 1024)
+
+#define MERCURY_MSIX_VECTOR	0
+#define MERCURY_MSIX_COUNT	1 /* Currently, only 1 vector is assigned to mercury */
+
+#define MERCURY_DMA_MAX_BUF_SIZE_BYTES		SZ_8K
+#define MERCURY_DMA_MEMCPY_MAX_BUF_SIZE_BYTES	SZ_1G
+
+/* Mercury device accepts the DMA size as double-word (4-bytes) */
+#define MERCURY_DMA_SIZE_STRIDE			4
+
+#define MERCURY_ABI_VERSION	0
+
+/* Register Offsets relative to MERCURY_BASE_OFFSET */
+/* Unless otherwise specified, all the registers are 32-bits */
+#define MERCURY_REG_VERSION		0x0	/* Read-only */
+#define MERCURY_REG_COMMAND		0x04	/* Write-only */
+#define MERCURY_REG_STATUS		0x08	/* Read-only, 64-bit register */
+#define MERCURY_REG_DMA_SRC_ADDR	0x10	/* Read/Write, 64-bit register */
+#define MERCURY_REG_DMA_DEST_ADDR	0x18	/* Read/Write, 64-bit register */
+#define MERCURY_REG_DMA_DW_LEN		0x20	/* Read/Write */
+#define MERCURY_REG_SCRATCH_REG0	0x24	/* Read/Write */
+#define MERCURY_REG_SCRATCH_REG1	0x1000	/* Read/Write */
+
+/* Bit positions of the STATUS register */
+enum mercury_status_bit {
+	MERCURY_STATUS_BIT_READY = 0,
+	MERCURY_STATUS_BIT_DMA_FROM_DEV_COMPLETE = 1,
+	MERCURY_STATUS_BIT_DMA_TO_DEV_COMPLETE = 2,
+	MERCURY_STATUS_BIT_DMA_MEMCPY_COMPLETE = 3,
+	MERCURY_STATUS_BIT_FORCE_INTERRUPT = 4,
+	MERCURY_STATUS_BIT_INVAL_DMA_SIZE = 5,
+	MERCURY_STATUS_BIT_DMA_ERROR = 6,
+	MERCURY_STATUS_BIT_CMD_ERR_INVAL_CMD = 7,
+	MERCURY_STATUS_BIT_CMD_ERR_DEV_NOT_READY = 8,
+};
+
+/* List of mercury commands that can be written into MERCURY_REG_COMMAND register */
+enum mercury_command {
+	MERCURY_COMMAND_RESET = 0,
+	MERCURY_COMMAND_TRIGGER_DMA_FROM_DEV = 1,
+	MERCURY_COMMAND_TRIGGER_DMA_TO_DEV = 2,
+	MERCURY_COMMAND_TRIGGER_DMA_MEMCPY = 3,
+	MERCURY_COMMAND_FORCE_INTERRUPT = 4,
+};
+
+static inline void mercury_write_reg64(void *bar0, uint32_t reg_off, uint64_t val)
+{
+	void *reg = bar0 + MERCURY_BASE_OFFSET + reg_off;
+
+	writeq(val, reg);
+}
+
+static inline void mercury_write_reg32(void *bar0, uint32_t reg_off, uint32_t val)
+{
+	void *reg = bar0 + MERCURY_BASE_OFFSET + reg_off;
+
+	writel(val, reg);
+}
+
+static inline uint32_t mercury_read_reg32(void *bar0, uint32_t reg_off)
+{
+	void *reg = bar0 + MERCURY_BASE_OFFSET + reg_off;
+
+	return readl(reg);
+}
+
+static inline uint64_t mercury_read_reg64(void *bar0, uint32_t reg_off)
+{
+	void *reg = bar0 + MERCURY_BASE_OFFSET + reg_off;
+
+	return readq(reg);
+}
+
+static inline uint64_t mercury_get_status(void *bar0)
+{
+	return mercury_read_reg64(bar0, MERCURY_REG_STATUS);
+}
+
+static inline void mercury_issue_command(void *bar0, enum mercury_command cmd)
+{
+	mercury_write_reg32(bar0, MERCURY_REG_COMMAND, cmd);
+}
+
+static inline void mercury_issue_reset(void *bar0)
+{
+	mercury_issue_command(bar0, MERCURY_COMMAND_RESET);
+}
+
+static inline void mercury_force_irq(void *bar0)
+{
+	mercury_issue_command(bar0, MERCURY_COMMAND_FORCE_INTERRUPT);
+}
+
+static inline void mercury_set_dma_size(void *bar0, size_t sz_bytes)
+{
+	/* Convert the DMA size from bytes to DWORDS, as accepted by the device */
+	size_t sz_dwords = sz_bytes / MERCURY_DMA_SIZE_STRIDE;
+
+	mercury_write_reg32(bar0, MERCURY_REG_DMA_DW_LEN, sz_dwords);
+}
+
+#endif /* SELFTEST_KVM_MERCURY_DEVICE_H */
diff --git a/tools/testing/selftests/kvm/vfio_irq_test.c b/tools/testing/selftests/kvm/vfio_irq_test.c
new file mode 100644
index 000000000000..1cdc6fee9e9a
--- /dev/null
+++ b/tools/testing/selftests/kvm/vfio_irq_test.c
@@ -0,0 +1,429 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "apic.h"
+#include "processor.h"
+#include "test_util.h"
+#include "kvm_util.h"
+
+#include <errno.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <sched.h>
+#include <semaphore.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <stdint.h>
+#include <syscall.h>
+#include <sys/ioctl.h>
+#include <sys/sysinfo.h>
+#include <time.h>
+
+#include <sys/eventfd.h>
+
+#include "vfio_pci_util.h"
+#include "mercury_device.h"
+
+#define MERCURY_GSI		32
+#define MERCURY_IRQ_VECTOR	0x80
+
+#define MERCURY_BAR0_GPA	0xc0000000ul
+#define MERCURY_BAR0_SLOT	10
+
+/* Shared variables. */
+static bool do_guest_irq = true;
+
+/* Guest-only variables, shared across vCPUs. */
+static int irqs_received;
+static int irqs_sent;
+
+/* Host-only variables, shared across threads. */
+static cpu_set_t possible_mask;
+static int min_cpu, max_cpu;
+static bool done;
+static struct kvm_vcpu *target_vcpu;
+static sem_t do_irq;
+
+static bool x2apic;
+
+static void guest_irq_handler(struct ex_regs *regs)
+{
+	WRITE_ONCE(irqs_received, irqs_received + 1);
+
+	if (x2apic)
+		x2apic_write_reg(APIC_EOI, 0);
+	else
+		xapic_write_reg(APIC_EOI, 0);
+}
+
+static void guest_nmi_handler(struct ex_regs *regs)
+{
+	WRITE_ONCE(irqs_received, irqs_received + 1);
+}
+
+#define GUEST_VERIFY_IRQS()							\
+do {										\
+	int __received;								\
+										\
+	__received = READ_ONCE(irqs_received);					\
+	__GUEST_ASSERT(__received == irqs_sent,					\
+			"Sent %u IRQ, received %u IRQs", irqs_sent, __received);\
+} while (0)
+
+#define GUEST_WAIT_FOR_IRQ()	\
+do {				\
+	safe_halt();		\
+	GUEST_VERIFY_IRQS();	\
+	cli();			\
+} while (0)
+
+static void guest_code(uint32_t vcpu_id)
+{
+	/* GPA is identity mapped. */
+	void *mercury_bar0 = (void *)MERCURY_BAR0_GPA;
+	uint64_t status;
+	int i;
+
+	cli();
+
+	if (x2apic) {
+		x2apic_enable();
+		GUEST_ASSERT(x2apic_read_reg(APIC_ID) == vcpu_id);
+	} else {
+		xapic_enable();
+		GUEST_ASSERT(xapic_read_reg(APIC_ID) >> 24 == vcpu_id);
+	}
+
+	if (vcpu_id == 0) {
+		irqs_sent++;
+		GUEST_ASSERT(READ_ONCE(do_guest_irq));
+		mercury_issue_reset(mercury_bar0);
+		GUEST_WAIT_FOR_IRQ();
+
+		status = mercury_get_status(mercury_bar0);
+		__GUEST_ASSERT(status & BIT(MERCURY_STATUS_BIT_READY),
+			"Expected device ready after reset");
+		GUEST_SYNC(irqs_received);
+	}
+
+	for ( ; !READ_ONCE(done); ) {
+		irqs_sent++;
+		if (READ_ONCE(do_guest_irq))
+			mercury_force_irq(mercury_bar0);
+		GUEST_WAIT_FOR_IRQ();
+		GUEST_SYNC(irqs_received);
+	}
+
+	sti_nop();
+
+	for (i = 0; i < 1000; i++) {
+		mercury_force_irq(mercury_bar0);
+		cpu_relax();
+	}
+
+	GUEST_VERIFY_IRQS();
+	GUEST_SYNC(irqs_received);
+}
+
+static void *irq_worker(void *mercury_bar0)
+{
+	struct kvm_vcpu *vcpu;
+
+	for (;;) {
+		sem_wait(&do_irq);
+
+		if (READ_ONCE(done))
+			break;
+
+		vcpu = READ_ONCE(target_vcpu);
+		while (!vcpu_get_stat(vcpu, blocking))
+			cpu_relax();
+
+		mercury_force_irq(mercury_bar0);
+	}
+	return NULL;
+}
+
+static int next_cpu(int cpu)
+{
+	/*
+	 * Advance to the next CPU, skipping those that weren't in the original
+	 * affinity set.  Sadly, there is no CPU_SET_FOR_EACH, and cpu_set_t's
+	 * data storage is considered as opaque.  Note, if this task is pinned
+	 * to a small set of discontiguous CPUs, e.g. 2 and 1023, this loop will
+	 * burn a lot cycles and the test will take longer than normal to
+	 * complete.
+	 */
+	do {
+		cpu++;
+		if (cpu > max_cpu) {
+			cpu = min_cpu;
+			TEST_ASSERT(CPU_ISSET(cpu, &possible_mask),
+				    "Min CPU = %d must always be usable", cpu);
+			break;
+		}
+	} while (!CPU_ISSET(cpu, &possible_mask));
+
+	return cpu;
+}
+
+static void *migration_worker(void *__guest_tid)
+{
+	pid_t guest_tid = (pid_t)(unsigned long)__guest_tid;
+	cpu_set_t allowed_mask;
+	int r, i, cpu;
+
+	CPU_ZERO(&allowed_mask);
+
+	for (i = 0, cpu = min_cpu; !READ_ONCE(done); i++, cpu = next_cpu(cpu)) {
+		CPU_SET(cpu, &allowed_mask);
+
+		r = sched_setaffinity(guest_tid, sizeof(allowed_mask), &allowed_mask);
+		TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
+			    errno, strerror(errno));
+
+		CPU_CLR(cpu, &allowed_mask);
+
+		usleep((i % 10) + 10);
+	}
+	return NULL;
+}
+
+static void calc_min_max_cpu(void)
+{
+	int i, cnt, nproc;
+
+	TEST_REQUIRE(CPU_COUNT(&possible_mask) >= 2);
+
+	/*
+	 * CPU_SET doesn't provide a FOR_EACH helper, get the min/max CPU that
+	 * this task is affined to in order to reduce the time spent querying
+	 * unusable CPUs, e.g. if this task is pinned to a small percentage of
+	 * total CPUs.
+	 */
+	nproc = get_nprocs_conf();
+	min_cpu = -1;
+	max_cpu = -1;
+	cnt = 0;
+
+	for (i = 0; i < nproc; i++) {
+		if (!CPU_ISSET(i, &possible_mask))
+			continue;
+		if (min_cpu == -1)
+			min_cpu = i;
+		max_cpu = i;
+		cnt++;
+	}
+
+	__TEST_REQUIRE(cnt >= 2, "Only one usable CPU, task migration not possible");
+}
+
+static void sanity_check_mercury_device(struct vfio_pci_dev *dev, void *bar0)
+{
+	uint16_t vendor_id, device_id;
+	uint32_t version;
+
+	vendor_id = vfio_pci_get_vendor_id(dev);
+	device_id = vfio_pci_get_device_id(dev);
+
+	TEST_ASSERT(vendor_id == MERCURY_VENDOR_ID &&
+		    device_id == MERCURY_DEVICE_ID,
+		    "Mercury vendor-id/device-id mismatch.  "
+		    "Expected vendor: 0x%04x, device: 0x%04x.  "
+		    "Got vendor: 0x%04x, device: 0x%04x",
+		    MERCURY_VENDOR_ID, MERCURY_DEVICE_ID,
+		    vendor_id, device_id);
+
+	version = mercury_read_reg32(bar0, MERCURY_REG_VERSION);
+	TEST_ASSERT_EQ(version, MERCURY_ABI_VERSION);
+}
+
+static void set_empty_routing(struct kvm_vm *vm, struct kvm_irq_routing *routing)
+{
+	routing->nr = 0;
+	routing->entries[0].gsi = MERCURY_GSI;
+	routing->entries[0].type = KVM_IRQ_ROUTING_IRQCHIP;
+	routing->entries[0].flags = 0;
+	routing->entries[0].u.msi.address_lo = 0;
+	routing->entries[0].u.msi.address_hi = 0;
+	routing->entries[0].u.msi.data = 0xfe;
+	vm_ioctl(vm, KVM_SET_GSI_ROUTING, routing);
+}
+
+static void set_gsi_dest(struct kvm_vcpu *vcpu, struct kvm_irq_routing *routing,
+			 bool do_nmi)
+{
+	routing->nr = 1;
+	routing->entries[0].gsi = MERCURY_GSI;
+	routing->entries[0].type = KVM_IRQ_ROUTING_MSI;
+	routing->entries[0].flags = 0;
+	routing->entries[0].u.msi.address_lo = (vcpu->id << 12);
+	routing->entries[0].u.msi.address_hi = 0;
+	if (do_nmi)
+		routing->entries[0].u.msi.data = NMI_VECTOR | (4 << 8);
+	else
+		routing->entries[0].u.msi.data = MERCURY_IRQ_VECTOR;
+	vm_ioctl(vcpu->vm, KVM_SET_GSI_ROUTING, routing);
+}
+
+static void vcpu_run_and_verify(struct kvm_vcpu *vcpu, int nr_irqs)
+{
+	struct ucall uc;
+
+	vcpu_run(vcpu);
+	TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC);
+	TEST_ASSERT_EQ(uc.args[1], nr_irqs);
+}
+
+int main(int argc, char *argv[])
+{
+	bool migrate = false, nmi = false, async = false, empty = false;
+	pthread_t migration_thread, irq_thread;
+	struct kvm_irq_routing *routing;
+	struct vfio_pci_dev *dev;
+	struct kvm_vcpu *vcpus[2];
+	int opt, r, eventfd, i;
+	int nr_irqs = 10000;
+	struct kvm_vm *vm;
+	uint64_t bar_size;
+	char *bdf = NULL;
+	void *bar;
+
+	sem_init(&do_irq, 0, 0);
+
+	while ((opt = getopt(argc, argv, "had:ei:mnx")) != -1) {
+		switch (opt) {
+		case 'a':
+			async = true;
+			break;
+		case 'd':
+			bdf = strdup(optarg);
+			break;
+		case 'e':
+			empty = true;
+			break;
+		case 'i':
+			nr_irqs = atoi_positive("Number of IRQs", optarg);
+			break;
+		case 'm':
+			migrate = true;
+			break;
+		case 'n':
+			nmi = true;
+			break;
+		case 'x':
+			x2apic = false;
+			break;
+		case 'h':
+		default:
+			pr_info("Usage: %s [-h] <-d pci-bdf>\n\n", argv[0]);
+			pr_info("\t-d: PCI Domain, Bus, Device, Function in the format DDDD:BB:DD.F\n");
+			pr_info("\t-h: print this help screen\n");
+			exit(KSFT_SKIP);
+		}
+	}
+
+	__TEST_REQUIRE(bdf, "Required argument -d <pci-bdf> missing");
+
+	dev = vfio_pci_init(bdf);
+	bar = vfio_pci_map_bar(dev, VFIO_PCI_BAR0_REGION_INDEX, &bar_size);
+	sanity_check_mercury_device(dev, bar);
+
+	vm = vm_create_with_vcpus(ARRAY_SIZE(vcpus), guest_code, vcpus);
+	vm_install_exception_handler(vm, MERCURY_IRQ_VECTOR, guest_irq_handler);
+	vm_install_exception_handler(vm, NMI_VECTOR, guest_nmi_handler);
+
+	vcpu_args_set(vcpus[0], 1, 0);
+	vcpu_args_set(vcpus[1], 1, 1);
+
+	virt_pg_map(vm, APIC_DEFAULT_GPA, APIC_DEFAULT_GPA);
+
+	vm_set_user_memory_region(vm, MERCURY_BAR0_SLOT, 0, MERCURY_BAR0_GPA,
+				  bar_size, bar);
+	virt_map(vm, MERCURY_BAR0_GPA, MERCURY_BAR0_GPA,
+		 vm_calc_num_guest_pages(VM_MODE_DEFAULT, bar_size));
+
+	routing = kvm_gsi_routing_create();
+
+	eventfd = kvm_new_eventfd();
+	vfio_pci_assign_msix(dev, eventfd);
+	kvm_assign_irqfd(vm, MERCURY_GSI, eventfd);
+
+	r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask);
+	TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno,
+		    strerror(errno));
+
+	if (migrate) {
+		calc_min_max_cpu();
+
+		pthread_create(&migration_thread, NULL, migration_worker,
+			       (void *)(unsigned long)syscall(SYS_gettid));
+	}
+
+	if (nmi || async)
+		pthread_create(&irq_thread, NULL, irq_worker, bar);
+
+	set_gsi_dest(vcpus[0], routing, false);
+	vcpu_run_and_verify(vcpus[0], 1);
+
+#if 0
+	/*
+	 * Hack if the user wants to manually mess with interrupt routing while
+	 * the test is running, e.g. by modifying smp_affinity in the host.
+	 */
+	for (i = 1; i < nr_irqs; i++) {
+		usleep(1000 * 1000);
+		vcpu_run_and_verify(vcpus[0], i + 1);
+	}
+#endif
+
+	for (i = 1; i < nr_irqs; i++) {
+		struct kvm_vcpu *vcpu = vcpus[!!(i & BIT(1))];
+		const bool do_nmi = nmi && (i & BIT(2));
+		const bool do_empty = empty && (i & BIT(3));
+		const bool do_async = nmi || async;
+
+		if (do_empty)
+			set_empty_routing(vm, routing);
+
+		set_gsi_dest(vcpu, routing, do_nmi);
+
+		WRITE_ONCE(do_guest_irq, !do_async);
+		sync_global_to_guest(vm, do_guest_irq);
+
+		if (do_async) {
+			WRITE_ONCE(target_vcpu, vcpu);
+			sem_post(&do_irq);
+		}
+
+		vcpu_run_and_verify(vcpu, i + 1);
+	}
+
+	WRITE_ONCE(done, true);
+	sync_global_to_guest(vm, done);
+	sem_post(&do_irq);
+
+	for (i = 0; empty && i < ARRAY_SIZE(vcpus); i++) {
+		struct kvm_vcpu *vcpu = vcpus[i];
+
+		if (!i)
+			set_gsi_dest(vcpu, routing, false);
+		set_empty_routing(vm, routing);
+		vcpu_run_and_verify(vcpu, nr_irqs);
+	}
+
+	set_gsi_dest(vcpus[0], routing, false);
+
+	if (migrate)
+		pthread_join(migration_thread, NULL);
+
+	if (nmi || async)
+		pthread_join(irq_thread, NULL);
+
+	r = munmap(bar, bar_size);
+	TEST_ASSERT(!r, __KVM_SYSCALL_ERROR("munmap()", r));
+
+	vfio_pci_free(dev);
+
+	return 0;
+}
-- 
2.49.0.504.g3bcea36a83-goog


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCH 44/67] iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination
  2025-04-04 19:38 ` [PATCH 44/67] iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination Sean Christopherson
@ 2025-04-08 12:26   ` Joerg Roedel
  0 siblings, 0 replies; 128+ messages in thread
From: Joerg Roedel @ 2025-04-08 12:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, David Woodhouse, Lu Baolu, kvm, iommu,
	linux-kernel, Maxim Levitsky, Joao Martins, David Matlack

On Fri, Apr 04, 2025 at 12:38:59PM -0700, Sean Christopherson wrote:
> @@ -3974,8 +3974,10 @@ int amd_iommu_update_ga(int cpu, bool is_run, void *data)
>  					APICID_TO_IRTE_DEST_LO(cpu);
>  		entry->hi.fields.destination =
>  					APICID_TO_IRTE_DEST_HI(cpu);
> +		entry->lo.fields_vapic.is_run = true;
> +	} else {
> +		entry->lo.fields_vapic.is_run = false;
>  	}
> -	entry->lo.fields_vapic.is_run = is_run;

This change in the calling convention deserves a comment above the
function, describing that cpu < 0 marks the CPU as not running.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (66 preceding siblings ...)
  2025-04-04 19:39 ` [PATCH 67/67] *** DO NOT MERGE *** KVM: selftests: WIP posted interrupts test Sean Christopherson
@ 2025-04-08 12:44 ` Joerg Roedel
  2025-04-09  8:30   ` Vasant Hegde
  2025-04-08 15:36 ` Paolo Bonzini
                   ` (3 subsequent siblings)
  71 siblings, 1 reply; 128+ messages in thread
From: Joerg Roedel @ 2025-04-08 12:44 UTC (permalink / raw)
  To: Sean Christopherson, suravee.suthikulpanit, vasant.hegde
  Cc: Paolo Bonzini, David Woodhouse, Lu Baolu, kvm, iommu,
	linux-kernel, Maxim Levitsky, Joao Martins, David Matlack

Hey Sean,

On Fri, Apr 04, 2025 at 12:38:15PM -0700, Sean Christopherson wrote:
> TL;DR: Overhaul device posted interrupts in KVM and IOMMU, and AVIC in
>        general.  This needs more testing on AMD with device posted IRQs.

Thanks for posting this, it fixes quite some issues in the posted IRQ
implemention. I skimmed through the AMD IOMMU changes and besides some
small things didn't spot anything worrisome.

Adding Suravee and Vasant from AMD to this thread for deeper review
(also of the KVM parts) and testing.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (67 preceding siblings ...)
  2025-04-08 12:44 ` [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Joerg Roedel
@ 2025-04-08 15:36 ` Paolo Bonzini
  2025-04-08 17:13 ` David Matlack
                   ` (2 subsequent siblings)
  71 siblings, 0 replies; 128+ messages in thread
From: Paolo Bonzini @ 2025-04-08 15:36 UTC (permalink / raw)
  To: Sean Christopherson, Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

On 4/4/25 21:38, Sean Christopherson wrote:
> TL;DR: Overhaul device posted interrupts in KVM and IOMMU, and AVIC in
>         general.  This needs more testing on AMD with device posted IRQs.
> 
> This applies on the small series that adds a enable_device_posted_irqs
> module param (the prep work for that is also prep work for this):
> 
>     https://lore.kernel.org/all/20250401161804.842968-1-seanjc@google.com
> 
> Fix a variety of bugs related to device posted IRQs, especially on the
> AMD side, and clean up KVM's implementation, which IMO is in the running
> for Most Convoluted Code in KVM.
> 
> Stating the obvious, this series is comically large.  I'm posting it as a
> single series, at least for the first round of reviews, to build the
> (mostly) full picture of the end goal (it's not the true end goal; there's
> still more cleanups that can be done).  And because properly testing most
> of the code would be futile until almost the end of the series (so. many.
> bugs.).
> 
> Batch #1 (patches 1-10) fixes bugs of varying severity.

I started reviewing these, I guess patches 1-7 could be queued for 6.15? 
  And maybe also patch 2 from 
https://lore.kernel.org/all/20250401161804.842968-1-seanjc@google.com/.

Paolo

> Batch #2 is mostly SVM specific:
> 
>   - Cleans up various warts and bugs in the IRTE tracking
>   - Fixes AVIC to not reject large VMs (honor KVM's ABI)
>   - Wire up AVIC to enable_ipiv to support disabling IPI virtualization while
>     still utilizing device posted interrupts, and to workaround erratum #1235.
> 
> Batch #3 overhauls the guts of IRQ bypass in KVM, and moves the vast majority
> of the logic to common x86; only the code that needs to communicate with the
> IOMMU is truly vendor specific.
> 
> Batch #4 is more SVM/AVIC cleanups that are made possible by batch #3.
> 
> Batch #5 adds WARNs and drops dead code after all the previous cleanups and
> fixes (I don't want to add the WARNs earlier; I don't any point in adding
> WARNs in code that's known to be broken).
> 
> Batch #6 is yet more SVM/AVIC cleanups, with the specific goal of configuring
> IRTEs to generate GA log interrupts if and only if KVM actually needs a wake
> event.
> 
> This series is well tested except for one notable gap: I was not able to
> fully test the AMD IOMMU changes.  Long story short, getting upstream
> kernels into our full test environments is practically infeasible.  And
> exposing a device or VF on systems that are available to developers is a
> bit of a mess.
> 
> The device the selftest (see the last patch) uses is an internel test VF
> that's hosted on a smart NIC using non-production (test-only) firmware.
> Unfortunately, only some of our developer systems have the right NIC, and
> for unknown reasons I couldn't get the test firmware to install cleanly on
> Rome systems.  I was able to get it functional on Milan (and Intel CPUs),
> but APIC virtualization is disabled on Milan.  Thanks to KVM's force_avic
> I could test the KVM flows, but the IOMMU was having none of my attempts
> to force enable APIC virtualization against its will.
> 
> Through hackery (see the penultimate patch), I was able to gain a decent
> amount of confidence in the IOMMU changes (and the interface between KVM
> and the IOMMU).
> 
> For initial development of the series, I also cobbled together a "mock"
> IRQ bypass device, to allow testing in a VM.
> 
>    https://github.com/sean-jc/linux.git x86/mock_irqbypass_producer
> 
> Note, the diffstat is misleading due to the last two DO NOT MERGE patches
> adding 1k+ LoC.  Without those, this series removes ~80 LoC (substantially
> more if comments are ignored).
> 
>    21 files changed, 577 insertions(+), 655 deletions(-)
> 
> Maxim Levitsky (2):
>    KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled
>    KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235
> 
> Sean Christopherson (65):
>    KVM: SVM: Allocate IR data using atomic allocation
>    KVM: x86: Reset IRTE to host control if *new* route isn't postable
>    KVM: x86: Explicitly treat routing entry type changes as changes
>    KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producer
>    iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE
>    iommu/amd: WARN if KVM attempts to set vCPU affinity without posted
>      intrrupts
>    KVM: SVM: WARN if an invalid posted interrupt IRTE entry is added
>    KVM: x86: Pass new routing entries and irqfd when updating IRTEs
>    KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure
>    KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE
>    KVM: SVM: Delete IRTE link from previous vCPU irrespective of new
>      routing
>    KVM: SVM: Drop pointless masking of default APIC base when setting
>      V_APIC_BAR
>    KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA
>      masks
>    KVM: SVM: Add helper to deduplicate code for getting AVIC backing page
>    KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field
>    KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU
>      creation
>    KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creation
>    KVM: SVM: Track AVIC tables as natively sized pointers, not "struct
>      pages"
>    KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer
>    KVM: VMX: Move enable_ipiv knob to common x86
>    KVM: VMX: Suppress PI notifications whenever the vCPU is put
>    KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores
>      IRQ blocking
>    iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr
>    iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
>    iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode"
>    KVM: SVM: Get vCPU info for IRTE using new routing entry
>    KVM: SVM: Stop walking list of routing table entries when updating
>      IRTE
>    KVM: VMX: Stop walking list of routing table entries when updating
>      IRTE
>    KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info()
>    KVM: x86: Nullify irqfd->producer after updating IRTEs
>    KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU
>    KVM: x86: Move posted interrupt tracepoint to common code
>    KVM: SVM: Clean up return handling in avic_pi_update_irte()
>    iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel
>      structs
>    KVM: Don't WARN if updating IRQ bypass route fails
>    KVM: Fold kvm_arch_irqfd_route_changed() into
>      kvm_arch_update_irqfd_routing()
>    KVM: x86: Track irq_bypass_vcpu in common x86 code
>    KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being
>      targeted
>    KVM: x86: Don't update IRTE entries when old and new routes were !MSI
>    KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR
>      metadata
>    KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU
>    iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination
>    iommu/amd: Factor out helper for manipulating IRTE GA/CPU info
>    iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity
>    iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC
>      is inhibited
>    KVM: SVM: Don't check for assigned device(s) when updating affinity
>    KVM: SVM: Don't check for assigned device(s) when activating AVIC
>    KVM: SVM: WARN if (de)activating guest mode in IOMMU fails
>    KVM: SVM: Process all IRTEs on affinity change even if one update
>      fails
>    KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails
>    KVM: x86: Drop superfluous "has assigned device" check in
>      kvm_pi_update_irte()
>    KVM: x86: WARN if IRQ bypass isn't supported in kvm_pi_update_irte()
>    KVM: x86: WARN if IRQ bypass routing is updated without in-kernel
>      local APIC
>    KVM: SVM: WARN if ir_list is non-empty at vCPU free
>    KVM: x86: Decouple device assignment from IRQ bypass
>    KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting IRQ
>      bypass
>    KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata
>    iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC
>      support
>    KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller
>    KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off
>    KVM: SVM: Consolidate IRTE update when toggling AVIC on/off
>    iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
>    KVM: SVM: Generate GA log IRQs only if the associated vCPUs is
>      blocking
>    *** DO NOT MERGE *** iommu/amd: Hack to fake IRQ posting support
>    *** DO NOT MERGE *** KVM: selftests: WIP posted interrupts test
> 
>   arch/x86/include/asm/irq_remapping.h          |  17 +-
>   arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
>   arch/x86/include/asm/kvm_host.h               |  20 +-
>   arch/x86/include/asm/svm.h                    |  13 +-
>   arch/x86/kvm/svm/avic.c                       | 707 ++++++++----------
>   arch/x86/kvm/svm/svm.c                        |   6 +
>   arch/x86/kvm/svm/svm.h                        |  24 +-
>   arch/x86/kvm/trace.h                          |  19 +-
>   arch/x86/kvm/vmx/capabilities.h               |   1 -
>   arch/x86/kvm/vmx/main.c                       |   2 +-
>   arch/x86/kvm/vmx/posted_intr.c                | 150 ++--
>   arch/x86/kvm/vmx/posted_intr.h                |  11 +-
>   arch/x86/kvm/vmx/vmx.c                        |   2 -
>   arch/x86/kvm/x86.c                            | 124 ++-
>   drivers/iommu/amd/amd_iommu_types.h           |   1 -
>   drivers/iommu/amd/init.c                      |   8 +-
>   drivers/iommu/amd/iommu.c                     | 171 +++--
>   drivers/iommu/intel/irq_remapping.c           |  10 +-
>   include/linux/amd-iommu.h                     |  25 +-
>   include/linux/kvm_host.h                      |   9 +-
>   include/linux/kvm_irqfd.h                     |   4 +
>   tools/testing/selftests/kvm/Makefile.kvm      |   2 +
>   .../selftests/kvm/include/vfio_pci_util.h     | 149 ++++
>   .../selftests/kvm/include/x86/processor.h     |  21 +
>   .../testing/selftests/kvm/lib/vfio_pci_util.c | 201 +++++
>   tools/testing/selftests/kvm/mercury_device.h  | 118 +++
>   tools/testing/selftests/kvm/vfio_irq_test.c   | 429 +++++++++++
>   virt/kvm/eventfd.c                            |  22 +-
>   28 files changed, 1610 insertions(+), 658 deletions(-)
>   create mode 100644 tools/testing/selftests/kvm/include/vfio_pci_util.h
>   create mode 100644 tools/testing/selftests/kvm/lib/vfio_pci_util.c
>   create mode 100644 tools/testing/selftests/kvm/mercury_device.h
>   create mode 100644 tools/testing/selftests/kvm/vfio_irq_test.c
> 
> 
> base-commit: 5f9f498ea14ffe15390aa46fb85375e7c901bce3


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 29/67] KVM: SVM: Stop walking list of routing table entries when updating IRTE
  2025-04-04 19:38 ` [PATCH 29/67] KVM: SVM: Stop walking list of routing table entries when updating IRTE Sean Christopherson
@ 2025-04-08 16:56   ` Paolo Bonzini
  0 siblings, 0 replies; 128+ messages in thread
From: Paolo Bonzini @ 2025-04-08 16:56 UTC (permalink / raw)
  To: Sean Christopherson, Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

On 4/4/25 21:38, Sean Christopherson wrote:
> Now that KVM SVM simply uses the provided routing entry, stop walking the
> routing table to find that entry.  KVM, via setup_routing_entry() and
> sanity checked by kvm_get_msi_route(), disallows having a GSI configured
> to trigger multiple MSIs.

I would squash this with the previous patch.  It's not large when shown 
with -b, and the coexistence of "e" and "new" after patch 28 is weird.

Paolo

> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/svm/avic.c | 106 ++++++++++++++++------------------------
>   1 file changed, 43 insertions(+), 63 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index eb6017b01c5f..685a7b01194b 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -852,10 +852,10 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   			unsigned int host_irq, uint32_t guest_irq,
>   			struct kvm_kernel_irq_routing_entry *new)
>   {
> -	struct kvm_kernel_irq_routing_entry *e;
> -	struct kvm_irq_routing_table *irq_rt;
>   	bool enable_remapped_mode = true;
> -	int idx, ret = 0;
> +	struct vcpu_data vcpu_info;
> +	struct vcpu_svm *svm = NULL;
> +	int ret = 0;
>   
>   	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
>   		return 0;
> @@ -869,70 +869,51 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   	pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n",
>   		 __func__, host_irq, guest_irq, !!new);
>   
> -	idx = srcu_read_lock(&kvm->irq_srcu);
> -	irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
> +	/**
> +	 * Here, we setup with legacy mode in the following cases:
> +	 * 1. When cannot target interrupt to a specific vcpu.
> +	 * 2. Unsetting posted interrupt.
> +	 * 3. APIC virtualization is disabled for the vcpu.
> +	 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
> +	 */
> +	if (new && new->type == KVM_IRQ_ROUTING_MSI &&
> +	    !get_pi_vcpu_info(kvm, new, &vcpu_info, &svm) &&
> +	    kvm_vcpu_apicv_active(&svm->vcpu)) {
> +		struct amd_iommu_pi_data pi;
>   
> -	if (guest_irq >= irq_rt->nr_rt_entries ||
> -		hlist_empty(&irq_rt->map[guest_irq])) {
> -		pr_warn_once("no route for guest_irq %u/%u (broken user space?)\n",
> -			     guest_irq, irq_rt->nr_rt_entries);
> -		goto out;
> -	}
> +		enable_remapped_mode = false;
>   
> -	hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
> -		struct vcpu_data vcpu_info;
> -		struct vcpu_svm *svm = NULL;
> -
> -		if (e->type != KVM_IRQ_ROUTING_MSI)
> -			continue;
> -
> -		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
> +		/*
> +		 * Try to enable guest_mode in IRTE.  Note, the address
> +		 * of the vCPU's AVIC backing page is passed to the
> +		 * IOMMU via vcpu_info->pi_desc_addr.
> +		 */
> +		pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
> +					     svm->vcpu.vcpu_id);
> +		pi.is_guest_mode = true;
> +		pi.vcpu_data = &vcpu_info;
> +		ret = irq_set_vcpu_affinity(host_irq, &pi);
>   
>   		/**
> -		 * Here, we setup with legacy mode in the following cases:
> -		 * 1. When cannot target interrupt to a specific vcpu.
> -		 * 2. Unsetting posted interrupt.
> -		 * 3. APIC virtualization is disabled for the vcpu.
> -		 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
> +		 * Here, we successfully setting up vcpu affinity in
> +		 * IOMMU guest mode. Now, we need to store the posted
> +		 * interrupt information in a per-vcpu ir_list so that
> +		 * we can reference to them directly when we update vcpu
> +		 * scheduling information in IOMMU irte.
>   		 */
> -		if (new && !get_pi_vcpu_info(kvm, new, &vcpu_info, &svm) &&
> -		    kvm_vcpu_apicv_active(&svm->vcpu)) {
> -			struct amd_iommu_pi_data pi;
> -
> -			enable_remapped_mode = false;
> -
> -			/*
> -			 * Try to enable guest_mode in IRTE.  Note, the address
> -			 * of the vCPU's AVIC backing page is passed to the
> -			 * IOMMU via vcpu_info->pi_desc_addr.
> -			 */
> -			pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
> -						     svm->vcpu.vcpu_id);
> -			pi.is_guest_mode = true;
> -			pi.vcpu_data = &vcpu_info;
> -			ret = irq_set_vcpu_affinity(host_irq, &pi);
> -
> -			/**
> -			 * Here, we successfully setting up vcpu affinity in
> -			 * IOMMU guest mode. Now, we need to store the posted
> -			 * interrupt information in a per-vcpu ir_list so that
> -			 * we can reference to them directly when we update vcpu
> -			 * scheduling information in IOMMU irte.
> -			 */
> -			if (!ret && pi.is_guest_mode)
> -				svm_ir_list_add(svm, irqfd, &pi);
> -		}
> -
> -		if (!ret && svm) {
> -			trace_kvm_pi_irte_update(host_irq, svm->vcpu.vcpu_id,
> -						 e->gsi, vcpu_info.vector,
> -						 vcpu_info.pi_desc_addr, !!new);
> -		}
> -
> -		if (ret < 0) {
> -			pr_err("%s: failed to update PI IRTE\n", __func__);
> -			goto out;
> -		}
> +		if (!ret)
> +			ret = svm_ir_list_add(svm, irqfd, &pi);
> +	}
> +
> +	if (!ret && svm) {
> +		trace_kvm_pi_irte_update(host_irq, svm->vcpu.vcpu_id,
> +					 guest_irq, vcpu_info.vector,
> +					 vcpu_info.pi_desc_addr, !!new);
> +	}
> +
> +	if (ret < 0) {
> +		pr_err("%s: failed to update PI IRTE\n", __func__);
> +		goto out;
>   	}
>   
>   	if (enable_remapped_mode)
> @@ -940,7 +921,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   	else
>   		ret = 0;
>   out:
> -	srcu_read_unlock(&kvm->irq_srcu, idx);
>   	return ret;
>   }
>   


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 26/67] iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
  2025-04-04 19:38 ` [PATCH 26/67] iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields Sean Christopherson
@ 2025-04-08 16:57   ` Paolo Bonzini
  2025-04-08 22:25     ` Sean Christopherson
  2025-04-18 12:25   ` Vasant Hegde
  1 sibling, 1 reply; 128+ messages in thread
From: Paolo Bonzini @ 2025-04-08 16:57 UTC (permalink / raw)
  To: Sean Christopherson, Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

On 4/4/25 21:38, Sean Christopherson wrote:
> Delete the amd_ir_data.prev_ga_tag field now that all usage is
> superfluous.

This can be moved much earlier (maybe even after patch 10 from a cursory 
look), can't it?  I'd do that to clarify what has been cleaned up at 
which point.

Paolo

> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/svm/avic.c             |  2 --
>   drivers/iommu/amd/amd_iommu_types.h |  1 -
>   drivers/iommu/amd/iommu.c           | 10 ----------
>   include/linux/amd-iommu.h           |  1 -
>   4 files changed, 14 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 9024b9fbca53..7f0f6a9cd2e8 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -943,9 +943,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   		/**
>   		 * Here, pi is used to:
>   		 * - Tell IOMMU to use legacy mode for this interrupt.
> -		 * - Retrieve ga_tag of prior interrupt remapping data.
>   		 */
> -		pi.prev_ga_tag = 0;
>   		pi.is_guest_mode = false;
>   		ret = irq_set_vcpu_affinity(host_irq, &pi);
>   	} else {
> diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
> index 23caea22f8dc..319a1b650b3b 100644
> --- a/drivers/iommu/amd/amd_iommu_types.h
> +++ b/drivers/iommu/amd/amd_iommu_types.h
> @@ -1060,7 +1060,6 @@ struct irq_2_irte {
>   };
>   
>   struct amd_ir_data {
> -	u32 cached_ga_tag;
>   	struct amd_iommu *iommu;
>   	struct irq_2_irte irq_2_irte;
>   	struct msi_msg msi_entry;
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 635774642b89..3c40bc9980b7 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -3858,23 +3858,13 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
>   	ir_data->cfg = irqd_cfg(data);
>   	pi_data->ir_data = ir_data;
>   
> -	pi_data->prev_ga_tag = ir_data->cached_ga_tag;
>   	if (pi_data->is_guest_mode) {
>   		ir_data->ga_root_ptr = (vcpu_pi_info->pi_desc_addr >> 12);
>   		ir_data->ga_vector = vcpu_pi_info->vector;
>   		ir_data->ga_tag = pi_data->ga_tag;
>   		ret = amd_iommu_activate_guest_mode(ir_data);
> -		if (!ret)
> -			ir_data->cached_ga_tag = pi_data->ga_tag;
>   	} else {
>   		ret = amd_iommu_deactivate_guest_mode(ir_data);
> -
> -		/*
> -		 * This communicates the ga_tag back to the caller
> -		 * so that it can do all the necessary clean up.
> -		 */
> -		if (!ret)
> -			ir_data->cached_ga_tag = 0;
>   	}
>   
>   	return ret;
> diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
> index 4f433ef39188..deeefc92a5cf 100644
> --- a/include/linux/amd-iommu.h
> +++ b/include/linux/amd-iommu.h
> @@ -19,7 +19,6 @@ struct amd_iommu;
>    */
>   struct amd_iommu_pi_data {
>   	u32 ga_tag;
> -	u32 prev_ga_tag;
>   	bool is_guest_mode;
>   	struct vcpu_data *vcpu_data;
>   	void *ir_data;


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 30/67] KVM: VMX: Stop walking list of routing table entries when updating IRTE
  2025-04-04 19:38 ` [PATCH 30/67] KVM: VMX: " Sean Christopherson
@ 2025-04-08 17:00   ` Paolo Bonzini
  2025-05-20 20:36     ` Sean Christopherson
  0 siblings, 1 reply; 128+ messages in thread
From: Paolo Bonzini @ 2025-04-08 17:00 UTC (permalink / raw)
  To: Sean Christopherson, Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

On 4/4/25 21:38, Sean Christopherson wrote:
> Now that KVM provides the to-be-updated routing entry, stop walking the
> routing table to find that entry.  KVM, via setup_routing_entry() and
> sanity checked by kvm_get_msi_route(), disallows having a GSI configured
> to trigger multiple MSIs, i.e. the for-loop can only process one entry.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/vmx/posted_intr.c | 100 +++++++++++----------------------
>   1 file changed, 33 insertions(+), 67 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 00818ca30ee0..786912cee3f8 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -268,78 +268,44 @@ int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   		       unsigned int host_irq, uint32_t guest_irq,
>   		       struct kvm_kernel_irq_routing_entry *new)
>   {
> -	struct kvm_kernel_irq_routing_entry *e;
> -	struct kvm_irq_routing_table *irq_rt;
> -	bool enable_remapped_mode = true;
>   	struct kvm_lapic_irq irq;
>   	struct kvm_vcpu *vcpu;
>   	struct vcpu_data vcpu_info;
> -	bool set = !!new;
> -	int idx, ret = 0;
>   
>   	if (!vmx_can_use_vtd_pi(kvm))
>   		return 0;
>   
> -	idx = srcu_read_lock(&kvm->irq_srcu);
> -	irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
> -	if (guest_irq >= irq_rt->nr_rt_entries ||
> -	    hlist_empty(&irq_rt->map[guest_irq])) {
> -		pr_warn_once("no route for guest_irq %u/%u (broken user space?)\n",
> -			     guest_irq, irq_rt->nr_rt_entries);
> -		goto out;
> -	}
> -
> -	hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
> -		if (e->type != KVM_IRQ_ROUTING_MSI)
> -			continue;
> -
> -		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));

Alternatively, if you want to keep patches 28/29 separate, you could add 
this WARN_ON_ONCE to avic.c in the exact same place after checking 
e->type -- not so much for asserting purposes, but more to document 
what's going on for the reviewer.

Paolo


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (68 preceding siblings ...)
  2025-04-08 15:36 ` Paolo Bonzini
@ 2025-04-08 17:13 ` David Matlack
  2025-05-23 23:52   ` David Matlack
  2025-04-18 13:01 ` David Woodhouse
  2025-05-15 12:08 ` Sairaj Kodilkar
  71 siblings, 1 reply; 128+ messages in thread
From: David Matlack @ 2025-04-08 17:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins

On Fri, Apr 4, 2025 at 12:39 PM Sean Christopherson <seanjc@google.com> wrote:
>
> This series is well tested except for one notable gap: I was not able to
> fully test the AMD IOMMU changes.  Long story short, getting upstream
> kernels into our full test environments is practically infeasible.  And
> exposing a device or VF on systems that are available to developers is a
> bit of a mess.
>
> The device the selftest (see the last patch) uses is an internel test VF
> that's hosted on a smart NIC using non-production (test-only) firmware.
> Unfortunately, only some of our developer systems have the right NIC, and
> for unknown reasons I couldn't get the test firmware to install cleanly on
> Rome systems.  I was able to get it functional on Milan (and Intel CPUs),
> but APIC virtualization is disabled on Milan.  Thanks to KVM's force_avic
> I could test the KVM flows, but the IOMMU was having none of my attempts
> to force enable APIC virtualization against its will.

(Sean already knows this but just sharing for the broader visibility.)

I am working on a VFIO selftests framework and helper library that we
can link into the KVM selftests to make this kind of testing much
easier. It will support a driver framework so we can support testing
against different devices in a common way. Developers/companies can
carry their own out-of-tree drivers for non-standard/custom test
devices, e.g. the "Mercury device" used in this series.

I will send an RFC in the coming weeks. If/when my proposal is merged,
then I think we'll have a clean way to get the vfio_irq_test merged
upstream as well.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 33/67] KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU
  2025-04-04 19:38 ` [PATCH 33/67] KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU Sean Christopherson
@ 2025-04-08 17:30   ` Paolo Bonzini
  2025-04-08 20:51     ` Sean Christopherson
  2025-04-24  4:39   ` Sairaj Kodilkar
  1 sibling, 1 reply; 128+ messages in thread
From: Paolo Bonzini @ 2025-04-08 17:30 UTC (permalink / raw)
  To: Sean Christopherson, Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

On 4/4/25 21:38, Sean Christopherson wrote:
> Hoist the logic for identifying the target vCPU for a posted interrupt
> into common x86.  The code is functionally identical between Intel and
> AMD.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  3 +-
>   arch/x86/kvm/svm/avic.c         | 83 ++++++++-------------------------
>   arch/x86/kvm/svm/svm.h          |  3 +-
>   arch/x86/kvm/vmx/posted_intr.c  | 56 ++++++----------------
>   arch/x86/kvm/vmx/posted_intr.h  |  3 +-
>   arch/x86/kvm/x86.c              | 46 +++++++++++++++---

Please use irq.c, since (for once) there is a file other than x86.c that 
can be used.

Bonus points for merging irq_comm.c into irq.c (IIRC irq_comm.c was 
"common" between ia64 and x86 :)).

Paolo

>   6 files changed, 81 insertions(+), 113 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 85f45fc5156d..cb98d8d3c6c2 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1838,7 +1838,8 @@ struct kvm_x86_ops {
>   
>   	int (*pi_update_irte)(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   			      unsigned int host_irq, uint32_t guest_irq,
> -			      struct kvm_kernel_irq_routing_entry *new);
> +			      struct kvm_kernel_irq_routing_entry *new,
> +			      struct kvm_vcpu *vcpu, u32 vector);
>   	void (*pi_start_assignment)(struct kvm *kvm);
>   	void (*apicv_pre_state_restore)(struct kvm_vcpu *vcpu);
>   	void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index ea6eae72b941..666f518340a7 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -812,52 +812,13 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
>   	return 0;
>   }
>   
> -/*
> - * Note:
> - * The HW cannot support posting multicast/broadcast
> - * interrupts to a vCPU. So, we still use legacy interrupt
> - * remapping for these kind of interrupts.
> - *
> - * For lowest-priority interrupts, we only support
> - * those with single CPU as the destination, e.g. user
> - * configures the interrupts via /proc/irq or uses
> - * irqbalance to make the interrupts single-CPU.
> - */
> -static int
> -get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
> -		 struct vcpu_data *vcpu_info, struct kvm_vcpu **vcpu)
> -{
> -	struct kvm_lapic_irq irq;
> -	*vcpu = NULL;
> -
> -	kvm_set_msi_irq(kvm, e, &irq);
> -
> -	if (!kvm_intr_is_single_vcpu(kvm, &irq, vcpu) ||
> -	    !kvm_irq_is_postable(&irq)) {
> -		pr_debug("SVM: %s: use legacy intr remap mode for irq %u\n",
> -			 __func__, irq.vector);
> -		return -1;
> -	}
> -
> -	pr_debug("SVM: %s: use GA mode for irq %u\n", __func__,
> -		 irq.vector);
> -	vcpu_info->vector = irq.vector;
> -
> -	return 0;
> -}
> -
>   int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   			unsigned int host_irq, uint32_t guest_irq,
> -			struct kvm_kernel_irq_routing_entry *new)
> +			struct kvm_kernel_irq_routing_entry *new,
> +			struct kvm_vcpu *vcpu, u32 vector)
>   {
> -	bool enable_remapped_mode = true;
> -	struct vcpu_data vcpu_info;
> -	struct kvm_vcpu *vcpu = NULL;
>   	int ret = 0;
>   
> -	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
> -		return 0;
> -
>   	/*
>   	 * If the IRQ was affined to a different vCPU, remove the IRTE metadata
>   	 * from the *previous* vCPU's list.
> @@ -865,7 +826,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   	svm_ir_list_del(irqfd);
>   
>   	pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n",
> -		 __func__, host_irq, guest_irq, !!new);
> +		 __func__, host_irq, guest_irq, !!vcpu);
>   
>   	/**
>   	 * Here, we setup with legacy mode in the following cases:
> @@ -874,23 +835,23 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   	 * 3. APIC virtualization is disabled for the vcpu.
>   	 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
>   	 */
> -	if (new && new && new->type == KVM_IRQ_ROUTING_MSI &&
> -	    !get_pi_vcpu_info(kvm, new, &vcpu_info, &vcpu) &&
> -	    kvm_vcpu_apicv_active(vcpu)) {
> -		struct amd_iommu_pi_data pi;
> -
> -		enable_remapped_mode = false;
> -
> -		vcpu_info.pi_desc_addr = avic_get_backing_page_address(to_svm(vcpu));
> -
> +	if (vcpu && kvm_vcpu_apicv_active(vcpu)) {
>   		/*
>   		 * Try to enable guest_mode in IRTE.  Note, the address
>   		 * of the vCPU's AVIC backing page is passed to the
>   		 * IOMMU via vcpu_info->pi_desc_addr.
>   		 */
> -		pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, vcpu->vcpu_id);
> -		pi.is_guest_mode = true;
> -		pi.vcpu_data = &vcpu_info;
> +		struct vcpu_data vcpu_info = {
> +			.pi_desc_addr = avic_get_backing_page_address(to_svm(vcpu)),
> +			.vector = vector,
> +		};
> +
> +		struct amd_iommu_pi_data pi = {
> +			.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, vcpu->vcpu_id),
> +			.is_guest_mode = true,
> +			.vcpu_data = &vcpu_info,
> +		};
> +
>   		ret = irq_set_vcpu_affinity(host_irq, &pi);
>   
>   		/**
> @@ -902,12 +863,11 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   		 */
>   		if (!ret)
>   			ret = svm_ir_list_add(to_svm(vcpu), irqfd, &pi);
> -	}
>   
> -	if (!ret && vcpu) {
> -		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id,
> -					 guest_irq, vcpu_info.vector,
> -					 vcpu_info.pi_desc_addr, !!new);
> +		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
> +					 vector, vcpu_info.pi_desc_addr, true);
> +	} else {
> +		ret = irq_set_vcpu_affinity(host_irq, NULL);
>   	}
>   
>   	if (ret < 0) {
> @@ -915,10 +875,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   		goto out;
>   	}
>   
> -	if (enable_remapped_mode)
> -		ret = irq_set_vcpu_affinity(host_irq, NULL);
> -	else
> -		ret = 0;
> +	ret = 0;
>   out:
>   	return ret;
>   }
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 6ad0aa86f78d..5ce240085ee0 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -741,7 +741,8 @@ void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu);
>   void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
>   int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   			unsigned int host_irq, uint32_t guest_irq,
> -			struct kvm_kernel_irq_routing_entry *new);
> +			struct kvm_kernel_irq_routing_entry *new,
> +			struct kvm_vcpu *vcpu, u32 vector);
>   void avic_vcpu_blocking(struct kvm_vcpu *vcpu);
>   void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
>   void avic_ring_doorbell(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 786912cee3f8..fd5f6a125614 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -266,46 +266,20 @@ void vmx_pi_start_assignment(struct kvm *kvm)
>   
>   int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   		       unsigned int host_irq, uint32_t guest_irq,
> -		       struct kvm_kernel_irq_routing_entry *new)
> +		       struct kvm_kernel_irq_routing_entry *new,
> +		       struct kvm_vcpu *vcpu, u32 vector)
>   {
> -	struct kvm_lapic_irq irq;
> -	struct kvm_vcpu *vcpu;
> -	struct vcpu_data vcpu_info;
> -
> -	if (!vmx_can_use_vtd_pi(kvm))
> -		return 0;
> -
> -	/*
> -	 * VT-d PI cannot support posting multicast/broadcast
> -	 * interrupts to a vCPU, we still use interrupt remapping
> -	 * for these kind of interrupts.
> -	 *
> -	 * For lowest-priority interrupts, we only support
> -	 * those with single CPU as the destination, e.g. user
> -	 * configures the interrupts via /proc/irq or uses
> -	 * irqbalance to make the interrupts single-CPU.
> -	 *
> -	 * We will support full lowest-priority interrupt later.
> -	 *
> -	 * In addition, we can only inject generic interrupts using
> -	 * the PI mechanism, refuse to route others through it.
> -	 */
> -	if (!new || new->type != KVM_IRQ_ROUTING_MSI)
> -		goto do_remapping;
> -
> -	kvm_set_msi_irq(kvm, new, &irq);
> -
> -	if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) ||
> -	    !kvm_irq_is_postable(&irq))
> -		goto do_remapping;
> -
> -	vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu));
> -	vcpu_info.vector = irq.vector;
> -
> -	trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
> -				 vcpu_info.vector, vcpu_info.pi_desc_addr, true);
> -
> -	return irq_set_vcpu_affinity(host_irq, &vcpu_info);
> -do_remapping:
> -	return irq_set_vcpu_affinity(host_irq, NULL);
> +	if (vcpu) {
> +		struct vcpu_data vcpu_info = {
> +			.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu)),
> +			.vector = vector,
> +		};
> +
> +		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
> +					 vcpu_info.vector, vcpu_info.pi_desc_addr, true);
> +
> +		return irq_set_vcpu_affinity(host_irq, &vcpu_info);
> +	} else {
> +		return irq_set_vcpu_affinity(host_irq, NULL);
> +	}
>   }
> diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
> index a586d6aaf862..ee3e19e976ac 100644
> --- a/arch/x86/kvm/vmx/posted_intr.h
> +++ b/arch/x86/kvm/vmx/posted_intr.h
> @@ -15,7 +15,8 @@ void __init pi_init_cpu(int cpu);
>   bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
>   int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   		       unsigned int host_irq, uint32_t guest_irq,
> -		       struct kvm_kernel_irq_routing_entry *new);
> +		       struct kvm_kernel_irq_routing_entry *new,
> +		       struct kvm_vcpu *vcpu, u32 vector);
>   void vmx_pi_start_assignment(struct kvm *kvm);
>   
>   static inline int pi_find_highest_vector(struct pi_desc *pi_desc)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b8b259847d05..0ab818bba743 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13567,6 +13567,43 @@ bool kvm_arch_has_irq_bypass(void)
>   }
>   EXPORT_SYMBOL_GPL(kvm_arch_has_irq_bypass);
>   
> +static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
> +			      struct kvm_kernel_irq_routing_entry *old,
> +			      struct kvm_kernel_irq_routing_entry *new)
> +{
> +	struct kvm *kvm = irqfd->kvm;
> +	struct kvm_vcpu *vcpu = NULL;
> +	struct kvm_lapic_irq irq;
> +
> +	if (!irqchip_in_kernel(kvm) ||
> +	    !kvm_arch_has_irq_bypass() ||
> +	    !kvm_arch_has_assigned_device(kvm))
> +		return 0;
> +
> +	if (new && new->type == KVM_IRQ_ROUTING_MSI) {
> +		kvm_set_msi_irq(kvm, new, &irq);
> +
> +		/*
> +		 * Force remapped mode if hardware doesn't support posting the
> +		 * virtual interrupt to a vCPU.  Only IRQs are postable (NMIs,
> +		 * SMIs, etc. are not), and neither AMD nor Intel IOMMUs support
> +		 * posting multicast/broadcast IRQs.  If the interrupt can't be
> +		 * posted, the device MSI needs to be routed to the host so that
> +		 * the guest's desired interrupt can be synthesized by KVM.
> +		 *
> +		 * This means that KVM can only post lowest-priority interrupts
> +		 * if they have a single CPU as the destination, e.g. only if
> +		 * the guest has affined the interrupt to a single vCPU.
> +		 */
> +		if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) ||
> +		    !kvm_irq_is_postable(&irq))
> +			vcpu = NULL;
> +	}
> +
> +	return kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, irqfd->producer->irq,
> +					    irqfd->gsi, new, vcpu, irq.vector);
> +}
> +
>   int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
>   				      struct irq_bypass_producer *prod)
>   {
> @@ -13581,8 +13618,7 @@ int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
>   	irqfd->producer = prod;
>   
>   	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
> -		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
> -						   irqfd->gsi, &irqfd->irq_entry);
> +		ret = kvm_pi_update_irte(irqfd, NULL, &irqfd->irq_entry);
>   		if (ret)
>   			kvm_arch_end_assignment(irqfd->kvm);
>   	}
> @@ -13610,8 +13646,7 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
>   	spin_lock_irq(&kvm->irqfds.lock);
>   
>   	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
> -		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
> -						   irqfd->gsi, NULL);
> +		ret = kvm_pi_update_irte(irqfd, &irqfd->irq_entry, NULL);
>   		if (ret)
>   			pr_info("irq bypass consumer (token %p) unregistration fails: %d\n",
>   				irqfd->consumer.token, ret);
> @@ -13628,8 +13663,7 @@ int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
>   				  struct kvm_kernel_irq_routing_entry *old,
>   				  struct kvm_kernel_irq_routing_entry *new)
>   {
> -	return kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, irqfd->producer->irq,
> -					    irqfd->gsi, new);
> +	return kvm_pi_update_irte(irqfd, old, new);
>   }
>   
>   bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 62/67] KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off
  2025-04-04 19:39 ` [PATCH 62/67] KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off Sean Christopherson
@ 2025-04-08 17:51   ` Paolo Bonzini
  0 siblings, 0 replies; 128+ messages in thread
From: Paolo Bonzini @ 2025-04-08 17:51 UTC (permalink / raw)
  To: Sean Christopherson, Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

On 4/4/25 21:39, Sean Christopherson wrote:
> Don't query a vCPU's blocking status when toggling AVIC on/off; barring
> KVM bugs, the vCPU can't be blocking when refrecing AVIC controls.  And if

refrecing -> refreshing

Paolo

> there are KVM bugs, ensuring the vCPU and its associated IRTEs are in the
> correct state is desirable, i.e. well worth any overhead in a buggy
> scenario.
> 
> Isolating the "real" load/put flows will allow moving the IOMMU IRTE
> (de)activation logic from avic_refresh_apicv_exec_ctrl() to
> avic_update_iommu_vcpu_affinity(), i.e. will allow updating the vCPU's
> physical ID entry and its IRTEs in a common path, under a single critical
> section of ir_list_lock.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/svm/avic.c | 65 +++++++++++++++++++++++------------------
>   1 file changed, 37 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 0425cc374a79..d5fa915d0827 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -838,7 +838,7 @@ static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu)
>   		WARN_ON_ONCE(amd_iommu_update_ga(cpu, ir->data));
>   }
>   
> -void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> +static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   {
>   	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
>   	int h_physical_id = kvm_cpu_get_apicid(cpu);
> @@ -854,16 +854,6 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   	if (WARN_ON_ONCE(vcpu->vcpu_id * sizeof(entry) >= PAGE_SIZE))
>   		return;
>   
> -	/*
> -	 * No need to update anything if the vCPU is blocking, i.e. if the vCPU
> -	 * is being scheduled in after being preempted.  The CPU entries in the
> -	 * Physical APIC table and IRTE are consumed iff IsRun{ning} is '1'.
> -	 * If the vCPU was migrated, its new CPU value will be stuffed when the
> -	 * vCPU unblocks.
> -	 */
> -	if (kvm_vcpu_is_blocking(vcpu))
> -		return;
> -
>   	/*
>   	 * Grab the per-vCPU interrupt remapping lock even if the VM doesn't
>   	 * _currently_ have assigned devices, as that can change.  Holding
> @@ -898,31 +888,33 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
>   }
>   
> -void avic_vcpu_put(struct kvm_vcpu *vcpu)
> +void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> +{
> +	/*
> +	 * No need to update anything if the vCPU is blocking, i.e. if the vCPU
> +	 * is being scheduled in after being preempted.  The CPU entries in the
> +	 * Physical APIC table and IRTE are consumed iff IsRun{ning} is '1'.
> +	 * If the vCPU was migrated, its new CPU value will be stuffed when the
> +	 * vCPU unblocks.
> +	 */
> +	if (kvm_vcpu_is_blocking(vcpu))
> +		return;
> +
> +	__avic_vcpu_load(vcpu, cpu);
> +}
> +
> +static void __avic_vcpu_put(struct kvm_vcpu *vcpu)
>   {
>   	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
>   	struct vcpu_svm *svm = to_svm(vcpu);
>   	unsigned long flags;
> -	u64 entry;
> +	u64 entry = svm->avic_physical_id_entry;
>   
>   	lockdep_assert_preemption_disabled();
>   
>   	if (WARN_ON_ONCE(vcpu->vcpu_id * sizeof(entry) >= PAGE_SIZE))
>   		return;
>   
> -	/*
> -	 * Note, reading the Physical ID entry outside of ir_list_lock is safe
> -	 * as only the pCPU that has loaded (or is loading) the vCPU is allowed
> -	 * to modify the entry, and preemption is disabled.  I.e. the vCPU
> -	 * can't be scheduled out and thus avic_vcpu_{put,load}() can't run
> -	 * recursively.
> -	 */
> -	entry = svm->avic_physical_id_entry;
> -
> -	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
> -	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
> -		return;
> -
>   	/*
>   	 * Take and hold the per-vCPU interrupt remapping lock while updating
>   	 * the Physical ID entry even though the lock doesn't protect against
> @@ -942,7 +934,24 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
>   		WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
>   
>   	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
> +}
>   
> +void avic_vcpu_put(struct kvm_vcpu *vcpu)
> +{
> +	/*
> +	 * Note, reading the Physical ID entry outside of ir_list_lock is safe
> +	 * as only the pCPU that has loaded (or is loading) the vCPU is allowed
> +	 * to modify the entry, and preemption is disabled.  I.e. the vCPU
> +	 * can't be scheduled out and thus avic_vcpu_{put,load}() can't run
> +	 * recursively.
> +	 */
> +	u64 entry = to_svm(vcpu)->avic_physical_id_entry;
> +
> +	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
> +	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
> +		return;
> +
> +	__avic_vcpu_put(vcpu);
>   }
>   
>   void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)
> @@ -983,9 +992,9 @@ void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>   	avic_refresh_virtual_apic_mode(vcpu);
>   
>   	if (activated)
> -		avic_vcpu_load(vcpu, vcpu->cpu);
> +		__avic_vcpu_load(vcpu, vcpu->cpu);
>   	else
> -		avic_vcpu_put(vcpu);
> +		__avic_vcpu_put(vcpu);
>   
>   	/*
>   	 * Here, we go through the per-vcpu ir_list to update all existing


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 65/67] KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking
  2025-04-04 19:39 ` [PATCH 65/67] KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking Sean Christopherson
@ 2025-04-08 17:53   ` Paolo Bonzini
  2025-04-08 21:31     ` Sean Christopherson
  0 siblings, 1 reply; 128+ messages in thread
From: Paolo Bonzini @ 2025-04-08 17:53 UTC (permalink / raw)
  To: Sean Christopherson, Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

On 4/4/25 21:39, Sean Christopherson wrote:
> Configure IRTEs to GA log interrupts for device posted IRQs that hit
> non-running vCPUs if and only if the target vCPU is blocking, i.e.
> actually needs a wake event.  If the vCPU has exited to userspace or was
> preempted, generating GA log entries and interrupts is wasteful and
> unnecessary, as the vCPU will be re-loaded and/or scheduled back in
> irrespective of the GA log notification (avic_ga_log_notifier() is just a
> fancy wrapper for kvm_vcpu_wake_up()).
> 
> Use a should-be-zero bit in the vCPU's Physical APIC ID Table Entry to
> track whether or not the vCPU's associated IRTEs are configured to
> generate GA logs, but only set the synthetic bit in KVM's "cache", i.e.
> never set the should-be-zero bit in tables that are used by hardware.
> Use a synthetic bit instead of a dedicated boolean to minimize the odds
> of messing up the locking, i.e. so that all the existing rules that apply
> to avic_physical_id_entry for IS_RUNNING are reused verbatim for
> GA_LOG_INTR.
> 
> Note, because KVM (by design) "puts" AVIC state in a "pre-blocking"
> phase, using kvm_vcpu_is_blocking() to track the need for notifications
> isn't a viable option.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/include/asm/svm.h |  7 ++++++
>   arch/x86/kvm/svm/avic.c    | 49 +++++++++++++++++++++++++++-----------
>   2 files changed, 42 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> index 8b07939ef3b9..be6e833bf92c 100644
> --- a/arch/x86/include/asm/svm.h
> +++ b/arch/x86/include/asm/svm.h
> @@ -246,6 +246,13 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
>   #define AVIC_LOGICAL_ID_ENTRY_VALID_BIT			31
>   #define AVIC_LOGICAL_ID_ENTRY_VALID_MASK		(1 << 31)
>   
> +/*
> + * GA_LOG_INTR is a synthetic flag that's never propagated to hardware-visible
> + * tables.  GA_LOG_INTR is set if the vCPU needs device posted IRQs to generate
> + * GA log interrupts to wake the vCPU (because it's blocking or about to block).
> + */
> +#define AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR		BIT_ULL(61)
> +
>   #define AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK	GENMASK_ULL(11, 0)
>   #define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK	GENMASK_ULL(51, 12)
>   #define AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK		(1ULL << 62)
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 1466e66cca6c..0d2a17a74be6 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -798,7 +798,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   			pi_data.cpu = entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
>   		} else {
>   			pi_data.cpu = -1;
> -			pi_data.ga_log_intr = true;
> +			pi_data.ga_log_intr = entry & AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR;
>   		}
>   
>   		ret = irq_set_vcpu_affinity(host_irq, &pi_data);
> @@ -823,7 +823,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   }
>   
>   static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
> -					    bool toggle_avic)
> +					    bool toggle_avic, bool ga_log_intr)
>   {
>   	struct amd_svm_iommu_ir *ir;
>   	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -839,9 +839,9 @@ static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
>   
>   	list_for_each_entry(ir, &svm->ir_list, node) {
>   		if (!toggle_avic)
> -			WARN_ON_ONCE(amd_iommu_update_ga(ir->data, cpu, true));
> +			WARN_ON_ONCE(amd_iommu_update_ga(ir->data, cpu, ga_log_intr));
>   		else if (cpu >= 0)
> -			WARN_ON_ONCE(amd_iommu_activate_guest_mode(ir->data, cpu, true));
> +			WARN_ON_ONCE(amd_iommu_activate_guest_mode(ir->data, cpu, ga_log_intr));
>   		else
>   			WARN_ON_ONCE(amd_iommu_deactivate_guest_mode(ir->data));
>   	}
> @@ -875,7 +875,8 @@ static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool toggle_avic)
>   	entry = svm->avic_physical_id_entry;
>   	WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
>   
> -	entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
> +	entry &= ~(AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK |
> +		   AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR);
>   	entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
>   	entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
>   
> @@ -892,7 +893,7 @@ static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool toggle_avic)
>   
>   	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
>   
> -	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, toggle_avic);
> +	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, toggle_avic, false);
>   
>   	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
>   }
> @@ -912,7 +913,8 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   	__avic_vcpu_load(vcpu, cpu, false);
>   }
>   
> -static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic)
> +static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic,
> +			    bool is_blocking)

What would it look like to use an enum { SCHED_OUT, SCHED_IN, 
ENABLE_AVIC, DISABLE_AVIC, START_BLOCKING } for both __avic_vcpu_put and 
__avic_vcpu_load's second argument?  Consecutive bools are ugly...

Paolo

>   {
>   	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
>   	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -934,14 +936,28 @@ static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic)
>   	 */
>   	spin_lock_irqsave(&svm->ir_list_lock, flags);
>   
> -	avic_update_iommu_vcpu_affinity(vcpu, -1, toggle_avic);
> +	avic_update_iommu_vcpu_affinity(vcpu, -1, toggle_avic, is_blocking);
>   
> +	WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR);
> +
> +	/*
> +	 * Keep the previouv APIC ID in the entry so that a rogue doorbell from
> +	 * hardware is at least restricted to a CPU associated with the vCPU.
> +	 */
>   	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
> -	svm->avic_physical_id_entry = entry;
>   
>   	if (enable_ipiv)
>   		WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
>   
> +	/*
> +	 * Note!  Don't set AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR in the table as
> +	 * it's a synthetic flag that usurps an unused a should-be-zero bit.
> +	 */
> +	if (is_blocking)
> +		entry |= AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR;
> +
> +	svm->avic_physical_id_entry = entry;
> +
>   	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
>   }
>   
> @@ -957,10 +973,15 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
>   	u64 entry = to_svm(vcpu)->avic_physical_id_entry;
>   
>   	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
> -	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
> -		return;
> +	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)) {
> +		if (WARN_ON_ONCE(!kvm_vcpu_is_blocking(vcpu)))
> +			return;
>   
> -	__avic_vcpu_put(vcpu, false);
> +		if (!(WARN_ON_ONCE(!(entry & AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR))))
> +			return;
> +	}
> +
> +	__avic_vcpu_put(vcpu, false, kvm_vcpu_is_blocking(vcpu));
>   }
>   
>   void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)
> @@ -997,7 +1018,7 @@ void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>   	if (kvm_vcpu_apicv_active(vcpu))
>   		__avic_vcpu_load(vcpu, vcpu->cpu, true);
>   	else
> -		__avic_vcpu_put(vcpu, true);
> +		__avic_vcpu_put(vcpu, true, true);
>   }
>   
>   void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
> @@ -1023,7 +1044,7 @@ void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
>   	 * CPU and cause noisy neighbor problems if the VM is sending interrupts
>   	 * to the vCPU while it's scheduled out.
>   	 */
> -	avic_vcpu_put(vcpu);
> +	__avic_vcpu_put(vcpu, false, true);
>   }
>   
>   void avic_vcpu_unblocking(struct kvm_vcpu *vcpu)


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 33/67] KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU
  2025-04-08 17:30   ` Paolo Bonzini
@ 2025-04-08 20:51     ` Sean Christopherson
  0 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-08 20:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Joerg Roedel, David Woodhouse, Lu Baolu, kvm, iommu, linux-kernel,
	Maxim Levitsky, Joao Martins, David Matlack

On Tue, Apr 08, 2025, Paolo Bonzini wrote:
> On 4/4/25 21:38, Sean Christopherson wrote:
> > Hoist the logic for identifying the target vCPU for a posted interrupt
> > into common x86.  The code is functionally identical between Intel and
> > AMD.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   arch/x86/include/asm/kvm_host.h |  3 +-
> >   arch/x86/kvm/svm/avic.c         | 83 ++++++++-------------------------
> >   arch/x86/kvm/svm/svm.h          |  3 +-
> >   arch/x86/kvm/vmx/posted_intr.c  | 56 ++++++----------------
> >   arch/x86/kvm/vmx/posted_intr.h  |  3 +-
> >   arch/x86/kvm/x86.c              | 46 +++++++++++++++---
> 
> Please use irq.c, since (for once) there is a file other than x86.c that can
> be used.

Hah, will do.  I honestly forget that irq.c and irq_comm.c exist on a regular
basis.

> Bonus points for merging irq_comm.c into irq.c (IIRC irq_comm.c was "common"
> between ia64 and x86 :)).

With pleasure :-)

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 65/67] KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking
  2025-04-08 17:53   ` Paolo Bonzini
@ 2025-04-08 21:31     ` Sean Christopherson
  2025-04-09 10:34       ` Paolo Bonzini
  0 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-08 21:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Joerg Roedel, David Woodhouse, Lu Baolu, kvm, iommu, linux-kernel,
	Maxim Levitsky, Joao Martins, David Matlack

On Tue, Apr 08, 2025, Paolo Bonzini wrote:
> On 4/4/25 21:39, Sean Christopherson wrote:
> > @@ -892,7 +893,7 @@ static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool toggle_avic)
> >   	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
> > -	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, toggle_avic);
> > +	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, toggle_avic, false);
> >   	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
> >   }
> > @@ -912,7 +913,8 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >   	__avic_vcpu_load(vcpu, cpu, false);
> >   }
> > -static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic)
> > +static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic,
> > +			    bool is_blocking)
> 
> What would it look like to use an enum { SCHED_OUT, SCHED_IN, ENABLE_AVIC,
> DISABLE_AVIC, START_BLOCKING } for both __avic_vcpu_put and
> __avic_vcpu_load's second argument?

There's gotta be a way to make it look better than this code.  I gave a half-
hearted attempt at using an enum before posting, but wasn't able to come up with
anything decent.

Coming back to it with fresh eyes, what about this (full on-top diff below)?

enum avic_vcpu_action {
	AVIC_SCHED_IN		= 0,
	AVIC_SCHED_OUT		= 0,
	AVIC_START_BLOCKING	= BIT(0),

	AVIC_TOGGLE_ON_OFF	= BIT(1),
	AVIC_ACTIVATE		= AVIC_TOGGLE_ON_OFF,
	AVIC_DEACTIVATE		= AVIC_TOGGLE_ON_OFF,
};

AVIC_SCHED_IN and AVIC_SCHED_OUT are essentially syntactic sugar, as are
AVIC_ACTIVATE and AVIC_DEACTIVATE to a certain extent.  But it's much better than
booleans, and using a bitmask makes avic_update_iommu_vcpu_affinity() slightly
prettier.

> Consecutive bools are ugly...

Yeah, I hated it when I wrote it, and still hate it now.

And more error prone, e.g. the __avic_vcpu_put() call from avic_refresh_apicv_exec_ctrl()
should specify is_blocking=false, not true, as kvm_x86_ops.refresh_apicv_exec_ctrl()
should never be called while the vCPU is blocking.

---
 arch/x86/kvm/svm/avic.c | 41 ++++++++++++++++++++++++++++-------------
 1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 425674e1a04c..1752420c68aa 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -833,9 +833,20 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	return irq_set_vcpu_affinity(host_irq, NULL);
 }
 
+enum avic_vcpu_action {
+	AVIC_SCHED_IN		= 0,
+	AVIC_SCHED_OUT		= 0,
+	AVIC_START_BLOCKING	= BIT(0),
+
+	AVIC_TOGGLE_ON_OFF	= BIT(1),
+	AVIC_ACTIVATE		= AVIC_TOGGLE_ON_OFF,
+	AVIC_DEACTIVATE		= AVIC_TOGGLE_ON_OFF,
+};
+
 static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
-					    bool toggle_avic, bool ga_log_intr)
+					    enum avic_vcpu_action action)
 {
+	bool ga_log_intr = (action & AVIC_START_BLOCKING);
 	struct amd_svm_iommu_ir *ir;
 	struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -849,7 +860,7 @@ static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
 		return;
 
 	list_for_each_entry(ir, &svm->ir_list, node) {
-		if (!toggle_avic)
+		if (!(action & AVIC_TOGGLE_ON_OFF))
 			WARN_ON_ONCE(amd_iommu_update_ga(ir->data, cpu, ga_log_intr));
 		else if (cpu >= 0)
 			WARN_ON_ONCE(amd_iommu_activate_guest_mode(ir->data, cpu, ga_log_intr));
@@ -858,7 +869,8 @@ static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
 	}
 }
 
-static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool toggle_avic)
+static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu,
+			     enum avic_vcpu_action action)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	int h_physical_id = kvm_cpu_get_apicid(cpu);
@@ -904,7 +916,7 @@ static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool toggle_avic)
 
 	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 
-	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, toggle_avic, false);
+	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, action);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 }
@@ -921,11 +933,10 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (kvm_vcpu_is_blocking(vcpu))
 		return;
 
-	__avic_vcpu_load(vcpu, cpu, false);
+	__avic_vcpu_load(vcpu, cpu, AVIC_SCHED_IN);
 }
 
-static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic,
-			    bool is_blocking)
+static void __avic_vcpu_put(struct kvm_vcpu *vcpu, enum avic_vcpu_action action)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -947,7 +958,7 @@ static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic,
 	 */
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
 
-	avic_update_iommu_vcpu_affinity(vcpu, -1, toggle_avic, is_blocking);
+	avic_update_iommu_vcpu_affinity(vcpu, -1, action);
 
 	WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR);
 
@@ -964,7 +975,7 @@ static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic,
 	 * Note!  Don't set AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR in the table as
 	 * it's a synthetic flag that usurps an unused a should-be-zero bit.
 	 */
-	if (is_blocking)
+	if (action & AVIC_START_BLOCKING)
 		entry |= AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR;
 
 	svm->avic_physical_id_entry = entry;
@@ -992,7 +1003,8 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 			return;
 	}
 
-	__avic_vcpu_put(vcpu, false, kvm_vcpu_is_blocking(vcpu));
+	__avic_vcpu_put(vcpu, kvm_vcpu_is_blocking(vcpu) ? AVIC_START_BLOCKING :
+							   AVIC_SCHED_OUT);
 }
 
 void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)
@@ -1024,12 +1036,15 @@ void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 	if (!enable_apicv)
 		return;
 
+	/* APICv should only be toggled on/off while the vCPU is running. */
+	WARN_ON_ONCE(kvm_vcpu_is_blocking(vcpu));
+
 	avic_refresh_virtual_apic_mode(vcpu);
 
 	if (kvm_vcpu_apicv_active(vcpu))
-		__avic_vcpu_load(vcpu, vcpu->cpu, true);
+		__avic_vcpu_load(vcpu, vcpu->cpu, AVIC_ACTIVATE);
 	else
-		__avic_vcpu_put(vcpu, true, true);
+		__avic_vcpu_put(vcpu, AVIC_DEACTIVATE);
 }
 
 void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
@@ -1055,7 +1070,7 @@ void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
 	 * CPU and cause noisy neighbor problems if the VM is sending interrupts
 	 * to the vCPU while it's scheduled out.
 	 */
-	__avic_vcpu_put(vcpu, false, true);
+	__avic_vcpu_put(vcpu, AVIC_START_BLOCKING);
 }
 
 void avic_vcpu_unblocking(struct kvm_vcpu *vcpu)

base-commit: fe5b44cf46d5444ff071bc2373fbe7b109a3f60b
-- 

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCH 26/67] iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
  2025-04-08 16:57   ` Paolo Bonzini
@ 2025-04-08 22:25     ` Sean Christopherson
  0 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-08 22:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Joerg Roedel, David Woodhouse, Lu Baolu, kvm, iommu, linux-kernel,
	Maxim Levitsky, Joao Martins, David Matlack

On Tue, Apr 08, 2025, Paolo Bonzini wrote:
> On 4/4/25 21:38, Sean Christopherson wrote:
> > Delete the amd_ir_data.prev_ga_tag field now that all usage is
> > superfluous.
> 
> This can be moved much earlier (maybe even after patch 10 from a cursory
> look), can't it? 

Ya, I independently arrived at the same conclusion[*], specifically after

   KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE

[*] I was counting patches based on my local tree, which has three extra patches
    from the posted IRQs module param, and so initially thought the last dependency
    went away in patch 13.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
  2025-04-08 12:44 ` [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Joerg Roedel
@ 2025-04-09  8:30   ` Vasant Hegde
  0 siblings, 0 replies; 128+ messages in thread
From: Vasant Hegde @ 2025-04-09  8:30 UTC (permalink / raw)
  To: Joerg Roedel, Sean Christopherson, suravee.suthikulpanit
  Cc: Paolo Bonzini, David Woodhouse, Lu Baolu, kvm, iommu,
	linux-kernel, Maxim Levitsky, Joao Martins, David Matlack

Joerg,


On 4/8/2025 6:14 PM, Joerg Roedel wrote:
> Hey Sean,
> 
> On Fri, Apr 04, 2025 at 12:38:15PM -0700, Sean Christopherson wrote:
>> TL;DR: Overhaul device posted interrupts in KVM and IOMMU, and AVIC in
>>        general.  This needs more testing on AMD with device posted IRQs.
> 
> Thanks for posting this, it fixes quite some issues in the posted IRQ
> implemention. I skimmed through the AMD IOMMU changes and besides some
> small things didn't spot anything worrisome.
> 
> Adding Suravee and Vasant from AMD to this thread for deeper review
> (also of the KVM parts) and testing.

Sure. We will try to review it soon and will run some tests.

-Vasant


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 65/67] KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking
  2025-04-08 21:31     ` Sean Christopherson
@ 2025-04-09 10:34       ` Paolo Bonzini
  0 siblings, 0 replies; 128+ messages in thread
From: Paolo Bonzini @ 2025-04-09 10:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Joerg Roedel, David Woodhouse, Lu Baolu, kvm, iommu, linux-kernel,
	Maxim Levitsky, Joao Martins, David Matlack

On 4/8/25 23:31, Sean Christopherson wrote:
> On Tue, Apr 08, 2025, Paolo Bonzini wrote:
>> On 4/4/25 21:39, Sean Christopherson wrote:
>>> @@ -892,7 +893,7 @@ static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool toggle_avic)
>>>    	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
>>> -	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, toggle_avic);
>>> +	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, toggle_avic, false);
>>>    	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
>>>    }
>>> @@ -912,7 +913,8 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>    	__avic_vcpu_load(vcpu, cpu, false);
>>>    }
>>> -static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic)
>>> +static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic,
>>> +			    bool is_blocking)
>>
>> What would it look like to use an enum { SCHED_OUT, SCHED_IN, ENABLE_AVIC,
>> DISABLE_AVIC, START_BLOCKING } for both __avic_vcpu_put and
>> __avic_vcpu_load's second argument?
> 
> There's gotta be a way to make it look better than this code.  I gave a half-
> hearted attempt at using an enum before posting, but wasn't able to come up with
> anything decent.
> 
> Coming back to it with fresh eyes, what about this (full on-top diff below)?
> 
> enum avic_vcpu_action {
> 	AVIC_SCHED_IN		= 0,
> 	AVIC_SCHED_OUT		= 0,
> 	AVIC_START_BLOCKING	= BIT(0),
> 
> 	AVIC_TOGGLE_ON_OFF	= BIT(1),
> 	AVIC_ACTIVATE		= AVIC_TOGGLE_ON_OFF,
> 	AVIC_DEACTIVATE		= AVIC_TOGGLE_ON_OFF,
> };
> 
> AVIC_SCHED_IN and AVIC_SCHED_OUT are essentially syntactic sugar, as are
> AVIC_ACTIVATE and AVIC_DEACTIVATE to a certain extent.  But it's much better than
> booleans, and using a bitmask makes avic_update_iommu_vcpu_affinity() slightly
> prettier.

Even just the bitmask at least makes it clear what is "true" and what is 
"false" (which is obvious but I never thought about it this way, you 
never stop learning).

You decide whether you prefer the syntactic sugar or not.

Paolo

>> Consecutive bools are ugly...
> 
> Yeah, I hated it when I wrote it, and still hate it now.
> 
> And more error prone, e.g. the __avic_vcpu_put() call from avic_refresh_apicv_exec_ctrl()
> should specify is_blocking=false, not true, as kvm_x86_ops.refresh_apicv_exec_ctrl()
> should never be called while the vCPU is blocking.
> 
> ---
>   arch/x86/kvm/svm/avic.c | 41 ++++++++++++++++++++++++++++-------------
>   1 file changed, 28 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 425674e1a04c..1752420c68aa 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -833,9 +833,20 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   	return irq_set_vcpu_affinity(host_irq, NULL);
>   }
>   
> +enum avic_vcpu_action {
> +	AVIC_SCHED_IN		= 0,
> +	AVIC_SCHED_OUT		= 0,
> +	AVIC_START_BLOCKING	= BIT(0),
> +
> +	AVIC_TOGGLE_ON_OFF	= BIT(1),
> +	AVIC_ACTIVATE		= AVIC_TOGGLE_ON_OFF,
> +	AVIC_DEACTIVATE		= AVIC_TOGGLE_ON_OFF,
> +};
> +
>   static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
> -					    bool toggle_avic, bool ga_log_intr)
> +					    enum avic_vcpu_action action)
>   {
> +	bool ga_log_intr = (action & AVIC_START_BLOCKING);
>   	struct amd_svm_iommu_ir *ir;
>   	struct vcpu_svm *svm = to_svm(vcpu);
>   
> @@ -849,7 +860,7 @@ static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
>   		return;
>   
>   	list_for_each_entry(ir, &svm->ir_list, node) {
> -		if (!toggle_avic)
> +		if (!(action & AVIC_TOGGLE_ON_OFF))
>   			WARN_ON_ONCE(amd_iommu_update_ga(ir->data, cpu, ga_log_intr));
>   		else if (cpu >= 0)
>   			WARN_ON_ONCE(amd_iommu_activate_guest_mode(ir->data, cpu, ga_log_intr));
> @@ -858,7 +869,8 @@ static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
>   	}
>   }
>   
> -static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool toggle_avic)
> +static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu,
> +			     enum avic_vcpu_action action)
>   {
>   	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
>   	int h_physical_id = kvm_cpu_get_apicid(cpu);
> @@ -904,7 +916,7 @@ static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool toggle_avic)
>   
>   	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
>   
> -	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, toggle_avic, false);
> +	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, action);
>   
>   	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
>   }
> @@ -921,11 +933,10 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   	if (kvm_vcpu_is_blocking(vcpu))
>   		return;
>   
> -	__avic_vcpu_load(vcpu, cpu, false);
> +	__avic_vcpu_load(vcpu, cpu, AVIC_SCHED_IN);
>   }
>   
> -static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic,
> -			    bool is_blocking)
> +static void __avic_vcpu_put(struct kvm_vcpu *vcpu, enum avic_vcpu_action action)
>   {
>   	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
>   	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -947,7 +958,7 @@ static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic,
>   	 */
>   	spin_lock_irqsave(&svm->ir_list_lock, flags);
>   
> -	avic_update_iommu_vcpu_affinity(vcpu, -1, toggle_avic, is_blocking);
> +	avic_update_iommu_vcpu_affinity(vcpu, -1, action);
>   
>   	WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR);
>   
> @@ -964,7 +975,7 @@ static void __avic_vcpu_put(struct kvm_vcpu *vcpu, bool toggle_avic,
>   	 * Note!  Don't set AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR in the table as
>   	 * it's a synthetic flag that usurps an unused a should-be-zero bit.
>   	 */
> -	if (is_blocking)
> +	if (action & AVIC_START_BLOCKING)
>   		entry |= AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR;
>   
>   	svm->avic_physical_id_entry = entry;
> @@ -992,7 +1003,8 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
>   			return;
>   	}
>   
> -	__avic_vcpu_put(vcpu, false, kvm_vcpu_is_blocking(vcpu));
> +	__avic_vcpu_put(vcpu, kvm_vcpu_is_blocking(vcpu) ? AVIC_START_BLOCKING :
> +							   AVIC_SCHED_OUT);
>   }
>   
>   void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)
> @@ -1024,12 +1036,15 @@ void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>   	if (!enable_apicv)
>   		return;
>   
> +	/* APICv should only be toggled on/off while the vCPU is running. */
> +	WARN_ON_ONCE(kvm_vcpu_is_blocking(vcpu));
> +
>   	avic_refresh_virtual_apic_mode(vcpu);
>   
>   	if (kvm_vcpu_apicv_active(vcpu))
> -		__avic_vcpu_load(vcpu, vcpu->cpu, true);
> +		__avic_vcpu_load(vcpu, vcpu->cpu, AVIC_ACTIVATE);
>   	else
> -		__avic_vcpu_put(vcpu, true, true);
> +		__avic_vcpu_put(vcpu, AVIC_DEACTIVATE);
>   }
>   
>   void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
> @@ -1055,7 +1070,7 @@ void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
>   	 * CPU and cause noisy neighbor problems if the VM is sending interrupts
>   	 * to the vCPU while it's scheduled out.
>   	 */
> -	__avic_vcpu_put(vcpu, false, true);
> +	__avic_vcpu_put(vcpu, AVIC_START_BLOCKING);
>   }
>   
>   void avic_vcpu_unblocking(struct kvm_vcpu *vcpu)
> 
> base-commit: fe5b44cf46d5444ff071bc2373fbe7b109a3f60b


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 64/67] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
  2025-04-04 19:39 ` [PATCH 64/67] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts Sean Christopherson
@ 2025-04-09 11:56   ` Joao Martins
  2025-04-10 15:45     ` Sean Christopherson
  2025-04-18 12:17     ` Vasant Hegde
  0 siblings, 2 replies; 128+ messages in thread
From: Joao Martins @ 2025-04-09 11:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, David Matlack,
	Alejandro Jimenez, Suravee Suthikulpanit, Vasant Hegde,
	Joerg Roedel, David Woodhouse, Lu Baolu, Paolo Bonzini

On 04/04/2025 20:39, Sean Christopherson wrote:
> Add plumbing to the AMD IOMMU driver to allow KVM to control whether or
> not an IRTE is configured to generate GA log interrupts.  KVM only needs a
> notification if the target vCPU is blocking, so the vCPU can be awakened.
> If a vCPU is preempted or exits to userspace, KVM clears is_run, but will
> set the vCPU back to running when userspace does KVM_RUN and/or the vCPU
> task is scheduled back in, i.e. KVM doesn't need a notification.
> 
> Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes
> from the KVM changes insofar as possible.
> 
> Opportunistically swap the ordering of parameters for amd_iommu_update_ga()
> so that the match amd_iommu_activate_guest_mode().

Unfortunately I think this patch and the next one might be riding on the
assumption that amd_iommu_update_ga() is always cheap :( -- see below.

I didn't spot anything else flawed in the series though, just this one. I would
suggest holding off on this and the next one, while progressing with the rest of
the series.

> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 2e016b98fa1b..27b03e718980 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> -static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
> +static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu,
> +				  bool ga_log_intr)
>  {
>  	if (cpu >= 0) {
>  		entry->lo.fields_vapic.destination =
> @@ -3783,12 +3784,14 @@ static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
>  		entry->hi.fields.destination =
>  					APICID_TO_IRTE_DEST_HI(cpu);
>  		entry->lo.fields_vapic.is_run = true;
> +		entry->lo.fields_vapic.ga_log_intr = false;
>  	} else {
>  		entry->lo.fields_vapic.is_run = false;
> +		entry->lo.fields_vapic.ga_log_intr = ga_log_intr;
>  	}
>  }
>

isRun, Destination and GATag are not cached. Quoting the update from a few years
back (page 93 of IOMMU spec dated Feb 2025):

| When virtual interrupts are enabled by setting MMIO Offset 0018h[GAEn] and
| IRTE[GuestMode=1], IRTE[IsRun], IRTE[Destination], and if present IRTE[GATag],
| are not cached by the IOMMU. Modifications to these fields do not require an
| invalidation of the Interrupt Remapping Table.

This is the reason we were able to get rid of the IOMMU invalidation in
amd_iommu_update_ga() ... which sped up vmexit/vmenter flow with iommu avic.
Besides the lock contention that was observed at the time, we were seeing stalls
in this path with enough vCPUs IIRC; CCing Alejandro to keep me honest.

Now this change above is incorrect as is and to make it correct: you will need
xor with the previous content of the IRTE::ga_log_intr and then if it changes
then you re-add back an invalidation command via
iommu_flush_irt_and_complete()). The latter is what I am worried will
reintroduce these above problem :(

The invalidation command (which has a completion barrier to serialize
invalidation execution) takes some time in h/w, and will make all your vcpus
content on the irq table lock (as is). Even assuming you somehow move the
invalidation outside the lock, you will content on the iommu lock (for the
command queue) or best case assuming no locks (which I am not sure it is
possible) you will need to wait for the command to complete until you can
progress forward with entering/exiting.

Unless the GALogIntr bit is somehow also not cached too which wouldn't need the
invalidation command (which would be good news!). Adding Suravee/Vasant here.

It's a nice trick how you would leverage this in SVM, but do you have
measurements that corroborate its introduction? How many unnecessary GALog
entries were you able to avoid with this trick say on a workload that would
exercise this (like netperf 1byte RR test that sleeps and wakes up a lot) ?

I should also mention that there's a different logic that is alternative to
GALog (in Genoa or more recent), which is GAPPI support whereby an entry is
generated but a more rare condition. Quoting the an excerpts below:

| This mode is enabled by setting MMIO Offset 0018h[GAPPIEn]=1. Under this
| mode, guest interrupts (IRTE[GuestMode]=1) update the vAPIC backing page IRR
| status as normal.

| In GAPPI mode, a GAPPI interrupt is generated if all of the guest IRR bits
| were previously clear prior to the last IRR update. This indicates the new
| interrupt is the first pending interrupt to the
| vCPU. The GAPPI interrupt is used to signal Hypervisor software to schedule
| one or more vCPUs to execute pending interrupts.

| Implementations may not be able to perfectly determine if all of the IRR bits
| were previously clear prior to updating the vAPIC backing page to set IRR.
| Spurious interrupts may be generated as a
| result. Software must be designed to handle this possibility

Page 99, "2.2.5.5 Guest APIC Physical Processor Interrupt",
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 64/67] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
  2025-04-09 11:56   ` Joao Martins
@ 2025-04-10 15:45     ` Sean Christopherson
  2025-04-10 17:13       ` Joao Martins
  2025-04-18 12:17     ` Vasant Hegde
  1 sibling, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-10 15:45 UTC (permalink / raw)
  To: Joao Martins
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, David Matlack,
	Alejandro Jimenez, Suravee Suthikulpanit, Vasant Hegde,
	Joerg Roedel, David Woodhouse, Lu Baolu, Paolo Bonzini

On Wed, Apr 09, 2025, Joao Martins wrote:
> On 04/04/2025 20:39, Sean Christopherson wrote:
> I would suggest holding off on this and the next one, while progressing with
> the rest of the series.

Agreed, though I think there's a "pure win" alternative that can be safely
implemented (but it definitely should be done separately).

If HLT-exiting is disabled for the VM, and the VM doesn't have access to the
various paravirtual features that can put it into a synthetic HLT state (PV async
#PF and/or Xen support), then I'm pretty sure GALogIntr can be disabled entirely,
i.e. disabled during the initial irq_set_vcpu_affinity() and never enabled.  KVM
doesn't emulate HLT via its full emulator for AMD (just non-unrestricted Intel
guests), so I'm pretty sure there would be no need for KVM to ever wake a vCPU in
response to a device interrupt.

> > diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> > index 2e016b98fa1b..27b03e718980 100644
> > --- a/drivers/iommu/amd/iommu.c
> > +++ b/drivers/iommu/amd/iommu.c
> > -static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
> > +static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu,
> > +				  bool ga_log_intr)
> >  {
> >  	if (cpu >= 0) {
> >  		entry->lo.fields_vapic.destination =
> > @@ -3783,12 +3784,14 @@ static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
> >  		entry->hi.fields.destination =
> >  					APICID_TO_IRTE_DEST_HI(cpu);
> >  		entry->lo.fields_vapic.is_run = true;
> > +		entry->lo.fields_vapic.ga_log_intr = false;
> >  	} else {
> >  		entry->lo.fields_vapic.is_run = false;
> > +		entry->lo.fields_vapic.ga_log_intr = ga_log_intr;
> >  	}
> >  }
> >
> 
> isRun, Destination and GATag are not cached. Quoting the update from a few years
> back (page 93 of IOMMU spec dated Feb 2025):
> 
> | When virtual interrupts are enabled by setting MMIO Offset 0018h[GAEn] and
> | IRTE[GuestMode=1], IRTE[IsRun], IRTE[Destination], and if present IRTE[GATag],
> | are not cached by the IOMMU. Modifications to these fields do not require an
> | invalidation of the Interrupt Remapping Table.

Ooh, that's super helpful info.  Any objections to me adding verbose comments to
explain the effective rules for amd_iommu_update_ga()?

> This is the reason we were able to get rid of the IOMMU invalidation in
> amd_iommu_update_ga() ... which sped up vmexit/vmenter flow with iommu avic.
> Besides the lock contention that was observed at the time, we were seeing stalls
> in this path with enough vCPUs IIRC; CCing Alejandro to keep me honest.
> 
> Now this change above is incorrect as is and to make it correct: you will need
> xor with the previous content of the IRTE::ga_log_intr and then if it changes
> then you re-add back an invalidation command via
> iommu_flush_irt_and_complete()). The latter is what I am worried will
> reintroduce these above problem :(

Ya, the need to flush definitely changes things.

> The invalidation command (which has a completion barrier to serialize
> invalidation execution) takes some time in h/w, and will make all your vcpus
> content on the irq table lock (as is). Even assuming you somehow move the
> invalidation outside the lock, you will content on the iommu lock (for the
> command queue) or best case assuming no locks (which I am not sure it is
> possible) you will need to wait for the command to complete until you can
> progress forward with entering/exiting.
> 
> Unless the GALogIntr bit is somehow also not cached too which wouldn't need the
> invalidation command (which would be good news!). Adding Suravee/Vasant here.
> 
> It's a nice trick how you would leverage this in SVM, but do you have
> measurements that corroborate its introduction? How many unnecessary GALog
> entries were you able to avoid with this trick say on a workload that would
> exercise this (like netperf 1byte RR test that sleeps and wakes up a lot) ?

I didn't do any measurements; I assumed the opportunistic toggling of GALogIntr
would be "free".

There might be optimizations that could be done, but I think the better solution
is to simply disable GALogIntr when it's not needed.  That'd limit the benefits
to select setups, but trying to optimize IRQ bypass for VMs that are CPU-overcommited,
i.e. can't do native HLT, seems rather pointless.

> I should also mention that there's a different logic that is alternative to
> GALog (in Genoa or more recent), which is GAPPI support whereby an entry is
> generated but a more rare condition. Quoting the an excerpts below:
> 
> | This mode is enabled by setting MMIO Offset 0018h[GAPPIEn]=1. Under this
> | mode, guest interrupts (IRTE[GuestMode]=1) update the vAPIC backing page IRR
> | status as normal.
> 
> | In GAPPI mode, a GAPPI interrupt is generated if all of the guest IRR bits
> | were previously clear prior to the last IRR update. This indicates the new
> | interrupt is the first pending interrupt to the
> | vCPU. The GAPPI interrupt is used to signal Hypervisor software to schedule
> | one or more vCPUs to execute pending interrupts.
> 
> | Implementations may not be able to perfectly determine if all of the IRR bits
> | were previously clear prior to updating the vAPIC backing page to set IRR.
> | Spurious interrupts may be generated as a
> | result. Software must be designed to handle this possibility

I saw this as well.  I'm curious if enabling GAPPI mode affects IOMMU interrupt
delivery latency/throughput due to having to scrape the entire IRR.

> Page 99, "2.2.5.5 Guest APIC Physical Processor Interrupt",
> https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 64/67] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
  2025-04-10 15:45     ` Sean Christopherson
@ 2025-04-10 17:13       ` Joao Martins
  2025-04-10 17:29         ` Sean Christopherson
  0 siblings, 1 reply; 128+ messages in thread
From: Joao Martins @ 2025-04-10 17:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, David Matlack,
	Alejandro Jimenez, Suravee Suthikulpanit, Vasant Hegde,
	Joerg Roedel, David Woodhouse, Lu Baolu, Paolo Bonzini

On 10/04/2025 16:45, Sean Christopherson wrote:
> On Wed, Apr 09, 2025, Joao Martins wrote:
>> On 04/04/2025 20:39, Sean Christopherson wrote:
>> I would suggest holding off on this and the next one, while progressing with
>> the rest of the series.
> 
> Agreed, though I think there's a "pure win" alternative that can be safely
> implemented (but it definitely should be done separately).
> 
> If HLT-exiting is disabled for the VM, and the VM doesn't have access to the
> various paravirtual features that can put it into a synthetic HLT state (PV async
> #PF and/or Xen support), then I'm pretty sure GALogIntr can be disabled entirely,
> i.e. disabled during the initial irq_set_vcpu_affinity() and never enabled.  KVM
> doesn't emulate HLT via its full emulator for AMD (just non-unrestricted Intel
> guests), so I'm pretty sure there would be no need for KVM to ever wake a vCPU in
> response to a device interrupt.
> 

Done via IRQ affinity changes already a significant portion of the IRTE and it's
already on a slowpath that performs an invalidation, so via
irq_set_vcpu_affinity is definitely safe.

But even with HLT exits disabled; there's still preemption though? But I guess
that's a bit more rare if it's conditional to HLT exiting being enabled or not,
and whether there's only a single task running.

>>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>>> index 2e016b98fa1b..27b03e718980 100644
>>> --- a/drivers/iommu/amd/iommu.c
>>> +++ b/drivers/iommu/amd/iommu.c
>>> -static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
>>> +static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu,
>>> +				  bool ga_log_intr)
>>>  {
>>>  	if (cpu >= 0) {
>>>  		entry->lo.fields_vapic.destination =
>>> @@ -3783,12 +3784,14 @@ static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
>>>  		entry->hi.fields.destination =
>>>  					APICID_TO_IRTE_DEST_HI(cpu);
>>>  		entry->lo.fields_vapic.is_run = true;
>>> +		entry->lo.fields_vapic.ga_log_intr = false;
>>>  	} else {
>>>  		entry->lo.fields_vapic.is_run = false;
>>> +		entry->lo.fields_vapic.ga_log_intr = ga_log_intr;
>>>  	}
>>>  }
>>>
>>
>> isRun, Destination and GATag are not cached. Quoting the update from a few years
>> back (page 93 of IOMMU spec dated Feb 2025):
>>
>> | When virtual interrupts are enabled by setting MMIO Offset 0018h[GAEn] and
>> | IRTE[GuestMode=1], IRTE[IsRun], IRTE[Destination], and if present IRTE[GATag],
>> | are not cached by the IOMMU. Modifications to these fields do not require an
>> | invalidation of the Interrupt Remapping Table.
> 
> Ooh, that's super helpful info.  Any objections to me adding verbose comments to
> explain the effective rules for amd_iommu_update_ga()?
> 
That's a great addition, it should have been there from the beginning when we
added the cacheviness of guest-mode IRTE into the mix.

>> This is the reason we were able to get rid of the IOMMU invalidation in
>> amd_iommu_update_ga() ... which sped up vmexit/vmenter flow with iommu avic.
>> Besides the lock contention that was observed at the time, we were seeing stalls
>> in this path with enough vCPUs IIRC; CCing Alejandro to keep me honest.
>>
>> Now this change above is incorrect as is and to make it correct: you will need
>> xor with the previous content of the IRTE::ga_log_intr and then if it changes
>> then you re-add back an invalidation command via
>> iommu_flush_irt_and_complete()). The latter is what I am worried will
>> reintroduce these above problem :(
> 
> Ya, the need to flush definitely changes things.
> 
>> The invalidation command (which has a completion barrier to serialize
>> invalidation execution) takes some time in h/w, and will make all your vcpus
>> content on the irq table lock (as is). Even assuming you somehow move the
>> invalidation outside the lock, you will content on the iommu lock (for the
>> command queue) or best case assuming no locks (which I am not sure it is
>> possible) you will need to wait for the command to complete until you can
>> progress forward with entering/exiting.
>>
>> Unless the GALogIntr bit is somehow also not cached too which wouldn't need the
>> invalidation command (which would be good news!). Adding Suravee/Vasant here.
>>
>> It's a nice trick how you would leverage this in SVM, but do you have
>> measurements that corroborate its introduction? How many unnecessary GALog
>> entries were you able to avoid with this trick say on a workload that would
>> exercise this (like netperf 1byte RR test that sleeps and wakes up a lot) ?
> 
> I didn't do any measurements; I assumed the opportunistic toggling of GALogIntr
> would be "free".
> 
> There might be optimizations that could be done, but I think the better solution
> is to simply disable GALogIntr when it's not needed.  That'd limit the benefits
> to select setups, but trying to optimize IRQ bypass for VMs that are CPU-overcommited,
> i.e. can't do native HLT, seems rather pointless.
> 
/me nods

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 64/67] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
  2025-04-10 17:13       ` Joao Martins
@ 2025-04-10 17:29         ` Sean Christopherson
  0 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-10 17:29 UTC (permalink / raw)
  To: Joao Martins
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, David Matlack,
	Alejandro Jimenez, Suravee Suthikulpanit, Vasant Hegde,
	Joerg Roedel, David Woodhouse, Lu Baolu, Paolo Bonzini

On Thu, Apr 10, 2025, Joao Martins wrote:
> On 10/04/2025 16:45, Sean Christopherson wrote:
> > On Wed, Apr 09, 2025, Joao Martins wrote:
> >> On 04/04/2025 20:39, Sean Christopherson wrote:
> >> I would suggest holding off on this and the next one, while progressing with
> >> the rest of the series.
> > 
> > Agreed, though I think there's a "pure win" alternative that can be safely
> > implemented (but it definitely should be done separately).
> > 
> > If HLT-exiting is disabled for the VM, and the VM doesn't have access to the
> > various paravirtual features that can put it into a synthetic HLT state (PV async
> > #PF and/or Xen support), then I'm pretty sure GALogIntr can be disabled entirely,
> > i.e. disabled during the initial irq_set_vcpu_affinity() and never enabled.  KVM
> > doesn't emulate HLT via its full emulator for AMD (just non-unrestricted Intel
> > guests), so I'm pretty sure there would be no need for KVM to ever wake a vCPU in
> > response to a device interrupt.
> > 
> 
> Done via IRQ affinity changes already a significant portion of the IRTE and it's
> already on a slowpath that performs an invalidation, so via
> irq_set_vcpu_affinity is definitely safe.
> 
> But even with HLT exits disabled; there's still preemption though?

Even with involuntary preemption (which would be nonsensical to pair with HLT
passthrough), KVM doesn't rely on the GALogIntr to schedule in the vCPU task.

The _only_ use of the notification is to wake the task and make it runnable.  If
the vCPU task is already runnable, when and where the task is run is fully
controlled by the scheduler (and/or userspace).

> But I guess that's a bit more rare if it's conditional to HLT exiting being
> enabled or not, and whether there's only a single task running.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 09/67] KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure
  2025-04-04 19:38 ` [PATCH 09/67] KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure Sean Christopherson
@ 2025-04-11  7:47   ` Arun Kodilkar, Sairaj
  2025-04-11 14:32     ` Sean Christopherson
  0 siblings, 1 reply; 128+ messages in thread
From: Arun Kodilkar, Sairaj @ 2025-04-11  7:47 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> Track the IRTEs that are posting to an SVM vCPU via the associated irqfd
> structure and GSI routing instead of dynamically allocating a separate
> data structure.  In addition to eliminating an atomic allocation, this
> will allow hoisting much of the IRTE update logic to common x86.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/svm/avic.c   | 49 ++++++++++++++++-----------------------
>   include/linux/kvm_irqfd.h |  3 +++
>   2 files changed, 23 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 04dfd898ea8d..967618ba743a 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -774,27 +774,30 @@ static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
>   	return ret;
>   }
>   
> -static void svm_ir_list_del(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
> +static void svm_ir_list_del(struct vcpu_svm *svm,
> +			    struct kvm_kernel_irqfd *irqfd,
> +			    struct amd_iommu_pi_data *pi)
>   {
>   	unsigned long flags;
> -	struct amd_svm_iommu_ir *cur;
> +	struct kvm_kernel_irqfd *cur;
>   
>   	spin_lock_irqsave(&svm->ir_list_lock, flags);
> -	list_for_each_entry(cur, &svm->ir_list, node) {
> -		if (cur->data != pi->ir_data)
> +	list_for_each_entry(cur, &svm->ir_list, vcpu_list) {
> +		if (cur->irq_bypass_data != pi->ir_data)
>   			continue;
> -		list_del(&cur->node);
> -		kfree(cur);
> +		if (WARN_ON_ONCE(cur != irqfd))
> +			continue;
> +		list_del(&irqfd->vcpu_list);
>   		break;
>   	}
>   	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
>   }
>   
> -static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
> +static int svm_ir_list_add(struct vcpu_svm *svm,
> +			   struct kvm_kernel_irqfd *irqfd,
> +			   struct amd_iommu_pi_data *pi)
>   {
> -	int ret = 0;
>   	unsigned long flags;
> -	struct amd_svm_iommu_ir *ir;
>   	u64 entry;
>   
>   	if (WARN_ON_ONCE(!pi->ir_data))
> @@ -811,25 +814,14 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
>   		struct kvm_vcpu *prev_vcpu = kvm_get_vcpu_by_id(kvm, vcpu_id);
>   		struct vcpu_svm *prev_svm;
>   
> -		if (!prev_vcpu) {
> -			ret = -EINVAL;
> -			goto out;
> -		}
> +		if (!prev_vcpu)
> +			return -EINVAL;
>   
>   		prev_svm = to_svm(prev_vcpu);
> -		svm_ir_list_del(prev_svm, pi);
> +		svm_ir_list_del(prev_svm, irqfd, pi);
>   	}
>   
> -	/**
> -	 * Allocating new amd_iommu_pi_data, which will get
> -	 * add to the per-vcpu ir_list.
> -	 */
> -	ir = kzalloc(sizeof(struct amd_svm_iommu_ir), GFP_ATOMIC | __GFP_ACCOUNT);
> -	if (!ir) {
> -		ret = -ENOMEM;
> -		goto out;
> -	}
> -	ir->data = pi->ir_data;
> +	irqfd->irq_bypass_data = pi->ir_data;
>   
>   	spin_lock_irqsave(&svm->ir_list_lock, flags);
>   
> @@ -844,10 +836,9 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
>   		amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
>   				    true, pi->ir_data);
>   
> -	list_add(&ir->node, &svm->ir_list);
> +	list_add(&irqfd->vcpu_list, &svm->ir_list);
>   	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
> -out:
> -	return ret;
> +	return 0;
>   }
>   
>   /*
> @@ -951,7 +942,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   			 * scheduling information in IOMMU irte.
>   			 */
>   			if (!ret && pi.is_guest_mode)
> -				svm_ir_list_add(svm, &pi);
> +				svm_ir_list_add(svm, irqfd, &pi);
>   		}
>   
>   		if (!ret && svm) {
> @@ -991,7 +982,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   
>   			vcpu = kvm_get_vcpu_by_id(kvm, id);
>   			if (vcpu)
> -				svm_ir_list_del(to_svm(vcpu), &pi);
> +				svm_ir_list_del(to_svm(vcpu), irqfd, &pi);
>   		}
>   	} else {
>   		ret = 0;
> diff --git a/include/linux/kvm_irqfd.h b/include/linux/kvm_irqfd.h
> index 8ad43692e3bb..6510a48e62aa 100644
> --- a/include/linux/kvm_irqfd.h
> +++ b/include/linux/kvm_irqfd.h
> @@ -59,6 +59,9 @@ struct kvm_kernel_irqfd {
>   	struct work_struct shutdown;
>   	struct irq_bypass_consumer consumer;
>   	struct irq_bypass_producer *producer;
> +
> +	struct list_head vcpu_list;
> +	void *irq_bypass_data;
>   };
>   
>   #endif /* __LINUX_KVM_IRQFD_H */

Hi Sean,
You missed to update the functions avic_set_pi_irte_mode and
avic_update_iommu_vcpu_affinity, which iterate over the ir_list.

Regards
Sairaj Kodilkar

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 02/67] KVM: x86: Reset IRTE to host control if *new* route isn't postable
  2025-04-04 19:38 ` [PATCH 02/67] KVM: x86: Reset IRTE to host control if *new* route isn't postable Sean Christopherson
@ 2025-04-11  8:08   ` Sairaj Kodilkar
  2025-04-11 14:16     ` Sean Christopherson
  0 siblings, 1 reply; 128+ messages in thread
From: Sairaj Kodilkar @ 2025-04-11  8:08 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack, Vasant Hegde, Naveen N Rao

On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> Restore an IRTE back to host control (remapped or posted MSI mode) if the
> *new* GSI route prevents posting the IRQ directly to a vCPU, regardless of
> the GSI routing type.  Updating the IRTE if and only if the new GSI is an
> MSI results in KVM leaving an IRTE posting to a vCPU.
> 
> The dangling IRTE can result in interrupts being incorrectly delivered to
> the guest, and in the worst case scenario can result in use-after-free,
> e.g. if the VM is torn down, but the underlying host IRQ isn't freed.
> 
> Fixes: efc644048ecd ("KVM: x86: Update IRTE for posted-interrupts")
> Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/svm/avic.c        | 61 ++++++++++++++++++----------------
>   arch/x86/kvm/vmx/posted_intr.c | 28 ++++++----------
>   2 files changed, 43 insertions(+), 46 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index a961e6e67050..ef08356fdb1c 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -896,6 +896,7 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
>   {
>   	struct kvm_kernel_irq_routing_entry *e;
>   	struct kvm_irq_routing_table *irq_rt;
> +	bool enable_remapped_mode = true;
>   	int idx, ret = 0;
>   
>   	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
> @@ -932,6 +933,8 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
>   		    kvm_vcpu_apicv_active(&svm->vcpu)) {
>   			struct amd_iommu_pi_data pi;
>   
> +			enable_remapped_mode = false;
> +
>   			/* Try to enable guest_mode in IRTE */
>   			pi.base = __sme_set(page_to_phys(svm->avic_backing_page) &
>   					    AVIC_HPA_MASK);
> @@ -950,33 +953,6 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
>   			 */
>   			if (!ret && pi.is_guest_mode)
>   				svm_ir_list_add(svm, &pi);
> -		} else {
> -			/* Use legacy mode in IRTE */
> -			struct amd_iommu_pi_data pi;
> -
> -			/**
> -			 * Here, pi is used to:
> -			 * - Tell IOMMU to use legacy mode for this interrupt.
> -			 * - Retrieve ga_tag of prior interrupt remapping data.
> -			 */
> -			pi.prev_ga_tag = 0;
> -			pi.is_guest_mode = false;
> -			ret = irq_set_vcpu_affinity(host_irq, &pi);
> -
> -			/**
> -			 * Check if the posted interrupt was previously
> -			 * setup with the guest_mode by checking if the ga_tag
> -			 * was cached. If so, we need to clean up the per-vcpu
> -			 * ir_list.
> -			 */
> -			if (!ret && pi.prev_ga_tag) {
> -				int id = AVIC_GATAG_TO_VCPUID(pi.prev_ga_tag);
> -				struct kvm_vcpu *vcpu;
> -
> -				vcpu = kvm_get_vcpu_by_id(kvm, id);
> -				if (vcpu)
> -					svm_ir_list_del(to_svm(vcpu), &pi);
> -			}
>   		}
>   
>   		if (!ret && svm) {
> @@ -991,7 +967,36 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
>   		}
>   	}
>   
> -	ret = 0;
> +	if (enable_remapped_mode) {
> +		/* Use legacy mode in IRTE */
> +		struct amd_iommu_pi_data pi;
> +
> +		/**
> +		 * Here, pi is used to:
> +		 * - Tell IOMMU to use legacy mode for this interrupt.
> +		 * - Retrieve ga_tag of prior interrupt remapping data.
> +		 */
> +		pi.prev_ga_tag = 0;
> +		pi.is_guest_mode = false;
> +		ret = irq_set_vcpu_affinity(host_irq, &pi);
> +
> +		/**
> +		 * Check if the posted interrupt was previously
> +		 * setup with the guest_mode by checking if the ga_tag
> +		 * was cached. If so, we need to clean up the per-vcpu
> +		 * ir_list.
> +		 */
> +		if (!ret && pi.prev_ga_tag) {
> +			int id = AVIC_GATAG_TO_VCPUID(pi.prev_ga_tag);
> +			struct kvm_vcpu *vcpu;
> +
> +			vcpu = kvm_get_vcpu_by_id(kvm, id);
> +			if (vcpu)
> +				svm_ir_list_del(to_svm(vcpu), &pi);
> +		}
> +	} else {
> +		ret = 0;
> +	}

Hi Sean,
I think you can remove this else and "ret = 0". Because Code will come 
to this point when irq_set_vcpu_affinity() is successful, ensuring that 
ret is 0.

.../...

Regards
Sairaj Kodilkar

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 06/67] iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts
  2025-04-04 19:38 ` [PATCH 06/67] iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts Sean Christopherson
@ 2025-04-11  8:28   ` Sairaj Kodilkar
  2025-04-11 14:10     ` Sean Christopherson
  0 siblings, 1 reply; 128+ messages in thread
From: Sairaj Kodilkar @ 2025-04-11  8:28 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack, Naveen N Rao, Vasant Hegde

On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> WARN if KVM attempts to set vCPU affinity when posted interrupts aren't
> enabled, as KVM shouldn't try to enable posting when they're unsupported,
> and the IOMMU driver darn well should only advertise posting support when
> AMD_IOMMU_GUEST_IR_VAPIC() is true.
> 
> Note, KVM consumes is_guest_mode only on success.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   drivers/iommu/amd/iommu.c | 13 +++----------
>   1 file changed, 3 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index b3a01b7757ee..4f69a37cf143 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -3852,19 +3852,12 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
>   	if (!dev_data || !dev_data->use_vapic)
>   		return -EINVAL;
>   
> +	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
> +		return -EINVAL;
> +

Hi Sean,
'dev_data->use_vapic' is always zero when AMD IOMMU uses legacy
interrupts i.e. when AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) is 0.
Hence you can remove this additional check.

Regards
Sairaj Kodilkar

>   	ir_data->cfg = irqd_cfg(data);
>   	pi_data->ir_data = ir_data;
>   
> -	/* Note:
> -	 * SVM tries to set up for VAPIC mode, but we are in
> -	 * legacy mode. So, we force legacy mode instead.
> -	 */
> -	if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)) {
> -		pr_debug("%s: Fall back to using intr legacy remap\n",
> -			 __func__);
> -		pi_data->is_guest_mode = false;
> -	}
> -
>   	pi_data->prev_ga_tag = ir_data->cached_ga_tag;
>   	if (pi_data->is_guest_mode) {
>   		ir_data->ga_root_ptr = (pi_data->base >> 12);


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 05/67] iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE
  2025-04-04 19:38 ` [PATCH 05/67] iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE Sean Christopherson
@ 2025-04-11  8:34   ` Sairaj Kodilkar
  2025-04-11 14:05     ` Sean Christopherson
  2025-04-18 12:25   ` Vasant Hegde
  1 sibling, 1 reply; 128+ messages in thread
From: Sairaj Kodilkar @ 2025-04-11  8:34 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack, Vasant Hegde, Naveen N Rao, Suthikulpanit, Suravee

On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> Return -EINVAL instead of success if amd_ir_set_vcpu_affinity() is
> invoked without use_vapic; lying to KVM about whether or not the IRTE was
> configured to post IRQs is all kinds of bad.
> 
> Fixes: d98de49a53e4 ("iommu/amd: Enable vAPIC interrupt remapping mode by default")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   drivers/iommu/amd/iommu.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index cd5116d8c3b2..b3a01b7757ee 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -3850,7 +3850,7 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
>   	 * we should not modify the IRTE
>   	 */
>   	if (!dev_data || !dev_data->use_vapic)
> -		return 0;
> +		return -EINVAL;
>   

Hi Sean,
you can update following functions as well to return error when
IOMMU is using legacy interrupt mode.
1. amd_iommu_update_ga
2. amd_iommu_activate_guest_mode
3. amd_iommu_deactivate_guest_mode

Currently these functions return 0 to the kvm layer when they fail to
set the IRTE.

Regards
Sairaj Kodilkar

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 08/67] KVM: x86: Pass new routing entries and irqfd when updating IRTEs
  2025-04-04 19:38 ` [PATCH 08/67] KVM: x86: Pass new routing entries and irqfd when updating IRTEs Sean Christopherson
@ 2025-04-11 10:57   ` Arun Kodilkar, Sairaj
  2025-04-11 14:01     ` Sean Christopherson
  0 siblings, 1 reply; 128+ messages in thread
From: Arun Kodilkar, Sairaj @ 2025-04-11 10:57 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack, Naveen N Rao, Vasant Hegde

On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> When updating IRTEs in response to a GSI routing or IRQ bypass change,
> pass the new/current routing information along with the associated irqfd.
> This will allow KVM x86 to harden, simplify, and deduplicate its code.
> 
> Since adding/removing a bypass producer is now conveniently protected with
> irqfds.lock, i.e. can't run concurrently with kvm_irq_routing_update(),
> use the routing information cached in the irqfd instead of looking up
> the information in the current GSI routing tables.
> 
> Opportunistically convert an existing printk() to pr_info() and put its
> string onto a single line (old code that strictly adhered to 80 chars).
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  6 ++++--
>   arch/x86/kvm/svm/avic.c         | 18 +++++++----------
>   arch/x86/kvm/svm/svm.h          |  5 +++--
>   arch/x86/kvm/vmx/posted_intr.c  | 19 ++++++++---------
>   arch/x86/kvm/vmx/posted_intr.h  |  8 ++++++--
>   arch/x86/kvm/x86.c              | 36 ++++++++++++++++++---------------
>   include/linux/kvm_host.h        |  7 +++++--
>   virt/kvm/eventfd.c              | 11 +++++-----
>   8 files changed, 58 insertions(+), 52 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6e8be274c089..54f3cf73329b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -294,6 +294,7 @@ enum x86_intercept_stage;
>    */
>   #define KVM_APIC_PV_EOI_PENDING	1
>   
> +struct kvm_kernel_irqfd;
>   struct kvm_kernel_irq_routing_entry;
>   
>   /*
> @@ -1828,8 +1829,9 @@ struct kvm_x86_ops {
>   	void (*vcpu_blocking)(struct kvm_vcpu *vcpu);
>   	void (*vcpu_unblocking)(struct kvm_vcpu *vcpu);
>   
> -	int (*pi_update_irte)(struct kvm *kvm, unsigned int host_irq,
> -			      uint32_t guest_irq, bool set);
> +	int (*pi_update_irte)(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
> +			      unsigned int host_irq, uint32_t guest_irq,
> +			      struct kvm_kernel_irq_routing_entry *new);
>   	void (*pi_start_assignment)(struct kvm *kvm);
>   	void (*apicv_pre_state_restore)(struct kvm_vcpu *vcpu);
>   	void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 1708ea55125a..04dfd898ea8d 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -18,6 +18,7 @@
>   #include <linux/hashtable.h>
>   #include <linux/amd-iommu.h>
>   #include <linux/kvm_host.h>
> +#include <linux/kvm_irqfd.h>
>   
>   #include <asm/irq_remapping.h>
>   
> @@ -885,21 +886,14 @@ get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
>   	return 0;
>   }
>   
> -/*
> - * avic_pi_update_irte - set IRTE for Posted-Interrupts
> - *
> - * @kvm: kvm
> - * @host_irq: host irq of the interrupt
> - * @guest_irq: gsi of the interrupt
> - * @set: set or unset PI
> - * returns 0 on success, < 0 on failure
> - */
> -int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
> -			uint32_t guest_irq, bool set)
> +int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
> +			unsigned int host_irq, uint32_t guest_irq,
> +			struct kvm_kernel_irq_routing_entry *new)
>   {
>   	struct kvm_kernel_irq_routing_entry *e;
>   	struct kvm_irq_routing_table *irq_rt;
>   	bool enable_remapped_mode = true;
> +	bool set = !!new;
>   	int idx, ret = 0;
>   
>   	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
> @@ -925,6 +919,8 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
>   		if (e->type != KVM_IRQ_ROUTING_MSI)
>   			continue;
>   
> +		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
> +
> 

Hi Sean,

In kvm_irq_routing_update() function, its possible that there are
multiple entries in the `kvm_irq_routing_table`, and `irqfd_update()`
ends up setting up the new entry type to 0 instead of copying the entry.

if (n_entries == 1)
     irqfd->irq_entry = *e;
else
     irqfd->irq_entry.type = 0;

Since irqfd_update() did not copy the entry to irqfd->entries, the "new" 
will not match entry "e" obtained from irq_rt, which can trigger a false 
WARN_ON.

Let me know if I am missing something here.

Regards
Sairaj Kodilkar

>  		/**
>   		 * Here, we setup with legacy mode in the following cases:
>   		 * 1. When cannot target interrupt to a specific vcpu.
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index d4490eaed55d..294d5594c724 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -731,8 +731,9 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
>   void avic_vcpu_put(struct kvm_vcpu *vcpu);
>   void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu);
>   void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
> -int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
> -			uint32_t guest_irq, bool set);
> +int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
> +			unsigned int host_irq, uint32_t guest_irq,
> +			struct kvm_kernel_irq_routing_entry *new);
>   void avic_vcpu_blocking(struct kvm_vcpu *vcpu);
>   void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
>   void avic_ring_doorbell(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 78ba3d638fe8..1b6b655a2b8a 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -2,6 +2,7 @@
>   #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>   
>   #include <linux/kvm_host.h>
> +#include <linux/kvm_irqfd.h>
>   
>   #include <asm/irq_remapping.h>
>   #include <asm/cpu.h>
> @@ -259,17 +260,9 @@ void vmx_pi_start_assignment(struct kvm *kvm)
>   	kvm_make_all_cpus_request(kvm, KVM_REQ_UNBLOCK);
>   }
>   
> -/*
> - * vmx_pi_update_irte - set IRTE for Posted-Interrupts
> - *
> - * @kvm: kvm
> - * @host_irq: host irq of the interrupt
> - * @guest_irq: gsi of the interrupt
> - * @set: set or unset PI
> - * returns 0 on success, < 0 on failure
> - */
> -int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
> -		       uint32_t guest_irq, bool set)
> +int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
> +		       unsigned int host_irq, uint32_t guest_irq,
> +		       struct kvm_kernel_irq_routing_entry *new)
>   {
>   	struct kvm_kernel_irq_routing_entry *e;
>   	struct kvm_irq_routing_table *irq_rt;
> @@ -277,6 +270,7 @@ int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
>   	struct kvm_lapic_irq irq;
>   	struct kvm_vcpu *vcpu;
>   	struct vcpu_data vcpu_info;
> +	bool set = !!new;
>   	int idx, ret = 0;
>   
>   	if (!vmx_can_use_vtd_pi(kvm))
> @@ -294,6 +288,9 @@ int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
>   	hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
>   		if (e->type != KVM_IRQ_ROUTING_MSI)
>   			continue;
> +
> +		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
> +
>   		/*
>   		 * VT-d PI cannot support posting multicast/broadcast
>   		 * interrupts to a vCPU, we still use interrupt remapping
> diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
> index ad9116a99bcc..a586d6aaf862 100644
> --- a/arch/x86/kvm/vmx/posted_intr.h
> +++ b/arch/x86/kvm/vmx/posted_intr.h
> @@ -3,6 +3,9 @@
>   #define __KVM_X86_VMX_POSTED_INTR_H
>   
>   #include <linux/bitmap.h>
> +#include <linux/find.h>
> +#include <linux/kvm_host.h>
> +
>   #include <asm/posted_intr.h>
>   
>   void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
> @@ -10,8 +13,9 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
>   void pi_wakeup_handler(void);
>   void __init pi_init_cpu(int cpu);
>   bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
> -int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
> -		       uint32_t guest_irq, bool set);
> +int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
> +		       unsigned int host_irq, uint32_t guest_irq,
> +		       struct kvm_kernel_irq_routing_entry *new);
>   void vmx_pi_start_assignment(struct kvm *kvm);
>   
>   static inline int pi_find_highest_vector(struct pi_desc *pi_desc)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index dcc173852dc5..23376fcd928c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13570,31 +13570,31 @@ int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
>   	struct kvm_kernel_irqfd *irqfd =
>   		container_of(cons, struct kvm_kernel_irqfd, consumer);
>   	struct kvm *kvm = irqfd->kvm;
> -	int ret;
> +	int ret = 0;
>   
>   	kvm_arch_start_assignment(irqfd->kvm);
>   
>   	spin_lock_irq(&kvm->irqfds.lock);
>   	irqfd->producer = prod;
>   
> -	ret = kvm_x86_call(pi_update_irte)(irqfd->kvm,
> -					   prod->irq, irqfd->gsi, 1);
> -	if (ret)
> -		kvm_arch_end_assignment(irqfd->kvm);
> -
> +	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
> +		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
> +						   irqfd->gsi, &irqfd->irq_entry);
> +		if (ret)
> +			kvm_arch_end_assignment(irqfd->kvm);
> +	}
>   	spin_unlock_irq(&kvm->irqfds.lock);
>   
> -
>   	return ret;
>   }
>   
>   void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
>   				      struct irq_bypass_producer *prod)
>   {
> -	int ret;
>   	struct kvm_kernel_irqfd *irqfd =
>   		container_of(cons, struct kvm_kernel_irqfd, consumer);
>   	struct kvm *kvm = irqfd->kvm;
> +	int ret;
>   
>   	WARN_ON(irqfd->producer != prod);
>   
> @@ -13607,11 +13607,13 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
>   	spin_lock_irq(&kvm->irqfds.lock);
>   	irqfd->producer = NULL;
>   
> -	ret = kvm_x86_call(pi_update_irte)(irqfd->kvm,
> -					   prod->irq, irqfd->gsi, 0);
> -	if (ret)
> -		printk(KERN_INFO "irq bypass consumer (token %p) unregistration"
> -		       " fails: %d\n", irqfd->consumer.token, ret);
> +	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
> +		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
> +						   irqfd->gsi, NULL);
> +		if (ret)
> +			pr_info("irq bypass consumer (token %p) unregistration fails: %d\n",
> +				irqfd->consumer.token, ret);
> +	}
>   
>   	spin_unlock_irq(&kvm->irqfds.lock);
>   
> @@ -13619,10 +13621,12 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
>   	kvm_arch_end_assignment(irqfd->kvm);
>   }
>   
> -int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
> -				   uint32_t guest_irq, bool set)
> +int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
> +				  struct kvm_kernel_irq_routing_entry *old,
> +				  struct kvm_kernel_irq_routing_entry *new)
>   {
> -	return kvm_x86_call(pi_update_irte)(kvm, host_irq, guest_irq, set);
> +	return kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, irqfd->producer->irq,
> +					    irqfd->gsi, new);
>   }
>   
>   bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 5438a1b446a6..2d9f3aeb766a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2383,6 +2383,8 @@ struct kvm_vcpu *kvm_get_running_vcpu(void);
>   struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void);
>   
>   #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
> +struct kvm_kernel_irqfd;
> +
>   bool kvm_arch_has_irq_bypass(void);
>   int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *,
>   			   struct irq_bypass_producer *);
> @@ -2390,8 +2392,9 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *,
>   			   struct irq_bypass_producer *);
>   void kvm_arch_irq_bypass_stop(struct irq_bypass_consumer *);
>   void kvm_arch_irq_bypass_start(struct irq_bypass_consumer *);
> -int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
> -				  uint32_t guest_irq, bool set);
> +int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
> +				  struct kvm_kernel_irq_routing_entry *old,
> +				  struct kvm_kernel_irq_routing_entry *new);
>   bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *,
>   				  struct kvm_kernel_irq_routing_entry *);
>   #endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 249ba5b72e9b..ad71e3e4d1c3 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -285,9 +285,9 @@ void __attribute__((weak)) kvm_arch_irq_bypass_start(
>   {
>   }
>   
> -int  __attribute__((weak)) kvm_arch_update_irqfd_routing(
> -				struct kvm *kvm, unsigned int host_irq,
> -				uint32_t guest_irq, bool set)
> +int __weak kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
> +					 struct kvm_kernel_irq_routing_entry *old,
> +					 struct kvm_kernel_irq_routing_entry *new)
>   {
>   	return 0;
>   }
> @@ -619,9 +619,8 @@ void kvm_irq_routing_update(struct kvm *kvm)
>   #ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
>   		if (irqfd->producer &&
>   		    kvm_arch_irqfd_route_changed(&old, &irqfd->irq_entry)) {
> -			int ret = kvm_arch_update_irqfd_routing(
> -					irqfd->kvm, irqfd->producer->irq,
> -					irqfd->gsi, 1);
> +			int ret = kvm_arch_update_irqfd_routing(irqfd, &old, &irqfd->irq_entry);
> +
>   			WARN_ON(ret);
>   		}
>   #endif


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 08/67] KVM: x86: Pass new routing entries and irqfd when updating IRTEs
  2025-04-11 10:57   ` Arun Kodilkar, Sairaj
@ 2025-04-11 14:01     ` Sean Christopherson
  2025-04-11 17:22       ` Sairaj Kodilkar
  0 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-11 14:01 UTC (permalink / raw)
  To: Sairaj Arun Kodilkar
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins, David Matlack,
	Naveen N Rao, Vasant Hegde

On Fri, Apr 11, 2025, Arun Kodilkar, Sairaj wrote:
> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> > +int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
> > +			unsigned int host_irq, uint32_t guest_irq,
> > +			struct kvm_kernel_irq_routing_entry *new)
> >   {
> >   	struct kvm_kernel_irq_routing_entry *e;
> >   	struct kvm_irq_routing_table *irq_rt;
> >   	bool enable_remapped_mode = true;
> > +	bool set = !!new;
> >   	int idx, ret = 0;
> >   	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
> > @@ -925,6 +919,8 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
> >   		if (e->type != KVM_IRQ_ROUTING_MSI)
> >   			continue;
> > +		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
> > +
> > 
> 
> Hi Sean,
> 
> In kvm_irq_routing_update() function, its possible that there are
> multiple entries in the `kvm_irq_routing_table`,

Not if one of them is an MSI.  In setup_routing_entry():

	/*
	 * Do not allow GSI to be mapped to the same irqchip more than once.
	 * Allow only one to one mapping between GSI and non-irqchip routing.
	 */
	hlist_for_each_entry(ei, &rt->map[gsi], link)
		if (ei->type != KVM_IRQ_ROUTING_IRQCHIP ||
		    ue->type != KVM_IRQ_ROUTING_IRQCHIP ||
		    ue->u.irqchip.irqchip == ei->irqchip.irqchip)
			return -EINVAL;

> and `irqfd_update()` ends up setting up the new entry type to 0 instead of
> copying the entry.
> 
> if (n_entries == 1)
>     irqfd->irq_entry = *e;
> else
>     irqfd->irq_entry.type = 0;
> 
> Since irqfd_update() did not copy the entry to irqfd->entries, the "new"
> will not match entry "e" obtained from irq_rt, which can trigger a false
> WARN_ON.

And since there can only be one MSI, if there are multiple routing entries, then
the WARN won't be reached thanks to the continue that's just above:

		if (e->type != KVM_IRQ_ROUTING_MSI)
			continue;

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 05/67] iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE
  2025-04-11  8:34   ` Sairaj Kodilkar
@ 2025-04-11 14:05     ` Sean Christopherson
  2025-04-11 17:02       ` Sairaj Kodilkar
  0 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-11 14:05 UTC (permalink / raw)
  To: Sairaj Kodilkar
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins, David Matlack,
	Vasant Hegde, Naveen N Rao, Suravee Suthikulpanit

On Fri, Apr 11, 2025, Sairaj Kodilkar wrote:
> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> > Return -EINVAL instead of success if amd_ir_set_vcpu_affinity() is
> > invoked without use_vapic; lying to KVM about whether or not the IRTE was
> > configured to post IRQs is all kinds of bad.
> > 
> > Fixes: d98de49a53e4 ("iommu/amd: Enable vAPIC interrupt remapping mode by default")
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   drivers/iommu/amd/iommu.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> > index cd5116d8c3b2..b3a01b7757ee 100644
> > --- a/drivers/iommu/amd/iommu.c
> > +++ b/drivers/iommu/amd/iommu.c
> > @@ -3850,7 +3850,7 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
> >   	 * we should not modify the IRTE
> >   	 */
> >   	if (!dev_data || !dev_data->use_vapic)
> > -		return 0;
> > +		return -EINVAL;
> 
> Hi Sean,
> you can update following functions as well to return error when
> IOMMU is using legacy interrupt mode.
> 1. amd_iommu_update_ga
> 2. amd_iommu_activate_guest_mode
> 3. amd_iommu_deactivate_guest_mode

Heh, I'm well aware, and this series gets there eventually (the end product WARNs
and returns an error in all three functions).  I fixed amd_ir_set_vcpu_affinity()
early in the series because it's the initial API that KVM will use to configure
an IRTE for posting to a vCPU.  I.e. to reach the other helpers, KVM would need
to ignore the error returned by amd_ir_set_vcpu_affinity().

> Currently these functions return 0 to the kvm layer when they fail to
> set the IRTE.
> 
> Regards
> Sairaj Kodilkar

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 06/67] iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts
  2025-04-11  8:28   ` Sairaj Kodilkar
@ 2025-04-11 14:10     ` Sean Christopherson
  2025-04-11 17:03       ` Sairaj Kodilkar
                         ` (2 more replies)
  0 siblings, 3 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-11 14:10 UTC (permalink / raw)
  To: Sairaj Kodilkar
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins, David Matlack,
	Naveen N Rao, Vasant Hegde

On Fri, Apr 11, 2025, Sairaj Kodilkar wrote:
> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> > WARN if KVM attempts to set vCPU affinity when posted interrupts aren't
> > enabled, as KVM shouldn't try to enable posting when they're unsupported,
> > and the IOMMU driver darn well should only advertise posting support when
> > AMD_IOMMU_GUEST_IR_VAPIC() is true.
> > 
> > Note, KVM consumes is_guest_mode only on success.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   drivers/iommu/amd/iommu.c | 13 +++----------
> >   1 file changed, 3 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> > index b3a01b7757ee..4f69a37cf143 100644
> > --- a/drivers/iommu/amd/iommu.c
> > +++ b/drivers/iommu/amd/iommu.c
> > @@ -3852,19 +3852,12 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
> >   	if (!dev_data || !dev_data->use_vapic)
> >   		return -EINVAL;
> > +	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
> > +		return -EINVAL;
> > +
> 
> Hi Sean,
> 'dev_data->use_vapic' is always zero when AMD IOMMU uses legacy
> interrupts i.e. when AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) is 0.
> Hence you can remove this additional check.

Hmm, or move it above?  KVM should never call amd_ir_set_vcpu_affinity() if
IRQ posting is unsupported, and that would make this consistent with the end
behavior of amd_iommu_update_ga() and amd_iommu_{de,}activate_guest_mode().

	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
		return -EINVAL;

	if (ir_data->iommu == NULL)
		return -EINVAL;

	dev_data = search_dev_data(ir_data->iommu, irte_info->devid);

	/* Note:
	 * This device has never been set up for guest mode.
	 * we should not modify the IRTE
	 */
	if (!dev_data || !dev_data->use_vapic)
		return -EINVAL;

I'd like to keep the WARN so that someone will notice if KVM screws up.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 02/67] KVM: x86: Reset IRTE to host control if *new* route isn't postable
  2025-04-11  8:08   ` Sairaj Kodilkar
@ 2025-04-11 14:16     ` Sean Christopherson
  2025-04-15 11:36       ` Paolo Bonzini
  0 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-11 14:16 UTC (permalink / raw)
  To: Sairaj Kodilkar
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins, David Matlack,
	Vasant Hegde, Naveen N Rao

On Fri, Apr 11, 2025, Sairaj Kodilkar wrote:
> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> > @@ -991,7 +967,36 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
> >   		}
> >   	}
> > -	ret = 0;
> > +	if (enable_remapped_mode) {
> > +		/* Use legacy mode in IRTE */
> > +		struct amd_iommu_pi_data pi;
> > +
> > +		/**
> > +		 * Here, pi is used to:
> > +		 * - Tell IOMMU to use legacy mode for this interrupt.
> > +		 * - Retrieve ga_tag of prior interrupt remapping data.
> > +		 */
> > +		pi.prev_ga_tag = 0;
> > +		pi.is_guest_mode = false;
> > +		ret = irq_set_vcpu_affinity(host_irq, &pi);
> > +
> > +		/**
> > +		 * Check if the posted interrupt was previously
> > +		 * setup with the guest_mode by checking if the ga_tag
> > +		 * was cached. If so, we need to clean up the per-vcpu
> > +		 * ir_list.
> > +		 */
> > +		if (!ret && pi.prev_ga_tag) {
> > +			int id = AVIC_GATAG_TO_VCPUID(pi.prev_ga_tag);
> > +			struct kvm_vcpu *vcpu;
> > +
> > +			vcpu = kvm_get_vcpu_by_id(kvm, id);
> > +			if (vcpu)
> > +				svm_ir_list_del(to_svm(vcpu), &pi);
> > +		}
> > +	} else {
> > +		ret = 0;
> > +	}
> 
> Hi Sean,
> I think you can remove this else and "ret = 0". Because Code will come to
> this point when irq_set_vcpu_affinity() is successful, ensuring that ret is
> 0.

Ah, nice, because of this:

		if (ret < 0) {
			pr_err("%s: failed to update PI IRTE\n", __func__);
			goto out;
		}

However, looking at this again, I'm very tempted to simply leave the "ret = 0;"
that's already there so as to minimize the change.  It'll get cleaned up later on
no matter what, so safety for LTS kernels is the driving factor as of this patch.

Paolo, any preference?

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 09/67] KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure
  2025-04-11  7:47   ` Arun Kodilkar, Sairaj
@ 2025-04-11 14:32     ` Sean Christopherson
  0 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-11 14:32 UTC (permalink / raw)
  To: Sairaj Arun Kodilkar
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins, David Matlack

On Fri, Apr 11, 2025, Arun Kodilkar, Sairaj wrote:
> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> > diff --git a/include/linux/kvm_irqfd.h b/include/linux/kvm_irqfd.h
> > index 8ad43692e3bb..6510a48e62aa 100644
> > --- a/include/linux/kvm_irqfd.h
> > +++ b/include/linux/kvm_irqfd.h
> > @@ -59,6 +59,9 @@ struct kvm_kernel_irqfd {
> >   	struct work_struct shutdown;
> >   	struct irq_bypass_consumer consumer;
> >   	struct irq_bypass_producer *producer;
> > +
> > +	struct list_head vcpu_list;
> > +	void *irq_bypass_data;
> >   };
> >   #endif /* __LINUX_KVM_IRQFD_H */
> 
> Hi Sean,
> You missed to update the functions avic_set_pi_irte_mode and
> avic_update_iommu_vcpu_affinity, which iterate over the ir_list.

Well bugger, I did indeed.  And now I'm questioning my (hacky) testing, as I don't
see how avic_update_iommu_vcpu_affinity() survived.

Oh, wow.  This is disgustingly hilarious.  By dumb luck, the offset of the data
pointer relative to the list_head structure is the same in amd_svm_iommu_ir and
kvm_kernel_irqfd.  And the usage in avic_set_pi_irte_mode() and
avic_update_iommu_vcpu_affinity() only ever touches the data, not "svm".

So even though the structure is completely wrong, the math works out and
avic_set_pi_irte_mode() and avic_update_iommu_vcpu_affinity() unknowingly pass in
irq_bypass_data, and all is well.

struct amd_svm_iommu_ir {
	struct list_head node;	/* Used by SVM for per-vcpu ir_list */
	void *data;		/* Storing pointer to struct amd_ir_data */
	struct vcpu_svm *svm;
};


struct kvm_kernel_irqfd {
	...

	struct kvm_vcpu *irq_bypass_vcpu;
	struct list_head vcpu_list;
	void *irq_bypass_data;
};

Great catch!  And thanks for the reviews!

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 05/67] iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE
  2025-04-11 14:05     ` Sean Christopherson
@ 2025-04-11 17:02       ` Sairaj Kodilkar
  2025-04-11 19:30         ` Sean Christopherson
  0 siblings, 1 reply; 128+ messages in thread
From: Sairaj Kodilkar @ 2025-04-11 17:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins, David Matlack,
	Vasant Hegde, Naveen N Rao, Suravee Suthikulpanit



On 4/11/2025 7:35 PM, Sean Christopherson wrote:
> On Fri, Apr 11, 2025, Sairaj Kodilkar wrote:
>> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
>>> Return -EINVAL instead of success if amd_ir_set_vcpu_affinity() is
>>> invoked without use_vapic; lying to KVM about whether or not the IRTE was
>>> configured to post IRQs is all kinds of bad.
>>>
>>> Fixes: d98de49a53e4 ("iommu/amd: Enable vAPIC interrupt remapping mode by default")
>>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>>> ---
>>>    drivers/iommu/amd/iommu.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>>> index cd5116d8c3b2..b3a01b7757ee 100644
>>> --- a/drivers/iommu/amd/iommu.c
>>> +++ b/drivers/iommu/amd/iommu.c
>>> @@ -3850,7 +3850,7 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
>>>    	 * we should not modify the IRTE
>>>    	 */
>>>    	if (!dev_data || !dev_data->use_vapic)
>>> -		return 0;
>>> +		return -EINVAL;
>>
>> Hi Sean,
>> you can update following functions as well to return error when
>> IOMMU is using legacy interrupt mode.
>> 1. amd_iommu_update_ga
>> 2. amd_iommu_activate_guest_mode
>> 3. amd_iommu_deactivate_guest_mode
> 
> Heh, I'm well aware, and this series gets there eventually (the end product WARNs
> and returns an error in all three functions).  I fixed amd_ir_set_vcpu_affinity()
> early in the series because it's the initial API that KVM will use to configure
> an IRTE for posting to a vCPU.  I.e. to reach the other helpers, KVM would need
> to ignore the error returned by amd_ir_set_vcpu_affinity().
> 

Ohh sorry about that. Since I was reviewing patches sequentially, I did
come across those changes.

Regards
Sairaj Kodilkar

>> Currently these functions return 0 to the kvm layer when they fail to
>> set the IRTE.
>>
>> Regards
>> Sairaj Kodilkar


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 06/67] iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts
  2025-04-11 14:10     ` Sean Christopherson
@ 2025-04-11 17:03       ` Sairaj Kodilkar
  2025-04-15 11:42       ` Paolo Bonzini
  2025-04-15 17:48       ` Vasant Hegde
  2 siblings, 0 replies; 128+ messages in thread
From: Sairaj Kodilkar @ 2025-04-11 17:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins, David Matlack,
	Naveen N Rao, Vasant Hegde



On 4/11/2025 7:40 PM, Sean Christopherson wrote:
> On Fri, Apr 11, 2025, Sairaj Kodilkar wrote:
>> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
>>> WARN if KVM attempts to set vCPU affinity when posted interrupts aren't
>>> enabled, as KVM shouldn't try to enable posting when they're unsupported,
>>> and the IOMMU driver darn well should only advertise posting support when
>>> AMD_IOMMU_GUEST_IR_VAPIC() is true.
>>>
>>> Note, KVM consumes is_guest_mode only on success.
>>>
>>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>>> ---
>>>    drivers/iommu/amd/iommu.c | 13 +++----------
>>>    1 file changed, 3 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>>> index b3a01b7757ee..4f69a37cf143 100644
>>> --- a/drivers/iommu/amd/iommu.c
>>> +++ b/drivers/iommu/amd/iommu.c
>>> @@ -3852,19 +3852,12 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
>>>    	if (!dev_data || !dev_data->use_vapic)
>>>    		return -EINVAL;
>>> +	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
>>> +		return -EINVAL;
>>> +
>>
>> Hi Sean,
>> 'dev_data->use_vapic' is always zero when AMD IOMMU uses legacy
>> interrupts i.e. when AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) is 0.
>> Hence you can remove this additional check.
> 
> Hmm, or move it above?  KVM should never call amd_ir_set_vcpu_affinity() if
> IRQ posting is unsupported, and that would make this consistent with the end
> behavior of amd_iommu_update_ga() and amd_iommu_{de,}activate_guest_mode().
> 
> 	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
> 		return -EINVAL;
> 
> 	if (ir_data->iommu == NULL)
> 		return -EINVAL;
> 
> 	dev_data = search_dev_data(ir_data->iommu, irte_info->devid);
> 
> 	/* Note:
> 	 * This device has never been set up for guest mode.
> 	 * we should not modify the IRTE
> 	 */
> 	if (!dev_data || !dev_data->use_vapic)
> 		return -EINVAL;
> 
> I'd like to keep the WARN so that someone will notice if KVM screws up.

Yeah makes sense. Lets move it above

Thanks
Sairaj


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 08/67] KVM: x86: Pass new routing entries and irqfd when updating IRTEs
  2025-04-11 14:01     ` Sean Christopherson
@ 2025-04-11 17:22       ` Sairaj Kodilkar
  0 siblings, 0 replies; 128+ messages in thread
From: Sairaj Kodilkar @ 2025-04-11 17:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins, David Matlack,
	Naveen N Rao, Vasant Hegde



On 4/11/2025 7:31 PM, Sean Christopherson wrote:
> On Fri, Apr 11, 2025, Arun Kodilkar, Sairaj wrote:
>> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
>>> +int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>>> +			unsigned int host_irq, uint32_t guest_irq,
>>> +			struct kvm_kernel_irq_routing_entry *new)
>>>    {
>>>    	struct kvm_kernel_irq_routing_entry *e;
>>>    	struct kvm_irq_routing_table *irq_rt;
>>>    	bool enable_remapped_mode = true;
>>> +	bool set = !!new;
>>>    	int idx, ret = 0;
>>>    	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
>>> @@ -925,6 +919,8 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
>>>    		if (e->type != KVM_IRQ_ROUTING_MSI)
>>>    			continue;
>>> +		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
>>> +
>>>
>>
>> Hi Sean,
>>
>> In kvm_irq_routing_update() function, its possible that there are
>> multiple entries in the `kvm_irq_routing_table`,
> 
> Not if one of them is an MSI.  In setup_routing_entry():
> 
> 	/*
> 	 * Do not allow GSI to be mapped to the same irqchip more than once.
> 	 * Allow only one to one mapping between GSI and non-irqchip routing.
> 	 */
> 	hlist_for_each_entry(ei, &rt->map[gsi], link)
> 		if (ei->type != KVM_IRQ_ROUTING_IRQCHIP ||
> 		    ue->type != KVM_IRQ_ROUTING_IRQCHIP ||
> 		    ue->u.irqchip.irqchip == ei->irqchip.irqchip)
> 			return -EINVAL;
> 
>> and `irqfd_update()` ends up setting up the new entry type to 0 instead of
>> copying the entry.
>>
>> if (n_entries == 1)
>>      irqfd->irq_entry = *e;
>> else
>>      irqfd->irq_entry.type = 0;
>>
>> Since irqfd_update() did not copy the entry to irqfd->entries, the "new"
>> will not match entry "e" obtained from irq_rt, which can trigger a false
>> WARN_ON.
> 
> And since there can only be one MSI, if there are multiple routing entries, then
> the WARN won't be reached thanks to the continue that's just above:
> 
> 		if (e->type != KVM_IRQ_ROUTING_MSI)
> 			continue;

Thanks.. I understand it now. I did not see complete code hence the 
confusion. Sorry about that.

Regards
Sairaj kodilkar

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 05/67] iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE
  2025-04-11 17:02       ` Sairaj Kodilkar
@ 2025-04-11 19:30         ` Sean Christopherson
  0 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-11 19:30 UTC (permalink / raw)
  To: Sairaj Kodilkar
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins, David Matlack,
	Vasant Hegde, Naveen N Rao, Suravee Suthikulpanit

On Fri, Apr 11, 2025, Sairaj Kodilkar wrote:
> 
> 
> On 4/11/2025 7:35 PM, Sean Christopherson wrote:
> > On Fri, Apr 11, 2025, Sairaj Kodilkar wrote:
> > > On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> > > > Return -EINVAL instead of success if amd_ir_set_vcpu_affinity() is
> > > > invoked without use_vapic; lying to KVM about whether or not the IRTE was
> > > > configured to post IRQs is all kinds of bad.
> > > > 
> > > > Fixes: d98de49a53e4 ("iommu/amd: Enable vAPIC interrupt remapping mode by default")
> > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > > ---
> > > >    drivers/iommu/amd/iommu.c | 2 +-
> > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> > > > index cd5116d8c3b2..b3a01b7757ee 100644
> > > > --- a/drivers/iommu/amd/iommu.c
> > > > +++ b/drivers/iommu/amd/iommu.c
> > > > @@ -3850,7 +3850,7 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
> > > >    	 * we should not modify the IRTE
> > > >    	 */
> > > >    	if (!dev_data || !dev_data->use_vapic)
> > > > -		return 0;
> > > > +		return -EINVAL;
> > > 
> > > Hi Sean,
> > > you can update following functions as well to return error when
> > > IOMMU is using legacy interrupt mode.
> > > 1. amd_iommu_update_ga
> > > 2. amd_iommu_activate_guest_mode
> > > 3. amd_iommu_deactivate_guest_mode
> > 
> > Heh, I'm well aware, and this series gets there eventually (the end product WARNs
> > and returns an error in all three functions).  I fixed amd_ir_set_vcpu_affinity()
> > early in the series because it's the initial API that KVM will use to configure
> > an IRTE for posting to a vCPU.  I.e. to reach the other helpers, KVM would need
> > to ignore the error returned by amd_ir_set_vcpu_affinity().
> > 
> 
> Ohh sorry about that. Since I was reviewing patches sequentially, I did
> come across those changes.

No worries, I wrote most of these patches and I can barely keep track of what all
is happening in this series.  :-)

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/67] KVM: SVM: Delete IRTE link from previous vCPU irrespective of new routing
  2025-04-04 19:38 ` [PATCH 11/67] KVM: SVM: Delete IRTE link from previous vCPU irrespective of new routing Sean Christopherson
@ 2025-04-15 11:06   ` Sairaj Kodilkar
  2025-04-15 14:55     ` Sean Christopherson
  0 siblings, 1 reply; 128+ messages in thread
From: Sairaj Kodilkar @ 2025-04-15 11:06 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> Delete the IRTE link from the previous vCPU irrespective of the new
> routing state.  This is a glorified nop (only the ordering changes), as
> both the "posting" and "remapped" mode paths pre-delete the link.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/svm/avic.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 02b6f0007436..e9ded2488a0b 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -870,6 +870,12 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
>   		return 0;
>   
> +	/*
> +	 * If the IRQ was affined to a different vCPU, remove the IRTE metadata
> +	 * from the *previous* vCPU's list.
> +	 */
> +	svm_ir_list_del(irqfd);
> +
>   	pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n",
>   		 __func__, host_irq, guest_irq, set);
>   
> @@ -892,8 +898,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   
>   		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
>   
> -		svm_ir_list_del(irqfd);
> -
>   		/**
>   		 * Here, we setup with legacy mode in the following cases:
>   		 * 1. When cannot target interrupt to a specific vcpu.

Hi sean,
Why not combine patch 10 and patch 11 ? Is there a reason to separate
the changes ?

Regards
Sairaj Kodilkar

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/67] KVM: SVM: Add helper to deduplicate code for getting AVIC backing page
  2025-04-04 19:38 ` [PATCH 14/67] KVM: SVM: Add helper to deduplicate code for getting AVIC backing page Sean Christopherson
@ 2025-04-15 11:11   ` Sairaj Kodilkar
  2025-04-15 14:57     ` Sean Christopherson
  0 siblings, 1 reply; 128+ messages in thread
From: Sairaj Kodilkar @ 2025-04-15 11:11 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

On 4/5/2025 1:08 AM, Sean Christopherson wrote:
Hi Sean,

> Add a helper to get the physical address of the AVIC backing page, both
> to deduplicate code and to prepare for getting the address directly from
> apic->regs, at which point it won't be all that obvious that the address
> in question is what SVM calls the AVIC backing page.
> 
> No functional change intended.
> 
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/svm/avic.c | 14 +++++++++-----
>   1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index f04010f66595..a1f4a08d35f5 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -243,14 +243,18 @@ int avic_vm_init(struct kvm *kvm)
>   	return err;
>   }
>   
> +static phys_addr_t avic_get_backing_page_address(struct vcpu_svm *svm)
> +{
> +	return __sme_set(page_to_phys(svm->avic_backing_page));
> +}
> +

Maybe why not introduce a generic function like...

static phsys_addr_t page_to_phys_sme_set(struct page *page)
{
	return __sme_set(page_to_phys(page));
}

and use it for avic_logical_id_table_page and
avic_physical_id_table_page as well.
		
Regards
Sairaj
>   void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
>   {
>   	struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm);
> -	phys_addr_t bpa = __sme_set(page_to_phys(svm->avic_backing_page));
>   	phys_addr_t lpa = __sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page));
>   	phys_addr_t ppa = __sme_set(page_to_phys(kvm_svm->avic_physical_id_table_page));
>   
> -	vmcb->control.avic_backing_page = bpa;
> +	vmcb->control.avic_backing_page = avic_get_backing_page_address(svm);
>   	vmcb->control.avic_logical_id = lpa;
>   	vmcb->control.avic_physical_id = ppa;
>   	vmcb->control.avic_vapic_bar = APIC_DEFAULT_PHYS_BASE;
> @@ -314,7 +318,7 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
>   	BUILD_BUG_ON(__PHYSICAL_MASK_SHIFT >
>   		     fls64(AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK));
>   
> -	new_entry = __sme_set(page_to_phys(svm->avic_backing_page)) |
> +	new_entry = avic_get_backing_page_address(svm) |
>   		    AVIC_PHYSICAL_ID_ENTRY_VALID_MASK;
>   	WRITE_ONCE(*entry, new_entry);
>   
> @@ -854,7 +858,7 @@ get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
>   	pr_debug("SVM: %s: use GA mode for irq %u\n", __func__,
>   		 irq.vector);
>   	*svm = to_svm(vcpu);
> -	vcpu_info->pi_desc_addr = __sme_set(page_to_phys((*svm)->avic_backing_page));
> +	vcpu_info->pi_desc_addr = avic_get_backing_page_address(*svm);
>   	vcpu_info->vector = irq.vector;
>   
>   	return 0;
> @@ -915,7 +919,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   			enable_remapped_mode = false;
>   
>   			/* Try to enable guest_mode in IRTE */
> -			pi.base = __sme_set(page_to_phys(svm->avic_backing_page));
> +			pi.base = avic_get_backing_page_address(svm);
>   			pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
>   						     svm->vcpu.vcpu_id);
>   			pi.is_guest_mode = true;


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 17/67] KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creation
  2025-04-04 19:38 ` [PATCH 17/67] KVM: SVM: Drop redundant check in AVIC code on ID during " Sean Christopherson
@ 2025-04-15 11:16   ` Sairaj Kodilkar
  0 siblings, 0 replies; 128+ messages in thread
From: Sairaj Kodilkar @ 2025-04-15 11:16 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> Drop avic_get_physical_id_entry()'s compatibility check on the incoming
> ID, as its sole caller, avic_init_backing_page(), performs the exact same
> check.  Drop avic_get_physical_id_entry() entirely as the only remaining
> functionality is getting the address of the Physical ID table, and
> accessing the array without an immediate bounds check is kludgy.
> 
> Opportunistically add a compile-time assertion to ensure the vcpu_id can't
> result in a bounds overflow, e.g. if KVM (really) messed up a maximum
> physical ID #define, as well as run-time assertions so that a NULL pointer
> dereference is morphed into a safer WARN().
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/svm/avic.c | 47 +++++++++++++++++------------------------
>   1 file changed, 19 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index ba8dfc8a12f4..344541e418c3 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -265,35 +265,19 @@ void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
>   		avic_deactivate_vmcb(svm);
>   }
>   
> -static u64 *avic_get_physical_id_entry(struct kvm_vcpu *vcpu,
> -				       unsigned int index)
> -{
> -	u64 *avic_physical_id_table;
> -	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
> -
> -	if ((!x2avic_enabled && index > AVIC_MAX_PHYSICAL_ID) ||
> -	    (index > X2AVIC_MAX_PHYSICAL_ID))
> -		return NULL;
> -
> -	avic_physical_id_table = page_address(kvm_svm->avic_physical_id_table_page);
> -
> -	return &avic_physical_id_table[index];
> -}
> -
>   static int avic_init_backing_page(struct kvm_vcpu *vcpu)
>   {
> -	u64 *entry, new_entry;
> -	int id = vcpu->vcpu_id;
> +	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
>   	struct vcpu_svm *svm = to_svm(vcpu);
> +	u32 id = vcpu->vcpu_id;
> +	u64 *table, new_entry;
>   
>   	/*
>   	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
> -	 * hardware.  Do so immediately, i.e. don't defer the update via a
> -	 * request, as avic_vcpu_load() expects to be called if and only if the
> -	 * vCPU has fully initialized AVIC.  Immediately clear apicv_active,
> -	 * as avic_vcpu_load() assumes avic_physical_id_cache is valid, i.e.
> -	 * waiting until KVM_REQ_APICV_UPDATE is processed on the first KVM_RUN
> -	 * will result in an NULL pointer deference when loading the vCPU.
> +	 * hardware.  Immediately clear apicv_active, i.e. don't wait until the
> +	 * KVM_REQ_APICV_UPDATE request is processed on the first KVM_RUN, as
> +	 * avic_vcpu_load() expects to be called if and only if the vCPU has
> +	 * fully initialized AVIC.
>   	 */

Hi Sean,
I think above change in the comment belongs to patch 16.

Regards
Sairaj
>   	if ((!x2avic_enabled && id > AVIC_MAX_PHYSICAL_ID) ||
>   	    (id > X2AVIC_MAX_PHYSICAL_ID)) {
> @@ -302,6 +286,9 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
>   		return 0;
>   	}
>   
> +	BUILD_BUG_ON((AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE ||
> +		     (X2AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE);
> +
>   	if (WARN_ON_ONCE(!vcpu->arch.apic->regs))
>   		return -EINVAL;
>   
> @@ -320,9 +307,7 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
>   	}
>   
>   	/* Setting AVIC backing page address in the phy APIC ID table */
> -	entry = avic_get_physical_id_entry(vcpu, id);
> -	if (!entry)
> -		return -EINVAL;
> +	table = page_address(kvm_svm->avic_physical_id_table_page);
>   
>   	/* Note, fls64() returns the bit position, +1. */
>   	BUILD_BUG_ON(__PHYSICAL_MASK_SHIFT >
> @@ -330,9 +315,9 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
>   
>   	new_entry = avic_get_backing_page_address(svm) |
>   		    AVIC_PHYSICAL_ID_ENTRY_VALID_MASK;
> -	WRITE_ONCE(*entry, new_entry);
> +	WRITE_ONCE(table[id], new_entry);
>   
> -	svm->avic_physical_id_cache = entry;
> +	svm->avic_physical_id_cache = &table[id];
>   
>   	return 0;
>   }
> @@ -1018,6 +1003,9 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   	if (WARN_ON(h_physical_id & ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK))
>   		return;
>   
> +	if (WARN_ON_ONCE(!svm->avic_physical_id_cache))
> +		return;
> +
>   	/*
>   	 * No need to update anything if the vCPU is blocking, i.e. if the vCPU
>   	 * is being scheduled in after being preempted.  The CPU entries in the
> @@ -1058,6 +1046,9 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
>   
>   	lockdep_assert_preemption_disabled();
>   
> +	if (WARN_ON_ONCE(!svm->avic_physical_id_cache))
> +		return;
> +
>   	/*
>   	 * Note, reading the Physical ID entry outside of ir_list_lock is safe
>   	 * as only the pCPU that has loaded (or is loading) the vCPU is allowed


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 02/67] KVM: x86: Reset IRTE to host control if *new* route isn't postable
  2025-04-11 14:16     ` Sean Christopherson
@ 2025-04-15 11:36       ` Paolo Bonzini
  0 siblings, 0 replies; 128+ messages in thread
From: Paolo Bonzini @ 2025-04-15 11:36 UTC (permalink / raw)
  To: Sean Christopherson, Sairaj Kodilkar
  Cc: Joerg Roedel, David Woodhouse, Lu Baolu, kvm, iommu, linux-kernel,
	Maxim Levitsky, Joao Martins, David Matlack, Vasant Hegde,
	Naveen N Rao

On 4/11/25 16:16, Sean Christopherson wrote:
> On Fri, Apr 11, 2025, Sairaj Kodilkar wrote:
>> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
>>> @@ -991,7 +967,36 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
>>>    		}
>>>    	}
>>> -	ret = 0;
>>> +	if (enable_remapped_mode) {
>>> +		/* Use legacy mode in IRTE */
>>> +		struct amd_iommu_pi_data pi;
>>> +
>>> +		/**
>>> +		 * Here, pi is used to:
>>> +		 * - Tell IOMMU to use legacy mode for this interrupt.
>>> +		 * - Retrieve ga_tag of prior interrupt remapping data.
>>> +		 */
>>> +		pi.prev_ga_tag = 0;
>>> +		pi.is_guest_mode = false;
>>> +		ret = irq_set_vcpu_affinity(host_irq, &pi);
>>> +
>>> +		/**
>>> +		 * Check if the posted interrupt was previously
>>> +		 * setup with the guest_mode by checking if the ga_tag
>>> +		 * was cached. If so, we need to clean up the per-vcpu
>>> +		 * ir_list.
>>> +		 */
>>> +		if (!ret && pi.prev_ga_tag) {
>>> +			int id = AVIC_GATAG_TO_VCPUID(pi.prev_ga_tag);
>>> +			struct kvm_vcpu *vcpu;
>>> +
>>> +			vcpu = kvm_get_vcpu_by_id(kvm, id);
>>> +			if (vcpu)
>>> +				svm_ir_list_del(to_svm(vcpu), &pi);
>>> +		}
>>> +	} else {
>>> +		ret = 0;
>>> +	}
>>
>> Hi Sean,
>> I think you can remove this else and "ret = 0". Because Code will come to
>> this point when irq_set_vcpu_affinity() is successful, ensuring that ret is
>> 0.
> 
> Ah, nice, because of this:
> 
> 		if (ret < 0) {
> 			pr_err("%s: failed to update PI IRTE\n", __func__);
> 			goto out;
> 		}
> 
> However, looking at this again, I'm very tempted to simply leave the "ret = 0;"
> that's already there so as to minimize the change.  It'll get cleaned up later on
> no matter what, so safety for LTS kernels is the driving factor as of this patch.
> 
> Paolo, any preference?

If you mean squashing this in:

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index ef08356fdb1c..8e09f6ae98fd 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -967,6 +967,7 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
                 }
         }
  
+       ret = 0;
         if (enable_remapped_mode) {
                 /* Use legacy mode in IRTE */
                 struct amd_iommu_pi_data pi;
@@ -994,8 +995,6 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
                         if (vcpu)
                                 svm_ir_list_del(to_svm(vcpu), &pi);
                 }
-       } else {
-               ret = 0;
         }
  out:
         srcu_read_unlock(&kvm->irq_srcu, idx);

to undo the moving of "ret = 0", that's a good idea yes.

Paolo


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCH 06/67] iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts
  2025-04-11 14:10     ` Sean Christopherson
  2025-04-11 17:03       ` Sairaj Kodilkar
@ 2025-04-15 11:42       ` Paolo Bonzini
  2025-04-15 17:48       ` Vasant Hegde
  2 siblings, 0 replies; 128+ messages in thread
From: Paolo Bonzini @ 2025-04-15 11:42 UTC (permalink / raw)
  To: Sean Christopherson, Sairaj Kodilkar
  Cc: Joerg Roedel, David Woodhouse, Lu Baolu, kvm, iommu, linux-kernel,
	Maxim Levitsky, Joao Martins, David Matlack, Naveen N Rao,
	Vasant Hegde

On 4/11/25 16:10, Sean Christopherson wrote:
> On Fri, Apr 11, 2025, Sairaj Kodilkar wrote:
>> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
>>> WARN if KVM attempts to set vCPU affinity when posted interrupts aren't
>>> enabled, as KVM shouldn't try to enable posting when they're unsupported,
>>> and the IOMMU driver darn well should only advertise posting support when
>>> AMD_IOMMU_GUEST_IR_VAPIC() is true.
>>>
>>> Note, KVM consumes is_guest_mode only on success.
>>>
>>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>>> ---
>>>    drivers/iommu/amd/iommu.c | 13 +++----------
>>>    1 file changed, 3 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>>> index b3a01b7757ee..4f69a37cf143 100644
>>> --- a/drivers/iommu/amd/iommu.c
>>> +++ b/drivers/iommu/amd/iommu.c
>>> @@ -3852,19 +3852,12 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
>>>    	if (!dev_data || !dev_data->use_vapic)
>>>    		return -EINVAL;
>>> +	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
>>> +		return -EINVAL;
>>> +
>>
>> Hi Sean,
>> 'dev_data->use_vapic' is always zero when AMD IOMMU uses legacy
>> interrupts i.e. when AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) is 0.
>> Hence you can remove this additional check.
> 
> Hmm, or move it above?  KVM should never call amd_ir_set_vcpu_affinity() if
> IRQ posting is unsupported, and that would make this consistent with the end
> behavior of amd_iommu_update_ga() and amd_iommu_{de,}activate_guest_mode().
> 
> 	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
> 		return -EINVAL;
> 
> 	if (ir_data->iommu == NULL)
> 		return -EINVAL;
> 
> 	dev_data = search_dev_data(ir_data->iommu, irte_info->devid);
> 
> 	/* Note:
> 	 * This device has never been set up for guest mode.
> 	 * we should not modify the IRTE
> 	 */
> 	if (!dev_data || !dev_data->use_vapic)
> 		return -EINVAL;
> 
> I'd like to keep the WARN so that someone will notice if KVM screws up.

Makes sense, avic_pi_update_irte() returns way before it has the 
occasion to call irq_set_vcpu_affinity().

Paolo


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/67] KVM: SVM: Delete IRTE link from previous vCPU irrespective of new routing
  2025-04-15 11:06   ` Sairaj Kodilkar
@ 2025-04-15 14:55     ` Sean Christopherson
  0 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-15 14:55 UTC (permalink / raw)
  To: Sairaj Kodilkar
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins, David Matlack

On Tue, Apr 15, 2025, Sairaj Kodilkar wrote:
> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> > Delete the IRTE link from the previous vCPU irrespective of the new
> > routing state.  This is a glorified nop (only the ordering changes), as
> > both the "posting" and "remapped" mode paths pre-delete the link.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   arch/x86/kvm/svm/avic.c | 8 ++++++--
> >   1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > index 02b6f0007436..e9ded2488a0b 100644
> > --- a/arch/x86/kvm/svm/avic.c
> > +++ b/arch/x86/kvm/svm/avic.c
> > @@ -870,6 +870,12 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
> >   	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
> >   		return 0;
> > +	/*
> > +	 * If the IRQ was affined to a different vCPU, remove the IRTE metadata
> > +	 * from the *previous* vCPU's list.
> > +	 */
> > +	svm_ir_list_del(irqfd);
> > +
> >   	pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n",
> >   		 __func__, host_irq, guest_irq, set);
> > @@ -892,8 +898,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
> >   		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
> > -		svm_ir_list_del(irqfd);
> > -
> >   		/**
> >   		 * Here, we setup with legacy mode in the following cases:
> >   		 * 1. When cannot target interrupt to a specific vcpu.
> 
> Hi sean,
> Why not combine patch 10 and patch 11 ? Is there a reason to separate
> the changes ?

To provide distinct bisection points if one (or both) changes introduces a bug.

Patch 10, "Delete IRTE link from previous vCPU before setting new IRTE", is a
non-trivial change in how KVM tracks per-vCPU IRTEs.

This patch is also a somewhat non-trivial change, in that removes IRTEs from the
per-vCPU list even when the new routing isn't an MSI.

Ah, but the changelog for this patch is wrong (I wrote a number of the changelogs
several months after I wrote the code, ugh).  Either that or I've now confused
myself.  I'll stare at this a bit more and rewrite the changelog unless current
me is the one that's confused.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/67] KVM: SVM: Add helper to deduplicate code for getting AVIC backing page
  2025-04-15 11:11   ` Sairaj Kodilkar
@ 2025-04-15 14:57     ` Sean Christopherson
  0 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-15 14:57 UTC (permalink / raw)
  To: Sairaj Kodilkar
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins, David Matlack

On Tue, Apr 15, 2025, Sairaj Kodilkar wrote:
> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> Hi Sean,
> 
> > Add a helper to get the physical address of the AVIC backing page, both
> > to deduplicate code and to prepare for getting the address directly from
> > apic->regs, at which point it won't be all that obvious that the address
> > in question is what SVM calls the AVIC backing page.
> > 
> > No functional change intended.
> > 
> > Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   arch/x86/kvm/svm/avic.c | 14 +++++++++-----
> >   1 file changed, 9 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > index f04010f66595..a1f4a08d35f5 100644
> > --- a/arch/x86/kvm/svm/avic.c
> > +++ b/arch/x86/kvm/svm/avic.c
> > @@ -243,14 +243,18 @@ int avic_vm_init(struct kvm *kvm)
> >   	return err;
> >   }
> > +static phys_addr_t avic_get_backing_page_address(struct vcpu_svm *svm)
> > +{
> > +	return __sme_set(page_to_phys(svm->avic_backing_page));
> > +}
> > +
> 
> Maybe why not introduce a generic function like...
> 
> static phsys_addr_t page_to_phys_sme_set(struct page *page)
> {
> 	return __sme_set(page_to_phys(page));
> }
> 
> and use it for avic_logical_id_table_page and
> avic_physical_id_table_page as well.

Because subsequent commits remove the "struct page" tracking (it's suboptimal
and confusing), and I don't want to encourage that bad pattern in the future.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 06/67] iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts
  2025-04-11 14:10     ` Sean Christopherson
  2025-04-11 17:03       ` Sairaj Kodilkar
  2025-04-15 11:42       ` Paolo Bonzini
@ 2025-04-15 17:48       ` Vasant Hegde
  2025-04-15 22:04         ` Sean Christopherson
  2 siblings, 1 reply; 128+ messages in thread
From: Vasant Hegde @ 2025-04-15 17:48 UTC (permalink / raw)
  To: Sean Christopherson, Sairaj Kodilkar
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins, David Matlack,
	Naveen N Rao

On 4/11/2025 7:40 PM, Sean Christopherson wrote:
> On Fri, Apr 11, 2025, Sairaj Kodilkar wrote:
>> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
>>> WARN if KVM attempts to set vCPU affinity when posted interrupts aren't
>>> enabled, as KVM shouldn't try to enable posting when they're unsupported,
>>> and the IOMMU driver darn well should only advertise posting support when
>>> AMD_IOMMU_GUEST_IR_VAPIC() is true.
>>>
>>> Note, KVM consumes is_guest_mode only on success.
>>>
>>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>>> ---
>>>   drivers/iommu/amd/iommu.c | 13 +++----------
>>>   1 file changed, 3 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>>> index b3a01b7757ee..4f69a37cf143 100644
>>> --- a/drivers/iommu/amd/iommu.c
>>> +++ b/drivers/iommu/amd/iommu.c
>>> @@ -3852,19 +3852,12 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
>>>   	if (!dev_data || !dev_data->use_vapic)
>>>   		return -EINVAL;
>>> +	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
>>> +		return -EINVAL;
>>> +
>>
>> Hi Sean,
>> 'dev_data->use_vapic' is always zero when AMD IOMMU uses legacy
>> interrupts i.e. when AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) is 0.
>> Hence you can remove this additional check.
> 
> Hmm, or move it above?  KVM should never call amd_ir_set_vcpu_affinity() if
> IRQ posting is unsupported, and that would make this consistent with the end
> behavior of amd_iommu_update_ga() and amd_iommu_{de,}activate_guest_mode().
> 
> 	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))

Note that this is global IOMMU level check while dev_data->use_vapic is per
device. We set per device thing while attaching device to domain based on IOMMU
domain type and IOMMU vapic support.

How about add WARN_ON based on dev_data->use_vapic .. so that we can catch if
something went wrong in IOMMU side as well?

-Vasant





^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 06/67] iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts
  2025-04-15 17:48       ` Vasant Hegde
@ 2025-04-15 22:04         ` Sean Christopherson
  2025-04-16  9:47           ` Sairaj Kodilkar
  0 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-15 22:04 UTC (permalink / raw)
  To: Vasant Hegde
  Cc: Sairaj Kodilkar, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu, kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack, Naveen N Rao

On Tue, Apr 15, 2025, Vasant Hegde wrote:
> On 4/11/2025 7:40 PM, Sean Christopherson wrote:
> > On Fri, Apr 11, 2025, Sairaj Kodilkar wrote:
> >> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> >>> WARN if KVM attempts to set vCPU affinity when posted interrupts aren't
> >>> enabled, as KVM shouldn't try to enable posting when they're unsupported,
> >>> and the IOMMU driver darn well should only advertise posting support when
> >>> AMD_IOMMU_GUEST_IR_VAPIC() is true.
> >>>
> >>> Note, KVM consumes is_guest_mode only on success.
> >>>
> >>> Signed-off-by: Sean Christopherson <seanjc@google.com>
> >>> ---
> >>>   drivers/iommu/amd/iommu.c | 13 +++----------
> >>>   1 file changed, 3 insertions(+), 10 deletions(-)
> >>>
> >>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> >>> index b3a01b7757ee..4f69a37cf143 100644
> >>> --- a/drivers/iommu/amd/iommu.c
> >>> +++ b/drivers/iommu/amd/iommu.c
> >>> @@ -3852,19 +3852,12 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
> >>>   	if (!dev_data || !dev_data->use_vapic)
> >>>   		return -EINVAL;
> >>> +	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
> >>> +		return -EINVAL;
> >>> +
> >>
> >> Hi Sean,
> >> 'dev_data->use_vapic' is always zero when AMD IOMMU uses legacy
> >> interrupts i.e. when AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) is 0.
> >> Hence you can remove this additional check.
> > 
> > Hmm, or move it above?  KVM should never call amd_ir_set_vcpu_affinity() if
> > IRQ posting is unsupported, and that would make this consistent with the end
> > behavior of amd_iommu_update_ga() and amd_iommu_{de,}activate_guest_mode().
> > 
> > 	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
> 
> Note that this is global IOMMU level check while dev_data->use_vapic is per
> device. We set per device thing while attaching device to domain based on IOMMU
> domain type and IOMMU vapic support.
> 
> How about add WARN_ON based on dev_data->use_vapic .. so that we can catch if
> something went wrong in IOMMU side as well?

It's not clear to me that a WARN_ON(dev_data->use_vapic) would be free of false
positives.  AFAICT, the producers (e.g. VFIO) don't check whether or not a device
supports posting interrupts, and KVM definitely doesn't check.  And KVM is also
tolerant of irq_set_vcpu_affinity() failures, specifically for this type of
situation, so unfortunately I don't know that the IOMMU side of the world can
safely WARN.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 06/67] iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts
  2025-04-15 22:04         ` Sean Christopherson
@ 2025-04-16  9:47           ` Sairaj Kodilkar
  2025-04-17 17:37             ` Paolo Bonzini
  0 siblings, 1 reply; 128+ messages in thread
From: Sairaj Kodilkar @ 2025-04-16  9:47 UTC (permalink / raw)
  To: Sean Christopherson, Vasant Hegde
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins, David Matlack,
	Naveen N Rao

On 4/16/2025 3:34 AM, Sean Christopherson wrote:
> On Tue, Apr 15, 2025, Vasant Hegde wrote:
>> On 4/11/2025 7:40 PM, Sean Christopherson wrote:
>>> On Fri, Apr 11, 2025, Sairaj Kodilkar wrote:
>>>> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
>>>>> WARN if KVM attempts to set vCPU affinity when posted interrupts aren't
>>>>> enabled, as KVM shouldn't try to enable posting when they're unsupported,
>>>>> and the IOMMU driver darn well should only advertise posting support when
>>>>> AMD_IOMMU_GUEST_IR_VAPIC() is true.
>>>>>
>>>>> Note, KVM consumes is_guest_mode only on success.
>>>>>
>>>>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>>>>> ---
>>>>>    drivers/iommu/amd/iommu.c | 13 +++----------
>>>>>    1 file changed, 3 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>>>>> index b3a01b7757ee..4f69a37cf143 100644
>>>>> --- a/drivers/iommu/amd/iommu.c
>>>>> +++ b/drivers/iommu/amd/iommu.c
>>>>> @@ -3852,19 +3852,12 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
>>>>>    	if (!dev_data || !dev_data->use_vapic)
>>>>>    		return -EINVAL;
>>>>> +	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
>>>>> +		return -EINVAL;
>>>>> +
>>>>
>>>> Hi Sean,
>>>> 'dev_data->use_vapic' is always zero when AMD IOMMU uses legacy
>>>> interrupts i.e. when AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) is 0.
>>>> Hence you can remove this additional check.
>>>
>>> Hmm, or move it above?  KVM should never call amd_ir_set_vcpu_affinity() if
>>> IRQ posting is unsupported, and that would make this consistent with the end
>>> behavior of amd_iommu_update_ga() and amd_iommu_{de,}activate_guest_mode().
>>>
>>> 	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
>>
>> Note that this is global IOMMU level check while dev_data->use_vapic is per
>> device. We set per device thing while attaching device to domain based on IOMMU
>> domain type and IOMMU vapic support.
>>
>> How about add WARN_ON based on dev_data->use_vapic .. so that we can catch if
>> something went wrong in IOMMU side as well?
> 
> It's not clear to me that a WARN_ON(dev_data->use_vapic) would be free of false
> positives.  AFAICT, the producers (e.g. VFIO) don't check whether or not a device
> supports posting interrupts, and KVM definitely doesn't check.  And KVM is also
> tolerant of irq_set_vcpu_affinity() failures, specifically for this type of
> situation, so unfortunately I don't know that the IOMMU side of the world can
> safely WARN.

Hi sean,
I think it is safe to have this WARN_ON(!dev_data->use_vapic) without
any false positives. IOMMU driver sets the dev_data->use_vapic only when
the device is in UNMANAGE_DOMAIN and it is 0 if the device is in any
other domain (DMA, DMA_FQ, IDENTITY).

We have a bigger problem from the VFIO side if we hit this WARN_ON()
as device is not in a UNMANGED_DOMAIN.

Regards
Sairaj Kodilkar

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 06/67] iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts
  2025-04-16  9:47           ` Sairaj Kodilkar
@ 2025-04-17 17:37             ` Paolo Bonzini
  0 siblings, 0 replies; 128+ messages in thread
From: Paolo Bonzini @ 2025-04-17 17:37 UTC (permalink / raw)
  To: Sairaj Kodilkar
  Cc: Sean Christopherson, Vasant Hegde, Joerg Roedel, David Woodhouse,
	Lu Baolu, kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack, Naveen N Rao

On Wed, Apr 16, 2025 at 11:47 AM Sairaj Kodilkar <sarunkod@amd.com> wrote:
> I think it is safe to have this WARN_ON(!dev_data->use_vapic) without
> any false positives. IOMMU driver sets the dev_data->use_vapic only when
> the device is in UNMANAGE_DOMAIN and it is 0 if the device is in any
> other domain (DMA, DMA_FQ, IDENTITY).
>
> We have a bigger problem from the VFIO side if we hit this WARN_ON()
> as device is not in a UNMANGED_DOMAIN.

This does seem safe, but its more of a VFIO/iommu change than a KVM
one so I'm not too comfortable with merging it myself; please submit
it as a separate patch.

Paolo


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 64/67] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
  2025-04-09 11:56   ` Joao Martins
  2025-04-10 15:45     ` Sean Christopherson
@ 2025-04-18 12:17     ` Vasant Hegde
  2025-04-18 18:48       ` Sean Christopherson
  1 sibling, 1 reply; 128+ messages in thread
From: Vasant Hegde @ 2025-04-18 12:17 UTC (permalink / raw)
  To: Joao Martins, Sean Christopherson
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, David Matlack,
	Alejandro Jimenez, Suravee Suthikulpanit, Joerg Roedel,
	David Woodhouse, Lu Baolu, Paolo Bonzini

Hi Joao, Sean,


On 4/9/2025 5:26 PM, Joao Martins wrote:
> On 04/04/2025 20:39, Sean Christopherson wrote:
>> Add plumbing to the AMD IOMMU driver to allow KVM to control whether or
>> not an IRTE is configured to generate GA log interrupts.  KVM only needs a
>> notification if the target vCPU is blocking, so the vCPU can be awakened.
>> If a vCPU is preempted or exits to userspace, KVM clears is_run, but will
>> set the vCPU back to running when userspace does KVM_RUN and/or the vCPU
>> task is scheduled back in, i.e. KVM doesn't need a notification.
>>
>> Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes
>> from the KVM changes insofar as possible.
>>
>> Opportunistically swap the ordering of parameters for amd_iommu_update_ga()
>> so that the match amd_iommu_activate_guest_mode().
> 
> Unfortunately I think this patch and the next one might be riding on the
> assumption that amd_iommu_update_ga() is always cheap :( -- see below.
> 
> I didn't spot anything else flawed in the series though, just this one. I would
> suggest holding off on this and the next one, while progressing with the rest of
> the series.
> 
>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>> index 2e016b98fa1b..27b03e718980 100644
>> --- a/drivers/iommu/amd/iommu.c
>> +++ b/drivers/iommu/amd/iommu.c
>> -static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
>> +static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu,
>> +				  bool ga_log_intr)
>>  {
>>  	if (cpu >= 0) {
>>  		entry->lo.fields_vapic.destination =
>> @@ -3783,12 +3784,14 @@ static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
>>  		entry->hi.fields.destination =
>>  					APICID_TO_IRTE_DEST_HI(cpu);
>>  		entry->lo.fields_vapic.is_run = true;
>> +		entry->lo.fields_vapic.ga_log_intr = false;
>>  	} else {
>>  		entry->lo.fields_vapic.is_run = false;
>> +		entry->lo.fields_vapic.ga_log_intr = ga_log_intr;
>>  	}
>>  }
>>
> 
> isRun, Destination and GATag are not cached. Quoting the update from a few years
> back (page 93 of IOMMU spec dated Feb 2025):
> 
> | When virtual interrupts are enabled by setting MMIO Offset 0018h[GAEn] and
> | IRTE[GuestMode=1], IRTE[IsRun], IRTE[Destination], and if present IRTE[GATag],
> | are not cached by the IOMMU. Modifications to these fields do not require an
> | invalidation of the Interrupt Remapping Table.
> 
> This is the reason we were able to get rid of the IOMMU invalidation in
> amd_iommu_update_ga() ... which sped up vmexit/vmenter flow with iommu avic.
> Besides the lock contention that was observed at the time, we were seeing stalls
> in this path with enough vCPUs IIRC; CCing Alejandro to keep me honest.
> 
> Now this change above is incorrect as is and to make it correct: you will need
> xor with the previous content of the IRTE::ga_log_intr and then if it changes
> then you re-add back an invalidation command via
> iommu_flush_irt_and_complete()). The latter is what I am worried will
> reintroduce these above problem :(
> 
> The invalidation command (which has a completion barrier to serialize
> invalidation execution) takes some time in h/w, and will make all your vcpus
> content on the irq table lock (as is). Even assuming you somehow move the
> invalidation outside the lock, you will content on the iommu lock (for the>
command queue) or best case assuming no locks (which I am not sure it is
> possible) you will need to wait for the command to complete until you can
> progress forward with entering/exiting.
> 
> Unless the GALogIntr bit is somehow also not cached too which wouldn't need the
> invalidation command (which would be good news!). Adding Suravee/Vasant here.

I have checked with HW architects. Its not cached. So we don't need invalidation
after updating GALogIntr field in IRTE.


-Vasant



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 25/67] iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr
  2025-04-04 19:38 ` [PATCH 25/67] iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr Sean Christopherson
@ 2025-04-18 12:24   ` Vasant Hegde
  0 siblings, 0 replies; 128+ messages in thread
From: Vasant Hegde @ 2025-04-18 12:24 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> Use vcpu_data.pi_desc_addr instead of amd_iommu_pi_data.base to get the
> GA root pointer.  KVM is the only source of amd_iommu_pi_data.base, and
> KVM's one and only path for writing amd_iommu_pi_data.base computes the
> exact same value for vcpu_data.pi_desc_addr and amd_iommu_pi_data.base,
> and fills amd_iommu_pi_data.base if and only if vcpu_data.pi_desc_addr is
> valid, i.e. amd_iommu_pi_data.base is fully redundant.
> 
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>

-Vasant


> ---
>  arch/x86/kvm/svm/avic.c   | 7 +++++--
>  drivers/iommu/amd/iommu.c | 2 +-
>  include/linux/amd-iommu.h | 1 -
>  3 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 60e6e82fe41f..9024b9fbca53 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -902,8 +902,11 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>  
>  			enable_remapped_mode = false;
>  
> -			/* Try to enable guest_mode in IRTE */
> -			pi.base = avic_get_backing_page_address(svm);
> +			/*
> +			 * Try to enable guest_mode in IRTE.  Note, the address
> +			 * of the vCPU's AVIC backing page is passed to the
> +			 * IOMMU via vcpu_info->pi_desc_addr.
> +			 */
>  			pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
>  						     svm->vcpu.vcpu_id);
>  			pi.is_guest_mode = true;
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 4f69a37cf143..635774642b89 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -3860,7 +3860,7 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
>  
>  	pi_data->prev_ga_tag = ir_data->cached_ga_tag;
>  	if (pi_data->is_guest_mode) {
> -		ir_data->ga_root_ptr = (pi_data->base >> 12);
> +		ir_data->ga_root_ptr = (vcpu_pi_info->pi_desc_addr >> 12);
>  		ir_data->ga_vector = vcpu_pi_info->vector;
>  		ir_data->ga_tag = pi_data->ga_tag;
>  		ret = amd_iommu_activate_guest_mode(ir_data);
> diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
> index 062fbd4c9b77..4f433ef39188 100644
> --- a/include/linux/amd-iommu.h
> +++ b/include/linux/amd-iommu.h
> @@ -20,7 +20,6 @@ struct amd_iommu;
>  struct amd_iommu_pi_data {
>  	u32 ga_tag;
>  	u32 prev_ga_tag;
> -	u64 base;
>  	bool is_guest_mode;
>  	struct vcpu_data *vcpu_data;
>  	void *ir_data;


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 26/67] iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
  2025-04-04 19:38 ` [PATCH 26/67] iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields Sean Christopherson
  2025-04-08 16:57   ` Paolo Bonzini
@ 2025-04-18 12:25   ` Vasant Hegde
  1 sibling, 0 replies; 128+ messages in thread
From: Vasant Hegde @ 2025-04-18 12:25 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> Delete the amd_ir_data.prev_ga_tag field now that all usage is
> superfluous.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>

-Vasant





^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 05/67] iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE
  2025-04-04 19:38 ` [PATCH 05/67] iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE Sean Christopherson
  2025-04-11  8:34   ` Sairaj Kodilkar
@ 2025-04-18 12:25   ` Vasant Hegde
  1 sibling, 0 replies; 128+ messages in thread
From: Vasant Hegde @ 2025-04-18 12:25 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> Return -EINVAL instead of success if amd_ir_set_vcpu_affinity() is
> invoked without use_vapic; lying to KVM about whether or not the IRTE was
> configured to post IRQs is all kinds of bad.
> 
> Fixes: d98de49a53e4 ("iommu/amd: Enable vAPIC interrupt remapping mode by default")
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>

-Vasant





^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (69 preceding siblings ...)
  2025-04-08 17:13 ` David Matlack
@ 2025-04-18 13:01 ` David Woodhouse
  2025-04-18 16:22   ` Sean Christopherson
  2025-05-15 12:08 ` Sairaj Kodilkar
  71 siblings, 1 reply; 128+ messages in thread
From: David Woodhouse @ 2025-04-18 13:01 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]

On Fri, 2025-04-04 at 12:38 -0700, Sean Christopherson wrote:
> 
> This series is well tested except for one notable gap: I was not able to
> fully test the AMD IOMMU changes.  Long story short, getting upstream
> kernels into our full test environments is practically infeasible.  And
> exposing a device or VF on systems that are available to developers is a
> bit of a mess.

If I can make AMD bare-metal "instances" available to you, would that help?

Separately, I'd quite like to see the eventfd→MSI delivery linkage not
use the IRQ routing table at all, and not need a GSI# assigned. Doing
it that way is just a scaling and performance issue.

I recently looked through the irqfd code and came to the conclusion
that it wouldn't be hard to add a new user API which allows us to
simply configure the kvm_irq_routing_entry to be delivered when a given
eventfd fires, without using the table.

I haven't had a chance to look hard, hopefully your rework doesn't make
that any less feasible...

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
  2025-04-18 13:01 ` David Woodhouse
@ 2025-04-18 16:22   ` Sean Christopherson
  0 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-18 16:22 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Paolo Bonzini, Joerg Roedel, Lu Baolu, kvm, iommu, linux-kernel,
	Maxim Levitsky, Joao Martins, David Matlack

On Fri, Apr 18, 2025, David Woodhouse wrote:
> On Fri, 2025-04-04 at 12:38 -0700, Sean Christopherson wrote:
> > 
> > This series is well tested except for one notable gap: I was not able to
> > fully test the AMD IOMMU changes.  Long story short, getting upstream
> > kernels into our full test environments is practically infeasible.  And
> > exposing a device or VF on systems that are available to developers is a
> > bit of a mess.
> 
> If I can make AMD bare-metal "instances" available to you, would that help?

Probably not, my main limitation is time, not lack of hardware.

I'm confident I can get a functional AMD test setup internally, it'll just take
a bit more time/effort (there are other people working on the testing front; I'm
hoping if I wait a bit, someone will solve the hiccups for me).

I'd been holding this series since ~October of last year, precisely due to lack
of bandwidth to configure a working test environment.  I felt that I got far
enough in testing that the odds of something being _really_ broken are small,
and didn't want to delay posting for potentially multiple more months as I assume
other folks in the community already have readily available test setups.

And no matter what, I want to get more thorough testing on a broader range of
hardware, e.g. from Intel and AMD in particular, before this gets merged, so in
the end I don't think me getting access to different hardware would move the
needle much.

Though I appreciate the offer :-)

> Separately, I'd quite like to see the eventfd→MSI delivery linkage not
> use the IRQ routing table at all, and not need a GSI# assigned. Doing
> it that way is just a scaling and performance issue.
> 
> I recently looked through the irqfd code and came to the conclusion
> that it wouldn't be hard to add a new user API which allows us to
> simply configure the kvm_irq_routing_entry to be delivered when a given
> eventfd fires, without using the table.

Yeah, especially if we gated the functionality on a per-VM capability.  That way
kvm_irq_routing_update() could completely skip processing irqfds.  At that point,
other than the new uAPI, I think it's just irqfd_inject() and the resample code
that needs to be modified.

> I haven't had a chance to look hard, hopefully your rework doesn't make
> that any less feasible...

Quite the opposite, it should make it much, much easier.  Currently, both
vmx_pi_update_irte() and avic_pi_update_irte() pull the GSI's routing entry from
kvm->irq_routing.

After this rework, irqfd->irq_entry is explicitly passed into 
kvm_arch_update_irqfd_routing(), i.e. it removes two of the gnarliest paths that
expect irqfd to go through the standard routing table.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 64/67] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
  2025-04-18 12:17     ` Vasant Hegde
@ 2025-04-18 18:48       ` Sean Christopherson
  2025-04-23 10:21         ` Joao Martins
  0 siblings, 1 reply; 128+ messages in thread
From: Sean Christopherson @ 2025-04-18 18:48 UTC (permalink / raw)
  To: Vasant Hegde
  Cc: Joao Martins, kvm, iommu, linux-kernel, Maxim Levitsky,
	David Matlack, Alejandro Jimenez, Suravee Suthikulpanit,
	Joerg Roedel, David Woodhouse, Lu Baolu, Paolo Bonzini

On Fri, Apr 18, 2025, Vasant Hegde wrote:
> On 4/9/2025 5:26 PM, Joao Martins wrote:
> > On 04/04/2025 20:39, Sean Christopherson wrote:
> >> Add plumbing to the AMD IOMMU driver to allow KVM to control whether or
> >> not an IRTE is configured to generate GA log interrupts.  KVM only needs a
> >> notification if the target vCPU is blocking, so the vCPU can be awakened.
> >> If a vCPU is preempted or exits to userspace, KVM clears is_run, but will
> >> set the vCPU back to running when userspace does KVM_RUN and/or the vCPU
> >> task is scheduled back in, i.e. KVM doesn't need a notification.
> >>
> >> Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes
> >> from the KVM changes insofar as possible.
> >>
> >> Opportunistically swap the ordering of parameters for amd_iommu_update_ga()
> >> so that the match amd_iommu_activate_guest_mode().
> > 
> > Unfortunately I think this patch and the next one might be riding on the
> > assumption that amd_iommu_update_ga() is always cheap :( -- see below.
> > 
> > I didn't spot anything else flawed in the series though, just this one. I would
> > suggest holding off on this and the next one, while progressing with the rest of
> > the series.
> > 
> >> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> >> index 2e016b98fa1b..27b03e718980 100644
> >> --- a/drivers/iommu/amd/iommu.c
> >> +++ b/drivers/iommu/amd/iommu.c
> >> -static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
> >> +static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu,
> >> +				  bool ga_log_intr)
> >>  {
> >>  	if (cpu >= 0) {
> >>  		entry->lo.fields_vapic.destination =
> >> @@ -3783,12 +3784,14 @@ static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
> >>  		entry->hi.fields.destination =
> >>  					APICID_TO_IRTE_DEST_HI(cpu);
> >>  		entry->lo.fields_vapic.is_run = true;
> >> +		entry->lo.fields_vapic.ga_log_intr = false;
> >>  	} else {
> >>  		entry->lo.fields_vapic.is_run = false;
> >> +		entry->lo.fields_vapic.ga_log_intr = ga_log_intr;
> >>  	}
> >>  }
> >>
> > 
> > isRun, Destination and GATag are not cached. Quoting the update from a few years
> > back (page 93 of IOMMU spec dated Feb 2025):
> > 
> > | When virtual interrupts are enabled by setting MMIO Offset 0018h[GAEn] and
> > | IRTE[GuestMode=1], IRTE[IsRun], IRTE[Destination], and if present IRTE[GATag],
> > | are not cached by the IOMMU. Modifications to these fields do not require an
> > | invalidation of the Interrupt Remapping Table.
> > 
> > This is the reason we were able to get rid of the IOMMU invalidation in
> > amd_iommu_update_ga() ... which sped up vmexit/vmenter flow with iommu avic.
> > Besides the lock contention that was observed at the time, we were seeing stalls
> > in this path with enough vCPUs IIRC; CCing Alejandro to keep me honest.
> > 
> > Now this change above is incorrect as is and to make it correct: you will need
> > xor with the previous content of the IRTE::ga_log_intr and then if it changes
> > then you re-add back an invalidation command via
> > iommu_flush_irt_and_complete()). The latter is what I am worried will
> > reintroduce these above problem :(
> > 
> > The invalidation command (which has a completion barrier to serialize
> > invalidation execution) takes some time in h/w, and will make all your vcpus
> > content on the irq table lock (as is). Even assuming you somehow move the
> > invalidation outside the lock, you will content on the iommu lock (for the>
> command queue) or best case assuming no locks (which I am not sure it is
> > possible) you will need to wait for the command to complete until you can
> > progress forward with entering/exiting.
> > 
> > Unless the GALogIntr bit is somehow also not cached too which wouldn't need the
> > invalidation command (which would be good news!). Adding Suravee/Vasant here.
> 
> I have checked with HW architects. Its not cached. So we don't need invalidation
> after updating GALogIntr field in IRTE.

Woot!  Better to be lucky than good :-)

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 64/67] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
  2025-04-18 18:48       ` Sean Christopherson
@ 2025-04-23 10:21         ` Joao Martins
  0 siblings, 0 replies; 128+ messages in thread
From: Joao Martins @ 2025-04-23 10:21 UTC (permalink / raw)
  To: Sean Christopherson, Vasant Hegde
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, David Matlack,
	Alejandro Jimenez, Suravee Suthikulpanit, Joerg Roedel,
	David Woodhouse, Lu Baolu, Paolo Bonzini

On 18/04/2025 19:48, Sean Christopherson wrote:
> On Fri, Apr 18, 2025, Vasant Hegde wrote:
>> On 4/9/2025 5:26 PM, Joao Martins wrote:
>>> On 04/04/2025 20:39, Sean Christopherson wrote:
>>>> Add plumbing to the AMD IOMMU driver to allow KVM to control whether or
>>>> not an IRTE is configured to generate GA log interrupts.  KVM only needs a
>>>> notification if the target vCPU is blocking, so the vCPU can be awakened.
>>>> If a vCPU is preempted or exits to userspace, KVM clears is_run, but will
>>>> set the vCPU back to running when userspace does KVM_RUN and/or the vCPU
>>>> task is scheduled back in, i.e. KVM doesn't need a notification.
>>>>
>>>> Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes
>>>> from the KVM changes insofar as possible.
>>>>
>>>> Opportunistically swap the ordering of parameters for amd_iommu_update_ga()
>>>> so that the match amd_iommu_activate_guest_mode().
>>>
>>> Unfortunately I think this patch and the next one might be riding on the
>>> assumption that amd_iommu_update_ga() is always cheap :( -- see below.
>>>
>>> I didn't spot anything else flawed in the series though, just this one. I would
>>> suggest holding off on this and the next one, while progressing with the rest of
>>> the series.
>>>
>>>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>>>> index 2e016b98fa1b..27b03e718980 100644
>>>> --- a/drivers/iommu/amd/iommu.c
>>>> +++ b/drivers/iommu/amd/iommu.c
>>>> -static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
>>>> +static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu,
>>>> +				  bool ga_log_intr)
>>>>  {
>>>>  	if (cpu >= 0) {
>>>>  		entry->lo.fields_vapic.destination =
>>>> @@ -3783,12 +3784,14 @@ static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
>>>>  		entry->hi.fields.destination =
>>>>  					APICID_TO_IRTE_DEST_HI(cpu);
>>>>  		entry->lo.fields_vapic.is_run = true;
>>>> +		entry->lo.fields_vapic.ga_log_intr = false;
>>>>  	} else {
>>>>  		entry->lo.fields_vapic.is_run = false;
>>>> +		entry->lo.fields_vapic.ga_log_intr = ga_log_intr;
>>>>  	}
>>>>  }
>>>>
>>>
>>> isRun, Destination and GATag are not cached. Quoting the update from a few years
>>> back (page 93 of IOMMU spec dated Feb 2025):
>>>
>>> | When virtual interrupts are enabled by setting MMIO Offset 0018h[GAEn] and
>>> | IRTE[GuestMode=1], IRTE[IsRun], IRTE[Destination], and if present IRTE[GATag],
>>> | are not cached by the IOMMU. Modifications to these fields do not require an
>>> | invalidation of the Interrupt Remapping Table.
>>>
>>> This is the reason we were able to get rid of the IOMMU invalidation in
>>> amd_iommu_update_ga() ... which sped up vmexit/vmenter flow with iommu avic.
>>> Besides the lock contention that was observed at the time, we were seeing stalls
>>> in this path with enough vCPUs IIRC; CCing Alejandro to keep me honest.
>>>
>>> Now this change above is incorrect as is and to make it correct: you will need
>>> xor with the previous content of the IRTE::ga_log_intr and then if it changes
>>> then you re-add back an invalidation command via
>>> iommu_flush_irt_and_complete()). The latter is what I am worried will
>>> reintroduce these above problem :(
>>>
>>> The invalidation command (which has a completion barrier to serialize
>>> invalidation execution) takes some time in h/w, and will make all your vcpus
>>> content on the irq table lock (as is). Even assuming you somehow move the
>>> invalidation outside the lock, you will content on the iommu lock (for the>
>> command queue) or best case assuming no locks (which I am not sure it is
>>> possible) you will need to wait for the command to complete until you can
>>> progress forward with entering/exiting.
>>>
>>> Unless the GALogIntr bit is somehow also not cached too which wouldn't need the
>>> invalidation command (which would be good news!). Adding Suravee/Vasant here.
>>
>> I have checked with HW architects. Its not cached. So we don't need invalidation
>> after updating GALogIntr field in IRTE.
> 
> Woot!  Better to be lucky than good :-)

Probably worth using this thread Message ID as a Link: tag while the IOMMU
manual isn't yet up to date with this information. That usually takes a while
until formalized.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 31/67] KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info()
  2025-04-04 19:38 ` [PATCH 31/67] KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info() Sean Christopherson
@ 2025-04-23 15:21   ` Francesco Lavra
  2025-04-23 15:55     ` Sean Christopherson
  0 siblings, 1 reply; 128+ messages in thread
From: Francesco Lavra @ 2025-04-23 15:21 UTC (permalink / raw)
  To: seanjc
  Cc: baolu.lu, dmatlack, dwmw2, iommu, joao.m.martins, joro, kvm,
	linux-kernel, mlevitsk, pbonzini

On 2025-04-04 at 19:38, Sean Christopherson wrote:
> @@ -876,20 +874,21 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd
> *irqfd, struct kvm *kvm,
>  	 * 3. APIC virtualization is disabled for the vcpu.
>  	 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
>  	 */
> -	if (new && new->type == KVM_IRQ_ROUTING_MSI &&
> -	    !get_pi_vcpu_info(kvm, new, &vcpu_info, &svm) &&
> -	    kvm_vcpu_apicv_active(&svm->vcpu)) {
> +	if (new && new && new->type == KVM_IRQ_ROUTING_MSI &&

The `&& new` part is redundant.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 31/67] KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info()
  2025-04-23 15:21   ` Francesco Lavra
@ 2025-04-23 15:55     ` Sean Christopherson
  0 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-23 15:55 UTC (permalink / raw)
  To: Francesco Lavra
  Cc: baolu.lu, dmatlack, dwmw2, iommu, joao.m.martins, joro, kvm,
	linux-kernel, mlevitsk, pbonzini

On Wed, Apr 23, 2025, Francesco Lavra wrote:
> On 2025-04-04 at 19:38, Sean Christopherson wrote:
> > @@ -876,20 +874,21 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd
> > *irqfd, struct kvm *kvm,
> >  	 * 3. APIC virtualization is disabled for the vcpu.
> >  	 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
> >  	 */
> > -	if (new && new->type == KVM_IRQ_ROUTING_MSI &&
> > -	    !get_pi_vcpu_info(kvm, new, &vcpu_info, &svm) &&
> > -	    kvm_vcpu_apicv_active(&svm->vcpu)) {
> > +	if (new && new && new->type == KVM_IRQ_ROUTING_MSI &&
> 
> The `&& new` part is redundant.

Ha, good job me.  Better safe than sorry?  :-)

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 33/67] KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU
  2025-04-04 19:38 ` [PATCH 33/67] KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU Sean Christopherson
  2025-04-08 17:30   ` Paolo Bonzini
@ 2025-04-24  4:39   ` Sairaj Kodilkar
  2025-04-24 14:13     ` Sean Christopherson
  1 sibling, 1 reply; 128+ messages in thread
From: Sairaj Kodilkar @ 2025-04-24  4:39 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu
  Cc: kvm, iommu, linux-kernel, Maxim Levitsky, Joao Martins,
	David Matlack

On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> Hoist the logic for identifying the target vCPU for a posted interrupt
> into common x86.  The code is functionally identical between Intel and
> AMD.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  3 +-
>   arch/x86/kvm/svm/avic.c         | 83 ++++++++-------------------------
>   arch/x86/kvm/svm/svm.h          |  3 +-
>   arch/x86/kvm/vmx/posted_intr.c  | 56 ++++++----------------
>   arch/x86/kvm/vmx/posted_intr.h  |  3 +-
>   arch/x86/kvm/x86.c              | 46 +++++++++++++++---
>   6 files changed, 81 insertions(+), 113 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 85f45fc5156d..cb98d8d3c6c2 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1838,7 +1838,8 @@ struct kvm_x86_ops {
>   
>   	int (*pi_update_irte)(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   			      unsigned int host_irq, uint32_t guest_irq,
> -			      struct kvm_kernel_irq_routing_entry *new);
> +			      struct kvm_kernel_irq_routing_entry *new,
> +			      struct kvm_vcpu *vcpu, u32 vector);
>   	void (*pi_start_assignment)(struct kvm *kvm);
>   	void (*apicv_pre_state_restore)(struct kvm_vcpu *vcpu);
>   	void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index ea6eae72b941..666f518340a7 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -812,52 +812,13 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
>   	return 0;
>   }
>   
> -/*
> - * Note:
> - * The HW cannot support posting multicast/broadcast
> - * interrupts to a vCPU. So, we still use legacy interrupt
> - * remapping for these kind of interrupts.
> - *
> - * For lowest-priority interrupts, we only support
> - * those with single CPU as the destination, e.g. user
> - * configures the interrupts via /proc/irq or uses
> - * irqbalance to make the interrupts single-CPU.
> - */
> -static int
> -get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
> -		 struct vcpu_data *vcpu_info, struct kvm_vcpu **vcpu)
> -{
> -	struct kvm_lapic_irq irq;
> -	*vcpu = NULL;
> -
> -	kvm_set_msi_irq(kvm, e, &irq);
> -
> -	if (!kvm_intr_is_single_vcpu(kvm, &irq, vcpu) ||
> -	    !kvm_irq_is_postable(&irq)) {
> -		pr_debug("SVM: %s: use legacy intr remap mode for irq %u\n",
> -			 __func__, irq.vector);
> -		return -1;
> -	}
> -
> -	pr_debug("SVM: %s: use GA mode for irq %u\n", __func__,
> -		 irq.vector);
> -	vcpu_info->vector = irq.vector;
> -
> -	return 0;
> -}
> -
>   int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   			unsigned int host_irq, uint32_t guest_irq,
> -			struct kvm_kernel_irq_routing_entry *new)
> +			struct kvm_kernel_irq_routing_entry *new,
> +			struct kvm_vcpu *vcpu, u32 vector)
>   {
> -	bool enable_remapped_mode = true;
> -	struct vcpu_data vcpu_info;
> -	struct kvm_vcpu *vcpu = NULL;
>   	int ret = 0;
>   
> -	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
> -		return 0;
> -
>   	/*
>   	 * If the IRQ was affined to a different vCPU, remove the IRTE metadata
>   	 * from the *previous* vCPU's list.
> @@ -865,7 +826,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   	svm_ir_list_del(irqfd);
>   
>   	pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n",
> -		 __func__, host_irq, guest_irq, !!new);
> +		 __func__, host_irq, guest_irq, !!vcpu);
>   
>   	/**
>   	 * Here, we setup with legacy mode in the following cases:
> @@ -874,23 +835,23 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   	 * 3. APIC virtualization is disabled for the vcpu.
>   	 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
>   	 */
> -	if (new && new && new->type == KVM_IRQ_ROUTING_MSI &&
> -	    !get_pi_vcpu_info(kvm, new, &vcpu_info, &vcpu) &&
> -	    kvm_vcpu_apicv_active(vcpu)) {
> -		struct amd_iommu_pi_data pi;
> -
> -		enable_remapped_mode = false;
> -
> -		vcpu_info.pi_desc_addr = avic_get_backing_page_address(to_svm(vcpu));
> -
> +	if (vcpu && kvm_vcpu_apicv_active(vcpu)) {
>   		/*
>   		 * Try to enable guest_mode in IRTE.  Note, the address
>   		 * of the vCPU's AVIC backing page is passed to the
>   		 * IOMMU via vcpu_info->pi_desc_addr.
>   		 */
> -		pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, vcpu->vcpu_id);
> -		pi.is_guest_mode = true;
> -		pi.vcpu_data = &vcpu_info;
> +		struct vcpu_data vcpu_info = {
> +			.pi_desc_addr = avic_get_backing_page_address(to_svm(vcpu)),
> +			.vector = vector,
> +		};
> +
> +		struct amd_iommu_pi_data pi = {
> +			.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, vcpu->vcpu_id),
> +			.is_guest_mode = true,
> +			.vcpu_data = &vcpu_info,
> +		};
> +
>   		ret = irq_set_vcpu_affinity(host_irq, &pi);
>   
>   		/**
> @@ -902,12 +863,11 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   		 */
>   		if (!ret)
>   			ret = svm_ir_list_add(to_svm(vcpu), irqfd, &pi);
> -	}
>   
> -	if (!ret && vcpu) {
> -		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id,
> -					 guest_irq, vcpu_info.vector,
> -					 vcpu_info.pi_desc_addr, !!new);
> +		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
> +					 vector, vcpu_info.pi_desc_addr, true);
> +	} else {
> +		ret = irq_set_vcpu_affinity(host_irq, NULL);
>   	}
>   
>   	if (ret < 0) {
> @@ -915,10 +875,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   		goto out;
>   	}
>   
> -	if (enable_remapped_mode)
> -		ret = irq_set_vcpu_affinity(host_irq, NULL);
> -	else
> -		ret = 0;
> +	ret = 0;
>   out:
>   	return ret;
>   }
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 6ad0aa86f78d..5ce240085ee0 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -741,7 +741,8 @@ void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu);
>   void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
>   int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   			unsigned int host_irq, uint32_t guest_irq,
> -			struct kvm_kernel_irq_routing_entry *new);
> +			struct kvm_kernel_irq_routing_entry *new,
> +			struct kvm_vcpu *vcpu, u32 vector);
>   void avic_vcpu_blocking(struct kvm_vcpu *vcpu);
>   void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
>   void avic_ring_doorbell(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index 786912cee3f8..fd5f6a125614 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -266,46 +266,20 @@ void vmx_pi_start_assignment(struct kvm *kvm)
>   
>   int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   		       unsigned int host_irq, uint32_t guest_irq,
> -		       struct kvm_kernel_irq_routing_entry *new)
> +		       struct kvm_kernel_irq_routing_entry *new,
> +		       struct kvm_vcpu *vcpu, u32 vector)
>   {
> -	struct kvm_lapic_irq irq;
> -	struct kvm_vcpu *vcpu;
> -	struct vcpu_data vcpu_info;
> -
> -	if (!vmx_can_use_vtd_pi(kvm))
> -		return 0;
> -
> -	/*
> -	 * VT-d PI cannot support posting multicast/broadcast
> -	 * interrupts to a vCPU, we still use interrupt remapping
> -	 * for these kind of interrupts.
> -	 *
> -	 * For lowest-priority interrupts, we only support
> -	 * those with single CPU as the destination, e.g. user
> -	 * configures the interrupts via /proc/irq or uses
> -	 * irqbalance to make the interrupts single-CPU.
> -	 *
> -	 * We will support full lowest-priority interrupt later.
> -	 *
> -	 * In addition, we can only inject generic interrupts using
> -	 * the PI mechanism, refuse to route others through it.
> -	 */
> -	if (!new || new->type != KVM_IRQ_ROUTING_MSI)
> -		goto do_remapping;
> -
> -	kvm_set_msi_irq(kvm, new, &irq);
> -
> -	if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) ||
> -	    !kvm_irq_is_postable(&irq))
> -		goto do_remapping;
> -
> -	vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu));
> -	vcpu_info.vector = irq.vector;
> -
> -	trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
> -				 vcpu_info.vector, vcpu_info.pi_desc_addr, true);
> -
> -	return irq_set_vcpu_affinity(host_irq, &vcpu_info);
> -do_remapping:
> -	return irq_set_vcpu_affinity(host_irq, NULL);
> +	if (vcpu) {
> +		struct vcpu_data vcpu_info = {
> +			.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu)),
> +			.vector = vector,
> +		};
> +
> +		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
> +					 vcpu_info.vector, vcpu_info.pi_desc_addr, true);
> +
> +		return irq_set_vcpu_affinity(host_irq, &vcpu_info);
> +	} else {
> +		return irq_set_vcpu_affinity(host_irq, NULL);
> +	}
>   }
> diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
> index a586d6aaf862..ee3e19e976ac 100644
> --- a/arch/x86/kvm/vmx/posted_intr.h
> +++ b/arch/x86/kvm/vmx/posted_intr.h
> @@ -15,7 +15,8 @@ void __init pi_init_cpu(int cpu);
>   bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
>   int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
>   		       unsigned int host_irq, uint32_t guest_irq,
> -		       struct kvm_kernel_irq_routing_entry *new);
> +		       struct kvm_kernel_irq_routing_entry *new,
> +		       struct kvm_vcpu *vcpu, u32 vector);
>   void vmx_pi_start_assignment(struct kvm *kvm);
>   
>   static inline int pi_find_highest_vector(struct pi_desc *pi_desc)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b8b259847d05..0ab818bba743 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13567,6 +13567,43 @@ bool kvm_arch_has_irq_bypass(void)
>   }
>   EXPORT_SYMBOL_GPL(kvm_arch_has_irq_bypass);
>   
> +static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
> +			      struct kvm_kernel_irq_routing_entry *old,

the argument 'old' is redundant in this function.

Regards
Sairaj Kodilkar

> +			      struct kvm_kernel_irq_routing_entry *new)
> +{
> +	struct kvm *kvm = irqfd->kvm;
> +	struct kvm_vcpu *vcpu = NULL;
> +	struct kvm_lapic_irq irq;
> +
> +	if (!irqchip_in_kernel(kvm) ||
> +	    !kvm_arch_has_irq_bypass() ||
> +	    !kvm_arch_has_assigned_device(kvm))
> +		return 0;
> +
> +	if (new && new->type == KVM_IRQ_ROUTING_MSI) {
> +		kvm_set_msi_irq(kvm, new, &irq);
> +
> +		/*
> +		 * Force remapped mode if hardware doesn't support posting the
> +		 * virtual interrupt to a vCPU.  Only IRQs are postable (NMIs,
> +		 * SMIs, etc. are not), and neither AMD nor Intel IOMMUs support
> +		 * posting multicast/broadcast IRQs.  If the interrupt can't be
> +		 * posted, the device MSI needs to be routed to the host so that
> +		 * the guest's desired interrupt can be synthesized by KVM.
> +		 *
> +		 * This means that KVM can only post lowest-priority interrupts
> +		 * if they have a single CPU as the destination, e.g. only if
> +		 * the guest has affined the interrupt to a single vCPU.
> +		 */
> +		if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) ||
> +		    !kvm_irq_is_postable(&irq))
> +			vcpu = NULL;
> +	}
> +
> +	return kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, irqfd->producer->irq,
> +					    irqfd->gsi, new, vcpu, irq.vector);
> +}
> +
>   int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
>   				      struct irq_bypass_producer *prod)
>   {
> @@ -13581,8 +13618,7 @@ int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
>   	irqfd->producer = prod;
>   
>   	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
> -		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
> -						   irqfd->gsi, &irqfd->irq_entry);
> +		ret = kvm_pi_update_irte(irqfd, NULL, &irqfd->irq_entry);
>   		if (ret)
>   			kvm_arch_end_assignment(irqfd->kvm);
>   	}
> @@ -13610,8 +13646,7 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
>   	spin_lock_irq(&kvm->irqfds.lock);
>   
>   	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
> -		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
> -						   irqfd->gsi, NULL);
> +		ret = kvm_pi_update_irte(irqfd, &irqfd->irq_entry, NULL);
>   		if (ret)
>   			pr_info("irq bypass consumer (token %p) unregistration fails: %d\n",
>   				irqfd->consumer.token, ret);
> @@ -13628,8 +13663,7 @@ int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
>   				  struct kvm_kernel_irq_routing_entry *old,
>   				  struct kvm_kernel_irq_routing_entry *new)
>   {
> -	return kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, irqfd->producer->irq,
> -					    irqfd->gsi, new);
> +	return kvm_pi_update_irte(irqfd, old, new);
>   }
>   
>   bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 33/67] KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU
  2025-04-24  4:39   ` Sairaj Kodilkar
@ 2025-04-24 14:13     ` Sean Christopherson
  0 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-04-24 14:13 UTC (permalink / raw)
  To: Sairaj Kodilkar
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins, David Matlack

On Thu, Apr 24, 2025, Sairaj Kodilkar wrote:
> On 4/5/2025 1:08 AM, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index b8b259847d05..0ab818bba743 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -13567,6 +13567,43 @@ bool kvm_arch_has_irq_bypass(void)
> >   }
> >   EXPORT_SYMBOL_GPL(kvm_arch_has_irq_bypass);
> > +static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
> > +			      struct kvm_kernel_irq_routing_entry *old,
> 
> the argument 'old' is redundant in this function.

Ooh, and @new to kvm_x86_ops.pi_update_irte is also unused.  I'll get rid of them
both.  I went through multiple iterations of hacking to figure out how to dedup
the code, and (obviously) missed a few things when tidying up after the fact.

Good eyes, and thanks again for the reviews!

P.S. Please trim your replies.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
  2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (70 preceding siblings ...)
  2025-04-18 13:01 ` David Woodhouse
@ 2025-05-15 12:08 ` Sairaj Kodilkar
  2025-05-15 22:05   ` Sean Christopherson
  71 siblings, 1 reply; 128+ messages in thread
From: Sairaj Kodilkar @ 2025-05-15 12:08 UTC (permalink / raw)
  To: seanjc
  Cc: baolu.lu, dmatlack, dwmw2, iommu, joao.m.martins, joro, kvm,
	linux-kernel, mlevitsk, pbonzini, vasant.hegde,
	suravee.suthikulpanit, naveen.rao, Sairaj Kodilkar

Hi Sean,

We ran few tests with following setup

* Turin system with 2P, 192 cores each (SMT enabled, Total 768)
* 4 NVMEs of size 1.7 attached to a single IOMMU
* Total RAM 247 GiB
* Qemu version : 9.1.93
* Guest kernel : 6.14-rc7
* FIO random reads with 4K blocksize and libai

With above setup we measured the Guest nvme interrupts, IOPS, GALOG interrupts
and GALOG entries for 60 seconds with and without your changes.

Here are the results,

                          VCPUS = 32, Jobs per NVME = 8
==============================================================================================
                             w/o Sean's patches           w/ Sean's patches     Percent change
----------------------------------------------------------------------------------------------
Guest Nvme interrupts               123,922,860                 124,559,110              0.51%
IOPS (in kilo)                            4,795                       4,796              0.04%
GALOG Interrupts                         40,245                         164            -99.59%
GALOG entries                            42,040                         169            -99.60%
----------------------------------------------------------------------------------------------


                VCPUS = 64, Jobs per NVME = 16
==============================================================================================
                             w/o Sean's patches           w/ Sean's patches     Percent change
----------------------------------------------------------------------------------------------
Guest Nvme interrupts               99,483,339                   99,800,056             0.32% 
IOPS (in kilo)                           4,791                        4,798             0.15% 
GALOG Interrupts                        47,599                       11,634           -75.56% 
GALOG entries                           48,899                       11,923           -75.62%
----------------------------------------------------------------------------------------------


                VCPUS = 192, Jobs per NVME = 48
==============================================================================================
                             w/o Sean's patches          w/ Sean's patches      Percent change
----------------------------------------------------------------------------------------------
Guest Nvme interrupts               76,750,310                  78,066,512               1.71%
IOPS (in kilo)                           4,751                       4,749              -0.04%
GALOG Interrupts                        56,621                      54,732              -3.34%
GALOG entries                           59,579                      56,215              -5.65%
----------------------------------------------------------------------------------------------
 

The results show that patches have significant impact on the number of posted
interrupts at lower vCPU count (32 and 64) while providing similar IOPS and
Guest NVME interrupt rate (i.e. patches do not regress).

Along with the performance evaluation, we did sanity tests such with AVIC,
x2AVIC and kernel selftest.  All tests look good.

For AVIC related patches:
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>

Regards
Sairaj Kodilkar


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
  2025-05-15 12:08 ` Sairaj Kodilkar
@ 2025-05-15 22:05   ` Sean Christopherson
  0 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-05-15 22:05 UTC (permalink / raw)
  To: Sairaj Kodilkar
  Cc: baolu.lu, dmatlack, dwmw2, iommu, joao.m.martins, joro, kvm,
	linux-kernel, mlevitsk, pbonzini, vasant.hegde,
	suravee.suthikulpanit, naveen.rao

On Thu, May 15, 2025, Sairaj Kodilkar wrote:
> Hi Sean,
> 
> We ran few tests with following setup

A few!!?!?  This is awesome!  Thank you, I greatly appreciate the testing!

> * Turin system with 2P, 192 cores each (SMT enabled, Total 768)
> * 4 NVMEs of size 1.7 attached to a single IOMMU
> * Total RAM 247 GiB
> * Qemu version : 9.1.93
> * Guest kernel : 6.14-rc7
> * FIO random reads with 4K blocksize and libai
> 
> With above setup we measured the Guest nvme interrupts, IOPS, GALOG interrupts
> and GALOG entries for 60 seconds with and without your changes.
> 
> Here are the results,
> 
>                           VCPUS = 32, Jobs per NVME = 8
> ==============================================================================================
>                              w/o Sean's patches           w/ Sean's patches     Percent change
> ----------------------------------------------------------------------------------------------
> Guest Nvme interrupts               123,922,860                 124,559,110              0.51%
> IOPS (in kilo)                            4,795                       4,796              0.04%
> GALOG Interrupts                         40,245                         164            -99.59%
> GALOG entries                            42,040                         169            -99.60%
> ----------------------------------------------------------------------------------------------
> 
> 
>                 VCPUS = 64, Jobs per NVME = 16
> ==============================================================================================
>                              w/o Sean's patches           w/ Sean's patches     Percent change
> ----------------------------------------------------------------------------------------------
> Guest Nvme interrupts               99,483,339                   99,800,056             0.32% 
> IOPS (in kilo)                           4,791                        4,798             0.15% 
> GALOG Interrupts                        47,599                       11,634           -75.56% 
> GALOG entries                           48,899                       11,923           -75.62%
> ----------------------------------------------------------------------------------------------
> 
> 
>                 VCPUS = 192, Jobs per NVME = 48
> ==============================================================================================
>                              w/o Sean's patches          w/ Sean's patches      Percent change
> ----------------------------------------------------------------------------------------------
> Guest Nvme interrupts               76,750,310                  78,066,512               1.71%
> IOPS (in kilo)                           4,751                       4,749              -0.04%
> GALOG Interrupts                        56,621                      54,732              -3.34%
> GALOG entries                           59,579                      56,215              -5.65%
> ----------------------------------------------------------------------------------------------
>  
> 
> The results show that patches have significant impact on the number of posted
> interrupts at lower vCPU count (32 and 64) while providing similar IOPS and
> Guest NVME interrupt rate (i.e. patches do not regress).
> 
> Along with the performance evaluation, we did sanity tests such with AVIC,
> x2AVIC and kernel selftest.  All tests look good.
> 
> For AVIC related patches:
> Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
> 
> Regards
> Sairaj Kodilkar
> 

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 30/67] KVM: VMX: Stop walking list of routing table entries when updating IRTE
  2025-04-08 17:00   ` Paolo Bonzini
@ 2025-05-20 20:36     ` Sean Christopherson
  0 siblings, 0 replies; 128+ messages in thread
From: Sean Christopherson @ 2025-05-20 20:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Joerg Roedel, David Woodhouse, Lu Baolu, kvm, iommu, linux-kernel,
	Maxim Levitsky, Joao Martins, David Matlack

On Tue, Apr 08, 2025, Paolo Bonzini wrote:
> On 4/4/25 21:38, Sean Christopherson wrote:
> > Now that KVM provides the to-be-updated routing entry, stop walking the
> > routing table to find that entry.  KVM, via setup_routing_entry() and
> > sanity checked by kvm_get_msi_route(), disallows having a GSI configured
> > to trigger multiple MSIs, i.e. the for-loop can only process one entry.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   arch/x86/kvm/vmx/posted_intr.c | 100 +++++++++++----------------------
> >   1 file changed, 33 insertions(+), 67 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> > index 00818ca30ee0..786912cee3f8 100644
> > --- a/arch/x86/kvm/vmx/posted_intr.c
> > +++ b/arch/x86/kvm/vmx/posted_intr.c
> > @@ -268,78 +268,44 @@ int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
> >   		       unsigned int host_irq, uint32_t guest_irq,
> >   		       struct kvm_kernel_irq_routing_entry *new)
> >   {
> > -	struct kvm_kernel_irq_routing_entry *e;
> > -	struct kvm_irq_routing_table *irq_rt;
> > -	bool enable_remapped_mode = true;
> >   	struct kvm_lapic_irq irq;
> >   	struct kvm_vcpu *vcpu;
> >   	struct vcpu_data vcpu_info;
> > -	bool set = !!new;
> > -	int idx, ret = 0;
> >   	if (!vmx_can_use_vtd_pi(kvm))
> >   		return 0;
> > -	idx = srcu_read_lock(&kvm->irq_srcu);
> > -	irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
> > -	if (guest_irq >= irq_rt->nr_rt_entries ||
> > -	    hlist_empty(&irq_rt->map[guest_irq])) {
> > -		pr_warn_once("no route for guest_irq %u/%u (broken user space?)\n",
> > -			     guest_irq, irq_rt->nr_rt_entries);
> > -		goto out;
> > -	}
> > -
> > -	hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
> > -		if (e->type != KVM_IRQ_ROUTING_MSI)
> > -			continue;
> > -
> > -		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
> 
> Alternatively, if you want to keep patches 28/29 separate, you could add
> this WARN_ON_ONCE to avic.c in the exact same place after checking e->type
> -- not so much for asserting purposes, but more to document what's going on
> for the reviewer.

FWIW, AVIC already has the same WARN, they were both added by "KVM: x86: Pass new
routing entries and irqfd when updating IRTEs".

That said, I agree that squashing 28/29 is the way to go, especially since I didn't
isolate the changes for VMX (I've no idea why I did for SVM but not VMX).

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support
  2025-04-08 17:13 ` David Matlack
@ 2025-05-23 23:52   ` David Matlack
  0 siblings, 0 replies; 128+ messages in thread
From: David Matlack @ 2025-05-23 23:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Joerg Roedel, David Woodhouse, Lu Baolu, kvm,
	iommu, linux-kernel, Maxim Levitsky, Joao Martins

On Tue, Apr 8, 2025 at 10:13 AM David Matlack <dmatlack@google.com> wrote:
>
> On Fri, Apr 4, 2025 at 12:39 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > This series is well tested except for one notable gap: I was not able to
> > fully test the AMD IOMMU changes.  Long story short, getting upstream
> > kernels into our full test environments is practically infeasible.  And
> > exposing a device or VF on systems that are available to developers is a
> > bit of a mess.
> >
> > The device the selftest (see the last patch) uses is an internel test VF
> > that's hosted on a smart NIC using non-production (test-only) firmware.
> > Unfortunately, only some of our developer systems have the right NIC, and
> > for unknown reasons I couldn't get the test firmware to install cleanly on
> > Rome systems.  I was able to get it functional on Milan (and Intel CPUs),
> > but APIC virtualization is disabled on Milan.  Thanks to KVM's force_avic
> > I could test the KVM flows, but the IOMMU was having none of my attempts
> > to force enable APIC virtualization against its will.
>
> (Sean already knows this but just sharing for the broader visibility.)
>
> I am working on a VFIO selftests framework and helper library that we
> can link into the KVM selftests to make this kind of testing much
> easier. It will support a driver framework so we can support testing
> against different devices in a common way. Developers/companies can
> carry their own out-of-tree drivers for non-standard/custom test
> devices, e.g. the "Mercury device" used in this series.
>
> I will send an RFC in the coming weeks. If/when my proposal is merged,
> then I think we'll have a clean way to get the vfio_irq_test merged
> upstream as well.

This RFC can be found here:
https://lore.kernel.org/kvm/20250523233018.1702151-1-dmatlack@google.com/

^ permalink raw reply	[flat|nested] 128+ messages in thread

end of thread, other threads:[~2025-05-23 23:52 UTC | newest]

Thread overview: 128+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-04 19:38 [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
2025-04-04 19:38 ` [PATCH 01/67] KVM: SVM: Allocate IR data using atomic allocation Sean Christopherson
2025-04-04 19:38 ` [PATCH 02/67] KVM: x86: Reset IRTE to host control if *new* route isn't postable Sean Christopherson
2025-04-11  8:08   ` Sairaj Kodilkar
2025-04-11 14:16     ` Sean Christopherson
2025-04-15 11:36       ` Paolo Bonzini
2025-04-04 19:38 ` [PATCH 03/67] KVM: x86: Explicitly treat routing entry type changes as changes Sean Christopherson
2025-04-04 19:38 ` [PATCH 04/67] KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producer Sean Christopherson
2025-04-04 19:38 ` [PATCH 05/67] iommu/amd: Return an error if vCPU affinity is set for non-vCPU IRTE Sean Christopherson
2025-04-11  8:34   ` Sairaj Kodilkar
2025-04-11 14:05     ` Sean Christopherson
2025-04-11 17:02       ` Sairaj Kodilkar
2025-04-11 19:30         ` Sean Christopherson
2025-04-18 12:25   ` Vasant Hegde
2025-04-04 19:38 ` [PATCH 06/67] iommu/amd: WARN if KVM attempts to set vCPU affinity without posted intrrupts Sean Christopherson
2025-04-11  8:28   ` Sairaj Kodilkar
2025-04-11 14:10     ` Sean Christopherson
2025-04-11 17:03       ` Sairaj Kodilkar
2025-04-15 11:42       ` Paolo Bonzini
2025-04-15 17:48       ` Vasant Hegde
2025-04-15 22:04         ` Sean Christopherson
2025-04-16  9:47           ` Sairaj Kodilkar
2025-04-17 17:37             ` Paolo Bonzini
2025-04-04 19:38 ` [PATCH 07/67] KVM: SVM: WARN if an invalid posted interrupt IRTE entry is added Sean Christopherson
2025-04-04 19:38 ` [PATCH 08/67] KVM: x86: Pass new routing entries and irqfd when updating IRTEs Sean Christopherson
2025-04-11 10:57   ` Arun Kodilkar, Sairaj
2025-04-11 14:01     ` Sean Christopherson
2025-04-11 17:22       ` Sairaj Kodilkar
2025-04-04 19:38 ` [PATCH 09/67] KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure Sean Christopherson
2025-04-11  7:47   ` Arun Kodilkar, Sairaj
2025-04-11 14:32     ` Sean Christopherson
2025-04-04 19:38 ` [PATCH 10/67] KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE Sean Christopherson
2025-04-04 19:38 ` [PATCH 11/67] KVM: SVM: Delete IRTE link from previous vCPU irrespective of new routing Sean Christopherson
2025-04-15 11:06   ` Sairaj Kodilkar
2025-04-15 14:55     ` Sean Christopherson
2025-04-04 19:38 ` [PATCH 12/67] KVM: SVM: Drop pointless masking of default APIC base when setting V_APIC_BAR Sean Christopherson
2025-04-04 19:38 ` [PATCH 13/67] KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masks Sean Christopherson
2025-04-04 19:38 ` [PATCH 14/67] KVM: SVM: Add helper to deduplicate code for getting AVIC backing page Sean Christopherson
2025-04-15 11:11   ` Sairaj Kodilkar
2025-04-15 14:57     ` Sean Christopherson
2025-04-04 19:38 ` [PATCH 15/67] KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field Sean Christopherson
2025-04-04 19:38 ` [PATCH 16/67] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation Sean Christopherson
2025-04-04 19:38 ` [PATCH 17/67] KVM: SVM: Drop redundant check in AVIC code on ID during " Sean Christopherson
2025-04-15 11:16   ` Sairaj Kodilkar
2025-04-04 19:38 ` [PATCH 18/67] KVM: SVM: Track AVIC tables as natively sized pointers, not "struct pages" Sean Christopherson
2025-04-04 19:38 ` [PATCH 19/67] KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer Sean Christopherson
2025-04-04 19:38 ` [PATCH 20/67] KVM: VMX: Move enable_ipiv knob to common x86 Sean Christopherson
2025-04-04 19:38 ` [PATCH 21/67] KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled Sean Christopherson
2025-04-04 19:38 ` [PATCH 22/67] KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235 Sean Christopherson
2025-04-04 19:38 ` [PATCH 23/67] KVM: VMX: Suppress PI notifications whenever the vCPU is put Sean Christopherson
2025-04-04 19:38 ` [PATCH 24/67] KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blocking Sean Christopherson
2025-04-04 19:38 ` [PATCH 25/67] iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr Sean Christopherson
2025-04-18 12:24   ` Vasant Hegde
2025-04-04 19:38 ` [PATCH 26/67] iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields Sean Christopherson
2025-04-08 16:57   ` Paolo Bonzini
2025-04-08 22:25     ` Sean Christopherson
2025-04-18 12:25   ` Vasant Hegde
2025-04-04 19:38 ` [PATCH 27/67] iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode" Sean Christopherson
2025-04-04 19:38 ` [PATCH 28/67] KVM: SVM: Get vCPU info for IRTE using new routing entry Sean Christopherson
2025-04-04 19:38 ` [PATCH 29/67] KVM: SVM: Stop walking list of routing table entries when updating IRTE Sean Christopherson
2025-04-08 16:56   ` Paolo Bonzini
2025-04-04 19:38 ` [PATCH 30/67] KVM: VMX: " Sean Christopherson
2025-04-08 17:00   ` Paolo Bonzini
2025-05-20 20:36     ` Sean Christopherson
2025-04-04 19:38 ` [PATCH 31/67] KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info() Sean Christopherson
2025-04-23 15:21   ` Francesco Lavra
2025-04-23 15:55     ` Sean Christopherson
2025-04-04 19:38 ` [PATCH 32/67] KVM: x86: Nullify irqfd->producer after updating IRTEs Sean Christopherson
2025-04-04 19:38 ` [PATCH 33/67] KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU Sean Christopherson
2025-04-08 17:30   ` Paolo Bonzini
2025-04-08 20:51     ` Sean Christopherson
2025-04-24  4:39   ` Sairaj Kodilkar
2025-04-24 14:13     ` Sean Christopherson
2025-04-04 19:38 ` [PATCH 34/67] KVM: x86: Move posted interrupt tracepoint to common code Sean Christopherson
2025-04-04 19:38 ` [PATCH 35/67] KVM: SVM: Clean up return handling in avic_pi_update_irte() Sean Christopherson
2025-04-04 19:38 ` [PATCH 36/67] iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structs Sean Christopherson
2025-04-04 19:38 ` [PATCH 37/67] KVM: Don't WARN if updating IRQ bypass route fails Sean Christopherson
2025-04-04 19:38 ` [PATCH 38/67] KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing() Sean Christopherson
2025-04-04 19:38 ` [PATCH 39/67] KVM: x86: Track irq_bypass_vcpu in common x86 code Sean Christopherson
2025-04-04 19:38 ` [PATCH 40/67] KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being targeted Sean Christopherson
2025-04-04 19:38 ` [PATCH 41/67] KVM: x86: Don't update IRTE entries when old and new routes were !MSI Sean Christopherson
2025-04-04 19:38 ` [PATCH 42/67] KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR metadata Sean Christopherson
2025-04-04 19:38 ` [PATCH 43/67] KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU Sean Christopherson
2025-04-04 19:38 ` [PATCH 44/67] iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination Sean Christopherson
2025-04-08 12:26   ` Joerg Roedel
2025-04-04 19:39 ` [PATCH 45/67] iommu/amd: Factor out helper for manipulating IRTE GA/CPU info Sean Christopherson
2025-04-04 19:39 ` [PATCH 46/67] iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity Sean Christopherson
2025-04-04 19:39 ` [PATCH 47/67] iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is inhibited Sean Christopherson
2025-04-04 19:39 ` [PATCH 48/67] KVM: SVM: Don't check for assigned device(s) when updating affinity Sean Christopherson
2025-04-04 19:39 ` [PATCH 49/67] KVM: SVM: Don't check for assigned device(s) when activating AVIC Sean Christopherson
2025-04-04 19:39 ` [PATCH 50/67] KVM: SVM: WARN if (de)activating guest mode in IOMMU fails Sean Christopherson
2025-04-04 19:39 ` [PATCH 51/67] KVM: SVM: Process all IRTEs on affinity change even if one update fails Sean Christopherson
2025-04-04 19:39 ` [PATCH 52/67] KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails Sean Christopherson
2025-04-04 19:39 ` [PATCH 53/67] KVM: x86: Drop superfluous "has assigned device" check in kvm_pi_update_irte() Sean Christopherson
2025-04-04 19:39 ` [PATCH 54/67] KVM: x86: WARN if IRQ bypass isn't supported " Sean Christopherson
2025-04-04 19:39 ` [PATCH 55/67] KVM: x86: WARN if IRQ bypass routing is updated without in-kernel local APIC Sean Christopherson
2025-04-04 19:39 ` [PATCH 56/67] KVM: SVM: WARN if ir_list is non-empty at vCPU free Sean Christopherson
2025-04-04 19:39 ` [PATCH 57/67] KVM: x86: Decouple device assignment from IRQ bypass Sean Christopherson
2025-04-04 19:39 ` [PATCH 58/67] KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting " Sean Christopherson
2025-04-04 19:39 ` [PATCH 59/67] KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata Sean Christopherson
2025-04-04 19:39 ` [PATCH 60/67] iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC support Sean Christopherson
2025-04-04 19:39 ` [PATCH 61/67] KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller Sean Christopherson
2025-04-04 19:39 ` [PATCH 62/67] KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off Sean Christopherson
2025-04-08 17:51   ` Paolo Bonzini
2025-04-04 19:39 ` [PATCH 63/67] KVM: SVM: Consolidate IRTE update " Sean Christopherson
2025-04-04 19:39 ` [PATCH 64/67] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts Sean Christopherson
2025-04-09 11:56   ` Joao Martins
2025-04-10 15:45     ` Sean Christopherson
2025-04-10 17:13       ` Joao Martins
2025-04-10 17:29         ` Sean Christopherson
2025-04-18 12:17     ` Vasant Hegde
2025-04-18 18:48       ` Sean Christopherson
2025-04-23 10:21         ` Joao Martins
2025-04-04 19:39 ` [PATCH 65/67] KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking Sean Christopherson
2025-04-08 17:53   ` Paolo Bonzini
2025-04-08 21:31     ` Sean Christopherson
2025-04-09 10:34       ` Paolo Bonzini
2025-04-04 19:39 ` [PATCH 66/67] *** DO NOT MERGE *** iommu/amd: Hack to fake IRQ posting support Sean Christopherson
2025-04-04 19:39 ` [PATCH 67/67] *** DO NOT MERGE *** KVM: selftests: WIP posted interrupts test Sean Christopherson
2025-04-08 12:44 ` [PATCH 00/67] KVM: iommu: Overhaul device posted IRQs support Joerg Roedel
2025-04-09  8:30   ` Vasant Hegde
2025-04-08 15:36 ` Paolo Bonzini
2025-04-08 17:13 ` David Matlack
2025-05-23 23:52   ` David Matlack
2025-04-18 13:01 ` David Woodhouse
2025-04-18 16:22   ` Sean Christopherson
2025-05-15 12:08 ` Sairaj Kodilkar
2025-05-15 22:05   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).