[PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support
@ 2025-06-11 22:45 Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 01/62] KVM: arm64: Explicitly treat routing entry type changes as changes Sean Christopherson
                   ` (62 more replies)
  0 siblings, 63 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Marc/Oliver,

Patch 1 is an arm64 fix that I'm guessing you'll want to grab for 6.16.
Assuming that's the case, I'll make sure this series lands on top of
kvm/master (or maybe an -rc?) at the appropriate point in time.  Though if you
can grab the patch sooner than later, that'd be super helpful :-)

Oh, and the other patches are of interest to arm64 are:

  [PATCH v3 32/62] KVM: Don't WARN if updating IRQ bypass route fails
  [PATCH v3 33/62] KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing()

In theory, I _think_ those could be moved earlier so that there aren't
multi-arch patches buried in a massive x86-centric series, but I really don't
want to try and re-disentangle x86's posted interrupt mess at this point.


TL;DR: Overhaul device posted interrupts in KVM and IOMMU, and AVIC in
       general.

This applies on the series to add CONFIG_KVM_IOAPIC (and to kill irq_comm.c):

  https://lore.kernel.org/all/20250611213557.294358-1-seanjc@google.com

Fix a variety of bugs related to device posted IRQs, especially on the
AMD side, and clean up KVM's implementation (this series actually removes
more code than it adds).

Batch #1 is new in this version, and consists of two aforementioned arm64
changes.

Batch #2 is mostly SVM specific:

 - Cleans up various warts and bugs in the IRTE tracking
 - Fixes AVIC to not reject large VMs (honor KVM's ABI)
 - Wire up AVIC to enable_ipiv to support disabling IPI virtualization while
   still utilizing device posted interrupts, and to workaround erratum #1235.

Batch #3 overhauls the guts of IRQ bypass in KVM, and moves the vast majority
of the logic to common x86; only the code that needs to communicate with the
IOMMU is truly vendor specific.

Batch #4 is more SVM/AVIC cleanups that are made possible by batch #3.

Batch #5 adds WARNs and drops dead code after all the previous cleanups and
fixes (I don't want to add the WARNs earlier; I don't see any point in adding
WARNs in code that's known to be broken).

Batch #6 is yet more SVM/AVIC cleanups, with the specific goal of configuring
IRTEs to generate GA log interrupts if and only if KVM actually needs a wake
event.

v3:
 - Rebase on kvm/next to pick up relevant arm64 irqfd routing changes, and
   account for arm64 as appropriate.
 - Fix a suspiciously similar bug in arm64's version of
   kvm_arch_irqfd_route_changed().
 - Add a patch to rename kvm_set_msi_irq() to kvm_msi_to_lapic_irq().

v2:
 - https://lore.kernel.org/all/20250523010004.3240643-1-seanjc@google.com
 - Drop patches that were already merged.
 - Move code into irq.c, not x86.c. [Paolo]
 - Collect review/testing tags. [Sairaj, Vasant]
 - Sqaush fixup for a comment that was added in the prior patch. [Sairaj]
 - Rewrote the changelog for "Delete IRTE link from previous vCPU irrespective
   of new routing". [Sairaj]
 - Actually drop "struct amd_svm_iommu_ir" and all usage in "Track per-vCPU
   IRTEs using kvm_kernel_irqfd structure" (the previous version was getting
   hilarious lucky with struct offsets). [Sairaj]
 - Drop unused params from kvm_pi_update_irte() and pi_update_irte(). [Sairaj]
 - Document the rules and behavior of amd_iommu_update_ga(). [Joerg]
 - Fix a changelog typo. [Paolo]
 - Document that GALogIntr isn't cached, i.e. can be safely updated without
   an invalidation. [Joao, Vasant]
 - Rework avic_vcpu_{load,put}() to use an enumerated parameter instead of a
   series of booleans. [Paolo]
 - Drop a redundant "&& new". [Francesco]
 - Drop the *** DO NOT MERGE *** testing hack patches.

v1: https://lore.kernel.org/all/20250404193923.1413163-1-seanjc@google.com

Maxim Levitsky (2):
  KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled
  KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235

Sean Christopherson (60):
  KVM: arm64: Explicitly treat routing entry type changes as changes
  KVM: arm64: WARN if unmapping vLPI fails
  KVM: Pass new routing entries and irqfd when updating IRTEs
  KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure
  KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE
  iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
  KVM: SVM: Delete IRTE link from previous vCPU irrespective of new
    routing
  KVM: SVM: Drop pointless masking of default APIC base when setting
    V_APIC_BAR
  KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA
    masks
  KVM: SVM: Add helper to deduplicate code for getting AVIC backing page
  KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field
  KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU
    creation
  KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creation
  KVM: SVM: Track AVIC tables as natively sized pointers, not "struct
    pages"
  KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer
  KVM: VMX: Move enable_ipiv knob to common x86
  KVM: VMX: Suppress PI notifications whenever the vCPU is put
  KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores
    IRQ blocking
  iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr
  iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode"
  KVM: SVM: Stop walking list of routing table entries when updating
    IRTE
  KVM: VMX: Stop walking list of routing table entries when updating
    IRTE
  KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info()
  KVM: x86: Move IRQ routing/delivery APIs from x86.c => irq.c
  KVM: x86: Nullify irqfd->producer after updating IRTEs
  KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU
  KVM: x86: Move posted interrupt tracepoint to common code
  KVM: SVM: Clean up return handling in avic_pi_update_irte()
  iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel
    structs
  KVM: Don't WARN if updating IRQ bypass route fails
  KVM: Fold kvm_arch_irqfd_route_changed() into
    kvm_arch_update_irqfd_routing()
  KVM: x86: Track irq_bypass_vcpu in common x86 code
  KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being
    targeted
  KVM: x86: Don't update IRTE entries when old and new routes were !MSI
  KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR
    metadata
  KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU
  iommu/amd: Document which IRTE fields amd_iommu_update_ga() can modify
  iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination
  iommu/amd: Factor out helper for manipulating IRTE GA/CPU info
  iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity
  iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC
    is inhibited
  KVM: SVM: Don't check for assigned device(s) when updating affinity
  KVM: SVM: Don't check for assigned device(s) when activating AVIC
  KVM: SVM: WARN if (de)activating guest mode in IOMMU fails
  KVM: SVM: Process all IRTEs on affinity change even if one update
    fails
  KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails
  KVM: x86: Drop superfluous "has assigned device" check in
    kvm_pi_update_irte()
  KVM: x86: WARN if IRQ bypass isn't supported in kvm_pi_update_irte()
  KVM: x86: WARN if IRQ bypass routing is updated without in-kernel
    local APIC
  KVM: SVM: WARN if ir_list is non-empty at vCPU free
  KVM: x86: Decouple device assignment from IRQ bypass
  KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting IRQ
    bypass
  KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata
  iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC
    support
  KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller
  KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off
  KVM: SVM: Consolidate IRTE update when toggling AVIC on/off
  iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
  KVM: SVM: Generate GA log IRQs only if the associated vCPUs is
    blocking
  KVM: x86: Rename kvm_set_msi_irq() => kvm_msi_to_lapic_irq()

 arch/arm64/kvm/arm.c                 |  19 +-
 arch/arm64/kvm/vgic/vgic-v4.c        |  10 +-
 arch/x86/include/asm/irq_remapping.h |  17 +-
 arch/x86/include/asm/kvm-x86-ops.h   |   2 +-
 arch/x86/include/asm/kvm_host.h      |  23 +-
 arch/x86/include/asm/svm.h           |  13 +-
 arch/x86/kvm/irq.c                   | 152 +++++-
 arch/x86/kvm/svm/avic.c              | 702 ++++++++++++---------------
 arch/x86/kvm/svm/svm.c               |   4 +
 arch/x86/kvm/svm/svm.h               |  32 +-
 arch/x86/kvm/trace.h                 |  19 +-
 arch/x86/kvm/vmx/capabilities.h      |   1 -
 arch/x86/kvm/vmx/main.c              |   2 +-
 arch/x86/kvm/vmx/posted_intr.c       | 140 ++----
 arch/x86/kvm/vmx/posted_intr.h       |  10 +-
 arch/x86/kvm/vmx/vmx.c               |   2 -
 arch/x86/kvm/x86.c                   |  90 +---
 drivers/iommu/amd/amd_iommu_types.h  |   1 -
 drivers/iommu/amd/iommu.c            | 125 +++--
 drivers/iommu/intel/irq_remapping.c  |  10 +-
 include/kvm/arm_vgic.h               |   2 +-
 include/linux/amd-iommu.h            |  25 +-
 include/linux/kvm_host.h             |   9 +-
 include/linux/kvm_irqfd.h            |   4 +
 virt/kvm/eventfd.c                   |  22 +-
 25 files changed, 691 insertions(+), 745 deletions(-)


base-commit: 06880162469d702d052e5d51b49a24e43f182af8
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v3 01/62] KVM: arm64: Explicitly treat routing entry type changes as changes
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-13 19:43   ` Oliver Upton
  2025-06-19 12:36   ` (subset) " Marc Zyngier
  2025-06-11 22:45 ` [PATCH v3 02/62] KVM: arm64: WARN if unmapping vLPI fails Sean Christopherson
                   ` (61 subsequent siblings)
  62 siblings, 2 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Explicitly treat type differences as GSI routing changes, as comparing MSI
data between two entries could get a false negative, e.g. if userspace
changed the type but left the type-specific data as-

Note, the same bug was fixed in x86 by commit bcda70c56f3e ("KVM: x86:
Explicitly treat routing entry type changes as changes").

Fixes: 4bf3693d36af ("KVM: arm64: Unmap vLPIs affected by changes to GSI routing information")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/kvm/arm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index de2b4e9c9f9f..38a91bb5d4c7 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -2764,7 +2764,8 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
 				  struct kvm_kernel_irq_routing_entry *new)
 {
-	if (new->type != KVM_IRQ_ROUTING_MSI)
+	if (old->type != KVM_IRQ_ROUTING_MSI ||
+	    new->type != KVM_IRQ_ROUTING_MSI)
 		return true;
 
 	return memcmp(&old->msi, &new->msi, sizeof(new->msi));
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 02/62] KVM: arm64: WARN if unmapping vLPI fails
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 01/62] KVM: arm64: Explicitly treat routing entry type changes as changes Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-12 11:59   ` Marc Zyngier
  2025-06-11 22:45 ` [PATCH v3 03/62] KVM: Pass new routing entries and irqfd when updating IRTEs Sean Christopherson
                   ` (60 subsequent siblings)
  62 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

WARN if unmapping a vLPI in kvm_vgic_v4_unset_forwarding() fails, as
failure means an IRTE has likely been left in a bad state, i.e. IRQs
won't go where they should.  WARNing in arch code will eventually allow
dropping all the plumbing a similar WARN in kvm_irq_routing_update(),
and thus avoid having to plumb back a return code just to WARN.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/kvm/vgic/vgic-v4.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
index 193946108192..86e54cefc237 100644
--- a/arch/arm64/kvm/vgic/vgic-v4.c
+++ b/arch/arm64/kvm/vgic/vgic-v4.c
@@ -546,6 +546,7 @@ int kvm_vgic_v4_unset_forwarding(struct kvm *kvm, int host_irq)
 		atomic_dec(&irq->target_vcpu->arch.vgic_cpu.vgic_v3.its_vpe.vlpi_count);
 		irq->hw = false;
 		ret = its_unmap_vlpi(host_irq);
+		WARN_ON_ONCE(ret);
 	}
 
 	raw_spin_unlock_irqrestore(&irq->irq_lock, flags);
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 03/62] KVM: Pass new routing entries and irqfd when updating IRTEs
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 01/62] KVM: arm64: Explicitly treat routing entry type changes as changes Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 02/62] KVM: arm64: WARN if unmapping vLPI fails Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 04/62] KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure Sean Christopherson
                   ` (59 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

When updating IRTEs in response to a GSI routing or IRQ bypass change,
pass the new/current routing information along with the associated irqfd.
This will allow KVM x86 to harden, simplify, and deduplicate its code.

Since adding/removing a bypass producer is now conveniently protected with
irqfds.lock, i.e. can't run concurrently with kvm_irq_routing_update(),
use the routing information cached in the irqfd instead of looking up
the information in the current GSI routing tables.

Opportunistically convert an existing printk() to pr_info() and put its
string onto a single line (old code that strictly adhered to 80 chars).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/kvm/arm.c            |  7 ++++---
 arch/x86/include/asm/kvm_host.h |  6 ++++--
 arch/x86/kvm/svm/avic.c         | 18 +++++++----------
 arch/x86/kvm/svm/svm.h          |  5 +++--
 arch/x86/kvm/vmx/posted_intr.c  | 19 ++++++++---------
 arch/x86/kvm/vmx/posted_intr.h  |  8 ++++++--
 arch/x86/kvm/x86.c              | 36 ++++++++++++++++++---------------
 include/linux/kvm_host.h        |  7 +++++--
 virt/kvm/eventfd.c              | 11 +++++-----
 9 files changed, 62 insertions(+), 55 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 38a91bb5d4c7..a9a39e0375f7 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -2771,8 +2771,9 @@ bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
 	return memcmp(&old->msi, &new->msi, sizeof(new->msi));
 }
 
-int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
-				  uint32_t guest_irq, bool set)
+int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
+				  struct kvm_kernel_irq_routing_entry *old,
+				  struct kvm_kernel_irq_routing_entry *new)
 {
 	/*
 	 * Remapping the vLPI requires taking the its_lock mutex to resolve
@@ -2781,7 +2782,7 @@ int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
 	 *
 	 * Unmap the vLPI and fall back to software LPI injection.
 	 */
-	return kvm_vgic_v4_unset_forwarding(kvm, host_irq);
+	return kvm_vgic_v4_unset_forwarding(irqfd->kvm, irqfd->producer->irq);
 }
 
 void kvm_arch_irq_bypass_stop(struct irq_bypass_consumer *cons)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 21ccb122ab76..2a6ef1398da7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -296,6 +296,7 @@ enum x86_intercept_stage;
  */
 #define KVM_APIC_PV_EOI_PENDING	1
 
+struct kvm_kernel_irqfd;
 struct kvm_kernel_irq_routing_entry;
 
 /*
@@ -1844,8 +1845,9 @@ struct kvm_x86_ops {
 	void (*vcpu_blocking)(struct kvm_vcpu *vcpu);
 	void (*vcpu_unblocking)(struct kvm_vcpu *vcpu);
 
-	int (*pi_update_irte)(struct kvm *kvm, unsigned int host_irq,
-			      uint32_t guest_irq, bool set);
+	int (*pi_update_irte)(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
+			      unsigned int host_irq, uint32_t guest_irq,
+			      struct kvm_kernel_irq_routing_entry *new);
 	void (*pi_start_assignment)(struct kvm *kvm);
 	void (*apicv_pre_state_restore)(struct kvm_vcpu *vcpu);
 	void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 7338879d1c0c..adacf00d6664 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -18,6 +18,7 @@
 #include <linux/hashtable.h>
 #include <linux/amd-iommu.h>
 #include <linux/kvm_host.h>
+#include <linux/kvm_irqfd.h>
 
 #include <asm/irq_remapping.h>
 
@@ -885,21 +886,14 @@ get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
 	return 0;
 }
 
-/*
- * avic_pi_update_irte - set IRTE for Posted-Interrupts
- *
- * @kvm: kvm
- * @host_irq: host irq of the interrupt
- * @guest_irq: gsi of the interrupt
- * @set: set or unset PI
- * returns 0 on success, < 0 on failure
- */
-int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
-			uint32_t guest_irq, bool set)
+int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
+			unsigned int host_irq, uint32_t guest_irq,
+			struct kvm_kernel_irq_routing_entry *new)
 {
 	struct kvm_kernel_irq_routing_entry *e;
 	struct kvm_irq_routing_table *irq_rt;
 	bool enable_remapped_mode = true;
+	bool set = !!new;
 	int idx, ret = 0;
 
 	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
@@ -925,6 +919,8 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
 		if (e->type != KVM_IRQ_ROUTING_MSI)
 			continue;
 
+		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
+
 		/**
 		 * Here, we setup with legacy mode in the following cases:
 		 * 1. When cannot target interrupt to a specific vcpu.
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index e6f3c6a153a0..b35fce30d923 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -736,8 +736,9 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 void avic_vcpu_put(struct kvm_vcpu *vcpu);
 void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu);
 void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
-int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
-			uint32_t guest_irq, bool set);
+int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
+			unsigned int host_irq, uint32_t guest_irq,
+			struct kvm_kernel_irq_routing_entry *new);
 void avic_vcpu_blocking(struct kvm_vcpu *vcpu);
 void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
 void avic_ring_doorbell(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 5c615e5845bf..110fb19848ab 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -2,6 +2,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include <linux/kvm_host.h>
+#include <linux/kvm_irqfd.h>
 
 #include <asm/irq_remapping.h>
 #include <asm/cpu.h>
@@ -294,17 +295,9 @@ void vmx_pi_start_assignment(struct kvm *kvm)
 	kvm_make_all_cpus_request(kvm, KVM_REQ_UNBLOCK);
 }
 
-/*
- * vmx_pi_update_irte - set IRTE for Posted-Interrupts
- *
- * @kvm: kvm
- * @host_irq: host irq of the interrupt
- * @guest_irq: gsi of the interrupt
- * @set: set or unset PI
- * returns 0 on success, < 0 on failure
- */
-int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
-		       uint32_t guest_irq, bool set)
+int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
+		       unsigned int host_irq, uint32_t guest_irq,
+		       struct kvm_kernel_irq_routing_entry *new)
 {
 	struct kvm_kernel_irq_routing_entry *e;
 	struct kvm_irq_routing_table *irq_rt;
@@ -312,6 +305,7 @@ int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
 	struct kvm_lapic_irq irq;
 	struct kvm_vcpu *vcpu;
 	struct vcpu_data vcpu_info;
+	bool set = !!new;
 	int idx, ret = 0;
 
 	if (!vmx_can_use_vtd_pi(kvm))
@@ -329,6 +323,9 @@ int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
 	hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
 		if (e->type != KVM_IRQ_ROUTING_MSI)
 			continue;
+
+		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
+
 		/*
 		 * VT-d PI cannot support posting multicast/broadcast
 		 * interrupts to a vCPU, we still use interrupt remapping
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 80499ea0e674..a94afcb55f7f 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -3,6 +3,9 @@
 #define __KVM_X86_VMX_POSTED_INTR_H
 
 #include <linux/bitmap.h>
+#include <linux/find.h>
+#include <linux/kvm_host.h>
+
 #include <asm/posted_intr.h>
 
 void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
@@ -11,8 +14,9 @@ void pi_wakeup_handler(void);
 void __init pi_init_cpu(int cpu);
 void pi_apicv_pre_state_restore(struct kvm_vcpu *vcpu);
 bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
-int vmx_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
-		       uint32_t guest_irq, bool set);
+int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
+		       unsigned int host_irq, uint32_t guest_irq,
+		       struct kvm_kernel_irq_routing_entry *new);
 void vmx_pi_start_assignment(struct kvm *kvm);
 
 static inline int pi_find_highest_vector(struct pi_desc *pi_desc)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 28a20f0aa3dd..93711a5ef272 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13507,31 +13507,31 @@ int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
 	struct kvm_kernel_irqfd *irqfd =
 		container_of(cons, struct kvm_kernel_irqfd, consumer);
 	struct kvm *kvm = irqfd->kvm;
-	int ret;
+	int ret = 0;
 
 	kvm_arch_start_assignment(irqfd->kvm);
 
 	spin_lock_irq(&kvm->irqfds.lock);
 	irqfd->producer = prod;
 
-	ret = kvm_x86_call(pi_update_irte)(irqfd->kvm,
-					   prod->irq, irqfd->gsi, 1);
-	if (ret)
-		kvm_arch_end_assignment(irqfd->kvm);
-
+	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
+		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
+						   irqfd->gsi, &irqfd->irq_entry);
+		if (ret)
+			kvm_arch_end_assignment(irqfd->kvm);
+	}
 	spin_unlock_irq(&kvm->irqfds.lock);
 
-
 	return ret;
 }
 
 void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 				      struct irq_bypass_producer *prod)
 {
-	int ret;
 	struct kvm_kernel_irqfd *irqfd =
 		container_of(cons, struct kvm_kernel_irqfd, consumer);
 	struct kvm *kvm = irqfd->kvm;
+	int ret;
 
 	WARN_ON(irqfd->producer != prod);
 
@@ -13544,11 +13544,13 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 	spin_lock_irq(&kvm->irqfds.lock);
 	irqfd->producer = NULL;
 
-	ret = kvm_x86_call(pi_update_irte)(irqfd->kvm,
-					   prod->irq, irqfd->gsi, 0);
-	if (ret)
-		printk(KERN_INFO "irq bypass consumer (token %p) unregistration"
-		       " fails: %d\n", irqfd->consumer.token, ret);
+	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
+		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
+						   irqfd->gsi, NULL);
+		if (ret)
+			pr_info("irq bypass consumer (token %p) unregistration fails: %d\n",
+				irqfd->consumer.token, ret);
+	}
 
 	spin_unlock_irq(&kvm->irqfds.lock);
 
@@ -13556,10 +13558,12 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 	kvm_arch_end_assignment(irqfd->kvm);
 }
 
-int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
-				   uint32_t guest_irq, bool set)
+int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
+				  struct kvm_kernel_irq_routing_entry *old,
+				  struct kvm_kernel_irq_routing_entry *new)
 {
-	return kvm_x86_call(pi_update_irte)(kvm, host_irq, guest_irq, set);
+	return kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, irqfd->producer->irq,
+					    irqfd->gsi, new);
 }
 
 bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3b5575d0b574..a4160c1c0c6b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2401,6 +2401,8 @@ struct kvm_vcpu *kvm_get_running_vcpu(void);
 struct kvm_vcpu * __percpu *kvm_get_running_vcpus(void);
 
 #if IS_ENABLED(CONFIG_HAVE_KVM_IRQ_BYPASS)
+struct kvm_kernel_irqfd;
+
 bool kvm_arch_has_irq_bypass(void);
 int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *,
 			   struct irq_bypass_producer *);
@@ -2408,8 +2410,9 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *,
 			   struct irq_bypass_producer *);
 void kvm_arch_irq_bypass_stop(struct irq_bypass_consumer *);
 void kvm_arch_irq_bypass_start(struct irq_bypass_consumer *);
-int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
-				  uint32_t guest_irq, bool set);
+int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
+				  struct kvm_kernel_irq_routing_entry *old,
+				  struct kvm_kernel_irq_routing_entry *new);
 bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *,
 				  struct kvm_kernel_irq_routing_entry *);
 #endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 11e5d1e3f12e..85581550dc8d 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -285,9 +285,9 @@ void __attribute__((weak)) kvm_arch_irq_bypass_start(
 {
 }
 
-int  __attribute__((weak)) kvm_arch_update_irqfd_routing(
-				struct kvm *kvm, unsigned int host_irq,
-				uint32_t guest_irq, bool set)
+int __weak kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
+					 struct kvm_kernel_irq_routing_entry *old,
+					 struct kvm_kernel_irq_routing_entry *new)
 {
 	return 0;
 }
@@ -619,9 +619,8 @@ void kvm_irq_routing_update(struct kvm *kvm)
 #if IS_ENABLED(CONFIG_HAVE_KVM_IRQ_BYPASS)
 		if (irqfd->producer &&
 		    kvm_arch_irqfd_route_changed(&old, &irqfd->irq_entry)) {
-			int ret = kvm_arch_update_irqfd_routing(
-					irqfd->kvm, irqfd->producer->irq,
-					irqfd->gsi, 1);
+			int ret = kvm_arch_update_irqfd_routing(irqfd, &old, &irqfd->irq_entry);
+
 			WARN_ON(ret);
 		}
 #endif
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 04/62] KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (2 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 03/62] KVM: Pass new routing entries and irqfd when updating IRTEs Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 05/62] KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE Sean Christopherson
                   ` (58 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Track the IRTEs that are posting to an SVM vCPU via the associated irqfd
structure and GSI routing instead of dynamically allocating a separate
data structure.  In addition to eliminating an atomic allocation, this
will allow hoisting much of the IRTE update logic to common x86.

Cc: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c   | 71 +++++++++++++++------------------------
 arch/x86/kvm/svm/svm.h    | 10 +++---
 include/linux/kvm_irqfd.h |  3 ++
 3 files changed, 36 insertions(+), 48 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index adacf00d6664..d33c01379421 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -75,14 +75,6 @@ static bool next_vm_id_wrapped = 0;
 static DEFINE_SPINLOCK(svm_vm_data_hash_lock);
 bool x2avic_enabled;
 
-/*
- * This is a wrapper of struct amd_iommu_ir_data.
- */
-struct amd_svm_iommu_ir {
-	struct list_head node;	/* Used by SVM for per-vcpu ir_list */
-	void *data;		/* Storing pointer to struct amd_ir_data */
-};
-
 static void avic_activate_vmcb(struct vcpu_svm *svm)
 {
 	struct vmcb *vmcb = svm->vmcb01.ptr;
@@ -746,8 +738,8 @@ static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 {
 	int ret = 0;
 	unsigned long flags;
-	struct amd_svm_iommu_ir *ir;
 	struct vcpu_svm *svm = to_svm(vcpu);
+	struct kvm_kernel_irqfd *irqfd;
 
 	if (!kvm_arch_has_assigned_device(vcpu->kvm))
 		return 0;
@@ -761,11 +753,11 @@ static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 	if (list_empty(&svm->ir_list))
 		goto out;
 
-	list_for_each_entry(ir, &svm->ir_list, node) {
+	list_for_each_entry(irqfd, &svm->ir_list, vcpu_list) {
 		if (activate)
-			ret = amd_iommu_activate_guest_mode(ir->data);
+			ret = amd_iommu_activate_guest_mode(irqfd->irq_bypass_data);
 		else
-			ret = amd_iommu_deactivate_guest_mode(ir->data);
+			ret = amd_iommu_deactivate_guest_mode(irqfd->irq_bypass_data);
 		if (ret)
 			break;
 	}
@@ -774,27 +766,30 @@ static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 	return ret;
 }
 
-static void svm_ir_list_del(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
+static void svm_ir_list_del(struct vcpu_svm *svm,
+			    struct kvm_kernel_irqfd *irqfd,
+			    struct amd_iommu_pi_data *pi)
 {
 	unsigned long flags;
-	struct amd_svm_iommu_ir *cur;
+	struct kvm_kernel_irqfd *cur;
 
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
-	list_for_each_entry(cur, &svm->ir_list, node) {
-		if (cur->data != pi->ir_data)
+	list_for_each_entry(cur, &svm->ir_list, vcpu_list) {
+		if (cur->irq_bypass_data != pi->ir_data)
 			continue;
-		list_del(&cur->node);
-		kfree(cur);
+		if (WARN_ON_ONCE(cur != irqfd))
+			continue;
+		list_del(&irqfd->vcpu_list);
 		break;
 	}
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 }
 
-static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
+static int svm_ir_list_add(struct vcpu_svm *svm,
+			   struct kvm_kernel_irqfd *irqfd,
+			   struct amd_iommu_pi_data *pi)
 {
-	int ret = 0;
 	unsigned long flags;
-	struct amd_svm_iommu_ir *ir;
 	u64 entry;
 
 	if (WARN_ON_ONCE(!pi->ir_data))
@@ -811,25 +806,14 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
 		struct kvm_vcpu *prev_vcpu = kvm_get_vcpu_by_id(kvm, vcpu_id);
 		struct vcpu_svm *prev_svm;
 
-		if (!prev_vcpu) {
-			ret = -EINVAL;
-			goto out;
-		}
+		if (!prev_vcpu)
+			return -EINVAL;
 
 		prev_svm = to_svm(prev_vcpu);
-		svm_ir_list_del(prev_svm, pi);
+		svm_ir_list_del(prev_svm, irqfd, pi);
 	}
 
-	/**
-	 * Allocating new amd_iommu_pi_data, which will get
-	 * add to the per-vcpu ir_list.
-	 */
-	ir = kzalloc(sizeof(struct amd_svm_iommu_ir), GFP_ATOMIC | __GFP_ACCOUNT);
-	if (!ir) {
-		ret = -ENOMEM;
-		goto out;
-	}
-	ir->data = pi->ir_data;
+	irqfd->irq_bypass_data = pi->ir_data;
 
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
 
@@ -844,10 +828,9 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
 		amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
 				    true, pi->ir_data);
 
-	list_add(&ir->node, &svm->ir_list);
+	list_add(&irqfd->vcpu_list, &svm->ir_list);
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
-out:
-	return ret;
+	return 0;
 }
 
 /*
@@ -951,7 +934,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			 * scheduling information in IOMMU irte.
 			 */
 			if (!ret && pi.is_guest_mode)
-				svm_ir_list_add(svm, &pi);
+				svm_ir_list_add(svm, irqfd, &pi);
 		}
 
 		if (!ret && svm) {
@@ -992,7 +975,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 
 			vcpu = kvm_get_vcpu_by_id(kvm, id);
 			if (vcpu)
-				svm_ir_list_del(to_svm(vcpu), &pi);
+				svm_ir_list_del(to_svm(vcpu), irqfd, &pi);
 		}
 	}
 out:
@@ -1004,8 +987,8 @@ static inline int
 avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
 {
 	int ret = 0;
-	struct amd_svm_iommu_ir *ir;
 	struct vcpu_svm *svm = to_svm(vcpu);
+	struct kvm_kernel_irqfd *irqfd;
 
 	lockdep_assert_held(&svm->ir_list_lock);
 
@@ -1019,8 +1002,8 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
 	if (list_empty(&svm->ir_list))
 		return 0;
 
-	list_for_each_entry(ir, &svm->ir_list, node) {
-		ret = amd_iommu_update_ga(cpu, r, ir->data);
+	list_for_each_entry(irqfd, &svm->ir_list, vcpu_list) {
+		ret = amd_iommu_update_ga(cpu, r, irqfd->irq_bypass_data);
 		if (ret)
 			return ret;
 	}
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index b35fce30d923..cc27877d69ae 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -310,10 +310,12 @@ struct vcpu_svm {
 	u64 *avic_physical_id_cache;
 
 	/*
-	 * Per-vcpu list of struct amd_svm_iommu_ir:
-	 * This is used mainly to store interrupt remapping information used
-	 * when update the vcpu affinity. This avoids the need to scan for
-	 * IRTE and try to match ga_tag in the IOMMU driver.
+	 * Per-vCPU list of irqfds that are eligible to post IRQs directly to
+	 * the vCPU (a.k.a. device posted IRQs, a.k.a. IRQ bypass).  The list
+	 * is used to reconfigure IRTEs when the vCPU is loaded/put (to set the
+	 * target pCPU), when AVIC is toggled on/off (to (de)activate bypass),
+	 * and if the irqfd becomes ineligible for posting (to put the IRTE
+	 * back into remapped mode).
 	 */
 	struct list_head ir_list;
 	spinlock_t ir_list_lock;
diff --git a/include/linux/kvm_irqfd.h b/include/linux/kvm_irqfd.h
index 8ad43692e3bb..6510a48e62aa 100644
--- a/include/linux/kvm_irqfd.h
+++ b/include/linux/kvm_irqfd.h
@@ -59,6 +59,9 @@ struct kvm_kernel_irqfd {
 	struct work_struct shutdown;
 	struct irq_bypass_consumer consumer;
 	struct irq_bypass_producer *producer;
+
+	struct list_head vcpu_list;
+	void *irq_bypass_data;
 };
 
 #endif /* __LINUX_KVM_IRQFD_H */
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 05/62] KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (3 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 04/62] KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 06/62] iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields Sean Christopherson
                   ` (57 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Delete the previous per-vCPU IRTE link prior to modifying the IRTE.  If
forcing the IRTE back to remapped mode fails, the IRQ is already broken;
keeping stale metadata won't change that, and the IOMMU should be
sufficiently paranoid to sanitize the IRTE when the IRQ is freed and
reallocated.

This will allow hoisting the vCPU tracking to common x86, which in turn
will allow most of the IRTE update code to be deduplicated.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c   | 60 +++++++++------------------------------
 include/linux/kvm_irqfd.h |  1 +
 2 files changed, 14 insertions(+), 47 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index d33c01379421..ed7374f0bd5a 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -766,23 +766,19 @@ static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 	return ret;
 }
 
-static void svm_ir_list_del(struct vcpu_svm *svm,
-			    struct kvm_kernel_irqfd *irqfd,
-			    struct amd_iommu_pi_data *pi)
+static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)
 {
+	struct kvm_vcpu *vcpu = irqfd->irq_bypass_vcpu;
 	unsigned long flags;
-	struct kvm_kernel_irqfd *cur;
 
-	spin_lock_irqsave(&svm->ir_list_lock, flags);
-	list_for_each_entry(cur, &svm->ir_list, vcpu_list) {
-		if (cur->irq_bypass_data != pi->ir_data)
-			continue;
-		if (WARN_ON_ONCE(cur != irqfd))
-			continue;
-		list_del(&irqfd->vcpu_list);
-		break;
-	}
-	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
+	if (!vcpu)
+		return;
+
+	spin_lock_irqsave(&to_svm(vcpu)->ir_list_lock, flags);
+	list_del(&irqfd->vcpu_list);
+	spin_unlock_irqrestore(&to_svm(vcpu)->ir_list_lock, flags);
+
+	irqfd->irq_bypass_vcpu = NULL;
 }
 
 static int svm_ir_list_add(struct vcpu_svm *svm,
@@ -795,24 +791,7 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
 	if (WARN_ON_ONCE(!pi->ir_data))
 		return -EINVAL;
 
-	/**
-	 * In some cases, the existing irte is updated and re-set,
-	 * so we need to check here if it's already been * added
-	 * to the ir_list.
-	 */
-	if (pi->prev_ga_tag) {
-		struct kvm *kvm = svm->vcpu.kvm;
-		u32 vcpu_id = AVIC_GATAG_TO_VCPUID(pi->prev_ga_tag);
-		struct kvm_vcpu *prev_vcpu = kvm_get_vcpu_by_id(kvm, vcpu_id);
-		struct vcpu_svm *prev_svm;
-
-		if (!prev_vcpu)
-			return -EINVAL;
-
-		prev_svm = to_svm(prev_vcpu);
-		svm_ir_list_del(prev_svm, irqfd, pi);
-	}
-
+	irqfd->irq_bypass_vcpu = &svm->vcpu;
 	irqfd->irq_bypass_data = pi->ir_data;
 
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
@@ -904,6 +883,8 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 
 		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
 
+		svm_ir_list_del(irqfd);
+
 		/**
 		 * Here, we setup with legacy mode in the following cases:
 		 * 1. When cannot target interrupt to a specific vcpu.
@@ -962,21 +943,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		pi.prev_ga_tag = 0;
 		pi.is_guest_mode = false;
 		ret = irq_set_vcpu_affinity(host_irq, &pi);
-
-		/**
-		 * Check if the posted interrupt was previously
-		 * setup with the guest_mode by checking if the ga_tag
-		 * was cached. If so, we need to clean up the per-vcpu
-		 * ir_list.
-		 */
-		if (!ret && pi.prev_ga_tag) {
-			int id = AVIC_GATAG_TO_VCPUID(pi.prev_ga_tag);
-			struct kvm_vcpu *vcpu;
-
-			vcpu = kvm_get_vcpu_by_id(kvm, id);
-			if (vcpu)
-				svm_ir_list_del(to_svm(vcpu), irqfd, &pi);
-		}
 	}
 out:
 	srcu_read_unlock(&kvm->irq_srcu, idx);
diff --git a/include/linux/kvm_irqfd.h b/include/linux/kvm_irqfd.h
index 6510a48e62aa..361c07f4466d 100644
--- a/include/linux/kvm_irqfd.h
+++ b/include/linux/kvm_irqfd.h
@@ -60,6 +60,7 @@ struct kvm_kernel_irqfd {
 	struct irq_bypass_consumer consumer;
 	struct irq_bypass_producer *producer;
 
+	struct kvm_vcpu *irq_bypass_vcpu;
 	struct list_head vcpu_list;
 	void *irq_bypass_data;
 };
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 06/62] iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (4 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 05/62] KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 07/62] KVM: SVM: Delete IRTE link from previous vCPU irrespective of new routing Sean Christopherson
                   ` (56 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Delete the amd_ir_data.prev_ga_tag field now that all usage is
superfluous.

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c             |  2 --
 drivers/iommu/amd/amd_iommu_types.h |  1 -
 drivers/iommu/amd/iommu.c           | 10 ----------
 include/linux/amd-iommu.h           |  2 +-
 4 files changed, 1 insertion(+), 14 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index ed7374f0bd5a..4e8380d2f017 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -938,9 +938,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		/**
 		 * Here, pi is used to:
 		 * - Tell IOMMU to use legacy mode for this interrupt.
-		 * - Retrieve ga_tag of prior interrupt remapping data.
 		 */
-		pi.prev_ga_tag = 0;
 		pi.is_guest_mode = false;
 		ret = irq_set_vcpu_affinity(host_irq, &pi);
 	}
diff --git a/drivers/iommu/amd/amd_iommu_types.h b/drivers/iommu/amd/amd_iommu_types.h
index 5089b58e528a..57a96f3e7b84 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -1060,7 +1060,6 @@ struct irq_2_irte {
 };
 
 struct amd_ir_data {
-	u32 cached_ga_tag;
 	struct amd_iommu *iommu;
 	struct irq_2_irte irq_2_irte;
 	struct msi_msg msi_entry;
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index f34209b08b4c..f23635b062f0 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3887,23 +3887,13 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 	ir_data->cfg = irqd_cfg(data);
 	pi_data->ir_data = ir_data;
 
-	pi_data->prev_ga_tag = ir_data->cached_ga_tag;
 	if (pi_data->is_guest_mode) {
 		ir_data->ga_root_ptr = (pi_data->base >> 12);
 		ir_data->ga_vector = vcpu_pi_info->vector;
 		ir_data->ga_tag = pi_data->ga_tag;
 		ret = amd_iommu_activate_guest_mode(ir_data);
-		if (!ret)
-			ir_data->cached_ga_tag = pi_data->ga_tag;
 	} else {
 		ret = amd_iommu_deactivate_guest_mode(ir_data);
-
-		/*
-		 * This communicates the ga_tag back to the caller
-		 * so that it can do all the necessary clean up.
-		 */
-		if (!ret)
-			ir_data->cached_ga_tag = 0;
 	}
 
 	return ret;
diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
index 062fbd4c9b77..1f9b13d803c5 100644
--- a/include/linux/amd-iommu.h
+++ b/include/linux/amd-iommu.h
@@ -19,8 +19,8 @@ struct amd_iommu;
  */
 struct amd_iommu_pi_data {
 	u32 ga_tag;
-	u32 prev_ga_tag;
 	u64 base;
+
 	bool is_guest_mode;
 	struct vcpu_data *vcpu_data;
 	void *ir_data;
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 07/62] KVM: SVM: Delete IRTE link from previous vCPU irrespective of new routing
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (5 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 06/62] iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 08/62] KVM: SVM: Drop pointless masking of default APIC base when setting V_APIC_BAR Sean Christopherson
                   ` (55 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Delete the IRTE link from the previous vCPU irrespective of the new
routing state, i.e. even if the IRTE won't be configured to post IRQs to a
vCPU.  Whether or not the new route is postable as no bearing on the *old*
route.  Failure to delete the link can result in KVM incorrectly updating
the IRTE, e.g. if the "old" vCPU is scheduled in/out.

Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt")
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 4e8380d2f017..c981ce764b45 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -861,6 +861,12 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
 		return 0;
 
+	/*
+	 * If the IRQ was affined to a different vCPU, remove the IRTE metadata
+	 * from the *previous* vCPU's list.
+	 */
+	svm_ir_list_del(irqfd);
+
 	pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n",
 		 __func__, host_irq, guest_irq, set);
 
@@ -883,8 +889,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 
 		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
 
-		svm_ir_list_del(irqfd);
-
 		/**
 		 * Here, we setup with legacy mode in the following cases:
 		 * 1. When cannot target interrupt to a specific vcpu.
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 08/62] KVM: SVM: Drop pointless masking of default APIC base when setting V_APIC_BAR
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (6 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 07/62] KVM: SVM: Delete IRTE link from previous vCPU irrespective of new routing Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-13 14:15   ` Naveen N Rao
  2025-06-11 22:45 ` [PATCH v3 09/62] KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masks Sean Christopherson
                   ` (54 subsequent siblings)
  62 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Drop VMCB_AVIC_APIC_BAR_MASK, it's just a regurgitation of the maximum
theoretical 4KiB-aligned physical address, i.e. is not novel in any way,
and its only usage is to mask the default APIC base, which is 4KiB aligned
and (obviously) a legal physical address.

No functional change intended.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/svm.h | 2 --
 arch/x86/kvm/svm/avic.c    | 2 +-
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index ad954a1a6656..89a666952b01 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -260,8 +260,6 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 
 #define AVIC_DOORBELL_PHYSICAL_ID_MASK			GENMASK_ULL(11, 0)
 
-#define VMCB_AVIC_APIC_BAR_MASK				0xFFFFFFFFFF000ULL
-
 #define AVIC_UNACCEL_ACCESS_WRITE_MASK		1
 #define AVIC_UNACCEL_ACCESS_OFFSET_MASK		0xFF0
 #define AVIC_UNACCEL_ACCESS_VECTOR_MASK		0xFFFFFFFF
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index c981ce764b45..5344ae76c590 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -244,7 +244,7 @@ void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
 	vmcb->control.avic_backing_page = bpa & AVIC_HPA_MASK;
 	vmcb->control.avic_logical_id = lpa & AVIC_HPA_MASK;
 	vmcb->control.avic_physical_id = ppa & AVIC_HPA_MASK;
-	vmcb->control.avic_vapic_bar = APIC_DEFAULT_PHYS_BASE & VMCB_AVIC_APIC_BAR_MASK;
+	vmcb->control.avic_vapic_bar = APIC_DEFAULT_PHYS_BASE;
 
 	if (kvm_apicv_activated(svm->vcpu.kvm))
 		avic_activate_vmcb(svm);
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 09/62] KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masks
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (7 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 08/62] KVM: SVM: Drop pointless masking of default APIC base when setting V_APIC_BAR Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-13 14:37   ` Naveen N Rao
  2025-06-11 22:45 ` [PATCH v3 10/62] KVM: SVM: Add helper to deduplicate code for getting AVIC backing page Sean Christopherson
                   ` (53 subsequent siblings)
  62 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Drop AVIC_HPA_MASK and all its users, the mask is just the 4KiB-aligned
maximum theoretical physical address for x86-64 CPUs, as x86-64 is
currently defined (going beyond PA52 would require an entirely new paging
mode, which would arguably create a new, different architecture).

All usage in KVM masks the result of page_to_phys(), which on x86-64 is
guaranteed to be 4KiB aligned and a legal physical address; if either of
those requirements doesn't hold true, KVM has far bigger problems.

Drop masking the avic_backing_page with
AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK for all the same reasons, but
keep the macro even though it's unused in functional code.  It's a
distinct architectural define, and having the definition in software
helps visualize the layout of an entry.  And to be hyper-paranoid about
MAXPA going beyond 52, add a compile-time assert to ensure the kernel's
maximum supported physical address stays in bounds.

The unnecessary masking in avic_init_vmcb() also incorrectly assumes that
SME's C-bit resides between bits 51:11; that holds true for current CPUs,
but isn't required by AMD's architecture:

  In some implementations, the bit used may be a physical address bit

Key word being "may".

Opportunistically use the GENMASK_ULL() version for
AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK, which is far more readable
than a set of repeating Fs.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/svm.h |  4 +---
 arch/x86/kvm/svm/avic.c    | 18 ++++++++++--------
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 89a666952b01..36f67c69ea66 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -253,7 +253,7 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 #define AVIC_LOGICAL_ID_ENTRY_VALID_MASK		(1 << 31)
 
 #define AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK	GENMASK_ULL(11, 0)
-#define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK	(0xFFFFFFFFFFULL << 12)
+#define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK	GENMASK_ULL(51, 12)
 #define AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK		(1ULL << 62)
 #define AVIC_PHYSICAL_ID_ENTRY_VALID_MASK		(1ULL << 63)
 #define AVIC_PHYSICAL_ID_TABLE_SIZE_MASK		(0xFFULL)
@@ -288,8 +288,6 @@ enum avic_ipi_failure_cause {
 static_assert((AVIC_MAX_PHYSICAL_ID & AVIC_PHYSICAL_MAX_INDEX_MASK) == AVIC_MAX_PHYSICAL_ID);
 static_assert((X2AVIC_MAX_PHYSICAL_ID & AVIC_PHYSICAL_MAX_INDEX_MASK) == X2AVIC_MAX_PHYSICAL_ID);
 
-#define AVIC_HPA_MASK	~((0xFFFULL << 52) | 0xFFF)
-
 #define SVM_SEV_FEAT_SNP_ACTIVE				BIT(0)
 #define SVM_SEV_FEAT_RESTRICTED_INJECTION		BIT(3)
 #define SVM_SEV_FEAT_ALTERNATE_INJECTION		BIT(4)
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 5344ae76c590..4b882148f2c0 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -241,9 +241,9 @@ void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
 	phys_addr_t lpa = __sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page));
 	phys_addr_t ppa = __sme_set(page_to_phys(kvm_svm->avic_physical_id_table_page));
 
-	vmcb->control.avic_backing_page = bpa & AVIC_HPA_MASK;
-	vmcb->control.avic_logical_id = lpa & AVIC_HPA_MASK;
-	vmcb->control.avic_physical_id = ppa & AVIC_HPA_MASK;
+	vmcb->control.avic_backing_page = bpa;
+	vmcb->control.avic_logical_id = lpa;
+	vmcb->control.avic_physical_id = ppa;
 	vmcb->control.avic_vapic_bar = APIC_DEFAULT_PHYS_BASE;
 
 	if (kvm_apicv_activated(svm->vcpu.kvm))
@@ -301,9 +301,12 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	if (!entry)
 		return -EINVAL;
 
-	new_entry = __sme_set((page_to_phys(svm->avic_backing_page) &
-			      AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK) |
-			      AVIC_PHYSICAL_ID_ENTRY_VALID_MASK);
+	/* Note, fls64() returns the bit position, +1. */
+	BUILD_BUG_ON(__PHYSICAL_MASK_SHIFT >
+		     fls64(AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK));
+
+	new_entry = __sme_set(page_to_phys(svm->avic_backing_page)) |
+		    AVIC_PHYSICAL_ID_ENTRY_VALID_MASK;
 	WRITE_ONCE(*entry, new_entry);
 
 	svm->avic_physical_id_cache = entry;
@@ -903,8 +906,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			enable_remapped_mode = false;
 
 			/* Try to enable guest_mode in IRTE */
-			pi.base = __sme_set(page_to_phys(svm->avic_backing_page) &
-					    AVIC_HPA_MASK);
+			pi.base = __sme_set(page_to_phys(svm->avic_backing_page));
 			pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
 						     svm->vcpu.vcpu_id);
 			pi.is_guest_mode = true;
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 10/62] KVM: SVM: Add helper to deduplicate code for getting AVIC backing page
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (8 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 09/62] KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masks Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-13 14:38   ` Naveen N Rao
  2025-06-11 22:45 ` [PATCH v3 11/62] KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field Sean Christopherson
                   ` (52 subsequent siblings)
  62 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Add a helper to get the physical address of the AVIC backing page, both
to deduplicate code and to prepare for getting the address directly from
apic->regs, at which point it won't be all that obvious that the address
in question is what SVM calls the AVIC backing page.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 4b882148f2c0..c36f7db9252e 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -234,14 +234,18 @@ int avic_vm_init(struct kvm *kvm)
 	return err;
 }
 
+static phys_addr_t avic_get_backing_page_address(struct vcpu_svm *svm)
+{
+	return __sme_set(page_to_phys(svm->avic_backing_page));
+}
+
 void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm);
-	phys_addr_t bpa = __sme_set(page_to_phys(svm->avic_backing_page));
 	phys_addr_t lpa = __sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page));
 	phys_addr_t ppa = __sme_set(page_to_phys(kvm_svm->avic_physical_id_table_page));
 
-	vmcb->control.avic_backing_page = bpa;
+	vmcb->control.avic_backing_page = avic_get_backing_page_address(svm);
 	vmcb->control.avic_logical_id = lpa;
 	vmcb->control.avic_physical_id = ppa;
 	vmcb->control.avic_vapic_bar = APIC_DEFAULT_PHYS_BASE;
@@ -305,7 +309,7 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	BUILD_BUG_ON(__PHYSICAL_MASK_SHIFT >
 		     fls64(AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK));
 
-	new_entry = __sme_set(page_to_phys(svm->avic_backing_page)) |
+	new_entry = avic_get_backing_page_address(svm) |
 		    AVIC_PHYSICAL_ID_ENTRY_VALID_MASK;
 	WRITE_ONCE(*entry, new_entry);
 
@@ -845,7 +849,7 @@ get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
 	pr_debug("SVM: %s: use GA mode for irq %u\n", __func__,
 		 irq.vector);
 	*svm = to_svm(vcpu);
-	vcpu_info->pi_desc_addr = __sme_set(page_to_phys((*svm)->avic_backing_page));
+	vcpu_info->pi_desc_addr = avic_get_backing_page_address(*svm);
 	vcpu_info->vector = irq.vector;
 
 	return 0;
@@ -906,7 +910,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			enable_remapped_mode = false;
 
 			/* Try to enable guest_mode in IRTE */
-			pi.base = __sme_set(page_to_phys(svm->avic_backing_page));
+			pi.base = avic_get_backing_page_address(svm);
 			pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
 						     svm->vcpu.vcpu_id);
 			pi.is_guest_mode = true;
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 11/62] KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (9 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 10/62] KVM: SVM: Add helper to deduplicate code for getting AVIC backing page Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-13 14:44   ` Naveen N Rao
  2025-06-11 22:45 ` [PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation Sean Christopherson
                   ` (51 subsequent siblings)
  62 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Drop vcpu_svm's avic_backing_page pointer and instead grab the physical
address of KVM's vAPIC page directly from the source.  Getting a physical
address from a kernel virtual address is not an expensive operation, and
getting the physical address from a struct page is *more* expensive for
CONFIG_SPARSEMEM=y kernels.  Regardless, none of the paths that consume
the address are hot paths, i.e. shaving cycles is not a priority.

Eliminating the "cache" means KVM doesn't have to worry about the cache
being invalid, which will simplify a future fix when dealing with vCPU IDs
that are too big.

WARN if KVM attempts to allocate a vCPU's AVIC backing page without an
in-kernel local APIC.  avic_init_vcpu() bails early if the APIC is not
in-kernel, and KVM disallows enabling an in-kernel APIC after vCPUs have
been created, i.e. it should be impossible to reach
avic_init_backing_page() without the vAPIC being allocated.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 6 ++----
 arch/x86/kvm/svm/svm.h  | 1 -
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index c36f7db9252e..ab228872a19b 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -236,7 +236,7 @@ int avic_vm_init(struct kvm *kvm)
 
 static phys_addr_t avic_get_backing_page_address(struct vcpu_svm *svm)
 {
-	return __sme_set(page_to_phys(svm->avic_backing_page));
+	return __sme_set(__pa(svm->vcpu.arch.apic->regs));
 }
 
 void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
@@ -281,7 +281,7 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	    (id > X2AVIC_MAX_PHYSICAL_ID))
 		return -EINVAL;
 
-	if (!vcpu->arch.apic->regs)
+	if (WARN_ON_ONCE(!vcpu->arch.apic->regs))
 		return -EINVAL;
 
 	if (kvm_apicv_activated(vcpu->kvm)) {
@@ -298,8 +298,6 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 			return ret;
 	}
 
-	svm->avic_backing_page = virt_to_page(vcpu->arch.apic->regs);
-
 	/* Setting AVIC backing page address in the phy APIC ID table */
 	entry = avic_get_physical_id_entry(vcpu, id);
 	if (!entry)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index cc27877d69ae..1585288200f4 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -306,7 +306,6 @@ struct vcpu_svm {
 
 	u32 ldr_reg;
 	u32 dfr_reg;
-	struct page *avic_backing_page;
 	u64 *avic_physical_id_cache;
 
 	/*
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (10 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 11/62] KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-17 14:25   ` Naveen N Rao
  2025-06-11 22:45 ` [PATCH v3 13/62] KVM: SVM: Drop redundant check in AVIC code on ID during " Sean Christopherson
                   ` (50 subsequent siblings)
  62 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Inhibit AVIC with a new "ID too big" flag if userspace creates a vCPU with
an ID that is too big, but otherwise allow vCPU creation to succeed.
Rejecting KVM_CREATE_VCPU with EINVAL violates KVM's ABI as KVM advertises
that the max vCPU ID is 4095, but disallows creating vCPUs with IDs bigger
than 254 (AVIC) or 511 (x2AVIC).

Alternatively, KVM could advertise an accurate value depending on which
AVIC mode is in use, but that wouldn't really solve the underlying problem,
e.g. would be a breaking change if KVM were to ever try and enable AVIC or
x2AVIC by default.

Cc: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  9 ++++++++-
 arch/x86/kvm/svm/avic.c         | 14 ++++++++++++--
 arch/x86/kvm/svm/svm.h          |  3 ++-
 3 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2a6ef1398da7..a9b709db7c59 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1314,6 +1314,12 @@ enum kvm_apicv_inhibit {
 	 */
 	APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
 
+	/*
+	 * AVIC is disabled because the vCPU's APIC ID is beyond the max
+	 * supported by AVIC/x2AVIC, i.e. the vCPU is unaddressable.
+	 */
+	APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG,
+
 	NR_APICV_INHIBIT_REASONS,
 };
 
@@ -1332,7 +1338,8 @@ enum kvm_apicv_inhibit {
 	__APICV_INHIBIT_REASON(IRQWIN),			\
 	__APICV_INHIBIT_REASON(PIT_REINJ),		\
 	__APICV_INHIBIT_REASON(SEV),			\
-	__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)
+	__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED),	\
+	__APICV_INHIBIT_REASON(PHYSICAL_ID_TOO_BIG)
 
 struct kvm_arch {
 	unsigned long n_used_mmu_pages;
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index ab228872a19b..f0a74b102c57 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -277,9 +277,19 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	int id = vcpu->vcpu_id;
 	struct vcpu_svm *svm = to_svm(vcpu);
 
+	/*
+	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
+	 * hardware.  Immediately clear apicv_active, i.e. don't wait until the
+	 * KVM_REQ_APICV_UPDATE request is processed on the first KVM_RUN, as
+	 * avic_vcpu_load() expects to be called if and only if the vCPU has
+	 * fully initialized AVIC.
+	 */
 	if ((!x2avic_enabled && id > AVIC_MAX_PHYSICAL_ID) ||
-	    (id > X2AVIC_MAX_PHYSICAL_ID))
-		return -EINVAL;
+	    (id > X2AVIC_MAX_PHYSICAL_ID)) {
+		kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG);
+		vcpu->arch.apic->apicv_active = false;
+		return 0;
+	}
 
 	if (WARN_ON_ONCE(!vcpu->arch.apic->regs))
 		return -EINVAL;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 1585288200f4..71e3c003580e 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -722,7 +722,8 @@ extern struct kvm_x86_nested_ops svm_nested_ops;
 	BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) |	\
 	BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) |	\
 	BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) |	\
-	BIT(APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED)	\
+	BIT(APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED) |	\
+	BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG)	\
 )
 
 bool avic_hardware_setup(void);
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 13/62] KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creation
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (11 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-17 14:49   ` Naveen N Rao
  2025-06-11 22:45 ` [PATCH v3 14/62] KVM: SVM: Track AVIC tables as natively sized pointers, not "struct pages" Sean Christopherson
                   ` (49 subsequent siblings)
  62 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Drop avic_get_physical_id_entry()'s compatibility check on the incoming
ID, as its sole caller, avic_init_backing_page(), performs the exact same
check.  Drop avic_get_physical_id_entry() entirely as the only remaining
functionality is getting the address of the Physical ID table, and
accessing the array without an immediate bounds check is kludgy.

Opportunistically add a compile-time assertion to ensure the vcpu_id can't
result in a bounds overflow, e.g. if KVM (really) messed up a maximum
physical ID #define, as well as run-time assertions so that a NULL pointer
dereference is morphed into a safer WARN().

No functional change intended.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 37 +++++++++++++++----------------------
 1 file changed, 15 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index f0a74b102c57..948bab48083b 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -256,26 +256,12 @@ void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
 		avic_deactivate_vmcb(svm);
 }
 
-static u64 *avic_get_physical_id_entry(struct kvm_vcpu *vcpu,
-				       unsigned int index)
-{
-	u64 *avic_physical_id_table;
-	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
-
-	if ((!x2avic_enabled && index > AVIC_MAX_PHYSICAL_ID) ||
-	    (index > X2AVIC_MAX_PHYSICAL_ID))
-		return NULL;
-
-	avic_physical_id_table = page_address(kvm_svm->avic_physical_id_table_page);
-
-	return &avic_physical_id_table[index];
-}
-
 static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 {
-	u64 *entry, new_entry;
-	int id = vcpu->vcpu_id;
+	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	struct vcpu_svm *svm = to_svm(vcpu);
+	u32 id = vcpu->vcpu_id;
+	u64 *table, new_entry;
 
 	/*
 	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
@@ -291,6 +277,9 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 		return 0;
 	}
 
+	BUILD_BUG_ON((AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE ||
+		     (X2AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE);
+
 	if (WARN_ON_ONCE(!vcpu->arch.apic->regs))
 		return -EINVAL;
 
@@ -309,9 +298,7 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	}
 
 	/* Setting AVIC backing page address in the phy APIC ID table */
-	entry = avic_get_physical_id_entry(vcpu, id);
-	if (!entry)
-		return -EINVAL;
+	table = page_address(kvm_svm->avic_physical_id_table_page);
 
 	/* Note, fls64() returns the bit position, +1. */
 	BUILD_BUG_ON(__PHYSICAL_MASK_SHIFT >
@@ -319,9 +306,9 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 
 	new_entry = avic_get_backing_page_address(svm) |
 		    AVIC_PHYSICAL_ID_ENTRY_VALID_MASK;
-	WRITE_ONCE(*entry, new_entry);
+	WRITE_ONCE(table[id], new_entry);
 
-	svm->avic_physical_id_cache = entry;
+	svm->avic_physical_id_cache = &table[id];
 
 	return 0;
 }
@@ -1004,6 +991,9 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (WARN_ON(h_physical_id & ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK))
 		return;
 
+	if (WARN_ON_ONCE(!svm->avic_physical_id_cache))
+		return;
+
 	/*
 	 * No need to update anything if the vCPU is blocking, i.e. if the vCPU
 	 * is being scheduled in after being preempted.  The CPU entries in the
@@ -1044,6 +1034,9 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 
 	lockdep_assert_preemption_disabled();
 
+	if (WARN_ON_ONCE(!svm->avic_physical_id_cache))
+		return;
+
 	/*
 	 * Note, reading the Physical ID entry outside of ir_list_lock is safe
 	 * as only the pCPU that has loaded (or is loading) the vCPU is allowed
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 14/62] KVM: SVM: Track AVIC tables as natively sized pointers, not "struct pages"
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (12 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 13/62] KVM: SVM: Drop redundant check in AVIC code on ID during " Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-17 15:01   ` Naveen N Rao
  2025-06-11 22:45 ` [PATCH v3 15/62] KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer Sean Christopherson
                   ` (48 subsequent siblings)
  62 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Allocate and track AVIC's logical and physical tables as u32 and u64
pointers respectively, as managing the pages as "struct page" pointers
adds an almost absurd amount of boilerplate and complexity.  E.g. with
page_address() out of the way, svm->avic_physical_id_cache becomes
completely superfluous, and will be removed in a future cleanup.

No functional change intended.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 49 ++++++++++++++---------------------------
 arch/x86/kvm/svm/svm.h  |  4 ++--
 2 files changed, 18 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 948bab48083b..bf18b0b643d9 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -172,10 +172,8 @@ void avic_vm_destroy(struct kvm *kvm)
 	if (!enable_apicv)
 		return;
 
-	if (kvm_svm->avic_logical_id_table_page)
-		__free_page(kvm_svm->avic_logical_id_table_page);
-	if (kvm_svm->avic_physical_id_table_page)
-		__free_page(kvm_svm->avic_physical_id_table_page);
+	free_page((unsigned long)kvm_svm->avic_logical_id_table);
+	free_page((unsigned long)kvm_svm->avic_physical_id_table);
 
 	spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
 	hash_del(&kvm_svm->hnode);
@@ -188,27 +186,19 @@ int avic_vm_init(struct kvm *kvm)
 	int err = -ENOMEM;
 	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
 	struct kvm_svm *k2;
-	struct page *p_page;
-	struct page *l_page;
 	u32 vm_id;
 
 	if (!enable_apicv)
 		return 0;
 
-	/* Allocating physical APIC ID table (4KB) */
-	p_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
-	if (!p_page)
+	kvm_svm->avic_physical_id_table = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+	if (!kvm_svm->avic_physical_id_table)
 		goto free_avic;
 
-	kvm_svm->avic_physical_id_table_page = p_page;
-
-	/* Allocating logical APIC ID table (4KB) */
-	l_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
-	if (!l_page)
+	kvm_svm->avic_logical_id_table = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
+	if (!kvm_svm->avic_logical_id_table)
 		goto free_avic;
 
-	kvm_svm->avic_logical_id_table_page = l_page;
-
 	spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
  again:
 	vm_id = next_vm_id = (next_vm_id + 1) & AVIC_VM_ID_MASK;
@@ -242,12 +232,10 @@ static phys_addr_t avic_get_backing_page_address(struct vcpu_svm *svm)
 void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm);
-	phys_addr_t lpa = __sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page));
-	phys_addr_t ppa = __sme_set(page_to_phys(kvm_svm->avic_physical_id_table_page));
 
 	vmcb->control.avic_backing_page = avic_get_backing_page_address(svm);
-	vmcb->control.avic_logical_id = lpa;
-	vmcb->control.avic_physical_id = ppa;
+	vmcb->control.avic_logical_id = __sme_set(__pa(kvm_svm->avic_logical_id_table));
+	vmcb->control.avic_physical_id = __sme_set(__pa(kvm_svm->avic_physical_id_table));
 	vmcb->control.avic_vapic_bar = APIC_DEFAULT_PHYS_BASE;
 
 	if (kvm_apicv_activated(svm->vcpu.kvm))
@@ -261,7 +249,7 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	struct vcpu_svm *svm = to_svm(vcpu);
 	u32 id = vcpu->vcpu_id;
-	u64 *table, new_entry;
+	u64 new_entry;
 
 	/*
 	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
@@ -277,8 +265,8 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 		return 0;
 	}
 
-	BUILD_BUG_ON((AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE ||
-		     (X2AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE);
+	BUILD_BUG_ON((AVIC_MAX_PHYSICAL_ID + 1) * sizeof(new_entry) > PAGE_SIZE ||
+		     (X2AVIC_MAX_PHYSICAL_ID + 1) * sizeof(new_entry) > PAGE_SIZE);
 
 	if (WARN_ON_ONCE(!vcpu->arch.apic->regs))
 		return -EINVAL;
@@ -297,18 +285,16 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 			return ret;
 	}
 
-	/* Setting AVIC backing page address in the phy APIC ID table */
-	table = page_address(kvm_svm->avic_physical_id_table_page);
-
 	/* Note, fls64() returns the bit position, +1. */
 	BUILD_BUG_ON(__PHYSICAL_MASK_SHIFT >
 		     fls64(AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK));
 
+	/* Setting AVIC backing page address in the phy APIC ID table */
 	new_entry = avic_get_backing_page_address(svm) |
 		    AVIC_PHYSICAL_ID_ENTRY_VALID_MASK;
-	WRITE_ONCE(table[id], new_entry);
+	WRITE_ONCE(kvm_svm->avic_physical_id_table[id], new_entry);
 
-	svm->avic_physical_id_cache = &table[id];
+	svm->avic_physical_id_cache = &kvm_svm->avic_physical_id_table[id];
 
 	return 0;
 }
@@ -442,7 +428,7 @@ static int avic_kick_target_vcpus_fast(struct kvm *kvm, struct kvm_lapic *source
 		if (apic_x2apic_mode(source))
 			avic_logical_id_table = NULL;
 		else
-			avic_logical_id_table = page_address(kvm_svm->avic_logical_id_table_page);
+			avic_logical_id_table = kvm_svm->avic_logical_id_table;
 
 		/*
 		 * AVIC is inhibited if vCPUs aren't mapped 1:1 with logical
@@ -544,7 +530,6 @@ unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu)
 static u32 *avic_get_logical_id_entry(struct kvm_vcpu *vcpu, u32 ldr, bool flat)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
-	u32 *logical_apic_id_table;
 	u32 cluster, index;
 
 	ldr = GET_APIC_LOGICAL_ID(ldr);
@@ -565,9 +550,7 @@ static u32 *avic_get_logical_id_entry(struct kvm_vcpu *vcpu, u32 ldr, bool flat)
 		return NULL;
 	index += (cluster << 2);
 
-	logical_apic_id_table = (u32 *) page_address(kvm_svm->avic_logical_id_table_page);
-
-	return &logical_apic_id_table[index];
+	return &kvm_svm->avic_logical_id_table[index];
 }
 
 static void avic_ldr_write(struct kvm_vcpu *vcpu, u8 g_physical_id, u32 ldr)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 71e3c003580e..ec5d77d42a49 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -123,8 +123,8 @@ struct kvm_svm {
 
 	/* Struct members for AVIC */
 	u32 avic_vm_id;
-	struct page *avic_logical_id_table_page;
-	struct page *avic_physical_id_table_page;
+	u32 *avic_logical_id_table;
+	u64 *avic_physical_id_table;
 	struct hlist_node hnode;
 
 	struct kvm_sev_info sev_info;
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 15/62] KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (13 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 14/62] KVM: SVM: Track AVIC tables as natively sized pointers, not "struct pages" Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-19 11:09   ` Naveen N Rao
  2025-06-11 22:45 ` [PATCH v3 16/62] KVM: VMX: Move enable_ipiv knob to common x86 Sean Christopherson
                   ` (47 subsequent siblings)
  62 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Drop the vCPU's pointer to its AVIC Physical ID entry, and simply index
the table directly.  Caching a pointer address is completely unnecessary
for performance, and while the field technically caches the result of the
pointer calculation, it's all too easy to misinterpret the name and think
that the field somehow caches the _data_ in the table.

No functional change intended.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 27 +++++++++++++++------------
 arch/x86/kvm/svm/svm.h  |  1 -
 2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index bf18b0b643d9..0c0be274d29e 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -294,8 +294,6 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 		    AVIC_PHYSICAL_ID_ENTRY_VALID_MASK;
 	WRITE_ONCE(kvm_svm->avic_physical_id_table[id], new_entry);
 
-	svm->avic_physical_id_cache = &kvm_svm->avic_physical_id_table[id];
-
 	return 0;
 }
 
@@ -770,13 +768,16 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
 			   struct kvm_kernel_irqfd *irqfd,
 			   struct amd_iommu_pi_data *pi)
 {
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
 	unsigned long flags;
 	u64 entry;
 
 	if (WARN_ON_ONCE(!pi->ir_data))
 		return -EINVAL;
 
-	irqfd->irq_bypass_vcpu = &svm->vcpu;
+	irqfd->irq_bypass_vcpu = vcpu;
 	irqfd->irq_bypass_data = pi->ir_data;
 
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
@@ -787,7 +788,7 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
 	 * will update the pCPU info when the vCPU awkened and/or scheduled in.
 	 * See also avic_vcpu_load().
 	 */
-	entry = READ_ONCE(*(svm->avic_physical_id_cache));
+	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
 	if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
 		amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
 				    true, pi->ir_data);
@@ -964,17 +965,18 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
 
 void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
-	u64 entry;
+	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	int h_physical_id = kvm_cpu_get_apicid(cpu);
 	struct vcpu_svm *svm = to_svm(vcpu);
 	unsigned long flags;
+	u64 entry;
 
 	lockdep_assert_preemption_disabled();
 
 	if (WARN_ON(h_physical_id & ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK))
 		return;
 
-	if (WARN_ON_ONCE(!svm->avic_physical_id_cache))
+	if (WARN_ON_ONCE(vcpu->vcpu_id * sizeof(entry) >= PAGE_SIZE))
 		return;
 
 	/*
@@ -996,14 +998,14 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	 */
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
 
-	entry = READ_ONCE(*(svm->avic_physical_id_cache));
+	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
 	WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
 	entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
 	entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 
-	WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
+	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
@@ -1011,13 +1013,14 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 void avic_vcpu_put(struct kvm_vcpu *vcpu)
 {
-	u64 entry;
+	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	struct vcpu_svm *svm = to_svm(vcpu);
 	unsigned long flags;
+	u64 entry;
 
 	lockdep_assert_preemption_disabled();
 
-	if (WARN_ON_ONCE(!svm->avic_physical_id_cache))
+	if (WARN_ON_ONCE(vcpu->vcpu_id * sizeof(entry) >= PAGE_SIZE))
 		return;
 
 	/*
@@ -1027,7 +1030,7 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	 * can't be scheduled out and thus avic_vcpu_{put,load}() can't run
 	 * recursively.
 	 */
-	entry = READ_ONCE(*(svm->avic_physical_id_cache));
+	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
 
 	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
 	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
@@ -1046,7 +1049,7 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
-	WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
+	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index ec5d77d42a49..f225d0bed152 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -306,7 +306,6 @@ struct vcpu_svm {
 
 	u32 ldr_reg;
 	u32 dfr_reg;
-	u64 *avic_physical_id_cache;
 
 	/*
 	 * Per-vCPU list of irqfds that are eligible to post IRQs directly to
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 16/62] KVM: VMX: Move enable_ipiv knob to common x86
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (14 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 15/62] KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 17/62] KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled Sean Christopherson
                   ` (46 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Move enable_ipiv to common x86 so that it can be reused by SVM to control
IPI virtualization when AVIC is enabled.  SVM doesn't actually provide a
way to truly disable IPI virtualization, but KVM can get close enough by
skipping the necessary table programming.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/vmx/capabilities.h | 1 -
 arch/x86/kvm/vmx/vmx.c          | 2 --
 arch/x86/kvm/x86.c              | 3 +++
 4 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a9b709db7c59..cba82d7a701d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1950,6 +1950,7 @@ struct kvm_arch_async_pf {
 extern u32 __read_mostly kvm_nr_uret_msrs;
 extern bool __read_mostly allow_smaller_maxphyaddr;
 extern bool __read_mostly enable_apicv;
+extern bool __read_mostly enable_ipiv;
 extern bool __read_mostly enable_device_posted_irqs;
 extern struct kvm_x86_ops kvm_x86_ops;
 
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index cb6588238f46..5316c27f6099 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -15,7 +15,6 @@ extern bool __read_mostly enable_ept;
 extern bool __read_mostly enable_unrestricted_guest;
 extern bool __read_mostly enable_ept_ad_bits;
 extern bool __read_mostly enable_pml;
-extern bool __read_mostly enable_ipiv;
 extern int __read_mostly pt_mode;
 
 #define PT_MODE_SYSTEM		0
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9ff00ae9f05a..f79604bc0127 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -112,8 +112,6 @@ static bool __read_mostly fasteoi = 1;
 module_param(fasteoi, bool, 0444);
 
 module_param(enable_apicv, bool, 0444);
-
-bool __read_mostly enable_ipiv = true;
 module_param(enable_ipiv, bool, 0444);
 
 module_param(enable_device_posted_irqs, bool, 0444);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 93711a5ef272..f1b0dbce9a8b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -226,6 +226,9 @@ EXPORT_SYMBOL_GPL(allow_smaller_maxphyaddr);
 bool __read_mostly enable_apicv = true;
 EXPORT_SYMBOL_GPL(enable_apicv);
 
+bool __read_mostly enable_ipiv = true;
+EXPORT_SYMBOL_GPL(enable_ipiv);
+
 bool __read_mostly enable_device_posted_irqs = true;
 EXPORT_SYMBOL_GPL(enable_device_posted_irqs);
 
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 17/62] KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (15 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 16/62] KVM: VMX: Move enable_ipiv knob to common x86 Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-19 11:31   ` Naveen N Rao
  2025-06-11 22:45 ` [PATCH v3 18/62] KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235 Sean Christopherson
                   ` (45 subsequent siblings)
  62 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

From: Maxim Levitsky <mlevitsk@redhat.com>

Let userspace "disable" IPI virtualization for AVIC via the enable_ipiv
module param, by never setting IsRunning.  SVM doesn't provide a way to
disable IPI virtualization in hardware, but by ensuring CPUs never see
IsRunning=1, every IPI in the guest (except for self-IPIs) will generate a
VM-Exit.

To avoid setting the real IsRunning bit, while still allowing KVM to use
each vCPU's entry to update GA log entries, simply maintain a shadow of
the entry, without propagating IsRunning updates to the real table when
IPI virtualization is disabled.

Providing a way to effectively disable IPI virtualization will allow KVM
to safely enable AVIC on hardware that is susceptible to erratum #1235,
which causes hardware to sometimes fail to detect that the IsRunning bit
has been cleared by software.

Note, the table _must_ be fully populated, as broadcast IPIs skip invalid
entries, i.e. won't generate VM-Exit if every entry is invalid, and so
simply pointing the VMCB at a common dummy table won't work.

Alternatively, KVM could allocate a shadow of the entire table, but that'd
be a waste of 4KiB since the per-vCPU entry doesn't actually consume an
additional 8 bytes of memory (vCPU structures are large enough that they
are backed by order-N pages).

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: keep "entry" variables, reuse enable_ipiv, split from erratum]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 32 ++++++++++++++++++++++++++------
 arch/x86/kvm/svm/svm.c  |  2 ++
 arch/x86/kvm/svm/svm.h  |  8 ++++++++
 3 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 0c0be274d29e..48c737e1200a 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -292,6 +292,13 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
 	/* Setting AVIC backing page address in the phy APIC ID table */
 	new_entry = avic_get_backing_page_address(svm) |
 		    AVIC_PHYSICAL_ID_ENTRY_VALID_MASK;
+	svm->avic_physical_id_entry = new_entry;
+
+	/*
+	 * Initialize the real table, as vCPUs must have a valid entry in order
+	 * for broadcast IPIs to function correctly (broadcast IPIs ignore
+	 * invalid entries, i.e. aren't guaranteed to generate a VM-Exit).
+	 */
 	WRITE_ONCE(kvm_svm->avic_physical_id_table[id], new_entry);
 
 	return 0;
@@ -769,8 +776,6 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
 			   struct amd_iommu_pi_data *pi)
 {
 	struct kvm_vcpu *vcpu = &svm->vcpu;
-	struct kvm *kvm = vcpu->kvm;
-	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
 	unsigned long flags;
 	u64 entry;
 
@@ -788,7 +793,7 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
 	 * will update the pCPU info when the vCPU awkened and/or scheduled in.
 	 * See also avic_vcpu_load().
 	 */
-	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
+	entry = svm->avic_physical_id_entry;
 	if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
 		amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
 				    true, pi->ir_data);
@@ -998,14 +1003,26 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	 */
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
 
-	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
+	entry = svm->avic_physical_id_entry;
 	WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
 	entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
 	entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 
+	svm->avic_physical_id_entry = entry;
+
+	/*
+	 * If IPI virtualization is disabled, clear IsRunning when updating the
+	 * actual Physical ID table, so that the CPU never sees IsRunning=1.
+	 * Keep the APIC ID up-to-date in the entry to minimize the chances of
+	 * things going sideways if hardware peeks at the ID.
+	 */
+	if (!enable_ipiv)
+		entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+
 	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
+
 	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
@@ -1030,7 +1047,7 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	 * can't be scheduled out and thus avic_vcpu_{put,load}() can't run
 	 * recursively.
 	 */
-	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
+	entry = svm->avic_physical_id_entry;
 
 	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
 	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
@@ -1049,7 +1066,10 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
-	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
+	svm->avic_physical_id_entry = entry;
+
+	if (enable_ipiv)
+		WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 0ad1a6d4fb6d..56d11f7b4bef 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -231,6 +231,7 @@ module_param(tsc_scaling, int, 0444);
  */
 static bool avic;
 module_param(avic, bool, 0444);
+module_param(enable_ipiv, bool, 0444);
 
 module_param(enable_device_posted_irqs, bool, 0444);
 
@@ -5594,6 +5595,7 @@ static __init int svm_hardware_setup(void)
 	enable_apicv = avic = avic && avic_hardware_setup();
 
 	if (!enable_apicv) {
+		enable_ipiv = false;
 		svm_x86_ops.vcpu_blocking = NULL;
 		svm_x86_ops.vcpu_unblocking = NULL;
 		svm_x86_ops.vcpu_get_apicv_inhibit_reasons = NULL;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index f225d0bed152..939ff0e35a2b 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -307,6 +307,14 @@ struct vcpu_svm {
 	u32 ldr_reg;
 	u32 dfr_reg;
 
+	/* This is essentially a shadow of the vCPU's actual entry in the
+	 * Physical ID table that is programmed into the VMCB, i.e. that is
+	 * seen by the CPU.  If IPI virtualization is disabled, IsRunning is
+	 * only ever set in the shadow, i.e. is never propagated to the "real"
+	 * table, so that hardware never sees IsRunning=1.
+	 */
+	u64 avic_physical_id_entry;
+
 	/*
 	 * Per-vCPU list of irqfds that are eligible to post IRQs directly to
 	 * the vCPU (a.k.a. device posted IRQs, a.k.a. IRQ bypass).  The list
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 18/62] KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (16 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 17/62] KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-23 14:05   ` Naveen N Rao
  2025-06-11 22:45 ` [PATCH v3 19/62] KVM: VMX: Suppress PI notifications whenever the vCPU is put Sean Christopherson
                   ` (44 subsequent siblings)
  62 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

From: Maxim Levitsky <mlevitsk@redhat.com>

Disable IPI virtualization on AMD Family 17h CPUs (Zen2 and Zen1), as
hardware doesn't reliably detect changes to the 'IsRunning' bit during ICR
write emulation, and might fail to VM-Exit on the sending vCPU, if
IsRunning was recently cleared.

The absence of the VM-Exit leads to KVM not waking (or triggering nested
VM-Exit) of the target vCPU(s) of the IPI, which can lead to hung vCPUs,
unbounded delays in L2 execution, etc.

To workaround the erratum, simply disable IPI virtualization, which
prevents KVM from setting IsRunning and thus eliminates the race where
hardware sees a stale IsRunning=1.  As a result, all ICR writes (except
when "Self" shorthand is used) will VM-Exit and therefore be correctly
emulated by KVM.

Disabling IPI virtualization does carry a performance penalty, but
benchmarkng shows that enabling AVIC without IPI virtualization is still
much better than not using AVIC at all, because AVIC still accelerates
posted interrupts and the receiving end of the IPIs.

Note, when virtualizaing Self-IPIs, the CPU skips reading the physical ID
table and updates the vIRR directly (because the vCPU is by definition
actively running), i.e. Self-IPI isn't susceptible to the erratum *and*
is still accelerated by hardware.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: rebase, massage changelog, disallow user override]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 48c737e1200a..bf8b59556373 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -1187,6 +1187,14 @@ bool avic_hardware_setup(void)
 	if (x2avic_enabled)
 		pr_info("x2AVIC enabled\n");
 
+	/*
+	 * Disable IPI virtualization for AMD Family 17h CPUs (Zen1 and Zen2)
+	 * due to erratum 1235, which results in missed GA log events and thus
+	 * missed wake events for blocking vCPUs due to the CPU failing to see
+	 * a software update to clear IsRunning.
+	 */
+	enable_ipiv = enable_ipiv && boot_cpu_data.x86 != 0x17;
+
 	amd_iommu_register_ga_log_notifier(&avic_ga_log_notifier);
 
 	return true;
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 19/62] KVM: VMX: Suppress PI notifications whenever the vCPU is put
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (17 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 18/62] KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235 Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 20/62] KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blocking Sean Christopherson
                   ` (43 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Suppress posted interrupt notifications (set PID.SN=1) whenever the vCPU
is put, i.e. unloaded, not just when the vCPU is preempted, as KVM doesn't
do anything in response to a notification IRQ that arrives in the host,
nor does KVM rely on the Outstanding Notification (PID.ON) flag when the
vCPU is unloaded.  And, the cost of scanning the PIR to manually set PID.ON
when loading the vCPU is quite small, especially relative to the cost of
loading (and unloading) a vCPU.

On the flip side, leaving SN clear means a notification for the vCPU will
result in a spurious IRQ for the pCPU, even if vCPU task is scheduled out,
running in userspace, etc.  Even worse, if the pCPU is running a different
vCPU, the spurious IRQ could trigger posted interrupt processing for the
wrong vCPU, which is technically a violation of the architecture, as
setting bits in PIR aren't supposed to be propagated to the vIRR until a
notification IRQ is received.

The saving grace of the current behavior is that hardware sends
notification interrupts if and only if PID.ON=0, i.e. only the first
posted interrupt for a vCPU will trigger a spurious IRQ (for each window
where the vCPU is unloaded).

Ideally, KVM would suppress notifications before enabling IRQs in the
VM-Exit, but KVM relies on PID.ON as an indicator that there is a posted
interrupt pending in PIR, e.g. in vmx_sync_pir_to_irr(), and sadly there
is no way to ask hardware to set PID.ON, but not generate an interrupt.
That could be solved by using pi_has_pending_interrupt() instead of
checking only PID.ON, but it's not at all clear that would be a performance
win, as KVM would end up scanning the entire PIR whenever an interrupt
isn't pending.

And long term, the spurious IRQ window, i.e. where a vCPU is loaded with
IRQs enabled, can effectively be made smaller for hot paths by moving
performance critical VM-Exit handlers into the fastpath, i.e. by never
enabling IRQs for hot path VM-Exits.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/posted_intr.c | 29 ++++++++++++++++-------------
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 110fb19848ab..d4826a6b674f 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -73,13 +73,10 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 	/*
 	 * If the vCPU wasn't on the wakeup list and wasn't migrated, then the
 	 * full update can be skipped as neither the vector nor the destination
-	 * needs to be changed.
+	 * needs to be changed.  Clear SN even if there is no assigned device,
+	 * again for simplicity.
 	 */
 	if (pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR && vcpu->cpu == cpu) {
-		/*
-		 * Clear SN if it was set due to being preempted.  Again, do
-		 * this even if there is no assigned device for simplicity.
-		 */
 		if (pi_test_and_clear_sn(pi_desc))
 			goto after_clear_sn;
 		return;
@@ -225,17 +222,23 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
 	if (!vmx_needs_pi_wakeup(vcpu))
 		return;

-	if (kvm_vcpu_is_blocking(vcpu) &&
+	/*
+	 * If the vCPU is blocking with IRQs enabled and ISN'T being preempted,
+	 * enable the wakeup handler so that notification IRQ wakes the vCPU as
+	 * expected.  There is no need to enable the wakeup handler if the vCPU
+	 * is preempted between setting its wait state and manually scheduling
+	 * out, as the task is still runnable, i.e. doesn't need a wake event
+	 * from KVM to be scheduled in.
+	 *
+	 * If the wakeup handler isn't being enabled, Suppress Notifications as
+	 * the cost of propagating PIR.IRR to PID.ON is negligible compared to
+	 * the cost of a spurious IRQ, and vCPU put/load is a slow path.
+	 */
+	if (!vcpu->preempted && kvm_vcpu_is_blocking(vcpu) &&
 	    ((is_td_vcpu(vcpu) && tdx_interrupt_allowed(vcpu)) ||
 	     (!is_td_vcpu(vcpu) && !vmx_interrupt_blocked(vcpu))))
 		pi_enable_wakeup_handler(vcpu);
-
-	/*
-	 * Set SN when the vCPU is preempted.  Note, the vCPU can both be seen
-	 * as blocking and preempted, e.g. if it's preempted between setting
-	 * its wait state and manually scheduling out.
-	 */
-	if (vcpu->preempted)
+	else
 		pi_set_sn(pi_desc);
 }

-- 
2.50.0.rc1.591.g9c95f17f64-goog

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 20/62] KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blocking
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (18 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 19/62] KVM: VMX: Suppress PI notifications whenever the vCPU is put Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-23 15:54   ` Naveen N Rao
  2025-06-11 22:45 ` [PATCH v3 21/62] iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr Sean Christopherson
                   ` (42 subsequent siblings)
  62 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Add a comment to explain why KVM clears IsRunning when putting a vCPU,
even though leaving IsRunning=1 would be ok from a functional perspective.
Per Maxim's experiments, a misbehaving VM could spam the AVIC doorbell so
fast as to induce a 50%+ loss in performance.

Link: https://lore.kernel.org/all/8d7e0d0391df4efc7cb28557297eb2ec9904f1e5.camel@redhat.com
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index bf8b59556373..3cf929ac117f 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -1121,19 +1121,24 @@ void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
 	if (!kvm_vcpu_apicv_active(vcpu))
 		return;
 
-       /*
-        * Unload the AVIC when the vCPU is about to block, _before_
-        * the vCPU actually blocks.
-        *
-        * Any IRQs that arrive before IsRunning=0 will not cause an
-        * incomplete IPI vmexit on the source, therefore vIRR will also
-        * be checked by kvm_vcpu_check_block() before blocking.  The
-        * memory barrier implicit in set_current_state orders writing
-        * IsRunning=0 before reading the vIRR.  The processor needs a
-        * matching memory barrier on interrupt delivery between writing
-        * IRR and reading IsRunning; the lack of this barrier might be
-        * the cause of errata #1235).
-        */
+	/*
+	 * Unload the AVIC when the vCPU is about to block, _before_ the vCPU
+	 * actually blocks.
+	 *
+	 * Note, any IRQs that arrive before IsRunning=0 will not cause an
+	 * incomplete IPI vmexit on the source; kvm_vcpu_check_block() handles
+	 * this by checking vIRR one last time before blocking.  The memory
+	 * barrier implicit in set_current_state orders writing IsRunning=0
+	 * before reading the vIRR.  The processor needs a matching memory
+	 * barrier on interrupt delivery between writing IRR and reading
+	 * IsRunning; the lack of this barrier might be the cause of errata #1235).
+	 *
+	 * Clear IsRunning=0 even if guest IRQs are disabled, i.e. even if KVM
+	 * doesn't need to detect events for scheduling purposes.  The doorbell
+	 * used to signal running vCPUs cannot be blocked, i.e. will perturb the
+	 * CPU and cause noisy neighbor problems if the VM is sending interrupts
+	 * to the vCPU while it's scheduled out.
+	 */
 	avic_vcpu_put(vcpu);
 }
 
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 21/62] iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (19 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 20/62] KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blocking Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 22/62] iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode" Sean Christopherson
                   ` (41 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Use vcpu_data.pi_desc_addr instead of amd_iommu_pi_data.base to get the
GA root pointer.  KVM is the only source of amd_iommu_pi_data.base, and
KVM's one and only path for writing amd_iommu_pi_data.base computes the
exact same value for vcpu_data.pi_desc_addr and amd_iommu_pi_data.base,
and fills amd_iommu_pi_data.base if and only if vcpu_data.pi_desc_addr is
valid, i.e. amd_iommu_pi_data.base is fully redundant.

Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c   | 7 +++++--
 drivers/iommu/amd/iommu.c | 2 +-
 include/linux/amd-iommu.h | 2 --
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 3cf929ac117f..461300bc5608 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -893,8 +893,11 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 
 			enable_remapped_mode = false;
 
-			/* Try to enable guest_mode in IRTE */
-			pi.base = avic_get_backing_page_address(svm);
+			/*
+			 * Try to enable guest_mode in IRTE.  Note, the address
+			 * of the vCPU's AVIC backing page is passed to the
+			 * IOMMU via vcpu_info->pi_desc_addr.
+			 */
 			pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
 						     svm->vcpu.vcpu_id);
 			pi.is_guest_mode = true;
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index f23635b062f0..512167f7aef4 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3888,7 +3888,7 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 	pi_data->ir_data = ir_data;
 
 	if (pi_data->is_guest_mode) {
-		ir_data->ga_root_ptr = (pi_data->base >> 12);
+		ir_data->ga_root_ptr = (vcpu_pi_info->pi_desc_addr >> 12);
 		ir_data->ga_vector = vcpu_pi_info->vector;
 		ir_data->ga_tag = pi_data->ga_tag;
 		ret = amd_iommu_activate_guest_mode(ir_data);
diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
index 1f9b13d803c5..deeefc92a5cf 100644
--- a/include/linux/amd-iommu.h
+++ b/include/linux/amd-iommu.h
@@ -19,8 +19,6 @@ struct amd_iommu;
  */
 struct amd_iommu_pi_data {
 	u32 ga_tag;
-	u64 base;
-
 	bool is_guest_mode;
 	struct vcpu_data *vcpu_data;
 	void *ir_data;
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 22/62] iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode"
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (20 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 21/62] iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 23/62] KVM: SVM: Stop walking list of routing table entries when updating IRTE Sean Christopherson
                   ` (40 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Pass NULL to amd_ir_set_vcpu_affinity() to communicate "don't post to a
vCPU" now that there's no need to communicate information back to KVM
about the previous vCPU (KVM does its own tracking).

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c   | 16 ++++------------
 drivers/iommu/amd/iommu.c | 10 +++++++---
 2 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 461300bc5608..6260bf3697ba 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -927,18 +927,10 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		}
 	}
 
-	ret = 0;
-	if (enable_remapped_mode) {
-		/* Use legacy mode in IRTE */
-		struct amd_iommu_pi_data pi;
-
-		/**
-		 * Here, pi is used to:
-		 * - Tell IOMMU to use legacy mode for this interrupt.
-		 */
-		pi.is_guest_mode = false;
-		ret = irq_set_vcpu_affinity(host_irq, &pi);
-	}
+	if (enable_remapped_mode)
+		ret = irq_set_vcpu_affinity(host_irq, NULL);
+	else
+		ret = 0;
 out:
 	srcu_read_unlock(&kvm->irq_srcu, idx);
 	return ret;
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 512167f7aef4..5141507587e1 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3864,7 +3864,6 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 {
 	int ret;
 	struct amd_iommu_pi_data *pi_data = vcpu_info;
-	struct vcpu_data *vcpu_pi_info = pi_data->vcpu_data;
 	struct amd_ir_data *ir_data = data->chip_data;
 	struct irq_2_irte *irte_info = &ir_data->irq_2_irte;
 	struct iommu_dev_data *dev_data;
@@ -3885,9 +3884,14 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 		return -EINVAL;
 
 	ir_data->cfg = irqd_cfg(data);
-	pi_data->ir_data = ir_data;
 
-	if (pi_data->is_guest_mode) {
+	if (pi_data) {
+		struct vcpu_data *vcpu_pi_info = pi_data->vcpu_data;
+
+		pi_data->ir_data = ir_data;
+
+		WARN_ON_ONCE(!pi_data->is_guest_mode);
+
 		ir_data->ga_root_ptr = (vcpu_pi_info->pi_desc_addr >> 12);
 		ir_data->ga_vector = vcpu_pi_info->vector;
 		ir_data->ga_tag = pi_data->ga_tag;
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 23/62] KVM: SVM: Stop walking list of routing table entries when updating IRTE
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (21 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 22/62] iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode" Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 24/62] KVM: VMX: " Sean Christopherson
                   ` (39 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Now that KVM explicitly passes the new/current GSI routing to
pi_update_irte(), simply use the provided routing entry and stop walking
the routing table to find that entry.  KVM, via setup_routing_entry() and
sanity checked by kvm_get_msi_route(), disallows having a GSI configured
to trigger multiple MSIs.

I.e. this is subtly a glorified nop, as KVM allows at most one MSI per
GSI, the for-loop can only ever process one entry, and that entry is the
new/current entry (see the WARN_ON_ONCE() added by "KVM: x86: Pass new
routing entries and irqfd when updating IRTEs" to ensure @new matches the
entry found in the routing table).

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 109 ++++++++++++++++------------------------
 1 file changed, 44 insertions(+), 65 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 6260bf3697ba..a83769bb8123 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -843,11 +843,10 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			unsigned int host_irq, uint32_t guest_irq,
 			struct kvm_kernel_irq_routing_entry *new)
 {
-	struct kvm_kernel_irq_routing_entry *e;
-	struct kvm_irq_routing_table *irq_rt;
 	bool enable_remapped_mode = true;
-	bool set = !!new;
-	int idx, ret = 0;
+	struct vcpu_data vcpu_info;
+	struct vcpu_svm *svm = NULL;
+	int ret = 0;
 
 	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
 		return 0;
@@ -859,72 +858,53 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	svm_ir_list_del(irqfd);
 
 	pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n",
-		 __func__, host_irq, guest_irq, set);
+		 __func__, host_irq, guest_irq, !!new);
 
-	idx = srcu_read_lock(&kvm->irq_srcu);
-	irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
+	/**
+	 * Here, we setup with legacy mode in the following cases:
+	 * 1. When cannot target interrupt to a specific vcpu.
+	 * 2. Unsetting posted interrupt.
+	 * 3. APIC virtualization is disabled for the vcpu.
+	 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
+	 */
+	if (new && new->type == KVM_IRQ_ROUTING_MSI &&
+	    !get_pi_vcpu_info(kvm, new, &vcpu_info, &svm) &&
+	    kvm_vcpu_apicv_active(&svm->vcpu)) {
+		struct amd_iommu_pi_data pi;
 
-	if (guest_irq >= irq_rt->nr_rt_entries ||
-		hlist_empty(&irq_rt->map[guest_irq])) {
-		pr_warn_once("no route for guest_irq %u/%u (broken user space?)\n",
-			     guest_irq, irq_rt->nr_rt_entries);
-		goto out;
-	}
+		enable_remapped_mode = false;
 
-	hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
-		struct vcpu_data vcpu_info;
-		struct vcpu_svm *svm = NULL;
-
-		if (e->type != KVM_IRQ_ROUTING_MSI)
-			continue;
-
-		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
+		/*
+		 * Try to enable guest_mode in IRTE.  Note, the address
+		 * of the vCPU's AVIC backing page is passed to the
+		 * IOMMU via vcpu_info->pi_desc_addr.
+		 */
+		pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
+					     svm->vcpu.vcpu_id);
+		pi.is_guest_mode = true;
+		pi.vcpu_data = &vcpu_info;
+		ret = irq_set_vcpu_affinity(host_irq, &pi);
 
 		/**
-		 * Here, we setup with legacy mode in the following cases:
-		 * 1. When cannot target interrupt to a specific vcpu.
-		 * 2. Unsetting posted interrupt.
-		 * 3. APIC virtualization is disabled for the vcpu.
-		 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
+		 * Here, we successfully setting up vcpu affinity in
+		 * IOMMU guest mode. Now, we need to store the posted
+		 * interrupt information in a per-vcpu ir_list so that
+		 * we can reference to them directly when we update vcpu
+		 * scheduling information in IOMMU irte.
 		 */
-		if (!get_pi_vcpu_info(kvm, e, &vcpu_info, &svm) && set &&
-		    kvm_vcpu_apicv_active(&svm->vcpu)) {
-			struct amd_iommu_pi_data pi;
-
-			enable_remapped_mode = false;
-
-			/*
-			 * Try to enable guest_mode in IRTE.  Note, the address
-			 * of the vCPU's AVIC backing page is passed to the
-			 * IOMMU via vcpu_info->pi_desc_addr.
-			 */
-			pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
-						     svm->vcpu.vcpu_id);
-			pi.is_guest_mode = true;
-			pi.vcpu_data = &vcpu_info;
-			ret = irq_set_vcpu_affinity(host_irq, &pi);
-
-			/**
-			 * Here, we successfully setting up vcpu affinity in
-			 * IOMMU guest mode. Now, we need to store the posted
-			 * interrupt information in a per-vcpu ir_list so that
-			 * we can reference to them directly when we update vcpu
-			 * scheduling information in IOMMU irte.
-			 */
-			if (!ret && pi.is_guest_mode)
-				svm_ir_list_add(svm, irqfd, &pi);
-		}
-
-		if (!ret && svm) {
-			trace_kvm_pi_irte_update(host_irq, svm->vcpu.vcpu_id,
-						 e->gsi, vcpu_info.vector,
-						 vcpu_info.pi_desc_addr, set);
-		}
-
-		if (ret < 0) {
-			pr_err("%s: failed to update PI IRTE\n", __func__);
-			goto out;
-		}
+		if (!ret)
+			ret = svm_ir_list_add(svm, irqfd, &pi);
+	}
+
+	if (!ret && svm) {
+		trace_kvm_pi_irte_update(host_irq, svm->vcpu.vcpu_id,
+					 guest_irq, vcpu_info.vector,
+					 vcpu_info.pi_desc_addr, !!new);
+	}
+
+	if (ret < 0) {
+		pr_err("%s: failed to update PI IRTE\n", __func__);
+		goto out;
 	}
 
 	if (enable_remapped_mode)
@@ -932,7 +912,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	else
 		ret = 0;
 out:
-	srcu_read_unlock(&kvm->irq_srcu, idx);
 	return ret;
 }
 
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 24/62] KVM: VMX: Stop walking list of routing table entries when updating IRTE
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (22 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 23/62] KVM: SVM: Stop walking list of routing table entries when updating IRTE Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 25/62] KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info() Sean Christopherson
                   ` (38 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Now that KVM provides the to-be-updated routing entry, stop walking the
routing table to find that entry.  KVM, via setup_routing_entry() and
sanity checked by kvm_get_msi_route(), disallows having a GSI configured
to trigger multiple MSIs, i.e. the for-loop can only process one entry.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/posted_intr.c | 100 +++++++++++----------------------
 1 file changed, 33 insertions(+), 67 deletions(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index d4826a6b674f..e59eae11f476 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -302,78 +302,44 @@ int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		       unsigned int host_irq, uint32_t guest_irq,
 		       struct kvm_kernel_irq_routing_entry *new)
 {
-	struct kvm_kernel_irq_routing_entry *e;
-	struct kvm_irq_routing_table *irq_rt;
-	bool enable_remapped_mode = true;
 	struct kvm_lapic_irq irq;
 	struct kvm_vcpu *vcpu;
 	struct vcpu_data vcpu_info;
-	bool set = !!new;
-	int idx, ret = 0;
 
 	if (!vmx_can_use_vtd_pi(kvm))
 		return 0;
 
-	idx = srcu_read_lock(&kvm->irq_srcu);
-	irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
-	if (guest_irq >= irq_rt->nr_rt_entries ||
-	    hlist_empty(&irq_rt->map[guest_irq])) {
-		pr_warn_once("no route for guest_irq %u/%u (broken user space?)\n",
-			     guest_irq, irq_rt->nr_rt_entries);
-		goto out;
-	}
-
-	hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
-		if (e->type != KVM_IRQ_ROUTING_MSI)
-			continue;
-
-		WARN_ON_ONCE(new && memcmp(e, new, sizeof(*new)));
-
-		/*
-		 * VT-d PI cannot support posting multicast/broadcast
-		 * interrupts to a vCPU, we still use interrupt remapping
-		 * for these kind of interrupts.
-		 *
-		 * For lowest-priority interrupts, we only support
-		 * those with single CPU as the destination, e.g. user
-		 * configures the interrupts via /proc/irq or uses
-		 * irqbalance to make the interrupts single-CPU.
-		 *
-		 * We will support full lowest-priority interrupt later.
-		 *
-		 * In addition, we can only inject generic interrupts using
-		 * the PI mechanism, refuse to route others through it.
-		 */
-
-		kvm_set_msi_irq(kvm, e, &irq);
-		if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) ||
-		    !kvm_irq_is_postable(&irq))
-			continue;
-
-		vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu));
-		vcpu_info.vector = irq.vector;
-
-		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, e->gsi,
-				vcpu_info.vector, vcpu_info.pi_desc_addr, set);
-
-		if (!set)
-			continue;
-
-		enable_remapped_mode = false;
-
-		ret = irq_set_vcpu_affinity(host_irq, &vcpu_info);
-		if (ret < 0) {
-			printk(KERN_INFO "%s: failed to update PI IRTE\n",
-					__func__);
-			goto out;
-		}
-	}
-
-	if (enable_remapped_mode)
-		ret = irq_set_vcpu_affinity(host_irq, NULL);
-
-	ret = 0;
-out:
-	srcu_read_unlock(&kvm->irq_srcu, idx);
-	return ret;
+	/*
+	 * VT-d PI cannot support posting multicast/broadcast
+	 * interrupts to a vCPU, we still use interrupt remapping
+	 * for these kind of interrupts.
+	 *
+	 * For lowest-priority interrupts, we only support
+	 * those with single CPU as the destination, e.g. user
+	 * configures the interrupts via /proc/irq or uses
+	 * irqbalance to make the interrupts single-CPU.
+	 *
+	 * We will support full lowest-priority interrupt later.
+	 *
+	 * In addition, we can only inject generic interrupts using
+	 * the PI mechanism, refuse to route others through it.
+	 */
+	if (!new || new->type != KVM_IRQ_ROUTING_MSI)
+		goto do_remapping;
+
+	kvm_set_msi_irq(kvm, new, &irq);
+
+	if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) ||
+	    !kvm_irq_is_postable(&irq))
+		goto do_remapping;
+
+	vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu));
+	vcpu_info.vector = irq.vector;
+
+	trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
+				 vcpu_info.vector, vcpu_info.pi_desc_addr, true);
+
+	return irq_set_vcpu_affinity(host_irq, &vcpu_info);
+do_remapping:
+	return irq_set_vcpu_affinity(host_irq, NULL);
 }
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 25/62] KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info()
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (23 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 24/62] KVM: VMX: " Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 26/62] KVM: x86: Move IRQ routing/delivery APIs from x86.c => irq.c Sean Christopherson
                   ` (37 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Genericize SVM's get_pi_vcpu_info() so that it can be shared with VMX.
The only SVM specific information it provides is the AVIC back page, and
that can be trivially retrieved by its sole caller.

No functional change intended.

Cc: Francesco Lavra <francescolavra.fl@gmail.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index a83769bb8123..3bbd565dcd0f 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -816,14 +816,14 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
  */
 static int
 get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
-		 struct vcpu_data *vcpu_info, struct vcpu_svm **svm)
+		 struct vcpu_data *vcpu_info, struct kvm_vcpu **vcpu)
 {
 	struct kvm_lapic_irq irq;
-	struct kvm_vcpu *vcpu = NULL;
+	*vcpu = NULL;
 
 	kvm_set_msi_irq(kvm, e, &irq);
 
-	if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) ||
+	if (!kvm_intr_is_single_vcpu(kvm, &irq, vcpu) ||
 	    !kvm_irq_is_postable(&irq)) {
 		pr_debug("SVM: %s: use legacy intr remap mode for irq %u\n",
 			 __func__, irq.vector);
@@ -832,8 +832,6 @@ get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
 
 	pr_debug("SVM: %s: use GA mode for irq %u\n", __func__,
 		 irq.vector);
-	*svm = to_svm(vcpu);
-	vcpu_info->pi_desc_addr = avic_get_backing_page_address(*svm);
 	vcpu_info->vector = irq.vector;
 
 	return 0;
@@ -845,7 +843,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 {
 	bool enable_remapped_mode = true;
 	struct vcpu_data vcpu_info;
-	struct vcpu_svm *svm = NULL;
+	struct kvm_vcpu *vcpu = NULL;
 	int ret = 0;
 
 	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
@@ -868,19 +866,20 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
 	 */
 	if (new && new->type == KVM_IRQ_ROUTING_MSI &&
-	    !get_pi_vcpu_info(kvm, new, &vcpu_info, &svm) &&
-	    kvm_vcpu_apicv_active(&svm->vcpu)) {
+	    !get_pi_vcpu_info(kvm, new, &vcpu_info, &vcpu) &&
+	    kvm_vcpu_apicv_active(vcpu)) {
 		struct amd_iommu_pi_data pi;
 
 		enable_remapped_mode = false;
 
+		vcpu_info.pi_desc_addr = avic_get_backing_page_address(to_svm(vcpu));
+
 		/*
 		 * Try to enable guest_mode in IRTE.  Note, the address
 		 * of the vCPU's AVIC backing page is passed to the
 		 * IOMMU via vcpu_info->pi_desc_addr.
 		 */
-		pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
-					     svm->vcpu.vcpu_id);
+		pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, vcpu->vcpu_id);
 		pi.is_guest_mode = true;
 		pi.vcpu_data = &vcpu_info;
 		ret = irq_set_vcpu_affinity(host_irq, &pi);
@@ -893,11 +892,11 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 * scheduling information in IOMMU irte.
 		 */
 		if (!ret)
-			ret = svm_ir_list_add(svm, irqfd, &pi);
+			ret = svm_ir_list_add(to_svm(vcpu), irqfd, &pi);
 	}
 
-	if (!ret && svm) {
-		trace_kvm_pi_irte_update(host_irq, svm->vcpu.vcpu_id,
+	if (!ret && vcpu) {
+		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id,
 					 guest_irq, vcpu_info.vector,
 					 vcpu_info.pi_desc_addr, !!new);
 	}
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 26/62] KVM: x86: Move IRQ routing/delivery APIs from x86.c => irq.c
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (24 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 25/62] KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info() Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 27/62] KVM: x86: Nullify irqfd->producer after updating IRTEs Sean Christopherson
                   ` (36 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Move a bunch of IRQ routing and delivery APIs from x86.c to irq.c.  x86.c
has grown quite fat, and irq.c is the perfect landing spot.

Opportunistically rewrite kvm_arch_irq_bypass_del_producer()'s comment, as
the existing comment has several typos and is rather confusing.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/irq.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c | 87 ---------------------------------------------
 2 files changed, 88 insertions(+), 87 deletions(-)

diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index a0b1499baf6e..b9d3ec72037c 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -11,6 +11,7 @@
 
 #include <linux/export.h>
 #include <linux/kvm_host.h>
+#include <linux/kvm_irqfd.h>
 
 #include "hyperv.h"
 #include "ioapic.h"
@@ -332,6 +333,18 @@ int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
 	return -EWOULDBLOCK;
 }
 
+int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event,
+			bool line_status)
+{
+	if (!irqchip_in_kernel(kvm))
+		return -ENXIO;
+
+	irq_event->status = kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID,
+					irq_event->irq, irq_event->level,
+					line_status);
+	return 0;
+}
+
 bool kvm_arch_can_set_irq_routing(struct kvm *kvm)
 {
 	return irqchip_in_kernel(kvm);
@@ -495,6 +508,81 @@ void kvm_arch_irq_routing_update(struct kvm *kvm)
 		kvm_make_scan_ioapic_request(kvm);
 }
 
+int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
+				      struct irq_bypass_producer *prod)
+{
+	struct kvm_kernel_irqfd *irqfd =
+		container_of(cons, struct kvm_kernel_irqfd, consumer);
+	struct kvm *kvm = irqfd->kvm;
+	int ret = 0;
+
+	kvm_arch_start_assignment(irqfd->kvm);
+
+	spin_lock_irq(&kvm->irqfds.lock);
+	irqfd->producer = prod;
+
+	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
+		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
+						   irqfd->gsi, &irqfd->irq_entry);
+		if (ret)
+			kvm_arch_end_assignment(irqfd->kvm);
+	}
+	spin_unlock_irq(&kvm->irqfds.lock);
+
+	return ret;
+}
+
+void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
+				      struct irq_bypass_producer *prod)
+{
+	struct kvm_kernel_irqfd *irqfd =
+		container_of(cons, struct kvm_kernel_irqfd, consumer);
+	struct kvm *kvm = irqfd->kvm;
+	int ret;
+
+	WARN_ON(irqfd->producer != prod);
+
+	/*
+	 * If the producer of an IRQ that is currently being posted to a vCPU
+	 * is unregistered, change the associated IRTE back to remapped mode as
+	 * the IRQ has been released (or repurposed) by the device driver, i.e.
+	 * KVM must relinquish control of the IRTE.
+	 */
+	spin_lock_irq(&kvm->irqfds.lock);
+	irqfd->producer = NULL;
+
+	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
+		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
+						   irqfd->gsi, NULL);
+		if (ret)
+			pr_info("irq bypass consumer (token %p) unregistration fails: %d\n",
+				irqfd->consumer.token, ret);
+	}
+
+	spin_unlock_irq(&kvm->irqfds.lock);
+
+
+	kvm_arch_end_assignment(irqfd->kvm);
+}
+
+int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
+				  struct kvm_kernel_irq_routing_entry *old,
+				  struct kvm_kernel_irq_routing_entry *new)
+{
+	return kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, irqfd->producer->irq,
+					    irqfd->gsi, new);
+}
+
+bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
+				  struct kvm_kernel_irq_routing_entry *new)
+{
+	if (old->type != KVM_IRQ_ROUTING_MSI ||
+	    new->type != KVM_IRQ_ROUTING_MSI)
+		return true;
+
+	return !!memcmp(&old->msi, &new->msi, sizeof(new->msi));
+}
+
 #ifdef CONFIG_KVM_IOAPIC
 #define IOAPIC_ROUTING_ENTRY(irq) \
 	{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP,	\
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f1b0dbce9a8b..5ba5e28532de 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6420,18 +6420,6 @@ void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
 		kvm_vcpu_kick(vcpu);
 }
 
-int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event,
-			bool line_status)
-{
-	if (!irqchip_in_kernel(kvm))
-		return -ENXIO;
-
-	irq_event->status = kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID,
-					irq_event->irq, irq_event->level,
-					line_status);
-	return 0;
-}
-
 int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 			    struct kvm_enable_cap *cap)
 {
@@ -13504,81 +13492,6 @@ bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(kvm_arch_has_noncoherent_dma);
 
-int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
-				      struct irq_bypass_producer *prod)
-{
-	struct kvm_kernel_irqfd *irqfd =
-		container_of(cons, struct kvm_kernel_irqfd, consumer);
-	struct kvm *kvm = irqfd->kvm;
-	int ret = 0;
-
-	kvm_arch_start_assignment(irqfd->kvm);
-
-	spin_lock_irq(&kvm->irqfds.lock);
-	irqfd->producer = prod;
-
-	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
-		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
-						   irqfd->gsi, &irqfd->irq_entry);
-		if (ret)
-			kvm_arch_end_assignment(irqfd->kvm);
-	}
-	spin_unlock_irq(&kvm->irqfds.lock);
-
-	return ret;
-}
-
-void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
-				      struct irq_bypass_producer *prod)
-{
-	struct kvm_kernel_irqfd *irqfd =
-		container_of(cons, struct kvm_kernel_irqfd, consumer);
-	struct kvm *kvm = irqfd->kvm;
-	int ret;
-
-	WARN_ON(irqfd->producer != prod);
-
-	/*
-	 * When producer of consumer is unregistered, we change back to
-	 * remapped mode, so we can re-use the current implementation
-	 * when the irq is masked/disabled or the consumer side (KVM
-	 * int this case doesn't want to receive the interrupts.
-	*/
-	spin_lock_irq(&kvm->irqfds.lock);
-	irqfd->producer = NULL;
-
-	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
-		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
-						   irqfd->gsi, NULL);
-		if (ret)
-			pr_info("irq bypass consumer (token %p) unregistration fails: %d\n",
-				irqfd->consumer.token, ret);
-	}
-
-	spin_unlock_irq(&kvm->irqfds.lock);
-
-
-	kvm_arch_end_assignment(irqfd->kvm);
-}
-
-int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
-				  struct kvm_kernel_irq_routing_entry *old,
-				  struct kvm_kernel_irq_routing_entry *new)
-{
-	return kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, irqfd->producer->irq,
-					    irqfd->gsi, new);
-}
-
-bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
-				  struct kvm_kernel_irq_routing_entry *new)
-{
-	if (old->type != KVM_IRQ_ROUTING_MSI ||
-	    new->type != KVM_IRQ_ROUTING_MSI)
-		return true;
-
-	return !!memcmp(&old->msi, &new->msi, sizeof(new->msi));
-}
-
 bool kvm_vector_hashing_enabled(void)
 {
 	return vector_hashing;
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 27/62] KVM: x86: Nullify irqfd->producer after updating IRTEs
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (25 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 26/62] KVM: x86: Move IRQ routing/delivery APIs from x86.c => irq.c Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 28/62] KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU Sean Christopherson
                   ` (35 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Nullify irqfd->producer (when it's going away) _after_ updating IRTEs so
that the producer can be queried during the update.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/irq.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index b9d3ec72037c..09f7a5cdca7d 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -549,7 +549,6 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 	 * KVM must relinquish control of the IRTE.
 	 */
 	spin_lock_irq(&kvm->irqfds.lock);
-	irqfd->producer = NULL;
 
 	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
 		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
@@ -558,10 +557,10 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 			pr_info("irq bypass consumer (token %p) unregistration fails: %d\n",
 				irqfd->consumer.token, ret);
 	}
+	irqfd->producer = NULL;
 
 	spin_unlock_irq(&kvm->irqfds.lock);
 
-
 	kvm_arch_end_assignment(irqfd->kvm);
 }
 
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 28/62] KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (26 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 27/62] KVM: x86: Nullify irqfd->producer after updating IRTEs Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 29/62] KVM: x86: Move posted interrupt tracepoint to common code Sean Christopherson
                   ` (34 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Hoist the logic for identifying the target vCPU for a posted interrupt
into common x86.  The code is functionally identical between Intel and
AMD.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/irq.c              | 45 +++++++++++++++---
 arch/x86/kvm/svm/avic.c         | 82 ++++++++-------------------------
 arch/x86/kvm/svm/svm.h          |  2 +-
 arch/x86/kvm/vmx/posted_intr.c  | 55 ++++++----------------
 arch/x86/kvm/vmx/posted_intr.h  |  2 +-
 6 files changed, 75 insertions(+), 113 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cba82d7a701d..c722adfedd96 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1854,7 +1854,7 @@ struct kvm_x86_ops {
 
 	int (*pi_update_irte)(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			      unsigned int host_irq, uint32_t guest_irq,
-			      struct kvm_kernel_irq_routing_entry *new);
+			      struct kvm_vcpu *vcpu, u32 vector);
 	void (*pi_start_assignment)(struct kvm *kvm);
 	void (*apicv_pre_state_restore)(struct kvm_vcpu *vcpu);
 	void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 09f7a5cdca7d..5948aba9fdc0 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -508,6 +508,42 @@ void kvm_arch_irq_routing_update(struct kvm *kvm)
 		kvm_make_scan_ioapic_request(kvm);
 }
 
+static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
+			      struct kvm_kernel_irq_routing_entry *entry)
+{
+	struct kvm *kvm = irqfd->kvm;
+	struct kvm_vcpu *vcpu = NULL;
+	struct kvm_lapic_irq irq;
+
+	if (!irqchip_in_kernel(kvm) ||
+	    !kvm_arch_has_irq_bypass() ||
+	    !kvm_arch_has_assigned_device(kvm))
+		return 0;
+
+	if (entry && entry->type == KVM_IRQ_ROUTING_MSI) {
+		kvm_set_msi_irq(kvm, entry, &irq);
+
+		/*
+		 * Force remapped mode if hardware doesn't support posting the
+		 * virtual interrupt to a vCPU.  Only IRQs are postable (NMIs,
+		 * SMIs, etc. are not), and neither AMD nor Intel IOMMUs support
+		 * posting multicast/broadcast IRQs.  If the interrupt can't be
+		 * posted, the device MSI needs to be routed to the host so that
+		 * the guest's desired interrupt can be synthesized by KVM.
+		 *
+		 * This means that KVM can only post lowest-priority interrupts
+		 * if they have a single CPU as the destination, e.g. only if
+		 * the guest has affined the interrupt to a single vCPU.
+		 */
+		if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) ||
+		    !kvm_irq_is_postable(&irq))
+			vcpu = NULL;
+	}
+
+	return kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, irqfd->producer->irq,
+					    irqfd->gsi, vcpu, irq.vector);
+}
+
 int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
 				      struct irq_bypass_producer *prod)
 {
@@ -522,8 +558,7 @@ int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
 	irqfd->producer = prod;
 
 	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
-		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
-						   irqfd->gsi, &irqfd->irq_entry);
+		ret = kvm_pi_update_irte(irqfd, &irqfd->irq_entry);
 		if (ret)
 			kvm_arch_end_assignment(irqfd->kvm);
 	}
@@ -551,8 +586,7 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 	spin_lock_irq(&kvm->irqfds.lock);
 
 	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
-		ret = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, prod->irq,
-						   irqfd->gsi, NULL);
+		ret = kvm_pi_update_irte(irqfd, NULL);
 		if (ret)
 			pr_info("irq bypass consumer (token %p) unregistration fails: %d\n",
 				irqfd->consumer.token, ret);
@@ -568,8 +602,7 @@ int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
 				  struct kvm_kernel_irq_routing_entry *old,
 				  struct kvm_kernel_irq_routing_entry *new)
 {
-	return kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, irqfd->producer->irq,
-					    irqfd->gsi, new);
+	return kvm_pi_update_irte(irqfd, new);
 }
 
 bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 3bbd565dcd0f..14a1544af192 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -803,52 +803,12 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
 	return 0;
 }
 
-/*
- * Note:
- * The HW cannot support posting multicast/broadcast
- * interrupts to a vCPU. So, we still use legacy interrupt
- * remapping for these kind of interrupts.
- *
- * For lowest-priority interrupts, we only support
- * those with single CPU as the destination, e.g. user
- * configures the interrupts via /proc/irq or uses
- * irqbalance to make the interrupts single-CPU.
- */
-static int
-get_pi_vcpu_info(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
-		 struct vcpu_data *vcpu_info, struct kvm_vcpu **vcpu)
-{
-	struct kvm_lapic_irq irq;
-	*vcpu = NULL;
-
-	kvm_set_msi_irq(kvm, e, &irq);
-
-	if (!kvm_intr_is_single_vcpu(kvm, &irq, vcpu) ||
-	    !kvm_irq_is_postable(&irq)) {
-		pr_debug("SVM: %s: use legacy intr remap mode for irq %u\n",
-			 __func__, irq.vector);
-		return -1;
-	}
-
-	pr_debug("SVM: %s: use GA mode for irq %u\n", __func__,
-		 irq.vector);
-	vcpu_info->vector = irq.vector;
-
-	return 0;
-}
-
 int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			unsigned int host_irq, uint32_t guest_irq,
-			struct kvm_kernel_irq_routing_entry *new)
+			struct kvm_vcpu *vcpu, u32 vector)
 {
-	bool enable_remapped_mode = true;
-	struct vcpu_data vcpu_info;
-	struct kvm_vcpu *vcpu = NULL;
 	int ret = 0;
 
-	if (!kvm_arch_has_assigned_device(kvm) || !kvm_arch_has_irq_bypass())
-		return 0;
-
 	/*
 	 * If the IRQ was affined to a different vCPU, remove the IRTE metadata
 	 * from the *previous* vCPU's list.
@@ -856,7 +816,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	svm_ir_list_del(irqfd);
 
 	pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n",
-		 __func__, host_irq, guest_irq, !!new);
+		 __func__, host_irq, guest_irq, !!vcpu);
 
 	/**
 	 * Here, we setup with legacy mode in the following cases:
@@ -865,23 +825,23 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	 * 3. APIC virtualization is disabled for the vcpu.
 	 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
 	 */
-	if (new && new->type == KVM_IRQ_ROUTING_MSI &&
-	    !get_pi_vcpu_info(kvm, new, &vcpu_info, &vcpu) &&
-	    kvm_vcpu_apicv_active(vcpu)) {
-		struct amd_iommu_pi_data pi;
-
-		enable_remapped_mode = false;
-
-		vcpu_info.pi_desc_addr = avic_get_backing_page_address(to_svm(vcpu));
-
+	if (vcpu && kvm_vcpu_apicv_active(vcpu)) {
 		/*
 		 * Try to enable guest_mode in IRTE.  Note, the address
 		 * of the vCPU's AVIC backing page is passed to the
 		 * IOMMU via vcpu_info->pi_desc_addr.
 		 */
-		pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, vcpu->vcpu_id);
-		pi.is_guest_mode = true;
-		pi.vcpu_data = &vcpu_info;
+		struct vcpu_data vcpu_info = {
+			.pi_desc_addr = avic_get_backing_page_address(to_svm(vcpu)),
+			.vector = vector,
+		};
+
+		struct amd_iommu_pi_data pi = {
+			.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, vcpu->vcpu_id),
+			.is_guest_mode = true,
+			.vcpu_data = &vcpu_info,
+		};
+
 		ret = irq_set_vcpu_affinity(host_irq, &pi);
 
 		/**
@@ -893,12 +853,11 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 */
 		if (!ret)
 			ret = svm_ir_list_add(to_svm(vcpu), irqfd, &pi);
-	}
 
-	if (!ret && vcpu) {
-		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id,
-					 guest_irq, vcpu_info.vector,
-					 vcpu_info.pi_desc_addr, !!new);
+		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
+					 vector, vcpu_info.pi_desc_addr, true);
+	} else {
+		ret = irq_set_vcpu_affinity(host_irq, NULL);
 	}
 
 	if (ret < 0) {
@@ -906,10 +865,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		goto out;
 	}
 
-	if (enable_remapped_mode)
-		ret = irq_set_vcpu_affinity(host_irq, NULL);
-	else
-		ret = 0;
+	ret = 0;
 out:
 	return ret;
 }
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 939ff0e35a2b..b5cd1927b009 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -747,7 +747,7 @@ void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu);
 void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
 int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			unsigned int host_irq, uint32_t guest_irq,
-			struct kvm_kernel_irq_routing_entry *new);
+			struct kvm_vcpu *vcpu, u32 vector);
 void avic_vcpu_blocking(struct kvm_vcpu *vcpu);
 void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
 void avic_ring_doorbell(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index e59eae11f476..3de767c5d6b2 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -300,46 +300,19 @@ void vmx_pi_start_assignment(struct kvm *kvm)
 
 int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		       unsigned int host_irq, uint32_t guest_irq,
-		       struct kvm_kernel_irq_routing_entry *new)
+		       struct kvm_vcpu *vcpu, u32 vector)
 {
-	struct kvm_lapic_irq irq;
-	struct kvm_vcpu *vcpu;
-	struct vcpu_data vcpu_info;
-
-	if (!vmx_can_use_vtd_pi(kvm))
-		return 0;
-
-	/*
-	 * VT-d PI cannot support posting multicast/broadcast
-	 * interrupts to a vCPU, we still use interrupt remapping
-	 * for these kind of interrupts.
-	 *
-	 * For lowest-priority interrupts, we only support
-	 * those with single CPU as the destination, e.g. user
-	 * configures the interrupts via /proc/irq or uses
-	 * irqbalance to make the interrupts single-CPU.
-	 *
-	 * We will support full lowest-priority interrupt later.
-	 *
-	 * In addition, we can only inject generic interrupts using
-	 * the PI mechanism, refuse to route others through it.
-	 */
-	if (!new || new->type != KVM_IRQ_ROUTING_MSI)
-		goto do_remapping;
-
-	kvm_set_msi_irq(kvm, new, &irq);
-
-	if (!kvm_intr_is_single_vcpu(kvm, &irq, &vcpu) ||
-	    !kvm_irq_is_postable(&irq))
-		goto do_remapping;
-
-	vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu));
-	vcpu_info.vector = irq.vector;
-
-	trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
-				 vcpu_info.vector, vcpu_info.pi_desc_addr, true);
-
-	return irq_set_vcpu_affinity(host_irq, &vcpu_info);
-do_remapping:
-	return irq_set_vcpu_affinity(host_irq, NULL);
+	if (vcpu) {
+		struct vcpu_data vcpu_info = {
+			.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu)),
+			.vector = vector,
+		};
+
+		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
+					 vcpu_info.vector, vcpu_info.pi_desc_addr, true);
+
+		return irq_set_vcpu_affinity(host_irq, &vcpu_info);
+	} else {
+		return irq_set_vcpu_affinity(host_irq, NULL);
+	}
 }
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index a94afcb55f7f..94ed66ea6249 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -16,7 +16,7 @@ void pi_apicv_pre_state_restore(struct kvm_vcpu *vcpu);
 bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		       unsigned int host_irq, uint32_t guest_irq,
-		       struct kvm_kernel_irq_routing_entry *new);
+		       struct kvm_vcpu *vcpu, u32 vector);
 void vmx_pi_start_assignment(struct kvm *kvm);
 
 static inline int pi_find_highest_vector(struct pi_desc *pi_desc)
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 29/62] KVM: x86: Move posted interrupt tracepoint to common code
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (27 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 28/62] KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 30/62] KVM: SVM: Clean up return handling in avic_pi_update_irte() Sean Christopherson
                   ` (33 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Move the pi_irte_update tracepoint to common x86, and call it whenever the
IRTE is modified.  Tracing only the modifications that result in an IRQ
being posted to a vCPU makes the tracepoint useless for debugging.

Drop the vendor specific address; plumbing that into common code isn't
worth the trouble, as the address is meaningless without a whole pile of
other information that isn't provided in any tracepoint.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/irq.c             | 12 ++++++++++--
 arch/x86/kvm/svm/avic.c        |  6 ------
 arch/x86/kvm/trace.h           | 19 +++++++------------
 arch/x86/kvm/vmx/posted_intr.c |  3 ---
 arch/x86/kvm/x86.c             |  1 -
 5 files changed, 17 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 5948aba9fdc0..ec13d1d7706d 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -511,9 +511,11 @@ void kvm_arch_irq_routing_update(struct kvm *kvm)
 static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
 			      struct kvm_kernel_irq_routing_entry *entry)
 {
+	unsigned int host_irq = irqfd->producer->irq;
 	struct kvm *kvm = irqfd->kvm;
 	struct kvm_vcpu *vcpu = NULL;
 	struct kvm_lapic_irq irq;
+	int r;
 
 	if (!irqchip_in_kernel(kvm) ||
 	    !kvm_arch_has_irq_bypass() ||
@@ -540,8 +542,13 @@ static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
 			vcpu = NULL;
 	}
 
-	return kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, irqfd->producer->irq,
-					    irqfd->gsi, vcpu, irq.vector);
+	r = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, host_irq, irqfd->gsi,
+					 vcpu, irq.vector);
+	if (r)
+		return r;
+
+	trace_kvm_pi_irte_update(host_irq, vcpu, irqfd->gsi, irq.vector, !!vcpu);
+	return 0;
 }
 
 int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
@@ -595,6 +602,7 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 
 	spin_unlock_irq(&kvm->irqfds.lock);
 
+
 	kvm_arch_end_assignment(irqfd->kvm);
 }
 
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 14a1544af192..d8d50b8f14bb 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -815,9 +815,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	 */
 	svm_ir_list_del(irqfd);
 
-	pr_debug("SVM: %s: host_irq=%#x, guest_irq=%#x, set=%#x\n",
-		 __func__, host_irq, guest_irq, !!vcpu);
-
 	/**
 	 * Here, we setup with legacy mode in the following cases:
 	 * 1. When cannot target interrupt to a specific vcpu.
@@ -853,9 +850,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 */
 		if (!ret)
 			ret = svm_ir_list_add(to_svm(vcpu), irqfd, &pi);
-
-		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
-					 vector, vcpu_info.pi_desc_addr, true);
 	} else {
 		ret = irq_set_vcpu_affinity(host_irq, NULL);
 	}
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index ababdba2c186..57d79fd31df0 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1176,37 +1176,32 @@ TRACE_EVENT(kvm_smm_transition,
  * Tracepoint for VT-d posted-interrupts and AMD-Vi Guest Virtual APIC.
  */
 TRACE_EVENT(kvm_pi_irte_update,
-	TP_PROTO(unsigned int host_irq, unsigned int vcpu_id,
-		 unsigned int gsi, unsigned int gvec,
-		 u64 pi_desc_addr, bool set),
-	TP_ARGS(host_irq, vcpu_id, gsi, gvec, pi_desc_addr, set),
+	TP_PROTO(unsigned int host_irq, struct kvm_vcpu *vcpu,
+		 unsigned int gsi, unsigned int gvec, bool set),
+	TP_ARGS(host_irq, vcpu, gsi, gvec, set),
 
 	TP_STRUCT__entry(
 		__field(	unsigned int,	host_irq	)
-		__field(	unsigned int,	vcpu_id		)
+		__field(	int,		vcpu_id		)
 		__field(	unsigned int,	gsi		)
 		__field(	unsigned int,	gvec		)
-		__field(	u64,		pi_desc_addr	)
 		__field(	bool,		set		)
 	),
 
 	TP_fast_assign(
 		__entry->host_irq	= host_irq;
-		__entry->vcpu_id	= vcpu_id;
+		__entry->vcpu_id	= vcpu ? vcpu->vcpu_id : -1;
 		__entry->gsi		= gsi;
 		__entry->gvec		= gvec;
-		__entry->pi_desc_addr	= pi_desc_addr;
 		__entry->set		= set;
 	),
 
-	TP_printk("PI is %s for irq %u, vcpu %u, gsi: 0x%x, "
-		  "gvec: 0x%x, pi_desc_addr: 0x%llx",
+	TP_printk("PI is %s for irq %u, vcpu %d, gsi: 0x%x, gvec: 0x%x",
 		  __entry->set ? "enabled and being updated" : "disabled",
 		  __entry->host_irq,
 		  __entry->vcpu_id,
 		  __entry->gsi,
-		  __entry->gvec,
-		  __entry->pi_desc_addr)
+		  __entry->gvec)
 );
 
 /*
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 3de767c5d6b2..687ffde3b61c 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -308,9 +308,6 @@ int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			.vector = vector,
 		};
 
-		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, guest_irq,
-					 vcpu_info.vector, vcpu_info.pi_desc_addr, true);
-
 		return irq_set_vcpu_affinity(host_irq, &vcpu_info);
 	} else {
 		return irq_set_vcpu_affinity(host_irq, NULL);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5ba5e28532de..d3571c484790 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13891,7 +13891,6 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window_update);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pml_full);
-EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pi_irte_update);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_unaccelerated_access);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_incomplete_ipi);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_ga_log);
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 30/62] KVM: SVM: Clean up return handling in avic_pi_update_irte()
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (28 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 29/62] KVM: x86: Move posted interrupt tracepoint to common code Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 31/62] iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structs Sean Christopherson
                   ` (32 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Clean up the return paths for avic_pi_update_irte() now that the
refactoring dust has settled.

Opportunistically drop the pr_err() on IRTE update failures.  Logging that
a failure occurred without _any_ context is quite useless.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 20 +++++---------------
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index d8d50b8f14bb..a0f3cdd2ea3f 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -807,8 +807,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			unsigned int host_irq, uint32_t guest_irq,
 			struct kvm_vcpu *vcpu, u32 vector)
 {
-	int ret = 0;
-
 	/*
 	 * If the IRQ was affined to a different vCPU, remove the IRTE metadata
 	 * from the *previous* vCPU's list.
@@ -838,8 +836,11 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			.is_guest_mode = true,
 			.vcpu_data = &vcpu_info,
 		};
+		int ret;
 
 		ret = irq_set_vcpu_affinity(host_irq, &pi);
+		if (ret)
+			return ret;
 
 		/**
 		 * Here, we successfully setting up vcpu affinity in
@@ -848,20 +849,9 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 * we can reference to them directly when we update vcpu
 		 * scheduling information in IOMMU irte.
 		 */
-		if (!ret)
-			ret = svm_ir_list_add(to_svm(vcpu), irqfd, &pi);
-	} else {
-		ret = irq_set_vcpu_affinity(host_irq, NULL);
+		return svm_ir_list_add(to_svm(vcpu), irqfd, &pi);
 	}
-
-	if (ret < 0) {
-		pr_err("%s: failed to update PI IRTE\n", __func__);
-		goto out;
-	}
-
-	ret = 0;
-out:
-	return ret;
+	return irq_set_vcpu_affinity(host_irq, NULL);
 }
 
 static inline int
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 31/62] iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structs
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (29 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 30/62] KVM: SVM: Clean up return handling in avic_pi_update_irte() Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 32/62] KVM: Don't WARN if updating IRQ bypass route fails Sean Christopherson
                   ` (31 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Split the vcpu_data structure that serves as a handoff from KVM to IOMMU
drivers into vendor specific structures.  Overloading a single structure
makes the code hard to read and maintain, is *very* misleading as it
suggests that mixing vendors is actually supported, and bastardizing
Intel's posted interrupt descriptor address when AMD's IOMMU already has
its own structure is quite unnecessary.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/irq_remapping.h | 15 ++++++++++++++-
 arch/x86/kvm/svm/avic.c              | 21 ++++++++-------------
 arch/x86/kvm/vmx/posted_intr.c       |  4 ++--
 drivers/iommu/amd/iommu.c            | 12 ++++--------
 drivers/iommu/intel/irq_remapping.c  | 10 +++++-----
 include/linux/amd-iommu.h            | 12 ------------
 6 files changed, 33 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index 5036f13ab69f..2dbc9cb61c2f 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -26,7 +26,20 @@ enum {
 	IRQ_REMAP_X2APIC_MODE,
 };
 
-struct vcpu_data {
+/*
+ * This is mainly used to communicate information back-and-forth
+ * between SVM and IOMMU for setting up and tearing down posted
+ * interrupt
+ */
+struct amd_iommu_pi_data {
+	u64 vapic_addr;		/* Physical address of the vCPU's vAPIC. */
+	u32 ga_tag;
+	u32 vector;		/* Guest vector of the interrupt */
+	bool is_guest_mode;
+	void *ir_data;
+};
+
+struct intel_iommu_pi_data {
 	u64 pi_desc_addr;	/* Physical address of PI Descriptor */
 	u32 vector;		/* Guest vector of the interrupt */
 };
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index a0f3cdd2ea3f..6085a629c5e6 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -822,23 +822,18 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	 */
 	if (vcpu && kvm_vcpu_apicv_active(vcpu)) {
 		/*
-		 * Try to enable guest_mode in IRTE.  Note, the address
-		 * of the vCPU's AVIC backing page is passed to the
-		 * IOMMU via vcpu_info->pi_desc_addr.
+		 * Try to enable guest_mode in IRTE.
 		 */
-		struct vcpu_data vcpu_info = {
-			.pi_desc_addr = avic_get_backing_page_address(to_svm(vcpu)),
-			.vector = vector,
-		};
-
-		struct amd_iommu_pi_data pi = {
-			.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id, vcpu->vcpu_id),
+		struct amd_iommu_pi_data pi_data = {
+			.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
+					     vcpu->vcpu_id),
 			.is_guest_mode = true,
-			.vcpu_data = &vcpu_info,
+			.vapic_addr = avic_get_backing_page_address(to_svm(vcpu)),
+			.vector = vector,
 		};
 		int ret;
 
-		ret = irq_set_vcpu_affinity(host_irq, &pi);
+		ret = irq_set_vcpu_affinity(host_irq, &pi_data);
 		if (ret)
 			return ret;
 
@@ -849,7 +844,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 * we can reference to them directly when we update vcpu
 		 * scheduling information in IOMMU irte.
 		 */
-		return svm_ir_list_add(to_svm(vcpu), irqfd, &pi);
+		return svm_ir_list_add(to_svm(vcpu), irqfd, &pi_data);
 	}
 	return irq_set_vcpu_affinity(host_irq, NULL);
 }
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 687ffde3b61c..3a23c30f73cb 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -303,12 +303,12 @@ int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		       struct kvm_vcpu *vcpu, u32 vector)
 {
 	if (vcpu) {
-		struct vcpu_data vcpu_info = {
+		struct intel_iommu_pi_data pi_data = {
 			.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu)),
 			.vector = vector,
 		};
 
-		return irq_set_vcpu_affinity(host_irq, &vcpu_info);
+		return irq_set_vcpu_affinity(host_irq, &pi_data);
 	} else {
 		return irq_set_vcpu_affinity(host_irq, NULL);
 	}
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 5141507587e1..36749efcc781 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3860,10 +3860,10 @@ int amd_iommu_deactivate_guest_mode(void *data)
 }
 EXPORT_SYMBOL(amd_iommu_deactivate_guest_mode);
 
-static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
+static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 {
 	int ret;
-	struct amd_iommu_pi_data *pi_data = vcpu_info;
+	struct amd_iommu_pi_data *pi_data = info;
 	struct amd_ir_data *ir_data = data->chip_data;
 	struct irq_2_irte *irte_info = &ir_data->irq_2_irte;
 	struct iommu_dev_data *dev_data;
@@ -3886,14 +3886,10 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
 	ir_data->cfg = irqd_cfg(data);
 
 	if (pi_data) {
-		struct vcpu_data *vcpu_pi_info = pi_data->vcpu_data;
-
 		pi_data->ir_data = ir_data;
 
-		WARN_ON_ONCE(!pi_data->is_guest_mode);
-
-		ir_data->ga_root_ptr = (vcpu_pi_info->pi_desc_addr >> 12);
-		ir_data->ga_vector = vcpu_pi_info->vector;
+		ir_data->ga_root_ptr = (pi_data->vapic_addr >> 12);
+		ir_data->ga_vector = pi_data->vector;
 		ir_data->ga_tag = pi_data->ga_tag;
 		ret = amd_iommu_activate_guest_mode(ir_data);
 	} else {
diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 3bc2a03cceca..6165bb919520 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1244,10 +1244,10 @@ static void intel_ir_compose_msi_msg(struct irq_data *irq_data,
 static int intel_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 {
 	struct intel_ir_data *ir_data = data->chip_data;
-	struct vcpu_data *vcpu_pi_info = info;
+	struct intel_iommu_pi_data *pi_data = info;
 
 	/* stop posting interrupts, back to the default mode */
-	if (!vcpu_pi_info) {
+	if (!pi_data) {
 		__intel_ir_reconfigure_irte(data, true);
 	} else {
 		struct irte irte_pi;
@@ -1265,10 +1265,10 @@ static int intel_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 		/* Update the posted mode fields */
 		irte_pi.p_pst = 1;
 		irte_pi.p_urgent = 0;
-		irte_pi.p_vector = vcpu_pi_info->vector;
-		irte_pi.pda_l = (vcpu_pi_info->pi_desc_addr >>
+		irte_pi.p_vector = pi_data->vector;
+		irte_pi.pda_l = (pi_data->pi_desc_addr >>
 				(32 - PDA_LOW_BIT)) & ~(-1UL << PDA_LOW_BIT);
-		irte_pi.pda_h = (vcpu_pi_info->pi_desc_addr >> 32) &
+		irte_pi.pda_h = (pi_data->pi_desc_addr >> 32) &
 				~(-1UL << PDA_HIGH_BIT);
 
 		ir_data->irq_2_iommu.posted_vcpu = true;
diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
index deeefc92a5cf..99b4fa9a0296 100644
--- a/include/linux/amd-iommu.h
+++ b/include/linux/amd-iommu.h
@@ -12,18 +12,6 @@
 
 struct amd_iommu;
 
-/*
- * This is mainly used to communicate information back-and-forth
- * between SVM and IOMMU for setting up and tearing down posted
- * interrupt
- */
-struct amd_iommu_pi_data {
-	u32 ga_tag;
-	bool is_guest_mode;
-	struct vcpu_data *vcpu_data;
-	void *ir_data;
-};
-
 #ifdef CONFIG_AMD_IOMMU
 
 struct task_struct;
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 32/62] KVM: Don't WARN if updating IRQ bypass route fails
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (30 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 31/62] iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structs Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 33/62] KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing() Sean Christopherson
                   ` (30 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Don't bother WARNing if updating an IRTE route fails now that vendor code
provides much more precise WARNs.  The generic WARN doesn't provide enough
information to actually debug the problem, and has obviously done nothing
to surface the myriad bugs in KVM x86's implementation.

Drop all of the associated return code plumbing that existed just so that
common KVM could WARN.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/kvm/arm.c          |  6 +++---
 arch/arm64/kvm/vgic/vgic-v4.c | 11 ++++-------
 arch/x86/kvm/irq.c            |  8 ++++----
 include/kvm/arm_vgic.h        |  2 +-
 include/linux/kvm_host.h      |  6 +++---
 virt/kvm/eventfd.c            | 15 ++++++---------
 6 files changed, 21 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a9a39e0375f7..94fb2f096a20 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -2771,9 +2771,9 @@ bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
 	return memcmp(&old->msi, &new->msi, sizeof(new->msi));
 }
 
-int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
-				  struct kvm_kernel_irq_routing_entry *old,
-				  struct kvm_kernel_irq_routing_entry *new)
+void kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
+				   struct kvm_kernel_irq_routing_entry *old,
+				   struct kvm_kernel_irq_routing_entry *new)
 {
 	/*
 	 * Remapping the vLPI requires taking the its_lock mutex to resolve
diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
index 86e54cefc237..5e59a3f5d4c9 100644
--- a/arch/arm64/kvm/vgic/vgic-v4.c
+++ b/arch/arm64/kvm/vgic/vgic-v4.c
@@ -527,29 +527,26 @@ static struct vgic_irq *__vgic_host_irq_get_vlpi(struct kvm *kvm, int host_irq)
 	return NULL;
 }
 
-int kvm_vgic_v4_unset_forwarding(struct kvm *kvm, int host_irq)
+void kvm_vgic_v4_unset_forwarding(struct kvm *kvm, int host_irq)
 {
 	struct vgic_irq *irq;
 	unsigned long flags;
-	int ret = 0;
 
 	if (!vgic_supports_direct_msis(kvm))
-		return 0;
+		return;
 
 	irq = __vgic_host_irq_get_vlpi(kvm, host_irq);
 	if (!irq)
-		return 0;
+		return;
 
 	raw_spin_lock_irqsave(&irq->irq_lock, flags);
 	WARN_ON(irq->hw && irq->host_irq != host_irq);
 	if (irq->hw) {
 		atomic_dec(&irq->target_vcpu->arch.vgic_cpu.vgic_v3.its_vpe.vlpi_count);
 		irq->hw = false;
-		ret = its_unmap_vlpi(host_irq);
-		WARN_ON_ONCE(ret);
+		WARN_ON_ONCE(its_unmap_vlpi(host_irq));
 	}
 
 	raw_spin_unlock_irqrestore(&irq->irq_lock, flags);
 	vgic_put_irq(kvm, irq);
-	return ret;
 }
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index ec13d1d7706d..adc250ca1171 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -606,11 +606,11 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 	kvm_arch_end_assignment(irqfd->kvm);
 }
 
-int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
-				  struct kvm_kernel_irq_routing_entry *old,
-				  struct kvm_kernel_irq_routing_entry *new)
+void kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
+				   struct kvm_kernel_irq_routing_entry *old,
+				   struct kvm_kernel_irq_routing_entry *new)
 {
-	return kvm_pi_update_irte(irqfd, new);
+	kvm_pi_update_irte(irqfd, new);
 }
 
 bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 4a34f7f0a864..b2a04481de1a 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -434,7 +434,7 @@ struct kvm_kernel_irq_routing_entry;
 int kvm_vgic_v4_set_forwarding(struct kvm *kvm, int irq,
 			       struct kvm_kernel_irq_routing_entry *irq_entry);
 
-int kvm_vgic_v4_unset_forwarding(struct kvm *kvm, int host_irq);
+void kvm_vgic_v4_unset_forwarding(struct kvm *kvm, int host_irq);
 
 int vgic_v4_load(struct kvm_vcpu *vcpu);
 void vgic_v4_commit(struct kvm_vcpu *vcpu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a4160c1c0c6b..8e74ac0f90b1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2410,9 +2410,9 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *,
 			   struct irq_bypass_producer *);
 void kvm_arch_irq_bypass_stop(struct irq_bypass_consumer *);
 void kvm_arch_irq_bypass_start(struct irq_bypass_consumer *);
-int kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
-				  struct kvm_kernel_irq_routing_entry *old,
-				  struct kvm_kernel_irq_routing_entry *new);
+void kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
+				   struct kvm_kernel_irq_routing_entry *old,
+				   struct kvm_kernel_irq_routing_entry *new);
 bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *,
 				  struct kvm_kernel_irq_routing_entry *);
 #endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 85581550dc8d..a4f80fe8a5f3 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -285,11 +285,11 @@ void __attribute__((weak)) kvm_arch_irq_bypass_start(
 {
 }
 
-int __weak kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
-					 struct kvm_kernel_irq_routing_entry *old,
-					 struct kvm_kernel_irq_routing_entry *new)
+void __weak kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
+					  struct kvm_kernel_irq_routing_entry *old,
+					  struct kvm_kernel_irq_routing_entry *new)
 {
-	return 0;
+
 }
 
 bool __attribute__((weak)) kvm_arch_irqfd_route_changed(
@@ -618,11 +618,8 @@ void kvm_irq_routing_update(struct kvm *kvm)
 
 #if IS_ENABLED(CONFIG_HAVE_KVM_IRQ_BYPASS)
 		if (irqfd->producer &&
-		    kvm_arch_irqfd_route_changed(&old, &irqfd->irq_entry)) {
-			int ret = kvm_arch_update_irqfd_routing(irqfd, &old, &irqfd->irq_entry);
-
-			WARN_ON(ret);
-		}
+		    kvm_arch_irqfd_route_changed(&old, &irqfd->irq_entry))
+			kvm_arch_update_irqfd_routing(irqfd, &old, &irqfd->irq_entry);
 #endif
 	}
 
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 33/62] KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing()
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (31 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 32/62] KVM: Don't WARN if updating IRQ bypass route fails Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-13 20:50   ` Oliver Upton
  2025-06-11 22:45 ` [PATCH v3 34/62] KVM: x86: Track irq_bypass_vcpu in common x86 code Sean Christopherson
                   ` (29 subsequent siblings)
  62 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing().
Calling arch code to know whether or not to call arch code is absurd.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/kvm/arm.c     | 15 +++++----------
 arch/x86/kvm/irq.c       | 15 +++++----------
 include/linux/kvm_host.h |  2 --
 virt/kvm/eventfd.c       | 10 +---------
 4 files changed, 11 insertions(+), 31 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 94fb2f096a20..04f2116927b1 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -2761,20 +2761,15 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 	kvm_vgic_v4_unset_forwarding(irqfd->kvm, prod->irq);
 }
 
-bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
-				  struct kvm_kernel_irq_routing_entry *new)
-{
-	if (old->type != KVM_IRQ_ROUTING_MSI ||
-	    new->type != KVM_IRQ_ROUTING_MSI)
-		return true;
-
-	return memcmp(&old->msi, &new->msi, sizeof(new->msi));
-}
-
 void kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
 				   struct kvm_kernel_irq_routing_entry *old,
 				   struct kvm_kernel_irq_routing_entry *new)
 {
+	if (old->type == KVM_IRQ_ROUTING_MSI &&
+	    new->type == KVM_IRQ_ROUTING_MSI &&
+	    !memcmp(&old->msi, &new->msi, sizeof(new->msi)))
+		return;
+
 	/*
 	 * Remapping the vLPI requires taking the its_lock mutex to resolve
 	 * the new translation. We're in spinlock land at this point, so no
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index adc250ca1171..48134aebb541 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -610,19 +610,14 @@ void kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
 				   struct kvm_kernel_irq_routing_entry *old,
 				   struct kvm_kernel_irq_routing_entry *new)
 {
+	if (old->type == KVM_IRQ_ROUTING_MSI &&
+	    new->type == KVM_IRQ_ROUTING_MSI &&
+	    !memcmp(&old->msi, &new->msi, sizeof(new->msi)))
+		return;
+
 	kvm_pi_update_irte(irqfd, new);
 }
 
-bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *old,
-				  struct kvm_kernel_irq_routing_entry *new)
-{
-	if (old->type != KVM_IRQ_ROUTING_MSI ||
-	    new->type != KVM_IRQ_ROUTING_MSI)
-		return true;
-
-	return !!memcmp(&old->msi, &new->msi, sizeof(new->msi));
-}
-
 #ifdef CONFIG_KVM_IOAPIC
 #define IOAPIC_ROUTING_ENTRY(irq) \
 	{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP,	\
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8e74ac0f90b1..fb9ec06aa807 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2413,8 +2413,6 @@ void kvm_arch_irq_bypass_start(struct irq_bypass_consumer *);
 void kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
 				   struct kvm_kernel_irq_routing_entry *old,
 				   struct kvm_kernel_irq_routing_entry *new);
-bool kvm_arch_irqfd_route_changed(struct kvm_kernel_irq_routing_entry *,
-				  struct kvm_kernel_irq_routing_entry *);
 #endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */
 
 #ifdef CONFIG_HAVE_KVM_INVALID_WAKEUPS
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index a4f80fe8a5f3..defc2c04d241 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -291,13 +291,6 @@ void __weak kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
 {
 
 }
-
-bool __attribute__((weak)) kvm_arch_irqfd_route_changed(
-				struct kvm_kernel_irq_routing_entry *old,
-				struct kvm_kernel_irq_routing_entry *new)
-{
-	return true;
-}
 #endif
 
 static int
@@ -617,8 +610,7 @@ void kvm_irq_routing_update(struct kvm *kvm)
 		irqfd_update(kvm, irqfd);
 
 #if IS_ENABLED(CONFIG_HAVE_KVM_IRQ_BYPASS)
-		if (irqfd->producer &&
-		    kvm_arch_irqfd_route_changed(&old, &irqfd->irq_entry))
+		if (irqfd->producer)
 			kvm_arch_update_irqfd_routing(irqfd, &old, &irqfd->irq_entry);
 #endif
 	}
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 34/62] KVM: x86: Track irq_bypass_vcpu in common x86 code
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (32 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 33/62] KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing() Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 35/62] KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being targeted Sean Christopherson
                   ` (28 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Track the vCPU that is being targeted for IRQ bypass, a.k.a. for a posted
IRQ, in common x86 code.  This will allow for additional consolidation of
the SVM and VMX code.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/irq.c      | 7 ++++++-
 arch/x86/kvm/svm/avic.c | 4 ----
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 48134aebb541..6447ea518d01 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -544,8 +544,13 @@ static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
 
 	r = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, host_irq, irqfd->gsi,
 					 vcpu, irq.vector);
-	if (r)
+	if (r) {
+		WARN_ON_ONCE(irqfd->irq_bypass_vcpu && !vcpu);
+		irqfd->irq_bypass_vcpu = NULL;
 		return r;
+	}
+
+	irqfd->irq_bypass_vcpu = vcpu;
 
 	trace_kvm_pi_irte_update(host_irq, vcpu, irqfd->gsi, irq.vector, !!vcpu);
 	return 0;
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 6085a629c5e6..97b747e82012 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -767,22 +767,18 @@ static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)
 	spin_lock_irqsave(&to_svm(vcpu)->ir_list_lock, flags);
 	list_del(&irqfd->vcpu_list);
 	spin_unlock_irqrestore(&to_svm(vcpu)->ir_list_lock, flags);
-
-	irqfd->irq_bypass_vcpu = NULL;
 }
 
 static int svm_ir_list_add(struct vcpu_svm *svm,
 			   struct kvm_kernel_irqfd *irqfd,
 			   struct amd_iommu_pi_data *pi)
 {
-	struct kvm_vcpu *vcpu = &svm->vcpu;
 	unsigned long flags;
 	u64 entry;
 
 	if (WARN_ON_ONCE(!pi->ir_data))
 		return -EINVAL;
 
-	irqfd->irq_bypass_vcpu = vcpu;
 	irqfd->irq_bypass_data = pi->ir_data;
 
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 35/62] KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being targeted
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (33 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 34/62] KVM: x86: Track irq_bypass_vcpu in common x86 code Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 36/62] KVM: x86: Don't update IRTE entries when old and new routes were !MSI Sean Christopherson
                   ` (27 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Don't "reconfigure" an IRTE into host controlled mode when it's already in
the state, i.e. if KVM's GSI routing changes but the IRQ wasn't and still
isn't being posted to a vCPU.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/irq.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 6447ea518d01..43e85ebc0d5b 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -542,6 +542,9 @@ static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
 			vcpu = NULL;
 	}
 
+	if (!irqfd->irq_bypass_vcpu && !vcpu)
+		return 0;
+
 	r = kvm_x86_call(pi_update_irte)(irqfd, irqfd->kvm, host_irq, irqfd->gsi,
 					 vcpu, irq.vector);
 	if (r) {
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 36/62] KVM: x86: Don't update IRTE entries when old and new routes were !MSI
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (34 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 35/62] KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being targeted Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 37/62] KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR metadata Sean Christopherson
                   ` (26 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Skip the entirety of IRTE updates on a GSI routing change if neither the
old nor the new routing is for an MSI, i.e. if the neither routing setup
allows for posting to a vCPU.  If the IRTE isn't already host controlled,
KVM has bigger problems.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/irq.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 43e85ebc0d5b..4119c1e880e7 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -618,6 +618,10 @@ void kvm_arch_update_irqfd_routing(struct kvm_kernel_irqfd *irqfd,
 				   struct kvm_kernel_irq_routing_entry *old,
 				   struct kvm_kernel_irq_routing_entry *new)
 {
+	if (new->type != KVM_IRQ_ROUTING_MSI &&
+	    old->type != KVM_IRQ_ROUTING_MSI)
+		return;
+
 	if (old->type == KVM_IRQ_ROUTING_MSI &&
 	    new->type == KVM_IRQ_ROUTING_MSI &&
 	    !memcmp(&old->msi, &new->msi, sizeof(new->msi)))
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 37/62] KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR metadata
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (35 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 36/62] KVM: x86: Don't update IRTE entries when old and new routes were !MSI Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 38/62] KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU Sean Christopherson
                   ` (25 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Revert the IRTE back to remapping mode if the AMD IOMMU driver mucks up
and doesn't provide the necessary metadata.  Returning an error up the
stack without actually handling the error is useless and confusing.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 97b747e82012..f1e9f0dd43e8 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -769,16 +769,13 @@ static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)
 	spin_unlock_irqrestore(&to_svm(vcpu)->ir_list_lock, flags);
 }
 
-static int svm_ir_list_add(struct vcpu_svm *svm,
-			   struct kvm_kernel_irqfd *irqfd,
-			   struct amd_iommu_pi_data *pi)
+static void svm_ir_list_add(struct vcpu_svm *svm,
+			    struct kvm_kernel_irqfd *irqfd,
+			    struct amd_iommu_pi_data *pi)
 {
 	unsigned long flags;
 	u64 entry;
 
-	if (WARN_ON_ONCE(!pi->ir_data))
-		return -EINVAL;
-
 	irqfd->irq_bypass_data = pi->ir_data;
 
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
@@ -796,7 +793,6 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
 
 	list_add(&irqfd->vcpu_list, &svm->ir_list);
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
-	return 0;
 }
 
 int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
@@ -833,6 +829,16 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		if (ret)
 			return ret;
 
+		/*
+		 * Revert to legacy mode if the IOMMU didn't provide metadata
+		 * for the IRTE, which KVM needs to keep the IRTE up-to-date,
+		 * e.g. if the vCPU is migrated or AVIC is disabled.
+		 */
+		if (WARN_ON_ONCE(!pi_data.ir_data)) {
+			irq_set_vcpu_affinity(host_irq, NULL);
+			return -EIO;
+		}
+
 		/**
 		 * Here, we successfully setting up vcpu affinity in
 		 * IOMMU guest mode. Now, we need to store the posted
@@ -840,7 +846,8 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 * we can reference to them directly when we update vcpu
 		 * scheduling information in IOMMU irte.
 		 */
-		return svm_ir_list_add(to_svm(vcpu), irqfd, &pi_data);
+		svm_ir_list_add(to_svm(vcpu), irqfd, &pi_data);
+		return 0;
 	}
 	return irq_set_vcpu_affinity(host_irq, NULL);
 }
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 38/62] KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (36 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 37/62] KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR metadata Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-17 15:42   ` Naveen N Rao
  2025-06-11 22:45 ` [PATCH v3 39/62] iommu/amd: Document which IRTE fields amd_iommu_update_ga() can modify Sean Christopherson
                   ` (24 subsequent siblings)
  62 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Now that svm_ir_list_add() isn't overloaded with all manner of weird
things, fold it into avic_pi_update_irte(), and more importantly take
ir_list_lock across the irq_set_vcpu_affinity() calls to ensure the info
that's shoved into the IRTE is fresh.  While preemption (and IRQs) is
disabled on the task performing the IRTE update, thanks to irqfds.lock,
that task doesn't hold the vCPU's mutex, i.e. preemption being disabled
is irrelevant.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 55 +++++++++++++++++------------------------
 1 file changed, 22 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index f1e9f0dd43e8..4747fb09aca4 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -769,32 +769,6 @@ static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)
 	spin_unlock_irqrestore(&to_svm(vcpu)->ir_list_lock, flags);
 }
 
-static void svm_ir_list_add(struct vcpu_svm *svm,
-			    struct kvm_kernel_irqfd *irqfd,
-			    struct amd_iommu_pi_data *pi)
-{
-	unsigned long flags;
-	u64 entry;
-
-	irqfd->irq_bypass_data = pi->ir_data;
-
-	spin_lock_irqsave(&svm->ir_list_lock, flags);
-
-	/*
-	 * Update the target pCPU for IOMMU doorbells if the vCPU is running.
-	 * If the vCPU is NOT running, i.e. is blocking or scheduled out, KVM
-	 * will update the pCPU info when the vCPU awkened and/or scheduled in.
-	 * See also avic_vcpu_load().
-	 */
-	entry = svm->avic_physical_id_entry;
-	if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
-		amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
-				    true, pi->ir_data);
-
-	list_add(&irqfd->vcpu_list, &svm->ir_list);
-	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
-}
-
 int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			unsigned int host_irq, uint32_t guest_irq,
 			struct kvm_vcpu *vcpu, u32 vector)
@@ -823,8 +797,18 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			.vapic_addr = avic_get_backing_page_address(to_svm(vcpu)),
 			.vector = vector,
 		};
+		struct vcpu_svm *svm = to_svm(vcpu);
+		u64 entry;
 		int ret;
 
+		/*
+		 * Prevent the vCPU from being scheduled out or migrated until
+		 * the IRTE is updated and its metadata has been added to the
+		 * list of IRQs being posted to the vCPU, to ensure the IRTE
+		 * isn't programmed with stale pCPU/IsRunning information.
+		 */
+		guard(spinlock_irqsave)(&svm->ir_list_lock);
+
 		ret = irq_set_vcpu_affinity(host_irq, &pi_data);
 		if (ret)
 			return ret;
@@ -839,14 +823,19 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			return -EIO;
 		}
 
-		/**
-		 * Here, we successfully setting up vcpu affinity in
-		 * IOMMU guest mode. Now, we need to store the posted
-		 * interrupt information in a per-vcpu ir_list so that
-		 * we can reference to them directly when we update vcpu
-		 * scheduling information in IOMMU irte.
+		/*
+		 * Update the target pCPU for IOMMU doorbells if the vCPU is
+		 * running.  If the vCPU is NOT running, i.e. is blocking or
+		 * scheduled out, KVM will update the pCPU info when the vCPU
+		 * is awakened and/or scheduled in.  See also avic_vcpu_load().
 		 */
-		svm_ir_list_add(to_svm(vcpu), irqfd, &pi_data);
+		entry = svm->avic_physical_id_entry;
+		if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
+			amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
+					    true, pi_data.ir_data);
+
+		irqfd->irq_bypass_data = pi_data.ir_data;
+		list_add(&irqfd->vcpu_list, &svm->ir_list);
 		return 0;
 	}
 	return irq_set_vcpu_affinity(host_irq, NULL);
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 39/62] iommu/amd: Document which IRTE fields amd_iommu_update_ga() can modify
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (37 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 38/62] KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 40/62] iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination Sean Christopherson
                   ` (23 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Add a comment to amd_iommu_update_ga() to document what fields it can
safely modify without issuing an invalidation of the IRTE, and to explain
its role in keeping GA IRTEs up-to-date.

Per page 93 of the IOMMU spec dated Feb 2025:

  When virtual interrupts are enabled by setting MMIO Offset 0018h[GAEn] and
  IRTE[GuestMode=1], IRTE[IsRun], IRTE[Destination], and if present IRTE[GATag],
  are not cached by the IOMMU. Modifications to these fields do not require an
  invalidation of the Interrupt Remapping Table.

Link: https://lore.kernel.org/all/9b7ceea3-8c47-4383-ad9c-1a9bbdc9044a@oracle.com
Cc: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 drivers/iommu/amd/iommu.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 36749efcc781..5adc932b947e 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3986,6 +3986,18 @@ int amd_iommu_create_irq_domain(struct amd_iommu *iommu)
 	return 0;
 }
 
+/*
+ * Update the pCPU information for an IRTE that is configured to post IRQs to
+ * a vCPU, without issuing an IOMMU invalidation for the IRTE.
+ *
+ * This API is intended to be used when a vCPU is scheduled in/out (or stops
+ * running for any reason), to do a fast update of IsRun and (conditionally)
+ * Destination.
+ *
+ * Per the IOMMU spec, the Destination, IsRun, and GATag fields are not cached
+ * and thus don't require an invalidation to ensure the IOMMU consumes fresh
+ * information.
+ */
 int amd_iommu_update_ga(int cpu, bool is_run, void *data)
 {
 	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 40/62] iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (38 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 39/62] iommu/amd: Document which IRTE fields amd_iommu_update_ga() can modify Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 41/62] iommu/amd: Factor out helper for manipulating IRTE GA/CPU info Sean Christopherson
                   ` (22 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Infer whether or not a vCPU should be marked running from the validity of
the pCPU on which it is running.  amd_iommu_update_ga() already skips the
IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an
invalid pCPU would be a blatant and egregrious KVM bug.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c   | 11 +++++------
 drivers/iommu/amd/iommu.c | 14 +++++++++-----
 include/linux/amd-iommu.h |  6 ++----
 3 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 4747fb09aca4..c79648d96752 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -832,7 +832,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		entry = svm->avic_physical_id_entry;
 		if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
 			amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
-					    true, pi_data.ir_data);
+					    pi_data.ir_data);
 
 		irqfd->irq_bypass_data = pi_data.ir_data;
 		list_add(&irqfd->vcpu_list, &svm->ir_list);
@@ -841,8 +841,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	return irq_set_vcpu_affinity(host_irq, NULL);
 }
 
-static inline int
-avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
+static inline int avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu)
 {
 	int ret = 0;
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -861,7 +860,7 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
 		return 0;
 
 	list_for_each_entry(irqfd, &svm->ir_list, vcpu_list) {
-		ret = amd_iommu_update_ga(cpu, r, irqfd->irq_bypass_data);
+		ret = amd_iommu_update_ga(cpu, irqfd->irq_bypass_data);
 		if (ret)
 			return ret;
 	}
@@ -923,7 +922,7 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 
-	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
+	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 }
@@ -963,7 +962,7 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	 */
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
 
-	avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
+	avic_update_iommu_vcpu_affinity(vcpu, -1);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 	svm->avic_physical_id_entry = entry;
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 5adc932b947e..bb804bbc916b 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3990,15 +3990,17 @@ int amd_iommu_create_irq_domain(struct amd_iommu *iommu)
  * Update the pCPU information for an IRTE that is configured to post IRQs to
  * a vCPU, without issuing an IOMMU invalidation for the IRTE.
  *
- * This API is intended to be used when a vCPU is scheduled in/out (or stops
- * running for any reason), to do a fast update of IsRun and (conditionally)
- * Destination.
+ * If the vCPU is associated with a pCPU (@cpu >= 0), configure the Destination
+ * with the pCPU's APIC ID and set IsRun, else clear IsRun.  I.e. treat vCPUs
+ * that are associated with a pCPU as running.  This API is intended to be used
+ * when a vCPU is scheduled in/out (or stops running for any reason), to do a
+ * fast update of IsRun and (conditionally) Destination.
  *
  * Per the IOMMU spec, the Destination, IsRun, and GATag fields are not cached
  * and thus don't require an invalidation to ensure the IOMMU consumes fresh
  * information.
  */
-int amd_iommu_update_ga(int cpu, bool is_run, void *data)
+int amd_iommu_update_ga(int cpu, void *data)
 {
 	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
 	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
@@ -4015,8 +4017,10 @@ int amd_iommu_update_ga(int cpu, bool is_run, void *data)
 					APICID_TO_IRTE_DEST_LO(cpu);
 		entry->hi.fields.destination =
 					APICID_TO_IRTE_DEST_HI(cpu);
+		entry->lo.fields_vapic.is_run = true;
+	} else {
+		entry->lo.fields_vapic.is_run = false;
 	}
-	entry->lo.fields_vapic.is_run = is_run;
 
 	return __modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
 				ir_data->irq_2_irte.index, entry);
diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
index 99b4fa9a0296..fe0e16ffe0e5 100644
--- a/include/linux/amd-iommu.h
+++ b/include/linux/amd-iommu.h
@@ -30,8 +30,7 @@ static inline void amd_iommu_detect(void) { }
 /* IOMMU AVIC Function */
 extern int amd_iommu_register_ga_log_notifier(int (*notifier)(u32));
 
-extern int
-amd_iommu_update_ga(int cpu, bool is_run, void *data);
+extern int amd_iommu_update_ga(int cpu, void *data);
 
 extern int amd_iommu_activate_guest_mode(void *data);
 extern int amd_iommu_deactivate_guest_mode(void *data);
@@ -44,8 +43,7 @@ amd_iommu_register_ga_log_notifier(int (*notifier)(u32))
 	return 0;
 }
 
-static inline int
-amd_iommu_update_ga(int cpu, bool is_run, void *data)
+static inline int amd_iommu_update_ga(int cpu, void *data)
 {
 	return 0;
 }
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 41/62] iommu/amd: Factor out helper for manipulating IRTE GA/CPU info
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (39 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 40/62] iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 42/62] iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity Sean Christopherson
                   ` (21 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Split the guts of amd_iommu_update_ga() to a dedicated helper so that the
logic can be shared with flows that put the IRTE into posted mode.

Opportunistically move amd_iommu_update_ga() and its new helper above
amd_iommu_activate_guest_mode() so that it's all co-located.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 drivers/iommu/amd/iommu.c | 87 +++++++++++++++++++++------------------
 1 file changed, 46 insertions(+), 41 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index bb804bbc916b..15718b7b8bd4 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3804,6 +3804,52 @@ static const struct irq_domain_ops amd_ir_domain_ops = {
 	.deactivate = irq_remapping_deactivate,
 };
 
+static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
+{
+	if (cpu >= 0) {
+		entry->lo.fields_vapic.destination =
+					APICID_TO_IRTE_DEST_LO(cpu);
+		entry->hi.fields.destination =
+					APICID_TO_IRTE_DEST_HI(cpu);
+		entry->lo.fields_vapic.is_run = true;
+	} else {
+		entry->lo.fields_vapic.is_run = false;
+	}
+}
+
+/*
+ * Update the pCPU information for an IRTE that is configured to post IRQs to
+ * a vCPU, without issuing an IOMMU invalidation for the IRTE.
+ *
+ * If the vCPU is associated with a pCPU (@cpu >= 0), configure the Destination
+ * with the pCPU's APIC ID and set IsRun, else clear IsRun.  I.e. treat vCPUs
+ * that are associated with a pCPU as running.  This API is intended to be used
+ * when a vCPU is scheduled in/out (or stops running for any reason), to do a
+ * fast update of IsRun and (conditionally) Destination.
+ *
+ * Per the IOMMU spec, the Destination, IsRun, and GATag fields are not cached
+ * and thus don't require an invalidation to ensure the IOMMU consumes fresh
+ * information.
+ */
+int amd_iommu_update_ga(int cpu, void *data)
+{
+	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
+	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
+
+	if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
+	    !entry || !entry->lo.fields_vapic.guest_mode)
+		return 0;
+
+	if (!ir_data->iommu)
+		return -ENODEV;
+
+	__amd_iommu_update_ga(entry, cpu);
+
+	return __modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
+				ir_data->irq_2_irte.index, entry);
+}
+EXPORT_SYMBOL(amd_iommu_update_ga);
+
 int amd_iommu_activate_guest_mode(void *data)
 {
 	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
@@ -3985,45 +4031,4 @@ int amd_iommu_create_irq_domain(struct amd_iommu *iommu)
 
 	return 0;
 }
-
-/*
- * Update the pCPU information for an IRTE that is configured to post IRQs to
- * a vCPU, without issuing an IOMMU invalidation for the IRTE.
- *
- * If the vCPU is associated with a pCPU (@cpu >= 0), configure the Destination
- * with the pCPU's APIC ID and set IsRun, else clear IsRun.  I.e. treat vCPUs
- * that are associated with a pCPU as running.  This API is intended to be used
- * when a vCPU is scheduled in/out (or stops running for any reason), to do a
- * fast update of IsRun and (conditionally) Destination.
- *
- * Per the IOMMU spec, the Destination, IsRun, and GATag fields are not cached
- * and thus don't require an invalidation to ensure the IOMMU consumes fresh
- * information.
- */
-int amd_iommu_update_ga(int cpu, void *data)
-{
-	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
-	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
-
-	if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
-	    !entry || !entry->lo.fields_vapic.guest_mode)
-		return 0;
-
-	if (!ir_data->iommu)
-		return -ENODEV;
-
-	if (cpu >= 0) {
-		entry->lo.fields_vapic.destination =
-					APICID_TO_IRTE_DEST_LO(cpu);
-		entry->hi.fields.destination =
-					APICID_TO_IRTE_DEST_HI(cpu);
-		entry->lo.fields_vapic.is_run = true;
-	} else {
-		entry->lo.fields_vapic.is_run = false;
-	}
-
-	return __modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
-				ir_data->irq_2_irte.index, entry);
-}
-EXPORT_SYMBOL(amd_iommu_update_ga);
 #endif
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 42/62] iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (40 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 41/62] iommu/amd: Factor out helper for manipulating IRTE GA/CPU info Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 43/62] iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is inhibited Sean Christopherson
                   ` (20 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that
avic_physical_id_entry can be safely accessed, set the pCPU info
straight-away when setting vCPU affinity.  Putting the IRTE into posted
mode, and then immediately updating the IRTE a second time if the target
vCPU is running is wasteful and confusing.

This also fixes a flaw where a posted IRQ that arrives between putting
the IRTE into guest_mode and setting the correct destination could cause
the IOMMU to ring the doorbell on the wrong pCPU.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/irq_remapping.h |  1 +
 arch/x86/kvm/svm/avic.c              | 26 ++++++++++++++------------
 drivers/iommu/amd/iommu.c            |  6 ++++--
 include/linux/amd-iommu.h            |  4 ++--
 4 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index 2dbc9cb61c2f..4c75a17632f6 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -35,6 +35,7 @@ struct amd_iommu_pi_data {
 	u64 vapic_addr;		/* Physical address of the vCPU's vAPIC. */
 	u32 ga_tag;
 	u32 vector;		/* Guest vector of the interrupt */
+	int cpu;
 	bool is_guest_mode;
 	void *ir_data;
 };
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index c79648d96752..16557328aa58 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -726,6 +726,7 @@ void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 
 static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 {
+	int apic_id = kvm_cpu_get_apicid(vcpu->cpu);
 	int ret = 0;
 	unsigned long flags;
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -745,7 +746,7 @@ static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 
 	list_for_each_entry(irqfd, &svm->ir_list, vcpu_list) {
 		if (activate)
-			ret = amd_iommu_activate_guest_mode(irqfd->irq_bypass_data);
+			ret = amd_iommu_activate_guest_mode(irqfd->irq_bypass_data, apic_id);
 		else
 			ret = amd_iommu_deactivate_guest_mode(irqfd->irq_bypass_data);
 		if (ret)
@@ -809,6 +810,18 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 */
 		guard(spinlock_irqsave)(&svm->ir_list_lock);
 
+		/*
+		 * Update the target pCPU for IOMMU doorbells if the vCPU is
+		 * running.  If the vCPU is NOT running, i.e. is blocking or
+		 * scheduled out, KVM will update the pCPU info when the vCPU
+		 * is awakened and/or scheduled in.  See also avic_vcpu_load().
+		 */
+		entry = svm->avic_physical_id_entry;
+		if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
+			pi_data.cpu = entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
+		else
+			pi_data.cpu = -1;
+
 		ret = irq_set_vcpu_affinity(host_irq, &pi_data);
 		if (ret)
 			return ret;
@@ -823,17 +836,6 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			return -EIO;
 		}
 
-		/*
-		 * Update the target pCPU for IOMMU doorbells if the vCPU is
-		 * running.  If the vCPU is NOT running, i.e. is blocking or
-		 * scheduled out, KVM will update the pCPU info when the vCPU
-		 * is awakened and/or scheduled in.  See also avic_vcpu_load().
-		 */
-		entry = svm->avic_physical_id_entry;
-		if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
-			amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
-					    pi_data.ir_data);
-
 		irqfd->irq_bypass_data = pi_data.ir_data;
 		list_add(&irqfd->vcpu_list, &svm->ir_list);
 		return 0;
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 15718b7b8bd4..718bd9604f71 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3850,7 +3850,7 @@ int amd_iommu_update_ga(int cpu, void *data)
 }
 EXPORT_SYMBOL(amd_iommu_update_ga);
 
-int amd_iommu_activate_guest_mode(void *data)
+int amd_iommu_activate_guest_mode(void *data, int cpu)
 {
 	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
 	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
@@ -3871,6 +3871,8 @@ int amd_iommu_activate_guest_mode(void *data)
 	entry->hi.fields.vector            = ir_data->ga_vector;
 	entry->lo.fields_vapic.ga_tag      = ir_data->ga_tag;
 
+	__amd_iommu_update_ga(entry, cpu);
+
 	return modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
 			      ir_data->irq_2_irte.index, entry);
 }
@@ -3937,7 +3939,7 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 		ir_data->ga_root_ptr = (pi_data->vapic_addr >> 12);
 		ir_data->ga_vector = pi_data->vector;
 		ir_data->ga_tag = pi_data->ga_tag;
-		ret = amd_iommu_activate_guest_mode(ir_data);
+		ret = amd_iommu_activate_guest_mode(ir_data, pi_data->cpu);
 	} else {
 		ret = amd_iommu_deactivate_guest_mode(ir_data);
 	}
diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
index fe0e16ffe0e5..c9f2df0c4596 100644
--- a/include/linux/amd-iommu.h
+++ b/include/linux/amd-iommu.h
@@ -32,7 +32,7 @@ extern int amd_iommu_register_ga_log_notifier(int (*notifier)(u32));
 
 extern int amd_iommu_update_ga(int cpu, void *data);
 
-extern int amd_iommu_activate_guest_mode(void *data);
+extern int amd_iommu_activate_guest_mode(void *data, int cpu);
 extern int amd_iommu_deactivate_guest_mode(void *data);
 
 #else /* defined(CONFIG_AMD_IOMMU) && defined(CONFIG_IRQ_REMAP) */
@@ -48,7 +48,7 @@ static inline int amd_iommu_update_ga(int cpu, void *data)
 	return 0;
 }
 
-static inline int amd_iommu_activate_guest_mode(void *data)
+static inline int amd_iommu_activate_guest_mode(void *data, int cpu)
 {
 	return 0;
 }
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 43/62] iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is inhibited
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (41 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 42/62] iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 44/62] KVM: SVM: Don't check for assigned device(s) when updating affinity Sean Christopherson
                   ` (19 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

If an IRQ can be posted to a vCPU, but AVIC is currently inhibited on the
vCPU, go through the dance of "affining" the IRTE to the vCPU, but leave
the actual IRTE in remapped mode.  KVM already handles the case where AVIC
is inhibited => uninhibited with posted IRQs (see avic_set_pi_irte_mode()),
but doesn't handle the scenario where a postable IRQ comes along while AVIC
is inhibited.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c   | 16 ++++++----------
 drivers/iommu/amd/iommu.c |  5 ++++-
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 16557328aa58..2e3a8fda0355 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -780,21 +780,17 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	 */
 	svm_ir_list_del(irqfd);
 
-	/**
-	 * Here, we setup with legacy mode in the following cases:
-	 * 1. When cannot target interrupt to a specific vcpu.
-	 * 2. Unsetting posted interrupt.
-	 * 3. APIC virtualization is disabled for the vcpu.
-	 * 4. IRQ has incompatible delivery mode (SMI, INIT, etc)
-	 */
-	if (vcpu && kvm_vcpu_apicv_active(vcpu)) {
+	if (vcpu) {
 		/*
-		 * Try to enable guest_mode in IRTE.
+		 * Try to enable guest_mode in IRTE, unless AVIC is inhibited,
+		 * in which case configure the IRTE for legacy mode, but track
+		 * the IRTE metadata so that it can be converted to guest mode
+		 * if AVIC is enabled/uninhibited in the future.
 		 */
 		struct amd_iommu_pi_data pi_data = {
 			.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
 					     vcpu->vcpu_id),
-			.is_guest_mode = true,
+			.is_guest_mode = kvm_vcpu_apicv_active(vcpu),
 			.vapic_addr = avic_get_backing_page_address(to_svm(vcpu)),
 			.vector = vector,
 		};
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 718bd9604f71..becef69a306d 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3939,7 +3939,10 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 		ir_data->ga_root_ptr = (pi_data->vapic_addr >> 12);
 		ir_data->ga_vector = pi_data->vector;
 		ir_data->ga_tag = pi_data->ga_tag;
-		ret = amd_iommu_activate_guest_mode(ir_data, pi_data->cpu);
+		if (pi_data->is_guest_mode)
+			ret = amd_iommu_activate_guest_mode(ir_data, pi_data->cpu);
+		else
+			ret = amd_iommu_deactivate_guest_mode(ir_data);
 	} else {
 		ret = amd_iommu_deactivate_guest_mode(ir_data);
 	}
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 44/62] KVM: SVM: Don't check for assigned device(s) when updating affinity
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (42 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 43/62] iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is inhibited Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 45/62] KVM: SVM: Don't check for assigned device(s) when activating AVIC Sean Christopherson
                   ` (18 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Don't bother checking if a VM has an assigned device when updating AVIC
vCPU affinity, querying ir_list is just as cheap and nothing prevents
racing with changes in device assignment.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 2e3a8fda0355..dadd982b03c0 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -847,9 +847,6 @@ static inline int avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu
 
 	lockdep_assert_held(&svm->ir_list_lock);
 
-	if (!kvm_arch_has_assigned_device(vcpu->kvm))
-		return 0;
-
 	/*
 	 * Here, we go through the per-vcpu ir_list to update all existing
 	 * interrupt remapping table entry targeting this vcpu.
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 45/62] KVM: SVM: Don't check for assigned device(s) when activating AVIC
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (43 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 44/62] KVM: SVM: Don't check for assigned device(s) when updating affinity Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 46/62] KVM: SVM: WARN if (de)activating guest mode in IOMMU fails Sean Christopherson
                   ` (17 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Don't short-circuit IRTE updating when (de)activating AVIC based on the
VM having assigned devices, as nothing prevents AVIC (de)activation from
racing with device (de)assignment.  And from a performance perspective,
bailing early when there is no assigned device doesn't add much, as
ir_list_lock will never be contended if there's no assigned device.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index dadd982b03c0..ab7fb8950cc0 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -732,9 +732,6 @@ static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 	struct vcpu_svm *svm = to_svm(vcpu);
 	struct kvm_kernel_irqfd *irqfd;
 
-	if (!kvm_arch_has_assigned_device(vcpu->kvm))
-		return 0;
-
 	/*
 	 * Here, we go through the per-vcpu ir_list to update all existing
 	 * interrupt remapping table entry targeting this vcpu.
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 46/62] KVM: SVM: WARN if (de)activating guest mode in IOMMU fails
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (44 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 45/62] KVM: SVM: Don't check for assigned device(s) when activating AVIC Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 47/62] KVM: SVM: Process all IRTEs on affinity change even if one update fails Sean Christopherson
                   ` (16 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

WARN if (de)activating "guest mode" for an IRTE entry fails as modifying
an IRTE should only fail if KVM is buggy, e.g. has stale metadata.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index ab7fb8950cc0..6048cd90e731 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -724,10 +724,9 @@ void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 	avic_handle_ldr_update(vcpu);
 }
 
-static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
+static void avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 {
 	int apic_id = kvm_cpu_get_apicid(vcpu->cpu);
-	int ret = 0;
 	unsigned long flags;
 	struct vcpu_svm *svm = to_svm(vcpu);
 	struct kvm_kernel_irqfd *irqfd;
@@ -742,16 +741,15 @@ static int avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 		goto out;
 
 	list_for_each_entry(irqfd, &svm->ir_list, vcpu_list) {
+		void *data = irqfd->irq_bypass_data;
+
 		if (activate)
-			ret = amd_iommu_activate_guest_mode(irqfd->irq_bypass_data, apic_id);
+			WARN_ON_ONCE(amd_iommu_activate_guest_mode(data, apic_id));
 		else
-			ret = amd_iommu_deactivate_guest_mode(irqfd->irq_bypass_data);
-		if (ret)
-			break;
+			WARN_ON_ONCE(amd_iommu_deactivate_guest_mode(data));
 	}
 out:
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
-	return ret;
 }
 
 static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 47/62] KVM: SVM: Process all IRTEs on affinity change even if one update fails
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (45 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 46/62] KVM: SVM: WARN if (de)activating guest mode in IOMMU fails Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 48/62] KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails Sean Christopherson
                   ` (15 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

When updating IRTE GA fields, keep processing all other IRTEs if an update
fails, as not updating later entries risks making a bad situation worse.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 6048cd90e731..24e07f075646 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -849,12 +849,10 @@ static inline int avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu
 	if (list_empty(&svm->ir_list))
 		return 0;
 
-	list_for_each_entry(irqfd, &svm->ir_list, vcpu_list) {
+	list_for_each_entry(irqfd, &svm->ir_list, vcpu_list)
 		ret = amd_iommu_update_ga(cpu, irqfd->irq_bypass_data);
-		if (ret)
-			return ret;
-	}
-	return 0;
+
+	return ret;
 }
 
 void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 48/62] KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (46 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 47/62] KVM: SVM: Process all IRTEs on affinity change even if one update fails Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 49/62] KVM: x86: Drop superfluous "has assigned device" check in kvm_pi_update_irte() Sean Christopherson
                   ` (14 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

WARN if updating GA information for an IRTE entry fails as modifying an
IRTE should only fail if KVM is buggy, e.g. has stale metadata, and
because returning an error that is always ignored is pointless.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 24e07f075646..d1f7b35c1b02 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -834,9 +834,8 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	return irq_set_vcpu_affinity(host_irq, NULL);
 }
 
-static inline int avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu)
+static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu)
 {
-	int ret = 0;
 	struct vcpu_svm *svm = to_svm(vcpu);
 	struct kvm_kernel_irqfd *irqfd;
 
@@ -847,12 +846,10 @@ static inline int avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu
 	 * interrupt remapping table entry targeting this vcpu.
 	 */
 	if (list_empty(&svm->ir_list))
-		return 0;
+		return;
 
 	list_for_each_entry(irqfd, &svm->ir_list, vcpu_list)
-		ret = amd_iommu_update_ga(cpu, irqfd->irq_bypass_data);
-
-	return ret;
+		WARN_ON_ONCE(amd_iommu_update_ga(cpu, irqfd->irq_bypass_data));
 }
 
 void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 49/62] KVM: x86: Drop superfluous "has assigned device" check in kvm_pi_update_irte()
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (47 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 48/62] KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 50/62] KVM: x86: WARN if IRQ bypass isn't supported " Sean Christopherson
                   ` (13 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Don't bother checking if the VM has an assigned device when updating
IRTE entries.  kvm_arch_irq_bypass_add_producer() explicitly increments
the assigned device count, kvm_arch_irq_bypass_del_producer() explicitly
decrements the count before invoking kvm_pi_update_irte(), and
kvm_irq_routing_update() only updates IRTE entries if there's an active
IRQ bypass producer.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/irq.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 4119c1e880e7..45cb9f1ee618 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -517,9 +517,7 @@ static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
 	struct kvm_lapic_irq irq;
 	int r;
 
-	if (!irqchip_in_kernel(kvm) ||
-	    !kvm_arch_has_irq_bypass() ||
-	    !kvm_arch_has_assigned_device(kvm))
+	if (!irqchip_in_kernel(kvm) || !kvm_arch_has_irq_bypass())
 		return 0;
 
 	if (entry && entry->type == KVM_IRQ_ROUTING_MSI) {
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 50/62] KVM: x86: WARN if IRQ bypass isn't supported in kvm_pi_update_irte()
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (48 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 49/62] KVM: x86: Drop superfluous "has assigned device" check in kvm_pi_update_irte() Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 51/62] KVM: x86: WARN if IRQ bypass routing is updated without in-kernel local APIC Sean Christopherson
                   ` (12 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

WARN if kvm_pi_update_irte() is reached without IRQ bypass support, as the
code is only reachable if the VM already has an IRQ bypass producer (see
kvm_irq_routing_update()), or from kvm_arch_irq_bypass_{add,del}_producer(),
which, stating the obvious, are called if and only if KVM enables its IRQ
bypass hooks.

Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/irq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 45cb9f1ee618..9c03a67fedbe 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -517,7 +517,7 @@ static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
 	struct kvm_lapic_irq irq;
 	int r;
 
-	if (!irqchip_in_kernel(kvm) || !kvm_arch_has_irq_bypass())
+	if (!irqchip_in_kernel(kvm) || WARN_ON_ONCE(!kvm_arch_has_irq_bypass()))
 		return 0;
 
 	if (entry && entry->type == KVM_IRQ_ROUTING_MSI) {
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 51/62] KVM: x86: WARN if IRQ bypass routing is updated without in-kernel local APIC
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (49 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 50/62] KVM: x86: WARN if IRQ bypass isn't supported " Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 52/62] KVM: SVM: WARN if ir_list is non-empty at vCPU free Sean Christopherson
                   ` (11 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Yell if kvm_pi_update_irte() is reached without an in-kernel local APIC,
as kvm_arch_irqfd_allowed() should prevent attaching an irqfd and thus any
and all postable IRQs to an APIC-less VM.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/irq.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 9c03a67fedbe..02efa7162252 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -517,8 +517,8 @@ static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
 	struct kvm_lapic_irq irq;
 	int r;
 
-	if (!irqchip_in_kernel(kvm) || WARN_ON_ONCE(!kvm_arch_has_irq_bypass()))
-		return 0;
+	if (WARN_ON_ONCE(!irqchip_in_kernel(kvm) || !kvm_arch_has_irq_bypass()))
+		return -EINVAL;
 
 	if (entry && entry->type == KVM_IRQ_ROUTING_MSI) {
 		kvm_set_msi_irq(kvm, entry, &irq);
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 52/62] KVM: SVM: WARN if ir_list is non-empty at vCPU free
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (50 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 51/62] KVM: x86: WARN if IRQ bypass routing is updated without in-kernel local APIC Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 53/62] KVM: x86: Decouple device assignment from IRQ bypass Sean Christopherson
                   ` (10 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Now that AVIC IRTE tracking is in a mostly sane state, WARN if a vCPU is
freed with ir_list entries, i.e. if KVM leaves a dangling IRTE.

Initialize the per-vCPU interrupt remapping list and its lock even if AVIC
is disabled so that the WARN doesn't hit false positives (and so that KVM
doesn't need to call into AVIC code for a simple sanity check).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 5 +++--
 arch/x86/kvm/svm/svm.c  | 2 ++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index d1f7b35c1b02..c55cbb0610b4 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -704,6 +704,9 @@ int avic_init_vcpu(struct vcpu_svm *svm)
 	int ret;
 	struct kvm_vcpu *vcpu = &svm->vcpu;
 
+	INIT_LIST_HEAD(&svm->ir_list);
+	spin_lock_init(&svm->ir_list_lock);
+
 	if (!enable_apicv || !irqchip_in_kernel(vcpu->kvm))
 		return 0;
 
@@ -711,8 +714,6 @@ int avic_init_vcpu(struct vcpu_svm *svm)
 	if (ret)
 		return ret;
 
-	INIT_LIST_HEAD(&svm->ir_list);
-	spin_lock_init(&svm->ir_list_lock);
 	svm->dfr_reg = APIC_DFR_FLAT;
 
 	return ret;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 56d11f7b4bef..2cd991062acb 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1497,6 +1497,8 @@ static void svm_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 
+	WARN_ON_ONCE(!list_empty(&svm->ir_list));
+
 	svm_leave_nested(vcpu);
 	svm_free_nested(svm);
 
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 53/62] KVM: x86: Decouple device assignment from IRQ bypass
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (51 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 52/62] KVM: SVM: WARN if ir_list is non-empty at vCPU free Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 54/62] KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting " Sean Christopherson
                   ` (9 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Use a dedicated counter to track the number of IRQs that can utilize IRQ
bypass instead of piggybacking the assigned device count.  As evidenced by
commit 2edd9cb79fb3 ("kvm: detect assigned device via irqbypass manager"),
it's possible for a device to be able to post IRQs to a vCPU without said
device being assigned to a VM.

Leave the calls to kvm_arch_{start,end}_assignment() alone for the moment
to avoid regressing the MMIO stale data mitigation.  KVM is abusing the
assigned device count when applying mmio_stale_data_clear, and it's not at
all clear if vDPA devices rely on this behavior.  This will hopefully be
cleaned up in the future, as the number of assigned devices is a terrible
heuristic for detecting if a VM has access to host MMIO.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  2 +-
 arch/x86/include/asm/kvm_host.h    |  4 +++-
 arch/x86/kvm/irq.c                 |  9 ++++++++-
 arch/x86/kvm/vmx/main.c            |  2 +-
 arch/x86/kvm/vmx/posted_intr.c     | 16 ++++++++++------
 arch/x86/kvm/vmx/posted_intr.h     |  2 +-
 arch/x86/kvm/x86.c                 |  3 +--
 7 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 8d50e3e0a19b..8897f509860c 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -112,7 +112,7 @@ KVM_X86_OP_OPTIONAL(update_cpu_dirty_logging)
 KVM_X86_OP_OPTIONAL(vcpu_blocking)
 KVM_X86_OP_OPTIONAL(vcpu_unblocking)
 KVM_X86_OP_OPTIONAL(pi_update_irte)
-KVM_X86_OP_OPTIONAL(pi_start_assignment)
+KVM_X86_OP_OPTIONAL(pi_start_bypass)
 KVM_X86_OP_OPTIONAL(apicv_pre_state_restore)
 KVM_X86_OP_OPTIONAL(apicv_post_state_restore)
 KVM_X86_OP_OPTIONAL_RET0(dy_apicv_has_pending_interrupt)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c722adfedd96..01edcefbd937 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1382,6 +1382,8 @@ struct kvm_arch {
 	atomic_t noncoherent_dma_count;
 #define __KVM_HAVE_ARCH_ASSIGNED_DEVICE
 	atomic_t assigned_device_count;
+	unsigned long nr_possible_bypass_irqs;
+
 #ifdef CONFIG_KVM_IOAPIC
 	struct kvm_pic *vpic;
 	struct kvm_ioapic *vioapic;
@@ -1855,7 +1857,7 @@ struct kvm_x86_ops {
 	int (*pi_update_irte)(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			      unsigned int host_irq, uint32_t guest_irq,
 			      struct kvm_vcpu *vcpu, u32 vector);
-	void (*pi_start_assignment)(struct kvm *kvm);
+	void (*pi_start_bypass)(struct kvm *kvm);
 	void (*apicv_pre_state_restore)(struct kvm_vcpu *vcpu);
 	void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
 	bool (*dy_apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 02efa7162252..3e4338c6a712 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -570,10 +570,15 @@ int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
 	spin_lock_irq(&kvm->irqfds.lock);
 	irqfd->producer = prod;
 
+	if (!kvm->arch.nr_possible_bypass_irqs++)
+		kvm_x86_call(pi_start_bypass)(kvm);
+
 	if (irqfd->irq_entry.type == KVM_IRQ_ROUTING_MSI) {
 		ret = kvm_pi_update_irte(irqfd, &irqfd->irq_entry);
-		if (ret)
+		if (ret) {
+			kvm->arch.nr_possible_bypass_irqs--;
 			kvm_arch_end_assignment(irqfd->kvm);
+		}
 	}
 	spin_unlock_irq(&kvm->irqfds.lock);
 
@@ -606,6 +611,8 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
 	}
 	irqfd->producer = NULL;
 
+	kvm->arch.nr_possible_bypass_irqs--;
+
 	spin_unlock_irq(&kvm->irqfds.lock);
 
 
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d1e02e567b57..a986fc45145e 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -1014,7 +1014,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.nested_ops = &vmx_nested_ops,
 
 	.pi_update_irte = vmx_pi_update_irte,
-	.pi_start_assignment = vmx_pi_start_assignment,
+	.pi_start_bypass = vmx_pi_start_bypass,
 
 #ifdef CONFIG_X86_64
 	.set_hv_timer = vt_op(set_hv_timer),
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 3a23c30f73cb..5671d59a6b6d 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -146,8 +146,13 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 
 static bool vmx_can_use_vtd_pi(struct kvm *kvm)
 {
+	/*
+	 * Note, reading the number of possible bypass IRQs can race with a
+	 * bypass IRQ being attached to the VM.  vmx_pi_start_bypass() ensures
+	 * blockng vCPUs will see an elevated count or get KVM_REQ_UNBLOCK.
+	 */
 	return irqchip_in_kernel(kvm) && kvm_arch_has_irq_bypass() &&
-	       kvm_arch_has_assigned_device(kvm);
+	       READ_ONCE(kvm->arch.nr_possible_bypass_irqs);
 }
 
 /*
@@ -285,12 +290,11 @@ bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu)
 
 
 /*
- * Bail out of the block loop if the VM has an assigned
- * device, but the blocking vCPU didn't reconfigure the
- * PI.NV to the wakeup vector, i.e. the assigned device
- * came along after the initial check in vmx_vcpu_pi_put().
+ * Kick all vCPUs when the first possible bypass IRQ is attached to a VM, as
+ * blocking vCPUs may scheduled out without reconfiguring PID.NV to the wakeup
+ * vector, i.e. if the bypass IRQ came along after vmx_vcpu_pi_put().
  */
-void vmx_pi_start_assignment(struct kvm *kvm)
+void vmx_pi_start_bypass(struct kvm *kvm)
 {
 	if (!kvm_arch_has_irq_bypass())
 		return;
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 94ed66ea6249..a4af39948cf0 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -17,7 +17,7 @@ bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int vmx_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		       unsigned int host_irq, uint32_t guest_irq,
 		       struct kvm_vcpu *vcpu, u32 vector);
-void vmx_pi_start_assignment(struct kvm *kvm);
+void vmx_pi_start_bypass(struct kvm *kvm);
 
 static inline int pi_find_highest_vector(struct pi_desc *pi_desc)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d3571c484790..59391cb56674 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13440,8 +13440,7 @@ bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu)
 
 void kvm_arch_start_assignment(struct kvm *kvm)
 {
-	if (atomic_inc_return(&kvm->arch.assigned_device_count) == 1)
-		kvm_x86_call(pi_start_assignment)(kvm);
+	atomic_inc(&kvm->arch.assigned_device_count);
 }
 EXPORT_SYMBOL_GPL(kvm_arch_start_assignment);
 
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 54/62] KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting IRQ bypass
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (52 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 53/62] KVM: x86: Decouple device assignment from IRQ bypass Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 55/62] KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata Sean Christopherson
                   ` (8 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

WARN if KVM attempts to "start" IRQ bypass when VT-d Posted IRQs are
disabled, to make it obvious that the logic is a sanity check, and so that
a bug related to nr_possible_bypass_irqs is more like to cause noisy
failures, e.g. so that KVM doesn't silently fail to wake blocking vCPUs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/posted_intr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 5671d59a6b6d..4a6d9a17da23 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -296,7 +296,7 @@ bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu)
  */
 void vmx_pi_start_bypass(struct kvm *kvm)
 {
-	if (!kvm_arch_has_irq_bypass())
+	if (WARN_ON_ONCE(!vmx_can_use_vtd_pi(kvm)))
 		return;
 
 	kvm_make_all_cpus_request(kvm, KVM_REQ_UNBLOCK);
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 55/62] KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (53 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 54/62] KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting " Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:45 ` [PATCH v3 56/62] iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC support Sean Christopherson
                   ` (7 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Use a vCPU's index, not its ID, for the GA log tag/metadata that's used to
find and kick vCPUs when a device posted interrupt serves as a wake event.
Lookups on a vCPU index are O(fast) (not sure what xa_load() actually
provides), whereas a vCPU ID lookup is O(n) if a vCPU's ID doesn't match
its index.

Unlike the Physical APIC Table, which is accessed by hardware when
virtualizing IPIs, hardware doesn't consume the GA tag, i.e. KVM _must_
use APIC IDs to fill the Physical APIC Table, but KVM has free rein over
the format/meaning of the GA tag.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 37 ++++++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index c55cbb0610b4..bb74705d6cfd 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -29,36 +29,39 @@
 #include "svm.h"
 
 /*
- * Encode the arbitrary VM ID and the vCPU's default APIC ID, i.e the vCPU ID,
- * into the GATag so that KVM can retrieve the correct vCPU from a GALog entry
- * if an interrupt can't be delivered, e.g. because the vCPU isn't running.
+ * Encode the arbitrary VM ID and the vCPU's _index_ into the GATag so that
+ * KVM can retrieve the correct vCPU from a GALog entry if an interrupt can't
+ * be delivered, e.g. because the vCPU isn't running.  Use the vCPU's index
+ * instead of its ID (a.k.a. its default APIC ID), as KVM is guaranteed a fast
+ * lookup on the index, where as vCPUs whose index doesn't match their ID need
+ * to walk the entire xarray of vCPUs in the worst case scenario.
  *
- * For the vCPU ID, use however many bits are currently allowed for the max
+ * For the vCPU index, use however many bits are currently allowed for the max
  * guest physical APIC ID (limited by the size of the physical ID table), and
  * use whatever bits remain to assign arbitrary AVIC IDs to VMs.  Note, the
  * size of the GATag is defined by hardware (32 bits), but is an opaque value
  * as far as hardware is concerned.
  */
-#define AVIC_VCPU_ID_MASK		AVIC_PHYSICAL_MAX_INDEX_MASK
+#define AVIC_VCPU_IDX_MASK		AVIC_PHYSICAL_MAX_INDEX_MASK
 
 #define AVIC_VM_ID_SHIFT		HWEIGHT32(AVIC_PHYSICAL_MAX_INDEX_MASK)
 #define AVIC_VM_ID_MASK			(GENMASK(31, AVIC_VM_ID_SHIFT) >> AVIC_VM_ID_SHIFT)
 
 #define AVIC_GATAG_TO_VMID(x)		((x >> AVIC_VM_ID_SHIFT) & AVIC_VM_ID_MASK)
-#define AVIC_GATAG_TO_VCPUID(x)		(x & AVIC_VCPU_ID_MASK)
+#define AVIC_GATAG_TO_VCPUIDX(x)	(x & AVIC_VCPU_IDX_MASK)
 
-#define __AVIC_GATAG(vm_id, vcpu_id)	((((vm_id) & AVIC_VM_ID_MASK) << AVIC_VM_ID_SHIFT) | \
-					 ((vcpu_id) & AVIC_VCPU_ID_MASK))
-#define AVIC_GATAG(vm_id, vcpu_id)					\
+#define __AVIC_GATAG(vm_id, vcpu_idx)	((((vm_id) & AVIC_VM_ID_MASK) << AVIC_VM_ID_SHIFT) | \
+					 ((vcpu_idx) & AVIC_VCPU_IDX_MASK))
+#define AVIC_GATAG(vm_id, vcpu_idx)					\
 ({									\
-	u32 ga_tag = __AVIC_GATAG(vm_id, vcpu_id);			\
+	u32 ga_tag = __AVIC_GATAG(vm_id, vcpu_idx);			\
 									\
-	WARN_ON_ONCE(AVIC_GATAG_TO_VCPUID(ga_tag) != (vcpu_id));	\
+	WARN_ON_ONCE(AVIC_GATAG_TO_VCPUIDX(ga_tag) != (vcpu_idx));	\
 	WARN_ON_ONCE(AVIC_GATAG_TO_VMID(ga_tag) != (vm_id));		\
 	ga_tag;								\
 })
 
-static_assert(__AVIC_GATAG(AVIC_VM_ID_MASK, AVIC_VCPU_ID_MASK) == -1u);
+static_assert(__AVIC_GATAG(AVIC_VM_ID_MASK, AVIC_VCPU_IDX_MASK) == -1u);
 
 static bool force_avic;
 module_param_unsafe(force_avic, bool, 0444);
@@ -139,16 +142,16 @@ int avic_ga_log_notifier(u32 ga_tag)
 	struct kvm_svm *kvm_svm;
 	struct kvm_vcpu *vcpu = NULL;
 	u32 vm_id = AVIC_GATAG_TO_VMID(ga_tag);
-	u32 vcpu_id = AVIC_GATAG_TO_VCPUID(ga_tag);
+	u32 vcpu_idx = AVIC_GATAG_TO_VCPUIDX(ga_tag);
 
-	pr_debug("SVM: %s: vm_id=%#x, vcpu_id=%#x\n", __func__, vm_id, vcpu_id);
-	trace_kvm_avic_ga_log(vm_id, vcpu_id);
+	pr_debug("SVM: %s: vm_id=%#x, vcpu_idx=%#x\n", __func__, vm_id, vcpu_idx);
+	trace_kvm_avic_ga_log(vm_id, vcpu_idx);
 
 	spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
 	hash_for_each_possible(svm_vm_data_hash, kvm_svm, hnode, vm_id) {
 		if (kvm_svm->avic_vm_id != vm_id)
 			continue;
-		vcpu = kvm_get_vcpu_by_id(&kvm_svm->kvm, vcpu_id);
+		vcpu = kvm_get_vcpu(&kvm_svm->kvm, vcpu_idx);
 		break;
 	}
 	spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags);
@@ -785,7 +788,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 */
 		struct amd_iommu_pi_data pi_data = {
 			.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
-					     vcpu->vcpu_id),
+					     vcpu->vcpu_idx),
 			.is_guest_mode = kvm_vcpu_apicv_active(vcpu),
 			.vapic_addr = avic_get_backing_page_address(to_svm(vcpu)),
 			.vector = vector,
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 56/62] iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC support
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (54 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 55/62] KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata Sean Christopherson
@ 2025-06-11 22:45 ` Sean Christopherson
  2025-06-11 22:46 ` [PATCH v3 57/62] KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller Sean Christopherson
                   ` (6 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:45 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

WARN if KVM attempts to update IRTE entries when virtual APIC isn't fully
supported, as KVM should guard all such calls on IRQ posting being enabled.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 drivers/iommu/amd/iommu.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index becef69a306d..926dcdfe08c8 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3836,8 +3836,10 @@ int amd_iommu_update_ga(int cpu, void *data)
 	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
 	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
 
-	if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
-	    !entry || !entry->lo.fields_vapic.guest_mode)
+	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
+		return -EINVAL;
+
+	if (!entry || !entry->lo.fields_vapic.guest_mode)
 		return 0;
 
 	if (!ir_data->iommu)
@@ -3856,7 +3858,10 @@ int amd_iommu_activate_guest_mode(void *data, int cpu)
 	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
 	u64 valid;
 
-	if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) || !entry)
+	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
+		return -EINVAL;
+
+	if (!entry)
 		return 0;
 
 	valid = entry->lo.fields_vapic.valid;
@@ -3885,8 +3890,10 @@ int amd_iommu_deactivate_guest_mode(void *data)
 	struct irq_cfg *cfg = ir_data->cfg;
 	u64 valid;
 
-	if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
-	    !entry || !entry->lo.fields_vapic.guest_mode)
+	if (WARN_ON_ONCE(!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir)))
+		return -EINVAL;
+
+	if (!entry || !entry->lo.fields_vapic.guest_mode)
 		return 0;
 
 	valid = entry->lo.fields_remap.valid;
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 57/62] KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (55 preceding siblings ...)
  2025-06-11 22:45 ` [PATCH v3 56/62] iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC support Sean Christopherson
@ 2025-06-11 22:46 ` Sean Christopherson
  2025-06-11 22:46 ` [PATCH v3 58/62] KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off Sean Christopherson
                   ` (5 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:46 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Fold avic_set_pi_irte_mode() into avic_refresh_apicv_exec_ctrl() in
anticipation of moving the __avic_vcpu_{load,put}() calls into the
critical section, and because having a one-off helper with a name that's
easily confused with avic_pi_update_irte() is unnecessary.

No functional change intended.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 52 ++++++++++++++++++-----------------------
 1 file changed, 23 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index bb74705d6cfd..9ddec6f3ad41 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -728,34 +728,6 @@ void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 	avic_handle_ldr_update(vcpu);
 }
 
-static void avic_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
-{
-	int apic_id = kvm_cpu_get_apicid(vcpu->cpu);
-	unsigned long flags;
-	struct vcpu_svm *svm = to_svm(vcpu);
-	struct kvm_kernel_irqfd *irqfd;
-
-	/*
-	 * Here, we go through the per-vcpu ir_list to update all existing
-	 * interrupt remapping table entry targeting this vcpu.
-	 */
-	spin_lock_irqsave(&svm->ir_list_lock, flags);
-
-	if (list_empty(&svm->ir_list))
-		goto out;
-
-	list_for_each_entry(irqfd, &svm->ir_list, vcpu_list) {
-		void *data = irqfd->irq_bypass_data;
-
-		if (activate)
-			WARN_ON_ONCE(amd_iommu_activate_guest_mode(data, apic_id));
-		else
-			WARN_ON_ONCE(amd_iommu_deactivate_guest_mode(data));
-	}
-out:
-	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
-}
-
 static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)
 {
 	struct kvm_vcpu *vcpu = irqfd->irq_bypass_vcpu;
@@ -990,6 +962,10 @@ void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)
 void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
 	bool activated = kvm_vcpu_apicv_active(vcpu);
+	int apic_id = kvm_cpu_get_apicid(vcpu->cpu);
+	struct vcpu_svm *svm = to_svm(vcpu);
+	struct kvm_kernel_irqfd *irqfd;
+	unsigned long flags;
 
 	if (!enable_apicv)
 		return;
@@ -1001,7 +977,25 @@ void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 	else
 		avic_vcpu_put(vcpu);
 
-	avic_set_pi_irte_mode(vcpu, activated);
+	/*
+	 * Here, we go through the per-vcpu ir_list to update all existing
+	 * interrupt remapping table entry targeting this vcpu.
+	 */
+	spin_lock_irqsave(&svm->ir_list_lock, flags);
+
+	if (list_empty(&svm->ir_list))
+		goto out;
+
+	list_for_each_entry(irqfd, &svm->ir_list, vcpu_list) {
+		void *data = irqfd->irq_bypass_data;
+
+		if (activated)
+			WARN_ON_ONCE(amd_iommu_activate_guest_mode(data, apic_id));
+		else
+			WARN_ON_ONCE(amd_iommu_deactivate_guest_mode(data));
+	}
+out:
+	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 }
 
 void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 58/62] KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (56 preceding siblings ...)
  2025-06-11 22:46 ` [PATCH v3 57/62] KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller Sean Christopherson
@ 2025-06-11 22:46 ` Sean Christopherson
  2025-06-11 22:46 ` [PATCH v3 59/62] KVM: SVM: Consolidate IRTE update " Sean Christopherson
                   ` (4 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:46 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Don't query a vCPU's blocking status when toggling AVIC on/off; barring
KVM bugs, the vCPU can't be blocking when refreshing AVIC controls.  And
if there are KVM bugs, ensuring the vCPU and its associated IRTEs are in
the correct state is desirable, i.e. well worth any overhead in a buggy
scenario.

Isolating the "real" load/put flows will allow moving the IOMMU IRTE
(de)activation logic from avic_refresh_apicv_exec_ctrl() to
avic_update_iommu_vcpu_affinity(), i.e. will allow updating the vCPU's
physical ID entry and its IRTEs in a common path, under a single critical
section of ir_list_lock.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 65 +++++++++++++++++++++++------------------
 1 file changed, 37 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 9ddec6f3ad41..1e6e5d1f6b4e 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -828,7 +828,7 @@ static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu)
 		WARN_ON_ONCE(amd_iommu_update_ga(cpu, irqfd->irq_bypass_data));
 }
 
-void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	int h_physical_id = kvm_cpu_get_apicid(cpu);
@@ -844,16 +844,6 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (WARN_ON_ONCE(vcpu->vcpu_id * sizeof(entry) >= PAGE_SIZE))
 		return;
 
-	/*
-	 * No need to update anything if the vCPU is blocking, i.e. if the vCPU
-	 * is being scheduled in after being preempted.  The CPU entries in the
-	 * Physical APIC table and IRTE are consumed iff IsRun{ning} is '1'.
-	 * If the vCPU was migrated, its new CPU value will be stuffed when the
-	 * vCPU unblocks.
-	 */
-	if (kvm_vcpu_is_blocking(vcpu))
-		return;
-
 	/*
 	 * Grab the per-vCPU interrupt remapping lock even if the VM doesn't
 	 * _currently_ have assigned devices, as that can change.  Holding
@@ -888,31 +878,33 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 }
 
-void avic_vcpu_put(struct kvm_vcpu *vcpu)
+void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+	/*
+	 * No need to update anything if the vCPU is blocking, i.e. if the vCPU
+	 * is being scheduled in after being preempted.  The CPU entries in the
+	 * Physical APIC table and IRTE are consumed iff IsRun{ning} is '1'.
+	 * If the vCPU was migrated, its new CPU value will be stuffed when the
+	 * vCPU unblocks.
+	 */
+	if (kvm_vcpu_is_blocking(vcpu))
+		return;
+
+	__avic_vcpu_load(vcpu, cpu);
+}
+
+static void __avic_vcpu_put(struct kvm_vcpu *vcpu)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	struct vcpu_svm *svm = to_svm(vcpu);
 	unsigned long flags;
-	u64 entry;
+	u64 entry = svm->avic_physical_id_entry;
 
 	lockdep_assert_preemption_disabled();
 
 	if (WARN_ON_ONCE(vcpu->vcpu_id * sizeof(entry) >= PAGE_SIZE))
 		return;
 
-	/*
-	 * Note, reading the Physical ID entry outside of ir_list_lock is safe
-	 * as only the pCPU that has loaded (or is loading) the vCPU is allowed
-	 * to modify the entry, and preemption is disabled.  I.e. the vCPU
-	 * can't be scheduled out and thus avic_vcpu_{put,load}() can't run
-	 * recursively.
-	 */
-	entry = svm->avic_physical_id_entry;
-
-	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
-	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
-		return;
-
 	/*
 	 * Take and hold the per-vCPU interrupt remapping lock while updating
 	 * the Physical ID entry even though the lock doesn't protect against
@@ -932,7 +924,24 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 		WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
+}
 
+void avic_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * Note, reading the Physical ID entry outside of ir_list_lock is safe
+	 * as only the pCPU that has loaded (or is loading) the vCPU is allowed
+	 * to modify the entry, and preemption is disabled.  I.e. the vCPU
+	 * can't be scheduled out and thus avic_vcpu_{put,load}() can't run
+	 * recursively.
+	 */
+	u64 entry = to_svm(vcpu)->avic_physical_id_entry;
+
+	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
+	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
+		return;
+
+	__avic_vcpu_put(vcpu);
 }
 
 void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)
@@ -973,9 +982,9 @@ void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 	avic_refresh_virtual_apic_mode(vcpu);
 
 	if (activated)
-		avic_vcpu_load(vcpu, vcpu->cpu);
+		__avic_vcpu_load(vcpu, vcpu->cpu);
 	else
-		avic_vcpu_put(vcpu);
+		__avic_vcpu_put(vcpu);
 
 	/*
 	 * Here, we go through the per-vcpu ir_list to update all existing
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 59/62] KVM: SVM: Consolidate IRTE update when toggling AVIC on/off
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (57 preceding siblings ...)
  2025-06-11 22:46 ` [PATCH v3 58/62] KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off Sean Christopherson
@ 2025-06-11 22:46 ` Sean Christopherson
  2025-06-11 22:46 ` [PATCH v3 60/62] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts Sean Christopherson
                   ` (3 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:46 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Fold the IRTE modification logic in avic_refresh_apicv_exec_ctrl() into
__avic_vcpu_{load,put}(), and add a param to the helpers to communicate
whether or not AVIC is being toggled, i.e. if IRTE needs a "full" update,
or just a quick update to set the CPU and IsRun.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 85 ++++++++++++++++++++++-------------------
 1 file changed, 46 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 1e6e5d1f6b4e..2e47559a4134 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -810,7 +810,28 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 	return irq_set_vcpu_affinity(host_irq, NULL);
 }
 
-static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu)
+enum avic_vcpu_action {
+	/*
+	 * There is no need to differentiate between activate and deactivate,
+	 * as KVM only refreshes AVIC state when the vCPU is scheduled in and
+	 * isn't blocking, i.e. the pCPU must always be (in)valid when AVIC is
+	 * being (de)activated.
+	 */
+	AVIC_TOGGLE_ON_OFF	= BIT(0),
+	AVIC_ACTIVATE		= AVIC_TOGGLE_ON_OFF,
+	AVIC_DEACTIVATE		= AVIC_TOGGLE_ON_OFF,
+
+	/*
+	 * No unique action is required to deal with a vCPU that stops/starts
+	 * running, as IRTEs are configured to generate GALog interrupts at all
+	 * times.
+	 */
+	AVIC_START_RUNNING	= 0,
+	AVIC_STOP_RUNNING	= 0,
+};
+
+static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
+					    enum avic_vcpu_action action)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 	struct kvm_kernel_irqfd *irqfd;
@@ -824,11 +845,20 @@ static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu)
 	if (list_empty(&svm->ir_list))
 		return;
 
-	list_for_each_entry(irqfd, &svm->ir_list, vcpu_list)
-		WARN_ON_ONCE(amd_iommu_update_ga(cpu, irqfd->irq_bypass_data));
+	list_for_each_entry(irqfd, &svm->ir_list, vcpu_list) {
+		void *data = irqfd->irq_bypass_data;
+
+		if (!(action & AVIC_TOGGLE_ON_OFF))
+			WARN_ON_ONCE(amd_iommu_update_ga(cpu, data));
+		else if (cpu >= 0)
+			WARN_ON_ONCE(amd_iommu_activate_guest_mode(data, cpu));
+		else
+			WARN_ON_ONCE(amd_iommu_deactivate_guest_mode(data));
+	}
 }
 
-static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu,
+			     enum avic_vcpu_action action)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	int h_physical_id = kvm_cpu_get_apicid(cpu);
@@ -873,7 +903,7 @@ static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 
-	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id);
+	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, action);
 
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 }
@@ -890,10 +920,10 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (kvm_vcpu_is_blocking(vcpu))
 		return;
 
-	__avic_vcpu_load(vcpu, cpu);
+	__avic_vcpu_load(vcpu, cpu, AVIC_START_RUNNING);
 }
 
-static void __avic_vcpu_put(struct kvm_vcpu *vcpu)
+static void __avic_vcpu_put(struct kvm_vcpu *vcpu, enum avic_vcpu_action action)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -915,7 +945,7 @@ static void __avic_vcpu_put(struct kvm_vcpu *vcpu)
 	 */
 	spin_lock_irqsave(&svm->ir_list_lock, flags);
 
-	avic_update_iommu_vcpu_affinity(vcpu, -1);
+	avic_update_iommu_vcpu_affinity(vcpu, -1, action);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 	svm->avic_physical_id_entry = entry;
@@ -941,7 +971,7 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
 		return;
 
-	__avic_vcpu_put(vcpu);
+	__avic_vcpu_put(vcpu, AVIC_STOP_RUNNING);
 }
 
 void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)
@@ -970,41 +1000,18 @@ void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)
 
 void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
-	bool activated = kvm_vcpu_apicv_active(vcpu);
-	int apic_id = kvm_cpu_get_apicid(vcpu->cpu);
-	struct vcpu_svm *svm = to_svm(vcpu);
-	struct kvm_kernel_irqfd *irqfd;
-	unsigned long flags;
-
 	if (!enable_apicv)
 		return;
 
+	/* APICv should only be toggled on/off while the vCPU is running. */
+	WARN_ON_ONCE(kvm_vcpu_is_blocking(vcpu));
+
 	avic_refresh_virtual_apic_mode(vcpu);
 
-	if (activated)
-		__avic_vcpu_load(vcpu, vcpu->cpu);
+	if (kvm_vcpu_apicv_active(vcpu))
+		__avic_vcpu_load(vcpu, vcpu->cpu, AVIC_ACTIVATE);
 	else
-		__avic_vcpu_put(vcpu);
-
-	/*
-	 * Here, we go through the per-vcpu ir_list to update all existing
-	 * interrupt remapping table entry targeting this vcpu.
-	 */
-	spin_lock_irqsave(&svm->ir_list_lock, flags);
-
-	if (list_empty(&svm->ir_list))
-		goto out;
-
-	list_for_each_entry(irqfd, &svm->ir_list, vcpu_list) {
-		void *data = irqfd->irq_bypass_data;
-
-		if (activated)
-			WARN_ON_ONCE(amd_iommu_activate_guest_mode(data, apic_id));
-		else
-			WARN_ON_ONCE(amd_iommu_deactivate_guest_mode(data));
-	}
-out:
-	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
+		__avic_vcpu_put(vcpu, AVIC_DEACTIVATE);
 }
 
 void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
@@ -1030,7 +1037,7 @@ void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
 	 * CPU and cause noisy neighbor problems if the VM is sending interrupts
 	 * to the vCPU while it's scheduled out.
 	 */
-	avic_vcpu_put(vcpu);
+	__avic_vcpu_put(vcpu, AVIC_STOP_RUNNING);
 }
 
 void avic_vcpu_unblocking(struct kvm_vcpu *vcpu)
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 60/62] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (58 preceding siblings ...)
  2025-06-11 22:46 ` [PATCH v3 59/62] KVM: SVM: Consolidate IRTE update " Sean Christopherson
@ 2025-06-11 22:46 ` Sean Christopherson
  2025-06-11 22:46 ` [PATCH v3 61/62] KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking Sean Christopherson
                   ` (2 subsequent siblings)
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:46 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Add plumbing to the AMD IOMMU driver to allow KVM to control whether or
not an IRTE is configured to generate GA log interrupts.  KVM only needs a
notification if the target vCPU is blocking, so the vCPU can be awakened.
If a vCPU is preempted or exits to userspace, KVM clears is_run, but will
set the vCPU back to running when userspace does KVM_RUN and/or the vCPU
task is scheduled back in, i.e. KVM doesn't need a notification.

Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes
from the KVM changes insofar as possible.

Opportunistically swap the ordering of parameters for amd_iommu_update_ga()
so that the match amd_iommu_activate_guest_mode().

Note, as of this writing, the AMD IOMMU manual doesn't list GALogIntr as
a non-cached field, but per AMD hardware architects, it's not cached and
can be safely updated without an invalidation.

Link: https://lore.kernel.org/all/b29b8c22-2fd4-4b5e-b755-9198874157c7@amd.com
Cc: Vasant Hegde <vasant.hegde@amd.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/irq_remapping.h |  1 +
 arch/x86/kvm/svm/avic.c              | 10 ++++++----
 drivers/iommu/amd/iommu.c            | 28 +++++++++++++++++-----------
 include/linux/amd-iommu.h            |  9 ++++-----
 4 files changed, 28 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index 4c75a17632f6..5a0d42464d44 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -36,6 +36,7 @@ struct amd_iommu_pi_data {
 	u32 ga_tag;
 	u32 vector;		/* Guest vector of the interrupt */
 	int cpu;
+	bool ga_log_intr;
 	bool is_guest_mode;
 	void *ir_data;
 };
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 2e47559a4134..e61ecc3514ea 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -784,10 +784,12 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 		 * is awakened and/or scheduled in.  See also avic_vcpu_load().
 		 */
 		entry = svm->avic_physical_id_entry;
-		if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
+		if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK) {
 			pi_data.cpu = entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
-		else
+		} else {
 			pi_data.cpu = -1;
+			pi_data.ga_log_intr = true;
+		}
 
 		ret = irq_set_vcpu_affinity(host_irq, &pi_data);
 		if (ret)
@@ -849,9 +851,9 @@ static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
 		void *data = irqfd->irq_bypass_data;
 
 		if (!(action & AVIC_TOGGLE_ON_OFF))
-			WARN_ON_ONCE(amd_iommu_update_ga(cpu, data));
+			WARN_ON_ONCE(amd_iommu_update_ga(data, cpu, true));
 		else if (cpu >= 0)
-			WARN_ON_ONCE(amd_iommu_activate_guest_mode(data, cpu));
+			WARN_ON_ONCE(amd_iommu_activate_guest_mode(data, cpu, true));
 		else
 			WARN_ON_ONCE(amd_iommu_deactivate_guest_mode(data));
 	}
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 926dcdfe08c8..e79f583da36b 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3804,7 +3804,8 @@ static const struct irq_domain_ops amd_ir_domain_ops = {
 	.deactivate = irq_remapping_deactivate,
 };
 
-static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
+static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu,
+				  bool ga_log_intr)
 {
 	if (cpu >= 0) {
 		entry->lo.fields_vapic.destination =
@@ -3812,8 +3813,10 @@ static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
 		entry->hi.fields.destination =
 					APICID_TO_IRTE_DEST_HI(cpu);
 		entry->lo.fields_vapic.is_run = true;
+		entry->lo.fields_vapic.ga_log_intr = false;
 	} else {
 		entry->lo.fields_vapic.is_run = false;
+		entry->lo.fields_vapic.ga_log_intr = ga_log_intr;
 	}
 }
 
@@ -3822,16 +3825,19 @@ static void __amd_iommu_update_ga(struct irte_ga *entry, int cpu)
  * a vCPU, without issuing an IOMMU invalidation for the IRTE.
  *
  * If the vCPU is associated with a pCPU (@cpu >= 0), configure the Destination
- * with the pCPU's APIC ID and set IsRun, else clear IsRun.  I.e. treat vCPUs
- * that are associated with a pCPU as running.  This API is intended to be used
- * when a vCPU is scheduled in/out (or stops running for any reason), to do a
- * fast update of IsRun and (conditionally) Destination.
+ * with the pCPU's APIC ID, set IsRun, and clear GALogIntr.  If the vCPU isn't
+ * associated with a pCPU (@cpu < 0), clear IsRun and set/clear GALogIntr based
+ * on input from the caller (e.g. KVM only requests GALogIntr when the vCPU is
+ * blocking and requires a notification wake event).  I.e. treat vCPUs that are
+ * associated with a pCPU as running.  This API is intended to be used when a
+ * vCPU is scheduled in/out (or stops running for any reason), to do a fast
+ * update of IsRun, GALogIntr, and (conditionally) Destination.
  *
  * Per the IOMMU spec, the Destination, IsRun, and GATag fields are not cached
  * and thus don't require an invalidation to ensure the IOMMU consumes fresh
  * information.
  */
-int amd_iommu_update_ga(int cpu, void *data)
+int amd_iommu_update_ga(void *data, int cpu, bool ga_log_intr)
 {
 	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
 	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
@@ -3845,14 +3851,14 @@ int amd_iommu_update_ga(int cpu, void *data)
 	if (!ir_data->iommu)
 		return -ENODEV;
 
-	__amd_iommu_update_ga(entry, cpu);
+	__amd_iommu_update_ga(entry, cpu, ga_log_intr);
 
 	return __modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
 				ir_data->irq_2_irte.index, entry);
 }
 EXPORT_SYMBOL(amd_iommu_update_ga);
 
-int amd_iommu_activate_guest_mode(void *data, int cpu)
+int amd_iommu_activate_guest_mode(void *data, int cpu, bool ga_log_intr)
 {
 	struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
 	struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
@@ -3871,12 +3877,11 @@ int amd_iommu_activate_guest_mode(void *data, int cpu)
 
 	entry->lo.fields_vapic.valid       = valid;
 	entry->lo.fields_vapic.guest_mode  = 1;
-	entry->lo.fields_vapic.ga_log_intr = 1;
 	entry->hi.fields.ga_root_ptr       = ir_data->ga_root_ptr;
 	entry->hi.fields.vector            = ir_data->ga_vector;
 	entry->lo.fields_vapic.ga_tag      = ir_data->ga_tag;
 
-	__amd_iommu_update_ga(entry, cpu);
+	__amd_iommu_update_ga(entry, cpu, ga_log_intr);
 
 	return modify_irte_ga(ir_data->iommu, ir_data->irq_2_irte.devid,
 			      ir_data->irq_2_irte.index, entry);
@@ -3947,7 +3952,8 @@ static int amd_ir_set_vcpu_affinity(struct irq_data *data, void *info)
 		ir_data->ga_vector = pi_data->vector;
 		ir_data->ga_tag = pi_data->ga_tag;
 		if (pi_data->is_guest_mode)
-			ret = amd_iommu_activate_guest_mode(ir_data, pi_data->cpu);
+			ret = amd_iommu_activate_guest_mode(ir_data, pi_data->cpu,
+							    pi_data->ga_log_intr);
 		else
 			ret = amd_iommu_deactivate_guest_mode(ir_data);
 	} else {
diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
index c9f2df0c4596..8cced632ecd0 100644
--- a/include/linux/amd-iommu.h
+++ b/include/linux/amd-iommu.h
@@ -30,9 +30,8 @@ static inline void amd_iommu_detect(void) { }
 /* IOMMU AVIC Function */
 extern int amd_iommu_register_ga_log_notifier(int (*notifier)(u32));
 
-extern int amd_iommu_update_ga(int cpu, void *data);
-
-extern int amd_iommu_activate_guest_mode(void *data, int cpu);
+extern int amd_iommu_update_ga(void *data, int cpu, bool ga_log_intr);
+extern int amd_iommu_activate_guest_mode(void *data, int cpu, bool ga_log_intr);
 extern int amd_iommu_deactivate_guest_mode(void *data);
 
 #else /* defined(CONFIG_AMD_IOMMU) && defined(CONFIG_IRQ_REMAP) */
@@ -43,12 +42,12 @@ amd_iommu_register_ga_log_notifier(int (*notifier)(u32))
 	return 0;
 }
 
-static inline int amd_iommu_update_ga(int cpu, void *data)
+static inline int amd_iommu_update_ga(void *data, int cpu, bool ga_log_intr)
 {
 	return 0;
 }
 
-static inline int amd_iommu_activate_guest_mode(void *data, int cpu)
+static inline int amd_iommu_activate_guest_mode(void *data, int cpu, bool ga_log_intr)
 {
 	return 0;
 }
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 61/62] KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (59 preceding siblings ...)
  2025-06-11 22:46 ` [PATCH v3 60/62] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts Sean Christopherson
@ 2025-06-11 22:46 ` Sean Christopherson
  2025-06-11 22:46 ` [PATCH v3 62/62] KVM: x86: Rename kvm_set_msi_irq() => kvm_msi_to_lapic_irq() Sean Christopherson
  2025-06-24 19:38 ` [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:46 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Configure IRTEs to GA log interrupts for device posted IRQs that hit
non-running vCPUs if and only if the target vCPU is blocking, i.e.
actually needs a wake event.  If the vCPU has exited to userspace or was
preempted, generating GA log entries and interrupts is wasteful and
unnecessary, as the vCPU will be re-loaded and/or scheduled back in
irrespective of the GA log notification (avic_ga_log_notifier() is just a
fancy wrapper for kvm_vcpu_wake_up()).

Use a should-be-zero bit in the vCPU's Physical APIC ID Table Entry to
track whether or not the vCPU's associated IRTEs are configured to
generate GA logs, but only set the synthetic bit in KVM's "cache", i.e.
never set the should-be-zero bit in tables that are used by hardware.
Use a synthetic bit instead of a dedicated boolean to minimize the odds
of messing up the locking, i.e. so that all the existing rules that apply
to avic_physical_id_entry for IS_RUNNING are reused verbatim for
GA_LOG_INTR.

Note, because KVM (by design) "puts" AVIC state in a "pre-blocking"
phase, using kvm_vcpu_is_blocking() to track the need for notifications
isn't a viable option.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/svm.h |  7 +++++
 arch/x86/kvm/svm/avic.c    | 63 ++++++++++++++++++++++++++++++--------
 2 files changed, 58 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 36f67c69ea66..ffc27f676243 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -252,6 +252,13 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 #define AVIC_LOGICAL_ID_ENTRY_VALID_BIT			31
 #define AVIC_LOGICAL_ID_ENTRY_VALID_MASK		(1 << 31)
 
+/*
+ * GA_LOG_INTR is a synthetic flag that's never propagated to hardware-visible
+ * tables.  GA_LOG_INTR is set if the vCPU needs device posted IRQs to generate
+ * GA log interrupts to wake the vCPU (because it's blocking or about to block).
+ */
+#define AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR		BIT_ULL(61)
+
 #define AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK	GENMASK_ULL(11, 0)
 #define AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK	GENMASK_ULL(51, 12)
 #define AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK		(1ULL << 62)
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index e61ecc3514ea..e4e1d169577f 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -788,7 +788,7 @@ int avic_pi_update_irte(struct kvm_kernel_irqfd *irqfd, struct kvm *kvm,
 			pi_data.cpu = entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
 		} else {
 			pi_data.cpu = -1;
-			pi_data.ga_log_intr = true;
+			pi_data.ga_log_intr = entry & AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR;
 		}
 
 		ret = irq_set_vcpu_affinity(host_irq, &pi_data);
@@ -825,16 +825,25 @@ enum avic_vcpu_action {
 
 	/*
 	 * No unique action is required to deal with a vCPU that stops/starts
-	 * running, as IRTEs are configured to generate GALog interrupts at all
-	 * times.
+	 * running.  A vCPU that starts running by definition stops blocking as
+	 * well, and a vCPU that stops running can't have been blocking, i.e.
+	 * doesn't need to toggle GALogIntr.
 	 */
 	AVIC_START_RUNNING	= 0,
 	AVIC_STOP_RUNNING	= 0,
+
+	/*
+	 * When a vCPU starts blocking, KVM needs to set the GALogIntr flag
+	 * int all associated IRTEs so that KVM can wake the vCPU if an IRQ is
+	 * sent to the vCPU.
+	 */
+	AVIC_START_BLOCKING	= BIT(1),
 };
 
 static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
 					    enum avic_vcpu_action action)
 {
+	bool ga_log_intr = (action & AVIC_START_BLOCKING);
 	struct vcpu_svm *svm = to_svm(vcpu);
 	struct kvm_kernel_irqfd *irqfd;
 
@@ -851,9 +860,9 @@ static void avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu,
 		void *data = irqfd->irq_bypass_data;
 
 		if (!(action & AVIC_TOGGLE_ON_OFF))
-			WARN_ON_ONCE(amd_iommu_update_ga(data, cpu, true));
+			WARN_ON_ONCE(amd_iommu_update_ga(data, cpu, ga_log_intr));
 		else if (cpu >= 0)
-			WARN_ON_ONCE(amd_iommu_activate_guest_mode(data, cpu, true));
+			WARN_ON_ONCE(amd_iommu_activate_guest_mode(data, cpu, ga_log_intr));
 		else
 			WARN_ON_ONCE(amd_iommu_deactivate_guest_mode(data));
 	}
@@ -888,7 +897,8 @@ static void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu,
 	entry = svm->avic_physical_id_entry;
 	WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
 
-	entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
+	entry &= ~(AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK |
+		   AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR);
 	entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
 	entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 
@@ -949,12 +959,26 @@ static void __avic_vcpu_put(struct kvm_vcpu *vcpu, enum avic_vcpu_action action)
 
 	avic_update_iommu_vcpu_affinity(vcpu, -1, action);
 
+	WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR);
+
+	/*
+	 * Keep the previous APIC ID in the entry so that a rogue doorbell from
+	 * hardware is at least restricted to a CPU associated with the vCPU.
+	 */
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
-	svm->avic_physical_id_entry = entry;
 
 	if (enable_ipiv)
 		WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
 
+	/*
+	 * Note!  Don't set AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR in the table as
+	 * it's a synthetic flag that usurps an unused should-be-zero bit.
+	 */
+	if (action & AVIC_START_BLOCKING)
+		entry |= AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR;
+
+	svm->avic_physical_id_entry = entry;
+
 	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
 }
 
@@ -969,11 +993,26 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	 */
 	u64 entry = to_svm(vcpu)->avic_physical_id_entry;
 
-	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
-	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
-		return;
+	/*
+	 * Nothing to do if IsRunning == '0' due to vCPU blocking, i.e. if the
+	 * vCPU is preempted while its in the process of blocking.  WARN if the
+	 * vCPU wasn't running and isn't blocking, KVM shouldn't attempt to put
+	 * the AVIC if it wasn't previously loaded.
+	 */
+	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)) {
+		if (WARN_ON_ONCE(!kvm_vcpu_is_blocking(vcpu)))
+			return;
 
-	__avic_vcpu_put(vcpu, AVIC_STOP_RUNNING);
+		/*
+		 * The vCPU was preempted while blocking, ensure its IRTEs are
+		 * configured to generate GA Log Interrupts.
+		 */
+		if (!(WARN_ON_ONCE(!(entry & AVIC_PHYSICAL_ID_ENTRY_GA_LOG_INTR))))
+			return;
+	}
+
+	__avic_vcpu_put(vcpu, kvm_vcpu_is_blocking(vcpu) ? AVIC_START_BLOCKING :
+							   AVIC_STOP_RUNNING);
 }
 
 void avic_refresh_virtual_apic_mode(struct kvm_vcpu *vcpu)
@@ -1039,7 +1078,7 @@ void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
 	 * CPU and cause noisy neighbor problems if the VM is sending interrupts
 	 * to the vCPU while it's scheduled out.
 	 */
-	__avic_vcpu_put(vcpu, AVIC_STOP_RUNNING);
+	__avic_vcpu_put(vcpu, AVIC_START_BLOCKING);
 }
 
 void avic_vcpu_unblocking(struct kvm_vcpu *vcpu)
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v3 62/62] KVM: x86: Rename kvm_set_msi_irq() => kvm_msi_to_lapic_irq()
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (60 preceding siblings ...)
  2025-06-11 22:46 ` [PATCH v3 61/62] KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking Sean Christopherson
@ 2025-06-11 22:46 ` Sean Christopherson
  2025-06-24 19:38 ` [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-11 22:46 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, Sean Christopherson, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

Rename kvm_set_msi_irq() to kvm_msi_to_lapic_irq() to better capture what
it actually does, e.g. it's _really_ easy to conflate kvm_set_msi_irq()
with kvm_set_msi().

Opportunistically delete the public declaration and export, as they are
no longer used/needed.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ---
 arch/x86/kvm/irq.c              | 14 +++++++-------
 2 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 01edcefbd937..af08b1acba04 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2381,9 +2381,6 @@ bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu);
 bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
 			     struct kvm_vcpu **dest_vcpu);
 
-void kvm_set_msi_irq(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
-		     struct kvm_lapic_irq *irq);
-
 static inline bool kvm_irq_is_postable(struct kvm_lapic_irq *irq)
 {
 	/* We can only post Fixed and LowPrio IRQs */
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 3e4338c6a712..83c91845389c 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -252,8 +252,9 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 	return r;
 }
 
-void kvm_set_msi_irq(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
-		     struct kvm_lapic_irq *irq)
+static void kvm_msi_to_lapic_irq(struct kvm *kvm,
+				 struct kvm_kernel_irq_routing_entry *e,
+				 struct kvm_lapic_irq *irq)
 {
 	struct msi_msg msg = { .address_lo = e->msi.address_lo,
 			       .address_hi = e->msi.address_hi,
@@ -271,7 +272,6 @@ void kvm_set_msi_irq(struct kvm *kvm, struct kvm_kernel_irq_routing_entry *e,
 	irq->level = 1;
 	irq->shorthand = APIC_DEST_NOSHORT;
 }
-EXPORT_SYMBOL_GPL(kvm_set_msi_irq);
 
 static inline bool kvm_msi_route_invalid(struct kvm *kvm,
 		struct kvm_kernel_irq_routing_entry *e)
@@ -290,7 +290,7 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
 	if (!level)
 		return -1;
 
-	kvm_set_msi_irq(kvm, e, &irq);
+	kvm_msi_to_lapic_irq(kvm, e, &irq);
 
 	return kvm_irq_delivery_to_apic(kvm, NULL, &irq, NULL);
 }
@@ -313,7 +313,7 @@ int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
 		if (kvm_msi_route_invalid(kvm, e))
 			return -EINVAL;
 
-		kvm_set_msi_irq(kvm, e, &irq);
+		kvm_msi_to_lapic_irq(kvm, e, &irq);
 
 		if (kvm_irq_delivery_to_apic_fast(kvm, NULL, &irq, &r, NULL))
 			return r;
@@ -486,7 +486,7 @@ void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu,
 			if (entry->type != KVM_IRQ_ROUTING_MSI)
 				continue;
 
-			kvm_set_msi_irq(vcpu->kvm, entry, &irq);
+			kvm_msi_to_lapic_irq(vcpu->kvm, entry, &irq);
 
 			if (!irq.trig_mode)
 				continue;
@@ -521,7 +521,7 @@ static int kvm_pi_update_irte(struct kvm_kernel_irqfd *irqfd,
 		return -EINVAL;
 
 	if (entry && entry->type == KVM_IRQ_ROUTING_MSI) {
-		kvm_set_msi_irq(kvm, entry, &irq);
+		kvm_msi_to_lapic_irq(kvm, entry, &irq);
 
 		/*
 		 * Force remapped mode if hardware doesn't support posting the
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 02/62] KVM: arm64: WARN if unmapping vLPI fails
  2025-06-11 22:45 ` [PATCH v3 02/62] KVM: arm64: WARN if unmapping vLPI fails Sean Christopherson
@ 2025-06-12 11:59   ` Marc Zyngier
  2025-06-12 14:34     ` Sean Christopherson
  0 siblings, 1 reply; 100+ messages in thread
From: Marc Zyngier @ 2025-06-12 11:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Oliver Upton, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

On Wed, 11 Jun 2025 23:45:05 +0100,
Sean Christopherson <seanjc@google.com> wrote:
> 
> WARN if unmapping a vLPI in kvm_vgic_v4_unset_forwarding() fails, as
> failure means an IRTE has likely been left in a bad state, i.e. IRQs
> won't go where they should.

I have no idea what an IRTE is. But not having an VLPI mapping for an
interrupt at the point where we're tearing down the forwarding is
pretty benign. IRQs *still* go where they should, and we don't lose
anything.

What it may mean is that userspace and the guest may have done things
in the wrong order for KVM to accelerate interrupt delivery. That's
silly, but also completely harmless.

	M.


-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 02/62] KVM: arm64: WARN if unmapping vLPI fails
  2025-06-12 11:59   ` Marc Zyngier
@ 2025-06-12 14:34     ` Sean Christopherson
  2025-06-13 20:47       ` Oliver Upton
  0 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-12 14:34 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Oliver Upton, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

On Thu, Jun 12, 2025, Marc Zyngier wrote:
> On Wed, 11 Jun 2025 23:45:05 +0100,
> Sean Christopherson <seanjc@google.com> wrote:
> > 
> > WARN if unmapping a vLPI in kvm_vgic_v4_unset_forwarding() fails, as
> > failure means an IRTE has likely been left in a bad state, i.e. IRQs
> > won't go where they should.
> 
> I have no idea what an IRTE is.

Sorry, x86 IOMMU terminology (Interrupt Remapping Table Entry).  I think the GIC
terminology would be ITS entry?  Or maybe ITS mapping?

> But not having an VLPI mapping for an interrupt at the point where we're
> tearing down the forwarding is pretty benign. IRQs *still* go where they
> should, and we don't lose anything.

This is the code I'm trying to refer to:

  static int its_vlpi_unmap(struct irq_data *d)
  {
	struct its_device *its_dev = irq_data_get_irq_chip_data(d);
	u32 event = its_get_event_id(d);

	if (!its_dev->event_map.vm || !irqd_is_forwarded_to_vcpu(d))  <=== this shouldn't happen?
		return -EINVAL;

	/* Drop the virtual mapping */
	its_send_discard(its_dev, event);

	/* and restore the physical one */
	irqd_clr_forwarded_to_vcpu(d);
	its_send_mapti(its_dev, d->hwirq, event);
	lpi_update_config(d, 0xff, (lpi_prop_prio |
				    LPI_PROP_ENABLED |
				    LPI_PROP_GROUP1));

	/* Potentially unmap the VM from this ITS */
	its_unmap_vm(its_dev->its, its_dev->event_map.vm);

	/*
	 * Drop the refcount and make the device available again if
	 * this was the last VLPI.
	 */
	if (!--its_dev->event_map.nr_vlpis) {
		its_dev->event_map.vm = NULL;
		kfree(its_dev->event_map.vlpi_maps);
	}

	return 0;
  }

When called from kvm_vgic_v4_unset_forwarding()

	if (irq->hw) {
		atomic_dec(&irq->target_vcpu->arch.vgic_cpu.vgic_v3.its_vpe.vlpi_count);
		irq->hw = false;
		ret = its_unmap_vlpi(host_irq);
	}

IIUC, irq->hw is set if and only if KVM has configured the IRQ to be fowarded
directly to a vCPU.  Based on this comment in its_map_vlpi(): 

	/*
	 * The host will never see that interrupt firing again, so it
	 * is vital that we don't do any lazy masking.
	 */

and this code in its_vlpi_map():


		/* Drop the physical mapping */
		its_send_discard(its_dev, event);

my understanding is that the associated physical IRQ will not be delivered to the
host while the IRQ is being forwarded to a vCPU.

irq->hw should only become true for MSIs (I'm crossing my fingers that SGIs aren't
in play here) if its_map_vlpi() succeeds:

	ret = its_map_vlpi(virq, &map);
	if (ret)
		goto out_unlock_irq;

	irq->hw		= true;
	irq->host_irq	= virq;
	atomic_inc(&map.vpe->vlpi_count);

and so if its_unmap_vlpi() fails in kvm_vgic_v4_unset_forwarding(), then from KVM's
perspective, the worst case scenario is that an IRQ has been left in a forwarded
state, i.e. the physical mapping hasn't been reinstalled.

KVM already WARNs if kvm_vgic_v4_unset_forwarding() fails when KVM is reacting to
a routing change (this is the WARN I want to move into arch code so that
kvm_arch_update_irqfd_routing() doesn't plumb a pointless error code up the stack):

		if (irqfd->producer &&
		    kvm_arch_irqfd_route_changed(&old, &irqfd->irq_entry)) {
			int ret = kvm_arch_update_irqfd_routing(
					irqfd->kvm, irqfd->producer->irq,
					irqfd->gsi, 1);
			WARN_ON(ret);
		}

It's only the kvm_arch_irq_bypass_del_producer() case where KVM doesn't WARN.  If
that fails, then the IRQ has potentially been left in a forwarded state, despite
whatever driver "owns" the physical IRQ having removed its producer.  E.g. if VFIO
detaches its irqbypass producer and gives the device back to the host, then
whatever is using the device in the host won't receive IRQs as expected.

Looking at this again, its_free_ite() also WARNs on its_unmap_vlpi() failure, so
wouldn't it make sense to have its_unmap_vlpi() WARN if irq_set_vcpu_affinity()
fails?  The only possible failures are that the GIC doesn't have a v4 ITS (from
its_irq_set_vcpu_affinity()):

	/* Need a v4 ITS */
	if (!is_v4(its_dev->its))
		return -EINVAL;

	guard(raw_spinlock)(&its_dev->event_map.vlpi_lock);

	/* Unmap request? */
	if (!info)
		return its_vlpi_unmap(d);

or that KVM has gotten out of sync with the GIC/ITS (from its_vlpi_unmap()):

	if (!its_dev->event_map.vm || !irqd_is_forwarded_to_vcpu(d))
		return -EINVAL;

All of those failure scenario seem like warnable offences when KVM thinks it has
configured the IRQ to be forwarded to a vCPU.

E.g.

--
diff --git a/arch/arm64/kvm/vgic/vgic-its.c b/arch/arm64/kvm/vgic/vgic-its.c
index 534049c7c94b..98630dae910d 100644
--- a/arch/arm64/kvm/vgic/vgic-its.c
+++ b/arch/arm64/kvm/vgic/vgic-its.c
@@ -758,7 +758,7 @@ static void its_free_ite(struct kvm *kvm, struct its_ite *ite)
        if (irq) {
                scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) {
                        if (irq->hw)
-                               WARN_ON(its_unmap_vlpi(ite->irq->host_irq));
+                               its_unmap_vlpi(ite->irq->host_irq);
 
                        irq->hw = false;
                }
diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
index 193946108192..911170d4a9c8 100644
--- a/arch/arm64/kvm/vgic/vgic-v4.c
+++ b/arch/arm64/kvm/vgic/vgic-v4.c
@@ -545,10 +545,10 @@ int kvm_vgic_v4_unset_forwarding(struct kvm *kvm, int host_irq)
        if (irq->hw) {
                atomic_dec(&irq->target_vcpu->arch.vgic_cpu.vgic_v3.its_vpe.vlpi_count);
                irq->hw = false;
-               ret = its_unmap_vlpi(host_irq);
+               its_unmap_vlpi(host_irq);
        }
 
        raw_spin_unlock_irqrestore(&irq->irq_lock, flags);
        vgic_put_irq(kvm, irq);
-       return ret;
+       return 0;
 }
diff --git a/drivers/irqchip/irq-gic-v4.c b/drivers/irqchip/irq-gic-v4.c
index 58c28895f8c4..8455b4a5fbb0 100644
--- a/drivers/irqchip/irq-gic-v4.c
+++ b/drivers/irqchip/irq-gic-v4.c
@@ -342,10 +342,10 @@ int its_get_vlpi(int irq, struct its_vlpi_map *map)
        return irq_set_vcpu_affinity(irq, &info);
 }
 
-int its_unmap_vlpi(int irq)
+void its_unmap_vlpi(int irq)
 {
        irq_clear_status_flags(irq, IRQ_DISABLE_UNLAZY);
-       return irq_set_vcpu_affinity(irq, NULL);
+       WARN_ON_ONCE(irq_set_vcpu_affinity(irq, NULL));
 }
 
 int its_prop_update_vlpi(int irq, u8 config, bool inv)
diff --git a/include/linux/irqchip/arm-gic-v4.h b/include/linux/irqchip/arm-gic-v4.h
index 7f1f11a5e4e4..0b0887099fd7 100644
--- a/include/linux/irqchip/arm-gic-v4.h
+++ b/include/linux/irqchip/arm-gic-v4.h
@@ -146,7 +146,7 @@ int its_commit_vpe(struct its_vpe *vpe);
 int its_invall_vpe(struct its_vpe *vpe);
 int its_map_vlpi(int irq, struct its_vlpi_map *map);
 int its_get_vlpi(int irq, struct its_vlpi_map *map);
-int its_unmap_vlpi(int irq);
+void its_unmap_vlpi(int irq);
 int its_prop_update_vlpi(int irq, u8 config, bool inv);
 int its_prop_update_vsgi(int irq, u8 priority, bool group);



^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 08/62] KVM: SVM: Drop pointless masking of default APIC base when setting V_APIC_BAR
  2025-06-11 22:45 ` [PATCH v3 08/62] KVM: SVM: Drop pointless masking of default APIC base when setting V_APIC_BAR Sean Christopherson
@ 2025-06-13 14:15   ` Naveen N Rao
  0 siblings, 0 replies; 100+ messages in thread
From: Naveen N Rao @ 2025-06-13 14:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Wed, Jun 11, 2025 at 03:45:11PM -0700, Sean Christopherson wrote:
> Drop VMCB_AVIC_APIC_BAR_MASK, it's just a regurgitation of the maximum
> theoretical 4KiB-aligned physical address, i.e. is not novel in any way,
> and its only usage is to mask the default APIC base, which is 4KiB aligned
> and (obviously) a legal physical address.
> 
> No functional change intended.
> 
> Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/svm.h | 2 --
>  arch/x86/kvm/svm/avic.c    | 2 +-
>  2 files changed, 1 insertion(+), 3 deletions(-)

Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>

- Naveen


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 09/62] KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masks
  2025-06-11 22:45 ` [PATCH v3 09/62] KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masks Sean Christopherson
@ 2025-06-13 14:37   ` Naveen N Rao
  0 siblings, 0 replies; 100+ messages in thread
From: Naveen N Rao @ 2025-06-13 14:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

> KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masks
				      ^^^^^^^^^^^^^^^^
Just "physical addresses", or perhaps "HPAs"?

On Wed, Jun 11, 2025 at 03:45:12PM -0700, Sean Christopherson wrote:
> Drop AVIC_HPA_MASK and all its users, the mask is just the 4KiB-aligned
> maximum theoretical physical address for x86-64 CPUs, as x86-64 is
> currently defined (going beyond PA52 would require an entirely new paging
> mode, which would arguably create a new, different architecture).
> 
> All usage in KVM masks the result of page_to_phys(), which on x86-64 is
> guaranteed to be 4KiB aligned and a legal physical address; if either of
> those requirements doesn't hold true, KVM has far bigger problems.
> 
> Drop masking the avic_backing_page with
> AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK for all the same reasons, but
> keep the macro even though it's unused in functional code.  It's a
> distinct architectural define, and having the definition in software
> helps visualize the layout of an entry.  And to be hyper-paranoid about
> MAXPA going beyond 52, add a compile-time assert to ensure the kernel's
> maximum supported physical address stays in bounds.
> 
> The unnecessary masking in avic_init_vmcb() also incorrectly assumes that
> SME's C-bit resides between bits 51:11; that holds true for current CPUs,
> but isn't required by AMD's architecture:
> 
>   In some implementations, the bit used may be a physical address bit
> 
> Key word being "may".
> 
> Opportunistically use the GENMASK_ULL() version for
> AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK, which is far more readable
> than a set of repeating Fs.

Makes sense.
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>

- Naveen


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 10/62] KVM: SVM: Add helper to deduplicate code for getting AVIC backing page
  2025-06-11 22:45 ` [PATCH v3 10/62] KVM: SVM: Add helper to deduplicate code for getting AVIC backing page Sean Christopherson
@ 2025-06-13 14:38   ` Naveen N Rao
  0 siblings, 0 replies; 100+ messages in thread
From: Naveen N Rao @ 2025-06-13 14:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Wed, Jun 11, 2025 at 03:45:13PM -0700, Sean Christopherson wrote:
> Add a helper to get the physical address of the AVIC backing page, both
> to deduplicate code and to prepare for getting the address directly from
> apic->regs, at which point it won't be all that obvious that the address
> in question is what SVM calls the AVIC backing page.
> 
> No functional change intended.
> 
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/svm/avic.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)

LGTM.
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>

- Naveen


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 11/62] KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field
  2025-06-11 22:45 ` [PATCH v3 11/62] KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field Sean Christopherson
@ 2025-06-13 14:44   ` Naveen N Rao
  0 siblings, 0 replies; 100+ messages in thread
From: Naveen N Rao @ 2025-06-13 14:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Wed, Jun 11, 2025 at 03:45:14PM -0700, Sean Christopherson wrote:
> Drop vcpu_svm's avic_backing_page pointer and instead grab the physical
> address of KVM's vAPIC page directly from the source.  Getting a physical
> address from a kernel virtual address is not an expensive operation, and
> getting the physical address from a struct page is *more* expensive for
> CONFIG_SPARSEMEM=y kernels.  Regardless, none of the paths that consume
> the address are hot paths, i.e. shaving cycles is not a priority.
> 
> Eliminating the "cache" means KVM doesn't have to worry about the cache
> being invalid, which will simplify a future fix when dealing with vCPU IDs
> that are too big.
> 
> WARN if KVM attempts to allocate a vCPU's AVIC backing page without an
> in-kernel local APIC.  avic_init_vcpu() bails early if the APIC is not
> in-kernel, and KVM disallows enabling an in-kernel APIC after vCPUs have
> been created, i.e. it should be impossible to reach
> avic_init_backing_page() without the vAPIC being allocated.
> 
> Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/svm/avic.c | 6 ++----
>  arch/x86/kvm/svm/svm.h  | 1 -
>  2 files changed, 2 insertions(+), 5 deletions(-)

Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>

- Naveen


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 01/62] KVM: arm64: Explicitly treat routing entry type changes as changes
  2025-06-11 22:45 ` [PATCH v3 01/62] KVM: arm64: Explicitly treat routing entry type changes as changes Sean Christopherson
@ 2025-06-13 19:43   ` Oliver Upton
  2025-06-19 12:36   ` (subset) " Marc Zyngier
  1 sibling, 0 replies; 100+ messages in thread
From: Oliver Upton @ 2025-06-13 19:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

On Wed, Jun 11, 2025 at 03:45:04PM -0700, Sean Christopherson wrote:
> Explicitly treat type differences as GSI routing changes, as comparing MSI
> data between two entries could get a false negative, e.g. if userspace
> changed the type but left the type-specific data as-
> 
> Note, the same bug was fixed in x86 by commit bcda70c56f3e ("KVM: x86:
> Explicitly treat routing entry type changes as changes").

Yeah, I'll let you guess where I got the idea from...

> Fixes: 4bf3693d36af ("KVM: arm64: Unmap vLPIs affected by changes to GSI routing information")
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 02/62] KVM: arm64: WARN if unmapping vLPI fails
  2025-06-12 14:34     ` Sean Christopherson
@ 2025-06-13 20:47       ` Oliver Upton
  2025-06-20 17:22         ` Sean Christopherson
  0 siblings, 1 reply; 100+ messages in thread
From: Oliver Upton @ 2025-06-13 20:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

On Thu, Jun 12, 2025 at 07:34:35AM -0700, Sean Christopherson wrote:
> On Thu, Jun 12, 2025, Marc Zyngier wrote:
> > On Wed, 11 Jun 2025 23:45:05 +0100,
> > Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > WARN if unmapping a vLPI in kvm_vgic_v4_unset_forwarding() fails, as
> > > failure means an IRTE has likely been left in a bad state, i.e. IRQs
> > > won't go where they should.
> > 
> > I have no idea what an IRTE is.
> 
> Sorry, x86 IOMMU terminology (Interrupt Remapping Table Entry).  I think the GIC
> terminology would be ITS entry?  Or maybe ITS mapping?

We tend to just call it a 'vLPI mapping', which under the hood implies
a couple other translations have been wired up as well (vPE + Device).

> > But not having an VLPI mapping for an interrupt at the point where we're
> > tearing down the forwarding is pretty benign. IRQs *still* go where they
> > should, and we don't lose anything.

The VM may not actually be getting torn down, though. The series of
fixes [*] we took for 6.16 addressed games that VMMs might be playing on
irqbypass for a live VM.

[*] https://lore.kernel.org/kvmarm/20250523194722.4066715-1-oliver.upton@linux.dev/

> All of those failure scenario seem like warnable offences when KVM thinks it has
> configured the IRQ to be forwarded to a vCPU.

I tend to agree here, especially considering how horribly fragile GICv4
has been in some systems. I know of a couple implementations where ITS
command failures and/or unmapped MSIs are fatal for the entire machine.
Debugging them has been a genuine pain in the ass.

WARN'ing when state tracking for vLPIs is out of whack would've made it
a little easier.

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 33/62] KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing()
  2025-06-11 22:45 ` [PATCH v3 33/62] KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing() Sean Christopherson
@ 2025-06-13 20:50   ` Oliver Upton
  0 siblings, 0 replies; 100+ messages in thread
From: Oliver Upton @ 2025-06-13 20:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

On Wed, Jun 11, 2025 at 03:45:36PM -0700, Sean Christopherson wrote:
> Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing().
> Calling arch code to know whether or not to call arch code is absurd.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation
  2025-06-11 22:45 ` [PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation Sean Christopherson
@ 2025-06-17 14:25   ` Naveen N Rao
  2025-06-17 16:10     ` Sean Christopherson
  0 siblings, 1 reply; 100+ messages in thread
From: Naveen N Rao @ 2025-06-17 14:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Wed, Jun 11, 2025 at 03:45:15PM -0700, Sean Christopherson wrote:
> Inhibit AVIC with a new "ID too big" flag if userspace creates a vCPU with
> an ID that is too big, but otherwise allow vCPU creation to succeed.
> Rejecting KVM_CREATE_VCPU with EINVAL violates KVM's ABI as KVM advertises
> that the max vCPU ID is 4095, but disallows creating vCPUs with IDs bigger
> than 254 (AVIC) or 511 (x2AVIC).
> 
> Alternatively, KVM could advertise an accurate value depending on which
> AVIC mode is in use, but that wouldn't really solve the underlying problem,
> e.g. would be a breaking change if KVM were to ever try and enable AVIC or
> x2AVIC by default.
> 
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  9 ++++++++-
>  arch/x86/kvm/svm/avic.c         | 14 ++++++++++++--
>  arch/x86/kvm/svm/svm.h          |  3 ++-
>  3 files changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2a6ef1398da7..a9b709db7c59 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1314,6 +1314,12 @@ enum kvm_apicv_inhibit {
>  	 */
>  	APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
>  
> +	/*
> +	 * AVIC is disabled because the vCPU's APIC ID is beyond the max
> +	 * supported by AVIC/x2AVIC, i.e. the vCPU is unaddressable.
> +	 */
> +	APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG,
> +
>  	NR_APICV_INHIBIT_REASONS,
>  };
>  
> @@ -1332,7 +1338,8 @@ enum kvm_apicv_inhibit {
>  	__APICV_INHIBIT_REASON(IRQWIN),			\
>  	__APICV_INHIBIT_REASON(PIT_REINJ),		\
>  	__APICV_INHIBIT_REASON(SEV),			\
> -	__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)
> +	__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED),	\
> +	__APICV_INHIBIT_REASON(PHYSICAL_ID_TOO_BIG)
>  
>  struct kvm_arch {
>  	unsigned long n_used_mmu_pages;
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index ab228872a19b..f0a74b102c57 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -277,9 +277,19 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
>  	int id = vcpu->vcpu_id;
>  	struct vcpu_svm *svm = to_svm(vcpu);
>  
> +	/*
> +	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
> +	 * hardware.  Immediately clear apicv_active, i.e. don't wait until the
> +	 * KVM_REQ_APICV_UPDATE request is processed on the first KVM_RUN, as
> +	 * avic_vcpu_load() expects to be called if and only if the vCPU has
> +	 * fully initialized AVIC.
> +	 */
>  	if ((!x2avic_enabled && id > AVIC_MAX_PHYSICAL_ID) ||
> -	    (id > X2AVIC_MAX_PHYSICAL_ID))
> -		return -EINVAL;
> +	    (id > X2AVIC_MAX_PHYSICAL_ID)) {
> +		kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG);
> +		vcpu->arch.apic->apicv_active = false;

This bothers me a bit. kvm_create_lapic() does this:
          if (enable_apicv) {
                  apic->apicv_active = true;
                  kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
	  }

But, setting apic->apicv_active to false here means KVM_REQ_APICV_UPDATE 
is going to be a no-op.

This does not look to be a big deal given that kvm_create_lapic() itself 
is called just a bit before svm_vcpu_create() (which calls the above 
function through avic_init_vcpu()) in kvm_arch_vcpu_create(), so there 
isn't that much done before apicv_active is toggled.

But, this made me wonder if introducing a kvm_x86_op to check and 
enable/disable apic->apicv_active in kvm_create_lapic() might be cleaner 
overall. Maybe even have it be the initialization point for APICv: 
apicv_init(), so we can invoke avic_init_vcpu() right away?


- Naveen


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 13/62] KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creation
  2025-06-11 22:45 ` [PATCH v3 13/62] KVM: SVM: Drop redundant check in AVIC code on ID during " Sean Christopherson
@ 2025-06-17 14:49   ` Naveen N Rao
  2025-06-17 16:33     ` Sean Christopherson
  0 siblings, 1 reply; 100+ messages in thread
From: Naveen N Rao @ 2025-06-17 14:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Wed, Jun 11, 2025 at 03:45:16PM -0700, Sean Christopherson wrote:
>  static int avic_init_backing_page(struct kvm_vcpu *vcpu)
>  {
> -	u64 *entry, new_entry;
> -	int id = vcpu->vcpu_id;
> +	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
>  	struct vcpu_svm *svm = to_svm(vcpu);
> +	u32 id = vcpu->vcpu_id;
> +	u64 *table, new_entry;
>  
>  	/*
>  	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
> @@ -291,6 +277,9 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
>  		return 0;
>  	}
>  
> +	BUILD_BUG_ON((AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE ||
> +		     (X2AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE);
						    ^^^^^^^^^^^^^^
Renaming new_entry to just 'entry' and using sizeof(entry) makes this 
more readable for me. Otherwise, for this patch:
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>


As an aside, there are a few static asserts to validate some of the 
related macros. Can this also be a static_assert(), or is there is 
reason to prefer BUILD_BUG_ON()?


- Naveen


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 14/62] KVM: SVM: Track AVIC tables as natively sized pointers, not "struct pages"
  2025-06-11 22:45 ` [PATCH v3 14/62] KVM: SVM: Track AVIC tables as natively sized pointers, not "struct pages" Sean Christopherson
@ 2025-06-17 15:01   ` Naveen N Rao
  0 siblings, 0 replies; 100+ messages in thread
From: Naveen N Rao @ 2025-06-17 15:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Wed, Jun 11, 2025 at 03:45:17PM -0700, Sean Christopherson wrote:
> @@ -277,8 +265,8 @@ static int avic_init_backing_page(struct kvm_vcpu 
> *vcpu)
>  		return 0;
>  	}
>  
> -	BUILD_BUG_ON((AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE ||
> -		     (X2AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE);
> +	BUILD_BUG_ON((AVIC_MAX_PHYSICAL_ID + 1) * sizeof(new_entry) > PAGE_SIZE ||
> +		     (X2AVIC_MAX_PHYSICAL_ID + 1) * sizeof(new_entry) > PAGE_SIZE);

Ah, you change to 'new_entry' here. Would be good to rename it to just 
'entry', but that's a minor nit. Ignore my comment on the previous 
patch.

For this patch:
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>


- Naveen

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 38/62] KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU
  2025-06-11 22:45 ` [PATCH v3 38/62] KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU Sean Christopherson
@ 2025-06-17 15:42   ` Naveen N Rao
  0 siblings, 0 replies; 100+ messages in thread
From: Naveen N Rao @ 2025-06-17 15:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Wed, Jun 11, 2025 at 03:45:41PM -0700, Sean Christopherson wrote:
> Now that svm_ir_list_add() isn't overloaded with all manner of weird
> things, fold it into avic_pi_update_irte(), and more importantly take
> ir_list_lock across the irq_set_vcpu_affinity() calls to ensure the info
> that's shoved into the IRTE is fresh.  While preemption (and IRQs) is
> disabled on the task performing the IRTE update, thanks to irqfds.lock,
> that task doesn't hold the vCPU's mutex, i.e. preemption being disabled
> is irrelevant.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/svm/avic.c | 55 +++++++++++++++++------------------------
>  1 file changed, 22 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index f1e9f0dd43e8..4747fb09aca4 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -769,32 +769,6 @@ static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)
>  	spin_unlock_irqrestore(&to_svm(vcpu)->ir_list_lock, flags);
>  }
>  
> -static void svm_ir_list_add(struct vcpu_svm *svm,
> -			    struct kvm_kernel_irqfd *irqfd,
> -			    struct amd_iommu_pi_data *pi)
> -{
> -	unsigned long flags;
> -	u64 entry;
> -
> -	irqfd->irq_bypass_data = pi->ir_data;
> -
> -	spin_lock_irqsave(&svm->ir_list_lock, flags);
> -
> -	/*
> -	 * Update the target pCPU for IOMMU doorbells if the vCPU is running.
> -	 * If the vCPU is NOT running, i.e. is blocking or scheduled out, KVM
> -	 * will update the pCPU info when the vCPU awkened and/or scheduled in.
> -	 * See also avic_vcpu_load().
> -	 */
> -	entry = svm->avic_physical_id_entry;
> -	if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
> -		amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
> -				    true, pi->ir_data);
> -
> -	list_add(&irqfd->vcpu_list, &svm->ir_list);
> -	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
> -}
> -

There are a few comments in avic_vcpu_load() and avic_vcpu_put() which 
still refer to svm_ir_list_add(). Would be good to update those.

- Naveen


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation
  2025-06-17 14:25   ` Naveen N Rao
@ 2025-06-17 16:10     ` Sean Christopherson
  2025-06-18 14:33       ` Naveen N Rao
  0 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-17 16:10 UTC (permalink / raw)
  To: Naveen N Rao
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Tue, Jun 17, 2025, Naveen N Rao wrote:
> On Wed, Jun 11, 2025 at 03:45:15PM -0700, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > index ab228872a19b..f0a74b102c57 100644
> > --- a/arch/x86/kvm/svm/avic.c
> > +++ b/arch/x86/kvm/svm/avic.c
> > @@ -277,9 +277,19 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
> >  	int id = vcpu->vcpu_id;
> >  	struct vcpu_svm *svm = to_svm(vcpu);
> >  
> > +	/*
> > +	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
> > +	 * hardware.  Immediately clear apicv_active, i.e. don't wait until the
> > +	 * KVM_REQ_APICV_UPDATE request is processed on the first KVM_RUN, as
> > +	 * avic_vcpu_load() expects to be called if and only if the vCPU has
> > +	 * fully initialized AVIC.
> > +	 */
> >  	if ((!x2avic_enabled && id > AVIC_MAX_PHYSICAL_ID) ||
> > -	    (id > X2AVIC_MAX_PHYSICAL_ID))
> > -		return -EINVAL;
> > +	    (id > X2AVIC_MAX_PHYSICAL_ID)) {
> > +		kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG);
> > +		vcpu->arch.apic->apicv_active = false;
> 
> This bothers me a bit. kvm_create_lapic() does this:
>           if (enable_apicv) {
>                   apic->apicv_active = true;
>                   kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
> 	  }
> 
> But, setting apic->apicv_active to false here means KVM_REQ_APICV_UPDATE 
> is going to be a no-op.

That's fine, KVM_REQ_APICV_UPDATE is a nop in other scenarios, too.  I agree it's
not ideal, but this is a rather extreme edge case and a one-time slow path, so I
don't think it's worth doing anything special just to avoid KVM_REQ_APICV_UPDATE.

> This does not look to be a big deal given that kvm_create_lapic() itself 
> is called just a bit before svm_vcpu_create() (which calls the above 
> function through avic_init_vcpu()) in kvm_arch_vcpu_create(), so there 
> isn't that much done before apicv_active is toggled.
> 
> But, this made me wonder if introducing a kvm_x86_op to check and 
> enable/disable apic->apicv_active in kvm_create_lapic() might be cleaner 
> overall

Hmm, yes and no.  I completely agree that clearing apicv_active in avic.c is all
kinds of gross, but clearing apic->apicv_active in lapic.c to handle this particular
scenario is just as problematic, because then avic_init_backing_page() would need
to check kvm_vcpu_apicv_active() to determine whether or not to allocate the backing
page.  In a way, that's even worse, because setting apic->apicv_active by default
is purely an optimization, i.e. leaving it %false _should_ work as well, it would
just be suboptimal.  But if AVIC were to key off apic->apicv_active, that could
lead to KVM incorrectly skipping allocation of the AVIC backing page.

> Maybe even have it be the initialization point for APICv: 
> apicv_init(), so we can invoke avic_init_vcpu() right away?

I mostly like this idea (actually, I take that back; see below), but VMX throws
a big wrench in things.  Unlike SVM, VMX doesn't have a singular "enable APICv"
flag.  Rather, KVM considers "APICv" to be the combination of APIC-register
virtualization, virtual interrupt delivery, and posted interrupts.

Which is totally fine.  The big wrench is that the are other features that interact
with "APICv" and require allocations.  E.g.  the APIC access page isn't actually
tied to enable_apicv, it's tied to yet another VMX feature, "virtualize APIC
accesses" (not to be confused with APIC-register virtualization; don't blame me,
I didn't create the control knobs/names).

As a result, KVM may need to allocate the APIC access page (not to be confused
with the APIC *backing* page; again, don't blame me :-D) when enable_apicv=false,
and even more confusingly, NOT allocate the APIC access page when enable_apicv=true.

	if (cpu_need_virtualize_apic_accesses(vcpu)) {  <=== not an enable_apic check, *sigh*
		err = kvm_alloc_apic_access_page(vcpu->kvm);
		if (err)
			goto free_vmcs;
	}

And for both SVM and VMX, IPI virtualization adds another wrinkle, in that the
per-vCPU setup needs to fill an entry in a per-VM table.  And filling that entry
needs to happen at the very end of vCPU creation, e.g. so that the vCPU can't be
targeted until KVM is ready to run the vCPU.

Ouch.  And I'm pretty sure there's a use-after-free in the AVIC code.  If
svm_vcpu_alloc_msrpm() fails, the avic_physical_id_table[] will still have a pointer
to freed vAPIC page.

Thus, invoking avic_init_vcpu() "right away" is unfortunately flat out unsafe
(took me a while to realize that).

So while I 100% agree with your dislike of this patch, I think it's the least
awful band-aid, at least in the short term.

Longer term, I'd definitely be in favor of cleaning up the related flows, but I
think that's best done on top of this series, because I suspect it'll be somewhat
invasive.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 13/62] KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creation
  2025-06-17 14:49   ` Naveen N Rao
@ 2025-06-17 16:33     ` Sean Christopherson
  2025-06-18 14:39       ` Naveen N Rao
  0 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-17 16:33 UTC (permalink / raw)
  To: Naveen N Rao
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Tue, Jun 17, 2025, Naveen N Rao wrote:
> On Wed, Jun 11, 2025 at 03:45:16PM -0700, Sean Christopherson wrote:
> >  static int avic_init_backing_page(struct kvm_vcpu *vcpu)
> >  {
> > -	u64 *entry, new_entry;
> > -	int id = vcpu->vcpu_id;
> > +	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
> >  	struct vcpu_svm *svm = to_svm(vcpu);
> > +	u32 id = vcpu->vcpu_id;
> > +	u64 *table, new_entry;
> >  
> >  	/*
> >  	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
> > @@ -291,6 +277,9 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
> >  		return 0;
> >  	}
> >  
> > +	BUILD_BUG_ON((AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE ||
> > +		     (X2AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE);
> 						    ^^^^^^^^^^^^^^
> Renaming new_entry to just 'entry' and using sizeof(entry) makes this 
> more readable for me.

Good call, though I think it makes sense to do that on top so as to minimize the
churn in this patch.  I'll post a patch, unless you want the honors?

> Otherwise, for this patch:
> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
> 
> As an aside, there are a few static asserts to validate some of the 
> related macros. Can this also be a static_assert(), or is there is 
> reason to prefer BUILD_BUG_ON()?

For this particular assertion, static_assert() would be fine.  That said,
BUILD_BUG_ON() is slightly preferred in this context.

The advantage of BUILD_BUG_ON() is that it works so long as the condition is
compile-time constant, whereas static_assert() requires the condition to an
integer constant expression.  E.g. BUILD_BUG_ON() can be used so long as the
condition is eventually resolved to a constant, whereas static_assert() has
stricter requirements.

E.g. the fls64() assert below is fully resolved at compile time, but isn't a
purely constant expression, i.e. that one *needs* to be BUILD_BUG_ON().

--
arch/x86/kvm/svm/avic.c: In function ‘avic_init_backing_page’:
arch/x86/kvm/svm/avic.c:293:45: error: expression in static assertion is not constant
  293 |         static_assert(__PHYSICAL_MASK_SHIFT <=
include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
   78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
      |                                                        ^~~~
arch/x86/kvm/svm/avic.c:293:9: note: in expansion of macro ‘static_assert’
  293 |         static_assert(__PHYSICAL_MASK_SHIFT <=
      |         ^~~~~~~~~~~~~
make[5]: *** [scripts/Makefile.build:203: arch/x86/kvm/svm/avic.o] Error 1
--

The downside of BUILD_BUG_ON() is that it can't be used at global scope, i.e.
needs to be called from a function.

As a result, when adding an assertion in a function, using BUILD_BUG_ON() is
slightly preferred, because it's less likely to break in the future.  E.g. if
X2AVIC_MAX_PHYSICAL_ID were changed to something that is a compile-time constant,
but for whatever reason isn't a pure integer constant.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation
  2025-06-17 16:10     ` Sean Christopherson
@ 2025-06-18 14:33       ` Naveen N Rao
  2025-06-18 20:59         ` Sean Christopherson
  0 siblings, 1 reply; 100+ messages in thread
From: Naveen N Rao @ 2025-06-18 14:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Tue, Jun 17, 2025 at 09:10:10AM -0700, Sean Christopherson wrote:
> On Tue, Jun 17, 2025, Naveen N Rao wrote:
> > On Wed, Jun 11, 2025 at 03:45:15PM -0700, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > > index ab228872a19b..f0a74b102c57 100644
> > > --- a/arch/x86/kvm/svm/avic.c
> > > +++ b/arch/x86/kvm/svm/avic.c
> > > @@ -277,9 +277,19 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
> > >  	int id = vcpu->vcpu_id;
> > >  	struct vcpu_svm *svm = to_svm(vcpu);
> > >  
> > > +	/*
> > > +	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
> > > +	 * hardware.  Immediately clear apicv_active, i.e. don't wait until the
> > > +	 * KVM_REQ_APICV_UPDATE request is processed on the first KVM_RUN, as
> > > +	 * avic_vcpu_load() expects to be called if and only if the vCPU has
> > > +	 * fully initialized AVIC.
> > > +	 */
> > >  	if ((!x2avic_enabled && id > AVIC_MAX_PHYSICAL_ID) ||
> > > -	    (id > X2AVIC_MAX_PHYSICAL_ID))
> > > -		return -EINVAL;
> > > +	    (id > X2AVIC_MAX_PHYSICAL_ID)) {
> > > +		kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG);
> > > +		vcpu->arch.apic->apicv_active = false;
> > 
> > This bothers me a bit. kvm_create_lapic() does this:
> >           if (enable_apicv) {
> >                   apic->apicv_active = true;
> >                   kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
> > 	  }
> > 
> > But, setting apic->apicv_active to false here means KVM_REQ_APICV_UPDATE 
> > is going to be a no-op.
> 
> That's fine, KVM_REQ_APICV_UPDATE is a nop in other scenarios, too.  I agree it's
> not ideal, but this is a rather extreme edge case and a one-time slow path, so I
> don't think it's worth doing anything special just to avoid KVM_REQ_APICV_UPDATE.
> 
> > This does not look to be a big deal given that kvm_create_lapic() 
> > is called just a bit before svm_vcpu_create() (which calls the above 
> > function through avic_init_vcpu()) in kvm_arch_vcpu_create(), so there 
> > isn't that much done before apicv_active is toggled.
> > 
> > But, this made me wonder if introducing a kvm_x86_op to check and 
> > enable/disable apic->apicv_active in kvm_create_lapic() might be cleaner 
> > overall
> 
> Hmm, yes and no.  I completely agree that clearing apicv_active in avic.c is all
> kinds of gross, but clearing apic->apicv_active in lapic.c to handle this particular
> scenario is just as problematic, because then avic_init_backing_page() would need
> to check kvm_vcpu_apicv_active() to determine whether or not to allocate the backing
> page.  In a way, that's even worse, because setting apic->apicv_active by default
> is purely an optimization, i.e. leaving it %false _should_ work as well, it would
> just be suboptimal.  But if AVIC were to key off apic->apicv_active, 
> that could
> lead to KVM incorrectly skipping allocation of the AVIC backing page.

I understand your concern about key'ing off apic->apicv_active - that 
would definitely require a thorough audit and does add complexity to 
this.

However, as far as I can see, after your current series, we no longer 
maintain a pointer to the AVIC backing page, but instead rely on the 
lapic-allocated page.

Were you referring to the APIC access page though? That is behind 
kvm_apicv_activated() today, which looks to be problematic if there are 
inhibits set during vcpu_create() and if those can be unset later?  
Shouldn't we be allocating the apic access page unconditionally here?

> 
> > Maybe even have it be the initialization point for APICv: 
> > apicv_init(), so we can invoke avic_init_vcpu() right away?
> 
> I mostly like this idea (actually, I take that back; see below), but VMX throws
> a big wrench in things.  Unlike SVM, VMX doesn't have a singular "enable APICv"
> flag.  Rather, KVM considers "APICv" to be the combination of APIC-register
> virtualization, virtual interrupt delivery, and posted interrupts.
> 
> Which is totally fine.  The big wrench is that the are other features that interact
> with "APICv" and require allocations.  E.g.  the APIC access page isn't actually
> tied to enable_apicv, it's tied to yet another VMX feature, "virtualize APIC
> accesses" (not to be confused with APIC-register virtualization; don't blame me,
> I didn't create the control knobs/names).
> 
> As a result, KVM may need to allocate the APIC access page (not to be confused
> with the APIC *backing* page; again, don't blame me :-D) when enable_apicv=false,
> and even more confusingly, NOT allocate the APIC access page when enable_apicv=true.
> 
> 	if (cpu_need_virtualize_apic_accesses(vcpu)) {  <=== not an enable_apic check, *sigh*
> 		err = kvm_alloc_apic_access_page(vcpu->kvm);
> 		if (err)
> 			goto free_vmcs;
> 	}

Thanks for the background and the details there -- I am going to need 
some time to go unpack all of that :)

> 
> And for both SVM and VMX, IPI virtualization adds another wrinkle, in that the
> per-vCPU setup needs to fill an entry in a per-VM table.  And filling that entry
> needs to happen at the very end of vCPU creation, e.g. so that the vCPU can't be
> targeted until KVM is ready to run the vCPU.
> 
> Ouch.  And I'm pretty sure there's a use-after-free in the AVIC code.  If
> svm_vcpu_alloc_msrpm() fails, the avic_physical_id_table[] will still have a pointer
> to freed vAPIC page.

Oops.

> 
> Thus, invoking avic_init_vcpu() "right away" is unfortunately flat out unsafe
> (took me a while to realize that).
> 
> So while I 100% agree with your dislike of this patch, I think it's the least
> awful band-aid, at least in the short term.
> 
> Longer term, I'd definitely be in favor of cleaning up the related flows, but I
> think that's best done on top of this series, because I suspect it'll be somewhat
> invasive.

Sure, that makes sense. For this patch:
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>


Thanks,
Naveen

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 13/62] KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creation
  2025-06-17 16:33     ` Sean Christopherson
@ 2025-06-18 14:39       ` Naveen N Rao
  0 siblings, 0 replies; 100+ messages in thread
From: Naveen N Rao @ 2025-06-18 14:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Tue, Jun 17, 2025 at 09:33:20AM -0700, Sean Christopherson wrote:
> On Tue, Jun 17, 2025, Naveen N Rao wrote:
> > On Wed, Jun 11, 2025 at 03:45:16PM -0700, Sean Christopherson wrote:
> > >  static int avic_init_backing_page(struct kvm_vcpu *vcpu)
> > >  {
> > > -	u64 *entry, new_entry;
> > > -	int id = vcpu->vcpu_id;
> > > +	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
> > >  	struct vcpu_svm *svm = to_svm(vcpu);
> > > +	u32 id = vcpu->vcpu_id;
> > > +	u64 *table, new_entry;
> > >  
> > >  	/*
> > >  	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
> > > @@ -291,6 +277,9 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
> > >  		return 0;
> > >  	}
> > >  
> > > +	BUILD_BUG_ON((AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE ||
> > > +		     (X2AVIC_MAX_PHYSICAL_ID + 1) * sizeof(*table) > PAGE_SIZE);
> > 						    ^^^^^^^^^^^^^^
> > Renaming new_entry to just 'entry' and using sizeof(entry) makes this 
> > more readable for me.
> 
> Good call, though I think it makes sense to do that on top so as to minimize the
> churn in this patch.  I'll post a patch, unless you want the honors?

Not at all, please feel free to add a patch (or not, given that this 
will be a trivial change).

> 
> > Otherwise, for this patch:
> > Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
> > 
> > As an aside, there are a few static asserts to validate some of the 
> > related macros. Can this also be a static_assert(), or is there is 
> > reason to prefer BUILD_BUG_ON()?
> 
> For this particular assertion, static_assert() would be fine.  That said,
> BUILD_BUG_ON() is slightly preferred in this context.
> 
> The advantage of BUILD_BUG_ON() is that it works so long as the condition is
> compile-time constant, whereas static_assert() requires the condition to an
> integer constant expression.  E.g. BUILD_BUG_ON() can be used so long as the
> condition is eventually resolved to a constant, whereas static_assert() has
> stricter requirements.
> 
> E.g. the fls64() assert below is fully resolved at compile time, but isn't a
> purely constant expression, i.e. that one *needs* to be BUILD_BUG_ON().
> 
> --
> arch/x86/kvm/svm/avic.c: In function ‘avic_init_backing_page’:
> arch/x86/kvm/svm/avic.c:293:45: error: expression in static assertion is not constant
>   293 |         static_assert(__PHYSICAL_MASK_SHIFT <=
> include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
>    78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
>       |                                                        ^~~~
> arch/x86/kvm/svm/avic.c:293:9: note: in expansion of macro ‘static_assert’
>   293 |         static_assert(__PHYSICAL_MASK_SHIFT <=
>       |         ^~~~~~~~~~~~~
> make[5]: *** [scripts/Makefile.build:203: arch/x86/kvm/svm/avic.o] Error 1
> --
> 
> The downside of BUILD_BUG_ON() is that it can't be used at global scope, i.e.
> needs to be called from a function.
> 
> As a result, when adding an assertion in a function, using BUILD_BUG_ON() is
> slightly preferred, because it's less likely to break in the future.  E.g. if
> X2AVIC_MAX_PHYSICAL_ID were changed to something that is a compile-time constant,
> but for whatever reason isn't a pure integer constant.

Understood, thanks for the explanation.


- Naveen


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation
  2025-06-18 14:33       ` Naveen N Rao
@ 2025-06-18 20:59         ` Sean Christopherson
  0 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-18 20:59 UTC (permalink / raw)
  To: Naveen N Rao
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Wed, Jun 18, 2025, Naveen N Rao wrote:
> On Tue, Jun 17, 2025 at 09:10:10AM -0700, Sean Christopherson wrote:
> > Hmm, yes and no.  I completely agree that clearing apicv_active in avic.c
> > is all kinds of gross, but clearing apic->apicv_active in lapic.c to handle
> > this particular scenario is just as problematic, because then
> > avic_init_backing_page() would need to check kvm_vcpu_apicv_active() to
> > determine whether or not to allocate the backing page.  In a way, that's
> > even worse, because setting apic->apicv_active by default is purely an
> > optimization, i.e. leaving it %false _should_ work as well, it would just
> > be suboptimal.  But if AVIC were to key off apic->apicv_active, that could
> > lead to KVM incorrectly skipping allocation of the AVIC backing page.
> 
> I understand your concern about key'ing off apic->apicv_active - that 
> would definitely require a thorough audit and does add complexity to 
> this.
> 
> However, as far as I can see, after your current series, we no longer 
> maintain a pointer to the AVIC backing page, but instead rely on the 
> lapic-allocated page.
> 
> Were you referring to the APIC access page though? 

Gah, yes.  I was hyper aware of the two things when typing up the response, and
still managed to screw up.  *sigh* :-)

> That is behind kvm_apicv_activated() today, which looks to be problematic if
> there are inhibits set during vcpu_create() and if those can be unset later?
> Shouldn't we be allocating the apic access page unconditionally here?

In theory, yes.  In practice, this guards against an unnecessary allocation for
SEV+ guests (see APICV_INHIBIT_REASON_SEV).

That said, I completely agree that checking kvm_apicv_activated() is weird and
sketchy.  Hopefully that can be cleaned up, too (but after this series).

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 15/62] KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer
  2025-06-11 22:45 ` [PATCH v3 15/62] KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer Sean Christopherson
@ 2025-06-19 11:09   ` Naveen N Rao
  0 siblings, 0 replies; 100+ messages in thread
From: Naveen N Rao @ 2025-06-19 11:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Wed, Jun 11, 2025 at 03:45:18PM -0700, Sean Christopherson wrote:
> Drop the vCPU's pointer to its AVIC Physical ID entry, and simply index
> the table directly.  Caching a pointer address is completely unnecessary
> for performance, and while the field technically caches the result of the
> pointer calculation, it's all too easy to misinterpret the name and think
> that the field somehow caches the _data_ in the table.
> 
> No functional change intended.
> 
> Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
> Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/svm/avic.c | 27 +++++++++++++++------------
>  arch/x86/kvm/svm/svm.h  |  1 -
>  2 files changed, 15 insertions(+), 13 deletions(-)

Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>

- Naveen


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 17/62] KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled
  2025-06-11 22:45 ` [PATCH v3 17/62] KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled Sean Christopherson
@ 2025-06-19 11:31   ` Naveen N Rao
  2025-06-19 12:01     ` Naveen N Rao
  2025-06-20 14:39     ` Sean Christopherson
  0 siblings, 2 replies; 100+ messages in thread
From: Naveen N Rao @ 2025-06-19 11:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Wed, Jun 11, 2025 at 03:45:20PM -0700, Sean Christopherson wrote:
> From: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Let userspace "disable" IPI virtualization for AVIC via the enable_ipiv
> module param, by never setting IsRunning.  SVM doesn't provide a way to
> disable IPI virtualization in hardware, but by ensuring CPUs never see
> IsRunning=1, every IPI in the guest (except for self-IPIs) will generate a
> VM-Exit.

I think this is good to have regardless of the erratum. Not sure about VMX,
but does it make sense to intercept writes to the self-ipi MSR as well?

> 
> To avoid setting the real IsRunning bit, while still allowing KVM to use
> each vCPU's entry to update GA log entries, simply maintain a shadow of
> the entry, without propagating IsRunning updates to the real table when
> IPI virtualization is disabled.
> 
> Providing a way to effectively disable IPI virtualization will allow KVM
> to safely enable AVIC on hardware that is susceptible to erratum #1235,
> which causes hardware to sometimes fail to detect that the IsRunning bit
> has been cleared by software.
> 
> Note, the table _must_ be fully populated, as broadcast IPIs skip invalid
> entries, i.e. won't generate VM-Exit if every entry is invalid, and so
> simply pointing the VMCB at a common dummy table won't work.
> 
> Alternatively, KVM could allocate a shadow of the entire table, but that'd
> be a waste of 4KiB since the per-vCPU entry doesn't actually consume an
> additional 8 bytes of memory (vCPU structures are large enough that they
> are backed by order-N pages).
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> [sean: keep "entry" variables, reuse enable_ipiv, split from erratum]
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/svm/avic.c | 32 ++++++++++++++++++++++++++------
>  arch/x86/kvm/svm/svm.c  |  2 ++
>  arch/x86/kvm/svm/svm.h  |  8 ++++++++
>  3 files changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 0c0be274d29e..48c737e1200a 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -292,6 +292,13 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
>  	/* Setting AVIC backing page address in the phy APIC ID table */
>  	new_entry = avic_get_backing_page_address(svm) |
>  		    AVIC_PHYSICAL_ID_ENTRY_VALID_MASK;
> +	svm->avic_physical_id_entry = new_entry;
> +
> +	/*
> +	 * Initialize the real table, as vCPUs must have a valid entry in order
> +	 * for broadcast IPIs to function correctly (broadcast IPIs ignore
> +	 * invalid entries, i.e. aren't guaranteed to generate a VM-Exit).
> +	 */
>  	WRITE_ONCE(kvm_svm->avic_physical_id_table[id], new_entry);
>  
>  	return 0;
> @@ -769,8 +776,6 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
>  			   struct amd_iommu_pi_data *pi)
>  {
>  	struct kvm_vcpu *vcpu = &svm->vcpu;
> -	struct kvm *kvm = vcpu->kvm;
> -	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
>  	unsigned long flags;
>  	u64 entry;
>  
> @@ -788,7 +793,7 @@ static int svm_ir_list_add(struct vcpu_svm *svm,
>  	 * will update the pCPU info when the vCPU awkened and/or scheduled in.
>  	 * See also avic_vcpu_load().
>  	 */
> -	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
> +	entry = svm->avic_physical_id_entry;
>  	if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
>  		amd_iommu_update_ga(entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK,
>  				    true, pi->ir_data);
> @@ -998,14 +1003,26 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  	 */
>  	spin_lock_irqsave(&svm->ir_list_lock, flags);
>  
> -	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
> +	entry = svm->avic_physical_id_entry;
>  	WARN_ON_ONCE(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
>  
>  	entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
>  	entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
>  	entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
>  
> +	svm->avic_physical_id_entry = entry;
> +
> +	/*
> +	 * If IPI virtualization is disabled, clear IsRunning when updating the
> +	 * actual Physical ID table, so that the CPU never sees IsRunning=1.
> +	 * Keep the APIC ID up-to-date in the entry to minimize the chances of
> +	 * things going sideways if hardware peeks at the ID.
> +	 */
> +	if (!enable_ipiv)
> +		entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
> +
>  	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
> +
>  	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
>  
>  	spin_unlock_irqrestore(&svm->ir_list_lock, flags);
> @@ -1030,7 +1047,7 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
>  	 * can't be scheduled out and thus avic_vcpu_{put,load}() can't run
>  	 * recursively.
>  	 */
> -	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
> +	entry = svm->avic_physical_id_entry;
>  
>  	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
>  	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
> @@ -1049,7 +1066,10 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
>  	avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
>  
>  	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
> -	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
> +	svm->avic_physical_id_entry = entry;
> +
> +	if (enable_ipiv)
> +		WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);

If enable_ipiv is false, then isRunning bit will never be set and we 
would have bailed out earlier. So, the check for enable_ipiv can be 
dropped here (or converted into an assert).

- Naveen


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 17/62] KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled
  2025-06-19 11:31   ` Naveen N Rao
@ 2025-06-19 12:01     ` Naveen N Rao
  2025-06-20 14:39     ` Sean Christopherson
  1 sibling, 0 replies; 100+ messages in thread
From: Naveen N Rao @ 2025-06-19 12:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Thu, Jun 19, 2025 at 05:01:30PM +0530, Naveen N Rao wrote:
> On Wed, Jun 11, 2025 at 03:45:20PM -0700, Sean Christopherson wrote:
> > @@ -1030,7 +1047,7 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
> >  	 * can't be scheduled out and thus avic_vcpu_{put,load}() can't run
> >  	 * recursively.
> >  	 */
> > -	entry = READ_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id]);
> > +	entry = svm->avic_physical_id_entry;
> >  
> >  	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
> >  	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
> > @@ -1049,7 +1066,10 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
> >  	avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
> >  
> >  	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
> > -	WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
> > +	svm->avic_physical_id_entry = entry;
> > +
> > +	if (enable_ipiv)
> > +		WRITE_ONCE(kvm_svm->avic_physical_id_table[vcpu->vcpu_id], entry);
> 
> If enable_ipiv is false, then isRunning bit will never be set and we 
> would have bailed out earlier. So, the check for enable_ipiv can be 
> dropped here (or converted into an assert).

Ignore this, I got this wrong, sorry. The earlier check is against the 
local copy of the physical ID table entry, which will indeed have 
isRunning set so this is all good.

- Naveen


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: (subset) [PATCH v3 01/62] KVM: arm64: Explicitly treat routing entry type changes as changes
  2025-06-11 22:45 ` [PATCH v3 01/62] KVM: arm64: Explicitly treat routing entry type changes as changes Sean Christopherson
  2025-06-13 19:43   ` Oliver Upton
@ 2025-06-19 12:36   ` Marc Zyngier
  1 sibling, 0 replies; 100+ messages in thread
From: Marc Zyngier @ 2025-06-19 12:36 UTC (permalink / raw)
  To: Oliver Upton, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu, Sean Christopherson
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

On Wed, 11 Jun 2025 15:45:04 -0700, Sean Christopherson wrote:
> Explicitly treat type differences as GSI routing changes, as comparing MSI
> data between two entries could get a false negative, e.g. if userspace
> changed the type but left the type-specific data as-
> 
> Note, the same bug was fixed in x86 by commit bcda70c56f3e ("KVM: x86:
> Explicitly treat routing entry type changes as changes").
> 
> [...]

Applied to fixes, thanks!

[01/62] KVM: arm64: Explicitly treat routing entry type changes as changes
        commit: 1fbe6861a6d9a942fb8ab8677ddf1ecb86b1af60

Cheers,

	M.
-- 
Without deviation from the norm, progress is not possible.



^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 17/62] KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled
  2025-06-19 11:31   ` Naveen N Rao
  2025-06-19 12:01     ` Naveen N Rao
@ 2025-06-20 14:39     ` Sean Christopherson
  2025-06-23 10:45       ` Naveen N Rao
  1 sibling, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-20 14:39 UTC (permalink / raw)
  To: Naveen N Rao
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Thu, Jun 19, 2025, Naveen N Rao wrote:
> On Wed, Jun 11, 2025 at 03:45:20PM -0700, Sean Christopherson wrote:
> > From: Maxim Levitsky <mlevitsk@redhat.com>
> > 
> > Let userspace "disable" IPI virtualization for AVIC via the enable_ipiv
> > module param, by never setting IsRunning.  SVM doesn't provide a way to
> > disable IPI virtualization in hardware, but by ensuring CPUs never see
> > IsRunning=1, every IPI in the guest (except for self-IPIs) will generate a
> > VM-Exit.
> 
> I think this is good to have regardless of the erratum. Not sure about VMX,
> but does it make sense to intercept writes to the self-ipi MSR as well?

That doesn't work for AVIC, i.e. if the guest is MMIO to access the virtual APIC.

Regardless, I don't see any reason to manually intercept self-IPIs when IPI
virtualization is disabled.  AFAIK, there's no need to do so for correctness,
and Intel's self-IPI virtualization isn't tied to IPI virtualization either.
Self-IPI virtualization is enabled by virtual interrupt delivery, which in turn
is enabled by KVM when enable_apicv is true:

  Self-IPI virtualization occurs only if the “virtual-interrupt delivery”
  VM-execution control is 1.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 02/62] KVM: arm64: WARN if unmapping vLPI fails
  2025-06-13 20:47       ` Oliver Upton
@ 2025-06-20 17:22         ` Sean Christopherson
  2025-06-20 18:00           ` David Woodhouse
  2025-06-20 18:48           ` Oliver Upton
  0 siblings, 2 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-20 17:22 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Marc Zyngier, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

On Fri, Jun 13, 2025, Oliver Upton wrote:
> On Thu, Jun 12, 2025 at 07:34:35AM -0700, Sean Christopherson wrote:
> > On Thu, Jun 12, 2025, Marc Zyngier wrote:
> > > But not having an VLPI mapping for an interrupt at the point where we're
> > > tearing down the forwarding is pretty benign. IRQs *still* go where they
> > > should, and we don't lose anything.
> 
> The VM may not actually be getting torn down, though. The series of
> fixes [*] we took for 6.16 addressed games that VMMs might be playing on
> irqbypass for a live VM.
> 
> [*] https://lore.kernel.org/kvmarm/20250523194722.4066715-1-oliver.upton@linux.dev/
> 
> > All of those failure scenario seem like warnable offences when KVM thinks it has
> > configured the IRQ to be forwarded to a vCPU.
> 
> I tend to agree here, especially considering how horribly fragile GICv4
> has been in some systems. I know of a couple implementations where ITS
> command failures and/or unmapped MSIs are fatal for the entire machine.
> Debugging them has been a genuine pain in the ass.
> 
> WARN'ing when state tracking for vLPIs is out of whack would've made it
> a little easier.

Marc, does this look and read better?

I'd really, really like to get this sorted out asap, as it's the only thing
blocking the series, and I want to get the series into linux-next early next
week, before I go OOO for ~10 days.

--
From: Sean Christopherson <seanjc@google.com>
Date: Thu, 12 Jun 2025 16:51:47 -0700
Subject: [PATCH] KVM: arm64: WARN if unmapping a vLPI fails in any path

When unmapping a vLPI, WARN if nullifying vCPU affinity fails, not just if
failure occurs when freeing an ITE.  If undoing vCPU affinity fails, then
odds are very good that vLPI state tracking has has gotten out of whack,
i.e. that KVM and the GIC disagree on the state of an IRQ/vLPI.  At best,
inconsistent state means there is a lurking bug/flaw somewhere.  At worst,
the inconsistency could eventually be fatal to the host, e.g. if an ITS
command fails because KVM's view of things doesn't match reality/hardware.

Note, only the call from kvm_arch_irq_bypass_del_producer() by way of
kvm_vgic_v4_unset_forwarding() doesn't already WARN.  Common KVM's
kvm_irq_routing_update() WARNs if kvm_arch_update_irqfd_routing() fails.
For that path, if its_unmap_vlpi() fails in kvm_vgic_v4_unset_forwarding(),
the only possible causes are that the GIC doesn't have a v4 ITS (from
its_irq_set_vcpu_affinity()):

        /* Need a v4 ITS */
        if (!is_v4(its_dev->its))
                return -EINVAL;

        guard(raw_spinlock)(&its_dev->event_map.vlpi_lock);

        /* Unmap request? */
        if (!info)
                return its_vlpi_unmap(d);

or that KVM has gotten out of sync with the GIC/ITS (from its_vlpi_unmap()):

        if (!its_dev->event_map.vm || !irqd_is_forwarded_to_vcpu(d))
                return -EINVAL;

All of the above failure scenarios are warnable offences, as they should
never occur absent a kernel/KVM bug.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/kvm/vgic/vgic-its.c     | 2 +-
 arch/arm64/kvm/vgic/vgic-v4.c      | 4 ++--
 drivers/irqchip/irq-gic-v4.c       | 4 ++--
 include/linux/irqchip/arm-gic-v4.h | 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-its.c b/arch/arm64/kvm/vgic/vgic-its.c
index 534049c7c94b..98630dae910d 100644
--- a/arch/arm64/kvm/vgic/vgic-its.c
+++ b/arch/arm64/kvm/vgic/vgic-its.c
@@ -758,7 +758,7 @@ static void its_free_ite(struct kvm *kvm, struct its_ite *ite)
 	if (irq) {
 		scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) {
 			if (irq->hw)
-				WARN_ON(its_unmap_vlpi(ite->irq->host_irq));
+				its_unmap_vlpi(ite->irq->host_irq);
 
 			irq->hw = false;
 		}
diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
index 193946108192..911170d4a9c8 100644
--- a/arch/arm64/kvm/vgic/vgic-v4.c
+++ b/arch/arm64/kvm/vgic/vgic-v4.c
@@ -545,10 +545,10 @@ int kvm_vgic_v4_unset_forwarding(struct kvm *kvm, int host_irq)
 	if (irq->hw) {
 		atomic_dec(&irq->target_vcpu->arch.vgic_cpu.vgic_v3.its_vpe.vlpi_count);
 		irq->hw = false;
-		ret = its_unmap_vlpi(host_irq);
+		its_unmap_vlpi(host_irq);
 	}
 
 	raw_spin_unlock_irqrestore(&irq->irq_lock, flags);
 	vgic_put_irq(kvm, irq);
-	return ret;
+	return 0;
 }
diff --git a/drivers/irqchip/irq-gic-v4.c b/drivers/irqchip/irq-gic-v4.c
index 58c28895f8c4..8455b4a5fbb0 100644
--- a/drivers/irqchip/irq-gic-v4.c
+++ b/drivers/irqchip/irq-gic-v4.c
@@ -342,10 +342,10 @@ int its_get_vlpi(int irq, struct its_vlpi_map *map)
 	return irq_set_vcpu_affinity(irq, &info);
 }
 
-int its_unmap_vlpi(int irq)
+void its_unmap_vlpi(int irq)
 {
 	irq_clear_status_flags(irq, IRQ_DISABLE_UNLAZY);
-	return irq_set_vcpu_affinity(irq, NULL);
+	WARN_ON_ONCE(irq_set_vcpu_affinity(irq, NULL));
 }
 
 int its_prop_update_vlpi(int irq, u8 config, bool inv)
diff --git a/include/linux/irqchip/arm-gic-v4.h b/include/linux/irqchip/arm-gic-v4.h
index 7f1f11a5e4e4..0b0887099fd7 100644
--- a/include/linux/irqchip/arm-gic-v4.h
+++ b/include/linux/irqchip/arm-gic-v4.h
@@ -146,7 +146,7 @@ int its_commit_vpe(struct its_vpe *vpe);
 int its_invall_vpe(struct its_vpe *vpe);
 int its_map_vlpi(int irq, struct its_vlpi_map *map);
 int its_get_vlpi(int irq, struct its_vlpi_map *map);
-int its_unmap_vlpi(int irq);
+void its_unmap_vlpi(int irq);
 int its_prop_update_vlpi(int irq, u8 config, bool inv);
 int its_prop_update_vsgi(int irq, u8 priority, bool group);
 

base-commit: 4fc39a165c70a49991b7cc29be3a19eddcd9e5b9
--

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 02/62] KVM: arm64: WARN if unmapping vLPI fails
  2025-06-20 17:22         ` Sean Christopherson
@ 2025-06-20 18:00           ` David Woodhouse
  2025-06-20 18:48           ` Oliver Upton
  1 sibling, 0 replies; 100+ messages in thread
From: David Woodhouse @ 2025-06-20 18:00 UTC (permalink / raw)
  To: Sean Christopherson, Oliver Upton
  Cc: Marc Zyngier, Paolo Bonzini, Joerg Roedel, Lu Baolu,
	linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

[-- Attachment #1: Type: text/plain, Size: 2631 bytes --]

On Fri, 2025-06-20 at 10:22 -0700, Sean Christopherson wrote:
> On Fri, Jun 13, 2025, Oliver Upton wrote:
> > On Thu, Jun 12, 2025 at 07:34:35AM -0700, Sean Christopherson wrote:
> > > On Thu, Jun 12, 2025, Marc Zyngier wrote:
> > > > But not having an VLPI mapping for an interrupt at the point where we're
> > > > tearing down the forwarding is pretty benign. IRQs *still* go where they
> > > > should, and we don't lose anything.
> > 
> > The VM may not actually be getting torn down, though. The series of
> > fixes [*] we took for 6.16 addressed games that VMMs might be playing on
> > irqbypass for a live VM.
> > 
> > [*] https://lore.kernel.org/kvmarm/20250523194722.4066715-1-oliver.upton@linux.dev/
> > 
> > > All of those failure scenario seem like warnable offences when KVM thinks it has
> > > configured the IRQ to be forwarded to a vCPU.
> > 
> > I tend to agree here, especially considering how horribly fragile GICv4
> > has been in some systems. I know of a couple implementations where ITS
> > command failures and/or unmapped MSIs are fatal for the entire machine.
> > Debugging them has been a genuine pain in the ass.
> > 
> > WARN'ing when state tracking for vLPIs is out of whack would've made it
> > a little easier.
> 
> Marc, does this look and read better?
> 
> I'd really, really like to get this sorted out asap, as it's the only thing
> blocking the series, and I want to get the series into linux-next early next
> week, before I go OOO for ~10 days.
> 
> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Thu, 12 Jun 2025 16:51:47 -0700
> Subject: [PATCH] KVM: arm64: WARN if unmapping a vLPI fails in any path
> 
> When unmapping a vLPI, WARN if nullifying vCPU affinity fails, not just if
> failure occurs when freeing an ITE.  If undoing vCPU affinity fails, then
> odds are very good that vLPI state tracking has has gotten out of whack,
> i.e. that KVM and the GIC disagree on the state of an IRQ/vLPI.  At best,
> inconsistent state means there is a lurking bug/flaw somewhere.  At worst,
> the inconsistency could eventually be fatal to the host, e.g. if an ITS
> command fails because KVM's view of things doesn't match reality/hardware.

Btw, we finally figured out the reason some machines were just going
dark on kexec, with the new kernel being corrupted. It turns out the
GIC is still scribbling on the vLPI Pending Table even after it isn't
the vLPI Pending Table any more, and is now part of the new kernel's
text.

In my queue I have a patch to call unmap_all_vpes() ∀ kvm on kexec,
which makes the problem go away.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 02/62] KVM: arm64: WARN if unmapping vLPI fails
  2025-06-20 17:22         ` Sean Christopherson
  2025-06-20 18:00           ` David Woodhouse
@ 2025-06-20 18:48           ` Oliver Upton
  2025-06-20 19:04             ` Sean Christopherson
  1 sibling, 1 reply; 100+ messages in thread
From: Oliver Upton @ 2025-06-20 18:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

On Fri, Jun 20, 2025 at 10:22:32AM -0700, Sean Christopherson wrote:
> On Fri, Jun 13, 2025, Oliver Upton wrote:
> > On Thu, Jun 12, 2025 at 07:34:35AM -0700, Sean Christopherson wrote:
> > > On Thu, Jun 12, 2025, Marc Zyngier wrote:
> > > > But not having an VLPI mapping for an interrupt at the point where we're
> > > > tearing down the forwarding is pretty benign. IRQs *still* go where they
> > > > should, and we don't lose anything.
> > 
> > The VM may not actually be getting torn down, though. The series of
> > fixes [*] we took for 6.16 addressed games that VMMs might be playing on
> > irqbypass for a live VM.
> > 
> > [*] https://lore.kernel.org/kvmarm/20250523194722.4066715-1-oliver.upton@linux.dev/
> > 
> > > All of those failure scenario seem like warnable offences when KVM thinks it has
> > > configured the IRQ to be forwarded to a vCPU.
> > 
> > I tend to agree here, especially considering how horribly fragile GICv4
> > has been in some systems. I know of a couple implementations where ITS
> > command failures and/or unmapped MSIs are fatal for the entire machine.
> > Debugging them has been a genuine pain in the ass.
> > 
> > WARN'ing when state tracking for vLPIs is out of whack would've made it
> > a little easier.
> 
> Marc, does this look and read better?
> 
> I'd really, really like to get this sorted out asap, as it's the only thing
> blocking the series, and I want to get the series into linux-next early next
> week, before I go OOO for ~10 days.

Can you just send it out as a standalone patch? It's only tangientally
related to the truckload of x86 stuff that I'd rather not pull in the
event of conflict resolution.

Thanks,
Oliver

> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Thu, 12 Jun 2025 16:51:47 -0700
> Subject: [PATCH] KVM: arm64: WARN if unmapping a vLPI fails in any path
> 
> When unmapping a vLPI, WARN if nullifying vCPU affinity fails, not just if
> failure occurs when freeing an ITE.  If undoing vCPU affinity fails, then
> odds are very good that vLPI state tracking has has gotten out of whack,
> i.e. that KVM and the GIC disagree on the state of an IRQ/vLPI.  At best,
> inconsistent state means there is a lurking bug/flaw somewhere.  At worst,
> the inconsistency could eventually be fatal to the host, e.g. if an ITS
> command fails because KVM's view of things doesn't match reality/hardware.
> 
> Note, only the call from kvm_arch_irq_bypass_del_producer() by way of
> kvm_vgic_v4_unset_forwarding() doesn't already WARN.  Common KVM's
> kvm_irq_routing_update() WARNs if kvm_arch_update_irqfd_routing() fails.
> For that path, if its_unmap_vlpi() fails in kvm_vgic_v4_unset_forwarding(),
> the only possible causes are that the GIC doesn't have a v4 ITS (from
> its_irq_set_vcpu_affinity()):
> 
>         /* Need a v4 ITS */
>         if (!is_v4(its_dev->its))
>                 return -EINVAL;
> 
>         guard(raw_spinlock)(&its_dev->event_map.vlpi_lock);
> 
>         /* Unmap request? */
>         if (!info)
>                 return its_vlpi_unmap(d);
> 
> or that KVM has gotten out of sync with the GIC/ITS (from its_vlpi_unmap()):
> 
>         if (!its_dev->event_map.vm || !irqd_is_forwarded_to_vcpu(d))
>                 return -EINVAL;
> 
> All of the above failure scenarios are warnable offences, as they should
> never occur absent a kernel/KVM bug.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/arm64/kvm/vgic/vgic-its.c     | 2 +-
>  arch/arm64/kvm/vgic/vgic-v4.c      | 4 ++--
>  drivers/irqchip/irq-gic-v4.c       | 4 ++--
>  include/linux/irqchip/arm-gic-v4.h | 2 +-
>  4 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-its.c b/arch/arm64/kvm/vgic/vgic-its.c
> index 534049c7c94b..98630dae910d 100644
> --- a/arch/arm64/kvm/vgic/vgic-its.c
> +++ b/arch/arm64/kvm/vgic/vgic-its.c
> @@ -758,7 +758,7 @@ static void its_free_ite(struct kvm *kvm, struct its_ite *ite)
>  	if (irq) {
>  		scoped_guard(raw_spinlock_irqsave, &irq->irq_lock) {
>  			if (irq->hw)
> -				WARN_ON(its_unmap_vlpi(ite->irq->host_irq));
> +				its_unmap_vlpi(ite->irq->host_irq);
>  
>  			irq->hw = false;
>  		}
> diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
> index 193946108192..911170d4a9c8 100644
> --- a/arch/arm64/kvm/vgic/vgic-v4.c
> +++ b/arch/arm64/kvm/vgic/vgic-v4.c
> @@ -545,10 +545,10 @@ int kvm_vgic_v4_unset_forwarding(struct kvm *kvm, int host_irq)
>  	if (irq->hw) {
>  		atomic_dec(&irq->target_vcpu->arch.vgic_cpu.vgic_v3.its_vpe.vlpi_count);
>  		irq->hw = false;
> -		ret = its_unmap_vlpi(host_irq);
> +		its_unmap_vlpi(host_irq);
>  	}
>  
>  	raw_spin_unlock_irqrestore(&irq->irq_lock, flags);
>  	vgic_put_irq(kvm, irq);
> -	return ret;
> +	return 0;
>  }
> diff --git a/drivers/irqchip/irq-gic-v4.c b/drivers/irqchip/irq-gic-v4.c
> index 58c28895f8c4..8455b4a5fbb0 100644
> --- a/drivers/irqchip/irq-gic-v4.c
> +++ b/drivers/irqchip/irq-gic-v4.c
> @@ -342,10 +342,10 @@ int its_get_vlpi(int irq, struct its_vlpi_map *map)
>  	return irq_set_vcpu_affinity(irq, &info);
>  }
>  
> -int its_unmap_vlpi(int irq)
> +void its_unmap_vlpi(int irq)
>  {
>  	irq_clear_status_flags(irq, IRQ_DISABLE_UNLAZY);
> -	return irq_set_vcpu_affinity(irq, NULL);
> +	WARN_ON_ONCE(irq_set_vcpu_affinity(irq, NULL));
>  }
>  
>  int its_prop_update_vlpi(int irq, u8 config, bool inv)
> diff --git a/include/linux/irqchip/arm-gic-v4.h b/include/linux/irqchip/arm-gic-v4.h
> index 7f1f11a5e4e4..0b0887099fd7 100644
> --- a/include/linux/irqchip/arm-gic-v4.h
> +++ b/include/linux/irqchip/arm-gic-v4.h
> @@ -146,7 +146,7 @@ int its_commit_vpe(struct its_vpe *vpe);
>  int its_invall_vpe(struct its_vpe *vpe);
>  int its_map_vlpi(int irq, struct its_vlpi_map *map);
>  int its_get_vlpi(int irq, struct its_vlpi_map *map);
> -int its_unmap_vlpi(int irq);
> +void its_unmap_vlpi(int irq);
>  int its_prop_update_vlpi(int irq, u8 config, bool inv);
>  int its_prop_update_vsgi(int irq, u8 priority, bool group);
>  
> 
> base-commit: 4fc39a165c70a49991b7cc29be3a19eddcd9e5b9
> --
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 02/62] KVM: arm64: WARN if unmapping vLPI fails
  2025-06-20 18:48           ` Oliver Upton
@ 2025-06-20 19:04             ` Sean Christopherson
  2025-06-20 19:27               ` Oliver Upton
  0 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-20 19:04 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Marc Zyngier, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

On Fri, Jun 20, 2025, Oliver Upton wrote:
> On Fri, Jun 20, 2025 at 10:22:32AM -0700, Sean Christopherson wrote:
> > On Fri, Jun 13, 2025, Oliver Upton wrote:
> > > On Thu, Jun 12, 2025 at 07:34:35AM -0700, Sean Christopherson wrote:
> > > > On Thu, Jun 12, 2025, Marc Zyngier wrote:
> > > > > But not having an VLPI mapping for an interrupt at the point where we're
> > > > > tearing down the forwarding is pretty benign. IRQs *still* go where they
> > > > > should, and we don't lose anything.
> > > 
> > > The VM may not actually be getting torn down, though. The series of
> > > fixes [*] we took for 6.16 addressed games that VMMs might be playing on
> > > irqbypass for a live VM.
> > > 
> > > [*] https://lore.kernel.org/kvmarm/20250523194722.4066715-1-oliver.upton@linux.dev/
> > > 
> > > > All of those failure scenario seem like warnable offences when KVM thinks it has
> > > > configured the IRQ to be forwarded to a vCPU.
> > > 
> > > I tend to agree here, especially considering how horribly fragile GICv4
> > > has been in some systems. I know of a couple implementations where ITS
> > > command failures and/or unmapped MSIs are fatal for the entire machine.
> > > Debugging them has been a genuine pain in the ass.
> > > 
> > > WARN'ing when state tracking for vLPIs is out of whack would've made it
> > > a little easier.
> > 
> > Marc, does this look and read better?
> > 
> > I'd really, really like to get this sorted out asap, as it's the only thing
> > blocking the series, and I want to get the series into linux-next early next
> > week, before I go OOO for ~10 days.
> 
> Can you just send it out as a standalone patch? It's only tangientally
> related to the truckload of x86 stuff

The issue is that "KVM: Don't WARN if updating IRQ bypass route fails" directly
depends on both this patch and decent chunk of the x86 crud.  I could probably
trim some of the x86 crud by reshuffling patches around, but I can't get rid of
it entirely.

> that I'd rather not pull in the event of conflict resolution.

LOL, why not?  :-)

If I post it as a standalone patch, could you/Marc put it into a stable topic
branch based on kvm/master? (kvm/master now has patch 1, yay!)  Then I can create
a topic branch for this mountain of stuff based on the arm64 topic branch.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 02/62] KVM: arm64: WARN if unmapping vLPI fails
  2025-06-20 19:04             ` Sean Christopherson
@ 2025-06-20 19:27               ` Oliver Upton
  2025-06-20 20:31                 ` Sean Christopherson
  0 siblings, 1 reply; 100+ messages in thread
From: Oliver Upton @ 2025-06-20 19:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

On Fri, Jun 20, 2025 at 12:04:19PM -0700, Sean Christopherson wrote:
> On Fri, Jun 20, 2025, Oliver Upton wrote:
> > Can you just send it out as a standalone patch? It's only tangientally
> > related to the truckload of x86 stuff
> 
> The issue is that "KVM: Don't WARN if updating IRQ bypass route fails" directly
> depends on both this patch

Eh? At worst we break bisection for the warn, which I'm sure we manage
to commit similar high crimes on the regular.

> If I post it as a standalone patch, could you/Marc put it into a stable topic
> branch based on kvm/master? (kvm/master now has patch 1, yay!)  Then I can create
> a topic branch for this mountain of stuff based on the arm64 topic branch.

Ok, how about making the arm64 piece patch 1 in your series and you take
the whole pile. If we need it, I'll bug you for a ref that only has the
first change.

That ok?

Thanks,
Oliver

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 02/62] KVM: arm64: WARN if unmapping vLPI fails
  2025-06-20 19:27               ` Oliver Upton
@ 2025-06-20 20:31                 ` Sean Christopherson
  2025-06-20 20:45                   ` Oliver Upton
  0 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-20 20:31 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Marc Zyngier, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

On Fri, Jun 20, 2025, Oliver Upton wrote:
> On Fri, Jun 20, 2025 at 12:04:19PM -0700, Sean Christopherson wrote:
> > If I post it as a standalone patch, could you/Marc put it into a stable topic
> > branch based on kvm/master? (kvm/master now has patch 1, yay!)  Then I can create
> > a topic branch for this mountain of stuff based on the arm64 topic branch.
> 
> Ok, how about making the arm64 piece patch 1 in your series and you take
> the whole pile. If we need it, I'll bug you for a ref that only has the
> first change.

Any preference as to whether I formally post the last version, or if I apply it
directly from this thread?

> That ok?

Ya, works for me.  What's the going bribe rate for an ack these days?  :-D

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 02/62] KVM: arm64: WARN if unmapping vLPI fails
  2025-06-20 20:31                 ` Sean Christopherson
@ 2025-06-20 20:45                   ` Oliver Upton
  0 siblings, 0 replies; 100+ messages in thread
From: Oliver Upton @ 2025-06-20 20:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Paolo Bonzini, Joerg Roedel, David Woodhouse,
	Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

On Fri, Jun 20, 2025 at 01:31:55PM -0700, Sean Christopherson wrote:
> On Fri, Jun 20, 2025, Oliver Upton wrote:
> > On Fri, Jun 20, 2025 at 12:04:19PM -0700, Sean Christopherson wrote:
> > > If I post it as a standalone patch, could you/Marc put it into a stable topic
> > > branch based on kvm/master? (kvm/master now has patch 1, yay!)  Then I can create
> > > a topic branch for this mountain of stuff based on the arm64 topic branch.
> > 
> > Ok, how about making the arm64 piece patch 1 in your series and you take
> > the whole pile. If we need it, I'll bug you for a ref that only has the
> > first change.
> 
> Any preference as to whether I formally post the last version, or if I apply it
> directly from this thread?

Since I have this email on my phone I'm always in favor of not getting
spammed :) This thread has all the discourse too, so likely a better
artifact.

> > That ok?
> 
> Ya, works for me.  What's the going bribe rate for an ack these days?  :-D

Usually about the cost of a pint ;-) For the arm64 bits:

Acked-by: Oliver Upton <oliver.upton@linux.dev>

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 17/62] KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled
  2025-06-20 14:39     ` Sean Christopherson
@ 2025-06-23 10:45       ` Naveen N Rao
  0 siblings, 0 replies; 100+ messages in thread
From: Naveen N Rao @ 2025-06-23 10:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Fri, Jun 20, 2025 at 07:39:16AM -0700, Sean Christopherson wrote:
> On Thu, Jun 19, 2025, Naveen N Rao wrote:
> > On Wed, Jun 11, 2025 at 03:45:20PM -0700, Sean Christopherson wrote:
> > > From: Maxim Levitsky <mlevitsk@redhat.com>
> > > 
> > > Let userspace "disable" IPI virtualization for AVIC via the enable_ipiv
> > > module param, by never setting IsRunning.  SVM doesn't provide a way to
> > > disable IPI virtualization in hardware, but by ensuring CPUs never see
> > > IsRunning=1, every IPI in the guest (except for self-IPIs) will generate a
> > > VM-Exit.
> > 
> > I think this is good to have regardless of the erratum. Not sure about VMX,
> > but does it make sense to intercept writes to the self-ipi MSR as well?
> 
> That doesn't work for AVIC, i.e. if the guest is MMIO to access the virtual APIC.

Right, I was thinking about the Self-IPI MSR, but the ICR will also need 
to be intercepted.

> 
> Regardless, I don't see any reason to manually intercept self-IPIs when IPI
> virtualization is disabled.  AFAIK, there's no need to do so for correctness,
> and Intel's self-IPI virtualization isn't tied to IPI virtualization either.
> Self-IPI virtualization is enabled by virtual interrupt delivery, which in turn
> is enabled by KVM when enable_apicv is true:
> 
>   Self-IPI virtualization occurs only if the “virtual-interrupt delivery”
>   VM-execution control is 1.

Excellent, for this patch:
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>

Thanks,
Naveen

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 18/62] KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235
  2025-06-11 22:45 ` [PATCH v3 18/62] KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235 Sean Christopherson
@ 2025-06-23 14:05   ` Naveen N Rao
  2025-06-23 15:30     ` Sean Christopherson
  0 siblings, 1 reply; 100+ messages in thread
From: Naveen N Rao @ 2025-06-23 14:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Wed, Jun 11, 2025 at 03:45:21PM -0700, Sean Christopherson wrote:
> From: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Disable IPI virtualization on AMD Family 17h CPUs (Zen2 and Zen1), as
> hardware doesn't reliably detect changes to the 'IsRunning' bit during ICR
> write emulation, and might fail to VM-Exit on the sending vCPU, if
> IsRunning was recently cleared.
> 
> The absence of the VM-Exit leads to KVM not waking (or triggering nested
> VM-Exit) of the target vCPU(s) of the IPI, which can lead to hung vCPUs,
  ^^^^^^^^^^^
  VM-Exit of)

> unbounded delays in L2 execution, etc.
> 
> To workaround the erratum, simply disable IPI virtualization, which
> prevents KVM from setting IsRunning and thus eliminates the race where
> hardware sees a stale IsRunning=1.  As a result, all ICR writes (except
> when "Self" shorthand is used) will VM-Exit and therefore be correctly
> emulated by KVM.
> 
> Disabling IPI virtualization does carry a performance penalty, but
> benchmarkng shows that enabling AVIC without IPI virtualization is still
> much better than not using AVIC at all, because AVIC still accelerates
> posted interrupts and the receiving end of the IPIs.
> 
> Note, when virtualizaing Self-IPIs, the CPU skips reading the physical ID
	     ^^^^^^^^^^^^^
	     virtualizing

> table and updates the vIRR directly (because the vCPU is by definition
> actively running), i.e. Self-IPI isn't susceptible to the erratum *and*
> is still accelerated by hardware.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> [sean: rebase, massage changelog, disallow user override]
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/svm/avic.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 48c737e1200a..bf8b59556373 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -1187,6 +1187,14 @@ bool avic_hardware_setup(void)
>  	if (x2avic_enabled)
>  		pr_info("x2AVIC enabled\n");
>  
> +	/*
> +	 * Disable IPI virtualization for AMD Family 17h CPUs (Zen1 and Zen2)
> +	 * due to erratum 1235, which results in missed GA log events and thus
							^^^^^^^^^^^^^
Not sure I understand the reference to GA log events here -- those are 
only for device interrupts and not IPIs.

> +	 * missed wake events for blocking vCPUs due to the CPU failing to see
> +	 * a software update to clear IsRunning.
> +	 */
> +	enable_ipiv = enable_ipiv && boot_cpu_data.x86 != 0x17;
> +

Apart from the above, this LGTM.
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>


- Naveen


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 18/62] KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235
  2025-06-23 14:05   ` Naveen N Rao
@ 2025-06-23 15:30     ` Sean Christopherson
  0 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-23 15:30 UTC (permalink / raw)
  To: Naveen N Rao
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Mon, Jun 23, 2025, Naveen N Rao wrote:
> On Wed, Jun 11, 2025 at 03:45:21PM -0700, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > index 48c737e1200a..bf8b59556373 100644
> > --- a/arch/x86/kvm/svm/avic.c
> > +++ b/arch/x86/kvm/svm/avic.c
> > @@ -1187,6 +1187,14 @@ bool avic_hardware_setup(void)
> >  	if (x2avic_enabled)
> >  		pr_info("x2AVIC enabled\n");
> >  
> > +	/*
> > +	 * Disable IPI virtualization for AMD Family 17h CPUs (Zen1 and Zen2)
> > +	 * due to erratum 1235, which results in missed GA log events and thus
> 							^^^^^^^^^^^^^
> Not sure I understand the reference to GA log events here -- those are 
> only for device interrupts and not IPIs.

Doh, you're absolutely right.  I'll fix to this when applying:

	/*
	 * Disable IPI virtualization for AMD Family 17h CPUs (Zen1 and Zen2)
	 * due to erratum 1235, which results in missed VM-Exits on the sender
	 * and thus missed wake events for blocking vCPUs due to the CPU
	 * failing to see a software update to clear IsRunning.
	 */

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 20/62] KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blocking
  2025-06-11 22:45 ` [PATCH v3 20/62] KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blocking Sean Christopherson
@ 2025-06-23 15:54   ` Naveen N Rao
  2025-06-23 16:18     ` Sean Christopherson
  0 siblings, 1 reply; 100+ messages in thread
From: Naveen N Rao @ 2025-06-23 15:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Wed, Jun 11, 2025 at 03:45:23PM -0700, Sean Christopherson wrote:
> Add a comment to explain why KVM clears IsRunning when putting a vCPU,
> even though leaving IsRunning=1 would be ok from a functional perspective.
> Per Maxim's experiments, a misbehaving VM could spam the AVIC doorbell so
> fast as to induce a 50%+ loss in performance.
> 
> Link: https://lore.kernel.org/all/8d7e0d0391df4efc7cb28557297eb2ec9904f1e5.camel@redhat.com
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/svm/avic.c | 31 ++++++++++++++++++-------------
>  1 file changed, 18 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index bf8b59556373..3cf929ac117f 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -1121,19 +1121,24 @@ void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
>  	if (!kvm_vcpu_apicv_active(vcpu))
>  		return;
>  
> -       /*
> -        * Unload the AVIC when the vCPU is about to block, _before_
> -        * the vCPU actually blocks.
> -        *
> -        * Any IRQs that arrive before IsRunning=0 will not cause an
> -        * incomplete IPI vmexit on the source, therefore vIRR will also
> -        * be checked by kvm_vcpu_check_block() before blocking.  The
> -        * memory barrier implicit in set_current_state orders writing
> -        * IsRunning=0 before reading the vIRR.  The processor needs a
> -        * matching memory barrier on interrupt delivery between writing
> -        * IRR and reading IsRunning; the lack of this barrier might be
> -        * the cause of errata #1235).
> -        */
> +	/*
> +	 * Unload the AVIC when the vCPU is about to block, _before_ the vCPU
> +	 * actually blocks.
> +	 *
> +	 * Note, any IRQs that arrive before IsRunning=0 will not cause an
> +	 * incomplete IPI vmexit on the source; kvm_vcpu_check_block() handles
> +	 * this by checking vIRR one last time before blocking.  The memory
> +	 * barrier implicit in set_current_state orders writing IsRunning=0
> +	 * before reading the vIRR.  The processor needs a matching memory
> +	 * barrier on interrupt delivery between writing IRR and reading
> +	 * IsRunning; the lack of this barrier might be the cause of errata #1235).
> +	 *
> +	 * Clear IsRunning=0 even if guest IRQs are disabled, i.e. even if KVM
> +	 * doesn't need to detect events for scheduling purposes.  The doorbell

Nit: just IsRunning (you can drop the =0 part).

Trying to understand the significance of IRQs being disabled here. Is 
that a path KVM tries to optimize? Theoretically, it looks like we want 
to clear IsRunning regardless of whether the vCPU is blocked so as to 
prevent the guest from spamming the host with AVIC doorbells -- compared 
to always keeping IsRunning set so as to speed up VM entry/exit.

> +	 * used to signal running vCPUs cannot be blocked, i.e. will perturb the
> +	 * CPU and cause noisy neighbor problems if the VM is sending interrupts
> +	 * to the vCPU while it's scheduled out.
> +	 */
>  	avic_vcpu_put(vcpu);
>  }

Otherwise, this LGTM.
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>


Thanks,
Naveen


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 20/62] KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blocking
  2025-06-23 15:54   ` Naveen N Rao
@ 2025-06-23 16:18     ` Sean Christopherson
  2025-06-25 15:28       ` Naveen N Rao
  0 siblings, 1 reply; 100+ messages in thread
From: Sean Christopherson @ 2025-06-23 16:18 UTC (permalink / raw)
  To: Naveen N Rao
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Mon, Jun 23, 2025, Naveen N Rao wrote:
> On Wed, Jun 11, 2025 at 03:45:23PM -0700, Sean Christopherson wrote:
> > Add a comment to explain why KVM clears IsRunning when putting a vCPU,
> > even though leaving IsRunning=1 would be ok from a functional perspective.
> > Per Maxim's experiments, a misbehaving VM could spam the AVIC doorbell so
> > fast as to induce a 50%+ loss in performance.
> > 
> > Link: https://lore.kernel.org/all/8d7e0d0391df4efc7cb28557297eb2ec9904f1e5.camel@redhat.com
> > Cc: Maxim Levitsky <mlevitsk@redhat.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/svm/avic.c | 31 ++++++++++++++++++-------------
> >  1 file changed, 18 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > index bf8b59556373..3cf929ac117f 100644
> > --- a/arch/x86/kvm/svm/avic.c
> > +++ b/arch/x86/kvm/svm/avic.c
> > @@ -1121,19 +1121,24 @@ void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
> >  	if (!kvm_vcpu_apicv_active(vcpu))
> >  		return;
> >  
> > -       /*
> > -        * Unload the AVIC when the vCPU is about to block, _before_
> > -        * the vCPU actually blocks.
> > -        *
> > -        * Any IRQs that arrive before IsRunning=0 will not cause an
> > -        * incomplete IPI vmexit on the source, therefore vIRR will also
> > -        * be checked by kvm_vcpu_check_block() before blocking.  The
> > -        * memory barrier implicit in set_current_state orders writing
> > -        * IsRunning=0 before reading the vIRR.  The processor needs a
> > -        * matching memory barrier on interrupt delivery between writing
> > -        * IRR and reading IsRunning; the lack of this barrier might be
> > -        * the cause of errata #1235).
> > -        */
> > +	/*
> > +	 * Unload the AVIC when the vCPU is about to block, _before_ the vCPU
> > +	 * actually blocks.
> > +	 *
> > +	 * Note, any IRQs that arrive before IsRunning=0 will not cause an
> > +	 * incomplete IPI vmexit on the source; kvm_vcpu_check_block() handles
> > +	 * this by checking vIRR one last time before blocking.  The memory
> > +	 * barrier implicit in set_current_state orders writing IsRunning=0
> > +	 * before reading the vIRR.  The processor needs a matching memory
> > +	 * barrier on interrupt delivery between writing IRR and reading
> > +	 * IsRunning; the lack of this barrier might be the cause of errata #1235).
> > +	 *
> > +	 * Clear IsRunning=0 even if guest IRQs are disabled, i.e. even if KVM
> > +	 * doesn't need to detect events for scheduling purposes.  The doorbell
> 
> Nit: just IsRunning (you can drop the =0 part).

Hmm, not really.  It could be:

	/* Note, any IRQs that arrive while IsRunning is set will not cause an

or

	/* Note, any IRQs that arrive while IsRunning=1 will not cause an

but that's just regurgitating the spec.  The slightly more interesting scenario
that's being described here is what will happen if an IRQ arrives _just_ before
the below code toggle IsRunning from 1 => 0.

> Trying to understand the significance of IRQs being disabled here. Is 
> that a path KVM tries to optimize?

Yep.  KVM doesn't need a notification for the undelivered (virtual) IRQ, because
it won't be handled by the vCPU until the vCPU enables IRQs, and thus it's not a
valid wake event for the vCPU.

So, *if* spurious doorbells didn't affect performance or functionality, then
ideally KVM would leave IsRunning=1, e.g. so that the IOMMU doesn't need to
generate GA log events, and so that other vCPUs aren't forced to VM-Exit when
sending an IPI.  Unfortunately, spurious doorbells are quite intrusive, and so
KVM "needs" to clear IsRunning.

> Theoretically, it looks like we want to clear IsRunning regardless of whether
> the vCPU is blocked so as to prevent the guest from spamming the host with
> AVIC doorbells -- compared to always keeping IsRunning set so as to speed up
> VM entry/exit.

Yep, exactly.

> > +	 * used to signal running vCPUs cannot be blocked, i.e. will perturb the
> > +	 * CPU and cause noisy neighbor problems if the VM is sending interrupts
> > +	 * to the vCPU while it's scheduled out.
> > +	 */
> >  	avic_vcpu_put(vcpu);
> >  }
> 
> Otherwise, this LGTM.
> Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
> 
> 
> Thanks,
> Naveen
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support
  2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
                   ` (61 preceding siblings ...)
  2025-06-11 22:46 ` [PATCH v3 62/62] KVM: x86: Rename kvm_set_msi_irq() => kvm_msi_to_lapic_irq() Sean Christopherson
@ 2025-06-24 19:38 ` Sean Christopherson
  62 siblings, 0 replies; 100+ messages in thread
From: Sean Christopherson @ 2025-06-24 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Marc Zyngier, Oliver Upton, Paolo Bonzini,
	Joerg Roedel, David Woodhouse, Lu Baolu
  Cc: linux-arm-kernel, kvmarm, kvm, iommu, linux-kernel,
	Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky, Joao Martins,
	Francesco Lavra, David Matlack

On Wed, Jun 11, 2025, Sean Christopherson wrote:
> TL;DR: Overhaul device posted interrupts in KVM and IOMMU, and AVIC in
>        general.
> 
> This applies on the series to add CONFIG_KVM_IOAPIC (and to kill irq_comm.c):
> 
>   https://lore.kernel.org/all/20250611213557.294358-1-seanjc@google.com
> 
> [...]

Applied to kvm-x86 irqs (except for patch 1, which is already in Linus' tree),
with minor massaging (comments and changelogs) to address feedback.

Note, in the branch, the arm64 vLPI is split quite far from the rest of the
series.  This is by design so that I can provide a stable branch/tag for arm64
if a conflict or dependency crops up, without the arm64 patch inheriting all of
the dependencies and prep work for the series at large.

[01/62] KVM: arm64: Explicitly treat routing entry type changes as changes
        (already in Linus' tree)
[02/62] KVM: arm64: WARN if unmapping a vLPI fails in any path
        https://github.com/kvm-x86/linux/commit/cd4178d19420
[03/62] KVM: Pass new routing entries and irqfd when updating IRTEs
        https://github.com/kvm-x86/linux/commit/cb210737675e
[04/62] KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure
        https://github.com/kvm-x86/linux/commit/05c5e23657e1
[05/62] KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE
        https://github.com/kvm-x86/linux/commit/0a917e9d4b70
[06/62] iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
        https://github.com/kvm-x86/linux/commit/1da19c5ce053
[07/62] KVM: SVM: Delete IRTE link from previous vCPU irrespective of new routing
        https://github.com/kvm-x86/linux/commit/a0ca34bb1aad
[08/62] KVM: SVM: Drop pointless masking of default APIC base when setting V_APIC_BAR
        https://github.com/kvm-x86/linux/commit/430579577892
[09/62] KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masks
        https://github.com/kvm-x86/linux/commit/2e002ddc8966
[10/62] KVM: SVM: Add helper to deduplicate code for getting AVIC backing page
        https://github.com/kvm-x86/linux/commit/3338c639da15
[11/62] KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field
        https://github.com/kvm-x86/linux/commit/d8527f133c0a
[12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation
        https://github.com/kvm-x86/linux/commit/1aa6e256e46f
[13/62] KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creation
        https://github.com/kvm-x86/linux/commit/c24ed209c474
[14/62] KVM: SVM: Track AVIC tables as natively sized pointers, not "struct pages"
        https://github.com/kvm-x86/linux/commit/26baab4eea4c
[15/62] KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer
        https://github.com/kvm-x86/linux/commit/d29433336a7b
[16/62] KVM: VMX: Move enable_ipiv knob to common x86
        https://github.com/kvm-x86/linux/commit/bafddc70001d
[17/62] KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled
        https://github.com/kvm-x86/linux/commit/d921665e01ba
[18/62] KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235
        https://github.com/kvm-x86/linux/commit/8de4a1c8164e
[19/62] KVM: VMX: Suppress PI notifications whenever the vCPU is put
        https://github.com/kvm-x86/linux/commit/6737557442e5
[20/62] KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blocking
        https://github.com/kvm-x86/linux/commit/52d826c9e54c
[21/62] iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr
        https://github.com/kvm-x86/linux/commit/c4cdbaf9d81c
[22/62] iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode"
        https://github.com/kvm-x86/linux/commit/95d50ebe6df8
[23/62] KVM: SVM: Stop walking list of routing table entries when updating IRTE
        https://github.com/kvm-x86/linux/commit/1e663ed23992
[24/62] KVM: VMX: Stop walking list of routing table entries when updating IRTE
        https://github.com/kvm-x86/linux/commit/23ca102e6fb2
[25/62] KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info()
        https://github.com/kvm-x86/linux/commit/0a64c447f6f8
[26/62] KVM: x86: Move IRQ routing/delivery APIs from x86.c => irq.c
        https://github.com/kvm-x86/linux/commit/f5369619f7f8
[27/62] KVM: x86: Nullify irqfd->producer after updating IRTEs
        https://github.com/kvm-x86/linux/commit/9517aedecd0e
[28/62] KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU
        https://github.com/kvm-x86/linux/commit/cf04ec393ed0
[29/62] KVM: x86: Move posted interrupt tracepoint to common code
        https://github.com/kvm-x86/linux/commit/c5af31698d71
[30/62] KVM: SVM: Clean up return handling in avic_pi_update_irte()
        https://github.com/kvm-x86/linux/commit/803928483669
[31/62] iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structs
        https://github.com/kvm-x86/linux/commit/53527ea1b702
[32/62] KVM: Don't WARN if updating IRQ bypass route fails
        https://github.com/kvm-x86/linux/commit/b33252b9d172
[33/62] KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing()
        https://github.com/kvm-x86/linux/commit/77bb184ab880
[34/62] KVM: x86: Track irq_bypass_vcpu in common x86 code
        https://github.com/kvm-x86/linux/commit/511754bc548b
[35/62] KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being targeted
        https://github.com/kvm-x86/linux/commit/dc6adb13046a
[36/62] KVM: x86: Don't update IRTE entries when old and new routes were !MSI
        https://github.com/kvm-x86/linux/commit/cc8b13105eac
[37/62] KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR metadata
        https://github.com/kvm-x86/linux/commit/71d6b3b8e69d
[38/62] KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU
        https://github.com/kvm-x86/linux/commit/c3d591c91f9c
[39/62] iommu/amd: Document which IRTE fields amd_iommu_update_ga() can modify
        https://github.com/kvm-x86/linux/commit/3be405e89f3d
[40/62] iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination
        https://github.com/kvm-x86/linux/commit/08d9ccdd1a5c
[41/62] iommu/amd: Factor out helper for manipulating IRTE GA/CPU info
        https://github.com/kvm-x86/linux/commit/0b2b541fa3cd
[42/62] iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity
        https://github.com/kvm-x86/linux/commit/f965255dc503
[43/62] iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is inhibited
        https://github.com/kvm-x86/linux/commit/6df262f915ab
[44/62] KVM: SVM: Don't check for assigned device(s) when updating affinity
        https://github.com/kvm-x86/linux/commit/f5998661ff73
[45/62] KVM: SVM: Don't check for assigned device(s) when activating AVIC
        https://github.com/kvm-x86/linux/commit/fe0213923dd9
[46/62] KVM: SVM: WARN if (de)activating guest mode in IOMMU fails
        https://github.com/kvm-x86/linux/commit/16562766f171
[47/62] KVM: SVM: Process all IRTEs on affinity change even if one update fails
        https://github.com/kvm-x86/linux/commit/48f79c6c86b3
[48/62] KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails
        https://github.com/kvm-x86/linux/commit/cd86240fea26
[49/62] KVM: x86: Drop superfluous "has assigned device" check in kvm_pi_update_irte()
        https://github.com/kvm-x86/linux/commit/04c4ca0ae479
[50/62] KVM: x86: WARN if IRQ bypass isn't supported in kvm_pi_update_irte()
        https://github.com/kvm-x86/linux/commit/d1bccaa1793d
[51/62] KVM: x86: WARN if IRQ bypass routing is updated without in-kernel local APIC
        https://github.com/kvm-x86/linux/commit/25ef059e8bc5
[52/62] KVM: SVM: WARN if ir_list is non-empty at vCPU free
        https://github.com/kvm-x86/linux/commit/99836eb9c5dc
[53/62] KVM: x86: Decouple device assignment from IRQ bypass
        https://github.com/kvm-x86/linux/commit/77e1b8332d1d
[54/62] KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting IRQ bypass
        https://github.com/kvm-x86/linux/commit/ce9d54f41be0
[55/62] KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata
        https://github.com/kvm-x86/linux/commit/11a60455d4c9
[56/62] iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC support
        https://github.com/kvm-x86/linux/commit/a23480fe21de
[57/62] KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller
        https://github.com/kvm-x86/linux/commit/f2bc961d383b
[58/62] KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off
        https://github.com/kvm-x86/linux/commit/6eab2340f339
[59/62] KVM: SVM: Consolidate IRTE update when toggling AVIC on/off
        https://github.com/kvm-x86/linux/commit/5f3d06b1648e
[60/62] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
        https://github.com/kvm-x86/linux/commit/b9e53f9ff4a8
[61/62] KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking
        https://github.com/kvm-x86/linux/commit/b03500f03ea0
[62/62] KVM: x86: Rename kvm_set_msi_irq() => kvm_msi_to_lapic_irq()
        https://github.com/kvm-x86/linux/commit/6f343724837b

--
https://github.com/kvm-x86/kvm-unit-tests/tree/next

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v3 20/62] KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blocking
  2025-06-23 16:18     ` Sean Christopherson
@ 2025-06-25 15:28       ` Naveen N Rao
  0 siblings, 0 replies; 100+ messages in thread
From: Naveen N Rao @ 2025-06-25 15:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, Oliver Upton, Paolo Bonzini, Joerg Roedel,
	David Woodhouse, Lu Baolu, linux-arm-kernel, kvmarm, kvm, iommu,
	linux-kernel, Sairaj Kodilkar, Vasant Hegde, Maxim Levitsky,
	Joao Martins, Francesco Lavra, David Matlack

On Mon, Jun 23, 2025 at 09:18:45AM -0700, Sean Christopherson wrote:
> On Mon, Jun 23, 2025, Naveen N Rao wrote:
> > On Wed, Jun 11, 2025 at 03:45:23PM -0700, Sean Christopherson wrote:
> > > Add a comment to explain why KVM clears IsRunning when putting a vCPU,
> > > even though leaving IsRunning=1 would be ok from a functional perspective.
> > > Per Maxim's experiments, a misbehaving VM could spam the AVIC doorbell so
> > > fast as to induce a 50%+ loss in performance.
> > > 
> > > Link: https://lore.kernel.org/all/8d7e0d0391df4efc7cb28557297eb2ec9904f1e5.camel@redhat.com
> > > Cc: Maxim Levitsky <mlevitsk@redhat.com>
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  arch/x86/kvm/svm/avic.c | 31 ++++++++++++++++++-------------
> > >  1 file changed, 18 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > > index bf8b59556373..3cf929ac117f 100644
> > > --- a/arch/x86/kvm/svm/avic.c
> > > +++ b/arch/x86/kvm/svm/avic.c
> > > @@ -1121,19 +1121,24 @@ void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
> > >  	if (!kvm_vcpu_apicv_active(vcpu))
> > >  		return;
> > >  
> > > -       /*
> > > -        * Unload the AVIC when the vCPU is about to block, _before_
> > > -        * the vCPU actually blocks.
> > > -        *
> > > -        * Any IRQs that arrive before IsRunning=0 will not cause an
> > > -        * incomplete IPI vmexit on the source, therefore vIRR will also
> > > -        * be checked by kvm_vcpu_check_block() before blocking.  The
> > > -        * memory barrier implicit in set_current_state orders writing
> > > -        * IsRunning=0 before reading the vIRR.  The processor needs a
> > > -        * matching memory barrier on interrupt delivery between writing
> > > -        * IRR and reading IsRunning; the lack of this barrier might be
> > > -        * the cause of errata #1235).
> > > -        */
> > > +	/*
> > > +	 * Unload the AVIC when the vCPU is about to block, _before_ the vCPU
> > > +	 * actually blocks.
> > > +	 *
> > > +	 * Note, any IRQs that arrive before IsRunning=0 will not cause an
> > > +	 * incomplete IPI vmexit on the source; kvm_vcpu_check_block() handles
> > > +	 * this by checking vIRR one last time before blocking.  The memory
> > > +	 * barrier implicit in set_current_state orders writing IsRunning=0
> > > +	 * before reading the vIRR.  The processor needs a matching memory
> > > +	 * barrier on interrupt delivery between writing IRR and reading
> > > +	 * IsRunning; the lack of this barrier might be the cause of errata #1235).
> > > +	 *
> > > +	 * Clear IsRunning=0 even if guest IRQs are disabled, i.e. even if KVM
> > > +	 * doesn't need to detect events for scheduling purposes.  The doorbell
> > 
> > Nit: just IsRunning (you can drop the =0 part).
> 
> Hmm, not really.  It could be:
> 
> 	/* Note, any IRQs that arrive while IsRunning is set will not cause an
> 
> or
> 
> 	/* Note, any IRQs that arrive while IsRunning=1 will not cause an
> 
> but that's just regurgitating the spec.  The slightly more interesting scenario
> that's being described here is what will happen if an IRQ arrives _just_ before
> the below code toggle IsRunning from 1 => 0.

Oh, you're right. I was referring to the last paragraph starting with 
"Clear IsRunning=0 even if", which to me feels like a double negation. 
It reads better if it is "Clear IsRunning even if ..."

> 
> > Trying to understand the significance of IRQs being disabled here. Is 
> > that a path KVM tries to optimize?
> 
> Yep.  KVM doesn't need a notification for the undelivered (virtual) IRQ, because
> it won't be handled by the vCPU until the vCPU enables IRQs, and thus it's not a
> valid wake event for the vCPU.
> 
> So, *if* spurious doorbells didn't affect performance or functionality, then
> ideally KVM would leave IsRunning=1, e.g. so that the IOMMU doesn't need to
> generate GA log events, and so that other vCPUs aren't forced to VM-Exit when
> sending an IPI.  Unfortunately, spurious doorbells are quite intrusive, and so
> KVM "needs" to clear IsRunning.

Ok, got that. kvm_arch_vcpu_blocking() looks like it exists precisely 
for this reason.

Thanks,
Naveen


^ permalink raw reply	[flat|nested] 100+ messages in thread

end of thread, other threads:[~2025-06-25 15:38 UTC | newest]

Thread overview: 100+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-11 22:45 [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 01/62] KVM: arm64: Explicitly treat routing entry type changes as changes Sean Christopherson
2025-06-13 19:43   ` Oliver Upton
2025-06-19 12:36   ` (subset) " Marc Zyngier
2025-06-11 22:45 ` [PATCH v3 02/62] KVM: arm64: WARN if unmapping vLPI fails Sean Christopherson
2025-06-12 11:59   ` Marc Zyngier
2025-06-12 14:34     ` Sean Christopherson
2025-06-13 20:47       ` Oliver Upton
2025-06-20 17:22         ` Sean Christopherson
2025-06-20 18:00           ` David Woodhouse
2025-06-20 18:48           ` Oliver Upton
2025-06-20 19:04             ` Sean Christopherson
2025-06-20 19:27               ` Oliver Upton
2025-06-20 20:31                 ` Sean Christopherson
2025-06-20 20:45                   ` Oliver Upton
2025-06-11 22:45 ` [PATCH v3 03/62] KVM: Pass new routing entries and irqfd when updating IRTEs Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 04/62] KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 05/62] KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 06/62] iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 07/62] KVM: SVM: Delete IRTE link from previous vCPU irrespective of new routing Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 08/62] KVM: SVM: Drop pointless masking of default APIC base when setting V_APIC_BAR Sean Christopherson
2025-06-13 14:15   ` Naveen N Rao
2025-06-11 22:45 ` [PATCH v3 09/62] KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masks Sean Christopherson
2025-06-13 14:37   ` Naveen N Rao
2025-06-11 22:45 ` [PATCH v3 10/62] KVM: SVM: Add helper to deduplicate code for getting AVIC backing page Sean Christopherson
2025-06-13 14:38   ` Naveen N Rao
2025-06-11 22:45 ` [PATCH v3 11/62] KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field Sean Christopherson
2025-06-13 14:44   ` Naveen N Rao
2025-06-11 22:45 ` [PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation Sean Christopherson
2025-06-17 14:25   ` Naveen N Rao
2025-06-17 16:10     ` Sean Christopherson
2025-06-18 14:33       ` Naveen N Rao
2025-06-18 20:59         ` Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 13/62] KVM: SVM: Drop redundant check in AVIC code on ID during " Sean Christopherson
2025-06-17 14:49   ` Naveen N Rao
2025-06-17 16:33     ` Sean Christopherson
2025-06-18 14:39       ` Naveen N Rao
2025-06-11 22:45 ` [PATCH v3 14/62] KVM: SVM: Track AVIC tables as natively sized pointers, not "struct pages" Sean Christopherson
2025-06-17 15:01   ` Naveen N Rao
2025-06-11 22:45 ` [PATCH v3 15/62] KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer Sean Christopherson
2025-06-19 11:09   ` Naveen N Rao
2025-06-11 22:45 ` [PATCH v3 16/62] KVM: VMX: Move enable_ipiv knob to common x86 Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 17/62] KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled Sean Christopherson
2025-06-19 11:31   ` Naveen N Rao
2025-06-19 12:01     ` Naveen N Rao
2025-06-20 14:39     ` Sean Christopherson
2025-06-23 10:45       ` Naveen N Rao
2025-06-11 22:45 ` [PATCH v3 18/62] KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235 Sean Christopherson
2025-06-23 14:05   ` Naveen N Rao
2025-06-23 15:30     ` Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 19/62] KVM: VMX: Suppress PI notifications whenever the vCPU is put Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 20/62] KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blocking Sean Christopherson
2025-06-23 15:54   ` Naveen N Rao
2025-06-23 16:18     ` Sean Christopherson
2025-06-25 15:28       ` Naveen N Rao
2025-06-11 22:45 ` [PATCH v3 21/62] iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 22/62] iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode" Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 23/62] KVM: SVM: Stop walking list of routing table entries when updating IRTE Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 24/62] KVM: VMX: " Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 25/62] KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info() Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 26/62] KVM: x86: Move IRQ routing/delivery APIs from x86.c => irq.c Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 27/62] KVM: x86: Nullify irqfd->producer after updating IRTEs Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 28/62] KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 29/62] KVM: x86: Move posted interrupt tracepoint to common code Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 30/62] KVM: SVM: Clean up return handling in avic_pi_update_irte() Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 31/62] iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structs Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 32/62] KVM: Don't WARN if updating IRQ bypass route fails Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 33/62] KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing() Sean Christopherson
2025-06-13 20:50   ` Oliver Upton
2025-06-11 22:45 ` [PATCH v3 34/62] KVM: x86: Track irq_bypass_vcpu in common x86 code Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 35/62] KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being targeted Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 36/62] KVM: x86: Don't update IRTE entries when old and new routes were !MSI Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 37/62] KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR metadata Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 38/62] KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU Sean Christopherson
2025-06-17 15:42   ` Naveen N Rao
2025-06-11 22:45 ` [PATCH v3 39/62] iommu/amd: Document which IRTE fields amd_iommu_update_ga() can modify Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 40/62] iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 41/62] iommu/amd: Factor out helper for manipulating IRTE GA/CPU info Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 42/62] iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 43/62] iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is inhibited Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 44/62] KVM: SVM: Don't check for assigned device(s) when updating affinity Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 45/62] KVM: SVM: Don't check for assigned device(s) when activating AVIC Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 46/62] KVM: SVM: WARN if (de)activating guest mode in IOMMU fails Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 47/62] KVM: SVM: Process all IRTEs on affinity change even if one update fails Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 48/62] KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 49/62] KVM: x86: Drop superfluous "has assigned device" check in kvm_pi_update_irte() Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 50/62] KVM: x86: WARN if IRQ bypass isn't supported " Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 51/62] KVM: x86: WARN if IRQ bypass routing is updated without in-kernel local APIC Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 52/62] KVM: SVM: WARN if ir_list is non-empty at vCPU free Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 53/62] KVM: x86: Decouple device assignment from IRQ bypass Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 54/62] KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting " Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 55/62] KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata Sean Christopherson
2025-06-11 22:45 ` [PATCH v3 56/62] iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC support Sean Christopherson
2025-06-11 22:46 ` [PATCH v3 57/62] KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller Sean Christopherson
2025-06-11 22:46 ` [PATCH v3 58/62] KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off Sean Christopherson
2025-06-11 22:46 ` [PATCH v3 59/62] KVM: SVM: Consolidate IRTE update " Sean Christopherson
2025-06-11 22:46 ` [PATCH v3 60/62] iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts Sean Christopherson
2025-06-11 22:46 ` [PATCH v3 61/62] KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking Sean Christopherson
2025-06-11 22:46 ` [PATCH v3 62/62] KVM: x86: Rename kvm_set_msi_irq() => kvm_msi_to_lapic_irq() Sean Christopherson
2025-06-24 19:38 ` [PATCH v3 00/62] KVM: iommu: Overhaul device posted IRQs support Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).