Kernel KVM virtualization development
 help / color / mirror / Atom feed
* [PATCH v2 0/2] KVM: x86: Fix max_irr reporting when PIR is empty despite PID.ON
@ 2026-04-29  7:28 Chenyi Qiang
  2026-04-29  7:28 ` [PATCH v2 1/2] KVM: x86: Always report highest IRR from __kvm_apic_update_irr() Chenyi Qiang
  2026-04-29  7:28 ` [PATCH v2 2/2] KVM: x86: Fix misleading variable names and add more comments for PIR=>IRR flow Chenyi Qiang
  0 siblings, 2 replies; 3+ messages in thread
From: Chenyi Qiang @ 2026-04-29  7:28 UTC (permalink / raw)
  To: kvm; +Cc: Chenyi Qiang, Sean Christopherson, Jim Mattson, Paolo Bonzini,
	Gao Chao

Fix a race in the PIR-to-IRR synchronization path where PID.ON is set
but PIR is empty, causing __kvm_apic_update_irr() to report max_irr as
-1 even though there are pending interrupts in the IRR.  This triggers a
spurious WARNING in vmx_check_nested_events() during nested VM stress
tests.

v2:
  - Move the fix from vmx_sync_pir_to_irr() to __kvm_apic_update_irr()
    as suggested by Paolo along with some commit message changes. [Paolo]
  - Add Cc: stable, Suggested-by, Reviewed-by tags.
  - Carry Sean's cleanup patch for variable renames and expanded
    comments. [Sean]

v1: https://lore.kernel.org/kvm/20260428070349.1633238-1-chenyi.qiang@intel.com/

Chenyi Qiang (1):
  KVM: x86: Always report highest IRR from __kvm_apic_update_irr()

Sean Christopherson (1):
  KVM: x86: Fix misleading variable names and add more comments for
    PIR=>IRR flow

 arch/x86/kvm/lapic.c   | 24 +++++++++++++-----------
 arch/x86/kvm/vmx/vmx.c | 40 ++++++++++++++++++++++++++++++++--------
 2 files changed, 45 insertions(+), 19 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH v2 1/2] KVM: x86: Always report highest IRR from __kvm_apic_update_irr()
  2026-04-29  7:28 [PATCH v2 0/2] KVM: x86: Fix max_irr reporting when PIR is empty despite PID.ON Chenyi Qiang
@ 2026-04-29  7:28 ` Chenyi Qiang
  2026-04-29  7:28 ` [PATCH v2 2/2] KVM: x86: Fix misleading variable names and add more comments for PIR=>IRR flow Chenyi Qiang
  1 sibling, 0 replies; 3+ messages in thread
From: Chenyi Qiang @ 2026-04-29  7:28 UTC (permalink / raw)
  To: kvm
  Cc: Chenyi Qiang, Sean Christopherson, Jim Mattson, Paolo Bonzini,
	Gao Chao, stable, Farrah Chen

Compute *max_irr from the existing IRR in __kvm_apic_update_irr() even
when pi_harvest_pir() returns false (PIR is empty), instead of leaving
*max_irr uninitialized at -1.

In a nested VM stress test, the following WARNING fires in
vmx_check_nested_events() when kvm_cpu_has_interrupt() reports a pending
interrupt but the subsequent kvm_apic_has_interrupt() (which invokes
vmx_sync_pir_to_irr() again) returns -1:

  WARNING: CPU: 99 PID: 57767 at arch/x86/kvm/vmx/nested.c:4449 vmx_check_nested_events+0x6bf/0x6e0 [kvm_intel]
  Call Trace:
   kvm_check_and_inject_events
   vcpu_enter_guest.constprop.0
   vcpu_run
   kvm_arch_vcpu_ioctl_run
   kvm_vcpu_ioctl
   __x64_sys_ioctl
   do_syscall_64
   entry_SYSCALL_64_after_hwframe

The root cause is a race between vmx_sync_pir_to_irr() on the target vCPU
and __vmx_deliver_posted_interrupt() on a sender vCPU.  The sender
performs two individually-atomic operations that are not a single
transaction:

  1. pi_test_and_set_pir(vector)  -- sets the PIR bit
  2. pi_test_and_set_on()         -- sets PID.ON

The following interleaving triggers the bug:

  Sender vCPU (IPI):              Target vCPU (1st sync_pir_to_irr):
  B1: set PIR[vector]
                                  A1: pi_clear_on()
                                  A2: pi_harvest_pir() -> sees B1 bit
                                  A3: xchg() -> consumes bit, PIR=0
                                      (1st sync returns correct max_irr)
  B2: set PID.ON = 1

                                  Target vCPU (2nd sync_pir_to_irr):
                                  C1: pi_test_on() -> TRUE (from B2)
                                  C2: pi_clear_on() -> ON=0
                                  C3: pi_harvest_pir() -> PIR empty
                                  C4: *max_irr = -1, early return
                                      IRR NOT SCANNED

The interrupt is not lost (it resides in the IRR from the first sync and
is recovered on the next vcpu_enter_guest() iteration), but the incorrect
max_irr causes a spurious WARNING and a wasted L2 VM-Enter/VM-Exit cycle.

Fix this by scanning the IRR via apic_find_highest_vector() in
__kvm_apic_update_irr() when PIR is empty, so that *max_irr always
reflects the true highest pending interrupt regardless of PIR state.

Fixes: b41f8638b9d3 ("KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR")
Cc: stable@vger.kernel.org
Reported-by: Farrah Chen <farrah.chen@intel.com>
Assisted-by: GitHub Copilot:Claude Opus 4.6
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 arch/x86/kvm/lapic.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9381c58d4c85..e9f1e5451160 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -669,12 +669,14 @@ bool __kvm_apic_update_irr(unsigned long *pir, void *regs, int *max_irr)
 	u32 irr_val, prev_irr_val;
 	int max_updated_irr;
 
+	if (!pi_harvest_pir(pir, pir_vals)) {
+		*max_irr = apic_find_highest_vector(regs + APIC_IRR);
+		return false;
+	}
+
 	max_updated_irr = -1;
 	*max_irr = -1;
 
-	if (!pi_harvest_pir(pir, pir_vals))
-		return false;
-
 	for (i = vec = 0; i <= 7; i++, vec += 32) {
 		u32 *p_irr = (u32 *)(regs + APIC_IRR + i * 0x10);
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH v2 2/2] KVM: x86: Fix misleading variable names and add more comments for PIR=>IRR flow
  2026-04-29  7:28 [PATCH v2 0/2] KVM: x86: Fix max_irr reporting when PIR is empty despite PID.ON Chenyi Qiang
  2026-04-29  7:28 ` [PATCH v2 1/2] KVM: x86: Always report highest IRR from __kvm_apic_update_irr() Chenyi Qiang
@ 2026-04-29  7:28 ` Chenyi Qiang
  1 sibling, 0 replies; 3+ messages in thread
From: Chenyi Qiang @ 2026-04-29  7:28 UTC (permalink / raw)
  To: kvm; +Cc: Sean Christopherson, Chenyi Qiang, Jim Mattson, Paolo Bonzini,
	Gao Chao

From: Sean Christopherson <seanjc@google.com>

Rename kvm_apic_update_irr()'s "irr_updated" and vmx_sync_pir_to_irr()'s
"got_posted_interrupt" to a more accurate "max_irr_is_from_pir", as neither
"irr_updated" nor "got_posted_interrupt" is accurate.
__kvm_apic_update_irr() and thus kvm_apic_update_irr() specifically return
true if and only if the highest priority IRQ, i.e. max_irr, is a "new"
pending IRQ from the PIR.  I.e. it's possible for the IRR to be updated,
i.e. for a posted IRQ to be "got", *without* the APIs returning true.

Expand vmx_sync_pir_to_irr()'s comment to explain why it's necessary to
set KVM_REQ_EVENT only if a "new" IRQ was found, and to explain why it's
safe to do so only if a new IRQ is also the highest priority pending IRQ.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/lapic.c   | 16 ++++++++--------
 arch/x86/kvm/vmx/vmx.c | 40 ++++++++++++++++++++++++++++++++--------
 2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e9f1e5451160..3c9512d1d79a 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -667,14 +667,14 @@ bool __kvm_apic_update_irr(unsigned long *pir, void *regs, int *max_irr)
 	u32 *__pir = (void *)pir_vals;
 	u32 i, vec;
 	u32 irr_val, prev_irr_val;
-	int max_updated_irr;
+	int max_new_irr;
 
 	if (!pi_harvest_pir(pir, pir_vals)) {
 		*max_irr = apic_find_highest_vector(regs + APIC_IRR);
 		return false;
 	}
 
-	max_updated_irr = -1;
+	max_new_irr = -1;
 	*max_irr = -1;
 
 	for (i = vec = 0; i <= 7; i++, vec += 32) {
@@ -690,25 +690,25 @@ bool __kvm_apic_update_irr(unsigned long *pir, void *regs, int *max_irr)
 				 !try_cmpxchg(p_irr, &prev_irr_val, irr_val));
 
 			if (prev_irr_val != irr_val)
-				max_updated_irr = __fls(irr_val ^ prev_irr_val) + vec;
+				max_new_irr = __fls(irr_val ^ prev_irr_val) + vec;
 		}
 		if (irr_val)
 			*max_irr = __fls(irr_val) + vec;
 	}
 
-	return ((max_updated_irr != -1) &&
-		(max_updated_irr == *max_irr));
+	return (max_new_irr != -1) && (max_new_irr == *max_irr);
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(__kvm_apic_update_irr);
 
 bool kvm_apic_update_irr(struct kvm_vcpu *vcpu, unsigned long *pir, int *max_irr)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
-	bool irr_updated = __kvm_apic_update_irr(pir, apic->regs, max_irr);
+	bool max_irr_is_from_pir;
 
-	if (unlikely(!apic->apicv_active && irr_updated))
+	max_irr_is_from_pir = __kvm_apic_update_irr(pir, apic->regs, max_irr);
+	if (unlikely(!apic->apicv_active && max_irr_is_from_pir))
 		apic->irr_pending = true;
-	return irr_updated;
+	return max_irr_is_from_pir;
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_apic_update_irr);
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8b24e682535b..297bb6c3ecc2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7138,8 +7138,8 @@ static void vmx_set_rvi(int vector)
 int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vt *vt = to_vt(vcpu);
+	bool max_irr_is_from_pir;
 	int max_irr;
-	bool got_posted_interrupt;
 
 	if (KVM_BUG_ON(!enable_apicv, vcpu->kvm))
 		return -EIO;
@@ -7151,17 +7151,22 @@ int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 		 * But on x86 this is just a compiler barrier anyway.
 		 */
 		smp_mb__after_atomic();
-		got_posted_interrupt =
-			kvm_apic_update_irr(vcpu, vt->pi_desc.pir, &max_irr);
+		max_irr_is_from_pir = kvm_apic_update_irr(vcpu, vt->pi_desc.pir,
+							  &max_irr);
 	} else {
 		max_irr = kvm_lapic_find_highest_irr(vcpu);
-		got_posted_interrupt = false;
+		max_irr_is_from_pir = false;
 	}
 
 	/*
-	 * Newly recognized interrupts are injected via either virtual interrupt
-	 * delivery (RVI) or KVM_REQ_EVENT.  Virtual interrupt delivery is
-	 * disabled in two cases:
+	 * If APICv is enabled and L2 is not active, then update the Requesting
+	 * Virtual Interrupt (RVI) portion of vmcs01.GUEST_INTR_STATUS with the
+	 * highest priority IRR to deliver the IRQ via Virtual Interrupt
+	 * Delivery.  Note, this is required even if the highest priority IRQ
+	 * was already pending in the IRR, as RVI isn't updated in lockstep with
+	 * the IRR (unlike apic->irr_pending).
+	 *
+	 * For the cases where Virtual Interrupt Delivery can't be used:
 	 *
 	 * 1) If L2 is running and the vCPU has a new pending interrupt.  If L1
 	 * wants to exit on interrupts, KVM_REQ_EVENT is needed to synthesize a
@@ -7172,10 +7177,29 @@ int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 	 * 2) If APICv is disabled for this vCPU, assigned devices may still
 	 * attempt to post interrupts.  The posted interrupt vector will cause
 	 * a VM-Exit and the subsequent entry will call sync_pir_to_irr.
+	 *
+	 * In both cases, set KVM_REQ_EVENT if and only if the highest priority
+	 * pending IRQ came from the PIR, as setting KVM_REQ_EVENT if any IRQ
+	 * is pending may put the vCPU into an infinite loop, e.g. if the IRQ
+	 * is blocked, then it will stay pending until an IRQ window is opened.
+	 *
+	 * Note!  It's possible that one or more IRQs were moved from the PIR
+	 * to the IRR _without_ max_irr_is_from_pir being true!  I.e. if there
+	 * was a higher priority IRQ already pending in the IRR.  Not setting
+	 * KVM_REQ_EVENT in this case is intentional and safe.  If APICv is
+	 * inactive, or L2 is running with exit-on-interrupt off (in vmcs12),
+	 * i.e. without nested virtual interrupt delivery, then there's no need
+	 * to request an IRQ window as the lower priority IRQ only needs to be
+	 * delivered when the higher priority IRQ is dismissed from the ISR,
+	 * i.e. on the next EOI, and EOIs are always intercepted if APICv is
+	 * disabled or if L2 is running without nested VID.  If L2 is running
+	 * exit-on-interrupt on (in vmcs12), then the higher priority IRQ will
+	 * trigger a nested VM-Exit, at which point KVM will re-evaluate L1's
+	 * pending IRQs.
 	 */
 	if (!is_guest_mode(vcpu) && kvm_vcpu_apicv_active(vcpu))
 		vmx_set_rvi(max_irr);
-	else if (got_posted_interrupt)
+	else if (max_irr_is_from_pir)
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
 
 	return max_irr;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-29  7:29 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29  7:28 [PATCH v2 0/2] KVM: x86: Fix max_irr reporting when PIR is empty despite PID.ON Chenyi Qiang
2026-04-29  7:28 ` [PATCH v2 1/2] KVM: x86: Always report highest IRR from __kvm_apic_update_irr() Chenyi Qiang
2026-04-29  7:28 ` [PATCH v2 2/2] KVM: x86: Fix misleading variable names and add more comments for PIR=>IRR flow Chenyi Qiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox