public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Chenyi Qiang <chenyi.qiang@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org,  Jim Mattson <jmattson@google.com>,
	Gao Chao <chao.gao@intel.com>,
	 Farrah Chen <farrah.chen@intel.com>
Subject: Re: [PATCH] KVM: VMX: Fall back to IRR scan when PIR is empty despite PID.ON being set
Date: Tue, 28 Apr 2026 08:50:43 -0700	[thread overview]
Message-ID: <afDXUwq6pR7SgxIC@google.com> (raw)
In-Reply-To: <04c954e6-a23a-4cc8-8bd3-5882a951a8cc@intel.com>

On Tue, Apr 28, 2026, Chenyi Qiang wrote:
> On 4/28/2026 3:45 PM, Paolo Bonzini wrote:
> > On 4/28/26 09:03, Chenyi Qiang wrote:
> >> The interrupt is not lost (it resides in the IRR from the first sync and
> >> is recovered on the next vcpu_enter_guest() iteration), but the incorrect
> >> max_irr causes a spurious WARNING and a wasted L2 VM-Enter/VM-Exit cycle.
> >>
> >> Fixes: b41f8638b9d3 ("KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR")

Cc: stable@vger.kernel.org

> >> Reported-by: Farrah Chen <farrah.chen@intel.com>
> >> Assisted-by: GitHub Copilot:Claude Opus 4.6
> >> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> >>
> >> ---
> >> There is a WARNING call trace during a nested VM stress test.

Any chance the stress test is something that can be shared?  I've seen this WARN
2-3 times over the last year, but it was never reproducible, and so intermittent
that I couldn't even correlate what I was doing at the time with the WARN.

> > The analysis of the race is correct and changing the logic is the
> > right thing to do; but I would change directly __kvm_apic_update_irr,
> > either like this:
> > 
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index e3ec4d8607c1..5ee14d6bc288 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -669,12 +669,14 @@ bool __kvm_apic_update_irr(unsigned long *pir, void *regs, int *max_irr)
> >      u32 irr_val, prev_irr_val;
> >      int max_updated_irr;
> >  
> > +    if (!pi_harvest_pir(pir, pir_vals)) {
> > +        *max_irr = apic_find_highest_vector(regs + APIC_IRR);
> > +        return false;
> > +    }

Holy moly, this code is all kinds of subtle.

For Paolo's suggested fix,

  Reviewed-by: Sean Christopherson <seanjc@google.com>

But this code needs some naming tweaks and more comments, because I've been staring
at this for over an hour (more than two now, *sigh*), and have been _this_ close
to making two suggestions that were subtly very, very wrong.

First off, both irr_updated in kvm_apic_update_irr() and got_posted_interrupt in
vmx_sync_pir_to_irr() are misnomers.  __kvm_apic_update_irr() and thus
kvm_apic_update_irr() don't return true if a new posted IRQ was found, they return
true if at least one new posted IRQ was found *and* the highest priority "new" IRQ
is also now the highest priority IRQ in the IRR.

Near-bug-suggestion #1 (in my chronological order of analysis) was to drop what
I _thought_ was a micro-optimization of detecting *new* IRQs in the IRR.  I.e. I
thought skipping KVM_REQ_EVENT was purely an optimization to avoid re-evaluating
IRQs.  But it's not an optimization, it's functionally necessary, because setting
KVM_REQ_EVENT if there was a pre-existing IRQ in the IRR would put the vCPU into
an infinite loop if the IRQ is blocked: KVM won't move the IRQ out of the IRQ
until it can be injected, and so KVM would always see the pending IRQ, always
set KVM_REQ_EVENT, and thus never actually attempt VM-Enter.

Near-bug-suggestion #2, after realizing the functional necessity of the code, was
the thought that there's a pre-existing bug due to this behavior.  When the vCPU
is running L2 or APICv is inactive, KVM needs to set KVM_REQ_EVENT if a new IRQ
is found, in order to trigger the various checks that ensure KVM delivers/injects
the IRQ in a timely fashion.  So of course setting KVM_REQ_EVENT only if a the
new IRQ is the highest IRQ must be wrong, right?

Nope.  Because if there was a pre-existing, higher priority IRQ in the IRR, then
either that IRQ is already primed for injection via the VMCS, or KVM has requested
an IRQ window because the higher priority IRQ is blocked.  In the latter case,
there's more obviously nothing to be done since KVM will do process pending IRQs
when an IRQ window is opened.  The former case is more subtle though: KVM doesn't
need to re-evaluate interrupts, because either EOI from the guest is guaranteed
to exit (APICv is inactive and/or L2 is running and L1 is running L2 with
exit-on-interrupt off, i.e. without nested virtual interrupt delivery), or the
highest IRQ is guaranteed to trigger a nested VM-Exit, (L2 is running and L1 is
running L2 with exit-on-interrupt on).

On top of Paolo's suggested fix:

---
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 28 Apr 2026 07:33:20 -0700
Subject: [PATCH] KVM: x86: Fix misleading variable names and add more comments
 for PIR=>IRR flow

Rename kvm_apic_update_irr()'s "irr_updated" and vmx_sync_pir_to_irr()'s
"got_posted_interrupt" to a more accurate "max_irr_is_from_pir", as neither
"irr_updated" nor "got_posted_interrupt" is accurate.
__kvm_apic_update_irr() and thus kvm_apic_update_irr() specifically return
true if and only if the highest priority IRQ, i.e. max_irr, is a "new"
pending IRQ from the PIR.  I.e. it's possible for the IRR to be updated,
i.e. for a posted IRQ to be "got", *without* the APIs returning true.

Expand vmx_sync_pir_to_irr()'s comment to explain why it's necessary to
set KVM_REQ_EVENT only if a "new" IRQ was found, and to explain why it's
safe to do so only if a new IRQ is also the highest priority pending IRQ.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/lapic.c   | 16 ++++++++--------
 arch/x86/kvm/vmx/vmx.c | 40 ++++++++++++++++++++++++++++++++--------
 2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 4d4b003cf7b2..0cac6decdef9 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -667,9 +667,9 @@ bool __kvm_apic_update_irr(unsigned long *pir, void *regs, int *max_irr)
 	u32 *__pir = (void *)pir_vals;
 	u32 i, vec;
 	u32 irr_val, prev_irr_val;
-	int max_updated_irr;
+	int max_new_irr;
 
-	max_updated_irr = -1;
+	max_new_irr = -1;
 	*max_irr = -1;
 
 	if (!pi_harvest_pir(pir, pir_vals)) {
@@ -690,25 +690,25 @@ bool __kvm_apic_update_irr(unsigned long *pir, void *regs, int *max_irr)
 				 !try_cmpxchg(p_irr, &prev_irr_val, irr_val));
 
 			if (prev_irr_val != irr_val)
-				max_updated_irr = __fls(irr_val ^ prev_irr_val) + vec;
+				max_new_irr = __fls(irr_val ^ prev_irr_val) + vec;
 		}
 		if (irr_val)
 			*max_irr = __fls(irr_val) + vec;
 	}
 
-	return ((max_updated_irr != -1) &&
-		(max_updated_irr == *max_irr));
+	return (max_new_irr != -1) && (max_new_irr == *max_irr);
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(__kvm_apic_update_irr);
 
 bool kvm_apic_update_irr(struct kvm_vcpu *vcpu, unsigned long *pir, int *max_irr)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
-	bool irr_updated = __kvm_apic_update_irr(pir, apic->regs, max_irr);
+	bool max_irr_is_from_pir;
 
-	if (unlikely(!apic->apicv_active && irr_updated))
+	max_irr_is_from_pir = __kvm_apic_update_irr(pir, apic->regs, max_irr);
+	if (unlikely(!apic->apicv_active && max_irr_is_from_pir))
 		apic->irr_pending = true;
-	return irr_updated;
+	return max_irr_is_from_pir;
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_apic_update_irr);
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a29896a9ef14..4e1aadd89c63 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7029,8 +7029,8 @@ static void vmx_set_rvi(int vector)
 int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vt *vt = to_vt(vcpu);
+	bool max_irr_is_from_pir;
 	int max_irr;
-	bool got_posted_interrupt;
 
 	if (KVM_BUG_ON(!enable_apicv, vcpu->kvm))
 		return -EIO;
@@ -7042,17 +7042,22 @@ int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 		 * But on x86 this is just a compiler barrier anyway.
 		 */
 		smp_mb__after_atomic();
-		got_posted_interrupt =
-			kvm_apic_update_irr(vcpu, vt->pi_desc.pir, &max_irr);
+		max_irr_is_from_pir = kvm_apic_update_irr(vcpu, vt->pi_desc.pir,
+							  &max_irr);
 	} else {
 		max_irr = kvm_lapic_find_highest_irr(vcpu);
-		got_posted_interrupt = false;
+		max_irr_is_from_pir = false;
 	}
 
 	/*
-	 * Newly recognized interrupts are injected via either virtual interrupt
-	 * delivery (RVI) or KVM_REQ_EVENT.  Virtual interrupt delivery is
-	 * disabled in two cases:
+	 * If APICv is enabled and L2 is not active, then update the Requesting
+	 * Virtual Interrupt (RVI) portion of vmcs01.GUEST_INTR_STATUS with the
+	 * highest priority IRR to deliver the IRQ via Virtual Interrupt
+	 * Delivery.  Note, this is required even if the highest priority IRQ
+	 * was already pending in the IRR, as RVI isn't update in lockstep with
+	 * the IRR (unlike apic->irr_pending).
+	 *
+	 * For the cases where Virtual Interrupt Delivery can't be used:
 	 *
 	 * 1) If L2 is running and the vCPU has a new pending interrupt.  If L1
 	 * wants to exit on interrupts, KVM_REQ_EVENT is needed to synthesize a
@@ -7063,10 +7068,29 @@ int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 	 * 2) If APICv is disabled for this vCPU, assigned devices may still
 	 * attempt to post interrupts.  The posted interrupt vector will cause
 	 * a VM-Exit and the subsequent entry will call sync_pir_to_irr.
+	 *
+	 * In both cases, set KVM_REQ_EVENT if and only if the highest priority
+	 * pending IRQ came from the PIR, as setting KVM_REQ_EVENT if any IRQ
+	 * is pending may put the vCPU into an infinite loop, e.g. if the IRQ
+	 * is blocked, then it will stay pending until an IRQ window is opened.
+	 *
+	 * Note!  It's possible that one or more IRQs were moved from the PIR
+	 * to the IRR _without_ max_irr_is_from_pir being true!  I.e. if there
+	 * was a higher priority IRQ already pending in the IRR.  Not setting
+	 * KVM_REQ_EVENT in this case is intentional and safe.  If APICv is
+	 * inactive, or L2 is running with exit-on-interrupt off (in vmcs12),
+	 * i.e. without nested virtual interrupt delivery, then there's no need
+	 * to request an IRQ window as the lower priority IRQ only needs to be
+	 * delivered when the higher priority IRQ is dismissed from the ISR,
+	 * i.e. on the next EOI, and EOIs are always intercepted if APICv is
+	 * disabled or if L2 is running without nested VID.  If L2 is running
+	 * exit-on-interrupt on (in vmcs12), then the higher priority IRQ will
+	 * trigger a nested VM-Exit, at which point KVM will re-evaluate L1's
+	 * pending IRQs.
 	 */
 	if (!is_guest_mode(vcpu) && kvm_vcpu_apicv_active(vcpu))
 		vmx_set_rvi(max_irr);
-	else if (got_posted_interrupt)
+	else if (max_irr_is_from_pir)
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
 
 	return max_irr;

base-commit: a98d8321896bf010f877b5edb88d3f14627c943a
--

  reply	other threads:[~2026-04-28 15:50 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-28  7:03 [PATCH] KVM: VMX: Fall back to IRR scan when PIR is empty despite PID.ON being set Chenyi Qiang
2026-04-28  7:45 ` Paolo Bonzini
2026-04-28  8:27   ` Chenyi Qiang
2026-04-28 15:50     ` Sean Christopherson [this message]
2026-04-29  1:08       ` Chenyi Qiang
2026-04-29 12:58         ` Sean Christopherson
2026-04-28 11:10 ` Chao Gao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afDXUwq6pR7SgxIC@google.com \
    --to=seanjc@google.com \
    --cc=chao.gao@intel.com \
    --cc=chenyi.qiang@intel.com \
    --cc=farrah.chen@intel.com \
    --cc=jmattson@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox