public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] KVM: VMX: Fall back to IRR scan when PIR is empty despite PID.ON being set
@ 2026-04-28  7:03 Chenyi Qiang
  2026-04-28  7:45 ` Paolo Bonzini
  2026-04-28 11:10 ` Chao Gao
  0 siblings, 2 replies; 7+ messages in thread
From: Chenyi Qiang @ 2026-04-28  7:03 UTC (permalink / raw)
  To: kvm
  Cc: Chenyi Qiang, Sean Christopherson, Jim Mattson, Paolo Bonzini,
	Gao Chao, Farrah Chen

Fall back to kvm_lapic_find_highest_irr() in vmx_sync_pir_to_irr() when
PID.ON is set but PIR turns out to be empty, to correctly report the
highest pending interrupt from the existing IRR.

In a nested VM stress test, the following WARNING fires in
vmx_check_nested_events() when kvm_cpu_has_interrupt() reports a pending
interrupt but the subsequent kvm_apic_has_interrupt() (which invokes
vmx_sync_pir_to_irr() again) returns -1:

  WARNING: CPU: 99 PID: 57767 at arch/x86/kvm/vmx/nested.c:4449 vmx_check_nested_events+0x6bf/0x6e0 [kvm_intel]
  Call Trace:
   kvm_check_and_inject_events
   vcpu_enter_guest.constprop.0
   vcpu_run
   kvm_arch_vcpu_ioctl_run
   kvm_vcpu_ioctl
   __x64_sys_ioctl
   do_syscall_64
   entry_SYSCALL_64_after_hwframe

The root cause is a race between vmx_sync_pir_to_irr() on the target vCPU
and __vmx_deliver_posted_interrupt() on a sender vCPU.  The sender
performs two individually-atomic operations that are not a single
transaction:

  1. pi_test_and_set_pir(vector)  -- sets the PIR bit
  2. pi_test_and_set_on()         -- sets PID.ON

The following interleaving triggers the bug:

  Sender vCPU (IPI):              Target vCPU (1st sync_pir_to_irr):
  B1: set PIR[vector]
                                  A1: pi_clear_on()
                                  A2: pi_harvest_pir() -> sees B1 bit
                                  A3: xchg() -> consumes bit, PIR=0
                                      (1st sync returns correct max_irr)
  B2: set PID.ON = 1

                                  Target vCPU (2nd sync_pir_to_irr):
                                  C1: pi_test_on() -> TRUE (from B2)
                                  C2: pi_clear_on() -> ON=0
                                  C3: pi_harvest_pir() -> PIR empty
                                  C4: *max_irr = -1, early return
                                      IRR NOT SCANNED

The interrupt is not lost (it resides in the IRR from the first sync and
is recovered on the next vcpu_enter_guest() iteration), but the incorrect
max_irr causes a spurious WARNING and a wasted L2 VM-Enter/VM-Exit cycle.

Fixes: b41f8638b9d3 ("KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR")
Reported-by: Farrah Chen <farrah.chen@intel.com>
Assisted-by: GitHub Copilot:Claude Opus 4.6
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>

---
There is a WARNING call trace during a nested VM stress test. AI
provided an analysis of a race condition and the related fix, which
looks reasonable to me. With the patch applied, the WARNING can not
be reproduced in overnight stress testing.
---
 arch/x86/kvm/vmx/vmx.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8b24e682535b..e2da29371e00 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7153,6 +7153,16 @@ int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 		smp_mb__after_atomic();
 		got_posted_interrupt =
 			kvm_apic_update_irr(vcpu, vt->pi_desc.pir, &max_irr);
+		/*
+		 * If PID.ON was set but PIR is empty, another CPU may have
+		 * set PID.ON via __vmx_deliver_posted_interrupt() after a
+		 * previous sync already consumed the PIR bits.  In this
+		 * case, kvm_apic_update_irr() will not have scanned the
+		 * existing IRR, so fall back to scanning the IRR directly
+		 * to correctly report the highest pending interrupt.
+		 */
+		if (max_irr == -1)
+			max_irr = kvm_lapic_find_highest_irr(vcpu);
 	} else {
 		max_irr = kvm_lapic_find_highest_irr(vcpu);
 		got_posted_interrupt = false;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] KVM: VMX: Fall back to IRR scan when PIR is empty despite PID.ON being set
  2026-04-28  7:03 [PATCH] KVM: VMX: Fall back to IRR scan when PIR is empty despite PID.ON being set Chenyi Qiang
@ 2026-04-28  7:45 ` Paolo Bonzini
  2026-04-28  8:27   ` Chenyi Qiang
  2026-04-28 11:10 ` Chao Gao
  1 sibling, 1 reply; 7+ messages in thread
From: Paolo Bonzini @ 2026-04-28  7:45 UTC (permalink / raw)
  To: Chenyi Qiang, kvm; +Cc: Sean Christopherson, Jim Mattson, Gao Chao, Farrah Chen

On 4/28/26 09:03, Chenyi Qiang wrote:
> Fall back to kvm_lapic_find_highest_irr() in vmx_sync_pir_to_irr() when
> PID.ON is set but PIR turns out to be empty, to correctly report the
> highest pending interrupt from the existing IRR.
> 
> In a nested VM stress test, the following WARNING fires in
> vmx_check_nested_events() when kvm_cpu_has_interrupt() reports a pending
> interrupt but the subsequent kvm_apic_has_interrupt() (which invokes
> vmx_sync_pir_to_irr() again) returns -1:
> 
>    WARNING: CPU: 99 PID: 57767 at arch/x86/kvm/vmx/nested.c:4449 vmx_check_nested_events+0x6bf/0x6e0 [kvm_intel]
>    Call Trace:
>     kvm_check_and_inject_events
>     vcpu_enter_guest.constprop.0
>     vcpu_run
>     kvm_arch_vcpu_ioctl_run
>     kvm_vcpu_ioctl
>     __x64_sys_ioctl
>     do_syscall_64
>     entry_SYSCALL_64_after_hwframe
> 
> The root cause is a race between vmx_sync_pir_to_irr() on the target vCPU
> and __vmx_deliver_posted_interrupt() on a sender vCPU.  The sender
> performs two individually-atomic operations that are not a single
> transaction:
> 
>    1. pi_test_and_set_pir(vector)  -- sets the PIR bit
>    2. pi_test_and_set_on()         -- sets PID.ON
> 
> The following interleaving triggers the bug:
> 
>    Sender vCPU (IPI):              Target vCPU (1st sync_pir_to_irr):
>    B1: set PIR[vector]
>                                    A1: pi_clear_on()
>                                    A2: pi_harvest_pir() -> sees B1 bit
>                                    A3: xchg() -> consumes bit, PIR=0
>                                        (1st sync returns correct max_irr)
>    B2: set PID.ON = 1
> 
>                                    Target vCPU (2nd sync_pir_to_irr):
>                                    C1: pi_test_on() -> TRUE (from B2)
>                                    C2: pi_clear_on() -> ON=0
>                                    C3: pi_harvest_pir() -> PIR empty
>                                    C4: *max_irr = -1, early return
>                                        IRR NOT SCANNED
> 
> The interrupt is not lost (it resides in the IRR from the first sync and
> is recovered on the next vcpu_enter_guest() iteration), but the incorrect
> max_irr causes a spurious WARNING and a wasted L2 VM-Enter/VM-Exit cycle.
> 
> Fixes: b41f8638b9d3 ("KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR")
> Reported-by: Farrah Chen <farrah.chen@intel.com>
> Assisted-by: GitHub Copilot:Claude Opus 4.6
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> 
> ---
> There is a WARNING call trace during a nested VM stress test. AI
> provided an analysis of a race condition and the related fix, which
> looks reasonable to me. With the patch applied, the WARNING can not
> be reproduced in overnight stress testing.

The analysis of the race is correct and changing the logic is the
right thing to do; but I would change directly __kvm_apic_update_irr,
either like this:

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e3ec4d8607c1..5ee14d6bc288 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -669,12 +669,14 @@ bool __kvm_apic_update_irr(unsigned long *pir, void *regs, int *max_irr)
  	u32 irr_val, prev_irr_val;
  	int max_updated_irr;
  
+	if (!pi_harvest_pir(pir, pir_vals)) {
+		*max_irr = apic_find_highest_vector(regs + APIC_IRR);
+		return false;
+	}
+
  	max_updated_irr = -1;
  	*max_irr = -1;
  
-	if (!pi_harvest_pir(pir, pir_vals))
-		return false;
-
  	for (i = vec = 0; i <= 7; i++, vec += 32) {
  		u32 *p_irr = (u32 *)(regs + APIC_IRR + i * 0x10);
  

Or even ignoring altogether the return value of pi_harvest_pir(), always
going in the loop below for simplicity.

Paolo

> ---
>   arch/x86/kvm/vmx/vmx.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 8b24e682535b..e2da29371e00 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7153,6 +7153,16 @@ int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
>   		smp_mb__after_atomic();
>   		got_posted_interrupt =
>   			kvm_apic_update_irr(vcpu, vt->pi_desc.pir, &max_irr);
> +		/*
> +		 * If PID.ON was set but PIR is empty, another CPU may have
> +		 * set PID.ON via __vmx_deliver_posted_interrupt() after a
> +		 * previous sync already consumed the PIR bits.  In this
> +		 * case, kvm_apic_update_irr() will not have scanned the
> +		 * existing IRR, so fall back to scanning the IRR directly
> +		 * to correctly report the highest pending interrupt.
> +		 */
> +		if (max_irr == -1)
> +			max_irr = kvm_lapic_find_highest_irr(vcpu);
>   	} else {
>   		max_irr = kvm_lapic_find_highest_irr(vcpu);
>   		got_posted_interrupt = false;


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] KVM: VMX: Fall back to IRR scan when PIR is empty despite PID.ON being set
  2026-04-28  7:45 ` Paolo Bonzini
@ 2026-04-28  8:27   ` Chenyi Qiang
  2026-04-28 15:50     ` Sean Christopherson
  0 siblings, 1 reply; 7+ messages in thread
From: Chenyi Qiang @ 2026-04-28  8:27 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: Sean Christopherson, Jim Mattson, Gao Chao, Farrah Chen



On 4/28/2026 3:45 PM, Paolo Bonzini wrote:
> On 4/28/26 09:03, Chenyi Qiang wrote:
>> Fall back to kvm_lapic_find_highest_irr() in vmx_sync_pir_to_irr() when
>> PID.ON is set but PIR turns out to be empty, to correctly report the
>> highest pending interrupt from the existing IRR.
>>
>> In a nested VM stress test, the following WARNING fires in
>> vmx_check_nested_events() when kvm_cpu_has_interrupt() reports a pending
>> interrupt but the subsequent kvm_apic_has_interrupt() (which invokes
>> vmx_sync_pir_to_irr() again) returns -1:
>>
>>    WARNING: CPU: 99 PID: 57767 at arch/x86/kvm/vmx/nested.c:4449 vmx_check_nested_events+0x6bf/0x6e0 [kvm_intel]
>>    Call Trace:
>>     kvm_check_and_inject_events
>>     vcpu_enter_guest.constprop.0
>>     vcpu_run
>>     kvm_arch_vcpu_ioctl_run
>>     kvm_vcpu_ioctl
>>     __x64_sys_ioctl
>>     do_syscall_64
>>     entry_SYSCALL_64_after_hwframe
>>
>> The root cause is a race between vmx_sync_pir_to_irr() on the target vCPU
>> and __vmx_deliver_posted_interrupt() on a sender vCPU.  The sender
>> performs two individually-atomic operations that are not a single
>> transaction:
>>
>>    1. pi_test_and_set_pir(vector)  -- sets the PIR bit
>>    2. pi_test_and_set_on()         -- sets PID.ON
>>
>> The following interleaving triggers the bug:
>>
>>    Sender vCPU (IPI):              Target vCPU (1st sync_pir_to_irr):
>>    B1: set PIR[vector]
>>                                    A1: pi_clear_on()
>>                                    A2: pi_harvest_pir() -> sees B1 bit
>>                                    A3: xchg() -> consumes bit, PIR=0
>>                                        (1st sync returns correct max_irr)
>>    B2: set PID.ON = 1
>>
>>                                    Target vCPU (2nd sync_pir_to_irr):
>>                                    C1: pi_test_on() -> TRUE (from B2)
>>                                    C2: pi_clear_on() -> ON=0
>>                                    C3: pi_harvest_pir() -> PIR empty
>>                                    C4: *max_irr = -1, early return
>>                                        IRR NOT SCANNED
>>
>> The interrupt is not lost (it resides in the IRR from the first sync and
>> is recovered on the next vcpu_enter_guest() iteration), but the incorrect
>> max_irr causes a spurious WARNING and a wasted L2 VM-Enter/VM-Exit cycle.
>>
>> Fixes: b41f8638b9d3 ("KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR")
>> Reported-by: Farrah Chen <farrah.chen@intel.com>
>> Assisted-by: GitHub Copilot:Claude Opus 4.6
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>
>> ---
>> There is a WARNING call trace during a nested VM stress test. AI
>> provided an analysis of a race condition and the related fix, which
>> looks reasonable to me. With the patch applied, the WARNING can not
>> be reproduced in overnight stress testing.
> 
> The analysis of the race is correct and changing the logic is the
> right thing to do; but I would change directly __kvm_apic_update_irr,
> either like this:
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index e3ec4d8607c1..5ee14d6bc288 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -669,12 +669,14 @@ bool __kvm_apic_update_irr(unsigned long *pir, void *regs, int *max_irr)
>      u32 irr_val, prev_irr_val;
>      int max_updated_irr;
>  
> +    if (!pi_harvest_pir(pir, pir_vals)) {
> +        *max_irr = apic_find_highest_vector(regs + APIC_IRR);
> +        return false;
> +    }
> +
>      max_updated_irr = -1;
>      *max_irr = -1;
>  
> -    if (!pi_harvest_pir(pir, pir_vals))
> -        return false;
> -
>      for (i = vec = 0; i <= 7; i++, vec += 32) {
>          u32 *p_irr = (u32 *)(regs + APIC_IRR + i * 0x10);

Make sense. This resolves the problem at the source.
I will verify the change in this way and respin it.

>  
> 
> Or even ignoring altogether the return value of pi_harvest_pir(), always
> going in the loop below for simplicity.
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] KVM: VMX: Fall back to IRR scan when PIR is empty despite PID.ON being set
  2026-04-28  7:03 [PATCH] KVM: VMX: Fall back to IRR scan when PIR is empty despite PID.ON being set Chenyi Qiang
  2026-04-28  7:45 ` Paolo Bonzini
@ 2026-04-28 11:10 ` Chao Gao
  1 sibling, 0 replies; 7+ messages in thread
From: Chao Gao @ 2026-04-28 11:10 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: kvm, Sean Christopherson, Jim Mattson, Paolo Bonzini, Farrah Chen

On Tue, Apr 28, 2026 at 03:03:26PM +0800, Chenyi Qiang wrote:
>Fall back to kvm_lapic_find_highest_irr() in vmx_sync_pir_to_irr() when
>PID.ON is set but PIR turns out to be empty, to correctly report the
>highest pending interrupt from the existing IRR.
>
>In a nested VM stress test, the following WARNING fires in
>vmx_check_nested_events() when kvm_cpu_has_interrupt() reports a pending
>interrupt but the subsequent kvm_apic_has_interrupt() (which invokes
>vmx_sync_pir_to_irr() again) returns -1:
>
>  WARNING: CPU: 99 PID: 57767 at arch/x86/kvm/vmx/nested.c:4449 vmx_check_nested_events+0x6bf/0x6e0 [kvm_intel]
>  Call Trace:
>   kvm_check_and_inject_events
>   vcpu_enter_guest.constprop.0
>   vcpu_run
>   kvm_arch_vcpu_ioctl_run
>   kvm_vcpu_ioctl
>   __x64_sys_ioctl
>   do_syscall_64
>   entry_SYSCALL_64_after_hwframe
>
>The root cause is a race between vmx_sync_pir_to_irr() on the target vCPU
>and __vmx_deliver_posted_interrupt() on a sender vCPU.  The sender
>performs two individually-atomic operations that are not a single
>transaction:
>
>  1. pi_test_and_set_pir(vector)  -- sets the PIR bit
>  2. pi_test_and_set_on()         -- sets PID.ON
>
>The following interleaving triggers the bug:
>
>  Sender vCPU (IPI):              Target vCPU (1st sync_pir_to_irr):
>  B1: set PIR[vector]
>                                  A1: pi_clear_on()
>                                  A2: pi_harvest_pir() -> sees B1 bit
>                                  A3: xchg() -> consumes bit, PIR=0
>                                      (1st sync returns correct max_irr)
>  B2: set PID.ON = 1
>
>                                  Target vCPU (2nd sync_pir_to_irr):
>                                  C1: pi_test_on() -> TRUE (from B2)
>                                  C2: pi_clear_on() -> ON=0
>                                  C3: pi_harvest_pir() -> PIR empty
>                                  C4: *max_irr = -1, early return
>                                      IRR NOT SCANNED
>
>The interrupt is not lost (it resides in the IRR from the first sync and
>is recovered on the next vcpu_enter_guest() iteration), but the incorrect
>max_irr causes a spurious WARNING and a wasted L2 VM-Enter/VM-Exit cycle.
>
>Fixes: b41f8638b9d3 ("KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR")

Just FYI, I asked Copilot to review commit b41f8638b9d3, and it indeed
identified this subtle functional change:

"
Found 1 regression. In arch/x86/kvm/lapic.c::__kvm_apic_update_irr(), the new if
(!pending) return false; drops the old behavior of recomputing *max_irr from
APIC_IRR on an empty-PIR path. vmx_sync_pir_to_irr() still calls this helper
whenever PID.ON is set and then unconditionally passes max_irr to vmx_set_rvi(),
so when hardware has already drained PIR into vIRR/APIC_IRR, max_irr stays -1 and
KVM clears RVI despite an interrupt still being pending
"

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] KVM: VMX: Fall back to IRR scan when PIR is empty despite PID.ON being set
  2026-04-28  8:27   ` Chenyi Qiang
@ 2026-04-28 15:50     ` Sean Christopherson
  2026-04-29  1:08       ` Chenyi Qiang
  0 siblings, 1 reply; 7+ messages in thread
From: Sean Christopherson @ 2026-04-28 15:50 UTC (permalink / raw)
  To: Chenyi Qiang; +Cc: Paolo Bonzini, kvm, Jim Mattson, Gao Chao, Farrah Chen

On Tue, Apr 28, 2026, Chenyi Qiang wrote:
> On 4/28/2026 3:45 PM, Paolo Bonzini wrote:
> > On 4/28/26 09:03, Chenyi Qiang wrote:
> >> The interrupt is not lost (it resides in the IRR from the first sync and
> >> is recovered on the next vcpu_enter_guest() iteration), but the incorrect
> >> max_irr causes a spurious WARNING and a wasted L2 VM-Enter/VM-Exit cycle.
> >>
> >> Fixes: b41f8638b9d3 ("KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR")

Cc: stable@vger.kernel.org

> >> Reported-by: Farrah Chen <farrah.chen@intel.com>
> >> Assisted-by: GitHub Copilot:Claude Opus 4.6
> >> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> >>
> >> ---
> >> There is a WARNING call trace during a nested VM stress test.

Any chance the stress test is something that can be shared?  I've seen this WARN
2-3 times over the last year, but it was never reproducible, and so intermittent
that I couldn't even correlate what I was doing at the time with the WARN.

> > The analysis of the race is correct and changing the logic is the
> > right thing to do; but I would change directly __kvm_apic_update_irr,
> > either like this:
> > 
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index e3ec4d8607c1..5ee14d6bc288 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -669,12 +669,14 @@ bool __kvm_apic_update_irr(unsigned long *pir, void *regs, int *max_irr)
> >      u32 irr_val, prev_irr_val;
> >      int max_updated_irr;
> >  
> > +    if (!pi_harvest_pir(pir, pir_vals)) {
> > +        *max_irr = apic_find_highest_vector(regs + APIC_IRR);
> > +        return false;
> > +    }

Holy moly, this code is all kinds of subtle.

For Paolo's suggested fix,

  Reviewed-by: Sean Christopherson <seanjc@google.com>

But this code needs some naming tweaks and more comments, because I've been staring
at this for over an hour (more than two now, *sigh*), and have been _this_ close
to making two suggestions that were subtly very, very wrong.

First off, both irr_updated in kvm_apic_update_irr() and got_posted_interrupt in
vmx_sync_pir_to_irr() are misnomers.  __kvm_apic_update_irr() and thus
kvm_apic_update_irr() don't return true if a new posted IRQ was found, they return
true if at least one new posted IRQ was found *and* the highest priority "new" IRQ
is also now the highest priority IRQ in the IRR.

Near-bug-suggestion #1 (in my chronological order of analysis) was to drop what
I _thought_ was a micro-optimization of detecting *new* IRQs in the IRR.  I.e. I
thought skipping KVM_REQ_EVENT was purely an optimization to avoid re-evaluating
IRQs.  But it's not an optimization, it's functionally necessary, because setting
KVM_REQ_EVENT if there was a pre-existing IRQ in the IRR would put the vCPU into
an infinite loop if the IRQ is blocked: KVM won't move the IRQ out of the IRQ
until it can be injected, and so KVM would always see the pending IRQ, always
set KVM_REQ_EVENT, and thus never actually attempt VM-Enter.

Near-bug-suggestion #2, after realizing the functional necessity of the code, was
the thought that there's a pre-existing bug due to this behavior.  When the vCPU
is running L2 or APICv is inactive, KVM needs to set KVM_REQ_EVENT if a new IRQ
is found, in order to trigger the various checks that ensure KVM delivers/injects
the IRQ in a timely fashion.  So of course setting KVM_REQ_EVENT only if a the
new IRQ is the highest IRQ must be wrong, right?

Nope.  Because if there was a pre-existing, higher priority IRQ in the IRR, then
either that IRQ is already primed for injection via the VMCS, or KVM has requested
an IRQ window because the higher priority IRQ is blocked.  In the latter case,
there's more obviously nothing to be done since KVM will do process pending IRQs
when an IRQ window is opened.  The former case is more subtle though: KVM doesn't
need to re-evaluate interrupts, because either EOI from the guest is guaranteed
to exit (APICv is inactive and/or L2 is running and L1 is running L2 with
exit-on-interrupt off, i.e. without nested virtual interrupt delivery), or the
highest IRQ is guaranteed to trigger a nested VM-Exit, (L2 is running and L1 is
running L2 with exit-on-interrupt on).

On top of Paolo's suggested fix:

---
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 28 Apr 2026 07:33:20 -0700
Subject: [PATCH] KVM: x86: Fix misleading variable names and add more comments
 for PIR=>IRR flow

Rename kvm_apic_update_irr()'s "irr_updated" and vmx_sync_pir_to_irr()'s
"got_posted_interrupt" to a more accurate "max_irr_is_from_pir", as neither
"irr_updated" nor "got_posted_interrupt" is accurate.
__kvm_apic_update_irr() and thus kvm_apic_update_irr() specifically return
true if and only if the highest priority IRQ, i.e. max_irr, is a "new"
pending IRQ from the PIR.  I.e. it's possible for the IRR to be updated,
i.e. for a posted IRQ to be "got", *without* the APIs returning true.

Expand vmx_sync_pir_to_irr()'s comment to explain why it's necessary to
set KVM_REQ_EVENT only if a "new" IRQ was found, and to explain why it's
safe to do so only if a new IRQ is also the highest priority pending IRQ.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/lapic.c   | 16 ++++++++--------
 arch/x86/kvm/vmx/vmx.c | 40 ++++++++++++++++++++++++++++++++--------
 2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 4d4b003cf7b2..0cac6decdef9 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -667,9 +667,9 @@ bool __kvm_apic_update_irr(unsigned long *pir, void *regs, int *max_irr)
 	u32 *__pir = (void *)pir_vals;
 	u32 i, vec;
 	u32 irr_val, prev_irr_val;
-	int max_updated_irr;
+	int max_new_irr;
 
-	max_updated_irr = -1;
+	max_new_irr = -1;
 	*max_irr = -1;
 
 	if (!pi_harvest_pir(pir, pir_vals)) {
@@ -690,25 +690,25 @@ bool __kvm_apic_update_irr(unsigned long *pir, void *regs, int *max_irr)
 				 !try_cmpxchg(p_irr, &prev_irr_val, irr_val));
 
 			if (prev_irr_val != irr_val)
-				max_updated_irr = __fls(irr_val ^ prev_irr_val) + vec;
+				max_new_irr = __fls(irr_val ^ prev_irr_val) + vec;
 		}
 		if (irr_val)
 			*max_irr = __fls(irr_val) + vec;
 	}
 
-	return ((max_updated_irr != -1) &&
-		(max_updated_irr == *max_irr));
+	return (max_new_irr != -1) && (max_new_irr == *max_irr);
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(__kvm_apic_update_irr);
 
 bool kvm_apic_update_irr(struct kvm_vcpu *vcpu, unsigned long *pir, int *max_irr)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
-	bool irr_updated = __kvm_apic_update_irr(pir, apic->regs, max_irr);
+	bool max_irr_is_from_pir;
 
-	if (unlikely(!apic->apicv_active && irr_updated))
+	max_irr_is_from_pir = __kvm_apic_update_irr(pir, apic->regs, max_irr);
+	if (unlikely(!apic->apicv_active && max_irr_is_from_pir))
 		apic->irr_pending = true;
-	return irr_updated;
+	return max_irr_is_from_pir;
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_apic_update_irr);
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a29896a9ef14..4e1aadd89c63 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7029,8 +7029,8 @@ static void vmx_set_rvi(int vector)
 int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vt *vt = to_vt(vcpu);
+	bool max_irr_is_from_pir;
 	int max_irr;
-	bool got_posted_interrupt;
 
 	if (KVM_BUG_ON(!enable_apicv, vcpu->kvm))
 		return -EIO;
@@ -7042,17 +7042,22 @@ int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 		 * But on x86 this is just a compiler barrier anyway.
 		 */
 		smp_mb__after_atomic();
-		got_posted_interrupt =
-			kvm_apic_update_irr(vcpu, vt->pi_desc.pir, &max_irr);
+		max_irr_is_from_pir = kvm_apic_update_irr(vcpu, vt->pi_desc.pir,
+							  &max_irr);
 	} else {
 		max_irr = kvm_lapic_find_highest_irr(vcpu);
-		got_posted_interrupt = false;
+		max_irr_is_from_pir = false;
 	}
 
 	/*
-	 * Newly recognized interrupts are injected via either virtual interrupt
-	 * delivery (RVI) or KVM_REQ_EVENT.  Virtual interrupt delivery is
-	 * disabled in two cases:
+	 * If APICv is enabled and L2 is not active, then update the Requesting
+	 * Virtual Interrupt (RVI) portion of vmcs01.GUEST_INTR_STATUS with the
+	 * highest priority IRR to deliver the IRQ via Virtual Interrupt
+	 * Delivery.  Note, this is required even if the highest priority IRQ
+	 * was already pending in the IRR, as RVI isn't update in lockstep with
+	 * the IRR (unlike apic->irr_pending).
+	 *
+	 * For the cases where Virtual Interrupt Delivery can't be used:
 	 *
 	 * 1) If L2 is running and the vCPU has a new pending interrupt.  If L1
 	 * wants to exit on interrupts, KVM_REQ_EVENT is needed to synthesize a
@@ -7063,10 +7068,29 @@ int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 	 * 2) If APICv is disabled for this vCPU, assigned devices may still
 	 * attempt to post interrupts.  The posted interrupt vector will cause
 	 * a VM-Exit and the subsequent entry will call sync_pir_to_irr.
+	 *
+	 * In both cases, set KVM_REQ_EVENT if and only if the highest priority
+	 * pending IRQ came from the PIR, as setting KVM_REQ_EVENT if any IRQ
+	 * is pending may put the vCPU into an infinite loop, e.g. if the IRQ
+	 * is blocked, then it will stay pending until an IRQ window is opened.
+	 *
+	 * Note!  It's possible that one or more IRQs were moved from the PIR
+	 * to the IRR _without_ max_irr_is_from_pir being true!  I.e. if there
+	 * was a higher priority IRQ already pending in the IRR.  Not setting
+	 * KVM_REQ_EVENT in this case is intentional and safe.  If APICv is
+	 * inactive, or L2 is running with exit-on-interrupt off (in vmcs12),
+	 * i.e. without nested virtual interrupt delivery, then there's no need
+	 * to request an IRQ window as the lower priority IRQ only needs to be
+	 * delivered when the higher priority IRQ is dismissed from the ISR,
+	 * i.e. on the next EOI, and EOIs are always intercepted if APICv is
+	 * disabled or if L2 is running without nested VID.  If L2 is running
+	 * exit-on-interrupt on (in vmcs12), then the higher priority IRQ will
+	 * trigger a nested VM-Exit, at which point KVM will re-evaluate L1's
+	 * pending IRQs.
 	 */
 	if (!is_guest_mode(vcpu) && kvm_vcpu_apicv_active(vcpu))
 		vmx_set_rvi(max_irr);
-	else if (got_posted_interrupt)
+	else if (max_irr_is_from_pir)
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
 
 	return max_irr;

base-commit: a98d8321896bf010f877b5edb88d3f14627c943a
--

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] KVM: VMX: Fall back to IRR scan when PIR is empty despite PID.ON being set
  2026-04-28 15:50     ` Sean Christopherson
@ 2026-04-29  1:08       ` Chenyi Qiang
  2026-04-29 12:58         ` Sean Christopherson
  0 siblings, 1 reply; 7+ messages in thread
From: Chenyi Qiang @ 2026-04-29  1:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, Jim Mattson, Gao Chao, Farrah Chen



On 4/28/2026 11:50 PM, Sean Christopherson wrote:
> On Tue, Apr 28, 2026, Chenyi Qiang wrote:
>> On 4/28/2026 3:45 PM, Paolo Bonzini wrote:
>>> On 4/28/26 09:03, Chenyi Qiang wrote:
>>>> The interrupt is not lost (it resides in the IRR from the first sync and
>>>> is recovered on the next vcpu_enter_guest() iteration), but the incorrect
>>>> max_irr causes a spurious WARNING and a wasted L2 VM-Enter/VM-Exit cycle.
>>>>
>>>> Fixes: b41f8638b9d3 ("KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR")
> 
> Cc: stable@vger.kernel.org
> 
>>>> Reported-by: Farrah Chen <farrah.chen@intel.com>
>>>> Assisted-by: GitHub Copilot:Claude Opus 4.6
>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>>
>>>> ---
>>>> There is a WARNING call trace during a nested VM stress test.
> 
> Any chance the stress test is something that can be shared?  I've seen this WARN
> 2-3 times over the last year, but it was never reproducible, and so intermittent
> that I couldn't even correlate what I was doing at the time with the WARN.

It's not something special. The stress test is only to create and tear down 20 nested VM instance in a cycle.
It often takes more than 20 cycles to reproduce this issue.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] KVM: VMX: Fall back to IRR scan when PIR is empty despite PID.ON being set
  2026-04-29  1:08       ` Chenyi Qiang
@ 2026-04-29 12:58         ` Sean Christopherson
  0 siblings, 0 replies; 7+ messages in thread
From: Sean Christopherson @ 2026-04-29 12:58 UTC (permalink / raw)
  To: Chenyi Qiang; +Cc: Paolo Bonzini, kvm, Jim Mattson, Gao Chao, Farrah Chen

On Wed, Apr 29, 2026, Chenyi Qiang wrote:
> On 4/28/2026 11:50 PM, Sean Christopherson wrote:
> > On Tue, Apr 28, 2026, Chenyi Qiang wrote:
> >> On 4/28/2026 3:45 PM, Paolo Bonzini wrote:
> >>> On 4/28/26 09:03, Chenyi Qiang wrote:
> >>>> The interrupt is not lost (it resides in the IRR from the first sync and
> >>>> is recovered on the next vcpu_enter_guest() iteration), but the incorrect
> >>>> max_irr causes a spurious WARNING and a wasted L2 VM-Enter/VM-Exit cycle.
> >>>>
> >>>> Fixes: b41f8638b9d3 ("KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR")
> > 
> > Cc: stable@vger.kernel.org
> > 
> >>>> Reported-by: Farrah Chen <farrah.chen@intel.com>
> >>>> Assisted-by: GitHub Copilot:Claude Opus 4.6
> >>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> >>>>
> >>>> ---
> >>>> There is a WARNING call trace during a nested VM stress test.
> > 
> > Any chance the stress test is something that can be shared?  I've seen this WARN
> > 2-3 times over the last year, but it was never reproducible, and so intermittent
> > that I couldn't even correlate what I was doing at the time with the WARN.
> 
> It's not something special. The stress test is only to create and tear down
> 20 nested VM instance in a cycle.  It often takes more than 20 cycles to
> reproduce this issue.

Got it, thanks much!

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-04-29 12:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28  7:03 [PATCH] KVM: VMX: Fall back to IRR scan when PIR is empty despite PID.ON being set Chenyi Qiang
2026-04-28  7:45 ` Paolo Bonzini
2026-04-28  8:27   ` Chenyi Qiang
2026-04-28 15:50     ` Sean Christopherson
2026-04-29  1:08       ` Chenyi Qiang
2026-04-29 12:58         ` Sean Christopherson
2026-04-28 11:10 ` Chao Gao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox