[PATCH] KVM: x86: Ignore pending PV EOI if the vCPU has since disabled PV EOIs

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH] KVM: x86: Ignore pending PV EOI if the vCPU has since disabled PV EOIs
@ 2026-06-24 22:05 Sean Christopherson
  2026-06-25  7:30 ` Huang, Kai
  0 siblings, 1 reply; 5+ messages in thread
From: Sean Christopherson @ 2026-06-24 22:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Ignore KVM's internal "service pending PV EOI" request if the vCPU has
disabled PV EOIs since the request was made.  Asserting that PV EOIs are
enabled can fail if reading guest memory in pv_eoi_get_user() fails, i.e.
if pv_eoi_test_and_clr_pending() bails early, *and* the vCPU also disables
PV EOIs.

  kernel BUG at arch/x86/kvm/lapic.c:3338!
  Oops: invalid opcode: 0000 [#1] SMP
  CPU: 4 UID: 1000 PID: 890 Comm: pv_eoi_test Not tainted 7.0.0-d585aa5894d8-vm #337 PREEMPT
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_lapic_sync_from_vapic+0x12b/0x140 [kvm]
  Call Trace:
   <TASK>
   kvm_arch_vcpu_ioctl_run+0x1075/0x1c30 [kvm]
   kvm_vcpu_ioctl+0x2d5/0x980 [kvm]
   __x64_sys_ioctl+0x8a/0xd0
   do_syscall_64+0xb5/0xb40
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>
  Modules linked in: kvm_intel kvm irqbypass
  ---[ end trace 0000000000000000 ]---

Fixes: ae7a2a3fb6f8 ("KVM: host side for eoi optimization")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/lapic.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 6f30bbdddb5a..38bba9a1114c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -3371,6 +3371,12 @@ static void apic_sync_pv_eoi_from_guest(struct kvm_vcpu *vcpu,
 					struct kvm_lapic *apic)
 {
 	int vector;
+
+	if (unlikely(!pv_eoi_enabled(vcpu))) {
+		__clear_bit(KVM_APIC_PV_EOI_PENDING, &vcpu->arch.apic_attention);
+		return;
+	}
+
 	/*
 	 * PV EOI state is derived from KVM_APIC_PV_EOI_PENDING in host
 	 * and KVM_PV_EOI_ENABLED in guest memory as follows:
@@ -3382,8 +3388,6 @@ static void apic_sync_pv_eoi_from_guest(struct kvm_vcpu *vcpu,
 	 * KVM_APIC_PV_EOI_PENDING is set, KVM_PV_EOI_ENABLED is unset:
 	 * 	-> host enabled PV EOI, guest executed EOI.
 	 */
-	BUG_ON(!pv_eoi_enabled(vcpu));
-
 	if (pv_eoi_test_and_clr_pending(vcpu))
 		return;
 	vector = apic_set_eoi(apic);

base-commit: 50406d35f5635e1cc523e61409d57e851b5f5df8
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] KVM: x86: Ignore pending PV EOI if the vCPU has since disabled PV EOIs
  2026-06-24 22:05 [PATCH] KVM: x86: Ignore pending PV EOI if the vCPU has since disabled PV EOIs Sean Christopherson
@ 2026-06-25  7:30 ` Huang, Kai
  2026-06-25 15:33   ` Sean Christopherson
  0 siblings, 1 reply; 5+ messages in thread
From: Huang, Kai @ 2026-06-25  7:30 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org

On Wed, 2026-06-24 at 15:05 -0700, Sean Christopherson wrote:
> Ignore KVM's internal "service pending PV EOI" request if the vCPU has
> disabled PV EOIs since the request was made.  Asserting that PV EOIs are
> enabled can fail if reading guest memory in pv_eoi_get_user() fails, i.e.
> if pv_eoi_test_and_clr_pending() bails early, *and* the vCPU also disables
> PV EOIs.
> 
>   kernel BUG at arch/x86/kvm/lapic.c:3338!
>   Oops: invalid opcode: 0000 [#1] SMP
>   CPU: 4 UID: 1000 PID: 890 Comm: pv_eoi_test Not tainted 7.0.0-d585aa5894d8-vm #337 PREEMPT
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:kvm_lapic_sync_from_vapic+0x12b/0x140 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_arch_vcpu_ioctl_run+0x1075/0x1c30 [kvm]
>    kvm_vcpu_ioctl+0x2d5/0x980 [kvm]
>    __x64_sys_ioctl+0x8a/0xd0
>    do_syscall_64+0xb5/0xb40
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
>   Modules linked in: kvm_intel kvm irqbypass
>   ---[ end trace 0000000000000000 ]---
> 
> Fixes: ae7a2a3fb6f8 ("KVM: host side for eoi optimization")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/lapic.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 6f30bbdddb5a..38bba9a1114c 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -3371,6 +3371,12 @@ static void apic_sync_pv_eoi_from_guest(struct kvm_vcpu *vcpu,
>  					struct kvm_lapic *apic)
>  {
>  	int vector;
> +
> +	if (unlikely(!pv_eoi_enabled(vcpu))) {
> +		__clear_bit(KVM_APIC_PV_EOI_PENDING, &vcpu->arch.apic_attention);
> +		return;
> +	}
> +

Do you think whether it's better to do this in kvm_lapic_set_pv_eoi(), i.e.,
when guest disables PV_EOI for this vCPU?  If we do so, it seems we can save the
unnecessary pv_eoi_set_pending() (which writes guest memory) called from
apic_sync_pv_eoi_to_guest(), or it's not possible to happen?

>  	/*
>  	 * PV EOI state is derived from KVM_APIC_PV_EOI_PENDING in host
>  	 * and KVM_PV_EOI_ENABLED in guest memory as follows:
> @@ -3382,8 +3388,6 @@ static void apic_sync_pv_eoi_from_guest(struct kvm_vcpu *vcpu,
>  	 * KVM_APIC_PV_EOI_PENDING is set, KVM_PV_EOI_ENABLED is unset:
>  	 * 	-> host enabled PV EOI, guest executed EOI.
>  	 */
> -	BUG_ON(!pv_eoi_enabled(vcpu));
> -
>  	if (pv_eoi_test_and_clr_pending(vcpu))
>  		return;
>  	vector = apic_set_eoi(apic);
> 
> base-commit: 50406d35f5635e1cc523e61409d57e851b5f5df8

Anyway, the code LGTM:

Reviewed-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] KVM: x86: Ignore pending PV EOI if the vCPU has since disabled PV EOIs
  2026-06-25  7:30 ` Huang, Kai
@ 2026-06-25 15:33   ` Sean Christopherson
  2026-06-25 23:44     ` Huang, Kai
  0 siblings, 1 reply; 5+ messages in thread
From: Sean Christopherson @ 2026-06-25 15:33 UTC (permalink / raw)
  To: Kai Huang
  Cc: pbonzini@redhat.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Thu, Jun 25, 2026, Kai Huang wrote:
> On Wed, 2026-06-24 at 15:05 -0700, Sean Christopherson wrote:
> > Ignore KVM's internal "service pending PV EOI" request if the vCPU has
> > disabled PV EOIs since the request was made.  Asserting that PV EOIs are
> > enabled can fail if reading guest memory in pv_eoi_get_user() fails, i.e.
> > if pv_eoi_test_and_clr_pending() bails early, *and* the vCPU also disables
> > PV EOIs.
> > 
> >   kernel BUG at arch/x86/kvm/lapic.c:3338!
> >   Oops: invalid opcode: 0000 [#1] SMP
> >   CPU: 4 UID: 1000 PID: 890 Comm: pv_eoi_test Not tainted 7.0.0-d585aa5894d8-vm #337 PREEMPT
> >   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> >   RIP: 0010:kvm_lapic_sync_from_vapic+0x12b/0x140 [kvm]
> >   Call Trace:
> >    <TASK>
> >    kvm_arch_vcpu_ioctl_run+0x1075/0x1c30 [kvm]
> >    kvm_vcpu_ioctl+0x2d5/0x980 [kvm]
> >    __x64_sys_ioctl+0x8a/0xd0
> >    do_syscall_64+0xb5/0xb40
> >    entry_SYSCALL_64_after_hwframe+0x4b/0x53
> >    </TASK>
> >   Modules linked in: kvm_intel kvm irqbypass
> >   ---[ end trace 0000000000000000 ]---
> > 
> > Fixes: ae7a2a3fb6f8 ("KVM: host side for eoi optimization")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/lapic.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 6f30bbdddb5a..38bba9a1114c 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -3371,6 +3371,12 @@ static void apic_sync_pv_eoi_from_guest(struct kvm_vcpu *vcpu,
> >  					struct kvm_lapic *apic)
> >  {
> >  	int vector;
> > +
> > +	if (unlikely(!pv_eoi_enabled(vcpu))) {
> > +		__clear_bit(KVM_APIC_PV_EOI_PENDING, &vcpu->arch.apic_attention);
> > +		return;
> > +	}
> > +
> 
> Do you think whether it's better to do this in kvm_lapic_set_pv_eoi(), i.e.,
> when guest disables PV_EOI for this vCPU?

Yes, if we were doing the initial implementation.  I opted not to go that route
because it would require writing to guest memory to put the shared flag back to
KVM_PV_EOI_DISABLED, at which point this becomes a guest-visible change.  It's
probably fine?  But very, very, theoretically, a guest could toggle the feature
on/off and expect memory to remain unchanged.

And I want to get rid of the BUG_ON() no matter what, so I opted to leave the
WRMSR path alone and go with the minimal (albeit ugly) fix.

> If we do so, it seems we can save the unnecessary pv_eoi_set_pending() (which
> writes guest memory) called from apic_sync_pv_eoi_to_guest(), or it's not
> possible to happen?

No, the whole point of apic_sync_pv_eoi_to_guest() is to do pv_eoi_set_pending().
The naming of the flags and the way the code is written obfuscates the logic to
some extent, but what it's doing in pseudocode is:

    IF vCPU has in-service IRQ and no pending IRQs; THEN
        vCPU.pv_eoi = ACTIVE
    FI

where vCPU.pv_eoi is the shared memory.  The guest side, in its APIC EOI path:

  static notrace __maybe_unused void kvm_guest_apic_eoi_write(void)
  {
	/**
	 * This relies on __test_and_clear_bit to modify the memory
	 * in a way that is atomic with respect to the local CPU.
	 * The hypervisor only accesses this memory from the local CPU so
	 * there's no need for lock or memory barriers.
	 * An optimization barrier is implied in apic write.
	 */
	if (__test_and_clear_bit(KVM_PV_EOI_BIT, this_cpu_ptr(&kvm_apic_eoi)))
		return;
	apic_native_eoi();
  }

then does:

    IF vCPU.pv_eoi != ACTIVE; THEN
        do APIC EOI
    FI

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] KVM: x86: Ignore pending PV EOI if the vCPU has since disabled PV EOIs
  2026-06-25 15:33   ` Sean Christopherson
@ 2026-06-25 23:44     ` Huang, Kai
  2026-06-26 17:44       ` Sean Christopherson
  0 siblings, 1 reply; 5+ messages in thread
From: Huang, Kai @ 2026-06-25 23:44 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org

On Thu, 2026-06-25 at 08:33 -0700, Sean Christopherson wrote:
> On Thu, Jun 25, 2026, Kai Huang wrote:
> > On Wed, 2026-06-24 at 15:05 -0700, Sean Christopherson wrote:
> > > Ignore KVM's internal "service pending PV EOI" request if the vCPU has
> > > disabled PV EOIs since the request was made.  Asserting that PV EOIs are
> > > enabled can fail if reading guest memory in pv_eoi_get_user() fails, i.e.
> > > if pv_eoi_test_and_clr_pending() bails early, *and* the vCPU also disables
> > > PV EOIs.
> > > 
> > >   kernel BUG at arch/x86/kvm/lapic.c:3338!
> > >   Oops: invalid opcode: 0000 [#1] SMP
> > >   CPU: 4 UID: 1000 PID: 890 Comm: pv_eoi_test Not tainted 7.0.0-d585aa5894d8-vm #337 PREEMPT
> > >   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > >   RIP: 0010:kvm_lapic_sync_from_vapic+0x12b/0x140 [kvm]
> > >   Call Trace:
> > >    <TASK>
> > >    kvm_arch_vcpu_ioctl_run+0x1075/0x1c30 [kvm]
> > >    kvm_vcpu_ioctl+0x2d5/0x980 [kvm]
> > >    __x64_sys_ioctl+0x8a/0xd0
> > >    do_syscall_64+0xb5/0xb40
> > >    entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > >    </TASK>
> > >   Modules linked in: kvm_intel kvm irqbypass
> > >   ---[ end trace 0000000000000000 ]---
> > > 
> > > Fixes: ae7a2a3fb6f8 ("KVM: host side for eoi optimization")
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  arch/x86/kvm/lapic.c | 8 ++++++--
> > >  1 file changed, 6 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > > index 6f30bbdddb5a..38bba9a1114c 100644
> > > --- a/arch/x86/kvm/lapic.c
> > > +++ b/arch/x86/kvm/lapic.c
> > > @@ -3371,6 +3371,12 @@ static void apic_sync_pv_eoi_from_guest(struct kvm_vcpu *vcpu,
> > >  					struct kvm_lapic *apic)
> > >  {
> > >  	int vector;
> > > +
> > > +	if (unlikely(!pv_eoi_enabled(vcpu))) {
> > > +		__clear_bit(KVM_APIC_PV_EOI_PENDING, &vcpu->arch.apic_attention);
> > > +		return;
> > > +	}
> > > +
> > 
> > Do you think whether it's better to do this in kvm_lapic_set_pv_eoi(), i.e.,
> > when guest disables PV_EOI for this vCPU?
> 
> Yes, if we were doing the initial implementation.  I opted not to go that route
> because it would require writing to guest memory to put the shared flag back to
> KVM_PV_EOI_DISABLED, at which point this becomes a guest-visible change.  It's
> probably fine?  But very, very, theoretically, a guest could toggle the feature
> on/off and expect memory to remain unchanged.

OK.  I thought it's OK to not write guest memory to put flag back to
KVM_PV_EOI_DISABLED.  Guest has explicitly asked to disable this feature, so it
shouldn't use that flag anyway.

> 
> And I want to get rid of the BUG_ON() no matter what, so I opted to leave the
> WRMSR path alone and go with the minimal (albeit ugly) fix.
> 
> > If we do so, it seems we can save the unnecessary pv_eoi_set_pending() (which
> > writes guest memory) called from apic_sync_pv_eoi_to_guest(), or it's not
> > possible to happen?
> 
> No, the whole point of apic_sync_pv_eoi_to_guest() is to do pv_eoi_set_pending().
> The naming of the flags and the way the code is written obfuscates the logic to
> some extent, but what it's doing in pseudocode is:
> 
>     IF vCPU has in-service IRQ and no pending IRQs; THEN
>         vCPU.pv_eoi = ACTIVE
>     FI

This pseudocode is much clearer than the actual code :-)

I was kinda thinking whether it's possible that there are two IRQs when
vCPU.pv_eoi is active (e.g., one in IRR and one in ISR, with different vector),
but from the code right it's not possible:

        if (!pv_eoi_enabled(vcpu) ||                                           
            /* IRR set or many bits in ISR: could be nested. */
            apic->irr_pending ||
	    ...)) {
		return;
	}
	pv_eoi_set_pending(apic->vcpu);

The reason behind this still eludes me :-(

The CPU cannot execute another vector in ISR until the highest one is EOI-ed,
right?

> 
> where vCPU.pv_eoi is the shared memory.  The guest side, in its APIC EOI path:
> 
>   static notrace __maybe_unused void kvm_guest_apic_eoi_write(void)
>   {
> 	/**
> 	 * This relies on __test_and_clear_bit to modify the memory
> 	 * in a way that is atomic with respect to the local CPU.
> 	 * The hypervisor only accesses this memory from the local CPU so
> 	 * there's no need for lock or memory barriers.
> 	 * An optimization barrier is implied in apic write.
> 	 */
> 	if (__test_and_clear_bit(KVM_PV_EOI_BIT, this_cpu_ptr(&kvm_apic_eoi)))
> 		return;
> 	apic_native_eoi();
>   }
> 
> then does:
> 
>     IF vCPU.pv_eoi != ACTIVE; THEN
>         do APIC EOI
>     FI

Yeah.  Thanks for the education.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] KVM: x86: Ignore pending PV EOI if the vCPU has since disabled PV EOIs
  2026-06-25 23:44     ` Huang, Kai
@ 2026-06-26 17:44       ` Sean Christopherson
  0 siblings, 0 replies; 5+ messages in thread
From: Sean Christopherson @ 2026-06-26 17:44 UTC (permalink / raw)
  To: Kai Huang
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org

On Thu, Jun 25, 2026, Kai Huang wrote:
> On Thu, 2026-06-25 at 08:33 -0700, Sean Christopherson wrote:
> > On Thu, Jun 25, 2026, Kai Huang wrote:
> > > On Wed, 2026-06-24 at 15:05 -0700, Sean Christopherson wrote:
> I was kinda thinking whether it's possible that there are two IRQs when
> vCPU.pv_eoi is active (e.g., one in IRR and one in ISR, with different vector),
> but from the code right it's not possible:
> 
>         if (!pv_eoi_enabled(vcpu) ||                                           
>             /* IRR set or many bits in ISR: could be nested. */
>             apic->irr_pending ||
> 	    ...)) {
> 		return;
> 	}
> 	pv_eoi_set_pending(apic->vcpu);
> 
> The reason behind this still eludes me :-(

The PV EOI stuff is all about eliding the EOIs in the guest in order to avoid
relatively useless VM-Exits (this pre-dates hardware virtualization of EOIs).

Instead of having the guest explicitly do EOI, KVM sets two flags: one to note
to itself that there is/was a pending PV EOI, and another that's shared with the
vCPU to track whether or not the guest ack'd the pending PV EOI.  On the next
VM-Exit, KVM checks its internal pending PV EOI flag, and then does an actual
EOI on behalf of the guest if the shared bit was cleared, i.e. if the PV EOI was
ack'd by the guest.

The irr_pending check above effectively disables PV EOI, because PV EOI only has
a single bit, i.e. can only track a single IRQ, and because KVM needs to know
precisely when the pending IRQ is unblocked, i.e. can't lazily wait until the
next VM-Exit.

> The CPU cannot execute another vector in ISR until the highest one is EOI-ed,
> right?

Nope.  The rules for delivery, i.e. for moving an IRQ from IRR => ISR, are that
the IRQ's priority must be higher than PPR, the exact IRQ isn't in-progress (i.e.
its ISR bit isn't set), and that IRQs aren't generally blocked by the core/CPU.

From "12.8.4 Interrupt Acceptance for Fixed Interrupts"

  If the local APIC receives an interrupt with an interrupt-priority class higher
  than that of the interrupt currently in service, and interrupts are enabled in
  the processor core, the local APIC dispatches the higher priority interrupt to
  the processor immediately (without waiting for a write to the EOI register).

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-26 17:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 22:05 [PATCH] KVM: x86: Ignore pending PV EOI if the vCPU has since disabled PV EOIs Sean Christopherson
2026-06-25  7:30 ` Huang, Kai
2026-06-25 15:33   ` Sean Christopherson
2026-06-25 23:44     ` Huang, Kai
2026-06-26 17:44       ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox