* [RFC PATCH 1/3] KVM: SVM: Fix clearing IRQ window inhibit with nested guests
2025-07-18 6:43 [RFC PATCH 0/3] KVM: SVM: Fix IRQ window inhibit handling Naveen N Rao (AMD)
@ 2025-07-18 6:43 ` Naveen N Rao (AMD)
2025-07-18 6:43 ` [RFC PATCH 2/3] KVM: SVM: Fix IRQ window inhibit handling across multiple vCPUs Naveen N Rao (AMD)
` (2 subsequent siblings)
3 siblings, 0 replies; 11+ messages in thread
From: Naveen N Rao (AMD) @ 2025-07-18 6:43 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini
Cc: kvm, linux-kernel, Maxim Levitsky, Vasant Hegde,
Suravee Suthikulpanit
Clearing IRQ window inhibit today relies on interrupt window
interception, but that is not always reachable when nested guests are
involved.
If L1 is intercepting IRQs, then interrupt_window_interception() will
never be reached while L2 is active, because the only reason KVM
would set the V_IRQ intercept in vmcb02 would be on behalf of L1, i.e.
because of vmcb12. svm_clear_vintr() always operates on (at least)
vmcb01, and VMRUN unconditionally sets GIF=1, which means that
enter_svm_guest_mode() will always do svm_clear_vintr() via
svm_set_gif(svm, true). I.e. KVM will keep the VM-wide inhibit set until
control transfers back to L1 *and* an interrupt window is triggered.
If L1 is not intercepting IRQs, KVM may immediately inject L1's ExtINT
into L2 if IRQs are enabled in L2 without taking an interrupt window
interception.
Address this by clearing the IRQ window inhibit when KVM actually
injects an interrupt and there are no further injectable interrupts.
That way, if L1 isn't intercepting IRQs, KVM will drop the inhibit as
soon as an interrupt is injected into L2. And if L1 is intercepting
IRQs, KVM will keep the inhibit until the IRQ is injected into L2. So,
AVIC won't be left inhibited.
---
arch/x86/kvm/svm/svm.c | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)
I think patch tags for this should be:
From: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-Developed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
- Naveen
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d9931c6c4bc6..bbe439c0e36a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3108,20 +3108,6 @@ static int interrupt_window_interception(struct kvm_vcpu *vcpu)
kvm_make_request(KVM_REQ_EVENT, vcpu);
svm_clear_vintr(to_svm(vcpu));
- /*
- * If not running nested, for AVIC, the only reason to end up here is ExtINTs.
- * In this case AVIC was temporarily disabled for
- * requesting the IRQ window and we have to re-enable it.
- *
- * If running nested, still remove the VM wide AVIC inhibit to
- * support case in which the interrupt window was requested when the
- * vCPU was not running nested.
-
- * All vCPUs which run still run nested, will remain to have their
- * AVIC still inhibited due to per-cpu AVIC inhibition.
- */
- kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
-
++vcpu->stat.irq_window_exits;
return 1;
}
@@ -3684,6 +3670,20 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
struct vcpu_svm *svm = to_svm(vcpu);
u32 type;
+ /*
+ * If AVIC was inhibited in order to detect an IRQ window, and there's
+ * no other injectable interrupts pending or L2 is active (see below),
+ * then drop the inhibit as the window has served its purpose.
+ *
+ * If L2 is active, this path is reachable if L1 is not intercepting
+ * IRQs, i.e. if KVM is injecting L1 IRQs into L2. AVIC is locally
+ * inhibited while L2 is active; drop the VM-wide inhibit to optimize
+ * the case in which the interrupt window was requested while L1 was
+ * active (the vCPU was not running nested).
+ */
+ if (!kvm_cpu_has_injectable_intr(vcpu) || is_guest_mode(vcpu))
+ kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
+
if (vcpu->arch.interrupt.soft) {
if (svm_update_soft_interrupt_rip(vcpu))
return;
--
2.50.1
^ permalink raw reply related [flat|nested] 11+ messages in thread* [RFC PATCH 2/3] KVM: SVM: Fix IRQ window inhibit handling across multiple vCPUs
2025-07-18 6:43 [RFC PATCH 0/3] KVM: SVM: Fix IRQ window inhibit handling Naveen N Rao (AMD)
2025-07-18 6:43 ` [RFC PATCH 1/3] KVM: SVM: Fix clearing IRQ window inhibit with nested guests Naveen N Rao (AMD)
@ 2025-07-18 6:43 ` Naveen N Rao (AMD)
2026-01-14 19:55 ` Sean Christopherson
2025-07-18 6:43 ` [RFC PATCH 3/3] KVM: SVM: Optimize IRQ window inhibit handling Naveen N Rao (AMD)
2025-09-09 14:30 ` [RFC PATCH 0/3] KVM: SVM: Fix " Naveen N Rao
3 siblings, 1 reply; 11+ messages in thread
From: Naveen N Rao (AMD) @ 2025-07-18 6:43 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini
Cc: kvm, linux-kernel, Maxim Levitsky, Vasant Hegde,
Suravee Suthikulpanit
IRQ window inhibits can be requested by multiple vCPUs at the same time
for injecting interrupts meant for different vCPUs. However, AVIC
inhibition is VM-wide and hence it is possible for the inhibition to be
cleared prematurely by the first vCPU that obtains the IRQ window even
though a second vCPU is still waiting for its IRQ window. This is likely
not a functional issue since the other vCPU will again see that
interrupts are pending to be injected (due to KVM_REQ_EVENT), and will
again request for an IRQ window inhibition. However, this can result in
AVIC being rapidly toggled resulting in high contention on
apicv_update_lock and degrading performance of the guest.
Address this by maintaining a VM-wide count of the number of vCPUs that
have requested for an IRQ window. Set/clear the inhibit reason when the
count transitions between 0 and 1. This ensures that the inhibit reason
is not cleared as long as there are some vCPUs still waiting for an IRQ
window.
---
arch/x86/include/asm/kvm_host.h | 16 +++++++++++++++
arch/x86/kvm/svm/svm.h | 1 +
arch/x86/kvm/svm/svm.c | 36 +++++++++++++++++++++++----------
arch/x86/kvm/x86.c | 19 +++++++++++++++++
4 files changed, 61 insertions(+), 11 deletions(-)
I think patch tags for this should be:
From: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
- Naveen
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f19a76d3ca0e..b781b4f1d304 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1395,6 +1395,10 @@ struct kvm_arch {
struct kvm_pit *vpit;
#endif
atomic_t vapics_in_nmi_mode;
+
+ /* Keep this in a cacheline separate from apicv_update_lock */
+ atomic_t apicv_nr_irq_window_req;
+
struct mutex apic_map_lock;
struct kvm_apic_map __rcu *apic_map;
atomic_t apic_map_dirty;
@@ -2263,6 +2267,18 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
}
+void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc);
+
+static inline void kvm_inc_apicv_irq_window_req(struct kvm *kvm)
+{
+ kvm_inc_or_dec_irq_window_inhibit(kvm, true);
+}
+
+static inline void kvm_dec_apicv_irq_window_req(struct kvm *kvm)
+{
+ kvm_inc_or_dec_irq_window_inhibit(kvm, false);
+}
+
int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
void *insn, int insn_len);
void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg);
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 58b9d168e0c8..9ef6b5494e77 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -331,6 +331,7 @@ struct vcpu_svm {
bool guest_state_loaded;
+ bool avic_irq_window;
bool x2avic_msrs_intercepted;
/* Guest GIF value, used when vGIF is not enabled */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index bbe439c0e36a..0211b713174c 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3681,8 +3681,11 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
* the case in which the interrupt window was requested while L1 was
* active (the vCPU was not running nested).
*/
- if (!kvm_cpu_has_injectable_intr(vcpu) || is_guest_mode(vcpu))
- kvm_clear_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
+ if (svm->avic_irq_window &&
+ (!kvm_cpu_has_injectable_intr(vcpu) || is_guest_mode(vcpu))) {
+ svm->avic_irq_window = false;
+ kvm_dec_apicv_irq_window_req(svm->vcpu.kvm);
+ }
if (vcpu->arch.interrupt.soft) {
if (svm_update_soft_interrupt_rip(vcpu))
@@ -3895,17 +3898,28 @@ static void svm_enable_irq_window(struct kvm_vcpu *vcpu)
*/
if (vgif || gif_set(svm)) {
/*
- * IRQ window is not needed when AVIC is enabled,
- * unless we have pending ExtINT since it cannot be injected
- * via AVIC. In such case, KVM needs to temporarily disable AVIC,
- * and fallback to injecting IRQ via V_IRQ.
+ * KVM only enables IRQ windows when AVIC is enabled if there's
+ * pending ExtINT since it cannot be injected via AVIC (ExtINT
+ * bypasses the local APIC). V_IRQ is ignored by hardware when
+ * AVIC is enabled, and so KVM needs to temporarily disable
+ * AVIC in order to detect when it's ok to inject the ExtINT.
*
- * If running nested, AVIC is already locally inhibited
- * on this vCPU, therefore there is no need to request
- * the VM wide AVIC inhibition.
+ * If running nested, AVIC is already locally inhibited on this
+ * vCPU (L2 vCPUs use a different MMU that never maps the AVIC
+ * backing page), therefore there is no need to increment the
+ * VM-wide AVIC inhibit. KVM will re-evaluate events when the
+ * vCPU exits to L1 and enable an IRQ window if the ExtINT is
+ * still pending.
+ *
+ * Note, the IRQ window inhibit needs to be updated even if
+ * AVIC is inhibited for a different reason, as KVM needs to
+ * keep AVIC inhibited if the other reason is cleared and there
+ * is still an injectable interrupt pending.
*/
- if (!is_guest_mode(vcpu))
- kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_IRQWIN);
+ if (enable_apicv && !svm->avic_irq_window && !is_guest_mode(vcpu)) {
+ svm->avic_irq_window = true;
+ kvm_inc_apicv_irq_window_req(vcpu->kvm);
+ }
svm_set_vintr(svm);
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a1c49bc681c4..216d1801a4f2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10575,6 +10575,25 @@ void kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
}
EXPORT_SYMBOL_GPL(kvm_set_or_clear_apicv_inhibit);
+void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc)
+{
+ int add = inc ? 1 : -1;
+
+ if (!enable_apicv)
+ return;
+
+ /*
+ * Strictly speaking, the lock is only needed if going 0->1 or 1->0,
+ * a la atomic_dec_and_mutex_lock. However, ExtINTs are rare and
+ * only target a single CPU, so that is the common case; do not
+ * bother eliding the down_write()/up_write() pair.
+ */
+ guard(rwsem_write)(&kvm->arch.apicv_update_lock);
+ if (atomic_add_return(add, &kvm->arch.apicv_nr_irq_window_req) == inc)
+ __kvm_set_or_clear_apicv_inhibit(kvm, APICV_INHIBIT_REASON_IRQWIN, inc);
+}
+EXPORT_SYMBOL_GPL(kvm_inc_or_dec_irq_window_inhibit);
+
static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
{
if (!kvm_apic_present(vcpu))
--
2.50.1
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [RFC PATCH 2/3] KVM: SVM: Fix IRQ window inhibit handling across multiple vCPUs
2025-07-18 6:43 ` [RFC PATCH 2/3] KVM: SVM: Fix IRQ window inhibit handling across multiple vCPUs Naveen N Rao (AMD)
@ 2026-01-14 19:55 ` Sean Christopherson
2026-01-14 20:19 ` Sean Christopherson
2026-01-22 14:49 ` Naveen N Rao
0 siblings, 2 replies; 11+ messages in thread
From: Sean Christopherson @ 2026-01-14 19:55 UTC (permalink / raw)
To: Naveen N Rao (AMD)
Cc: Paolo Bonzini, kvm, linux-kernel, Maxim Levitsky, Vasant Hegde,
Suravee Suthikulpanit
Finally mustered up the brainpower to land this series :-)
On Fri, Jul 18, 2025, Naveen N Rao (AMD) wrote:
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f19a76d3ca0e..b781b4f1d304 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1395,6 +1395,10 @@ struct kvm_arch {
> struct kvm_pit *vpit;
> #endif
> atomic_t vapics_in_nmi_mode;
> +
> + /* Keep this in a cacheline separate from apicv_update_lock */
A comment won't suffice. To isolate what we want to isolate, tag things with
__aligned(). Ideally we would use __cacheline_aligned_in_smp, but AFAIK that
can't be used within a struct as it uses .section tags, :-(
And revisiting your analysis from
https://lore.kernel.org/all/evszbck4u7afiu7lkafwcu3rs6a7io2zkv53rygrgz544op4ur@m2bugote2wdl:
: Also, note that introducing apicv_irq_window after apicv_inhibit_reasons
: is degrading performance in the AVIC disabled case too. So, it is likely
: that some other cacheline below apicv_inhibit_reasons in kvm_arch may
: also be contributing to this.
I strongly suspect past-you were correct: the problem isn't that apicv_nr_irq_window_req
is in the same cacheline with apicv_update_lock, it's that apicv_nr_irq_window_req
landed in the same cachline as _other_ stuff.
Looking at the struct layout from kvm-x86-next-2025.01.14, putting apicv_irq_window
after apicv_inhibit_reasons _did_ put it on a separate cacheline from
apicv_update_lock:
/* --- cacheline 517 boundary (33088 bytes) was 24 bytes ago --- */
struct kvm_apic_map * apic_map; /* 33112 8 */
atomic_t apic_map_dirty; /* 33120 4 */
bool apic_access_memslot_enabled; /* 33124 1 */
bool apic_access_memslot_inhibited; /* 33125 1 */
/* XXX 2 bytes hole, try to pack */
struct rw_semaphore apicv_update_lock; /* 33128 152 */
/* XXX last struct has 1 hole */
/* --- cacheline 520 boundary (33280 bytes) --- */
unsigned long apicv_inhibit_reasons; /* 33280 8 */
atomic_t apicv_irq_window; /* 33288 4 */
/* XXX 4 bytes hole, try to pack */
gpa_t wall_clock; /* 33296 8 */
bool mwait_in_guest; /* 33304 1 */
bool hlt_in_guest; /* 33305 1 */
bool pause_in_guest; /* 33306 1 */
bool cstate_in_guest; /* 33307 1 */
/* XXX 4 bytes hole, try to pack */
unsigned long irq_sources_bitmap; /* 33312 8 */
s64 kvmclock_offset; /* 33320 8 */
raw_spinlock_t tsc_write_lock; /* 33328 64 */
/* --- cacheline 521 boundary (33344 bytes) was 48 bytes ago --- */
Which fits with my reaction that the irq_window counter being in the same cachline
as apicv_update_lock shouldn't be problematic, because the counter is only ever
written while holding the lock. I.e. the counter is written only when the lock
cacheline is likely already pulled in in an exclusive state.
What appears to be problematic is that the counter is in the same cacheline as
several relatively hot read-mostly fields:
apicv_inhibit_reasons - read by every vCPU on every VM-Enter
xxx_in_guest (now disabled_exits) - read on page faults, if a vCPU takes a
PAUSE exit, if a vCPU is scheduled out, etc.
kvmclock_offset - read every time a vCPU needs to refresh kvmclock
So I actually think we want apicv_update_lock and apicv_nr_irq_window_req to
_share_ a cacheline, and then isolate that cacheline from everything else. Because
those two fields are effectively write-mostly, whereas most things in kvm-arch are
read-mostly. I.e. end up with this:
/*
* Protects apicv_inhibit_reasons and apicv_nr_irq_window_req (with an
* asterisk, see kvm_inc_or_dec_irq_window_inhibit() for details).
*
* Force apicv_update_lock and apicv_nr_irq_window_req to reside in a
* dedicated cacheline. They are write-mostly, whereas most everything
* else in kvm_arch is read-mostly.
*/
struct rw_semaphore apicv_update_lock __aligned(L1_CACHE_BYTES);
atomic_t apicv_nr_irq_window_req;
/*
* As above, isolate apicv_update_lock and apicv_nr_irq_window_req on
* their own cacheline. Note that apicv_inhibit_reasons is read-mostly
* even though it's protected by apicv_update_lock (toggling VM-wide
* inhibits is rare; _checking_ for inhibits is common).
*/
unsigned long apicv_inhibit_reasons __aligned(L1_CACHE_BYTES);
I also want to land the optimization separately, so that it can be properly
documented, justified, and analyzed by others.
I pushed a rebased version (compile-tested only at this time) with the above change to:
https://github.com/sean-jc/linux.git svm/avic_irq_window
Can you run you perf tests to see if that aproach also eliminates the degredation
relative to avic=0 that you observed?
Thanks!
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [RFC PATCH 2/3] KVM: SVM: Fix IRQ window inhibit handling across multiple vCPUs
2026-01-14 19:55 ` Sean Christopherson
@ 2026-01-14 20:19 ` Sean Christopherson
2026-01-22 14:49 ` Naveen N Rao
1 sibling, 0 replies; 11+ messages in thread
From: Sean Christopherson @ 2026-01-14 20:19 UTC (permalink / raw)
To: Naveen N Rao (AMD)
Cc: Paolo Bonzini, kvm, linux-kernel, Maxim Levitsky, Vasant Hegde,
Suravee Suthikulpanit
On Wed, Jan 14, 2026, Sean Christopherson wrote:
> Finally mustered up the brainpower to land this series :-)
>
> On Fri, Jul 18, 2025, Naveen N Rao (AMD) wrote:
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index f19a76d3ca0e..b781b4f1d304 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1395,6 +1395,10 @@ struct kvm_arch {
> > struct kvm_pit *vpit;
> > #endif
> > atomic_t vapics_in_nmi_mode;
> > +
> > + /* Keep this in a cacheline separate from apicv_update_lock */
>
> A comment won't suffice. To isolate what we want to isolate, tag things with
> __aligned(). Ideally we would use __cacheline_aligned_in_smp, but AFAIK that
> can't be used within a struct as it uses .section tags, :-(
Gah, finally found what I was looking for: ____cacheline_aligned
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [RFC PATCH 2/3] KVM: SVM: Fix IRQ window inhibit handling across multiple vCPUs
2026-01-14 19:55 ` Sean Christopherson
2026-01-14 20:19 ` Sean Christopherson
@ 2026-01-22 14:49 ` Naveen N Rao
2026-01-23 19:55 ` Sean Christopherson
1 sibling, 1 reply; 11+ messages in thread
From: Naveen N Rao @ 2026-01-22 14:49 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, kvm, linux-kernel, Maxim Levitsky, Vasant Hegde,
Suravee Suthikulpanit
On Wed, Jan 14, 2026 at 11:55:57AM -0800, Sean Christopherson wrote:
> Finally mustered up the brainpower to land this series :-)
Yay! :)
>
> On Fri, Jul 18, 2025, Naveen N Rao (AMD) wrote:
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index f19a76d3ca0e..b781b4f1d304 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1395,6 +1395,10 @@ struct kvm_arch {
> > struct kvm_pit *vpit;
> > #endif
> > atomic_t vapics_in_nmi_mode;
> > +
> > + /* Keep this in a cacheline separate from apicv_update_lock */
>
> A comment won't suffice. To isolate what we want to isolate, tag things with
> __aligned(). Ideally we would use __cacheline_aligned_in_smp, but AFAIK that
> can't be used within a struct as it uses .section tags, :-(
>
> And revisiting your analysis from
> https://lore.kernel.org/all/evszbck4u7afiu7lkafwcu3rs6a7io2zkv53rygrgz544op4ur@m2bugote2wdl:
>
> : Also, note that introducing apicv_irq_window after apicv_inhibit_reasons
> : is degrading performance in the AVIC disabled case too. So, it is likely
> : that some other cacheline below apicv_inhibit_reasons in kvm_arch may
> : also be contributing to this.
>
> I strongly suspect past-you were correct: the problem isn't that apicv_nr_irq_window_req
> is in the same cacheline with apicv_update_lock, it's that apicv_nr_irq_window_req
> landed in the same cachline as _other_ stuff.
>
> Looking at the struct layout from kvm-x86-next-2025.01.14, putting apicv_irq_window
> after apicv_inhibit_reasons _did_ put it on a separate cacheline from
> apicv_update_lock:
I suppose you meant kvm-x86-next-2026.01.14 (2026 and not 2025). I'm
fairly certain that when I tested this, all three of apicv_update_lock,
apicv_inhibit_reasons and the irq_window count were ending up in the
same cacheline. I specifically tested moving each of those out to a
separate cacheline (including apicv_inhibit_reasons), but as far as I
remember, the only time I noticed a difference was when moving the
irq_window count elsewhere.
>
> /* --- cacheline 517 boundary (33088 bytes) was 24 bytes ago --- */
> struct kvm_apic_map * apic_map; /* 33112 8 */
> atomic_t apic_map_dirty; /* 33120 4 */
> bool apic_access_memslot_enabled; /* 33124 1 */
> bool apic_access_memslot_inhibited; /* 33125 1 */
>
> /* XXX 2 bytes hole, try to pack */
>
> struct rw_semaphore apicv_update_lock; /* 33128 152 */
>
> /* XXX last struct has 1 hole */
>
> /* --- cacheline 520 boundary (33280 bytes) --- */
> unsigned long apicv_inhibit_reasons; /* 33280 8 */
> atomic_t apicv_irq_window; /* 33288 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> gpa_t wall_clock; /* 33296 8 */
> bool mwait_in_guest; /* 33304 1 */
> bool hlt_in_guest; /* 33305 1 */
> bool pause_in_guest; /* 33306 1
> */
> bool cstate_in_guest; /* 33307 1 */
>
> /* XXX 4 bytes hole, try to pack */
>
> unsigned long irq_sources_bitmap; /* 33312 8 */
> s64 kvmclock_offset; /* 33320 8 */
> raw_spinlock_t tsc_write_lock; /* 33328 64 */
> /* --- cacheline 521 boundary (33344 bytes) was 48 bytes ago --- */
>
>
> Which fits with my reaction that the irq_window counter being in the same cachline
> as apicv_update_lock shouldn't be problematic, because the counter is only ever
> written while holding the lock. I.e. the counter is written only when the lock
> cacheline is likely already pulled in in an exclusive state.
Indeed.
>
> What appears to be problematic is that the counter is in the same cacheline as
> several relatively hot read-mostly fields:
>
> apicv_inhibit_reasons - read by every vCPU on every VM-Enter
> xxx_in_guest (now disabled_exits) - read on page faults, if a vCPU
> takes a PAUSE exit, if a vCPU is scheduled out, etc.
> kvmclock_offset - read every time a vCPU needs to refresh kvmclock
>
> So I actually think we want apicv_update_lock and apicv_nr_irq_window_req to
> _share_ a cacheline, and then isolate that cacheline from everything else. Because
> those two fields are effectively write-mostly, whereas most things in kvm-arch are
> read-mostly. I.e. end up with this:
>
> /*
> * Protects apicv_inhibit_reasons and apicv_nr_irq_window_req (with an
> * asterisk, see kvm_inc_or_dec_irq_window_inhibit() for details).
> *
> * Force apicv_update_lock and apicv_nr_irq_window_req to reside in a
> * dedicated cacheline. They are write-mostly, whereas most everything
> * else in kvm_arch is read-mostly.
> */
> struct rw_semaphore apicv_update_lock __aligned(L1_CACHE_BYTES);
> atomic_t apicv_nr_irq_window_req;
>
> /*
> * As above, isolate apicv_update_lock and apicv_nr_irq_window_req on
> * their own cacheline. Note that apicv_inhibit_reasons is read-mostly
> * even though it's protected by apicv_update_lock (toggling VM-wide
> * inhibits is rare; _checking_ for inhibits is common).
> */
> unsigned long apicv_inhibit_reasons __aligned(L1_CACHE_BYTES);
Nice, isolating those in a separate cacheline looks to be helping.
>
> I also want to land the optimization separately, so that it can be properly
> documented, justified, and analyzed by others.
>
> I pushed a rebased version (compile-tested only at this time) with the above change to:
>
> https://github.com/sean-jc/linux.git svm/avic_irq_window
>
> Can you run you perf tests to see if that aproach also eliminates the degredation
> relative to avic=0 that you observed?
Yes, this definitely seems to be helping get rid of that odd performance
drop I was seeing earlier. I'll run a couple more tests and report back
by next week if I see anything off. Otherwise, this is looking good to
me and if you want to apply this to -next, I'm fine with that:
Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Thanks for all your help with this (and Paolo)!
- Naveen
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [RFC PATCH 2/3] KVM: SVM: Fix IRQ window inhibit handling across multiple vCPUs
2026-01-22 14:49 ` Naveen N Rao
@ 2026-01-23 19:55 ` Sean Christopherson
0 siblings, 0 replies; 11+ messages in thread
From: Sean Christopherson @ 2026-01-23 19:55 UTC (permalink / raw)
To: Naveen N Rao
Cc: Paolo Bonzini, kvm, linux-kernel, Maxim Levitsky, Vasant Hegde,
Suravee Suthikulpanit
On Thu, Jan 22, 2026, Naveen N Rao wrote:
> On Wed, Jan 14, 2026 at 11:55:57AM -0800, Sean Christopherson wrote:
> > I also want to land the optimization separately, so that it can be properly
> > documented, justified, and analyzed by others.
> >
> > I pushed a rebased version (compile-tested only at this time) with the above change to:
> >
> > https://github.com/sean-jc/linux.git svm/avic_irq_window
> >
> > Can you run you perf tests to see if that aproach also eliminates the degredation
> > relative to avic=0 that you observed?
>
> Yes, this definitely seems to be helping get rid of that odd performance
> drop I was seeing earlier. I'll run a couple more tests and report back
> by next week if I see anything off. Otherwise, this is looking good to
> me and if you want to apply this to -next, I'm fine with that:
> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org>
Nice, thanks for testing! I'll post a proper series later today (or early next
week), and will wait until after the merge window to apply (a little too close
for comfort at this point).
^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC PATCH 3/3] KVM: SVM: Optimize IRQ window inhibit handling
2025-07-18 6:43 [RFC PATCH 0/3] KVM: SVM: Fix IRQ window inhibit handling Naveen N Rao (AMD)
2025-07-18 6:43 ` [RFC PATCH 1/3] KVM: SVM: Fix clearing IRQ window inhibit with nested guests Naveen N Rao (AMD)
2025-07-18 6:43 ` [RFC PATCH 2/3] KVM: SVM: Fix IRQ window inhibit handling across multiple vCPUs Naveen N Rao (AMD)
@ 2025-07-18 6:43 ` Naveen N Rao (AMD)
2025-09-09 14:30 ` [RFC PATCH 0/3] KVM: SVM: Fix " Naveen N Rao
3 siblings, 0 replies; 11+ messages in thread
From: Naveen N Rao (AMD) @ 2025-07-18 6:43 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini
Cc: kvm, linux-kernel, Maxim Levitsky, Vasant Hegde,
Suravee Suthikulpanit
IRQ windows represent times during which an IRQ can be injected into a
vCPU, and thus represent times when a vCPU is running with RFLAGS.IF=1
and GIF enabled (TPR/PPR don't matter since KVM controls interrupt
injection and it only injects one interrupt at a time). On SVM, when
emulating the local APIC (i.e., AVIC disabled), KVM detects IRQ windows
by injecting a dummy virtual interrupt through VMCB.V_IRQ and
intercepting virtual interrupts (INTERCEPT_VINTR). This intercept
triggers as soon as the guest enables interrupts and is about to take
the dummy interrupt, at which point the actual interrupt can be injected
through VMCB.EVENTINJ.
When AVIC is enabled, VMCB.V_IRQ is ignored by the hardware and so
detecting IRQ windows requires AVIC to be inhibited. However, this is
only necessary for ExtINTs since all other interrupts can be injected
either by directly setting IRR in the APIC backing page and letting the
AVIC hardware inject the interrupt into the guest, or via VMCB.V_NMI for
NMIs.
If AVIC is enabled but inhibited for some other reason, KVM has to
request for IRQ window inhibits every time it has to inject an interrupt
into the guest. This is because APICv inhibits are dynamic in nature, so
KVM has to be sure that AVIC is inhibited for purposes of discovering an
IRQ window even if the other inhibit is cleared in the meantime.
This is particularly problematic with APICV_INHIBIT_REASON_PIT_REINJ
which stays set throughout the life of the guest and results in KVM
rapidly toggling IRQ window inhibit resulting in contention on
apicv_update_lock.
Address this by setting and clearing APICV_INHIBIT_REASON_PIT_REINJ
lazily: if some other inhibit reason is already set, just increment the
IRQ window request count and do not update apicv_inhibit_reasons
immediately. If any other inhibit reason is set/cleared in the meantime,
re-evaluate APICV_INHIBIT_REASON_PIT_REINJ by checking the IRQ window
request count and update apicv_inhibit_reasons appropriately. Otherwise,
just the IRQ window request count is incremented/decremented each time
an IRQ window is requested. This reduces much of the contention on the
apicv_update_lock semaphore and does away with much of the performance
degradation.
---
arch/x86/kvm/x86.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)
I think patch tags for this should be:
From: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
- Naveen
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 216d1801a4f2..845afcf6e85f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10534,7 +10534,11 @@ void __kvm_set_or_clear_apicv_inhibit(struct kvm *kvm,
old = new = kvm->arch.apicv_inhibit_reasons;
- set_or_clear_apicv_inhibit(&new, reason, set);
+ if (reason != APICV_INHIBIT_REASON_IRQWIN)
+ set_or_clear_apicv_inhibit(&new, reason, set);
+
+ set_or_clear_apicv_inhibit(&new, APICV_INHIBIT_REASON_IRQWIN,
+ atomic_read(&kvm->arch.apicv_nr_irq_window_req));
if (!!old != !!new) {
/*
@@ -10582,6 +10586,26 @@ void kvm_inc_or_dec_irq_window_inhibit(struct kvm *kvm, bool inc)
if (!enable_apicv)
return;
+ /*
+ * IRQ windows are requested either because of ExtINT injections, or
+ * because APICv is already disabled/inhibited for another reason.
+ * While ExtINT injections are rare and should not happen while the
+ * vCPU is running its actual workload, it's worth avoiding thrashing
+ * if the IRQ window is being requested because APICv is already
+ * inhibited. So, toggle the actual inhibit (which requires taking
+ * the lock for write) if and only if there's no other inhibit.
+ * kvm_set_or_clear_apicv_inhibit() always evaluates the IRQ window
+ * count; thus the IRQ window inhibit call _will_ be lazily updated on
+ * the next call, if it ever happens.
+ */
+ if (READ_ONCE(kvm->arch.apicv_inhibit_reasons) & ~BIT(APICV_INHIBIT_REASON_IRQWIN)) {
+ guard(rwsem_read)(&kvm->arch.apicv_update_lock);
+ if (READ_ONCE(kvm->arch.apicv_inhibit_reasons) & ~BIT(APICV_INHIBIT_REASON_IRQWIN)) {
+ atomic_add(add, &kvm->arch.apicv_nr_irq_window_req);
+ return;
+ }
+ }
+
/*
* Strictly speaking, the lock is only needed if going 0->1 or 1->0,
* a la atomic_dec_and_mutex_lock. However, ExtINTs are rare and
--
2.50.1
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [RFC PATCH 0/3] KVM: SVM: Fix IRQ window inhibit handling
2025-07-18 6:43 [RFC PATCH 0/3] KVM: SVM: Fix IRQ window inhibit handling Naveen N Rao (AMD)
` (2 preceding siblings ...)
2025-07-18 6:43 ` [RFC PATCH 3/3] KVM: SVM: Optimize IRQ window inhibit handling Naveen N Rao (AMD)
@ 2025-09-09 14:30 ` Naveen N Rao
2025-09-09 15:34 ` Sean Christopherson
3 siblings, 1 reply; 11+ messages in thread
From: Naveen N Rao @ 2025-09-09 14:30 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini
Cc: kvm, linux-kernel, Maxim Levitsky, Vasant Hegde,
Suravee Suthikulpanit
On Fri, Jul 18, 2025 at 12:13:33PM +0530, Naveen N Rao (AMD) wrote:
> Sean, Paolo,
> I have attempted to take the changes discussed in the below thread and
> to convert them into a patch series:
> http://lkml.kernel.org/r/Z6JoInXNntIoHLQ8@google.com
>
> I have tried to describe the changes, but the nested aspects would
> definitely need a review to ensure correctness and that all aspects are
> covered there.
>
> None of these patches include patch tags since none were provided in the
> discussion. I have proposed patch trailers on the individual patches.
> Please take a look and let me know if that's fine.
>
> I tested this lightly with nested guests as well and it is working fine
> for me.
Sean, Paolo,
Any feedback on this - can you please take a look?
Thanks,
Naveen
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [RFC PATCH 0/3] KVM: SVM: Fix IRQ window inhibit handling
2025-09-09 14:30 ` [RFC PATCH 0/3] KVM: SVM: Fix " Naveen N Rao
@ 2025-09-09 15:34 ` Sean Christopherson
2025-09-10 12:30 ` Naveen N Rao
0 siblings, 1 reply; 11+ messages in thread
From: Sean Christopherson @ 2025-09-09 15:34 UTC (permalink / raw)
To: Naveen N Rao
Cc: Paolo Bonzini, kvm, linux-kernel, Maxim Levitsky, Vasant Hegde,
Suravee Suthikulpanit
On Tue, Sep 09, 2025, Naveen N Rao wrote:
> On Fri, Jul 18, 2025 at 12:13:33PM +0530, Naveen N Rao (AMD) wrote:
> > Sean, Paolo,
> > I have attempted to take the changes discussed in the below thread and
> > to convert them into a patch series:
> > http://lkml.kernel.org/r/Z6JoInXNntIoHLQ8@google.com
> >
> > I have tried to describe the changes, but the nested aspects would
> > definitely need a review to ensure correctness and that all aspects are
> > covered there.
> >
> > None of these patches include patch tags since none were provided in the
> > discussion. I have proposed patch trailers on the individual patches.
> > Please take a look and let me know if that's fine.
> >
> > I tested this lightly with nested guests as well and it is working fine
> > for me.
>
> Sean, Paolo,
> Any feedback on this - can you please take a look?
I'm getting there, slowly. Lot's of time sensitive things in flight right now
both internally and upstream. Sorry.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/3] KVM: SVM: Fix IRQ window inhibit handling
2025-09-09 15:34 ` Sean Christopherson
@ 2025-09-10 12:30 ` Naveen N Rao
0 siblings, 0 replies; 11+ messages in thread
From: Naveen N Rao @ 2025-09-10 12:30 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, kvm, linux-kernel, Maxim Levitsky, Vasant Hegde,
Suravee Suthikulpanit
On Tue, Sep 09, 2025 at 08:34:22AM -0700, Sean Christopherson wrote:
> On Tue, Sep 09, 2025, Naveen N Rao wrote:
> > On Fri, Jul 18, 2025 at 12:13:33PM +0530, Naveen N Rao (AMD) wrote:
> > > Sean, Paolo,
> > > I have attempted to take the changes discussed in the below thread and
> > > to convert them into a patch series:
> > > http://lkml.kernel.org/r/Z6JoInXNntIoHLQ8@google.com
> > >
> > > I have tried to describe the changes, but the nested aspects would
> > > definitely need a review to ensure correctness and that all aspects are
> > > covered there.
> > >
> > > None of these patches include patch tags since none were provided in the
> > > discussion. I have proposed patch trailers on the individual patches.
> > > Please take a look and let me know if that's fine.
> > >
> > > I tested this lightly with nested guests as well and it is working fine
> > > for me.
> >
> > Sean, Paolo,
> > Any feedback on this - can you please take a look?
>
> I'm getting there, slowly. Lot's of time sensitive things in flight right now
> both internally and upstream. Sorry.
Sure, no worries. I primarily asked since this is a fix, and we have had
a few reports of performance degradation due to the APICv inhibit lock.
So, it would be good if this can be addressed sooner (compared to some
of the other AVIC features).
Thanks,
Naveen
^ permalink raw reply [flat|nested] 11+ messages in thread