[PATCH 0/7] KVM: x86/xen: Fix Xen / GPC / PREEMPT_RT issues with rwlock

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH 0/7] KVM: x86/xen: Fix Xen / GPC / PREEMPT_RT issues with rwlock_t
@ 2026-05-08 18:10 David Woodhouse
  2026-05-08 18:10 ` [PATCH 1/7] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock() David Woodhouse
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: David Woodhouse @ 2026-05-08 18:10 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Paul Durrant, Peter Zijlstra,
	Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, kvm, linux-kernel, linux-rt-devel,
	Mauricio Faria de Oliveira, kernel-dev,
	syzbot+208f7f3e5f59c11aeb90

This series fixes sleeping-in-hardirq bugs in KVM's Xen emulation on
PREEMPT_RT, and then cleans up the now-unnecessary IRQ disabling in GPC
lock usage throughout KVM.

The core issue is that kvm_xen_set_evtchn_fast() and the Xen timer
callback are called from hardirq/atomic context, but on PREEMPT_RT the
GPC rwlock_t is a sleeping lock.

Patch 1 fixes a related RT locking bug in the rwlock core where
__rwbase_read_unlock() unconditionally re-enables IRQs regardless of
the caller's saved state.

Patch 2 converts record_steal_time() to use gfn_to_pfn_cache, replacing
the kvm_map_gfn()/kvm_unmap_gfn() interface.

Patch 3 is the main fix: it switches the hardirq/atomic GPC users to
read_trylock() with -EWOULDBLOCK fallback. There is always a slow path
for the case where the GPC is invalid and needs to be refreshed.

Patches 4-6 remove the now-unnecessary irqsave/irqrestore from all
remaining GPC lock users, since no hardirq path holds the lock any more.
This simplifies the locking throughout xen.c, x86.c, and pfncache.c.

Patch 7 subsumes Xen timer injection into kvm_xen_inject_pending_events()
and calls it from vcpu_enter_guest(), reducing deferred timer delivery
latency from ~10ms (scheduler tick dependent) to sub-microsecond.

Tested on bare metal (c7i.metal-48xl) with both non-RT and PREEMPT_RT
kernels, including the xen_shinfo_test selftest and QEMU with Xen
emulation (xen-version=0x40010,kernel-irqchip=split).

Carsten Stollmaier (1):
      KVM: x86: Use gfn_to_pfn_cache for record_steal_time

David Woodhouse (6):
      locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
      KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths
      KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c
      KVM: x86: Remove unnecessary irqsave from kvm_setup_guest_pvclock()
      KVM: Remove unnecessary IRQ disabling from GPC lock in pfncache.c
      KVM: x86/xen: Handle pending Xen timer events in vcpu_enter_guest()

 arch/x86/include/asm/kvm_host.h |   2 +-
 arch/x86/kvm/irq.c              |   4 --
 arch/x86/kvm/x86.c              | 138 +++++++++++++++++++++-------------------
 arch/x86/kvm/xen.c              | 117 +++++++++++++++++-----------------
 arch/x86/kvm/xen.h              |  21 +-----
 kernel/locking/rwbase_rt.c      |   5 +-
 virt/kvm/pfncache.c             |  28 ++++----
 7 files changed, 152 insertions(+), 163 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/7] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
  2026-05-08 18:10 [PATCH 0/7] KVM: x86/xen: Fix Xen / GPC / PREEMPT_RT issues with rwlock_t David Woodhouse
@ 2026-05-08 18:10 ` David Woodhouse
  2026-05-08 18:10 ` [PATCH 2/7] KVM: x86: Use gfn_to_pfn_cache for record_steal_time David Woodhouse
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: David Woodhouse @ 2026-05-08 18:10 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Paul Durrant, Peter Zijlstra,
	Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, kvm, linux-kernel, linux-rt-devel,
	Mauricio Faria de Oliveira, kernel-dev,
	syzbot+208f7f3e5f59c11aeb90

From: David Woodhouse <dwmw@amazon.co.uk>

__rwbase_read_unlock() uses raw_spin_lock_irq()/raw_spin_unlock_irq()
which unconditionally disables and re-enables interrupts. When
read_unlock() is called from hardirq context (e.g. after a successful
read_trylock() in a timer callback), the raw_spin_unlock_irq()
incorrectly re-enables interrupts within the hardirq handler.

This causes lockdep warnings ('hardirqs_on_prepare' from hardirq
context) and can lead to IRQ state corruption.

Using read_trylock() in hardirq context on PREEMPT_RT is safe because
it does not record the lock owner. The read_unlock() acquires the
wait_lock which is hardirq safe. This change additionally allows
rwlock_t during early boot.

Switch to raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() to
preserve the caller's IRQ state.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/locking/rwbase_rt.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index 82e078c0665a..25744862d627 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -153,8 +153,9 @@ static void __sched __rwbase_read_unlock(struct rwbase_rt *rwb,
 	struct rt_mutex_base *rtm = &rwb->rtmutex;
 	struct task_struct *owner;
 	DEFINE_RT_WAKE_Q(wqh);
+	unsigned long flags;
 
-	raw_spin_lock_irq(&rtm->wait_lock);
+	raw_spin_lock_irqsave(&rtm->wait_lock, flags);
 	/*
 	 * Wake the writer, i.e. the rtmutex owner. It might release the
 	 * rtmutex concurrently in the fast path (due to a signal), but to
@@ -167,7 +168,7 @@ static void __sched __rwbase_read_unlock(struct rwbase_rt *rwb,
 
 	/* Pairs with the preempt_enable in rt_mutex_wake_up_q() */
 	preempt_disable();
-	raw_spin_unlock_irq(&rtm->wait_lock);
+	raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
 	rt_mutex_wake_up_q(&wqh);
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/7] KVM: x86: Use gfn_to_pfn_cache for record_steal_time
  2026-05-08 18:10 [PATCH 0/7] KVM: x86/xen: Fix Xen / GPC / PREEMPT_RT issues with rwlock_t David Woodhouse
  2026-05-08 18:10 ` [PATCH 1/7] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock() David Woodhouse
@ 2026-05-08 18:10 ` David Woodhouse
  2026-05-08 18:10 ` [PATCH 3/7] KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths David Woodhouse
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: David Woodhouse @ 2026-05-08 18:10 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Paul Durrant, Peter Zijlstra,
	Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, kvm, linux-kernel, linux-rt-devel,
	Mauricio Faria de Oliveira, kernel-dev,
	syzbot+208f7f3e5f59c11aeb90

From: Carsten Stollmaier <stollmc@amazon.com>

This largely reverts commit 7e2175ebd695 ("KVM: x86: Fix recording of
guest steal time / preempted status"), which dropped the use of the
gfn_to_pfn_cache because it was not integrated with the MMU notifiers
at the time. That shortcoming has long since been addressed, making
the GPC work correctly for this use case.

Aside from cleaning up the last	open-coded assembler access to user
addresses and associated explicit asm exception fixups, moving back
to the now-functional GPC also resolves an issue with contention on
the mmap_lock with userfaultfd. The contention issue is as follows:

On vcpu_run, before entering the guest, the update of the steal time
information causes a page-fault if the page is not present. In our
scenario, this gets handled by do_user_addr_fault() and successively
handle_userfault() because the region is registered to that.

Since handle_userfault() uses TASK_INTERRUPTIBLE, it is interruptible
by signals. But do_user_addr_fault() then busy-retries if the pending
signal is non-fatal, which leads to heavy contention of the mmap_lock.

By restoring the use of GPC for accessing the guest steal time, the
contention is avoided and refreshing the GPC happens when the vCPU is
next scheduled.

Since the gfn_to_pfn_cache gives a kernel mapping rather than a
userspace HVA, accesses are now plain C instead of unsafe_put_user()
et al. Use READ_ONCE()/WRITE_ONCE() to prevent the compiler from
reordering or tearing the accesses, and add an smp_wmb() before the
final version increment to ensure the data writes are ordered before
the seqcount update — the old unsafe_put_user() inline assembly acted
as an implicit compiler barrier.

In kvm_steal_time_set_preempted(), use read_trylock() instead of
read_lock_irqsave() since this is called from the scheduler path
where rwlock_t is not safe on PREEMPT_RT (it becomes sleepable).
Since we only trylock and bail on failure, there is no risk of
deadlock with an interrupt handler, so no need to disable interrupts
at all. Setting the preempted flag is best-effort anyway.

Signed-off-by: Carsten Stollmaier <stollmc@amazon.com>
Co-developed-by: David Woodhouse <dwmw2@amazon.co.uk>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_host.h |   2 +-
 arch/x86/kvm/x86.c              | 126 ++++++++++++++++----------------
 2 files changed, 66 insertions(+), 62 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c470e40a00aa..6f26c68db4b0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -959,7 +959,7 @@ struct kvm_vcpu_arch {
 		u8 preempted;
 		u64 msr_val;
 		u64 last_steal;
-		struct gfn_to_hva_cache cache;
+		struct gfn_to_pfn_cache cache;
 	} st;
 
 	u64 l1_tsc_offset;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0a1b63c63d1a..ae71f28cc1c5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3747,10 +3747,8 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_service_local_tlb_flush_requests);
 
 static void record_steal_time(struct kvm_vcpu *vcpu)
 {
-	struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache;
-	struct kvm_steal_time __user *st;
-	struct kvm_memslots *slots;
-	gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
+	struct gfn_to_pfn_cache *gpc = &vcpu->arch.st.cache;
+	struct kvm_steal_time *st;
 	u64 steal;
 	u32 version;
 
@@ -3765,42 +3763,26 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 	if (WARN_ON_ONCE(current->mm != vcpu->kvm->mm))
 		return;
 
-	slots = kvm_memslots(vcpu->kvm);
-
-	if (unlikely(slots->generation != ghc->generation ||
-		     gpa != ghc->gpa ||
-		     kvm_is_error_hva(ghc->hva) || !ghc->memslot)) {
+	read_lock(&gpc->lock);
+	while (!kvm_gpc_check(gpc, sizeof(*st))) {
 		/* We rely on the fact that it fits in a single page. */
 		BUILD_BUG_ON((sizeof(*st) - 1) & KVM_STEAL_VALID_BITS);
 
-		if (kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, gpa, sizeof(*st)) ||
-		    kvm_is_error_hva(ghc->hva) || !ghc->memslot)
+		read_unlock(&gpc->lock);
+
+		if (kvm_gpc_refresh(gpc, sizeof(*st)))
 			return;
+
+		read_lock(&gpc->lock);
 	}
 
-	st = (struct kvm_steal_time __user *)ghc->hva;
+	st = (struct kvm_steal_time *)gpc->khva;
 	/*
 	 * Doing a TLB flush here, on the guest's behalf, can avoid
 	 * expensive IPIs.
 	 */
 	if (guest_pv_has(vcpu, KVM_FEATURE_PV_TLB_FLUSH)) {
-		u8 st_preempted = 0;
-		int err = -EFAULT;
-
-		if (!user_access_begin(st, sizeof(*st)))
-			return;
-
-		asm volatile("1: xchgb %0, %2\n"
-			     "xor %1, %1\n"
-			     "2:\n"
-			     _ASM_EXTABLE_UA(1b, 2b)
-			     : "+q" (st_preempted),
-			       "+&r" (err),
-			       "+m" (st->preempted));
-		if (err)
-			goto out;
-
-		user_access_end();
+		u8 st_preempted = xchg(&st->preempted, 0);
 
 		vcpu->arch.st.preempted = 0;
 
@@ -3808,39 +3790,34 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 				       st_preempted & KVM_VCPU_FLUSH_TLB);
 		if (st_preempted & KVM_VCPU_FLUSH_TLB)
 			kvm_vcpu_flush_tlb_guest(vcpu);
-
-		if (!user_access_begin(st, sizeof(*st)))
-			goto dirty;
 	} else {
-		if (!user_access_begin(st, sizeof(*st)))
-			return;
-
-		unsafe_put_user(0, &st->preempted, out);
+		WRITE_ONCE(st->preempted, 0);
 		vcpu->arch.st.preempted = 0;
 	}
 
-	unsafe_get_user(version, &st->version, out);
+	version = READ_ONCE(st->version);
 	if (version & 1)
 		version += 1;  /* first time write, random junk */
 
 	version += 1;
-	unsafe_put_user(version, &st->version, out);
+	WRITE_ONCE(st->version, version);
 
 	smp_wmb();
 
-	unsafe_get_user(steal, &st->steal, out);
+	steal = READ_ONCE(st->steal);
 	steal += current->sched_info.run_delay -
 		vcpu->arch.st.last_steal;
 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
-	unsafe_put_user(steal, &st->steal, out);
+	WRITE_ONCE(st->steal, steal);
+
+	smp_wmb();
 
 	version += 1;
-	unsafe_put_user(version, &st->version, out);
+	WRITE_ONCE(st->version, version);
+
+	kvm_gpc_mark_dirty_in_slot(gpc);
 
- out:
-	user_access_end();
- dirty:
-	mark_page_dirty_in_slot(vcpu->kvm, ghc->memslot, gpa_to_gfn(ghc->gpa));
+	read_unlock(&gpc->lock);
 }
 
 /*
@@ -4175,8 +4152,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 
 		vcpu->arch.st.msr_val = data;
 
-		if (!(data & KVM_MSR_ENABLED))
-			break;
+		if (data & KVM_MSR_ENABLED)
+			kvm_gpc_activate(&vcpu->arch.st.cache, data & ~KVM_MSR_ENABLED,
+					sizeof(struct kvm_steal_time));
+		else
+			kvm_gpc_deactivate(&vcpu->arch.st.cache);
 
 		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 
@@ -5239,11 +5219,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 {
-	struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache;
-	struct kvm_steal_time __user *st;
-	struct kvm_memslots *slots;
+	struct gfn_to_pfn_cache *gpc = &vcpu->arch.st.cache;
+	struct kvm_steal_time *st;
 	static const u8 preempted = KVM_VCPU_PREEMPTED;
-	gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
 
 	/*
 	 * The vCPU can be marked preempted if and only if the VM-Exit was on
@@ -5268,20 +5246,41 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 	if (unlikely(current->mm != vcpu->kvm->mm))
 		return;
 
-	slots = kvm_memslots(vcpu->kvm);
-
-	if (unlikely(slots->generation != ghc->generation ||
-		     gpa != ghc->gpa ||
-		     kvm_is_error_hva(ghc->hva) || !ghc->memslot))
+	/*
+	 * Use a trylock as this is called from the scheduler path (via
+	 * kvm_sched_out), where rwlock_t is not safe on PREEMPT_RT (it
+	 * becomes sleepable).  Setting preempted is best-effort anyway;
+	 * the old HVA-based code used copy_to_user_nofault() which could
+	 * also silently fail.
+	 *
+	 * Since we only trylock and bail on failure, there is no risk of
+	 * deadlock with an interrupt handler, so no need to disable
+	 * interrupts.
+	 */
+	if (!read_trylock(&gpc->lock))
 		return;
 
-	st = (struct kvm_steal_time __user *)ghc->hva;
+	if (!kvm_gpc_check(gpc, sizeof(*st)))
+		goto out_unlock_gpc;
+
+	st = (struct kvm_steal_time *)gpc->khva;
 	BUILD_BUG_ON(sizeof(st->preempted) != sizeof(preempted));
 
-	if (!copy_to_user_nofault(&st->preempted, &preempted, sizeof(preempted)))
-		vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED;
+	WRITE_ONCE(st->preempted, preempted);
+	vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED;
+
+	kvm_gpc_mark_dirty_in_slot(gpc);
+
+out_unlock_gpc:
+	read_unlock(&gpc->lock);
+}
 
-	mark_page_dirty_in_slot(vcpu->kvm, ghc->memslot, gpa_to_gfn(ghc->gpa));
+static void kvm_steal_time_reset(struct kvm_vcpu *vcpu)
+{
+	kvm_gpc_deactivate(&vcpu->arch.st.cache);
+	vcpu->arch.st.preempted = 0;
+	vcpu->arch.st.msr_val = 0;
+	vcpu->arch.st.last_steal = 0;
 }
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
@@ -12841,6 +12840,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
 	kvm_gpc_init(&vcpu->arch.pv_time, vcpu->kvm);
 
+	kvm_gpc_init(&vcpu->arch.st.cache, vcpu->kvm);
+
 	if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu))
 		kvm_set_mp_state(vcpu, KVM_MP_STATE_RUNNABLE);
 	else
@@ -12948,6 +12949,8 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 	kvm_clear_async_pf_completion_queue(vcpu);
 	kvm_mmu_unload(vcpu);
 
+	kvm_steal_time_reset(vcpu);
+
 	kvmclock_reset(vcpu);
 
 	for_each_possible_cpu(cpu)
@@ -13068,7 +13071,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 	vcpu->arch.apf.msr_en_val = 0;
 	vcpu->arch.apf.msr_int_val = 0;
-	vcpu->arch.st.msr_val = 0;
+
+	kvm_steal_time_reset(vcpu);
 
 	kvmclock_reset(vcpu);
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/7] KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths
  2026-05-08 18:10 [PATCH 0/7] KVM: x86/xen: Fix Xen / GPC / PREEMPT_RT issues with rwlock_t David Woodhouse
  2026-05-08 18:10 ` [PATCH 1/7] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock() David Woodhouse
  2026-05-08 18:10 ` [PATCH 2/7] KVM: x86: Use gfn_to_pfn_cache for record_steal_time David Woodhouse
@ 2026-05-08 18:10 ` David Woodhouse
  2026-05-08 18:10 ` [PATCH 4/7] KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c David Woodhouse
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: David Woodhouse @ 2026-05-08 18:10 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Paul Durrant, Peter Zijlstra,
	Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, kvm, linux-kernel, linux-rt-devel,
	Mauricio Faria de Oliveira, kernel-dev,
	syzbot+208f7f3e5f59c11aeb90

From: David Woodhouse <dwmw@amazon.co.uk>

kvm_xen_set_evtchn_fast() is called from hardirq context (timer
callback, kvm_arch_set_irq_inatomic()). On PREEMPT_RT, rwlock_t is a
sleeping lock, so read_lock_irqsave() cannot be used in this context.

Switch to read_trylock() and return -EWOULDBLOCK on contention, which is
the designed fallback — there is always a slow path for the case where
the GPC is invalid and needs to be refreshed.

Reported-by: syzbot+208f7f3e5f59c11aeb90@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=208f7f3e5f59c11aeb90
Fixes: 14243b387137 ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/xen.c | 32 +++++++++++++++++++++++---------
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 91fd3673c09a..9bdb8e3cad58 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -697,6 +697,7 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
 {
 	struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache;
+	bool atomic = in_atomic() || !task_is_running(current);
 	unsigned long flags;
 	u8 rc = 0;
 
@@ -713,7 +714,15 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
 	BUILD_BUG_ON(sizeof(rc) !=
 		     sizeof_field(struct compat_vcpu_info, evtchn_upcall_pending));
 
-	read_lock_irqsave(&gpc->lock, flags);
+	if (atomic) {
+		local_irq_save(flags);
+		if (!read_trylock(&gpc->lock)) {
+			local_irq_restore(flags);
+			return 1;
+		}
+	} else {
+		read_lock_irqsave(&gpc->lock, flags);
+	}
 	while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
 		read_unlock_irqrestore(&gpc->lock, flags);
 
@@ -725,7 +734,7 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
 		 * and we'll end up getting called again from a context where we *can*
 		 * fault in the page and wait for it.
 		 */
-		if (in_atomic() || !task_is_running(current))
+		if (atomic)
 			return 1;
 
 		if (kvm_gpc_refresh(gpc, sizeof(struct vcpu_info))) {
@@ -1794,7 +1803,6 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
 	struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache;
 	struct kvm_vcpu *vcpu;
 	unsigned long *pending_bits, *mask_bits;
-	unsigned long flags;
 	int port_word_bit;
 	bool kick_vcpu = false;
 	int vcpu_idx, idx, rc;
@@ -1816,9 +1824,10 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
 
 	idx = srcu_read_lock(&kvm->srcu);
 
-	read_lock_irqsave(&gpc->lock, flags);
-	if (!kvm_gpc_check(gpc, PAGE_SIZE))
+	if (!read_trylock(&gpc->lock))
 		goto out_rcu;
+	if (!kvm_gpc_check(gpc, PAGE_SIZE))
+		goto out_unlock;
 
 	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
 		struct shared_info *shinfo = gpc->khva;
@@ -1847,11 +1856,10 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
 	} else {
 		rc = 1; /* Delivered to the bitmap in shared_info. */
 		/* Now switch to the vCPU's vcpu_info to set the index and pending_sel */
-		read_unlock_irqrestore(&gpc->lock, flags);
+		read_unlock(&gpc->lock);
 		gpc = &vcpu->arch.xen.vcpu_info_cache;
 
-		read_lock_irqsave(&gpc->lock, flags);
-		if (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
+		if (!read_trylock(&gpc->lock)) {
 			/*
 			 * Could not access the vcpu_info. Set the bit in-kernel
 			 * and prod the vCPU to deliver it for itself.
@@ -1860,6 +1868,11 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
 				kick_vcpu = true;
 			goto out_rcu;
 		}
+		if (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
+			if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
+				kick_vcpu = true;
+			goto out_unlock;
+		}
 
 		if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
 			struct vcpu_info *vcpu_info = gpc->khva;
@@ -1883,8 +1896,9 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
 		}
 	}
 
+ out_unlock:
+	read_unlock(&gpc->lock);
  out_rcu:
-	read_unlock_irqrestore(&gpc->lock, flags);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	if (kick_vcpu) {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/7] KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c
  2026-05-08 18:10 [PATCH 0/7] KVM: x86/xen: Fix Xen / GPC / PREEMPT_RT issues with rwlock_t David Woodhouse
                   ` (2 preceding siblings ...)
  2026-05-08 18:10 ` [PATCH 3/7] KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths David Woodhouse
@ 2026-05-08 18:10 ` David Woodhouse
  2026-05-11 16:51   ` Sean Christopherson
  2026-05-08 18:10 ` [PATCH 5/7] KVM: x86: Remove unnecessary irqsave from kvm_setup_guest_pvclock() David Woodhouse
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: David Woodhouse @ 2026-05-08 18:10 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Paul Durrant, Peter Zijlstra,
	Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, kvm, linux-kernel, linux-rt-devel,
	Mauricio Faria de Oliveira, kernel-dev,
	syzbot+208f7f3e5f59c11aeb90

From: David Woodhouse <dwmw@amazon.co.uk>

Now that the hardirq path (xen_timer_callback and set_evtchn_fast) uses
read_trylock() instead of read_lock_irqsave(), the remaining GPC lock
users in xen.c are only called from process context (vcpu_run, ioctls).
There is no need to disable interrupts to prevent concurrent access from
a hardirq user, since the hardirq path no longer takes the lock.

Convert read_lock_irqsave()/read_unlock_irqrestore() to plain
read_lock()/read_unlock() in:
 - kvm_xen_update_runstate_guest()
 - kvm_xen_shared_info_init()
 - xen_get_guest_pvclock()
 - kvm_xen_inject_pending_events()
 - __kvm_xen_has_interrupt()
 - wait_pending_event()

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/xen.c | 60 +++++++++++++++++++---------------------------
 1 file changed, 25 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 9bdb8e3cad58..b1fae42bf295 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -45,15 +45,15 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
 	int ret = 0;
 	int idx = srcu_read_lock(&kvm->srcu);
 
-	read_lock_irq(&gpc->lock);
+	read_lock(&gpc->lock);
 	while (!kvm_gpc_check(gpc, PAGE_SIZE)) {
-		read_unlock_irq(&gpc->lock);
+		read_unlock(&gpc->lock);
 
 		ret = kvm_gpc_refresh(gpc, PAGE_SIZE);
 		if (ret)
 			goto out;
 
-		read_lock_irq(&gpc->lock);
+		read_lock(&gpc->lock);
 	}
 
 	/*
@@ -96,7 +96,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
 	smp_wmb();
 
 	wc->version = wc_version + 1;
-	read_unlock_irq(&gpc->lock);
+	read_unlock(&gpc->lock);
 
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
 
@@ -155,22 +155,21 @@ static int xen_get_guest_pvclock(struct kvm_vcpu *vcpu,
 				 struct gfn_to_pfn_cache *gpc,
 				 unsigned int offset)
 {
-	unsigned long flags;
 	int r;
 
-	read_lock_irqsave(&gpc->lock, flags);
+	read_lock(&gpc->lock);
 	while (!kvm_gpc_check(gpc, offset + sizeof(*hv_clock))) {
-		read_unlock_irqrestore(&gpc->lock, flags);
+		read_unlock(&gpc->lock);
 
 		r = kvm_gpc_refresh(gpc, offset + sizeof(*hv_clock));
 		if (r)
 			return r;
 
-		read_lock_irqsave(&gpc->lock, flags);
+		read_lock(&gpc->lock);
 	}
 
 	memcpy(hv_clock, gpc->khva + offset, sizeof(*hv_clock));
-	read_unlock_irqrestore(&gpc->lock, flags);
+	read_unlock(&gpc->lock);
 
 	/*
 	 * Sanity check TSC shift+multiplier to verify the guest's view of time
@@ -325,7 +324,6 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	struct gfn_to_pfn_cache *gpc2 = &vx->runstate2_cache;
 	size_t user_len, user_len1, user_len2;
 	struct vcpu_runstate_info rs;
-	unsigned long flags;
 	size_t times_ofs;
 	uint8_t *update_bit = NULL;
 	uint64_t entry_time;
@@ -421,16 +419,14 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	 * gfn_to_pfn caches that cover the region.
 	 */
 	if (atomic) {
-		local_irq_save(flags);
 		if (!read_trylock(&gpc1->lock)) {
-			local_irq_restore(flags);
 			return;
 		}
 	} else {
-		read_lock_irqsave(&gpc1->lock, flags);
+		read_lock(&gpc1->lock);
 	}
 	while (!kvm_gpc_check(gpc1, user_len1)) {
-		read_unlock_irqrestore(&gpc1->lock, flags);
+		read_unlock(&gpc1->lock);
 
 		/* When invoked from kvm_sched_out() we cannot sleep */
 		if (atomic)
@@ -439,7 +435,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 		if (kvm_gpc_refresh(gpc1, user_len1))
 			return;
 
-		read_lock_irqsave(&gpc1->lock, flags);
+		read_lock(&gpc1->lock);
 	}
 
 	if (likely(!user_len2)) {
@@ -467,7 +463,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 		lock_set_subclass(&gpc1->lock.dep_map, 1, _THIS_IP_);
 		if (atomic) {
 			if (!read_trylock(&gpc2->lock)) {
-				read_unlock_irqrestore(&gpc1->lock, flags);
+				read_unlock(&gpc1->lock);
 				return;
 			}
 		} else {
@@ -476,7 +472,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 
 		if (!kvm_gpc_check(gpc2, user_len2)) {
 			read_unlock(&gpc2->lock);
-			read_unlock_irqrestore(&gpc1->lock, flags);
+			read_unlock(&gpc1->lock);
 
 			/* When invoked from kvm_sched_out() we cannot sleep */
 			if (atomic)
@@ -581,7 +577,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	}
 
 	kvm_gpc_mark_dirty_in_slot(gpc1);
-	read_unlock_irqrestore(&gpc1->lock, flags);
+	read_unlock(&gpc1->lock);
 }
 
 void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
@@ -640,7 +636,6 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 {
 	unsigned long evtchn_pending_sel = READ_ONCE(v->arch.xen.evtchn_pending_sel);
 	struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache;
-	unsigned long flags;
 
 	if (!evtchn_pending_sel)
 		return;
@@ -650,14 +645,14 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 	 * does anyway. Page it in and retry the instruction. We're just a
 	 * little more honest about it.
 	 */
-	read_lock_irqsave(&gpc->lock, flags);
+	read_lock(&gpc->lock);
 	while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
-		read_unlock_irqrestore(&gpc->lock, flags);
+		read_unlock(&gpc->lock);
 
 		if (kvm_gpc_refresh(gpc, sizeof(struct vcpu_info)))
 			return;
 
-		read_lock_irqsave(&gpc->lock, flags);
+		read_lock(&gpc->lock);
 	}
 
 	/* Now gpc->khva is a valid kernel address for the vcpu_info */
@@ -687,7 +682,7 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 	}
 
 	kvm_gpc_mark_dirty_in_slot(gpc);
-	read_unlock_irqrestore(&gpc->lock, flags);
+	read_unlock(&gpc->lock);
 
 	/* For the per-vCPU lapic vector, deliver it as MSI. */
 	if (v->arch.xen.upcall_vector)
@@ -698,7 +693,6 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
 {
 	struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache;
 	bool atomic = in_atomic() || !task_is_running(current);
-	unsigned long flags;
 	u8 rc = 0;
 
 	/*
@@ -715,16 +709,13 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
 		     sizeof_field(struct compat_vcpu_info, evtchn_upcall_pending));
 
 	if (atomic) {
-		local_irq_save(flags);
-		if (!read_trylock(&gpc->lock)) {
-			local_irq_restore(flags);
+		if (!read_trylock(&gpc->lock))
 			return 1;
-		}
 	} else {
-		read_lock_irqsave(&gpc->lock, flags);
+		read_lock(&gpc->lock);
 	}
 	while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
-		read_unlock_irqrestore(&gpc->lock, flags);
+		read_unlock(&gpc->lock);
 
 		/*
 		 * This function gets called from kvm_vcpu_block() after setting the
@@ -744,11 +735,11 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
 			 */
 			return 0;
 		}
-		read_lock_irqsave(&gpc->lock, flags);
+		read_lock(&gpc->lock);
 	}
 
 	rc = ((struct vcpu_info *)gpc->khva)->evtchn_upcall_pending;
-	read_unlock_irqrestore(&gpc->lock, flags);
+	read_unlock(&gpc->lock);
 	return rc;
 }
 
@@ -1445,12 +1436,11 @@ static bool wait_pending_event(struct kvm_vcpu *vcpu, int nr_ports,
 	struct kvm *kvm = vcpu->kvm;
 	struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache;
 	unsigned long *pending_bits;
-	unsigned long flags;
 	bool ret = true;
 	int idx, i;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	read_lock_irqsave(&gpc->lock, flags);
+	read_lock(&gpc->lock);
 	if (!kvm_gpc_check(gpc, PAGE_SIZE))
 		goto out_rcu;
 
@@ -1471,7 +1461,7 @@ static bool wait_pending_event(struct kvm_vcpu *vcpu, int nr_ports,
 	}
 
  out_rcu:
-	read_unlock_irqrestore(&gpc->lock, flags);
+	read_unlock(&gpc->lock);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return ret;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/7] KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c
  2026-05-08 18:10 ` [PATCH 4/7] KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c David Woodhouse
@ 2026-05-11 16:51   ` Sean Christopherson
  2026-05-11 16:54     ` David Woodhouse
  0 siblings, 1 reply; 13+ messages in thread
From: Sean Christopherson @ 2026-05-11 16:51 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Dave Hansen, x86, H. Peter Anvin, Paul Durrant, Peter Zijlstra,
	Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, kvm, linux-kernel, linux-rt-devel,
	Mauricio Faria de Oliveira, kernel-dev,
	syzbot+208f7f3e5f59c11aeb90

On Fri, May 08, 2026, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Now that the hardirq path (xen_timer_callback and set_evtchn_fast) uses
> read_trylock() instead of read_lock_irqsave(), the remaining GPC lock
> users in xen.c are only called from process context (vcpu_run, ioctls).
> There is no need to disable interrupts to prevent concurrent access from
> a hardirq user, since the hardirq path no longer takes the lock.

No longer _waits_ on the lock, correct?  I.e. the hardirq path can still take the
lock, but only ever does so using trylock.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/7] KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c
  2026-05-11 16:51   ` Sean Christopherson
@ 2026-05-11 16:54     ` David Woodhouse
  2026-05-11 17:48       ` Sean Christopherson
  0 siblings, 1 reply; 13+ messages in thread
From: David Woodhouse @ 2026-05-11 16:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, x86, H. Peter Anvin, Paul Durrant, Peter Zijlstra,
	Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, kvm, linux-kernel, linux-rt-devel,
	Mauricio Faria de Oliveira, kernel-dev,
	syzbot+208f7f3e5f59c11aeb90

[-- Attachment #1: Type: text/plain, Size: 712 bytes --]

On Mon, 2026-05-11 at 09:51 -0700, Sean Christopherson wrote:
> On Fri, May 08, 2026, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > Now that the hardirq path (xen_timer_callback and set_evtchn_fast) uses
> > read_trylock() instead of read_lock_irqsave(), the remaining GPC lock
> > users in xen.c are only called from process context (vcpu_run, ioctls).
> > There is no need to disable interrupts to prevent concurrent access from
> > a hardirq user, since the hardirq path no longer takes the lock.
> 
> No longer _waits_ on the lock, correct?  I.e. the hardirq path can still take the
> lock, but only ever does so using trylock.

Indeed so. Want me to fix that?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/7] KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c
  2026-05-11 16:54     ` David Woodhouse
@ 2026-05-11 17:48       ` Sean Christopherson
  2026-05-11 18:12         ` David Woodhouse
  0 siblings, 1 reply; 13+ messages in thread
From: Sean Christopherson @ 2026-05-11 17:48 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Dave Hansen, x86, H. Peter Anvin, Paul Durrant, Peter Zijlstra,
	Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, kvm, linux-kernel, linux-rt-devel,
	Mauricio Faria de Oliveira, kernel-dev,
	syzbot+208f7f3e5f59c11aeb90

On Mon, May 11, 2026, David Woodhouse wrote:
> On Mon, 2026-05-11 at 09:51 -0700, Sean Christopherson wrote:
> > On Fri, May 08, 2026, David Woodhouse wrote:
> > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > 
> > > Now that the hardirq path (xen_timer_callback and set_evtchn_fast) uses
> > > read_trylock() instead of read_lock_irqsave(), the remaining GPC lock
> > > users in xen.c are only called from process context (vcpu_run, ioctls).
> > > There is no need to disable interrupts to prevent concurrent access from
> > > a hardirq user, since the hardirq path no longer takes the lock.
> > 
> > No longer _waits_ on the lock, correct?  I.e. the hardirq path can still take the
> > lock, but only ever does so using trylock.
> 
> Indeed so. Want me to fix that?

Nah, at least not yet.  I'm in the process of wrapping my head around all of the
in-flight gpc series (namely yours and Fred's), just wanted to make sure I wasn't
missing something.

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/7] KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c
  2026-05-11 17:48       ` Sean Christopherson
@ 2026-05-11 18:12         ` David Woodhouse
  0 siblings, 0 replies; 13+ messages in thread
From: David Woodhouse @ 2026-05-11 18:12 UTC (permalink / raw)
  To: Sean Christopherson, Griffoul, Fred
  Cc: Dave Hansen, x86, H. Peter Anvin, Paul Durrant, Peter Zijlstra,
	Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, kvm, linux-kernel, linux-rt-devel,
	Mauricio Faria de Oliveira, kernel-dev,
	syzbot+208f7f3e5f59c11aeb90

[-- Attachment #1: Type: text/plain, Size: 1333 bytes --]

On Mon, 2026-05-11 at 10:48 -0700, Sean Christopherson wrote:
> On Mon, May 11, 2026, David Woodhouse wrote:
> > On Mon, 2026-05-11 at 09:51 -0700, Sean Christopherson wrote:
> > > On Fri, May 08, 2026, David Woodhouse wrote:
> > > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > > 
> > > > Now that the hardirq path (xen_timer_callback and set_evtchn_fast) uses
> > > > read_trylock() instead of read_lock_irqsave(), the remaining GPC lock
> > > > users in xen.c are only called from process context (vcpu_run, ioctls).
> > > > There is no need to disable interrupts to prevent concurrent access from
> > > > a hardirq user, since the hardirq path no longer takes the lock.
> > > 
> > > No longer _waits_ on the lock, correct?  I.e. the hardirq path can still take the
> > > lock, but only ever does so using trylock.
> > 
> > Indeed so. Want me to fix that?
> 
> Nah, at least not yet.  I'm in the process of wrapping my head around all of the
> in-flight gpc series (namely yours and Fred's), just wanted to make sure I wasn't
> missing something.

I think Fred has a few performance fixes and is due to send another
version once he can find the cycles to do so.

My assumption (without actually *talking* to Fred about it) was that he
would be able to rebase fairly trivially on top of this series.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 5/7] KVM: x86: Remove unnecessary irqsave from kvm_setup_guest_pvclock()
  2026-05-08 18:10 [PATCH 0/7] KVM: x86/xen: Fix Xen / GPC / PREEMPT_RT issues with rwlock_t David Woodhouse
                   ` (3 preceding siblings ...)
  2026-05-08 18:10 ` [PATCH 4/7] KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c David Woodhouse
@ 2026-05-08 18:10 ` David Woodhouse
  2026-05-08 18:10 ` [PATCH 6/7] KVM: Remove unnecessary IRQ disabling from GPC lock in pfncache.c David Woodhouse
  2026-05-08 18:10 ` [PATCH 7/7] KVM: x86/xen: Handle pending Xen timer events in vcpu_enter_guest() David Woodhouse
  6 siblings, 0 replies; 13+ messages in thread
From: David Woodhouse @ 2026-05-08 18:10 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Paul Durrant, Peter Zijlstra,
	Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, kvm, linux-kernel, linux-rt-devel,
	Mauricio Faria de Oliveira, kernel-dev,
	syzbot+208f7f3e5f59c11aeb90

From: David Woodhouse <dwmw@amazon.co.uk>

kvm_setup_guest_pvclock() is only called from kvm_guest_time_update()
which runs in process context (vcpu_enter_guest or ioctl). There is no
hardirq path that takes the GPC read lock for pvclock, so irqsave is
unnecessary.

Convert to plain read_lock()/read_unlock().

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ae71f28cc1c5..e62f4a9ad334 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3270,18 +3270,17 @@ static void kvm_setup_guest_pvclock(struct pvclock_vcpu_time_info *ref_hv_clock,
 {
 	struct pvclock_vcpu_time_info *guest_hv_clock;
 	struct pvclock_vcpu_time_info hv_clock;
-	unsigned long flags;
 
 	memcpy(&hv_clock, ref_hv_clock, sizeof(hv_clock));
 
-	read_lock_irqsave(&gpc->lock, flags);
+	read_lock(&gpc->lock);
 	while (!kvm_gpc_check(gpc, offset + sizeof(*guest_hv_clock))) {
-		read_unlock_irqrestore(&gpc->lock, flags);
+		read_unlock(&gpc->lock);
 
 		if (kvm_gpc_refresh(gpc, offset + sizeof(*guest_hv_clock)))
 			return;
 
-		read_lock_irqsave(&gpc->lock, flags);
+		read_lock(&gpc->lock);
 	}
 
 	guest_hv_clock = (void *)(gpc->khva + offset);
@@ -3306,7 +3305,7 @@ static void kvm_setup_guest_pvclock(struct pvclock_vcpu_time_info *ref_hv_clock,
 	guest_hv_clock->version = ++hv_clock.version;
 
 	kvm_gpc_mark_dirty_in_slot(gpc);
-	read_unlock_irqrestore(&gpc->lock, flags);
+	read_unlock(&gpc->lock);
 
 	trace_kvm_pvclock_update(vcpu->vcpu_id, &hv_clock);
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 6/7] KVM: Remove unnecessary IRQ disabling from GPC lock in pfncache.c
  2026-05-08 18:10 [PATCH 0/7] KVM: x86/xen: Fix Xen / GPC / PREEMPT_RT issues with rwlock_t David Woodhouse
                   ` (4 preceding siblings ...)
  2026-05-08 18:10 ` [PATCH 5/7] KVM: x86: Remove unnecessary irqsave from kvm_setup_guest_pvclock() David Woodhouse
@ 2026-05-08 18:10 ` David Woodhouse
  2026-05-08 18:10 ` [PATCH 7/7] KVM: x86/xen: Handle pending Xen timer events in vcpu_enter_guest() David Woodhouse
  6 siblings, 0 replies; 13+ messages in thread
From: David Woodhouse @ 2026-05-08 18:10 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Paul Durrant, Peter Zijlstra,
	Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, kvm, linux-kernel, linux-rt-devel,
	Mauricio Faria de Oliveira, kernel-dev,
	syzbot+208f7f3e5f59c11aeb90

From: David Woodhouse <dwmw@amazon.co.uk>

Now that all hardirq/atomic GPC users (xen_timer_callback,
kvm_xen_set_evtchn_fast) use read_trylock() instead of read_lock(), no
hardirq path ever holds the GPC rwlock. There is therefore no risk of
deadlock between the write side and a hardirq reader, and no need to
disable interrupts when taking the lock.

Convert all read_lock_irq()/write_lock_irq() and their unlock
counterparts to plain read_lock()/write_lock() in pfncache.c.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 virt/kvm/pfncache.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 728d2c1b488a..70b102095173 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -29,12 +29,12 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
 
 	spin_lock(&kvm->gpc_lock);
 	list_for_each_entry(gpc, &kvm->gpc_list, list) {
-		read_lock_irq(&gpc->lock);
+		read_lock(&gpc->lock);
 
 		/* Only a single page so no need to care about length */
 		if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) &&
 		    gpc->uhva >= start && gpc->uhva < end) {
-			read_unlock_irq(&gpc->lock);
+			read_unlock(&gpc->lock);
 
 			/*
 			 * There is a small window here where the cache could
@@ -44,15 +44,15 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
 			 * acquired.
 			 */
 
-			write_lock_irq(&gpc->lock);
+			write_lock(&gpc->lock);
 			if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) &&
 			    gpc->uhva >= start && gpc->uhva < end)
 				gpc->valid = false;
-			write_unlock_irq(&gpc->lock);
+			write_unlock(&gpc->lock);
 			continue;
 		}
 
-		read_unlock_irq(&gpc->lock);
+		read_unlock(&gpc->lock);
 	}
 	spin_unlock(&kvm->gpc_lock);
 }
@@ -184,7 +184,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 		mmu_seq = gpc->kvm->mmu_invalidate_seq;
 		smp_rmb();
 
-		write_unlock_irq(&gpc->lock);
+		write_unlock(&gpc->lock);
 
 		/*
 		 * If the previous iteration "failed" due to an mmu_notifier
@@ -225,7 +225,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 			goto out_error;
 		}
 
-		write_lock_irq(&gpc->lock);
+		write_lock(&gpc->lock);
 
 		/*
 		 * Other tasks must wait for _this_ refresh to complete before
@@ -248,7 +248,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 	return 0;
 
 out_error:
-	write_lock_irq(&gpc->lock);
+	write_lock(&gpc->lock);
 
 	return -EFAULT;
 }
@@ -269,7 +269,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 
 	lockdep_assert_held(&gpc->refresh_lock);
 
-	write_lock_irq(&gpc->lock);
+	write_lock(&gpc->lock);
 
 	if (!gpc->active) {
 		ret = -EINVAL;
@@ -355,7 +355,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 	unmap_old = (old_pfn != gpc->pfn);
 
 out_unlock:
-	write_unlock_irq(&gpc->lock);
+	write_unlock(&gpc->lock);
 
 	if (unmap_old)
 		gpc_unmap(old_pfn, old_khva);
@@ -417,9 +417,9 @@ static int __kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned
 		 * refresh must not establish a mapping until the cache is
 		 * reachable by mmu_notifier events.
 		 */
-		write_lock_irq(&gpc->lock);
+		write_lock(&gpc->lock);
 		gpc->active = true;
-		write_unlock_irq(&gpc->lock);
+		write_unlock(&gpc->lock);
 	}
 	return __kvm_gpc_refresh(gpc, gpa, uhva);
 }
@@ -458,7 +458,7 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
 		 * must stall mmu_notifier events until all users go away, i.e.
 		 * until gpc->lock is dropped and refresh is guaranteed to fail.
 		 */
-		write_lock_irq(&gpc->lock);
+		write_lock(&gpc->lock);
 		gpc->active = false;
 		gpc->valid = false;
 
@@ -473,7 +473,7 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
 
 		old_pfn = gpc->pfn;
 		gpc->pfn = KVM_PFN_ERR_FAULT;
-		write_unlock_irq(&gpc->lock);
+		write_unlock(&gpc->lock);
 
 		spin_lock(&kvm->gpc_lock);
 		list_del(&gpc->list);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 7/7] KVM: x86/xen: Handle pending Xen timer events in vcpu_enter_guest()
  2026-05-08 18:10 [PATCH 0/7] KVM: x86/xen: Fix Xen / GPC / PREEMPT_RT issues with rwlock_t David Woodhouse
                   ` (5 preceding siblings ...)
  2026-05-08 18:10 ` [PATCH 6/7] KVM: Remove unnecessary IRQ disabling from GPC lock in pfncache.c David Woodhouse
@ 2026-05-08 18:10 ` David Woodhouse
  2026-05-09  7:27   ` David Woodhouse
  6 siblings, 1 reply; 13+ messages in thread
From: David Woodhouse @ 2026-05-08 18:10 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Paul Durrant, Peter Zijlstra,
	Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, kvm, linux-kernel, linux-rt-devel,
	Mauricio Faria de Oliveira, kernel-dev,
	syzbot+208f7f3e5f59c11aeb90

From: David Woodhouse <dwmw@amazon.co.uk>

If xen_timer_callback() can't deliver an event directly to the guest
(e.g. due to memslot changes causing the GPC to need refreshing), it
sets the timer_pending flag and kicks the vCPU.

However, the pending timer was only injected from the outer vcpu_run()
loop via kvm_inject_pending_timer_irqs(), not from the inner loop in
vcpu_enter_guest(). This means that the timer could be delayed until
something else causes vcpu_enter_guest() to return to the outer loop.

Thus, timer delivery could be delayed by a whole scheduler tick, or
hypothetically for ever in a NOHZ_FULL environment.

Subsume Xen timer handling into kvm_xen_has_pending_events() and
kvm_xen_inject_pending_events(), and use those directly from the inner
vcpu_enter_guest() loop. This ensures deferred timer delivery happens
on the next VM-entry rather than waiting for the scheduler.

Remove the Xen timer handling from kvm_inject_pending_timer_irqs() and
from kvm_cpu_has_pending_timer(), since kvm_vcpu_has_events() already
covers the wakeup case via kvm_xen_has_pending_events().

Pull the actual event injection into kvm_xen_inject_pending_events()
and remove kvm_xen_inject_timer_irqs() to avoid a double check of
arch.xen.timer_pending in caller and callee. Its other caller can
just call kvm_xen_inject_pending_events() (to ensure pending timers
are flushed when setting them from userspace).

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/irq.c |  4 ----
 arch/x86/kvm/x86.c |  3 +++
 arch/x86/kvm/xen.c | 35 +++++++++++++++++------------------
 arch/x86/kvm/xen.h | 21 ++-------------------
 4 files changed, 22 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 9519fec09ee6..7527c9bfe244 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -30,8 +30,6 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
 
 	if (lapic_in_kernel(vcpu))
 		r = apic_has_pending_timer(vcpu);
-	if (kvm_xen_timer_enabled(vcpu))
-		r += kvm_xen_has_pending_timer(vcpu);
 
 	return r;
 }
@@ -170,8 +168,6 @@ void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
 {
 	if (lapic_in_kernel(vcpu))
 		kvm_inject_apic_timer_irqs(vcpu);
-	if (kvm_xen_timer_enabled(vcpu))
-		kvm_xen_inject_timer_irqs(vcpu);
 }
 
 void __kvm_migrate_timers(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e62f4a9ad334..c8e58a18a3e7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11254,6 +11254,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		}
 		if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
 			record_steal_time(vcpu);
+		if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu) &&
+		    kvm_xen_has_pending_events(vcpu))
+			kvm_xen_inject_pending_events(vcpu);
 		if (kvm_check_request(KVM_REQ_PMU, vcpu))
 			kvm_pmu_handle_event(vcpu);
 		if (kvm_check_request(KVM_REQ_PMI, vcpu))
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index b1fae42bf295..16b8c154243c 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -105,22 +105,6 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
 	return ret;
 }
 
-void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu)
-{
-	if (atomic_read(&vcpu->arch.xen.timer_pending) > 0) {
-		struct kvm_xen_evtchn e;
-
-		e.vcpu_id = vcpu->vcpu_id;
-		e.vcpu_idx = vcpu->vcpu_idx;
-		e.port = vcpu->arch.xen.timer_virq;
-		e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL;
-
-		kvm_xen_set_evtchn(&e, vcpu->kvm);
-
-		vcpu->arch.xen.timer_expires = 0;
-		atomic_set(&vcpu->arch.xen.timer_pending, 0);
-	}
-}
 
 static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
 {
@@ -634,9 +618,24 @@ void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v)
  */
 void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 {
-	unsigned long evtchn_pending_sel = READ_ONCE(v->arch.xen.evtchn_pending_sel);
+	unsigned long evtchn_pending_sel;
 	struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache;
 
+	if (kvm_xen_timer_enabled(v) && atomic_read(&v->arch.xen.timer_pending)) {
+		struct kvm_xen_evtchn e;
+
+		e.vcpu_id = v->vcpu_id;
+		e.vcpu_idx = v->vcpu_idx;
+		e.port = v->arch.xen.timer_virq;
+		e.priority = KVM_IRQ_ROUTING_XEN_EVTCHN_PRIO_2LEVEL;
+
+		kvm_xen_set_evtchn(&e, v->kvm);
+
+		v->arch.xen.timer_expires = 0;
+		atomic_set(&v->arch.xen.timer_pending, 0);
+	}
+
+	evtchn_pending_sel = READ_ONCE(v->arch.xen.evtchn_pending_sel);
 	if (!evtchn_pending_sel)
 		return;
 
@@ -1238,7 +1237,7 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 		 */
 		if (vcpu->arch.xen.timer_expires) {
 			hrtimer_cancel(&vcpu->arch.xen.timer);
-			kvm_xen_inject_timer_irqs(vcpu);
+			kvm_xen_inject_pending_events(vcpu);
 		}
 
 		data->u.timer.port = vcpu->arch.xen.timer_virq;
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index 59e6128a7bd3..029026853af5 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -92,7 +92,8 @@ static inline int kvm_xen_has_interrupt(struct kvm_vcpu *vcpu)
 static inline bool kvm_xen_has_pending_events(struct kvm_vcpu *vcpu)
 {
 	return static_branch_unlikely(&kvm_xen_enabled.key) &&
-		vcpu->arch.xen.evtchn_pending_sel;
+		(vcpu->arch.xen.evtchn_pending_sel ||
+		 atomic_read(&vcpu->arch.xen.timer_pending));
 }
 
 static inline bool kvm_xen_timer_enabled(struct kvm_vcpu *vcpu)
@@ -100,15 +101,6 @@ static inline bool kvm_xen_timer_enabled(struct kvm_vcpu *vcpu)
 	return !!vcpu->arch.xen.timer_virq;
 }
 
-static inline int kvm_xen_has_pending_timer(struct kvm_vcpu *vcpu)
-{
-	if (kvm_xen_hypercall_enabled(vcpu->kvm) && kvm_xen_timer_enabled(vcpu))
-		return atomic_read(&vcpu->arch.xen.timer_pending);
-
-	return 0;
-}
-
-void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu);
 #else
 static inline int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
 {
@@ -164,15 +156,6 @@ static inline bool kvm_xen_has_pending_events(struct kvm_vcpu *vcpu)
 	return false;
 }
 
-static inline int kvm_xen_has_pending_timer(struct kvm_vcpu *vcpu)
-{
-	return 0;
-}
-
-static inline void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu)
-{
-}
-
 static inline bool kvm_xen_timer_enabled(struct kvm_vcpu *vcpu)
 {
 	return false;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 7/7] KVM: x86/xen: Handle pending Xen timer events in vcpu_enter_guest()
  2026-05-08 18:10 ` [PATCH 7/7] KVM: x86/xen: Handle pending Xen timer events in vcpu_enter_guest() David Woodhouse
@ 2026-05-09  7:27   ` David Woodhouse
  0 siblings, 0 replies; 13+ messages in thread
From: David Woodhouse @ 2026-05-09  7:27 UTC (permalink / raw)
  To: Dave Hansen, x86, H. Peter Anvin, Paul Durrant, Peter Zijlstra,
	Will Deacon, Boqun Feng, Waiman Long, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, kvm, linux-kernel, linux-rt-devel,
	Mauricio Faria de Oliveira, kernel-dev,
	syzbot+208f7f3e5f59c11aeb90

[-- Attachment #1: Type: text/plain, Size: 1417 bytes --]

On Fri, 2026-05-08 at 19:10 +0100, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> If xen_timer_callback() can't deliver an event directly to the guest
> (e.g. due to memslot changes causing the GPC to need refreshing), it
> sets the timer_pending flag and kicks the vCPU.
> 
> However, the pending timer was only injected from the outer vcpu_run()
> loop via kvm_inject_pending_timer_irqs(), not from the inner loop in
> vcpu_enter_guest(). This means that the timer could be delayed until
> something else causes vcpu_enter_guest() to return to the outer loop.

This one is nonsense; I just got utterly confused when chasing the fact
that hrtimers weren't being delivered at all due to commit 15dd3a94885.

It's actually fine: the deferred timer will set KVM_REQ_UNBLOCK and
kvm_cpu_exit_request() will return true, causing vcpu_enter_guest() to
flip back to OUTSIDE_GUEST_MODE and bail early, as $DEITY intended.

After a two-day debugging session on the first thing, I came back to
the problem I *thought* I'd seen, and contrived a test which
demonstrated it... by introducing a race that didn't actually exist (my
test didn't set KVM_REQ_UNBLOCK when actually setting the timer flag in
vcpu_enter_guest() to show the race).

So we can drop this patch [7/7], but I might still do the cleanup and
consolidation of the events/timer part in a separate patch later.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-05-11 18:12 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-08 18:10 [PATCH 0/7] KVM: x86/xen: Fix Xen / GPC / PREEMPT_RT issues with rwlock_t David Woodhouse
2026-05-08 18:10 ` [PATCH 1/7] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock() David Woodhouse
2026-05-08 18:10 ` [PATCH 2/7] KVM: x86: Use gfn_to_pfn_cache for record_steal_time David Woodhouse
2026-05-08 18:10 ` [PATCH 3/7] KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths David Woodhouse
2026-05-08 18:10 ` [PATCH 4/7] KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c David Woodhouse
2026-05-11 16:51   ` Sean Christopherson
2026-05-11 16:54     ` David Woodhouse
2026-05-11 17:48       ` Sean Christopherson
2026-05-11 18:12         ` David Woodhouse
2026-05-08 18:10 ` [PATCH 5/7] KVM: x86: Remove unnecessary irqsave from kvm_setup_guest_pvclock() David Woodhouse
2026-05-08 18:10 ` [PATCH 6/7] KVM: Remove unnecessary IRQ disabling from GPC lock in pfncache.c David Woodhouse
2026-05-08 18:10 ` [PATCH 7/7] KVM: x86/xen: Handle pending Xen timer events in vcpu_enter_guest() David Woodhouse
2026-05-09  7:27   ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox