* [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t
@ 2026-05-29 16:50 Sean Christopherson
2026-05-29 16:50 ` [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock() Sean Christopherson
` (19 more replies)
0 siblings, 20 replies; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:50 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
This series fixes sleeping-in-hardirq bugs in KVM's Xen emulation on
PREEMPT_RT, cleans up the now-unnecessary IRQ disabling in GPC lock usage
throughout KVM, and then adds CLASS()-based APIs for utilizing GPCs mappings
to dedup code and (hopefully) make it easier to use GPCs in other places.
The core issue is that kvm_xen_set_evtchn_fast() and the Xen timer
callback are called from hardirq/atomic context, but on PREEMPT_RT the
GPC rwlock_t is a sleeping lock.
Assuming I can get an Ack on patch 1, I'm planning on grabbing at least these
KVM: Move {g,p}fn <=> {g,h}pa conversion helpers to kvm_types.h
KVM: x86/xen: Don't dirty track "vCPU info" page
KVM: x86/xen: Explicitly tag "shared info" page as never being dirty tracked
KVM: x86/xen: Extract delivery of event to vCPU into a separate helper
KVM: x86/xen: Use guard() to grab kvm->srcu around gpc critical sections
KVM: Remove unnecessary IRQ disabling from GPC lock in pfncache.c
KVM: x86: Remove unnecessary irqsave from kvm_setup_guest_pvclock()
KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c
KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths
locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
for 7.2. If people like the CLASS() stuff, I'll also probably grab these:
KVM: Add "extended" gpc CLASS() APIs for sometimes-atomic cases
KVM: x86/xen: Convert event injection to gpc's CLASS() APIs
KVM: x86/xen: Drop local "kick_vcpu" from __kvm_xen_set_evtchn_fast()
KVM: x86/xen: Convert xen_get_guest_pvclock() to gpc's CLASS() APIs
KVM: x86/xen: Convert kvm_xen_set_evtchn_fast() to gpc's CLASS() APIs
KVM: x86/xen: Convert wait_pending_event() to gpc's CLASS() APIs
KVM: x86/xen: Don't bother waiting on gpc->lock in SCHEDOP_poll
KVM: x86/xen: Convert kvm_xen_shared_info_init() to gpc's CLASS() APIs
KVM: Add CLASS() constructs to automagically handle lock+check of gpc
I do NOT plan on grabbing the record_steal_time change for 7.2 no matter
what, even though I do like the end result, as I still have concerns over the
lack of range-based invalidation for GPCs. I 100% agree that such problems
are really only due to flawed VMMs and/or setups, but unfortunately history
has shown that there are a suprising number of deployments running what I
would consider flawed setups, e.g. run with NUMA autobalancing and KSM.
I realize I'm being somewhat paranoid, as KVM already uses a GPC for PV
clocks. But for modern setups, KVM_REQ_CLOCK_UPDATE is a rare event, whereas
KVM will update steal time (when enabled) on every vCPU load. So I want a
high level of confidence that KVM won't regress "imperfect" setups before
switching to a GPC for steal time (though again, I definitely like the end
result and want to do so).
[*] https://lore.kernel.org/all/20240821202814.711673-2-dwmw2@infradead.org
v2:
- Add the CLASS() APIs.
- Move the steal time change to the very end.
- "Fix" a dirty logging inconsistency with the Xen vCPU info page.
v1: https://lore.kernel.org/all/20260508181717.3230988-1-dwmw2@infradead.org
Carsten Stollmaier (1):
KVM: x86: Use gfn_to_pfn_cache for record_steal_time
David Woodhouse (5):
locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths
KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c
KVM: x86: Remove unnecessary irqsave from kvm_setup_guest_pvclock()
KVM: Remove unnecessary IRQ disabling from GPC lock in pfncache.c
Sean Christopherson (14):
KVM: x86/xen: Use guard() to grab kvm->srcu around gpc critical
sections
KVM: x86/xen: Extract delivery of event to vCPU into a separate helper
KVM: x86/xen: Explicitly tag "shared info" page as never being dirty
tracked
KVM: x86/xen: Don't dirty track "vCPU info" page
KVM: Move {g,p}fn <=> {g,h}pa conversion helpers to kvm_types.h
KVM: Add CLASS() constructs to automagically handle lock+check of gpc
KVM: x86/xen: Convert kvm_xen_shared_info_init() to gpc's CLASS() APIs
KVM: x86/xen: Don't bother waiting on gpc->lock in SCHEDOP_poll
KVM: x86/xen: Convert wait_pending_event() to gpc's CLASS() APIs
KVM: x86/xen: Convert kvm_xen_set_evtchn_fast() to gpc's CLASS() APIs
KVM: x86/xen: Convert xen_get_guest_pvclock() to gpc's CLASS() APIs
KVM: x86/xen: Drop local "kick_vcpu" from __kvm_xen_set_evtchn_fast()
KVM: x86/xen: Convert event injection to gpc's CLASS() APIs
KVM: Add "extended" gpc CLASS() APIs for sometimes-atomic cases
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/x86.c | 140 +++++++---------
arch/x86/kvm/xen.c | 288 +++++++++++++-------------------
include/linux/kvm_host.h | 84 +++++++---
include/linux/kvm_types.h | 17 ++
kernel/locking/rwbase_rt.c | 5 +-
virt/kvm/pfncache.c | 68 ++++++--
7 files changed, 304 insertions(+), 300 deletions(-)
base-commit: d1568b1332b6b3b36b222c2868fc102727c12a34
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
@ 2026-05-29 16:50 ` Sean Christopherson
2026-05-29 19:32 ` Peter Zijlstra
2026-05-29 16:50 ` [PATCH v2 02/20] KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths Sean Christopherson
` (18 subsequent siblings)
19 siblings, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:50 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
From: David Woodhouse <dwmw@amazon.co.uk>
__rwbase_read_unlock() uses raw_spin_lock_irq()/raw_spin_unlock_irq()
which unconditionally disables and re-enables interrupts. When
read_unlock() is called from hardirq context (e.g. after a successful
read_trylock() in a timer callback), the raw_spin_unlock_irq()
incorrectly re-enables interrupts within the hardirq handler.
This causes lockdep warnings ('hardirqs_on_prepare' from hardirq
context) and can lead to IRQ state corruption.
Using read_trylock() in hardirq context on PREEMPT_RT is safe because
it does not record the lock owner. The read_unlock() acquires the
wait_lock which is hardirq safe. This change additionally allows
rwlock_t during early boot.
Switch to raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() to
preserve the caller's IRQ state.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
kernel/locking/rwbase_rt.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index 82e078c0665a..25744862d627 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -153,8 +153,9 @@ static void __sched __rwbase_read_unlock(struct rwbase_rt *rwb,
struct rt_mutex_base *rtm = &rwb->rtmutex;
struct task_struct *owner;
DEFINE_RT_WAKE_Q(wqh);
+ unsigned long flags;
- raw_spin_lock_irq(&rtm->wait_lock);
+ raw_spin_lock_irqsave(&rtm->wait_lock, flags);
/*
* Wake the writer, i.e. the rtmutex owner. It might release the
* rtmutex concurrently in the fast path (due to a signal), but to
@@ -167,7 +168,7 @@ static void __sched __rwbase_read_unlock(struct rwbase_rt *rwb,
/* Pairs with the preempt_enable in rt_mutex_wake_up_q() */
preempt_disable();
- raw_spin_unlock_irq(&rtm->wait_lock);
+ raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
rt_mutex_wake_up_q(&wqh);
}
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 02/20] KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
2026-05-29 16:50 ` [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock() Sean Christopherson
@ 2026-05-29 16:50 ` Sean Christopherson
2026-05-29 17:20 ` sashiko-bot
2026-05-29 23:28 ` Hillf Danton
2026-05-29 16:50 ` [PATCH v2 03/20] KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c Sean Christopherson
` (17 subsequent siblings)
19 siblings, 2 replies; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:50 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
From: David Woodhouse <dwmw@amazon.co.uk>
kvm_xen_set_evtchn_fast() is called from hardirq context (timer
callback, kvm_arch_set_irq_inatomic()). On PREEMPT_RT, rwlock_t is a
sleeping lock, so read_lock_irqsave() cannot be used in this context.
Switch to read_trylock() and return -EWOULDBLOCK on contention, which is
the designed fallback — there is always a slow path for the case where
the GPC is invalid and needs to be refreshed.
Reported-by: syzbot+208f7f3e5f59c11aeb90@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=208f7f3e5f59c11aeb90
Fixes: 14243b387137 ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/xen.c | 32 +++++++++++++++++++++++---------
1 file changed, 23 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 91fd3673c09a..9bdb8e3cad58 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -697,6 +697,7 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
{
struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache;
+ bool atomic = in_atomic() || !task_is_running(current);
unsigned long flags;
u8 rc = 0;
@@ -713,7 +714,15 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
BUILD_BUG_ON(sizeof(rc) !=
sizeof_field(struct compat_vcpu_info, evtchn_upcall_pending));
- read_lock_irqsave(&gpc->lock, flags);
+ if (atomic) {
+ local_irq_save(flags);
+ if (!read_trylock(&gpc->lock)) {
+ local_irq_restore(flags);
+ return 1;
+ }
+ } else {
+ read_lock_irqsave(&gpc->lock, flags);
+ }
while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
read_unlock_irqrestore(&gpc->lock, flags);
@@ -725,7 +734,7 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
* and we'll end up getting called again from a context where we *can*
* fault in the page and wait for it.
*/
- if (in_atomic() || !task_is_running(current))
+ if (atomic)
return 1;
if (kvm_gpc_refresh(gpc, sizeof(struct vcpu_info))) {
@@ -1794,7 +1803,6 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache;
struct kvm_vcpu *vcpu;
unsigned long *pending_bits, *mask_bits;
- unsigned long flags;
int port_word_bit;
bool kick_vcpu = false;
int vcpu_idx, idx, rc;
@@ -1816,9 +1824,10 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
idx = srcu_read_lock(&kvm->srcu);
- read_lock_irqsave(&gpc->lock, flags);
- if (!kvm_gpc_check(gpc, PAGE_SIZE))
+ if (!read_trylock(&gpc->lock))
goto out_rcu;
+ if (!kvm_gpc_check(gpc, PAGE_SIZE))
+ goto out_unlock;
if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
struct shared_info *shinfo = gpc->khva;
@@ -1847,11 +1856,10 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
} else {
rc = 1; /* Delivered to the bitmap in shared_info. */
/* Now switch to the vCPU's vcpu_info to set the index and pending_sel */
- read_unlock_irqrestore(&gpc->lock, flags);
+ read_unlock(&gpc->lock);
gpc = &vcpu->arch.xen.vcpu_info_cache;
- read_lock_irqsave(&gpc->lock, flags);
- if (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
+ if (!read_trylock(&gpc->lock)) {
/*
* Could not access the vcpu_info. Set the bit in-kernel
* and prod the vCPU to deliver it for itself.
@@ -1860,6 +1868,11 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
kick_vcpu = true;
goto out_rcu;
}
+ if (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
+ if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
+ kick_vcpu = true;
+ goto out_unlock;
+ }
if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
struct vcpu_info *vcpu_info = gpc->khva;
@@ -1883,8 +1896,9 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
}
}
+ out_unlock:
+ read_unlock(&gpc->lock);
out_rcu:
- read_unlock_irqrestore(&gpc->lock, flags);
srcu_read_unlock(&kvm->srcu, idx);
if (kick_vcpu) {
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 03/20] KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
2026-05-29 16:50 ` [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock() Sean Christopherson
2026-05-29 16:50 ` [PATCH v2 02/20] KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths Sean Christopherson
@ 2026-05-29 16:50 ` Sean Christopherson
2026-05-29 17:36 ` sashiko-bot
2026-05-29 16:50 ` [PATCH v2 04/20] KVM: x86: Remove unnecessary irqsave from kvm_setup_guest_pvclock() Sean Christopherson
` (16 subsequent siblings)
19 siblings, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:50 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
From: David Woodhouse <dwmw@amazon.co.uk>
Now that the hardirq path (xen_timer_callback and set_evtchn_fast) uses
read_trylock() instead of read_lock_irqsave(), the remaining GPC lock
users in xen.c are only called from process context (vcpu_run, ioctls).
There is no need to disable interrupts to prevent concurrent access from
a hardirq user, since the hardirq path no longer takes the lock.
Convert read_lock_irqsave()/read_unlock_irqrestore() to plain
read_lock()/read_unlock() in:
- kvm_xen_update_runstate_guest()
- kvm_xen_shared_info_init()
- xen_get_guest_pvclock()
- kvm_xen_inject_pending_events()
- __kvm_xen_has_interrupt()
- wait_pending_event()
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/xen.c | 60 +++++++++++++++++++---------------------------
1 file changed, 25 insertions(+), 35 deletions(-)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 9bdb8e3cad58..b1fae42bf295 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -45,15 +45,15 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
int ret = 0;
int idx = srcu_read_lock(&kvm->srcu);
- read_lock_irq(&gpc->lock);
+ read_lock(&gpc->lock);
while (!kvm_gpc_check(gpc, PAGE_SIZE)) {
- read_unlock_irq(&gpc->lock);
+ read_unlock(&gpc->lock);
ret = kvm_gpc_refresh(gpc, PAGE_SIZE);
if (ret)
goto out;
- read_lock_irq(&gpc->lock);
+ read_lock(&gpc->lock);
}
/*
@@ -96,7 +96,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
smp_wmb();
wc->version = wc_version + 1;
- read_unlock_irq(&gpc->lock);
+ read_unlock(&gpc->lock);
kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
@@ -155,22 +155,21 @@ static int xen_get_guest_pvclock(struct kvm_vcpu *vcpu,
struct gfn_to_pfn_cache *gpc,
unsigned int offset)
{
- unsigned long flags;
int r;
- read_lock_irqsave(&gpc->lock, flags);
+ read_lock(&gpc->lock);
while (!kvm_gpc_check(gpc, offset + sizeof(*hv_clock))) {
- read_unlock_irqrestore(&gpc->lock, flags);
+ read_unlock(&gpc->lock);
r = kvm_gpc_refresh(gpc, offset + sizeof(*hv_clock));
if (r)
return r;
- read_lock_irqsave(&gpc->lock, flags);
+ read_lock(&gpc->lock);
}
memcpy(hv_clock, gpc->khva + offset, sizeof(*hv_clock));
- read_unlock_irqrestore(&gpc->lock, flags);
+ read_unlock(&gpc->lock);
/*
* Sanity check TSC shift+multiplier to verify the guest's view of time
@@ -325,7 +324,6 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
struct gfn_to_pfn_cache *gpc2 = &vx->runstate2_cache;
size_t user_len, user_len1, user_len2;
struct vcpu_runstate_info rs;
- unsigned long flags;
size_t times_ofs;
uint8_t *update_bit = NULL;
uint64_t entry_time;
@@ -421,16 +419,14 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
* gfn_to_pfn caches that cover the region.
*/
if (atomic) {
- local_irq_save(flags);
if (!read_trylock(&gpc1->lock)) {
- local_irq_restore(flags);
return;
}
} else {
- read_lock_irqsave(&gpc1->lock, flags);
+ read_lock(&gpc1->lock);
}
while (!kvm_gpc_check(gpc1, user_len1)) {
- read_unlock_irqrestore(&gpc1->lock, flags);
+ read_unlock(&gpc1->lock);
/* When invoked from kvm_sched_out() we cannot sleep */
if (atomic)
@@ -439,7 +435,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
if (kvm_gpc_refresh(gpc1, user_len1))
return;
- read_lock_irqsave(&gpc1->lock, flags);
+ read_lock(&gpc1->lock);
}
if (likely(!user_len2)) {
@@ -467,7 +463,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
lock_set_subclass(&gpc1->lock.dep_map, 1, _THIS_IP_);
if (atomic) {
if (!read_trylock(&gpc2->lock)) {
- read_unlock_irqrestore(&gpc1->lock, flags);
+ read_unlock(&gpc1->lock);
return;
}
} else {
@@ -476,7 +472,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
if (!kvm_gpc_check(gpc2, user_len2)) {
read_unlock(&gpc2->lock);
- read_unlock_irqrestore(&gpc1->lock, flags);
+ read_unlock(&gpc1->lock);
/* When invoked from kvm_sched_out() we cannot sleep */
if (atomic)
@@ -581,7 +577,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
}
kvm_gpc_mark_dirty_in_slot(gpc1);
- read_unlock_irqrestore(&gpc1->lock, flags);
+ read_unlock(&gpc1->lock);
}
void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
@@ -640,7 +636,6 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
{
unsigned long evtchn_pending_sel = READ_ONCE(v->arch.xen.evtchn_pending_sel);
struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache;
- unsigned long flags;
if (!evtchn_pending_sel)
return;
@@ -650,14 +645,14 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
* does anyway. Page it in and retry the instruction. We're just a
* little more honest about it.
*/
- read_lock_irqsave(&gpc->lock, flags);
+ read_lock(&gpc->lock);
while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
- read_unlock_irqrestore(&gpc->lock, flags);
+ read_unlock(&gpc->lock);
if (kvm_gpc_refresh(gpc, sizeof(struct vcpu_info)))
return;
- read_lock_irqsave(&gpc->lock, flags);
+ read_lock(&gpc->lock);
}
/* Now gpc->khva is a valid kernel address for the vcpu_info */
@@ -687,7 +682,7 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
}
kvm_gpc_mark_dirty_in_slot(gpc);
- read_unlock_irqrestore(&gpc->lock, flags);
+ read_unlock(&gpc->lock);
/* For the per-vCPU lapic vector, deliver it as MSI. */
if (v->arch.xen.upcall_vector)
@@ -698,7 +693,6 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
{
struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache;
bool atomic = in_atomic() || !task_is_running(current);
- unsigned long flags;
u8 rc = 0;
/*
@@ -715,16 +709,13 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
sizeof_field(struct compat_vcpu_info, evtchn_upcall_pending));
if (atomic) {
- local_irq_save(flags);
- if (!read_trylock(&gpc->lock)) {
- local_irq_restore(flags);
+ if (!read_trylock(&gpc->lock))
return 1;
- }
} else {
- read_lock_irqsave(&gpc->lock, flags);
+ read_lock(&gpc->lock);
}
while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
- read_unlock_irqrestore(&gpc->lock, flags);
+ read_unlock(&gpc->lock);
/*
* This function gets called from kvm_vcpu_block() after setting the
@@ -744,11 +735,11 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
*/
return 0;
}
- read_lock_irqsave(&gpc->lock, flags);
+ read_lock(&gpc->lock);
}
rc = ((struct vcpu_info *)gpc->khva)->evtchn_upcall_pending;
- read_unlock_irqrestore(&gpc->lock, flags);
+ read_unlock(&gpc->lock);
return rc;
}
@@ -1445,12 +1436,11 @@ static bool wait_pending_event(struct kvm_vcpu *vcpu, int nr_ports,
struct kvm *kvm = vcpu->kvm;
struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache;
unsigned long *pending_bits;
- unsigned long flags;
bool ret = true;
int idx, i;
idx = srcu_read_lock(&kvm->srcu);
- read_lock_irqsave(&gpc->lock, flags);
+ read_lock(&gpc->lock);
if (!kvm_gpc_check(gpc, PAGE_SIZE))
goto out_rcu;
@@ -1471,7 +1461,7 @@ static bool wait_pending_event(struct kvm_vcpu *vcpu, int nr_ports,
}
out_rcu:
- read_unlock_irqrestore(&gpc->lock, flags);
+ read_unlock(&gpc->lock);
srcu_read_unlock(&kvm->srcu, idx);
return ret;
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 04/20] KVM: x86: Remove unnecessary irqsave from kvm_setup_guest_pvclock()
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (2 preceding siblings ...)
2026-05-29 16:50 ` [PATCH v2 03/20] KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c Sean Christopherson
@ 2026-05-29 16:50 ` Sean Christopherson
2026-05-29 16:50 ` [PATCH v2 05/20] KVM: Remove unnecessary IRQ disabling from GPC lock in pfncache.c Sean Christopherson
` (15 subsequent siblings)
19 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:50 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
From: David Woodhouse <dwmw@amazon.co.uk>
kvm_setup_guest_pvclock() is only called from kvm_guest_time_update()
which runs in process context (vcpu_enter_guest or ioctl). There is no
hardirq path that takes the GPC read lock for pvclock, so irqsave is
unnecessary.
Convert to plain read_lock()/read_unlock().
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/x86.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e6f1dd84f22d..87e99756de0a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3265,18 +3265,17 @@ static void kvm_setup_guest_pvclock(struct pvclock_vcpu_time_info *ref_hv_clock,
{
struct pvclock_vcpu_time_info *guest_hv_clock;
struct pvclock_vcpu_time_info hv_clock;
- unsigned long flags;
memcpy(&hv_clock, ref_hv_clock, sizeof(hv_clock));
- read_lock_irqsave(&gpc->lock, flags);
+ read_lock(&gpc->lock);
while (!kvm_gpc_check(gpc, offset + sizeof(*guest_hv_clock))) {
- read_unlock_irqrestore(&gpc->lock, flags);
+ read_unlock(&gpc->lock);
if (kvm_gpc_refresh(gpc, offset + sizeof(*guest_hv_clock)))
return;
- read_lock_irqsave(&gpc->lock, flags);
+ read_lock(&gpc->lock);
}
guest_hv_clock = (void *)(gpc->khva + offset);
@@ -3301,7 +3300,7 @@ static void kvm_setup_guest_pvclock(struct pvclock_vcpu_time_info *ref_hv_clock,
guest_hv_clock->version = ++hv_clock.version;
kvm_gpc_mark_dirty_in_slot(gpc);
- read_unlock_irqrestore(&gpc->lock, flags);
+ read_unlock(&gpc->lock);
trace_kvm_pvclock_update(vcpu->vcpu_id, &hv_clock);
}
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 05/20] KVM: Remove unnecessary IRQ disabling from GPC lock in pfncache.c
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (3 preceding siblings ...)
2026-05-29 16:50 ` [PATCH v2 04/20] KVM: x86: Remove unnecessary irqsave from kvm_setup_guest_pvclock() Sean Christopherson
@ 2026-05-29 16:50 ` Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 06/20] KVM: x86/xen: Use guard() to grab kvm->srcu around gpc critical sections Sean Christopherson
` (14 subsequent siblings)
19 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:50 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
From: David Woodhouse <dwmw@amazon.co.uk>
Now that all hardirq/atomic GPC users (xen_timer_callback,
kvm_xen_set_evtchn_fast) use read_trylock() instead of read_lock(), no
hardirq path ever holds the GPC rwlock. There is therefore no risk of
deadlock between the write side and a hardirq reader, and no need to
disable interrupts when taking the lock.
Convert all read_lock_irq()/write_lock_irq() and their unlock
counterparts to plain read_lock()/write_lock() in pfncache.c.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
virt/kvm/pfncache.c | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 728d2c1b488a..70b102095173 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -29,12 +29,12 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
spin_lock(&kvm->gpc_lock);
list_for_each_entry(gpc, &kvm->gpc_list, list) {
- read_lock_irq(&gpc->lock);
+ read_lock(&gpc->lock);
/* Only a single page so no need to care about length */
if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) &&
gpc->uhva >= start && gpc->uhva < end) {
- read_unlock_irq(&gpc->lock);
+ read_unlock(&gpc->lock);
/*
* There is a small window here where the cache could
@@ -44,15 +44,15 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
* acquired.
*/
- write_lock_irq(&gpc->lock);
+ write_lock(&gpc->lock);
if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) &&
gpc->uhva >= start && gpc->uhva < end)
gpc->valid = false;
- write_unlock_irq(&gpc->lock);
+ write_unlock(&gpc->lock);
continue;
}
- read_unlock_irq(&gpc->lock);
+ read_unlock(&gpc->lock);
}
spin_unlock(&kvm->gpc_lock);
}
@@ -184,7 +184,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
mmu_seq = gpc->kvm->mmu_invalidate_seq;
smp_rmb();
- write_unlock_irq(&gpc->lock);
+ write_unlock(&gpc->lock);
/*
* If the previous iteration "failed" due to an mmu_notifier
@@ -225,7 +225,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
goto out_error;
}
- write_lock_irq(&gpc->lock);
+ write_lock(&gpc->lock);
/*
* Other tasks must wait for _this_ refresh to complete before
@@ -248,7 +248,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
return 0;
out_error:
- write_lock_irq(&gpc->lock);
+ write_lock(&gpc->lock);
return -EFAULT;
}
@@ -269,7 +269,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
lockdep_assert_held(&gpc->refresh_lock);
- write_lock_irq(&gpc->lock);
+ write_lock(&gpc->lock);
if (!gpc->active) {
ret = -EINVAL;
@@ -355,7 +355,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
unmap_old = (old_pfn != gpc->pfn);
out_unlock:
- write_unlock_irq(&gpc->lock);
+ write_unlock(&gpc->lock);
if (unmap_old)
gpc_unmap(old_pfn, old_khva);
@@ -417,9 +417,9 @@ static int __kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned
* refresh must not establish a mapping until the cache is
* reachable by mmu_notifier events.
*/
- write_lock_irq(&gpc->lock);
+ write_lock(&gpc->lock);
gpc->active = true;
- write_unlock_irq(&gpc->lock);
+ write_unlock(&gpc->lock);
}
return __kvm_gpc_refresh(gpc, gpa, uhva);
}
@@ -458,7 +458,7 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
* must stall mmu_notifier events until all users go away, i.e.
* until gpc->lock is dropped and refresh is guaranteed to fail.
*/
- write_lock_irq(&gpc->lock);
+ write_lock(&gpc->lock);
gpc->active = false;
gpc->valid = false;
@@ -473,7 +473,7 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
old_pfn = gpc->pfn;
gpc->pfn = KVM_PFN_ERR_FAULT;
- write_unlock_irq(&gpc->lock);
+ write_unlock(&gpc->lock);
spin_lock(&kvm->gpc_lock);
list_del(&gpc->list);
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 06/20] KVM: x86/xen: Use guard() to grab kvm->srcu around gpc critical sections
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (4 preceding siblings ...)
2026-05-29 16:50 ` [PATCH v2 05/20] KVM: Remove unnecessary IRQ disabling from GPC lock in pfncache.c Sean Christopherson
@ 2026-05-29 16:51 ` Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 07/20] KVM: x86/xen: Extract delivery of event to vCPU into a separate helper Sean Christopherson
` (13 subsequent siblings)
19 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:51 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
Use guard() to acquire and release kvm->srcu protection around gpc
critical sections, so that said critical sections can also use the fancy
__cleanup() functionality.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/xen.c | 32 ++++++++++++++------------------
1 file changed, 14 insertions(+), 18 deletions(-)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index b1fae42bf295..0c6b74b97408 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -42,8 +42,9 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
u32 *wc_sec_hi;
u32 wc_version;
u64 wall_nsec;
- int ret = 0;
- int idx = srcu_read_lock(&kvm->srcu);
+ int ret;
+
+ guard(srcu)(&kvm->srcu);
read_lock(&gpc->lock);
while (!kvm_gpc_check(gpc, PAGE_SIZE)) {
@@ -51,7 +52,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
ret = kvm_gpc_refresh(gpc, PAGE_SIZE);
if (ret)
- goto out;
+ return ret;
read_lock(&gpc->lock);
}
@@ -99,10 +100,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
read_unlock(&gpc->lock);
kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
-
-out:
- srcu_read_unlock(&kvm->srcu, idx);
- return ret;
+ return 0;
}
void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu)
@@ -1437,9 +1435,10 @@ static bool wait_pending_event(struct kvm_vcpu *vcpu, int nr_ports,
struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache;
unsigned long *pending_bits;
bool ret = true;
- int idx, i;
+ int i;
+
+ guard(srcu)(&kvm->srcu);
- idx = srcu_read_lock(&kvm->srcu);
read_lock(&gpc->lock);
if (!kvm_gpc_check(gpc, PAGE_SIZE))
goto out_rcu;
@@ -1462,8 +1461,6 @@ static bool wait_pending_event(struct kvm_vcpu *vcpu, int nr_ports,
out_rcu:
read_unlock(&gpc->lock);
- srcu_read_unlock(&kvm->srcu, idx);
-
return ret;
}
@@ -1795,7 +1792,7 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
unsigned long *pending_bits, *mask_bits;
int port_word_bit;
bool kick_vcpu = false;
- int vcpu_idx, idx, rc;
+ int vcpu_idx, rc;
vcpu_idx = READ_ONCE(xe->vcpu_idx);
if (vcpu_idx >= 0)
@@ -1812,10 +1809,11 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
rc = -EWOULDBLOCK;
- idx = srcu_read_lock(&kvm->srcu);
+ guard(srcu)(&kvm->srcu);
if (!read_trylock(&gpc->lock))
- goto out_rcu;
+ return rc;
+
if (!kvm_gpc_check(gpc, PAGE_SIZE))
goto out_unlock;
@@ -1856,7 +1854,7 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
*/
if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
kick_vcpu = true;
- goto out_rcu;
+ goto out_kick;
}
if (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
@@ -1888,9 +1886,7 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
out_unlock:
read_unlock(&gpc->lock);
- out_rcu:
- srcu_read_unlock(&kvm->srcu, idx);
-
+ out_kick:
if (kick_vcpu) {
kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
kvm_vcpu_kick(vcpu);
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 07/20] KVM: x86/xen: Extract delivery of event to vCPU into a separate helper
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (5 preceding siblings ...)
2026-05-29 16:51 ` [PATCH v2 06/20] KVM: x86/xen: Use guard() to grab kvm->srcu around gpc critical sections Sean Christopherson
@ 2026-05-29 16:51 ` Sean Christopherson
2026-05-29 17:47 ` sashiko-bot
2026-05-29 16:51 ` [PATCH v2 08/20] KVM: x86/xen: Explicitly tag "shared info" page as never being dirty tracked Sean Christopherson
` (12 subsequent siblings)
19 siblings, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:51 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
Hoist the actual fastpath delivery of an event to a vCPU into a separate
helper so that CLASS()-based gpc locking+checking can be used without
needing to implement scoped versions.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/xen.c | 100 ++++++++++++++++++++++++---------------------
1 file changed, 54 insertions(+), 46 deletions(-)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 0c6b74b97408..020ef0ddab01 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -1775,6 +1775,57 @@ static void kvm_xen_check_poller(struct kvm_vcpu *vcpu, int port)
}
}
+static void __kvm_xen_set_evtchn_fast(struct kvm_vcpu *vcpu, int port_word_bit)
+{
+ struct gfn_to_pfn_cache *gpc = &vcpu->arch.xen.vcpu_info_cache;
+ bool kick_vcpu = false;
+
+ /* Now switch to the vCPU's vcpu_info to set the index and pending_sel */
+ if (!read_trylock(&gpc->lock)) {
+ /*
+ * Could not access the vcpu_info. Set the bit in-kernel and
+ * prod the vCPU to deliver it for itself.
+ */
+ if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
+ kick_vcpu = true;
+ goto out_kick;
+ }
+ if (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
+ if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
+ kick_vcpu = true;
+ goto out_unlock;
+ }
+
+ if (IS_ENABLED(CONFIG_64BIT) && vcpu->kvm->arch.xen.long_mode) {
+ struct vcpu_info *vcpu_info = gpc->khva;
+ if (!test_and_set_bit(port_word_bit, &vcpu_info->evtchn_pending_sel)) {
+ WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
+ kick_vcpu = true;
+ }
+ } else {
+ struct compat_vcpu_info *vcpu_info = gpc->khva;
+ if (!test_and_set_bit(port_word_bit,
+ (unsigned long *)&vcpu_info->evtchn_pending_sel)) {
+ WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
+ kick_vcpu = true;
+ }
+ }
+
+ /* For the per-vCPU lapic vector, deliver it as MSI. */
+ if (kick_vcpu && vcpu->arch.xen.upcall_vector) {
+ kvm_xen_inject_vcpu_vector(vcpu);
+ kick_vcpu = false;
+ }
+
+out_unlock:
+ read_unlock(&gpc->lock);
+out_kick:
+ if (kick_vcpu) {
+ kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
+ kvm_vcpu_kick(vcpu);
+ }
+}
+
/*
* The return value from this function is propagated to kvm_set_irq() API,
* so it returns:
@@ -1791,7 +1842,6 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
struct kvm_vcpu *vcpu;
unsigned long *pending_bits, *mask_bits;
int port_word_bit;
- bool kick_vcpu = false;
int vcpu_idx, rc;
vcpu_idx = READ_ONCE(xe->vcpu_idx);
@@ -1843,55 +1893,13 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
kvm_xen_check_poller(vcpu, xe->port);
} else {
rc = 1; /* Delivered to the bitmap in shared_info. */
- /* Now switch to the vCPU's vcpu_info to set the index and pending_sel */
- read_unlock(&gpc->lock);
- gpc = &vcpu->arch.xen.vcpu_info_cache;
-
- if (!read_trylock(&gpc->lock)) {
- /*
- * Could not access the vcpu_info. Set the bit in-kernel
- * and prod the vCPU to deliver it for itself.
- */
- if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
- kick_vcpu = true;
- goto out_kick;
- }
- if (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
- if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
- kick_vcpu = true;
- goto out_unlock;
- }
-
- if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
- struct vcpu_info *vcpu_info = gpc->khva;
- if (!test_and_set_bit(port_word_bit, &vcpu_info->evtchn_pending_sel)) {
- WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
- kick_vcpu = true;
- }
- } else {
- struct compat_vcpu_info *vcpu_info = gpc->khva;
- if (!test_and_set_bit(port_word_bit,
- (unsigned long *)&vcpu_info->evtchn_pending_sel)) {
- WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
- kick_vcpu = true;
- }
- }
-
- /* For the per-vCPU lapic vector, deliver it as MSI. */
- if (kick_vcpu && vcpu->arch.xen.upcall_vector) {
- kvm_xen_inject_vcpu_vector(vcpu);
- kick_vcpu = false;
- }
}
- out_unlock:
+out_unlock:
read_unlock(&gpc->lock);
- out_kick:
- if (kick_vcpu) {
- kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
- kvm_vcpu_kick(vcpu);
- }
+ if (rc == 1)
+ __kvm_xen_set_evtchn_fast(vcpu, port_word_bit);
return rc;
}
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 08/20] KVM: x86/xen: Explicitly tag "shared info" page as never being dirty tracked
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (6 preceding siblings ...)
2026-05-29 16:51 ` [PATCH v2 07/20] KVM: x86/xen: Extract delivery of event to vCPU into a separate helper Sean Christopherson
@ 2026-05-29 16:51 ` Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 09/20] KVM: x86/xen: Don't dirty track "vCPU info" page Sean Christopherson
` (11 subsequent siblings)
19 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:51 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
Explicitly mark the Xen shared info page as never being dirty tracked so
that higher-level gpc APIs can be added to automatically take care of
things like dirty tracking, without reintroducing the bug fixed by commit
55749769fe60 ("KVM: x86: Fix wall clock writes in Xen shared_info not to
mark page dirty"). And because the code _looks_ buggy.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/xen.c | 2 +-
include/linux/kvm_host.h | 13 ++++++++++---
include/linux/kvm_types.h | 1 +
virt/kvm/pfncache.c | 4 +++-
4 files changed, 15 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 020ef0ddab01..ab8e95647406 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -2334,7 +2334,7 @@ void kvm_xen_init_vm(struct kvm *kvm)
{
mutex_init(&kvm->arch.xen.xen_lock);
idr_init(&kvm->arch.xen.evtchn_ports);
- kvm_gpc_init(&kvm->arch.xen.shinfo_cache, kvm);
+ __kvm_gpc_init(&kvm->arch.xen.shinfo_cache, kvm, true);
}
void kvm_xen_destroy_vm(struct kvm *kvm)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 27498e990dff..0dc4eb78b6d9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1427,16 +1427,23 @@ int kvm_vcpu_write_guest(struct kvm_vcpu *vcpu, gpa_t gpa, const void *data,
unsigned long len);
/**
- * kvm_gpc_init - initialize gfn_to_pfn_cache.
+ * __kvm_gpc_init - initialize gfn_to_pfn_cache.
*
* @gpc: struct gfn_to_pfn_cache object.
* @kvm: pointer to kvm instance.
+ * @never_dirty: %true if the associated gfn should never be marked dirty
*
* This sets up a gfn_to_pfn_cache by initializing locks and assigning the
* immutable attributes. Note, the cache must be zero-allocated (or zeroed by
* the caller before init).
*/
-void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm);
+void __kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm,
+ bool never_dirty);
+
+static inline void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm)
+{
+ __kvm_gpc_init(gpc, kvm, false);
+}
/**
* kvm_gpc_activate - prepare a cached kernel mapping and HPA for a given guest
@@ -1942,7 +1949,7 @@ static inline void kvm_gpc_mark_dirty_in_slot(struct gfn_to_pfn_cache *gpc)
{
lockdep_assert_held(&gpc->lock);
- if (!gpc->memslot)
+ if (!gpc->memslot || gpc->never_dirty)
return;
mark_page_dirty_in_slot(gpc->kvm, gpc->memslot, gpa_to_gfn(gpc->gpa));
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index a568d8e6f4e8..e850adc3f47e 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -94,6 +94,7 @@ struct gfn_to_pfn_cache {
kvm_pfn_t pfn;
bool active;
bool valid;
+ bool never_dirty;
};
#ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 70b102095173..9209f06c46b4 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -382,7 +382,8 @@ int kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, unsigned long len)
return __kvm_gpc_refresh(gpc, gpc->gpa, uhva);
}
-void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm)
+void __kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm,
+ bool never_dirty)
{
rwlock_init(&gpc->lock);
mutex_init(&gpc->refresh_lock);
@@ -392,6 +393,7 @@ void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm)
gpc->gpa = INVALID_GPA;
gpc->uhva = KVM_HVA_ERR_BAD;
gpc->active = gpc->valid = false;
+ gpc->never_dirty = never_dirty;
}
static int __kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long uhva,
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 09/20] KVM: x86/xen: Don't dirty track "vCPU info" page
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (7 preceding siblings ...)
2026-05-29 16:51 ` [PATCH v2 08/20] KVM: x86/xen: Explicitly tag "shared info" page as never being dirty tracked Sean Christopherson
@ 2026-05-29 16:51 ` Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 10/20] KVM: Move {g,p}fn <=> {g,h}pa conversion helpers to kvm_types.h Sean Christopherson
` (10 subsequent siblings)
19 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:51 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
Give the Xen per-vCPU info page the same treatment as the per-VM shared
info page, and never mark it dirty, as KVM clearly relies on userspace to
assume the page is always dirty. While the page is marked dirty on writes
via kvm_xen_inject_pending_events(), it's not marked dirty when written by
__kvm_xen_set_evtchn_fast().
Furthermore, as was the case with the shared info page, writes in the event
channel fastpath may be done without an active vCPU, e.g. when called via
timer callback or irqfd injection. I.e. attempting to fix the faspath
would run afoul of same issue that was fixed by commit ("KVM: x86: Fix wall
clock writes in Xen shared_info not to mark page dirty").
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/xen.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index ab8e95647406..7b527a983cfc 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -679,7 +679,6 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
WRITE_ONCE(vi->evtchn_upcall_pending, 1);
}
- kvm_gpc_mark_dirty_in_slot(gpc);
read_unlock(&gpc->lock);
/* For the per-vCPU lapic vector, deliver it as MSI. */
@@ -2313,7 +2312,7 @@ void kvm_xen_init_vcpu(struct kvm_vcpu *vcpu)
kvm_gpc_init(&vcpu->arch.xen.runstate_cache, vcpu->kvm);
kvm_gpc_init(&vcpu->arch.xen.runstate2_cache, vcpu->kvm);
- kvm_gpc_init(&vcpu->arch.xen.vcpu_info_cache, vcpu->kvm);
+ __kvm_gpc_init(&vcpu->arch.xen.vcpu_info_cache, vcpu->kvm, true);
kvm_gpc_init(&vcpu->arch.xen.vcpu_time_info_cache, vcpu->kvm);
}
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 10/20] KVM: Move {g,p}fn <=> {g,h}pa conversion helpers to kvm_types.h
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (8 preceding siblings ...)
2026-05-29 16:51 ` [PATCH v2 09/20] KVM: x86/xen: Don't dirty track "vCPU info" page Sean Christopherson
@ 2026-05-29 16:51 ` Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 11/20] KVM: Add CLASS() constructs to automagically handle lock+check of gpc Sean Christopherson
` (9 subsequent siblings)
19 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:51 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
Move the helpers to convert between pfns and physical addresses (for host
and guest) into kvm_types.h so that they can be used throughout kvm_host.h
(and any other KVM-related header) without having to worry about ordering.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
include/linux/kvm_host.h | 15 ---------------
include/linux/kvm_types.h | 16 ++++++++++++++++
2 files changed, 16 insertions(+), 15 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0dc4eb78b6d9..ffbae1e6e84e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1923,21 +1923,6 @@ hva_to_gfn_memslot(unsigned long hva, struct kvm_memory_slot *slot)
return slot->base_gfn + gfn_offset;
}
-static inline gpa_t gfn_to_gpa(gfn_t gfn)
-{
- return (gpa_t)gfn << PAGE_SHIFT;
-}
-
-static inline gfn_t gpa_to_gfn(gpa_t gpa)
-{
- return (gfn_t)(gpa >> PAGE_SHIFT);
-}
-
-static inline hpa_t pfn_to_hpa(kvm_pfn_t pfn)
-{
- return (hpa_t)pfn << PAGE_SHIFT;
-}
-
static inline bool kvm_is_gpa_in_memslot(struct kvm *kvm, gpa_t gpa)
{
unsigned long hva = gfn_to_hva(kvm, gpa_to_gfn(gpa));
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index e850adc3f47e..961572e102f0 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -6,6 +6,7 @@
#include <linux/bits.h>
#include <linux/export.h>
#include <linux/types.h>
+#include <vdso/page.h>
#include <asm/kvm_types.h>
#ifdef KVM_SUB_MODULES
@@ -73,6 +74,21 @@ typedef u64 hfn_t;
typedef hfn_t kvm_pfn_t;
+static inline gpa_t gfn_to_gpa(gfn_t gfn)
+{
+ return (gpa_t)gfn << PAGE_SHIFT;
+}
+
+static inline gfn_t gpa_to_gfn(gpa_t gpa)
+{
+ return (gfn_t)(gpa >> PAGE_SHIFT);
+}
+
+static inline hpa_t pfn_to_hpa(kvm_pfn_t pfn)
+{
+ return (hpa_t)pfn << PAGE_SHIFT;
+}
+
struct gfn_to_hva_cache {
u64 generation;
gpa_t gpa;
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 11/20] KVM: Add CLASS() constructs to automagically handle lock+check of gpc
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (9 preceding siblings ...)
2026-05-29 16:51 ` [PATCH v2 10/20] KVM: Move {g,p}fn <=> {g,h}pa conversion helpers to kvm_types.h Sean Christopherson
@ 2026-05-29 16:51 ` Sean Christopherson
2026-05-29 17:59 ` sashiko-bot
2026-05-29 16:51 ` [PATCH v2 12/20] KVM: x86/xen: Convert kvm_xen_shared_info_init() to gpc's CLASS() APIs Sean Christopherson
` (8 subsequent siblings)
19 siblings, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:51 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
Add CLASS() definitions for locally mapping a PFN given a gpc (gfn_to_pfn
cache), to eventually deduplicate a number of users that do:
lock();
while (!check()) {
unlock();
if (refresh())
return err;
lock()
}
...
mark_dirty();
unlock();
Implement read-only (for cases where KVM is only reading) and "try" (for
use in atomic code where rwlock might sleep due to PREEMPT_RT) variations.
Use "map local" as the primary terminology as the basic concept is more or
less the same as kmap_local(): ensure the current CPU has a kernel mapping
to the underlying memory.
Convert the pvclock code as the first user, as it is straightforward and
thus easier to audit for correctness.
For all intents and purposes, no functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/x86.c | 17 ++++----------
include/linux/kvm_host.h | 48 +++++++++++++++++++++++++++++++---------
virt/kvm/pfncache.c | 36 ++++++++++++++++++++++++++++++
3 files changed, 78 insertions(+), 23 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 87e99756de0a..ea10ed4ab06f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3268,17 +3268,11 @@ static void kvm_setup_guest_pvclock(struct pvclock_vcpu_time_info *ref_hv_clock,
memcpy(&hv_clock, ref_hv_clock, sizeof(hv_clock));
- read_lock(&gpc->lock);
- while (!kvm_gpc_check(gpc, offset + sizeof(*guest_hv_clock))) {
- read_unlock(&gpc->lock);
+ CLASS(gpc_map_local, clock_map)(gpc, offset + sizeof(*guest_hv_clock));
+ if (IS_ERR(clock_map))
+ return;
- if (kvm_gpc_refresh(gpc, offset + sizeof(*guest_hv_clock)))
- return;
-
- read_lock(&gpc->lock);
- }
-
- guest_hv_clock = (void *)(gpc->khva + offset);
+ guest_hv_clock = *clock_map + offset;
/*
* This VCPU is paused, but it's legal for a guest to read another
@@ -3299,9 +3293,6 @@ static void kvm_setup_guest_pvclock(struct pvclock_vcpu_time_info *ref_hv_clock,
guest_hv_clock->version = ++hv_clock.version;
- kvm_gpc_mark_dirty_in_slot(gpc);
- read_unlock(&gpc->lock);
-
trace_kvm_pvclock_update(vcpu->vcpu_id, &hv_clock);
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ffbae1e6e84e..d70fa91cda0c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1535,6 +1535,44 @@ static inline bool kvm_gpc_is_hva_active(struct gfn_to_pfn_cache *gpc)
return gpc->active && kvm_is_error_gpa(gpc->gpa);
}
+static inline void kvm_gpc_mark_dirty_in_slot(struct gfn_to_pfn_cache *gpc)
+{
+ lockdep_assert_held(&gpc->lock);
+
+ if (!gpc->memslot || gpc->never_dirty)
+ return;
+
+ mark_page_dirty_in_slot(gpc->kvm, gpc->memslot, gpa_to_gfn(gpc->gpa));
+}
+
+void **gpc_map_local_lock(struct gfn_to_pfn_cache *gpc, unsigned long len);
+void **gpc_try_map_local_lock(struct gfn_to_pfn_cache *gpc, unsigned long len);
+
+static inline void gpc_map_local_unlock(void **khva)
+{
+ struct gfn_to_pfn_cache *gpc = container_of(khva, struct gfn_to_pfn_cache, khva);
+
+ kvm_gpc_mark_dirty_in_slot(gpc);
+
+ read_unlock(&gpc->lock);
+}
+
+static inline void gpc_map_local_unlock_ro(void **khva)
+{
+ read_unlock(&container_of(khva, struct gfn_to_pfn_cache, khva)->lock);
+}
+
+#define DEFINE_GPC_CLASS(try, ro) \
+DEFINE_CLASS(gpc##try##_map_local##ro, void **, \
+ if (!IS_ERR(_T)) gpc_map_local_unlock##ro(_T), \
+ gpc##try##_map_local_lock(gpc, len), \
+ struct gfn_to_pfn_cache *gpc, unsigned long len) \
+
+DEFINE_GPC_CLASS(,);
+DEFINE_GPC_CLASS(_try,);
+DEFINE_GPC_CLASS(, _ro);
+DEFINE_GPC_CLASS(_try, _ro);
+
void kvm_sigset_activate(struct kvm_vcpu *vcpu);
void kvm_sigset_deactivate(struct kvm_vcpu *vcpu);
@@ -1930,16 +1968,6 @@ static inline bool kvm_is_gpa_in_memslot(struct kvm *kvm, gpa_t gpa)
return !kvm_is_error_hva(hva);
}
-static inline void kvm_gpc_mark_dirty_in_slot(struct gfn_to_pfn_cache *gpc)
-{
- lockdep_assert_held(&gpc->lock);
-
- if (!gpc->memslot || gpc->never_dirty)
- return;
-
- mark_page_dirty_in_slot(gpc->kvm, gpc->memslot, gpa_to_gfn(gpc->gpa));
-}
-
enum kvm_stat_kind {
KVM_STAT_VM,
KVM_STAT_VCPU,
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 9209f06c46b4..d3e02a2bac38 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -484,3 +484,39 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
gpc_unmap(old_pfn, old_khva);
}
}
+
+void **gpc_try_map_local_lock(struct gfn_to_pfn_cache *gpc, unsigned long len)
+{
+ if (!read_trylock(&gpc->lock))
+ return ERR_PTR(-EWOULDBLOCK);
+
+ if (!kvm_gpc_check(gpc, len)) {
+ read_unlock(&gpc->lock);
+ return ERR_PTR(-EWOULDBLOCK);
+ }
+
+ return &gpc->khva;
+}
+
+void **gpc_map_local_lock(struct gfn_to_pfn_cache *gpc, unsigned long len)
+{
+ /*
+ * Yes, this is an open-coded loop. But that's just what put_user()
+ * does anyway. Page it in and retry the instruction. We're just a
+ * little more honest about it.
+ */
+ for (;;) {
+ int r;
+
+ read_lock(&gpc->lock);
+
+ if (kvm_gpc_check(gpc, len))
+ return &gpc->khva;
+
+ read_unlock(&gpc->lock);
+
+ r = kvm_gpc_refresh(gpc, len);
+ if (r)
+ return ERR_PTR(r);
+ }
+}
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 12/20] KVM: x86/xen: Convert kvm_xen_shared_info_init() to gpc's CLASS() APIs
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (10 preceding siblings ...)
2026-05-29 16:51 ` [PATCH v2 11/20] KVM: Add CLASS() constructs to automagically handle lock+check of gpc Sean Christopherson
@ 2026-05-29 16:51 ` Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 13/20] KVM: x86/xen: Don't bother waiting on gpc->lock in SCHEDOP_poll Sean Christopherson
` (7 subsequent siblings)
19 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:51 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
Use the newfangled "map local" CLASS() APIs to access the Xen shared info
page via its gpc.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/xen.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 7b527a983cfc..065b4c92f7ed 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -35,27 +35,19 @@ static bool kvm_xen_hcall_evtchn_send(struct kvm_vcpu *vcpu, u64 param, u64 *r);
DEFINE_STATIC_KEY_DEFERRED_FALSE(kvm_xen_enabled, HZ);
-static int kvm_xen_shared_info_init(struct kvm *kvm)
+static int __kvm_xen_shared_info_init(struct kvm *kvm)
{
struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache;
struct pvclock_wall_clock *wc;
u32 *wc_sec_hi;
u32 wc_version;
u64 wall_nsec;
- int ret;
guard(srcu)(&kvm->srcu);
- read_lock(&gpc->lock);
- while (!kvm_gpc_check(gpc, PAGE_SIZE)) {
- read_unlock(&gpc->lock);
-
- ret = kvm_gpc_refresh(gpc, PAGE_SIZE);
- if (ret)
- return ret;
-
- read_lock(&gpc->lock);
- }
+ CLASS(gpc_map_local, shinfo_map)(gpc, PAGE_SIZE);
+ if (IS_ERR(shinfo_map))
+ return PTR_ERR(shinfo_map);
/*
* This code mirrors kvm_write_wall_clock() except that it writes
@@ -74,14 +66,14 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
BUILD_BUG_ON(offsetof(struct shared_info, wc_sec_hi) != 0xc0c);
if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
- struct shared_info *shinfo = gpc->khva;
+ struct shared_info *shinfo = *shinfo_map;
wc_sec_hi = &shinfo->wc_sec_hi;
wc = &shinfo->wc;
} else
#endif
{
- struct compat_shared_info *shinfo = gpc->khva;
+ struct compat_shared_info *shinfo = *shinfo_map;
wc_sec_hi = &shinfo->arch.wc_sec_hi;
wc = &shinfo->wc;
@@ -97,12 +89,20 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
smp_wmb();
wc->version = wc_version + 1;
- read_unlock(&gpc->lock);
-
- kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
return 0;
}
+static int kvm_xen_shared_info_init(struct kvm *kvm)
+{
+ int r;
+
+ r = __kvm_xen_shared_info_init(kvm);
+ if (!r)
+ kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
+
+ return r;
+}
+
void kvm_xen_inject_timer_irqs(struct kvm_vcpu *vcpu)
{
if (atomic_read(&vcpu->arch.xen.timer_pending) > 0) {
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 13/20] KVM: x86/xen: Don't bother waiting on gpc->lock in SCHEDOP_poll
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (11 preceding siblings ...)
2026-05-29 16:51 ` [PATCH v2 12/20] KVM: x86/xen: Convert kvm_xen_shared_info_init() to gpc's CLASS() APIs Sean Christopherson
@ 2026-05-29 16:51 ` Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 14/20] KVM: x86/xen: Convert wait_pending_event() to gpc's CLASS() APIs Sean Christopherson
` (6 subsequent siblings)
19 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:51 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
Use trylock instead of waiting on the gpc->lock when locking and checking
the shared info page for SCHEDOP_poll, as odds are very good that if the
lock is contended, then the check will fail. This will allow using the new
CLASS() APIs for local gpc mappings, without having to add a version that
waits on the lock, but doesn't refresh on failure.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/xen.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 065b4c92f7ed..d9b09809e243 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -1438,7 +1438,9 @@ static bool wait_pending_event(struct kvm_vcpu *vcpu, int nr_ports,
guard(srcu)(&kvm->srcu);
- read_lock(&gpc->lock);
+ if (!read_trylock(&gpc->lock))
+ return ret;
+
if (!kvm_gpc_check(gpc, PAGE_SIZE))
goto out_rcu;
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 14/20] KVM: x86/xen: Convert wait_pending_event() to gpc's CLASS() APIs
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (12 preceding siblings ...)
2026-05-29 16:51 ` [PATCH v2 13/20] KVM: x86/xen: Don't bother waiting on gpc->lock in SCHEDOP_poll Sean Christopherson
@ 2026-05-29 16:51 ` Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 15/20] KVM: x86/xen: Convert kvm_xen_set_evtchn_fast() " Sean Christopherson
` (5 subsequent siblings)
19 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:51 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
Use gpc's CLASS() interface to lock and check the shared info page when
processing a SCHEDOP_poll hypercall.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/xen.c | 24 ++++++++----------------
1 file changed, 8 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index d9b09809e243..8f822acb11a4 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -1433,36 +1433,28 @@ static bool wait_pending_event(struct kvm_vcpu *vcpu, int nr_ports,
struct kvm *kvm = vcpu->kvm;
struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache;
unsigned long *pending_bits;
- bool ret = true;
int i;
guard(srcu)(&kvm->srcu);
- if (!read_trylock(&gpc->lock))
- return ret;
+ CLASS(gpc_try_map_local_ro, shinfo_map)(gpc, PAGE_SIZE);
+ if (IS_ERR(shinfo_map))
+ return true;
- if (!kvm_gpc_check(gpc, PAGE_SIZE))
- goto out_rcu;
-
- ret = false;
if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
- struct shared_info *shinfo = gpc->khva;
+ struct shared_info *shinfo = *shinfo_map;
pending_bits = (unsigned long *)&shinfo->evtchn_pending;
} else {
- struct compat_shared_info *shinfo = gpc->khva;
+ struct compat_shared_info *shinfo = *shinfo_map;
pending_bits = (unsigned long *)&shinfo->evtchn_pending;
}
for (i = 0; i < nr_ports; i++) {
- if (test_bit(ports[i], pending_bits)) {
- ret = true;
- break;
- }
+ if (test_bit(ports[i], pending_bits))
+ return true;
}
- out_rcu:
- read_unlock(&gpc->lock);
- return ret;
+ return false;
}
static bool kvm_xen_schedop_poll(struct kvm_vcpu *vcpu, bool longmode,
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 15/20] KVM: x86/xen: Convert kvm_xen_set_evtchn_fast() to gpc's CLASS() APIs
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (13 preceding siblings ...)
2026-05-29 16:51 ` [PATCH v2 14/20] KVM: x86/xen: Convert wait_pending_event() to gpc's CLASS() APIs Sean Christopherson
@ 2026-05-29 16:51 ` Sean Christopherson
2026-05-29 19:01 ` sashiko-bot
2026-05-29 16:51 ` [PATCH v2 16/20] KVM: x86/xen: Convert xen_get_guest_pvclock() " Sean Christopherson
` (4 subsequent siblings)
19 siblings, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:51 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
Convert the event channel fastpath to the "map local" CLASS() APIs, using
the "try" variants as the faspath can't block.
Note! The vcpu_info mapping is read/write, even though there is no
existing call to mark the page dirty. Like Xen's shared info page, the
vCPU info page is assumed to be dirty at all times, and so isn't marked
dirty after every write.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/xen.c | 42 ++++++++++++++++--------------------------
1 file changed, 16 insertions(+), 26 deletions(-)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 8f822acb11a4..47750316f132 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -1774,29 +1774,28 @@ static void __kvm_xen_set_evtchn_fast(struct kvm_vcpu *vcpu, int port_word_bit)
bool kick_vcpu = false;
/* Now switch to the vCPU's vcpu_info to set the index and pending_sel */
- if (!read_trylock(&gpc->lock)) {
- /*
- * Could not access the vcpu_info. Set the bit in-kernel and
- * prod the vCPU to deliver it for itself.
- */
+ CLASS(gpc_try_map_local, vcpu_info_map)(gpc, sizeof(struct vcpu_info));
+
+ /*
+ * If the vcpu_info is inaccessible, set the bit in-kernel and prod the
+ * vCPU to deliver it for itself.
+ */
+ if (IS_ERR(vcpu_info_map)) {
if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
kick_vcpu = true;
goto out_kick;
}
- if (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
- if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
- kick_vcpu = true;
- goto out_unlock;
- }
if (IS_ENABLED(CONFIG_64BIT) && vcpu->kvm->arch.xen.long_mode) {
- struct vcpu_info *vcpu_info = gpc->khva;
+ struct vcpu_info *vcpu_info = *vcpu_info_map;
+
if (!test_and_set_bit(port_word_bit, &vcpu_info->evtchn_pending_sel)) {
WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
kick_vcpu = true;
}
} else {
- struct compat_vcpu_info *vcpu_info = gpc->khva;
+ struct compat_vcpu_info *vcpu_info = *vcpu_info_map;
+
if (!test_and_set_bit(port_word_bit,
(unsigned long *)&vcpu_info->evtchn_pending_sel)) {
WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
@@ -1810,8 +1809,6 @@ static void __kvm_xen_set_evtchn_fast(struct kvm_vcpu *vcpu, int port_word_bit)
kick_vcpu = false;
}
-out_unlock:
- read_unlock(&gpc->lock);
out_kick:
if (kick_vcpu) {
kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
@@ -1850,23 +1847,19 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
if (xe->port >= max_evtchn_port(kvm))
return -EINVAL;
- rc = -EWOULDBLOCK;
-
guard(srcu)(&kvm->srcu);
- if (!read_trylock(&gpc->lock))
- return rc;
-
- if (!kvm_gpc_check(gpc, PAGE_SIZE))
- goto out_unlock;
+ CLASS(gpc_try_map_local, shinfo_map)(gpc, PAGE_SIZE);
+ if (IS_ERR(shinfo_map))
+ return -EWOULDBLOCK;
if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
- struct shared_info *shinfo = gpc->khva;
+ struct shared_info *shinfo = *shinfo_map;
pending_bits = (unsigned long *)&shinfo->evtchn_pending;
mask_bits = (unsigned long *)&shinfo->evtchn_mask;
port_word_bit = xe->port / 64;
} else {
- struct compat_shared_info *shinfo = gpc->khva;
+ struct compat_shared_info *shinfo = *shinfo_map;
pending_bits = (unsigned long *)&shinfo->evtchn_pending;
mask_bits = (unsigned long *)&shinfo->evtchn_mask;
port_word_bit = xe->port / 32;
@@ -1888,9 +1881,6 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
rc = 1; /* Delivered to the bitmap in shared_info. */
}
-out_unlock:
- read_unlock(&gpc->lock);
-
if (rc == 1)
__kvm_xen_set_evtchn_fast(vcpu, port_word_bit);
return rc;
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 16/20] KVM: x86/xen: Convert xen_get_guest_pvclock() to gpc's CLASS() APIs
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (14 preceding siblings ...)
2026-05-29 16:51 ` [PATCH v2 15/20] KVM: x86/xen: Convert kvm_xen_set_evtchn_fast() " Sean Christopherson
@ 2026-05-29 16:51 ` Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 17/20] KVM: x86/xen: Drop local "kick_vcpu" from __kvm_xen_set_evtchn_fast() Sean Christopherson
` (3 subsequent siblings)
19 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:51 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
Convert the Xen pvclock reads to the "map local" CLASS() APIs, using a
read-only variants as KVM simply copying data from the pvclock.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/xen.c | 18 ++++--------------
1 file changed, 4 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 47750316f132..89daad3fe712 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -153,21 +153,11 @@ static int xen_get_guest_pvclock(struct kvm_vcpu *vcpu,
struct gfn_to_pfn_cache *gpc,
unsigned int offset)
{
- int r;
+ CLASS(gpc_map_local_ro, pvclock_map)(gpc, offset + sizeof(*hv_clock));
+ if (IS_ERR(pvclock_map))
+ return PTR_ERR(pvclock_map);
- read_lock(&gpc->lock);
- while (!kvm_gpc_check(gpc, offset + sizeof(*hv_clock))) {
- read_unlock(&gpc->lock);
-
- r = kvm_gpc_refresh(gpc, offset + sizeof(*hv_clock));
- if (r)
- return r;
-
- read_lock(&gpc->lock);
- }
-
- memcpy(hv_clock, gpc->khva + offset, sizeof(*hv_clock));
- read_unlock(&gpc->lock);
+ memcpy(hv_clock, *pvclock_map + offset, sizeof(*hv_clock));
/*
* Sanity check TSC shift+multiplier to verify the guest's view of time
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 17/20] KVM: x86/xen: Drop local "kick_vcpu" from __kvm_xen_set_evtchn_fast()
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (15 preceding siblings ...)
2026-05-29 16:51 ` [PATCH v2 16/20] KVM: x86/xen: Convert xen_get_guest_pvclock() " Sean Christopherson
@ 2026-05-29 16:51 ` Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 18/20] KVM: x86/xen: Convert event injection to gpc's CLASS() APIs Sean Christopherson
` (2 subsequent siblings)
19 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:51 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
Now that the CLASS()-based gpc mappings allow for early returns, drop the
local "kick_vcpu" from the event channel fastpath, and simply return early
if the vCPU doesn't need to be kicked.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/xen.c | 33 +++++++++++++++------------------
1 file changed, 15 insertions(+), 18 deletions(-)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 89daad3fe712..2c776e475a4f 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -1761,7 +1761,6 @@ static void kvm_xen_check_poller(struct kvm_vcpu *vcpu, int port)
static void __kvm_xen_set_evtchn_fast(struct kvm_vcpu *vcpu, int port_word_bit)
{
struct gfn_to_pfn_cache *gpc = &vcpu->arch.xen.vcpu_info_cache;
- bool kick_vcpu = false;
/* Now switch to the vCPU's vcpu_info to set the index and pending_sel */
CLASS(gpc_try_map_local, vcpu_info_map)(gpc, sizeof(struct vcpu_info));
@@ -1771,39 +1770,37 @@ static void __kvm_xen_set_evtchn_fast(struct kvm_vcpu *vcpu, int port_word_bit)
* vCPU to deliver it for itself.
*/
if (IS_ERR(vcpu_info_map)) {
- if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
- kick_vcpu = true;
+ if (test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
+ return;
goto out_kick;
}
if (IS_ENABLED(CONFIG_64BIT) && vcpu->kvm->arch.xen.long_mode) {
struct vcpu_info *vcpu_info = *vcpu_info_map;
- if (!test_and_set_bit(port_word_bit, &vcpu_info->evtchn_pending_sel)) {
- WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
- kick_vcpu = true;
- }
+ if (test_and_set_bit(port_word_bit, &vcpu_info->evtchn_pending_sel))
+ return;
+
+ WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
} else {
struct compat_vcpu_info *vcpu_info = *vcpu_info_map;
- if (!test_and_set_bit(port_word_bit,
- (unsigned long *)&vcpu_info->evtchn_pending_sel)) {
- WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
- kick_vcpu = true;
- }
+ if (test_and_set_bit(port_word_bit,
+ (unsigned long *)&vcpu_info->evtchn_pending_sel))
+ return;
+
+ WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
}
/* For the per-vCPU lapic vector, deliver it as MSI. */
- if (kick_vcpu && vcpu->arch.xen.upcall_vector) {
+ if (vcpu->arch.xen.upcall_vector) {
kvm_xen_inject_vcpu_vector(vcpu);
- kick_vcpu = false;
+ return;
}
out_kick:
- if (kick_vcpu) {
- kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
- kvm_vcpu_kick(vcpu);
- }
+ kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
+ kvm_vcpu_kick(vcpu);
}
/*
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 18/20] KVM: x86/xen: Convert event injection to gpc's CLASS() APIs
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (16 preceding siblings ...)
2026-05-29 16:51 ` [PATCH v2 17/20] KVM: x86/xen: Drop local "kick_vcpu" from __kvm_xen_set_evtchn_fast() Sean Christopherson
@ 2026-05-29 16:51 ` Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 19/20] KVM: Add "extended" gpc CLASS() APIs for sometimes-atomic cases Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 20/20] KVM: x86: Use gfn_to_pfn_cache for record_steal_time Sean Christopherson
19 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:51 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
Convert Xen event injection to the "map local" CLASS() APIs.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/xen.c | 24 +++++-------------------
1 file changed, 5 insertions(+), 19 deletions(-)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 2c776e475a4f..3ebde7ba5558 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -628,24 +628,12 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
if (!evtchn_pending_sel)
return;
- /*
- * Yes, this is an open-coded loop. But that's just what put_user()
- * does anyway. Page it in and retry the instruction. We're just a
- * little more honest about it.
- */
- read_lock(&gpc->lock);
- while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
- read_unlock(&gpc->lock);
+ CLASS(gpc_map_local, vcpu_info_map)(gpc, sizeof(struct vcpu_info));
+ if (IS_ERR(vcpu_info_map))
+ return;
- if (kvm_gpc_refresh(gpc, sizeof(struct vcpu_info)))
- return;
-
- read_lock(&gpc->lock);
- }
-
- /* Now gpc->khva is a valid kernel address for the vcpu_info */
if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode) {
- struct vcpu_info *vi = gpc->khva;
+ struct vcpu_info *vi = *vcpu_info_map;
asm volatile(LOCK_PREFIX "orq %0, %1\n"
"notq %0\n"
@@ -657,7 +645,7 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
WRITE_ONCE(vi->evtchn_upcall_pending, 1);
} else {
u32 evtchn_pending_sel32 = evtchn_pending_sel;
- struct compat_vcpu_info *vi = gpc->khva;
+ struct compat_vcpu_info *vi = *vcpu_info_map;
asm volatile(LOCK_PREFIX "orl %0, %1\n"
"notl %0\n"
@@ -669,8 +657,6 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
WRITE_ONCE(vi->evtchn_upcall_pending, 1);
}
- read_unlock(&gpc->lock);
-
/* For the per-vCPU lapic vector, deliver it as MSI. */
if (v->arch.xen.upcall_vector)
kvm_xen_inject_vcpu_vector(v);
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 19/20] KVM: Add "extended" gpc CLASS() APIs for sometimes-atomic cases
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (17 preceding siblings ...)
2026-05-29 16:51 ` [PATCH v2 18/20] KVM: x86/xen: Convert event injection to gpc's CLASS() APIs Sean Christopherson
@ 2026-05-29 16:51 ` Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 20/20] KVM: x86: Use gfn_to_pfn_cache for record_steal_time Sean Christopherson
19 siblings, 0 replies; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:51 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
Extend the CLASS() APIs for gpcs to allow choosing between the "normal" and
the "try" versions at runtime, depending on whether or not the caller is
running in atomic context. Convert the "has interrupt" helper as the first
user, as it is called from IRQ context, but also needs to wait when called
from non-atomic context, i.e. can't tolerate false negatives in that case.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/xen.c | 52 +++++++++++++---------------------------
include/linux/kvm_host.h | 10 ++++++++
2 files changed, 26 insertions(+), 36 deletions(-)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 3ebde7ba5558..a2e88a76e8d9 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -666,7 +666,6 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
{
struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache;
bool atomic = in_atomic() || !task_is_running(current);
- u8 rc = 0;
/*
* If the global upcall vector (HVMIRQ_callback_vector) is set and
@@ -676,44 +675,25 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
/* No need for compat handling here */
BUILD_BUG_ON(offsetof(struct vcpu_info, evtchn_upcall_pending) !=
offsetof(struct compat_vcpu_info, evtchn_upcall_pending));
- BUILD_BUG_ON(sizeof(rc) !=
- sizeof_field(struct vcpu_info, evtchn_upcall_pending));
- BUILD_BUG_ON(sizeof(rc) !=
+ BUILD_BUG_ON(sizeof_field(struct vcpu_info, evtchn_upcall_pending) !=
sizeof_field(struct compat_vcpu_info, evtchn_upcall_pending));
- if (atomic) {
- if (!read_trylock(&gpc->lock))
- return 1;
- } else {
- read_lock(&gpc->lock);
- }
- while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
- read_unlock(&gpc->lock);
+ /*
+ * This function gets called from kvm_vcpu_block() after setting the
+ * task to TASK_INTERRUPTIBLE, to see if it needs to wake immediately
+ * from a HLT. So we really mustn't sleep. If the page ended up absent
+ * at that point, just return 1 in order to trigger an immediate wake,
+ * and we'll end up getting called again from a context where we *can*
+ * fault in the page and wait for it.
+ *
+ * For normal, non-atomic usage, nothing can be done if userspace has
+ * screwed up the vcpu_info mapping. No interrupts for you.
+ */
+ CLASS(gpc_map_local_ro_ex, info_map)(gpc, sizeof(struct vcpu_info), atomic);
+ if (IS_ERR(info_map))
+ return atomic ? 1 : 0;
- /*
- * This function gets called from kvm_vcpu_block() after setting the
- * task to TASK_INTERRUPTIBLE, to see if it needs to wake immediately
- * from a HLT. So we really mustn't sleep. If the page ended up absent
- * at that point, just return 1 in order to trigger an immediate wake,
- * and we'll end up getting called again from a context where we *can*
- * fault in the page and wait for it.
- */
- if (atomic)
- return 1;
-
- if (kvm_gpc_refresh(gpc, sizeof(struct vcpu_info))) {
- /*
- * If this failed, userspace has screwed up the
- * vcpu_info mapping. No interrupts for you.
- */
- return 0;
- }
- read_lock(&gpc->lock);
- }
-
- rc = ((struct vcpu_info *)gpc->khva)->evtchn_upcall_pending;
- read_unlock(&gpc->lock);
- return rc;
+ return ((struct vcpu_info *)*info_map)->evtchn_upcall_pending;
}
int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d70fa91cda0c..0602d0ca731c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1573,6 +1573,16 @@ DEFINE_GPC_CLASS(_try,);
DEFINE_GPC_CLASS(, _ro);
DEFINE_GPC_CLASS(_try, _ro);
+#define DEFINE_GPC_CLASS_EX(ro) \
+DEFINE_CLASS(gpc_map_local##ro##_ex, void **, \
+ if (!IS_ERR(_T)) gpc_map_local_unlock##ro(_T), \
+ atomic ? gpc_try_map_local_lock(gpc, len) : \
+ gpc_map_local_lock(gpc, len), \
+ struct gfn_to_pfn_cache *gpc, unsigned long len, bool atomic)
+
+DEFINE_GPC_CLASS_EX();
+DEFINE_GPC_CLASS_EX(_ro);
+
void kvm_sigset_activate(struct kvm_vcpu *vcpu);
void kvm_sigset_deactivate(struct kvm_vcpu *vcpu);
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 20/20] KVM: x86: Use gfn_to_pfn_cache for record_steal_time
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
` (18 preceding siblings ...)
2026-05-29 16:51 ` [PATCH v2 19/20] KVM: Add "extended" gpc CLASS() APIs for sometimes-atomic cases Sean Christopherson
@ 2026-05-29 16:51 ` Sean Christopherson
2026-05-30 6:19 ` sashiko-bot
19 siblings, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 16:51 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, David Woodhouse, Paul Durrant,
Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng
Cc: Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
From: Carsten Stollmaier <stollmc@amazon.com>
This largely reverts commit 7e2175ebd695 ("KVM: x86: Fix recording of
guest steal time / preempted status"), which dropped the use of the
gfn_to_pfn_cache because it was not integrated with the MMU notifiers
at the time. That shortcoming has long since been addressed, making
the GPC work correctly for this use case.
Aside from cleaning up the last open-coded assembler access to user
addresses and associated explicit asm exception fixups, moving back
to the now-functional GPC also resolves an issue with contention on
the mmap_lock with userfaultfd. The contention issue is as follows:
On vcpu_run, before entering the guest, the update of the steal time
information causes a page-fault if the page is not present. In our
scenario, this gets handled by do_user_addr_fault() and successively
handle_userfault() because the region is registered to that.
Since handle_userfault() uses TASK_INTERRUPTIBLE, it is interruptible
by signals. But do_user_addr_fault() then busy-retries if the pending
signal is non-fatal, which leads to heavy contention of the mmap_lock.
By restoring the use of GPC for accessing the guest steal time, the
contention is avoided and refreshing the GPC happens when the vCPU is
next scheduled.
Since the gfn_to_pfn_cache gives a kernel mapping rather than a
userspace HVA, accesses are now plain C instead of unsafe_put_user()
et al. Use READ_ONCE()/WRITE_ONCE() to prevent the compiler from
reordering or tearing the accesses, and add an smp_wmb() before the
final version increment to ensure the data writes are ordered before
the seqcount update — the old unsafe_put_user() inline assembly acted
as an implicit compiler barrier.
In kvm_steal_time_set_preempted(), use read_trylock() instead of
read_lock_irqsave() since this is called from the scheduler path
where rwlock_t is not safe on PREEMPT_RT (it becomes sleepable).
Since we only trylock and bail on failure, there is no risk of
deadlock with an interrupt handler, so no need to disable interrupts
at all. Setting the preempted flag is best-effort anyway.
Signed-off-by: Carsten Stollmaier <stollmc@amazon.com>
Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/x86.c | 122 ++++++++++++++------------------
2 files changed, 54 insertions(+), 70 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6ae7d539af90..9f652dcdda93 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -983,7 +983,7 @@ struct kvm_vcpu_arch {
u8 preempted;
u64 msr_val;
u64 last_steal;
- struct gfn_to_hva_cache cache;
+ struct gfn_to_pfn_cache cache;
} st;
u64 l1_tsc_offset;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ea10ed4ab06f..1b27dd9ba0aa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3728,10 +3728,8 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_service_local_tlb_flush_requests);
static void record_steal_time(struct kvm_vcpu *vcpu)
{
- struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache;
- struct kvm_steal_time __user *st;
- struct kvm_memslots *slots;
- gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
+ struct gfn_to_pfn_cache *gpc = &vcpu->arch.st.cache;
+ struct kvm_steal_time *st;
u64 steal;
u32 version;
@@ -3746,42 +3744,20 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
if (WARN_ON_ONCE(current->mm != vcpu->kvm->mm))
return;
- slots = kvm_memslots(vcpu->kvm);
+ /* We rely on the fact that it fits in a single page. */
+ BUILD_BUG_ON((sizeof(*st) - 1) & KVM_STEAL_VALID_BITS);
- if (unlikely(slots->generation != ghc->generation ||
- gpa != ghc->gpa ||
- kvm_is_error_hva(ghc->hva) || !ghc->memslot)) {
- /* We rely on the fact that it fits in a single page. */
- BUILD_BUG_ON((sizeof(*st) - 1) & KVM_STEAL_VALID_BITS);
+ CLASS(gpc_map_local, st_map)(gpc, sizeof(*st));
+ if (IS_ERR(st_map))
+ return;
- if (kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, gpa, sizeof(*st)) ||
- kvm_is_error_hva(ghc->hva) || !ghc->memslot)
- return;
- }
-
- st = (struct kvm_steal_time __user *)ghc->hva;
+ st = *st_map;
/*
* Doing a TLB flush here, on the guest's behalf, can avoid
* expensive IPIs.
*/
if (guest_pv_has(vcpu, KVM_FEATURE_PV_TLB_FLUSH)) {
- u8 st_preempted = 0;
- int err = -EFAULT;
-
- if (!user_access_begin(st, sizeof(*st)))
- return;
-
- asm volatile("1: xchgb %0, %2\n"
- "xor %1, %1\n"
- "2:\n"
- _ASM_EXTABLE_UA(1b, 2b)
- : "+q" (st_preempted),
- "+&r" (err),
- "+m" (st->preempted));
- if (err)
- goto out;
-
- user_access_end();
+ u8 st_preempted = xchg(&st->preempted, 0);
vcpu->arch.st.preempted = 0;
@@ -3789,39 +3765,30 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
st_preempted & KVM_VCPU_FLUSH_TLB);
if (st_preempted & KVM_VCPU_FLUSH_TLB)
kvm_vcpu_flush_tlb_guest(vcpu);
-
- if (!user_access_begin(st, sizeof(*st)))
- goto dirty;
} else {
- if (!user_access_begin(st, sizeof(*st)))
- return;
-
- unsafe_put_user(0, &st->preempted, out);
+ WRITE_ONCE(st->preempted, 0);
vcpu->arch.st.preempted = 0;
}
- unsafe_get_user(version, &st->version, out);
+ version = READ_ONCE(st->version);
if (version & 1)
version += 1; /* first time write, random junk */
version += 1;
- unsafe_put_user(version, &st->version, out);
+ WRITE_ONCE(st->version, version);
smp_wmb();
- unsafe_get_user(steal, &st->steal, out);
+ steal = READ_ONCE(st->steal);
steal += current->sched_info.run_delay -
vcpu->arch.st.last_steal;
vcpu->arch.st.last_steal = current->sched_info.run_delay;
- unsafe_put_user(steal, &st->steal, out);
+ WRITE_ONCE(st->steal, steal);
+
+ smp_wmb();
version += 1;
- unsafe_put_user(version, &st->version, out);
-
- out:
- user_access_end();
- dirty:
- mark_page_dirty_in_slot(vcpu->kvm, ghc->memslot, gpa_to_gfn(ghc->gpa));
+ WRITE_ONCE(st->version, version);
}
/*
@@ -4162,8 +4129,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vcpu->arch.st.msr_val = data;
- if (!(data & KVM_MSR_ENABLED))
- break;
+ if (data & KVM_MSR_ENABLED)
+ kvm_gpc_activate(&vcpu->arch.st.cache, data & ~KVM_MSR_ENABLED,
+ sizeof(struct kvm_steal_time));
+ else
+ kvm_gpc_deactivate(&vcpu->arch.st.cache);
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
@@ -5231,11 +5201,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
{
- struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache;
- struct kvm_steal_time __user *st;
- struct kvm_memslots *slots;
- static const u8 preempted = KVM_VCPU_PREEMPTED;
- gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
+ struct gfn_to_pfn_cache *gpc = &vcpu->arch.st.cache;
+ struct kvm_steal_time *st;
/*
* The vCPU can be marked preempted if and only if the VM-Exit was on
@@ -5260,20 +5227,32 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
if (unlikely(current->mm != vcpu->kvm->mm))
return;
- slots = kvm_memslots(vcpu->kvm);
-
- if (unlikely(slots->generation != ghc->generation ||
- gpa != ghc->gpa ||
- kvm_is_error_hva(ghc->hva) || !ghc->memslot))
+ /*
+ * Use a trylock as this is called from the scheduler path (via
+ * kvm_sched_out), where rwlock_t is not safe on PREEMPT_RT (it
+ * becomes sleepable). Setting preempted is best-effort anyway;
+ * the old HVA-based code used copy_to_user_nofault() which could
+ * also silently fail.
+ *
+ * Since we only trylock and bail on failure, there is no risk of
+ * deadlock with an interrupt handler, so no need to disable
+ * interrupts.
+ */
+ CLASS(gpc_try_map_local, st_map)(gpc, sizeof(st->preempted));
+ if (IS_ERR(st_map))
return;
- st = (struct kvm_steal_time __user *)ghc->hva;
- BUILD_BUG_ON(sizeof(st->preempted) != sizeof(preempted));
+ st = *st_map;
+ WRITE_ONCE(st->preempted, KVM_VCPU_PREEMPTED);
+ vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED;
+}
- if (!copy_to_user_nofault(&st->preempted, &preempted, sizeof(preempted)))
- vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED;
-
- mark_page_dirty_in_slot(vcpu->kvm, ghc->memslot, gpa_to_gfn(ghc->gpa));
+static void kvm_steal_time_reset(struct kvm_vcpu *vcpu)
+{
+ kvm_gpc_deactivate(&vcpu->arch.st.cache);
+ vcpu->arch.st.preempted = 0;
+ vcpu->arch.st.msr_val = 0;
+ vcpu->arch.st.last_steal = 0;
}
void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
@@ -12819,6 +12798,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
kvm_gpc_init(&vcpu->arch.pv_time, vcpu->kvm);
+ kvm_gpc_init(&vcpu->arch.st.cache, vcpu->kvm);
+
if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu))
kvm_set_mp_state(vcpu, KVM_MP_STATE_RUNNABLE);
else
@@ -12926,6 +12907,8 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
kvm_clear_async_pf_completion_queue(vcpu);
kvm_mmu_unload(vcpu);
+ kvm_steal_time_reset(vcpu);
+
kvmclock_reset(vcpu);
for_each_possible_cpu(cpu)
@@ -13046,7 +13029,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
kvm_make_request(KVM_REQ_EVENT, vcpu);
vcpu->arch.apf.msr_en_val = 0;
vcpu->arch.apf.msr_int_val = 0;
- vcpu->arch.st.msr_val = 0;
+
+ kvm_steal_time_reset(vcpu);
kvmclock_reset(vcpu);
--
2.54.0.823.g6e5bcc1fc9-goog
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH v2 02/20] KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths
2026-05-29 16:50 ` [PATCH v2 02/20] KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths Sean Christopherson
@ 2026-05-29 17:20 ` sashiko-bot
2026-05-29 23:28 ` Hillf Danton
1 sibling, 0 replies; 55+ messages in thread
From: sashiko-bot @ 2026-05-29 17:20 UTC (permalink / raw)
To: Sean Christopherson; +Cc: kvm
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Unbalanced interrupt disable causes permanent hard IRQ disable on PREEMPT_RT
--
commit 13e357726c8cbadaf777f9052a529db65b824338
Author: David Woodhouse <dwmw@amazon.co.uk>
KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths
This patch modifies the GPC lock acquisition paths to use read_trylock()
instead of read_lock_irqsave() in atomic contexts. It introduces an explicit
local_irq_save() to prevent issues on PREEMPT_RT kernels where rwlock_t
cannot be used with read_lock_irqsave() from hardirq context.
Reported-by: syzbot+208f7f3e5f59c11aeb90@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=208f7f3e5f59c11aeb90
Fixes: 14243b387137 ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery")
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 91fd3673c09a2..9bdb8e3cad58b 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -713,7 +714,15 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
> BUILD_BUG_ON(sizeof(rc) !=
> sizeof_field(struct compat_vcpu_info, evtchn_upcall_pending));
>
> - read_lock_irqsave(&gpc->lock, flags);
> + if (atomic) {
> + local_irq_save(flags);
> + if (!read_trylock(&gpc->lock)) {
> + local_irq_restore(flags);
> + return 1;
> + }
> + } else {
> + read_lock_irqsave(&gpc->lock, flags);
> + }
> while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
> read_unlock_irqrestore(&gpc->lock, flags);
[Severity: High]
Does this result in unbalanced interrupt disables on PREEMPT_RT?
When atomic is true, local_irq_save() explicitly disables hardware interrupts.
On standard kernels, read_unlock_irqrestore() restores the interrupt state
correctly.
However, on PREEMPT_RT, rwlock_t is a sleepable lock and the implementation
of read_unlock_irqrestore() ignores the flags parameter, merely dropping the
lock without re-enabling interrupts. The interrupt state saved by
local_irq_save() is never restored, which would lead to hardware interrupts
being permanently disabled on the CPU.
I noticed this issue is resolved later in the series by commit 625a2fc07daa6
("KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c").
Should this fix be folded into this patch to ensure the kernel remains fully
functional and bisectable at this step?
[ ... ]
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260529165114.748639-1-seanjc@google.com?part=2
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 03/20] KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c
2026-05-29 16:50 ` [PATCH v2 03/20] KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c Sean Christopherson
@ 2026-05-29 17:36 ` sashiko-bot
0 siblings, 0 replies; 55+ messages in thread
From: sashiko-bot @ 2026-05-29 17:36 UTC (permalink / raw)
To: Sean Christopherson; +Cc: kvm
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
Pre-existing issues:
- [High] Calling read_unlock() on an rwlock_t inside kvm_sched_out() can cause deadlocks on PREEMPT_RT because waking a waiting writer requires acquiring runqueue locks while the current runqueue lock is already held.
--
commit 625a2fc07daa67a8fcafaa64daaf9791f5012ecd
Author: David Woodhouse <dwmw@amazon.co.uk>
Subject: KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c
This commit converts read_lock_irqsave() and read_unlock_irqrestore()
calls to plain read_lock() and read_unlock() for gpc->lock in several
xen.c functions, as the hardirq path now safely uses read_trylock().
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
[ ... ]
> @@ -421,16 +419,14 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
> * gfn_to_pfn caches that cover the region.
> */
> if (atomic) {
> - local_irq_save(flags);
> if (!read_trylock(&gpc1->lock)) {
> - local_irq_restore(flags);
> return;
> }
> } else {
> - read_lock_irqsave(&gpc1->lock, flags);
> + read_lock(&gpc1->lock);
> }
> while (!kvm_gpc_check(gpc1, user_len1)) {
> - read_unlock_irqrestore(&gpc1->lock, flags);
> + read_unlock(&gpc1->lock);
[Severity: High]
This is a pre-existing issue, but can calling read_unlock() here lead to a
deadlock on PREEMPT_RT?
kvm_xen_update_runstate_guest() is called with atomic=true from
kvm_sched_out(), which is invoked via preempt notifiers while holding the
runqueue lock for the current CPU.
On CONFIG_PREEMPT_RT, rwlock_t is mapped to an rt_rwlock. If a writer is
waiting (for example, from a concurrent kvm_gpc_deactivate()),
read_unlock() calls __rwbase_read_unlock() to wake the blocked writer.
__rwbase_read_unlock() observes the waiting writer and calls
rt_mutex_wake_up_q(), which synchronously calls try_to_wake_up().
try_to_wake_up() will then attempt to acquire the pi_lock and the target
task's rq_lock. Since the current CPU's runqueue lock is already held by
the scheduler, acquiring another rq_lock without double_rq_lock() violates
lock ordering and can cause an ABBA deadlock.
Is there a plan to handle this rt_rwlock wake-up behavior from the preempt
notifier context?
>
> /* When invoked from kvm_sched_out() we cannot sleep */
> if (atomic)
> @@ -439,7 +435,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
[ ... ]
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260529165114.748639-1-seanjc@google.com?part=3
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 07/20] KVM: x86/xen: Extract delivery of event to vCPU into a separate helper
2026-05-29 16:51 ` [PATCH v2 07/20] KVM: x86/xen: Extract delivery of event to vCPU into a separate helper Sean Christopherson
@ 2026-05-29 17:47 ` sashiko-bot
0 siblings, 0 replies; 55+ messages in thread
From: sashiko-bot @ 2026-05-29 17:47 UTC (permalink / raw)
To: Sean Christopherson; +Cc: kvm
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
Pre-existing issues:
- [Critical] Lockless reads of `kvm->arch.xen.long_mode` across multiple checks allow TOCTOU races, leading to out-of-bounds memory corruption in both host kernel state and guest state.
--
commit f4fd761232e8e0f394f8395cfeec34caad9f061d
Author: Sean Christopherson <seanjc@google.com>
KVM: x86/xen: Extract delivery of event to vCPU into a separate helper
This commit extracts the fastpath delivery of an event to a vCPU into a
separate helper function to allow CLASS()-based gpc locking and checking.
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 0c6b74b974080..020ef0ddab013 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
[ ... ]
> @@ -1775,6 +1775,57 @@ static void kvm_xen_check_poller(struct kvm_vcpu *vcpu, int port)
> }
> }
>
> +static void __kvm_xen_set_evtchn_fast(struct kvm_vcpu *vcpu, int port_word_bit)
> +{
> + struct gfn_to_pfn_cache *gpc = &vcpu->arch.xen.vcpu_info_cache;
> + bool kick_vcpu = false;
> +
> + /* Now switch to the vCPU's vcpu_info to set the index and pending_sel */
> + if (!read_trylock(&gpc->lock)) {
> + /*
> + * Could not access the vcpu_info. Set the bit in-kernel and
> + * prod the vCPU to deliver it for itself.
> + */
> + if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
[Severity: Critical]
This is a pre-existing issue, but can the lockless evaluation of long_mode
lead to out-of-bounds memory corruption here?
If userspace races KVM_XEN_HVM_SET_ATTR (setting long_mode = 0) with event
channel delivery (e.g., via KVM_XEN_HVM_EVTCHN_SEND), could the initial
max_evtchn_port() check observe long_mode == 1, permitting xe->port up to
4095?
If the subsequent check inside kvm_xen_set_evtchn_fast() observes
long_mode == 0, it falls into the 32-bit compat block, computing
port_word_bit = xe->port / 32, which could yield a bit index up to 127.
When this out-of-bounds port_word_bit is passed to test_and_set_bit(), it
would write past the 8-byte evtchn_pending_sel field, potentially corrupting
the adjacent vcpu_id and timer_virq fields in struct kvm_vcpu_xen.
> + kick_vcpu = true;
> + goto out_kick;
> + }
[ ... ]
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260529165114.748639-1-seanjc@google.com?part=7
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 11/20] KVM: Add CLASS() constructs to automagically handle lock+check of gpc
2026-05-29 16:51 ` [PATCH v2 11/20] KVM: Add CLASS() constructs to automagically handle lock+check of gpc Sean Christopherson
@ 2026-05-29 17:59 ` sashiko-bot
0 siblings, 0 replies; 55+ messages in thread
From: sashiko-bot @ 2026-05-29 17:59 UTC (permalink / raw)
To: Sean Christopherson; +Cc: kvm
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] The cleanup wrapper defined by `DEFINE_GPC_CLASS` uses `if (!IS_ERR(_T))` instead of `if (!IS_ERR_OR_NULL(_T))`, failing to guard against `NULL` pointers. This leads to a guaranteed kernel crash if ownership of the class variable is ever transferred or if it is initialized to `NULL`.
--
commit 88a61c5d744692bfbffaf3ccd332184029a7ffee
Author: Sean Christopherson <seanjc@google.com>
KVM: Add CLASS() constructs to automagically handle lock+check of gpc
This commit adds CLASS() definitions for mapping a PFN given a gfn_to_pfn
cache (gpc), deduplicating lock, check, and unlock loops. The pvclock code
is converted as the first user of this new construct.
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
[ ... ]
> +static inline void gpc_map_local_unlock(void **khva)
> +{
> + struct gfn_to_pfn_cache *gpc = container_of(khva, struct gfn_to_pfn_cache, khva);
> +
> + kvm_gpc_mark_dirty_in_slot(gpc);
> +
> + read_unlock(&gpc->lock);
> +}
> +
> +static inline void gpc_map_local_unlock_ro(void **khva)
> +{
> + read_unlock(&container_of(khva, struct gfn_to_pfn_cache, khva)->lock);
> +}
> +
> +#define DEFINE_GPC_CLASS(try, ro) \
> +DEFINE_CLASS(gpc##try##_map_local##ro, void **, \
> + if (!IS_ERR(_T)) gpc_map_local_unlock##ro(_T), \
[Severity: High]
Does this guard expression properly handle NULL values, or could it lead to a
regression?
Since IS_ERR(NULL) evaluates to false, !IS_ERR(NULL) evaluates to true. If a
user of this class manually sets the variable to NULL or uses a primitive
like no_free_ptr() to inhibit cleanup, the destructor will run and
erroneously execute gpc_map_local_unlock(NULL).
Inside gpc_map_local_unlock(), container_of(NULL, ...) will produce a large
negative pointer. The subsequent calls to kvm_gpc_mark_dirty_in_slot() and
read_unlock() will dereference this invalid pointer and cause a regression.
Would it be safer to use if (!IS_ERR_OR_NULL(_T)) instead?
> + gpc##try##_map_local_lock(gpc, len), \
> + struct gfn_to_pfn_cache *gpc, unsigned long len) \
> +
> +DEFINE_GPC_CLASS(,);
> +DEFINE_GPC_CLASS(_try,);
> +DEFINE_GPC_CLASS(, _ro);
> +DEFINE_GPC_CLASS(_try, _ro);
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260529165114.748639-1-seanjc@google.com?part=11
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 15/20] KVM: x86/xen: Convert kvm_xen_set_evtchn_fast() to gpc's CLASS() APIs
2026-05-29 16:51 ` [PATCH v2 15/20] KVM: x86/xen: Convert kvm_xen_set_evtchn_fast() " Sean Christopherson
@ 2026-05-29 19:01 ` sashiko-bot
2026-05-29 19:11 ` Sean Christopherson
0 siblings, 1 reply; 55+ messages in thread
From: sashiko-bot @ 2026-05-29 19:01 UTC (permalink / raw)
To: Sean Christopherson; +Cc: kvm
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Scope-based `CLASS()` conversion unintentionally extends the duration of the `gpc->lock` read locks across expensive cross-vCPU IPIs (`kvm_vcpu_kick`) and nested lock acquisitions.
--
commit f52c566cf9e4ecf5b818d43ac788f5009d85b7bc
Author: Sean Christopherson <seanjc@google.com>
KVM: x86/xen: Convert kvm_xen_set_evtchn_fast() to gpc's CLASS() APIs
Convert the event channel fastpath to the "map local" CLASS() APIs, using
the "try" variants as the faspath can't block.
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 8f822acb11a46..47750316f1322 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
[ ... ]
> @@ -1810,8 +1809,6 @@ static void __kvm_xen_set_evtchn_fast(struct kvm_vcpu *vcpu, int port_word_bit)
> kick_vcpu = false;
> }
>
> -out_unlock:
> - read_unlock(&gpc->lock);
> out_kick:
> if (kick_vcpu) {
> kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
[Severity: High]
Does this unintentionally hold the gpc lock across the IPI kick?
By replacing the explicit unlock with a function-scoped CLASS() variable, the
vcpu_info_map destructor won't run until the end of the function. This means
the lock is now held while calling kvm_vcpu_kick(), whereas the original
code explicitly dropped it first.
[ ... ]
> @@ -1888,9 +1881,6 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
> rc = 1; /* Delivered to the bitmap in shared_info. */
> }
>
> -out_unlock:
> - read_unlock(&gpc->lock);
> -
> if (rc == 1)
> __kvm_xen_set_evtchn_fast(vcpu, port_word_bit);
> return rc;
[Severity: High]
Does this now hold the VM-wide shinfo_cache lock while calling
__kvm_xen_set_evtchn_fast() and kicking the vCPU?
Since shinfo_map is a function-scoped CLASS() variable, its destructor
won't release the lock until after __kvm_xen_set_evtchn_fast() returns.
This creates a nested locking dependency and holds locks over expensive
cross-vCPU operations, potentially serializing event channel deliveries
across the entire VM on the fast path.
Would it make sense to constrain the mapping to an explicit scope block
to match the original release points and maintain the "No functional change
intended" guarantee?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260529165114.748639-1-seanjc@google.com?part=15
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 15/20] KVM: x86/xen: Convert kvm_xen_set_evtchn_fast() to gpc's CLASS() APIs
2026-05-29 19:01 ` sashiko-bot
@ 2026-05-29 19:11 ` Sean Christopherson
2026-06-02 12:37 ` David Woodhouse
0 siblings, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 19:11 UTC (permalink / raw)
To: sashiko-reviews; +Cc: kvm
On Fri, May 29, 2026, sashiko-bot@kernel.org wrote:
> > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> > index 8f822acb11a46..47750316f1322 100644
> > --- a/arch/x86/kvm/xen.c
> > +++ b/arch/x86/kvm/xen.c
> [ ... ]
> > @@ -1810,8 +1809,6 @@ static void __kvm_xen_set_evtchn_fast(struct kvm_vcpu *vcpu, int port_word_bit)
> > kick_vcpu = false;
> > }
> >
> > -out_unlock:
> > - read_unlock(&gpc->lock);
> > out_kick:
> > if (kick_vcpu) {
> > kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
>
> [Severity: High]
> Does this unintentionally hold the gpc lock across the IPI kick?
No, it intentionally holds the gpc lock across _sending_ the IPI kick.
> By replacing the explicit unlock with a function-scoped CLASS() variable, the
> vcpu_info_map destructor won't run until the end of the function. This means
> the lock is now held while calling kvm_vcpu_kick(), whereas the original
> code explicitly dropped it first.
>
> [ ... ]
> > @@ -1888,9 +1881,6 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
> > rc = 1; /* Delivered to the bitmap in shared_info. */
> > }
> >
> > -out_unlock:
> > - read_unlock(&gpc->lock);
> > -
> > if (rc == 1)
> > __kvm_xen_set_evtchn_fast(vcpu, port_word_bit);
> > return rc;
>
> [Severity: High]
> Does this now hold the VM-wide shinfo_cache lock while calling
> __kvm_xen_set_evtchn_fast() and kicking the vCPU?
>
> Since shinfo_map is a function-scoped CLASS() variable, its destructor
> won't release the lock until after __kvm_xen_set_evtchn_fast() returns.
> This creates a nested locking dependency and holds locks over expensive
> cross-vCPU operations, potentially serializing event channel deliveries
> across the entire VM on the fast path.
__kvm_vcpu_kick() is neither expensive nor cross-vCPU. In the wait=false case,
which is the behavior of kvm_vcpu_kick(), it sends IPIs via smp_send_reschedule(),
i.e. it's more or less just __apic_send_IPI(cpu, RESCHEDULE_VECTOR), which is a
single WRMSR on modern harware.
> Would it make sense to constrain the mapping to an explicit scope block
> to match the original release points and maintain the "No functional change
> intended" guarantee?
I'll just drop that claim.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-05-29 16:50 ` [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock() Sean Christopherson
@ 2026-05-29 19:32 ` Peter Zijlstra
2026-05-29 19:34 ` Peter Zijlstra
0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-05-29 19:32 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, David Woodhouse, Paul Durrant, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, kvm, linux-kernel,
David Woodhouse, Sebastian Andrzej Siewior,
syzbot+208f7f3e5f59c11aeb90, Carsten Stollmaier
On Fri, May 29, 2026 at 09:50:55AM -0700, Sean Christopherson wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> __rwbase_read_unlock() uses raw_spin_lock_irq()/raw_spin_unlock_irq()
> which unconditionally disables and re-enables interrupts. When
> read_unlock() is called from hardirq context (e.g. after a successful
> read_trylock() in a timer callback), the raw_spin_unlock_irq()
> incorrectly re-enables interrupts within the hardirq handler.
>
> This causes lockdep warnings ('hardirqs_on_prepare' from hardirq
> context) and can lead to IRQ state corruption.
>
> Using read_trylock() in hardirq context on PREEMPT_RT is safe because
> it does not record the lock owner. The read_unlock() acquires the
> wait_lock which is hardirq safe. This change additionally allows
> rwlock_t during early boot.
>
> Switch to raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() to
> preserve the caller's IRQ state.
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
We have very specifically not supported the: trylock+unlock from hardirq
(although typically this comes up for mutex). Specifically with PI this
can lead to trying to boost the idle thread.
Consider doing this from an interrupt that hits idle, then idle becomes
the 'owner' of a successful acquisition. This is absolutely broken.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-05-29 19:32 ` Peter Zijlstra
@ 2026-05-29 19:34 ` Peter Zijlstra
2026-05-29 20:05 ` Sean Christopherson
0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-05-29 19:34 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, David Woodhouse, Paul Durrant, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, kvm, linux-kernel,
David Woodhouse, Sebastian Andrzej Siewior,
syzbot+208f7f3e5f59c11aeb90, Carsten Stollmaier
On Fri, May 29, 2026 at 09:32:14PM +0200, Peter Zijlstra wrote:
> On Fri, May 29, 2026 at 09:50:55AM -0700, Sean Christopherson wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> >
> > __rwbase_read_unlock() uses raw_spin_lock_irq()/raw_spin_unlock_irq()
> > which unconditionally disables and re-enables interrupts. When
> > read_unlock() is called from hardirq context (e.g. after a successful
> > read_trylock() in a timer callback), the raw_spin_unlock_irq()
> > incorrectly re-enables interrupts within the hardirq handler.
> >
> > This causes lockdep warnings ('hardirqs_on_prepare' from hardirq
> > context) and can lead to IRQ state corruption.
> >
> > Using read_trylock() in hardirq context on PREEMPT_RT is safe because
> > it does not record the lock owner. The read_unlock() acquires the
> > wait_lock which is hardirq safe. This change additionally allows
> > rwlock_t during early boot.
Forgot to reply to this; it is safe with this implementation. If we were
to ever do reader owner tracking this goes sideways real fast.
I really think this is a very bad approach.
> > Switch to raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() to
> > preserve the caller's IRQ state.
> >
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
>
> We have very specifically not supported the: trylock+unlock from hardirq
> (although typically this comes up for mutex). Specifically with PI this
> can lead to trying to boost the idle thread.
>
> Consider doing this from an interrupt that hits idle, then idle becomes
> the 'owner' of a successful acquisition. This is absolutely broken.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-05-29 19:34 ` Peter Zijlstra
@ 2026-05-29 20:05 ` Sean Christopherson
2026-05-29 20:13 ` Peter Zijlstra
0 siblings, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2026-05-29 20:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Paolo Bonzini, David Woodhouse, Paul Durrant, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, kvm, linux-kernel,
David Woodhouse, Sebastian Andrzej Siewior,
syzbot+208f7f3e5f59c11aeb90, Carsten Stollmaier
On Fri, May 29, 2026, Peter Zijlstra wrote:
> On Fri, May 29, 2026 at 09:32:14PM +0200, Peter Zijlstra wrote:
> > On Fri, May 29, 2026 at 09:50:55AM -0700, Sean Christopherson wrote:
> > > From: David Woodhouse <dwmw@amazon.co.uk>
> > >
> > > __rwbase_read_unlock() uses raw_spin_lock_irq()/raw_spin_unlock_irq()
> > > which unconditionally disables and re-enables interrupts. When
> > > read_unlock() is called from hardirq context (e.g. after a successful
> > > read_trylock() in a timer callback), the raw_spin_unlock_irq()
> > > incorrectly re-enables interrupts within the hardirq handler.
> > >
> > > This causes lockdep warnings ('hardirqs_on_prepare' from hardirq
> > > context) and can lead to IRQ state corruption.
> > >
> > > Using read_trylock() in hardirq context on PREEMPT_RT is safe because
> > > it does not record the lock owner. The read_unlock() acquires the
> > > wait_lock which is hardirq safe. This change additionally allows
> > > rwlock_t during early boot.
>
> Forgot to reply to this; it is safe with this implementation. If we were
> to ever do reader owner tracking this goes sideways real fast.
>
> I really think this is a very bad approach.
>
> > > Switch to raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() to
> > > preserve the caller's IRQ state.
> > >
> > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > > Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> >
> > We have very specifically not supported the: trylock+unlock from hardirq
> > (although typically this comes up for mutex). Specifically with PI this
> > can lead to trying to boost the idle thread.
> >
> > Consider doing this from an interrupt that hits idle, then idle becomes
> > the 'owner' of a successful acquisition. This is absolutely broken.
I assume the only alternative is to implement raw versions of rwlock? Or do I
understand all of this even less than I thought? :-)
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-05-29 20:05 ` Sean Christopherson
@ 2026-05-29 20:13 ` Peter Zijlstra
2026-05-29 20:38 ` Peter Zijlstra
0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-05-29 20:13 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, David Woodhouse, Paul Durrant, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, kvm, linux-kernel,
David Woodhouse, Sebastian Andrzej Siewior,
syzbot+208f7f3e5f59c11aeb90, Carsten Stollmaier
On Fri, May 29, 2026 at 01:05:16PM -0700, Sean Christopherson wrote:
> On Fri, May 29, 2026, Peter Zijlstra wrote:
> > On Fri, May 29, 2026 at 09:32:14PM +0200, Peter Zijlstra wrote:
> > > On Fri, May 29, 2026 at 09:50:55AM -0700, Sean Christopherson wrote:
> > > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > >
> > > > __rwbase_read_unlock() uses raw_spin_lock_irq()/raw_spin_unlock_irq()
> > > > which unconditionally disables and re-enables interrupts. When
> > > > read_unlock() is called from hardirq context (e.g. after a successful
> > > > read_trylock() in a timer callback), the raw_spin_unlock_irq()
> > > > incorrectly re-enables interrupts within the hardirq handler.
> > > >
> > > > This causes lockdep warnings ('hardirqs_on_prepare' from hardirq
> > > > context) and can lead to IRQ state corruption.
> > > >
> > > > Using read_trylock() in hardirq context on PREEMPT_RT is safe because
> > > > it does not record the lock owner. The read_unlock() acquires the
> > > > wait_lock which is hardirq safe. This change additionally allows
> > > > rwlock_t during early boot.
> >
> > Forgot to reply to this; it is safe with this implementation. If we were
> > to ever do reader owner tracking this goes sideways real fast.
> >
> > I really think this is a very bad approach.
> >
> > > > Switch to raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore() to
> > > > preserve the caller's IRQ state.
> > > >
> > > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > > > Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > >
> > > We have very specifically not supported the: trylock+unlock from hardirq
> > > (although typically this comes up for mutex). Specifically with PI this
> > > can lead to trying to boost the idle thread.
> > >
> > > Consider doing this from an interrupt that hits idle, then idle becomes
> > > the 'owner' of a successful acquisition. This is absolutely broken.
>
> I assume the only alternative is to implement raw versions of rwlock? Or do I
> understand all of this even less than I thought? :-)
Well, the code needs to do something when trylock fails, can't you
pretend it fails unconditionally?
It is somewhat possible to do an RT aware read-write spinlock thing, but
it is definitely non-trivial and this would be the only user.
Also, why does it have to be a rwlock? Can't this be RCU-ified in some
way?
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-05-29 20:13 ` Peter Zijlstra
@ 2026-05-29 20:38 ` Peter Zijlstra
2026-05-30 0:54 ` Sean Christopherson
2026-06-04 23:58 ` David Woodhouse
0 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2026-05-29 20:38 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, David Woodhouse, Paul Durrant, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, kvm, linux-kernel,
David Woodhouse, Sebastian Andrzej Siewior,
syzbot+208f7f3e5f59c11aeb90, Carsten Stollmaier
On Fri, May 29, 2026 at 10:13:35PM +0200, Peter Zijlstra wrote:
> It is somewhat possible to do an RT aware read-write spinlock thing, but
> it is definitely non-trivial and this would be the only user.
Furthermore, it is fundamentally one of the worst possible lock types.
It really isn't something you *want* to have -- arguably even for !RT.
They scale like ass; per them being a spinlock type, the critical
sections must be short, but this means there is nothing to amortize the
cost of bouncing the shared lock around -- which is the 'saving' grace
of the rwsem.
For short sections the cost of the shared access will dominate. I'm sure
Paul has a bunch of graphs to illustrate this point :-)
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 02/20] KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths
2026-05-29 16:50 ` [PATCH v2 02/20] KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths Sean Christopherson
2026-05-29 17:20 ` sashiko-bot
@ 2026-05-29 23:28 ` Hillf Danton
1 sibling, 0 replies; 55+ messages in thread
From: Hillf Danton @ 2026-05-29 23:28 UTC (permalink / raw)
To: Sean Christopherson
Cc: Waiman Long, Boqun Feng, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
On Fri, 29 May 2026 09:50:56 -0700 Sean Christopherson wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> kvm_xen_set_evtchn_fast() is called from hardirq context (timer
> callback, kvm_arch_set_irq_inatomic()). On PREEMPT_RT, rwlock_t is a
> sleeping lock, so read_lock_irqsave() cannot be used in this context.
>
> Switch to read_trylock() and return -EWOULDBLOCK on contention, which is
> the designed fallback — there is always a slow path for the case where
> the GPC is invalid and needs to be refreshed.
>
> Reported-by: syzbot+208f7f3e5f59c11aeb90@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=208f7f3e5f59c11aeb90
> Fixes: 14243b387137 ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/kvm/xen.c | 32 +++++++++++++++++++++++---------
> 1 file changed, 23 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 91fd3673c09a..9bdb8e3cad58 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -697,6 +697,7 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
> int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
> {
> struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache;
> + bool atomic = in_atomic() || !task_is_running(current);
> unsigned long flags;
> u8 rc = 0;
>
> @@ -713,7 +714,15 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
> BUILD_BUG_ON(sizeof(rc) !=
> sizeof_field(struct compat_vcpu_info, evtchn_upcall_pending));
>
> - read_lock_irqsave(&gpc->lock, flags);
> + if (atomic) {
> + local_irq_save(flags);
> + if (!read_trylock(&gpc->lock)) {
> + local_irq_restore(flags);
> + return 1;
> + }
> + } else {
> + read_lock_irqsave(&gpc->lock, flags);
> + }
> while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
> read_unlock_irqrestore(&gpc->lock, flags);
>
I suspect this works given
static __always_inline void read_unlock_irqrestore(rwlock_t *rwlock,
unsigned long flags)
__releases_shared(rwlock)
{
rt_read_unlock(rwlock);
}
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-05-29 20:38 ` Peter Zijlstra
@ 2026-05-30 0:54 ` Sean Christopherson
2026-05-30 10:26 ` Paolo Bonzini
2026-06-04 23:58 ` David Woodhouse
1 sibling, 1 reply; 55+ messages in thread
From: Sean Christopherson @ 2026-05-30 0:54 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Paolo Bonzini, David Woodhouse, Paul Durrant, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, kvm, linux-kernel,
David Woodhouse, Sebastian Andrzej Siewior,
syzbot+208f7f3e5f59c11aeb90, Carsten Stollmaier
On Fri, May 29, 2026, Peter Zijlstra wrote:
> On Fri, May 29, 2026 at 10:13:35PM +0200, Peter Zijlstra wrote:
>
> > It is somewhat possible to do an RT aware read-write spinlock thing, but
> > it is definitely non-trivial and this would be the only user.
>
> Furthermore, it is fundamentally one of the worst possible lock types.
>
> It really isn't something you *want* to have -- arguably even for !RT.
>
> They scale like ass; per them being a spinlock type, the critical
> sections must be short, but this means there is nothing to amortize the
> cost of bouncing the shared lock around -- which is the 'saving' grace
> of the rwsem.
>
> For short sections the cost of the shared access will dominate. I'm sure
> Paul has a bunch of graphs to illustrate this point :-)
Hrm. We might be able to do an SRCU-based implementation. Full RCU would be
too unpredictable on the update side.
The big hiccup is when KVM needs to reclaim memory in response to mmu_notifier
events from the OOM killer, which don't allow sleeping. I'll stare at this a
bit next week. I'm not exactly thrilled about the idea of ripping apart this
code, but there's a known performance bottleneck with mmu_notifier events,
maybe we can kill two birds with one stone.
Thanks for the input!
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 20/20] KVM: x86: Use gfn_to_pfn_cache for record_steal_time
2026-05-29 16:51 ` [PATCH v2 20/20] KVM: x86: Use gfn_to_pfn_cache for record_steal_time Sean Christopherson
@ 2026-05-30 6:19 ` sashiko-bot
2026-06-02 12:29 ` David Woodhouse
0 siblings, 1 reply; 55+ messages in thread
From: sashiko-bot @ 2026-05-30 6:19 UTC (permalink / raw)
To: Sean Christopherson; +Cc: kvm
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Calling sleepable operations while holding an rwlock (gpc->lock) in record_steal_time() causes 'scheduling while atomic' bugs.
- [High] Calling read_unlock() on an RT rwlock from sched_out() triggers wake_up_process() while rq->lock is held, causing an ABBA deadlock.
--
commit cb799f9970f13f02eaea4c9d6c87fa7537434bcf
Author: Carsten Stollmaier <stollmc@amazon.com>
KVM: x86: Use gfn_to_pfn_cache for record_steal_time
This commit reverts to using the gfn_to_pfn_cache for guest steal time and
preempted status updates to avoid mmap_lock contention with userfaultfd. It
replaces unsafe_put_user accesses with plain C reads and writes. It also
uses a read_trylock in the scheduler path to avoid sleeping under PREEMPT_RT.
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3746,42 +3744,20 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
[ ... ]
> + CLASS(gpc_map_local, st_map)(gpc, sizeof(*st));
> + if (IS_ERR(st_map))
> + return;
>
> - st = (struct kvm_steal_time __user *)ghc->hva;
> /*
> * Doing a TLB flush here, on the guest's behalf, can avoid
> * expensive IPIs.
> */
> if (guest_pv_has(vcpu, KVM_FEATURE_PV_TLB_FLUSH)) {
[ ... ]
> + u8 st_preempted = xchg(&st->preempted, 0);
>
> vcpu->arch.st.preempted = 0;
>
> @@ -3789,39 +3765,30 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
> st_preempted & KVM_VCPU_FLUSH_TLB);
> if (st_preempted & KVM_VCPU_FLUSH_TLB)
> kvm_vcpu_flush_tlb_guest(vcpu);
[Severity: High]
Does this introduce a scheduling while atomic bug on non-PREEMPT_RT kernels?
The CLASS(gpc_map_local, st_map) macro acquires a read_lock on gpc->lock,
which disables preemption. While this lock is held, if the guest supports
PV TLB flush, the code calls kvm_vcpu_flush_tlb_guest().
If TDP is disabled (shadow paging), kvm_vcpu_flush_tlb_guest() calls
kvm_mmu_sync_roots() and eventually mmu_sync_children(). This path can yield
via cond_resched_rwlock_write(). Yielding while preemption is disabled by
the gpc read lock will trigger a BUG.
The old code safely dropped the user access guard before executing the
TLB flush, avoiding this issue.
[ ... ]
> @@ -5260,20 +5227,32 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
[ ... ]
> - st = (struct kvm_steal_time __user *)ghc->hva;
> - BUILD_BUG_ON(sizeof(st->preempted) != sizeof(preempted));
> -
> - if (!copy_to_user_nofault(&st->preempted, &preempted, sizeof(preempted)))
> - vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED;
> + /*
> + * Use a trylock as this is called from the scheduler path (via
> + * kvm_sched_out), where rwlock_t is not safe on PREEMPT_RT (it
> + * becomes sleepable). Setting preempted is best-effort anyway;
> + * the old HVA-based code used copy_to_user_nofault() which could
> + * also silently fail.
> + *
> + * Since we only trylock and bail on failure, there is no risk of
> + * deadlock with an interrupt handler, so no need to disable
> + * interrupts.
> + */
> + CLASS(gpc_try_map_local, st_map)(gpc, sizeof(st->preempted));
> + if (IS_ERR(st_map))
> return;
>
> - mark_page_dirty_in_slot(vcpu->kvm, ghc->memslot, gpa_to_gfn(ghc->gpa));
> + st = *st_map;
> + WRITE_ONCE(st->preempted, KVM_VCPU_PREEMPTED);
> + vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED;
> +}
[Severity: High]
Can this cause an ABBA deadlock in the scheduler on PREEMPT_RT?
kvm_steal_time_set_preempted() is called during a vCPU context switch
via kvm_sched_out(), which is a preempt notifier hook. This runs with
the local CPUs runqueue lock (rq->lock) held.
While the code correctly uses a trylock to avoid sleeping when acquiring
the lock, the CLASS macro destructor will implicitly call read_unlock()
at the end of the function. On PREEMPT_RT kernels, unlocking an rt_rwlock
can wake up a waiting writer, such as a concurrent kvm_gpc_refresh().
If a writer is woken, rt_mutex_wake_up_q() calls wake_up_process(), which
attempts to acquire the target tasks rq->lock via try_to_wake_up().
Attempting to acquire a runqueue lock while the local CPUs rq->lock is
already held violates scheduler lock ordering and leads to a deadlock.
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260529165114.748639-1-seanjc@google.com?part=20
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-05-30 0:54 ` Sean Christopherson
@ 2026-05-30 10:26 ` Paolo Bonzini
2026-05-30 12:47 ` David Woodhouse
` (2 more replies)
0 siblings, 3 replies; 55+ messages in thread
From: Paolo Bonzini @ 2026-05-30 10:26 UTC (permalink / raw)
To: Sean Christopherson, Peter Zijlstra
Cc: David Woodhouse, Paul Durrant, Ingo Molnar, Will Deacon,
Boqun Feng, Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
[-- Attachment #1: Type: text/plain, Size: 4047 bytes --]
On 5/30/26 02:54, Sean Christopherson wrote:
> On Fri, May 29, 2026, Peter Zijlstra wrote:
>> On Fri, May 29, 2026 at 10:13:35PM +0200, Peter Zijlstra wrote:
>>
>>> It is somewhat possible to do an RT aware read-write spinlock thing, but
>>> it is definitely non-trivial and this would be the only user.
>>
>> Furthermore, it is fundamentally one of the worst possible lock types.
>>
>> It really isn't something you *want* to have -- arguably even for !RT.
>>
>> They scale like ass; per them being a spinlock type, the critical
>> sections must be short, but this means there is nothing to amortize the
>> cost of bouncing the shared lock around -- which is the 'saving' grace
>> of the rwsem.
>>
>> For short sections the cost of the shared access will dominate. I'm sure
>> Paul has a bunch of graphs to illustrate this point :-)
>
> Hrm. We might be able to do an SRCU-based implementation. Full RCU would be
> too unpredictable on the update side.
Yeah, I think so.
The write side needs kvm->srcu so it would have to be yet another SRCU.
I initially thought that sucks for the code that calls kvm_gpc_check(),
but maybe not because it simply replaces read_lock/read_unlock.
By using a seqcount for the data, SRCU only needs to be synchronized in
gpc_unmap(). So, something like this:
struct kvm_cached_gva_map {
struct kvm_memory_slot *memslot;
void *khva;
unsigned long uhva;
unsigned long idx;
};
/* snapshot is the new check */
bool kvm_gpc_snapshot(struct gfn_to_pfn_cache *gpc, unsigned long len,
struct kvm_cached_gva_map *ret)
{
struct kvm_memslots *slots = kvm_memslots(gpc->kvm);
unsigned seq;
u64 generation;
gpa_t gpa;
/* The data is protected by the seqcount, the khva by SRCU. */
ret->idx = srcu_read_lock(&gpc->kvm->gpc_srcu);
/* A seqcount_spinlock_t, but see below - there's a bug! */
if (!raw_seqcount_try_begin(&gpc->sc, seq))
goto out_fail;
if (!gpc->active || !gpc->valid)
goto out_fail;
ret->uhva = gpc->uhva;
ret->khva = gpc->khva;
ret->memslot = gpc->memslot;
gpa = gpc->gpa;
generation = gpc->generation;
if (read_seqcount_retry(&gpc->sc, seq))
goto out_fail;
/*
* Now do all the validity checks on a stable copy.
* If the page was cached from a memslot, make sure the
* memslots have not been re-configured.
*/
if (!kvm_is_error_gpa(gpa) && generation != slots->generation)
goto out_fail;
if (kvm_is_error_hva(ret->uhva))
goto out_fail;
if (!kvm_gpc_is_valid_len(gpa, ret->uhva, len))
goto out_fail;
return true;
out_fail:
srcu_read_unlock(&gpc->kvm->gpc_srcu, ret->idx);
return false;
}
static inline
void kvm_gpc_snapshot_release(struct gfn_to_pfn_cache *gpc,
struct kvm_cached_gva_map *ret)
{
srcu_read_unlock(&gpc->kvm->gpc_srcu, ret->idx);
}
Introducing snapshot as a replacement for check (while still leaving it
to the caller to take the rwlock) can even be a separate patch.
There is a bug in raw_seqcount_try_begin() though. It must not do the
lock/unlock dance or it won't be usable in atomic context. See attached
patch.
> The big hiccup is when KVM needs to reclaim memory in response to mmu_notifier
> events from the OOM killer, which don't allow sleeping. I'll stare at this a
> bit next week. I'm not exactly thrilled about the idea of ripping apart this
> code, but there's a known performance bottleneck with mmu_notifier events,
> maybe we can kill two birds with one stone.
There was a pending patch to allow the MMU notifiers to disable
themselves while still in sleepable context.
https://lore.kernel.org/kvm/20260429222548.25475-1-shaikhkamal2012@gmail.com/
It really needs to be split in multiple parts, and the allocation needs
to be removed. Looking at it.
Otherwise, calling synchronize_srcu() from gpc_unmap(), which is within
the kvm->srcu read side, is also a bit unpredictable. But the read
sides are super short so you can make it expedited.
Or can you just skip the synchronize_srcu() from OOM via
check_stable_address_space()? Maybe pretend I didn't suggest that.
Paolo
[-- Attachment #2: 0001-seqcount-allow-raw_seqcount_try_begin-in-atomic-cont.patch --]
[-- Type: text/x-patch, Size: 1672 bytes --]
From f2c604a94cdea9bb96bb21c8a29288eae6cbac67 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Sat, 30 May 2026 10:54:36 +0200
Subject: [PATCH] seqcount: allow raw_seqcount_try_begin in atomic context
When raw_seqcount_try_begin() is used, the typical action when
the seqcount is busy is to take the same lock that the writer uses.
Another possibility, found in kernel/events/uprobes.c, is to delay
the work to a safe point using RCU. Either way there is no need
to guarantee progress of the writer under CONFIG_PREEMPT_RT,
because the caller is not going to call raw_seqcount_try_begin() in a loop.
However, raw_seqcount_try_begin() currently uses seqprop_sequence()
via raw_read_seqcount(), and therefore does a lock/unlock of the
write-side lock on PREEMPT_RT kernel. This prevents it from being
used in atomic context. Use __seqprop_sequence() instead of
raw_read_seqcount() to allow it.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
include/linux/seqlock.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 5a40252b8334..9adbf5d9667c 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -338,7 +338,10 @@ SEQCOUNT_LOCKNAME(mutex, struct mutex, true, mutex)
*/
#define raw_seqcount_try_begin(s, start) \
({ \
- start = raw_read_seqcount(s); \
+ start = __seqprop_sequence(s); \
+ \
+ if (!(start & 1)) \
+ kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX); \
!(start & 1); \
})
--
2.54.0
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-05-30 10:26 ` Paolo Bonzini
@ 2026-05-30 12:47 ` David Woodhouse
2026-05-30 14:40 ` Paolo Bonzini
2026-06-01 9:40 ` Peter Zijlstra
2026-05-30 13:02 ` Paolo Bonzini
2026-06-01 8:40 ` Peter Zijlstra
2 siblings, 2 replies; 55+ messages in thread
From: David Woodhouse @ 2026-05-30 12:47 UTC (permalink / raw)
To: Paolo Bonzini, Sean Christopherson, Peter Zijlstra
Cc: Paul Durrant, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
kvm, linux-kernel, Sebastian Andrzej Siewior,
syzbot+208f7f3e5f59c11aeb90, Carsten Stollmaier
[-- Attachment #1: Type: text/plain, Size: 619 bytes --]
On Sat, 2026-05-30 at 12:26 +0200, Paolo Bonzini wrote:
>
> Yeah, I think so.
>
> The write side needs kvm->srcu so it would have to be yet another SRCU.
> I initially thought that sucks for the code that calls kvm_gpc_check(),
> but maybe not because it simply replaces read_lock/read_unlock.
>
> By using a seqcount for the data, SRCU only needs to be synchronized in
> gpc_unmap(). So, something like this:
It isn't just gpc_unmap() which does the invalidation. We also
invalidate from the MMU notifier in gfn_to_pfn_cache_invalidate_start()
which would also have to synchronize, wouldn't it?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-05-30 10:26 ` Paolo Bonzini
2026-05-30 12:47 ` David Woodhouse
@ 2026-05-30 13:02 ` Paolo Bonzini
2026-06-01 8:40 ` Peter Zijlstra
2 siblings, 0 replies; 55+ messages in thread
From: Paolo Bonzini @ 2026-05-30 13:02 UTC (permalink / raw)
To: Sean Christopherson, Peter Zijlstra
Cc: David Woodhouse, Paul Durrant, Ingo Molnar, Will Deacon,
Boqun Feng, Waiman Long, kvm, linux-kernel, David Woodhouse,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
On Sat, May 30, 2026 at 12:26 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> There was a pending patch to allow the MMU notifiers to disable
> themselves while still in sleepable context.
>
> https://lore.kernel.org/kvm/20260429222548.25475-1-shaikhkamal2012@gmail.com/
>
> It really needs to be split in multiple parts, and the allocation needs
> to be removed. Looking at it.
I did clean it up but nope, it's broken... You can't really stop the
MMU notifiers while the OOM reaper runs because the victim's own
threads are still running and can do whatever they want. If one is in
kvm_swap_active_memslots(), the callback would reset
mn_active_invalidate_count but leave the thread sleeping forever.
Unfortunately this doesn't give me much confidence about how to adopt
SRCU for the gfn-to-pfn caches, unless you come up with a way to avoid
vmentry after OOM?
Paolo
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-05-30 12:47 ` David Woodhouse
@ 2026-05-30 14:40 ` Paolo Bonzini
2026-06-01 10:52 ` David Woodhouse
2026-06-01 9:40 ` Peter Zijlstra
1 sibling, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2026-05-30 14:40 UTC (permalink / raw)
To: David Woodhouse
Cc: Sean Christopherson, Peter Zijlstra, Paul Durrant, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, kvm, linux-kernel,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
On Sat, May 30, 2026 at 3:04 PM David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Sat, 2026-05-30 at 12:26 +0200, Paolo Bonzini wrote:
> >
> > Yeah, I think so.
> >
> > The write side needs kvm->srcu so it would have to be yet another SRCU.
> > I initially thought that sucks for the code that calls kvm_gpc_check(),
> > but maybe not because it simply replaces read_lock/read_unlock.
> >
> > By using a seqcount for the data, SRCU only needs to be synchronized in
> > gpc_unmap(). So, something like this:
>
> It isn't just gpc_unmap() which does the invalidation. We also
> invalidate from the MMU notifier in gfn_to_pfn_cache_invalidate_start()
> which would also have to synchronize, wouldn't it?
You're right, the write_lock_irq() there drains the readers and that
is needed because khva is not pinned, only kmap()-ed.
That is already broken for the OOM case under PREEMPT_RT, where
rwlock_t becomes sleepable. But using SRCU would break it on
!PREEMPT_RT as well.
Paolo
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-05-30 10:26 ` Paolo Bonzini
2026-05-30 12:47 ` David Woodhouse
2026-05-30 13:02 ` Paolo Bonzini
@ 2026-06-01 8:40 ` Peter Zijlstra
2026-06-01 11:11 ` Sebastian Andrzej Siewior
2 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-06-01 8:40 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Sean Christopherson, David Woodhouse, Paul Durrant, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, kvm, linux-kernel,
David Woodhouse, Sebastian Andrzej Siewior,
syzbot+208f7f3e5f59c11aeb90, Carsten Stollmaier
On Sat, May 30, 2026 at 12:26:20PM +0200, Paolo Bonzini wrote:
> From f2c604a94cdea9bb96bb21c8a29288eae6cbac67 Mon Sep 17 00:00:00 2001
> From: Paolo Bonzini <pbonzini@redhat.com>
> Date: Sat, 30 May 2026 10:54:36 +0200
> Subject: [PATCH] seqcount: allow raw_seqcount_try_begin in atomic context
>
> When raw_seqcount_try_begin() is used, the typical action when
> the seqcount is busy is to take the same lock that the writer uses.
> Another possibility, found in kernel/events/uprobes.c, is to delay
> the work to a safe point using RCU. Either way there is no need
> to guarantee progress of the writer under CONFIG_PREEMPT_RT,
> because the caller is not going to call raw_seqcount_try_begin() in a loop.
>
> However, raw_seqcount_try_begin() currently uses seqprop_sequence()
> via raw_read_seqcount(), and therefore does a lock/unlock of the
> write-side lock on PREEMPT_RT kernel. This prevents it from being
> used in atomic context. Use __seqprop_sequence() instead of
> raw_read_seqcount() to allow it.
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> include/linux/seqlock.h | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> index 5a40252b8334..9adbf5d9667c 100644
> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -338,7 +338,10 @@ SEQCOUNT_LOCKNAME(mutex, struct mutex, true, mutex)
> */
> #define raw_seqcount_try_begin(s, start) \
> ({ \
> - start = raw_read_seqcount(s); \
> + start = __seqprop_sequence(s); \
> + \
> + if (!(start & 1)) \
> + kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX); \
> !(start & 1); \
> })
Yeah, I think this makes sense. Thanks!
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-05-30 12:47 ` David Woodhouse
2026-05-30 14:40 ` Paolo Bonzini
@ 2026-06-01 9:40 ` Peter Zijlstra
2026-06-01 10:04 ` David Woodhouse
1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-06-01 9:40 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Sean Christopherson, Paul Durrant, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, kvm, linux-kernel,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
On Sat, May 30, 2026 at 01:47:06PM +0100, David Woodhouse wrote:
> On Sat, 2026-05-30 at 12:26 +0200, Paolo Bonzini wrote:
> >
> > Yeah, I think so.
> >
> > The write side needs kvm->srcu so it would have to be yet another SRCU.
> > I initially thought that sucks for the code that calls kvm_gpc_check(),
> > but maybe not because it simply replaces read_lock/read_unlock.
> >
> > By using a seqcount for the data, SRCU only needs to be synchronized in
> > gpc_unmap(). So, something like this:
>
> It isn't just gpc_unmap() which does the invalidation. We also
> invalidate from the MMU notifier in gfn_to_pfn_cache_invalidate_start()
> which would also have to synchronize, wouldn't it?
Ok, so I had a look at what this code actually does, and it appears to
be a guest frame number to page frame number cache, managed by
mmu_notifiers.
IOW, its some software TLB thing (pre HVM Xen support?)
Now, mmu_notifier_invalidate_range_start() has a rather explicit
might_sleep() in, and while there is an
mmu_notifier_invalidate_range_start_noblock(), that has an error return,
and it is clearly specified that if that thing returns non-zero, PTEs
must not be changed.
With all that, I don't see why we can't block for srcu_synchronize() in
gfn_to_pfn_cache_invalidate_start().
Now, I've never much looked at mmu_notifiers, but for native, TLBI might
require sending IPIs to all CPUs, and as such cannot happen in atomic
sections. I would expect this same to extend to mmu_notifiers. It must
be possible to sleep in them.
What am I missing?
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-06-01 9:40 ` Peter Zijlstra
@ 2026-06-01 10:04 ` David Woodhouse
0 siblings, 0 replies; 55+ messages in thread
From: David Woodhouse @ 2026-06-01 10:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Paolo Bonzini, Sean Christopherson, Paul Durrant, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, kvm, linux-kernel,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
[-- Attachment #1: Type: text/plain, Size: 4050 bytes --]
On Mon, 2026-06-01 at 11:40 +0200, Peter Zijlstra wrote:
> On Sat, May 30, 2026 at 01:47:06PM +0100, David Woodhouse wrote:
> > On Sat, 2026-05-30 at 12:26 +0200, Paolo Bonzini wrote:
> > >
> > > Yeah, I think so.
> > >
> > > The write side needs kvm->srcu so it would have to be yet another SRCU.
> > > I initially thought that sucks for the code that calls kvm_gpc_check(),
> > > but maybe not because it simply replaces read_lock/read_unlock.
> > >
> > > By using a seqcount for the data, SRCU only needs to be synchronized in
> > > gpc_unmap(). So, something like this:
> >
> > It isn't just gpc_unmap() which does the invalidation. We also
> > invalidate from the MMU notifier in gfn_to_pfn_cache_invalidate_start()
> > which would also have to synchronize, wouldn't it?
>
> Ok, so I had a look at what this code actually does, and it appears to
> be a guest frame number to page frame number cache, managed by
> mmu_notifiers.
>
> IOW, its some software TLB thing (pre HVM Xen support?)
No, it's for any fast path access to KVM guest memory.
KVM guest physical addresses are translated through two stages.
First through the memslots which convert the GPA to a userspace virtual
address (in the VMM process).
Then through the normal process mm into a host physical address.
The GPC is indeed a software TLB thing used to cache the result of that
two-stage translation, hence being invalidated when *either* the
memslots change (easy), *or* the page is invlidated it the process mm
(more fun).
KVM used to *pin* the latter, which was sad for memory hotunplug and
other reasons, and then *still* managed to screw up the caching. (I
think it would just keep using the old page even if userspace mmap'd
something else over it).
I ripped all that out and replaced it with the gfn-to-pfn-cache we have
now.
It's used to allow a fast path where the *common* case is that the GPC
is already valid. Like interrupt delivery to Xen (HVM) guests which
needs to go into the shared-info page. There's always a slow path
fallback which will revalidate the cache (which might involve bringing
the page back in from swap, etc.). For things like steal time /
runstate information, an invalid cache might lead to it not being
updated when the process is scheduled out, but it'll be refreshed when
scheduled back in again before re-entering the guest.
The GPC also has (well, *had*, but we're looking at reintroducing it¹)
a mode where the underlying physical address is actually present in the
VMCS and used directly by the CPU, for guest mode. In that mode,
invalidation requires kicking the vCPU *out* of guest mode. That one
requires sleeping and needs handling even for !PREEMPT_RT:
¹ https://lore.kernel.org/all/20251121111113.456628-3-griffoul@gmail.com/
> Now, mmu_notifier_invalidate_range_start() has a rather explicit
> might_sleep() in, and while there is an
> mmu_notifier_invalidate_range_start_noblock(), that has an error return,
> and it is clearly specified that if that thing returns non-zero, PTEs
> must not be changed.
>
> With all that, I don't see why we can't block for srcu_synchronize() in
> gfn_to_pfn_cache_invalidate_start().
>
> Now, I've never much looked at mmu_notifiers, but for native, TLBI might
> require sending IPIs to all CPUs, and as such cannot happen in atomic
> sections. I would expect this same to extend to mmu_notifiers. It must
> be possible to sleep in them.
>
> What am I missing?
I think it's the OOM path which invokes it in a non-sleepable context?
Hence the may_block argument, and the 'all vCPUs should have been
stopped already, so perform the request without KVM_REQUEST_WAIT and be
sad' comment in the patch I linked above.
The code in virt/kvm/kvm_main.c already tracks 'may_block' for each
call path that gets here; Fred's patch is just passing it in to the GPC
invalidation.
These caches are *mostly* per-vCPU so we don't suffer too much from the
cache contention on the readlocks.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-05-30 14:40 ` Paolo Bonzini
@ 2026-06-01 10:52 ` David Woodhouse
2026-06-01 13:01 ` David Woodhouse
0 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2026-06-01 10:52 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Sean Christopherson, Peter Zijlstra, Paul Durrant, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, kvm, linux-kernel,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
[-- Attachment #1: Type: text/plain, Size: 2290 bytes --]
On Sat, 2026-05-30 at 16:40 +0200, Paolo Bonzini wrote:
> On Sat, May 30, 2026 at 3:04 PM David Woodhouse <dwmw2@infradead.org> wrote:
> >
> > On Sat, 2026-05-30 at 12:26 +0200, Paolo Bonzini wrote:
> > >
> > > Yeah, I think so.
> > >
> > > The write side needs kvm->srcu so it would have to be yet another SRCU.
> > > I initially thought that sucks for the code that calls kvm_gpc_check(),
> > > but maybe not because it simply replaces read_lock/read_unlock.
> > >
> > > By using a seqcount for the data, SRCU only needs to be synchronized in
> > > gpc_unmap(). So, something like this:
> >
> > It isn't just gpc_unmap() which does the invalidation. We also
> > invalidate from the MMU notifier in gfn_to_pfn_cache_invalidate_start()
> > which would also have to synchronize, wouldn't it?
>
> You're right, the write_lock_irq() there drains the readers and that
> is needed because khva is not pinned, only kmap()-ed.
>
> That is already broken for the OOM case under PREEMPT_RT, where
> rwlock_t becomes sleepable. But using SRCU would break it on
> !PREEMPT_RT as well.
I don't think 'sleepable' is the problem per se, is it? *Why* does the
OOM killer use mmu_notifier_invalidate_range_start_nonblock()?
Commit 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
notifiers") did say:
There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.
But that was in 2018, when mmap_lock was an rw_semaphore.
Is "sleepable" still a problem even when PREEMPT_RT where almost
*everything* is now strictly sleepable? Wouldn't that mean drivers
aren't even allowed to take their own spinlocks?
I think the *real* constraint in the OOM path is that it mustn't block
on anything which might be waiting for memory allocation. So waiting on
an actual mutex is bad... but waiting for an rwlock which PREEMPT_RT
just happens to have made sleepable... should be OK?
And waiting for the in-guest CPU to respond to the IPI in Fred's patch
should actually be OK too, but then so would returning -EAGAIN if any
vCPUs really did need kicking.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-06-01 8:40 ` Peter Zijlstra
@ 2026-06-01 11:11 ` Sebastian Andrzej Siewior
2026-06-01 11:40 ` Peter Zijlstra
2026-06-01 19:13 ` Paolo Bonzini
0 siblings, 2 replies; 55+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-06-01 11:11 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Paolo Bonzini, Sean Christopherson, David Woodhouse, Paul Durrant,
Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, kvm,
linux-kernel, David Woodhouse, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
On 2026-06-01 10:40:45 [+0200], Peter Zijlstra wrote:
> > When raw_seqcount_try_begin() is used, the typical action when
> > the seqcount is busy is to take the same lock that the writer uses.
> > Another possibility, found in kernel/events/uprobes.c, is to delay
> > the work to a safe point using RCU. Either way there is no need
> > to guarantee progress of the writer under CONFIG_PREEMPT_RT,
> > because the caller is not going to call raw_seqcount_try_begin() in a loop.
This does not happen. It only happens if the associated lock is
preemptible. This is not the case for uprobes.c so there is no lock/
unlock.
> > However, raw_seqcount_try_begin() currently uses seqprop_sequence()
> > via raw_read_seqcount(), and therefore does a lock/unlock of the
> > write-side lock on PREEMPT_RT kernel. This prevents it from being
> > used in atomic context. Use __seqprop_sequence() instead of
> > raw_read_seqcount() to allow it.
So it would work if there is no lock associated or it is something like
raw_spinlock_t.
> > --- a/include/linux/seqlock.h
> > +++ b/include/linux/seqlock.h
> > @@ -338,7 +338,10 @@ SEQCOUNT_LOCKNAME(mutex, struct mutex, true, mutex)
> > */
> > #define raw_seqcount_try_begin(s, start) \
> > ({ \
> > - start = raw_read_seqcount(s); \
> > + start = __seqprop_sequence(s); \
> > + \
> > + if (!(start & 1)) \
> > + kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX); \
> > !(start & 1); \
> > })
>
> Yeah, I think this makes sense. Thanks!
Do I make sense or do I miss something obvious? On the other it would
make sense to have this even for spinlock_t assuming the try_begin
version does not try spin and expect to make progress (which it does not
for the current users).
Sebastian
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-06-01 11:11 ` Sebastian Andrzej Siewior
@ 2026-06-01 11:40 ` Peter Zijlstra
2026-06-01 19:13 ` Paolo Bonzini
1 sibling, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2026-06-01 11:40 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Paolo Bonzini, Sean Christopherson, David Woodhouse, Paul Durrant,
Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, kvm,
linux-kernel, David Woodhouse, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
On Mon, Jun 01, 2026 at 01:11:09PM +0200, Sebastian Andrzej Siewior wrote:
> On 2026-06-01 10:40:45 [+0200], Peter Zijlstra wrote:
> > > When raw_seqcount_try_begin() is used, the typical action when
> > > the seqcount is busy is to take the same lock that the writer uses.
> > > Another possibility, found in kernel/events/uprobes.c, is to delay
> > > the work to a safe point using RCU. Either way there is no need
> > > to guarantee progress of the writer under CONFIG_PREEMPT_RT,
> > > because the caller is not going to call raw_seqcount_try_begin() in a loop.
>
> This does not happen. It only happens if the associated lock is
> preemptible. This is not the case for uprobes.c so there is no lock/
> unlock.
>
> > > However, raw_seqcount_try_begin() currently uses seqprop_sequence()
> > > via raw_read_seqcount(), and therefore does a lock/unlock of the
> > > write-side lock on PREEMPT_RT kernel. This prevents it from being
> > > used in atomic context. Use __seqprop_sequence() instead of
> > > raw_read_seqcount() to allow it.
>
> So it would work if there is no lock associated or it is something like
> raw_spinlock_t.
>
> > > --- a/include/linux/seqlock.h
> > > +++ b/include/linux/seqlock.h
> > > @@ -338,7 +338,10 @@ SEQCOUNT_LOCKNAME(mutex, struct mutex, true, mutex)
> > > */
> > > #define raw_seqcount_try_begin(s, start) \
> > > ({ \
> > > - start = raw_read_seqcount(s); \
> > > + start = __seqprop_sequence(s); \
> > > + \
> > > + if (!(start & 1)) \
> > > + kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX); \
> > > !(start & 1); \
> > > })
> >
> > Yeah, I think this makes sense. Thanks!
>
> Do I make sense or do I miss something obvious? On the other it would
> make sense to have this even for spinlock_t assuming the try_begin
> version does not try spin and expect to make progress (which it does not
> for the current users).
Ah, so I was thinking that since it is try_begin(), it is already set up
for failure and we could elide that initial lock+unlock thing. But yes,
I suppose the user could simply use &s->seqcount to get the seqcount_t
variant.
Let me forget I ever seen this patch ;-)
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-06-01 10:52 ` David Woodhouse
@ 2026-06-01 13:01 ` David Woodhouse
2026-06-01 13:40 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2026-06-01 13:01 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Sean Christopherson, Peter Zijlstra, Paul Durrant, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, kvm, linux-kernel,
Sebastian Andrzej Siewior, syzbot+208f7f3e5f59c11aeb90,
Carsten Stollmaier
[-- Attachment #1: Type: text/plain, Size: 5674 bytes --]
On Mon, 2026-06-01 at 11:52 +0100, David Woodhouse wrote:
> On Sat, 2026-05-30 at 16:40 +0200, Paolo Bonzini wrote:
> > On Sat, May 30, 2026 at 3:04 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > >
> > > On Sat, 2026-05-30 at 12:26 +0200, Paolo Bonzini wrote:
> > > >
> > > > Yeah, I think so.
> > > >
> > > > The write side needs kvm->srcu so it would have to be yet another SRCU.
> > > > I initially thought that sucks for the code that calls kvm_gpc_check(),
> > > > but maybe not because it simply replaces read_lock/read_unlock.
> > > >
> > > > By using a seqcount for the data, SRCU only needs to be synchronized in
> > > > gpc_unmap(). So, something like this:
> > >
> > > It isn't just gpc_unmap() which does the invalidation. We also
> > > invalidate from the MMU notifier in gfn_to_pfn_cache_invalidate_start()
> > > which would also have to synchronize, wouldn't it?
> >
> > You're right, the write_lock_irq() there drains the readers and that
> > is needed because khva is not pinned, only kmap()-ed.
> >
> > That is already broken for the OOM case under PREEMPT_RT, where
> > rwlock_t becomes sleepable. But using SRCU would break it on
> > !PREEMPT_RT as well.
>
> I don't think 'sleepable' is the problem per se, is it? *Why* does the
> OOM killer use mmu_notifier_invalidate_range_start_nonblock()?
>
> Commit 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
> notifiers") did say:
>
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
>
> But that was in 2018, when mmap_lock was an rw_semaphore.
>
> Is "sleepable" still a problem even when PREEMPT_RT where almost
> *everything* is now strictly sleepable? Wouldn't that mean drivers
> aren't even allowed to take their own spinlocks?
Yeah, this is *already* hosed by PREEMPT_RT's "haha let's make things
sleepable that nobody ever expected to be" approach.
It's hard to trigger as not only do you have to get the KVM process to
OOM, it also has to be *slow* to die. I ended up doing this:
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1278,6 +1278,10 @@ void exit_mmap(struct mm_struct *mm)
VMA_ITERATOR(vmi, mm, 0);
struct unmap_desc unmap;
+ if (!strcmp(current->comm, "kvm_oom_test")) {
+ pr_info("exit_mmap: delaying before mmu_notifier_release for kvm_oom_test\n");
+ schedule_timeout_uninterruptible(3*HZ);
+ }
/* mm's last user has gone, and its about to be pulled down */
mmu_notifier_release(mm);
And then we see it even when taking kvm->mn_invalidate_lock:
kvm_mmu_notifier_invalidate_range_start+0xac
0xffffffff8132732c is in kvm_mmu_notifier_invalidate_range_start (arch/x86/kvm/../../../virt/kvm/kvm_main.c:745).
740 * adjustments will be imbalanced.
741 *
742 * Pairs with the decrement in range_end().
743 */
744 spin_lock(&kvm->mn_invalidate_lock);
745 kvm->mn_active_invalidate_count++;
746 if (!mmu_notifier_range_blockable(range))
747 pr_info("KVM: non-blockable invalidate_range_start, non_block_count=%d\n", current->non_block_count);
748 spin_unlock(&kvm->mn_invalidate_lock);
749
[ 427.919969] mmap: exit_mmap: delaying before mmu_notifier_release for kvm_oom_test
[ 429.926972] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[ 429.926978] in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 280, name: oom_reaper
[ 429.926982] preempt_count: 0, expected: 0
[ 429.926984] RCU nest depth: 0, expected: 0
[ 429.926986] 4 locks held by oom_reaper/280:
[ 429.926989] #0: ffff8a61da779cb0 (&mm->mmap_lock){....}-{3:3}, at: oom_reaper+0x150/0x520
[ 429.927006] #1: ffffffffa0934f20 (mmu_notifier_invalidate_range_start){....}-{0:0}, at: zap_vma_for_reaping+0xb7/0x1d0
[ 429.927019] #2: ffffffffa0934f78 (srcu){....}-{0:0}, at: __mmu_notifier_invalidate_range_start+0xae/0x340
[ 429.927029] #3: ffff8a6240295360 (&kvm->mn_invalidate_lock){....}-{2:2}, at: kvm_mmu_notifier_invalidate_range_start+0xac/0x4b0
[ 429.927044] CPU: 26 UID: 0 PID: 280 Comm: oom_reaper Tainted: G S I 7.1.0-rc2+ #2460 PREEMPT_{RT,(lazy)}
[ 429.927051] Tainted: [S]=CPU_OUT_OF_SPEC, [I]=FIRMWARE_WORKAROUND
[ 429.927053] Hardware name: Intel Corporation S2600CW/S2600CW, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
[ 429.927055] Call Trace:
[ 429.927058] <TASK>
[ 429.927062] dump_stack_lvl+0x6e/0xa0
[ 429.927074] __might_resched.cold+0xeb/0x100
[ 429.927084] rt_spin_lock+0x6c/0x1a0
[ 429.927092] ? kvm_mmu_notifier_invalidate_range_start+0xac/0x4b0
[ 429.927102] kvm_mmu_notifier_invalidate_range_start+0xac/0x4b0
[ 429.927110] ? sched_update_numa+0xa0/0x270
[ 429.927129] __mmu_notifier_invalidate_range_start+0x129/0x340
[ 429.927138] ? __pfx_oom_reaper+0x10/0x10
[ 429.927144] zap_vma_for_reaping+0x186/0x1d0
[ 429.927150] ? zap_vma_for_reaping+0xb7/0x1d0
[ 429.927155] ? zap_vma_for_reaping+0xb7/0x1d0
[ 429.927176] __oom_reap_task_mm+0xbf/0x100
[ 429.927191] oom_reaper+0xeb/0x520
[ 429.927199] ? __pfx_autoremove_wake_function+0x10/0x10
[ 429.927212] kthread+0xf5/0x130
[ 429.927217] ? __pfx_kthread+0x10/0x10
[ 429.927224] ret_from_fork+0x286/0x310
[ 429.927232] ? __pfx_kthread+0x10/0x10
[ 429.927236] ret_from_fork_asm+0x1a/0x30
[ 429.927257] </TASK>
[ 429.927260] KVM: non-blockable invalidate_range_start, non_block_count=1
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-06-01 13:01 ` David Woodhouse
@ 2026-06-01 13:40 ` Sebastian Andrzej Siewior
2026-06-01 13:53 ` David Woodhouse
0 siblings, 1 reply; 55+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-06-01 13:40 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Sean Christopherson, Peter Zijlstra, Paul Durrant,
Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, kvm,
linux-kernel, syzbot+208f7f3e5f59c11aeb90, Carsten Stollmaier
On 2026-06-01 14:01:07 [+0100], David Woodhouse wrote:
> On Mon, 2026-06-01 at 11:52 +0100, David Woodhouse wrote:
> > On Sat, 2026-05-30 at 16:40 +0200, Paolo Bonzini wrote:
> > > On Sat, May 30, 2026 at 3:04 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > > >
> > > > On Sat, 2026-05-30 at 12:26 +0200, Paolo Bonzini wrote:
> > > > >
> > > > > Yeah, I think so.
> > > > >
> > > > > The write side needs kvm->srcu so it would have to be yet another SRCU.
> > > > > I initially thought that sucks for the code that calls kvm_gpc_check(),
> > > > > but maybe not because it simply replaces read_lock/read_unlock.
> > > > >
> > > > > By using a seqcount for the data, SRCU only needs to be synchronized in
> > > > > gpc_unmap(). So, something like this:
> > > >
> > > > It isn't just gpc_unmap() which does the invalidation. We also
> > > > invalidate from the MMU notifier in gfn_to_pfn_cache_invalidate_start()
> > > > which would also have to synchronize, wouldn't it?
> > >
> > > You're right, the write_lock_irq() there drains the readers and that
> > > is needed because khva is not pinned, only kmap()-ed.
> > >
> > > That is already broken for the OOM case under PREEMPT_RT, where
> > > rwlock_t becomes sleepable. But using SRCU would break it on
> > > !PREEMPT_RT as well.
> >
> > I don't think 'sleepable' is the problem per se, is it? *Why* does the
> > OOM killer use mmu_notifier_invalidate_range_start_nonblock()?
> >
> > Commit 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
> > notifiers") did say:
> >
> > There are several blockable mmu notifiers which might sleep in
> > mmu_notifier_invalidate_range_start and that is a problem for the
> > oom_reaper because it needs to guarantee a forward progress so it cannot
> > depend on any sleepable locks.
> >
> > But that was in 2018, when mmap_lock was an rw_semaphore.
> >
> > Is "sleepable" still a problem even when PREEMPT_RT where almost
> > *everything* is now strictly sleepable? Wouldn't that mean drivers
> > aren't even allowed to take their own spinlocks?
…
> And then we see it even when taking kvm->mn_invalidate_lock:
>
> kvm_mmu_notifier_invalidate_range_start+0xac
> 0xffffffff8132732c is in kvm_mmu_notifier_invalidate_range_start (arch/x86/kvm/../../../virt/kvm/kvm_main.c:745).
> 740 * adjustments will be imbalanced.
> 741 *
> 742 * Pairs with the decrement in range_end().
> 743 */
> 744 spin_lock(&kvm->mn_invalidate_lock);
> 745 kvm->mn_active_invalidate_count++;
> 746 if (!mmu_notifier_range_blockable(range))
> 747 pr_info("KVM: non-blockable invalidate_range_start, non_block_count=%d\n", current->non_block_count);
> 748 spin_unlock(&kvm->mn_invalidate_lock);
> 749
>
>
> [ 427.919969] mmap: exit_mmap: delaying before mmu_notifier_release for kvm_oom_test
> [ 429.926972] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
> [ 429.926978] in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 280, name: oom_reaper
> [ 429.926982] preempt_count: 0, expected: 0
> [ 429.926984] RCU nest depth: 0, expected: 0
> [ 429.926986] 4 locks held by oom_reaper/280:
> [ 429.926989] #0: ffff8a61da779cb0 (&mm->mmap_lock){....}-{3:3}, at: oom_reaper+0x150/0x520
> [ 429.927006] #1: ffffffffa0934f20 (mmu_notifier_invalidate_range_start){....}-{0:0}, at: zap_vma_for_reaping+0xb7/0x1d0
> [ 429.927019] #2: ffffffffa0934f78 (srcu){....}-{0:0}, at: __mmu_notifier_invalidate_range_start+0xae/0x340
> [ 429.927029] #3: ffff8a6240295360 (&kvm->mn_invalidate_lock){....}-{2:2}, at: kvm_mmu_notifier_invalidate_range_start+0xac/0x4b0
> [ 429.927044] CPU: 26 UID: 0 PID: 280 Comm: oom_reaper Tainted: G S I 7.1.0-rc2+ #2460 PREEMPT_{RT,(lazy)}
> [ 429.927051] Tainted: [S]=CPU_OUT_OF_SPEC, [I]=FIRMWARE_WORKAROUND
> [ 429.927053] Hardware name: Intel Corporation S2600CW/S2600CW, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
> [ 429.927055] Call Trace:
> [ 429.927058] <TASK>
> [ 429.927062] dump_stack_lvl+0x6e/0xa0
> [ 429.927074] __might_resched.cold+0xeb/0x100
> [ 429.927084] rt_spin_lock+0x6c/0x1a0
> [ 429.927092] ? kvm_mmu_notifier_invalidate_range_start+0xac/0x4b0
> [ 429.927102] kvm_mmu_notifier_invalidate_range_start+0xac/0x4b0
> [ 429.927110] ? sched_update_numa+0xa0/0x270
> [ 429.927129] __mmu_notifier_invalidate_range_start+0x129/0x340
Okay. This complains about non_block:
…
> [ 429.927260] KVM: non-blockable invalidate_range_start, non_block_count=1
and commit 312364f3534cc ("kernel.h: Add non_block_start/end()") says
| Peter also asked whether we want to catch spinlocks on top, but Michal
| said those are less of a problem because spinlocks can't have an indirect
| dependency upon the page allocator and hence close the loop with the oom
| reaper.
so a lock which becomes sleep-able on RT vs !RT shouldn't be a problem,
right? We also don't complain about about scheduling within a
rcu_read_lock() section if it is part of spin_lock().
Sebastian
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-06-01 13:40 ` Sebastian Andrzej Siewior
@ 2026-06-01 13:53 ` David Woodhouse
2026-06-01 14:47 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2026-06-01 13:53 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Paolo Bonzini, Sean Christopherson, Peter Zijlstra, Paul Durrant,
Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, kvm,
linux-kernel, syzbot+208f7f3e5f59c11aeb90, Carsten Stollmaier
[-- Attachment #1: Type: text/plain, Size: 1219 bytes --]
On Mon, 2026-06-01 at 15:40 +0200, Sebastian Andrzej Siewior wrote:
>
> Okay. This complains about non_block:
>
> …
> > [ 429.927260] KVM: non-blockable invalidate_range_start, non_block_count=1
>
> and commit 312364f3534cc ("kernel.h: Add non_block_start/end()") says
>
> > Peter also asked whether we want to catch spinlocks on top, but Michal
> > said those are less of a problem because spinlocks can't have an indirect
> > dependency upon the page allocator and hence close the loop with the oom
> > reaper.
>
> so a lock which becomes sleep-able on RT vs !RT shouldn't be a problem,
> right? We also don't complain about about scheduling within a
> rcu_read_lock() section if it is part of spin_lock().
Right. This is just a false positive in the debugging check, AFAICT.
It's actually *fine* to take a spinlock or rwlock in the OOM reaper
path, *even* if RT makes them in sleepable locks. And I think even the
KVM_REQUEST_WAIT IPI is fine.
But to fix the false positive warning, *either*:
• non_block_start() shouldn't complain about "sleepable only in RT" locks,
OR
• The OOM reaper path shouldn't use non_block_start() under RT.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-06-01 13:53 ` David Woodhouse
@ 2026-06-01 14:47 ` Sebastian Andrzej Siewior
2026-06-01 15:11 ` David Woodhouse
0 siblings, 1 reply; 55+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-06-01 14:47 UTC (permalink / raw)
To: David Woodhouse
Cc: Paolo Bonzini, Sean Christopherson, Peter Zijlstra, Paul Durrant,
Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, kvm,
linux-kernel, syzbot+208f7f3e5f59c11aeb90, Carsten Stollmaier
On 2026-06-01 14:53:14 [+0100], David Woodhouse wrote:
> On Mon, 2026-06-01 at 15:40 +0200, Sebastian Andrzej Siewior wrote:
> >
> > Okay. This complains about non_block:
> >
> > …
> > > [ 429.927260] KVM: non-blockable invalidate_range_start, non_block_count=1
> >
> > and commit 312364f3534cc ("kernel.h: Add non_block_start/end()") says
> >
> > > Peter also asked whether we want to catch spinlocks on top, but Michal
> > > said those are less of a problem because spinlocks can't have an indirect
> > > dependency upon the page allocator and hence close the loop with the oom
> > > reaper.
> >
> > so a lock which becomes sleep-able on RT vs !RT shouldn't be a problem,
> > right? We also don't complain about about scheduling within a
> > rcu_read_lock() section if it is part of spin_lock().
>
> Right. This is just a false positive in the debugging check, AFAICT.
>
> It's actually *fine* to take a spinlock or rwlock in the OOM reaper
> path, *even* if RT makes them in sleepable locks. And I think even the
> KVM_REQUEST_WAIT IPI is fine.
>
> But to fix the false positive warning, *either*:
>
> • non_block_start() shouldn't complain about "sleepable only in RT" locks,
>
> OR
>
> • The OOM reaper path shouldn't use non_block_start() under RT.
Looking at the list of users it was probably introduced for the mm
folks. Let me add this to my list of things to look at…
Sebastian
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-06-01 14:47 ` Sebastian Andrzej Siewior
@ 2026-06-01 15:11 ` David Woodhouse
0 siblings, 0 replies; 55+ messages in thread
From: David Woodhouse @ 2026-06-01 15:11 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Paolo Bonzini, Sean Christopherson, Peter Zijlstra, Paul Durrant,
Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, kvm,
linux-kernel, syzbot+208f7f3e5f59c11aeb90, Carsten Stollmaier
[-- Attachment #1: Type: text/plain, Size: 1968 bytes --]
On Mon, 2026-06-01 at 16:47 +0200, Sebastian Andrzej Siewior wrote:
> On 2026-06-01 14:53:14 [+0100], David Woodhouse wrote:
> > On Mon, 2026-06-01 at 15:40 +0200, Sebastian Andrzej Siewior wrote:
> > >
> > > Okay. This complains about non_block:
> > >
> > > …
> > > > [ 429.927260] KVM: non-blockable invalidate_range_start, non_block_count=1
> > >
> > > and commit 312364f3534cc ("kernel.h: Add non_block_start/end()") says
> > >
> > > > Peter also asked whether we want to catch spinlocks on top, but Michal
> > > > said those are less of a problem because spinlocks can't have an indirect
> > > > dependency upon the page allocator and hence close the loop with the oom
> > > > reaper.
> > >
> > > so a lock which becomes sleep-able on RT vs !RT shouldn't be a problem,
> > > right? We also don't complain about about scheduling within a
> > > rcu_read_lock() section if it is part of spin_lock().
> >
> > Right. This is just a false positive in the debugging check, AFAICT.
> >
> > It's actually *fine* to take a spinlock or rwlock in the OOM reaper
> > path, *even* if RT makes them in sleepable locks. And I think even the
> > KVM_REQUEST_WAIT IPI is fine.
> >
> > But to fix the false positive warning, *either*:
> >
> > • non_block_start() shouldn't complain about "sleepable only in RT" locks,
> >
> > OR
> >
> > • The OOM reaper path shouldn't use non_block_start() under RT.
>
> Looking at the list of users it was probably introduced for the mm
> folks. Let me add this to my list of things to look at…
Thanks. For the purpose of KVM we shall ignore the issue and assume
it's going away. It is safe to take both spinlocks *and* rwlocks in the
"!may_block" invalidation path.
And probably OK to IPI a vCPU to GTFO too, but we'll discuss that under
separate cover:
https://lore.kernel.org/all/7428996952122d5943715c8682c6ad1c353566f9.camel@infradead.org
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-06-01 11:11 ` Sebastian Andrzej Siewior
2026-06-01 11:40 ` Peter Zijlstra
@ 2026-06-01 19:13 ` Paolo Bonzini
2026-06-02 7:34 ` Sebastian Andrzej Siewior
1 sibling, 1 reply; 55+ messages in thread
From: Paolo Bonzini @ 2026-06-01 19:13 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Peter Zijlstra, Sean Christopherson, David Woodhouse,
Paul Durrant, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
kvm, Kernel Mailing List, Linux, David Woodhouse,
syzbot+208f7f3e5f59c11aeb90, Carsten Stollmaier
Il lun 1 giu 2026, 13:11 Sebastian Andrzej Siewior
<bigeasy@linutronix.de> ha scritto:
> On 2026-06-01 10:40:45 [+0200], Peter Zijlstra wrote:
> > > When raw_seqcount_try_begin() is used, the typical action when
> > > the seqcount is busy is to take the same lock that the writer uses.
> > > Another possibility, found in kernel/events/uprobes.c, is to delay
> > > the work to a safe point using RCU. Either way there is no need
> > > to guarantee progress of the writer under CONFIG_PREEMPT_RT,
> > > because the caller is not going to call raw_seqcount_try_begin() in a loop.
>
> This does not happen. It only happens if the associated lock is
> preemptible. This is not the case for uprobes.c so there is no lock/
> unlock.
The uprobes reference is just an example of a different try_begin
pattern, that does not involve taking a lock but also doesn't need to
guarantee progress of the writer. uprobes uses it without a lock but
that doesn't have to be the case. I thought the commit message was
reasonably clear but perhaps I was wrong?
> > > However, raw_seqcount_try_begin() currently uses seqprop_sequence()
> > > via raw_read_seqcount(), and therefore does a lock/unlock of the
> > > write-side lock on PREEMPT_RT kernel. This prevents it from being
> > > used in atomic context. Use __seqprop_sequence() instead of
> > > raw_read_seqcount() to allow it.
>
> So it would work if there is no lock associated or it is something like
> raw_spinlock_t.
Yes, in which case the patch is a no-op. But in the KVM case that we
are thinking about in this thread, raw spinlocks are not needed
because the atomic-context read side would just punt if try_begin()
fails; and unlike the uprobes case, there would be a lock involved.
Since it'd be possible to use a regular spinlock, this patch would
help.
> > Yeah, I think this makes sense. Thanks!
>
> Do I make sense or do I miss something obvious? On the other it would
> make sense to have this even for spinlock_t assuming the try_begin
> version does not try spin and expect to make progress (which it does not
> for the current users).
That was my point. The lock/unlock is not causing any current bug in
the kernel, other than maybe some minor inefficiencies, but it makes
try_begin() subtly different for PREEMPT_RT kernels and the "obviously
correct" way to do it isn't the right one. Having to use &s->seqcount
is yucky.
The docs for try_begin() don't say it super explicitly, but still they
heavily imply you're not meant to spin on it. At that point it's not a
"try" anymore.
Thanks,
Paolo
>
> Sebastian
>
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-06-01 19:13 ` Paolo Bonzini
@ 2026-06-02 7:34 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 55+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-06-02 7:34 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Peter Zijlstra, Sean Christopherson, David Woodhouse,
Paul Durrant, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
kvm, Kernel Mailing List, Linux, David Woodhouse,
syzbot+208f7f3e5f59c11aeb90, Carsten Stollmaier
On 2026-06-01 21:13:56 [+0200], Paolo Bonzini wrote:
> Il lun 1 giu 2026, 13:11 Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> ha scritto:
> > On 2026-06-01 10:40:45 [+0200], Peter Zijlstra wrote:
> > > > When raw_seqcount_try_begin() is used, the typical action when
> > > > the seqcount is busy is to take the same lock that the writer uses.
> > > > Another possibility, found in kernel/events/uprobes.c, is to delay
> > > > the work to a safe point using RCU. Either way there is no need
> > > > to guarantee progress of the writer under CONFIG_PREEMPT_RT,
> > > > because the caller is not going to call raw_seqcount_try_begin() in a loop.
> >
> > This does not happen. It only happens if the associated lock is
> > preemptible. This is not the case for uprobes.c so there is no lock/
> > unlock.
>
> The uprobes reference is just an example of a different try_begin
> pattern, that does not involve taking a lock but also doesn't need to
> guarantee progress of the writer. uprobes uses it without a lock but
> that doesn't have to be the case. I thought the commit message was
> reasonably clear but perhaps I was wrong?
I wasn't sure if you followed the code wrong or not. Based on the
example it does not happen. But I get the idea.
> > > > However, raw_seqcount_try_begin() currently uses seqprop_sequence()
> > > > via raw_read_seqcount(), and therefore does a lock/unlock of the
> > > > write-side lock on PREEMPT_RT kernel. This prevents it from being
> > > > used in atomic context. Use __seqprop_sequence() instead of
> > > > raw_read_seqcount() to allow it.
> >
> > So it would work if there is no lock associated or it is something like
> > raw_spinlock_t.
>
> Yes, in which case the patch is a no-op. But in the KVM case that we
> are thinking about in this thread, raw spinlocks are not needed
> because the atomic-context read side would just punt if try_begin()
> fails; and unlike the uprobes case, there would be a lock involved.
> Since it'd be possible to use a regular spinlock, this patch would
> help.
Yes, I understand.
> > > Yeah, I think this makes sense. Thanks!
> >
> > Do I make sense or do I miss something obvious? On the other it would
> > make sense to have this even for spinlock_t assuming the try_begin
> > version does not try spin and expect to make progress (which it does not
> > for the current users).
>
> That was my point. The lock/unlock is not causing any current bug in
> the kernel, other than maybe some minor inefficiencies, but it makes
> try_begin() subtly different for PREEMPT_RT kernels and the "obviously
> correct" way to do it isn't the right one. Having to use &s->seqcount
> is yucky.
>
> The docs for try_begin() don't say it super explicitly, but still they
> heavily imply you're not meant to spin on it. At that point it's not a
> "try" anymore.
I don't mind the patch if the commit message would be clearer. This is
what confused me since it did not make sense based on the example.
The kernel doc says "w/o counter stabilization" but it is w/ on
PREEMPT_RT. Based on the API and the current users is not required.
Correct.
It could make sense to update the doc clarifying that a retry loop in
with raw_seqcount_try_begin() is not a valid pattern/ usage.
> Thanks,
>
> Paolo
Sebastian
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 20/20] KVM: x86: Use gfn_to_pfn_cache for record_steal_time
2026-05-30 6:19 ` sashiko-bot
@ 2026-06-02 12:29 ` David Woodhouse
0 siblings, 0 replies; 55+ messages in thread
From: David Woodhouse @ 2026-06-02 12:29 UTC (permalink / raw)
To: sashiko-bot
Cc: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
linux-kernel, sashiko-reviews, stollmc, dwmw
[-- Attachment #1: Type: text/plain, Size: 2108 bytes --]
On Sat, 30 May 2026 06:19:32 +0000, sashiko-bot@kernel.org wrote:
> [Severity: High]
> Does this introduce a scheduling while atomic bug on non-PREEMPT_RT kernels?
>
> The CLASS(gpc_map_local, st_map) macro acquires a read_lock on gpc->lock,
> which disables preemption. While this lock is held, if the guest supports
> PV TLB flush, the code calls kvm_vcpu_flush_tlb_guest().
>
> If TDP is disabled (shadow paging), kvm_vcpu_flush_tlb_guest() calls
> kvm_mmu_sync_roots() and eventually mmu_sync_children(). This path can yield
> via cond_resched_rwlock_write(). Yielding while preemption is disabled by
> the gpc read lock will trigger a BUG.
Ah, that issue exists in the previous versions too, but it's simple
enough to fix. There's no particular timing constraint for flushing the
TLB; it just have to be done before this vCPU ever runs again. It can
just be moved to the end of the function after the lock is dropped.
That does mean record_steal_time() should use the explicit
gpc_map_local_lock()/gpc_map_local_unlock() instead of the CLASS()
macro, but that's easy enough.
> [Severity: High]
> Can this cause an ABBA deadlock in the scheduler on PREEMPT_RT?
>
> kvm_steal_time_set_preempted() is called during a vCPU context switch
> via kvm_sched_out(), which is a preempt notifier hook. This runs with
> the local CPUs runqueue lock (rq->lock) held.
>
> While the code correctly uses a trylock to avoid sleeping when acquiring
> the lock, the CLASS macro destructor will implicitly call read_unlock()
> at the end of the function. On PREEMPT_RT kernels, unlocking an rt_rwlock
> can wake up a waiting writer, such as a concurrent kvm_gpc_refresh().
This one is more fun, and the problem already exists on RT for the Xen
runstate update so it's not even really new with this patch.
I think the gpc lock needs to be a raw_rwlock_t, so that it remains a
true spinning lock even on PREEMPT_RT. The writer side
(kvm_gpc_refresh) only holds the lock briefly to swap in the new
pfn/khva after doing the heavyweight work, so it should be fine as a
raw lock.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 15/20] KVM: x86/xen: Convert kvm_xen_set_evtchn_fast() to gpc's CLASS() APIs
2026-05-29 19:11 ` Sean Christopherson
@ 2026-06-02 12:37 ` David Woodhouse
0 siblings, 0 replies; 55+ messages in thread
From: David Woodhouse @ 2026-06-02 12:37 UTC (permalink / raw)
To: seanjc
Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
linux-kernel, sashiko-reviews, dwmw
[-- Attachment #1: Type: text/plain, Size: 1464 bytes --]
On Fri, 29 May 2026 12:11:37 -0700, Sean Christopherson wrote:
> On Fri, May 29, 2026, sashiko-bot@kernel.org wrote:
> > [Severity: High]
> > Does this unintentionally hold the gpc lock across the IPI kick?
>
> No, it intentionally holds the gpc lock across _sending_ the IPI kick.
>
> > [Severity: High]
> > Does this now hold the VM-wide shinfo_cache lock while calling
> > __kvm_xen_set_evtchn_fast() and kicking the vCPU?
> >
> > Since shinfo_map is a function-scoped CLASS() variable, its destructor
> > won't release the lock until after __kvm_xen_set_evtchn_fast() returns.
> > This creates a nested locking dependency and holds locks over expensive
> > cross-vCPU operations, potentially serializing event channel deliveries
> > across the entire VM on the fast path.
>
> __kvm_vcpu_kick() is neither expensive nor cross-vCPU. In the wait=false case,
> which is the behavior of kvm_vcpu_kick(), it sends IPIs via smp_send_reschedule(),
> i.e. it's more or less just __apic_send_IPI(cpu, RESCHEDULE_VECTOR), which is a
> single WRMSR on modern harware.
Hm, if we really are going to use a raw rwlock then I suppose we should
take care not to spend any extra cycles with the lock held unless we
actually *need* to?
In which case I wonder if we really need the scoped CLASS() thing (that
encourages people to run to the end of the function under the lock), or
if the helpers you add for that are sufficient on their own?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock()
2026-05-29 20:38 ` Peter Zijlstra
2026-05-30 0:54 ` Sean Christopherson
@ 2026-06-04 23:58 ` David Woodhouse
1 sibling, 0 replies; 55+ messages in thread
From: David Woodhouse @ 2026-06-04 23:58 UTC (permalink / raw)
To: Peter Zijlstra, Sean Christopherson
Cc: Paolo Bonzini, Paul Durrant, Ingo Molnar, Will Deacon, Boqun Feng,
Waiman Long, kvm, linux-kernel, Sebastian Andrzej Siewior,
syzbot+208f7f3e5f59c11aeb90, Carsten Stollmaier
[-- Attachment #1: Type: text/plain, Size: 1436 bytes --]
On Fri, 2026-05-29 at 22:38 +0200, Peter Zijlstra wrote:
> On Fri, May 29, 2026 at 10:13:35PM +0200, Peter Zijlstra wrote:
>
> > It is somewhat possible to do an RT aware read-write spinlock thing, but
> > it is definitely non-trivial and this would be the only user.
>
> Furthermore, it is fundamentally one of the worst possible lock types.
>
> It really isn't something you *want* to have -- arguably even for !RT.
>
> They scale like ass; per them being a spinlock type, the critical
> sections must be short, but this means there is nothing to amortize the
> cost of bouncing the shared lock around -- which is the 'saving' grace
> of the rwsem.
>
> For short sections the cost of the shared access will dominate. I'm sure
> Paul has a bunch of graphs to illustrate this point :-)
Most of these are per-vCPU anyway. I think the only one that isn't is
the Xen shared_info page, and we *could* have a GPC for that per vCPU
if we really wanted.
In *theory* we could be delivering hardware interrupts as event
channels from multiple physical CPUs simultaneously. But that's
probably fairly rare, at least for them all to be coming to the same
vCPU.
So in the general case, they don't bounce around much unless someone
else takes a write lock to invalidate them.
I suspect we could tolerate them being a raw_spin_lock on RT and only a
rwlock on non-RT, if there's no appetite for raw_rwlock?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 55+ messages in thread
end of thread, other threads:[~2026-06-04 23:58 UTC | newest]
Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-29 16:50 [PATCH v2 00/20] KVM: x86/xen: Fix Xen/GP/PREEMPT_RT issues with rwlock_t Sean Christopherson
2026-05-29 16:50 ` [PATCH v2 01/20] locking/rt: Use raw_spin_lock_irqsave() in __rwbase_read_unlock() Sean Christopherson
2026-05-29 19:32 ` Peter Zijlstra
2026-05-29 19:34 ` Peter Zijlstra
2026-05-29 20:05 ` Sean Christopherson
2026-05-29 20:13 ` Peter Zijlstra
2026-05-29 20:38 ` Peter Zijlstra
2026-05-30 0:54 ` Sean Christopherson
2026-05-30 10:26 ` Paolo Bonzini
2026-05-30 12:47 ` David Woodhouse
2026-05-30 14:40 ` Paolo Bonzini
2026-06-01 10:52 ` David Woodhouse
2026-06-01 13:01 ` David Woodhouse
2026-06-01 13:40 ` Sebastian Andrzej Siewior
2026-06-01 13:53 ` David Woodhouse
2026-06-01 14:47 ` Sebastian Andrzej Siewior
2026-06-01 15:11 ` David Woodhouse
2026-06-01 9:40 ` Peter Zijlstra
2026-06-01 10:04 ` David Woodhouse
2026-05-30 13:02 ` Paolo Bonzini
2026-06-01 8:40 ` Peter Zijlstra
2026-06-01 11:11 ` Sebastian Andrzej Siewior
2026-06-01 11:40 ` Peter Zijlstra
2026-06-01 19:13 ` Paolo Bonzini
2026-06-02 7:34 ` Sebastian Andrzej Siewior
2026-06-04 23:58 ` David Woodhouse
2026-05-29 16:50 ` [PATCH v2 02/20] KVM: x86/xen: Use read_trylock() for GPC locks in hardirq/atomic paths Sean Christopherson
2026-05-29 17:20 ` sashiko-bot
2026-05-29 23:28 ` Hillf Danton
2026-05-29 16:50 ` [PATCH v2 03/20] KVM: x86/xen: Remove unnecessary irqsave from GPC lock usage in xen.c Sean Christopherson
2026-05-29 17:36 ` sashiko-bot
2026-05-29 16:50 ` [PATCH v2 04/20] KVM: x86: Remove unnecessary irqsave from kvm_setup_guest_pvclock() Sean Christopherson
2026-05-29 16:50 ` [PATCH v2 05/20] KVM: Remove unnecessary IRQ disabling from GPC lock in pfncache.c Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 06/20] KVM: x86/xen: Use guard() to grab kvm->srcu around gpc critical sections Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 07/20] KVM: x86/xen: Extract delivery of event to vCPU into a separate helper Sean Christopherson
2026-05-29 17:47 ` sashiko-bot
2026-05-29 16:51 ` [PATCH v2 08/20] KVM: x86/xen: Explicitly tag "shared info" page as never being dirty tracked Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 09/20] KVM: x86/xen: Don't dirty track "vCPU info" page Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 10/20] KVM: Move {g,p}fn <=> {g,h}pa conversion helpers to kvm_types.h Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 11/20] KVM: Add CLASS() constructs to automagically handle lock+check of gpc Sean Christopherson
2026-05-29 17:59 ` sashiko-bot
2026-05-29 16:51 ` [PATCH v2 12/20] KVM: x86/xen: Convert kvm_xen_shared_info_init() to gpc's CLASS() APIs Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 13/20] KVM: x86/xen: Don't bother waiting on gpc->lock in SCHEDOP_poll Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 14/20] KVM: x86/xen: Convert wait_pending_event() to gpc's CLASS() APIs Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 15/20] KVM: x86/xen: Convert kvm_xen_set_evtchn_fast() " Sean Christopherson
2026-05-29 19:01 ` sashiko-bot
2026-05-29 19:11 ` Sean Christopherson
2026-06-02 12:37 ` David Woodhouse
2026-05-29 16:51 ` [PATCH v2 16/20] KVM: x86/xen: Convert xen_get_guest_pvclock() " Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 17/20] KVM: x86/xen: Drop local "kick_vcpu" from __kvm_xen_set_evtchn_fast() Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 18/20] KVM: x86/xen: Convert event injection to gpc's CLASS() APIs Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 19/20] KVM: Add "extended" gpc CLASS() APIs for sometimes-atomic cases Sean Christopherson
2026-05-29 16:51 ` [PATCH v2 20/20] KVM: x86: Use gfn_to_pfn_cache for record_steal_time Sean Christopherson
2026-05-30 6:19 ` sashiko-bot
2026-06-02 12:29 ` David Woodhouse
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.