* [PATCH v2 0/1] KVM: powerpc: Use generic xfer to guest work function
@ 2026-07-01 18:30 Vishal Chourasia
2026-07-01 18:30 ` [PATCH v2 1/1] " Vishal Chourasia
0 siblings, 1 reply; 3+ messages in thread
From: Vishal Chourasia @ 2026-07-01 18:30 UTC (permalink / raw)
To: maddy
Cc: npiggin, mpe, chleroy, sshegde, amachhiw, vaibhav, harshpb,
gautam, linuxppc-dev, kvm, linux-kernel, Vishal Chourasia
This series fixes a KVM scheduling bug on Book3S HV where a guest VM
under a cpu.max bandwidth limit can run arbitrarily past its quota and
then appear frozen for minutes afterwards.
== Problem ==
Since commit 2cd571245b43 ("sched/fair: Add related data structure for
task based throttle"), merged in v6.18, CFS bandwidth throttling no
longer dequeues a task directly. Instead it queues a task_work item via
task_work_add(..., TWA_RESUME), sets TIF_NOTIFY_RESUME, and relies on
that work running on the return path to actually dequeue the task.
The powerpc KVM run loops only test TIF_SIGPENDING and TIF_NEED_RESCHED
before re-entering the guest; TIF_NOTIFY_RESUME is never checked. For a
CPU-bound guest that generates few KVM exits back to userspace, the vCPU
thread never returns to user mode, so the deferred throttle task_work
never runs. The guest keeps running unchecked while its
runtime_remaining goes increasingly negative, and once it finally does
exit to userspace it is legitimately throttled for minutes while the
accrued debt is repaid at the bandwidth-timer replenishment rate.
The generic xfer-to-guest-mode infrastructure (commit 935ace2fb5cc,
"entry: Provide infrastructure for work before transitioning to guest
mode") exists precisely to handle this kind of work before each guest
entry. A full trace-backed root-cause analysis was posted with v1 [2].
== Fix ==
Opt powerpc KVM into VIRT_XFER_TO_GUEST_WORK and use the generic
xfer_to_guest_mode helpers to check for and handle pending guest-mode
work (reschedule, signals, and TIF_NOTIFY_RESUME task_work such as the
deferred CFS throttle) on every guest re-entry:
- Book3S HV: both run loops — kvmhv_run_single_vcpu() for POWER9+ and
kvmppc_run_vcpu() for pre-POWER9.
- Book3S PR and BookE: the common kvmppc_prepare_to_enter(), which
likewise only checked need_resched()/signal_pending().
== Changes from v1 ==
- Extend the fix beyond Book3S HV to the shared powerpc KVM entry path:
also convert the common kvmppc_prepare_to_enter() used by Book3S PR
and BookE. (Shrikanth Shegde)
- Move "select VIRT_XFER_TO_GUEST_WORK" from KVM_BOOK3S_64_HV up to the
common "config KVM" so every powerpc KVM variant gets the
infrastructure.
- Drop the redundant signal_pending() recheck and its sigpend label in
kvmhv_run_single_vcpu(); xfer_to_guest_mode_work_pending() is a
superset of it.
- Preserve the E500 CONFIG_KVM_EXIT_TIMING histogram on the signal path
via an explicit kvmppc_set_exit_type(SIGNAL_EXITS).
[1] https://lore.kernel.org/all/20250421102837.78515-2-sshegde@linux.ibm.com/
[2] https://lore.kernel.org/all/20260626105449.2897924-2-vishalc@linux.ibm.com/
Vishal Chourasia (1):
KVM: powerpc: Use generic xfer to guest work function
arch/powerpc/kvm/Kconfig | 1 +
arch/powerpc/kvm/book3s_hv.c | 64 ++++++++++++++++++++++++++++--------
arch/powerpc/kvm/powerpc.c | 34 ++++++++++++++-----
3 files changed, 77 insertions(+), 22 deletions(-)
--
2.54.0
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH v2 1/1] KVM: powerpc: Use generic xfer to guest work function
2026-07-01 18:30 [PATCH v2 0/1] KVM: powerpc: Use generic xfer to guest work function Vishal Chourasia
@ 2026-07-01 18:30 ` Vishal Chourasia
2026-07-01 18:42 ` sashiko-bot
0 siblings, 1 reply; 3+ messages in thread
From: Vishal Chourasia @ 2026-07-01 18:30 UTC (permalink / raw)
To: maddy
Cc: npiggin, mpe, chleroy, sshegde, amachhiw, vaibhav, harshpb,
gautam, linuxppc-dev, kvm, linux-kernel, Vishal Chourasia
Since commit 2cd571245b43 ("sched/fair: Add related data structure for
task based throttle") in v6.18, CFS bandwidth throttling no longer
dequeues a task directly; it queues task_work via TWA_RESUME and sets
TIF_NOTIFY_RESUME, relying on that work running before the task returns
to guest/user mode. The powerpc KVM run loops only checked for reschedule
and signals, never TIF_NOTIFY_RESUME, so the deferred throttle never ran
while a vCPU stayed in the run loop: a CPU-bound guest that rarely exits
to userspace ran far past its cpu.max quota and then appeared frozen for
minutes while the accrued throttle debt was repaid.
Use the generic infrastructure to check for and handle pending work
before transitioning into guest mode, replacing the open-coded
need_resched() and cond_resched() checks in the Book3S HV run loops and
in the common kvmppc_prepare_to_enter() used by the Book3S PR and BookE
run loops. The redundant signal_pending() recheck (and its sigpend label)
in kvmhv_run_single_vcpu() is also dropped, as
xfer_to_guest_mode_work_pending() is a superset of it.
This picks up handling for TIF_NOTIFY_RESUME, which was previously
ignored, meaning task work will now be correctly handled on every
guest re-entry.
In kvmppc_prepare_to_enter() the generic helper accounts the signal exit
(vcpu->stat.signal_exits and KVM_EXIT_INTR) but does not set the exit
type, so kvmppc_set_exit_type(SIGNAL_EXITS) is retained on the signal
path to preserve the E500 CONFIG_KVM_EXIT_TIMING histogram; it is a no-op
otherwise.
Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
---
arch/powerpc/kvm/Kconfig | 1 +
arch/powerpc/kvm/book3s_hv.c | 64 ++++++++++++++++++++++++++++--------
arch/powerpc/kvm/powerpc.c | 34 ++++++++++++++-----
3 files changed, 77 insertions(+), 22 deletions(-)
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 9a0d1c1aca6c..b6bc2fc86dca 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -22,6 +22,7 @@ config KVM
select KVM_COMMON
select KVM_VFIO
select HAVE_KVM_IRQ_BYPASS
+ select VIRT_XFER_TO_GUEST_WORK
config KVM_BOOK3S_HANDLER
bool
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 61dbeea317f3..a1b2077561bb 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3850,10 +3850,20 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
* and return without going into the guest(s).
* If the mmu_ready flag has been cleared, don't go into the
* guest because that means a HPT resize operation is in progress.
+ *
+ * xfer_to_guest_mode_work_pending() is the IRQs-disabled recheck for
+ * pending guest-mode work (reschedule, signals, and TIF_NOTIFY_RESUME
+ * task_work such as the deferred CFS throttle). It is the pre-POWER9
+ * analog of the final gate in kvmhv_run_single_vcpu(), and a superset
+ * of the old need_resched() check: it catches work that raced in after
+ * the drain in kvmppc_run_vcpu(), so a CPU-bound vCPU is throttled here
+ * instead of running one more guest dispatch past its quota. IRQs are
+ * hard-disabled just above, so the non-__ variant (which asserts that)
+ * is the correct one.
*/
local_irq_disable();
hard_irq_disable();
- if (lazy_irq_pending() || need_resched() ||
+ if (lazy_irq_pending() || xfer_to_guest_mode_work_pending() ||
recheck_signals_and_mmu(&core_info)) {
local_irq_enable();
vc->vcore_state = VCORE_INACTIVE;
@@ -4824,10 +4834,24 @@ static int kvmppc_run_vcpu(struct kvm_vcpu *vcpu)
vc->runner = vcpu;
if (n_ceded == vc->n_runnable) {
kvmppc_vcore_blocked(vc);
- } else if (need_resched()) {
+ } else if (__xfer_to_guest_mode_work_pending()) {
kvmppc_vcore_preempt(vc);
- /* Let something else run */
- cond_resched_lock(&vc->lock);
+ /*
+ * Let something else run, and run pending guest-mode
+ * work (reschedule, and TIF_NOTIFY_RESUME task_work such
+ * as the deferred CFS throttle) before we would re-enter
+ * the guest, so a CPU-bound vCPU is actually throttled
+ * here instead of running past its quota. This is a
+ * superset of the old need_resched() check. Use the raw
+ * helper, not the kvm_ wrapper: signals (KVM_EXIT_INTR
+ * and the signal_exits stat) are accounted by this path's
+ * existing handling below, so going through the wrapper
+ * here would double-count them. The helper may schedule(),
+ * so the vcore lock is dropped around it.
+ */
+ spin_unlock(&vc->lock);
+ xfer_to_guest_mode_handle_work();
+ spin_lock(&vc->lock);
if (vc->vcore_state == VCORE_PREEMPT)
kvmppc_vcore_end_preempt(vc);
} else {
@@ -4899,8 +4923,21 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
}
}
- if (need_resched())
- cond_resched();
+ /*
+ * Run pending work before (re-)entering the guest, most importantly
+ * task_work queued via TWA_RESUME (e.g. the deferred CFS bandwidth
+ * throttle, which only sets TIF_NOTIFY_RESUME). Without this a CPU-bound
+ * vCPU that keeps returning RESUME_GUEST never reaches an exit-to-user
+ * point, so the throttle is never enforced and the task runs far beyond
+ * its quota. The helper also handles reschedule and signals, replacing
+ * the cond_resched() that was here. It may schedule(), so it runs before
+ * preemption and IRQs are disabled, with no vcore/KVM locks held. This
+ * is the per-reentry site shared by the bare-metal and pseries (nested)
+ * paths, so both are covered.
+ */
+ r = kvm_xfer_to_guest_mode_handle_work(vcpu);
+ if (r) /* -EINTR: signal pending, exit to userspace (KVM_EXIT_INTR) */
+ return r;
kvmppc_update_vpas(vcpu);
@@ -4914,9 +4951,14 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
vcpu->arch.state = KVMPPC_VCPU_RUNNABLE;
- if (signal_pending(current))
- goto sigpend;
- if (need_resched() || !kvm->arch.mmu_ready)
+ /*
+ * Final IRQs-disabled check for pending guest-mode work or an MMU that
+ * is not ready. IRQs are disabled here, so bail to the outer loop,
+ * which re-enters and handles the pending work via
+ * kvm_xfer_to_guest_mode_handle_work() above (exiting with -EINTR on a
+ * signal).
+ */
+ if (xfer_to_guest_mode_work_pending() || !kvm->arch.mmu_ready)
goto out;
vcpu->cpu = pcpu;
@@ -5068,10 +5110,6 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
return vcpu->arch.ret;
- sigpend:
- vcpu->stat.signal_exits++;
- run->exit_reason = KVM_EXIT_INTR;
- vcpu->arch.ret = -EINTR;
out:
vcpu->cpu = -1;
vcpu->arch.thread_cpu = -1;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 00302399fc37..ff1a9a8de5e0 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -84,20 +84,36 @@ int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu)
hard_irq_disable();
while (true) {
- if (need_resched()) {
+ if (__xfer_to_guest_mode_work_pending()) {
+ /*
+ * Handle pending guest-mode work before entering the
+ * guest: reschedule, signals, and TIF_NOTIFY_RESUME
+ * task_work such as the deferred CFS bandwidth throttle.
+ * The helper must run with interrupts enabled and may
+ * schedule(). This is a superset of the open-coded
+ * need_resched()/signal_pending() checks it replaces. On
+ * a pending signal it returns -EINTR after
+ * kvm_xfer_to_guest_mode_handle_work() has set
+ * run->exit_reason (KVM_EXIT_INTR) and bumped
+ * vcpu->stat.signal_exits, so just return to userspace.
+ */
local_irq_enable();
- cond_resched();
+ r = kvm_xfer_to_guest_mode_handle_work(vcpu);
hard_irq_disable();
+ if (r) {
+ /*
+ * -EINTR: the generic helper does not set the
+ * exit type, so record it here for the E500
+ * CONFIG_KVM_EXIT_TIMING histogram (a no-op
+ * otherwise). Only the exit type is set;
+ * signal_exits was already accounted above.
+ */
+ kvmppc_set_exit_type(vcpu, SIGNAL_EXITS);
+ break;
+ }
continue;
}
- if (signal_pending(current)) {
- kvmppc_account_exit(vcpu, SIGNAL_EXITS);
- vcpu->run->exit_reason = KVM_EXIT_INTR;
- r = -EINTR;
- break;
- }
-
vcpu->mode = IN_GUEST_MODE;
/*
--
2.54.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v2 1/1] KVM: powerpc: Use generic xfer to guest work function
2026-07-01 18:30 ` [PATCH v2 1/1] " Vishal Chourasia
@ 2026-07-01 18:42 ` sashiko-bot
0 siblings, 0 replies; 3+ messages in thread
From: sashiko-bot @ 2026-07-01 18:42 UTC (permalink / raw)
To: Vishal Chourasia; +Cc: kvm
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] Early return on pending signals bypasses the vCPU exit tracepoint
--
commit 902f7514807a6e7be76bed1f71c7055f873978b8
Author: Vishal Chourasia <vishalc@linux.ibm.com>
KVM: powerpc: Use generic xfer to guest work function
This commit replaces open-coded need_resched() and cond_resched() checks
with the generic xfer_to_guest_mode_handle_work() infrastructure in powerpc
KVM run loops to properly handle deferred task work like CFS bandwidth
throttling.
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 61dbeea317f3b..a1b2077561bb5 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
[ ... ]
> @@ -4899,8 +4923,21 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
> }
> }
>
> - if (need_resched())
> - cond_resched();
> + /*
> + * Run pending work before (re-)entering the guest, most importantly
> + * task_work queued via TWA_RESUME (e.g. the deferred CFS bandwidth
> + * throttle, which only sets TIF_NOTIFY_RESUME). Without this a CPU-bound
> + * vCPU that keeps returning RESUME_GUEST never reaches an exit-to-user
> + * point, so the throttle is never enforced and the task runs far beyond
> + * its quota. The helper also handles reschedule and signals, replacing
> + * the cond_resched() that was here. It may schedule(), so it runs before
> + * preemption and IRQs are disabled, with no vcore/KVM locks held. This
> + * is the per-reentry site shared by the bare-metal and pseries (nested)
> + * paths, so both are covered.
> + */
> + r = kvm_xfer_to_guest_mode_handle_work(vcpu);
> + if (r) /* -EINTR: signal pending, exit to userspace (KVM_EXIT_INTR) */
> + return r;
[Severity: Medium]
When kvm_xfer_to_guest_mode_handle_work() returns a pending signal, this
direct return bypasses the done: label at the end of the function.
Since trace_kvmppc_run_vcpu_enter() is executed unconditionally at the start
of kvmhv_run_single_vcpu(), does this early return cause unbalanced
tracepoints on signal-driven userspace exits? This might break performance
monitoring tools that rely on symmetric enter/exit events.
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260701183030.3610451-2-vishalc@linux.ibm.com?part=1
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-07-01 18:42 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-01 18:30 [PATCH v2 0/1] KVM: powerpc: Use generic xfer to guest work function Vishal Chourasia
2026-07-01 18:30 ` [PATCH v2 1/1] " Vishal Chourasia
2026-07-01 18:42 ` sashiko-bot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox