From: Shrikanth Hegde <sshegde@linux.ibm.com>
To: Vishal Chourasia <vishalc@linux.ibm.com>,
maddy@linux.ibm.com,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: npiggin@gmail.com, mpe@ellerman.id.au, chleroy@kernel.org,
amachhiw@linux.ibm.com, vaibhav@linux.ibm.com,
harshpb@linux.ibm.com, gautam@linux.ibm.com,
linuxppc-dev@lists.ozlabs.org, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 1/1] KVM: powerpc: Use generic xfer to guest work function
Date: Thu, 2 Jul 2026 13:17:30 +0530 [thread overview]
Message-ID: <092d5f46-9809-41bb-b332-aa51ad1ed04d@linux.ibm.com> (raw)
In-Reply-To: <20260701183030.3610451-2-vishalc@linux.ibm.com>
On 7/2/26 12:00 AM, Vishal Chourasia wrote:
> Since commit 2cd571245b43 ("sched/fair: Add related data structure for
> task based throttle") in v6.18, CFS bandwidth throttling no longer
> dequeues a task directly; it queues task_work via TWA_RESUME and sets
> TIF_NOTIFY_RESUME, relying on that work running before the task returns
> to guest/user mode. The powerpc KVM run loops only checked for reschedule
> and signals, never TIF_NOTIFY_RESUME, so the deferred throttle never ran
> while a vCPU stayed in the run loop: a CPU-bound guest that rarely exits
> to userspace ran far past its cpu.max quota and then appeared frozen for
> minutes while the accrued throttle debt was repaid.
>
> Use the generic infrastructure to check for and handle pending work
> before transitioning into guest mode, replacing the open-coded
> need_resched() and cond_resched() checks in the Book3S HV run loops and
> in the common kvmppc_prepare_to_enter() used by the Book3S PR and BookE
> run loops. The redundant signal_pending() recheck (and its sigpend label)
> in kvmhv_run_single_vcpu() is also dropped, as
> xfer_to_guest_mode_work_pending() is a superset of it.
>
> This picks up handling for TIF_NOTIFY_RESUME, which was previously
> ignored, meaning task work will now be correctly handled on every
> guest re-entry.
>
> In kvmppc_prepare_to_enter() the generic helper accounts the signal exit
> (vcpu->stat.signal_exits and KVM_EXIT_INTR) but does not set the exit
> type, so kvmppc_set_exit_type(SIGNAL_EXITS) is retained on the signal
> path to preserve the E500 CONFIG_KVM_EXIT_TIMING histogram; it is a no-op
> otherwise.
>
With VIRT_XFER_TO_GUEST_WORK support,
you can take the HAVE_POSIX_CPU_TIMERS_TASK_WORK patch too?
https://lore.kernel.org/all/20250421102837.78515-3-sshegde@linux.ibm.com/
That would help in adding some of those remaining patches for PREEMPT_RT
on powernv system. the remaining ones are straighforward or already
merged in some form.
+Sebastian.
> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com>
> ---
> arch/powerpc/kvm/Kconfig | 1 +
> arch/powerpc/kvm/book3s_hv.c | 64 ++++++++++++++++++++++++++++--------
> arch/powerpc/kvm/powerpc.c | 34 ++++++++++++++-----
> 3 files changed, 77 insertions(+), 22 deletions(-)
>
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 9a0d1c1aca6c..b6bc2fc86dca 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -22,6 +22,7 @@ config KVM
> select KVM_COMMON
> select KVM_VFIO
> select HAVE_KVM_IRQ_BYPASS
> + select VIRT_XFER_TO_GUEST_WORK
>
> config KVM_BOOK3S_HANDLER
> bool
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 61dbeea317f3..a1b2077561bb 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3850,10 +3850,20 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
> * and return without going into the guest(s).
> * If the mmu_ready flag has been cleared, don't go into the
> * guest because that means a HPT resize operation is in progress.
> + *
> + * xfer_to_guest_mode_work_pending() is the IRQs-disabled recheck for
> + * pending guest-mode work (reschedule, signals, and TIF_NOTIFY_RESUME
> + * task_work such as the deferred CFS throttle). It is the pre-POWER9
> + * analog of the final gate in kvmhv_run_single_vcpu(), and a superset
> + * of the old need_resched() check: it catches work that raced in after
> + * the drain in kvmppc_run_vcpu(), so a CPU-bound vCPU is throttled here
> + * instead of running one more guest dispatch past its quota. IRQs are
> + * hard-disabled just above, so the non-__ variant (which asserts that)
> + * is the correct one.
IMO, These are good to go in a changelog. But not for comments.
Could you trim the comments not to describe
what xfer_to_guest_mode_work_pending does?
Similar comment for other big fat comments.
> */
> local_irq_disable();
> hard_irq_disable();
> - if (lazy_irq_pending() || need_resched() ||
> + if (lazy_irq_pending() || xfer_to_guest_mode_work_pending() ||
> recheck_signals_and_mmu(&core_info)) {
> local_irq_enable();
> vc->vcore_state = VCORE_INACTIVE;
> @@ -4824,10 +4834,24 @@ static int kvmppc_run_vcpu(struct kvm_vcpu *vcpu)
> vc->runner = vcpu;
> if (n_ceded == vc->n_runnable) {
> kvmppc_vcore_blocked(vc);
> - } else if (need_resched()) {
> + } else if (__xfer_to_guest_mode_work_pending()) {
> kvmppc_vcore_preempt(vc);
> - /* Let something else run */
> - cond_resched_lock(&vc->lock);
> + /*
> + * Let something else run, and run pending guest-mode
> + * work (reschedule, and TIF_NOTIFY_RESUME task_work such
> + * as the deferred CFS throttle) before we would re-enter
> + * the guest, so a CPU-bound vCPU is actually throttled
> + * here instead of running past its quota. This is a
> + * superset of the old need_resched() check. Use the raw
> + * helper, not the kvm_ wrapper: signals (KVM_EXIT_INTR
> + * and the signal_exits stat) are accounted by this path's
> + * existing handling below, so going through the wrapper
> + * here would double-count them. The helper may schedule(),
> + * so the vcore lock is dropped around it.
> + */
reduce this blob.
> + spin_unlock(&vc->lock);
> + xfer_to_guest_mode_handle_work();
> + spin_lock(&vc->lock);
> if (vc->vcore_state == VCORE_PREEMPT)
> kvmppc_vcore_end_preempt(vc);
> } else {
> @@ -4899,8 +4923,21 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
> }
> }
>
> - if (need_resched())
> - cond_resched();
> + /*
> + * Run pending work before (re-)entering the guest, most importantly
> + * task_work queued via TWA_RESUME (e.g. the deferred CFS bandwidth
> + * throttle, which only sets TIF_NOTIFY_RESUME). Without this a CPU-bound
> + * vCPU that keeps returning RESUME_GUEST never reaches an exit-to-user
> + * point, so the throttle is never enforced and the task runs far beyond
> + * its quota. The helper also handles reschedule and signals, replacing
> + * the cond_resched() that was here. It may schedule(), so it runs before
> + * preemption and IRQs are disabled, with no vcore/KVM locks held. This
> + * is the per-reentry site shared by the bare-metal and pseries (nested)
> + * paths, so both are covered.
> + */
reduce this blob.
> + r = kvm_xfer_to_guest_mode_handle_work(vcpu);
> + if (r) /* -EINTR: signal pending, exit to userspace (KVM_EXIT_INTR) */
> + return r;
>
> kvmppc_update_vpas(vcpu);
>
> @@ -4914,9 +4951,14 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
>
> vcpu->arch.state = KVMPPC_VCPU_RUNNABLE;
>
> - if (signal_pending(current))
> - goto sigpend;
> - if (need_resched() || !kvm->arch.mmu_ready)
> + /*
> + * Final IRQs-disabled check for pending guest-mode work or an MMU that
> + * is not ready. IRQs are disabled here, so bail to the outer loop,
> + * which re-enters and handles the pending work via
> + * kvm_xfer_to_guest_mode_handle_work() above (exiting with -EINTR on a
> + * signal).
> + */
> + if (xfer_to_guest_mode_work_pending() || !kvm->arch.mmu_ready)
> goto out;
>
> vcpu->cpu = pcpu;
> @@ -5068,10 +5110,6 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
>
> return vcpu->arch.ret;
>
> - sigpend:
> - vcpu->stat.signal_exits++;
> - run->exit_reason = KVM_EXIT_INTR;
> - vcpu->arch.ret = -EINTR;
> out:
> vcpu->cpu = -1;
> vcpu->arch.thread_cpu = -1;
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 00302399fc37..ff1a9a8de5e0 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -84,20 +84,36 @@ int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu)
> hard_irq_disable();
>
> while (true) {
> - if (need_resched()) {
> + if (__xfer_to_guest_mode_work_pending()) {
> + /*
> + * Handle pending guest-mode work before entering the
> + * guest: reschedule, signals, and TIF_NOTIFY_RESUME
> + * task_work such as the deferred CFS bandwidth throttle.
> + * The helper must run with interrupts enabled and may
> + * schedule(). This is a superset of the open-coded
> + * need_resched()/signal_pending() checks it replaces. On
> + * a pending signal it returns -EINTR after
> + * kvm_xfer_to_guest_mode_handle_work() has set
> + * run->exit_reason (KVM_EXIT_INTR) and bumped
> + * vcpu->stat.signal_exits, so just return to userspace.
> + */
Make this comment precise instead of describing __xfer_to_guest_mode_work_pending
> local_irq_enable();
> - cond_resched();
> + r = kvm_xfer_to_guest_mode_handle_work(vcpu);
> hard_irq_disable();
> + if (r) {
> + /*
> + * -EINTR: the generic helper does not set the
> + * exit type, so record it here for the E500
> + * CONFIG_KVM_EXIT_TIMING histogram (a no-op
> + * otherwise). Only the exit type is set;
> + * signal_exits was already accounted above.
> + */
> + kvmppc_set_exit_type(vcpu, SIGNAL_EXITS);
> + break;
> + }
> continue;
> }
>
> - if (signal_pending(current)) {
> - kvmppc_account_exit(vcpu, SIGNAL_EXITS);
> - vcpu->run->exit_reason = KVM_EXIT_INTR;
> - r = -EINTR;
> - break;
> - }
> -
> vcpu->mode = IN_GUEST_MODE;
>
> /*
prev parent reply other threads:[~2026-07-02 7:48 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-01 18:30 [PATCH v2 0/1] KVM: powerpc: Use generic xfer to guest work function Vishal Chourasia
2026-07-01 18:30 ` [PATCH v2 1/1] " Vishal Chourasia
2026-07-01 18:42 ` sashiko-bot
2026-07-02 7:47 ` Shrikanth Hegde [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=092d5f46-9809-41bb-b332-aa51ad1ed04d@linux.ibm.com \
--to=sshegde@linux.ibm.com \
--cc=amachhiw@linux.ibm.com \
--cc=bigeasy@linutronix.de \
--cc=chleroy@kernel.org \
--cc=gautam@linux.ibm.com \
--cc=harshpb@linux.ibm.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=maddy@linux.ibm.com \
--cc=mpe@ellerman.id.au \
--cc=npiggin@gmail.com \
--cc=vaibhav@linux.ibm.com \
--cc=vishalc@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox