From: Wanpeng Li <kernellwp@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Paolo Bonzini <pbonzini@redhat.com>,
Sean Christopherson <seanjc@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>,
Christian Borntraeger <borntraeger@linux.ibm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Vincent Guittot <vincent.guittot@linaro.org>,
Juri Lelli <juri.lelli@redhat.com>,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
Wanpeng Li <wanpengli@tencent.com>,
Richie Buturla <richie@linux.ibm.com>
Subject: [PATCH v3 05/10] sched/fair: Force a local resched on yield_to() so the buddy is picked
Date: Fri, 12 Jun 2026 09:33:50 +0800 [thread overview]
Message-ID: <20260612013355.59231-6-kernellwp@gmail.com> (raw)
In-Reply-To: <20260612013355.59231-1-kernellwp@gmail.com>
From: Wanpeng Li <wanpengli@tencent.com>
Lag credit makes the target eligible for PICK_BUDDY, but yield_to() does
not by itself force the caller off the CPU. An active RUN_TO_PARITY
protect_slice() on the local yielder can therefore keep pick_eevdf()
returning the yielder instead of the credited buddy.
Add yield_to_local_force_resched() for the lag-credit path. It applies
the existing leaf forfeit, cancels slice protection along the yielder's
sched_entity hierarchy, and calls resched_curr() on the local rq.
cancel_protect_slice() is already used by PREEMPT_WAKEUP_SHORT and does
not modify vruntime.
Rate-limit only the forced preemption (cancel_protect_slice() plus
resched_curr()) to once per 6ms per rq. The lag credit itself remains
unthrottled so each directed yield refreshes the scheduling hint, while
compute-bound guests avoid excessive forced preemption on PLE-heavy spin
loops.
Dbench (filesystem metadata operations) on 16-vCPU guests under host CPU
overcommit, throughput improvement from the scheduler side alone:
2 VMs: +6.65%
3 VMs: +4.80%
4 VMs: +7.59%
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
kernel/sched/fair.c | 113 +++++++++++++++++++++++++++++++++++++------
kernel/sched/sched.h | 10 ++++
2 files changed, 108 insertions(+), 15 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48f65a4f1923..e9c5265cf0fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9485,6 +9485,83 @@ static void yield_task_fair(struct rq *rq)
}
}
+/*
+ * Rate-limit the forced local reschedule on the yield_to() lag-credit path
+ * to at most once per 6ms per rq.
+ *
+ * Lag credit is intentionally not rate-limited: a contended lock holder
+ * should be credited on every directed yield to keep the scheduling hint
+ * effective. Only the forced preemption needs bounding, as cancelling
+ * RUN_TO_PARITY protection and calling resched_curr() on every PLE-driven
+ * yield_to() can cause excessive preemption on compute-bound guests.
+ *
+ * Returns true if the caller should skip forcing a reschedule because a
+ * recent one already happened on this rq; the credit just applied still
+ * persists, so the buddy can be selected at the next scheduling point.
+ *
+ * Called under rq->lock with rq_clock up to date. yield_to_task_fair()
+ * updates the clock before walking the hierarchy because yield_to() takes
+ * the rq locks without updating them.
+ */
+static bool yield_to_force_resched_rate_limit(struct rq *rq)
+{
+ u64 now = rq_clock(rq);
+ u64 last = rq->yield_to_force_resched_last_ns;
+
+ if (last && (now - last) <= 6 * NSEC_PER_MSEC)
+ return true;
+
+ rq->yield_to_force_resched_last_ns = now;
+ return false;
+}
+
+/*
+ * Forfeit the local yielder, cancel its RUN_TO_PARITY slice protection
+ * along the whole sched_entity chain, and force a reschedule.
+ *
+ * yield_to() does not reschedule the caller, and an active protect_slice()
+ * at any level can keep pick_eevdf() returning the yielder instead of the
+ * credited buddy. cancel_protect_slice() is EEVDF-native (also used by
+ * PREEMPT_WAKEUP_SHORT) and does not touch vruntime. Caller holds the
+ * local rq lock via yield_to()'s double_rq_lock().
+ *
+ * Only the forced preemption here is rate-limited (to once per 6ms per rq);
+ * the lag credit applied by the caller runs on every yield_to(). When
+ * throttled, the credited buddy can still be selected at the next natural
+ * scheduling point without tearing down slice protection and forcing an
+ * immediate switch.
+ */
+static void yield_to_local_force_resched(struct rq *rq)
+{
+ struct sched_entity *yse = &rq->donor->se;
+
+ yield_task_fair(rq);
+
+ /*
+ * If the yielder is the only runnable task on this rq there is nothing
+ * for resched_curr() to switch to: any credited buddy is on a remote rq
+ * in this cross-rq case, where yield_to() already issued resched_curr()
+ * on the target's rq. Skip the forced reschedule: it would be a no-op
+ * and an unnecessary preemption of an unrelated local task.
+ * yield_task_fair() also returns early here without updating rq_clock.
+ */
+ if (rq->nr_running <= 1)
+ return;
+
+ /*
+ * Rate-limit the forced preemption (cancel_protect_slice + resched_curr)
+ * to once per 6ms per rq. rq's clock was refreshed by the caller before
+ * the credit walk, so rq_clock(rq) read here is current.
+ */
+ if (yield_to_force_resched_rate_limit(rq))
+ return;
+
+ for_each_sched_entity(yse)
+ cancel_protect_slice(yse);
+
+ resched_curr(rq);
+}
+
static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
{
struct sched_entity *se = &p->se;
@@ -9504,21 +9581,22 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
}
/*
- * Walk the ancestor chain set_next_buddy() just nominated and credit
- * bounded lag to each not-yet-eligible level so pick_eevdf() returns
- * it. yield_to() holds both rq locks via double_rq_lock(), so touching
- * p's cfs_rqs (possibly on another CPU) is safe; the primitive is
- * idempotent, so no rate limiting is needed.
+ * Walk the ancestor chain nominated by set_next_buddy() and credit
+ * bounded lag to each not-yet-eligible level, so pick_eevdf() can
+ * honor the buddy hint. Lag credit runs on every directed yield; only
+ * the forced preemption in yield_to_local_force_resched() is
+ * rate-limited. yield_to() holds both rq locks via double_rq_lock(),
+ * so touching p's cfs_rqs (possibly on another CPU) is safe.
*
- * Only refresh p_rq's clock when it differs from the local rq. A
- * remote p_rq must be refreshed so the per-level update_curr() is
- * accurate. In the same-rq case we skip it: the credit is a
- * best-effort hint and the rq clock is recent enough, while the
- * trailing yield_task_fair() would otherwise make this a second
- * update_rq_clock() on the same rq and trip
- * SCHED_WARN_ON(WARN_DOUBLE_CLOCK).
- */
- if (rq != p_rq)
+ * Refresh the local rq clock first: yield_to() took the locks without
+ * updating any clock and the per-level update_curr() below reads
+ * rq_clock; assert_clock_updated() (default-on, no sched_feat gate)
+ * fires otherwise. For a remote p_rq refresh it too; in the same-rq
+ * case the refresh above already covers it (a redundant update is only
+ * warned about under the default-off WARN_DOUBLE_CLOCK).
+ */
+ update_rq_clock(rq);
+ if (p_rq != rq)
update_rq_clock(p_rq);
for_each_sched_entity(se) {
@@ -9534,7 +9612,12 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
eevdf_credit_entity_vlag(cfs_rq, se);
}
- yield_task_fair(rq);
+ /*
+ * Force the local CPU to reschedule so the credited buddy can be
+ * selected instead of the protected yielder;
+ * yield_to_local_force_resched() also does the leaf forfeit.
+ */
+ yield_to_local_force_resched(rq);
return true;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d..690a2ab99beb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1316,6 +1316,16 @@ struct rq {
unsigned int ttwu_local;
#endif
+ /*
+ * Last rq_clock at which the yield_to() lag-credit path forced a local
+ * reschedule on this rq. Used to rate-limit only the forced preemption
+ * (cancel_protect_slice + resched_curr) to at most once per 6ms per rq,
+ * preventing excessive forced preemption on PLE-heavy guests. The lag
+ * credit itself is not rate-limited. Functional state, not a statistic,
+ * so kept outside CONFIG_SCHEDSTATS.
+ */
+ u64 yield_to_force_resched_last_ns;
+
#ifdef CONFIG_CPU_IDLE
/* Must be inspected within a RCU lock section */
struct cpuidle_state *idle_state;
--
2.43.0
next prev parent reply other threads:[~2026-06-12 1:34 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-12 1:33 [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM Wanpeng Li
2026-06-12 1:33 ` [PATCH v3 01/10] sched/fair: Add EEVDF lag credit primitive for nominated next-buddy Wanpeng Li
2026-06-12 1:49 ` sashiko-bot
2026-06-12 5:34 ` K Prateek Nayak
2026-06-12 1:33 ` [PATCH v3 02/10] sched/fair: Credit a persistent, queue-depth-scaled vlag margin Wanpeng Li
2026-06-12 1:53 ` sashiko-bot
2026-06-12 6:07 ` K Prateek Nayak
2026-06-12 1:33 ` [PATCH v3 03/10] sched/fair: Credit queued next-buddy via canonical requeue Wanpeng Li
2026-06-12 1:55 ` sashiko-bot
2026-06-12 1:33 ` [PATCH v3 04/10] sched/fair: Credit nominated next-buddy in yield_to_task_fair() Wanpeng Li
2026-06-12 1:54 ` sashiko-bot
2026-06-12 1:33 ` Wanpeng Li [this message]
2026-06-12 1:50 ` [PATCH v3 05/10] sched/fair: Force a local resched on yield_to() so the buddy is picked sashiko-bot
2026-06-12 1:33 ` [PATCH v3 06/10] KVM: x86: Add IPI tracking infrastructure for directed yield Wanpeng Li
2026-06-12 1:33 ` [PATCH v3 07/10] KVM: x86/lapic: Track unicast fixed IPI delivery Wanpeng Li
2026-06-12 1:33 ` [PATCH v3 08/10] KVM: x86/lapic: Clear IPI tracking on matching-vector EOI Wanpeng Li
2026-06-12 3:46 ` sashiko-bot
2026-06-12 1:33 ` [PATCH v3 09/10] KVM: Add IPI-aware directed-yield candidate selection Wanpeng Li
2026-06-12 1:48 ` sashiko-bot
2026-06-12 1:33 ` [PATCH v3 10/10] KVM: Add relaxed preempted-only fallback for directed yield Wanpeng Li
2026-06-12 5:17 ` [PATCH v3 00/10] sched/fair, KVM: Semantics-aware directed yield for oversubscribed KVM K Prateek Nayak
2026-06-12 9:43 ` Shrikanth Hegde
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260612013355.59231-6-kernellwp@gmail.com \
--to=kernellwp@gmail.com \
--cc=borntraeger@linux.ibm.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=richie@linux.ibm.com \
--cc=rostedt@goodmis.org \
--cc=seanjc@google.com \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=wanpengli@tencent.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.