* [PATCH v2 0/7] sched: Various reweight_entity() fixes
@ 2026-02-19 7:58 Peter Zijlstra
2026-02-19 7:58 ` [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking Peter Zijlstra
` (6 more replies)
0 siblings, 7 replies; 55+ messages in thread
From: Peter Zijlstra @ 2026-02-19 7:58 UTC (permalink / raw)
To: mingo
Cc: peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
kprateek.nayak, dsmythies, shubhang
Hi,
So what started out as a few reweight fixies, turned into a a few more patches,
but it looks to be solid now.
Thanks for all the testing!
The plan is to stick the first 4 patches (those with a Fixes tag) into
tip/sched/urgent and the rest into tip/sched/core right after -rc1.
These patches have been in queue/sched/core for about a week now, but
I'll wipe that tree to start staging stuff for the post -rc1 tip trees.
^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-02-19 7:58 [PATCH v2 0/7] sched: Various reweight_entity() fixes Peter Zijlstra
@ 2026-02-19 7:58 ` Peter Zijlstra
2026-02-23 10:56 ` Vincent Guittot
` (2 more replies)
2026-02-19 7:58 ` [PATCH v2 2/7] sched/fair: Only set slice protection at pick time Peter Zijlstra
` (5 subsequent siblings)
6 siblings, 3 replies; 55+ messages in thread
From: Peter Zijlstra @ 2026-02-19 7:58 UTC (permalink / raw)
To: mingo
Cc: peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
kprateek.nayak, dsmythies, shubhang
It turns out that zero_vruntime tracking is broken when there is but a single
task running. Current update paths are through __{en,de}queue_entity(), and
when there is but a single task, pick_next_task() will always return that one
task, and put_prev_set_next_task() will end up in neither function.
This can cause entity_key() to grow indefinitely large and cause overflows,
leading to much pain and suffering.
Furtermore, doing update_zero_vruntime() from __{de,en}queue_entity(), which
are called from {set_next,put_prev}_entity() has problems because:
- set_next_entity() calls __dequeue_entity() before it does cfs_rq->curr = se.
This means the avg_vruntime() will see the removal but not current, missing
the entity for accounting.
- put_prev_entity() calls __enqueue_entity() before it does cfs_rq->curr =
NULL. This means the avg_vruntime() will see the addition *and* current,
leading to double accounting.
Both cases are incorrect/inconsistent.
Noting that avg_vruntime is already called on each {en,de}queue, remove the
explicit avg_vruntime() calls (which removes an extra 64bit division for each
{en,de}queue) and have avg_vruntime() update zero_vruntime itself.
Additionally, have the tick call avg_vruntime() -- discarding the result, but
for the side-effect of updating zero_vruntime.
While there, optimize avg_vruntime() by noting that the average of one value is
rather trivial to compute.
Test case:
# taskset -c -p 1 $$
# taskset -c 2 bash -c 'while :; do :; done&'
# cat /sys/kernel/debug/sched/debug | awk '/^cpu#/ {P=0} /^cpu#2,/ {P=1} {if (P) print $0}' | grep -e zero_vruntime -e "^>"
PRE:
.zero_vruntime : 31316.407903
>R bash 487 50787.345112 E 50789.145972 2.800000 50780.298364 16 120 0.000000 0.000000 0.000000 /
.zero_vruntime : 382548.253179
>R bash 487 427275.204288 E 427276.003584 2.800000 427268.157540 23 120 0.000000 0.000000 0.000000 /
POST:
.zero_vruntime : 17259.709467
>R bash 526 17259.709467 E 17262.509467 2.800000 16915.031624 9 120 0.000000 0.000000 0.000000 /
.zero_vruntime : 18702.723356
>R bash 526 18702.723356 E 18705.523356 2.800000 18358.045513 9 120 0.000000 0.000000 0.000000 /
Fixes: 79f3f9bedd14 ("sched/eevdf: Fix min_vruntime vs avg_vruntime")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
---
kernel/sched/fair.c | 84 +++++++++++++++++++++++++++++++++++-----------------
1 file changed, 57 insertions(+), 27 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -589,6 +589,21 @@ static inline bool entity_before(const s
return vruntime_cmp(a->deadline, "<", b->deadline);
}
+/*
+ * Per avg_vruntime() below, cfs_rq::zero_vruntime is only slightly stale
+ * and this value should be no more than two lag bounds. Which puts it in the
+ * general order of:
+ *
+ * (slice + TICK_NSEC) << NICE_0_LOAD_SHIFT
+ *
+ * which is around 44 bits in size (on 64bit); that is 20 for
+ * NICE_0_LOAD_SHIFT, another 20 for NSEC_PER_MSEC and then a handful for
+ * however many msec the actual slice+tick ends up begin.
+ *
+ * (disregarding the actual divide-by-weight part makes for the worst case
+ * weight of 2, which nicely cancels vs the fuzz in zero_vruntime not actually
+ * being the zero-lag point).
+ */
static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
return vruntime_op(se->vruntime, "-", cfs_rq->zero_vruntime);
@@ -676,39 +691,61 @@ sum_w_vruntime_sub(struct cfs_rq *cfs_rq
}
static inline
-void sum_w_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+void update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta)
{
/*
- * v' = v + d ==> sum_w_vruntime' = sum_runtime - d*sum_weight
+ * v' = v + d ==> sum_w_vruntime' = sum_w_vruntime - d*sum_weight
*/
cfs_rq->sum_w_vruntime -= cfs_rq->sum_weight * delta;
+ cfs_rq->zero_vruntime += delta;
}
/*
- * Specifically: avg_runtime() + 0 must result in entity_eligible() := true
+ * Specifically: avg_vruntime() + 0 must result in entity_eligible() := true
* For this to be so, the result of this function must have a left bias.
+ *
+ * Called in:
+ * - place_entity() -- before enqueue
+ * - update_entity_lag() -- before dequeue
+ * - entity_tick()
+ *
+ * This means it is one entry 'behind' but that puts it close enough to where
+ * the bound on entity_key() is at most two lag bounds.
*/
u64 avg_vruntime(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
- s64 avg = cfs_rq->sum_w_vruntime;
- long load = cfs_rq->sum_weight;
+ long weight = cfs_rq->sum_weight;
+ s64 delta = 0;
- if (curr && curr->on_rq) {
- unsigned long weight = scale_load_down(curr->load.weight);
+ if (curr && !curr->on_rq)
+ curr = NULL;
- avg += entity_key(cfs_rq, curr) * weight;
- load += weight;
- }
+ if (weight) {
+ s64 runtime = cfs_rq->sum_w_vruntime;
+
+ if (curr) {
+ unsigned long w = scale_load_down(curr->load.weight);
+
+ runtime += entity_key(cfs_rq, curr) * w;
+ weight += w;
+ }
- if (load) {
/* sign flips effective floor / ceiling */
- if (avg < 0)
- avg -= (load - 1);
- avg = div_s64(avg, load);
+ if (runtime < 0)
+ runtime -= (weight - 1);
+
+ delta = div_s64(runtime, weight);
+ } else if (curr) {
+ /*
+ * When there is but one element, it is the average.
+ */
+ delta = curr->vruntime - cfs_rq->zero_vruntime;
}
- return cfs_rq->zero_vruntime + avg;
+ update_zero_vruntime(cfs_rq, delta);
+
+ return cfs_rq->zero_vruntime;
}
/*
@@ -777,16 +814,6 @@ int entity_eligible(struct cfs_rq *cfs_r
return vruntime_eligible(cfs_rq, se->vruntime);
}
-static void update_zero_vruntime(struct cfs_rq *cfs_rq)
-{
- u64 vruntime = avg_vruntime(cfs_rq);
- s64 delta = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
-
- sum_w_vruntime_update(cfs_rq, delta);
-
- cfs_rq->zero_vruntime = vruntime;
-}
-
static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
{
struct sched_entity *root = __pick_root_entity(cfs_rq);
@@ -856,7 +883,6 @@ RB_DECLARE_CALLBACKS(static, min_vruntim
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
sum_w_vruntime_add(cfs_rq, se);
- update_zero_vruntime(cfs_rq);
se->min_vruntime = se->vruntime;
se->min_slice = se->slice;
rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
@@ -868,7 +894,6 @@ static void __dequeue_entity(struct cfs_
rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
&min_vruntime_cb);
sum_w_vruntime_sub(cfs_rq, se);
- update_zero_vruntime(cfs_rq);
}
struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq)
@@ -5524,6 +5549,11 @@ entity_tick(struct cfs_rq *cfs_rq, struc
update_load_avg(cfs_rq, curr, UPDATE_TG);
update_cfs_group(curr);
+ /*
+ * Pulls along cfs_rq::zero_vruntime.
+ */
+ avg_vruntime(cfs_rq);
+
#ifdef CONFIG_SCHED_HRTICK
/*
* queued ticks are scheduled to match the slice, so don't bother
^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH v2 2/7] sched/fair: Only set slice protection at pick time
2026-02-19 7:58 [PATCH v2 0/7] sched: Various reweight_entity() fixes Peter Zijlstra
2026-02-19 7:58 ` [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking Peter Zijlstra
@ 2026-02-19 7:58 ` Peter Zijlstra
2026-02-19 7:58 ` [PATCH v2 3/7] sched/eevdf: Update se->vprot in reweight_entity() Peter Zijlstra
` (4 subsequent siblings)
6 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2026-02-19 7:58 UTC (permalink / raw)
To: mingo
Cc: peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
kprateek.nayak, dsmythies, shubhang
We should not (re)set slice protection in the sched_change pattern
which calls put_prev_task() / set_next_task().
Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
---
kernel/sched/fair.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5444,7 +5444,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
}
static void
-set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
+set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool first)
{
clear_buddies(cfs_rq, se);
@@ -5459,7 +5459,8 @@ set_next_entity(struct cfs_rq *cfs_rq, s
__dequeue_entity(cfs_rq, se);
update_load_avg(cfs_rq, se, UPDATE_TG);
- set_protect_slice(cfs_rq, se);
+ if (first)
+ set_protect_slice(cfs_rq, se);
}
update_stats_curr_start(cfs_rq, se);
@@ -8977,13 +8978,13 @@ pick_next_task_fair(struct rq *rq, struc
pse = parent_entity(pse);
}
if (se_depth >= pse_depth) {
- set_next_entity(cfs_rq_of(se), se);
+ set_next_entity(cfs_rq_of(se), se, true);
se = parent_entity(se);
}
}
put_prev_entity(cfs_rq, pse);
- set_next_entity(cfs_rq, se);
+ set_next_entity(cfs_rq, se, true);
__set_next_task_fair(rq, p, true);
}
@@ -13597,7 +13598,7 @@ static void set_next_task_fair(struct rq
for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
- set_next_entity(cfs_rq, se);
+ set_next_entity(cfs_rq, se, first);
/* ensure bandwidth has been allocated on our new cfs_rq */
account_cfs_rq_runtime(cfs_rq, 0);
}
^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH v2 3/7] sched/eevdf: Update se->vprot in reweight_entity()
2026-02-19 7:58 [PATCH v2 0/7] sched: Various reweight_entity() fixes Peter Zijlstra
2026-02-19 7:58 ` [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking Peter Zijlstra
2026-02-19 7:58 ` [PATCH v2 2/7] sched/fair: Only set slice protection at pick time Peter Zijlstra
@ 2026-02-19 7:58 ` Peter Zijlstra
2026-02-19 7:58 ` [PATCH v2 4/7] sched/fair: Fix lag clamp Peter Zijlstra
` (3 subsequent siblings)
6 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2026-02-19 7:58 UTC (permalink / raw)
To: mingo
Cc: peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
kprateek.nayak, dsmythies, shubhang, Zhang Qiao
From: Wang Tao <wangtao554@huawei.com>
In the EEVDF framework with Run-to-Parity protection, `se->vprot` is an
independent variable defining the virtual protection timestamp.
When `reweight_entity()` is called (e.g., via nice/renice), it performs
the following actions to preserve Lag consistency:
1. Scales `se->vlag` based on the new weight.
2. Calls `place_entity()`, which recalculates `se->vruntime` based on
the new weight and scaled lag.
However, the current implementation fails to update `se->vprot`, leading
to mismatches between the task's actual runtime and its expected duration.
Fixes: 63304558ba5d ("sched/eevdf: Curb wakeup-preemption")
Suggested-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Wang Tao <wangtao554@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Link: https://patch.msgid.link/20260120123113.3518950-1-wangtao554@huawei.com
---
kernel/sched/fair.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3814,6 +3814,8 @@ static void reweight_entity(struct cfs_r
unsigned long weight)
{
bool curr = cfs_rq->curr == se;
+ bool rel_vprot = false;
+ u64 vprot;
if (se->on_rq) {
/* commit outstanding execution time */
@@ -3821,6 +3823,11 @@ static void reweight_entity(struct cfs_r
update_entity_lag(cfs_rq, se);
se->deadline -= se->vruntime;
se->rel_deadline = 1;
+ if (curr && protect_slice(se)) {
+ vprot = se->vprot - se->vruntime;
+ rel_vprot = true;
+ }
+
cfs_rq->nr_queued--;
if (!curr)
__dequeue_entity(cfs_rq, se);
@@ -3836,6 +3843,9 @@ static void reweight_entity(struct cfs_r
if (se->rel_deadline)
se->deadline = div_s64(se->deadline * se->load.weight, weight);
+ if (rel_vprot)
+ vprot = div_s64(vprot * se->load.weight, weight);
+
update_load_set(&se->load, weight);
do {
@@ -3847,6 +3857,8 @@ static void reweight_entity(struct cfs_r
enqueue_load_avg(cfs_rq, se);
if (se->on_rq) {
place_entity(cfs_rq, se, 0);
+ if (rel_vprot)
+ se->vprot = se->vruntime + vprot;
update_load_add(&cfs_rq->load, se->load.weight);
if (!curr)
__enqueue_entity(cfs_rq, se);
^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH v2 4/7] sched/fair: Fix lag clamp
2026-02-19 7:58 [PATCH v2 0/7] sched: Various reweight_entity() fixes Peter Zijlstra
` (2 preceding siblings ...)
2026-02-19 7:58 ` [PATCH v2 3/7] sched/eevdf: Update se->vprot in reweight_entity() Peter Zijlstra
@ 2026-02-19 7:58 ` Peter Zijlstra
2026-02-23 10:23 ` Dietmar Eggemann
2026-02-23 10:57 ` Vincent Guittot
2026-02-19 7:58 ` [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime Peter Zijlstra
` (2 subsequent siblings)
6 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2026-02-19 7:58 UTC (permalink / raw)
To: mingo
Cc: peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
kprateek.nayak, dsmythies, shubhang
Vincent reported that he was seeing undue lag clamping in a mixed
slice workload. Implement the max_slice tracking as per the todo
comment.
Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Reported-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Link: https://patch.msgid.link/20250422101628.GA33555@noisy.programming.kicks-ass.net
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++----
2 files changed, 36 insertions(+), 4 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -574,6 +574,7 @@ struct sched_entity {
u64 deadline;
u64 min_vruntime;
u64 min_slice;
+ u64 max_slice;
struct list_head group_node;
unsigned char on_rq;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -748,6 +748,8 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
return cfs_rq->zero_vruntime;
}
+static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq);
+
/*
* lag_i = S - s_i = w_i * (V - v_i)
*
@@ -761,17 +763,16 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
* EEVDF gives the following limit for a steady state system:
*
* -r_max < lag < max(r_max, q)
- *
- * XXX could add max_slice to the augmented data to track this.
*/
static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ u64 max_slice = cfs_rq_max_slice(cfs_rq) + TICK_NSEC;
s64 vlag, limit;
WARN_ON_ONCE(!se->on_rq);
vlag = avg_vruntime(cfs_rq) - se->vruntime;
- limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+ limit = calc_delta_fair(max_slice, se);
se->vlag = clamp(vlag, -limit, limit);
}
@@ -829,6 +830,21 @@ static inline u64 cfs_rq_min_slice(struc
return min_slice;
}
+static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *root = __pick_root_entity(cfs_rq);
+ struct sched_entity *curr = cfs_rq->curr;
+ u64 max_slice = 0ULL;
+
+ if (curr && curr->on_rq)
+ max_slice = curr->slice;
+
+ if (root)
+ max_slice = max(max_slice, root->max_slice);
+
+ return max_slice;
+}
+
static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
{
return entity_before(__node_2_se(a), __node_2_se(b));
@@ -853,6 +869,15 @@ static inline void __min_slice_update(st
}
}
+static inline void __max_slice_update(struct sched_entity *se, struct rb_node *node)
+{
+ if (node) {
+ struct sched_entity *rse = __node_2_se(node);
+ if (rse->max_slice > se->max_slice)
+ se->max_slice = rse->max_slice;
+ }
+}
+
/*
* se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime)
*/
@@ -860,6 +885,7 @@ static inline bool min_vruntime_update(s
{
u64 old_min_vruntime = se->min_vruntime;
u64 old_min_slice = se->min_slice;
+ u64 old_max_slice = se->max_slice;
struct rb_node *node = &se->run_node;
se->min_vruntime = se->vruntime;
@@ -870,8 +896,13 @@ static inline bool min_vruntime_update(s
__min_slice_update(se, node->rb_right);
__min_slice_update(se, node->rb_left);
+ se->max_slice = se->slice;
+ __max_slice_update(se, node->rb_right);
+ __max_slice_update(se, node->rb_left);
+
return se->min_vruntime == old_min_vruntime &&
- se->min_slice == old_min_slice;
+ se->min_slice == old_min_slice &&
+ se->max_slice == old_max_slice;
}
RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime
2026-02-19 7:58 [PATCH v2 0/7] sched: Various reweight_entity() fixes Peter Zijlstra
` (3 preceding siblings ...)
2026-02-19 7:58 ` [PATCH v2 4/7] sched/fair: Fix lag clamp Peter Zijlstra
@ 2026-02-19 7:58 ` Peter Zijlstra
2026-02-23 10:56 ` Vincent Guittot
2026-02-19 7:58 ` [PATCH v2 6/7] sched/fair: Revert 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") Peter Zijlstra
2026-02-19 7:58 ` [PATCH v2 7/7] sched/fair: Use full weight to __calc_delta() Peter Zijlstra
6 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-02-19 7:58 UTC (permalink / raw)
To: mingo
Cc: peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
kprateek.nayak, dsmythies, shubhang
Due to the zero_vruntime patch, the deltas are now a lot smaller and
measurement with kernel-build and hackbench runs show about 45 bits
used.
This ensures avg_vruntime() tracks the full weight range, reducing
numerical artifacts in reweight and the like.
Also, lets keep the paranoid debug code around fow now.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
---
kernel/sched/debug.c | 14 ++++++-
kernel/sched/fair.c | 91 ++++++++++++++++++++++++++++++++++++++----------
kernel/sched/features.h | 2 +
kernel/sched/sched.h | 3 +
4 files changed, 90 insertions(+), 20 deletions(-)
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -8,6 +8,7 @@
*/
#include <linux/debugfs.h>
#include <linux/nmi.h>
+#include <linux/log2.h>
#include "sched.h"
/*
@@ -901,10 +902,13 @@ static void print_rq(struct seq_file *m,
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
{
- s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
+ s64 left_vruntime = -1, right_vruntime = -1, left_deadline = -1, spread;
+ s64 zero_vruntime = -1, sum_w_vruntime = -1;
struct sched_entity *last, *first, *root;
struct rq *rq = cpu_rq(cpu);
+ unsigned int sum_shift;
unsigned long flags;
+ u64 sum_weight;
#ifdef CONFIG_FAIR_GROUP_SCHED
SEQ_printf(m, "\n");
@@ -925,6 +929,9 @@ void print_cfs_rq(struct seq_file *m, in
if (last)
right_vruntime = last->vruntime;
zero_vruntime = cfs_rq->zero_vruntime;
+ sum_w_vruntime = cfs_rq->sum_w_vruntime;
+ sum_weight = cfs_rq->sum_weight;
+ sum_shift = cfs_rq->sum_shift;
raw_spin_rq_unlock_irqrestore(rq, flags);
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_deadline",
@@ -933,6 +940,11 @@ void print_cfs_rq(struct seq_file *m, in
SPLIT_NS(left_vruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "zero_vruntime",
SPLIT_NS(zero_vruntime));
+ SEQ_printf(m, " .%-30s: %Ld (%d bits)\n", "sum_w_vruntime",
+ sum_w_vruntime, ilog2(abs(sum_w_vruntime)));
+ SEQ_printf(m, " .%-30s: %Lu\n", "sum_weight",
+ sum_weight);
+ SEQ_printf(m, " .%-30s: %u\n", "sum_shift", sum_shift);
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "avg_vruntime",
SPLIT_NS(avg_vruntime(cfs_rq)));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "right_vruntime",
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -665,15 +665,20 @@ static inline s64 entity_key(struct cfs_
* Since zero_vruntime closely tracks the per-task service, these
* deltas: (v_i - v0), will be in the order of the maximal (virtual) lag
* induced in the system due to quantisation.
- *
- * Also, we use scale_load_down() to reduce the size.
- *
- * As measured, the max (key * weight) value was ~44 bits for a kernel build.
*/
-static void
-sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static inline unsigned long avg_vruntime_weight(struct cfs_rq *cfs_rq, unsigned long w)
+{
+#ifdef CONFIG_64BIT
+ if (cfs_rq->sum_shift)
+ w = max(2UL, w >> cfs_rq->sum_shift);
+#endif
+ return w;
+}
+
+static inline void
+__sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- unsigned long weight = scale_load_down(se->load.weight);
+ unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
s64 key = entity_key(cfs_rq, se);
cfs_rq->sum_w_vruntime += key * weight;
@@ -681,9 +686,59 @@ sum_w_vruntime_add(struct cfs_rq *cfs_rq
}
static void
+sum_w_vruntime_add_paranoid(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ unsigned long weight;
+ s64 key, tmp;
+
+again:
+ weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+ key = entity_key(cfs_rq, se);
+
+ if (check_mul_overflow(key, weight, &key))
+ goto overflow;
+
+ if (check_add_overflow(cfs_rq->sum_w_vruntime, key, &tmp))
+ goto overflow;
+
+ cfs_rq->sum_w_vruntime = tmp;
+ cfs_rq->sum_weight += weight;
+ return;
+
+overflow:
+ /*
+ * There's gotta be a limit -- if we're still failing at this point
+ * there's really nothing much to be done about things.
+ */
+ BUG_ON(cfs_rq->sum_shift >= 10);
+ cfs_rq->sum_shift++;
+
+ /*
+ * Note: \Sum (k_i * (w_i >> 1)) != (\Sum (k_i * w_i)) >> 1
+ */
+ cfs_rq->sum_w_vruntime = 0;
+ cfs_rq->sum_weight = 0;
+
+ for (struct rb_node *node = cfs_rq->tasks_timeline.rb_leftmost;
+ node; node = rb_next(node))
+ __sum_w_vruntime_add(cfs_rq, __node_2_se(node));
+
+ goto again;
+}
+
+static void
+sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ if (sched_feat(PARANOID_AVG))
+ return sum_w_vruntime_add_paranoid(cfs_rq, se);
+
+ __sum_w_vruntime_add(cfs_rq, se);
+}
+
+static void
sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- unsigned long weight = scale_load_down(se->load.weight);
+ unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
s64 key = entity_key(cfs_rq, se);
cfs_rq->sum_w_vruntime -= key * weight;
@@ -725,7 +780,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
s64 runtime = cfs_rq->sum_w_vruntime;
if (curr) {
- unsigned long w = scale_load_down(curr->load.weight);
+ unsigned long w = avg_vruntime_weight(cfs_rq, curr->load.weight);
runtime += entity_key(cfs_rq, curr) * w;
weight += w;
@@ -735,7 +790,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
if (runtime < 0)
runtime -= (weight - 1);
- delta = div_s64(runtime, weight);
+ delta = div64_long(runtime, weight);
} else if (curr) {
/*
* When there is but one element, it is the average.
@@ -801,7 +856,7 @@ static int vruntime_eligible(struct cfs_
long load = cfs_rq->sum_weight;
if (curr && curr->on_rq) {
- unsigned long weight = scale_load_down(curr->load.weight);
+ unsigned long weight = avg_vruntime_weight(cfs_rq, curr->load.weight);
avg += entity_key(cfs_rq, curr) * weight;
load += weight;
@@ -3871,12 +3926,12 @@ static void reweight_entity(struct cfs_r
* Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
* we need to scale se->vlag when w_i changes.
*/
- se->vlag = div_s64(se->vlag * se->load.weight, weight);
+ se->vlag = div64_long(se->vlag * se->load.weight, weight);
if (se->rel_deadline)
- se->deadline = div_s64(se->deadline * se->load.weight, weight);
+ se->deadline = div64_long(se->deadline * se->load.weight, weight);
if (rel_vprot)
- vprot = div_s64(vprot * se->load.weight, weight);
+ vprot = div64_long(vprot * se->load.weight, weight);
update_load_set(&se->load, weight);
@@ -5180,7 +5235,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
*/
if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
struct sched_entity *curr = cfs_rq->curr;
- unsigned long load;
+ long load;
lag = se->vlag;
@@ -5238,12 +5293,12 @@ place_entity(struct cfs_rq *cfs_rq, stru
*/
load = cfs_rq->sum_weight;
if (curr && curr->on_rq)
- load += scale_load_down(curr->load.weight);
+ load += avg_vruntime_weight(cfs_rq, curr->load.weight);
- lag *= load + scale_load_down(se->load.weight);
+ lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
if (WARN_ON_ONCE(!load))
load = 1;
- lag = div_s64(lag, load);
+ lag = div64_long(lag, load);
}
se->vruntime = vruntime - lag;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -58,6 +58,8 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
SCHED_FEAT(DELAY_DEQUEUE, true)
SCHED_FEAT(DELAY_ZERO, true)
+SCHED_FEAT(PARANOID_AVG, false)
+
/*
* Allow wakeup-time preemption of the current task:
*/
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -684,8 +684,9 @@ struct cfs_rq {
s64 sum_w_vruntime;
u64 sum_weight;
-
u64 zero_vruntime;
+ unsigned int sum_shift;
+
#ifdef CONFIG_SCHED_CORE
unsigned int forceidle_seq;
u64 zero_vruntime_fi;
^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH v2 6/7] sched/fair: Revert 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")
2026-02-19 7:58 [PATCH v2 0/7] sched: Various reweight_entity() fixes Peter Zijlstra
` (4 preceding siblings ...)
2026-02-19 7:58 ` [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime Peter Zijlstra
@ 2026-02-19 7:58 ` Peter Zijlstra
2026-02-23 10:57 ` Vincent Guittot
2026-02-19 7:58 ` [PATCH v2 7/7] sched/fair: Use full weight to __calc_delta() Peter Zijlstra
6 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-02-19 7:58 UTC (permalink / raw)
To: mingo
Cc: peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
kprateek.nayak, dsmythies, shubhang
Zicheng Qu reported that, because avg_vruntime() always includes
cfs_rq->curr, when ->on_rq, place_entity() doesn't work right.
Specifically, the lag scaling in place_entity() relies on
avg_vruntime() being the state *before* placement of the new entity.
However in this case avg_vruntime() will actually already include the
entity, which breaks things.
Also, Zicheng Qu argues that avg_vruntime should be invariant under
reweight. IOW commit 6d71a9c61604 ("sched/fair: Fix EEVDF entity
placement bug causing scheduling lag") was wrong!
The issue reported in 6d71a9c61604 could possibly be explained by
rounding artifacts -- notably the extreme weight '2' is outside of the
range of avg_vruntime/sum_w_vruntime, since that uses
scale_load_down(). By scaling vruntime by the real weight, but
accounting it in vruntime with a factor 1024 more, the average moves
significantly. However, that is now cured.
Tested by reverting 66951e4860d3 ("sched/fair: Fix update_cfs_group()
vs DELAY_DEQUEUE") and tracing vruntime and vlag figures again.
Reported-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
---
kernel/sched/fair.c | 148 +++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 124 insertions(+), 24 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -819,17 +819,22 @@ static inline u64 cfs_rq_max_slice(struc
*
* -r_max < lag < max(r_max, q)
*/
-static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static s64 entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 avruntime)
{
u64 max_slice = cfs_rq_max_slice(cfs_rq) + TICK_NSEC;
s64 vlag, limit;
- WARN_ON_ONCE(!se->on_rq);
-
- vlag = avg_vruntime(cfs_rq) - se->vruntime;
+ vlag = avruntime - se->vruntime;
limit = calc_delta_fair(max_slice, se);
- se->vlag = clamp(vlag, -limit, limit);
+ return clamp(vlag, -limit, limit);
+}
+
+static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ WARN_ON_ONCE(!se->on_rq);
+
+ se->vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
}
/*
@@ -3895,23 +3900,125 @@ dequeue_load_avg(struct cfs_rq *cfs_rq,
se_weight(se) * -se->avg.load_sum);
}
-static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags);
+static void
+rescale_entity(struct sched_entity *se, unsigned long weight, bool rel_vprot)
+{
+ unsigned long old_weight = se->load.weight;
+
+ /*
+ * VRUNTIME
+ * --------
+ *
+ * COROLLARY #1: The virtual runtime of the entity needs to be
+ * adjusted if re-weight at !0-lag point.
+ *
+ * Proof: For contradiction assume this is not true, so we can
+ * re-weight without changing vruntime at !0-lag point.
+ *
+ * Weight VRuntime Avg-VRuntime
+ * before w v V
+ * after w' v' V'
+ *
+ * Since lag needs to be preserved through re-weight:
+ *
+ * lag = (V - v)*w = (V'- v')*w', where v = v'
+ * ==> V' = (V - v)*w/w' + v (1)
+ *
+ * Let W be the total weight of the entities before reweight,
+ * since V' is the new weighted average of entities:
+ *
+ * V' = (WV + w'v - wv) / (W + w' - w) (2)
+ *
+ * by using (1) & (2) we obtain:
+ *
+ * (WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
+ * ==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
+ * ==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
+ * ==> (V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
+ *
+ * Since we are doing at !0-lag point which means V != v, we
+ * can simplify (3):
+ *
+ * ==> W / (W + w' - w) = w / w'
+ * ==> Ww' = Ww + ww' - ww
+ * ==> W * (w' - w) = w * (w' - w)
+ * ==> W = w (re-weight indicates w' != w)
+ *
+ * So the cfs_rq contains only one entity, hence vruntime of
+ * the entity @v should always equal to the cfs_rq's weighted
+ * average vruntime @V, which means we will always re-weight
+ * at 0-lag point, thus breach assumption. Proof completed.
+ *
+ *
+ * COROLLARY #2: Re-weight does NOT affect weighted average
+ * vruntime of all the entities.
+ *
+ * Proof: According to corollary #1, Eq. (1) should be:
+ *
+ * (V - v)*w = (V' - v')*w'
+ * ==> v' = V' - (V - v)*w/w' (4)
+ *
+ * According to the weighted average formula, we have:
+ *
+ * V' = (WV - wv + w'v') / (W - w + w')
+ * = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
+ * = (WV - wv + w'V' - Vw + wv) / (W - w + w')
+ * = (WV + w'V' - Vw) / (W - w + w')
+ *
+ * ==> V'*(W - w + w') = WV + w'V' - Vw
+ * ==> V' * (W - w) = (W - w) * V (5)
+ *
+ * If the entity is the only one in the cfs_rq, then reweight
+ * always occurs at 0-lag point, so V won't change. Or else
+ * there are other entities, hence W != w, then Eq. (5) turns
+ * into V' = V. So V won't change in either case, proof done.
+ *
+ *
+ * So according to corollary #1 & #2, the effect of re-weight
+ * on vruntime should be:
+ *
+ * v' = V' - (V - v) * w / w' (4)
+ * = V - (V - v) * w / w'
+ * = V - vl * w / w'
+ * = V - vl'
+ */
+ se->vlag = div64_long(se->vlag * old_weight, weight);
+
+ /*
+ * DEADLINE
+ * --------
+ *
+ * When the weight changes, the virtual time slope changes and
+ * we should adjust the relative virtual deadline accordingly.
+ *
+ * d' = v' + (d - v)*w/w'
+ * = V' - (V - v)*w/w' + (d - v)*w/w'
+ * = V - (V - v)*w/w' + (d - v)*w/w'
+ * = V + (d - V)*w/w'
+ */
+ if (se->rel_deadline)
+ se->deadline = div64_long(se->deadline * old_weight, weight);
+
+ if (rel_vprot)
+ se->vprot = div64_long(se->vprot * old_weight, weight);
+}
static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
unsigned long weight)
{
bool curr = cfs_rq->curr == se;
bool rel_vprot = false;
- u64 vprot;
+ u64 avruntime = 0;
if (se->on_rq) {
/* commit outstanding execution time */
update_curr(cfs_rq);
- update_entity_lag(cfs_rq, se);
- se->deadline -= se->vruntime;
+ avruntime = avg_vruntime(cfs_rq);
+ se->vlag = entity_lag(cfs_rq, se, avruntime);
+ se->deadline -= avruntime;
se->rel_deadline = 1;
if (curr && protect_slice(se)) {
- vprot = se->vprot - se->vruntime;
+ se->vprot -= avruntime;
rel_vprot = true;
}
@@ -3922,30 +4029,23 @@ static void reweight_entity(struct cfs_r
}
dequeue_load_avg(cfs_rq, se);
- /*
- * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
- * we need to scale se->vlag when w_i changes.
- */
- se->vlag = div64_long(se->vlag * se->load.weight, weight);
- if (se->rel_deadline)
- se->deadline = div64_long(se->deadline * se->load.weight, weight);
-
- if (rel_vprot)
- vprot = div64_long(vprot * se->load.weight, weight);
+ rescale_entity(se, weight, rel_vprot);
update_load_set(&se->load, weight);
do {
u32 divider = get_pelt_divider(&se->avg);
-
se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
} while (0);
enqueue_load_avg(cfs_rq, se);
if (se->on_rq) {
- place_entity(cfs_rq, se, 0);
if (rel_vprot)
- se->vprot = se->vruntime + vprot;
+ se->vprot += avruntime;
+ se->deadline += avruntime;
+ se->rel_deadline = 0;
+ se->vruntime = avruntime - se->vlag;
+
update_load_add(&cfs_rq->load, se->load.weight);
if (!curr)
__enqueue_entity(cfs_rq, se);
@@ -5303,7 +5403,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
se->vruntime = vruntime - lag;
- if (se->rel_deadline) {
+ if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
se->deadline += se->vruntime;
se->rel_deadline = 0;
return;
^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH v2 7/7] sched/fair: Use full weight to __calc_delta()
2026-02-19 7:58 [PATCH v2 0/7] sched: Various reweight_entity() fixes Peter Zijlstra
` (5 preceding siblings ...)
2026-02-19 7:58 ` [PATCH v2 6/7] sched/fair: Revert 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") Peter Zijlstra
@ 2026-02-19 7:58 ` Peter Zijlstra
2026-02-23 10:57 ` Vincent Guittot
6 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-02-19 7:58 UTC (permalink / raw)
To: mingo
Cc: peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
kprateek.nayak, dsmythies, shubhang
Since we now use the full weight for avg_vruntime(), also make
__calc_delta() use the full value.
Since weight is effectively NICE_0_LOAD, this is 20 bits on 64bit.
This leaves 44 bits for delta_exec, which is ~16k seconds, way longer
than any one tick would ever be, so no worry about overflow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
---
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -225,6 +225,7 @@ void __init sched_init_granularity(void)
update_sysctl();
}
+#ifndef CONFIG_64BIT
#define WMULT_CONST (~0U)
#define WMULT_SHIFT 32
@@ -283,6 +284,12 @@ static u64 __calc_delta(u64 delta_exec,
return mul_u64_u32_shr(delta_exec, fact, shift);
}
+#else
+static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
+{
+ return (delta_exec * weight) / lw->weight;
+}
+#endif
/*
* delta /= w
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 4/7] sched/fair: Fix lag clamp
2026-02-19 7:58 ` [PATCH v2 4/7] sched/fair: Fix lag clamp Peter Zijlstra
@ 2026-02-23 10:23 ` Dietmar Eggemann
2026-02-23 10:57 ` Vincent Guittot
1 sibling, 0 replies; 55+ messages in thread
From: Dietmar Eggemann @ 2026-02-23 10:23 UTC (permalink / raw)
To: Peter Zijlstra, mingo
Cc: juri.lelli, vincent.guittot, rostedt, bsegall, mgorman, vschneid,
linux-kernel, wangtao554, quzicheng, kprateek.nayak, dsmythies,
shubhang
On 19.02.26 08:58, Peter Zijlstra wrote:
[...]
> @@ -761,17 +763,16 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
> * EEVDF gives the following limit for a steady state system:
> *
> * -r_max < lag < max(r_max, q)
> - *
> - * XXX could add max_slice to the augmented data to track this.
> */
> static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> + u64 max_slice = cfs_rq_max_slice(cfs_rq) + TICK_NSEC;
> s64 vlag, limit;
>
> WARN_ON_ONCE(!se->on_rq);
>
> vlag = avg_vruntime(cfs_rq) - se->vruntime;
> - limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> + limit = calc_delta_fair(max_slice, se);
>
> se->vlag = clamp(vlag, -limit, limit);
> }
nitpick:
The "Limit this to either double the slice length with a minimum of
TICK_NSEC ..." in the function comment header doesn't match anymore.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime
2026-02-19 7:58 ` [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime Peter Zijlstra
@ 2026-02-23 10:56 ` Vincent Guittot
2026-02-23 11:51 ` Peter Zijlstra
0 siblings, 1 reply; 55+ messages in thread
From: Vincent Guittot @ 2026-02-23 10:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, wangtao554, quzicheng, kprateek.nayak,
dsmythies, shubhang
On Thu, 19 Feb 2026 at 09:10, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Due to the zero_vruntime patch, the deltas are now a lot smaller and
> measurement with kernel-build and hackbench runs show about 45 bits
> used.
>
> This ensures avg_vruntime() tracks the full weight range, reducing
> numerical artifacts in reweight and the like.
Instead of paranoid, would it be better to add WARN_ONCE ?
I'm afraid that we will not notice any potential overflow without a
long study of the regression with SCHED_FEAT(PARANOID_AVG, false)
Couldn't we add a cheaper WARN_ONCE (key > 2^50) in __sum_w_vruntime_add ?
We should always have
key < 110ms (max slice+max tick) * nice_0 (2^20) / weight (2)
key < 2^46
We can use 50 bits to get margin
Weight is always less than 27bits and key*weight gives us 110ms (max
slice+max tick) * nice_0 (2^20) so we should never add more than 2^47
to ->sum_weight
so a WARN_ONCE (cfs_rq->sum_weight > 2^63) should be enough
>
> Also, lets keep the paranoid debug code around fow now.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
> ---
> kernel/sched/debug.c | 14 ++++++-
> kernel/sched/fair.c | 91 ++++++++++++++++++++++++++++++++++++++----------
> kernel/sched/features.h | 2 +
> kernel/sched/sched.h | 3 +
> 4 files changed, 90 insertions(+), 20 deletions(-)
>
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -8,6 +8,7 @@
> */
> #include <linux/debugfs.h>
> #include <linux/nmi.h>
> +#include <linux/log2.h>
> #include "sched.h"
>
> /*
> @@ -901,10 +902,13 @@ static void print_rq(struct seq_file *m,
>
> void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> {
> - s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
> + s64 left_vruntime = -1, right_vruntime = -1, left_deadline = -1, spread;
> + s64 zero_vruntime = -1, sum_w_vruntime = -1;
> struct sched_entity *last, *first, *root;
> struct rq *rq = cpu_rq(cpu);
> + unsigned int sum_shift;
> unsigned long flags;
> + u64 sum_weight;
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> SEQ_printf(m, "\n");
> @@ -925,6 +929,9 @@ void print_cfs_rq(struct seq_file *m, in
> if (last)
> right_vruntime = last->vruntime;
> zero_vruntime = cfs_rq->zero_vruntime;
> + sum_w_vruntime = cfs_rq->sum_w_vruntime;
> + sum_weight = cfs_rq->sum_weight;
> + sum_shift = cfs_rq->sum_shift;
> raw_spin_rq_unlock_irqrestore(rq, flags);
>
> SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_deadline",
> @@ -933,6 +940,11 @@ void print_cfs_rq(struct seq_file *m, in
> SPLIT_NS(left_vruntime));
> SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "zero_vruntime",
> SPLIT_NS(zero_vruntime));
> + SEQ_printf(m, " .%-30s: %Ld (%d bits)\n", "sum_w_vruntime",
> + sum_w_vruntime, ilog2(abs(sum_w_vruntime)));
> + SEQ_printf(m, " .%-30s: %Lu\n", "sum_weight",
> + sum_weight);
> + SEQ_printf(m, " .%-30s: %u\n", "sum_shift", sum_shift);
> SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "avg_vruntime",
> SPLIT_NS(avg_vruntime(cfs_rq)));
> SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "right_vruntime",
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -665,15 +665,20 @@ static inline s64 entity_key(struct cfs_
> * Since zero_vruntime closely tracks the per-task service, these
> * deltas: (v_i - v0), will be in the order of the maximal (virtual) lag
> * induced in the system due to quantisation.
> - *
> - * Also, we use scale_load_down() to reduce the size.
> - *
> - * As measured, the max (key * weight) value was ~44 bits for a kernel build.
> */
> -static void
> -sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +static inline unsigned long avg_vruntime_weight(struct cfs_rq *cfs_rq, unsigned long w)
> +{
> +#ifdef CONFIG_64BIT
> + if (cfs_rq->sum_shift)
> + w = max(2UL, w >> cfs_rq->sum_shift);
> +#endif
> + return w;
> +}
> +
> +static inline void
> +__sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> - unsigned long weight = scale_load_down(se->load.weight);
> + unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
> s64 key = entity_key(cfs_rq, se);
>
> cfs_rq->sum_w_vruntime += key * weight;
> @@ -681,9 +686,59 @@ sum_w_vruntime_add(struct cfs_rq *cfs_rq
> }
>
> static void
> +sum_w_vruntime_add_paranoid(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> + unsigned long weight;
> + s64 key, tmp;
> +
> +again:
> + weight = avg_vruntime_weight(cfs_rq, se->load.weight);
> + key = entity_key(cfs_rq, se);
> +
> + if (check_mul_overflow(key, weight, &key))
> + goto overflow;
> +
> + if (check_add_overflow(cfs_rq->sum_w_vruntime, key, &tmp))
> + goto overflow;
> +
> + cfs_rq->sum_w_vruntime = tmp;
> + cfs_rq->sum_weight += weight;
> + return;
> +
> +overflow:
> + /*
> + * There's gotta be a limit -- if we're still failing at this point
> + * there's really nothing much to be done about things.
> + */
> + BUG_ON(cfs_rq->sum_shift >= 10);
> + cfs_rq->sum_shift++;
> +
> + /*
> + * Note: \Sum (k_i * (w_i >> 1)) != (\Sum (k_i * w_i)) >> 1
> + */
> + cfs_rq->sum_w_vruntime = 0;
> + cfs_rq->sum_weight = 0;
> +
> + for (struct rb_node *node = cfs_rq->tasks_timeline.rb_leftmost;
> + node; node = rb_next(node))
> + __sum_w_vruntime_add(cfs_rq, __node_2_se(node));
> +
> + goto again;
> +}
> +
> +static void
> +sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> + if (sched_feat(PARANOID_AVG))
> + return sum_w_vruntime_add_paranoid(cfs_rq, se);
> +
> + __sum_w_vruntime_add(cfs_rq, se);
> +}
> +
> +static void
> sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> - unsigned long weight = scale_load_down(se->load.weight);
> + unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
> s64 key = entity_key(cfs_rq, se);
>
> cfs_rq->sum_w_vruntime -= key * weight;
> @@ -725,7 +780,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
> s64 runtime = cfs_rq->sum_w_vruntime;
>
> if (curr) {
> - unsigned long w = scale_load_down(curr->load.weight);
> + unsigned long w = avg_vruntime_weight(cfs_rq, curr->load.weight);
>
> runtime += entity_key(cfs_rq, curr) * w;
> weight += w;
> @@ -735,7 +790,7 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
> if (runtime < 0)
> runtime -= (weight - 1);
>
> - delta = div_s64(runtime, weight);
> + delta = div64_long(runtime, weight);
> } else if (curr) {
> /*
> * When there is but one element, it is the average.
> @@ -801,7 +856,7 @@ static int vruntime_eligible(struct cfs_
> long load = cfs_rq->sum_weight;
>
> if (curr && curr->on_rq) {
> - unsigned long weight = scale_load_down(curr->load.weight);
> + unsigned long weight = avg_vruntime_weight(cfs_rq, curr->load.weight);
>
> avg += entity_key(cfs_rq, curr) * weight;
> load += weight;
> @@ -3871,12 +3926,12 @@ static void reweight_entity(struct cfs_r
> * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
> * we need to scale se->vlag when w_i changes.
> */
> - se->vlag = div_s64(se->vlag * se->load.weight, weight);
> + se->vlag = div64_long(se->vlag * se->load.weight, weight);
> if (se->rel_deadline)
> - se->deadline = div_s64(se->deadline * se->load.weight, weight);
> + se->deadline = div64_long(se->deadline * se->load.weight, weight);
>
> if (rel_vprot)
> - vprot = div_s64(vprot * se->load.weight, weight);
> + vprot = div64_long(vprot * se->load.weight, weight);
>
> update_load_set(&se->load, weight);
>
> @@ -5180,7 +5235,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
> */
> if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
> struct sched_entity *curr = cfs_rq->curr;
> - unsigned long load;
> + long load;
>
> lag = se->vlag;
>
> @@ -5238,12 +5293,12 @@ place_entity(struct cfs_rq *cfs_rq, stru
> */
> load = cfs_rq->sum_weight;
> if (curr && curr->on_rq)
> - load += scale_load_down(curr->load.weight);
> + load += avg_vruntime_weight(cfs_rq, curr->load.weight);
>
> - lag *= load + scale_load_down(se->load.weight);
> + lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
> if (WARN_ON_ONCE(!load))
> load = 1;
> - lag = div_s64(lag, load);
> + lag = div64_long(lag, load);
> }
>
> se->vruntime = vruntime - lag;
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -58,6 +58,8 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
> SCHED_FEAT(DELAY_DEQUEUE, true)
> SCHED_FEAT(DELAY_ZERO, true)
>
> +SCHED_FEAT(PARANOID_AVG, false)
> +
> /*
> * Allow wakeup-time preemption of the current task:
> */
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -684,8 +684,9 @@ struct cfs_rq {
>
> s64 sum_w_vruntime;
> u64 sum_weight;
> -
> u64 zero_vruntime;
> + unsigned int sum_shift;
> +
> #ifdef CONFIG_SCHED_CORE
> unsigned int forceidle_seq;
> u64 zero_vruntime_fi;
>
>
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-02-19 7:58 ` [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking Peter Zijlstra
@ 2026-02-23 10:56 ` Vincent Guittot
2026-02-23 13:09 ` Dietmar Eggemann
2026-03-28 5:44 ` John Stultz
2 siblings, 0 replies; 55+ messages in thread
From: Vincent Guittot @ 2026-02-23 10:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, wangtao554, quzicheng, kprateek.nayak,
dsmythies, shubhang
On Thu, 19 Feb 2026 at 09:10, Peter Zijlstra <peterz@infradead.org> wrote:
>
> It turns out that zero_vruntime tracking is broken when there is but a single
> task running. Current update paths are through __{en,de}queue_entity(), and
> when there is but a single task, pick_next_task() will always return that one
> task, and put_prev_set_next_task() will end up in neither function.
>
> This can cause entity_key() to grow indefinitely large and cause overflows,
> leading to much pain and suffering.
>
> Furtermore, doing update_zero_vruntime() from __{de,en}queue_entity(), which
> are called from {set_next,put_prev}_entity() has problems because:
>
> - set_next_entity() calls __dequeue_entity() before it does cfs_rq->curr = se.
> This means the avg_vruntime() will see the removal but not current, missing
> the entity for accounting.
>
> - put_prev_entity() calls __enqueue_entity() before it does cfs_rq->curr =
> NULL. This means the avg_vruntime() will see the addition *and* current,
> leading to double accounting.
>
> Both cases are incorrect/inconsistent.
>
> Noting that avg_vruntime is already called on each {en,de}queue, remove the
> explicit avg_vruntime() calls (which removes an extra 64bit division for each
> {en,de}queue) and have avg_vruntime() update zero_vruntime itself.
>
> Additionally, have the tick call avg_vruntime() -- discarding the result, but
> for the side-effect of updating zero_vruntime.
>
> While there, optimize avg_vruntime() by noting that the average of one value is
> rather trivial to compute.
>
> Test case:
> # taskset -c -p 1 $$
> # taskset -c 2 bash -c 'while :; do :; done&'
> # cat /sys/kernel/debug/sched/debug | awk '/^cpu#/ {P=0} /^cpu#2,/ {P=1} {if (P) print $0}' | grep -e zero_vruntime -e "^>"
>
> PRE:
> .zero_vruntime : 31316.407903
> >R bash 487 50787.345112 E 50789.145972 2.800000 50780.298364 16 120 0.000000 0.000000 0.000000 /
> .zero_vruntime : 382548.253179
> >R bash 487 427275.204288 E 427276.003584 2.800000 427268.157540 23 120 0.000000 0.000000 0.000000 /
>
> POST:
> .zero_vruntime : 17259.709467
> >R bash 526 17259.709467 E 17262.509467 2.800000 16915.031624 9 120 0.000000 0.000000 0.000000 /
> .zero_vruntime : 18702.723356
> >R bash 526 18702.723356 E 18705.523356 2.800000 18358.045513 9 120 0.000000 0.000000 0.000000 /
>
> Fixes: 79f3f9bedd14 ("sched/eevdf: Fix min_vruntime vs avg_vruntime")
> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 84 +++++++++++++++++++++++++++++++++++-----------------
> 1 file changed, 57 insertions(+), 27 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -589,6 +589,21 @@ static inline bool entity_before(const s
> return vruntime_cmp(a->deadline, "<", b->deadline);
> }
>
> +/*
> + * Per avg_vruntime() below, cfs_rq::zero_vruntime is only slightly stale
> + * and this value should be no more than two lag bounds. Which puts it in the
> + * general order of:
> + *
> + * (slice + TICK_NSEC) << NICE_0_LOAD_SHIFT
> + *
> + * which is around 44 bits in size (on 64bit); that is 20 for
> + * NICE_0_LOAD_SHIFT, another 20 for NSEC_PER_MSEC and then a handful for
> + * however many msec the actual slice+tick ends up begin.
> + *
> + * (disregarding the actual divide-by-weight part makes for the worst case
> + * weight of 2, which nicely cancels vs the fuzz in zero_vruntime not actually
> + * being the zero-lag point).
> + */
> static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> return vruntime_op(se->vruntime, "-", cfs_rq->zero_vruntime);
> @@ -676,39 +691,61 @@ sum_w_vruntime_sub(struct cfs_rq *cfs_rq
> }
>
> static inline
> -void sum_w_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
> +void update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta)
> {
> /*
> - * v' = v + d ==> sum_w_vruntime' = sum_runtime - d*sum_weight
> + * v' = v + d ==> sum_w_vruntime' = sum_w_vruntime - d*sum_weight
> */
> cfs_rq->sum_w_vruntime -= cfs_rq->sum_weight * delta;
> + cfs_rq->zero_vruntime += delta;
> }
>
> /*
> - * Specifically: avg_runtime() + 0 must result in entity_eligible() := true
> + * Specifically: avg_vruntime() + 0 must result in entity_eligible() := true
> * For this to be so, the result of this function must have a left bias.
> + *
> + * Called in:
> + * - place_entity() -- before enqueue
> + * - update_entity_lag() -- before dequeue
> + * - entity_tick()
> + *
> + * This means it is one entry 'behind' but that puts it close enough to where
> + * the bound on entity_key() is at most two lag bounds.
> */
> u64 avg_vruntime(struct cfs_rq *cfs_rq)
> {
> struct sched_entity *curr = cfs_rq->curr;
> - s64 avg = cfs_rq->sum_w_vruntime;
> - long load = cfs_rq->sum_weight;
> + long weight = cfs_rq->sum_weight;
> + s64 delta = 0;
>
> - if (curr && curr->on_rq) {
> - unsigned long weight = scale_load_down(curr->load.weight);
> + if (curr && !curr->on_rq)
> + curr = NULL;
>
> - avg += entity_key(cfs_rq, curr) * weight;
> - load += weight;
> - }
> + if (weight) {
> + s64 runtime = cfs_rq->sum_w_vruntime;
> +
> + if (curr) {
> + unsigned long w = scale_load_down(curr->load.weight);
> +
> + runtime += entity_key(cfs_rq, curr) * w;
> + weight += w;
> + }
>
> - if (load) {
> /* sign flips effective floor / ceiling */
> - if (avg < 0)
> - avg -= (load - 1);
> - avg = div_s64(avg, load);
> + if (runtime < 0)
> + runtime -= (weight - 1);
> +
> + delta = div_s64(runtime, weight);
> + } else if (curr) {
> + /*
> + * When there is but one element, it is the average.
> + */
> + delta = curr->vruntime - cfs_rq->zero_vruntime;
> }
>
> - return cfs_rq->zero_vruntime + avg;
> + update_zero_vruntime(cfs_rq, delta);
> +
> + return cfs_rq->zero_vruntime;
> }
>
> /*
> @@ -777,16 +814,6 @@ int entity_eligible(struct cfs_rq *cfs_r
> return vruntime_eligible(cfs_rq, se->vruntime);
> }
>
> -static void update_zero_vruntime(struct cfs_rq *cfs_rq)
> -{
> - u64 vruntime = avg_vruntime(cfs_rq);
> - s64 delta = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
> -
> - sum_w_vruntime_update(cfs_rq, delta);
> -
> - cfs_rq->zero_vruntime = vruntime;
> -}
> -
> static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
> {
> struct sched_entity *root = __pick_root_entity(cfs_rq);
> @@ -856,7 +883,6 @@ RB_DECLARE_CALLBACKS(static, min_vruntim
> static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> sum_w_vruntime_add(cfs_rq, se);
> - update_zero_vruntime(cfs_rq);
> se->min_vruntime = se->vruntime;
> se->min_slice = se->slice;
> rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
> @@ -868,7 +894,6 @@ static void __dequeue_entity(struct cfs_
> rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
> &min_vruntime_cb);
> sum_w_vruntime_sub(cfs_rq, se);
> - update_zero_vruntime(cfs_rq);
> }
>
> struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq)
> @@ -5524,6 +5549,11 @@ entity_tick(struct cfs_rq *cfs_rq, struc
> update_load_avg(cfs_rq, curr, UPDATE_TG);
> update_cfs_group(curr);
>
> + /*
> + * Pulls along cfs_rq::zero_vruntime.
> + */
> + avg_vruntime(cfs_rq);
> +
> #ifdef CONFIG_SCHED_HRTICK
> /*
> * queued ticks are scheduled to match the slice, so don't bother
>
>
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 6/7] sched/fair: Revert 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")
2026-02-19 7:58 ` [PATCH v2 6/7] sched/fair: Revert 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") Peter Zijlstra
@ 2026-02-23 10:57 ` Vincent Guittot
2026-03-24 10:01 ` William Montaz
0 siblings, 1 reply; 55+ messages in thread
From: Vincent Guittot @ 2026-02-23 10:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, wangtao554, quzicheng, kprateek.nayak,
dsmythies, shubhang
On Thu, 19 Feb 2026 at 09:10, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Zicheng Qu reported that, because avg_vruntime() always includes
> cfs_rq->curr, when ->on_rq, place_entity() doesn't work right.
>
> Specifically, the lag scaling in place_entity() relies on
> avg_vruntime() being the state *before* placement of the new entity.
> However in this case avg_vruntime() will actually already include the
> entity, which breaks things.
>
> Also, Zicheng Qu argues that avg_vruntime should be invariant under
> reweight. IOW commit 6d71a9c61604 ("sched/fair: Fix EEVDF entity
> placement bug causing scheduling lag") was wrong!
>
> The issue reported in 6d71a9c61604 could possibly be explained by
> rounding artifacts -- notably the extreme weight '2' is outside of the
> range of avg_vruntime/sum_w_vruntime, since that uses
> scale_load_down(). By scaling vruntime by the real weight, but
> accounting it in vruntime with a factor 1024 more, the average moves
> significantly. However, that is now cured.
>
> Tested by reverting 66951e4860d3 ("sched/fair: Fix update_cfs_group()
> vs DELAY_DEQUEUE") and tracing vruntime and vlag figures again.
>
> Reported-by: Zicheng Qu <quzicheng@huawei.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 148 +++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 124 insertions(+), 24 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -819,17 +819,22 @@ static inline u64 cfs_rq_max_slice(struc
> *
> * -r_max < lag < max(r_max, q)
> */
> -static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +static s64 entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 avruntime)
> {
> u64 max_slice = cfs_rq_max_slice(cfs_rq) + TICK_NSEC;
> s64 vlag, limit;
>
> - WARN_ON_ONCE(!se->on_rq);
> -
> - vlag = avg_vruntime(cfs_rq) - se->vruntime;
> + vlag = avruntime - se->vruntime;
> limit = calc_delta_fair(max_slice, se);
>
> - se->vlag = clamp(vlag, -limit, limit);
> + return clamp(vlag, -limit, limit);
> +}
> +
> +static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +{
> + WARN_ON_ONCE(!se->on_rq);
> +
> + se->vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
> }
>
> /*
> @@ -3895,23 +3900,125 @@ dequeue_load_avg(struct cfs_rq *cfs_rq,
> se_weight(se) * -se->avg.load_sum);
> }
>
> -static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags);
> +static void
> +rescale_entity(struct sched_entity *se, unsigned long weight, bool rel_vprot)
> +{
> + unsigned long old_weight = se->load.weight;
> +
> + /*
> + * VRUNTIME
> + * --------
> + *
> + * COROLLARY #1: The virtual runtime of the entity needs to be
> + * adjusted if re-weight at !0-lag point.
> + *
> + * Proof: For contradiction assume this is not true, so we can
> + * re-weight without changing vruntime at !0-lag point.
> + *
> + * Weight VRuntime Avg-VRuntime
> + * before w v V
> + * after w' v' V'
> + *
> + * Since lag needs to be preserved through re-weight:
> + *
> + * lag = (V - v)*w = (V'- v')*w', where v = v'
> + * ==> V' = (V - v)*w/w' + v (1)
> + *
> + * Let W be the total weight of the entities before reweight,
> + * since V' is the new weighted average of entities:
> + *
> + * V' = (WV + w'v - wv) / (W + w' - w) (2)
> + *
> + * by using (1) & (2) we obtain:
> + *
> + * (WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
> + * ==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
> + * ==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
> + * ==> (V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
> + *
> + * Since we are doing at !0-lag point which means V != v, we
> + * can simplify (3):
> + *
> + * ==> W / (W + w' - w) = w / w'
> + * ==> Ww' = Ww + ww' - ww
> + * ==> W * (w' - w) = w * (w' - w)
> + * ==> W = w (re-weight indicates w' != w)
> + *
> + * So the cfs_rq contains only one entity, hence vruntime of
> + * the entity @v should always equal to the cfs_rq's weighted
> + * average vruntime @V, which means we will always re-weight
> + * at 0-lag point, thus breach assumption. Proof completed.
> + *
> + *
> + * COROLLARY #2: Re-weight does NOT affect weighted average
> + * vruntime of all the entities.
> + *
> + * Proof: According to corollary #1, Eq. (1) should be:
> + *
> + * (V - v)*w = (V' - v')*w'
> + * ==> v' = V' - (V - v)*w/w' (4)
> + *
> + * According to the weighted average formula, we have:
> + *
> + * V' = (WV - wv + w'v') / (W - w + w')
> + * = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
> + * = (WV - wv + w'V' - Vw + wv) / (W - w + w')
> + * = (WV + w'V' - Vw) / (W - w + w')
> + *
> + * ==> V'*(W - w + w') = WV + w'V' - Vw
> + * ==> V' * (W - w) = (W - w) * V (5)
> + *
> + * If the entity is the only one in the cfs_rq, then reweight
> + * always occurs at 0-lag point, so V won't change. Or else
> + * there are other entities, hence W != w, then Eq. (5) turns
> + * into V' = V. So V won't change in either case, proof done.
> + *
> + *
> + * So according to corollary #1 & #2, the effect of re-weight
> + * on vruntime should be:
> + *
> + * v' = V' - (V - v) * w / w' (4)
> + * = V - (V - v) * w / w'
> + * = V - vl * w / w'
> + * = V - vl'
> + */
> + se->vlag = div64_long(se->vlag * old_weight, weight);
> +
> + /*
> + * DEADLINE
> + * --------
> + *
> + * When the weight changes, the virtual time slope changes and
> + * we should adjust the relative virtual deadline accordingly.
> + *
> + * d' = v' + (d - v)*w/w'
> + * = V' - (V - v)*w/w' + (d - v)*w/w'
> + * = V - (V - v)*w/w' + (d - v)*w/w'
> + * = V + (d - V)*w/w'
> + */
> + if (se->rel_deadline)
> + se->deadline = div64_long(se->deadline * old_weight, weight);
> +
> + if (rel_vprot)
> + se->vprot = div64_long(se->vprot * old_weight, weight);
> +}
>
> static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> unsigned long weight)
> {
> bool curr = cfs_rq->curr == se;
> bool rel_vprot = false;
> - u64 vprot;
> + u64 avruntime = 0;
>
> if (se->on_rq) {
> /* commit outstanding execution time */
> update_curr(cfs_rq);
> - update_entity_lag(cfs_rq, se);
> - se->deadline -= se->vruntime;
> + avruntime = avg_vruntime(cfs_rq);
> + se->vlag = entity_lag(cfs_rq, se, avruntime);
> + se->deadline -= avruntime;
> se->rel_deadline = 1;
> if (curr && protect_slice(se)) {
> - vprot = se->vprot - se->vruntime;
> + se->vprot -= avruntime;
> rel_vprot = true;
> }
>
> @@ -3922,30 +4029,23 @@ static void reweight_entity(struct cfs_r
> }
> dequeue_load_avg(cfs_rq, se);
>
> - /*
> - * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
> - * we need to scale se->vlag when w_i changes.
> - */
> - se->vlag = div64_long(se->vlag * se->load.weight, weight);
> - if (se->rel_deadline)
> - se->deadline = div64_long(se->deadline * se->load.weight, weight);
> -
> - if (rel_vprot)
> - vprot = div64_long(vprot * se->load.weight, weight);
> + rescale_entity(se, weight, rel_vprot);
>
> update_load_set(&se->load, weight);
>
> do {
> u32 divider = get_pelt_divider(&se->avg);
> -
> se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider);
> } while (0);
>
> enqueue_load_avg(cfs_rq, se);
> if (se->on_rq) {
> - place_entity(cfs_rq, se, 0);
> if (rel_vprot)
> - se->vprot = se->vruntime + vprot;
> + se->vprot += avruntime;
> + se->deadline += avruntime;
> + se->rel_deadline = 0;
> + se->vruntime = avruntime - se->vlag;
> +
> update_load_add(&cfs_rq->load, se->load.weight);
> if (!curr)
> __enqueue_entity(cfs_rq, se);
> @@ -5303,7 +5403,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
>
> se->vruntime = vruntime - lag;
>
> - if (se->rel_deadline) {
> + if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
> se->deadline += se->vruntime;
> se->rel_deadline = 0;
> return;
>
>
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 4/7] sched/fair: Fix lag clamp
2026-02-19 7:58 ` [PATCH v2 4/7] sched/fair: Fix lag clamp Peter Zijlstra
2026-02-23 10:23 ` Dietmar Eggemann
@ 2026-02-23 10:57 ` Vincent Guittot
1 sibling, 0 replies; 55+ messages in thread
From: Vincent Guittot @ 2026-02-23 10:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, wangtao554, quzicheng, kprateek.nayak,
dsmythies, shubhang
On Thu, 19 Feb 2026 at 09:10, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Vincent reported that he was seeing undue lag clamping in a mixed
> slice workload. Implement the max_slice tracking as per the todo
> comment.
>
> Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
> Reported-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
> Link: https://patch.msgid.link/20250422101628.GA33555@noisy.programming.kicks-ass.net
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> include/linux/sched.h | 1 +
> kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++----
> 2 files changed, 36 insertions(+), 4 deletions(-)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -574,6 +574,7 @@ struct sched_entity {
> u64 deadline;
> u64 min_vruntime;
> u64 min_slice;
> + u64 max_slice;
>
> struct list_head group_node;
> unsigned char on_rq;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -748,6 +748,8 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
> return cfs_rq->zero_vruntime;
> }
>
> +static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq);
> +
> /*
> * lag_i = S - s_i = w_i * (V - v_i)
> *
> @@ -761,17 +763,16 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
> * EEVDF gives the following limit for a steady state system:
> *
> * -r_max < lag < max(r_max, q)
> - *
> - * XXX could add max_slice to the augmented data to track this.
> */
> static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> + u64 max_slice = cfs_rq_max_slice(cfs_rq) + TICK_NSEC;
> s64 vlag, limit;
>
> WARN_ON_ONCE(!se->on_rq);
>
> vlag = avg_vruntime(cfs_rq) - se->vruntime;
> - limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> + limit = calc_delta_fair(max_slice, se);
>
> se->vlag = clamp(vlag, -limit, limit);
> }
> @@ -829,6 +830,21 @@ static inline u64 cfs_rq_min_slice(struc
> return min_slice;
> }
>
> +static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq)
> +{
> + struct sched_entity *root = __pick_root_entity(cfs_rq);
> + struct sched_entity *curr = cfs_rq->curr;
> + u64 max_slice = 0ULL;
> +
> + if (curr && curr->on_rq)
> + max_slice = curr->slice;
> +
> + if (root)
> + max_slice = max(max_slice, root->max_slice);
> +
> + return max_slice;
> +}
> +
> static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
> {
> return entity_before(__node_2_se(a), __node_2_se(b));
> @@ -853,6 +869,15 @@ static inline void __min_slice_update(st
> }
> }
>
> +static inline void __max_slice_update(struct sched_entity *se, struct rb_node *node)
> +{
> + if (node) {
> + struct sched_entity *rse = __node_2_se(node);
> + if (rse->max_slice > se->max_slice)
> + se->max_slice = rse->max_slice;
> + }
> +}
> +
> /*
> * se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime)
> */
> @@ -860,6 +885,7 @@ static inline bool min_vruntime_update(s
> {
> u64 old_min_vruntime = se->min_vruntime;
> u64 old_min_slice = se->min_slice;
> + u64 old_max_slice = se->max_slice;
> struct rb_node *node = &se->run_node;
>
> se->min_vruntime = se->vruntime;
> @@ -870,8 +896,13 @@ static inline bool min_vruntime_update(s
> __min_slice_update(se, node->rb_right);
> __min_slice_update(se, node->rb_left);
>
> + se->max_slice = se->slice;
> + __max_slice_update(se, node->rb_right);
> + __max_slice_update(se, node->rb_left);
> +
> return se->min_vruntime == old_min_vruntime &&
> - se->min_slice == old_min_slice;
> + se->min_slice == old_min_slice &&
> + se->max_slice == old_max_slice;
> }
>
> RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
>
>
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 7/7] sched/fair: Use full weight to __calc_delta()
2026-02-19 7:58 ` [PATCH v2 7/7] sched/fair: Use full weight to __calc_delta() Peter Zijlstra
@ 2026-02-23 10:57 ` Vincent Guittot
0 siblings, 0 replies; 55+ messages in thread
From: Vincent Guittot @ 2026-02-23 10:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, wangtao554, quzicheng, kprateek.nayak,
dsmythies, shubhang
On Thu, 19 Feb 2026 at 09:10, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Since we now use the full weight for avg_vruntime(), also make
> __calc_delta() use the full value.
>
> Since weight is effectively NICE_0_LOAD, this is 20 bits on 64bit.
> This leaves 44 bits for delta_exec, which is ~16k seconds, way longer
> than any one tick would ever be, so no worry about overflow.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -225,6 +225,7 @@ void __init sched_init_granularity(void)
> update_sysctl();
> }
>
> +#ifndef CONFIG_64BIT
> #define WMULT_CONST (~0U)
> #define WMULT_SHIFT 32
>
> @@ -283,6 +284,12 @@ static u64 __calc_delta(u64 delta_exec,
>
> return mul_u64_u32_shr(delta_exec, fact, shift);
> }
> +#else
> +static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
> +{
> + return (delta_exec * weight) / lw->weight;
> +}
> +#endif
>
> /*
> * delta /= w
>
>
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime
2026-02-23 10:56 ` Vincent Guittot
@ 2026-02-23 11:51 ` Peter Zijlstra
2026-02-23 12:36 ` Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 55+ messages in thread
From: Peter Zijlstra @ 2026-02-23 11:51 UTC (permalink / raw)
To: Vincent Guittot
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, wangtao554, quzicheng, kprateek.nayak,
dsmythies, shubhang
On Mon, Feb 23, 2026 at 11:56:33AM +0100, Vincent Guittot wrote:
> On Thu, 19 Feb 2026 at 09:10, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Due to the zero_vruntime patch, the deltas are now a lot smaller and
> > measurement with kernel-build and hackbench runs show about 45 bits
> > used.
> >
> > This ensures avg_vruntime() tracks the full weight range, reducing
> > numerical artifacts in reweight and the like.
>
> Instead of paranoid, would it be better to add WARN_ONCE ?
>
> I'm afraid that we will not notice any potential overflow without a
> long study of the regression with SCHED_FEAT(PARANOID_AVG, false)
>
> Couldn't we add a cheaper WARN_ONCE (key > 2^50) in __sum_w_vruntime_add ?
>
> We should always have
> key < 110ms (max slice+max tick) * nice_0 (2^20) / weight (2)
> key < 2^46
>
> We can use 50 bits to get margin
>
> Weight is always less than 27bits and key*weight gives us 110ms (max
> slice+max tick) * nice_0 (2^20) so we should never add more than 2^47
> to ->sum_weight
>
> so a WARN_ONCE (cfs_rq->sum_weight > 2^63) should be enough
Ha, I was >< close to pushing out these patches when I saw this.
The thing is signed, so bit 63 is the sign bit, but I suppose we can
test bit 62 like so:
Let me go build and boot that.
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -679,9 +679,13 @@ static inline void
__sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
- s64 key = entity_key(cfs_rq, se);
+ s64 w_vruntime, key = entity_key(cfs_rq, se);
- cfs_rq->sum_w_vruntime += key * weight;
+ w_vruntime = key * weight;
+
+ WARN_ON_ONCE((w_vruntime >> 63) != (w_vruntime >> 62));
+
+ cfs_rq->sum_w_vruntime += w_vruntime;
cfs_rq->sum_weight += weight;
}
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime
2026-02-23 11:51 ` Peter Zijlstra
@ 2026-02-23 12:36 ` Peter Zijlstra
2026-02-23 13:06 ` Vincent Guittot
2026-03-30 7:55 ` K Prateek Nayak
2 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2026-02-23 12:36 UTC (permalink / raw)
To: Vincent Guittot
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, wangtao554, quzicheng, kprateek.nayak,
dsmythies, shubhang
On Mon, Feb 23, 2026 at 12:51:00PM +0100, Peter Zijlstra wrote:
> Let me go build and boot that.
Seems to not explode; had it run a few things and such. Must be good.
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -679,9 +679,13 @@ static inline void
> __sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
> - s64 key = entity_key(cfs_rq, se);
> + s64 w_vruntime, key = entity_key(cfs_rq, se);
>
> - cfs_rq->sum_w_vruntime += key * weight;
> + w_vruntime = key * weight;
> +
> + WARN_ON_ONCE((w_vruntime >> 63) != (w_vruntime >> 62));
> +
> + cfs_rq->sum_w_vruntime += w_vruntime;
> cfs_rq->sum_weight += weight;
> }
>
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime
2026-02-23 11:51 ` Peter Zijlstra
2026-02-23 12:36 ` Peter Zijlstra
@ 2026-02-23 13:06 ` Vincent Guittot
2026-03-30 7:55 ` K Prateek Nayak
2 siblings, 0 replies; 55+ messages in thread
From: Vincent Guittot @ 2026-02-23 13:06 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, wangtao554, quzicheng, kprateek.nayak,
dsmythies, shubhang
On Mon, 23 Feb 2026 at 12:51, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Feb 23, 2026 at 11:56:33AM +0100, Vincent Guittot wrote:
> > On Thu, 19 Feb 2026 at 09:10, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > Due to the zero_vruntime patch, the deltas are now a lot smaller and
> > > measurement with kernel-build and hackbench runs show about 45 bits
> > > used.
> > >
> > > This ensures avg_vruntime() tracks the full weight range, reducing
> > > numerical artifacts in reweight and the like.
> >
> > Instead of paranoid, would it be better to add WARN_ONCE ?
> >
> > I'm afraid that we will not notice any potential overflow without a
> > long study of the regression with SCHED_FEAT(PARANOID_AVG, false)
> >
> > Couldn't we add a cheaper WARN_ONCE (key > 2^50) in __sum_w_vruntime_add ?
> >
> > We should always have
> > key < 110ms (max slice+max tick) * nice_0 (2^20) / weight (2)
> > key < 2^46
> >
> > We can use 50 bits to get margin
> >
> > Weight is always less than 27bits and key*weight gives us 110ms (max
> > slice+max tick) * nice_0 (2^20) so we should never add more than 2^47
> > to ->sum_weight
> >
> > so a WARN_ONCE (cfs_rq->sum_weight > 2^63) should be enough
>
> Ha, I was >< close to pushing out these patches when I saw this.
>
> The thing is signed, so bit 63 is the sign bit, but I suppose we can
> test bit 62 like so:
Ah yes, I forgot that it's a signed value
>
> Let me go build and boot that.
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -679,9 +679,13 @@ static inline void
> __sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
> - s64 key = entity_key(cfs_rq, se);
> + s64 w_vruntime, key = entity_key(cfs_rq, se);
>
> - cfs_rq->sum_w_vruntime += key * weight;
> + w_vruntime = key * weight;
> +
> + WARN_ON_ONCE((w_vruntime >> 63) != (w_vruntime >> 62));
yes looks good
> +
> + cfs_rq->sum_w_vruntime += w_vruntime;
> cfs_rq->sum_weight += weight;
> }
>
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-02-19 7:58 ` [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking Peter Zijlstra
2026-02-23 10:56 ` Vincent Guittot
@ 2026-02-23 13:09 ` Dietmar Eggemann
2026-02-23 14:15 ` Peter Zijlstra
2026-03-28 5:44 ` John Stultz
2 siblings, 1 reply; 55+ messages in thread
From: Dietmar Eggemann @ 2026-02-23 13:09 UTC (permalink / raw)
To: Peter Zijlstra, mingo
Cc: juri.lelli, vincent.guittot, rostedt, bsegall, mgorman, vschneid,
linux-kernel, wangtao554, quzicheng, kprateek.nayak, dsmythies,
shubhang
On 19.02.26 08:58, Peter Zijlstra wrote:
> It turns out that zero_vruntime tracking is broken when there is but a single
> task running. Current update paths are through __{en,de}queue_entity(), and
> when there is but a single task, pick_next_task() will always return that one
> task, and put_prev_set_next_task() will end up in neither function.
Tried hard but I don't get the last clause.
[...]
> While there, optimize avg_vruntime() by noting that the average of one value is
> rather trivial to compute.
>
nitpick:
> Test case:
> # taskset -c -p 1 $$
> # taskset -c 2 bash -c 'while :; do :; done&'
> # cat /sys/kernel/debug/sched/debug | awk '/^cpu#/ {P=0} /^cpu#2,/ {P=1} {if (P) print $0}' | grep -e zero_vruntime -e "^>"
^
|
Works
only w/o this comma for me.
[...]
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-02-23 13:09 ` Dietmar Eggemann
@ 2026-02-23 14:15 ` Peter Zijlstra
2026-02-24 8:53 ` Dietmar Eggemann
0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-02-23 14:15 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: mingo, juri.lelli, vincent.guittot, rostedt, bsegall, mgorman,
vschneid, linux-kernel, wangtao554, quzicheng, kprateek.nayak,
dsmythies, shubhang
On Mon, Feb 23, 2026 at 02:09:52PM +0100, Dietmar Eggemann wrote:
> On 19.02.26 08:58, Peter Zijlstra wrote:
> > It turns out that zero_vruntime tracking is broken when there is but a single
> > task running. Current update paths are through __{en,de}queue_entity(), and
> > when there is but a single task, pick_next_task() will always return that one
> > task, and put_prev_set_next_task() will end up in neither function.
>
> Tried hard but I don't get the last clause.
When prev==next, then put_prev_set_next_task() bails out and we'll never
hit __enqueue_entity()/__dequeue_entity().
> [...]
>
> > While there, optimize avg_vruntime() by noting that the average of one value is
> > rather trivial to compute.
> >
>
> nitpick:
>
> > Test case:
> > # taskset -c -p 1 $$
> > # taskset -c 2 bash -c 'while :; do :; done&'
> > # cat /sys/kernel/debug/sched/debug | awk '/^cpu#/ {P=0} /^cpu#2,/ {P=1} {if (P) print $0}' | grep -e zero_vruntime -e "^>"
> ^
> |
> Works
> only w/o this comma for me.
Hmm, weird, for me:
cat /sys/kernel/debug/sched/debug | grep ^cpu#2
cpu#2, 2500.000 MHz
cpu#20, 2500.000 MHz
cpu#21, 2500.000 MHz
cpu#22, 2500.000 MHz
cpu#23, 2500.000 MHz
cpu#24, 2500.000 MHz
cpu#25, 2500.000 MHz
cpu#26, 2500.000 MHz
cpu#27, 2500.000 MHz
cpu#28, 2500.000 MHz
cpu#29, 2500.000 MHz
And that ',' was added because otherwise it would match the full 20
range of CPUs too, which was not intended ;-)
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-02-23 14:15 ` Peter Zijlstra
@ 2026-02-24 8:53 ` Dietmar Eggemann
2026-02-24 9:02 ` Peter Zijlstra
0 siblings, 1 reply; 55+ messages in thread
From: Dietmar Eggemann @ 2026-02-24 8:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, vincent.guittot, rostedt, bsegall, mgorman,
vschneid, linux-kernel, wangtao554, quzicheng, kprateek.nayak,
dsmythies, shubhang
On 23.02.26 15:15, Peter Zijlstra wrote:
> On Mon, Feb 23, 2026 at 02:09:52PM +0100, Dietmar Eggemann wrote:
>> On 19.02.26 08:58, Peter Zijlstra wrote:
>>> It turns out that zero_vruntime tracking is broken when there is but a single
>>> task running. Current update paths are through __{en,de}queue_entity(), and
>>> when there is but a single task, pick_next_task() will always return that one
>>> task, and put_prev_set_next_task() will end up in neither function.
>>
>> Tried hard but I don't get the last clause.
>
> When prev==next, then put_prev_set_next_task() bails out and we'll never
> hit __enqueue_entity()/__dequeue_entity().
Ah, I see. But IMHO put_prev_set_next_task() is never called for the
testcase below (CPU hog is prev and next in put_prev_set_next_task())
But (prev != p) in pick_next_task_fair() is avoided ?
pick_next_task_fair()
if (prev != p) {
while (!is_same_group())
put_prev_entity()
set_next_entity()
put_prev_entity()
set_next_entity()
}
[...]
>>> Test case:
>>> # taskset -c -p 1 $$
>>> # taskset -c 2 bash -c 'while :; do :; done&'
>>> # cat /sys/kernel/debug/sched/debug | awk '/^cpu#/ {P=0} /^cpu#2,/ {P=1} {if (P) print $0}' | grep -e zero_vruntime -e "^>"
>> ^
>> |
>> Works
>> only w/o this comma for me.
>
> Hmm, weird, for me:
>
> cat /sys/kernel/debug/sched/debug | grep ^cpu#2
> cpu#2, 2500.000 MHz
> cpu#20, 2500.000 MHz
> cpu#21, 2500.000 MHz
> cpu#22, 2500.000 MHz
> cpu#23, 2500.000 MHz
> cpu#24, 2500.000 MHz
> cpu#25, 2500.000 MHz
> cpu#26, 2500.000 MHz
> cpu#27, 2500.000 MHz
> cpu#28, 2500.000 MHz
> cpu#29, 2500.000 MHz
>
> And that ',' was added because otherwise it would match the full 20
> range of CPUs too, which was not intended ;-)
Ah, the ', X Mhz' thing is X86 specific.
awk '/^cpu#/ {P=0} /^cpu#2(,|$)/ {P=1} {if (P) print $0}'
works on Arm64 as well.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-02-24 8:53 ` Dietmar Eggemann
@ 2026-02-24 9:02 ` Peter Zijlstra
0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2026-02-24 9:02 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: mingo, juri.lelli, vincent.guittot, rostedt, bsegall, mgorman,
vschneid, linux-kernel, wangtao554, quzicheng, kprateek.nayak,
dsmythies, shubhang
On Tue, Feb 24, 2026 at 09:53:06AM +0100, Dietmar Eggemann wrote:
> On 23.02.26 15:15, Peter Zijlstra wrote:
> > On Mon, Feb 23, 2026 at 02:09:52PM +0100, Dietmar Eggemann wrote:
> >> On 19.02.26 08:58, Peter Zijlstra wrote:
> >>> It turns out that zero_vruntime tracking is broken when there is but a single
> >>> task running. Current update paths are through __{en,de}queue_entity(), and
> >>> when there is but a single task, pick_next_task() will always return that one
> >>> task, and put_prev_set_next_task() will end up in neither function.
> >>
> >> Tried hard but I don't get the last clause.
> >
> > When prev==next, then put_prev_set_next_task() bails out and we'll never
> > hit __enqueue_entity()/__dequeue_entity().
>
> Ah, I see. But IMHO put_prev_set_next_task() is never called for the
> testcase below (CPU hog is prev and next in put_prev_set_next_task())
>
> But (prev != p) in pick_next_task_fair() is avoided ?
>
> pick_next_task_fair()
>
> if (prev != p) {
>
> while (!is_same_group())
> put_prev_entity()
> set_next_entity()
>
> put_prev_entity()
> set_next_entity()
> }
>
Ah yes, pick_next_task_fair() open codes that. Also, I might have a
patch to 'fix' all that, but I've not goten around to posting that.
There's still a few wobblies in that part of the pile :/
Look at the top 3 patches in queue/sched/flat if you're up for it :-)
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 6/7] sched/fair: Revert 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")
2026-02-23 10:57 ` Vincent Guittot
@ 2026-03-24 10:01 ` William Montaz
2026-04-07 13:45 ` Peter Zijlstra
0 siblings, 1 reply; 55+ messages in thread
From: William Montaz @ 2026-03-24 10:01 UTC (permalink / raw)
To: vincent.guittot
Cc: bsegall, dietmar.eggemann, dsmythies, juri.lelli, kprateek.nayak,
linux-kernel, mgorman, mingo, peterz, quzicheng, rostedt,
shubhang, vschneid, wangtao554
Hi,
> Zicheng Qu reported that, because avg_vruntime() always includes
> cfs_rq->curr, when ->on_rq, place_entity() doesn't work right.
> Specifically, the lag scaling in place_entity() relies on
> avg_vruntime() being the state *before* placement of the new entity.
> However in this case avg_vruntime() will actually already include the
> entity, which breaks things.
This has proven to be harmful on our production cluster using kernel version 6.18.19
We witness a parent cgroup entity (/kubepods.slice in our case) changing very frequently load_avg figures,
which leads to calling entity_pick->update_cfs_group->reweight_entity very often (pretty much at all entity_tick call).
If a cpu hogging task is member of this cgroup and bound to a CPU,
we observe starvation of processes bound to that same CPU but not being members of this cgroup
(kworkers for ceph rbd in our production case).
Looking at /sys/kernel/debug/sched/debug, we can indeed see that cfs_rq[0]:/ .avg_vruntime and .zero_vruntime
continuously move back in time while .left_deadline and .left_vruntime are stuck.
This is likely due to the wrong lag calculation of the cgroup entity within the root cgroup.
We can reproduce that in a sandboxed manner doing the following:
* create a cgroup 'CG'
* run a cpu intensive task 'offender', bound to a CPU
* move the task to cgroup 'CG'
* run a cpu intensive task 'victim' bound to the same CPU
* To reproduce the frequent call to reweight_entity, we change rapidly CG/cpu.weight from 99, 100, 101 and loop
* 'victim' will stop running
I use the following script to reproduce:
---
#!/bin/bash
TARGET_CPU=0
CG_PATH="/sys/fs/cgroup/test_reweight"
cat << 'EOF' > heartbeat.c
#include <stdio.h>
#include <time.h>
#include <stdint.h>
int main() {
struct timespec last, now;
uint64_t count = 0;
clock_gettime(CLOCK_MONOTONIC, &last);
while (1) {
count++;
clock_gettime(CLOCK_MONOTONIC, &now);
long delta_ms = (now.tv_sec - last.tv_sec) * 1000 + (now.tv_nsec - last.tv_nsec) / 1000000;
if (delta_ms >= 500) {
printf("Tick: %lu iterations (delta %ld ms)\n", count, delta_ms);
fflush(stdout);
count = 0;
last = now;
}
}
return 0;
}
EOF
gcc -O2 heartbeat.c -o heartbeat
mkdir -p "$CG_PATH"
echo "+cpu" > /sys/fs/cgroup/cgroup.subtree_control
taskset -c $TARGET_CPU yes > /dev/null &
PID_YES=$!
echo $PID_YES > "$CG_PATH/cgroup.procs"
taskset -c $TARGET_CPU ./heartbeat &
PID_HEARTBEAT=$!
echo "5 seconds observation..."
sleep 5
echo "Jittering on $CG_PATH/cpu.weight..."
trap "kill $PID_YES $PID_HEARTBEAT; rmdir $CG_PATH; rm heartbeat.c; rm heartbeat; exit" SIGINT SIGTERM
while true; do
echo 99 > "$CG_PATH/cpu.weight"
echo 100 > "$CG_PATH/cpu.weight"
echo 101 > "$CG_PATH/cpu.weight"
done
---
I tested the following versions:
* LTS 5.10.252, 5.15.202, 6.1.166, 6.6.129, 6.12.77 --> no issue
* LTS 6.18.19 has the issue
* Stable 6.19.9 has the issue
* Mainline 7.0-rc5 has the issue
* Tip 7.0.0-rc5+ no issue
Finally, I applied the patch to 6.18.19 LTS which solves the issue. However, we do not benefit from previous patches
such as [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime.
Thus I would prefer to let you decide how you want to adress backport on 6.18
If you want I can share my patch file, let me know.
Best regards
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-02-19 7:58 ` [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking Peter Zijlstra
2026-02-23 10:56 ` Vincent Guittot
2026-02-23 13:09 ` Dietmar Eggemann
@ 2026-03-28 5:44 ` John Stultz
2026-03-28 17:04 ` Steven Rostedt
` (2 more replies)
2 siblings, 3 replies; 55+ messages in thread
From: John Stultz @ 2026-03-28 5:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
kprateek.nayak, dsmythies, shubhang, Suleiman Souhlal
On Wed, Feb 18, 2026 at 11:58 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> It turns out that zero_vruntime tracking is broken when there is but a single
> task running. Current update paths are through __{en,de}queue_entity(), and
> when there is but a single task, pick_next_task() will always return that one
> task, and put_prev_set_next_task() will end up in neither function.
>
> This can cause entity_key() to grow indefinitely large and cause overflows,
> leading to much pain and suffering.
>
> Furtermore, doing update_zero_vruntime() from __{de,en}queue_entity(), which
> are called from {set_next,put_prev}_entity() has problems because:
>
> - set_next_entity() calls __dequeue_entity() before it does cfs_rq->curr = se.
> This means the avg_vruntime() will see the removal but not current, missing
> the entity for accounting.
>
> - put_prev_entity() calls __enqueue_entity() before it does cfs_rq->curr =
> NULL. This means the avg_vruntime() will see the addition *and* current,
> leading to double accounting.
>
> Both cases are incorrect/inconsistent.
>
> Noting that avg_vruntime is already called on each {en,de}queue, remove the
> explicit avg_vruntime() calls (which removes an extra 64bit division for each
> {en,de}queue) and have avg_vruntime() update zero_vruntime itself.
>
> Additionally, have the tick call avg_vruntime() -- discarding the result, but
> for the side-effect of updating zero_vruntime.
Hey all,
So in stress testing with my full proxy-exec series, I was
occasionally tripping over the situation where __pick_eevdf() returns
null which quickly crashes.
Initially I was thinking the bug was in my out of tree patches, but I
later found I could trip it with upstream as well, and I believe I
have bisected it down to this patch. Though reproduction often takes
3-4 hours, and I usually quit testing after 5 hours, so it's possible
I have some false negatives on the problem and it could have arisen
earlier.
From a little bit of debugging (done with the full proxy exec series,
I need to re-debug with vanilla), usual symptom is that we run into a
situation where !entity_eligible(cfs_rq, curr), so curr gets set to
null (though in one case, I saw cfs_rq->curr start as null), and then
we never set best, and thus the `if (!best || ...) best = curr;`
assignment doesn't save us and we return null, and crash.
I still need to dig more into the eligibility values and also to dump
the rq to see why nothing is being found. I am running with
CONFIG_SCHED_PROXY_EXEC enabled, so there may yet be some collision
here between this change and the already upstream portions of Proxy
Exec (I'll have to do more testing to see if it reproduces without
that option enabled).
The backtrace is usually due to stress-ng stress-ng-yield test:
[ 3775.898617] BUG: kernel NULL pointer dereference, address: 0000000000000059
[ 3775.903089] #PF: supervisor read access in kernel mode
[ 3775.906068] #PF: error_code(0x0000) - not-present page
[ 3775.909102] PGD 0 P4D 0
[ 3775.910656] Oops: Oops: 0000 [#1] SMP NOPTI
[ 3775.913371] CPU: 36 UID: 0 PID: 269131 Comm: stress-ng-yield
Tainted: G W 7.0.0-rc5-00001-g42a93b71138f #5
PREEMPT(full)
[ 3775.920304] Tainted: [W]=WARN
[ 3775.922100] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.17.0-debian-1.17.0-1 04/01/2014
[ 3775.927852] RIP: 0010:pick_task_fair+0x6f/0xb0
[ 3775.930466] Code: 85 ff 74 52 48 8b 47 48 48 85 c0 74 d6 80 78 58
00 74 d0 48 89 3c 24 e8 8f 9b ff ff 48 8b 3c 24 be 01 00 00 00 e8 51
74 ff ff <80> 78 59 00 74
c3 ba 21 00 00 00 48 89 c6 48 89 df e8 5b f1 ff ff
[ 3775.941027] RSP: 0018:ffffc9003827fde0 EFLAGS: 00010086
[ 3775.943949] RAX: 0000000000000000 RBX: ffff8881b972bc40 RCX: 0000000000000803
[ 3775.948179] RDX: 00000041acc1002a RSI: 000000b0cef5382a RDI: 000040138cc6cd49
[ 3775.952149] RBP: ffffc9003827fef8 R08: 0000000000000400 R09: 0000000000000002
[ 3775.956548] R10: 0000000000000024 R11: ffff8881b04a4d40 R12: ffff8881b04a4280
[ 3775.960480] R13: ffff8881b04a4280 R14: ffffffff82ce70a8 R15: ffff8881b972bc40
[ 3775.964713] FS: 00007f6ecb7a6b00(0000) GS:ffff888235beb000(0000)
knlGS:0000000000000000
[ 3775.969468] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3775.972960] CR2: 0000000000000059 CR3: 000000019c32a003 CR4: 0000000000370ef0
[ 3775.977008] Call Trace:
[ 3775.978581] <TASK>
[ 3775.979841] pick_next_task_fair+0x3c/0x8e0
[ 3775.982408] ? lock_is_held_type+0xcd/0x130
[ 3775.984833] __schedule+0x20f/0x14d0
[ 3775.987287] ? do_sched_yield+0xa2/0xe0
[ 3775.989365] schedule+0x3d/0x130
[ 3775.991376] __do_sys_sched_yield+0xe/0x20
[ 3775.993889] do_syscall_64+0xf3/0x680
[ 3775.996229] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 3776.000459] RIP: 0033:0x7f6ecc0e18c7
[ 3776.002757] Code: 73 01 c3 48 8b 0d 49 d5 0e 00 f7 d8 64 89 01 48
83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 18 00 00
00 0f 05 <48> 3d 01 f0 ff
ff 73 01 c3 48 8b 0d 19 d5 0e 00 f7 d8 64 89 01 48
I'll continue digging next week on this, but wanted to share in case
anyone else sees something obvious first.
thanks
-john
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-28 5:44 ` John Stultz
@ 2026-03-28 17:04 ` Steven Rostedt
2026-03-30 17:58 ` John Stultz
2026-03-30 9:43 ` Peter Zijlstra
2026-03-30 10:10 ` Peter Zijlstra
2 siblings, 1 reply; 55+ messages in thread
From: Steven Rostedt @ 2026-03-28 17:04 UTC (permalink / raw)
To: John Stultz
Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, bsegall, mgorman, vschneid, linux-kernel,
wangtao554, quzicheng, kprateek.nayak, dsmythies, shubhang,
Suleiman Souhlal
On Fri, 27 Mar 2026 22:44:28 -0700
John Stultz <jstultz@google.com> wrote:
> I'll continue digging next week on this, but wanted to share in case
> anyone else sees something obvious first.
FYI, you can add a trace_printk() around all the areas that assign
cfs_rq->curr, and if you use the persistent ring buffer you can retrieve
the output of all the events that lead up to the crash.
Add to the kernel command line:
reserve_mem=100M:12M:trace trace_instance=boot_mapped^traceprintk^traceoff@trace
And before starting your testing:
echo 1 > /sys/kernel/tracing/instances/boot_mapped/tracing_on
If all goes well, after the crash you should see the output in:
cat /sys/kernel/tracing/instances/boot_mapped/trace
-- Steve
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime
2026-02-23 11:51 ` Peter Zijlstra
2026-02-23 12:36 ` Peter Zijlstra
2026-02-23 13:06 ` Vincent Guittot
@ 2026-03-30 7:55 ` K Prateek Nayak
2026-03-30 9:27 ` Peter Zijlstra
2026-04-02 5:28 ` K Prateek Nayak
2 siblings, 2 replies; 55+ messages in thread
From: K Prateek Nayak @ 2026-03-30 7:55 UTC (permalink / raw)
To: Peter Zijlstra, Vincent Guittot
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, wangtao554, quzicheng, dsmythies,
shubhang
On 2/23/2026 5:21 PM, Peter Zijlstra wrote:
>> We should always have
>> key < 110ms (max slice+max tick) * nice_0 (2^20) / weight (2)
>> key < 2^46
>>
>> We can use 50 bits to get margin
>>
>> Weight is always less than 27bits and key*weight gives us 110ms (max
>> slice+max tick) * nice_0 (2^20) so we should never add more than 2^47
>> to ->sum_weight
>>
>> so a WARN_ONCE (cfs_rq->sum_weight > 2^63) should be enough
>
> Ha, I was >< close to pushing out these patches when I saw this.
>
> The thing is signed, so bit 63 is the sign bit, but I suppose we can
> test bit 62 like so:
>
> Let me go build and boot that.
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -679,9 +679,13 @@ static inline void
> __sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> unsigned long weight = avg_vruntime_weight(cfs_rq, se->load.weight);
> - s64 key = entity_key(cfs_rq, se);
> + s64 w_vruntime, key = entity_key(cfs_rq, se);
>
> - cfs_rq->sum_w_vruntime += key * weight;
> + w_vruntime = key * weight;
> +
> + WARN_ON_ONCE((w_vruntime >> 63) != (w_vruntime >> 62));
I was trying to reproduce the crash that John mentioned on the Patch 1
and although I couldn't reproduce that crash (yet), I tripped this when
running stress-ng yield test (32 copies x 256 children + sched messaging
16 groups) on my dual socket system (2 x 64C/128T):
------------[ cut here ]------------
(w_vruntime >> 63) != (w_vruntime >> 62)
WARNING: kernel/sched/fair.c:692 at __enqueue_entity+0x382/0x3a0, CPU#5: stress-ng/5062
Modules linked in: ...
CPU: 5 UID: 1000 PID: 5062 Comm: stress-ng Not tainted 7.0.0-rc5-topo-test+ #40 PREEMPT(full)
Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
RIP: 0010:__enqueue_entity+0x382/0x3a0
Code: 4c 89 4b 48 4c 89 4b 50 e9 61 fe ff ff 83 f9 3f 0f 87 b8 27 e5 ff 49 d3 ec b8 02 00 00 00 49 39 c4 4c 0f 42 e0 e9 16 ff ff ff <0f> 0b e9 d8 fc ff ff 0f 0b e9 e1 fe ff ff 0f 0b 66 66 2e 0f 1f 84
RSP: 0018:ffffcf6b8ea88c18 EFLAGS: 00010002
RAX: bf38ba3b09dc2400 RBX: ffff8d546f832680 RCX: ffffffffffffffff
RDX: fffffffffffffffe RSI: ffff8d1587818080 RDI: ffff8d546f832680
RBP: ffff8d1587818080 R08: 000000000000002d R09: 00000000ffffffff
R10: 0000000000000001 R11: ffffcf6b8ea88ff8 R12: 00000000056ae400
R13: ffff8d1587818080 R14: 0000000000000001 R15: ffff8d546f832680
FS: 00007f8c742c9740(0000) GS:ffff8d54b0c82000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2e09381358 CR3: 00000040db039002 CR4: 0000000000f70ef0
PKRU: 55555554
Call Trace:
<IRQ>
enqueue_task_fair+0x1a3/0xe50
? srso_alias_return_thunk+0x5/0xfbef5
? place_entity+0x21/0x160
enqueue_task+0x88/0x1b0
ttwu_do_activate+0x74/0x1c0
try_to_wake_up+0x277/0x840
...
asm_sysvec_call_function_single+0x1a/0x20
RIP: 0010:do_sched_yield+0x73/0xa0
Code: 89 df 48 8b 80 e8 02 00 00 48 8b 40 18 e8 75 a9 fd 00 65 ff 05 9e 94 fc 02 66 90 48 8d 7b 48 e8 d3 96 fd 00 fb 0f 1f 44 00 00 <65> ff 0d 86 94 fc 02 5b e9 10 12 fd 00 83 83 70 0d 00 00 01 eb bb
RSP: 0018:ffffcf6b9b7bbd78 EFLAGS: 00000282
RAX: ffffffffbbbb4560 RBX: ffff8d546f832580 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8d546f8325c8
RBP: ffffcf6b9b7bbf38 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d15d1070000
R13: 0000000000000018 R14: 0000000000000000 R15: 0000000000000018
? __pfx_yield_task_fair+0x10/0x10
? do_sched_yield+0x6d/0xa0
__do_sys_sched_yield+0xe/0x20
...
Since this wasn't suppose to trip, I'm assuming we are somehow in the
wrap around territory again :-(
I don't see anything particularly interesting in the sched/debug
entry after the fact:
cfs_rq[5]:/user.slice
.left_deadline : 26249498461.397509
.left_vruntime : 26249498270.250843
.zero_vruntime : 26249456859.395628
.sum_w_vruntime : 2547538312135680 (51 bits)
.sum_weight : 61440
.sum_shift : 0
.avg_vruntime : 26249498338.158417
.right_vruntime : 26249498381.633124
.spread : 111.382281
.nr_queued : 5
.h_nr_runnable : 5
.h_nr_queued : 5
.h_nr_idle : 0
...
I still haven't figured out how this happens I'll start running with
some debug prints next.
On a tangential note, now that we only yield on eligibility, would
something like below make sense?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 226509231e67..55ab1f58d703 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9265,9 +9265,10 @@ static void yield_task_fair(struct rq *rq)
struct sched_entity *se = &curr->se;
/*
- * Are we the only task in the tree?
+ * Single task is always eligible on the cfs_rq.
+ * Don't pull the vruntime needlessly.
*/
- if (unlikely(rq->nr_running == 1))
+ if (unlikely(cfs_rq->nr_queued == 1))
return;
clear_buddies(cfs_rq, se);
--
Thanks and Regards,
Prateek
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime
2026-03-30 7:55 ` K Prateek Nayak
@ 2026-03-30 9:27 ` Peter Zijlstra
2026-04-02 5:28 ` K Prateek Nayak
1 sibling, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2026-03-30 9:27 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Vincent Guittot, mingo, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
dsmythies, shubhang
On Mon, Mar 30, 2026 at 01:25:59PM +0530, K Prateek Nayak wrote:
> On a tangential note, now that we only yield on eligibility, would
> something like below make sense?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 226509231e67..55ab1f58d703 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9265,9 +9265,10 @@ static void yield_task_fair(struct rq *rq)
> struct sched_entity *se = &curr->se;
>
> /*
> - * Are we the only task in the tree?
> + * Single task is always eligible on the cfs_rq.
> + * Don't pull the vruntime needlessly.
> */
> - if (unlikely(rq->nr_running == 1))
> + if (unlikely(cfs_rq->nr_queued == 1))
> return;
>
> clear_buddies(cfs_rq, se);
Right, with the addition of sched_ext this could actually make a
difference.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-28 5:44 ` John Stultz
2026-03-28 17:04 ` Steven Rostedt
@ 2026-03-30 9:43 ` Peter Zijlstra
2026-03-30 17:49 ` John Stultz
2026-03-30 10:10 ` Peter Zijlstra
2 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-03-30 9:43 UTC (permalink / raw)
To: John Stultz
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
kprateek.nayak, dsmythies, shubhang, Suleiman Souhlal
On Fri, Mar 27, 2026 at 10:44:28PM -0700, John Stultz wrote:
> The backtrace is usually due to stress-ng stress-ng-yield test:
What actual stress-ng parameters are that, and what kind of VM are you
running?
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-28 5:44 ` John Stultz
2026-03-28 17:04 ` Steven Rostedt
2026-03-30 9:43 ` Peter Zijlstra
@ 2026-03-30 10:10 ` Peter Zijlstra
2026-03-30 14:37 ` K Prateek Nayak
2026-03-30 19:40 ` John Stultz
2 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2026-03-30 10:10 UTC (permalink / raw)
To: John Stultz
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
kprateek.nayak, dsmythies, shubhang, Suleiman Souhlal
On Fri, Mar 27, 2026 at 10:44:28PM -0700, John Stultz wrote:
> On Wed, Feb 18, 2026 at 11:58 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > It turns out that zero_vruntime tracking is broken when there is but a single
> > task running. Current update paths are through __{en,de}queue_entity(), and
> > when there is but a single task, pick_next_task() will always return that one
> > task, and put_prev_set_next_task() will end up in neither function.
> >
> > This can cause entity_key() to grow indefinitely large and cause overflows,
> > leading to much pain and suffering.
> >
> > Furtermore, doing update_zero_vruntime() from __{de,en}queue_entity(), which
> > are called from {set_next,put_prev}_entity() has problems because:
> >
> > - set_next_entity() calls __dequeue_entity() before it does cfs_rq->curr = se.
> > This means the avg_vruntime() will see the removal but not current, missing
> > the entity for accounting.
> >
> > - put_prev_entity() calls __enqueue_entity() before it does cfs_rq->curr =
> > NULL. This means the avg_vruntime() will see the addition *and* current,
> > leading to double accounting.
> >
> > Both cases are incorrect/inconsistent.
> >
> > Noting that avg_vruntime is already called on each {en,de}queue, remove the
> > explicit avg_vruntime() calls (which removes an extra 64bit division for each
> > {en,de}queue) and have avg_vruntime() update zero_vruntime itself.
> >
> > Additionally, have the tick call avg_vruntime() -- discarding the result, but
> > for the side-effect of updating zero_vruntime.
>
> Hey all,
>
> So in stress testing with my full proxy-exec series, I was
> occasionally tripping over the situation where __pick_eevdf() returns
> null which quickly crashes.
> The backtrace is usually due to stress-ng stress-ng-yield test:
Suppose we have 2 runnable tasks, both doing yield. Then one will be
eligible and one will not be, because the average position must be in
between these two entities.
Therefore, the runnable task will be eligible, and be promoted a full
slice (all the tasks do is yield after all). This causes it to jump over
the other task and now the other task is eligible and it is no longer.
So we schedule.
Since we are runnable, there is no dequeue or enqueue. All we have is
the __enqueue_entity() and __dequeue_entity() from put_prev_task() /
set_next_task(). But per the fingered commit, those two no longer move
zero_vruntime head.
All that moves zero_vruntime is tick and full dequeue or enqueue.
This means, that if the two tasks playing leapfrog can reach the
critical speed to reach the overflow point inside one tick's worth of
time, we're up a creek.
If this is indeed the case, then the below should cure things.
This also means that running a HZ=100 config will increase the chances
of hitting this vs HZ=1000.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9298f49f842c..c7daaf941b26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9307,6 +9307,7 @@ static void yield_task_fair(struct rq *rq)
if (entity_eligible(cfs_rq, se)) {
se->vruntime = se->deadline;
se->deadline += calc_delta_fair(se->slice, se);
+ avg_vruntime(cfs_rq);
}
}
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-30 10:10 ` Peter Zijlstra
@ 2026-03-30 14:37 ` K Prateek Nayak
2026-03-30 14:40 ` Peter Zijlstra
2026-03-30 19:40 ` John Stultz
1 sibling, 1 reply; 55+ messages in thread
From: K Prateek Nayak @ 2026-03-30 14:37 UTC (permalink / raw)
To: Peter Zijlstra, John Stultz
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
dsmythies, shubhang, Suleiman Souhlal
Hello Peter,
On 3/30/2026 3:40 PM, Peter Zijlstra wrote:
> This means, that if the two tasks playing leapfrog can reach the
> critical speed to reach the overflow point inside one tick's worth of
> time, we're up a creek.
>
> If this is indeed the case, then the below should cure things.
I have been running with this for four hours now and haven't seen
any splats or crashes on my setup. I could reliably trigger the
warning from __sum_w_vruntime_add() within an hour previously so
it is safe to say I was hitting exactly this.
Feel free to include:
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-30 14:37 ` K Prateek Nayak
@ 2026-03-30 14:40 ` Peter Zijlstra
2026-03-30 15:50 ` K Prateek Nayak
0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-03-30 14:40 UTC (permalink / raw)
To: K Prateek Nayak
Cc: John Stultz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, wangtao554,
quzicheng, dsmythies, shubhang, Suleiman Souhlal
On Mon, Mar 30, 2026 at 08:07:06PM +0530, K Prateek Nayak wrote:
> Hello Peter,
>
> On 3/30/2026 3:40 PM, Peter Zijlstra wrote:
> > This means, that if the two tasks playing leapfrog can reach the
> > critical speed to reach the overflow point inside one tick's worth of
> > time, we're up a creek.
> >
> > If this is indeed the case, then the below should cure things.
>
> I have been running with this for four hours now and haven't seen
> any splats or crashes on my setup. I could reliably trigger the
> warning from __sum_w_vruntime_add() within an hour previously so
> it is safe to say I was hitting exactly this.
>
> Feel free to include:
>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Ha!, excellent. Thanks!
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-30 14:40 ` Peter Zijlstra
@ 2026-03-30 15:50 ` K Prateek Nayak
2026-03-30 19:11 ` Peter Zijlstra
0 siblings, 1 reply; 55+ messages in thread
From: K Prateek Nayak @ 2026-03-30 15:50 UTC (permalink / raw)
To: Peter Zijlstra
Cc: John Stultz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, wangtao554,
quzicheng, dsmythies, shubhang, Suleiman Souhlal
Hello Peter,
On 3/30/2026 8:10 PM, Peter Zijlstra wrote:
> On Mon, Mar 30, 2026 at 08:07:06PM +0530, K Prateek Nayak wrote:
>> Hello Peter,
>>
>> On 3/30/2026 3:40 PM, Peter Zijlstra wrote:
>>> This means, that if the two tasks playing leapfrog can reach the
>>> critical speed to reach the overflow point inside one tick's worth of
>>> time, we're up a creek.
>>>
>>> If this is indeed the case, then the below should cure things.
>>
>> I have been running with this for four hours now and haven't seen
>> any splats or crashes on my setup. I could reliably trigger the
>> warning from __sum_w_vruntime_add() within an hour previously so
>> it is safe to say I was hitting exactly this.
>>
>> Feel free to include:
>>
>> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>
> Ha!, excellent. Thanks!
Turns out I spoke too soon and it did eventually run into that
problem again and then eventually crashed in pick_task_fair()
later so there is definitely something amiss still :-(
I'll throw in some debug traces and get back tomorrow.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-30 9:43 ` Peter Zijlstra
@ 2026-03-30 17:49 ` John Stultz
0 siblings, 0 replies; 55+ messages in thread
From: John Stultz @ 2026-03-30 17:49 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
kprateek.nayak, dsmythies, shubhang, Suleiman Souhlal
On Mon, Mar 30, 2026 at 2:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Mar 27, 2026 at 10:44:28PM -0700, John Stultz wrote:
>
> > The backtrace is usually due to stress-ng stress-ng-yield test:
>
> What actual stress-ng parameters are that, and what kind of VM are you
> running?
Just running in a loop:
stress-ng --class scheduler --all 1 --timeout 300
I usually also run with locktorture using the following boot args:
torture.random_shuffle=1 locktorture.writer_fifo=1
locktorture.torture_type=mutex_lock locktorture.nested_locks=8
locktorture.rt_boost=1 locktorture.rt_boost_factor=50
locktorture.stutter=0
As well as cyclictest -t -p99 and my prio-inversion-demo in a loop in
the background, but I suspect they aren't necessary here.
This all on a nested VM w/ 64 vCPUs (host has 96 vCPUs).
Over the weekend I did run with CONFIG_SCHED_PROXY_EXEC disabled, and
still tripped the problem. I've added some trace_printks back in and
am working to get more details.
thanks
-john
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-28 17:04 ` Steven Rostedt
@ 2026-03-30 17:58 ` John Stultz
2026-03-30 18:27 ` Steven Rostedt
0 siblings, 1 reply; 55+ messages in thread
From: John Stultz @ 2026-03-30 17:58 UTC (permalink / raw)
To: Steven Rostedt
Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, bsegall, mgorman, vschneid, linux-kernel,
wangtao554, quzicheng, kprateek.nayak, dsmythies, shubhang,
Suleiman Souhlal
On Sat, Mar 28, 2026 at 10:03 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 27 Mar 2026 22:44:28 -0700
> John Stultz <jstultz@google.com> wrote:
>
> > I'll continue digging next week on this, but wanted to share in case
> > anyone else sees something obvious first.
>
> FYI, you can add a trace_printk() around all the areas that assign
> cfs_rq->curr, and if you use the persistent ring buffer you can retrieve
> the output of all the events that lead up to the crash.
>
> Add to the kernel command line:
>
> reserve_mem=100M:12M:trace trace_instance=boot_mapped^traceprintk^traceoff@trace
>
> And before starting your testing:
>
> echo 1 > /sys/kernel/tracing/instances/boot_mapped/tracing_on
>
> If all goes well, after the crash you should see the output in:
>
> cat /sys/kernel/tracing/instances/boot_mapped/trace
Nice. I've actually been using ftrace_dump_on_oops as I log the VM
serial console to a file and don't have to remember to go fetch it out
the next time. But this is helpful when its on real hardware (dumping
the trace over serial console can take awhile and I've had cases where
the hardware watchdogs reboot the device midway when the trace is too
large).
thanks
-john
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-30 17:58 ` John Stultz
@ 2026-03-30 18:27 ` Steven Rostedt
0 siblings, 0 replies; 55+ messages in thread
From: Steven Rostedt @ 2026-03-30 18:27 UTC (permalink / raw)
To: John Stultz
Cc: Peter Zijlstra, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, bsegall, mgorman, vschneid, linux-kernel,
wangtao554, quzicheng, kprateek.nayak, dsmythies, shubhang,
Suleiman Souhlal
On Mon, 30 Mar 2026 10:58:54 -0700
John Stultz <jstultz@google.com> wrote:
> Nice. I've actually been using ftrace_dump_on_oops as I log the VM
> serial console to a file and don't have to remember to go fetch it out
> the next time. But this is helpful when its on real hardware (dumping
> the trace over serial console can take awhile and I've had cases where
> the hardware watchdogs reboot the device midway when the trace is too
> large).
>
I use it on VMs all the time. It works nicely with qemu.
In the near future we will have a backup buffer implementation too, where
you don't need to remember to start it!
reserve_mem=100M:12M:trace trace_instance=boot_mapped^traceprintk^@trace trace_instance=backup=boot_mapped
Where on early boot up, a "backup" instance is created and copies the
persistent ring buffer to a read-only temporary buffer that doesn't get
overwritten.
Then you can keep the persistent ring buffer always on, and if there's a
crash, just look at the "backup" instance to see what happened.
-- Steve
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-30 15:50 ` K Prateek Nayak
@ 2026-03-30 19:11 ` Peter Zijlstra
2026-03-31 0:38 ` K Prateek Nayak
0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-03-30 19:11 UTC (permalink / raw)
To: K Prateek Nayak
Cc: John Stultz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, wangtao554,
quzicheng, dsmythies, shubhang, Suleiman Souhlal
On Mon, Mar 30, 2026 at 09:20:01PM +0530, K Prateek Nayak wrote:
> Hello Peter,
>
> On 3/30/2026 8:10 PM, Peter Zijlstra wrote:
> > On Mon, Mar 30, 2026 at 08:07:06PM +0530, K Prateek Nayak wrote:
> >> Hello Peter,
> >>
> >> On 3/30/2026 3:40 PM, Peter Zijlstra wrote:
> >>> This means, that if the two tasks playing leapfrog can reach the
> >>> critical speed to reach the overflow point inside one tick's worth of
> >>> time, we're up a creek.
> >>>
> >>> If this is indeed the case, then the below should cure things.
> >>
> >> I have been running with this for four hours now and haven't seen
> >> any splats or crashes on my setup. I could reliably trigger the
> >> warning from __sum_w_vruntime_add() within an hour previously so
> >> it is safe to say I was hitting exactly this.
> >>
> >> Feel free to include:
> >>
> >> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> >
> > Ha!, excellent. Thanks!
>
> Turns out I spoke too soon and it did eventually run into that
> problem again and then eventually crashed in pick_task_fair()
> later so there is definitely something amiss still :-(
>
> I'll throw in some debug traces and get back tomorrow.
Are there cgroups involved?
I'm thinking that if you have two groups, and the tick always hits the
one group, the other group can go a while without ever getting updated.
But if there's no cgroups, this can't be it.
Anyway, something like the below would rule this out I suppose.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf948db905ed..19b75af31a5a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1304,6 +1304,8 @@ static void update_curr(struct cfs_rq *cfs_rq)
curr->vruntime += calc_delta_fair(delta_exec, curr);
resched = update_deadline(cfs_rq, curr);
+ if (resched)
+ avg_vruntime(cfs_rq);
if (entity_is_task(curr)) {
/*
@@ -5593,11 +5595,6 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
update_load_avg(cfs_rq, curr, UPDATE_TG);
update_cfs_group(curr);
- /*
- * Pulls along cfs_rq::zero_vruntime.
- */
- avg_vruntime(cfs_rq);
-
#ifdef CONFIG_SCHED_HRTICK
/*
* queued ticks are scheduled to match the slice, so don't bother
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-30 10:10 ` Peter Zijlstra
2026-03-30 14:37 ` K Prateek Nayak
@ 2026-03-30 19:40 ` John Stultz
2026-03-30 19:43 ` Peter Zijlstra
1 sibling, 1 reply; 55+ messages in thread
From: John Stultz @ 2026-03-30 19:40 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
kprateek.nayak, dsmythies, shubhang, Suleiman Souhlal
On Mon, Mar 30, 2026 at 3:10 AM Peter Zijlstra <peterz@infradead.org> wrote:
> This means, that if the two tasks playing leapfrog can reach the
> critical speed to reach the overflow point inside one tick's worth of
> time, we're up a creek.
>
> If this is indeed the case, then the below should cure things.
>
> This also means that running a HZ=100 config will increase the chances
> of hitting this vs HZ=1000.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9298f49f842c..c7daaf941b26 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9307,6 +9307,7 @@ static void yield_task_fair(struct rq *rq)
> if (entity_eligible(cfs_rq, se)) {
> se->vruntime = se->deadline;
> se->deadline += calc_delta_fair(se->slice, se);
> + avg_vruntime(cfs_rq);
> }
> }
I just tested with this and similar to Prateek, I also still tripped the issue.
I'll give your new patch a spin here in a second.
thanks
-john
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-30 19:40 ` John Stultz
@ 2026-03-30 19:43 ` Peter Zijlstra
2026-03-30 21:45 ` John Stultz
0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-03-30 19:43 UTC (permalink / raw)
To: John Stultz
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
kprateek.nayak, dsmythies, shubhang, Suleiman Souhlal
On Mon, Mar 30, 2026 at 12:40:45PM -0700, John Stultz wrote:
> On Mon, Mar 30, 2026 at 3:10 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > This means, that if the two tasks playing leapfrog can reach the
> > critical speed to reach the overflow point inside one tick's worth of
> > time, we're up a creek.
> >
> > If this is indeed the case, then the below should cure things.
> >
> > This also means that running a HZ=100 config will increase the chances
> > of hitting this vs HZ=1000.
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9298f49f842c..c7daaf941b26 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9307,6 +9307,7 @@ static void yield_task_fair(struct rq *rq)
> > if (entity_eligible(cfs_rq, se)) {
> > se->vruntime = se->deadline;
> > se->deadline += calc_delta_fair(se->slice, se);
> > + avg_vruntime(cfs_rq);
> > }
> > }
>
> I just tested with this and similar to Prateek, I also still tripped the issue.
>
> I'll give your new patch a spin here in a second.
Stick both on please :-) AFAICT they're both real, just not convinced
they're what you're hitting.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-30 19:43 ` Peter Zijlstra
@ 2026-03-30 21:45 ` John Stultz
0 siblings, 0 replies; 55+ messages in thread
From: John Stultz @ 2026-03-30 21:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
kprateek.nayak, dsmythies, shubhang, Suleiman Souhlal
On Mon, Mar 30, 2026 at 12:43 PM Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Mar 30, 2026 at 12:40:45PM -0700, John Stultz wrote:
> > On Mon, Mar 30, 2026 at 3:10 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > This means, that if the two tasks playing leapfrog can reach the
> > > critical speed to reach the overflow point inside one tick's worth of
> > > time, we're up a creek.
> > >
> > > If this is indeed the case, then the below should cure things.
> > >
> > > This also means that running a HZ=100 config will increase the chances
> > > of hitting this vs HZ=1000.
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 9298f49f842c..c7daaf941b26 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -9307,6 +9307,7 @@ static void yield_task_fair(struct rq *rq)
> > > if (entity_eligible(cfs_rq, se)) {
> > > se->vruntime = se->deadline;
> > > se->deadline += calc_delta_fair(se->slice, se);
> > > + avg_vruntime(cfs_rq);
> > > }
> > > }
> >
> > I just tested with this and similar to Prateek, I also still tripped the issue.
> >
> > I'll give your new patch a spin here in a second.
>
> Stick both on please :-) AFAICT they're both real, just not convinced
> they're what you're hitting.
Sadly I'm still hitting it with both. This time the stack trace was
different, and it came up through do_nanosleep() from stress-ng-exit
instead of yield.
I'll re-add my debug trace_printks (I dropped them while testing your
patches in case they changed the timing of things) and work to
understand more here.
thanks
-john
[ 6777.071789] BUG: kernel NULL pointer dereference, address: 0000000000000051
[ 6777.076712] #PF: supervisor read access in kernel mode
[ 6777.079767] #PF: error_code(0x0000) - not-present page
[ 6777.082787] PGD 0 P4D 0
[ 6777.084361] Oops: Oops: 0000 [#1] SMP NOPTI
[ 6777.086812] CPU: 37 UID: 0 PID: 531349 Comm: stress-ng-exit-
Tainted: G W 7.0.0-rc1-00001-gb3d99f43c72b-dirty #18
PREEMPT(full)
[ 6777.094026] Tainted: [W]=WARN
[ 6777.095771] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.17.0-debian-1.17.0-1 04/01/2014
[ 6777.100689] RIP: 0010:pick_task_fair+0x6f/0xb0
[ 6777.103239] Code: 85 ff 74 52 48 8b 47 48 48 85 c0 74 d6 80 78 50
00 74 d0 48 89 3c 24 e8 8f e0 ff ff 48 8b 3c 24 be 01 00 00 00 e8 31
77 ff ff <80> 78 51 00 74 c3 ba 2
1 00 00 00 48 89 c6 48 89 df e8 db f1 ff ff
[ 6777.113447] RSP: 0018:ffffc9000f7dbcf0 EFLAGS: 00010082
[ 6777.116283] RAX: 0000000000000000 RBX: ffff8881b976bbc0 RCX: 0000000000000800
[ 6777.119791] RDX: 000000000a071800 RSI: 000000000b719000 RDI: 00004fc5ab7864c9
[ 6777.123608] RBP: ffffc9000f7dbdf0 R08: 0000000000000400 R09: 0000000000000002
[ 6777.127785] R10: 0000000000000025 R11: 0000000000000000 R12: ffff88810adc4200
[ 6777.131937] R13: ffff88810adc4200 R14: ffffffff82ce5b28 R15: ffff8881b976bbc0
[ 6777.135994] FS: 00007fc1c37866c0(0000) GS:ffff888235c2b000(0000)
knlGS:0000000000000000
[ 6777.140449] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6777.143756] CR2: 0000000000000051 CR3: 000000014160f005 CR4: 0000000000370ef0
[ 6777.147819] Call Trace:
[ 6777.149379] <TASK>
[ 6777.150653] pick_next_task_fair+0x3c/0x8c0
[ 6777.153115] __schedule+0x1e8/0x1200
[ 6777.155241] ? do_nanosleep+0x1a/0x170
[ 6777.157336] schedule+0x3d/0x130
[ 6777.159150] do_nanosleep+0x88/0x170
[ 6777.161161] ? find_held_lock+0x2b/0x80
[ 6777.163201] hrtimer_nanosleep+0xba/0x1f0
[ 6777.165481] ? __pfx_hrtimer_wakeup+0x10/0x10
[ 6777.167990] common_nsleep+0x34/0x60
[ 6777.169957] __x64_sys_clock_nanosleep+0xde/0x150
[ 6777.172443] do_syscall_64+0xf3/0x680
[ 6777.174409] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 6777.177176] RIP: 0033:0x7fc1cc92d9ee
[ 6777.179009] Code: 08 0f 85 f5 4b ff ff 49 89 fb 48 89 f0 48 89 d7
48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24
08 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 80 00 00 00 00 48 83
ec 08
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-30 19:11 ` Peter Zijlstra
@ 2026-03-31 0:38 ` K Prateek Nayak
2026-03-31 4:58 ` K Prateek Nayak
2026-03-31 7:08 ` Peter Zijlstra
0 siblings, 2 replies; 55+ messages in thread
From: K Prateek Nayak @ 2026-03-31 0:38 UTC (permalink / raw)
To: Peter Zijlstra
Cc: John Stultz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, wangtao554,
quzicheng, dsmythies, shubhang, Suleiman Souhlal
Hello Peter,
On 3/31/2026 12:41 AM, Peter Zijlstra wrote:
>> Turns out I spoke too soon and it did eventually run into that
>> problem again and then eventually crashed in pick_task_fair()
>> later so there is definitely something amiss still :-(
>>
>> I'll throw in some debug traces and get back tomorrow.
>
> Are there cgroups involved?
Indeed there are.
>
> I'm thinking that if you have two groups, and the tick always hits the
> one group, the other group can go a while without ever getting updated.
Ack! That could be but I only have once cgroup on top of root cgroup as
far as cpu controllers are concerned so the sched_yield() catching up
the avg_vruntime() should have worked. Either ways, I have more data:
When I hit the overflow warning, I have:
se: entity_key(-83106064385) weight(90891264) overflow(-7553615238018032640)
cfs_rq: zero_vruntime(138430453113448575) sum_w_vruntime(0) sum_weight(0)
cfs_rq->curr: entity_key(0) vruntime(138430453113448575) deadline(138430500540426854)
Post avg_vruntime():
se: entity_key(-83106064385) weight(90891264) overflow(-7553615238018032640)
cfs_rq: zero_vruntime(138430453113448575) sum_w_vruntime(0) sum_weight(0)
cfs_rq->curr: entity_key(0) vruntime(138430453113448575) deadline(138430500540426854)
so running avg_vruntime() doesn't make a difference and it seems to be a
genuine case of place_entity() putting the newly woken entity pretty
far back in the timeline. (I forgot to print weights!)
Now, the funny part is, if I leave the system undisturbed, I get a few
of the above warning and nothing interesting but as soon as I do a:
grep bits /sys/kernel/debug/sched/debug
Boom! Pick fails very consistently (Because of copy-pasta this too
doesn't contain weights):
NULL Pick!
cfs_rq: zero_vruntime(89029406877992895) sum_w_vruntime(-135049248768) sum_weight(1048576)
cfs_rq->curr: entity_key(149162) vruntime(89029406878142057) deadline(89029406976268435)
queued se: entity_key(-123294) vruntime(89029406877869601) deadline(89029406880669601)
after avg_vruntime()!
cfs_rq: zero_vruntime(89029406877868114) sum_w_vruntime(-4206886912) sum_weight(1048576)
cfs_rq->curr: entity_key(273943) vruntime(89029406878142057) deadline(89029406976268435)
queued se: entity_key(1487) vruntime(89029406877869601) deadline(89029406880669601)
NULL Pick!
The above doesn't recover after a avg_vruntime(). Btw I'm running:
nice -n 19 stress-ng --yield 32 -t 1000000s&
while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
Nice 19 is to get a large deadline and keep catching up to that deadline
at every yield to see if that makes any difference.
>
> But if there's no cgroups, this can't be it.
>
> Anyway, something like the below would rule this out I suppose.
I'll add that in and see if it makes a difference. I'll add in
weights and look at place_entity() to see if we have anything
interesting going on there.
Thank you for taking a look.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-31 0:38 ` K Prateek Nayak
@ 2026-03-31 4:58 ` K Prateek Nayak
2026-03-31 7:08 ` Peter Zijlstra
1 sibling, 0 replies; 55+ messages in thread
From: K Prateek Nayak @ 2026-03-31 4:58 UTC (permalink / raw)
To: Peter Zijlstra
Cc: John Stultz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, wangtao554,
quzicheng, dsmythies, shubhang, Suleiman Souhlal
On 3/31/2026 6:08 AM, K Prateek Nayak wrote:
>> I'm thinking that if you have two groups, and the tick always hits the
>> one group, the other group can go a while without ever getting updated.
>
> Ack! That could be but I only have once cgroup on top of root cgroup as
> far as cpu controllers are concerned so the sched_yield() catching up
> the avg_vruntime() should have worked. Either ways, I have more data:
>
> When I hit the overflow warning, I have:
>
> se: entity_key(-83106064385) weight(90891264) overflow(-7553615238018032640)
> cfs_rq: zero_vruntime(138430453113448575) sum_w_vruntime(0) sum_weight(0)
> cfs_rq->curr: entity_key(0) vruntime(138430453113448575) deadline(138430500540426854)
> Post avg_vruntime():
> se: entity_key(-83106064385) weight(90891264) overflow(-7553615238018032640)
> cfs_rq: zero_vruntime(138430453113448575) sum_w_vruntime(0) sum_weight(0)
> cfs_rq->curr: entity_key(0) vruntime(138430453113448575) deadline(138430500540426854)
>
> so running avg_vruntime() doesn't make a difference and it seems to be a
> genuine case of place_entity() putting the newly woken entity pretty
> far back in the timeline. (I forgot to print weights!)
>
> Now, the funny part is, if I leave the system undisturbed, I get a few
> of the above warning and nothing interesting but as soon as I do a:
>
> grep bits /sys/kernel/debug/sched/debug
>
> Boom! Pick fails very consistently (Because of copy-pasta this too
> doesn't contain weights):
>
> NULL Pick!
> cfs_rq: zero_vruntime(89029406877992895) sum_w_vruntime(-135049248768) sum_weight(1048576)
> cfs_rq->curr: entity_key(149162) vruntime(89029406878142057) deadline(89029406976268435)
> queued se: entity_key(-123294) vruntime(89029406877869601) deadline(89029406880669601)
>
> after avg_vruntime()!
> cfs_rq: zero_vruntime(89029406877868114) sum_w_vruntime(-4206886912) sum_weight(1048576)
> cfs_rq->curr: entity_key(273943) vruntime(89029406878142057) deadline(89029406976268435)
> queued se: entity_key(1487) vruntime(89029406877869601) deadline(89029406880669601)
>
> NULL Pick!
>
> The above doesn't recover after a avg_vruntime(). Btw I'm running:
>
> nice -n 19 stress-ng --yield 32 -t 1000000s&
> while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
>
> Nice 19 is to get a large deadline and keep catching up to that deadline
> at every yield to see if that makes any difference.
>
>>
>> But if there's no cgroups, this can't be it.
>>
>> Anyway, something like the below would rule this out I suppose.
>
> I'll add that in and see if it makes a difference. I'll add in
> weights and look at place_entity() to see if we have anything
> interesting going on there.
Still trips the issue :-( This time I have logs with weights.
For the warning:
se: entity_key(-72358759771) weight(90891264) warning_mul(-6576779137058540544) vlag(39009) delayed?(0)
cfs_rq: zero_vruntime(18695504496613622) sum_w_vruntime(0) sum_weight(0)
cfs_rq->curr: entity_key(0) vruntime(18695504496613622) deadline(18695540588878716) weight(49)
Post avg_vruntime():
se: entity_key(-72358759771) weight(90891264) overflow?(-6576779137058540544)
cfs_rq: zero_vruntime(18695504496613622) sum_w_vruntime(0) sum_weight(0)
cfs_rq->curr: entity_key(0) vruntime(18695504496613622) deadline(18695540588878716) weight(49)
And the NULL pick while reading debugfs (probably something in the initial
task wakeup path that trips it?):
NULL Pick!
cfs_rq: zero_vruntime(21126236598445952) sum_w_vruntime(-1074569456640) sum_weight(15360)
cfs_rq->curr: entity_key(69958950) vruntime(21126236668404902) deadline(21126236859551568) weight(15360)
queued se: entity_key(32498584) vruntime(21126236630944536) deadline(21126236822091202) weight(15360)
After avg_vruntime():
cfs_rq: zero_vruntime(21126236598445952) sum_w_vruntime(-1074569456640) sum_weight(15360)
cfs_rq->curr: entity_key(69958950) vruntime(21126236668404902) deadline(21126236859551568) weight(15360)
queued se: entity_key(32498584) vruntime(21126236630944536) deadline(21126236822091202) weight(15360)
NULL Pick!
Updated zero_vruntime is behind that of either of the queued entities.
Now that I have a reliable trigger for the crash, I'll just start
tracing everything before I run grep (although I suspect something may
have gone bad a long time ago but we can be hopeful)
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-31 0:38 ` K Prateek Nayak
2026-03-31 4:58 ` K Prateek Nayak
@ 2026-03-31 7:08 ` Peter Zijlstra
2026-03-31 7:14 ` Peter Zijlstra
1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-03-31 7:08 UTC (permalink / raw)
To: K Prateek Nayak
Cc: John Stultz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, wangtao554,
quzicheng, dsmythies, shubhang, Suleiman Souhlal
On Tue, Mar 31, 2026 at 06:08:27AM +0530, K Prateek Nayak wrote:
> The above doesn't recover after a avg_vruntime(). Btw I'm running:
>
> nice -n 19 stress-ng --yield 32 -t 1000000s&
> while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
And you're running that on a 16 cpu machine / vm ?
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-31 7:08 ` Peter Zijlstra
@ 2026-03-31 7:14 ` Peter Zijlstra
2026-03-31 8:49 ` K Prateek Nayak
0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-03-31 7:14 UTC (permalink / raw)
To: K Prateek Nayak
Cc: John Stultz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, wangtao554,
quzicheng, dsmythies, shubhang, Suleiman Souhlal
On Tue, Mar 31, 2026 at 09:08:23AM +0200, Peter Zijlstra wrote:
> On Tue, Mar 31, 2026 at 06:08:27AM +0530, K Prateek Nayak wrote:
>
> > The above doesn't recover after a avg_vruntime(). Btw I'm running:
> >
> > nice -n 19 stress-ng --yield 32 -t 1000000s&
> > while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
>
> And you're running that on a 16 cpu machine / vm ?
W00t, it went b00m. Ok, let me go add some tracing.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-31 7:14 ` Peter Zijlstra
@ 2026-03-31 8:49 ` K Prateek Nayak
2026-03-31 9:29 ` Peter Zijlstra
0 siblings, 1 reply; 55+ messages in thread
From: K Prateek Nayak @ 2026-03-31 8:49 UTC (permalink / raw)
To: Peter Zijlstra
Cc: John Stultz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, wangtao554,
quzicheng, dsmythies, shubhang, Suleiman Souhlal
On 3/31/2026 12:44 PM, Peter Zijlstra wrote:
> On Tue, Mar 31, 2026 at 09:08:23AM +0200, Peter Zijlstra wrote:
>> On Tue, Mar 31, 2026 at 06:08:27AM +0530, K Prateek Nayak wrote:
>>
>>> The above doesn't recover after a avg_vruntime(). Btw I'm running:
>>>
>>> nice -n 19 stress-ng --yield 32 -t 1000000s&
>>> while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
>>
>> And you're running that on a 16 cpu machine / vm ?
>
> W00t, it went b00m. Ok, let me go add some tracing.
I could only repro it on baremetal after few hours but good to know it
exploded effortlessly on your end! Was this a 16vCPU VM with the same
recipe?
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-31 8:49 ` K Prateek Nayak
@ 2026-03-31 9:29 ` Peter Zijlstra
2026-03-31 12:20 ` Peter Zijlstra
0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-03-31 9:29 UTC (permalink / raw)
To: K Prateek Nayak
Cc: John Stultz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, wangtao554,
quzicheng, dsmythies, shubhang, Suleiman Souhlal
On Tue, Mar 31, 2026 at 02:19:54PM +0530, K Prateek Nayak wrote:
> On 3/31/2026 12:44 PM, Peter Zijlstra wrote:
> > On Tue, Mar 31, 2026 at 09:08:23AM +0200, Peter Zijlstra wrote:
> >> On Tue, Mar 31, 2026 at 06:08:27AM +0530, K Prateek Nayak wrote:
> >>
> >>> The above doesn't recover after a avg_vruntime(). Btw I'm running:
> >>>
> >>> nice -n 19 stress-ng --yield 32 -t 1000000s&
> >>> while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
> >>
> >> And you're running that on a 16 cpu machine / vm ?
> >
> > W00t, it went b00m. Ok, let me go add some tracing.
>
> I could only repro it on baremetal after few hours but good to know it
> exploded effortlessly on your end! Was this a 16vCPU VM with the same
> recipe?
Yep. It almost insta triggers. Trying to make sense of the traces now.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-31 9:29 ` Peter Zijlstra
@ 2026-03-31 12:20 ` Peter Zijlstra
2026-03-31 16:14 ` Peter Zijlstra
0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-03-31 12:20 UTC (permalink / raw)
To: K Prateek Nayak
Cc: John Stultz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, wangtao554,
quzicheng, dsmythies, shubhang, Suleiman Souhlal
On Tue, Mar 31, 2026 at 11:29:09AM +0200, Peter Zijlstra wrote:
> On Tue, Mar 31, 2026 at 02:19:54PM +0530, K Prateek Nayak wrote:
> > On 3/31/2026 12:44 PM, Peter Zijlstra wrote:
> > > On Tue, Mar 31, 2026 at 09:08:23AM +0200, Peter Zijlstra wrote:
> > >> On Tue, Mar 31, 2026 at 06:08:27AM +0530, K Prateek Nayak wrote:
> > >>
> > >>> The above doesn't recover after a avg_vruntime(). Btw I'm running:
> > >>>
> > >>> nice -n 19 stress-ng --yield 32 -t 1000000s&
> > >>> while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
> > >>
> > >> And you're running that on a 16 cpu machine / vm ?
> > >
> > > W00t, it went b00m. Ok, let me go add some tracing.
> >
> > I could only repro it on baremetal after few hours but good to know it
> > exploded effortlessly on your end! Was this a 16vCPU VM with the same
> > recipe?
>
> Yep. It almost insta triggers. Trying to make sense of the traces now.
So the thing I'm seeing is that avg_vruntime() is behind of where it
should be, not much, but every time it goes *boom* it is just far enough
behind that no entity is eligible.
sched-messaging-2192 [039] d..2. 77.136100: pick_task_fair: cfs_rq(39:ff4a5bc7bebeb680): sum_w_vruntime(194325882) sum_weight(5120) zero_vruntime(105210161141318) avg_vruntime(105210161179272)
sched-messaging-2192 [039] d..2. 77.136100: pick_task_fair: T se(ff4a5bc79040c940): vruntime(105210161556539) deadline(105210164099443) weight(1048576) -- sched-messaging:2340
sched-messaging-2192 [039] d..2. 77.136101: pick_task_fair: T se(ff4a5bc794ce98c0): vruntime(105210161435669) deadline(105210164235669) weight(1048576) -- sched-messaging:2212
sched-messaging-2192 [039] d..2. 77.136101: pick_task_fair: T se(ff4a5bc7952d3100): vruntime(105210161580240) deadline(105210164380240) weight(1048576) -- sched-messaging:2381
sched-messaging-2192 [039] d..2. 77.136102: pick_task_fair: T se(ff4a5bc794c318c0): vruntime(105210161818264) deadline(105210164518004) weight(1048576) -- sched-messaging:2306
sched-messaging-2192 [039] d..2. 77.136103: pick_task_fair: T se(ff4a5bc796b4b100): vruntime(105210161831546) deadline(105210164631546) weight(1048576) -- sched-messaging:2551
sched-messaging-2192 [039] d..2. 77.136104: pick_task_fair: min_lag(-652274) max_lag(0) limit(38000000)
sched-messaging-2192 [039] d..2. 77.136104: pick_task_fair: picked NULL!!
If we compute the avg_vruntime() manually, then we get a
sum_w_vruntime contribution for each task:
(105210161556539-105210161141318)*1024
425186304
(105210161435669-105210161141318)*1024
301415424
(105210161580240-105210161141318)*1024
449456128
(105210161818264-105210161141318)*1024
693192704
(105210161831546-105210161141318)*1024
706793472
Which combined is:
425186304+301415424+449456128+693192704+706793472
2576044032
NOTE: this is different (more) from sum_w_vruntime(194325882).
So divided, and added to zero gives:
2576044032/5120
503133.60000000000000000000
105210161141318+503133.60000000000000000000
105210161644451.60000000000000000000
Which is where avg_vruntime() *should* be, except it ends up being at:
avg_vruntime(105210161179272), which then results in no eligible entities.
Note that with the computed avg, the first 3 entities would be eligible.
This suggests I go build a parallel infrastructure to double check when
and where this goes sizeways.
... various attempts later ....
sched-messaging-1021 [009] d..2. 34.483159: update_curr: T<=> se(ff37d0bcd52718c0): vruntime(56921690782736, E) deadline(56921693563331) weight(1048576) -- sched-messaging:1021
sched-messaging-1021 [009] d..2. 34.483160: __avg_vruntime: cfs_rq(9:ff37d0bcfe46b680): delta(-48327) sum_w_vruntime(811471242) zero_vruntime(56921691575188)
sched-messaging-1021 [009] d..2. 34.483160: pick_task_fair: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(811471242) sum_weight(6159) zero_vruntime(56921691575188) avg_vruntime(56921691706941)
sched-messaging-1021 [009] d..2. 34.483160: pick_task_fair: T< se(ff37d0bcd5c6c940): vruntime(56921691276707, E) deadline(56921694076707) weight(1048576) -- sched-messaging:1276
sched-messaging-1021 [009] d..2. 34.483161: pick_task_fair: T se(ff37d0bcd56f98c0): vruntime(56921691917863) deadline(56921694079320) weight(1048576) -- sched-messaging:1201
sched-messaging-1021 [009] d..2. 34.483162: pick_task_fair: T se(ff37d0bcd5344940): vruntime(56921691340323, E) deadline(56921694140323) weight(1048576) -- sched-messaging:1036
sched-messaging-1021 [009] d..2. 34.483163: pick_task_fair: T se(ff37d0bcd56dc940): vruntime(56921691637185, E) deadline(56921694403038) weight(1048576) -- sched-messaging:1179
sched-messaging-1021 [009] d..2. 34.483164: pick_task_fair: T se(ff37d0bcd43eb100): vruntime(56921691629067, E) deadline(56921694429067) weight(1048576) -- sched-messaging:786
sched-messaging-1021 [009] d..2. 34.483164: pick_task_fair: T se(ff37d0bcd5d80080): vruntime(56921691810771) deadline(56921694610771) weight(1048576) -- sched-messaging:1291
sched-messaging-1021 [009] d..2. 34.483165: pick_task_fair: T se(ff37d0bcd027b100): vruntime(56921734696810) deadline(56921917287562) weight(15360) -- stress-ng-yield:693
sched-messaging-1021 [009] d..2. 34.483165: pick_task_fair: min_lag(-42989869) max_lag(430234) limit(38000000)
sched-messaging-1021 [009] d..2. 34.483166: pick_task_fair: swv(811471242)
sched-messaging-1021 [009] d..2. 34.483167: __dequeue_entity: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(1117115786) zero_vruntime(56921691575188)
set_next_task(1276):
swv -= key * weight
811471242 - (56921691276707-56921691575188)*1024
1117115786
OK
sched-messaging-1276 [009] d.h2. 34.483168: update_curr: T<=> se(ff37d0bcd5c6c940): vruntime(56921691285759, E) deadline(56921694076707) weight(1048576) -- sched-messaging:1276
sched-messaging-1276 [009] d.h2. 34.483169: __avg_vruntime: cfs_rq(9:ff37d0bcfe46b680): delta(22156) sum_w_vruntime(319064896) zero_vruntime(56921691597344)
swv -= sw * delta
1117115786 - 5135 * 22156
1003344726
WTF!?!
zv += delta
56921691575188 + 22156
56921691597344
OK
sched-messaging-1276 [009] d.h2. 34.483169: place_entity: T< se(ff37d0bcd52718c0): vruntime(56921690673139, E) deadline(56921693473139) weight(1048576) -- sched-messaging:1021
sched-messaging-1276 [009] d.h2. 34.483170: __enqueue_entity: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(-627321024) zero_vruntime(56921691597344)
swv += key * weight
Should be:
1003344726 + (56921690673139 - 56921691597344) * 1024
56958806 [*]
But is:
319064896 + (56921690673139 - 56921691597344) * 1024
-627321024
Consistent, but wrong
sched-messaging-1276 [009] d..2. 34.483173: update_curr: T<=> se(ff37d0bcd5c6c940): vruntime(56921691289762, E) deadline(56921694076707) weight(1048576) -- sched-messaging:1276
sched-messaging-1276 [009] d..2. 34.483173: __avg_vruntime: cfs_rq(9:ff37d0bcfe46b680): delta(571) sum_w_vruntime(180635073) zero_vruntime(56921691466161)
This would be dequeue(1276) update_entity_lag(), but the numbers make no sense...
swv -= sw * delta
-627321024 - 6159 * 571
-630837813 != 180635073
zv += delta
56921691597344 + 571
56921691597915 != 56921691466161
Also, the actual delta would be (zero_vruntime - prev zero_vruntime):
56921691466161-56921691597344
-131183
At which point we can construct the swv value from where we left of [*]
56958806 - -131183 * 6159
864914903
But the actual state makes no frigging sense....
sched-messaging-1276 [009] d..2. 34.483174: pick_task_fair: cfs_rq(9:ff37d0bcfe46b680): sum_w_vruntime(180635073) sum_weight(6159) zero_vruntime(56921691466161) avg_vruntime(56921691495489)
sched-messaging-1276 [009] d..2. 34.483174: pick_task_fair: T< se(ff37d0bcd52718c0): vruntime(56921690673139, E) deadline(56921693473139) weight(1048576) -- sched-messaging:1021
sched-messaging-1276 [009] d..2. 34.483175: pick_task_fair: T se(ff37d0bcd56f98c0): vruntime(56921691917863) deadline(56921694079320) weight(1048576) -- sched-messaging:1201
sched-messaging-1276 [009] d..2. 34.483175: pick_task_fair: T se(ff37d0bcd5344940): vruntime(56921691340323, E) deadline(56921694140323) weight(1048576) -- sched-messaging:1036
sched-messaging-1276 [009] d..2. 34.483176: pick_task_fair: T se(ff37d0bcd56dc940): vruntime(56921691637185) deadline(56921694403038) weight(1048576) -- sched-messaging:1179
sched-messaging-1276 [009] d..2. 34.483177: pick_task_fair: T se(ff37d0bcd43eb100): vruntime(56921691629067) deadline(56921694429067) weight(1048576) -- sched-messaging:786
sched-messaging-1276 [009] d..2. 34.483177: pick_task_fair: T se(ff37d0bcd5d80080): vruntime(56921691810771) deadline(56921694610771) weight(1048576) -- sched-messaging:1291
sched-messaging-1276 [009] d..2. 34.483178: pick_task_fair: T se(ff37d0bcd027b100): vruntime(56921734696810) deadline(56921917287562) weight(15360) -- stress-ng-yield:693
sched-messaging-1276 [009] d..2. 34.483178: pick_task_fair: min_lag(-43201321) max_lag(822350) limit(38000000)
sched-messaging-1276 [009] d..2. 34.483178: pick_task_fair: swv(864914903)
sched-messaging-1276 [009] d..2. 34.483179: pick_task_fair: FAIL
Generated with the below patch on top of -rc6.
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf948db905ed..5462aeac1c45 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -678,6 +678,11 @@ sum_w_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
cfs_rq->sum_w_vruntime += key * weight;
cfs_rq->sum_weight += weight;
+
+ trace_printk("cfs_rq(%d:%px): sum_w_vruntime(%Ld) zero_vruntime(%Ld)\n",
+ rq_of(cfs_rq)->cpu, cfs_rq,
+ cfs_rq->sum_w_vruntime,
+ cfs_rq->zero_vruntime);
}
static void
@@ -688,6 +693,11 @@ sum_w_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
cfs_rq->sum_w_vruntime -= key * weight;
cfs_rq->sum_weight -= weight;
+
+ trace_printk("cfs_rq(%d:%px): sum_w_vruntime(%Ld) zero_vruntime(%Ld)\n",
+ rq_of(cfs_rq)->cpu, cfs_rq,
+ cfs_rq->sum_w_vruntime,
+ cfs_rq->zero_vruntime);
}
static inline
@@ -698,6 +708,12 @@ void update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta)
*/
cfs_rq->sum_w_vruntime -= cfs_rq->sum_weight * delta;
cfs_rq->zero_vruntime += delta;
+
+ trace_printk("cfs_rq(%d:%px): delta(%Ld) sum_w_vruntime(%Ld) zero_vruntime(%Ld)\n",
+ rq_of(cfs_rq)->cpu, cfs_rq,
+ delta,
+ cfs_rq->sum_w_vruntime,
+ cfs_rq->zero_vruntime);
}
/*
@@ -712,7 +728,7 @@ void update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta)
* This means it is one entry 'behind' but that puts it close enough to where
* the bound on entity_key() is at most two lag bounds.
*/
-u64 avg_vruntime(struct cfs_rq *cfs_rq)
+static u64 __avg_vruntime(struct cfs_rq *cfs_rq, bool update)
{
struct sched_entity *curr = cfs_rq->curr;
long weight = cfs_rq->sum_weight;
@@ -743,9 +759,17 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
delta = curr->vruntime - cfs_rq->zero_vruntime;
}
- update_zero_vruntime(cfs_rq, delta);
+ if (update) {
+ update_zero_vruntime(cfs_rq, delta);
+ return cfs_rq->zero_vruntime;
+ }
- return cfs_rq->zero_vruntime;
+ return cfs_rq->zero_vruntime + delta;
+}
+
+u64 avg_vruntime(struct cfs_rq *cfs_rq)
+{
+ return __avg_vruntime(cfs_rq, true);
}
static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq);
@@ -1078,11 +1102,6 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
return best;
}
-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
-{
- return __pick_eevdf(cfs_rq, true);
-}
-
struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
{
struct rb_node *last = rb_last(&cfs_rq->tasks_timeline.rb_root);
@@ -1279,6 +1298,8 @@ s64 update_curr_common(struct rq *rq)
return update_se(rq, &rq->donor->se);
}
+static void print_se(struct cfs_rq *cfs_rq, struct sched_entity *se, bool pick);
+
/*
* Update the current task's runtime statistics.
*/
@@ -1304,6 +1325,10 @@ static void update_curr(struct cfs_rq *cfs_rq)
curr->vruntime += calc_delta_fair(delta_exec, curr);
resched = update_deadline(cfs_rq, curr);
+ if (resched)
+ avg_vruntime(cfs_rq);
+
+ print_se(cfs_rq, curr, true);
if (entity_is_task(curr)) {
/*
@@ -3849,6 +3874,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
bool rel_vprot = false;
u64 vprot;
+ print_se(cfs_rq, se, true);
+
if (se->on_rq) {
/* commit outstanding execution time */
update_curr(cfs_rq);
@@ -3896,6 +3923,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
__enqueue_entity(cfs_rq, se);
cfs_rq->nr_queued++;
}
+
+ print_se(cfs_rq, se, true);
}
static void reweight_task_fair(struct rq *rq, struct task_struct *p,
@@ -5251,6 +5280,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (se->rel_deadline) {
se->deadline += se->vruntime;
se->rel_deadline = 0;
+ print_se(cfs_rq, se, true);
return;
}
@@ -5266,6 +5296,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* EEVDF: vd_i = ve_i + r_i/w_i
*/
se->deadline = se->vruntime + vslice;
+ print_se(cfs_rq, se, true);
}
static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -5529,31 +5560,6 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool first)
se->prev_sum_exec_runtime = se->sum_exec_runtime;
}
-static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags);
-
-/*
- * Pick the next process, keeping these things in mind, in this order:
- * 1) keep things fair between processes/task groups
- * 2) pick the "next" process, since someone really wants that to run
- * 3) pick the "last" process, for cache locality
- * 4) do not run the "skip" process, if something else is available
- */
-static struct sched_entity *
-pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
-{
- struct sched_entity *se;
-
- se = pick_eevdf(cfs_rq);
- if (se->sched_delayed) {
- dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
- /*
- * Must not reference @se again, see __block_task().
- */
- return NULL;
- }
- return se;
-}
-
static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
@@ -8942,6 +8948,123 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f
resched_curr_lazy(rq);
}
+static __always_inline
+void print_se(struct cfs_rq *cfs_rq, struct sched_entity *se, bool pick)
+{
+ bool curr = (se == cfs_rq->curr);
+ bool el = entity_eligible(cfs_rq, se);
+ bool prot = protect_slice(se);
+ bool task = false;
+ char *comm = NULL;
+ int pid = -1;
+
+ if (entity_is_task(se)) {
+ struct task_struct *p = task_of(se);
+ task = true;
+ comm = p->comm;
+ pid = p->pid;
+ }
+
+ trace_printk("%c%c%c%c se(%px): vruntime(%Ld%s) deadline(%Ld) weight(%ld) -- %s:%d\n",
+ task ? 'T' : '@',
+ pick ? '<' : ' ',
+ curr && prot ? '=' : ' ',
+ curr ? '>' : ' ',
+ se, se->vruntime, el ? ", E" : "",
+ se->deadline, se->load.weight,
+ comm, pid);
+}
+
+static struct sched_entity *pick_debug(struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *pick = __pick_eevdf(cfs_rq, true);
+ struct sched_entity *curr = cfs_rq->curr;
+ s64 min_lag = 0, max_lag = 0;
+ u64 runtime, weight, z_vruntime, avg;
+ u64 swv = 0;
+
+ s64 limit = 10*(sysctl_sched_base_slice + TICK_NSEC);
+
+ if (curr && !curr->on_rq)
+ curr = NULL;
+
+ runtime = cfs_rq->sum_w_vruntime;
+ weight = cfs_rq->sum_weight;
+ z_vruntime = cfs_rq->zero_vruntime;
+ barrier();
+ avg = __avg_vruntime(cfs_rq, false);
+
+ trace_printk("cfs_rq(%d:%px): sum_w_vruntime(%Ld) sum_weight(%Ld) zero_vruntime(%Ld) avg_vruntime(%Ld)\n",
+ rq_of(cfs_rq)->cpu, cfs_rq,
+ runtime, weight,
+ z_vruntime, avg);
+
+ for (struct rb_node *node = cfs_rq->tasks_timeline.rb_leftmost;
+ node; node = rb_next(node)) {
+ struct sched_entity *se = __node_2_se(node);
+ if (se == curr)
+ curr = NULL;
+ print_se(cfs_rq, se, pick == se);
+
+ swv += (se->vruntime - z_vruntime) * scale_load_down(se->load.weight);
+
+ s64 vlag = avg - se->vruntime;
+ min_lag = min(min_lag, vlag);
+ max_lag = max(max_lag, vlag);
+ }
+
+ if (curr) {
+ print_se(cfs_rq, curr, pick == curr);
+
+ s64 vlag = avg - curr->vruntime;
+ min_lag = min(min_lag, vlag);
+ max_lag = max(max_lag, vlag);
+ }
+
+ trace_printk(" min_lag(%Ld) max_lag(%Ld) limit(%Ld)\n", min_lag, max_lag, limit);
+ trace_printk(" swv(%Ld)\n", swv);
+
+ if (swv != runtime) {
+ trace_printk("FAIL\n");
+ tracing_off();
+ printk("FAIL FAIL FAIL!!!\n");
+ }
+
+// WARN_ON_ONCE(min_lag < -limit || max_lag > limit);
+
+ if (!pick) {
+ trace_printk("picked NULL!!\n");
+ tracing_off();
+ printk("FAIL FAIL FAIL!!!\n");
+ return __pick_first_entity(cfs_rq);
+ }
+
+ return pick;
+}
+
+/*
+ * Pick the next process, keeping these things in mind, in this order:
+ * 1) keep things fair between processes/task groups
+ * 2) pick the "next" process, since someone really wants that to run
+ * 3) pick the "last" process, for cache locality
+ * 4) do not run the "skip" process, if something else is available
+ */
+static struct sched_entity *
+pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
+{
+ struct sched_entity *se;
+
+ se = pick_debug(cfs_rq);
+ if (se->sched_delayed) {
+ dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+ /*
+ * Must not reference @se again, see __block_task().
+ */
+ return NULL;
+ }
+ return se;
+}
+
static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
{
struct sched_entity *se;
@@ -9129,6 +9252,7 @@ static void yield_task_fair(struct rq *rq)
if (entity_eligible(cfs_rq, se)) {
se->vruntime = se->deadline;
se->deadline += calc_delta_fair(se->slice, se);
+ avg_vruntime(cfs_rq);
}
}
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-31 12:20 ` Peter Zijlstra
@ 2026-03-31 16:14 ` Peter Zijlstra
2026-03-31 17:02 ` K Prateek Nayak
2026-03-31 22:40 ` John Stultz
0 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2026-03-31 16:14 UTC (permalink / raw)
To: K Prateek Nayak
Cc: John Stultz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, wangtao554,
quzicheng, dsmythies, shubhang, Suleiman Souhlal
On Tue, Mar 31, 2026 at 02:20:35PM +0200, Peter Zijlstra wrote:
> WTF!?!
I'm thinking this might help... I'll try once I'm back home again.
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index b24f40f05019..15bf45b6f912 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -902,6 +902,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
{
s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
+ u64 avruntime;
struct sched_entity *last, *first, *root;
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
@@ -925,6 +926,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
if (last)
right_vruntime = last->vruntime;
zero_vruntime = cfs_rq->zero_vruntime;
+ avruntime = avg_vruntime(cfs_rq);
raw_spin_rq_unlock_irqrestore(rq, flags);
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_deadline",
@@ -934,7 +936,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "zero_vruntime",
SPLIT_NS(zero_vruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "avg_vruntime",
- SPLIT_NS(avg_vruntime(cfs_rq)));
+ SPLIT_NS(avruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "right_vruntime",
SPLIT_NS(right_vruntime));
spread = right_vruntime - left_vruntime;
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-31 16:14 ` Peter Zijlstra
@ 2026-03-31 17:02 ` K Prateek Nayak
2026-03-31 22:40 ` John Stultz
1 sibling, 0 replies; 55+ messages in thread
From: K Prateek Nayak @ 2026-03-31 17:02 UTC (permalink / raw)
To: Peter Zijlstra
Cc: John Stultz, mingo, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, linux-kernel, wangtao554,
quzicheng, dsmythies, shubhang, Suleiman Souhlal
On 3/31/2026 9:44 PM, Peter Zijlstra wrote:
> @@ -925,6 +926,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> if (last)
> right_vruntime = last->vruntime;
> zero_vruntime = cfs_rq->zero_vruntime;
> + avruntime = avg_vruntime(cfs_rq);
> raw_spin_rq_unlock_irqrestore(rq, flags);
Ah! Didn't notice we dropped the lock as soon as we are done reading the
necessary values. Makes sense why reading the debugfs under heavy load
crashed.
Ran 100 loops of reading the debug file while running stress-ng
+ sched-messaging and I haven't see any crashes yet so feel free to
include:
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>
> SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_deadline",
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking
2026-03-31 16:14 ` Peter Zijlstra
2026-03-31 17:02 ` K Prateek Nayak
@ 2026-03-31 22:40 ` John Stultz
1 sibling, 0 replies; 55+ messages in thread
From: John Stultz @ 2026-03-31 22:40 UTC (permalink / raw)
To: Peter Zijlstra
Cc: K Prateek Nayak, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel, wangtao554, quzicheng, dsmythies, shubhang,
Suleiman Souhlal
On Tue, Mar 31, 2026 at 9:14 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Mar 31, 2026 at 02:20:35PM +0200, Peter Zijlstra wrote:
>
> I'm thinking this might help... I'll try once I'm back home again.
>
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index b24f40f05019..15bf45b6f912 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -902,6 +902,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
> void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> {
> s64 left_vruntime = -1, zero_vruntime, right_vruntime = -1, left_deadline = -1, spread;
> + u64 avruntime;
> struct sched_entity *last, *first, *root;
> struct rq *rq = cpu_rq(cpu);
> unsigned long flags;
> @@ -925,6 +926,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> if (last)
> right_vruntime = last->vruntime;
> zero_vruntime = cfs_rq->zero_vruntime;
> + avruntime = avg_vruntime(cfs_rq);
> raw_spin_rq_unlock_irqrestore(rq, flags);
>
> SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_deadline",
> @@ -934,7 +936,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "zero_vruntime",
> SPLIT_NS(zero_vruntime));
> SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "avg_vruntime",
> - SPLIT_NS(avg_vruntime(cfs_rq)));
> + SPLIT_NS(avruntime));
> SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "right_vruntime",
> SPLIT_NS(right_vruntime));
> spread = right_vruntime - left_vruntime;
>
This on top of your two previous changes has run for 5 hours for me
now, which is usually where I'd call things "good" when bisecting.
I'm going to leave it overnight, but tentatively:
Tested-by: John Stultz <jstultz@google.com>
thanks
-john
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime
2026-03-30 7:55 ` K Prateek Nayak
2026-03-30 9:27 ` Peter Zijlstra
@ 2026-04-02 5:28 ` K Prateek Nayak
2026-04-02 10:22 ` Peter Zijlstra
1 sibling, 1 reply; 55+ messages in thread
From: K Prateek Nayak @ 2026-04-02 5:28 UTC (permalink / raw)
To: Peter Zijlstra, Vincent Guittot
Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel, wangtao554, quzicheng, dsmythies,
shubhang
On 3/30/2026 1:25 PM, K Prateek Nayak wrote:
> ------------[ cut here ]------------
> (w_vruntime >> 63) != (w_vruntime >> 62)
> WARNING: kernel/sched/fair.c:692 at __enqueue_entity+0x382/0x3a0, CPU#5: stress-ng/5062
Back to this: I still see this with latest set of changes on
queue:sched/urgent but it doesn't go kaboom. Nonetheless, it suggests we
are closing in on the s64 limitations of "sum_w_vruntime" which isn't
very comforting.
Here is one scenario where it was triggered when running:
stress-ng --yield=32 -t 10000000s&
while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
on a 256CPUs machine after about an hour into the run:
__enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)
The above comes from __enqueue_entity() after a place_entity(). Breaking
this down:
vlag_initial = 57498
vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754
vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
entity_key(se, cfs_rq) = -141,245,081,754
Now, multiplying the entity_key with its own weight results to
5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
in Python, without overflow, this would be: -1,2837,944,014,404,397,056
Now, the fact that it doesn't crash suggests to me the later
avg_vruntime() calculation would restore normality and the
sum_w_vruntime turns to -57498 (vlag_initial) * 90891264 (weight) =
-5,226,065,897,472 (assuming curr's vruntime is still the same) which
only requires 43 bits.
I also added the following at the bottom of dequeue_entity():
WARN_ON_ONCE(!cfs_rq->nr_queued && cfs_rq->sum_w_vruntime)
which was never triggered when the cfs_rq goes idle so it isn't like we
didn't account sum_w_vruntime properly. There was just a momentary
overflow so we are fine but will it always be that way?
One way to avoid the warning entirely would be to pull the zero_vruntime
close to avg_vruntime is we are enqueuing a very heavy entity.
The correct way to do this would be to compute the actual avg_vruntime()
and move the zero_vruntime to that point (but that requires at least one
multiply + divide + update_zero_vruntime()).
One seemingly cheap way by which I've been able to avoid the warning is
with:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 226509231e67..bc708bb8b5d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5329,6 +5329,7 @@ static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
u64 vslice, vruntime = avg_vruntime(cfs_rq);
+ bool update_zero = false;
s64 lag = 0;
if (!se->custom_slice)
@@ -5406,6 +5407,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
load += avg_vruntime_weight(cfs_rq, curr->load.weight);
lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
+ /*
+ * If the entity_key() * sum_weight of all the enqueued entities
+ * is more than the sum_w_vruntime, move the zero_vruntime
+ * point to the vruntime of the entity which prevents using
+ * more bits than necessary for sum_w_vruntime until the
+ * next avg_vruntime().
+ *
+ * XXX: Cheap enough check?
+ */
+ if (abs(lag) > abs(cfs_rq->sum_w_vruntime))
+ update_zero = true;
if (WARN_ON_ONCE(!load))
load = 1;
lag = div64_long(lag, load);
@@ -5413,6 +5425,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
se->vruntime = vruntime - lag;
+ if (update_zero)
+ update_zero_vruntime(cfs_rq, -lag);
+
if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
se->deadline += se->vruntime;
se->rel_deadline = 0;
---
But I'm sure it'll make people nervous since we basically move the
zero_vruntime to se->vruntime. It isn't too bad if:
abs(sum_w_vuntime - (lag * load)) < abs(lag * se->load.weight)
but we already know that the latter overflows so is there any other
cheaper indicator that we can use to detect the necessity to adjust the
avg_vruntime beforehand at place_entity()?
--
Thanks and Regards,
Prateek
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime
2026-04-02 5:28 ` K Prateek Nayak
@ 2026-04-02 10:22 ` Peter Zijlstra
2026-04-02 10:56 ` K Prateek Nayak
0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-04-02 10:22 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Vincent Guittot, mingo, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
dsmythies, shubhang
On Thu, Apr 02, 2026 at 10:58:18AM +0530, K Prateek Nayak wrote:
> On 3/30/2026 1:25 PM, K Prateek Nayak wrote:
> > ------------[ cut here ]------------
> > (w_vruntime >> 63) != (w_vruntime >> 62)
> > WARNING: kernel/sched/fair.c:692 at __enqueue_entity+0x382/0x3a0, CPU#5: stress-ng/5062
>
> Back to this: I still see this with latest set of changes on
> queue:sched/urgent but it doesn't go kaboom. Nonetheless, it suggests we
> are closing in on the s64 limitations of "sum_w_vruntime" which isn't
> very comforting.
Yeah, we are pushing 64bit pretty hard :/ And if all we would care about
was x86_64 I'd have long since used the fact that imul has a 128bit
result and idiv actually divides 128bit. But even among 64bit
architectures that is somewhat rare :/
> Here is one scenario where it was triggered when running:
>
> stress-ng --yield=32 -t 10000000s&
> while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
>
> on a 256CPUs machine after about an hour into the run:
>
> __enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
> cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
> cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)
>
> The above comes from __enqueue_entity() after a place_entity(). Breaking
> this down:
>
> vlag_initial = 57498
> vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754
>
> vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
> entity_key(se, cfs_rq) = -141,245,081,754
>
> Now, multiplying the entity_key with its own weight results to
> 5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
> in Python, without overflow, this would be: -1,2837,944,014,404,397,056
Oh gawd, this is a 'fun' case.
> One way to avoid the warning entirely would be to pull the zero_vruntime
> close to avg_vruntime is we are enqueuing a very heavy entity.
>
> The correct way to do this would be to compute the actual avg_vruntime()
> and move the zero_vruntime to that point (but that requires at least one
> multiply + divide + update_zero_vruntime()).
>
> One seemingly cheap way by which I've been able to avoid the warning is
> with:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 226509231e67..bc708bb8b5d0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5329,6 +5329,7 @@ static void
> place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> u64 vslice, vruntime = avg_vruntime(cfs_rq);
> + bool update_zero = false;
> s64 lag = 0;
>
> if (!se->custom_slice)
> @@ -5406,6 +5407,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> load += avg_vruntime_weight(cfs_rq, curr->load.weight);
>
> lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
> + /*
> + * If the entity_key() * sum_weight of all the enqueued entities
> + * is more than the sum_w_vruntime, move the zero_vruntime
> + * point to the vruntime of the entity which prevents using
> + * more bits than necessary for sum_w_vruntime until the
> + * next avg_vruntime().
> + *
> + * XXX: Cheap enough check?
> + */
> + if (abs(lag) > abs(cfs_rq->sum_w_vruntime))
> + update_zero = true;
> if (WARN_ON_ONCE(!load))
> load = 1;
> lag = div64_long(lag, load);
> @@ -5413,6 +5425,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>
> se->vruntime = vruntime - lag;
>
> + if (update_zero)
> + update_zero_vruntime(cfs_rq, -lag);
> +
> if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
> se->deadline += se->vruntime;
> se->rel_deadline = 0;
> ---
>
> But I'm sure it'll make people nervous since we basically move the
> zero_vruntime to se->vruntime. It isn't too bad if:
>
> abs(sum_w_vuntime - (lag * load)) < abs(lag * se->load.weight)
>
> but we already know that the latter overflows so is there any other
> cheaper indicator that we can use to detect the necessity to adjust the
> avg_vruntime beforehand at place_entity()?
So in general I think it would be fine to move zero_vruntime to the
heaviest entity in the tree. And if there are multiple equal heaviest
weights, any one of them should be fine.
Per necessity heavy entities are more tightly clustered -- the lag is
inversely proportional to weight, and the spread is proportional to the
lag bound.
I suspect something simple like comparing the entity weight against the
sum_weight might be enough. If the pre-existing tree is, in aggregate,
heavier than the new element, the avg will not move very drastically.
However, if the new element is (significantly) heavier than the tree,
the avg will move significantly (as demonstrated here).
That is, something like the below... But with a comment ofc :-)
Does that make sense?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9298f49f842c..7fbd9538fe30 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5329,6 +5329,7 @@ static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
u64 vslice, vruntime = avg_vruntime(cfs_rq);
+ bool update_zero = false;
s64 lag = 0;
if (!se->custom_slice)
@@ -5345,7 +5346,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*/
if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
struct sched_entity *curr = cfs_rq->curr;
- long load;
+ long load, weight;
lag = se->vlag;
@@ -5405,14 +5406,21 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (curr && curr->on_rq)
load += avg_vruntime_weight(cfs_rq, curr->load.weight);
- lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
+ weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+ lag *= load + weight;
if (WARN_ON_ONCE(!load))
load = 1;
lag = div64_long(lag, load);
+
+ if (weight > load)
+ update_zero = true;
}
se->vruntime = vruntime - lag;
+ if (update_zero)
+ update_zero_vruntime(cfs_rq, -lag);
+
if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
se->deadline += se->vruntime;
se->rel_deadline = 0;
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime
2026-04-02 10:22 ` Peter Zijlstra
@ 2026-04-02 10:56 ` K Prateek Nayak
2026-04-03 4:02 ` K Prateek Nayak
0 siblings, 1 reply; 55+ messages in thread
From: K Prateek Nayak @ 2026-04-02 10:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Vincent Guittot, mingo, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
dsmythies, shubhang
Hello Peter,
On 4/2/2026 3:52 PM, Peter Zijlstra wrote:
> On Thu, Apr 02, 2026 at 10:58:18AM +0530, K Prateek Nayak wrote:
>> On 3/30/2026 1:25 PM, K Prateek Nayak wrote:
>>> ------------[ cut here ]------------
>>> (w_vruntime >> 63) != (w_vruntime >> 62)
>>> WARNING: kernel/sched/fair.c:692 at __enqueue_entity+0x382/0x3a0, CPU#5: stress-ng/5062
>>
>> Back to this: I still see this with latest set of changes on
>> queue:sched/urgent but it doesn't go kaboom. Nonetheless, it suggests we
>> are closing in on the s64 limitations of "sum_w_vruntime" which isn't
>> very comforting.
>
> Yeah, we are pushing 64bit pretty hard :/ And if all we would care about
> was x86_64 I'd have long since used the fact that imul has a 128bit
> result and idiv actually divides 128bit. But even among 64bit
> architectures that is somewhat rare :/
Guess we have to make do with what is more abundant. We haven't crashed
and burnt yet so it should be a fun debug for future us when we get
there :-)
>
>> Here is one scenario where it was triggered when running:
>>
>> stress-ng --yield=32 -t 10000000s&
>> while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
>>
>> on a 256CPUs machine after about an hour into the run:
>>
>> __enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
>> cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
>> cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)
>>
>> The above comes from __enqueue_entity() after a place_entity(). Breaking
>> this down:
>>
>> vlag_initial = 57498
>> vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754
>>
>> vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
>> entity_key(se, cfs_rq) = -141,245,081,754
>>
>> Now, multiplying the entity_key with its own weight results to
>> 5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
>> in Python, without overflow, this would be: -1,2837,944,014,404,397,056
>
> Oh gawd, this is a 'fun' case.
>
>> One way to avoid the warning entirely would be to pull the zero_vruntime
>> close to avg_vruntime is we are enqueuing a very heavy entity.
>>
>> The correct way to do this would be to compute the actual avg_vruntime()
>> and move the zero_vruntime to that point (but that requires at least one
>> multiply + divide + update_zero_vruntime()).
>>
>> One seemingly cheap way by which I've been able to avoid the warning is
>> with:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 226509231e67..bc708bb8b5d0 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5329,6 +5329,7 @@ static void
>> place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> {
>> u64 vslice, vruntime = avg_vruntime(cfs_rq);
>> + bool update_zero = false;
>> s64 lag = 0;
>>
>> if (!se->custom_slice)
>> @@ -5406,6 +5407,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> load += avg_vruntime_weight(cfs_rq, curr->load.weight);
>>
>> lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
>> + /*
>> + * If the entity_key() * sum_weight of all the enqueued entities
>> + * is more than the sum_w_vruntime, move the zero_vruntime
>> + * point to the vruntime of the entity which prevents using
>> + * more bits than necessary for sum_w_vruntime until the
>> + * next avg_vruntime().
>> + *
>> + * XXX: Cheap enough check?
>> + */
>> + if (abs(lag) > abs(cfs_rq->sum_w_vruntime))
>> + update_zero = true;
>> if (WARN_ON_ONCE(!load))
>> load = 1;
>> lag = div64_long(lag, load);
>> @@ -5413,6 +5425,9 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>
>> se->vruntime = vruntime - lag;
>>
>> + if (update_zero)
>> + update_zero_vruntime(cfs_rq, -lag);
>> +
>> if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
>> se->deadline += se->vruntime;
>> se->rel_deadline = 0;
>> ---
>>
>> But I'm sure it'll make people nervous since we basically move the
>> zero_vruntime to se->vruntime. It isn't too bad if:
>>
>> abs(sum_w_vuntime - (lag * load)) < abs(lag * se->load.weight)
>>
>> but we already know that the latter overflows so is there any other
>> cheaper indicator that we can use to detect the necessity to adjust the
>> avg_vruntime beforehand at place_entity()?
>
> So in general I think it would be fine to move zero_vruntime to the
> heaviest entity in the tree. And if there are multiple equal heaviest
> weights, any one of them should be fine.
>
> Per necessity heavy entities are more tightly clustered -- the lag is
> inversely proportional to weight, and the spread is proportional to the
> lag bound.
>
> I suspect something simple like comparing the entity weight against the
> sum_weight might be enough. If the pre-existing tree is, in aggregate,
> heavier than the new element, the avg will not move very drastically.
> However, if the new element is (significantly) heavier than the tree,
> the avg will move significantly (as demonstrated here).
>
> That is, something like the below... But with a comment ofc :-)
>
> Does that make sense?
Let me go queue an overnight test to see if I trip that warning or
not. I initially did think this might work but then convinced myself
that testing the spread with "sum_w_vruntime" might prove to be
better but we'll know for sure tomorrow ;-)
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime
2026-04-02 10:56 ` K Prateek Nayak
@ 2026-04-03 4:02 ` K Prateek Nayak
2026-04-07 12:00 ` Peter Zijlstra
0 siblings, 1 reply; 55+ messages in thread
From: K Prateek Nayak @ 2026-04-03 4:02 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Vincent Guittot, mingo, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
dsmythies, shubhang
On 4/2/2026 4:26 PM, K Prateek Nayak wrote:
>> That is, something like the below... But with a comment ofc :-)
>>
>> Does that make sense?
>
> Let me go queue an overnight test to see if I trip that warning or
> not.
Didn't trip any warning and the machine is still up and running
after 15 Hours so feel free to include:
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Perhaps the comment can read something like:
/*
* A heavy entity can pull the avg_vruntime close to its
* vruntime post enqueue but the zero_vruntime point is
* only updated at the next update_deadline() / enqueue
* / dequeue.
*
* Until then, the sum_w_vruntime grow quadratically,
* proportional to the entity's weight (w_i) as:
*
* sum_w_vruntime -= (lag_i * (W + w_i) / W) * w_i
*
* If w_i > W, it is beneficial to pull the
* zero_vruntime towards the entity's vruntime (V_i)
* since the sum_w_vruntime would only grow by
* (lag_i * W) which consumes lesser bits than leaving
* the zero_vruntime at the pre-enqueue avg_vruntime.
*/
if (weight > load)
update_zero = true;
Feel free to reword as you see fit :-)
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime
2026-04-03 4:02 ` K Prateek Nayak
@ 2026-04-07 12:00 ` Peter Zijlstra
2026-04-07 13:42 ` [tip: sched/core] sched/fair: Avoid overflow in enqueue_entity() tip-bot2 for K Prateek Nayak
0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2026-04-07 12:00 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Vincent Guittot, mingo, juri.lelli, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, wangtao554, quzicheng,
dsmythies, shubhang
On Fri, Apr 03, 2026 at 09:32:22AM +0530, K Prateek Nayak wrote:
> On 4/2/2026 4:26 PM, K Prateek Nayak wrote:
> >> That is, something like the below... But with a comment ofc :-)
> >>
> >> Does that make sense?
> >
> > Let me go queue an overnight test to see if I trip that warning or
> > not.
>
> Didn't trip any warning and the machine is still up and running
> after 15 Hours so feel free to include:
>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>
> Perhaps the comment can read something like:
>
> /*
> * A heavy entity can pull the avg_vruntime close to its
> * vruntime post enqueue but the zero_vruntime point is
> * only updated at the next update_deadline() / enqueue
> * / dequeue.
> *
> * Until then, the sum_w_vruntime grow quadratically,
> * proportional to the entity's weight (w_i) as:
> *
> * sum_w_vruntime -= (lag_i * (W + w_i) / W) * w_i
> *
> * If w_i > W, it is beneficial to pull the
> * zero_vruntime towards the entity's vruntime (V_i)
> * since the sum_w_vruntime would only grow by
> * (lag_i * W) which consumes lesser bits than leaving
> * the zero_vruntime at the pre-enqueue avg_vruntime.
> */
> if (weight > load)
> update_zero = true;
>
> Feel free to reword as you see fit :-)
I've made it like so. You did all the hard work after all. Thanks!
---
Subject: sched/fair: Avoid overflow in enqueue_entity()
From: K Prateek Nayak <kprateek.nayak@amd.com>
Date: Tue Apr 7 13:36:17 CEST 2026
Here is one scenario which was triggered when running:
stress-ng --yield=32 -t 10000000s&
while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
on a 256CPUs machine after about an hour into the run:
__enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)
The above comes from __enqueue_entity() after a place_entity(). Breaking
this down:
vlag_initial = 57498
vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754
vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
entity_key(se, cfs_rq) = -141,245,081,754
Now, multiplying the entity_key with its own weight results to
5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
in Python, without overflow, this would be: -1,2837,944,014,404,397,056
Avoid the overflow (without doing the division for avg_vruntime()), by moving
zero_vruntime to the new entity when it is heavier.
Fixes: 4823725d9d1d ("sched/fair: Increase weight bits for avg_vruntime")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
[peterz: suggested 'weight > load' condition]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5352,6 +5352,7 @@ static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
u64 vslice, vruntime = avg_vruntime(cfs_rq);
+ bool update_zero = false;
s64 lag = 0;
if (!se->custom_slice)
@@ -5368,7 +5369,7 @@ place_entity(struct cfs_rq *cfs_rq, stru
*/
if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
struct sched_entity *curr = cfs_rq->curr;
- long load;
+ long load, weight;
lag = se->vlag;
@@ -5428,14 +5429,41 @@ place_entity(struct cfs_rq *cfs_rq, stru
if (curr && curr->on_rq)
load += avg_vruntime_weight(cfs_rq, curr->load.weight);
- lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
+ weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+ lag *= load + weight;
if (WARN_ON_ONCE(!load))
load = 1;
lag = div64_long(lag, load);
+
+ /*
+ * A heavy entity (relative to the tree) will pull the
+ * avg_vruntime close to its vruntime position on enqueue. But
+ * the zero_vruntime point is only updated at the next
+ * update_deadline()/place_entity()/update_entity_lag().
+ *
+ * Specifically (see the comment near avg_vruntime_weight()):
+ *
+ * sum_w_vruntime = \Sum (v_i - v0) * w_i
+ *
+ * Note that if v0 is near a light entity, both terms will be
+ * small for the light entity, while in that case both terms
+ * are large for the heavy entity, leading to risk of
+ * overflow.
+ *
+ * OTOH if v0 is near the heavy entity, then the difference is
+ * larger for the light entity, but the factor is small, while
+ * for the heavy entity the difference is small but the factor
+ * is large. Avoiding the multiplication overflow.
+ */
+ if (weight > load)
+ update_zero = true;
}
se->vruntime = vruntime - lag;
+ if (update_zero)
+ update_zero_vruntime(cfs_rq, -lag);
+
if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
se->deadline += se->vruntime;
se->rel_deadline = 0;
^ permalink raw reply [flat|nested] 55+ messages in thread
* [tip: sched/core] sched/fair: Avoid overflow in enqueue_entity()
2026-04-07 12:00 ` Peter Zijlstra
@ 2026-04-07 13:42 ` tip-bot2 for K Prateek Nayak
0 siblings, 0 replies; 55+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2026-04-07 13:42 UTC (permalink / raw)
To: linux-tip-commits
Cc: K Prateek Nayak, Peter Zijlstra (Intel), x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 556146ce5e9476db234134c46ddf0e154ca17028
Gitweb: https://git.kernel.org/tip/556146ce5e9476db234134c46ddf0e154ca17028
Author: K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate: Tue, 07 Apr 2026 13:36:17 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 07 Apr 2026 14:02:00 +02:00
sched/fair: Avoid overflow in enqueue_entity()
Here is one scenario which was triggered when running:
stress-ng --yield=32 -t 10000000s&
while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
on a 256CPUs machine after about an hour into the run:
__enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)
The above comes from __enqueue_entity() after a place_entity(). Breaking
this down:
vlag_initial = 57498
vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754
vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
entity_key(se, cfs_rq) = -141,245,081,754
Now, multiplying the entity_key with its own weight results to
5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
in Python, without overflow, this would be: -1,2837,944,014,404,397,056
Avoid the overflow (without doing the division for avg_vruntime()), by moving
zero_vruntime to the new entity when it is heavier.
Fixes: 4823725d9d1d ("sched/fair: Increase weight bits for avg_vruntime")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
[peterz: suggested 'weight > load' condition]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260407120052.GG3738010@noisy.programming.kicks-ass.net
---
kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 597ce5b..12890ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5352,6 +5352,7 @@ static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
u64 vslice, vruntime = avg_vruntime(cfs_rq);
+ bool update_zero = false;
s64 lag = 0;
if (!se->custom_slice)
@@ -5368,7 +5369,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*/
if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
struct sched_entity *curr = cfs_rq->curr;
- long load;
+ long load, weight;
lag = se->vlag;
@@ -5428,14 +5429,41 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (curr && curr->on_rq)
load += avg_vruntime_weight(cfs_rq, curr->load.weight);
- lag *= load + avg_vruntime_weight(cfs_rq, se->load.weight);
+ weight = avg_vruntime_weight(cfs_rq, se->load.weight);
+ lag *= load + weight;
if (WARN_ON_ONCE(!load))
load = 1;
lag = div64_long(lag, load);
+
+ /*
+ * A heavy entity (relative to the tree) will pull the
+ * avg_vruntime close to its vruntime position on enqueue. But
+ * the zero_vruntime point is only updated at the next
+ * update_deadline()/place_entity()/update_entity_lag().
+ *
+ * Specifically (see the comment near avg_vruntime_weight()):
+ *
+ * sum_w_vruntime = \Sum (v_i - v0) * w_i
+ *
+ * Note that if v0 is near a light entity, both terms will be
+ * small for the light entity, while in that case both terms
+ * are large for the heavy entity, leading to risk of
+ * overflow.
+ *
+ * OTOH if v0 is near the heavy entity, then the difference is
+ * larger for the light entity, but the factor is small, while
+ * for the heavy entity the difference is small but the factor
+ * is large. Avoiding the multiplication overflow.
+ */
+ if (weight > load)
+ update_zero = true;
}
se->vruntime = vruntime - lag;
+ if (update_zero)
+ update_zero_vruntime(cfs_rq, -lag);
+
if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
se->deadline += se->vruntime;
se->rel_deadline = 0;
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH v2 6/7] sched/fair: Revert 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")
2026-03-24 10:01 ` William Montaz
@ 2026-04-07 13:45 ` Peter Zijlstra
0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2026-04-07 13:45 UTC (permalink / raw)
To: William Montaz
Cc: vincent.guittot, bsegall, dietmar.eggemann, dsmythies, juri.lelli,
kprateek.nayak, linux-kernel, mgorman, mingo, quzicheng, rostedt,
shubhang, vschneid, wangtao554, Greg Kroah-Hartman
On Tue, Mar 24, 2026 at 10:01:26AM +0000, William Montaz wrote:
> Hi,
>
> > Zicheng Qu reported that, because avg_vruntime() always includes
> > cfs_rq->curr, when ->on_rq, place_entity() doesn't work right.
>
> > Specifically, the lag scaling in place_entity() relies on
> > avg_vruntime() being the state *before* placement of the new entity.
> > However in this case avg_vruntime() will actually already include the
> > entity, which breaks things.
>
> This has proven to be harmful on our production cluster using kernel version 6.18.19
> I tested the following versions:
> * LTS 5.10.252, 5.15.202, 6.1.166, 6.6.129, 6.12.77 --> no issue
> * LTS 6.18.19 has the issue
> * Stable 6.19.9 has the issue
> * Mainline 7.0-rc5 has the issue
> * Tip 7.0.0-rc5+ no issue
>
> Finally, I applied the patch to 6.18.19 LTS which solves the issue. However, we do not benefit from previous patches
> such as [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime.
>
> Thus I would prefer to let you decide how you want to adress backport on 6.18
>
> If you want I can share my patch file, let me know.
I've (finally!) had a look at stable-6.18.y and yes, I think this can be
backported without too much issue.
Feel free to submit a backport to stable for this.
^ permalink raw reply [flat|nested] 55+ messages in thread
end of thread, other threads:[~2026-04-07 13:45 UTC | newest]
Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-19 7:58 [PATCH v2 0/7] sched: Various reweight_entity() fixes Peter Zijlstra
2026-02-19 7:58 ` [PATCH v2 1/7] sched/fair: Fix zero_vruntime tracking Peter Zijlstra
2026-02-23 10:56 ` Vincent Guittot
2026-02-23 13:09 ` Dietmar Eggemann
2026-02-23 14:15 ` Peter Zijlstra
2026-02-24 8:53 ` Dietmar Eggemann
2026-02-24 9:02 ` Peter Zijlstra
2026-03-28 5:44 ` John Stultz
2026-03-28 17:04 ` Steven Rostedt
2026-03-30 17:58 ` John Stultz
2026-03-30 18:27 ` Steven Rostedt
2026-03-30 9:43 ` Peter Zijlstra
2026-03-30 17:49 ` John Stultz
2026-03-30 10:10 ` Peter Zijlstra
2026-03-30 14:37 ` K Prateek Nayak
2026-03-30 14:40 ` Peter Zijlstra
2026-03-30 15:50 ` K Prateek Nayak
2026-03-30 19:11 ` Peter Zijlstra
2026-03-31 0:38 ` K Prateek Nayak
2026-03-31 4:58 ` K Prateek Nayak
2026-03-31 7:08 ` Peter Zijlstra
2026-03-31 7:14 ` Peter Zijlstra
2026-03-31 8:49 ` K Prateek Nayak
2026-03-31 9:29 ` Peter Zijlstra
2026-03-31 12:20 ` Peter Zijlstra
2026-03-31 16:14 ` Peter Zijlstra
2026-03-31 17:02 ` K Prateek Nayak
2026-03-31 22:40 ` John Stultz
2026-03-30 19:40 ` John Stultz
2026-03-30 19:43 ` Peter Zijlstra
2026-03-30 21:45 ` John Stultz
2026-02-19 7:58 ` [PATCH v2 2/7] sched/fair: Only set slice protection at pick time Peter Zijlstra
2026-02-19 7:58 ` [PATCH v2 3/7] sched/eevdf: Update se->vprot in reweight_entity() Peter Zijlstra
2026-02-19 7:58 ` [PATCH v2 4/7] sched/fair: Fix lag clamp Peter Zijlstra
2026-02-23 10:23 ` Dietmar Eggemann
2026-02-23 10:57 ` Vincent Guittot
2026-02-19 7:58 ` [PATCH v2 5/7] sched/fair: Increase weight bits for avg_vruntime Peter Zijlstra
2026-02-23 10:56 ` Vincent Guittot
2026-02-23 11:51 ` Peter Zijlstra
2026-02-23 12:36 ` Peter Zijlstra
2026-02-23 13:06 ` Vincent Guittot
2026-03-30 7:55 ` K Prateek Nayak
2026-03-30 9:27 ` Peter Zijlstra
2026-04-02 5:28 ` K Prateek Nayak
2026-04-02 10:22 ` Peter Zijlstra
2026-04-02 10:56 ` K Prateek Nayak
2026-04-03 4:02 ` K Prateek Nayak
2026-04-07 12:00 ` Peter Zijlstra
2026-04-07 13:42 ` [tip: sched/core] sched/fair: Avoid overflow in enqueue_entity() tip-bot2 for K Prateek Nayak
2026-02-19 7:58 ` [PATCH v2 6/7] sched/fair: Revert 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") Peter Zijlstra
2026-02-23 10:57 ` Vincent Guittot
2026-03-24 10:01 ` William Montaz
2026-04-07 13:45 ` Peter Zijlstra
2026-02-19 7:58 ` [PATCH v2 7/7] sched/fair: Use full weight to __calc_delta() Peter Zijlstra
2026-02-23 10:57 ` Vincent Guittot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox