[PATCH 1/2] sched/fair: Enable scheduler feature NEXT

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY
       [not found] <20250714134429.19624-1-mgorman@techsingularity.net>
@ 2025-07-14 13:44 ` Mel Gorman
  2025-07-14 13:44 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
  1 sibling, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2025-07-14 13:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Mel Gorman

The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last
wakee to be scheduled sooner on the assumption that the waker/wakee share
cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on
the assumption that the pair of tasks still share data but also relied
on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get
good results.

NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve
latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465
("sched/fair: Commit to EEVDF"). The reasoning is not clear but vruntime
spread is mentioned so a safe assumption is that NEXT_BUDDY had an impact
on overall fairness. It was not noted why LAST_BUDDY was removed but the
assumption is that it's very difficult to reason what LAST_BUDDY's correct
and effective behaviour should be while still respecting EEVDFs goals.

NEXT_BUDDY is relatively easier to reason about given that it's a
point-in-time decision on the wakees deadline and eligibilty relative to the
waker. Enable NEXT_BUDDY as a preparation path to document that the decision
to ignore the current implementation is deliberate. While not presented,
the results were at best neutral and often much more variable. It's known
that it has at least some users due to bug reports but less clear what
workloads benefit from the current implementation.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/features.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..0607def744af 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.
  */
-SCHED_FEAT(NEXT_BUDDY, false)
+SCHED_FEAT(NEXT_BUDDY, true)

 /*
  * Allow completely ignoring cfs_rq->next; which can be set from various
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
       [not found] <20250714134429.19624-1-mgorman@techsingularity.net>
  2025-07-14 13:44 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
@ 2025-07-14 13:44 ` Mel Gorman
  1 sibling, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2025-07-14 13:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Mel Gorman

Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.

Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a mild suggestion instead of a hard rule. If a waker does go to sleep
almost immediately then the delay in wakeup is negligible. In all other
cases, it's throttled based on the accumulated runtime of the waker so
there is a chance that some batched wakeups have been issued before
preemption.

For all other wakeups, preemption happens if the wakee has a sooner deadline
than the waker and eligible to run.

While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.

First is the dbench throughput data even though it is a terrible metric for
dbench as it's the default one reported. The test machine is a 2-socket
CascadeLake machine and the backing filesystem is XFS as a lot of the IO
work is dispatched to kernel threads. It's important to note that these
results are not representative across all machines, especially Zen machines,
as different bottlenecks are exposed on different machines and filesystems.

dbench4 Throughput (misleading but traditional)
                          6.16.0-rc5             6.16.0-rc5
                             vanilla sched-preemptnext-v1r8
Hmean     1       1286.83 (   0.00%)     1281.73 (  -0.40%)
Hmean     4       4017.50 (   0.00%)     3934.85 *  -2.06%*
Hmean     7       5536.45 (   0.00%)     5453.55 *  -1.50%*
Hmean     12      7251.59 (   0.00%)     7217.25 (  -0.47%)
Hmean     21      8957.92 (   0.00%)     9188.07 (   2.57%)
Hmean     30      9403.41 (   0.00%)    10523.72 *  11.91%*
Hmean     48      9320.12 (   0.00%)    11496.27 *  23.35%*
Hmean     79      8962.30 (   0.00%)    11555.71 *  28.94%*
Hmean     110     8066.52 (   0.00%)    11307.26 *  40.18%*
Hmean     141     7605.20 (   0.00%)    10622.52 *  39.67%*
Hmean     160     7422.56 (   0.00%)    10250.78 *  38.10%*

As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.

dbench4 Loadfile Execution Time
                           6.16.0-rc5             6.16.0-rc5
                              vanilla sched-preemptnext-v1r8
Amean      1         14.35 (   0.00%)       14.27 (   0.57%)
Amean      4         18.58 (   0.00%)       19.01 (  -2.35%)
Amean      7         23.83 (   0.00%)       24.18 (  -1.48%)
Amean      12        31.59 (   0.00%)       31.77 (  -0.55%)
Amean      21        44.65 (   0.00%)       43.44 (   2.71%)
Amean      30        60.73 (   0.00%)       54.21 (  10.74%)
Amean      48        98.25 (   0.00%)       79.41 (  19.17%)
Amean      79       168.34 (   0.00%)      130.06 (  22.74%)
Amean      110      261.03 (   0.00%)      185.04 (  29.11%)
Amean      141      353.98 (   0.00%)      251.55 (  28.94%)
Amean      160      410.66 (   0.00%)      296.87 (  27.71%)
Stddev     1          0.51 (   0.00%)        0.48 (   6.67%)
Stddev     4          1.14 (   0.00%)        1.21 (  -6.78%)
Stddev     7          1.63 (   0.00%)        1.58 (   3.12%)
Stddev     12         2.62 (   0.00%)        2.38 (   9.05%)
Stddev     21         5.21 (   0.00%)        3.87 (  25.70%)
Stddev     30        10.03 (   0.00%)        6.65 (  33.65%)
Stddev     48        22.31 (   0.00%)       12.26 (  45.05%)
Stddev     79        41.14 (   0.00%)       29.11 (  29.25%)
Stddev     110       70.55 (   0.00%)       47.71 (  32.38%)
Stddev     141       98.12 (   0.00%)       66.83 (  31.89%)
Stddev     160      139.37 (   0.00%)       67.73 (  51.40%)

That is still looking good and the variance is reduced quite a bit.
Finally, fairness is a concern so the next report tracks how many
milliseconds does it take for all clients to complete a workfile. This
one is tricky because dbench makes to effort to synchronise clients so
the durations at benchmark start time differ substantially from typical
runtimes. This problem could be mitigated by warming up the benchmark
for a number of minutes but it's a matter of opinion whether that
counts as an evasion of inconvenient results.

dbench4 All Clients Loadfile Execution Time
                           6.16.0-rc5             6.16.0-rc5
                              vanilla sched-preemptnext-v1r8
Amean      1         14.93 (   0.00%)       14.91 (   0.11%)
Amean      4        348.88 (   0.00%)      277.06 (  20.59%)
Amean      7        722.94 (   0.00%)      991.70 ( -37.18%)
Amean      12      2055.72 (   0.00%)     2684.48 ( -30.59%)
Amean      21      4393.85 (   0.00%)     2625.79 (  40.24%)
Amean      30      6119.84 (   0.00%)     2491.15 (  59.29%)
Amean      48     20600.85 (   0.00%)     6717.61 (  67.39%)
Amean      79     22677.38 (   0.00%)    21866.80 (   3.57%)
Amean      110    35937.71 (   0.00%)    22517.63 (  37.34%)
Amean      141    25104.66 (   0.00%)    29897.08 ( -19.09%)
Amean      160    23843.74 (   0.00%)    23106.66 (   3.09%)
Stddev     1          0.50 (   0.00%)        0.46 (   6.67%)
Stddev     4        201.33 (   0.00%)      130.13 (  35.36%)
Stddev     7        471.94 (   0.00%)      641.69 ( -35.97%)
Stddev     12      1401.94 (   0.00%)     1750.14 ( -24.84%)
Stddev     21      2519.12 (   0.00%)     1416.77 (  43.76%)
Stddev     30      3469.05 (   0.00%)     1293.37 (  62.72%)
Stddev     48     11521.49 (   0.00%)     3846.34 (  66.62%)
Stddev     79     12849.21 (   0.00%)    12275.89 (   4.46%)
Stddev     110    20362.88 (   0.00%)    12989.46 (  36.21%)
Stddev     141    13768.42 (   0.00%)    17108.34 ( -24.26%)
Stddev     160    13196.34 (   0.00%)    13029.75 (   1.26%)

This is more of a mixed bag but it at least shows that fairness
is not crippled.

The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench.

hackbench-process-pipes
                          6.16.0-rc5             6.16.0-rc5
                             vanilla sched-preemptnext-v1r8
Amean     1        0.2183 (   0.00%)      0.2223 (  -1.83%)
Amean     4        0.5780 (   0.00%)      0.5413 (   6.34%)
Amean     7        0.7727 (   0.00%)      0.7093 (   8.20%)
Amean     12       1.1220 (   0.00%)      1.1170 (   0.45%)
Amean     21       1.7470 (   0.00%)      1.7713 (  -1.39%)
Amean     30       2.2940 (   0.00%)      2.6957 * -17.51%*
Amean     48       3.7337 (   0.00%)      4.1003 *  -9.82%*
Amean     79       4.9310 (   0.00%)      5.1417 *  -4.27%*
Amean     110      6.1800 (   0.00%)      6.5370 *  -5.78%*
Amean     141      7.5737 (   0.00%)      8.0060 *  -5.71%*
Amean     172      9.0820 (   0.00%)      9.4767 *  -4.35%*
Amean     203     10.6053 (   0.00%)     10.8870 (  -2.66%)
Amean     234     12.3380 (   0.00%)     13.1290 *  -6.41%*
Amean     265     14.5900 (   0.00%)     15.3547 *  -5.24%*
Amean     296     16.1937 (   0.00%)     17.1533 *  -5.93%*

Processes using pipes are impacted and it's outside the noise as the
coefficient of variance is roughly 3%. These results are not always
reproducible. If executed across multiple reboots, it may show neutral or
small gains so the worst measured results are presented.

Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.

hackbench-process-sockets
                          6.16.0-rc5             6.16.0-rc5
                             vanilla sched-preemptnext-v1r8
Amean     1        0.3217 (   0.00%)      0.3053 (   5.08%)
Amean     4        0.8967 (   0.00%)      0.9007 (  -0.45%)
Amean     7        1.4780 (   0.00%)      1.5067 (  -1.94%)
Amean     12       2.1977 (   0.00%)      2.2693 (  -3.26%)
Amean     21       3.4983 (   0.00%)      3.6667 *  -4.81%*
Amean     30       4.9270 (   0.00%)      5.1207 *  -3.93%*
Amean     48       7.6250 (   0.00%)      7.9667 *  -4.48%*
Amean     79      15.7477 (   0.00%)     15.4177 (   2.10%)
Amean     110     21.8070 (   0.00%)     21.9563 (  -0.68%)
Amean     141     29.4813 (   0.00%)     29.2327 (   0.84%)
Amean     172     36.7433 (   0.00%)     35.9043 (   2.28%)
Amean     203     40.8823 (   0.00%)     40.3467 (   1.31%)
Amean     234     43.1627 (   0.00%)     43.0343 (   0.30%)
Amean     265     49.6417 (   0.00%)     49.9030 (  -0.53%)
Amean     296     51.3137 (   0.00%)     51.9310 (  -1.20%)

At the time of writing, other tests are still running but most or either
neutral or relatively small gains. In general, the other workloads are
less wakeup-intensive than dbench or hackbench.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 123 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 106 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a14da5396fb..62fa036b0c3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -936,6 +936,16 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 	if (cfs_rq->nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
+	/*
+	 * Picking the ->next buddy will affect latency but not fairness.
+	 */
+	if (sched_feat(PICK_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+		return cfs_rq->next;
+	}
+
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
 
@@ -1205,6 +1215,83 @@ static inline bool do_preempt_short(struct cfs_rq *cfs_rq,
 	return false;
 }
 
+enum preempt_buddy_action {
+	PREEMPT_BUDDY_NONE,		/* No action on the buddy */
+	PREEMPT_BUDDY_NEXT,		/* Check next is most eligible
+					 * before rescheduling.
+					 */
+	PREEMPT_BUDDY_RESCHED,		/* Plain reschedule */
+	PREEMPT_BUDDY_IMMEDIATE		/* Remove slice protection
+					 * and reschedule
+					 */
+};
+
+static void set_next_buddy(struct sched_entity *se);
+
+static inline enum preempt_buddy_action
+do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
+		 struct sched_entity *pse, struct sched_entity *se,
+		 s64 delta, bool did_short)
+{
+	bool pse_before, pse_eligible;
+
+	if (!sched_feat(NEXT_BUDDY) ||
+	    (wake_flags & WF_FORK) ||
+	    (pse->sched_delayed)) {
+		BUILD_BUG_ON(PREEMPT_BUDDY_NONE + 1 != PREEMPT_BUDDY_NEXT);
+		return PREEMPT_BUDDY_NONE + did_short;
+	}
+
+	/* Reschedule if waker is no longer eligible */
+	if (!entity_eligible(cfs_rq, se))
+		return PREEMPT_BUDDY_RESCHED;
+
+	/* Keep existing buddy if the deadline is sooner than pse */
+	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
+		return PREEMPT_BUDDY_NONE;
+
+	set_next_buddy(pse);
+	pse_before = entity_before(pse, se);
+	pse_eligible = entity_eligible(cfs_rq, pse);
+
+	/*
+	 * WF_SYNC implies that waker will sleep soon but it is not enforced
+	 * because the hint is often abused or misunderstood.
+	 */
+	if ((wake_flags & (WF_TTWU|WF_SYNC)) == (WF_TTWU|WF_SYNC)) {
+		/*
+		 * WF_RQ_SELECTED implies the tasks are stacking on a
+		 * CPU. Only consider reschedule if pse deadline expires
+		 * before se.
+		 */
+		if ((wake_flags & WF_RQ_SELECTED) &&
+		    delta < sysctl_sched_migration_cost) {
+
+			if (!pse_before)
+				return PREEMPT_BUDDY_NONE;
+
+			/* Fall through to pse deadline.  */
+		}
+
+		/*
+		 * Reschedule if pse's deadline is sooner and there is a chance
+		 * that some wakeup batching has completed.
+		 */
+		if (pse_before &&
+		    delta >= (sysctl_sched_migration_cost >> 6)) {
+			return PREEMPT_BUDDY_IMMEDIATE;
+		}
+
+		return PREEMPT_BUDDY_NONE;
+	}
+
+	/* Check eligibility of buddy to start now. */
+	if (pse_before && pse_eligible)
+		return PREEMPT_BUDDY_IMMEDIATE;
+
+	return PREEMPT_BUDDY_NEXT;
+}
+
 /*
  * Used by other classes to account runtime.
  */
@@ -5589,16 +5676,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 
-	/*
-	 * Picking the ->next buddy will affect latency but not fairness.
-	 */
-	if (sched_feat(PICK_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
-		/* ->next will never be delayed */
-		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
-		return cfs_rq->next;
-	}
-
 	se = pick_eevdf(cfs_rq);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
@@ -7056,8 +7133,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -8767,6 +8842,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
+	bool did_short;
+	s64 delta;
 
 	if (unlikely(se == pse))
 		return;
@@ -8780,10 +8857,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
-		set_next_buddy(pse);
-	}
-
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
@@ -8829,6 +8902,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 		return;
 
 	cfs_rq = cfs_rq_of(se);
+	delta = rq_clock_task(rq) - se->exec_start;
 	update_curr(cfs_rq);
 	/*
 	 * If @p has a shorter slice than current and @p is eligible, override
@@ -8837,9 +8911,24 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	 * Note that even if @p does not turn out to be the most eligible
 	 * task at this moment, current's slice protection will be lost.
 	 */
-	if (do_preempt_short(cfs_rq, pse, se))
+	did_short = do_preempt_short(cfs_rq, pse, se);
+	if (did_short)
 		cancel_protect_slice(se);
 
+	switch (do_preempt_buddy(rq, cfs_rq, wake_flags, pse, se, delta, did_short)) {
+	case PREEMPT_BUDDY_NONE:
+		return;
+		break;
+	case PREEMPT_BUDDY_IMMEDIATE:
+		cancel_protect_slice(se);
+		;;
+	case PREEMPT_BUDDY_RESCHED:
+		goto preempt;
+		break;
+	case PREEMPT_BUDDY_NEXT:
+		break;
+	}
+
 	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 0/2] Reintroduce NEXT_BUDDY for EEVDF v2
@ 2025-10-21 14:28 Mel Gorman
  2025-10-21 14:28 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2025-10-21 14:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason, Mel Gorman

I've been chasing down a number of schedule issues recently like many
others and found they were broadly grouped as

1. Failure to boost CPU frequency with powersave/ondemand governors
2. Processors entering idle states that are too deep
3. Differences in wakeup latencies for wakeup-intensive workloads

Adding topology into account means that there is a lot of
machine-specific behaviour which may explain why some discussions
recently have reproduction problems. Nevertheless, the removal of
LAST_BUDDY and NEXT_BUDDY being disabled has an impact on wakeup
latencies.

This RFC is to determine if this is valid approach to prefer selecting
a wakee if it's eligible to run even though other unrelated tasks are
more eligible.

 kernel/sched/fair.c     | 131 ++++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |   2 +-
 2 files changed, 112 insertions(+), 21 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY
  2025-10-21 14:28 [RFC PATCH 0/2] Reintroduce NEXT_BUDDY for EEVDF v2 Mel Gorman
@ 2025-10-21 14:28 ` Mel Gorman
  0 siblings, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2025-10-21 14:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason, Mel Gorman

The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last
wakee to be scheduled sooner on the assumption that the waker/wakee share
cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on
the assumption that the pair of tasks still share data but also relied
on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get
good results.

NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve
latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465
("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime
spread is mentioned so the expectation is that NEXT_BUDDY had an impact
on overall fairness. It was not noted why LAST_BUDDY was removed but it
is assumed that it's very difficult to reason what LAST_BUDDY's correct
and effective behaviour should be while still respecting EEVDFs goals.

NEXT_BUDDY is easier to reason about given that it's a point-in-time
decision on the wakees deadline and eligibilty relative to the waker. Enable
NEXT_BUDDY as a preparation path to document that the decision to ignore
the current implementation is deliberate. While not presented, the results
were at best neutral and often much more variable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/features.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..0607def744af 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.
  */
-SCHED_FEAT(NEXT_BUDDY, false)
+SCHED_FEAT(NEXT_BUDDY, true)

 /*
  * Allow completely ignoring cfs_rq->next; which can be set from various
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

[parent not found: <20251027133915.4103633-1-mgorman@techsingularity.net>]

* [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY
       [not found] <20251027133915.4103633-1-mgorman@techsingularity.net>
@ 2025-10-27 13:39 ` Mel Gorman
  2025-10-28 14:37   ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2025-10-27 13:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason, Mel Gorman

The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last
wakee to be scheduled sooner on the assumption that the waker/wakee share
cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on
the assumption that the pair of tasks still share data but also relied
on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get
good results.

NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve
latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465
("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime
spread is mentioned so the expectation is that NEXT_BUDDY had an impact
on overall fairness. It was not noted why LAST_BUDDY was removed but it
is assumed that it's very difficult to reason what LAST_BUDDY's correct
and effective behaviour should be while still respecting EEVDFs goals.

NEXT_BUDDY is easier to reason about given that it's a point-in-time
decision on the wakees deadline and eligibilty relative to the waker. Enable
NEXT_BUDDY as a preparation path to document that the decision to ignore
the current implementation is deliberate. While not presented, the results
were at best neutral and often much more variable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/features.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..0607def744af 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.
  */
-SCHED_FEAT(NEXT_BUDDY, false)
+SCHED_FEAT(NEXT_BUDDY, true)

 /*
  * Allow completely ignoring cfs_rq->next; which can be set from various
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY
  2025-10-27 13:39 ` Mel Gorman
@ 2025-10-28 14:37   ` Peter Zijlstra
  0 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2025-10-28 14:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Mon, Oct 27, 2025 at 01:39:14PM +0000, Mel Gorman wrote:

> NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve
> latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465
> ("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime
> spread is mentioned so the expectation is that NEXT_BUDDY had an impact
> on overall fairness. It was not noted why LAST_BUDDY was removed but it
> is assumed that it's very difficult to reason what LAST_BUDDY's correct
> and effective behaviour should be while still respecting EEVDFs goals.

I think I was just struggling to make sense of things and figured
less is more and axed it.

I have vague memories trying to work through the dynamics of a
wakeup-stack and the EEVDF latency requirements and getting a head-ache.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 0/2 v4] Reintroduce NEXT_BUDDY for EEVDF
@ 2025-11-03 11:04 Mel Gorman
  2025-11-03 11:04 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2025-11-03 11:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, linux-kernel, Mel Gorman

Changes since v3
o Place new code near first consumer				(peterz)
o Separate between PREEMPT_SHORT and NEXT_BUDDY			(peterz)
o Naming and code flow clarity					(peterz)
o Restore slice protection					(peterz)

Changes since v2
o Review feedback applied from Prateek

I've been chasing down a number of schedule issues recently like many
others and found they were broadly grouped as

1. Failure to boost CPU frequency with powersave/ondemand governors
2. Processors entering idle states that are too deep
3. Differences in wakeup latencies for wakeup-intensive workloads

Adding topology into account means that there is a lot of machine-specific
behaviour which may explain why some discussions recently have reproduction
problems. Nevertheless, the removal of LAST_BUDDY and NEXT_BUDDY being
disabled has an impact on wakeup latencies.

This series enables NEXT_BUDDY and may select a wakee if it's eligible to
run even though other unrelated tasks may have an earlier deadline.

 kernel/sched/fair.c     | 145 ++++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |   2 +-
 2 files changed, 124 insertions(+), 23 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY
  2025-11-03 11:04 [PATCH 0/2 v4] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman
@ 2025-11-03 11:04 ` Mel Gorman
  0 siblings, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2025-11-03 11:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, linux-kernel, Mel Gorman

The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last
wakee to be scheduled sooner on the assumption that the waker/wakee share
cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on
the assumption that the pair of tasks still share data but also relied
on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get
good results.

NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve
latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465
("sched/fair: Commit to EEVDF"). The reasoning was not documented but as
vruntime spread is mentioned and NEXT_BUDDY cannot, by definition, strictly
obey EEVDF principles.  It was not noted why LAST_BUDDY was removed but it
is assumed that it's very difficult to reason what LAST_BUDDY's correct
and effective behaviour should be while still respecting EEVDFs goals.
NEXT_BUDDY will still pick an earlier deadline but LAST_BUDDY can pick
ineligible tasks. Peter Zijlstra made this comment about NEXT_BUDDY being
disabled during review;

	I think I was just struggling to make sense of things and figured
	less is more and axed it.

	I have vague memories trying to work through the dynamics of
	a wakeup-stack and the EEVDF latency requirements and getting
	a head-ache.

NEXT_BUDDY is easier to reason about given that it's a point-in-time
decision on the wakees deadline and eligibilty relative to the waker. Enable
NEXT_BUDDY as a preparation path to document that the decision to ignore
the current implementation is deliberate. While not presented, the results
were at best neutral and often much more variable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/features.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..0607def744af 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.
  */
-SCHED_FEAT(NEXT_BUDDY, false)
+SCHED_FEAT(NEXT_BUDDY, true)

 /*
  * Allow completely ignoring cfs_rq->next; which can be set from various
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF
@ 2025-11-12 12:25 Mel Gorman
  2025-11-12 12:25 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2025-11-12 12:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, Madadi Vineeth Reddy, linux-kernel, Mel Gorman

Changes since v4
o Splitout decisions into separate functions			(peterz)
o Flow clarity							(peterz)

Changes since v3
o Place new code near first consumer				(peterz)
o Separate between PREEMPT_SHORT and NEXT_BUDDY			(peterz)
o Naming and code flow clarity					(peterz)
o Restore slice protection					(peterz)

Changes since v2
o Review feedback applied from Prateek

I've been chasing down a number of schedule issues recently like many
others and found they were broadly grouped as

1. Failure to boost CPU frequency with powersave/ondemand governors
2. Processors entering idle states that are too deep
3. Differences in wakeup latencies for wakeup-intensive workloads

Adding topology into account means that there is a lot of machine-specific
behaviour which may explain why some discussions recently have reproduction
problems. Nevertheless, the removal of LAST_BUDDY and NEXT_BUDDY being
disabled has an impact on wakeup latencies.

This series enables NEXT_BUDDY and may select a wakee if it's eligible to
run even though other unrelated tasks may have an earlier deadline.

Mel Gorman (2):
  sched/fair: Enable scheduler feature NEXT_BUDDY
  sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

 kernel/sched/fair.c     | 152 ++++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |   2 +-
 2 files changed, 131 insertions(+), 23 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY
  2025-11-12 12:25 [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman
@ 2025-11-12 12:25 ` Mel Gorman
  0 siblings, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2025-11-12 12:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, Madadi Vineeth Reddy, linux-kernel, Mel Gorman

The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last
wakee to be scheduled sooner on the assumption that the waker/wakee share
cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on
the assumption that the pair of tasks still share data but also relied
on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get
good results.

NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve
latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465
("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime
spread is mentioned so the expectation is that NEXT_BUDDY had an impact
on overall fairness. It was not noted why LAST_BUDDY was removed but it
is assumed that it's very difficult to reason what LAST_BUDDY's correct
and effective behaviour should be while still respecting EEVDFs goals.
Peter Zijlstra noted during review;

	I think I was just struggling to make sense of things and figured
	less is more and axed it.

	I have vague memories trying to work through the dynamics of
	a wakeup-stack and the EEVDF latency requirements and getting
	a head-ache.

NEXT_BUDDY is easier to reason about given that it's a point-in-time
decision on the wakees deadline and eligibilty relative to the waker. Enable
NEXT_BUDDY as a preparation path to document that the decision to ignore
the current implementation is deliberate. While not presented, the results
were at best neutral and often much more variable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/features.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..0607def744af 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.
  */
-SCHED_FEAT(NEXT_BUDDY, false)
+SCHED_FEAT(NEXT_BUDDY, true)

 /*
  * Allow completely ignoring cfs_rq->next; which can be set from various
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-11-12 12:45 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250714134429.19624-1-mgorman@techsingularity.net>
2025-07-14 13:44 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
2025-07-14 13:44 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
2025-10-21 14:28 [RFC PATCH 0/2] Reintroduce NEXT_BUDDY for EEVDF v2 Mel Gorman
2025-10-21 14:28 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
     [not found] <20251027133915.4103633-1-mgorman@techsingularity.net>
2025-10-27 13:39 ` Mel Gorman
2025-10-28 14:37   ` Peter Zijlstra
  -- strict thread matches above, loose matches on Subject: below --
2025-11-03 11:04 [PATCH 0/2 v4] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman
2025-11-03 11:04 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
2025-11-12 12:25 [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman
2025-11-12 12:25 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.