[PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
       [not found] <20250714134429.19624-1-mgorman@techsingularity.net>
@ 2025-07-14 13:44 ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2025-07-14 13:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Mel Gorman

Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.

Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a mild suggestion instead of a hard rule. If a waker does go to sleep
almost immediately then the delay in wakeup is negligible. In all other
cases, it's throttled based on the accumulated runtime of the waker so
there is a chance that some batched wakeups have been issued before
preemption.

For all other wakeups, preemption happens if the wakee has a sooner deadline
than the waker and eligible to run.

While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.

First is the dbench throughput data even though it is a terrible metric for
dbench as it's the default one reported. The test machine is a 2-socket
CascadeLake machine and the backing filesystem is XFS as a lot of the IO
work is dispatched to kernel threads. It's important to note that these
results are not representative across all machines, especially Zen machines,
as different bottlenecks are exposed on different machines and filesystems.

dbench4 Throughput (misleading but traditional)
                          6.16.0-rc5             6.16.0-rc5
                             vanilla sched-preemptnext-v1r8
Hmean     1       1286.83 (   0.00%)     1281.73 (  -0.40%)
Hmean     4       4017.50 (   0.00%)     3934.85 *  -2.06%*
Hmean     7       5536.45 (   0.00%)     5453.55 *  -1.50%*
Hmean     12      7251.59 (   0.00%)     7217.25 (  -0.47%)
Hmean     21      8957.92 (   0.00%)     9188.07 (   2.57%)
Hmean     30      9403.41 (   0.00%)    10523.72 *  11.91%*
Hmean     48      9320.12 (   0.00%)    11496.27 *  23.35%*
Hmean     79      8962.30 (   0.00%)    11555.71 *  28.94%*
Hmean     110     8066.52 (   0.00%)    11307.26 *  40.18%*
Hmean     141     7605.20 (   0.00%)    10622.52 *  39.67%*
Hmean     160     7422.56 (   0.00%)    10250.78 *  38.10%*

As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.

dbench4 Loadfile Execution Time
                           6.16.0-rc5             6.16.0-rc5
                              vanilla sched-preemptnext-v1r8
Amean      1         14.35 (   0.00%)       14.27 (   0.57%)
Amean      4         18.58 (   0.00%)       19.01 (  -2.35%)
Amean      7         23.83 (   0.00%)       24.18 (  -1.48%)
Amean      12        31.59 (   0.00%)       31.77 (  -0.55%)
Amean      21        44.65 (   0.00%)       43.44 (   2.71%)
Amean      30        60.73 (   0.00%)       54.21 (  10.74%)
Amean      48        98.25 (   0.00%)       79.41 (  19.17%)
Amean      79       168.34 (   0.00%)      130.06 (  22.74%)
Amean      110      261.03 (   0.00%)      185.04 (  29.11%)
Amean      141      353.98 (   0.00%)      251.55 (  28.94%)
Amean      160      410.66 (   0.00%)      296.87 (  27.71%)
Stddev     1          0.51 (   0.00%)        0.48 (   6.67%)
Stddev     4          1.14 (   0.00%)        1.21 (  -6.78%)
Stddev     7          1.63 (   0.00%)        1.58 (   3.12%)
Stddev     12         2.62 (   0.00%)        2.38 (   9.05%)
Stddev     21         5.21 (   0.00%)        3.87 (  25.70%)
Stddev     30        10.03 (   0.00%)        6.65 (  33.65%)
Stddev     48        22.31 (   0.00%)       12.26 (  45.05%)
Stddev     79        41.14 (   0.00%)       29.11 (  29.25%)
Stddev     110       70.55 (   0.00%)       47.71 (  32.38%)
Stddev     141       98.12 (   0.00%)       66.83 (  31.89%)
Stddev     160      139.37 (   0.00%)       67.73 (  51.40%)

That is still looking good and the variance is reduced quite a bit.
Finally, fairness is a concern so the next report tracks how many
milliseconds does it take for all clients to complete a workfile. This
one is tricky because dbench makes to effort to synchronise clients so
the durations at benchmark start time differ substantially from typical
runtimes. This problem could be mitigated by warming up the benchmark
for a number of minutes but it's a matter of opinion whether that
counts as an evasion of inconvenient results.

dbench4 All Clients Loadfile Execution Time
                           6.16.0-rc5             6.16.0-rc5
                              vanilla sched-preemptnext-v1r8
Amean      1         14.93 (   0.00%)       14.91 (   0.11%)
Amean      4        348.88 (   0.00%)      277.06 (  20.59%)
Amean      7        722.94 (   0.00%)      991.70 ( -37.18%)
Amean      12      2055.72 (   0.00%)     2684.48 ( -30.59%)
Amean      21      4393.85 (   0.00%)     2625.79 (  40.24%)
Amean      30      6119.84 (   0.00%)     2491.15 (  59.29%)
Amean      48     20600.85 (   0.00%)     6717.61 (  67.39%)
Amean      79     22677.38 (   0.00%)    21866.80 (   3.57%)
Amean      110    35937.71 (   0.00%)    22517.63 (  37.34%)
Amean      141    25104.66 (   0.00%)    29897.08 ( -19.09%)
Amean      160    23843.74 (   0.00%)    23106.66 (   3.09%)
Stddev     1          0.50 (   0.00%)        0.46 (   6.67%)
Stddev     4        201.33 (   0.00%)      130.13 (  35.36%)
Stddev     7        471.94 (   0.00%)      641.69 ( -35.97%)
Stddev     12      1401.94 (   0.00%)     1750.14 ( -24.84%)
Stddev     21      2519.12 (   0.00%)     1416.77 (  43.76%)
Stddev     30      3469.05 (   0.00%)     1293.37 (  62.72%)
Stddev     48     11521.49 (   0.00%)     3846.34 (  66.62%)
Stddev     79     12849.21 (   0.00%)    12275.89 (   4.46%)
Stddev     110    20362.88 (   0.00%)    12989.46 (  36.21%)
Stddev     141    13768.42 (   0.00%)    17108.34 ( -24.26%)
Stddev     160    13196.34 (   0.00%)    13029.75 (   1.26%)

This is more of a mixed bag but it at least shows that fairness
is not crippled.

The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench.

hackbench-process-pipes
                          6.16.0-rc5             6.16.0-rc5
                             vanilla sched-preemptnext-v1r8
Amean     1        0.2183 (   0.00%)      0.2223 (  -1.83%)
Amean     4        0.5780 (   0.00%)      0.5413 (   6.34%)
Amean     7        0.7727 (   0.00%)      0.7093 (   8.20%)
Amean     12       1.1220 (   0.00%)      1.1170 (   0.45%)
Amean     21       1.7470 (   0.00%)      1.7713 (  -1.39%)
Amean     30       2.2940 (   0.00%)      2.6957 * -17.51%*
Amean     48       3.7337 (   0.00%)      4.1003 *  -9.82%*
Amean     79       4.9310 (   0.00%)      5.1417 *  -4.27%*
Amean     110      6.1800 (   0.00%)      6.5370 *  -5.78%*
Amean     141      7.5737 (   0.00%)      8.0060 *  -5.71%*
Amean     172      9.0820 (   0.00%)      9.4767 *  -4.35%*
Amean     203     10.6053 (   0.00%)     10.8870 (  -2.66%)
Amean     234     12.3380 (   0.00%)     13.1290 *  -6.41%*
Amean     265     14.5900 (   0.00%)     15.3547 *  -5.24%*
Amean     296     16.1937 (   0.00%)     17.1533 *  -5.93%*

Processes using pipes are impacted and it's outside the noise as the
coefficient of variance is roughly 3%. These results are not always
reproducible. If executed across multiple reboots, it may show neutral or
small gains so the worst measured results are presented.

Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.

hackbench-process-sockets
                          6.16.0-rc5             6.16.0-rc5
                             vanilla sched-preemptnext-v1r8
Amean     1        0.3217 (   0.00%)      0.3053 (   5.08%)
Amean     4        0.8967 (   0.00%)      0.9007 (  -0.45%)
Amean     7        1.4780 (   0.00%)      1.5067 (  -1.94%)
Amean     12       2.1977 (   0.00%)      2.2693 (  -3.26%)
Amean     21       3.4983 (   0.00%)      3.6667 *  -4.81%*
Amean     30       4.9270 (   0.00%)      5.1207 *  -3.93%*
Amean     48       7.6250 (   0.00%)      7.9667 *  -4.48%*
Amean     79      15.7477 (   0.00%)     15.4177 (   2.10%)
Amean     110     21.8070 (   0.00%)     21.9563 (  -0.68%)
Amean     141     29.4813 (   0.00%)     29.2327 (   0.84%)
Amean     172     36.7433 (   0.00%)     35.9043 (   2.28%)
Amean     203     40.8823 (   0.00%)     40.3467 (   1.31%)
Amean     234     43.1627 (   0.00%)     43.0343 (   0.30%)
Amean     265     49.6417 (   0.00%)     49.9030 (  -0.53%)
Amean     296     51.3137 (   0.00%)     51.9310 (  -1.20%)

At the time of writing, other tests are still running but most or either
neutral or relatively small gains. In general, the other workloads are
less wakeup-intensive than dbench or hackbench.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 123 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 106 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a14da5396fb..62fa036b0c3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -936,6 +936,16 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 	if (cfs_rq->nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
+	/*
+	 * Picking the ->next buddy will affect latency but not fairness.
+	 */
+	if (sched_feat(PICK_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+		return cfs_rq->next;
+	}
+
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
 
@@ -1205,6 +1215,83 @@ static inline bool do_preempt_short(struct cfs_rq *cfs_rq,
 	return false;
 }
 
+enum preempt_buddy_action {
+	PREEMPT_BUDDY_NONE,		/* No action on the buddy */
+	PREEMPT_BUDDY_NEXT,		/* Check next is most eligible
+					 * before rescheduling.
+					 */
+	PREEMPT_BUDDY_RESCHED,		/* Plain reschedule */
+	PREEMPT_BUDDY_IMMEDIATE		/* Remove slice protection
+					 * and reschedule
+					 */
+};
+
+static void set_next_buddy(struct sched_entity *se);
+
+static inline enum preempt_buddy_action
+do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
+		 struct sched_entity *pse, struct sched_entity *se,
+		 s64 delta, bool did_short)
+{
+	bool pse_before, pse_eligible;
+
+	if (!sched_feat(NEXT_BUDDY) ||
+	    (wake_flags & WF_FORK) ||
+	    (pse->sched_delayed)) {
+		BUILD_BUG_ON(PREEMPT_BUDDY_NONE + 1 != PREEMPT_BUDDY_NEXT);
+		return PREEMPT_BUDDY_NONE + did_short;
+	}
+
+	/* Reschedule if waker is no longer eligible */
+	if (!entity_eligible(cfs_rq, se))
+		return PREEMPT_BUDDY_RESCHED;
+
+	/* Keep existing buddy if the deadline is sooner than pse */
+	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
+		return PREEMPT_BUDDY_NONE;
+
+	set_next_buddy(pse);
+	pse_before = entity_before(pse, se);
+	pse_eligible = entity_eligible(cfs_rq, pse);
+
+	/*
+	 * WF_SYNC implies that waker will sleep soon but it is not enforced
+	 * because the hint is often abused or misunderstood.
+	 */
+	if ((wake_flags & (WF_TTWU|WF_SYNC)) == (WF_TTWU|WF_SYNC)) {
+		/*
+		 * WF_RQ_SELECTED implies the tasks are stacking on a
+		 * CPU. Only consider reschedule if pse deadline expires
+		 * before se.
+		 */
+		if ((wake_flags & WF_RQ_SELECTED) &&
+		    delta < sysctl_sched_migration_cost) {
+
+			if (!pse_before)
+				return PREEMPT_BUDDY_NONE;
+
+			/* Fall through to pse deadline.  */
+		}
+
+		/*
+		 * Reschedule if pse's deadline is sooner and there is a chance
+		 * that some wakeup batching has completed.
+		 */
+		if (pse_before &&
+		    delta >= (sysctl_sched_migration_cost >> 6)) {
+			return PREEMPT_BUDDY_IMMEDIATE;
+		}
+
+		return PREEMPT_BUDDY_NONE;
+	}
+
+	/* Check eligibility of buddy to start now. */
+	if (pse_before && pse_eligible)
+		return PREEMPT_BUDDY_IMMEDIATE;
+
+	return PREEMPT_BUDDY_NEXT;
+}
+
 /*
  * Used by other classes to account runtime.
  */
@@ -5589,16 +5676,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 
-	/*
-	 * Picking the ->next buddy will affect latency but not fairness.
-	 */
-	if (sched_feat(PICK_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
-		/* ->next will never be delayed */
-		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
-		return cfs_rq->next;
-	}
-
 	se = pick_eevdf(cfs_rq);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
@@ -7056,8 +7133,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -8767,6 +8842,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
+	bool did_short;
+	s64 delta;
 
 	if (unlikely(se == pse))
 		return;
@@ -8780,10 +8857,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
-		set_next_buddy(pse);
-	}
-
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
@@ -8829,6 +8902,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 		return;
 
 	cfs_rq = cfs_rq_of(se);
+	delta = rq_clock_task(rq) - se->exec_start;
 	update_curr(cfs_rq);
 	/*
 	 * If @p has a shorter slice than current and @p is eligible, override
@@ -8837,9 +8911,24 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	 * Note that even if @p does not turn out to be the most eligible
 	 * task at this moment, current's slice protection will be lost.
 	 */
-	if (do_preempt_short(cfs_rq, pse, se))
+	did_short = do_preempt_short(cfs_rq, pse, se);
+	if (did_short)
 		cancel_protect_slice(se);
 
+	switch (do_preempt_buddy(rq, cfs_rq, wake_flags, pse, se, delta, did_short)) {
+	case PREEMPT_BUDDY_NONE:
+		return;
+		break;
+	case PREEMPT_BUDDY_IMMEDIATE:
+		cancel_protect_slice(se);
+		;;
+	case PREEMPT_BUDDY_RESCHED:
+		goto preempt;
+		break;
+	case PREEMPT_BUDDY_NEXT:
+		break;
+	}
+
 	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-21 14:28 [RFC PATCH 0/2] Reintroduce NEXT_BUDDY for EEVDF v2 Mel Gorman
@ 2025-10-21 14:28 ` Mel Gorman
  2025-10-23  6:29   ` K Prateek Nayak
  0 siblings, 1 reply; 44+ messages in thread
From: Mel Gorman @ 2025-10-21 14:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason, Mel Gorman

Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.

Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a suggestion instead of a contract. If a waker does go to sleep almost
immediately then the delay in wakeup is negligible. In all other cases,
it's throttled based on the accumulated runtime of the waker so there is
a chance that some batched wakeups have been issued before preemption.

For all other wakeups, preemption happens if the wakee has a earlier
deadline than the waker and eligible to run.

While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.

First is the dbench throughput data even though it is a poor metric but
it is the default metric. The test machine is a 2-socket machine and the
backing filesystem is XFS as a lot of the IO work is dispatched to kernel
threads. It's important to note that these results are not representative
across all machines, especially Zen machines, as different bottlenecks
are exposed on different machines and filesystems.

dbench4 Throughput (misleading but traditional)
                            6.18-rc1               6.18-rc1
                             vanilla   sched-preemptnext-v2
Hmean     1       1268.80 (   0.00%)     1285.93 *   1.35%*
Hmean     4       3971.74 (   0.00%)     3851.10 *  -3.04%*
Hmean     7       5548.23 (   0.00%)     5507.07 (  -0.74%)
Hmean     12      7310.86 (   0.00%)     7205.37 (  -1.44%)
Hmean     21      8874.53 (   0.00%)     9193.66 *   3.60%*
Hmean     30      9361.93 (   0.00%)    10552.03 *  12.71%*
Hmean     48      9540.14 (   0.00%)    11936.22 *  25.12%*
Hmean     79      9208.74 (   0.00%)    12367.06 *  34.30%*
Hmean     110     8573.12 (   0.00%)    12114.01 *  41.30%*
Hmean     141     7791.33 (   0.00%)    11391.60 *  46.21%*
Hmean     160     7666.60 (   0.00%)    10789.23 *  40.73%*

As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.

dbench4 Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v2
Amean      1         14.62 (   0.00%)       14.21 (   2.80%)
Amean      4         18.76 (   0.00%)       19.45 (  -3.67%)
Amean      7         23.71 (   0.00%)       23.82 (  -0.48%)
Amean      12        31.25 (   0.00%)       31.78 (  -1.69%)
Amean      21        45.12 (   0.00%)       43.53 (   3.52%)
Amean      30        61.07 (   0.00%)       54.13 (  11.37%)
Amean      48        95.91 (   0.00%)       76.51 (  20.23%)
Amean      79       163.38 (   0.00%)      121.46 (  25.66%)
Amean      110      243.91 (   0.00%)      172.56 (  29.25%)
Amean      141      343.47 (   0.00%)      236.07 (  31.27%)
Amean      160      401.15 (   0.00%)      283.64 (  29.29%)
Stddev     1          0.52 (   0.00%)        0.44 (  15.50%)
Stddev     4          1.36 (   0.00%)        1.42 (  -4.91%)
Stddev     7          1.88 (   0.00%)        1.84 (   2.27%)
Stddev     12         3.06 (   0.00%)        2.55 (  16.57%)
Stddev     21         5.78 (   0.00%)        3.70 (  35.90%)
Stddev     30         9.85 (   0.00%)        5.10 (  48.26%)
Stddev     48        22.31 (   0.00%)        8.30 (  62.79%)
Stddev     79        35.96 (   0.00%)       16.79 (  53.31%)
Stddev     110       59.04 (   0.00%)       30.08 (  49.04%)
Stddev     141       85.38 (   0.00%)       47.05 (  44.89%)
Stddev     160       96.38 (   0.00%)       47.27 (  50.95%)

That is still looking good and the variance is reduced quite a bit.
Finally, fairness is a concern so the next report tracks how many
milliseconds does it take for all clients to complete a workfile. This
one is tricky because dbench makes to effort to synchronise clients so
the durations at benchmark start time differ substantially from typical
runtimes. This problem could be mitigated by warming up the benchmark
for a number of minutes but it's a matter of opinion whether that
counts as an evasion of inconvenient results.

dbench4 All Clients Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v2
Amean      1         14.93 (   0.00%)       14.91 (   0.11%)
Amean      4        348.88 (   0.00%)      277.06 (  20.59%)
Amean      7        722.94 (   0.00%)      991.70 ( -37.18%)
Amean      12      2055.72 (   0.00%)     2684.48 ( -30.59%)
Amean      21      4393.85 (   0.00%)     2625.79 (  40.24%)
Amean      30      6119.84 (   0.00%)     2491.15 (  59.29%)
Amean      48     20600.85 (   0.00%)     6717.61 (  67.39%)
Amean      79     22677.38 (   0.00%)    21866.80 (   3.57%)
Amean      110    35937.71 (   0.00%)    22517.63 (  37.34%)
Amean      141    25104.66 (   0.00%)    29897.08 ( -19.09%)
Amean      160    23843.74 (   0.00%)    23106.66 (   3.09%)
Stddev     1          0.50 (   0.00%)        0.46 (   6.67%)
Stddev     4        201.33 (   0.00%)      130.13 (  35.36%)
Stddev     7        471.94 (   0.00%)      641.69 ( -35.97%)
Stddev     12      1401.94 (   0.00%)     1750.14 ( -24.84%)
Stddev     21      2519.12 (   0.00%)     1416.77 (  43.76%)
Stddev     30      3469.05 (   0.00%)     1293.37 (  62.72%)
Stddev     48     11521.49 (   0.00%)     3846.34 (  66.62%)
Stddev     79     12849.21 (   0.00%)    12275.89 (   4.46%)
Stddev     110    20362.88 (   0.00%)    12989.46 (  36.21%)
Stddev     141    13768.42 (   0.00%)    17108.34 ( -24.26%)
Stddev     160    13196.34 (   0.00%)    13029.75 (   1.26%)

This is more of a mixed bag but it at least shows that fairness
is not crippled.

The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench.

hackbench-process-pipes
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v2
Amean     1        0.2657 (   0.00%)      0.2180 (  17.94%)
Amean     4        0.6107 (   0.00%)      0.5237 (  14.25%)
Amean     7        0.7923 (   0.00%)      0.7357 (   7.15%)
Amean     12       1.1500 (   0.00%)      1.1210 (   2.52%)
Amean     21       1.7950 (   0.00%)      1.8220 (  -1.50%)
Amean     30       2.3207 (   0.00%)      2.5337 *  -9.18%*
Amean     48       3.5023 (   0.00%)      4.0057 * -14.37%*
Amean     79       4.8093 (   0.00%)      5.1827 *  -7.76%*
Amean     110      6.1160 (   0.00%)      6.5667 *  -7.37%*
Amean     141      7.4763 (   0.00%)      7.9413 *  -6.22%*
Amean     172      8.9560 (   0.00%)      9.5543 *  -6.68%*
Amean     203     10.4783 (   0.00%)     11.3620 *  -8.43%*
Amean     234     12.4977 (   0.00%)     13.6183 (  -8.97%)
Amean     265     14.7003 (   0.00%)     15.3720 *  -4.57%*
Amean     296     16.1007 (   0.00%)     17.2463 *  -7.12%*

Processes using pipes are impacted but the variance (not presented)
is bad enough that it's close to noise. These results are not always
reproducible. If executed across multiple reboots, it may show neutral or
small gains so the worst measured results are presented.

Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.

hackbench-process-sockets
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v2
Amean     1        0.3073 (   0.00%)      0.3333 (  -8.46%)
Amean     4        0.7863 (   0.00%)      0.7350 (   6.53%)
Amean     7        1.3670 (   0.00%)      1.3580 (   0.66%)
Amean     12       2.1337 (   0.00%)      2.1320 (   0.08%)
Amean     21       3.4683 (   0.00%)      3.4677 (   0.02%)
Amean     30       4.7247 (   0.00%)      4.8200 (  -2.02%)
Amean     48       7.6097 (   0.00%)      7.8383 (  -3.00%)
Amean     79      14.7957 (   0.00%)     15.2863 (  -3.32%)
Amean     110     21.3413 (   0.00%)     21.7297 (  -1.82%)
Amean     141     29.0503 (   0.00%)     29.1197 (  -0.24%)
Amean     172     36.4660 (   0.00%)     36.3173 (   0.41%)
Amean     203     39.7177 (   0.00%)     40.2843 (  -1.43%)
Amean     234     42.1120 (   0.00%)     43.8480 (  -4.12%)
Amean     265     45.7830 (   0.00%)     48.1233 (  -5.11%)
Amean     296     50.7043 (   0.00%)     54.2533 (  -7.00%)

As schbench has been mentioned in numerous bugs recently, the results
are interesting. A test case that represents the default schbench
behaviour is

schbench Wakeup Latency (usec)
                                       6.18.0-rc1             6.18.0-rc1
                                          vanilla sched-preemptnext-v2r1
Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 *  16.28%*
Amean     Wakeup-90th-80         46.56 (   0.00%)       20.56 *  55.85%*
Amean     Wakeup-99th-80        119.61 (   0.00%)       88.83 (  25.73%)
Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      339.78 *  89.36%*
Amean     Wakeup-max-80        3874.28 (   0.00%)     3942.06 (  -1.75%)

schbench Requests Per Second (ops/sec)
                                  6.18.0-rc1             6.18.0-rc1
                                     vanilla sched-preemptnext-v2r1
Hmean     RPS-20th-80     8900.91 (   0.00%)     9148.18 *   2.78%*
Hmean     RPS-50th-80     8987.41 (   0.00%)     9199.72 (   2.36%)
Hmean     RPS-90th-80     9123.73 (   0.00%)     9233.56 (   1.20%)
Hmean     RPS-max-80      9193.50 (   0.00%)     9249.84 (   0.61%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 131 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 111 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc0b7ce8a65d..26413062009b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -955,6 +955,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
 	if (cfs_rq->nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
+	/*
+	 * Picking the ->next buddy will affect latency but not fairness.
+	 */
+	if (sched_feat(PICK_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+		return cfs_rq->next;
+	}
+
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
 
@@ -1193,6 +1203,82 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 	return delta_exec;
 }
 
+enum preempt_wakeup_action {
+	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
+	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
+					 * before rescheduling.
+					 */
+	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
+};
+
+static void set_next_buddy(struct sched_entity *se);
+
+static inline enum preempt_wakeup_action
+__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
+		 struct sched_entity *pse, struct sched_entity *se,
+		 s64 delta)
+{
+	bool pse_before;
+
+	/*
+	 * Ignore wakee preemption on WF_WORK as it is less likely that
+	 * there is shared data as exec often follow fork. Do not
+	 * preempt for tasks that are sched_delayed as it would violate
+	 * EEVDF to forcibly queue an ineligible task.
+	 */
+	if (!sched_feat(NEXT_BUDDY) ||
+	    (wake_flags & WF_FORK) ||
+	    (pse->sched_delayed)) {
+		return PREEMPT_WAKEUP_NONE;
+	}
+
+	/* Reschedule if waker is no longer eligible. */
+	if (!entity_eligible(cfs_rq, se)) {
+		resched_curr_lazy(rq);
+		return PREEMPT_WAKEUP_RESCHED;
+	}
+
+	/*
+	 * Keep existing buddy if the deadline is sooner than pse.
+	 * The downside is that the older buddy may be cache cold
+	 * but that is unpredictable where as an earlier deadline
+	 * is absolute.
+	 */
+	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
+		return PREEMPT_WAKEUP_NONE;
+
+	set_next_buddy(pse);
+
+	/*
+	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
+	 * strictly enforced because the hint is either misunderstood or
+	 * multiple tasks must be woken up.
+	 */
+	pse_before = entity_before(pse, se);
+	if ((wake_flags & (WF_TTWU|WF_SYNC)) == (WF_TTWU|WF_SYNC)) {
+		/*
+		 * WF_RQ_SELECTED implies the tasks are stacking on a
+		 * CPU when they could run on other CPUs. Only consider
+		 * reschedule if pse deadline expires before se.
+		 */
+		if ((wake_flags & WF_RQ_SELECTED) &&
+		    delta < sysctl_sched_migration_cost && pse_before) {
+			return PREEMPT_WAKEUP_NONE;
+		}
+
+		/*
+		 * As WF_SYNC is not strictly obeyed, allow some runtime for
+		 * batch wakeups to be issued.
+		 */
+		if (pse_before && delta >= sysctl_sched_migration_cost)
+			return PREEMPT_WAKEUP_RESCHED;
+
+		return PREEMPT_WAKEUP_NONE;
+	}
+
+	return PREEMPT_WAKEUP_NEXT;
+}
+
 /*
  * Used by other classes to account runtime.
  */
@@ -5512,16 +5598,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 
-	/*
-	 * Picking the ->next buddy will affect latency but not fairness.
-	 */
-	if (sched_feat(PICK_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
-		/* ->next will never be delayed */
-		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
-		return cfs_rq->next;
-	}
-
 	se = pick_eevdf(cfs_rq);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
@@ -7028,8 +7104,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -8734,7 +8808,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
-	bool do_preempt_short = false;
+	enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE;
+	s64 delta;
 
 	if (unlikely(se == pse))
 		return;
@@ -8748,10 +8823,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (task_is_throttled(p))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
-		set_next_buddy(pse);
-	}
-
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
@@ -8783,7 +8854,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 		 * When non-idle entity preempt an idle entity,
 		 * don't give idle entity slice protection.
 		 */
-		do_preempt_short = true;
+		do_preempt_short = PREEMPT_WAKEUP_NEXT;
 		goto preempt;
 	}
 
@@ -8797,12 +8868,31 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 		return;
 
 	cfs_rq = cfs_rq_of(se);
+	delta = rq_clock_task(rq) - se->exec_start;
 	update_curr(cfs_rq);
 	/*
 	 * If @p has a shorter slice than current and @p is eligible, override
 	 * current's slice protection in order to allow preemption.
 	 */
-	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
+	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
+		do_preempt_short = PREEMPT_WAKEUP_NEXT;
+	} else {
+		/*
+		 * If @p potentially is completing work required by current then
+		 * consider preemption.
+		 */
+		do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags,
+						      pse, se, delta);
+	}
+
+	switch (do_preempt_short) {
+	case PREEMPT_WAKEUP_NONE:
+		goto update_slice;
+	case PREEMPT_WAKEUP_RESCHED:
+		goto preempt;
+	case PREEMPT_WAKEUP_NEXT:
+		break;
+	}
 
 	/*
 	 * If @p has become the most eligible task, force preemption.
@@ -8810,7 +8900,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
 		goto preempt;
 
-	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
+update_slice:
+	if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE)
 		update_protect_slice(cfs_rq, se);
 
 	return;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-21 14:28 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
@ 2025-10-23  6:29   ` K Prateek Nayak
  0 siblings, 0 replies; 44+ messages in thread
From: K Prateek Nayak @ 2025-10-23  6:29 UTC (permalink / raw)
  To: Mel Gorman, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

Hello Mel,

On 10/21/2025 7:58 PM, Mel Gorman wrote:
> +static inline enum preempt_wakeup_action
> +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> +		 struct sched_entity *pse, struct sched_entity *se,
> +		 s64 delta)

"delta" is only used within __do_preempt_buddy(). Might as well compute
it here.

> +{
> +	bool pse_before;
> +
> +	/*
> +	 * Ignore wakee preemption on WF_WORK as it is less likely that
> +	 * there is shared data as exec often follow fork. Do not
> +	 * preempt for tasks that are sched_delayed as it would violate
> +	 * EEVDF to forcibly queue an ineligible task.
> +	 */
> +	if (!sched_feat(NEXT_BUDDY) ||
> +	    (wake_flags & WF_FORK) ||
> +	    (pse->sched_delayed)) {
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	/* Reschedule if waker is no longer eligible. */
> +	if (!entity_eligible(cfs_rq, se)) {
> +		resched_curr_lazy(rq);

This is unnecessary since PREEMPT_WAKEUP_RESCHED case already does a
"goto preempt" which does a resched_curr_lazy().

> +		return PREEMPT_WAKEUP_RESCHED;

Should we update the next buddy before returning here if
entity_before(pse, cfs_rq->next)?

> +	}
> +
> +	/*
> +	 * Keep existing buddy if the deadline is sooner than pse.
> +	 * The downside is that the older buddy may be cache cold
> +	 * but that is unpredictable where as an earlier deadline
> +	 * is absolute.
> +	 */
> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> +		return PREEMPT_WAKEUP_NONE;
> +
> +	set_next_buddy(pse);
> +
> +	/*
> +	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> +	 * strictly enforced because the hint is either misunderstood or
> +	 * multiple tasks must be woken up.
> +	 */
> +	pse_before = entity_before(pse, se);
> +	if ((wake_flags & (WF_TTWU|WF_SYNC)) == (WF_TTWU|WF_SYNC)) {

Do we have !TTWU cases that set WF_SYNC?

> +		/*
> +		 * WF_RQ_SELECTED implies the tasks are stacking on a
> +		 * CPU when they could run on other CPUs. Only consider
> +		 * reschedule if pse deadline expires before se.
> +		 */

Code doesn't seem to match that comment. We are foregoing preemption
if the current task has run for less than "sysctl_sched_migration_cost"
even on pse_before.

Also. looking at the comment in check_preempt_wakeup_fair():

    If @p potentially is completing work required by current then
    consider preemption.

Shouldn't this check if "se" is indeed the waker task? Despite WF_SYNC,
wake_affine() can still return the CPU where @p was previously running
which now might be running another unrelated task.

> +		if ((wake_flags & WF_RQ_SELECTED) &&
> +		    delta < sysctl_sched_migration_cost && pse_before) {
> +			return PREEMPT_WAKEUP_NONE;
> +		}

Above checks seems unnecessary since we return "PREEMPT_WAKEUP_NONE" if
we don't enter the below condition. The two cannot have any overlap
based on my reading.

> +
> +		/*
> +		 * As WF_SYNC is not strictly obeyed, allow some runtime for
> +		 * batch wakeups to be issued.
> +		 */
> +		if (pse_before && delta >= sysctl_sched_migration_cost)
> +			return PREEMPT_WAKEUP_RESCHED;
> +
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	return PREEMPT_WAKEUP_NEXT;
> +}
> +
>  /*
>   * Used by other classes to account runtime.
>   */
> @@ -5512,16 +5598,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>  {
>  	struct sched_entity *se;
>  
> -	/*
> -	 * Picking the ->next buddy will affect latency but not fairness.
> -	 */
> -	if (sched_feat(PICK_BUDDY) &&
> -	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
> -		/* ->next will never be delayed */
> -		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
> -		return cfs_rq->next;
> -	}
> -
>  	se = pick_eevdf(cfs_rq);
>  	if (se->sched_delayed) {
>  		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> @@ -7028,8 +7104,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	hrtick_update(rq);
>  }
>  
> -static void set_next_buddy(struct sched_entity *se);
> -
>  /*
>   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
>   * failing half-way through and resume the dequeue later.
> @@ -8734,7 +8808,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>  	int cse_is_idle, pse_is_idle;
> -	bool do_preempt_short = false;
> +	enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE;
> +	s64 delta;
>  
>  	if (unlikely(se == pse))
>  		return;
> @@ -8748,10 +8823,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (task_is_throttled(p))
>  		return;
>  
> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
> -		set_next_buddy(pse);
> -	}
> -
>  	/*
>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>  	 * wake up path.
> @@ -8783,7 +8854,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  		 * When non-idle entity preempt an idle entity,
>  		 * don't give idle entity slice protection.
>  		 */
> -		do_preempt_short = true;
> +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
>  		goto preempt;
>  	}
>  
> @@ -8797,12 +8868,31 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  		return;
>  
>  	cfs_rq = cfs_rq_of(se);
> +	delta = rq_clock_task(rq) - se->exec_start;
>  	update_curr(cfs_rq);
>  	/*
>  	 * If @p has a shorter slice than current and @p is eligible, override
>  	 * current's slice protection in order to allow preemption.
>  	 */
> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
> +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
> +	} else {
> +		/*
> +		 * If @p potentially is completing work required by current then
> +		 * consider preemption.
> +		 */
> +		do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags,
> +						      pse, se, delta);
> +	}
> +
> +	switch (do_preempt_short) {
> +	case PREEMPT_WAKEUP_NONE:
> +		goto update_slice;

You can just return from here since update_protect_slice() is only
done on "do_preempt_short != PREEMPT_WAKEUP_NONE". With that,
the "update_slice" label becomes unnecessary ...

> +	case PREEMPT_WAKEUP_RESCHED:
> +		goto preempt;
> +	case PREEMPT_WAKEUP_NEXT:
> +		break;
> +	}
>  
>  	/*
>  	 * If @p has become the most eligible task, force preemption.
> @@ -8810,7 +8900,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
>  		goto preempt;
>  
> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> +update_slice:
> +	if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE)

... and since do_preempt_short can only be "PREEMPT_WAKEUP_NEXT"
which is non-zero, changes in this hunk can be dropped.

>  		update_protect_slice(cfs_rq, se);
>  
>  	return;

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
       [not found] <20251027133915.4103633-1-mgorman@techsingularity.net>
@ 2025-10-27 13:39 ` Mel Gorman
  2025-10-28 15:05   ` Peter Zijlstra
                     ` (3 more replies)
  0 siblings, 4 replies; 44+ messages in thread
From: Mel Gorman @ 2025-10-27 13:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason, Mel Gorman

Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.

Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a suggestion instead of a contract. If a waker does go to sleep almost
immediately then the delay in wakeup is negligible. In all other cases,
it's throttled based on the accumulated runtime of the waker so there is
a chance that some batched wakeups have been issued before preemption.

For all other wakeups, preemption happens if the wakee has a earlier
deadline than the waker and eligible to run.

While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.

First is the dbench throughput data even though it is a poor metric but
it is the default metric. The test machine is a 2-socket machine and the
backing filesystem is XFS as a lot of the IO work is dispatched to kernel
threads. It's important to note that these results are not representative
across all machines, especially Zen machines, as different bottlenecks
are exposed on different machines and filesystems.

dbench4 Throughput (misleading but traditional)
                            6.18-rc1               6.18-rc1
                             vanilla      sched-preemptnext
Hmean     1       1268.80 (   0.00%)     1285.93 *   1.35%*
Hmean     4       3971.74 (   0.00%)     3851.10 *  -3.04%*
Hmean     7       5548.23 (   0.00%)     5507.07 (  -0.74%)
Hmean     12      7310.86 (   0.00%)     7205.37 (  -1.44%)
Hmean     21      8874.53 (   0.00%)     9193.66 *   3.60%*
Hmean     30      9361.93 (   0.00%)    10552.03 *  12.71%*
Hmean     48      9540.14 (   0.00%)    11936.22 *  25.12%*
Hmean     79      9208.74 (   0.00%)    12367.06 *  34.30%*
Hmean     110     8573.12 (   0.00%)    12114.01 *  41.30%*
Hmean     141     7791.33 (   0.00%)    11391.60 *  46.21%*
Hmean     160     7666.60 (   0.00%)    10789.23 *  40.73%*

As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.

dbench4 Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla      sched-preemptnext
Amean      1         14.62 (   0.00%)       14.21 (   2.80%)
Amean      4         18.76 (   0.00%)       19.45 (  -3.67%)
Amean      7         23.71 (   0.00%)       23.82 (  -0.48%)
Amean      12        31.25 (   0.00%)       31.78 (  -1.69%)
Amean      21        45.12 (   0.00%)       43.53 (   3.52%)
Amean      30        61.07 (   0.00%)       54.13 (  11.37%)
Amean      48        95.91 (   0.00%)       76.51 (  20.23%)
Amean      79       163.38 (   0.00%)      121.46 (  25.66%)
Amean      110      243.91 (   0.00%)      172.56 (  29.25%)
Amean      141      343.47 (   0.00%)      236.07 (  31.27%)
Amean      160      401.15 (   0.00%)      283.64 (  29.29%)
Stddev     1          0.52 (   0.00%)        0.44 (  15.50%)
Stddev     4          1.36 (   0.00%)        1.42 (  -4.91%)
Stddev     7          1.88 (   0.00%)        1.84 (   2.27%)
Stddev     12         3.06 (   0.00%)        2.55 (  16.57%)
Stddev     21         5.78 (   0.00%)        3.70 (  35.90%)
Stddev     30         9.85 (   0.00%)        5.10 (  48.26%)
Stddev     48        22.31 (   0.00%)        8.30 (  62.79%)
Stddev     79        35.96 (   0.00%)       16.79 (  53.31%)
Stddev     110       59.04 (   0.00%)       30.08 (  49.04%)
Stddev     141       85.38 (   0.00%)       47.05 (  44.89%)
Stddev     160       96.38 (   0.00%)       47.27 (  50.95%)

That is still looking good and the variance is reduced quite a bit.
Finally, fairness is a concern so the next report tracks how many
milliseconds does it take for all clients to complete a workfile. This
one is tricky because dbench makes to effort to synchronise clients so
the durations at benchmark start time differ substantially from typical
runtimes. This problem could be mitigated by warming up the benchmark
for a number of minutes but it's a matter of opinion whether that
counts as an evasion of inconvenient results.

dbench4 All Clients Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla      sched-preemptnext
Amean      1         14.93 (   0.00%)       14.91 (   0.11%)
Amean      4        348.88 (   0.00%)      277.06 (  20.59%)
Amean      7        722.94 (   0.00%)      991.70 ( -37.18%)
Amean      12      2055.72 (   0.00%)     2684.48 ( -30.59%)
Amean      21      4393.85 (   0.00%)     2625.79 (  40.24%)
Amean      30      6119.84 (   0.00%)     2491.15 (  59.29%)
Amean      48     20600.85 (   0.00%)     6717.61 (  67.39%)
Amean      79     22677.38 (   0.00%)    21866.80 (   3.57%)
Amean      110    35937.71 (   0.00%)    22517.63 (  37.34%)
Amean      141    25104.66 (   0.00%)    29897.08 ( -19.09%)
Amean      160    23843.74 (   0.00%)    23106.66 (   3.09%)
Stddev     1          0.50 (   0.00%)        0.46 (   6.67%)
Stddev     4        201.33 (   0.00%)      130.13 (  35.36%)
Stddev     7        471.94 (   0.00%)      641.69 ( -35.97%)
Stddev     12      1401.94 (   0.00%)     1750.14 ( -24.84%)
Stddev     21      2519.12 (   0.00%)     1416.77 (  43.76%)
Stddev     30      3469.05 (   0.00%)     1293.37 (  62.72%)
Stddev     48     11521.49 (   0.00%)     3846.34 (  66.62%)
Stddev     79     12849.21 (   0.00%)    12275.89 (   4.46%)
Stddev     110    20362.88 (   0.00%)    12989.46 (  36.21%)
Stddev     141    13768.42 (   0.00%)    17108.34 ( -24.26%)
Stddev     160    13196.34 (   0.00%)    13029.75 (   1.26%)

This is more of a mixed bag but it at least shows that fairness
is not crippled.

The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench.

hackbench-process-pipes
                          6.18-rc1             6.18-rc1
                             vanilla      sched-preemptnext
Amean     1        0.2657 (   0.00%)      0.2180 (  17.94%)
Amean     4        0.6107 (   0.00%)      0.5237 (  14.25%)
Amean     7        0.7923 (   0.00%)      0.7357 (   7.15%)
Amean     12       1.1500 (   0.00%)      1.1210 (   2.52%)
Amean     21       1.7950 (   0.00%)      1.8220 (  -1.50%)
Amean     30       2.3207 (   0.00%)      2.5337 *  -9.18%*
Amean     48       3.5023 (   0.00%)      4.0057 * -14.37%*
Amean     79       4.8093 (   0.00%)      5.1827 *  -7.76%*
Amean     110      6.1160 (   0.00%)      6.5667 *  -7.37%*
Amean     141      7.4763 (   0.00%)      7.9413 *  -6.22%*
Amean     172      8.9560 (   0.00%)      9.5543 *  -6.68%*
Amean     203     10.4783 (   0.00%)     11.3620 *  -8.43%*
Amean     234     12.4977 (   0.00%)     13.6183 (  -8.97%)
Amean     265     14.7003 (   0.00%)     15.3720 *  -4.57%*
Amean     296     16.1007 (   0.00%)     17.2463 *  -7.12%*

Processes using pipes are impacted but the variance (not presented)
is bad enough that it's close to noise. These results are not always
reproducible. If executed across multiple reboots, it may show neutral or
small gains so the worst measured results are presented.

Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.

hackbench-process-sockets
                          6.18-rc1             6.18-rc1
                             vanilla      sched-preemptnext
Amean     1        0.3073 (   0.00%)      0.3333 (  -8.46%)
Amean     4        0.7863 (   0.00%)      0.7350 (   6.53%)
Amean     7        1.3670 (   0.00%)      1.3580 (   0.66%)
Amean     12       2.1337 (   0.00%)      2.1320 (   0.08%)
Amean     21       3.4683 (   0.00%)      3.4677 (   0.02%)
Amean     30       4.7247 (   0.00%)      4.8200 (  -2.02%)
Amean     48       7.6097 (   0.00%)      7.8383 (  -3.00%)
Amean     79      14.7957 (   0.00%)     15.2863 (  -3.32%)
Amean     110     21.3413 (   0.00%)     21.7297 (  -1.82%)
Amean     141     29.0503 (   0.00%)     29.1197 (  -0.24%)
Amean     172     36.4660 (   0.00%)     36.3173 (   0.41%)
Amean     203     39.7177 (   0.00%)     40.2843 (  -1.43%)
Amean     234     42.1120 (   0.00%)     43.8480 (  -4.12%)
Amean     265     45.7830 (   0.00%)     48.1233 (  -5.11%)
Amean     296     50.7043 (   0.00%)     54.2533 (  -7.00%)

As schbench has been mentioned in numerous bugs recently, the results
are interesting. A test case that represents the default schbench
behaviour is

schbench Wakeup Latency (usec)
                                       6.18.0-rc1             6.18.0-rc1
                                          vanilla    sched-preemptnextr1
Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 *  16.28%*
Amean     Wakeup-90th-80         46.56 (   0.00%)       20.56 *  55.85%*
Amean     Wakeup-99th-80        119.61 (   0.00%)       88.83 (  25.73%)
Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      339.78 *  89.36%*
Amean     Wakeup-max-80        3874.28 (   0.00%)     3942.06 (  -1.75%)

schbench Requests Per Second (ops/sec)
                                  6.18.0-rc1             6.18.0-rc1
                                     vanilla    sched-preemptnextr1
Hmean     RPS-20th-80     8900.91 (   0.00%)     9148.18 *   2.78%*
Hmean     RPS-50th-80     8987.41 (   0.00%)     9199.72 (   2.36%)
Hmean     RPS-90th-80     9123.73 (   0.00%)     9233.56 (   1.20%)
Hmean     RPS-max-80      9193.50 (   0.00%)     9249.84 (   0.61%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 117 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc0b7ce8a65d..158e0430449b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -955,6 +955,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
 	if (cfs_rq->nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
+	/*
+	 * Picking the ->next buddy will affect latency but not fairness.
+	 */
+	if (sched_feat(PICK_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+		return cfs_rq->next;
+	}
+
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
 
@@ -1193,6 +1203,91 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 	return delta_exec;
 }
 
+enum preempt_wakeup_action {
+	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
+	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
+					 * before rescheduling.
+					 */
+	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
+};
+
+static void set_next_buddy(struct sched_entity *se);
+
+static inline enum preempt_wakeup_action
+__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
+		 struct sched_entity *pse, struct sched_entity *se)
+{
+	bool pse_before;
+
+	/*
+	 * Ignore wakee preemption on WF_WORK as it is less likely that
+	 * there is shared data as exec often follow fork. Do not
+	 * preempt for tasks that are sched_delayed as it would violate
+	 * EEVDF to forcibly queue an ineligible task.
+	 */
+	if (!sched_feat(NEXT_BUDDY) ||
+	    (wake_flags & WF_FORK) ||
+	    (pse->sched_delayed)) {
+		return PREEMPT_WAKEUP_NONE;
+	}
+
+	/* Reschedule if waker is no longer eligible. */
+	if (!entity_eligible(cfs_rq, se))
+		return PREEMPT_WAKEUP_RESCHED;
+
+	/*
+	 * Keep existing buddy if the deadline is sooner than pse.
+	 * The downside is that the older buddy may be cache cold
+	 * but that is unpredictable where as an earlier deadline
+	 * is absolute.
+	 */
+	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
+		return PREEMPT_WAKEUP_NONE;
+
+	set_next_buddy(pse);
+
+	/*
+	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
+	 * strictly enforced because the hint is either misunderstood or
+	 * multiple tasks must be woken up.
+	 */
+	pse_before = entity_before(pse, se);
+	if (wake_flags & WF_SYNC) {
+		u64 delta = rq_clock_task(rq) - se->exec_start;
+		u64 threshold = sysctl_sched_migration_cost;
+
+		/*
+		 * WF_SYNC without WF_TTWU is not expected so warn if it
+		 * happens even though it is likely harmless.
+		 */
+		WARN_ON_ONCE(!(wake_flags | WF_TTWU));
+
+		if ((s64)delta < 0)
+			delta = 0;
+
+		/*
+		 * WF_RQ_SELECTED implies the tasks are stacking on a
+		 * CPU when they could run on other CPUs. Reduce the
+		 * threshold before preemption is allowed to an
+		 * arbitrary lower value as it is more likely (but not
+		 * guaranteed) the waker requires the wakee to finish.
+		 */
+		if (wake_flags & WF_RQ_SELECTED)
+			threshold >>= 2;
+
+		/*
+		 * As WF_SYNC is not strictly obeyed, allow some runtime for
+		 * batch wakeups to be issued.
+		 */
+		if (pse_before && delta >= threshold)
+			return PREEMPT_WAKEUP_RESCHED;
+
+		return PREEMPT_WAKEUP_NONE;
+	}
+
+	return PREEMPT_WAKEUP_NEXT;
+}
+
 /*
  * Used by other classes to account runtime.
  */
@@ -5512,16 +5607,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 
-	/*
-	 * Picking the ->next buddy will affect latency but not fairness.
-	 */
-	if (sched_feat(PICK_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
-		/* ->next will never be delayed */
-		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
-		return cfs_rq->next;
-	}
-
 	se = pick_eevdf(cfs_rq);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
@@ -7028,8 +7113,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -8734,7 +8817,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
-	bool do_preempt_short = false;
+	enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE;
 
 	if (unlikely(se == pse))
 		return;
@@ -8748,10 +8831,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (task_is_throttled(p))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
-		set_next_buddy(pse);
-	}
-
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
@@ -8783,7 +8862,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 		 * When non-idle entity preempt an idle entity,
 		 * don't give idle entity slice protection.
 		 */
-		do_preempt_short = true;
+		do_preempt_short = PREEMPT_WAKEUP_NEXT;
 		goto preempt;
 	}
 
@@ -8802,7 +8881,25 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	 * If @p has a shorter slice than current and @p is eligible, override
 	 * current's slice protection in order to allow preemption.
 	 */
-	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
+	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
+		do_preempt_short = PREEMPT_WAKEUP_NEXT;
+	} else {
+		/*
+		 * If @p potentially is completing work required by current then
+		 * consider preemption.
+		 */
+		do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags,
+						      pse, se);
+	}
+
+	switch (do_preempt_short) {
+	case PREEMPT_WAKEUP_NONE:
+		return;
+	case PREEMPT_WAKEUP_RESCHED:
+		goto preempt;
+	case PREEMPT_WAKEUP_NEXT:
+		break;
+	}
 
 	/*
 	 * If @p has become the most eligible task, force preemption.
@@ -8810,7 +8907,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
 		goto preempt;
 
-	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
+	if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE)
 		update_protect_slice(cfs_rq, se);
 
 	return;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-27 13:39 ` Mel Gorman
@ 2025-10-28 15:05   ` Peter Zijlstra
  2025-10-31  9:46     ` Mel Gorman
  2025-10-28 15:09   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-10-28 15:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:


Still going through this; just a few early comments.


>  kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 117 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bc0b7ce8a65d..158e0430449b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1193,6 +1203,91 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>  	return delta_exec;
>  }
>  
> +enum preempt_wakeup_action {
> +	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
> +	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
> +					 * before rescheduling.
> +					 */
> +	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
> +};
> +
> +static void set_next_buddy(struct sched_entity *se);
> +
> +static inline enum preempt_wakeup_action
> +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> +		 struct sched_entity *pse, struct sched_entity *se)
> +{
> +	bool pse_before;
> +
> +	/*
> +	 * Ignore wakee preemption on WF_WORK as it is less likely that
> +	 * there is shared data as exec often follow fork. Do not
> +	 * preempt for tasks that are sched_delayed as it would violate
> +	 * EEVDF to forcibly queue an ineligible task.
> +	 */
> +	if (!sched_feat(NEXT_BUDDY) ||
> +	    (wake_flags & WF_FORK) ||
> +	    (pse->sched_delayed)) {
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	/* Reschedule if waker is no longer eligible. */
> +	if (!entity_eligible(cfs_rq, se))
> +		return PREEMPT_WAKEUP_RESCHED;
> +
> +	/*
> +	 * Keep existing buddy if the deadline is sooner than pse.
> +	 * The downside is that the older buddy may be cache cold
> +	 * but that is unpredictable where as an earlier deadline
> +	 * is absolute.
> +	 */
> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> +		return PREEMPT_WAKEUP_NONE;
> +
> +	set_next_buddy(pse);
> +
> +	/*
> +	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> +	 * strictly enforced because the hint is either misunderstood or
> +	 * multiple tasks must be woken up.
> +	 */
> +	pse_before = entity_before(pse, se);
> +	if (wake_flags & WF_SYNC) {
> +		u64 delta = rq_clock_task(rq) - se->exec_start;
> +		u64 threshold = sysctl_sched_migration_cost;
> +
> +		/*
> +		 * WF_SYNC without WF_TTWU is not expected so warn if it
> +		 * happens even though it is likely harmless.
> +		 */
> +		WARN_ON_ONCE(!(wake_flags | WF_TTWU));
> +
> +		if ((s64)delta < 0)
> +			delta = 0;
> +
> +		/*
> +		 * WF_RQ_SELECTED implies the tasks are stacking on a
> +		 * CPU when they could run on other CPUs. Reduce the
> +		 * threshold before preemption is allowed to an
> +		 * arbitrary lower value as it is more likely (but not
> +		 * guaranteed) the waker requires the wakee to finish.
> +		 */
> +		if (wake_flags & WF_RQ_SELECTED)
> +			threshold >>= 2;
> +
> +		/*
> +		 * As WF_SYNC is not strictly obeyed, allow some runtime for
> +		 * batch wakeups to be issued.
> +		 */
> +		if (pse_before && delta >= threshold)
> +			return PREEMPT_WAKEUP_RESCHED;
> +
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	return PREEMPT_WAKEUP_NEXT;
> +}

All this seems weirdly placed inside the file. Is there a reason this is
placed so far away from its only caller?

>  /*
>   * Used by other classes to account runtime.
>   */

> @@ -7028,8 +7113,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	hrtick_update(rq);
>  }
>  
> -static void set_next_buddy(struct sched_entity *se);
> -
>  /*
>   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
>   * failing half-way through and resume the dequeue later.
> @@ -8734,7 +8817,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>  	int cse_is_idle, pse_is_idle;
> -	bool do_preempt_short = false;
> +	enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE;

naming seems off; I'm not sure what this still has to do with short.
Perhaps just preempt_action or whatever?

>  
>  	if (unlikely(se == pse))
>  		return;
> @@ -8748,10 +8831,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (task_is_throttled(p))
>  		return;
>  
> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
> -		set_next_buddy(pse);
> -	}
> -
>  	/*
>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>  	 * wake up path.
> @@ -8783,7 +8862,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  		 * When non-idle entity preempt an idle entity,
>  		 * don't give idle entity slice protection.
>  		 */
> -		do_preempt_short = true;
> +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
>  		goto preempt;
>  	}
>  
> @@ -8802,7 +8881,25 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	 * If @p has a shorter slice than current and @p is eligible, override
>  	 * current's slice protection in order to allow preemption.
>  	 */
> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
> +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
> +	} else {
> +		/*
> +		 * If @p potentially is completing work required by current then
> +		 * consider preemption.
> +		 */
> +		do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags,
> +						      pse, se);
> +	}
> +
> +	switch (do_preempt_short) {
> +	case PREEMPT_WAKEUP_NONE:
> +		return;
> +	case PREEMPT_WAKEUP_RESCHED:
> +		goto preempt;
> +	case PREEMPT_WAKEUP_NEXT:
> +		break;
> +	}
>  
>  	/*
>  	 * If @p has become the most eligible task, force preemption.
> @@ -8810,7 +8907,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
>  		goto preempt;
>  
> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> +	if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE)
>  		update_protect_slice(cfs_rq, se);

WAKEUP_NONE did a return above, I don't think you can get here with
WAKEUP_NONE, making the above condition always true.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-27 13:39 ` Mel Gorman
  2025-10-28 15:05   ` Peter Zijlstra
@ 2025-10-28 15:09   ` Peter Zijlstra
  2025-10-31  9:48     ` Mel Gorman
  2025-10-28 15:33   ` Peter Zijlstra
  2025-10-30  9:10   ` Peter Zijlstra
  3 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-10-28 15:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> +enum preempt_wakeup_action {
> +	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
> +	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
> +					 * before rescheduling.
> +					 */
> +	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
> +};

In pre-existing code that isn't modified by this patch, we have:

  if (do_preempt_short)

Which seems to hard rely on PREEMPT_WAKEUP_NONE being 0, please make
that explicit in the enum above.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-27 13:39 ` Mel Gorman
  2025-10-28 15:05   ` Peter Zijlstra
  2025-10-28 15:09   ` Peter Zijlstra
@ 2025-10-28 15:33   ` Peter Zijlstra
  2025-10-28 15:47     ` Peter Zijlstra
  2025-10-30  9:10   ` Peter Zijlstra
  3 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-10-28 15:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> @@ -8783,7 +8862,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  		 * When non-idle entity preempt an idle entity,
>  		 * don't give idle entity slice protection.
>  		 */
> -		do_preempt_short = true;
> +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
>  		goto preempt;
>  	}

I'm confused, should this not be WAKEUP_RESCHED?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-28 15:33   ` Peter Zijlstra
@ 2025-10-28 15:47     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2025-10-28 15:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Tue, Oct 28, 2025 at 04:33:51PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> > @@ -8783,7 +8862,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  		 * When non-idle entity preempt an idle entity,
> >  		 * don't give idle entity slice protection.
> >  		 */
> > -		do_preempt_short = true;
> > +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
> >  		goto preempt;
> >  	}
> 
> I'm confused, should this not be WAKEUP_RESCHED?

It doesn't matter, you cannot end up with !do_preempt_short at preempt:,
so that condition is always true and can thus be deleted, at which point
the value of do_preempt_short is irrelevant and doesn't need to be set.

A little something like so perhaps...

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8727,7 +8727,7 @@ static void set_next_buddy(struct sched_
 }
 
 enum preempt_wakeup_action {
-	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
+	PREEMPT_WAKEUP_NONE = 0,	/* No action on the buddy */
 	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
 					 * before rescheduling.
 					 */
@@ -8814,7 +8814,7 @@ __do_preempt_buddy(struct rq *rq, struct
  */
 static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
-	enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE;
+	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;
 	struct task_struct *donor = rq->donor;
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
@@ -8863,7 +8863,6 @@ static void check_preempt_wakeup_fair(st
 		 * When non-idle entity preempt an idle entity,
 		 * don't give idle entity slice protection.
 		 */
-		do_preempt_short = PREEMPT_WAKEUP_NEXT;
 		goto preempt;
 	}
 
@@ -8883,17 +8882,17 @@ static void check_preempt_wakeup_fair(st
 	 * current's slice protection in order to allow preemption.
 	 */
 	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
-		do_preempt_short = PREEMPT_WAKEUP_NEXT;
+		preempt_action = PREEMPT_WAKEUP_NEXT;
 	} else {
 		/*
 		 * If @p potentially is completing work required by current then
 		 * consider preemption.
 		 */
-		do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags,
+		preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags,
 						      pse, se);
 	}
 
-	switch (do_preempt_short) {
+	switch (preempt_action) {
 	case PREEMPT_WAKEUP_NONE:
 		return;
 	case PREEMPT_WAKEUP_RESCHED:
@@ -8905,18 +8904,16 @@ static void check_preempt_wakeup_fair(st
 	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */
-	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
+	if (__pick_eevdf(cfs_rq, false) == pse)
 		goto preempt;
 
-	if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE)
+	if (sched_feat(RUN_TO_PARITY))
 		update_protect_slice(cfs_rq, se);
 
 	return;
 
 preempt:
-	if (do_preempt_short)
-		cancel_protect_slice(se);
-
+	cancel_protect_slice(se);
 	resched_curr_lazy(rq);
 }
 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-27 13:39 ` Mel Gorman
                     ` (2 preceding siblings ...)
  2025-10-28 15:33   ` Peter Zijlstra
@ 2025-10-30  9:10   ` Peter Zijlstra
  2025-10-31 10:27     ` Mel Gorman
  3 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-10-30  9:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> +static inline enum preempt_wakeup_action
> +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> +		 struct sched_entity *pse, struct sched_entity *se)
> +{
> +	bool pse_before;
> +
> +	/*
> +	 * Ignore wakee preemption on WF_WORK as it is less likely that
> +	 * there is shared data as exec often follow fork. Do not
> +	 * preempt for tasks that are sched_delayed as it would violate
> +	 * EEVDF to forcibly queue an ineligible task.
> +	 */
> +	if (!sched_feat(NEXT_BUDDY) ||

This seems wrong, that would mean wakeup preemption gets killed the
moment you disable NEXT_BUDDY, that can't be right.

> +	    (wake_flags & WF_FORK) ||
> +	    (pse->sched_delayed)) {
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	/* Reschedule if waker is no longer eligible. */
> +	if (!entity_eligible(cfs_rq, se))
> +		return PREEMPT_WAKEUP_RESCHED;

That comment isn't accurate, unless you add: && in_task(). That is, if
this is an interrupt doing the wakeup, it has nothing to do with
current.

> +	/*
> +	 * Keep existing buddy if the deadline is sooner than pse.
> +	 * The downside is that the older buddy may be cache cold
> +	 * but that is unpredictable where as an earlier deadline
> +	 * is absolute.
> +	 */
> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> +		return PREEMPT_WAKEUP_NONE;

But if previously we set next and didn't preempt, we should try again,
maybe it has more success now. That is, should this not be _NEXT?

> +
> +	set_next_buddy(pse);
> +
> +	/*
> +	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> +	 * strictly enforced because the hint is either misunderstood or
> +	 * multiple tasks must be woken up.
> +	 */
> +	pse_before = entity_before(pse, se);
> +	if (wake_flags & WF_SYNC) {
> +		u64 delta = rq_clock_task(rq) - se->exec_start;
> +		u64 threshold = sysctl_sched_migration_cost;
> +
> +		/*
> +		 * WF_SYNC without WF_TTWU is not expected so warn if it
> +		 * happens even though it is likely harmless.
> +		 */
> +		WARN_ON_ONCE(!(wake_flags | WF_TTWU));

s/|/&/ ?

> +		if ((s64)delta < 0)
> +			delta = 0;
> +
> +		/*
> +		 * WF_RQ_SELECTED implies the tasks are stacking on a
> +		 * CPU when they could run on other CPUs. Reduce the
> +		 * threshold before preemption is allowed to an
> +		 * arbitrary lower value as it is more likely (but not
> +		 * guaranteed) the waker requires the wakee to finish.
> +		 */
> +		if (wake_flags & WF_RQ_SELECTED)
> +			threshold >>= 2;
> +
> +		/*
> +		 * As WF_SYNC is not strictly obeyed, allow some runtime for
> +		 * batch wakeups to be issued.
> +		 */
> +		if (pse_before && delta >= threshold)
> +			return PREEMPT_WAKEUP_RESCHED;
> +
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	return PREEMPT_WAKEUP_NEXT;
> +}

Add to this that AFAICT your patch ends up doing:

	__pick_eevdf(.protect = false) == pse

which unconditionally disables the slice protection feature.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-28 15:05   ` Peter Zijlstra
@ 2025-10-31  9:46     ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2025-10-31  9:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Tue, Oct 28, 2025 at 04:05:45PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> >  kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 117 insertions(+), 20 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bc0b7ce8a65d..158e0430449b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1193,6 +1203,91 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
> >  	return delta_exec;
> >  }
> >  
> > +enum preempt_wakeup_action {
> > +	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
> > +	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
> > +					 * before rescheduling.
> > +					 */
> > +	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
> > +};
> > +
> > <SNIP>
> 
> All this seems weirdly placed inside the file. Is there a reason this is
> placed so far away from its only caller?
> 

No, I don't recall why I placed it there and whether it was simply
required by an early prototype.

> >  /*
> >   * Used by other classes to account runtime.
> >   */
> 
> > @@ -7028,8 +7113,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  	hrtick_update(rq);
> >  }
> >  
> > -static void set_next_buddy(struct sched_entity *se);
> > -
> >  /*
> >   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
> >   * failing half-way through and resume the dequeue later.
> > @@ -8734,7 +8817,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  	struct sched_entity *se = &donor->se, *pse = &p->se;
> >  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
> >  	int cse_is_idle, pse_is_idle;
> > -	bool do_preempt_short = false;
> > +	enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE;
> 
> naming seems off; I'm not sure what this still has to do with short.
> Perhaps just preempt_action or whatever?
> 

Good as name as any, may change it to a more meaningful name once I go
through the rest of the review.

> >  
> >  	if (unlikely(se == pse))
> >  		return;
> > @@ -8748,10 +8831,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  	if (task_is_throttled(p))
> >  		return;
> >  
> > -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
> > -		set_next_buddy(pse);
> > -	}
> > -
> >  	/*
> >  	 * We can come here with TIF_NEED_RESCHED already set from new task
> >  	 * wake up path.
> > @@ -8783,7 +8862,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  		 * When non-idle entity preempt an idle entity,
> >  		 * don't give idle entity slice protection.
> >  		 */
> > -		do_preempt_short = true;
> > +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
> >  		goto preempt;
> >  	}
> >  
> > @@ -8802,7 +8881,25 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  	 * If @p has a shorter slice than current and @p is eligible, override
> >  	 * current's slice protection in order to allow preemption.
> >  	 */
> > -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> > +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
> > +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
> > +	} else {
> > +		/*
> > +		 * If @p potentially is completing work required by current then
> > +		 * consider preemption.
> > +		 */
> > +		do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags,
> > +						      pse, se);
> > +	}
> > +
> > +	switch (do_preempt_short) {
> > +	case PREEMPT_WAKEUP_NONE:
> > +		return;
> > +	case PREEMPT_WAKEUP_RESCHED:
> > +		goto preempt;
> > +	case PREEMPT_WAKEUP_NEXT:
> > +		break;
> > +	}
> >  
> >  	/*
> >  	 * If @p has become the most eligible task, force preemption.
> > @@ -8810,7 +8907,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
> >  		goto preempt;
> >  
> > -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> > +	if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE)
> >  		update_protect_slice(cfs_rq, se);
> 
> WAKEUP_NONE did a return above, I don't think you can get here with
> WAKEUP_NONE, making the above condition always true.

Correct. This was not always the case during development but now the rename
alone highlights issues beyond this oddity.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-28 15:09   ` Peter Zijlstra
@ 2025-10-31  9:48     ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2025-10-31  9:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Tue, Oct 28, 2025 at 04:09:51PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> > +enum preempt_wakeup_action {
> > +	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
> > +	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
> > +					 * before rescheduling.
> > +					 */
> > +	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
> > +};
> 
> In pre-existing code that isn't modified by this patch, we have:
> 
>   if (do_preempt_short)
> 
> Which seems to hard rely on PREEMPT_WAKEUP_NONE being 0, please make
> that explicit in the enum above.

Will do.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-30  9:10   ` Peter Zijlstra
@ 2025-10-31 10:27     ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2025-10-31 10:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Thu, Oct 30, 2025 at 10:10:58AM +0100, Peter Zijlstra wrote:
> On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> > +static inline enum preempt_wakeup_action
> > +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> > +		 struct sched_entity *pse, struct sched_entity *se)
> > +{
> > +	bool pse_before;
> > +
> > +	/*
> > +	 * Ignore wakee preemption on WF_WORK as it is less likely that
> > +	 * there is shared data as exec often follow fork. Do not
> > +	 * preempt for tasks that are sched_delayed as it would violate
> > +	 * EEVDF to forcibly queue an ineligible task.
> > +	 */
> > +	if (!sched_feat(NEXT_BUDDY) ||
> 
> This seems wrong, that would mean wakeup preemption gets killed the
> moment you disable NEXT_BUDDY, that can't be right.
> 

Correct, the check is bogus.

> > +	    (wake_flags & WF_FORK) ||
> > +	    (pse->sched_delayed)) {
> > +		return PREEMPT_WAKEUP_NONE;
> > +	}
> > +
> > +	/* Reschedule if waker is no longer eligible. */
> > +	if (!entity_eligible(cfs_rq, se))
> > +		return PREEMPT_WAKEUP_RESCHED;
> 
> That comment isn't accurate, unless you add: && in_task(). That is, if
> this is an interrupt doing the wakeup, it has nothing to do with
> current.
> 

That was a complete oversight.

> > +	/*
> > +	 * Keep existing buddy if the deadline is sooner than pse.
> > +	 * The downside is that the older buddy may be cache cold
> > +	 * but that is unpredictable where as an earlier deadline
> > +	 * is absolute.
> > +	 */
> > +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> > +		return PREEMPT_WAKEUP_NONE;
> 
> But if previously we set next and didn't preempt, we should try again,
> maybe it has more success now. That is, should this not be _NEXT?
> 

The context of why the original buddy was set is now lost but you're
right, it is more straight-forward to reconsider the old buddy. It's
more in line with EEVDF objectives and cache residency and future
hotness requires crystal ball instructions.

> > +
> > +	set_next_buddy(pse);
> > +
> > +	/*
> > +	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> > +	 * strictly enforced because the hint is either misunderstood or
> > +	 * multiple tasks must be woken up.
> > +	 */
> > +	pse_before = entity_before(pse, se);
> > +	if (wake_flags & WF_SYNC) {
> > +		u64 delta = rq_clock_task(rq) - se->exec_start;
> > +		u64 threshold = sysctl_sched_migration_cost;
> > +
> > +		/*
> > +		 * WF_SYNC without WF_TTWU is not expected so warn if it
> > +		 * happens even though it is likely harmless.
> > +		 */
> > +		WARN_ON_ONCE(!(wake_flags | WF_TTWU));
> 
> s/|/&/ ?
> 

Bah, thanks.

> > +		if ((s64)delta < 0)
> > +			delta = 0;
> > +
> > +		/*
> > +		 * WF_RQ_SELECTED implies the tasks are stacking on a
> > +		 * CPU when they could run on other CPUs. Reduce the
> > +		 * threshold before preemption is allowed to an
> > +		 * arbitrary lower value as it is more likely (but not
> > +		 * guaranteed) the waker requires the wakee to finish.
> > +		 */
> > +		if (wake_flags & WF_RQ_SELECTED)
> > +			threshold >>= 2;
> > +
> > +		/*
> > +		 * As WF_SYNC is not strictly obeyed, allow some runtime for
> > +		 * batch wakeups to be issued.
> > +		 */
> > +		if (pse_before && delta >= threshold)
> > +			return PREEMPT_WAKEUP_RESCHED;
> > +
> > +		return PREEMPT_WAKEUP_NONE;
> > +	}
> > +
> > +	return PREEMPT_WAKEUP_NEXT;
> > +}
> 
> Add to this that AFAICT your patch ends up doing:
> 
> 	__pick_eevdf(.protect = false) == pse
> 
> which unconditionally disables the slice protection feature.
> 

Yes, trying to converge PREEMPT_SHORT with NEXT_BUDDY during prototyping
was a poor decision because it led to mistakes like this.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-03 11:04 [PATCH 0/2 v4] " Mel Gorman
@ 2025-11-03 11:04 ` Mel Gorman
  2025-11-03 14:07   ` Peter Zijlstra
  2025-11-05 21:48   ` Madadi Vineeth Reddy
  0 siblings, 2 replies; 44+ messages in thread
From: Mel Gorman @ 2025-11-03 11:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, linux-kernel, Mel Gorman

Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.

Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a suggestion instead of a contract. If a waker does go to sleep almost
immediately then the delay in wakeup is negligible. In all other cases,
it's throttled based on the accumulated runtime of the waker so there is
a chance that some batched wakeups have been issued before preemption.

For all other wakeups, preemption happens if the wakee has a earlier
deadline than the waker and eligible to run.

While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.

First is the dbench throughput data even though it is a poor metric but
it is the default metric. The test machine is a 2-socket machine and the
backing filesystem is XFS as a lot of the IO work is dispatched to kernel
threads. It's important to note that these results are not representative
across all machines, especially Zen machines, as different bottlenecks
are exposed on different machines and filesystems.

dbench4 Throughput (misleading but traditional)
                            6.18-rc1               6.18-rc1
                             vanilla   sched-preemptnext-v4
Hmean     1       1268.80 (   0.00%)     1268.92 (   0.01%)
Hmean     4       3971.74 (   0.00%)     3952.44 (  -0.49%)
Hmean     7       5548.23 (   0.00%)     5375.49 *  -3.11%*
Hmean     12      7310.86 (   0.00%)     7080.80 (  -3.15%)
Hmean     21      8874.53 (   0.00%)     9044.30 (   1.91%)
Hmean     30      9361.93 (   0.00%)    10430.80 *  11.42%*
Hmean     48      9540.14 (   0.00%)    11786.28 *  23.54%*
Hmean     79      9208.74 (   0.00%)    11910.07 *  29.33%*
Hmean     110     8573.12 (   0.00%)    11571.03 *  34.97%*
Hmean     141     7791.33 (   0.00%)    11151.69 *  43.13%*
Hmean     160     7666.60 (   0.00%)    10661.53 *  39.06%*

As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.

dbench4 Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v4
Amean      1         14.62 (   0.00%)       14.69 (  -0.50%)
Amean      4         18.76 (   0.00%)       18.88 (  -0.63%)
Amean      7         23.71 (   0.00%)       24.52 (  -3.43%)
Amean      12        31.25 (   0.00%)       32.20 (  -3.03%)
Amean      21        45.12 (   0.00%)       44.19 (   2.05%)
Amean      30        61.07 (   0.00%)       54.74 (  10.35%)
Amean      48        95.91 (   0.00%)       77.52 (  19.18%)
Amean      79       163.38 (   0.00%)      126.04 (  22.85%)
Amean      110      243.91 (   0.00%)      179.93 (  26.23%)
Amean      141      343.47 (   0.00%)      240.03 (  30.12%)
Amean      160      401.15 (   0.00%)      284.92 (  28.98%)

That is still looking good and the variance is reduced quite a bit with
the caveat that different CPU families may yield different results.

Fairness is a concern so the next report tracks how many milliseconds
does it take for all clients to complete a workfile. This one is tricky
because dbench makes no effort to synchronise clients so the durations at
benchmark start time differ substantially from typical runtimes. Depending
on the configuration, the benchmark can be dominated by the timing of
fsync of different clients. This problem could be mitigated by warming up
the benchmark for a number of minutes but it's a matter of opinion whether
that counts as an evasion of inconvenient results.

dbench4 All Clients Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v4
Amean      1         15.06 (   0.00%)       15.10 (  -0.27%)
Amean      4        603.81 (   0.00%)     1451.94 (-140.46%)
Amean      7        855.32 (   0.00%)     1116.68 ( -30.56%)
Amean      12      1890.02 (   0.00%)     2075.98 (  -9.84%)
Amean      21      3195.23 (   0.00%)     2508.86 (  21.48%)
Amean      30     13919.53 (   0.00%)     3324.58 (  76.12%)
Amean      48     25246.07 (   0.00%)     4146.10 (  83.58%)
Amean      79     29701.84 (   0.00%)    21338.68 (  28.16%)
Amean      110    22803.03 (   0.00%)    29502.47 ( -29.38%)
Amean      141    36356.07 (   0.00%)    25292.07 (  30.43%)
Amean      160    17046.71 (   0.00%)    27043.67 ( -58.64%)
Stddev     1          0.47 (   0.00%)        0.46 (   2.84%)
Stddev     4        395.24 (   0.00%)      782.96 ( -98.10%)
Stddev     7        467.24 (   0.00%)      738.26 ( -58.00%)
Stddev     12      1071.43 (   0.00%)     1124.78 (  -4.98%)
Stddev     21      1694.50 (   0.00%)     1463.27 (  13.65%)
Stddev     30      7945.63 (   0.00%)     1889.19 (  76.22%)
Stddev     48     14339.51 (   0.00%)     2226.59 (  84.47%)
Stddev     79     16620.91 (   0.00%)    12027.14 (  27.64%)
Stddev     110    12912.15 (   0.00%)    16601.77 ( -28.57%)
Stddev     141    20700.13 (   0.00%)    13624.50 (  34.18%)
Stddev     160     9079.16 (   0.00%)    15729.63 ( -73.25%)

This is more of a mixed bag but it at least shows that fairness
is not crippled.

The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench.

hackbench-process-pipes
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v4
Amean     1        0.2657 (   0.00%)      0.2120 (  20.20%)
Amean     4        0.6107 (   0.00%)      0.6120 (  -0.22%)
Amean     7        0.7923 (   0.00%)      0.7307 (   7.78%)
Amean     12       1.1500 (   0.00%)      1.1480 (   0.17%)
Amean     21       1.7950 (   0.00%)      1.8207 (  -1.43%)
Amean     30       2.3207 (   0.00%)      2.5340 *  -9.19%*
Amean     48       3.5023 (   0.00%)      4.0047 * -14.34%*
Amean     79       4.8093 (   0.00%)      5.3137 * -10.49%*
Amean     110      6.1160 (   0.00%)      6.5287 *  -6.75%*
Amean     141      7.4763 (   0.00%)      7.9553 *  -6.41%*
Amean     172      8.9560 (   0.00%)      9.3977 *  -4.93%*
Amean     203     10.4783 (   0.00%)     11.0160 *  -5.13%*
Amean     234     12.4977 (   0.00%)     13.5510 (  -8.43%)
Amean     265     14.7003 (   0.00%)     15.6950 *  -6.77%*
Amean     296     16.1007 (   0.00%)     17.0860 *  -6.12%*

Processes using pipes are impacted but the variance (not presented)
is bad enough that it's close to noise. These results are not always
reproducible. If executed across multiple reboots, it may show neutral or
small gains so the worst measured results are presented.

Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.

hackbench-process-sockets
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v4
Amean     1        0.3073 (   0.00%)      0.3187 (  -3.69%)
Amean     4        0.7863 (   0.00%)      0.7850 (   0.17%)
Amean     7        1.3670 (   0.00%)      1.3747 (  -0.56%)
Amean     12       2.1337 (   0.00%)      2.1450 (  -0.53%)
Amean     21       3.4683 (   0.00%)      3.4293 (   1.12%)
Amean     30       4.7247 (   0.00%)      4.8000 (  -1.59%)
Amean     48       7.6097 (   0.00%)      7.6943 (  -1.11%)
Amean     79      14.7957 (   0.00%)     15.3353 (  -3.65%)
Amean     110     21.3413 (   0.00%)     21.5753 (  -1.10%)
Amean     141     29.0503 (   0.00%)     28.8740 (   0.61%)
Amean     172     36.4660 (   0.00%)     35.7207 (   2.04%)
Amean     203     39.7177 (   0.00%)     39.7343 (  -0.04%)
Amean     234     42.1120 (   0.00%)     43.0177 (  -2.15%)
Amean     265     45.7830 (   0.00%)     46.9863 (  -2.63%)
Amean     296     50.7043 (   0.00%)     50.7017 (   0.01%)

As schbench has been mentioned in numerous bugs recently, the results
are interesting. A test case that represents the default schbench
behaviour is

schbench Wakeup Latency (usec)
                                       6.18.0-rc1             6.18.0-rc1
                                          vanilla   sched-preemptnext-v4
Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 *  16.28%*
Amean     Wakeup-90th-80         46.56 (   0.00%)       19.61 *  57.88%*
Amean     Wakeup-99th-80        119.61 (   0.00%)       87.28 *  27.03%*
Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      367.00 *  88.51%*
Amean     Wakeup-max-80        3874.28 (   0.00%)     3951.39 (  -1.99%)


schbench Requests Per Second (ops/sec)
                                  6.18.0-rc1             6.18.0-rc1
                                     vanilla sched-preemptnext-v4r1
Hmean     RPS-20th-80     8900.91 (   0.00%)     9167.86 *   3.00%*
Hmean     RPS-50th-80     8987.41 (   0.00%)     9213.06 *   2.51%*
Hmean     RPS-90th-80     9123.73 (   0.00%)     9263.80 (   1.54%)
Hmean     RPS-max-80      9193.50 (   0.00%)     9282.18 (   0.96%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 145 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 123 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc0b7ce8a65d..688bf010f444 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -955,6 +955,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
 	if (cfs_rq->nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
+	/*
+	 * Picking the ->next buddy will affect latency but not fairness.
+	 */
+	if (sched_feat(PICK_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+		return cfs_rq->next;
+	}
+
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
 
@@ -1193,6 +1203,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 	return delta_exec;
 }
 
+static void set_next_buddy(struct sched_entity *se);
+
 /*
  * Used by other classes to account runtime.
  */
@@ -5512,16 +5524,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 
-	/*
-	 * Picking the ->next buddy will affect latency but not fairness.
-	 */
-	if (sched_feat(PICK_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
-		/* ->next will never be delayed */
-		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
-		return cfs_rq->next;
-	}
-
 	se = pick_eevdf(cfs_rq);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
@@ -7028,8 +7030,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -8725,6 +8725,91 @@ static void set_next_buddy(struct sched_entity *se)
 	}
 }
 
+enum preempt_wakeup_action {
+	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
+	PREEMPT_WAKEUP_SHORT,		/* Override current's slice
+					 * protection to allow
+					 * preemption.
+					 */
+	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
+					 * before rescheduling.
+					 */
+	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
+};
+
+static inline enum preempt_wakeup_action
+__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
+		 struct sched_entity *pse, struct sched_entity *se)
+{
+	bool pse_before;
+
+	/*
+	 * Ignore wakee preemption on WF_FORK as it is less likely that
+	 * there is shared data as exec often follow fork. Do not
+	 * preempt for tasks that are sched_delayed as it would violate
+	 * EEVDF to forcibly queue an ineligible task.
+	 */
+	if ((wake_flags & WF_FORK) || pse->sched_delayed)
+		return PREEMPT_WAKEUP_NONE;
+
+	/* Reschedule if waker is no longer eligible. */
+	if (in_task() && !entity_eligible(cfs_rq, se))
+		return PREEMPT_WAKEUP_RESCHED;
+
+	/*
+	 * Keep existing buddy if the deadline is sooner than pse.
+	 * The older buddy may be cache cold and completely unrelated
+	 * to the current wakeup but that is unpredictable where as
+	 * obeying the deadline is more in line with EEVDF objectives.
+	 */
+	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
+		return PREEMPT_WAKEUP_NEXT;
+
+	set_next_buddy(pse);
+
+	/*
+	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
+	 * strictly enforced because the hint is either misunderstood or
+	 * multiple tasks must be woken up.
+	 */
+	pse_before = entity_before(pse, se);
+	if (wake_flags & WF_SYNC) {
+		u64 delta = rq_clock_task(rq) - se->exec_start;
+		u64 threshold = sysctl_sched_migration_cost;
+
+		/*
+		 * WF_SYNC without WF_TTWU is not expected so warn if it
+		 * happens even though it is likely harmless.
+		 */
+		WARN_ON_ONCE(!(wake_flags & WF_TTWU));
+
+		if ((s64)delta < 0)
+			delta = 0;
+
+		/*
+		 * WF_RQ_SELECTED implies the tasks are stacking on a
+		 * CPU when they could run on other CPUs. Reduce the
+		 * threshold before preemption is allowed to an
+		 * arbitrary lower value as it is more likely (but not
+		 * guaranteed) the waker requires the wakee to finish.
+		 */
+		if (wake_flags & WF_RQ_SELECTED)
+			threshold >>= 2;
+
+		/*
+		 * As WF_SYNC is not strictly obeyed, allow some runtime for
+		 * batch wakeups to be issued.
+		 */
+		if (pse_before && delta >= threshold)
+			return PREEMPT_WAKEUP_RESCHED;
+
+		return PREEMPT_WAKEUP_NONE;
+	}
+
+	return PREEMPT_WAKEUP_NEXT;
+}
+
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
-	bool do_preempt_short = false;
+	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;
 
 	if (unlikely(se == pse))
 		return;
@@ -8748,10 +8833,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (task_is_throttled(p))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
-		set_next_buddy(pse);
-	}
-
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
@@ -8783,7 +8864,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 		 * When non-idle entity preempt an idle entity,
 		 * don't give idle entity slice protection.
 		 */
-		do_preempt_short = true;
+		preempt_action = PREEMPT_WAKEUP_SHORT;
 		goto preempt;
 	}
 
@@ -8802,21 +8883,41 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	 * If @p has a shorter slice than current and @p is eligible, override
 	 * current's slice protection in order to allow preemption.
 	 */
-	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
+	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
+		preempt_action = PREEMPT_WAKEUP_SHORT;
+	} else {
+		/*
+		 * If @p potentially is completing work required by current then
+		 * consider preemption.
+		 */
+		preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags,
+						      pse, se);
+	}
+
+	switch (preempt_action) {
+	case PREEMPT_WAKEUP_NONE:
+		return;
+	case PREEMPT_WAKEUP_RESCHED:
+		goto preempt;
+	case PREEMPT_WAKEUP_SHORT:
+		fallthrough;
+	case PREEMPT_WAKEUP_NEXT:
+		break;
+	}
 
 	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */
-	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
+	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
 		goto preempt;
 
-	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
+	if (sched_feat(RUN_TO_PARITY))
 		update_protect_slice(cfs_rq, se);
 
 	return;
 
 preempt:
-	if (do_preempt_short)
+	if (preempt_action == PREEMPT_WAKEUP_SHORT)
 		cancel_protect_slice(se);
 
 	resched_curr_lazy(rq);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-03 11:04 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
@ 2025-11-03 14:07   ` Peter Zijlstra
  2025-11-03 14:14     ` Peter Zijlstra
  2025-11-05 21:48   ` Madadi Vineeth Reddy
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-11-03 14:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, linux-kernel

On Mon, Nov 03, 2025 at 11:04:45AM +0000, Mel Gorman wrote:

> @@ -8725,6 +8725,91 @@ static void set_next_buddy(struct sched_entity *se)
>  	}
>  }
>  
> +enum preempt_wakeup_action {
> +	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
> +	PREEMPT_WAKEUP_SHORT,		/* Override current's slice
> +					 * protection to allow
> +					 * preemption.
> +					 */
> +	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
> +					 * before rescheduling.
> +					 */
> +	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
> +};

Not really a fan of that comment style. While noodling, I've managed to
'accidentally' rename NEXT to PICK, since its more about letting
__pick_eevdf() decide.

> +static inline enum preempt_wakeup_action
> +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> +		 struct sched_entity *pse, struct sched_entity *se)
> +{
> +	bool pse_before;
> +
> +	/*
> +	 * Ignore wakee preemption on WF_FORK as it is less likely that
> +	 * there is shared data as exec often follow fork. Do not
> +	 * preempt for tasks that are sched_delayed as it would violate
> +	 * EEVDF to forcibly queue an ineligible task.
> +	 */
> +	if ((wake_flags & WF_FORK) || pse->sched_delayed)
> +		return PREEMPT_WAKEUP_NONE;
> +
> +	/* Reschedule if waker is no longer eligible. */
> +	if (in_task() && !entity_eligible(cfs_rq, se))
> +		return PREEMPT_WAKEUP_RESCHED;
> +
> +	/*
> +	 * Keep existing buddy if the deadline is sooner than pse.
> +	 * The older buddy may be cache cold and completely unrelated
> +	 * to the current wakeup but that is unpredictable where as
> +	 * obeying the deadline is more in line with EEVDF objectives.
> +	 */
> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> +		return PREEMPT_WAKEUP_NEXT;
> +
> +	set_next_buddy(pse);
> +
> +	/*
> +	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> +	 * strictly enforced because the hint is either misunderstood or
> +	 * multiple tasks must be woken up.
> +	 */
> +	pse_before = entity_before(pse, se);
> +	if (wake_flags & WF_SYNC) {
> +		u64 delta = rq_clock_task(rq) - se->exec_start;
> +		u64 threshold = sysctl_sched_migration_cost;
> +
> +		/*
> +		 * WF_SYNC without WF_TTWU is not expected so warn if it
> +		 * happens even though it is likely harmless.
> +		 */
> +		WARN_ON_ONCE(!(wake_flags & WF_TTWU));
> +
> +		if ((s64)delta < 0)
> +			delta = 0;
> +
> +		/*
> +		 * WF_RQ_SELECTED implies the tasks are stacking on a
> +		 * CPU when they could run on other CPUs. Reduce the
> +		 * threshold before preemption is allowed to an
> +		 * arbitrary lower value as it is more likely (but not
> +		 * guaranteed) the waker requires the wakee to finish.
> +		 */
> +		if (wake_flags & WF_RQ_SELECTED)
> +			threshold >>= 2;
> +
> +		/*
> +		 * As WF_SYNC is not strictly obeyed, allow some runtime for
> +		 * batch wakeups to be issued.
> +		 */
> +		if (pse_before && delta >= threshold)
> +			return PREEMPT_WAKEUP_RESCHED;
> +
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	return PREEMPT_WAKEUP_NEXT;
> +}

This seems to do 3 things:

 - that reschedule waker on !eligible
 - set the buddy (while losing NEXT_BUDDY)
 - the WF_SYNC thing

Let's split that out.

> @@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>  	int cse_is_idle, pse_is_idle;
> -	bool do_preempt_short = false;
> +	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;

I'm thinking NONE is the wrong default

>  
>  	if (unlikely(se == pse))
>  		return;
> @@ -8748,10 +8833,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (task_is_throttled(p))
>  		return;
>  
> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
> -		set_next_buddy(pse);
> -	}

This was the only NEXT_BUDDY site and none were returned in trade.

>  	/*
>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>  	 * wake up path.
> @@ -8783,7 +8864,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  		 * When non-idle entity preempt an idle entity,
>  		 * don't give idle entity slice protection.
>  		 */
> -		do_preempt_short = true;
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
>  		goto preempt;
>  	}
>  
> @@ -8802,21 +8883,41 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	 * If @p has a shorter slice than current and @p is eligible, override
>  	 * current's slice protection in order to allow preemption.
>  	 */
> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
> +	} else {
> +		/*
> +		 * If @p potentially is completing work required by current then
> +		 * consider preemption.
> +		 */
> +		preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags,
> +						      pse, se);
> +	}
> +
> +	switch (preempt_action) {
> +	case PREEMPT_WAKEUP_NONE:
> +		return;
> +	case PREEMPT_WAKEUP_RESCHED:
> +		goto preempt;
> +	case PREEMPT_WAKEUP_SHORT:
> +		fallthrough;
(this is redundant)
> +	case PREEMPT_WAKEUP_NEXT:
> +		break;
> +	}
>  
>  	/*
>  	 * If @p has become the most eligible task, force preemption.
>  	 */
> -	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
> +	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
>  		goto preempt;
>  
> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> +	if (sched_feat(RUN_TO_PARITY))
>  		update_protect_slice(cfs_rq, se);
>  
>  	return;
>  
>  preempt:
> -	if (do_preempt_short)
> +	if (preempt_action == PREEMPT_WAKEUP_SHORT)
>  		cancel_protect_slice(se);
>  
>  	resched_curr_lazy(rq);

Right, much better. But since I was noddling, how about something like
so on top?

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8727,23 +8727,16 @@ static void set_next_buddy(struct sched_
 }
 
 enum preempt_wakeup_action {
-	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
-	PREEMPT_WAKEUP_SHORT,		/* Override current's slice
-					 * protection to allow
-					 * preemption.
-					 */
-	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
-					 * before rescheduling.
-					 */
-	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
+	PREEMPT_WAKEUP_NONE,	/* No preemption. */
+	PREEMPT_WAKEUP_SHORT,	/* Ignore slice protection. */
+	PREEMPT_WAKEUP_PICK,	/* Let __pick_eevdf() decide. */
+	PREEMPT_WAKEUP_RESCHED,	/* Force reschedule. */
 };
 
-static inline enum preempt_wakeup_action
-__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
-		 struct sched_entity *pse, struct sched_entity *se)
+static inline void
+set_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
+		  struct sched_entity *pse, struct sched_entity *se)
 {
-	bool pse_before;
-
 	/*
 	 * Ignore wakee preemption on WF_FORK as it is less likely that
 	 * there is shared data as exec often follow fork. Do not
@@ -8751,11 +8744,7 @@ __do_preempt_buddy(struct rq *rq, struct
 	 * EEVDF to forcibly queue an ineligible task.
 	 */
 	if ((wake_flags & WF_FORK) || pse->sched_delayed)
-		return PREEMPT_WAKEUP_NONE;
-
-	/* Reschedule if waker is no longer eligible. */
-	if (in_task() && !entity_eligible(cfs_rq, se))
-		return PREEMPT_WAKEUP_RESCHED;
+		return;
 
 	/*
 	 * Keep existing buddy if the deadline is sooner than pse.
@@ -8764,63 +8753,62 @@ __do_preempt_buddy(struct rq *rq, struct
 	 * obeying the deadline is more in line with EEVDF objectives.
 	 */
 	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
-		return PREEMPT_WAKEUP_NEXT;
+		return;
 
 	set_next_buddy(pse);
+}
 
-	/*
-	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
-	 * strictly enforced because the hint is either misunderstood or
-	 * multiple tasks must be woken up.
-	 */
-	pse_before = entity_before(pse, se);
-	if (wake_flags & WF_SYNC) {
-		u64 delta = rq_clock_task(rq) - se->exec_start;
-		u64 threshold = sysctl_sched_migration_cost;
-
-		/*
-		 * WF_SYNC without WF_TTWU is not expected so warn if it
-		 * happens even though it is likely harmless.
-		 */
-		WARN_ON_ONCE(!(wake_flags & WF_TTWU));
+/*
+ * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
+ * strictly enforced because the hint is either misunderstood or
+ * multiple tasks must be woken up.
+ */
+static inline enum preempt_wakeup_action
+preempt_sync(struct rq *rq, int wake_flags,
+	     struct sched_entity *pse, struct sched_entity *se)
+{
+	u64 delta = rq_clock_task(rq) - se->exec_start;
+	u64 threshold = sysctl_sched_migration_cost;
 
-		if ((s64)delta < 0)
-			delta = 0;
+	/*
+	 * WF_SYNC without WF_TTWU is not expected so warn if it
+	 * happens even though it is likely harmless.
+	 */
+	WARN_ON_ONCE(!(wake_flags & WF_TTWU));
 
-		/*
-		 * WF_RQ_SELECTED implies the tasks are stacking on a
-		 * CPU when they could run on other CPUs. Reduce the
-		 * threshold before preemption is allowed to an
-		 * arbitrary lower value as it is more likely (but not
-		 * guaranteed) the waker requires the wakee to finish.
-		 */
-		if (wake_flags & WF_RQ_SELECTED)
-			threshold >>= 2;
+	if ((s64)delta < 0)
+		delta = 0;
 
-		/*
-		 * As WF_SYNC is not strictly obeyed, allow some runtime for
-		 * batch wakeups to be issued.
-		 */
-		if (pse_before && delta >= threshold)
-			return PREEMPT_WAKEUP_RESCHED;
+	/*
+	 * WF_RQ_SELECTED implies the tasks are stacking on a
+	 * CPU when they could run on other CPUs. Reduce the
+	 * threshold before preemption is allowed to an
+	 * arbitrary lower value as it is more likely (but not
+	 * guaranteed) the waker requires the wakee to finish.
+	 */
+	if (wake_flags & WF_RQ_SELECTED)
+		threshold >>= 2;
 
-		return PREEMPT_WAKEUP_NONE;
-	}
+	/*
+	 * As WF_SYNC is not strictly obeyed, allow some runtime for
+	 * batch wakeups to be issued.
+	 */
+	if (entity_before(pse, se) && delta >= threshold)
+		return PREEMPT_WAKEUP_RESCHED;
 
-	return PREEMPT_WAKEUP_NEXT;
+	return PREEMPT_WAKEUP_NONE;
 }
 
-
 /*
  * Preempt the current task with a newly woken task if needed:
  */
 static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
+	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
 	struct task_struct *donor = rq->donor;
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
-	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;
 
 	if (unlikely(se == pse))
 		return;
@@ -8886,26 +8874,40 @@ static void check_preempt_wakeup_fair(st
 	 */
 	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
 		preempt_action = PREEMPT_WAKEUP_SHORT;
-	} else {
-		/*
-		 * If @p potentially is completing work required by current then
-		 * consider preemption.
-		 */
-		preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags,
-						      pse, se);
+		goto pick;
+
 	}
 
+	/*
+	 * If @p potentially is completing work required by current
+	 * then consider preemption.
+	 */
+	if (in_task() && !entity_eligible(cfs_rq, se)) {
+		/* Reschedule if waker is no longer eligible. */
+		preempt_action = PREEMPT_WAKEUP_RESCHED;
+		goto preempt;
+
+	} 
+
+	if (sched_feat(NEXT_BUDDY))
+		set_preempt_buddy(rq, cfs_rq, wake_flags, pse, se);
+
+	if (wake_flags & WF_SYNC)
+		preempt_action = preempt_sync(rq, wake_flags, pse, se);
+
 	switch (preempt_action) {
 	case PREEMPT_WAKEUP_NONE:
 		return;
+
 	case PREEMPT_WAKEUP_RESCHED:
 		goto preempt;
+
 	case PREEMPT_WAKEUP_SHORT:
-		fallthrough;
-	case PREEMPT_WAKEUP_NEXT:
-		break;
+	case PREEMPT_WAKEUP_PICK:
+		goto pick;
 	}
 
+pick:
 	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-03 14:07   ` Peter Zijlstra
@ 2025-11-03 14:14     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2025-11-03 14:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, linux-kernel

On Mon, Nov 03, 2025 at 03:07:11PM +0100, Peter Zijlstra wrote:

> > @@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  	struct sched_entity *se = &donor->se, *pse = &p->se;
> >  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
> >  	int cse_is_idle, pse_is_idle;
> > -	bool do_preempt_short = false;
> > +	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;
> 
> I'm thinking NONE is the wrong default

That was unfinished, kid interrupted me at an inopportune moment and I
lost my train ;-)

Anyway, the current code defaults to what will be 'pick'. And I suppose
we could make the default depend on WAKEUP_PREEMPT but meh.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-03 11:04 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
  2025-11-03 14:07   ` Peter Zijlstra
@ 2025-11-05 21:48   ` Madadi Vineeth Reddy
  2025-11-07  8:53     ` Mel Gorman
  1 sibling, 1 reply; 44+ messages in thread
From: Madadi Vineeth Reddy @ 2025-11-05 21:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason, linux-kernel,
	Madadi Vineeth Reddy

Hi Mel,

On 03/11/25 16:34, Mel Gorman wrote:
> Reimplement NEXT_BUDDY preemption to take into account the deadline and
> eligibility of the wakee with respect to the waker. In the event
> multiple buddies could be considered, the one with the earliest deadline
> is selected.
> 
> Sync wakeups are treated differently to every other type of wakeup. The
> WF_SYNC assumption is that the waker promises to sleep in the very near
> future. This is violated in enough cases that WF_SYNC should be treated
> as a suggestion instead of a contract. If a waker does go to sleep almost
> immediately then the delay in wakeup is negligible. In all other cases,
> it's throttled based on the accumulated runtime of the waker so there is
> a chance that some batched wakeups have been issued before preemption.

[..snip..]

> +static inline enum preempt_wakeup_action
> +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> +		 struct sched_entity *pse, struct sched_entity *se)
> +{
> +	bool pse_before;
> +
> +	/*
> +	 * Ignore wakee preemption on WF_FORK as it is less likely that
> +	 * there is shared data as exec often follow fork. Do not
> +	 * preempt for tasks that are sched_delayed as it would violate
> +	 * EEVDF to forcibly queue an ineligible task.
> +	 */
> +	if ((wake_flags & WF_FORK) || pse->sched_delayed)
> +		return PREEMPT_WAKEUP_NONE;
> +
> +	/* Reschedule if waker is no longer eligible. */
> +	if (in_task() && !entity_eligible(cfs_rq, se))
> +		return PREEMPT_WAKEUP_RESCHED;
> +
> +	/*
> +	 * Keep existing buddy if the deadline is sooner than pse.
> +	 * The older buddy may be cache cold and completely unrelated
> +	 * to the current wakeup but that is unpredictable where as
> +	 * obeying the deadline is more in line with EEVDF objectives.
> +	 */
> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> +		return PREEMPT_WAKEUP_NEXT;

IIUC, the logic attempts to maintain deadline ordering among buddies, but
this doesn't address tasks already on the runqueue.

So with frequent wakeups, queued tasks (even with earlier deadlines) may be
unfairly delayed. I understand that this would fade away quickly as the
woken up task that got to run due to buddy preference would accumulate negative
lag and would not be eligible to run again, but the starvation could be higher if
wakeups are very high.

To test this, I ran schbench (many message and worker threads) together with
stress-ng (CPU-bound), and observed stress-ng's bogo-ops throughput dropped by
around 64%.

This shows a significant regression for CPU-bound tasks under heavy wakeup loads.
Thoughts?


I also ran schbench and hackbench. All these were run on a Power11 System
containing 4 sockets and 160 CPUs spread across 4 NUMA nodes.

schbench(new) 99.0th latency (lower is better)
========
load        	baseline[pct imp](std%)       With patch[pct imp]( std%)
20mt, 10wt      1.00 [ 0.00]( 0.24)           0.97 [ +3.00]( 0.18)
20mt, 20wt      1.00 [ 0.00]( 0.33)           1.00 [  0.00]( 0.12)
20mt, 40wt      1.00 [ 0.00]( 2.84)           0.76 [ +24.0]( 0.32)
20mt, 80wt      1.00 [ 0.00]( 3.66)           0.66 [ +34.0]( 0.72)
20mt, 160wt     1.00 [ 0.00](12.92)           0.88 [ +12.0]( 6.77)

mt=message threads ; wt=worker threads

schbench being a wakeup sensitive workload showed good improvement.


hackbench (lower is better)
========
case              load        baseline[pct imp](std%)      With patch[pct imp]( std%)
process-sockets   1-groups    1.00 [ 0.00]( 5.21)            0.91 [ +9.00]( 5.50)
process-sockets   4-groups    1.00 [ 0.00]( 7.30)            1.01 [ -1.00]( 4.27)
process-sockets   12-groups   1.00 [ 0.00]( 2.44)            1.00 [  0.00]( 1.78)
process-sockets   30-groups   1.00 [ 0.00]( 2.05)            1.04 [ -4.00]( 0.86)
process-sockets   48-groups   1.00 [ 0.00]( 2.25)            1.04 [ -4.00]( 1.03)
process-sockets   79-groups   1.00 [ 0.00]( 2.28)            1.05 [ -5.00]( 1.67)
process-sockets   110-groups  1.00 [ 0.00]( 11.17)           1.04 [ -4.00]( 8.64)

process-pipe      1-groups     1.00 [ 0.00]( 8.21)            0.84 [+16.00](13.00)
process-pipe      4-groups     1.00 [ 0.00]( 5.54)            0.95 [ +5.00]( 4.21)
process-pipe      12-groups    1.00 [ 0.00]( 3.96)            1.04 [ -4.00]( 2.26)
process-pipe      30-groups    1.00 [ 0.00]( 7.64)            1.20 [ -20.0]( 3.63)
process-pipe      48-groups    1.00 [ 0.00]( 6.28)            1.04 [ -4.00]( 8.48)
process-pipe      79-groups    1.00 [ 0.00]( 6.19)            1.01 [ -1.00]( 4.36)
process-pipe      110-groups   1.00 [ 0.00]( 10.23)           0.94 [ +6.00]( 5.21)

Didn't notice significant improvement or regression in Hackbench. Mostly in the noise
range.

Thanks,
Madadi Vineeth Reddy


> +
> +	set_next_buddy(pse);
> +
> +	/*
> +	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> +	 * strictly enforced because the hint is either misunderstood or
> +	 * multiple tasks must be woken up.
> +	 */
> +	pse_before = entity_before(pse, se);
> +	if (wake_flags & WF_SYNC) {
> +		u64 delta = rq_clock_task(rq) - se->exec_start;
> +		u64 threshold = sysctl_sched_migration_cost;
> +
> +		/*
> +		 * WF_SYNC without WF_TTWU is not expected so warn if it
> +		 * happens even though it is likely harmless.
> +		 */
> +		WARN_ON_ONCE(!(wake_flags & WF_TTWU));
> +
> +		if ((s64)delta < 0)
> +			delta = 0;
> +
> +		/*
> +		 * WF_RQ_SELECTED implies the tasks are stacking on a
> +		 * CPU when they could run on other CPUs. Reduce the
> +		 * threshold before preemption is allowed to an
> +		 * arbitrary lower value as it is more likely (but not
> +		 * guaranteed) the waker requires the wakee to finish.
> +		 */
> +		if (wake_flags & WF_RQ_SELECTED)
> +			threshold >>= 2;
> +
> +		/*
> +		 * As WF_SYNC is not strictly obeyed, allow some runtime for
> +		 * batch wakeups to be issued.
> +		 */
> +		if (pse_before && delta >= threshold)
> +			return PREEMPT_WAKEUP_RESCHED;
> +
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	return PREEMPT_WAKEUP_NEXT;
> +}
> +
> +
>  /*
>   * Preempt the current task with a newly woken task if needed:
>   */
> @@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>  	int cse_is_idle, pse_is_idle;
> -	bool do_preempt_short = false;
> +	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;
>  
>  	if (unlikely(se == pse))
>  		return;
> @@ -8748,10 +8833,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (task_is_throttled(p))
>  		return;
>  
> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
> -		set_next_buddy(pse);
> -	}
> -
>  	/*
>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>  	 * wake up path.
> @@ -8783,7 +8864,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  		 * When non-idle entity preempt an idle entity,
>  		 * don't give idle entity slice protection.
>  		 */
> -		do_preempt_short = true;
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
>  		goto preempt;
>  	}
>  
> @@ -8802,21 +8883,41 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	 * If @p has a shorter slice than current and @p is eligible, override
>  	 * current's slice protection in order to allow preemption.
>  	 */
> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
> +	} else {
> +		/*
> +		 * If @p potentially is completing work required by current then
> +		 * consider preemption.
> +		 */
> +		preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags,
> +						      pse, se);
> +	}
> +
> +	switch (preempt_action) {
> +	case PREEMPT_WAKEUP_NONE:
> +		return;
> +	case PREEMPT_WAKEUP_RESCHED:
> +		goto preempt;
> +	case PREEMPT_WAKEUP_SHORT:
> +		fallthrough;
> +	case PREEMPT_WAKEUP_NEXT:
> +		break;
> +	}
>  
>  	/*
>  	 * If @p has become the most eligible task, force preemption.
>  	 */
> -	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
> +	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
>  		goto preempt;
>  
> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> +	if (sched_feat(RUN_TO_PARITY))
>  		update_protect_slice(cfs_rq, se);
>  
>  	return;
>  
>  preempt:
> -	if (do_preempt_short)
> +	if (preempt_action == PREEMPT_WAKEUP_SHORT)
>  		cancel_protect_slice(se);
>  
>  	resched_curr_lazy(rq);


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-05 21:48   ` Madadi Vineeth Reddy
@ 2025-11-07  8:53     ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2025-11-07  8:53 UTC (permalink / raw)
  To: Madadi Vineeth Reddy
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason, linux-kernel

On Thu, Nov 06, 2025 at 03:18:30AM +0530, Madadi Vineeth Reddy wrote:
> Hi Mel,
> 
> On 03/11/25 16:34, Mel Gorman wrote:
> > Reimplement NEXT_BUDDY preemption to take into account the deadline and
> > eligibility of the wakee with respect to the waker. In the event
> > multiple buddies could be considered, the one with the earliest deadline
> > is selected.
> > 
> > Sync wakeups are treated differently to every other type of wakeup. The
> > WF_SYNC assumption is that the waker promises to sleep in the very near
> > future. This is violated in enough cases that WF_SYNC should be treated
> > as a suggestion instead of a contract. If a waker does go to sleep almost
> > immediately then the delay in wakeup is negligible. In all other cases,
> > it's throttled based on the accumulated runtime of the waker so there is
> > a chance that some batched wakeups have been issued before preemption.
> 
> [..snip..]
> 
> > +static inline enum preempt_wakeup_action
> > +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> > +		 struct sched_entity *pse, struct sched_entity *se)
> > +{
> > +	bool pse_before;
> > +
> > +	/*
> > +	 * Ignore wakee preemption on WF_FORK as it is less likely that
> > +	 * there is shared data as exec often follow fork. Do not
> > +	 * preempt for tasks that are sched_delayed as it would violate
> > +	 * EEVDF to forcibly queue an ineligible task.
> > +	 */
> > +	if ((wake_flags & WF_FORK) || pse->sched_delayed)
> > +		return PREEMPT_WAKEUP_NONE;
> > +
> > +	/* Reschedule if waker is no longer eligible. */
> > +	if (in_task() && !entity_eligible(cfs_rq, se))
> > +		return PREEMPT_WAKEUP_RESCHED;
> > +
> > +	/*
> > +	 * Keep existing buddy if the deadline is sooner than pse.
> > +	 * The older buddy may be cache cold and completely unrelated
> > +	 * to the current wakeup but that is unpredictable where as
> > +	 * obeying the deadline is more in line with EEVDF objectives.
> > +	 */
> > +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> > +		return PREEMPT_WAKEUP_NEXT;
> 
> IIUC, the logic attempts to maintain deadline ordering among buddies, but
> this doesn't address tasks already on the runqueue.
> 

It's addressed in that a buddy is only selected if it is eligible to
run. Buddies in this context are receiving preferential treatment in
terms of ordering but the other tasks on the queue should not be
starved either.

> So with frequent wakeups, queued tasks (even with earlier deadlines) may be
> unfairly delayed. I understand that this would fade away quickly as the
> woken up task that got to run due to buddy preference would accumulate negative
> lag and would not be eligible to run again, but the starvation could be higher if
> wakeups are very high.
> 

They shouldn't get starved as such, only delayed as the buddies become
eligible while other tasks on the runqueue have positive lag.

> To test this, I ran schbench (many message and worker threads) together with
> stress-ng (CPU-bound), and observed stress-ng's bogo-ops throughput dropped by
> around 64%.
> 

Stress-NG bogo-ops are by definition, bogus ops. The amount of work
executed depends on the stressor and the timing of when they execute.
Hence, a drop of 64% may or may not matter to the general case because the
drop may be due to a different mix of "operations", some of which may task
1ms and others that take a minute but are both "1 operation".

> This shows a significant regression for CPU-bound tasks under heavy wakeup loads.
> Thoughts?

I 100% accept that NEXT_BUDDY can affect timings of workloads but stress-ng
is not the best workload to use for performance testing at all because
the bogoops metric is by definition, bogus. There may be a good reason
to revisit if PICK_BUDDY should have been moved to __pick_eevdf or if
PICK_BUDDY should be applied if the slice is not protected but stressng
is a terrible workload to justify a decision either way.

> I also ran schbench and hackbench. All these were run on a Power11 System
> containing 4 sockets and 160 CPUs spread across 4 NUMA nodes.
> 
> schbench(new) 99.0th latency (lower is better)
> ========
> load        	baseline[pct imp](std%)       With patch[pct imp]( std%)
> 20mt, 10wt      1.00 [ 0.00]( 0.24)           0.97 [ +3.00]( 0.18)
> 20mt, 20wt      1.00 [ 0.00]( 0.33)           1.00 [  0.00]( 0.12)
> 20mt, 40wt      1.00 [ 0.00]( 2.84)           0.76 [ +24.0]( 0.32)
> 20mt, 80wt      1.00 [ 0.00]( 3.66)           0.66 [ +34.0]( 0.72)
> 20mt, 160wt     1.00 [ 0.00](12.92)           0.88 [ +12.0]( 6.77)
> 
> mt=message threads ; wt=worker threads
> 
> schbench being a wakeup sensitive workload showed good improvement.
> 

Good news because NEXT_BUDDY is primarily about prioritising an eligible
wakee over another eligible task to preserve hotness.

> 
> hackbench (lower is better)
> ========
> case              load        baseline[pct imp](std%)      With patch[pct imp]( std%)
> process-sockets   1-groups    1.00 [ 0.00]( 5.21)            0.91 [ +9.00]( 5.50)
> process-sockets   4-groups    1.00 [ 0.00]( 7.30)            1.01 [ -1.00]( 4.27)
> process-sockets   12-groups   1.00 [ 0.00]( 2.44)            1.00 [  0.00]( 1.78)
> process-sockets   30-groups   1.00 [ 0.00]( 2.05)            1.04 [ -4.00]( 0.86)
> process-sockets   48-groups   1.00 [ 0.00]( 2.25)            1.04 [ -4.00]( 1.03)
> process-sockets   79-groups   1.00 [ 0.00]( 2.28)            1.05 [ -5.00]( 1.67)
> process-sockets   110-groups  1.00 [ 0.00]( 11.17)           1.04 [ -4.00]( 8.64)
> 
> process-pipe      1-groups     1.00 [ 0.00]( 8.21)            0.84 [+16.00](13.00)
> process-pipe      4-groups     1.00 [ 0.00]( 5.54)            0.95 [ +5.00]( 4.21)
> process-pipe      12-groups    1.00 [ 0.00]( 3.96)            1.04 [ -4.00]( 2.26)
> process-pipe      30-groups    1.00 [ 0.00]( 7.64)            1.20 [ -20.0]( 3.63)
> process-pipe      48-groups    1.00 [ 0.00]( 6.28)            1.04 [ -4.00]( 8.48)
> process-pipe      79-groups    1.00 [ 0.00]( 6.19)            1.01 [ -1.00]( 4.36)
> process-pipe      110-groups   1.00 [ 0.00]( 10.23)           0.94 [ +6.00]( 5.21)
> 
> Didn't notice significant improvement or regression in Hackbench. Mostly in the noise
> range.
> 

Expected for hackbench because the degree of overload is so generally
high and cache hotness has limited benefit for it as so little data is
shared.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF
@ 2025-11-12 12:25 Mel Gorman
  2025-11-12 12:25 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
                   ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Mel Gorman @ 2025-11-12 12:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, Madadi Vineeth Reddy, linux-kernel, Mel Gorman

Changes since v4
o Splitout decisions into separate functions			(peterz)
o Flow clarity							(peterz)

Changes since v3
o Place new code near first consumer				(peterz)
o Separate between PREEMPT_SHORT and NEXT_BUDDY			(peterz)
o Naming and code flow clarity					(peterz)
o Restore slice protection					(peterz)

Changes since v2
o Review feedback applied from Prateek

I've been chasing down a number of schedule issues recently like many
others and found they were broadly grouped as

1. Failure to boost CPU frequency with powersave/ondemand governors
2. Processors entering idle states that are too deep
3. Differences in wakeup latencies for wakeup-intensive workloads

Adding topology into account means that there is a lot of machine-specific
behaviour which may explain why some discussions recently have reproduction
problems. Nevertheless, the removal of LAST_BUDDY and NEXT_BUDDY being
disabled has an impact on wakeup latencies.

This series enables NEXT_BUDDY and may select a wakee if it's eligible to
run even though other unrelated tasks may have an earlier deadline.

Mel Gorman (2):
  sched/fair: Enable scheduler feature NEXT_BUDDY
  sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

 kernel/sched/fair.c     | 152 ++++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |   2 +-
 2 files changed, 131 insertions(+), 23 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY
  2025-11-12 12:25 [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman
@ 2025-11-12 12:25 ` Mel Gorman
  2025-11-14 12:19   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  2025-11-17 16:23   ` tip-bot2 for Mel Gorman
       [not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net>
  2026-01-08 10:01 ` [REGRESSION] [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF Madadi Vineeth Reddy
  2 siblings, 2 replies; 44+ messages in thread
From: Mel Gorman @ 2025-11-12 12:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, Madadi Vineeth Reddy, linux-kernel, Mel Gorman

The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last
wakee to be scheduled sooner on the assumption that the waker/wakee share
cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on
the assumption that the pair of tasks still share data but also relied
on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get
good results.

NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve
latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465
("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime
spread is mentioned so the expectation is that NEXT_BUDDY had an impact
on overall fairness. It was not noted why LAST_BUDDY was removed but it
is assumed that it's very difficult to reason what LAST_BUDDY's correct
and effective behaviour should be while still respecting EEVDFs goals.
Peter Zijlstra noted during review;

	I think I was just struggling to make sense of things and figured
	less is more and axed it.

	I have vague memories trying to work through the dynamics of
	a wakeup-stack and the EEVDF latency requirements and getting
	a head-ache.

NEXT_BUDDY is easier to reason about given that it's a point-in-time
decision on the wakees deadline and eligibilty relative to the waker. Enable
NEXT_BUDDY as a preparation path to document that the decision to ignore
the current implementation is deliberate. While not presented, the results
were at best neutral and often much more variable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/features.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..0607def744af 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.
  */
-SCHED_FEAT(NEXT_BUDDY, false)
+SCHED_FEAT(NEXT_BUDDY, true)

 /*
  * Allow completely ignoring cfs_rq->next; which can be set from various
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
       [not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net>
@ 2025-11-12 14:48   ` Peter Zijlstra
  2025-11-13  8:26     ` Madadi Vineeth Reddy
  2025-11-13  9:04     ` Mel Gorman
  2025-11-14 12:19   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  2025-11-17 16:23   ` tip-bot2 for Mel Gorman
  2 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2025-11-12 14:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, Madadi Vineeth Reddy, linux-kernel

On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote:

> +	/* Prefer picking wakee soon if appropriate. */
> +	if (sched_feat(NEXT_BUDDY) &&
> +	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
> +
> +		/*
> +		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
> +		 * buddies are ignored as they may not be relevant to the
> +		 * waker and less likely to be cache hot.
> +		 */
> +		if (wake_flags & WF_SYNC)
> +			preempt_action = preempt_sync(rq, wake_flags, pse, se);
> +	}

Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to
depend on buddies.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-12 14:48   ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra
@ 2025-11-13  8:26     ` Madadi Vineeth Reddy
  2025-11-13  9:04     ` Mel Gorman
  1 sibling, 0 replies; 44+ messages in thread
From: Madadi Vineeth Reddy @ 2025-11-13  8:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason, linux-kernel,
	Madadi Vineeth Reddy

Hi Peter,

On 12/11/25 20:18, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote:
> 
>> +	/* Prefer picking wakee soon if appropriate. */
>> +	if (sched_feat(NEXT_BUDDY) &&
>> +	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
>> +
>> +		/*
>> +		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
>> +		 * buddies are ignored as they may not be relevant to the
>> +		 * waker and less likely to be cache hot.
>> +		 */
>> +		if (wake_flags & WF_SYNC)
>> +			preempt_action = preempt_sync(rq, wake_flags, pse, se);
>> +	}
> 
> Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to
> depend on buddies.


IIUC, calling preempt_sync() Without NEXT_BUDDY would force a reschedule after the threshold,
but no buddy would be set. This means pick_eevdf() would run with normal EEVDF deadline selection,
potentially picking a different task instead of the wakee.The forced reschedule would accomplish
nothing for the wakees.

That said, I see your point that the WF_SYNC threshold check could still reduce context switch
overhead even without guaranteeing wakee selection.

Thanks,
Madadi Vineeth Reddy

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-12 14:48   ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra
  2025-11-13  8:26     ` Madadi Vineeth Reddy
@ 2025-11-13  9:04     ` Mel Gorman
  2025-11-14 12:13       ` Peter Zijlstra
  1 sibling, 1 reply; 44+ messages in thread
From: Mel Gorman @ 2025-11-13  9:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, Madadi Vineeth Reddy, linux-kernel

On Wed, Nov 12, 2025 at 03:48:23PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote:
> 
> > +	/* Prefer picking wakee soon if appropriate. */
> > +	if (sched_feat(NEXT_BUDDY) &&
> > +	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
> > +
> > +		/*
> > +		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
> > +		 * buddies are ignored as they may not be relevant to the
> > +		 * waker and less likely to be cache hot.
> > +		 */
> > +		if (wake_flags & WF_SYNC)
> > +			preempt_action = preempt_sync(rq, wake_flags, pse, se);
> > +	}
> 
> Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to
> depend on buddies.

There isn't a direct relation, but there is an indirect one. I know from
your previous review that you separated out the WF_SYNC but after a while,
I did not find a good reason to separate it completely from NEXT_BUDDY.

NEXT_BUDDY updates cfs_rq->next if appropriate to indicate there is a waker
relationship between two tasks and potentially share data that may still
be cache resident after a context switch. WF_SYNC indicates there may be
a strict relationship between those two tasks that the waker may need the
wakee to do some work before it can make progress. If NEXT_BUDDY does not
set cfs_rq->next in the current waking context then the wakee may only be
picked next by coincidence under normal EEVDF rules.

WF_SYNC could still reschedule if the wakee is not selected as a buddy but
the benefit, if any, would be marginal -- if the waker does not go to sleep
then WF_SYNC contract is violated and if the data becomes cache cold after
a wakeup delay then the shared data may already be evicted from cache.
With NEXT_BUDDY, there is a chance that the cost of a reschedule and/or
a context switch will be offset by reduced overall latency (e.g. fewer
cache misses). Without NEXT_BUDDY, WF_SYNC may only incur costs due to
context switching.

I considered the possibility of WF_SYNC being applied if pse is already a
buddy due to yield or some other factor but there is no reason to assume
any shared data is still cache resident and it's not easy to reason about. I
considered applying WF_SYNC if pse was already set and use it as a two-pass
filter but again, no obvious benefit or why the second wakeup ie more
important than the first wakeup. I considered WF_SYNC being applied if
any buddy is set but it's not clear why a SYNC wakeup between tasks A,B
should instead pick C to run ASAP outside of the normal EEVDF rules.

I think it's straight-forward if the logic is

	o If NEXT_BUDDY sets the wakee becomes cfs_rq->next then
	  schedule the wakee soon
	o If the wakee is to be selected soon and WF_SYNC is also set then
	  pick the wakee ASAP

but less straight-forward if

	o If WF_SYNC is set, reschedule now and maybe the wakee will be
	  picked, maybe the waker will run again, maybe something else
	  will run and sometimes it'll be a gain overall.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-13  9:04     ` Mel Gorman
@ 2025-11-14 12:13       ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2025-11-14 12:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, Madadi Vineeth Reddy, linux-kernel

On Thu, Nov 13, 2025 at 09:04:38AM +0000, Mel Gorman wrote:
> On Wed, Nov 12, 2025 at 03:48:23PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote:
> > 
> > > +	/* Prefer picking wakee soon if appropriate. */
> > > +	if (sched_feat(NEXT_BUDDY) &&
> > > +	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
> > > +
> > > +		/*
> > > +		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
> > > +		 * buddies are ignored as they may not be relevant to the
> > > +		 * waker and less likely to be cache hot.
> > > +		 */
> > > +		if (wake_flags & WF_SYNC)
> > > +			preempt_action = preempt_sync(rq, wake_flags, pse, se);
> > > +	}
> > 
> > Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to
> > depend on buddies.
> 
> There isn't a direct relation, but there is an indirect one. I know from
> your previous review that you separated out the WF_SYNC but after a while,
> I did not find a good reason to separate it completely from NEXT_BUDDY.
> 
> NEXT_BUDDY updates cfs_rq->next if appropriate to indicate there is a waker
> relationship between two tasks and potentially share data that may still
> be cache resident after a context switch. WF_SYNC indicates there may be
> a strict relationship between those two tasks that the waker may need the
> wakee to do some work before it can make progress. If NEXT_BUDDY does not
> set cfs_rq->next in the current waking context then the wakee may only be
> picked next by coincidence under normal EEVDF rules.

Aah, fair enough. Perhaps the comment could've been clearer but
whatever.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [tip: sched/core] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
       [not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net>
  2025-11-12 14:48   ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra
@ 2025-11-14 12:19   ` tip-bot2 for Mel Gorman
  2025-11-17 16:23   ` tip-bot2 for Mel Gorman
  2 siblings, 0 replies; 44+ messages in thread
From: tip-bot2 for Mel Gorman @ 2025-11-14 12:19 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Mel Gorman, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     523afd46059d80351189630355ab7e6eccd6451e
Gitweb:        https://git.kernel.org/tip/523afd46059d80351189630355ab7e6eccd6451e
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Wed, 12 Nov 2025 12:25:21 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 14 Nov 2025 13:03:07 +01:00

sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.

Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a suggestion instead of a contract. If a waker does go to sleep almost
immediately then the delay in wakeup is negligible. In other cases, it's
throttled based on the accumulated runtime of the waker so there is a
chance that some batched wakeups have been issued before preemption.

For all other wakeups, preemption happens if the wakee has a earlier
deadline than the waker and eligible to run.

While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.

First is the dbench throughput data even though it is a poor metric but
it is the default metric. The test machine is a 2-socket machine and the
backing filesystem is XFS as a lot of the IO work is dispatched to kernel
threads. It's important to note that these results are not representative
across all machines, especially Zen machines, as different bottlenecks
are exposed on different machines and filesystems.

dbench4 Throughput (misleading but traditional)
                            6.18-rc1               6.18-rc1
                             vanilla   sched-preemptnext-v5
Hmean     1       1268.80 (   0.00%)     1269.74 (   0.07%)
Hmean     4       3971.74 (   0.00%)     3950.59 (  -0.53%)
Hmean     7       5548.23 (   0.00%)     5420.08 (  -2.31%)
Hmean     12      7310.86 (   0.00%)     7165.57 (  -1.99%)
Hmean     21      8874.53 (   0.00%)     9149.04 (   3.09%)
Hmean     30      9361.93 (   0.00%)    10530.04 (  12.48%)
Hmean     48      9540.14 (   0.00%)    11820.40 (  23.90%)
Hmean     79      9208.74 (   0.00%)    12193.79 (  32.42%)
Hmean     110     8573.12 (   0.00%)    11933.72 (  39.20%)
Hmean     141     7791.33 (   0.00%)    11273.90 (  44.70%)
Hmean     160     7666.60 (   0.00%)    10768.72 (  40.46%)

As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.

dbench4 Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v5
Amean      1         14.62 (   0.00%)       14.69 (  -0.46%)
Amean      4         18.76 (   0.00%)       18.85 (  -0.45%)
Amean      7         23.71 (   0.00%)       24.38 (  -2.82%)
Amean      12        31.25 (   0.00%)       31.87 (  -1.97%)
Amean      21        45.12 (   0.00%)       43.69 (   3.16%)
Amean      30        61.07 (   0.00%)       54.33 (  11.03%)
Amean      48        95.91 (   0.00%)       77.22 (  19.49%)
Amean      79       163.38 (   0.00%)      123.08 (  24.66%)
Amean      110      243.91 (   0.00%)      175.11 (  28.21%)
Amean      141      343.47 (   0.00%)      239.10 (  30.39%)
Amean      160      401.15 (   0.00%)      283.73 (  29.27%)
Stddev     1          0.52 (   0.00%)        0.51 (   2.45%)
Stddev     4          1.36 (   0.00%)        1.30 (   4.04%)
Stddev     7          1.88 (   0.00%)        1.87 (   0.72%)
Stddev     12         3.06 (   0.00%)        2.45 (  19.83%)
Stddev     21         5.78 (   0.00%)        3.87 (  33.06%)
Stddev     30         9.85 (   0.00%)        5.25 (  46.76%)
Stddev     48        22.31 (   0.00%)        8.64 (  61.27%)
Stddev     79        35.96 (   0.00%)       18.07 (  49.76%)
Stddev     110       59.04 (   0.00%)       30.93 (  47.61%)
Stddev     141       85.38 (   0.00%)       40.93 (  52.06%)
Stddev     160       96.38 (   0.00%)       39.72 (  58.79%)

That is still looking good and the variance is reduced quite a bit.
Finally, fairness is a concern so the next report tracks how many
milliseconds does it take for all clients to complete a workfile. This
one is tricky because dbench makes to effort to synchronise clients so
the durations at benchmark start time differ substantially from typical
runtimes. This problem could be mitigated by warming up the benchmark
for a number of minutes but it's a matter of opinion whether that
counts as an evasion of inconvenient results.

dbench4 All Clients Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v5
Amean      1         15.06 (   0.00%)       15.07 (  -0.03%)
Amean      4        603.81 (   0.00%)      524.29 (  13.17%)
Amean      7        855.32 (   0.00%)     1331.07 ( -55.62%)
Amean      12      1890.02 (   0.00%)     2323.97 ( -22.96%)
Amean      21      3195.23 (   0.00%)     2009.29 (  37.12%)
Amean      30     13919.53 (   0.00%)     4579.44 (  67.10%)
Amean      48     25246.07 (   0.00%)     5705.46 (  77.40%)
Amean      79     29701.84 (   0.00%)    15509.26 (  47.78%)
Amean      110    22803.03 (   0.00%)    23782.08 (  -4.29%)
Amean      141    36356.07 (   0.00%)    25074.20 (  31.03%)
Amean      160    17046.71 (   0.00%)    13247.62 (  22.29%)
Stddev     1          0.47 (   0.00%)        0.49 (  -3.74%)
Stddev     4        395.24 (   0.00%)      254.18 (  35.69%)
Stddev     7        467.24 (   0.00%)      764.42 ( -63.60%)
Stddev     12      1071.43 (   0.00%)     1395.90 ( -30.28%)
Stddev     21      1694.50 (   0.00%)     1204.89 (  28.89%)
Stddev     30      7945.63 (   0.00%)     2552.59 (  67.87%)
Stddev     48     14339.51 (   0.00%)     3227.55 (  77.49%)
Stddev     79     16620.91 (   0.00%)     8422.15 (  49.33%)
Stddev     110    12912.15 (   0.00%)    13560.95 (  -5.02%)
Stddev     141    20700.13 (   0.00%)    14544.51 (  29.74%)
Stddev     160     9079.16 (   0.00%)     7400.69 (  18.49%)

This is more of a mixed bag but it at least shows that fairness
is not crippled.

The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench. The WF_SYNC
behaviour is important for these workloads and is why the WF_SYNC
changes are not a separate patch.

hackbench-process-pipes
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v5
Amean     1        0.2657 (   0.00%)      0.2150 (  19.07%)
Amean     4        0.6107 (   0.00%)      0.6060 (   0.76%)
Amean     7        0.7923 (   0.00%)      0.7440 (   6.10%)
Amean     12       1.1500 (   0.00%)      1.1263 (   2.06%)
Amean     21       1.7950 (   0.00%)      1.7987 (  -0.20%)
Amean     30       2.3207 (   0.00%)      2.5053 (  -7.96%)
Amean     48       3.5023 (   0.00%)      3.9197 ( -11.92%)
Amean     79       4.8093 (   0.00%)      5.2247 (  -8.64%)
Amean     110      6.1160 (   0.00%)      6.6650 (  -8.98%)
Amean     141      7.4763 (   0.00%)      7.8973 (  -5.63%)
Amean     172      8.9560 (   0.00%)      9.3593 (  -4.50%)
Amean     203     10.4783 (   0.00%)     10.8347 (  -3.40%)
Amean     234     12.4977 (   0.00%)     13.0177 (  -4.16%)
Amean     265     14.7003 (   0.00%)     15.5630 (  -5.87%)
Amean     296     16.1007 (   0.00%)     17.4023 (  -8.08%)

Processes using pipes are impacted but the variance (not presented) indicates
it's close to noise and the results are not always reproducible. If executed
across multiple reboots, it may show neutral or small gains so the worst
measured results are presented.

Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.

hackbench-process-sockets
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v2
Amean     1        0.3073 (   0.00%)      0.3263 (  -6.18%)
Amean     4        0.7863 (   0.00%)      0.7930 (  -0.85%)
Amean     7        1.3670 (   0.00%)      1.3537 (   0.98%)
Amean     12       2.1337 (   0.00%)      2.1903 (  -2.66%)
Amean     21       3.4683 (   0.00%)      3.4940 (  -0.74%)
Amean     30       4.7247 (   0.00%)      4.8853 (  -3.40%)
Amean     48       7.6097 (   0.00%)      7.8197 (  -2.76%)
Amean     79      14.7957 (   0.00%)     16.1000 (  -8.82%)
Amean     110     21.3413 (   0.00%)     21.9997 (  -3.08%)
Amean     141     29.0503 (   0.00%)     29.0353 (   0.05%)
Amean     172     36.4660 (   0.00%)     36.1433 (   0.88%)
Amean     203     39.7177 (   0.00%)     40.5910 (  -2.20%)
Amean     234     42.1120 (   0.00%)     43.5527 (  -3.42%)
Amean     265     45.7830 (   0.00%)     50.0560 (  -9.33%)
Amean     296     50.7043 (   0.00%)     54.3657 (  -7.22%)

As schbench has been mentioned in numerous bugs recently, the results
are interesting. A test case that represents the default schbench
behaviour is

schbench Wakeup Latency (usec)
                                       6.18.0-rc1             6.18.0-rc1
                                          vanilla   sched-preemptnext-v5
Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 (  16.28%)
Amean     Wakeup-90th-80         46.56 (   0.00%)       19.78 (  57.52%)
Amean     Wakeup-99th-80        119.61 (   0.00%)       89.94 (  24.80%)
Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      328.22 (  89.72%)

schbench Requests Per Second (ops/sec)
                                  6.18.0-rc1             6.18.0-rc1
                                     vanilla   sched-preemptnext-v5
Hmean     RPS-20th-80     8900.91 (   0.00%)     9176.78 (   3.10%)
Hmean     RPS-50th-80     8987.41 (   0.00%)     9217.89 (   2.56%)
Hmean     RPS-90th-80     9123.73 (   0.00%)     9273.25 (   1.64%)
Hmean     RPS-max-80      9193.50 (   0.00%)     9301.47 (   1.17%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 130 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 320eedd..11d480e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
 	if (cfs_rq->nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
+	/*
+	 * Picking the ->next buddy will affect latency but not fairness.
+	 */
+	if (sched_feat(PICK_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+		return cfs_rq->next;
+	}
+
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
 
@@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 	return delta_exec;
 }
 
+static void set_next_buddy(struct sched_entity *se);
+
 /*
  * Used by other classes to account runtime.
  */
@@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 
-	/*
-	 * Picking the ->next buddy will affect latency but not fairness.
-	 */
-	if (sched_feat(PICK_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
-		/* ->next will never be delayed */
-		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
-		return cfs_rq->next;
-	}
-
 	se = pick_eevdf(cfs_rq);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
@@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se)
 	}
 }
 
+enum preempt_wakeup_action {
+	PREEMPT_WAKEUP_NONE,	/* No preemption. */
+	PREEMPT_WAKEUP_SHORT,	/* Ignore slice protection. */
+	PREEMPT_WAKEUP_PICK,	/* Let __pick_eevdf() decide. */
+	PREEMPT_WAKEUP_RESCHED,	/* Force reschedule. */
+};
+
+static inline bool
+set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags,
+		  struct sched_entity *pse, struct sched_entity *se)
+{
+	/*
+	 * Keep existing buddy if the deadline is sooner than pse.
+	 * The older buddy may be cache cold and completely unrelated
+	 * to the current wakeup but that is unpredictable where as
+	 * obeying the deadline is more in line with EEVDF objectives.
+	 */
+	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
+		return false;
+
+	set_next_buddy(pse);
+	return true;
+}
+
+/*
+ * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
+ * strictly enforced because the hint is either misunderstood or
+ * multiple tasks must be woken up.
+ */
+static inline enum preempt_wakeup_action
+preempt_sync(struct rq *rq, int wake_flags,
+	     struct sched_entity *pse, struct sched_entity *se)
+{
+	u64 threshold, delta;
+
+	/*
+	 * WF_SYNC without WF_TTWU is not expected so warn if it happens even
+	 * though it is likely harmless.
+	 */
+	WARN_ON_ONCE(!(wake_flags & WF_TTWU));
+
+	threshold = sysctl_sched_migration_cost;
+	delta = rq_clock_task(rq) - se->exec_start;
+	if ((s64)delta < 0)
+		delta = 0;
+
+	/*
+	 * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they
+	 * could run on other CPUs. Reduce the threshold before preemption is
+	 * allowed to an arbitrary lower value as it is more likely (but not
+	 * guaranteed) the waker requires the wakee to finish.
+	 */
+	if (wake_flags & WF_RQ_SELECTED)
+		threshold >>= 2;
+
+	/*
+	 * As WF_SYNC is not strictly obeyed, allow some runtime for batch
+	 * wakeups to be issued.
+	 */
+	if (entity_before(pse, se) && delta >= threshold)
+		return PREEMPT_WAKEUP_RESCHED;
+
+	return PREEMPT_WAKEUP_NONE;
+}
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
 static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
+	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
 	struct task_struct *donor = rq->donor;
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
-	bool do_preempt_short = false;
 
 	if (unlikely(se == pse))
 		return;
@@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 	if (task_is_throttled(p))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
-		set_next_buddy(pse);
-	}
-
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
@@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 		 * When non-idle entity preempt an idle entity,
 		 * don't give idle entity slice protection.
 		 */
-		do_preempt_short = true;
+		preempt_action = PREEMPT_WAKEUP_SHORT;
 		goto preempt;
 	}
 
@@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 	 * If @p has a shorter slice than current and @p is eligible, override
 	 * current's slice protection in order to allow preemption.
 	 */
-	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
+	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
+		preempt_action = PREEMPT_WAKEUP_SHORT;
+		goto pick;
+	}
 
 	/*
+	 * Ignore wakee preemption on WF_FORK as it is less likely that
+	 * there is shared data as exec often follow fork. Do not
+	 * preempt for tasks that are sched_delayed as it would violate
+	 * EEVDF to forcibly queue an ineligible task.
+	 */
+	if ((wake_flags & WF_FORK) || pse->sched_delayed)
+		return;
+
+	/*
+	 * If @p potentially is completing work required by current then
+	 * consider preemption.
+	 *
+	 * Reschedule if waker is no longer eligible. */
+	if (in_task() && !entity_eligible(cfs_rq, se)) {
+		preempt_action = PREEMPT_WAKEUP_RESCHED;
+		goto preempt;
+	}
+
+	/* Prefer picking wakee soon if appropriate. */
+	if (sched_feat(NEXT_BUDDY) &&
+	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
+
+		/*
+		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
+		 * buddies are ignored as they may not be relevant to the
+		 * waker and less likely to be cache hot.
+		 */
+		if (wake_flags & WF_SYNC)
+			preempt_action = preempt_sync(rq, wake_flags, pse, se);
+	}
+
+	switch (preempt_action) {
+	case PREEMPT_WAKEUP_NONE:
+		return;
+	case PREEMPT_WAKEUP_RESCHED:
+		goto preempt;
+	case PREEMPT_WAKEUP_SHORT:
+		fallthrough;
+	case PREEMPT_WAKEUP_PICK:
+		break;
+	}
+
+pick:
+	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */
-	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
+	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
 		goto preempt;
 
-	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
+	if (sched_feat(RUN_TO_PARITY))
 		update_protect_slice(cfs_rq, se);
 
 	return;
 
 preempt:
-	if (do_preempt_short)
+	if (preempt_action == PREEMPT_WAKEUP_SHORT)
 		cancel_protect_slice(se);
 
 	resched_curr_lazy(rq);

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [tip: sched/core] sched/fair: Enable scheduler feature NEXT_BUDDY
  2025-11-12 12:25 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
@ 2025-11-14 12:19   ` tip-bot2 for Mel Gorman
  2025-11-17 16:23   ` tip-bot2 for Mel Gorman
  1 sibling, 0 replies; 44+ messages in thread
From: tip-bot2 for Mel Gorman @ 2025-11-14 12:19 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Mel Gorman, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     8f839c9c55f2a034867b5d382950f6bc9acd1a3f
Gitweb:        https://git.kernel.org/tip/8f839c9c55f2a034867b5d382950f6bc9acd1a3f
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Wed, 12 Nov 2025 12:25:20 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 14 Nov 2025 13:03:06 +01:00

sched/fair: Enable scheduler feature NEXT_BUDDY

The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last
wakee to be scheduled sooner on the assumption that the waker/wakee share
cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on
the assumption that the pair of tasks still share data but also relied
on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get
good results.

NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve
latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465
("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime
spread is mentioned so the expectation is that NEXT_BUDDY had an impact
on overall fairness. It was not noted why LAST_BUDDY was removed but it
is assumed that it's very difficult to reason what LAST_BUDDY's correct
and effective behaviour should be while still respecting EEVDFs goals.
Peter Zijlstra noted during review;

	I think I was just struggling to make sense of things and figured
	less is more and axed it.

	I have vague memories trying to work through the dynamics of
	a wakeup-stack and the EEVDF latency requirements and getting
	a head-ache.

NEXT_BUDDY is easier to reason about given that it's a point-in-time
decision on the wakees deadline and eligibilty relative to the waker. Enable
NEXT_BUDDY as a preparation path to document that the decision to ignore
the current implementation is deliberate. While not presented, the results
were at best neutral and often much more variable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251112122521.1331238-2-mgorman@techsingularity.net
---
 kernel/sched/features.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f..0607def 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.
  */
-SCHED_FEAT(NEXT_BUDDY, false)
+SCHED_FEAT(NEXT_BUDDY, true)

 /*
  * Allow completely ignoring cfs_rq->next; which can be set from various

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [tip: sched/core] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
       [not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net>
  2025-11-12 14:48   ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra
  2025-11-14 12:19   ` [tip: sched/core] " tip-bot2 for Mel Gorman
@ 2025-11-17 16:23   ` tip-bot2 for Mel Gorman
  2025-12-22 10:57     ` [REGRESSION] " Ryan Roberts
  2 siblings, 1 reply; 44+ messages in thread
From: tip-bot2 for Mel Gorman @ 2025-11-17 16:23 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Mel Gorman, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e837456fdca81899a3c8e47b3fd39e30eae6e291
Gitweb:        https://git.kernel.org/tip/e837456fdca81899a3c8e47b3fd39e30eae6e291
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Wed, 12 Nov 2025 12:25:21 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00

sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.

Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a suggestion instead of a contract. If a waker does go to sleep almost
immediately then the delay in wakeup is negligible. In other cases, it's
throttled based on the accumulated runtime of the waker so there is a
chance that some batched wakeups have been issued before preemption.

For all other wakeups, preemption happens if the wakee has a earlier
deadline than the waker and eligible to run.

While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.

First is the dbench throughput data even though it is a poor metric but
it is the default metric. The test machine is a 2-socket machine and the
backing filesystem is XFS as a lot of the IO work is dispatched to kernel
threads. It's important to note that these results are not representative
across all machines, especially Zen machines, as different bottlenecks
are exposed on different machines and filesystems.

dbench4 Throughput (misleading but traditional)
                            6.18-rc1               6.18-rc1
                             vanilla   sched-preemptnext-v5
Hmean     1       1268.80 (   0.00%)     1269.74 (   0.07%)
Hmean     4       3971.74 (   0.00%)     3950.59 (  -0.53%)
Hmean     7       5548.23 (   0.00%)     5420.08 (  -2.31%)
Hmean     12      7310.86 (   0.00%)     7165.57 (  -1.99%)
Hmean     21      8874.53 (   0.00%)     9149.04 (   3.09%)
Hmean     30      9361.93 (   0.00%)    10530.04 (  12.48%)
Hmean     48      9540.14 (   0.00%)    11820.40 (  23.90%)
Hmean     79      9208.74 (   0.00%)    12193.79 (  32.42%)
Hmean     110     8573.12 (   0.00%)    11933.72 (  39.20%)
Hmean     141     7791.33 (   0.00%)    11273.90 (  44.70%)
Hmean     160     7666.60 (   0.00%)    10768.72 (  40.46%)

As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.

dbench4 Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v5
Amean      1         14.62 (   0.00%)       14.69 (  -0.46%)
Amean      4         18.76 (   0.00%)       18.85 (  -0.45%)
Amean      7         23.71 (   0.00%)       24.38 (  -2.82%)
Amean      12        31.25 (   0.00%)       31.87 (  -1.97%)
Amean      21        45.12 (   0.00%)       43.69 (   3.16%)
Amean      30        61.07 (   0.00%)       54.33 (  11.03%)
Amean      48        95.91 (   0.00%)       77.22 (  19.49%)
Amean      79       163.38 (   0.00%)      123.08 (  24.66%)
Amean      110      243.91 (   0.00%)      175.11 (  28.21%)
Amean      141      343.47 (   0.00%)      239.10 (  30.39%)
Amean      160      401.15 (   0.00%)      283.73 (  29.27%)
Stddev     1          0.52 (   0.00%)        0.51 (   2.45%)
Stddev     4          1.36 (   0.00%)        1.30 (   4.04%)
Stddev     7          1.88 (   0.00%)        1.87 (   0.72%)
Stddev     12         3.06 (   0.00%)        2.45 (  19.83%)
Stddev     21         5.78 (   0.00%)        3.87 (  33.06%)
Stddev     30         9.85 (   0.00%)        5.25 (  46.76%)
Stddev     48        22.31 (   0.00%)        8.64 (  61.27%)
Stddev     79        35.96 (   0.00%)       18.07 (  49.76%)
Stddev     110       59.04 (   0.00%)       30.93 (  47.61%)
Stddev     141       85.38 (   0.00%)       40.93 (  52.06%)
Stddev     160       96.38 (   0.00%)       39.72 (  58.79%)

That is still looking good and the variance is reduced quite a bit.
Finally, fairness is a concern so the next report tracks how many
milliseconds does it take for all clients to complete a workfile. This
one is tricky because dbench makes to effort to synchronise clients so
the durations at benchmark start time differ substantially from typical
runtimes. This problem could be mitigated by warming up the benchmark
for a number of minutes but it's a matter of opinion whether that
counts as an evasion of inconvenient results.

dbench4 All Clients Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v5
Amean      1         15.06 (   0.00%)       15.07 (  -0.03%)
Amean      4        603.81 (   0.00%)      524.29 (  13.17%)
Amean      7        855.32 (   0.00%)     1331.07 ( -55.62%)
Amean      12      1890.02 (   0.00%)     2323.97 ( -22.96%)
Amean      21      3195.23 (   0.00%)     2009.29 (  37.12%)
Amean      30     13919.53 (   0.00%)     4579.44 (  67.10%)
Amean      48     25246.07 (   0.00%)     5705.46 (  77.40%)
Amean      79     29701.84 (   0.00%)    15509.26 (  47.78%)
Amean      110    22803.03 (   0.00%)    23782.08 (  -4.29%)
Amean      141    36356.07 (   0.00%)    25074.20 (  31.03%)
Amean      160    17046.71 (   0.00%)    13247.62 (  22.29%)
Stddev     1          0.47 (   0.00%)        0.49 (  -3.74%)
Stddev     4        395.24 (   0.00%)      254.18 (  35.69%)
Stddev     7        467.24 (   0.00%)      764.42 ( -63.60%)
Stddev     12      1071.43 (   0.00%)     1395.90 ( -30.28%)
Stddev     21      1694.50 (   0.00%)     1204.89 (  28.89%)
Stddev     30      7945.63 (   0.00%)     2552.59 (  67.87%)
Stddev     48     14339.51 (   0.00%)     3227.55 (  77.49%)
Stddev     79     16620.91 (   0.00%)     8422.15 (  49.33%)
Stddev     110    12912.15 (   0.00%)    13560.95 (  -5.02%)
Stddev     141    20700.13 (   0.00%)    14544.51 (  29.74%)
Stddev     160     9079.16 (   0.00%)     7400.69 (  18.49%)

This is more of a mixed bag but it at least shows that fairness
is not crippled.

The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench. The WF_SYNC
behaviour is important for these workloads and is why the WF_SYNC
changes are not a separate patch.

hackbench-process-pipes
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v5
Amean     1        0.2657 (   0.00%)      0.2150 (  19.07%)
Amean     4        0.6107 (   0.00%)      0.6060 (   0.76%)
Amean     7        0.7923 (   0.00%)      0.7440 (   6.10%)
Amean     12       1.1500 (   0.00%)      1.1263 (   2.06%)
Amean     21       1.7950 (   0.00%)      1.7987 (  -0.20%)
Amean     30       2.3207 (   0.00%)      2.5053 (  -7.96%)
Amean     48       3.5023 (   0.00%)      3.9197 ( -11.92%)
Amean     79       4.8093 (   0.00%)      5.2247 (  -8.64%)
Amean     110      6.1160 (   0.00%)      6.6650 (  -8.98%)
Amean     141      7.4763 (   0.00%)      7.8973 (  -5.63%)
Amean     172      8.9560 (   0.00%)      9.3593 (  -4.50%)
Amean     203     10.4783 (   0.00%)     10.8347 (  -3.40%)
Amean     234     12.4977 (   0.00%)     13.0177 (  -4.16%)
Amean     265     14.7003 (   0.00%)     15.5630 (  -5.87%)
Amean     296     16.1007 (   0.00%)     17.4023 (  -8.08%)

Processes using pipes are impacted but the variance (not presented) indicates
it's close to noise and the results are not always reproducible. If executed
across multiple reboots, it may show neutral or small gains so the worst
measured results are presented.

Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.

hackbench-process-sockets
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v2
Amean     1        0.3073 (   0.00%)      0.3263 (  -6.18%)
Amean     4        0.7863 (   0.00%)      0.7930 (  -0.85%)
Amean     7        1.3670 (   0.00%)      1.3537 (   0.98%)
Amean     12       2.1337 (   0.00%)      2.1903 (  -2.66%)
Amean     21       3.4683 (   0.00%)      3.4940 (  -0.74%)
Amean     30       4.7247 (   0.00%)      4.8853 (  -3.40%)
Amean     48       7.6097 (   0.00%)      7.8197 (  -2.76%)
Amean     79      14.7957 (   0.00%)     16.1000 (  -8.82%)
Amean     110     21.3413 (   0.00%)     21.9997 (  -3.08%)
Amean     141     29.0503 (   0.00%)     29.0353 (   0.05%)
Amean     172     36.4660 (   0.00%)     36.1433 (   0.88%)
Amean     203     39.7177 (   0.00%)     40.5910 (  -2.20%)
Amean     234     42.1120 (   0.00%)     43.5527 (  -3.42%)
Amean     265     45.7830 (   0.00%)     50.0560 (  -9.33%)
Amean     296     50.7043 (   0.00%)     54.3657 (  -7.22%)

As schbench has been mentioned in numerous bugs recently, the results
are interesting. A test case that represents the default schbench
behaviour is

schbench Wakeup Latency (usec)
                                       6.18.0-rc1             6.18.0-rc1
                                          vanilla   sched-preemptnext-v5
Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 (  16.28%)
Amean     Wakeup-90th-80         46.56 (   0.00%)       19.78 (  57.52%)
Amean     Wakeup-99th-80        119.61 (   0.00%)       89.94 (  24.80%)
Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      328.22 (  89.72%)

schbench Requests Per Second (ops/sec)
                                  6.18.0-rc1             6.18.0-rc1
                                     vanilla   sched-preemptnext-v5
Hmean     RPS-20th-80     8900.91 (   0.00%)     9176.78 (   3.10%)
Hmean     RPS-50th-80     8987.41 (   0.00%)     9217.89 (   2.56%)
Hmean     RPS-90th-80     9123.73 (   0.00%)     9273.25 (   1.64%)
Hmean     RPS-max-80      9193.50 (   0.00%)     9301.47 (   1.17%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 130 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 071e07f..c6e5c64 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
 	if (cfs_rq->nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
+	/*
+	 * Picking the ->next buddy will affect latency but not fairness.
+	 */
+	if (sched_feat(PICK_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+		return cfs_rq->next;
+	}
+
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
 
@@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 	return delta_exec;
 }
 
+static void set_next_buddy(struct sched_entity *se);
+
 /*
  * Used by other classes to account runtime.
  */
@@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 
-	/*
-	 * Picking the ->next buddy will affect latency but not fairness.
-	 */
-	if (sched_feat(PICK_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
-		/* ->next will never be delayed */
-		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
-		return cfs_rq->next;
-	}
-
 	se = pick_eevdf(cfs_rq);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
@@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se)
 	}
 }
 
+enum preempt_wakeup_action {
+	PREEMPT_WAKEUP_NONE,	/* No preemption. */
+	PREEMPT_WAKEUP_SHORT,	/* Ignore slice protection. */
+	PREEMPT_WAKEUP_PICK,	/* Let __pick_eevdf() decide. */
+	PREEMPT_WAKEUP_RESCHED,	/* Force reschedule. */
+};
+
+static inline bool
+set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags,
+		  struct sched_entity *pse, struct sched_entity *se)
+{
+	/*
+	 * Keep existing buddy if the deadline is sooner than pse.
+	 * The older buddy may be cache cold and completely unrelated
+	 * to the current wakeup but that is unpredictable where as
+	 * obeying the deadline is more in line with EEVDF objectives.
+	 */
+	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
+		return false;
+
+	set_next_buddy(pse);
+	return true;
+}
+
+/*
+ * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
+ * strictly enforced because the hint is either misunderstood or
+ * multiple tasks must be woken up.
+ */
+static inline enum preempt_wakeup_action
+preempt_sync(struct rq *rq, int wake_flags,
+	     struct sched_entity *pse, struct sched_entity *se)
+{
+	u64 threshold, delta;
+
+	/*
+	 * WF_SYNC without WF_TTWU is not expected so warn if it happens even
+	 * though it is likely harmless.
+	 */
+	WARN_ON_ONCE(!(wake_flags & WF_TTWU));
+
+	threshold = sysctl_sched_migration_cost;
+	delta = rq_clock_task(rq) - se->exec_start;
+	if ((s64)delta < 0)
+		delta = 0;
+
+	/*
+	 * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they
+	 * could run on other CPUs. Reduce the threshold before preemption is
+	 * allowed to an arbitrary lower value as it is more likely (but not
+	 * guaranteed) the waker requires the wakee to finish.
+	 */
+	if (wake_flags & WF_RQ_SELECTED)
+		threshold >>= 2;
+
+	/*
+	 * As WF_SYNC is not strictly obeyed, allow some runtime for batch
+	 * wakeups to be issued.
+	 */
+	if (entity_before(pse, se) && delta >= threshold)
+		return PREEMPT_WAKEUP_RESCHED;
+
+	return PREEMPT_WAKEUP_NONE;
+}
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
 static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
+	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
 	struct task_struct *donor = rq->donor;
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
-	bool do_preempt_short = false;
 
 	if (unlikely(se == pse))
 		return;
@@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 	if (task_is_throttled(p))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
-		set_next_buddy(pse);
-	}
-
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
@@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 		 * When non-idle entity preempt an idle entity,
 		 * don't give idle entity slice protection.
 		 */
-		do_preempt_short = true;
+		preempt_action = PREEMPT_WAKEUP_SHORT;
 		goto preempt;
 	}
 
@@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 	 * If @p has a shorter slice than current and @p is eligible, override
 	 * current's slice protection in order to allow preemption.
 	 */
-	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
+	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
+		preempt_action = PREEMPT_WAKEUP_SHORT;
+		goto pick;
+	}
 
 	/*
+	 * Ignore wakee preemption on WF_FORK as it is less likely that
+	 * there is shared data as exec often follow fork. Do not
+	 * preempt for tasks that are sched_delayed as it would violate
+	 * EEVDF to forcibly queue an ineligible task.
+	 */
+	if ((wake_flags & WF_FORK) || pse->sched_delayed)
+		return;
+
+	/*
+	 * If @p potentially is completing work required by current then
+	 * consider preemption.
+	 *
+	 * Reschedule if waker is no longer eligible. */
+	if (in_task() && !entity_eligible(cfs_rq, se)) {
+		preempt_action = PREEMPT_WAKEUP_RESCHED;
+		goto preempt;
+	}
+
+	/* Prefer picking wakee soon if appropriate. */
+	if (sched_feat(NEXT_BUDDY) &&
+	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
+
+		/*
+		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
+		 * buddies are ignored as they may not be relevant to the
+		 * waker and less likely to be cache hot.
+		 */
+		if (wake_flags & WF_SYNC)
+			preempt_action = preempt_sync(rq, wake_flags, pse, se);
+	}
+
+	switch (preempt_action) {
+	case PREEMPT_WAKEUP_NONE:
+		return;
+	case PREEMPT_WAKEUP_RESCHED:
+		goto preempt;
+	case PREEMPT_WAKEUP_SHORT:
+		fallthrough;
+	case PREEMPT_WAKEUP_PICK:
+		break;
+	}
+
+pick:
+	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */
-	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
+	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
 		goto preempt;
 
-	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
+	if (sched_feat(RUN_TO_PARITY))
 		update_protect_slice(cfs_rq, se);
 
 	return;
 
 preempt:
-	if (do_preempt_short)
+	if (preempt_action == PREEMPT_WAKEUP_SHORT)
 		cancel_protect_slice(se);
 
 	resched_curr_lazy(rq);

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [tip: sched/core] sched/fair: Enable scheduler feature NEXT_BUDDY
  2025-11-12 12:25 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
  2025-11-14 12:19   ` [tip: sched/core] " tip-bot2 for Mel Gorman
@ 2025-11-17 16:23   ` tip-bot2 for Mel Gorman
  1 sibling, 0 replies; 44+ messages in thread
From: tip-bot2 for Mel Gorman @ 2025-11-17 16:23 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Mel Gorman, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     aceccac58ad76305d147165788ea6b939bef179b
Gitweb:        https://git.kernel.org/tip/aceccac58ad76305d147165788ea6b939bef179b
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Wed, 12 Nov 2025 12:25:20 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00

sched/fair: Enable scheduler feature NEXT_BUDDY

The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last
wakee to be scheduled sooner on the assumption that the waker/wakee share
cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on
the assumption that the pair of tasks still share data but also relied
on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get
good results.

NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve
latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465
("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime
spread is mentioned so the expectation is that NEXT_BUDDY had an impact
on overall fairness. It was not noted why LAST_BUDDY was removed but it
is assumed that it's very difficult to reason what LAST_BUDDY's correct
and effective behaviour should be while still respecting EEVDFs goals.
Peter Zijlstra noted during review;

	I think I was just struggling to make sense of things and figured
	less is more and axed it.

	I have vague memories trying to work through the dynamics of
	a wakeup-stack and the EEVDF latency requirements and getting
	a head-ache.

NEXT_BUDDY is easier to reason about given that it's a point-in-time
decision on the wakees deadline and eligibilty relative to the waker. Enable
NEXT_BUDDY as a preparation path to document that the decision to ignore
the current implementation is deliberate. While not presented, the results
were at best neutral and often much more variable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20251112122521.1331238-2-mgorman@techsingularity.net
---
 kernel/sched/features.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f..0607def 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.
  */
-SCHED_FEAT(NEXT_BUDDY, false)
+SCHED_FEAT(NEXT_BUDDY, true)

 /*
  * Allow completely ignoring cfs_rq->next; which can be set from various

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-17 16:23   ` tip-bot2 for Mel Gorman
@ 2025-12-22 10:57     ` Ryan Roberts
  2026-01-02 12:38       ` Ryan Roberts
  0 siblings, 1 reply; 44+ messages in thread
From: Ryan Roberts @ 2025-12-22 10:57 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra (Intel); +Cc: x86, linux-kernel, Aishwarya TCV

Hi Mel, Peter,

We are building out a kernel performance regression monitoring lab at Arm, and 
I've noticed some fairly large perofrmance regressions in real-world workloads, 
for which bisection has fingered this patch.

We are looking at performance changes between v6.18 and v6.19-rc1, and by 
reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan 
to move the testing to linux-next over the next couple of quarters so hopefully 
we will be able to deliver this sort of news prior to merging in future).

All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
statistically significant regression/improvement, where "statistically 
significant" means the 95% confidence intervals do not overlap".

The below is a large scale mysql workload, running across 2 AWS instances (a 
load generator and the mysql server). We have a partner for whom this is a very 
important workload. Performance regresses by 1.3% between 6.18 and 6.19-rc1 
(where the patch is added). By reverting the patch, the regression is not only 
fixed by performance is now nearly 6% better than v6.18:

+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| Benchmark                       | Result Class                                       |   6-18-0 (base) |   6-19-0-rc1 | revert-next-buddy |
+=================================+====================================================+=================+==============+===================+
| repro-collection/mysql-workload | db transaction rate (transactions/min)             |       646267.33 |   (R) -1.33% |         (I) 5.87% |
|                                 | new order rate (orders/min)                        |       213256.50 |   (R) -1.32% |         (I) 5.87% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+


Next are a bunch of benchmarks all running on a single system. specjbb is the 
SPEC Java Business Benchmark. The mysql one is the same as above but this time 
both loadgen and server are on the same system. pgbench is the PostgreSQL 
benchmark.

I'm showing hackbench for completeness, but I don't consider it a high priority 
issue.

Interestingly, nginx improves significantly with the patch.

+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| Benchmark                       | Result Class                                       |   6-18-0 (base) |   6-19-0-rc1 | revert-next-buddy |
+=================================+====================================================+=================+==============+===================+
| specjbb/composite               | critical-jOPS (jOPS)                               |        94700.00 |   (R) -5.10% |            -0.90% |
|                                 | max-jOPS (jOPS)                                    |       113984.50 |   (R) -3.90% |            -0.65% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| repro-collection/mysql-workload | db transaction rate (transactions/min)             |       245438.25 |   (R) -3.88% |            -0.13% |
|                                 | new order rate (orders/min)                        |        80985.75 |   (R) -3.78% |            -0.07% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| pts/pgbench                     | Scale: 1 Clients: 1 Read Only (TPS)                |        63124.00 |    (I) 2.90% |             0.74% |
|                                 | Scale: 1 Clients: 1 Read Only - Latency (ms)       |           0.016 |    (I) 5.49% |             1.05% |
|                                 | Scale: 1 Clients: 1 Read Write (TPS)               |          974.92 |        0.11% |            -0.08% |
|                                 | Scale: 1 Clients: 1 Read Write - Latency (ms)      |            1.03 |        0.12% |            -0.06% |
|                                 | Scale: 1 Clients: 250 Read Only (TPS)              |      1915931.58 |   (R) -2.25% |         (I) 2.12% |
|                                 | Scale: 1 Clients: 250 Read Only - Latency (ms)     |            0.13 |   (R) -2.37% |         (I) 2.09% |
|                                 | Scale: 1 Clients: 250 Read Write (TPS)             |          855.67 |       -1.36% |            -0.14% |
|                                 | Scale: 1 Clients: 250 Read Write - Latency (ms)    |          292.39 |       -1.31% |            -0.08% |
|                                 | Scale: 1 Clients: 1000 Read Only (TPS)             |      1534130.08 |  (R) -11.37% |             0.08% |
|                                 | Scale: 1 Clients: 1000 Read Only - Latency (ms)    |            0.65 |  (R) -11.38% |             0.08% |
|                                 | Scale: 1 Clients: 1000 Read Write (TPS)            |          578.75 |       -1.11% |             2.15% |
|                                 | Scale: 1 Clients: 1000 Read Write - Latency (ms)   |         1736.98 |       -1.26% |             2.47% |
|                                 | Scale: 100 Clients: 1 Read Only (TPS)              |        57170.33 |        1.68% |             0.10% |
|                                 | Scale: 100 Clients: 1 Read Only - Latency (ms)     |           0.018 |        1.94% |             0.00% |
|                                 | Scale: 100 Clients: 1 Read Write (TPS)             |          836.58 |       -0.37% |            -0.41% |
|                                 | Scale: 100 Clients: 1 Read Write - Latency (ms)    |            1.20 |       -0.37% |            -0.40% |
|                                 | Scale: 100 Clients: 250 Read Only (TPS)            |      1773440.67 |       -1.61% |             1.67% |
|                                 | Scale: 100 Clients: 250 Read Only - Latency (ms)   |            0.14 |       -1.40% |             1.56% |
|                                 | Scale: 100 Clients: 250 Read Write (TPS)           |         5505.50 |       -0.17% |            -0.86% |
|                                 | Scale: 100 Clients: 250 Read Write - Latency (ms)  |           45.42 |       -0.17% |            -0.85% |
|                                 | Scale: 100 Clients: 1000 Read Only (TPS)           |      1393037.50 |  (R) -10.31% |            -0.19% |
|                                 | Scale: 100 Clients: 1000 Read Only - Latency (ms)  |            0.72 |  (R) -10.30% |            -0.17% |
|                                 | Scale: 100 Clients: 1000 Read Write (TPS)          |         5085.92 |        0.27% |             0.07% |
|                                 | Scale: 100 Clients: 1000 Read Write - Latency (ms) |          196.79 |        0.23% |             0.05% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| mmtests/hackbench               | hackbench-process-pipes-1 (seconds)                |            0.14 |       -1.51% |            -1.05% |
|                                 | hackbench-process-pipes-4 (seconds)                |            0.44 |    (I) 6.49% |         (I) 5.42% |
|                                 | hackbench-process-pipes-7 (seconds)                |            0.68 |  (R) -18.36% |         (I) 3.40% |
|                                 | hackbench-process-pipes-12 (seconds)               |            1.24 |  (R) -19.89% |            -0.45% |
|                                 | hackbench-process-pipes-21 (seconds)               |            1.81 |   (R) -8.41% |            -1.22% |
|                                 | hackbench-process-pipes-30 (seconds)               |            2.39 |   (R) -9.06% |        (R) -2.95% |
|                                 | hackbench-process-pipes-48 (seconds)               |            3.18 |  (R) -11.68% |        (R) -4.10% |
|                                 | hackbench-process-pipes-79 (seconds)               |            3.84 |   (R) -9.74% |        (R) -3.25% |
|                                 | hackbench-process-pipes-110 (seconds)              |            4.68 |   (R) -6.57% |        (R) -2.12% |
|                                 | hackbench-process-pipes-141 (seconds)              |            5.75 |   (R) -5.86% |        (R) -3.44% |
|                                 | hackbench-process-pipes-172 (seconds)              |            6.80 |   (R) -4.28% |        (R) -2.81% |
|                                 | hackbench-process-pipes-203 (seconds)              |            7.94 |   (R) -4.01% |        (R) -3.00% |
|                                 | hackbench-process-pipes-234 (seconds)              |            9.02 |   (R) -3.52% |        (R) -2.81% |
|                                 | hackbench-process-pipes-256 (seconds)              |            9.78 |   (R) -3.24% |        (R) -2.81% |
|                                 | hackbench-process-sockets-1 (seconds)              |            0.29 |        0.50% |             0.26% |
|                                 | hackbench-process-sockets-4 (seconds)              |            0.76 |   (I) 17.44% |        (I) 16.31% |
|                                 | hackbench-process-sockets-7 (seconds)              |            1.16 |   (I) 12.10% |         (I) 9.78% |
|                                 | hackbench-process-sockets-12 (seconds)             |            1.86 |   (I) 10.19% |         (I) 9.83% |
|                                 | hackbench-process-sockets-21 (seconds)             |            3.12 |    (I) 9.38% |         (I) 9.20% |
|                                 | hackbench-process-sockets-30 (seconds)             |            4.30 |    (I) 6.43% |         (I) 6.11% |
|                                 | hackbench-process-sockets-48 (seconds)             |            6.58 |    (I) 3.00% |         (I) 2.19% |
|                                 | hackbench-process-sockets-79 (seconds)             |           10.56 |    (I) 2.87% |         (I) 3.31% |
|                                 | hackbench-process-sockets-110 (seconds)            |           13.85 |       -1.15% |         (I) 2.33% |
|                                 | hackbench-process-sockets-141 (seconds)            |           19.23 |       -1.40% |        (I) 14.53% |
|                                 | hackbench-process-sockets-172 (seconds)            |           26.33 |    (I) 3.52% |        (I) 30.37% |
|                                 | hackbench-process-sockets-203 (seconds)            |           30.27 |        1.10% |        (I) 27.20% |
|                                 | hackbench-process-sockets-234 (seconds)            |           35.12 |        1.60% |        (I) 28.24% |
|                                 | hackbench-process-sockets-256 (seconds)            |           38.74 |        0.70% |        (I) 28.74% |
|                                 | hackbench-thread-pipes-1 (seconds)                 |            0.17 |       -1.32% |            -0.76% |
|                                 | hackbench-thread-pipes-4 (seconds)                 |            0.45 |    (I) 6.91% |         (I) 7.64% |
|                                 | hackbench-thread-pipes-7 (seconds)                 |            0.74 |   (R) -7.51% |         (I) 5.26% |
|                                 | hackbench-thread-pipes-12 (seconds)                |            1.32 |   (R) -8.40% |         (I) 2.32% |
|                                 | hackbench-thread-pipes-21 (seconds)                |            1.95 |   (R) -2.95% |             0.91% |
|                                 | hackbench-thread-pipes-30 (seconds)                |            2.50 |   (R) -4.61% |             1.47% |
|                                 | hackbench-thread-pipes-48 (seconds)                |            3.32 |   (R) -5.45% |         (I) 2.15% |
|                                 | hackbench-thread-pipes-79 (seconds)                |            4.04 |   (R) -5.53% |             1.85% |
|                                 | hackbench-thread-pipes-110 (seconds)               |            4.94 |   (R) -2.33% |             1.51% |
|                                 | hackbench-thread-pipes-141 (seconds)               |            6.04 |   (R) -2.47% |             1.15% |
|                                 | hackbench-thread-pipes-172 (seconds)               |            7.15 |       -0.91% |             1.48% |
|                                 | hackbench-thread-pipes-203 (seconds)               |            8.31 |       -1.29% |             0.77% |
|                                 | hackbench-thread-pipes-234 (seconds)               |            9.49 |       -1.03% |             0.77% |
|                                 | hackbench-thread-pipes-256 (seconds)               |           10.30 |       -0.80% |             0.42% |
|                                 | hackbench-thread-sockets-1 (seconds)               |            0.31 |        0.05% |            -0.05% |
|                                 | hackbench-thread-sockets-4 (seconds)               |            0.79 |   (I) 18.91% |        (I) 16.82% |
|                                 | hackbench-thread-sockets-7 (seconds)               |            1.16 |   (I) 12.57% |        (I) 10.63% |
|                                 | hackbench-thread-sockets-12 (seconds)              |            1.87 |   (I) 12.65% |        (I) 12.26% |
|                                 | hackbench-thread-sockets-21 (seconds)              |            3.16 |   (I) 11.62% |        (I) 12.74% |
|                                 | hackbench-thread-sockets-30 (seconds)              |            4.32 |    (I) 7.35% |         (I) 8.89% |
|                                 | hackbench-thread-sockets-48 (seconds)              |            6.45 |    (I) 2.69% |         (I) 3.06% |
|                                 | hackbench-thread-sockets-79 (seconds)              |           10.15 |    (I) 3.30% |             1.98% |
|                                 | hackbench-thread-sockets-110 (seconds)             |           13.45 |       -0.25% |         (I) 3.68% |
|                                 | hackbench-thread-sockets-141 (seconds)             |           17.87 |   (R) -2.18% |         (I) 8.46% |
|                                 | hackbench-thread-sockets-172 (seconds)             |           24.38 |        1.02% |        (I) 24.33% |
|                                 | hackbench-thread-sockets-203 (seconds)             |           28.38 |       -0.99% |        (I) 24.20% |
|                                 | hackbench-thread-sockets-234 (seconds)             |           32.75 |       -0.42% |        (I) 24.35% |
|                                 | hackbench-thread-sockets-256 (seconds)             |           36.49 |       -1.30% |        (I) 26.22% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
| pts/nginx                       | Connections: 200 (Requests Per Second)             |       252332.60 |   (I) 17.54% |            -0.53% |
|                                 | Connections: 1000 (Requests Per Second)            |       248591.29 |   (I) 20.41% |             0.10% |
+---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+

All of the benchmarks have been run multiple times and I have high confidence in 
the results. I can share min/mean/max/stdev/ci95 stats if that's helpful though.

I'm not providing the data, but we also see similar regressions on AmpereOne 
(another arm64 server system). And we have seen a few functional tests (kvm 
selftests) that have started to timeout due to this patch slowing things down on 
arm64.

I'm hoping you can advise on the best way to proceed? We have a bigger library 
than what I'm showing, but the only improvement I see due to this patch is 
nginx. So based on that, my preference would be to revert the patch upstream 
until the issues can be worked out. I'm guessing the story is quite different 
for x86 though?

Thanks,
Ryan



On 17/11/2025 16:23, tip-bot2 for Mel Gorman wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     e837456fdca81899a3c8e47b3fd39e30eae6e291
> Gitweb:        https://git.kernel.org/tip/e837456fdca81899a3c8e47b3fd39e30eae6e291
> Author:        Mel Gorman <mgorman@techsingularity.net>
> AuthorDate:    Wed, 12 Nov 2025 12:25:21 
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00
> 
> sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
> 
> Reimplement NEXT_BUDDY preemption to take into account the deadline and
> eligibility of the wakee with respect to the waker. In the event
> multiple buddies could be considered, the one with the earliest deadline
> is selected.
> 
> Sync wakeups are treated differently to every other type of wakeup. The
> WF_SYNC assumption is that the waker promises to sleep in the very near
> future. This is violated in enough cases that WF_SYNC should be treated
> as a suggestion instead of a contract. If a waker does go to sleep almost
> immediately then the delay in wakeup is negligible. In other cases, it's
> throttled based on the accumulated runtime of the waker so there is a
> chance that some batched wakeups have been issued before preemption.
> 
> For all other wakeups, preemption happens if the wakee has a earlier
> deadline than the waker and eligible to run.
> 
> While many workloads were tested, the two main targets were a modified
> dbench4 benchmark and hackbench because the are on opposite ends of the
> spectrum -- one prefers throughput by avoiding preemption and the other
> relies on preemption.
> 
> First is the dbench throughput data even though it is a poor metric but
> it is the default metric. The test machine is a 2-socket machine and the
> backing filesystem is XFS as a lot of the IO work is dispatched to kernel
> threads. It's important to note that these results are not representative
> across all machines, especially Zen machines, as different bottlenecks
> are exposed on different machines and filesystems.
> 
> dbench4 Throughput (misleading but traditional)
>                             6.18-rc1               6.18-rc1
>                              vanilla   sched-preemptnext-v5
> Hmean     1       1268.80 (   0.00%)     1269.74 (   0.07%)
> Hmean     4       3971.74 (   0.00%)     3950.59 (  -0.53%)
> Hmean     7       5548.23 (   0.00%)     5420.08 (  -2.31%)
> Hmean     12      7310.86 (   0.00%)     7165.57 (  -1.99%)
> Hmean     21      8874.53 (   0.00%)     9149.04 (   3.09%)
> Hmean     30      9361.93 (   0.00%)    10530.04 (  12.48%)
> Hmean     48      9540.14 (   0.00%)    11820.40 (  23.90%)
> Hmean     79      9208.74 (   0.00%)    12193.79 (  32.42%)
> Hmean     110     8573.12 (   0.00%)    11933.72 (  39.20%)
> Hmean     141     7791.33 (   0.00%)    11273.90 (  44.70%)
> Hmean     160     7666.60 (   0.00%)    10768.72 (  40.46%)
> 
> As throughput is misleading, the benchmark is modified to use a short
> loadfile report the completion time duration in milliseconds.
> 
> dbench4 Loadfile Execution Time
>                              6.18-rc1               6.18-rc1
>                               vanilla   sched-preemptnext-v5
> Amean      1         14.62 (   0.00%)       14.69 (  -0.46%)
> Amean      4         18.76 (   0.00%)       18.85 (  -0.45%)
> Amean      7         23.71 (   0.00%)       24.38 (  -2.82%)
> Amean      12        31.25 (   0.00%)       31.87 (  -1.97%)
> Amean      21        45.12 (   0.00%)       43.69 (   3.16%)
> Amean      30        61.07 (   0.00%)       54.33 (  11.03%)
> Amean      48        95.91 (   0.00%)       77.22 (  19.49%)
> Amean      79       163.38 (   0.00%)      123.08 (  24.66%)
> Amean      110      243.91 (   0.00%)      175.11 (  28.21%)
> Amean      141      343.47 (   0.00%)      239.10 (  30.39%)
> Amean      160      401.15 (   0.00%)      283.73 (  29.27%)
> Stddev     1          0.52 (   0.00%)        0.51 (   2.45%)
> Stddev     4          1.36 (   0.00%)        1.30 (   4.04%)
> Stddev     7          1.88 (   0.00%)        1.87 (   0.72%)
> Stddev     12         3.06 (   0.00%)        2.45 (  19.83%)
> Stddev     21         5.78 (   0.00%)        3.87 (  33.06%)
> Stddev     30         9.85 (   0.00%)        5.25 (  46.76%)
> Stddev     48        22.31 (   0.00%)        8.64 (  61.27%)
> Stddev     79        35.96 (   0.00%)       18.07 (  49.76%)
> Stddev     110       59.04 (   0.00%)       30.93 (  47.61%)
> Stddev     141       85.38 (   0.00%)       40.93 (  52.06%)
> Stddev     160       96.38 (   0.00%)       39.72 (  58.79%)
> 
> That is still looking good and the variance is reduced quite a bit.
> Finally, fairness is a concern so the next report tracks how many
> milliseconds does it take for all clients to complete a workfile. This
> one is tricky because dbench makes to effort to synchronise clients so
> the durations at benchmark start time differ substantially from typical
> runtimes. This problem could be mitigated by warming up the benchmark
> for a number of minutes but it's a matter of opinion whether that
> counts as an evasion of inconvenient results.
> 
> dbench4 All Clients Loadfile Execution Time
>                              6.18-rc1               6.18-rc1
>                               vanilla   sched-preemptnext-v5
> Amean      1         15.06 (   0.00%)       15.07 (  -0.03%)
> Amean      4        603.81 (   0.00%)      524.29 (  13.17%)
> Amean      7        855.32 (   0.00%)     1331.07 ( -55.62%)
> Amean      12      1890.02 (   0.00%)     2323.97 ( -22.96%)
> Amean      21      3195.23 (   0.00%)     2009.29 (  37.12%)
> Amean      30     13919.53 (   0.00%)     4579.44 (  67.10%)
> Amean      48     25246.07 (   0.00%)     5705.46 (  77.40%)
> Amean      79     29701.84 (   0.00%)    15509.26 (  47.78%)
> Amean      110    22803.03 (   0.00%)    23782.08 (  -4.29%)
> Amean      141    36356.07 (   0.00%)    25074.20 (  31.03%)
> Amean      160    17046.71 (   0.00%)    13247.62 (  22.29%)
> Stddev     1          0.47 (   0.00%)        0.49 (  -3.74%)
> Stddev     4        395.24 (   0.00%)      254.18 (  35.69%)
> Stddev     7        467.24 (   0.00%)      764.42 ( -63.60%)
> Stddev     12      1071.43 (   0.00%)     1395.90 ( -30.28%)
> Stddev     21      1694.50 (   0.00%)     1204.89 (  28.89%)
> Stddev     30      7945.63 (   0.00%)     2552.59 (  67.87%)
> Stddev     48     14339.51 (   0.00%)     3227.55 (  77.49%)
> Stddev     79     16620.91 (   0.00%)     8422.15 (  49.33%)
> Stddev     110    12912.15 (   0.00%)    13560.95 (  -5.02%)
> Stddev     141    20700.13 (   0.00%)    14544.51 (  29.74%)
> Stddev     160     9079.16 (   0.00%)     7400.69 (  18.49%)
> 
> This is more of a mixed bag but it at least shows that fairness
> is not crippled.
> 
> The hackbench results are more neutral but this is still important.
> It's possible to boost the dbench figures by a large amount but only by
> crippling the performance of a workload like hackbench. The WF_SYNC
> behaviour is important for these workloads and is why the WF_SYNC
> changes are not a separate patch.
> 
> hackbench-process-pipes
>                           6.18-rc1             6.18-rc1
>                              vanilla   sched-preemptnext-v5
> Amean     1        0.2657 (   0.00%)      0.2150 (  19.07%)
> Amean     4        0.6107 (   0.00%)      0.6060 (   0.76%)
> Amean     7        0.7923 (   0.00%)      0.7440 (   6.10%)
> Amean     12       1.1500 (   0.00%)      1.1263 (   2.06%)
> Amean     21       1.7950 (   0.00%)      1.7987 (  -0.20%)
> Amean     30       2.3207 (   0.00%)      2.5053 (  -7.96%)
> Amean     48       3.5023 (   0.00%)      3.9197 ( -11.92%)
> Amean     79       4.8093 (   0.00%)      5.2247 (  -8.64%)
> Amean     110      6.1160 (   0.00%)      6.6650 (  -8.98%)
> Amean     141      7.4763 (   0.00%)      7.8973 (  -5.63%)
> Amean     172      8.9560 (   0.00%)      9.3593 (  -4.50%)
> Amean     203     10.4783 (   0.00%)     10.8347 (  -3.40%)
> Amean     234     12.4977 (   0.00%)     13.0177 (  -4.16%)
> Amean     265     14.7003 (   0.00%)     15.5630 (  -5.87%)
> Amean     296     16.1007 (   0.00%)     17.4023 (  -8.08%)
> 
> Processes using pipes are impacted but the variance (not presented) indicates
> it's close to noise and the results are not always reproducible. If executed
> across multiple reboots, it may show neutral or small gains so the worst
> measured results are presented.
> 
> Hackbench using sockets is more reliably neutral as the wakeup
> mechanisms are different between sockets and pipes.
> 
> hackbench-process-sockets
>                           6.18-rc1             6.18-rc1
>                              vanilla   sched-preemptnext-v2
> Amean     1        0.3073 (   0.00%)      0.3263 (  -6.18%)
> Amean     4        0.7863 (   0.00%)      0.7930 (  -0.85%)
> Amean     7        1.3670 (   0.00%)      1.3537 (   0.98%)
> Amean     12       2.1337 (   0.00%)      2.1903 (  -2.66%)
> Amean     21       3.4683 (   0.00%)      3.4940 (  -0.74%)
> Amean     30       4.7247 (   0.00%)      4.8853 (  -3.40%)
> Amean     48       7.6097 (   0.00%)      7.8197 (  -2.76%)
> Amean     79      14.7957 (   0.00%)     16.1000 (  -8.82%)
> Amean     110     21.3413 (   0.00%)     21.9997 (  -3.08%)
> Amean     141     29.0503 (   0.00%)     29.0353 (   0.05%)
> Amean     172     36.4660 (   0.00%)     36.1433 (   0.88%)
> Amean     203     39.7177 (   0.00%)     40.5910 (  -2.20%)
> Amean     234     42.1120 (   0.00%)     43.5527 (  -3.42%)
> Amean     265     45.7830 (   0.00%)     50.0560 (  -9.33%)
> Amean     296     50.7043 (   0.00%)     54.3657 (  -7.22%)
> 
> As schbench has been mentioned in numerous bugs recently, the results
> are interesting. A test case that represents the default schbench
> behaviour is
> 
> schbench Wakeup Latency (usec)
>                                        6.18.0-rc1             6.18.0-rc1
>                                           vanilla   sched-preemptnext-v5
> Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 (  16.28%)
> Amean     Wakeup-90th-80         46.56 (   0.00%)       19.78 (  57.52%)
> Amean     Wakeup-99th-80        119.61 (   0.00%)       89.94 (  24.80%)
> Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      328.22 (  89.72%)
> 
> schbench Requests Per Second (ops/sec)
>                                   6.18.0-rc1             6.18.0-rc1
>                                      vanilla   sched-preemptnext-v5
> Hmean     RPS-20th-80     8900.91 (   0.00%)     9176.78 (   3.10%)
> Hmean     RPS-50th-80     8987.41 (   0.00%)     9217.89 (   2.56%)
> Hmean     RPS-90th-80     9123.73 (   0.00%)     9273.25 (   1.64%)
> Hmean     RPS-max-80      9193.50 (   0.00%)     9301.47 (   1.17%)
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net
> ---
>  kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 130 insertions(+), 22 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 071e07f..c6e5c64 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
>  	if (cfs_rq->nr_queued == 1)
>  		return curr && curr->on_rq ? curr : se;
>  
> +	/*
> +	 * Picking the ->next buddy will affect latency but not fairness.
> +	 */
> +	if (sched_feat(PICK_BUDDY) &&
> +	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
> +		/* ->next will never be delayed */
> +		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
> +		return cfs_rq->next;
> +	}
> +
>  	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>  		curr = NULL;
>  
> @@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>  	return delta_exec;
>  }
>  
> +static void set_next_buddy(struct sched_entity *se);
> +
>  /*
>   * Used by other classes to account runtime.
>   */
> @@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>  {
>  	struct sched_entity *se;
>  
> -	/*
> -	 * Picking the ->next buddy will affect latency but not fairness.
> -	 */
> -	if (sched_feat(PICK_BUDDY) &&
> -	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
> -		/* ->next will never be delayed */
> -		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
> -		return cfs_rq->next;
> -	}
> -
>  	se = pick_eevdf(cfs_rq);
>  	if (se->sched_delayed) {
>  		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> @@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	hrtick_update(rq);
>  }
>  
> -static void set_next_buddy(struct sched_entity *se);
> -
>  /*
>   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
>   * failing half-way through and resume the dequeue later.
> @@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se)
>  	}
>  }
>  
> +enum preempt_wakeup_action {
> +	PREEMPT_WAKEUP_NONE,	/* No preemption. */
> +	PREEMPT_WAKEUP_SHORT,	/* Ignore slice protection. */
> +	PREEMPT_WAKEUP_PICK,	/* Let __pick_eevdf() decide. */
> +	PREEMPT_WAKEUP_RESCHED,	/* Force reschedule. */
> +};
> +
> +static inline bool
> +set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags,
> +		  struct sched_entity *pse, struct sched_entity *se)
> +{
> +	/*
> +	 * Keep existing buddy if the deadline is sooner than pse.
> +	 * The older buddy may be cache cold and completely unrelated
> +	 * to the current wakeup but that is unpredictable where as
> +	 * obeying the deadline is more in line with EEVDF objectives.
> +	 */
> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> +		return false;
> +
> +	set_next_buddy(pse);
> +	return true;
> +}
> +
> +/*
> + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> + * strictly enforced because the hint is either misunderstood or
> + * multiple tasks must be woken up.
> + */
> +static inline enum preempt_wakeup_action
> +preempt_sync(struct rq *rq, int wake_flags,
> +	     struct sched_entity *pse, struct sched_entity *se)
> +{
> +	u64 threshold, delta;
> +
> +	/*
> +	 * WF_SYNC without WF_TTWU is not expected so warn if it happens even
> +	 * though it is likely harmless.
> +	 */
> +	WARN_ON_ONCE(!(wake_flags & WF_TTWU));
> +
> +	threshold = sysctl_sched_migration_cost;
> +	delta = rq_clock_task(rq) - se->exec_start;
> +	if ((s64)delta < 0)
> +		delta = 0;
> +
> +	/*
> +	 * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they
> +	 * could run on other CPUs. Reduce the threshold before preemption is
> +	 * allowed to an arbitrary lower value as it is more likely (but not
> +	 * guaranteed) the waker requires the wakee to finish.
> +	 */
> +	if (wake_flags & WF_RQ_SELECTED)
> +		threshold >>= 2;
> +
> +	/*
> +	 * As WF_SYNC is not strictly obeyed, allow some runtime for batch
> +	 * wakeups to be issued.
> +	 */
> +	if (entity_before(pse, se) && delta >= threshold)
> +		return PREEMPT_WAKEUP_RESCHED;
> +
> +	return PREEMPT_WAKEUP_NONE;
> +}
> +
>  /*
>   * Preempt the current task with a newly woken task if needed:
>   */
>  static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
>  {
> +	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
>  	struct task_struct *donor = rq->donor;
>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>  	int cse_is_idle, pse_is_idle;
> -	bool do_preempt_short = false;
>  
>  	if (unlikely(se == pse))
>  		return;
> @@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>  	if (task_is_throttled(p))
>  		return;
>  
> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
> -		set_next_buddy(pse);
> -	}
> -
>  	/*
>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>  	 * wake up path.
> @@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>  		 * When non-idle entity preempt an idle entity,
>  		 * don't give idle entity slice protection.
>  		 */
> -		do_preempt_short = true;
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
>  		goto preempt;
>  	}
>  
> @@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>  	 * If @p has a shorter slice than current and @p is eligible, override
>  	 * current's slice protection in order to allow preemption.
>  	 */
> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
> +		goto pick;
> +	}
>  
>  	/*
> +	 * Ignore wakee preemption on WF_FORK as it is less likely that
> +	 * there is shared data as exec often follow fork. Do not
> +	 * preempt for tasks that are sched_delayed as it would violate
> +	 * EEVDF to forcibly queue an ineligible task.
> +	 */
> +	if ((wake_flags & WF_FORK) || pse->sched_delayed)
> +		return;
> +
> +	/*
> +	 * If @p potentially is completing work required by current then
> +	 * consider preemption.
> +	 *
> +	 * Reschedule if waker is no longer eligible. */
> +	if (in_task() && !entity_eligible(cfs_rq, se)) {
> +		preempt_action = PREEMPT_WAKEUP_RESCHED;
> +		goto preempt;
> +	}
> +
> +	/* Prefer picking wakee soon if appropriate. */
> +	if (sched_feat(NEXT_BUDDY) &&
> +	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
> +
> +		/*
> +		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
> +		 * buddies are ignored as they may not be relevant to the
> +		 * waker and less likely to be cache hot.
> +		 */
> +		if (wake_flags & WF_SYNC)
> +			preempt_action = preempt_sync(rq, wake_flags, pse, se);
> +	}
> +
> +	switch (preempt_action) {
> +	case PREEMPT_WAKEUP_NONE:
> +		return;
> +	case PREEMPT_WAKEUP_RESCHED:
> +		goto preempt;
> +	case PREEMPT_WAKEUP_SHORT:
> +		fallthrough;
> +	case PREEMPT_WAKEUP_PICK:
> +		break;
> +	}
> +
> +pick:
> +	/*
>  	 * If @p has become the most eligible task, force preemption.
>  	 */
> -	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
> +	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
>  		goto preempt;
>  
> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> +	if (sched_feat(RUN_TO_PARITY))
>  		update_protect_slice(cfs_rq, se);
>  
>  	return;
>  
>  preempt:
> -	if (do_preempt_short)
> +	if (preempt_action == PREEMPT_WAKEUP_SHORT)
>  		cancel_protect_slice(se);
>  
>  	resched_curr_lazy(rq);
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-12-22 10:57     ` [REGRESSION] " Ryan Roberts
@ 2026-01-02 12:38       ` Ryan Roberts
  2026-01-02 15:52         ` Dietmar Eggemann
  0 siblings, 1 reply; 44+ messages in thread
From: Ryan Roberts @ 2026-01-02 12:38 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra (Intel); +Cc: x86, linux-kernel, Aishwarya TCV

Hi, I appreciate I sent this report just before Xmas so most likely you haven't
had a chance to look, but wanted to bring it back to the top of your mailbox in
case it was missed.

Happy new year!

Thanks,
Ryan

On 22/12/2025 10:57, Ryan Roberts wrote:
> Hi Mel, Peter,
> 
> We are building out a kernel performance regression monitoring lab at Arm, and 
> I've noticed some fairly large perofrmance regressions in real-world workloads, 
> for which bisection has fingered this patch.
> 
> We are looking at performance changes between v6.18 and v6.19-rc1, and by 
> reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan 
> to move the testing to linux-next over the next couple of quarters so hopefully 
> we will be able to deliver this sort of news prior to merging in future).
> 
> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
> statistically significant regression/improvement, where "statistically 
> significant" means the 95% confidence intervals do not overlap".
> 
> The below is a large scale mysql workload, running across 2 AWS instances (a 
> load generator and the mysql server). We have a partner for whom this is a very 
> important workload. Performance regresses by 1.3% between 6.18 and 6.19-rc1 
> (where the patch is added). By reverting the patch, the regression is not only 
> fixed by performance is now nearly 6% better than v6.18:
> 
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | Benchmark                       | Result Class                                       |   6-18-0 (base) |   6-19-0-rc1 | revert-next-buddy |
> +=================================+====================================================+=================+==============+===================+
> | repro-collection/mysql-workload | db transaction rate (transactions/min)             |       646267.33 |   (R) -1.33% |         (I) 5.87% |
> |                                 | new order rate (orders/min)                        |       213256.50 |   (R) -1.32% |         (I) 5.87% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> 
> 
> Next are a bunch of benchmarks all running on a single system. specjbb is the 
> SPEC Java Business Benchmark. The mysql one is the same as above but this time 
> both loadgen and server are on the same system. pgbench is the PostgreSQL 
> benchmark.
> 
> I'm showing hackbench for completeness, but I don't consider it a high priority 
> issue.
> 
> Interestingly, nginx improves significantly with the patch.
> 
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | Benchmark                       | Result Class                                       |   6-18-0 (base) |   6-19-0-rc1 | revert-next-buddy |
> +=================================+====================================================+=================+==============+===================+
> | specjbb/composite               | critical-jOPS (jOPS)                               |        94700.00 |   (R) -5.10% |            -0.90% |
> |                                 | max-jOPS (jOPS)                                    |       113984.50 |   (R) -3.90% |            -0.65% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | repro-collection/mysql-workload | db transaction rate (transactions/min)             |       245438.25 |   (R) -3.88% |            -0.13% |
> |                                 | new order rate (orders/min)                        |        80985.75 |   (R) -3.78% |            -0.07% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | pts/pgbench                     | Scale: 1 Clients: 1 Read Only (TPS)                |        63124.00 |    (I) 2.90% |             0.74% |
> |                                 | Scale: 1 Clients: 1 Read Only - Latency (ms)       |           0.016 |    (I) 5.49% |             1.05% |
> |                                 | Scale: 1 Clients: 1 Read Write (TPS)               |          974.92 |        0.11% |            -0.08% |
> |                                 | Scale: 1 Clients: 1 Read Write - Latency (ms)      |            1.03 |        0.12% |            -0.06% |
> |                                 | Scale: 1 Clients: 250 Read Only (TPS)              |      1915931.58 |   (R) -2.25% |         (I) 2.12% |
> |                                 | Scale: 1 Clients: 250 Read Only - Latency (ms)     |            0.13 |   (R) -2.37% |         (I) 2.09% |
> |                                 | Scale: 1 Clients: 250 Read Write (TPS)             |          855.67 |       -1.36% |            -0.14% |
> |                                 | Scale: 1 Clients: 250 Read Write - Latency (ms)    |          292.39 |       -1.31% |            -0.08% |
> |                                 | Scale: 1 Clients: 1000 Read Only (TPS)             |      1534130.08 |  (R) -11.37% |             0.08% |
> |                                 | Scale: 1 Clients: 1000 Read Only - Latency (ms)    |            0.65 |  (R) -11.38% |             0.08% |
> |                                 | Scale: 1 Clients: 1000 Read Write (TPS)            |          578.75 |       -1.11% |             2.15% |
> |                                 | Scale: 1 Clients: 1000 Read Write - Latency (ms)   |         1736.98 |       -1.26% |             2.47% |
> |                                 | Scale: 100 Clients: 1 Read Only (TPS)              |        57170.33 |        1.68% |             0.10% |
> |                                 | Scale: 100 Clients: 1 Read Only - Latency (ms)     |           0.018 |        1.94% |             0.00% |
> |                                 | Scale: 100 Clients: 1 Read Write (TPS)             |          836.58 |       -0.37% |            -0.41% |
> |                                 | Scale: 100 Clients: 1 Read Write - Latency (ms)    |            1.20 |       -0.37% |            -0.40% |
> |                                 | Scale: 100 Clients: 250 Read Only (TPS)            |      1773440.67 |       -1.61% |             1.67% |
> |                                 | Scale: 100 Clients: 250 Read Only - Latency (ms)   |            0.14 |       -1.40% |             1.56% |
> |                                 | Scale: 100 Clients: 250 Read Write (TPS)           |         5505.50 |       -0.17% |            -0.86% |
> |                                 | Scale: 100 Clients: 250 Read Write - Latency (ms)  |           45.42 |       -0.17% |            -0.85% |
> |                                 | Scale: 100 Clients: 1000 Read Only (TPS)           |      1393037.50 |  (R) -10.31% |            -0.19% |
> |                                 | Scale: 100 Clients: 1000 Read Only - Latency (ms)  |            0.72 |  (R) -10.30% |            -0.17% |
> |                                 | Scale: 100 Clients: 1000 Read Write (TPS)          |         5085.92 |        0.27% |             0.07% |
> |                                 | Scale: 100 Clients: 1000 Read Write - Latency (ms) |          196.79 |        0.23% |             0.05% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | mmtests/hackbench               | hackbench-process-pipes-1 (seconds)                |            0.14 |       -1.51% |            -1.05% |
> |                                 | hackbench-process-pipes-4 (seconds)                |            0.44 |    (I) 6.49% |         (I) 5.42% |
> |                                 | hackbench-process-pipes-7 (seconds)                |            0.68 |  (R) -18.36% |         (I) 3.40% |
> |                                 | hackbench-process-pipes-12 (seconds)               |            1.24 |  (R) -19.89% |            -0.45% |
> |                                 | hackbench-process-pipes-21 (seconds)               |            1.81 |   (R) -8.41% |            -1.22% |
> |                                 | hackbench-process-pipes-30 (seconds)               |            2.39 |   (R) -9.06% |        (R) -2.95% |
> |                                 | hackbench-process-pipes-48 (seconds)               |            3.18 |  (R) -11.68% |        (R) -4.10% |
> |                                 | hackbench-process-pipes-79 (seconds)               |            3.84 |   (R) -9.74% |        (R) -3.25% |
> |                                 | hackbench-process-pipes-110 (seconds)              |            4.68 |   (R) -6.57% |        (R) -2.12% |
> |                                 | hackbench-process-pipes-141 (seconds)              |            5.75 |   (R) -5.86% |        (R) -3.44% |
> |                                 | hackbench-process-pipes-172 (seconds)              |            6.80 |   (R) -4.28% |        (R) -2.81% |
> |                                 | hackbench-process-pipes-203 (seconds)              |            7.94 |   (R) -4.01% |        (R) -3.00% |
> |                                 | hackbench-process-pipes-234 (seconds)              |            9.02 |   (R) -3.52% |        (R) -2.81% |
> |                                 | hackbench-process-pipes-256 (seconds)              |            9.78 |   (R) -3.24% |        (R) -2.81% |
> |                                 | hackbench-process-sockets-1 (seconds)              |            0.29 |        0.50% |             0.26% |
> |                                 | hackbench-process-sockets-4 (seconds)              |            0.76 |   (I) 17.44% |        (I) 16.31% |
> |                                 | hackbench-process-sockets-7 (seconds)              |            1.16 |   (I) 12.10% |         (I) 9.78% |
> |                                 | hackbench-process-sockets-12 (seconds)             |            1.86 |   (I) 10.19% |         (I) 9.83% |
> |                                 | hackbench-process-sockets-21 (seconds)             |            3.12 |    (I) 9.38% |         (I) 9.20% |
> |                                 | hackbench-process-sockets-30 (seconds)             |            4.30 |    (I) 6.43% |         (I) 6.11% |
> |                                 | hackbench-process-sockets-48 (seconds)             |            6.58 |    (I) 3.00% |         (I) 2.19% |
> |                                 | hackbench-process-sockets-79 (seconds)             |           10.56 |    (I) 2.87% |         (I) 3.31% |
> |                                 | hackbench-process-sockets-110 (seconds)            |           13.85 |       -1.15% |         (I) 2.33% |
> |                                 | hackbench-process-sockets-141 (seconds)            |           19.23 |       -1.40% |        (I) 14.53% |
> |                                 | hackbench-process-sockets-172 (seconds)            |           26.33 |    (I) 3.52% |        (I) 30.37% |
> |                                 | hackbench-process-sockets-203 (seconds)            |           30.27 |        1.10% |        (I) 27.20% |
> |                                 | hackbench-process-sockets-234 (seconds)            |           35.12 |        1.60% |        (I) 28.24% |
> |                                 | hackbench-process-sockets-256 (seconds)            |           38.74 |        0.70% |        (I) 28.74% |
> |                                 | hackbench-thread-pipes-1 (seconds)                 |            0.17 |       -1.32% |            -0.76% |
> |                                 | hackbench-thread-pipes-4 (seconds)                 |            0.45 |    (I) 6.91% |         (I) 7.64% |
> |                                 | hackbench-thread-pipes-7 (seconds)                 |            0.74 |   (R) -7.51% |         (I) 5.26% |
> |                                 | hackbench-thread-pipes-12 (seconds)                |            1.32 |   (R) -8.40% |         (I) 2.32% |
> |                                 | hackbench-thread-pipes-21 (seconds)                |            1.95 |   (R) -2.95% |             0.91% |
> |                                 | hackbench-thread-pipes-30 (seconds)                |            2.50 |   (R) -4.61% |             1.47% |
> |                                 | hackbench-thread-pipes-48 (seconds)                |            3.32 |   (R) -5.45% |         (I) 2.15% |
> |                                 | hackbench-thread-pipes-79 (seconds)                |            4.04 |   (R) -5.53% |             1.85% |
> |                                 | hackbench-thread-pipes-110 (seconds)               |            4.94 |   (R) -2.33% |             1.51% |
> |                                 | hackbench-thread-pipes-141 (seconds)               |            6.04 |   (R) -2.47% |             1.15% |
> |                                 | hackbench-thread-pipes-172 (seconds)               |            7.15 |       -0.91% |             1.48% |
> |                                 | hackbench-thread-pipes-203 (seconds)               |            8.31 |       -1.29% |             0.77% |
> |                                 | hackbench-thread-pipes-234 (seconds)               |            9.49 |       -1.03% |             0.77% |
> |                                 | hackbench-thread-pipes-256 (seconds)               |           10.30 |       -0.80% |             0.42% |
> |                                 | hackbench-thread-sockets-1 (seconds)               |            0.31 |        0.05% |            -0.05% |
> |                                 | hackbench-thread-sockets-4 (seconds)               |            0.79 |   (I) 18.91% |        (I) 16.82% |
> |                                 | hackbench-thread-sockets-7 (seconds)               |            1.16 |   (I) 12.57% |        (I) 10.63% |
> |                                 | hackbench-thread-sockets-12 (seconds)              |            1.87 |   (I) 12.65% |        (I) 12.26% |
> |                                 | hackbench-thread-sockets-21 (seconds)              |            3.16 |   (I) 11.62% |        (I) 12.74% |
> |                                 | hackbench-thread-sockets-30 (seconds)              |            4.32 |    (I) 7.35% |         (I) 8.89% |
> |                                 | hackbench-thread-sockets-48 (seconds)              |            6.45 |    (I) 2.69% |         (I) 3.06% |
> |                                 | hackbench-thread-sockets-79 (seconds)              |           10.15 |    (I) 3.30% |             1.98% |
> |                                 | hackbench-thread-sockets-110 (seconds)             |           13.45 |       -0.25% |         (I) 3.68% |
> |                                 | hackbench-thread-sockets-141 (seconds)             |           17.87 |   (R) -2.18% |         (I) 8.46% |
> |                                 | hackbench-thread-sockets-172 (seconds)             |           24.38 |        1.02% |        (I) 24.33% |
> |                                 | hackbench-thread-sockets-203 (seconds)             |           28.38 |       -0.99% |        (I) 24.20% |
> |                                 | hackbench-thread-sockets-234 (seconds)             |           32.75 |       -0.42% |        (I) 24.35% |
> |                                 | hackbench-thread-sockets-256 (seconds)             |           36.49 |       -1.30% |        (I) 26.22% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> | pts/nginx                       | Connections: 200 (Requests Per Second)             |       252332.60 |   (I) 17.54% |            -0.53% |
> |                                 | Connections: 1000 (Requests Per Second)            |       248591.29 |   (I) 20.41% |             0.10% |
> +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+
> 
> All of the benchmarks have been run multiple times and I have high confidence in 
> the results. I can share min/mean/max/stdev/ci95 stats if that's helpful though.
> 
> I'm not providing the data, but we also see similar regressions on AmpereOne 
> (another arm64 server system). And we have seen a few functional tests (kvm 
> selftests) that have started to timeout due to this patch slowing things down on 
> arm64.
> 
> I'm hoping you can advise on the best way to proceed? We have a bigger library 
> than what I'm showing, but the only improvement I see due to this patch is 
> nginx. So based on that, my preference would be to revert the patch upstream 
> until the issues can be worked out. I'm guessing the story is quite different 
> for x86 though?
> 
> Thanks,
> Ryan
> 
> 
> 
> On 17/11/2025 16:23, tip-bot2 for Mel Gorman wrote:
>> The following commit has been merged into the sched/core branch of tip:
>>
>> Commit-ID:     e837456fdca81899a3c8e47b3fd39e30eae6e291
>> Gitweb:        https://git.kernel.org/tip/e837456fdca81899a3c8e47b3fd39e30eae6e291
>> Author:        Mel Gorman <mgorman@techsingularity.net>
>> AuthorDate:    Wed, 12 Nov 2025 12:25:21 
>> Committer:     Peter Zijlstra <peterz@infradead.org>
>> CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00
>>
>> sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
>>
>> Reimplement NEXT_BUDDY preemption to take into account the deadline and
>> eligibility of the wakee with respect to the waker. In the event
>> multiple buddies could be considered, the one with the earliest deadline
>> is selected.
>>
>> Sync wakeups are treated differently to every other type of wakeup. The
>> WF_SYNC assumption is that the waker promises to sleep in the very near
>> future. This is violated in enough cases that WF_SYNC should be treated
>> as a suggestion instead of a contract. If a waker does go to sleep almost
>> immediately then the delay in wakeup is negligible. In other cases, it's
>> throttled based on the accumulated runtime of the waker so there is a
>> chance that some batched wakeups have been issued before preemption.
>>
>> For all other wakeups, preemption happens if the wakee has a earlier
>> deadline than the waker and eligible to run.
>>
>> While many workloads were tested, the two main targets were a modified
>> dbench4 benchmark and hackbench because the are on opposite ends of the
>> spectrum -- one prefers throughput by avoiding preemption and the other
>> relies on preemption.
>>
>> First is the dbench throughput data even though it is a poor metric but
>> it is the default metric. The test machine is a 2-socket machine and the
>> backing filesystem is XFS as a lot of the IO work is dispatched to kernel
>> threads. It's important to note that these results are not representative
>> across all machines, especially Zen machines, as different bottlenecks
>> are exposed on different machines and filesystems.
>>
>> dbench4 Throughput (misleading but traditional)
>>                             6.18-rc1               6.18-rc1
>>                              vanilla   sched-preemptnext-v5
>> Hmean     1       1268.80 (   0.00%)     1269.74 (   0.07%)
>> Hmean     4       3971.74 (   0.00%)     3950.59 (  -0.53%)
>> Hmean     7       5548.23 (   0.00%)     5420.08 (  -2.31%)
>> Hmean     12      7310.86 (   0.00%)     7165.57 (  -1.99%)
>> Hmean     21      8874.53 (   0.00%)     9149.04 (   3.09%)
>> Hmean     30      9361.93 (   0.00%)    10530.04 (  12.48%)
>> Hmean     48      9540.14 (   0.00%)    11820.40 (  23.90%)
>> Hmean     79      9208.74 (   0.00%)    12193.79 (  32.42%)
>> Hmean     110     8573.12 (   0.00%)    11933.72 (  39.20%)
>> Hmean     141     7791.33 (   0.00%)    11273.90 (  44.70%)
>> Hmean     160     7666.60 (   0.00%)    10768.72 (  40.46%)
>>
>> As throughput is misleading, the benchmark is modified to use a short
>> loadfile report the completion time duration in milliseconds.
>>
>> dbench4 Loadfile Execution Time
>>                              6.18-rc1               6.18-rc1
>>                               vanilla   sched-preemptnext-v5
>> Amean      1         14.62 (   0.00%)       14.69 (  -0.46%)
>> Amean      4         18.76 (   0.00%)       18.85 (  -0.45%)
>> Amean      7         23.71 (   0.00%)       24.38 (  -2.82%)
>> Amean      12        31.25 (   0.00%)       31.87 (  -1.97%)
>> Amean      21        45.12 (   0.00%)       43.69 (   3.16%)
>> Amean      30        61.07 (   0.00%)       54.33 (  11.03%)
>> Amean      48        95.91 (   0.00%)       77.22 (  19.49%)
>> Amean      79       163.38 (   0.00%)      123.08 (  24.66%)
>> Amean      110      243.91 (   0.00%)      175.11 (  28.21%)
>> Amean      141      343.47 (   0.00%)      239.10 (  30.39%)
>> Amean      160      401.15 (   0.00%)      283.73 (  29.27%)
>> Stddev     1          0.52 (   0.00%)        0.51 (   2.45%)
>> Stddev     4          1.36 (   0.00%)        1.30 (   4.04%)
>> Stddev     7          1.88 (   0.00%)        1.87 (   0.72%)
>> Stddev     12         3.06 (   0.00%)        2.45 (  19.83%)
>> Stddev     21         5.78 (   0.00%)        3.87 (  33.06%)
>> Stddev     30         9.85 (   0.00%)        5.25 (  46.76%)
>> Stddev     48        22.31 (   0.00%)        8.64 (  61.27%)
>> Stddev     79        35.96 (   0.00%)       18.07 (  49.76%)
>> Stddev     110       59.04 (   0.00%)       30.93 (  47.61%)
>> Stddev     141       85.38 (   0.00%)       40.93 (  52.06%)
>> Stddev     160       96.38 (   0.00%)       39.72 (  58.79%)
>>
>> That is still looking good and the variance is reduced quite a bit.
>> Finally, fairness is a concern so the next report tracks how many
>> milliseconds does it take for all clients to complete a workfile. This
>> one is tricky because dbench makes to effort to synchronise clients so
>> the durations at benchmark start time differ substantially from typical
>> runtimes. This problem could be mitigated by warming up the benchmark
>> for a number of minutes but it's a matter of opinion whether that
>> counts as an evasion of inconvenient results.
>>
>> dbench4 All Clients Loadfile Execution Time
>>                              6.18-rc1               6.18-rc1
>>                               vanilla   sched-preemptnext-v5
>> Amean      1         15.06 (   0.00%)       15.07 (  -0.03%)
>> Amean      4        603.81 (   0.00%)      524.29 (  13.17%)
>> Amean      7        855.32 (   0.00%)     1331.07 ( -55.62%)
>> Amean      12      1890.02 (   0.00%)     2323.97 ( -22.96%)
>> Amean      21      3195.23 (   0.00%)     2009.29 (  37.12%)
>> Amean      30     13919.53 (   0.00%)     4579.44 (  67.10%)
>> Amean      48     25246.07 (   0.00%)     5705.46 (  77.40%)
>> Amean      79     29701.84 (   0.00%)    15509.26 (  47.78%)
>> Amean      110    22803.03 (   0.00%)    23782.08 (  -4.29%)
>> Amean      141    36356.07 (   0.00%)    25074.20 (  31.03%)
>> Amean      160    17046.71 (   0.00%)    13247.62 (  22.29%)
>> Stddev     1          0.47 (   0.00%)        0.49 (  -3.74%)
>> Stddev     4        395.24 (   0.00%)      254.18 (  35.69%)
>> Stddev     7        467.24 (   0.00%)      764.42 ( -63.60%)
>> Stddev     12      1071.43 (   0.00%)     1395.90 ( -30.28%)
>> Stddev     21      1694.50 (   0.00%)     1204.89 (  28.89%)
>> Stddev     30      7945.63 (   0.00%)     2552.59 (  67.87%)
>> Stddev     48     14339.51 (   0.00%)     3227.55 (  77.49%)
>> Stddev     79     16620.91 (   0.00%)     8422.15 (  49.33%)
>> Stddev     110    12912.15 (   0.00%)    13560.95 (  -5.02%)
>> Stddev     141    20700.13 (   0.00%)    14544.51 (  29.74%)
>> Stddev     160     9079.16 (   0.00%)     7400.69 (  18.49%)
>>
>> This is more of a mixed bag but it at least shows that fairness
>> is not crippled.
>>
>> The hackbench results are more neutral but this is still important.
>> It's possible to boost the dbench figures by a large amount but only by
>> crippling the performance of a workload like hackbench. The WF_SYNC
>> behaviour is important for these workloads and is why the WF_SYNC
>> changes are not a separate patch.
>>
>> hackbench-process-pipes
>>                           6.18-rc1             6.18-rc1
>>                              vanilla   sched-preemptnext-v5
>> Amean     1        0.2657 (   0.00%)      0.2150 (  19.07%)
>> Amean     4        0.6107 (   0.00%)      0.6060 (   0.76%)
>> Amean     7        0.7923 (   0.00%)      0.7440 (   6.10%)
>> Amean     12       1.1500 (   0.00%)      1.1263 (   2.06%)
>> Amean     21       1.7950 (   0.00%)      1.7987 (  -0.20%)
>> Amean     30       2.3207 (   0.00%)      2.5053 (  -7.96%)
>> Amean     48       3.5023 (   0.00%)      3.9197 ( -11.92%)
>> Amean     79       4.8093 (   0.00%)      5.2247 (  -8.64%)
>> Amean     110      6.1160 (   0.00%)      6.6650 (  -8.98%)
>> Amean     141      7.4763 (   0.00%)      7.8973 (  -5.63%)
>> Amean     172      8.9560 (   0.00%)      9.3593 (  -4.50%)
>> Amean     203     10.4783 (   0.00%)     10.8347 (  -3.40%)
>> Amean     234     12.4977 (   0.00%)     13.0177 (  -4.16%)
>> Amean     265     14.7003 (   0.00%)     15.5630 (  -5.87%)
>> Amean     296     16.1007 (   0.00%)     17.4023 (  -8.08%)
>>
>> Processes using pipes are impacted but the variance (not presented) indicates
>> it's close to noise and the results are not always reproducible. If executed
>> across multiple reboots, it may show neutral or small gains so the worst
>> measured results are presented.
>>
>> Hackbench using sockets is more reliably neutral as the wakeup
>> mechanisms are different between sockets and pipes.
>>
>> hackbench-process-sockets
>>                           6.18-rc1             6.18-rc1
>>                              vanilla   sched-preemptnext-v2
>> Amean     1        0.3073 (   0.00%)      0.3263 (  -6.18%)
>> Amean     4        0.7863 (   0.00%)      0.7930 (  -0.85%)
>> Amean     7        1.3670 (   0.00%)      1.3537 (   0.98%)
>> Amean     12       2.1337 (   0.00%)      2.1903 (  -2.66%)
>> Amean     21       3.4683 (   0.00%)      3.4940 (  -0.74%)
>> Amean     30       4.7247 (   0.00%)      4.8853 (  -3.40%)
>> Amean     48       7.6097 (   0.00%)      7.8197 (  -2.76%)
>> Amean     79      14.7957 (   0.00%)     16.1000 (  -8.82%)
>> Amean     110     21.3413 (   0.00%)     21.9997 (  -3.08%)
>> Amean     141     29.0503 (   0.00%)     29.0353 (   0.05%)
>> Amean     172     36.4660 (   0.00%)     36.1433 (   0.88%)
>> Amean     203     39.7177 (   0.00%)     40.5910 (  -2.20%)
>> Amean     234     42.1120 (   0.00%)     43.5527 (  -3.42%)
>> Amean     265     45.7830 (   0.00%)     50.0560 (  -9.33%)
>> Amean     296     50.7043 (   0.00%)     54.3657 (  -7.22%)
>>
>> As schbench has been mentioned in numerous bugs recently, the results
>> are interesting. A test case that represents the default schbench
>> behaviour is
>>
>> schbench Wakeup Latency (usec)
>>                                        6.18.0-rc1             6.18.0-rc1
>>                                           vanilla   sched-preemptnext-v5
>> Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 (  16.28%)
>> Amean     Wakeup-90th-80         46.56 (   0.00%)       19.78 (  57.52%)
>> Amean     Wakeup-99th-80        119.61 (   0.00%)       89.94 (  24.80%)
>> Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      328.22 (  89.72%)
>>
>> schbench Requests Per Second (ops/sec)
>>                                   6.18.0-rc1             6.18.0-rc1
>>                                      vanilla   sched-preemptnext-v5
>> Hmean     RPS-20th-80     8900.91 (   0.00%)     9176.78 (   3.10%)
>> Hmean     RPS-50th-80     8987.41 (   0.00%)     9217.89 (   2.56%)
>> Hmean     RPS-90th-80     9123.73 (   0.00%)     9273.25 (   1.64%)
>> Hmean     RPS-max-80      9193.50 (   0.00%)     9301.47 (   1.17%)
>>
>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net
>> ---
>>  kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 130 insertions(+), 22 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 071e07f..c6e5c64 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
>>  	if (cfs_rq->nr_queued == 1)
>>  		return curr && curr->on_rq ? curr : se;
>>  
>> +	/*
>> +	 * Picking the ->next buddy will affect latency but not fairness.
>> +	 */
>> +	if (sched_feat(PICK_BUDDY) &&
>> +	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
>> +		/* ->next will never be delayed */
>> +		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
>> +		return cfs_rq->next;
>> +	}
>> +
>>  	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>>  		curr = NULL;
>>  
>> @@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>>  	return delta_exec;
>>  }
>>  
>> +static void set_next_buddy(struct sched_entity *se);
>> +
>>  /*
>>   * Used by other classes to account runtime.
>>   */
>> @@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>>  {
>>  	struct sched_entity *se;
>>  
>> -	/*
>> -	 * Picking the ->next buddy will affect latency but not fairness.
>> -	 */
>> -	if (sched_feat(PICK_BUDDY) &&
>> -	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
>> -		/* ->next will never be delayed */
>> -		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
>> -		return cfs_rq->next;
>> -	}
>> -
>>  	se = pick_eevdf(cfs_rq);
>>  	if (se->sched_delayed) {
>>  		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
>> @@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>  	hrtick_update(rq);
>>  }
>>  
>> -static void set_next_buddy(struct sched_entity *se);
>> -
>>  /*
>>   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
>>   * failing half-way through and resume the dequeue later.
>> @@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se)
>>  	}
>>  }
>>  
>> +enum preempt_wakeup_action {
>> +	PREEMPT_WAKEUP_NONE,	/* No preemption. */
>> +	PREEMPT_WAKEUP_SHORT,	/* Ignore slice protection. */
>> +	PREEMPT_WAKEUP_PICK,	/* Let __pick_eevdf() decide. */
>> +	PREEMPT_WAKEUP_RESCHED,	/* Force reschedule. */
>> +};
>> +
>> +static inline bool
>> +set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags,
>> +		  struct sched_entity *pse, struct sched_entity *se)
>> +{
>> +	/*
>> +	 * Keep existing buddy if the deadline is sooner than pse.
>> +	 * The older buddy may be cache cold and completely unrelated
>> +	 * to the current wakeup but that is unpredictable where as
>> +	 * obeying the deadline is more in line with EEVDF objectives.
>> +	 */
>> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
>> +		return false;
>> +
>> +	set_next_buddy(pse);
>> +	return true;
>> +}
>> +
>> +/*
>> + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
>> + * strictly enforced because the hint is either misunderstood or
>> + * multiple tasks must be woken up.
>> + */
>> +static inline enum preempt_wakeup_action
>> +preempt_sync(struct rq *rq, int wake_flags,
>> +	     struct sched_entity *pse, struct sched_entity *se)
>> +{
>> +	u64 threshold, delta;
>> +
>> +	/*
>> +	 * WF_SYNC without WF_TTWU is not expected so warn if it happens even
>> +	 * though it is likely harmless.
>> +	 */
>> +	WARN_ON_ONCE(!(wake_flags & WF_TTWU));
>> +
>> +	threshold = sysctl_sched_migration_cost;
>> +	delta = rq_clock_task(rq) - se->exec_start;
>> +	if ((s64)delta < 0)
>> +		delta = 0;
>> +
>> +	/*
>> +	 * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they
>> +	 * could run on other CPUs. Reduce the threshold before preemption is
>> +	 * allowed to an arbitrary lower value as it is more likely (but not
>> +	 * guaranteed) the waker requires the wakee to finish.
>> +	 */
>> +	if (wake_flags & WF_RQ_SELECTED)
>> +		threshold >>= 2;
>> +
>> +	/*
>> +	 * As WF_SYNC is not strictly obeyed, allow some runtime for batch
>> +	 * wakeups to be issued.
>> +	 */
>> +	if (entity_before(pse, se) && delta >= threshold)
>> +		return PREEMPT_WAKEUP_RESCHED;
>> +
>> +	return PREEMPT_WAKEUP_NONE;
>> +}
>> +
>>  /*
>>   * Preempt the current task with a newly woken task if needed:
>>   */
>>  static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
>>  {
>> +	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
>>  	struct task_struct *donor = rq->donor;
>>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>>  	int cse_is_idle, pse_is_idle;
>> -	bool do_preempt_short = false;
>>  
>>  	if (unlikely(se == pse))
>>  		return;
>> @@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>>  	if (task_is_throttled(p))
>>  		return;
>>  
>> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
>> -		set_next_buddy(pse);
>> -	}
>> -
>>  	/*
>>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>>  	 * wake up path.
>> @@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>>  		 * When non-idle entity preempt an idle entity,
>>  		 * don't give idle entity slice protection.
>>  		 */
>> -		do_preempt_short = true;
>> +		preempt_action = PREEMPT_WAKEUP_SHORT;
>>  		goto preempt;
>>  	}
>>  
>> @@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
>>  	 * If @p has a shorter slice than current and @p is eligible, override
>>  	 * current's slice protection in order to allow preemption.
>>  	 */
>> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
>> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
>> +		preempt_action = PREEMPT_WAKEUP_SHORT;
>> +		goto pick;
>> +	}
>>  
>>  	/*
>> +	 * Ignore wakee preemption on WF_FORK as it is less likely that
>> +	 * there is shared data as exec often follow fork. Do not
>> +	 * preempt for tasks that are sched_delayed as it would violate
>> +	 * EEVDF to forcibly queue an ineligible task.
>> +	 */
>> +	if ((wake_flags & WF_FORK) || pse->sched_delayed)
>> +		return;
>> +
>> +	/*
>> +	 * If @p potentially is completing work required by current then
>> +	 * consider preemption.
>> +	 *
>> +	 * Reschedule if waker is no longer eligible. */
>> +	if (in_task() && !entity_eligible(cfs_rq, se)) {
>> +		preempt_action = PREEMPT_WAKEUP_RESCHED;
>> +		goto preempt;
>> +	}
>> +
>> +	/* Prefer picking wakee soon if appropriate. */
>> +	if (sched_feat(NEXT_BUDDY) &&
>> +	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
>> +
>> +		/*
>> +		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
>> +		 * buddies are ignored as they may not be relevant to the
>> +		 * waker and less likely to be cache hot.
>> +		 */
>> +		if (wake_flags & WF_SYNC)
>> +			preempt_action = preempt_sync(rq, wake_flags, pse, se);
>> +	}
>> +
>> +	switch (preempt_action) {
>> +	case PREEMPT_WAKEUP_NONE:
>> +		return;
>> +	case PREEMPT_WAKEUP_RESCHED:
>> +		goto preempt;
>> +	case PREEMPT_WAKEUP_SHORT:
>> +		fallthrough;
>> +	case PREEMPT_WAKEUP_PICK:
>> +		break;
>> +	}
>> +
>> +pick:
>> +	/*
>>  	 * If @p has become the most eligible task, force preemption.
>>  	 */
>> -	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
>> +	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
>>  		goto preempt;
>>  
>> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
>> +	if (sched_feat(RUN_TO_PARITY))
>>  		update_protect_slice(cfs_rq, se);
>>  
>>  	return;
>>  
>>  preempt:
>> -	if (do_preempt_short)
>> +	if (preempt_action == PREEMPT_WAKEUP_SHORT)
>>  		cancel_protect_slice(se);
>>  
>>  	resched_curr_lazy(rq);
>>
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2026-01-02 12:38       ` Ryan Roberts
@ 2026-01-02 15:52         ` Dietmar Eggemann
  2026-01-05 11:45           ` Ryan Roberts
  0 siblings, 1 reply; 44+ messages in thread
From: Dietmar Eggemann @ 2026-01-02 15:52 UTC (permalink / raw)
  To: Ryan Roberts, Mel Gorman, Peter Zijlstra (Intel)
  Cc: x86, linux-kernel, Aishwarya TCV

On 02.01.26 13:38, Ryan Roberts wrote:
> Hi, I appreciate I sent this report just before Xmas so most likely you haven't
> had a chance to look, but wanted to bring it back to the top of your mailbox in
> case it was missed.
> 
> Happy new year!
> 
> Thanks,
> Ryan
> 
> On 22/12/2025 10:57, Ryan Roberts wrote:
>> Hi Mel, Peter,
>>
>> We are building out a kernel performance regression monitoring lab at Arm, and 
>> I've noticed some fairly large perofrmance regressions in real-world workloads, 
>> for which bisection has fingered this patch.
>>
>> We are looking at performance changes between v6.18 and v6.19-rc1, and by 
>> reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan 
>> to move the testing to linux-next over the next couple of quarters so hopefully 
>> we will be able to deliver this sort of news prior to merging in future).
>>
>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
>> statistically significant regression/improvement, where "statistically 
>> significant" means the 95% confidence intervals do not overlap".

You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
Reimplement NEXT_BUDDY to align with EEVDF goals'.

Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?

---

Mel mentioned that he tested on a 2-socket machine. So I guess something
like my Intel Xeon Silver 4314:

cpu0 0 0
domain0 SMT 00000001,00000001
domain1 MC 55555555,55555555
domain2 NUMA ffffffff,ffffffff

node distances:
node   0   1
  0:  10  20
  1:  20  10

Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
domain? I guess topology has influence in benchmark numbers here as well.

---

There was also a lot of improvement on schbench (wakeup latency) on
higher percentiles (>= 99.0th) on the 2-socket machine with those 2
patches. I guess you haven't seen those on Grav3?

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2026-01-02 15:52         ` Dietmar Eggemann
@ 2026-01-05 11:45           ` Ryan Roberts
  2026-01-05 14:38             ` Shrikanth Hegde
  2026-01-07 15:30             ` Dietmar Eggemann
  0 siblings, 2 replies; 44+ messages in thread
From: Ryan Roberts @ 2026-01-05 11:45 UTC (permalink / raw)
  To: Dietmar Eggemann, Mel Gorman, Peter Zijlstra (Intel)
  Cc: x86, linux-kernel, Aishwarya TCV

On 02/01/2026 15:52, Dietmar Eggemann wrote:
> On 02.01.26 13:38, Ryan Roberts wrote:
>> Hi, I appreciate I sent this report just before Xmas so most likely you haven't
>> had a chance to look, but wanted to bring it back to the top of your mailbox in
>> case it was missed.
>>
>> Happy new year!
>>
>> Thanks,
>> Ryan
>>
>> On 22/12/2025 10:57, Ryan Roberts wrote:
>>> Hi Mel, Peter,
>>>
>>> We are building out a kernel performance regression monitoring lab at Arm, and 
>>> I've noticed some fairly large perofrmance regressions in real-world workloads, 
>>> for which bisection has fingered this patch.
>>>
>>> We are looking at performance changes between v6.18 and v6.19-rc1, and by 
>>> reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan 
>>> to move the testing to linux-next over the next couple of quarters so hopefully 
>>> we will be able to deliver this sort of news prior to merging in future).
>>>
>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
>>> statistically significant regression/improvement, where "statistically 
>>> significant" means the 95% confidence intervals do not overlap".
> 
> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
> Reimplement NEXT_BUDDY to align with EEVDF goals'.
> 
> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?

Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?

> 
> ---
> 
> Mel mentioned that he tested on a 2-socket machine. So I guess something
> like my Intel Xeon Silver 4314:
> 
> cpu0 0 0
> domain0 SMT 00000001,00000001
> domain1 MC 55555555,55555555
> domain2 NUMA ffffffff,ffffffff
> 
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
> domain? I guess topology has influence in benchmark numbers here as well.

I can't easily enable scheduler debugging right now (which I think is needed to 
get this info directly?). But that's what I'd expect, yes. lscpu confirms there 
is a single NUMA node and topology for cpu0 gives this if it helps:

/sys/devices/system/cpu/cpu0/topology$ grep "" -r .
./cluster_cpus:ffffffff,ffffffff
./cluster_cpus_list:0-63
./physical_package_id:0
./core_cpus_list:0
./core_siblings:ffffffff,ffffffff
./cluster_id:0
./core_siblings_list:0-63
./package_cpus:ffffffff,ffffffff
./package_cpus_list:0-63
./thread_siblings_list:0
./core_id:0
./core_cpus:00000000,00000001
./thread_siblings:00000000,00000001

> 
> ---
> 
> There was also a lot of improvement on schbench (wakeup latency) on
> higher percentiles (>= 99.0th) on the 2-socket machine with those 2
> patches. I guess you haven't seen those on Grav3?
> 

I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for 
revert-next-buddy. The means have moved a bit but there are only a couple of 
cases that we consisder statistically significant (marked (R)egression / 
(I)mprovement):

+----------------------------+------------------------------------------------------+-------------+-------------------+
| Benchmark                  | Result Class                                         |  6-19-0-rc1 | revert-next-buddy |
+============================+======================================================+=============+===================+
| schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     1263.97 |            -6.43% |
|                            | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.28% |
|                            | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
|                            | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     6433.07 |           -10.99% |
|                            | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.39% |
|                            | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        4.17 |       (R) -16.67% |
|                            | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |     1458.33 |            -1.57% |
|                            | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |   813056.00 |            15.46% |
|                            | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    14240.00 |            -5.97% |
|                            | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      434.22 |             3.21% |
|                            | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 11354112.00 |             2.92% |
|                            | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63168.00 |            -2.87% |
|                            | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     2828.63 |             2.58% |
|                            | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |             0.00% |
|                            | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
|                            | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     3182.15 |             5.18% |
|                            | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   116266.67 |             8.22% |
|                            | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |     6186.67 |        (R) -5.34% |
|                            | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      749.20 |             2.91% |
|                            | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |  3702784.00 |        (I) 13.76% |
|                            | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    33514.67 |             0.24% |
|                            | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      392.23 |             3.42% |
|                            | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 16695296.00 |         (I) 5.82% |
|                            | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   120618.67 |            -3.22% |
|                            | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     5951.15 |             5.02% |
|                            | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15157.33 |             0.42% |
|                            | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.67 |            -4.35% |
|                            | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     1510.23 |            -1.38% |
|                            | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   802816.00 |            13.73% |
|                            | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |    14890.67 |           -10.44% |
|                            | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      458.87 |             4.60% |
|                            | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    | 11348650.67 |         (I) 2.67% |
|                            | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63445.33 |        (R) -5.48% |
|                            | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      541.33 |             2.65% |
|                            | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 36743850.67 |        (I) 10.95% |
|                            | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   211370.67 |            -1.94% |
+----------------------------+------------------------------------------------------+-------------+-------------------+

I could get the results for 6.18 if useful, but I think what I have probably 
shows enough of the picture: This patch has not impacted schbench much on 
this HW.

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2026-01-05 11:45           ` Ryan Roberts
@ 2026-01-05 14:38             ` Shrikanth Hegde
  2026-01-05 16:33               ` Ryan Roberts
  2026-01-07 15:30             ` Dietmar Eggemann
  1 sibling, 1 reply; 44+ messages in thread
From: Shrikanth Hegde @ 2026-01-05 14:38 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: x86, linux-kernel, Aishwarya TCV, Dietmar Eggemann, Mel Gorman,
	Peter Zijlstra (Intel)


Hi Ryan,

>> node distances:
>> node   0   1
>>    0:  10  20
>>    1:  20  10
>>
>> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
>> domain? I guess topology has influence in benchmark numbers here as well.
> 
> I can't easily enable scheduler debugging right now (which I think is needed to
> get this info directly?). But that's what I'd expect, yes. lscpu confirms there
> is a single NUMA node and topology for cpu0 gives this if it helps:

If you dump /proc/schedstat it should give you topology info as well.

(you will need to parse it depending on which CPU you are looking this from)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2026-01-05 14:38             ` Shrikanth Hegde
@ 2026-01-05 16:33               ` Ryan Roberts
  0 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2026-01-05 16:33 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: x86, linux-kernel, Aishwarya TCV, Dietmar Eggemann, Mel Gorman,
	Peter Zijlstra (Intel)

On 05/01/2026 14:38, Shrikanth Hegde wrote:
> 
> Hi Ryan,
> 
>>> node distances:
>>> node   0   1
>>>    0:  10  20
>>>    1:  20  10
>>>
>>> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
>>> domain? I guess topology has influence in benchmark numbers here as well.
>>
>> I can't easily enable scheduler debugging right now (which I think is needed to
>> get this info directly?). But that's what I'd expect, yes. lscpu confirms there
>> is a single NUMA node and topology for cpu0 gives this if it helps:
> 
> If you dump /proc/schedstat it should give you topology info as well.
> 
> (you will need to parse it depending on which CPU you are looking this from)

Ahh yes, thanks!

Every cpu is reported as being in "domain0 MC ffffffff,ffffffff". So I guess
that means there is a single MC domain as Dietmar suggests.

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2026-01-05 11:45           ` Ryan Roberts
  2026-01-05 14:38             ` Shrikanth Hegde
@ 2026-01-07 15:30             ` Dietmar Eggemann
  2026-01-08  8:50               ` Mel Gorman
  1 sibling, 1 reply; 44+ messages in thread
From: Dietmar Eggemann @ 2026-01-07 15:30 UTC (permalink / raw)
  To: Ryan Roberts, Mel Gorman, Peter Zijlstra (Intel)
  Cc: x86, linux-kernel, Aishwarya TCV

On 05.01.26 12:45, Ryan Roberts wrote:
> On 02/01/2026 15:52, Dietmar Eggemann wrote:
>> On 02.01.26 13:38, Ryan Roberts wrote:

[...]

>>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
>>>> statistically significant regression/improvement, where "statistically 
>>>> significant" means the 95% confidence intervals do not overlap".
>>
>> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
>> Reimplement NEXT_BUDDY to align with EEVDF goals'.
>>
>> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
>> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?
> 
> Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?

Well, I assume this would be more valuable. Before this patch-set (e.g
v6.18), NEXT_BUDDY was disabled and this is what people are running.

Now (>= v6.19-rc1) we have NEXT_BUDDY=true (1/2) and 'NEXT_BUDDY aligned
to EEVDF' (2/2). This is what people will run when they switch to v6.19
later.

But patch 2/2 changes more than the 'if (sched_feat(NEXT_BUDDY) ...'
condition. So testing 'w/o 2/2' vs. 'w/ 2/2' and 'NEXT_BUDDY=false'
could be helpful as well.

>> ---
>>
>> Mel mentioned that he tested on a 2-socket machine. So I guess something
>> like my Intel Xeon Silver 4314:
>>
>> cpu0 0 0
>> domain0 SMT 00000001,00000001
>> domain1 MC 55555555,55555555
>> domain2 NUMA ffffffff,ffffffff
>>
>> node distances:
>> node   0   1
>>   0:  10  20
>>   1:  20  10
>>
>> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
>> domain? I guess topology has influence in benchmark numbers here as well.
> 
> I can't easily enable scheduler debugging right now (which I think is needed to 
> get this info directly?). But that's what I'd expect, yes. lscpu confirms there 
> is a single NUMA node and topology for cpu0 gives this if it helps:
> 
> /sys/devices/system/cpu/cpu0/topology$ grep "" -r .
> ./cluster_cpus:ffffffff,ffffffff
> ./cluster_cpus_list:0-63
> ./physical_package_id:0
> ./core_cpus_list:0
> ./core_siblings:ffffffff,ffffffff
> ./cluster_id:0
> ./core_siblings_list:0-63
> ./package_cpus:ffffffff,ffffffff
> ./package_cpus_list:0-63

[...]

OK, so single (flat) MC domain with 64 CPUs.

>> There was also a lot of improvement on schbench (wakeup latency) on
>> higher percentiles (>= 99.0th) on the 2-socket machine with those 2
>> patches. I guess you haven't seen those on Grav3?
>>
> 
> I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for 
> revert-next-buddy. The means have moved a bit but there are only a couple of 
> cases that we consisder statistically significant (marked (R)egression / 
> (I)mprovement):
> 
> +----------------------------+------------------------------------------------------+-------------+-------------------+
> | Benchmark                  | Result Class                                         |  6-19-0-rc1 | revert-next-buddy |
> +============================+======================================================+=============+===================+
> | schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     1263.97 |            -6.43% |
> |                            | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.28% |
> |                            | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
> |                            | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     6433.07 |           -10.99% |
> |                            | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.39% |
> |                            | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        4.17 |       (R) -16.67% |
> |                            | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |     1458.33 |            -1.57% |
> |                            | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |   813056.00 |            15.46% |
> |                            | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    14240.00 |            -5.97% |
> |                            | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      434.22 |             3.21% |
> |                            | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 11354112.00 |             2.92% |
> |                            | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63168.00 |            -2.87% |
> |                            | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     2828.63 |             2.58% |
> |                            | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |             0.00% |
> |                            | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
> |                            | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     3182.15 |             5.18% |
> |                            | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   116266.67 |             8.22% |
> |                            | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |     6186.67 |        (R) -5.34% |
> |                            | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      749.20 |             2.91% |
> |                            | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |  3702784.00 |        (I) 13.76% |
> |                            | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    33514.67 |             0.24% |
> |                            | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      392.23 |             3.42% |
> |                            | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 16695296.00 |         (I) 5.82% |
> |                            | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   120618.67 |            -3.22% |
> |                            | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     5951.15 |             5.02% |
> |                            | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15157.33 |             0.42% |
> |                            | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.67 |            -4.35% |
> |                            | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     1510.23 |            -1.38% |
> |                            | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   802816.00 |            13.73% |
> |                            | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |    14890.67 |           -10.44% |
> |                            | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      458.87 |             4.60% |
> |                            | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    | 11348650.67 |         (I) 2.67% |
> |                            | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63445.33 |        (R) -5.48% |
> |                            | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      541.33 |             2.65% |
> |                            | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 36743850.67 |        (I) 10.95% |
> |                            | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   211370.67 |            -1.94% |
> +----------------------------+------------------------------------------------------+-------------+-------------------+
> 
> I could get the results for 6.18 if useful, but I think what I have probably 
> shows enough of the picture: This patch has not impacted schbench much on 
> this HW.

I see. IMHO, task scheduler tests are all about putting the right about
of stress onto the system, not too little and not too much.

I guess, these ~ `-m 64 -t 4` tests should be fine for a 64 CPUs system.
Not sure which parameter set Mel was using on his 2 socket machine. And
I still assume he tested w/o (base) against with these 2 patches.

The other test Mel was using is this modified dbench4 (prefers
throughput (less preemption)). Not sure if this is part of the MmTests
suite?

It would be nice to be able to run the same tests on different machines
(with a parameter set adapted to the number of CPUs), so we have only
the arch and the topology as variables). But there is definitely more
variety (e.g. used filesystem, etc) ... so this is not trivial.

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2026-01-07 15:30             ` Dietmar Eggemann
@ 2026-01-08  8:50               ` Mel Gorman
  2026-01-08 13:15                 ` Ryan Roberts
  0 siblings, 1 reply; 44+ messages in thread
From: Mel Gorman @ 2026-01-08  8:50 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ryan Roberts, Peter Zijlstra (Intel), x86, linux-kernel,
	Aishwarya TCV

On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote:
> On 05.01.26 12:45, Ryan Roberts wrote:
> > On 02/01/2026 15:52, Dietmar Eggemann wrote:
> >> On 02.01.26 13:38, Ryan Roberts wrote:
> 
> [...]
> 

Sorry for slow responses. I'm still back from holidays yet and unfortunately
do not have access to test machines right now cannot revalidate any of
the results against 6.19-rc*.

> >>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
> >>>> statistically significant regression/improvement, where "statistically 
> >>>> significant" means the 95% confidence intervals do not overlap".
> >>
> >> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
> >> Reimplement NEXT_BUDDY to align with EEVDF goals'.
> >>
> >> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
> >> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?
> > 
> > Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?
> 
> Well, I assume this would be more valuable.

Agreed because we need to know if it's NEXT_BUDDY that is conceptually
an issue with EEVDF in these cases or the specific implementation. The
comparison between 

6.18A					(baseline)
6.19-rcN vanilla			(New NEXT_BUDDY implementation enabled)
6.19-rcN revert patches 1+2		(NEXT_BUDDY disabled)
6.19-rcN revert patch 2 only		(Old NEXT_BUDDY implementation enabled)

It was known that NEXT_BUDDY was always a tradeoff but one that is workload,
architecture and specific arch implementation dependent. If it cannot be
sanely reconciled then it may be best to completely remove NEXT_BUDDY from
EEVDF or disable it by default again for now. I don't think NEXT_BUDDY as
it existed in CFS can be sanely implemented against EEVDF so it'll never
be equivalent. Related to that I doubt anyone has good data on NEXT_BUDDY
vs !NEXT_BUDDY even on CFS as it was enabled for so long.

> >> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
> >> domain? I guess topology has influence in benchmark numbers here as well.
> > 
> > I can't easily enable scheduler debugging right now (which I think is needed to 
> > get this info directly?). But that's what I'd expect, yes. lscpu confirms there 
> > is a single NUMA node and topology for cpu0 gives this if it helps:
> > 
> > /sys/devices/system/cpu/cpu0/topology$ grep "" -r .
> > ./cluster_cpus:ffffffff,ffffffff
> > ./cluster_cpus_list:0-63
> > ./physical_package_id:0
> > ./core_cpus_list:0
> > ./core_siblings:ffffffff,ffffffff
> > ./cluster_id:0
> > ./core_siblings_list:0-63
> > ./package_cpus:ffffffff,ffffffff
> > ./package_cpus_list:0-63
> 
> [...]
> 
> OK, so single (flat) MC domain with 64 CPUs.
> 

That is what the OS sees but does it reflect reality? e.g. does Graviton3
have multiple caches that are simply not advertised to the OS?

> >> There was also a lot of improvement on schbench (wakeup latency) on
> >> higher percentiles (>= 99.0th) on the 2-socket machine with those 2
> >> patches. I guess you haven't seen those on Grav3?
> >>
> > 
> > I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for 
> > revert-next-buddy. The means have moved a bit but there are only a couple of 
> > cases that we consisder statistically significant (marked (R)egression / 
> > (I)mprovement):
> > 
> > +----------------------------+------------------------------------------------------+-------------+-------------------+
> > | Benchmark                  | Result Class                                         |  6-19-0-rc1 | revert-next-buddy |
> > +============================+======================================================+=============+===================+
> > | schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     1263.97 |            -6.43% |
> > |                            | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.28% |
> > |                            | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
> > |                            | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     6433.07 |           -10.99% |
> > |                            | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.39% |
> > |                            | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        4.17 |       (R) -16.67% |
> > |                            | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |     1458.33 |            -1.57% |
> > |                            | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |   813056.00 |            15.46% |
> > |                            | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    14240.00 |            -5.97% |
> > |                            | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      434.22 |             3.21% |
> > |                            | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 11354112.00 |             2.92% |
> > |                            | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63168.00 |            -2.87% |
> > |                            | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     2828.63 |             2.58% |
> > |                            | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |             0.00% |
> > |                            | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
> > |                            | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     3182.15 |             5.18% |
> > |                            | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   116266.67 |             8.22% |
> > |                            | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |     6186.67 |        (R) -5.34% |
> > |                            | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      749.20 |             2.91% |
> > |                            | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |  3702784.00 |        (I) 13.76% |
> > |                            | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    33514.67 |             0.24% |
> > |                            | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      392.23 |             3.42% |
> > |                            | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 16695296.00 |         (I) 5.82% |
> > |                            | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   120618.67 |            -3.22% |
> > |                            | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     5951.15 |             5.02% |
> > |                            | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15157.33 |             0.42% |
> > |                            | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.67 |            -4.35% |
> > |                            | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     1510.23 |            -1.38% |
> > |                            | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   802816.00 |            13.73% |
> > |                            | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |    14890.67 |           -10.44% |
> > |                            | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      458.87 |             4.60% |
> > |                            | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    | 11348650.67 |         (I) 2.67% |
> > |                            | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63445.33 |        (R) -5.48% |
> > |                            | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      541.33 |             2.65% |
> > |                            | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 36743850.67 |        (I) 10.95% |
> > |                            | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   211370.67 |            -1.94% |
> > +----------------------------+------------------------------------------------------+-------------+-------------------+
> > 
> > I could get the results for 6.18 if useful, but I think what I have probably 
> > shows enough of the picture: This patch has not impacted schbench much on 
> > this HW.
> 
> I see. IMHO, task scheduler tests are all about putting the right about
> of stress onto the system, not too little and not too much.
> 
> I guess, these ~ `-m 64 -t 4` tests should be fine for a 64 CPUs system.

Agreed. It's not the full picture but it's a valuable part.

> Not sure which parameter set Mel was using on his 2 socket machine. And
> I still assume he tested w/o (base) against with these 2 patches.
> 

He did. The most comparable test I used was NUM_CPUS so 64 for Graviton
is ok.

> The other test Mel was using is this modified dbench4 (prefers
> throughput (less preemption)). Not sure if this is part of the MmTests
> suite?
> 

It is. The modifications are not extensive. dbench by default reports overall
throughput over time which masks actual throughput at a point in time. The
new metric tracks time taken to process "loadfiles" over time which is
more sensible to analyse. Other metrics such as loadfiles processed per
client would be easily extracted but isn't at the moment as dbench itself
is not designed for measuring fairness of forward progress as such.

> It would be nice to be able to run the same tests on different machines
> (with a parameter set adapted to the number of CPUs), so we have only
> the arch and the topology as variables). But there is definitely more
> variety (e.g. used filesystem, etc) ... so this is not trivial.
> 

From a topology perspective it is fairly trivial though. For example,
MMTESTS has a schbench configuration that runs one message thread per NUMA
node communicating with nr_cpus/nr_nodes to evaluate placement. A similar
configuration could use (nr_cpus/nr_nodes)-1 to ensure tasks are packed
properly or nr_llcs could also be used fairly trivially. You're right that
once filesystems are involved then it all gets more interesting. ext4 and
xfs use kernel threads differently (jbd vs kworkers), the underlying storage
is a factor, workset size vs RAM impacts dirty throttling and reclaim and
NUMA sizes all play a part. dbench is useful in this regard because while
it interacts with the filesystem a wakeups between userspace and kernel
threads get exercised, the amount of IO is relatively small.

Lets start with getting figures for 6.18, new-NEXT, old-NEXT and
no-NEXT side-by-side. Ideally I'd do the same across a range of x86-64
machines but I can't start that yet.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF
  2025-11-12 12:25 [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman
  2025-11-12 12:25 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
       [not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net>
@ 2026-01-08 10:01 ` Madadi Vineeth Reddy
  2 siblings, 0 replies; 44+ messages in thread
From: Madadi Vineeth Reddy @ 2026-01-08 10:01 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, linux-kernel, Madadi Vineeth Reddy

On 12/11/25 17:55, Mel Gorman wrote:
> Changes since v4
> o Splitout decisions into separate functions			(peterz)
> o Flow clarity							(peterz)
> 
> Changes since v3
> o Place new code near first consumer				(peterz)
> o Separate between PREEMPT_SHORT and NEXT_BUDDY			(peterz)
> o Naming and code flow clarity					(peterz)
> o Restore slice protection					(peterz)
> 
> Changes since v2
> o Review feedback applied from Prateek
> 
> I've been chasing down a number of schedule issues recently like many
> others and found they were broadly grouped as
> 
> 1. Failure to boost CPU frequency with powersave/ondemand governors
> 2. Processors entering idle states that are too deep
> 3. Differences in wakeup latencies for wakeup-intensive workloads
> 
> Adding topology into account means that there is a lot of machine-specific
> behaviour which may explain why some discussions recently have reproduction
> problems. Nevertheless, the removal of LAST_BUDDY and NEXT_BUDDY being
> disabled has an impact on wakeup latencies.
> 
> This series enables NEXT_BUDDY and may select a wakee if it's eligible to
> run even though other unrelated tasks may have an earlier deadline.
> 
> Mel Gorman (2):
>   sched/fair: Enable scheduler feature NEXT_BUDDY
>   sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
> 
>  kernel/sched/fair.c     | 152 ++++++++++++++++++++++++++++++++++------
>  kernel/sched/features.h |   2 +-
>  2 files changed, 131 insertions(+), 23 deletions(-)
> 

Hi Mel, Peter,

During internal testing, I noticed approximately 7% regression in a real-world workload
called DayTrader.

Git bisect pointed to this patch:
"sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals"

Before this patch was merged, I reported a regression in v4 with schbench and stress-ng.
From that discussion: 
https://lore.kernel.org/all/ddfde793-ad6e-4517-96b8-662dcb78acc8@linux.ibm.com/#t

```
So with frequent wakeups, queued tasks (even with earlier deadlines) may be
unfairly delayed. I understand that this would fade away quickly as the
woken up task that got to run due to buddy preference would accumulate negative
lag and would not be eligible to run again, but the starvation could be higher if
wakeups are very high.

To test this, I ran schbench (many message and worker threads) together with
stress-ng (CPU-bound), and observed stress-ng's bogo-ops throughput dropped by
around 64%.

This shows a significant regression for CPU-bound tasks under heavy wakeup loads.
```

I understand that stress-ng bogo-ops is not a reliable metric. However, the problem
appears to be real, as DayTrader also shows regression with this patch.

To check if WF_SYNC related change is the issue, I tried to decrease threshold by 
`echo 50000 > /sys/kernel/debug/sched/migration_cost_ns` so that waker could preempt
quickly in WF_SYNC case. This helped but I understand that it changes a lot of code paths
that use migration_cost_ns. So, when I decreased only threshold in this patch, the
performance didn't improve.

So, I think the problem is in making the tasks that are having earlier deadline to wait in
presence of frequent wakeups is hurting CPU intensive workloads.

Any thoughts/ideas?

Meanwhile, I will also spend time to workaround this patch and see if the performance
could be improved.

Thanks,
Vineeth

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2026-01-08  8:50               ` Mel Gorman
@ 2026-01-08 13:15                 ` Ryan Roberts
  2026-01-09 10:15                   ` Ryan Roberts
  0 siblings, 1 reply; 44+ messages in thread
From: Ryan Roberts @ 2026-01-08 13:15 UTC (permalink / raw)
  To: Mel Gorman, Dietmar Eggemann
  Cc: Peter Zijlstra (Intel), x86, linux-kernel, Aishwarya TCV

On 08/01/2026 08:50, Mel Gorman wrote:
> On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote:
>> On 05.01.26 12:45, Ryan Roberts wrote:
>>> On 02/01/2026 15:52, Dietmar Eggemann wrote:
>>>> On 02.01.26 13:38, Ryan Roberts wrote:
>>
>> [...]
>>
> 
> Sorry for slow responses. I'm still back from holidays yet and unfortunately
> do not have access to test machines right now cannot revalidate any of
> the results against 6.19-rc*.

No problem, thanks for getting back to me!

> 
>>>>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
>>>>>> statistically significant regression/improvement, where "statistically 
>>>>>> significant" means the 95% confidence intervals do not overlap".
>>>>
>>>> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
>>>> Reimplement NEXT_BUDDY to align with EEVDF goals'.
>>>>
>>>> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
>>>> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?
>>>
>>> Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?
>>
>> Well, I assume this would be more valuable.
> 
> Agreed because we need to know if it's NEXT_BUDDY that is conceptually
> an issue with EEVDF in these cases or the specific implementation. The
> comparison between 
> 
> 6.18A					(baseline)
> 6.19-rcN vanilla			(New NEXT_BUDDY implementation enabled)
> 6.19-rcN revert patches 1+2		(NEXT_BUDDY disabled)
> 6.19-rcN revert patch 2 only		(Old NEXT_BUDDY implementation enabled)

OK, I've already got 1, 2 and 4. Let me grab 3 and come back to you - hopefully
tomorrow. Then we can take it from there.

I appreciate your time on this!

Thanks,
Ryan


> 
> It was known that NEXT_BUDDY was always a tradeoff but one that is workload,
> architecture and specific arch implementation dependent. If it cannot be
> sanely reconciled then it may be best to completely remove NEXT_BUDDY from
> EEVDF or disable it by default again for now. I don't think NEXT_BUDDY as
> it existed in CFS can be sanely implemented against EEVDF so it'll never
> be equivalent. Related to that I doubt anyone has good data on NEXT_BUDDY
> vs !NEXT_BUDDY even on CFS as it was enabled for so long.
> 
>>>> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC
>>>> domain? I guess topology has influence in benchmark numbers here as well.
>>>
>>> I can't easily enable scheduler debugging right now (which I think is needed to 
>>> get this info directly?). But that's what I'd expect, yes. lscpu confirms there 
>>> is a single NUMA node and topology for cpu0 gives this if it helps:
>>>
>>> /sys/devices/system/cpu/cpu0/topology$ grep "" -r .
>>> ./cluster_cpus:ffffffff,ffffffff
>>> ./cluster_cpus_list:0-63
>>> ./physical_package_id:0
>>> ./core_cpus_list:0
>>> ./core_siblings:ffffffff,ffffffff
>>> ./cluster_id:0
>>> ./core_siblings_list:0-63
>>> ./package_cpus:ffffffff,ffffffff
>>> ./package_cpus_list:0-63
>>
>> [...]
>>
>> OK, so single (flat) MC domain with 64 CPUs.
>>
> 
> That is what the OS sees but does it reflect reality? e.g. does Graviton3
> have multiple caches that are simply not advertised to the OS?
> 
>>>> There was also a lot of improvement on schbench (wakeup latency) on
>>>> higher percentiles (>= 99.0th) on the 2-socket machine with those 2
>>>> patches. I guess you haven't seen those on Grav3?
>>>>
>>>
>>> I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for 
>>> revert-next-buddy. The means have moved a bit but there are only a couple of 
>>> cases that we consisder statistically significant (marked (R)egression / 
>>> (I)mprovement):
>>>
>>> +----------------------------+------------------------------------------------------+-------------+-------------------+
>>> | Benchmark                  | Result Class                                         |  6-19-0-rc1 | revert-next-buddy |
>>> +============================+======================================================+=============+===================+
>>> | schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     1263.97 |            -6.43% |
>>> |                            | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.28% |
>>> |                            | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
>>> |                            | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     6433.07 |           -10.99% |
>>> |                            | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |            -0.39% |
>>> |                            | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        4.17 |       (R) -16.67% |
>>> |                            | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |     1458.33 |            -1.57% |
>>> |                            | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |   813056.00 |            15.46% |
>>> |                            | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    14240.00 |            -5.97% |
>>> |                            | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      434.22 |             3.21% |
>>> |                            | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 11354112.00 |             2.92% |
>>> |                            | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63168.00 |            -2.87% |
>>> |                            | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     2828.63 |             2.58% |
>>> |                            | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15088.00 |             0.00% |
>>> |                            | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.00 |             0.00% |
>>> |                            | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     3182.15 |             5.18% |
>>> |                            | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   116266.67 |             8.22% |
>>> |                            | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |     6186.67 |        (R) -5.34% |
>>> |                            | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      749.20 |             2.91% |
>>> |                            | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    |  3702784.00 |        (I) 13.76% |
>>> |                            | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    33514.67 |             0.24% |
>>> |                            | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      392.23 |             3.42% |
>>> |                            | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 16695296.00 |         (I) 5.82% |
>>> |                            | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   120618.67 |            -3.22% |
>>> |                            | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec)          |     5951.15 |             5.02% |
>>> |                            | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec)     |    15157.33 |             0.42% |
>>> |                            | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec)  |        3.67 |            -4.35% |
>>> |                            | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec)          |     1510.23 |            -1.38% |
>>> |                            | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec)     |   802816.00 |            13.73% |
>>> |                            | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec)  |    14890.67 |           -10.44% |
>>> |                            | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec)         |      458.87 |             4.60% |
>>> |                            | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec)    | 11348650.67 |         (I) 2.67% |
>>> |                            | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) |    63445.33 |        (R) -5.48% |
>>> |                            | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec)         |      541.33 |             2.65% |
>>> |                            | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec)    | 36743850.67 |        (I) 10.95% |
>>> |                            | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) |   211370.67 |            -1.94% |
>>> +----------------------------+------------------------------------------------------+-------------+-------------------+
>>>
>>> I could get the results for 6.18 if useful, but I think what I have probably 
>>> shows enough of the picture: This patch has not impacted schbench much on 
>>> this HW.
>>
>> I see. IMHO, task scheduler tests are all about putting the right about
>> of stress onto the system, not too little and not too much.
>>
>> I guess, these ~ `-m 64 -t 4` tests should be fine for a 64 CPUs system.
> 
> Agreed. It's not the full picture but it's a valuable part.
> 
>> Not sure which parameter set Mel was using on his 2 socket machine. And
>> I still assume he tested w/o (base) against with these 2 patches.
>>
> 
> He did. The most comparable test I used was NUM_CPUS so 64 for Graviton
> is ok.
> 
>> The other test Mel was using is this modified dbench4 (prefers
>> throughput (less preemption)). Not sure if this is part of the MmTests
>> suite?
>>
> 
> It is. The modifications are not extensive. dbench by default reports overall
> throughput over time which masks actual throughput at a point in time. The
> new metric tracks time taken to process "loadfiles" over time which is
> more sensible to analyse. Other metrics such as loadfiles processed per
> client would be easily extracted but isn't at the moment as dbench itself
> is not designed for measuring fairness of forward progress as such.
> 
>> It would be nice to be able to run the same tests on different machines
>> (with a parameter set adapted to the number of CPUs), so we have only
>> the arch and the topology as variables). But there is definitely more
>> variety (e.g. used filesystem, etc) ... so this is not trivial.
>>
> 
> From a topology perspective it is fairly trivial though. For example,
> MMTESTS has a schbench configuration that runs one message thread per NUMA
> node communicating with nr_cpus/nr_nodes to evaluate placement. A similar
> configuration could use (nr_cpus/nr_nodes)-1 to ensure tasks are packed
> properly or nr_llcs could also be used fairly trivially. You're right that
> once filesystems are involved then it all gets more interesting. ext4 and
> xfs use kernel threads differently (jbd vs kworkers), the underlying storage
> is a factor, workset size vs RAM impacts dirty throttling and reclaim and
> NUMA sizes all play a part. dbench is useful in this regard because while
> it interacts with the filesystem a wakeups between userspace and kernel
> threads get exercised, the amount of IO is relatively small.
> 
> Lets start with getting figures for 6.18, new-NEXT, old-NEXT and
> no-NEXT side-by-side. Ideally I'd do the same across a range of x86-64
> machines but I can't start that yet.
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2026-01-08 13:15                 ` Ryan Roberts
@ 2026-01-09 10:15                   ` Ryan Roberts
  2026-01-12  7:47                     ` Peter Zijlstra
  2026-01-15 10:16                     ` Mel Gorman
  0 siblings, 2 replies; 44+ messages in thread
From: Ryan Roberts @ 2026-01-09 10:15 UTC (permalink / raw)
  To: Mel Gorman, Dietmar Eggemann
  Cc: Peter Zijlstra (Intel), x86, linux-kernel, Aishwarya TCV

On 08/01/2026 13:15, Ryan Roberts wrote:
> On 08/01/2026 08:50, Mel Gorman wrote:
>> On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote:
>>> On 05.01.26 12:45, Ryan Roberts wrote:
>>>> On 02/01/2026 15:52, Dietmar Eggemann wrote:
>>>>> On 02.01.26 13:38, Ryan Roberts wrote:
>>>
>>> [...]
>>>
>>
>> Sorry for slow responses. I'm still back from holidays yet and unfortunately
>> do not have access to test machines right now cannot revalidate any of
>> the results against 6.19-rc*.
> 
> No problem, thanks for getting back to me!
> 
>>
>>>>>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
>>>>>>> statistically significant regression/improvement, where "statistically 
>>>>>>> significant" means the 95% confidence intervals do not overlap".
>>>>>
>>>>> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
>>>>> Reimplement NEXT_BUDDY to align with EEVDF goals'.
>>>>>
>>>>> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
>>>>> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?
>>>>
>>>> Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?
>>>
>>> Well, I assume this would be more valuable.
>>
>> Agreed because we need to know if it's NEXT_BUDDY that is conceptually
>> an issue with EEVDF in these cases or the specific implementation. The
>> comparison between 
>>
>> 6.18A					(baseline)
>> 6.19-rcN vanilla			(New NEXT_BUDDY implementation enabled)
>> 6.19-rcN revert patches 1+2		(NEXT_BUDDY disabled)
>> 6.19-rcN revert patch 2 only		(Old NEXT_BUDDY implementation enabled)
> 
> OK, I've already got 1, 2 and 4. Let me grab 3 and come back to you - hopefully
> tomorrow. Then we can take it from there.

Hi Mel, Dietmar,

Here are the updated results, now including column for "revert #1 & #2".

6-18-0 (base)		(baseline)
6-19-0-rc1		(New NEXT_BUDDY implementation enabled)
revert #1 & #2		(NEXT_BUDDY disabled)
revert #2		(Old NEXT_BUDDY implementation enabled)


The regressions that are fixed by "revert #2" (as originally reported) are still 
fixed in "revert #1 & #2". Interestingly, performance actually improves further 
for the latter in the multi-node mysql benchmark (which is our VIP workload). 
There are a couple of hackbench cases (sockets with high thread counts) that 
showed an improvement with "revert #2" but which is gone with "revert #1 & #2".

Let me know if I can usefully do anything else.


Multi-node SUT (workload running across 2 machines):

+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
| Benchmark                       | Result Class                                       | 6-18-0 (base) |  6-19-0-rc1 |  revert #2 | revert #1 & #2 |
+=================================+====================================================+===============+=============+============+================+
| repro-collection/mysql-workload | db transaction rate (transactions/min)             |     646267.33 |  (R) -1.33% |  (I) 5.87% |      (I) 7.63% |
|                                 | new order rate (orders/min)                        |     213256.50 |  (R) -1.32% |  (I) 5.87% |      (I) 7.64% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+

Single-node SUT (workload running on single machine):

+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
| Benchmark                       | Result Class                                       | 6-18-0 (base) |  6-19-0-rc1 |  revert #2 | revert #1 & #2 |
+=================================+====================================================+===============+=============+============+================+
| specjbb/composite               | critical-jOPS (jOPS)                               |      94700.00 |  (R) -5.10% |     -0.90% |         -0.37% |
|                                 | max-jOPS (jOPS)                                    |     113984.50 |  (R) -3.90% |     -0.65% |          0.65% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
| repro-collection/mysql-workload | db transaction rate (transactions/min)             |     245438.25 |  (R) -3.88% |     -0.13% |          0.24% |
|                                 | new order rate (orders/min)                        |      80985.75 |  (R) -3.78% |     -0.07% |          0.29% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
| pts/pgbench                     | Scale: 1 Clients: 1 Read Only (TPS)                |      63124.00 |   (I) 2.90% |      0.74% |          0.85% |
|                                 | Scale: 1 Clients: 1 Read Only - Latency (ms)       |         0.016 |   (I) 5.49% |      1.05% |          1.05% |
|                                 | Scale: 1 Clients: 1 Read Write (TPS)               |        974.92 |       0.11% |     -0.08% |         -0.03% |
|                                 | Scale: 1 Clients: 1 Read Write - Latency (ms)      |          1.03 |       0.12% |     -0.06% |         -0.06% |
|                                 | Scale: 1 Clients: 250 Read Only (TPS)              |    1915931.58 |  (R) -2.25% |  (I) 2.12% |          1.62% |
|                                 | Scale: 1 Clients: 250 Read Only - Latency (ms)     |          0.13 |  (R) -2.37% |  (I) 2.09% |          1.69% |
|                                 | Scale: 1 Clients: 250 Read Write (TPS)             |        855.67 |      -1.36% |     -0.14% |         -0.12% |
|                                 | Scale: 1 Clients: 250 Read Write - Latency (ms)    |        292.39 |      -1.31% |     -0.08% |         -0.08% |
|                                 | Scale: 1 Clients: 1000 Read Only (TPS)             |    1534130.08 | (R) -11.37% |      0.08% |          0.48% |
|                                 | Scale: 1 Clients: 1000 Read Only - Latency (ms)    |          0.65 | (R) -11.38% |      0.08% |          0.44% |
|                                 | Scale: 1 Clients: 1000 Read Write (TPS)            |        578.75 |      -1.11% |      2.15% |         -0.96% |
|                                 | Scale: 1 Clients: 1000 Read Write - Latency (ms)   |       1736.98 |      -1.26% |      2.47% |         -0.90% |
|                                 | Scale: 100 Clients: 1 Read Only (TPS)              |      57170.33 |       1.68% |      0.10% |          0.22% |
|                                 | Scale: 100 Clients: 1 Read Only - Latency (ms)     |         0.018 |       1.94% |      0.00% |          0.96% |
|                                 | Scale: 100 Clients: 1 Read Write (TPS)             |        836.58 |      -0.37% |     -0.41% |          0.07% |
|                                 | Scale: 100 Clients: 1 Read Write - Latency (ms)    |          1.20 |      -0.37% |     -0.40% |          0.06% |
|                                 | Scale: 100 Clients: 250 Read Only (TPS)            |    1773440.67 |      -1.61% |      1.67% |          1.34% |
|                                 | Scale: 100 Clients: 250 Read Only - Latency (ms)   |          0.14 |      -1.40% |      1.56% |          1.20% |
|                                 | Scale: 100 Clients: 250 Read Write (TPS)           |       5505.50 |      -0.17% |     -0.86% |         -1.66% |
|                                 | Scale: 100 Clients: 250 Read Write - Latency (ms)  |         45.42 |      -0.17% |     -0.85% |         -1.67% |
|                                 | Scale: 100 Clients: 1000 Read Only (TPS)           |    1393037.50 | (R) -10.31% |     -0.19% |          0.53% |
|                                 | Scale: 100 Clients: 1000 Read Only - Latency (ms)  |          0.72 | (R) -10.30% |     -0.17% |          0.53% |
|                                 | Scale: 100 Clients: 1000 Read Write (TPS)          |       5085.92 |       0.27% |      0.07% |         -0.79% |
|                                 | Scale: 100 Clients: 1000 Read Write - Latency (ms) |        196.79 |       0.23% |      0.05% |         -0.81% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
| mmtests/hackbench               | hackbench-process-pipes-1 (seconds)                |          0.14 |      -1.51% |     -1.05% |         -1.51% |
|                                 | hackbench-process-pipes-4 (seconds)                |          0.44 |   (I) 6.49% |  (I) 5.42% |      (I) 6.06% |
|                                 | hackbench-process-pipes-7 (seconds)                |          0.68 | (R) -18.36% |  (I) 3.40% |         -0.41% |
|                                 | hackbench-process-pipes-12 (seconds)               |          1.24 | (R) -19.89% |     -0.45% |     (R) -2.23% |
|                                 | hackbench-process-pipes-21 (seconds)               |          1.81 |  (R) -8.41% |     -1.22% |     (R) -2.46% |
|                                 | hackbench-process-pipes-30 (seconds)               |          2.39 |  (R) -9.06% | (R) -2.95% |         -1.62% |
|                                 | hackbench-process-pipes-48 (seconds)               |          3.18 | (R) -11.68% | (R) -4.10% |         -0.26% |
|                                 | hackbench-process-pipes-79 (seconds)               |          3.84 |  (R) -9.74% | (R) -3.25% |     (R) -2.45% |
|                                 | hackbench-process-pipes-110 (seconds)              |          4.68 |  (R) -6.57% | (R) -2.12% |     (R) -2.25% |
|                                 | hackbench-process-pipes-141 (seconds)              |          5.75 |  (R) -5.86% | (R) -3.44% |     (R) -2.89% |
|                                 | hackbench-process-pipes-172 (seconds)              |          6.80 |  (R) -4.28% | (R) -2.81% |     (R) -2.44% |
|                                 | hackbench-process-pipes-203 (seconds)              |          7.94 |  (R) -4.01% | (R) -3.00% |     (R) -2.17% |
|                                 | hackbench-process-pipes-234 (seconds)              |          9.02 |  (R) -3.52% | (R) -2.81% |     (R) -2.20% |
|                                 | hackbench-process-pipes-256 (seconds)              |          9.78 |  (R) -3.24% | (R) -2.81% |     (R) -2.74% |
|                                 | hackbench-process-sockets-1 (seconds)              |          0.29 |       0.50% |      0.26% |          0.03% |
|                                 | hackbench-process-sockets-4 (seconds)              |          0.76 |  (I) 17.44% | (I) 16.31% |     (I) 19.09% |
|                                 | hackbench-process-sockets-7 (seconds)              |          1.16 |  (I) 12.10% |  (I) 9.78% |     (I) 11.83% |
|                                 | hackbench-process-sockets-12 (seconds)             |          1.86 |  (I) 10.19% |  (I) 9.83% |     (I) 11.21% |
|                                 | hackbench-process-sockets-21 (seconds)             |          3.12 |   (I) 9.38% |  (I) 9.20% |     (I) 10.30% |
|                                 | hackbench-process-sockets-30 (seconds)             |          4.30 |   (I) 6.43% |  (I) 6.11% |      (I) 7.22% |
|                                 | hackbench-process-sockets-48 (seconds)             |          6.58 |   (I) 3.00% |  (I) 2.19% |      (I) 2.85% |
|                                 | hackbench-process-sockets-79 (seconds)             |         10.56 |   (I) 2.87% |  (I) 3.31% |          3.10% |
|                                 | hackbench-process-sockets-110 (seconds)            |         13.85 |      -1.15% |  (I) 2.33% |          0.22% |
|                                 | hackbench-process-sockets-141 (seconds)            |         19.23 |      -1.40% | (I) 14.53% |          2.64% |
|                                 | hackbench-process-sockets-172 (seconds)            |         26.33 |   (I) 3.52% | (I) 30.37% |      (I) 4.32% |
|                                 | hackbench-process-sockets-203 (seconds)            |         30.27 |       1.10% | (I) 27.20% |          0.32% |
|                                 | hackbench-process-sockets-234 (seconds)            |         35.12 |       1.60% | (I) 28.24% |          1.28% |
|                                 | hackbench-process-sockets-256 (seconds)            |         38.74 |       0.70% | (I) 28.74% |          0.53% |
|                                 | hackbench-thread-pipes-1 (seconds)                 |          0.17 |      -1.32% |     -0.76% |         -0.67% |
|                                 | hackbench-thread-pipes-4 (seconds)                 |          0.45 |   (I) 6.91% |  (I) 7.64% |      (I) 9.08% |
|                                 | hackbench-thread-pipes-7 (seconds)                 |          0.74 |  (R) -7.51% |  (I) 5.26% |      (I) 2.82% |
|                                 | hackbench-thread-pipes-12 (seconds)                |          1.32 |  (R) -8.40% |  (I) 2.32% |         -0.53% |
|                                 | hackbench-thread-pipes-21 (seconds)                |          1.95 |  (R) -2.95% |      0.91% |     (R) -2.00% |
|                                 | hackbench-thread-pipes-30 (seconds)                |          2.50 |  (R) -4.61% |      1.47% |         -1.63% |
|                                 | hackbench-thread-pipes-48 (seconds)                |          3.32 |  (R) -5.45% |  (I) 2.15% |          0.81% |
|                                 | hackbench-thread-pipes-79 (seconds)                |          4.04 |  (R) -5.53% |      1.85% |         -0.53% |
|                                 | hackbench-thread-pipes-110 (seconds)               |          4.94 |  (R) -2.33% |      1.51% |          0.59% |
|                                 | hackbench-thread-pipes-141 (seconds)               |          6.04 |  (R) -2.47% |      1.15% |          0.24% |
|                                 | hackbench-thread-pipes-172 (seconds)               |          7.15 |      -0.91% |      1.48% |          0.45% |
|                                 | hackbench-thread-pipes-203 (seconds)               |          8.31 |      -1.29% |      0.77% |          0.40% |
|                                 | hackbench-thread-pipes-234 (seconds)               |          9.49 |      -1.03% |      0.77% |          0.65% |
|                                 | hackbench-thread-pipes-256 (seconds)               |         10.30 |      -0.80% |      0.42% |          0.30% |
|                                 | hackbench-thread-sockets-1 (seconds)               |          0.31 |       0.05% |     -0.05% |         -0.43% |
|                                 | hackbench-thread-sockets-4 (seconds)               |          0.79 |  (I) 18.91% | (I) 16.82% |     (I) 19.79% |
|                                 | hackbench-thread-sockets-7 (seconds)               |          1.16 |  (I) 12.57% | (I) 10.63% |     (I) 12.95% |
|                                 | hackbench-thread-sockets-12 (seconds)              |          1.87 |  (I) 12.65% | (I) 12.26% |     (I) 13.90% |
|                                 | hackbench-thread-sockets-21 (seconds)              |          3.16 |  (I) 11.62% | (I) 12.74% |     (I) 13.89% |
|                                 | hackbench-thread-sockets-30 (seconds)              |          4.32 |   (I) 7.35% |  (I) 8.89% |      (I) 9.51% |
|                                 | hackbench-thread-sockets-48 (seconds)              |          6.45 |   (I) 2.69% |  (I) 3.06% |      (I) 3.74% |
|                                 | hackbench-thread-sockets-79 (seconds)              |         10.15 |   (I) 3.30% |      1.98% |      (I) 2.76% |
|                                 | hackbench-thread-sockets-110 (seconds)             |         13.45 |      -0.25% |  (I) 3.68% |          0.44% |
|                                 | hackbench-thread-sockets-141 (seconds)             |         17.87 |  (R) -2.18% |  (I) 8.46% |          1.51% |
|                                 | hackbench-thread-sockets-172 (seconds)             |         24.38 |       1.02% | (I) 24.33% |          1.38% |
|                                 | hackbench-thread-sockets-203 (seconds)             |         28.38 |      -0.99% | (I) 24.20% |          0.57% |
|                                 | hackbench-thread-sockets-234 (seconds)             |         32.75 |      -0.42% | (I) 24.35% |          0.72% |
|                                 | hackbench-thread-sockets-256 (seconds)             |         36.49 |      -1.30% | (I) 26.22% |          0.81% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
| pts/nginx                       | Connections: 200 (Requests Per Second)             |     252332.60 |  (I) 17.54% |     -0.53% |         -0.61% |
|                                 | Connections: 1000 (Requests Per Second)            |     248591.29 |  (I) 20.41% |      0.10% |          0.57% |
+---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2026-01-09 10:15                   ` Ryan Roberts
@ 2026-01-12  7:47                     ` Peter Zijlstra
  2026-01-12  8:52                       ` Ryan Roberts
  2026-01-15 10:16                     ` Mel Gorman
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2026-01-12  7:47 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Mel Gorman, Dietmar Eggemann, x86, linux-kernel, Aishwarya TCV

On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote:

> Here are the updated results, now including column for "revert #1 & #2".
> 
> 6-18-0 (base)		(baseline)
> 6-19-0-rc1		(New NEXT_BUDDY implementation enabled)
> revert #1 & #2		(NEXT_BUDDY disabled)
> revert #2		(Old NEXT_BUDDY implementation enabled)
> 
> 
> The regressions that are fixed by "revert #2" (as originally reported) are still 
> fixed in "revert #1 & #2". Interestingly, performance actually improves further 
> for the latter in the multi-node mysql benchmark (which is our VIP workload). 
> There are a couple of hackbench cases (sockets with high thread counts) that 
> showed an improvement with "revert #2" but which is gone with "revert #1 & #2".
> 
> Let me know if I can usefully do anything else.

If its not too much bother, could you run 6.19-rc with SCHED_BATCH ? The
defining characteristic of BATCH is that it fully ignores wakeup
preemption.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2026-01-12  7:47                     ` Peter Zijlstra
@ 2026-01-12  8:52                       ` Ryan Roberts
  2026-01-12  9:57                         ` Peter Zijlstra
  2026-01-13  6:31                         ` K Prateek Nayak
  0 siblings, 2 replies; 44+ messages in thread
From: Ryan Roberts @ 2026-01-12  8:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Dietmar Eggemann, x86, linux-kernel, Aishwarya TCV

On 12/01/2026 07:47, Peter Zijlstra wrote:
> On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote:
> 
>> Here are the updated results, now including column for "revert #1 & #2".
>>
>> 6-18-0 (base)		(baseline)
>> 6-19-0-rc1		(New NEXT_BUDDY implementation enabled)
>> revert #1 & #2		(NEXT_BUDDY disabled)
>> revert #2		(Old NEXT_BUDDY implementation enabled)
>>
>>
>> The regressions that are fixed by "revert #2" (as originally reported) are still 
>> fixed in "revert #1 & #2". Interestingly, performance actually improves further 
>> for the latter in the multi-node mysql benchmark (which is our VIP workload). 
>> There are a couple of hackbench cases (sockets with high thread counts) that 
>> showed an improvement with "revert #2" but which is gone with "revert #1 & #2".
>>
>> Let me know if I can usefully do anything else.
> 
> If its not too much bother, could you run 6.19-rc with SCHED_BATCH ? The
> defining characteristic of BATCH is that it fully ignores wakeup
> preemption.

Is there a way I can force all future tasks to use SCHED_BATCH at the system
level? (a Kconfig, cmdline arg or sysfs toggle?) If so that would be simple for
me to do. But if I need to invoke the top level command with chrt -b and hope
that nothing in the workload explicitly changes the scheduling policy that would
be both trickier for me to do and (I guess) higher risk that it ends up not
doing what I expected. Happy to give whatever you recommend a try...

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2026-01-12  8:52                       ` Ryan Roberts
@ 2026-01-12  9:57                         ` Peter Zijlstra
  2026-01-12 10:27                           ` Ryan Roberts
  2026-01-13  6:31                         ` K Prateek Nayak
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2026-01-12  9:57 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Mel Gorman, Dietmar Eggemann, x86, linux-kernel, Aishwarya TCV

On Mon, Jan 12, 2026 at 08:52:17AM +0000, Ryan Roberts wrote:
> On 12/01/2026 07:47, Peter Zijlstra wrote:
> > On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote:
> > 
> >> Here are the updated results, now including column for "revert #1 & #2".
> >>
> >> 6-18-0 (base)		(baseline)
> >> 6-19-0-rc1		(New NEXT_BUDDY implementation enabled)
> >> revert #1 & #2		(NEXT_BUDDY disabled)
> >> revert #2		(Old NEXT_BUDDY implementation enabled)
> >>
> >>
> >> The regressions that are fixed by "revert #2" (as originally reported) are still 
> >> fixed in "revert #1 & #2". Interestingly, performance actually improves further 
> >> for the latter in the multi-node mysql benchmark (which is our VIP workload). 
> >> There are a couple of hackbench cases (sockets with high thread counts) that 
> >> showed an improvement with "revert #2" but which is gone with "revert #1 & #2".
> >>
> >> Let me know if I can usefully do anything else.
> > 
> > If its not too much bother, could you run 6.19-rc with SCHED_BATCH ? The
> > defining characteristic of BATCH is that it fully ignores wakeup
> > preemption.
> 
> Is there a way I can force all future tasks to use SCHED_BATCH at the system
> level? (a Kconfig, cmdline arg or sysfs toggle?) If so that would be simple for
> me to do. But if I need to invoke the top level command with chrt -b and hope
> that nothing in the workload explicitly changes the scheduling policy that would
> be both trickier for me to do and (I guess) higher risk that it ends up not
> doing what I expected. Happy to give whatever you recommend a try...

No fancy things here, chrt/schedtool are it.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2026-01-12  9:57                         ` Peter Zijlstra
@ 2026-01-12 10:27                           ` Ryan Roberts
  0 siblings, 0 replies; 44+ messages in thread
From: Ryan Roberts @ 2026-01-12 10:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Dietmar Eggemann, x86, linux-kernel, Aishwarya TCV

On 12/01/2026 09:57, Peter Zijlstra wrote:
> On Mon, Jan 12, 2026 at 08:52:17AM +0000, Ryan Roberts wrote:
>> On 12/01/2026 07:47, Peter Zijlstra wrote:
>>> On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote:
>>>
>>>> Here are the updated results, now including column for "revert #1 & #2".
>>>>
>>>> 6-18-0 (base)		(baseline)
>>>> 6-19-0-rc1		(New NEXT_BUDDY implementation enabled)
>>>> revert #1 & #2		(NEXT_BUDDY disabled)
>>>> revert #2		(Old NEXT_BUDDY implementation enabled)
>>>>
>>>>
>>>> The regressions that are fixed by "revert #2" (as originally reported) are still 
>>>> fixed in "revert #1 & #2". Interestingly, performance actually improves further 
>>>> for the latter in the multi-node mysql benchmark (which is our VIP workload). 
>>>> There are a couple of hackbench cases (sockets with high thread counts) that 
>>>> showed an improvement with "revert #2" but which is gone with "revert #1 & #2".
>>>>
>>>> Let me know if I can usefully do anything else.
>>>
>>> If its not too much bother, could you run 6.19-rc with SCHED_BATCH ? The
>>> defining characteristic of BATCH is that it fully ignores wakeup
>>> preemption.
>>
>> Is there a way I can force all future tasks to use SCHED_BATCH at the system
>> level? (a Kconfig, cmdline arg or sysfs toggle?) If so that would be simple for
>> me to do. But if I need to invoke the top level command with chrt -b and hope
>> that nothing in the workload explicitly changes the scheduling policy that would
>> be both trickier for me to do and (I guess) higher risk that it ends up not
>> doing what I expected. Happy to give whatever you recommend a try...
> 
> No fancy things here, chrt/schedtool are it.

OK I'll figure out how to butcher this into my workflow and get back to you with
results. It probably won't be until Wednesday though.

Thanks,
Ryan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2026-01-12  8:52                       ` Ryan Roberts
  2026-01-12  9:57                         ` Peter Zijlstra
@ 2026-01-13  6:31                         ` K Prateek Nayak
  1 sibling, 0 replies; 44+ messages in thread
From: K Prateek Nayak @ 2026-01-13  6:31 UTC (permalink / raw)
  To: Ryan Roberts, Peter Zijlstra
  Cc: Mel Gorman, Dietmar Eggemann, x86, linux-kernel, Aishwarya TCV

Hello Ryan,

On 1/12/2026 2:22 PM, Ryan Roberts wrote:
> On 12/01/2026 07:47, Peter Zijlstra wrote:
>> On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote:
>>
>>> Here are the updated results, now including column for "revert #1 & #2".
>>>
>>> 6-18-0 (base)		(baseline)
>>> 6-19-0-rc1		(New NEXT_BUDDY implementation enabled)
>>> revert #1 & #2		(NEXT_BUDDY disabled)
>>> revert #2		(Old NEXT_BUDDY implementation enabled)
>>>
>>>
>>> The regressions that are fixed by "revert #2" (as originally reported) are still 
>>> fixed in "revert #1 & #2". Interestingly, performance actually improves further 
>>> for the latter in the multi-node mysql benchmark (which is our VIP workload). 
>>> There are a couple of hackbench cases (sockets with high thread counts) that 
>>> showed an improvement with "revert #2" but which is gone with "revert #1 & #2".
>>>
>>> Let me know if I can usefully do anything else.
>>
>> If its not too much bother, could you run 6.19-rc with SCHED_BATCH ? The
>> defining characteristic of BATCH is that it fully ignores wakeup
>> preemption.
> 
> Is there a way I can force all future tasks to use SCHED_BATCH at the system
> level?

One shortcut is to echo "NO_WAKEUP_PREEMPTION" into
/sys/kernel/debug/sched/features but note it'll disable wakeup preemption
for all tasks, including kthreads, which might adversely affect
performance and is not an exact equivalent to only running the workload
under SCHED_BATCH.

For repro-collection/mysql-workload, (which I presume is [1]), there is
a "WORKLOAD_SCHED_POLICY" environment variable that can be overridden [2]
which controls the "CPUSchedulingPolicy" of the mysqld service.

[1] https://github.com/aws/repro-collection/tree/main/workloads/mysql
[2] https://github.com/aws/repro-collection/blob/a2cdf0455bd3422c9c1fc689ceac32971223b984/repros/repro-mysql-EEVDF-regression/main.sh#L102

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2026-01-09 10:15                   ` Ryan Roberts
  2026-01-12  7:47                     ` Peter Zijlstra
@ 2026-01-15 10:16                     ` Mel Gorman
  1 sibling, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2026-01-15 10:16 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dietmar Eggemann, Peter Zijlstra (Intel), x86, linux-kernel,
	Aishwarya TCV, Madadi Vineeth Reddy

On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote:
> On 08/01/2026 13:15, Ryan Roberts wrote:
> > On 08/01/2026 08:50, Mel Gorman wrote:
> >> On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote:
> >>> On 05.01.26 12:45, Ryan Roberts wrote:
> >>>> On 02/01/2026 15:52, Dietmar Eggemann wrote:
> >>>>> On 02.01.26 13:38, Ryan Roberts wrote:
> >>>
> >>> [...]
> >>>
> >>
> >> Sorry for slow responses. I'm still back from holidays yet and unfortunately
> >> do not have access to test machines right now cannot revalidate any of
> >> the results against 6.19-rc*.
> > 
> > No problem, thanks for getting back to me!
> > 
> >>
> >>>>>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean 
> >>>>>>> statistically significant regression/improvement, where "statistically 
> >>>>>>> significant" means the 95% confidence intervals do not overlap".
> >>>>>
> >>>>> You mentioned that you reverted this patch 'patch 2/2 'sched/fair:
> >>>>> Reimplement NEXT_BUDDY to align with EEVDF goals'.
> >>>>>
> >>>>> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted
> >>>>> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well?
> >>>>
> >>>> Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful?
> >>>
> >>> Well, I assume this would be more valuable.
> >>
> >> Agreed because we need to know if it's NEXT_BUDDY that is conceptually
> >> an issue with EEVDF in these cases or the specific implementation. The
> >> comparison between 
> >>
> >> 6.18A					(baseline)
> >> 6.19-rcN vanilla			(New NEXT_BUDDY implementation enabled)
> >> 6.19-rcN revert patches 1+2		(NEXT_BUDDY disabled)
> >> 6.19-rcN revert patch 2 only		(Old NEXT_BUDDY implementation enabled)
> > 
> > OK, I've already got 1, 2 and 4. Let me grab 3 and come back to you - hopefully
> > tomorrow. Then we can take it from there.
> 
> Hi Mel, Dietmar,
> 
> Here are the updated results, now including column for "revert #1 & #2".
> 
> 6-18-0 (base)		(baseline)
> 6-19-0-rc1		(New NEXT_BUDDY implementation enabled)
> revert #1 & #2		(NEXT_BUDDY disabled)
> revert #2		(Old NEXT_BUDDY implementation enabled)
> 

Thanks.

> 
> The regressions that are fixed by "revert #2" (as originally reported) are still 
> fixed in "revert #1 & #2". Interestingly, performance actually improves further 
> for the latter in the multi-node mysql benchmark (which is our VIP workload). 

It suggests that NEXT_BUDDY in general is harmful to this workload. In an
ideal world, this would also be checked against the NEXT_BUDDY implementation
CFS but it would be a waste of time for many reasons. I find it particularly
interesting that it is only measurable with the 2-machine test as it
suggests, but not proves, that the problem may be related to WF_SYNC
wakeups from the network layer

> There are a couple of hackbench cases (sockets with high thread counts) that 
> showed an improvement with "revert #2" but which is gone with "revert #1 & #2".
> 
> Let me know if I can usefully do anything else.
> 


> Multi-node SUT (workload running across 2 machines):
> 
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> | Benchmark                       | Result Class                                       | 6-18-0 (base) |  6-19-0-rc1 |  revert #2 | revert #1 & #2 |
> +=================================+====================================================+===============+=============+============+================+
> | repro-collection/mysql-workload | db transaction rate (transactions/min)             |     646267.33 |  (R) -1.33% |  (I) 5.87% |      (I) 7.63% |
> |                                 | new order rate (orders/min)                        |     213256.50 |  (R) -1.32% |  (I) 5.87% |      (I) 7.64% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+

Ok, fairly clear there

> Single-node SUT (workload running on single machine):
> 
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> | Benchmark                       | Result Class                                       | 6-18-0 (base) |  6-19-0-rc1 |  revert #2 | revert #1 & #2 |
> +=================================+====================================================+===============+=============+============+================+
> | specjbb/composite               | critical-jOPS (jOPS)                               |      94700.00 |  (R) -5.10% |     -0.90% |         -0.37% |
> |                                 | max-jOPS (jOPS)                                    |     113984.50 |  (R) -3.90% |     -0.65% |          0.65% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+

I assume this is specjbb2015. I'm a little cautious of this results as
specjbb2015 focuses on peak performance. It starts with low CPU usage and
scales up to find the point where performance reaches a peak. This metric can
be gamed and what works for specjbb, particularly as the machines approches
being heavily utilised and transitions to overloaded can be problematic.

Can you look at the detailed results for specjbb2015 and determine if the
peak was picked from different load points?

> | repro-collection/mysql-workload | db transaction rate (transactions/min)             |     245438.25 |  (R) -3.88% |     -0.13% |          0.24% |
> |                                 | new order rate (orders/min)                        |      80985.75 |  (R) -3.78% |     -0.07% |          0.29% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> | pts/pgbench                     | Scale: 1 Clients: 1 Read Only (TPS)                |      63124.00 |   (I) 2.90% |      0.74% |          0.85% |
> |                                 | Scale: 1 Clients: 1 Read Only - Latency (ms)       |         0.016 |   (I) 5.49% |      1.05% |          1.05% |
> |                                 | Scale: 1 Clients: 1 Read Write (TPS)               |        974.92 |       0.11% |     -0.08% |         -0.03% |
> |                                 | Scale: 1 Clients: 1 Read Write - Latency (ms)      |          1.03 |       0.12% |     -0.06% |         -0.06% |
> |                                 | Scale: 1 Clients: 250 Read Only (TPS)              |    1915931.58 |  (R) -2.25% |  (I) 2.12% |          1.62% |
> |                                 | Scale: 1 Clients: 250 Read Only - Latency (ms)     |          0.13 |  (R) -2.37% |  (I) 2.09% |          1.69% |
> |                                 | Scale: 1 Clients: 250 Read Write (TPS)             |        855.67 |      -1.36% |     -0.14% |         -0.12% |
> |                                 | Scale: 1 Clients: 250 Read Write - Latency (ms)    |        292.39 |      -1.31% |     -0.08% |         -0.08% |
> |                                 | Scale: 1 Clients: 1000 Read Only (TPS)             |    1534130.08 | (R) -11.37% |      0.08% |          0.48% |
> |                                 | Scale: 1 Clients: 1000 Read Only - Latency (ms)    |          0.65 | (R) -11.38% |      0.08% |          0.44% |
> |                                 | Scale: 1 Clients: 1000 Read Write (TPS)            |        578.75 |      -1.11% |      2.15% |         -0.96% |
> |                                 | Scale: 1 Clients: 1000 Read Write - Latency (ms)   |       1736.98 |      -1.26% |      2.47% |         -0.90% |
> |                                 | Scale: 100 Clients: 1 Read Only (TPS)              |      57170.33 |       1.68% |      0.10% |          0.22% |
> |                                 | Scale: 100 Clients: 1 Read Only - Latency (ms)     |         0.018 |       1.94% |      0.00% |          0.96% |
> |                                 | Scale: 100 Clients: 1 Read Write (TPS)             |        836.58 |      -0.37% |     -0.41% |          0.07% |
> |                                 | Scale: 100 Clients: 1 Read Write - Latency (ms)    |          1.20 |      -0.37% |     -0.40% |          0.06% |
> |                                 | Scale: 100 Clients: 250 Read Only (TPS)            |    1773440.67 |      -1.61% |      1.67% |          1.34% |
> |                                 | Scale: 100 Clients: 250 Read Only - Latency (ms)   |          0.14 |      -1.40% |      1.56% |          1.20% |
> |                                 | Scale: 100 Clients: 250 Read Write (TPS)           |       5505.50 |      -0.17% |     -0.86% |         -1.66% |
> |                                 | Scale: 100 Clients: 250 Read Write - Latency (ms)  |         45.42 |      -0.17% |     -0.85% |         -1.67% |
> |                                 | Scale: 100 Clients: 1000 Read Only (TPS)           |    1393037.50 | (R) -10.31% |     -0.19% |          0.53% |
> |                                 | Scale: 100 Clients: 1000 Read Only - Latency (ms)  |          0.72 | (R) -10.30% |     -0.17% |          0.53% |
> |                                 | Scale: 100 Clients: 1000 Read Write (TPS)          |       5085.92 |       0.27% |      0.07% |         -0.79% |
> |                                 | Scale: 100 Clients: 1000 Read Write - Latency (ms) |        196.79 |       0.23% |      0.05% |         -0.81% |

A few points of concern but nothing as severe as the mysql Multi-node
SUT. The worst regressions are when the number of clients exceeds the number
of CPUs and at that point any wakeup preemption is potentially harmful.

> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> | mmtests/hackbench               | hackbench-process-pipes-1 (seconds)                |          0.14 |      -1.51% |     -1.05% |         -1.51% |
> |                                 | hackbench-process-pipes-4 (seconds)                |          0.44 |   (I) 6.49% |  (I) 5.42% |      (I) 6.06% |
> |                                 | hackbench-process-pipes-7 (seconds)                |          0.68 | (R) -18.36% |  (I) 3.40% |         -0.41% |

So hackbench is all over the place with a mix of gains and losses so no
clear winner.

> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> | pts/nginx                       | Connections: 200 (Requests Per Second)             |     252332.60 |  (I) 17.54% |     -0.53% |         -0.61% |
> |                                 | Connections: 1000 (Requests Per Second)            |     248591.29 |  (I) 20.41% |      0.10% |          0.57% |
> +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+
> 

And this is the main winner. The results confirm that NEXT_BUDDY is not
a universal win but the mysql results and Daytrader results from Madadi
are a concern.

I still don't have access to test machines to investigate this properly
and may not have access for 1-2 weeks. I think the best approach for now
is to disable NEXT_BUDDY by default again until it's determined exactly
why mysql multi-host and daytrader suffered.

Can you this this to be sure please?

--8<--
sched/fair: Disable scheduler feature NEXT_BUDDY

NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again
after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdca8 ("sched/fair:
Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected
that this would be a universal win without a crystal ball instruction
but the reported regressions are a concern [1][2] even if gains were
also reported. Specifically;

o mysql with client/server running on different servers regresses
o specjbb reports lower peak metrics
o daytrader regresses

The mysql is realistic and a concern. It needs to be confirmed if
specjbb is simply shifting the point where peak performance is measured
but still a concern. daytrader is considered to be representative of a
real workload.

Access to test machines is currently problematic for verifying any fix to
this problem. Disable NEXT_BUDDY for now by default until the root causes
are addressed.

Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1]
Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/features.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 980d92bab8ab..136a6584be79 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.
  */
-SCHED_FEAT(NEXT_BUDDY, true)
+SCHED_FEAT(NEXT_BUDDY, false)
 
 /*
  * Allow completely ignoring cfs_rq->next; which can be set from various

^ permalink raw reply related	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2026-01-15 10:35 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-12 12:25 [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman
2025-11-12 12:25 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
2025-11-14 12:19   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2025-11-17 16:23   ` tip-bot2 for Mel Gorman
     [not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net>
2025-11-12 14:48   ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra
2025-11-13  8:26     ` Madadi Vineeth Reddy
2025-11-13  9:04     ` Mel Gorman
2025-11-14 12:13       ` Peter Zijlstra
2025-11-14 12:19   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2025-11-17 16:23   ` tip-bot2 for Mel Gorman
2025-12-22 10:57     ` [REGRESSION] " Ryan Roberts
2026-01-02 12:38       ` Ryan Roberts
2026-01-02 15:52         ` Dietmar Eggemann
2026-01-05 11:45           ` Ryan Roberts
2026-01-05 14:38             ` Shrikanth Hegde
2026-01-05 16:33               ` Ryan Roberts
2026-01-07 15:30             ` Dietmar Eggemann
2026-01-08  8:50               ` Mel Gorman
2026-01-08 13:15                 ` Ryan Roberts
2026-01-09 10:15                   ` Ryan Roberts
2026-01-12  7:47                     ` Peter Zijlstra
2026-01-12  8:52                       ` Ryan Roberts
2026-01-12  9:57                         ` Peter Zijlstra
2026-01-12 10:27                           ` Ryan Roberts
2026-01-13  6:31                         ` K Prateek Nayak
2026-01-15 10:16                     ` Mel Gorman
2026-01-08 10:01 ` [REGRESSION] [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF Madadi Vineeth Reddy
  -- strict thread matches above, loose matches on Subject: below --
2025-11-03 11:04 [PATCH 0/2 v4] " Mel Gorman
2025-11-03 11:04 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
2025-11-03 14:07   ` Peter Zijlstra
2025-11-03 14:14     ` Peter Zijlstra
2025-11-05 21:48   ` Madadi Vineeth Reddy
2025-11-07  8:53     ` Mel Gorman
     [not found] <20251027133915.4103633-1-mgorman@techsingularity.net>
2025-10-27 13:39 ` Mel Gorman
2025-10-28 15:05   ` Peter Zijlstra
2025-10-31  9:46     ` Mel Gorman
2025-10-28 15:09   ` Peter Zijlstra
2025-10-31  9:48     ` Mel Gorman
2025-10-28 15:33   ` Peter Zijlstra
2025-10-28 15:47     ` Peter Zijlstra
2025-10-30  9:10   ` Peter Zijlstra
2025-10-31 10:27     ` Mel Gorman
2025-10-21 14:28 [RFC PATCH 0/2] Reintroduce NEXT_BUDDY for EEVDF v2 Mel Gorman
2025-10-21 14:28 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
2025-10-23  6:29   ` K Prateek Nayak
     [not found] <20250714134429.19624-1-mgorman@techsingularity.net>
2025-07-14 13:44 ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox