[PATCH 1/2] sched/fair: Enable scheduler feature NEXT

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY
       [not found] <20251027133915.4103633-1-mgorman@techsingularity.net>
@ 2025-10-27 13:39 ` Mel Gorman
  2025-10-28 14:37   ` Peter Zijlstra
  2025-10-27 13:39 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
  1 sibling, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2025-10-27 13:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason, Mel Gorman

The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last
wakee to be scheduled sooner on the assumption that the waker/wakee share
cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on
the assumption that the pair of tasks still share data but also relied
on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get
good results.

NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve
latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465
("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime
spread is mentioned so the expectation is that NEXT_BUDDY had an impact
on overall fairness. It was not noted why LAST_BUDDY was removed but it
is assumed that it's very difficult to reason what LAST_BUDDY's correct
and effective behaviour should be while still respecting EEVDFs goals.

NEXT_BUDDY is easier to reason about given that it's a point-in-time
decision on the wakees deadline and eligibilty relative to the waker. Enable
NEXT_BUDDY as a preparation path to document that the decision to ignore
the current implementation is deliberate. While not presented, the results
were at best neutral and often much more variable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/features.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..0607def744af 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true)
  * wakeup-preemption), since its likely going to consume data we
  * touched, increases cache locality.
  */
-SCHED_FEAT(NEXT_BUDDY, false)
+SCHED_FEAT(NEXT_BUDDY, true)

 /*
  * Allow completely ignoring cfs_rq->next; which can be set from various
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY
  2025-10-27 13:39 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
@ 2025-10-28 14:37   ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2025-10-28 14:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Mon, Oct 27, 2025 at 01:39:14PM +0000, Mel Gorman wrote:

> NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve
> latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465
> ("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime
> spread is mentioned so the expectation is that NEXT_BUDDY had an impact
> on overall fairness. It was not noted why LAST_BUDDY was removed but it
> is assumed that it's very difficult to reason what LAST_BUDDY's correct
> and effective behaviour should be while still respecting EEVDFs goals.

I think I was just struggling to make sense of things and figured
less is more and axed it.

I have vague memories trying to work through the dynamics of a
wakeup-stack and the EEVDF latency requirements and getting a head-ache.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
       [not found] <20251027133915.4103633-1-mgorman@techsingularity.net>
  2025-10-27 13:39 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
@ 2025-10-27 13:39 ` Mel Gorman
  2025-10-28 15:05   ` Peter Zijlstra
                     ` (3 more replies)
  1 sibling, 4 replies; 23+ messages in thread
From: Mel Gorman @ 2025-10-27 13:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason, Mel Gorman

Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.

Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a suggestion instead of a contract. If a waker does go to sleep almost
immediately then the delay in wakeup is negligible. In all other cases,
it's throttled based on the accumulated runtime of the waker so there is
a chance that some batched wakeups have been issued before preemption.

For all other wakeups, preemption happens if the wakee has a earlier
deadline than the waker and eligible to run.

While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.

First is the dbench throughput data even though it is a poor metric but
it is the default metric. The test machine is a 2-socket machine and the
backing filesystem is XFS as a lot of the IO work is dispatched to kernel
threads. It's important to note that these results are not representative
across all machines, especially Zen machines, as different bottlenecks
are exposed on different machines and filesystems.

dbench4 Throughput (misleading but traditional)
                            6.18-rc1               6.18-rc1
                             vanilla      sched-preemptnext
Hmean     1       1268.80 (   0.00%)     1285.93 *   1.35%*
Hmean     4       3971.74 (   0.00%)     3851.10 *  -3.04%*
Hmean     7       5548.23 (   0.00%)     5507.07 (  -0.74%)
Hmean     12      7310.86 (   0.00%)     7205.37 (  -1.44%)
Hmean     21      8874.53 (   0.00%)     9193.66 *   3.60%*
Hmean     30      9361.93 (   0.00%)    10552.03 *  12.71%*
Hmean     48      9540.14 (   0.00%)    11936.22 *  25.12%*
Hmean     79      9208.74 (   0.00%)    12367.06 *  34.30%*
Hmean     110     8573.12 (   0.00%)    12114.01 *  41.30%*
Hmean     141     7791.33 (   0.00%)    11391.60 *  46.21%*
Hmean     160     7666.60 (   0.00%)    10789.23 *  40.73%*

As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.

dbench4 Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla      sched-preemptnext
Amean      1         14.62 (   0.00%)       14.21 (   2.80%)
Amean      4         18.76 (   0.00%)       19.45 (  -3.67%)
Amean      7         23.71 (   0.00%)       23.82 (  -0.48%)
Amean      12        31.25 (   0.00%)       31.78 (  -1.69%)
Amean      21        45.12 (   0.00%)       43.53 (   3.52%)
Amean      30        61.07 (   0.00%)       54.13 (  11.37%)
Amean      48        95.91 (   0.00%)       76.51 (  20.23%)
Amean      79       163.38 (   0.00%)      121.46 (  25.66%)
Amean      110      243.91 (   0.00%)      172.56 (  29.25%)
Amean      141      343.47 (   0.00%)      236.07 (  31.27%)
Amean      160      401.15 (   0.00%)      283.64 (  29.29%)
Stddev     1          0.52 (   0.00%)        0.44 (  15.50%)
Stddev     4          1.36 (   0.00%)        1.42 (  -4.91%)
Stddev     7          1.88 (   0.00%)        1.84 (   2.27%)
Stddev     12         3.06 (   0.00%)        2.55 (  16.57%)
Stddev     21         5.78 (   0.00%)        3.70 (  35.90%)
Stddev     30         9.85 (   0.00%)        5.10 (  48.26%)
Stddev     48        22.31 (   0.00%)        8.30 (  62.79%)
Stddev     79        35.96 (   0.00%)       16.79 (  53.31%)
Stddev     110       59.04 (   0.00%)       30.08 (  49.04%)
Stddev     141       85.38 (   0.00%)       47.05 (  44.89%)
Stddev     160       96.38 (   0.00%)       47.27 (  50.95%)

That is still looking good and the variance is reduced quite a bit.
Finally, fairness is a concern so the next report tracks how many
milliseconds does it take for all clients to complete a workfile. This
one is tricky because dbench makes to effort to synchronise clients so
the durations at benchmark start time differ substantially from typical
runtimes. This problem could be mitigated by warming up the benchmark
for a number of minutes but it's a matter of opinion whether that
counts as an evasion of inconvenient results.

dbench4 All Clients Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla      sched-preemptnext
Amean      1         14.93 (   0.00%)       14.91 (   0.11%)
Amean      4        348.88 (   0.00%)      277.06 (  20.59%)
Amean      7        722.94 (   0.00%)      991.70 ( -37.18%)
Amean      12      2055.72 (   0.00%)     2684.48 ( -30.59%)
Amean      21      4393.85 (   0.00%)     2625.79 (  40.24%)
Amean      30      6119.84 (   0.00%)     2491.15 (  59.29%)
Amean      48     20600.85 (   0.00%)     6717.61 (  67.39%)
Amean      79     22677.38 (   0.00%)    21866.80 (   3.57%)
Amean      110    35937.71 (   0.00%)    22517.63 (  37.34%)
Amean      141    25104.66 (   0.00%)    29897.08 ( -19.09%)
Amean      160    23843.74 (   0.00%)    23106.66 (   3.09%)
Stddev     1          0.50 (   0.00%)        0.46 (   6.67%)
Stddev     4        201.33 (   0.00%)      130.13 (  35.36%)
Stddev     7        471.94 (   0.00%)      641.69 ( -35.97%)
Stddev     12      1401.94 (   0.00%)     1750.14 ( -24.84%)
Stddev     21      2519.12 (   0.00%)     1416.77 (  43.76%)
Stddev     30      3469.05 (   0.00%)     1293.37 (  62.72%)
Stddev     48     11521.49 (   0.00%)     3846.34 (  66.62%)
Stddev     79     12849.21 (   0.00%)    12275.89 (   4.46%)
Stddev     110    20362.88 (   0.00%)    12989.46 (  36.21%)
Stddev     141    13768.42 (   0.00%)    17108.34 ( -24.26%)
Stddev     160    13196.34 (   0.00%)    13029.75 (   1.26%)

This is more of a mixed bag but it at least shows that fairness
is not crippled.

The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench.

hackbench-process-pipes
                          6.18-rc1             6.18-rc1
                             vanilla      sched-preemptnext
Amean     1        0.2657 (   0.00%)      0.2180 (  17.94%)
Amean     4        0.6107 (   0.00%)      0.5237 (  14.25%)
Amean     7        0.7923 (   0.00%)      0.7357 (   7.15%)
Amean     12       1.1500 (   0.00%)      1.1210 (   2.52%)
Amean     21       1.7950 (   0.00%)      1.8220 (  -1.50%)
Amean     30       2.3207 (   0.00%)      2.5337 *  -9.18%*
Amean     48       3.5023 (   0.00%)      4.0057 * -14.37%*
Amean     79       4.8093 (   0.00%)      5.1827 *  -7.76%*
Amean     110      6.1160 (   0.00%)      6.5667 *  -7.37%*
Amean     141      7.4763 (   0.00%)      7.9413 *  -6.22%*
Amean     172      8.9560 (   0.00%)      9.5543 *  -6.68%*
Amean     203     10.4783 (   0.00%)     11.3620 *  -8.43%*
Amean     234     12.4977 (   0.00%)     13.6183 (  -8.97%)
Amean     265     14.7003 (   0.00%)     15.3720 *  -4.57%*
Amean     296     16.1007 (   0.00%)     17.2463 *  -7.12%*

Processes using pipes are impacted but the variance (not presented)
is bad enough that it's close to noise. These results are not always
reproducible. If executed across multiple reboots, it may show neutral or
small gains so the worst measured results are presented.

Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.

hackbench-process-sockets
                          6.18-rc1             6.18-rc1
                             vanilla      sched-preemptnext
Amean     1        0.3073 (   0.00%)      0.3333 (  -8.46%)
Amean     4        0.7863 (   0.00%)      0.7350 (   6.53%)
Amean     7        1.3670 (   0.00%)      1.3580 (   0.66%)
Amean     12       2.1337 (   0.00%)      2.1320 (   0.08%)
Amean     21       3.4683 (   0.00%)      3.4677 (   0.02%)
Amean     30       4.7247 (   0.00%)      4.8200 (  -2.02%)
Amean     48       7.6097 (   0.00%)      7.8383 (  -3.00%)
Amean     79      14.7957 (   0.00%)     15.2863 (  -3.32%)
Amean     110     21.3413 (   0.00%)     21.7297 (  -1.82%)
Amean     141     29.0503 (   0.00%)     29.1197 (  -0.24%)
Amean     172     36.4660 (   0.00%)     36.3173 (   0.41%)
Amean     203     39.7177 (   0.00%)     40.2843 (  -1.43%)
Amean     234     42.1120 (   0.00%)     43.8480 (  -4.12%)
Amean     265     45.7830 (   0.00%)     48.1233 (  -5.11%)
Amean     296     50.7043 (   0.00%)     54.2533 (  -7.00%)

As schbench has been mentioned in numerous bugs recently, the results
are interesting. A test case that represents the default schbench
behaviour is

schbench Wakeup Latency (usec)
                                       6.18.0-rc1             6.18.0-rc1
                                          vanilla    sched-preemptnextr1
Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 *  16.28%*
Amean     Wakeup-90th-80         46.56 (   0.00%)       20.56 *  55.85%*
Amean     Wakeup-99th-80        119.61 (   0.00%)       88.83 (  25.73%)
Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      339.78 *  89.36%*
Amean     Wakeup-max-80        3874.28 (   0.00%)     3942.06 (  -1.75%)

schbench Requests Per Second (ops/sec)
                                  6.18.0-rc1             6.18.0-rc1
                                     vanilla    sched-preemptnextr1
Hmean     RPS-20th-80     8900.91 (   0.00%)     9148.18 *   2.78%*
Hmean     RPS-50th-80     8987.41 (   0.00%)     9199.72 (   2.36%)
Hmean     RPS-90th-80     9123.73 (   0.00%)     9233.56 (   1.20%)
Hmean     RPS-max-80      9193.50 (   0.00%)     9249.84 (   0.61%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 117 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc0b7ce8a65d..158e0430449b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -955,6 +955,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
 	if (cfs_rq->nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
+	/*
+	 * Picking the ->next buddy will affect latency but not fairness.
+	 */
+	if (sched_feat(PICK_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+		return cfs_rq->next;
+	}
+
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
 
@@ -1193,6 +1203,91 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 	return delta_exec;
 }
 
+enum preempt_wakeup_action {
+	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
+	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
+					 * before rescheduling.
+					 */
+	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
+};
+
+static void set_next_buddy(struct sched_entity *se);
+
+static inline enum preempt_wakeup_action
+__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
+		 struct sched_entity *pse, struct sched_entity *se)
+{
+	bool pse_before;
+
+	/*
+	 * Ignore wakee preemption on WF_WORK as it is less likely that
+	 * there is shared data as exec often follow fork. Do not
+	 * preempt for tasks that are sched_delayed as it would violate
+	 * EEVDF to forcibly queue an ineligible task.
+	 */
+	if (!sched_feat(NEXT_BUDDY) ||
+	    (wake_flags & WF_FORK) ||
+	    (pse->sched_delayed)) {
+		return PREEMPT_WAKEUP_NONE;
+	}
+
+	/* Reschedule if waker is no longer eligible. */
+	if (!entity_eligible(cfs_rq, se))
+		return PREEMPT_WAKEUP_RESCHED;
+
+	/*
+	 * Keep existing buddy if the deadline is sooner than pse.
+	 * The downside is that the older buddy may be cache cold
+	 * but that is unpredictable where as an earlier deadline
+	 * is absolute.
+	 */
+	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
+		return PREEMPT_WAKEUP_NONE;
+
+	set_next_buddy(pse);
+
+	/*
+	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
+	 * strictly enforced because the hint is either misunderstood or
+	 * multiple tasks must be woken up.
+	 */
+	pse_before = entity_before(pse, se);
+	if (wake_flags & WF_SYNC) {
+		u64 delta = rq_clock_task(rq) - se->exec_start;
+		u64 threshold = sysctl_sched_migration_cost;
+
+		/*
+		 * WF_SYNC without WF_TTWU is not expected so warn if it
+		 * happens even though it is likely harmless.
+		 */
+		WARN_ON_ONCE(!(wake_flags | WF_TTWU));
+
+		if ((s64)delta < 0)
+			delta = 0;
+
+		/*
+		 * WF_RQ_SELECTED implies the tasks are stacking on a
+		 * CPU when they could run on other CPUs. Reduce the
+		 * threshold before preemption is allowed to an
+		 * arbitrary lower value as it is more likely (but not
+		 * guaranteed) the waker requires the wakee to finish.
+		 */
+		if (wake_flags & WF_RQ_SELECTED)
+			threshold >>= 2;
+
+		/*
+		 * As WF_SYNC is not strictly obeyed, allow some runtime for
+		 * batch wakeups to be issued.
+		 */
+		if (pse_before && delta >= threshold)
+			return PREEMPT_WAKEUP_RESCHED;
+
+		return PREEMPT_WAKEUP_NONE;
+	}
+
+	return PREEMPT_WAKEUP_NEXT;
+}
+
 /*
  * Used by other classes to account runtime.
  */
@@ -5512,16 +5607,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 
-	/*
-	 * Picking the ->next buddy will affect latency but not fairness.
-	 */
-	if (sched_feat(PICK_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
-		/* ->next will never be delayed */
-		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
-		return cfs_rq->next;
-	}
-
 	se = pick_eevdf(cfs_rq);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
@@ -7028,8 +7113,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -8734,7 +8817,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
-	bool do_preempt_short = false;
+	enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE;
 
 	if (unlikely(se == pse))
 		return;
@@ -8748,10 +8831,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (task_is_throttled(p))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
-		set_next_buddy(pse);
-	}
-
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
@@ -8783,7 +8862,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 		 * When non-idle entity preempt an idle entity,
 		 * don't give idle entity slice protection.
 		 */
-		do_preempt_short = true;
+		do_preempt_short = PREEMPT_WAKEUP_NEXT;
 		goto preempt;
 	}
 
@@ -8802,7 +8881,25 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	 * If @p has a shorter slice than current and @p is eligible, override
 	 * current's slice protection in order to allow preemption.
 	 */
-	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
+	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
+		do_preempt_short = PREEMPT_WAKEUP_NEXT;
+	} else {
+		/*
+		 * If @p potentially is completing work required by current then
+		 * consider preemption.
+		 */
+		do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags,
+						      pse, se);
+	}
+
+	switch (do_preempt_short) {
+	case PREEMPT_WAKEUP_NONE:
+		return;
+	case PREEMPT_WAKEUP_RESCHED:
+		goto preempt;
+	case PREEMPT_WAKEUP_NEXT:
+		break;
+	}
 
 	/*
 	 * If @p has become the most eligible task, force preemption.
@@ -8810,7 +8907,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
 		goto preempt;
 
-	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
+	if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE)
 		update_protect_slice(cfs_rq, se);
 
 	return;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-27 13:39 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
@ 2025-10-28 15:05   ` Peter Zijlstra
  2025-10-31  9:46     ` Mel Gorman
  2025-10-28 15:09   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-10-28 15:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:


Still going through this; just a few early comments.


>  kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 117 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bc0b7ce8a65d..158e0430449b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1193,6 +1203,91 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>  	return delta_exec;
>  }
>  
> +enum preempt_wakeup_action {
> +	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
> +	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
> +					 * before rescheduling.
> +					 */
> +	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
> +};
> +
> +static void set_next_buddy(struct sched_entity *se);
> +
> +static inline enum preempt_wakeup_action
> +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> +		 struct sched_entity *pse, struct sched_entity *se)
> +{
> +	bool pse_before;
> +
> +	/*
> +	 * Ignore wakee preemption on WF_WORK as it is less likely that
> +	 * there is shared data as exec often follow fork. Do not
> +	 * preempt for tasks that are sched_delayed as it would violate
> +	 * EEVDF to forcibly queue an ineligible task.
> +	 */
> +	if (!sched_feat(NEXT_BUDDY) ||
> +	    (wake_flags & WF_FORK) ||
> +	    (pse->sched_delayed)) {
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	/* Reschedule if waker is no longer eligible. */
> +	if (!entity_eligible(cfs_rq, se))
> +		return PREEMPT_WAKEUP_RESCHED;
> +
> +	/*
> +	 * Keep existing buddy if the deadline is sooner than pse.
> +	 * The downside is that the older buddy may be cache cold
> +	 * but that is unpredictable where as an earlier deadline
> +	 * is absolute.
> +	 */
> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> +		return PREEMPT_WAKEUP_NONE;
> +
> +	set_next_buddy(pse);
> +
> +	/*
> +	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> +	 * strictly enforced because the hint is either misunderstood or
> +	 * multiple tasks must be woken up.
> +	 */
> +	pse_before = entity_before(pse, se);
> +	if (wake_flags & WF_SYNC) {
> +		u64 delta = rq_clock_task(rq) - se->exec_start;
> +		u64 threshold = sysctl_sched_migration_cost;
> +
> +		/*
> +		 * WF_SYNC without WF_TTWU is not expected so warn if it
> +		 * happens even though it is likely harmless.
> +		 */
> +		WARN_ON_ONCE(!(wake_flags | WF_TTWU));
> +
> +		if ((s64)delta < 0)
> +			delta = 0;
> +
> +		/*
> +		 * WF_RQ_SELECTED implies the tasks are stacking on a
> +		 * CPU when they could run on other CPUs. Reduce the
> +		 * threshold before preemption is allowed to an
> +		 * arbitrary lower value as it is more likely (but not
> +		 * guaranteed) the waker requires the wakee to finish.
> +		 */
> +		if (wake_flags & WF_RQ_SELECTED)
> +			threshold >>= 2;
> +
> +		/*
> +		 * As WF_SYNC is not strictly obeyed, allow some runtime for
> +		 * batch wakeups to be issued.
> +		 */
> +		if (pse_before && delta >= threshold)
> +			return PREEMPT_WAKEUP_RESCHED;
> +
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	return PREEMPT_WAKEUP_NEXT;
> +}

All this seems weirdly placed inside the file. Is there a reason this is
placed so far away from its only caller?

>  /*
>   * Used by other classes to account runtime.
>   */

> @@ -7028,8 +7113,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	hrtick_update(rq);
>  }
>  
> -static void set_next_buddy(struct sched_entity *se);
> -
>  /*
>   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
>   * failing half-way through and resume the dequeue later.
> @@ -8734,7 +8817,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>  	int cse_is_idle, pse_is_idle;
> -	bool do_preempt_short = false;
> +	enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE;

naming seems off; I'm not sure what this still has to do with short.
Perhaps just preempt_action or whatever?

>  
>  	if (unlikely(se == pse))
>  		return;
> @@ -8748,10 +8831,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (task_is_throttled(p))
>  		return;
>  
> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
> -		set_next_buddy(pse);
> -	}
> -
>  	/*
>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>  	 * wake up path.
> @@ -8783,7 +8862,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  		 * When non-idle entity preempt an idle entity,
>  		 * don't give idle entity slice protection.
>  		 */
> -		do_preempt_short = true;
> +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
>  		goto preempt;
>  	}
>  
> @@ -8802,7 +8881,25 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	 * If @p has a shorter slice than current and @p is eligible, override
>  	 * current's slice protection in order to allow preemption.
>  	 */
> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
> +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
> +	} else {
> +		/*
> +		 * If @p potentially is completing work required by current then
> +		 * consider preemption.
> +		 */
> +		do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags,
> +						      pse, se);
> +	}
> +
> +	switch (do_preempt_short) {
> +	case PREEMPT_WAKEUP_NONE:
> +		return;
> +	case PREEMPT_WAKEUP_RESCHED:
> +		goto preempt;
> +	case PREEMPT_WAKEUP_NEXT:
> +		break;
> +	}
>  
>  	/*
>  	 * If @p has become the most eligible task, force preemption.
> @@ -8810,7 +8907,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
>  		goto preempt;
>  
> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> +	if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE)
>  		update_protect_slice(cfs_rq, se);

WAKEUP_NONE did a return above, I don't think you can get here with
WAKEUP_NONE, making the above condition always true.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-28 15:05   ` Peter Zijlstra
@ 2025-10-31  9:46     ` Mel Gorman
  0 siblings, 0 replies; 23+ messages in thread
From: Mel Gorman @ 2025-10-31  9:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Tue, Oct 28, 2025 at 04:05:45PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> >  kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 117 insertions(+), 20 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bc0b7ce8a65d..158e0430449b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1193,6 +1203,91 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
> >  	return delta_exec;
> >  }
> >  
> > +enum preempt_wakeup_action {
> > +	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
> > +	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
> > +					 * before rescheduling.
> > +					 */
> > +	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
> > +};
> > +
> > <SNIP>
> 
> All this seems weirdly placed inside the file. Is there a reason this is
> placed so far away from its only caller?
> 

No, I don't recall why I placed it there and whether it was simply
required by an early prototype.

> >  /*
> >   * Used by other classes to account runtime.
> >   */
> 
> > @@ -7028,8 +7113,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  	hrtick_update(rq);
> >  }
> >  
> > -static void set_next_buddy(struct sched_entity *se);
> > -
> >  /*
> >   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
> >   * failing half-way through and resume the dequeue later.
> > @@ -8734,7 +8817,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  	struct sched_entity *se = &donor->se, *pse = &p->se;
> >  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
> >  	int cse_is_idle, pse_is_idle;
> > -	bool do_preempt_short = false;
> > +	enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE;
> 
> naming seems off; I'm not sure what this still has to do with short.
> Perhaps just preempt_action or whatever?
> 

Good as name as any, may change it to a more meaningful name once I go
through the rest of the review.

> >  
> >  	if (unlikely(se == pse))
> >  		return;
> > @@ -8748,10 +8831,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  	if (task_is_throttled(p))
> >  		return;
> >  
> > -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
> > -		set_next_buddy(pse);
> > -	}
> > -
> >  	/*
> >  	 * We can come here with TIF_NEED_RESCHED already set from new task
> >  	 * wake up path.
> > @@ -8783,7 +8862,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  		 * When non-idle entity preempt an idle entity,
> >  		 * don't give idle entity slice protection.
> >  		 */
> > -		do_preempt_short = true;
> > +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
> >  		goto preempt;
> >  	}
> >  
> > @@ -8802,7 +8881,25 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  	 * If @p has a shorter slice than current and @p is eligible, override
> >  	 * current's slice protection in order to allow preemption.
> >  	 */
> > -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> > +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
> > +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
> > +	} else {
> > +		/*
> > +		 * If @p potentially is completing work required by current then
> > +		 * consider preemption.
> > +		 */
> > +		do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags,
> > +						      pse, se);
> > +	}
> > +
> > +	switch (do_preempt_short) {
> > +	case PREEMPT_WAKEUP_NONE:
> > +		return;
> > +	case PREEMPT_WAKEUP_RESCHED:
> > +		goto preempt;
> > +	case PREEMPT_WAKEUP_NEXT:
> > +		break;
> > +	}
> >  
> >  	/*
> >  	 * If @p has become the most eligible task, force preemption.
> > @@ -8810,7 +8907,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
> >  		goto preempt;
> >  
> > -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> > +	if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE)
> >  		update_protect_slice(cfs_rq, se);
> 
> WAKEUP_NONE did a return above, I don't think you can get here with
> WAKEUP_NONE, making the above condition always true.

Correct. This was not always the case during development but now the rename
alone highlights issues beyond this oddity.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-27 13:39 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
  2025-10-28 15:05   ` Peter Zijlstra
@ 2025-10-28 15:09   ` Peter Zijlstra
  2025-10-31  9:48     ` Mel Gorman
  2025-10-28 15:33   ` Peter Zijlstra
  2025-10-30  9:10   ` Peter Zijlstra
  3 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-10-28 15:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> +enum preempt_wakeup_action {
> +	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
> +	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
> +					 * before rescheduling.
> +					 */
> +	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
> +};

In pre-existing code that isn't modified by this patch, we have:

  if (do_preempt_short)

Which seems to hard rely on PREEMPT_WAKEUP_NONE being 0, please make
that explicit in the enum above.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-28 15:09   ` Peter Zijlstra
@ 2025-10-31  9:48     ` Mel Gorman
  0 siblings, 0 replies; 23+ messages in thread
From: Mel Gorman @ 2025-10-31  9:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Tue, Oct 28, 2025 at 04:09:51PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> > +enum preempt_wakeup_action {
> > +	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
> > +	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
> > +					 * before rescheduling.
> > +					 */
> > +	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
> > +};
> 
> In pre-existing code that isn't modified by this patch, we have:
> 
>   if (do_preempt_short)
> 
> Which seems to hard rely on PREEMPT_WAKEUP_NONE being 0, please make
> that explicit in the enum above.

Will do.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-27 13:39 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
  2025-10-28 15:05   ` Peter Zijlstra
  2025-10-28 15:09   ` Peter Zijlstra
@ 2025-10-28 15:33   ` Peter Zijlstra
  2025-10-28 15:47     ` Peter Zijlstra
  2025-10-30  9:10   ` Peter Zijlstra
  3 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-10-28 15:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> @@ -8783,7 +8862,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  		 * When non-idle entity preempt an idle entity,
>  		 * don't give idle entity slice protection.
>  		 */
> -		do_preempt_short = true;
> +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
>  		goto preempt;
>  	}

I'm confused, should this not be WAKEUP_RESCHED?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-28 15:33   ` Peter Zijlstra
@ 2025-10-28 15:47     ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2025-10-28 15:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Tue, Oct 28, 2025 at 04:33:51PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> > @@ -8783,7 +8862,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  		 * When non-idle entity preempt an idle entity,
> >  		 * don't give idle entity slice protection.
> >  		 */
> > -		do_preempt_short = true;
> > +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
> >  		goto preempt;
> >  	}
> 
> I'm confused, should this not be WAKEUP_RESCHED?

It doesn't matter, you cannot end up with !do_preempt_short at preempt:,
so that condition is always true and can thus be deleted, at which point
the value of do_preempt_short is irrelevant and doesn't need to be set.

A little something like so perhaps...

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8727,7 +8727,7 @@ static void set_next_buddy(struct sched_
 }
 
 enum preempt_wakeup_action {
-	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
+	PREEMPT_WAKEUP_NONE = 0,	/* No action on the buddy */
 	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
 					 * before rescheduling.
 					 */
@@ -8814,7 +8814,7 @@ __do_preempt_buddy(struct rq *rq, struct
  */
 static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
-	enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE;
+	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;
 	struct task_struct *donor = rq->donor;
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
@@ -8863,7 +8863,6 @@ static void check_preempt_wakeup_fair(st
 		 * When non-idle entity preempt an idle entity,
 		 * don't give idle entity slice protection.
 		 */
-		do_preempt_short = PREEMPT_WAKEUP_NEXT;
 		goto preempt;
 	}
 
@@ -8883,17 +8882,17 @@ static void check_preempt_wakeup_fair(st
 	 * current's slice protection in order to allow preemption.
 	 */
 	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
-		do_preempt_short = PREEMPT_WAKEUP_NEXT;
+		preempt_action = PREEMPT_WAKEUP_NEXT;
 	} else {
 		/*
 		 * If @p potentially is completing work required by current then
 		 * consider preemption.
 		 */
-		do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags,
+		preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags,
 						      pse, se);
 	}
 
-	switch (do_preempt_short) {
+	switch (preempt_action) {
 	case PREEMPT_WAKEUP_NONE:
 		return;
 	case PREEMPT_WAKEUP_RESCHED:
@@ -8905,18 +8904,16 @@ static void check_preempt_wakeup_fair(st
 	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */
-	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
+	if (__pick_eevdf(cfs_rq, false) == pse)
 		goto preempt;
 
-	if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE)
+	if (sched_feat(RUN_TO_PARITY))
 		update_protect_slice(cfs_rq, se);
 
 	return;
 
 preempt:
-	if (do_preempt_short)
-		cancel_protect_slice(se);
-
+	cancel_protect_slice(se);
 	resched_curr_lazy(rq);
 }
 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-27 13:39 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
                     ` (2 preceding siblings ...)
  2025-10-28 15:33   ` Peter Zijlstra
@ 2025-10-30  9:10   ` Peter Zijlstra
  2025-10-31 10:27     ` Mel Gorman
  3 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-10-30  9:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> +static inline enum preempt_wakeup_action
> +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> +		 struct sched_entity *pse, struct sched_entity *se)
> +{
> +	bool pse_before;
> +
> +	/*
> +	 * Ignore wakee preemption on WF_WORK as it is less likely that
> +	 * there is shared data as exec often follow fork. Do not
> +	 * preempt for tasks that are sched_delayed as it would violate
> +	 * EEVDF to forcibly queue an ineligible task.
> +	 */
> +	if (!sched_feat(NEXT_BUDDY) ||

This seems wrong, that would mean wakeup preemption gets killed the
moment you disable NEXT_BUDDY, that can't be right.

> +	    (wake_flags & WF_FORK) ||
> +	    (pse->sched_delayed)) {
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	/* Reschedule if waker is no longer eligible. */
> +	if (!entity_eligible(cfs_rq, se))
> +		return PREEMPT_WAKEUP_RESCHED;

That comment isn't accurate, unless you add: && in_task(). That is, if
this is an interrupt doing the wakeup, it has nothing to do with
current.

> +	/*
> +	 * Keep existing buddy if the deadline is sooner than pse.
> +	 * The downside is that the older buddy may be cache cold
> +	 * but that is unpredictable where as an earlier deadline
> +	 * is absolute.
> +	 */
> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> +		return PREEMPT_WAKEUP_NONE;

But if previously we set next and didn't preempt, we should try again,
maybe it has more success now. That is, should this not be _NEXT?

> +
> +	set_next_buddy(pse);
> +
> +	/*
> +	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> +	 * strictly enforced because the hint is either misunderstood or
> +	 * multiple tasks must be woken up.
> +	 */
> +	pse_before = entity_before(pse, se);
> +	if (wake_flags & WF_SYNC) {
> +		u64 delta = rq_clock_task(rq) - se->exec_start;
> +		u64 threshold = sysctl_sched_migration_cost;
> +
> +		/*
> +		 * WF_SYNC without WF_TTWU is not expected so warn if it
> +		 * happens even though it is likely harmless.
> +		 */
> +		WARN_ON_ONCE(!(wake_flags | WF_TTWU));

s/|/&/ ?

> +		if ((s64)delta < 0)
> +			delta = 0;
> +
> +		/*
> +		 * WF_RQ_SELECTED implies the tasks are stacking on a
> +		 * CPU when they could run on other CPUs. Reduce the
> +		 * threshold before preemption is allowed to an
> +		 * arbitrary lower value as it is more likely (but not
> +		 * guaranteed) the waker requires the wakee to finish.
> +		 */
> +		if (wake_flags & WF_RQ_SELECTED)
> +			threshold >>= 2;
> +
> +		/*
> +		 * As WF_SYNC is not strictly obeyed, allow some runtime for
> +		 * batch wakeups to be issued.
> +		 */
> +		if (pse_before && delta >= threshold)
> +			return PREEMPT_WAKEUP_RESCHED;
> +
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	return PREEMPT_WAKEUP_NEXT;
> +}

Add to this that AFAICT your patch ends up doing:

	__pick_eevdf(.protect = false) == pse

which unconditionally disables the slice protection feature.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-30  9:10   ` Peter Zijlstra
@ 2025-10-31 10:27     ` Mel Gorman
  0 siblings, 0 replies; 23+ messages in thread
From: Mel Gorman @ 2025-10-31 10:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

On Thu, Oct 30, 2025 at 10:10:58AM +0100, Peter Zijlstra wrote:
> On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> > +static inline enum preempt_wakeup_action
> > +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> > +		 struct sched_entity *pse, struct sched_entity *se)
> > +{
> > +	bool pse_before;
> > +
> > +	/*
> > +	 * Ignore wakee preemption on WF_WORK as it is less likely that
> > +	 * there is shared data as exec often follow fork. Do not
> > +	 * preempt for tasks that are sched_delayed as it would violate
> > +	 * EEVDF to forcibly queue an ineligible task.
> > +	 */
> > +	if (!sched_feat(NEXT_BUDDY) ||
> 
> This seems wrong, that would mean wakeup preemption gets killed the
> moment you disable NEXT_BUDDY, that can't be right.
> 

Correct, the check is bogus.

> > +	    (wake_flags & WF_FORK) ||
> > +	    (pse->sched_delayed)) {
> > +		return PREEMPT_WAKEUP_NONE;
> > +	}
> > +
> > +	/* Reschedule if waker is no longer eligible. */
> > +	if (!entity_eligible(cfs_rq, se))
> > +		return PREEMPT_WAKEUP_RESCHED;
> 
> That comment isn't accurate, unless you add: && in_task(). That is, if
> this is an interrupt doing the wakeup, it has nothing to do with
> current.
> 

That was a complete oversight.

> > +	/*
> > +	 * Keep existing buddy if the deadline is sooner than pse.
> > +	 * The downside is that the older buddy may be cache cold
> > +	 * but that is unpredictable where as an earlier deadline
> > +	 * is absolute.
> > +	 */
> > +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> > +		return PREEMPT_WAKEUP_NONE;
> 
> But if previously we set next and didn't preempt, we should try again,
> maybe it has more success now. That is, should this not be _NEXT?
> 

The context of why the original buddy was set is now lost but you're
right, it is more straight-forward to reconsider the old buddy. It's
more in line with EEVDF objectives and cache residency and future
hotness requires crystal ball instructions.

> > +
> > +	set_next_buddy(pse);
> > +
> > +	/*
> > +	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> > +	 * strictly enforced because the hint is either misunderstood or
> > +	 * multiple tasks must be woken up.
> > +	 */
> > +	pse_before = entity_before(pse, se);
> > +	if (wake_flags & WF_SYNC) {
> > +		u64 delta = rq_clock_task(rq) - se->exec_start;
> > +		u64 threshold = sysctl_sched_migration_cost;
> > +
> > +		/*
> > +		 * WF_SYNC without WF_TTWU is not expected so warn if it
> > +		 * happens even though it is likely harmless.
> > +		 */
> > +		WARN_ON_ONCE(!(wake_flags | WF_TTWU));
> 
> s/|/&/ ?
> 

Bah, thanks.

> > +		if ((s64)delta < 0)
> > +			delta = 0;
> > +
> > +		/*
> > +		 * WF_RQ_SELECTED implies the tasks are stacking on a
> > +		 * CPU when they could run on other CPUs. Reduce the
> > +		 * threshold before preemption is allowed to an
> > +		 * arbitrary lower value as it is more likely (but not
> > +		 * guaranteed) the waker requires the wakee to finish.
> > +		 */
> > +		if (wake_flags & WF_RQ_SELECTED)
> > +			threshold >>= 2;
> > +
> > +		/*
> > +		 * As WF_SYNC is not strictly obeyed, allow some runtime for
> > +		 * batch wakeups to be issued.
> > +		 */
> > +		if (pse_before && delta >= threshold)
> > +			return PREEMPT_WAKEUP_RESCHED;
> > +
> > +		return PREEMPT_WAKEUP_NONE;
> > +	}
> > +
> > +	return PREEMPT_WAKEUP_NEXT;
> > +}
> 
> Add to this that AFAICT your patch ends up doing:
> 
> 	__pick_eevdf(.protect = false) == pse
> 
> which unconditionally disables the slice protection feature.
> 

Yes, trying to converge PREEMPT_SHORT with NEXT_BUDDY during prototyping
was a poor decision because it led to mistakes like this.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF
@ 2025-11-12 12:25 Mel Gorman
       [not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net>
  0 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2025-11-12 12:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, Madadi Vineeth Reddy, linux-kernel, Mel Gorman

Changes since v4
o Splitout decisions into separate functions			(peterz)
o Flow clarity							(peterz)

Changes since v3
o Place new code near first consumer				(peterz)
o Separate between PREEMPT_SHORT and NEXT_BUDDY			(peterz)
o Naming and code flow clarity					(peterz)
o Restore slice protection					(peterz)

Changes since v2
o Review feedback applied from Prateek

I've been chasing down a number of schedule issues recently like many
others and found they were broadly grouped as

1. Failure to boost CPU frequency with powersave/ondemand governors
2. Processors entering idle states that are too deep
3. Differences in wakeup latencies for wakeup-intensive workloads

Adding topology into account means that there is a lot of machine-specific
behaviour which may explain why some discussions recently have reproduction
problems. Nevertheless, the removal of LAST_BUDDY and NEXT_BUDDY being
disabled has an impact on wakeup latencies.

This series enables NEXT_BUDDY and may select a wakee if it's eligible to
run even though other unrelated tasks may have an earlier deadline.

Mel Gorman (2):
  sched/fair: Enable scheduler feature NEXT_BUDDY
  sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

 kernel/sched/fair.c     | 152 ++++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |   2 +-
 2 files changed, 131 insertions(+), 23 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <20251112122521.1331238-3-mgorman@techsingularity.net>]

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
       [not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net>
@ 2025-11-12 14:48   ` Peter Zijlstra
  2025-11-13  8:26     ` Madadi Vineeth Reddy
  2025-11-13  9:04     ` Mel Gorman
  0 siblings, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2025-11-12 14:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, Madadi Vineeth Reddy, linux-kernel

On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote:

> +	/* Prefer picking wakee soon if appropriate. */
> +	if (sched_feat(NEXT_BUDDY) &&
> +	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
> +
> +		/*
> +		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
> +		 * buddies are ignored as they may not be relevant to the
> +		 * waker and less likely to be cache hot.
> +		 */
> +		if (wake_flags & WF_SYNC)
> +			preempt_action = preempt_sync(rq, wake_flags, pse, se);
> +	}

Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to
depend on buddies.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-12 14:48   ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra
@ 2025-11-13  8:26     ` Madadi Vineeth Reddy
  2025-11-13  9:04     ` Mel Gorman
  1 sibling, 0 replies; 23+ messages in thread
From: Madadi Vineeth Reddy @ 2025-11-13  8:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason, linux-kernel,
	Madadi Vineeth Reddy

Hi Peter,

On 12/11/25 20:18, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote:
> 
>> +	/* Prefer picking wakee soon if appropriate. */
>> +	if (sched_feat(NEXT_BUDDY) &&
>> +	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
>> +
>> +		/*
>> +		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
>> +		 * buddies are ignored as they may not be relevant to the
>> +		 * waker and less likely to be cache hot.
>> +		 */
>> +		if (wake_flags & WF_SYNC)
>> +			preempt_action = preempt_sync(rq, wake_flags, pse, se);
>> +	}
> 
> Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to
> depend on buddies.


IIUC, calling preempt_sync() Without NEXT_BUDDY would force a reschedule after the threshold,
but no buddy would be set. This means pick_eevdf() would run with normal EEVDF deadline selection,
potentially picking a different task instead of the wakee.The forced reschedule would accomplish
nothing for the wakees.

That said, I see your point that the WF_SYNC threshold check could still reduce context switch
overhead even without guaranteeing wakee selection.

Thanks,
Madadi Vineeth Reddy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-12 14:48   ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra
  2025-11-13  8:26     ` Madadi Vineeth Reddy
@ 2025-11-13  9:04     ` Mel Gorman
  2025-11-14 12:13       ` Peter Zijlstra
  1 sibling, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2025-11-13  9:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, Madadi Vineeth Reddy, linux-kernel

On Wed, Nov 12, 2025 at 03:48:23PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote:
> 
> > +	/* Prefer picking wakee soon if appropriate. */
> > +	if (sched_feat(NEXT_BUDDY) &&
> > +	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
> > +
> > +		/*
> > +		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
> > +		 * buddies are ignored as they may not be relevant to the
> > +		 * waker and less likely to be cache hot.
> > +		 */
> > +		if (wake_flags & WF_SYNC)
> > +			preempt_action = preempt_sync(rq, wake_flags, pse, se);
> > +	}
> 
> Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to
> depend on buddies.

There isn't a direct relation, but there is an indirect one. I know from
your previous review that you separated out the WF_SYNC but after a while,
I did not find a good reason to separate it completely from NEXT_BUDDY.

NEXT_BUDDY updates cfs_rq->next if appropriate to indicate there is a waker
relationship between two tasks and potentially share data that may still
be cache resident after a context switch. WF_SYNC indicates there may be
a strict relationship between those two tasks that the waker may need the
wakee to do some work before it can make progress. If NEXT_BUDDY does not
set cfs_rq->next in the current waking context then the wakee may only be
picked next by coincidence under normal EEVDF rules.

WF_SYNC could still reschedule if the wakee is not selected as a buddy but
the benefit, if any, would be marginal -- if the waker does not go to sleep
then WF_SYNC contract is violated and if the data becomes cache cold after
a wakeup delay then the shared data may already be evicted from cache.
With NEXT_BUDDY, there is a chance that the cost of a reschedule and/or
a context switch will be offset by reduced overall latency (e.g. fewer
cache misses). Without NEXT_BUDDY, WF_SYNC may only incur costs due to
context switching.

I considered the possibility of WF_SYNC being applied if pse is already a
buddy due to yield or some other factor but there is no reason to assume
any shared data is still cache resident and it's not easy to reason about. I
considered applying WF_SYNC if pse was already set and use it as a two-pass
filter but again, no obvious benefit or why the second wakeup ie more
important than the first wakeup. I considered WF_SYNC being applied if
any buddy is set but it's not clear why a SYNC wakeup between tasks A,B
should instead pick C to run ASAP outside of the normal EEVDF rules.

I think it's straight-forward if the logic is

	o If NEXT_BUDDY sets the wakee becomes cfs_rq->next then
	  schedule the wakee soon
	o If the wakee is to be selected soon and WF_SYNC is also set then
	  pick the wakee ASAP

but less straight-forward if

	o If WF_SYNC is set, reschedule now and maybe the wakee will be
	  picked, maybe the waker will run again, maybe something else
	  will run and sometimes it'll be a gain overall.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-13  9:04     ` Mel Gorman
@ 2025-11-14 12:13       ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2025-11-14 12:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, Madadi Vineeth Reddy, linux-kernel

On Thu, Nov 13, 2025 at 09:04:38AM +0000, Mel Gorman wrote:
> On Wed, Nov 12, 2025 at 03:48:23PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote:
> > 
> > > +	/* Prefer picking wakee soon if appropriate. */
> > > +	if (sched_feat(NEXT_BUDDY) &&
> > > +	    set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
> > > +
> > > +		/*
> > > +		 * Decide whether to obey WF_SYNC hint for a new buddy. Old
> > > +		 * buddies are ignored as they may not be relevant to the
> > > +		 * waker and less likely to be cache hot.
> > > +		 */
> > > +		if (wake_flags & WF_SYNC)
> > > +			preempt_action = preempt_sync(rq, wake_flags, pse, se);
> > > +	}
> > 
> > Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to
> > depend on buddies.
> 
> There isn't a direct relation, but there is an indirect one. I know from
> your previous review that you separated out the WF_SYNC but after a while,
> I did not find a good reason to separate it completely from NEXT_BUDDY.
> 
> NEXT_BUDDY updates cfs_rq->next if appropriate to indicate there is a waker
> relationship between two tasks and potentially share data that may still
> be cache resident after a context switch. WF_SYNC indicates there may be
> a strict relationship between those two tasks that the waker may need the
> wakee to do some work before it can make progress. If NEXT_BUDDY does not
> set cfs_rq->next in the current waking context then the wakee may only be
> picked next by coincidence under normal EEVDF rules.

Aah, fair enough. Perhaps the comment could've been clearer but
whatever.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/2 v4] Reintroduce NEXT_BUDDY for EEVDF
@ 2025-11-03 11:04 Mel Gorman
  2025-11-03 11:04 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
  0 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2025-11-03 11:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, linux-kernel, Mel Gorman

Changes since v3
o Place new code near first consumer				(peterz)
o Separate between PREEMPT_SHORT and NEXT_BUDDY			(peterz)
o Naming and code flow clarity					(peterz)
o Restore slice protection					(peterz)

Changes since v2
o Review feedback applied from Prateek

I've been chasing down a number of schedule issues recently like many
others and found they were broadly grouped as

1. Failure to boost CPU frequency with powersave/ondemand governors
2. Processors entering idle states that are too deep
3. Differences in wakeup latencies for wakeup-intensive workloads

Adding topology into account means that there is a lot of machine-specific
behaviour which may explain why some discussions recently have reproduction
problems. Nevertheless, the removal of LAST_BUDDY and NEXT_BUDDY being
disabled has an impact on wakeup latencies.

This series enables NEXT_BUDDY and may select a wakee if it's eligible to
run even though other unrelated tasks may have an earlier deadline.

 kernel/sched/fair.c     | 145 ++++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |   2 +-
 2 files changed, 124 insertions(+), 23 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-03 11:04 [PATCH 0/2 v4] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman
@ 2025-11-03 11:04 ` Mel Gorman
  2025-11-03 14:07   ` Peter Zijlstra
  2025-11-05 21:48   ` Madadi Vineeth Reddy
  0 siblings, 2 replies; 23+ messages in thread
From: Mel Gorman @ 2025-11-03 11:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, linux-kernel, Mel Gorman

Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.

Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a suggestion instead of a contract. If a waker does go to sleep almost
immediately then the delay in wakeup is negligible. In all other cases,
it's throttled based on the accumulated runtime of the waker so there is
a chance that some batched wakeups have been issued before preemption.

For all other wakeups, preemption happens if the wakee has a earlier
deadline than the waker and eligible to run.

While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.

First is the dbench throughput data even though it is a poor metric but
it is the default metric. The test machine is a 2-socket machine and the
backing filesystem is XFS as a lot of the IO work is dispatched to kernel
threads. It's important to note that these results are not representative
across all machines, especially Zen machines, as different bottlenecks
are exposed on different machines and filesystems.

dbench4 Throughput (misleading but traditional)
                            6.18-rc1               6.18-rc1
                             vanilla   sched-preemptnext-v4
Hmean     1       1268.80 (   0.00%)     1268.92 (   0.01%)
Hmean     4       3971.74 (   0.00%)     3952.44 (  -0.49%)
Hmean     7       5548.23 (   0.00%)     5375.49 *  -3.11%*
Hmean     12      7310.86 (   0.00%)     7080.80 (  -3.15%)
Hmean     21      8874.53 (   0.00%)     9044.30 (   1.91%)
Hmean     30      9361.93 (   0.00%)    10430.80 *  11.42%*
Hmean     48      9540.14 (   0.00%)    11786.28 *  23.54%*
Hmean     79      9208.74 (   0.00%)    11910.07 *  29.33%*
Hmean     110     8573.12 (   0.00%)    11571.03 *  34.97%*
Hmean     141     7791.33 (   0.00%)    11151.69 *  43.13%*
Hmean     160     7666.60 (   0.00%)    10661.53 *  39.06%*

As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.

dbench4 Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v4
Amean      1         14.62 (   0.00%)       14.69 (  -0.50%)
Amean      4         18.76 (   0.00%)       18.88 (  -0.63%)
Amean      7         23.71 (   0.00%)       24.52 (  -3.43%)
Amean      12        31.25 (   0.00%)       32.20 (  -3.03%)
Amean      21        45.12 (   0.00%)       44.19 (   2.05%)
Amean      30        61.07 (   0.00%)       54.74 (  10.35%)
Amean      48        95.91 (   0.00%)       77.52 (  19.18%)
Amean      79       163.38 (   0.00%)      126.04 (  22.85%)
Amean      110      243.91 (   0.00%)      179.93 (  26.23%)
Amean      141      343.47 (   0.00%)      240.03 (  30.12%)
Amean      160      401.15 (   0.00%)      284.92 (  28.98%)

That is still looking good and the variance is reduced quite a bit with
the caveat that different CPU families may yield different results.

Fairness is a concern so the next report tracks how many milliseconds
does it take for all clients to complete a workfile. This one is tricky
because dbench makes no effort to synchronise clients so the durations at
benchmark start time differ substantially from typical runtimes. Depending
on the configuration, the benchmark can be dominated by the timing of
fsync of different clients. This problem could be mitigated by warming up
the benchmark for a number of minutes but it's a matter of opinion whether
that counts as an evasion of inconvenient results.

dbench4 All Clients Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v4
Amean      1         15.06 (   0.00%)       15.10 (  -0.27%)
Amean      4        603.81 (   0.00%)     1451.94 (-140.46%)
Amean      7        855.32 (   0.00%)     1116.68 ( -30.56%)
Amean      12      1890.02 (   0.00%)     2075.98 (  -9.84%)
Amean      21      3195.23 (   0.00%)     2508.86 (  21.48%)
Amean      30     13919.53 (   0.00%)     3324.58 (  76.12%)
Amean      48     25246.07 (   0.00%)     4146.10 (  83.58%)
Amean      79     29701.84 (   0.00%)    21338.68 (  28.16%)
Amean      110    22803.03 (   0.00%)    29502.47 ( -29.38%)
Amean      141    36356.07 (   0.00%)    25292.07 (  30.43%)
Amean      160    17046.71 (   0.00%)    27043.67 ( -58.64%)
Stddev     1          0.47 (   0.00%)        0.46 (   2.84%)
Stddev     4        395.24 (   0.00%)      782.96 ( -98.10%)
Stddev     7        467.24 (   0.00%)      738.26 ( -58.00%)
Stddev     12      1071.43 (   0.00%)     1124.78 (  -4.98%)
Stddev     21      1694.50 (   0.00%)     1463.27 (  13.65%)
Stddev     30      7945.63 (   0.00%)     1889.19 (  76.22%)
Stddev     48     14339.51 (   0.00%)     2226.59 (  84.47%)
Stddev     79     16620.91 (   0.00%)    12027.14 (  27.64%)
Stddev     110    12912.15 (   0.00%)    16601.77 ( -28.57%)
Stddev     141    20700.13 (   0.00%)    13624.50 (  34.18%)
Stddev     160     9079.16 (   0.00%)    15729.63 ( -73.25%)

This is more of a mixed bag but it at least shows that fairness
is not crippled.

The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench.

hackbench-process-pipes
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v4
Amean     1        0.2657 (   0.00%)      0.2120 (  20.20%)
Amean     4        0.6107 (   0.00%)      0.6120 (  -0.22%)
Amean     7        0.7923 (   0.00%)      0.7307 (   7.78%)
Amean     12       1.1500 (   0.00%)      1.1480 (   0.17%)
Amean     21       1.7950 (   0.00%)      1.8207 (  -1.43%)
Amean     30       2.3207 (   0.00%)      2.5340 *  -9.19%*
Amean     48       3.5023 (   0.00%)      4.0047 * -14.34%*
Amean     79       4.8093 (   0.00%)      5.3137 * -10.49%*
Amean     110      6.1160 (   0.00%)      6.5287 *  -6.75%*
Amean     141      7.4763 (   0.00%)      7.9553 *  -6.41%*
Amean     172      8.9560 (   0.00%)      9.3977 *  -4.93%*
Amean     203     10.4783 (   0.00%)     11.0160 *  -5.13%*
Amean     234     12.4977 (   0.00%)     13.5510 (  -8.43%)
Amean     265     14.7003 (   0.00%)     15.6950 *  -6.77%*
Amean     296     16.1007 (   0.00%)     17.0860 *  -6.12%*

Processes using pipes are impacted but the variance (not presented)
is bad enough that it's close to noise. These results are not always
reproducible. If executed across multiple reboots, it may show neutral or
small gains so the worst measured results are presented.

Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.

hackbench-process-sockets
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v4
Amean     1        0.3073 (   0.00%)      0.3187 (  -3.69%)
Amean     4        0.7863 (   0.00%)      0.7850 (   0.17%)
Amean     7        1.3670 (   0.00%)      1.3747 (  -0.56%)
Amean     12       2.1337 (   0.00%)      2.1450 (  -0.53%)
Amean     21       3.4683 (   0.00%)      3.4293 (   1.12%)
Amean     30       4.7247 (   0.00%)      4.8000 (  -1.59%)
Amean     48       7.6097 (   0.00%)      7.6943 (  -1.11%)
Amean     79      14.7957 (   0.00%)     15.3353 (  -3.65%)
Amean     110     21.3413 (   0.00%)     21.5753 (  -1.10%)
Amean     141     29.0503 (   0.00%)     28.8740 (   0.61%)
Amean     172     36.4660 (   0.00%)     35.7207 (   2.04%)
Amean     203     39.7177 (   0.00%)     39.7343 (  -0.04%)
Amean     234     42.1120 (   0.00%)     43.0177 (  -2.15%)
Amean     265     45.7830 (   0.00%)     46.9863 (  -2.63%)
Amean     296     50.7043 (   0.00%)     50.7017 (   0.01%)

As schbench has been mentioned in numerous bugs recently, the results
are interesting. A test case that represents the default schbench
behaviour is

schbench Wakeup Latency (usec)
                                       6.18.0-rc1             6.18.0-rc1
                                          vanilla   sched-preemptnext-v4
Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 *  16.28%*
Amean     Wakeup-90th-80         46.56 (   0.00%)       19.61 *  57.88%*
Amean     Wakeup-99th-80        119.61 (   0.00%)       87.28 *  27.03%*
Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      367.00 *  88.51%*
Amean     Wakeup-max-80        3874.28 (   0.00%)     3951.39 (  -1.99%)


schbench Requests Per Second (ops/sec)
                                  6.18.0-rc1             6.18.0-rc1
                                     vanilla sched-preemptnext-v4r1
Hmean     RPS-20th-80     8900.91 (   0.00%)     9167.86 *   3.00%*
Hmean     RPS-50th-80     8987.41 (   0.00%)     9213.06 *   2.51%*
Hmean     RPS-90th-80     9123.73 (   0.00%)     9263.80 (   1.54%)
Hmean     RPS-max-80      9193.50 (   0.00%)     9282.18 (   0.96%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 145 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 123 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc0b7ce8a65d..688bf010f444 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -955,6 +955,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
 	if (cfs_rq->nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
+	/*
+	 * Picking the ->next buddy will affect latency but not fairness.
+	 */
+	if (sched_feat(PICK_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+		return cfs_rq->next;
+	}
+
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
 
@@ -1193,6 +1203,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 	return delta_exec;
 }
 
+static void set_next_buddy(struct sched_entity *se);
+
 /*
  * Used by other classes to account runtime.
  */
@@ -5512,16 +5524,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 
-	/*
-	 * Picking the ->next buddy will affect latency but not fairness.
-	 */
-	if (sched_feat(PICK_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
-		/* ->next will never be delayed */
-		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
-		return cfs_rq->next;
-	}
-
 	se = pick_eevdf(cfs_rq);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
@@ -7028,8 +7030,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -8725,6 +8725,91 @@ static void set_next_buddy(struct sched_entity *se)
 	}
 }
 
+enum preempt_wakeup_action {
+	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
+	PREEMPT_WAKEUP_SHORT,		/* Override current's slice
+					 * protection to allow
+					 * preemption.
+					 */
+	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
+					 * before rescheduling.
+					 */
+	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
+};
+
+static inline enum preempt_wakeup_action
+__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
+		 struct sched_entity *pse, struct sched_entity *se)
+{
+	bool pse_before;
+
+	/*
+	 * Ignore wakee preemption on WF_FORK as it is less likely that
+	 * there is shared data as exec often follow fork. Do not
+	 * preempt for tasks that are sched_delayed as it would violate
+	 * EEVDF to forcibly queue an ineligible task.
+	 */
+	if ((wake_flags & WF_FORK) || pse->sched_delayed)
+		return PREEMPT_WAKEUP_NONE;
+
+	/* Reschedule if waker is no longer eligible. */
+	if (in_task() && !entity_eligible(cfs_rq, se))
+		return PREEMPT_WAKEUP_RESCHED;
+
+	/*
+	 * Keep existing buddy if the deadline is sooner than pse.
+	 * The older buddy may be cache cold and completely unrelated
+	 * to the current wakeup but that is unpredictable where as
+	 * obeying the deadline is more in line with EEVDF objectives.
+	 */
+	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
+		return PREEMPT_WAKEUP_NEXT;
+
+	set_next_buddy(pse);
+
+	/*
+	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
+	 * strictly enforced because the hint is either misunderstood or
+	 * multiple tasks must be woken up.
+	 */
+	pse_before = entity_before(pse, se);
+	if (wake_flags & WF_SYNC) {
+		u64 delta = rq_clock_task(rq) - se->exec_start;
+		u64 threshold = sysctl_sched_migration_cost;
+
+		/*
+		 * WF_SYNC without WF_TTWU is not expected so warn if it
+		 * happens even though it is likely harmless.
+		 */
+		WARN_ON_ONCE(!(wake_flags & WF_TTWU));
+
+		if ((s64)delta < 0)
+			delta = 0;
+
+		/*
+		 * WF_RQ_SELECTED implies the tasks are stacking on a
+		 * CPU when they could run on other CPUs. Reduce the
+		 * threshold before preemption is allowed to an
+		 * arbitrary lower value as it is more likely (but not
+		 * guaranteed) the waker requires the wakee to finish.
+		 */
+		if (wake_flags & WF_RQ_SELECTED)
+			threshold >>= 2;
+
+		/*
+		 * As WF_SYNC is not strictly obeyed, allow some runtime for
+		 * batch wakeups to be issued.
+		 */
+		if (pse_before && delta >= threshold)
+			return PREEMPT_WAKEUP_RESCHED;
+
+		return PREEMPT_WAKEUP_NONE;
+	}
+
+	return PREEMPT_WAKEUP_NEXT;
+}
+
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
-	bool do_preempt_short = false;
+	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;
 
 	if (unlikely(se == pse))
 		return;
@@ -8748,10 +8833,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (task_is_throttled(p))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
-		set_next_buddy(pse);
-	}
-
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
@@ -8783,7 +8864,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 		 * When non-idle entity preempt an idle entity,
 		 * don't give idle entity slice protection.
 		 */
-		do_preempt_short = true;
+		preempt_action = PREEMPT_WAKEUP_SHORT;
 		goto preempt;
 	}
 
@@ -8802,21 +8883,41 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	 * If @p has a shorter slice than current and @p is eligible, override
 	 * current's slice protection in order to allow preemption.
 	 */
-	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
+	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
+		preempt_action = PREEMPT_WAKEUP_SHORT;
+	} else {
+		/*
+		 * If @p potentially is completing work required by current then
+		 * consider preemption.
+		 */
+		preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags,
+						      pse, se);
+	}
+
+	switch (preempt_action) {
+	case PREEMPT_WAKEUP_NONE:
+		return;
+	case PREEMPT_WAKEUP_RESCHED:
+		goto preempt;
+	case PREEMPT_WAKEUP_SHORT:
+		fallthrough;
+	case PREEMPT_WAKEUP_NEXT:
+		break;
+	}
 
 	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */
-	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
+	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
 		goto preempt;
 
-	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
+	if (sched_feat(RUN_TO_PARITY))
 		update_protect_slice(cfs_rq, se);
 
 	return;
 
 preempt:
-	if (do_preempt_short)
+	if (preempt_action == PREEMPT_WAKEUP_SHORT)
 		cancel_protect_slice(se);
 
 	resched_curr_lazy(rq);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-03 11:04 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
@ 2025-11-03 14:07   ` Peter Zijlstra
  2025-11-03 14:14     ` Peter Zijlstra
  2025-11-05 21:48   ` Madadi Vineeth Reddy
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2025-11-03 14:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, linux-kernel

On Mon, Nov 03, 2025 at 11:04:45AM +0000, Mel Gorman wrote:

> @@ -8725,6 +8725,91 @@ static void set_next_buddy(struct sched_entity *se)
>  	}
>  }
>  
> +enum preempt_wakeup_action {
> +	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
> +	PREEMPT_WAKEUP_SHORT,		/* Override current's slice
> +					 * protection to allow
> +					 * preemption.
> +					 */
> +	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
> +					 * before rescheduling.
> +					 */
> +	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
> +};

Not really a fan of that comment style. While noodling, I've managed to
'accidentally' rename NEXT to PICK, since its more about letting
__pick_eevdf() decide.

> +static inline enum preempt_wakeup_action
> +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> +		 struct sched_entity *pse, struct sched_entity *se)
> +{
> +	bool pse_before;
> +
> +	/*
> +	 * Ignore wakee preemption on WF_FORK as it is less likely that
> +	 * there is shared data as exec often follow fork. Do not
> +	 * preempt for tasks that are sched_delayed as it would violate
> +	 * EEVDF to forcibly queue an ineligible task.
> +	 */
> +	if ((wake_flags & WF_FORK) || pse->sched_delayed)
> +		return PREEMPT_WAKEUP_NONE;
> +
> +	/* Reschedule if waker is no longer eligible. */
> +	if (in_task() && !entity_eligible(cfs_rq, se))
> +		return PREEMPT_WAKEUP_RESCHED;
> +
> +	/*
> +	 * Keep existing buddy if the deadline is sooner than pse.
> +	 * The older buddy may be cache cold and completely unrelated
> +	 * to the current wakeup but that is unpredictable where as
> +	 * obeying the deadline is more in line with EEVDF objectives.
> +	 */
> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> +		return PREEMPT_WAKEUP_NEXT;
> +
> +	set_next_buddy(pse);
> +
> +	/*
> +	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> +	 * strictly enforced because the hint is either misunderstood or
> +	 * multiple tasks must be woken up.
> +	 */
> +	pse_before = entity_before(pse, se);
> +	if (wake_flags & WF_SYNC) {
> +		u64 delta = rq_clock_task(rq) - se->exec_start;
> +		u64 threshold = sysctl_sched_migration_cost;
> +
> +		/*
> +		 * WF_SYNC without WF_TTWU is not expected so warn if it
> +		 * happens even though it is likely harmless.
> +		 */
> +		WARN_ON_ONCE(!(wake_flags & WF_TTWU));
> +
> +		if ((s64)delta < 0)
> +			delta = 0;
> +
> +		/*
> +		 * WF_RQ_SELECTED implies the tasks are stacking on a
> +		 * CPU when they could run on other CPUs. Reduce the
> +		 * threshold before preemption is allowed to an
> +		 * arbitrary lower value as it is more likely (but not
> +		 * guaranteed) the waker requires the wakee to finish.
> +		 */
> +		if (wake_flags & WF_RQ_SELECTED)
> +			threshold >>= 2;
> +
> +		/*
> +		 * As WF_SYNC is not strictly obeyed, allow some runtime for
> +		 * batch wakeups to be issued.
> +		 */
> +		if (pse_before && delta >= threshold)
> +			return PREEMPT_WAKEUP_RESCHED;
> +
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	return PREEMPT_WAKEUP_NEXT;
> +}

This seems to do 3 things:

 - that reschedule waker on !eligible
 - set the buddy (while losing NEXT_BUDDY)
 - the WF_SYNC thing

Let's split that out.

> @@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>  	int cse_is_idle, pse_is_idle;
> -	bool do_preempt_short = false;
> +	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;

I'm thinking NONE is the wrong default

>  
>  	if (unlikely(se == pse))
>  		return;
> @@ -8748,10 +8833,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (task_is_throttled(p))
>  		return;
>  
> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
> -		set_next_buddy(pse);
> -	}

This was the only NEXT_BUDDY site and none were returned in trade.

>  	/*
>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>  	 * wake up path.
> @@ -8783,7 +8864,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  		 * When non-idle entity preempt an idle entity,
>  		 * don't give idle entity slice protection.
>  		 */
> -		do_preempt_short = true;
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
>  		goto preempt;
>  	}
>  
> @@ -8802,21 +8883,41 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	 * If @p has a shorter slice than current and @p is eligible, override
>  	 * current's slice protection in order to allow preemption.
>  	 */
> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
> +	} else {
> +		/*
> +		 * If @p potentially is completing work required by current then
> +		 * consider preemption.
> +		 */
> +		preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags,
> +						      pse, se);
> +	}
> +
> +	switch (preempt_action) {
> +	case PREEMPT_WAKEUP_NONE:
> +		return;
> +	case PREEMPT_WAKEUP_RESCHED:
> +		goto preempt;
> +	case PREEMPT_WAKEUP_SHORT:
> +		fallthrough;
(this is redundant)
> +	case PREEMPT_WAKEUP_NEXT:
> +		break;
> +	}
>  
>  	/*
>  	 * If @p has become the most eligible task, force preemption.
>  	 */
> -	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
> +	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
>  		goto preempt;
>  
> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> +	if (sched_feat(RUN_TO_PARITY))
>  		update_protect_slice(cfs_rq, se);
>  
>  	return;
>  
>  preempt:
> -	if (do_preempt_short)
> +	if (preempt_action == PREEMPT_WAKEUP_SHORT)
>  		cancel_protect_slice(se);
>  
>  	resched_curr_lazy(rq);

Right, much better. But since I was noddling, how about something like
so on top?

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8727,23 +8727,16 @@ static void set_next_buddy(struct sched_
 }
 
 enum preempt_wakeup_action {
-	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
-	PREEMPT_WAKEUP_SHORT,		/* Override current's slice
-					 * protection to allow
-					 * preemption.
-					 */
-	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
-					 * before rescheduling.
-					 */
-	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
+	PREEMPT_WAKEUP_NONE,	/* No preemption. */
+	PREEMPT_WAKEUP_SHORT,	/* Ignore slice protection. */
+	PREEMPT_WAKEUP_PICK,	/* Let __pick_eevdf() decide. */
+	PREEMPT_WAKEUP_RESCHED,	/* Force reschedule. */
 };
 
-static inline enum preempt_wakeup_action
-__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
-		 struct sched_entity *pse, struct sched_entity *se)
+static inline void
+set_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
+		  struct sched_entity *pse, struct sched_entity *se)
 {
-	bool pse_before;
-
 	/*
 	 * Ignore wakee preemption on WF_FORK as it is less likely that
 	 * there is shared data as exec often follow fork. Do not
@@ -8751,11 +8744,7 @@ __do_preempt_buddy(struct rq *rq, struct
 	 * EEVDF to forcibly queue an ineligible task.
 	 */
 	if ((wake_flags & WF_FORK) || pse->sched_delayed)
-		return PREEMPT_WAKEUP_NONE;
-
-	/* Reschedule if waker is no longer eligible. */
-	if (in_task() && !entity_eligible(cfs_rq, se))
-		return PREEMPT_WAKEUP_RESCHED;
+		return;
 
 	/*
 	 * Keep existing buddy if the deadline is sooner than pse.
@@ -8764,63 +8753,62 @@ __do_preempt_buddy(struct rq *rq, struct
 	 * obeying the deadline is more in line with EEVDF objectives.
 	 */
 	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
-		return PREEMPT_WAKEUP_NEXT;
+		return;
 
 	set_next_buddy(pse);
+}
 
-	/*
-	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
-	 * strictly enforced because the hint is either misunderstood or
-	 * multiple tasks must be woken up.
-	 */
-	pse_before = entity_before(pse, se);
-	if (wake_flags & WF_SYNC) {
-		u64 delta = rq_clock_task(rq) - se->exec_start;
-		u64 threshold = sysctl_sched_migration_cost;
-
-		/*
-		 * WF_SYNC without WF_TTWU is not expected so warn if it
-		 * happens even though it is likely harmless.
-		 */
-		WARN_ON_ONCE(!(wake_flags & WF_TTWU));
+/*
+ * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
+ * strictly enforced because the hint is either misunderstood or
+ * multiple tasks must be woken up.
+ */
+static inline enum preempt_wakeup_action
+preempt_sync(struct rq *rq, int wake_flags,
+	     struct sched_entity *pse, struct sched_entity *se)
+{
+	u64 delta = rq_clock_task(rq) - se->exec_start;
+	u64 threshold = sysctl_sched_migration_cost;
 
-		if ((s64)delta < 0)
-			delta = 0;
+	/*
+	 * WF_SYNC without WF_TTWU is not expected so warn if it
+	 * happens even though it is likely harmless.
+	 */
+	WARN_ON_ONCE(!(wake_flags & WF_TTWU));
 
-		/*
-		 * WF_RQ_SELECTED implies the tasks are stacking on a
-		 * CPU when they could run on other CPUs. Reduce the
-		 * threshold before preemption is allowed to an
-		 * arbitrary lower value as it is more likely (but not
-		 * guaranteed) the waker requires the wakee to finish.
-		 */
-		if (wake_flags & WF_RQ_SELECTED)
-			threshold >>= 2;
+	if ((s64)delta < 0)
+		delta = 0;
 
-		/*
-		 * As WF_SYNC is not strictly obeyed, allow some runtime for
-		 * batch wakeups to be issued.
-		 */
-		if (pse_before && delta >= threshold)
-			return PREEMPT_WAKEUP_RESCHED;
+	/*
+	 * WF_RQ_SELECTED implies the tasks are stacking on a
+	 * CPU when they could run on other CPUs. Reduce the
+	 * threshold before preemption is allowed to an
+	 * arbitrary lower value as it is more likely (but not
+	 * guaranteed) the waker requires the wakee to finish.
+	 */
+	if (wake_flags & WF_RQ_SELECTED)
+		threshold >>= 2;
 
-		return PREEMPT_WAKEUP_NONE;
-	}
+	/*
+	 * As WF_SYNC is not strictly obeyed, allow some runtime for
+	 * batch wakeups to be issued.
+	 */
+	if (entity_before(pse, se) && delta >= threshold)
+		return PREEMPT_WAKEUP_RESCHED;
 
-	return PREEMPT_WAKEUP_NEXT;
+	return PREEMPT_WAKEUP_NONE;
 }
 
-
 /*
  * Preempt the current task with a newly woken task if needed:
  */
 static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
+	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
 	struct task_struct *donor = rq->donor;
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
-	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;
 
 	if (unlikely(se == pse))
 		return;
@@ -8886,26 +8874,40 @@ static void check_preempt_wakeup_fair(st
 	 */
 	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
 		preempt_action = PREEMPT_WAKEUP_SHORT;
-	} else {
-		/*
-		 * If @p potentially is completing work required by current then
-		 * consider preemption.
-		 */
-		preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags,
-						      pse, se);
+		goto pick;
+
 	}
 
+	/*
+	 * If @p potentially is completing work required by current
+	 * then consider preemption.
+	 */
+	if (in_task() && !entity_eligible(cfs_rq, se)) {
+		/* Reschedule if waker is no longer eligible. */
+		preempt_action = PREEMPT_WAKEUP_RESCHED;
+		goto preempt;
+
+	} 
+
+	if (sched_feat(NEXT_BUDDY))
+		set_preempt_buddy(rq, cfs_rq, wake_flags, pse, se);
+
+	if (wake_flags & WF_SYNC)
+		preempt_action = preempt_sync(rq, wake_flags, pse, se);
+
 	switch (preempt_action) {
 	case PREEMPT_WAKEUP_NONE:
 		return;
+
 	case PREEMPT_WAKEUP_RESCHED:
 		goto preempt;
+
 	case PREEMPT_WAKEUP_SHORT:
-		fallthrough;
-	case PREEMPT_WAKEUP_NEXT:
-		break;
+	case PREEMPT_WAKEUP_PICK:
+		goto pick;
 	}
 
+pick:
 	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-03 14:07   ` Peter Zijlstra
@ 2025-11-03 14:14     ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2025-11-03 14:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
	Chris Mason, linux-kernel

On Mon, Nov 03, 2025 at 03:07:11PM +0100, Peter Zijlstra wrote:

> > @@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> >  	struct sched_entity *se = &donor->se, *pse = &p->se;
> >  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
> >  	int cse_is_idle, pse_is_idle;
> > -	bool do_preempt_short = false;
> > +	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;
> 
> I'm thinking NONE is the wrong default

That was unfinished, kid interrupted me at an inopportune moment and I
lost my train ;-)

Anyway, the current code defaults to what will be 'pick'. And I suppose
we could make the default depend on WAKEUP_PREEMPT but meh.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-03 11:04 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
  2025-11-03 14:07   ` Peter Zijlstra
@ 2025-11-05 21:48   ` Madadi Vineeth Reddy
  2025-11-07  8:53     ` Mel Gorman
  1 sibling, 1 reply; 23+ messages in thread
From: Madadi Vineeth Reddy @ 2025-11-05 21:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason, linux-kernel,
	Madadi Vineeth Reddy

Hi Mel,

On 03/11/25 16:34, Mel Gorman wrote:
> Reimplement NEXT_BUDDY preemption to take into account the deadline and
> eligibility of the wakee with respect to the waker. In the event
> multiple buddies could be considered, the one with the earliest deadline
> is selected.
> 
> Sync wakeups are treated differently to every other type of wakeup. The
> WF_SYNC assumption is that the waker promises to sleep in the very near
> future. This is violated in enough cases that WF_SYNC should be treated
> as a suggestion instead of a contract. If a waker does go to sleep almost
> immediately then the delay in wakeup is negligible. In all other cases,
> it's throttled based on the accumulated runtime of the waker so there is
> a chance that some batched wakeups have been issued before preemption.

[..snip..]

> +static inline enum preempt_wakeup_action
> +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> +		 struct sched_entity *pse, struct sched_entity *se)
> +{
> +	bool pse_before;
> +
> +	/*
> +	 * Ignore wakee preemption on WF_FORK as it is less likely that
> +	 * there is shared data as exec often follow fork. Do not
> +	 * preempt for tasks that are sched_delayed as it would violate
> +	 * EEVDF to forcibly queue an ineligible task.
> +	 */
> +	if ((wake_flags & WF_FORK) || pse->sched_delayed)
> +		return PREEMPT_WAKEUP_NONE;
> +
> +	/* Reschedule if waker is no longer eligible. */
> +	if (in_task() && !entity_eligible(cfs_rq, se))
> +		return PREEMPT_WAKEUP_RESCHED;
> +
> +	/*
> +	 * Keep existing buddy if the deadline is sooner than pse.
> +	 * The older buddy may be cache cold and completely unrelated
> +	 * to the current wakeup but that is unpredictable where as
> +	 * obeying the deadline is more in line with EEVDF objectives.
> +	 */
> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> +		return PREEMPT_WAKEUP_NEXT;

IIUC, the logic attempts to maintain deadline ordering among buddies, but
this doesn't address tasks already on the runqueue.

So with frequent wakeups, queued tasks (even with earlier deadlines) may be
unfairly delayed. I understand that this would fade away quickly as the
woken up task that got to run due to buddy preference would accumulate negative
lag and would not be eligible to run again, but the starvation could be higher if
wakeups are very high.

To test this, I ran schbench (many message and worker threads) together with
stress-ng (CPU-bound), and observed stress-ng's bogo-ops throughput dropped by
around 64%.

This shows a significant regression for CPU-bound tasks under heavy wakeup loads.
Thoughts?


I also ran schbench and hackbench. All these were run on a Power11 System
containing 4 sockets and 160 CPUs spread across 4 NUMA nodes.

schbench(new) 99.0th latency (lower is better)
========
load        	baseline[pct imp](std%)       With patch[pct imp]( std%)
20mt, 10wt      1.00 [ 0.00]( 0.24)           0.97 [ +3.00]( 0.18)
20mt, 20wt      1.00 [ 0.00]( 0.33)           1.00 [  0.00]( 0.12)
20mt, 40wt      1.00 [ 0.00]( 2.84)           0.76 [ +24.0]( 0.32)
20mt, 80wt      1.00 [ 0.00]( 3.66)           0.66 [ +34.0]( 0.72)
20mt, 160wt     1.00 [ 0.00](12.92)           0.88 [ +12.0]( 6.77)

mt=message threads ; wt=worker threads

schbench being a wakeup sensitive workload showed good improvement.


hackbench (lower is better)
========
case              load        baseline[pct imp](std%)      With patch[pct imp]( std%)
process-sockets   1-groups    1.00 [ 0.00]( 5.21)            0.91 [ +9.00]( 5.50)
process-sockets   4-groups    1.00 [ 0.00]( 7.30)            1.01 [ -1.00]( 4.27)
process-sockets   12-groups   1.00 [ 0.00]( 2.44)            1.00 [  0.00]( 1.78)
process-sockets   30-groups   1.00 [ 0.00]( 2.05)            1.04 [ -4.00]( 0.86)
process-sockets   48-groups   1.00 [ 0.00]( 2.25)            1.04 [ -4.00]( 1.03)
process-sockets   79-groups   1.00 [ 0.00]( 2.28)            1.05 [ -5.00]( 1.67)
process-sockets   110-groups  1.00 [ 0.00]( 11.17)           1.04 [ -4.00]( 8.64)

process-pipe      1-groups     1.00 [ 0.00]( 8.21)            0.84 [+16.00](13.00)
process-pipe      4-groups     1.00 [ 0.00]( 5.54)            0.95 [ +5.00]( 4.21)
process-pipe      12-groups    1.00 [ 0.00]( 3.96)            1.04 [ -4.00]( 2.26)
process-pipe      30-groups    1.00 [ 0.00]( 7.64)            1.20 [ -20.0]( 3.63)
process-pipe      48-groups    1.00 [ 0.00]( 6.28)            1.04 [ -4.00]( 8.48)
process-pipe      79-groups    1.00 [ 0.00]( 6.19)            1.01 [ -1.00]( 4.36)
process-pipe      110-groups   1.00 [ 0.00]( 10.23)           0.94 [ +6.00]( 5.21)

Didn't notice significant improvement or regression in Hackbench. Mostly in the noise
range.

Thanks,
Madadi Vineeth Reddy


> +
> +	set_next_buddy(pse);
> +
> +	/*
> +	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> +	 * strictly enforced because the hint is either misunderstood or
> +	 * multiple tasks must be woken up.
> +	 */
> +	pse_before = entity_before(pse, se);
> +	if (wake_flags & WF_SYNC) {
> +		u64 delta = rq_clock_task(rq) - se->exec_start;
> +		u64 threshold = sysctl_sched_migration_cost;
> +
> +		/*
> +		 * WF_SYNC without WF_TTWU is not expected so warn if it
> +		 * happens even though it is likely harmless.
> +		 */
> +		WARN_ON_ONCE(!(wake_flags & WF_TTWU));
> +
> +		if ((s64)delta < 0)
> +			delta = 0;
> +
> +		/*
> +		 * WF_RQ_SELECTED implies the tasks are stacking on a
> +		 * CPU when they could run on other CPUs. Reduce the
> +		 * threshold before preemption is allowed to an
> +		 * arbitrary lower value as it is more likely (but not
> +		 * guaranteed) the waker requires the wakee to finish.
> +		 */
> +		if (wake_flags & WF_RQ_SELECTED)
> +			threshold >>= 2;
> +
> +		/*
> +		 * As WF_SYNC is not strictly obeyed, allow some runtime for
> +		 * batch wakeups to be issued.
> +		 */
> +		if (pse_before && delta >= threshold)
> +			return PREEMPT_WAKEUP_RESCHED;
> +
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	return PREEMPT_WAKEUP_NEXT;
> +}
> +
> +
>  /*
>   * Preempt the current task with a newly woken task if needed:
>   */
> @@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>  	int cse_is_idle, pse_is_idle;
> -	bool do_preempt_short = false;
> +	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE;
>  
>  	if (unlikely(se == pse))
>  		return;
> @@ -8748,10 +8833,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (task_is_throttled(p))
>  		return;
>  
> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
> -		set_next_buddy(pse);
> -	}
> -
>  	/*
>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>  	 * wake up path.
> @@ -8783,7 +8864,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  		 * When non-idle entity preempt an idle entity,
>  		 * don't give idle entity slice protection.
>  		 */
> -		do_preempt_short = true;
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
>  		goto preempt;
>  	}
>  
> @@ -8802,21 +8883,41 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	 * If @p has a shorter slice than current and @p is eligible, override
>  	 * current's slice protection in order to allow preemption.
>  	 */
> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
> +		preempt_action = PREEMPT_WAKEUP_SHORT;
> +	} else {
> +		/*
> +		 * If @p potentially is completing work required by current then
> +		 * consider preemption.
> +		 */
> +		preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags,
> +						      pse, se);
> +	}
> +
> +	switch (preempt_action) {
> +	case PREEMPT_WAKEUP_NONE:
> +		return;
> +	case PREEMPT_WAKEUP_RESCHED:
> +		goto preempt;
> +	case PREEMPT_WAKEUP_SHORT:
> +		fallthrough;
> +	case PREEMPT_WAKEUP_NEXT:
> +		break;
> +	}
>  
>  	/*
>  	 * If @p has become the most eligible task, force preemption.
>  	 */
> -	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
> +	if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse)
>  		goto preempt;
>  
> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> +	if (sched_feat(RUN_TO_PARITY))
>  		update_protect_slice(cfs_rq, se);
>  
>  	return;
>  
>  preempt:
> -	if (do_preempt_short)
> +	if (preempt_action == PREEMPT_WAKEUP_SHORT)
>  		cancel_protect_slice(se);
>  
>  	resched_curr_lazy(rq);


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-11-05 21:48   ` Madadi Vineeth Reddy
@ 2025-11-07  8:53     ` Mel Gorman
  0 siblings, 0 replies; 23+ messages in thread
From: Mel Gorman @ 2025-11-07  8:53 UTC (permalink / raw)
  To: Madadi Vineeth Reddy
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason, linux-kernel

On Thu, Nov 06, 2025 at 03:18:30AM +0530, Madadi Vineeth Reddy wrote:
> Hi Mel,
> 
> On 03/11/25 16:34, Mel Gorman wrote:
> > Reimplement NEXT_BUDDY preemption to take into account the deadline and
> > eligibility of the wakee with respect to the waker. In the event
> > multiple buddies could be considered, the one with the earliest deadline
> > is selected.
> > 
> > Sync wakeups are treated differently to every other type of wakeup. The
> > WF_SYNC assumption is that the waker promises to sleep in the very near
> > future. This is violated in enough cases that WF_SYNC should be treated
> > as a suggestion instead of a contract. If a waker does go to sleep almost
> > immediately then the delay in wakeup is negligible. In all other cases,
> > it's throttled based on the accumulated runtime of the waker so there is
> > a chance that some batched wakeups have been issued before preemption.
> 
> [..snip..]
> 
> > +static inline enum preempt_wakeup_action
> > +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> > +		 struct sched_entity *pse, struct sched_entity *se)
> > +{
> > +	bool pse_before;
> > +
> > +	/*
> > +	 * Ignore wakee preemption on WF_FORK as it is less likely that
> > +	 * there is shared data as exec often follow fork. Do not
> > +	 * preempt for tasks that are sched_delayed as it would violate
> > +	 * EEVDF to forcibly queue an ineligible task.
> > +	 */
> > +	if ((wake_flags & WF_FORK) || pse->sched_delayed)
> > +		return PREEMPT_WAKEUP_NONE;
> > +
> > +	/* Reschedule if waker is no longer eligible. */
> > +	if (in_task() && !entity_eligible(cfs_rq, se))
> > +		return PREEMPT_WAKEUP_RESCHED;
> > +
> > +	/*
> > +	 * Keep existing buddy if the deadline is sooner than pse.
> > +	 * The older buddy may be cache cold and completely unrelated
> > +	 * to the current wakeup but that is unpredictable where as
> > +	 * obeying the deadline is more in line with EEVDF objectives.
> > +	 */
> > +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> > +		return PREEMPT_WAKEUP_NEXT;
> 
> IIUC, the logic attempts to maintain deadline ordering among buddies, but
> this doesn't address tasks already on the runqueue.
> 

It's addressed in that a buddy is only selected if it is eligible to
run. Buddies in this context are receiving preferential treatment in
terms of ordering but the other tasks on the queue should not be
starved either.

> So with frequent wakeups, queued tasks (even with earlier deadlines) may be
> unfairly delayed. I understand that this would fade away quickly as the
> woken up task that got to run due to buddy preference would accumulate negative
> lag and would not be eligible to run again, but the starvation could be higher if
> wakeups are very high.
> 

They shouldn't get starved as such, only delayed as the buddies become
eligible while other tasks on the runqueue have positive lag.

> To test this, I ran schbench (many message and worker threads) together with
> stress-ng (CPU-bound), and observed stress-ng's bogo-ops throughput dropped by
> around 64%.
> 

Stress-NG bogo-ops are by definition, bogus ops. The amount of work
executed depends on the stressor and the timing of when they execute.
Hence, a drop of 64% may or may not matter to the general case because the
drop may be due to a different mix of "operations", some of which may task
1ms and others that take a minute but are both "1 operation".

> This shows a significant regression for CPU-bound tasks under heavy wakeup loads.
> Thoughts?

I 100% accept that NEXT_BUDDY can affect timings of workloads but stress-ng
is not the best workload to use for performance testing at all because
the bogoops metric is by definition, bogus. There may be a good reason
to revisit if PICK_BUDDY should have been moved to __pick_eevdf or if
PICK_BUDDY should be applied if the slice is not protected but stressng
is a terrible workload to justify a decision either way.

> I also ran schbench and hackbench. All these were run on a Power11 System
> containing 4 sockets and 160 CPUs spread across 4 NUMA nodes.
> 
> schbench(new) 99.0th latency (lower is better)
> ========
> load        	baseline[pct imp](std%)       With patch[pct imp]( std%)
> 20mt, 10wt      1.00 [ 0.00]( 0.24)           0.97 [ +3.00]( 0.18)
> 20mt, 20wt      1.00 [ 0.00]( 0.33)           1.00 [  0.00]( 0.12)
> 20mt, 40wt      1.00 [ 0.00]( 2.84)           0.76 [ +24.0]( 0.32)
> 20mt, 80wt      1.00 [ 0.00]( 3.66)           0.66 [ +34.0]( 0.72)
> 20mt, 160wt     1.00 [ 0.00](12.92)           0.88 [ +12.0]( 6.77)
> 
> mt=message threads ; wt=worker threads
> 
> schbench being a wakeup sensitive workload showed good improvement.
> 

Good news because NEXT_BUDDY is primarily about prioritising an eligible
wakee over another eligible task to preserve hotness.

> 
> hackbench (lower is better)
> ========
> case              load        baseline[pct imp](std%)      With patch[pct imp]( std%)
> process-sockets   1-groups    1.00 [ 0.00]( 5.21)            0.91 [ +9.00]( 5.50)
> process-sockets   4-groups    1.00 [ 0.00]( 7.30)            1.01 [ -1.00]( 4.27)
> process-sockets   12-groups   1.00 [ 0.00]( 2.44)            1.00 [  0.00]( 1.78)
> process-sockets   30-groups   1.00 [ 0.00]( 2.05)            1.04 [ -4.00]( 0.86)
> process-sockets   48-groups   1.00 [ 0.00]( 2.25)            1.04 [ -4.00]( 1.03)
> process-sockets   79-groups   1.00 [ 0.00]( 2.28)            1.05 [ -5.00]( 1.67)
> process-sockets   110-groups  1.00 [ 0.00]( 11.17)           1.04 [ -4.00]( 8.64)
> 
> process-pipe      1-groups     1.00 [ 0.00]( 8.21)            0.84 [+16.00](13.00)
> process-pipe      4-groups     1.00 [ 0.00]( 5.54)            0.95 [ +5.00]( 4.21)
> process-pipe      12-groups    1.00 [ 0.00]( 3.96)            1.04 [ -4.00]( 2.26)
> process-pipe      30-groups    1.00 [ 0.00]( 7.64)            1.20 [ -20.0]( 3.63)
> process-pipe      48-groups    1.00 [ 0.00]( 6.28)            1.04 [ -4.00]( 8.48)
> process-pipe      79-groups    1.00 [ 0.00]( 6.19)            1.01 [ -1.00]( 4.36)
> process-pipe      110-groups   1.00 [ 0.00]( 10.23)           0.94 [ +6.00]( 5.21)
> 
> Didn't notice significant improvement or regression in Hackbench. Mostly in the noise
> range.
> 

Expected for hackbench because the degree of overload is so generally
high and cache hotness has limited benefit for it as so little data is
shared.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 0/2] Reintroduce NEXT_BUDDY for EEVDF v2
@ 2025-10-21 14:28 Mel Gorman
  2025-10-21 14:28 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
  0 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2025-10-21 14:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason, Mel Gorman

I've been chasing down a number of schedule issues recently like many
others and found they were broadly grouped as

1. Failure to boost CPU frequency with powersave/ondemand governors
2. Processors entering idle states that are too deep
3. Differences in wakeup latencies for wakeup-intensive workloads

Adding topology into account means that there is a lot of
machine-specific behaviour which may explain why some discussions
recently have reproduction problems. Nevertheless, the removal of
LAST_BUDDY and NEXT_BUDDY being disabled has an impact on wakeup
latencies.

This RFC is to determine if this is valid approach to prefer selecting
a wakee if it's eligible to run even though other unrelated tasks are
more eligible.

 kernel/sched/fair.c     | 131 ++++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |   2 +-
 2 files changed, 112 insertions(+), 21 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-21 14:28 [RFC PATCH 0/2] Reintroduce NEXT_BUDDY for EEVDF v2 Mel Gorman
@ 2025-10-21 14:28 ` Mel Gorman
  2025-10-23  6:29   ` K Prateek Nayak
  0 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2025-10-21 14:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason, Mel Gorman

Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.

Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a suggestion instead of a contract. If a waker does go to sleep almost
immediately then the delay in wakeup is negligible. In all other cases,
it's throttled based on the accumulated runtime of the waker so there is
a chance that some batched wakeups have been issued before preemption.

For all other wakeups, preemption happens if the wakee has a earlier
deadline than the waker and eligible to run.

While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.

First is the dbench throughput data even though it is a poor metric but
it is the default metric. The test machine is a 2-socket machine and the
backing filesystem is XFS as a lot of the IO work is dispatched to kernel
threads. It's important to note that these results are not representative
across all machines, especially Zen machines, as different bottlenecks
are exposed on different machines and filesystems.

dbench4 Throughput (misleading but traditional)
                            6.18-rc1               6.18-rc1
                             vanilla   sched-preemptnext-v2
Hmean     1       1268.80 (   0.00%)     1285.93 *   1.35%*
Hmean     4       3971.74 (   0.00%)     3851.10 *  -3.04%*
Hmean     7       5548.23 (   0.00%)     5507.07 (  -0.74%)
Hmean     12      7310.86 (   0.00%)     7205.37 (  -1.44%)
Hmean     21      8874.53 (   0.00%)     9193.66 *   3.60%*
Hmean     30      9361.93 (   0.00%)    10552.03 *  12.71%*
Hmean     48      9540.14 (   0.00%)    11936.22 *  25.12%*
Hmean     79      9208.74 (   0.00%)    12367.06 *  34.30%*
Hmean     110     8573.12 (   0.00%)    12114.01 *  41.30%*
Hmean     141     7791.33 (   0.00%)    11391.60 *  46.21%*
Hmean     160     7666.60 (   0.00%)    10789.23 *  40.73%*

As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.

dbench4 Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v2
Amean      1         14.62 (   0.00%)       14.21 (   2.80%)
Amean      4         18.76 (   0.00%)       19.45 (  -3.67%)
Amean      7         23.71 (   0.00%)       23.82 (  -0.48%)
Amean      12        31.25 (   0.00%)       31.78 (  -1.69%)
Amean      21        45.12 (   0.00%)       43.53 (   3.52%)
Amean      30        61.07 (   0.00%)       54.13 (  11.37%)
Amean      48        95.91 (   0.00%)       76.51 (  20.23%)
Amean      79       163.38 (   0.00%)      121.46 (  25.66%)
Amean      110      243.91 (   0.00%)      172.56 (  29.25%)
Amean      141      343.47 (   0.00%)      236.07 (  31.27%)
Amean      160      401.15 (   0.00%)      283.64 (  29.29%)
Stddev     1          0.52 (   0.00%)        0.44 (  15.50%)
Stddev     4          1.36 (   0.00%)        1.42 (  -4.91%)
Stddev     7          1.88 (   0.00%)        1.84 (   2.27%)
Stddev     12         3.06 (   0.00%)        2.55 (  16.57%)
Stddev     21         5.78 (   0.00%)        3.70 (  35.90%)
Stddev     30         9.85 (   0.00%)        5.10 (  48.26%)
Stddev     48        22.31 (   0.00%)        8.30 (  62.79%)
Stddev     79        35.96 (   0.00%)       16.79 (  53.31%)
Stddev     110       59.04 (   0.00%)       30.08 (  49.04%)
Stddev     141       85.38 (   0.00%)       47.05 (  44.89%)
Stddev     160       96.38 (   0.00%)       47.27 (  50.95%)

That is still looking good and the variance is reduced quite a bit.
Finally, fairness is a concern so the next report tracks how many
milliseconds does it take for all clients to complete a workfile. This
one is tricky because dbench makes to effort to synchronise clients so
the durations at benchmark start time differ substantially from typical
runtimes. This problem could be mitigated by warming up the benchmark
for a number of minutes but it's a matter of opinion whether that
counts as an evasion of inconvenient results.

dbench4 All Clients Loadfile Execution Time
                             6.18-rc1               6.18-rc1
                              vanilla   sched-preemptnext-v2
Amean      1         14.93 (   0.00%)       14.91 (   0.11%)
Amean      4        348.88 (   0.00%)      277.06 (  20.59%)
Amean      7        722.94 (   0.00%)      991.70 ( -37.18%)
Amean      12      2055.72 (   0.00%)     2684.48 ( -30.59%)
Amean      21      4393.85 (   0.00%)     2625.79 (  40.24%)
Amean      30      6119.84 (   0.00%)     2491.15 (  59.29%)
Amean      48     20600.85 (   0.00%)     6717.61 (  67.39%)
Amean      79     22677.38 (   0.00%)    21866.80 (   3.57%)
Amean      110    35937.71 (   0.00%)    22517.63 (  37.34%)
Amean      141    25104.66 (   0.00%)    29897.08 ( -19.09%)
Amean      160    23843.74 (   0.00%)    23106.66 (   3.09%)
Stddev     1          0.50 (   0.00%)        0.46 (   6.67%)
Stddev     4        201.33 (   0.00%)      130.13 (  35.36%)
Stddev     7        471.94 (   0.00%)      641.69 ( -35.97%)
Stddev     12      1401.94 (   0.00%)     1750.14 ( -24.84%)
Stddev     21      2519.12 (   0.00%)     1416.77 (  43.76%)
Stddev     30      3469.05 (   0.00%)     1293.37 (  62.72%)
Stddev     48     11521.49 (   0.00%)     3846.34 (  66.62%)
Stddev     79     12849.21 (   0.00%)    12275.89 (   4.46%)
Stddev     110    20362.88 (   0.00%)    12989.46 (  36.21%)
Stddev     141    13768.42 (   0.00%)    17108.34 ( -24.26%)
Stddev     160    13196.34 (   0.00%)    13029.75 (   1.26%)

This is more of a mixed bag but it at least shows that fairness
is not crippled.

The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench.

hackbench-process-pipes
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v2
Amean     1        0.2657 (   0.00%)      0.2180 (  17.94%)
Amean     4        0.6107 (   0.00%)      0.5237 (  14.25%)
Amean     7        0.7923 (   0.00%)      0.7357 (   7.15%)
Amean     12       1.1500 (   0.00%)      1.1210 (   2.52%)
Amean     21       1.7950 (   0.00%)      1.8220 (  -1.50%)
Amean     30       2.3207 (   0.00%)      2.5337 *  -9.18%*
Amean     48       3.5023 (   0.00%)      4.0057 * -14.37%*
Amean     79       4.8093 (   0.00%)      5.1827 *  -7.76%*
Amean     110      6.1160 (   0.00%)      6.5667 *  -7.37%*
Amean     141      7.4763 (   0.00%)      7.9413 *  -6.22%*
Amean     172      8.9560 (   0.00%)      9.5543 *  -6.68%*
Amean     203     10.4783 (   0.00%)     11.3620 *  -8.43%*
Amean     234     12.4977 (   0.00%)     13.6183 (  -8.97%)
Amean     265     14.7003 (   0.00%)     15.3720 *  -4.57%*
Amean     296     16.1007 (   0.00%)     17.2463 *  -7.12%*

Processes using pipes are impacted but the variance (not presented)
is bad enough that it's close to noise. These results are not always
reproducible. If executed across multiple reboots, it may show neutral or
small gains so the worst measured results are presented.

Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.

hackbench-process-sockets
                          6.18-rc1             6.18-rc1
                             vanilla   sched-preemptnext-v2
Amean     1        0.3073 (   0.00%)      0.3333 (  -8.46%)
Amean     4        0.7863 (   0.00%)      0.7350 (   6.53%)
Amean     7        1.3670 (   0.00%)      1.3580 (   0.66%)
Amean     12       2.1337 (   0.00%)      2.1320 (   0.08%)
Amean     21       3.4683 (   0.00%)      3.4677 (   0.02%)
Amean     30       4.7247 (   0.00%)      4.8200 (  -2.02%)
Amean     48       7.6097 (   0.00%)      7.8383 (  -3.00%)
Amean     79      14.7957 (   0.00%)     15.2863 (  -3.32%)
Amean     110     21.3413 (   0.00%)     21.7297 (  -1.82%)
Amean     141     29.0503 (   0.00%)     29.1197 (  -0.24%)
Amean     172     36.4660 (   0.00%)     36.3173 (   0.41%)
Amean     203     39.7177 (   0.00%)     40.2843 (  -1.43%)
Amean     234     42.1120 (   0.00%)     43.8480 (  -4.12%)
Amean     265     45.7830 (   0.00%)     48.1233 (  -5.11%)
Amean     296     50.7043 (   0.00%)     54.2533 (  -7.00%)

As schbench has been mentioned in numerous bugs recently, the results
are interesting. A test case that represents the default schbench
behaviour is

schbench Wakeup Latency (usec)
                                       6.18.0-rc1             6.18.0-rc1
                                          vanilla sched-preemptnext-v2r1
Amean     Wakeup-50th-80          7.17 (   0.00%)        6.00 *  16.28%*
Amean     Wakeup-90th-80         46.56 (   0.00%)       20.56 *  55.85%*
Amean     Wakeup-99th-80        119.61 (   0.00%)       88.83 (  25.73%)
Amean     Wakeup-99.9th-80     3193.78 (   0.00%)      339.78 *  89.36%*
Amean     Wakeup-max-80        3874.28 (   0.00%)     3942.06 (  -1.75%)

schbench Requests Per Second (ops/sec)
                                  6.18.0-rc1             6.18.0-rc1
                                     vanilla sched-preemptnext-v2r1
Hmean     RPS-20th-80     8900.91 (   0.00%)     9148.18 *   2.78%*
Hmean     RPS-50th-80     8987.41 (   0.00%)     9199.72 (   2.36%)
Hmean     RPS-90th-80     9123.73 (   0.00%)     9233.56 (   1.20%)
Hmean     RPS-max-80      9193.50 (   0.00%)     9249.84 (   0.61%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 131 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 111 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc0b7ce8a65d..26413062009b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -955,6 +955,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect)
 	if (cfs_rq->nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
+	/*
+	 * Picking the ->next buddy will affect latency but not fairness.
+	 */
+	if (sched_feat(PICK_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+		return cfs_rq->next;
+	}
+
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
 
@@ -1193,6 +1203,82 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 	return delta_exec;
 }
 
+enum preempt_wakeup_action {
+	PREEMPT_WAKEUP_NONE,		/* No action on the buddy */
+	PREEMPT_WAKEUP_NEXT,		/* Check next is most eligible
+					 * before rescheduling.
+					 */
+	PREEMPT_WAKEUP_RESCHED,		/* Plain reschedule */
+};
+
+static void set_next_buddy(struct sched_entity *se);
+
+static inline enum preempt_wakeup_action
+__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
+		 struct sched_entity *pse, struct sched_entity *se,
+		 s64 delta)
+{
+	bool pse_before;
+
+	/*
+	 * Ignore wakee preemption on WF_WORK as it is less likely that
+	 * there is shared data as exec often follow fork. Do not
+	 * preempt for tasks that are sched_delayed as it would violate
+	 * EEVDF to forcibly queue an ineligible task.
+	 */
+	if (!sched_feat(NEXT_BUDDY) ||
+	    (wake_flags & WF_FORK) ||
+	    (pse->sched_delayed)) {
+		return PREEMPT_WAKEUP_NONE;
+	}
+
+	/* Reschedule if waker is no longer eligible. */
+	if (!entity_eligible(cfs_rq, se)) {
+		resched_curr_lazy(rq);
+		return PREEMPT_WAKEUP_RESCHED;
+	}
+
+	/*
+	 * Keep existing buddy if the deadline is sooner than pse.
+	 * The downside is that the older buddy may be cache cold
+	 * but that is unpredictable where as an earlier deadline
+	 * is absolute.
+	 */
+	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
+		return PREEMPT_WAKEUP_NONE;
+
+	set_next_buddy(pse);
+
+	/*
+	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
+	 * strictly enforced because the hint is either misunderstood or
+	 * multiple tasks must be woken up.
+	 */
+	pse_before = entity_before(pse, se);
+	if ((wake_flags & (WF_TTWU|WF_SYNC)) == (WF_TTWU|WF_SYNC)) {
+		/*
+		 * WF_RQ_SELECTED implies the tasks are stacking on a
+		 * CPU when they could run on other CPUs. Only consider
+		 * reschedule if pse deadline expires before se.
+		 */
+		if ((wake_flags & WF_RQ_SELECTED) &&
+		    delta < sysctl_sched_migration_cost && pse_before) {
+			return PREEMPT_WAKEUP_NONE;
+		}
+
+		/*
+		 * As WF_SYNC is not strictly obeyed, allow some runtime for
+		 * batch wakeups to be issued.
+		 */
+		if (pse_before && delta >= sysctl_sched_migration_cost)
+			return PREEMPT_WAKEUP_RESCHED;
+
+		return PREEMPT_WAKEUP_NONE;
+	}
+
+	return PREEMPT_WAKEUP_NEXT;
+}
+
 /*
  * Used by other classes to account runtime.
  */
@@ -5512,16 +5598,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 
-	/*
-	 * Picking the ->next buddy will affect latency but not fairness.
-	 */
-	if (sched_feat(PICK_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
-		/* ->next will never be delayed */
-		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
-		return cfs_rq->next;
-	}
-
 	se = pick_eevdf(cfs_rq);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
@@ -7028,8 +7104,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -8734,7 +8808,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
-	bool do_preempt_short = false;
+	enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE;
+	s64 delta;
 
 	if (unlikely(se == pse))
 		return;
@@ -8748,10 +8823,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (task_is_throttled(p))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
-		set_next_buddy(pse);
-	}
-
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
@@ -8783,7 +8854,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 		 * When non-idle entity preempt an idle entity,
 		 * don't give idle entity slice protection.
 		 */
-		do_preempt_short = true;
+		do_preempt_short = PREEMPT_WAKEUP_NEXT;
 		goto preempt;
 	}
 
@@ -8797,12 +8868,31 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 		return;
 
 	cfs_rq = cfs_rq_of(se);
+	delta = rq_clock_task(rq) - se->exec_start;
 	update_curr(cfs_rq);
 	/*
 	 * If @p has a shorter slice than current and @p is eligible, override
 	 * current's slice protection in order to allow preemption.
 	 */
-	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
+	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
+		do_preempt_short = PREEMPT_WAKEUP_NEXT;
+	} else {
+		/*
+		 * If @p potentially is completing work required by current then
+		 * consider preemption.
+		 */
+		do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags,
+						      pse, se, delta);
+	}
+
+	switch (do_preempt_short) {
+	case PREEMPT_WAKEUP_NONE:
+		goto update_slice;
+	case PREEMPT_WAKEUP_RESCHED:
+		goto preempt;
+	case PREEMPT_WAKEUP_NEXT:
+		break;
+	}
 
 	/*
 	 * If @p has become the most eligible task, force preemption.
@@ -8810,7 +8900,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
 		goto preempt;
 
-	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
+update_slice:
+	if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE)
 		update_protect_slice(cfs_rq, se);
 
 	return;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
  2025-10-21 14:28 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
@ 2025-10-23  6:29   ` K Prateek Nayak
  0 siblings, 0 replies; 23+ messages in thread
From: K Prateek Nayak @ 2025-10-23  6:29 UTC (permalink / raw)
  To: Mel Gorman, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Chris Mason

Hello Mel,

On 10/21/2025 7:58 PM, Mel Gorman wrote:
> +static inline enum preempt_wakeup_action
> +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> +		 struct sched_entity *pse, struct sched_entity *se,
> +		 s64 delta)

"delta" is only used within __do_preempt_buddy(). Might as well compute
it here.

> +{
> +	bool pse_before;
> +
> +	/*
> +	 * Ignore wakee preemption on WF_WORK as it is less likely that
> +	 * there is shared data as exec often follow fork. Do not
> +	 * preempt for tasks that are sched_delayed as it would violate
> +	 * EEVDF to forcibly queue an ineligible task.
> +	 */
> +	if (!sched_feat(NEXT_BUDDY) ||
> +	    (wake_flags & WF_FORK) ||
> +	    (pse->sched_delayed)) {
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	/* Reschedule if waker is no longer eligible. */
> +	if (!entity_eligible(cfs_rq, se)) {
> +		resched_curr_lazy(rq);

This is unnecessary since PREEMPT_WAKEUP_RESCHED case already does a
"goto preempt" which does a resched_curr_lazy().

> +		return PREEMPT_WAKEUP_RESCHED;

Should we update the next buddy before returning here if
entity_before(pse, cfs_rq->next)?

> +	}
> +
> +	/*
> +	 * Keep existing buddy if the deadline is sooner than pse.
> +	 * The downside is that the older buddy may be cache cold
> +	 * but that is unpredictable where as an earlier deadline
> +	 * is absolute.
> +	 */
> +	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> +		return PREEMPT_WAKEUP_NONE;
> +
> +	set_next_buddy(pse);
> +
> +	/*
> +	 * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> +	 * strictly enforced because the hint is either misunderstood or
> +	 * multiple tasks must be woken up.
> +	 */
> +	pse_before = entity_before(pse, se);
> +	if ((wake_flags & (WF_TTWU|WF_SYNC)) == (WF_TTWU|WF_SYNC)) {

Do we have !TTWU cases that set WF_SYNC?

> +		/*
> +		 * WF_RQ_SELECTED implies the tasks are stacking on a
> +		 * CPU when they could run on other CPUs. Only consider
> +		 * reschedule if pse deadline expires before se.
> +		 */

Code doesn't seem to match that comment. We are foregoing preemption
if the current task has run for less than "sysctl_sched_migration_cost"
even on pse_before.

Also. looking at the comment in check_preempt_wakeup_fair():

    If @p potentially is completing work required by current then
    consider preemption.

Shouldn't this check if "se" is indeed the waker task? Despite WF_SYNC,
wake_affine() can still return the CPU where @p was previously running
which now might be running another unrelated task.

> +		if ((wake_flags & WF_RQ_SELECTED) &&
> +		    delta < sysctl_sched_migration_cost && pse_before) {
> +			return PREEMPT_WAKEUP_NONE;
> +		}

Above checks seems unnecessary since we return "PREEMPT_WAKEUP_NONE" if
we don't enter the below condition. The two cannot have any overlap
based on my reading.

> +
> +		/*
> +		 * As WF_SYNC is not strictly obeyed, allow some runtime for
> +		 * batch wakeups to be issued.
> +		 */
> +		if (pse_before && delta >= sysctl_sched_migration_cost)
> +			return PREEMPT_WAKEUP_RESCHED;
> +
> +		return PREEMPT_WAKEUP_NONE;
> +	}
> +
> +	return PREEMPT_WAKEUP_NEXT;
> +}
> +
>  /*
>   * Used by other classes to account runtime.
>   */
> @@ -5512,16 +5598,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
>  {
>  	struct sched_entity *se;
>  
> -	/*
> -	 * Picking the ->next buddy will affect latency but not fairness.
> -	 */
> -	if (sched_feat(PICK_BUDDY) &&
> -	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
> -		/* ->next will never be delayed */
> -		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
> -		return cfs_rq->next;
> -	}
> -
>  	se = pick_eevdf(cfs_rq);
>  	if (se->sched_delayed) {
>  		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> @@ -7028,8 +7104,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	hrtick_update(rq);
>  }
>  
> -static void set_next_buddy(struct sched_entity *se);
> -
>  /*
>   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
>   * failing half-way through and resume the dequeue later.
> @@ -8734,7 +8808,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	struct sched_entity *se = &donor->se, *pse = &p->se;
>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>  	int cse_is_idle, pse_is_idle;
> -	bool do_preempt_short = false;
> +	enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE;
> +	s64 delta;
>  
>  	if (unlikely(se == pse))
>  		return;
> @@ -8748,10 +8823,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (task_is_throttled(p))
>  		return;
>  
> -	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
> -		set_next_buddy(pse);
> -	}
> -
>  	/*
>  	 * We can come here with TIF_NEED_RESCHED already set from new task
>  	 * wake up path.
> @@ -8783,7 +8854,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  		 * When non-idle entity preempt an idle entity,
>  		 * don't give idle entity slice protection.
>  		 */
> -		do_preempt_short = true;
> +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
>  		goto preempt;
>  	}
>  
> @@ -8797,12 +8868,31 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  		return;
>  
>  	cfs_rq = cfs_rq_of(se);
> +	delta = rq_clock_task(rq) - se->exec_start;
>  	update_curr(cfs_rq);
>  	/*
>  	 * If @p has a shorter slice than current and @p is eligible, override
>  	 * current's slice protection in order to allow preemption.
>  	 */
> -	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> +	if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) {
> +		do_preempt_short = PREEMPT_WAKEUP_NEXT;
> +	} else {
> +		/*
> +		 * If @p potentially is completing work required by current then
> +		 * consider preemption.
> +		 */
> +		do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags,
> +						      pse, se, delta);
> +	}
> +
> +	switch (do_preempt_short) {
> +	case PREEMPT_WAKEUP_NONE:
> +		goto update_slice;

You can just return from here since update_protect_slice() is only
done on "do_preempt_short != PREEMPT_WAKEUP_NONE". With that,
the "update_slice" label becomes unnecessary ...

> +	case PREEMPT_WAKEUP_RESCHED:
> +		goto preempt;
> +	case PREEMPT_WAKEUP_NEXT:
> +		break;
> +	}
>  
>  	/*
>  	 * If @p has become the most eligible task, force preemption.
> @@ -8810,7 +8900,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>  	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
>  		goto preempt;
>  
> -	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> +update_slice:
> +	if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE)

... and since do_preempt_short can only be "PREEMPT_WAKEUP_NEXT"
which is non-zero, changes in this hunk can be dropped.

>  		update_protect_slice(cfs_rq, se);
>  
>  	return;

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 23+ messages in thread

[parent not found: <20250714134429.19624-1-mgorman@techsingularity.net>]

* [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
       [not found] <20250714134429.19624-1-mgorman@techsingularity.net>
@ 2025-07-14 13:44 ` Mel Gorman
  0 siblings, 0 replies; 23+ messages in thread
From: Mel Gorman @ 2025-07-14 13:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann,
	Valentin Schneider, Mel Gorman

Reimplement NEXT_BUDDY preemption to take into account the deadline and
eligibility of the wakee with respect to the waker. In the event
multiple buddies could be considered, the one with the earliest deadline
is selected.

Sync wakeups are treated differently to every other type of wakeup. The
WF_SYNC assumption is that the waker promises to sleep in the very near
future. This is violated in enough cases that WF_SYNC should be treated
as a mild suggestion instead of a hard rule. If a waker does go to sleep
almost immediately then the delay in wakeup is negligible. In all other
cases, it's throttled based on the accumulated runtime of the waker so
there is a chance that some batched wakeups have been issued before
preemption.

For all other wakeups, preemption happens if the wakee has a sooner deadline
than the waker and eligible to run.

While many workloads were tested, the two main targets were a modified
dbench4 benchmark and hackbench because the are on opposite ends of the
spectrum -- one prefers throughput by avoiding preemption and the other
relies on preemption.

First is the dbench throughput data even though it is a terrible metric for
dbench as it's the default one reported. The test machine is a 2-socket
CascadeLake machine and the backing filesystem is XFS as a lot of the IO
work is dispatched to kernel threads. It's important to note that these
results are not representative across all machines, especially Zen machines,
as different bottlenecks are exposed on different machines and filesystems.

dbench4 Throughput (misleading but traditional)
                          6.16.0-rc5             6.16.0-rc5
                             vanilla sched-preemptnext-v1r8
Hmean     1       1286.83 (   0.00%)     1281.73 (  -0.40%)
Hmean     4       4017.50 (   0.00%)     3934.85 *  -2.06%*
Hmean     7       5536.45 (   0.00%)     5453.55 *  -1.50%*
Hmean     12      7251.59 (   0.00%)     7217.25 (  -0.47%)
Hmean     21      8957.92 (   0.00%)     9188.07 (   2.57%)
Hmean     30      9403.41 (   0.00%)    10523.72 *  11.91%*
Hmean     48      9320.12 (   0.00%)    11496.27 *  23.35%*
Hmean     79      8962.30 (   0.00%)    11555.71 *  28.94%*
Hmean     110     8066.52 (   0.00%)    11307.26 *  40.18%*
Hmean     141     7605.20 (   0.00%)    10622.52 *  39.67%*
Hmean     160     7422.56 (   0.00%)    10250.78 *  38.10%*

As throughput is misleading, the benchmark is modified to use a short
loadfile report the completion time duration in milliseconds.

dbench4 Loadfile Execution Time
                           6.16.0-rc5             6.16.0-rc5
                              vanilla sched-preemptnext-v1r8
Amean      1         14.35 (   0.00%)       14.27 (   0.57%)
Amean      4         18.58 (   0.00%)       19.01 (  -2.35%)
Amean      7         23.83 (   0.00%)       24.18 (  -1.48%)
Amean      12        31.59 (   0.00%)       31.77 (  -0.55%)
Amean      21        44.65 (   0.00%)       43.44 (   2.71%)
Amean      30        60.73 (   0.00%)       54.21 (  10.74%)
Amean      48        98.25 (   0.00%)       79.41 (  19.17%)
Amean      79       168.34 (   0.00%)      130.06 (  22.74%)
Amean      110      261.03 (   0.00%)      185.04 (  29.11%)
Amean      141      353.98 (   0.00%)      251.55 (  28.94%)
Amean      160      410.66 (   0.00%)      296.87 (  27.71%)
Stddev     1          0.51 (   0.00%)        0.48 (   6.67%)
Stddev     4          1.14 (   0.00%)        1.21 (  -6.78%)
Stddev     7          1.63 (   0.00%)        1.58 (   3.12%)
Stddev     12         2.62 (   0.00%)        2.38 (   9.05%)
Stddev     21         5.21 (   0.00%)        3.87 (  25.70%)
Stddev     30        10.03 (   0.00%)        6.65 (  33.65%)
Stddev     48        22.31 (   0.00%)       12.26 (  45.05%)
Stddev     79        41.14 (   0.00%)       29.11 (  29.25%)
Stddev     110       70.55 (   0.00%)       47.71 (  32.38%)
Stddev     141       98.12 (   0.00%)       66.83 (  31.89%)
Stddev     160      139.37 (   0.00%)       67.73 (  51.40%)

That is still looking good and the variance is reduced quite a bit.
Finally, fairness is a concern so the next report tracks how many
milliseconds does it take for all clients to complete a workfile. This
one is tricky because dbench makes to effort to synchronise clients so
the durations at benchmark start time differ substantially from typical
runtimes. This problem could be mitigated by warming up the benchmark
for a number of minutes but it's a matter of opinion whether that
counts as an evasion of inconvenient results.

dbench4 All Clients Loadfile Execution Time
                           6.16.0-rc5             6.16.0-rc5
                              vanilla sched-preemptnext-v1r8
Amean      1         14.93 (   0.00%)       14.91 (   0.11%)
Amean      4        348.88 (   0.00%)      277.06 (  20.59%)
Amean      7        722.94 (   0.00%)      991.70 ( -37.18%)
Amean      12      2055.72 (   0.00%)     2684.48 ( -30.59%)
Amean      21      4393.85 (   0.00%)     2625.79 (  40.24%)
Amean      30      6119.84 (   0.00%)     2491.15 (  59.29%)
Amean      48     20600.85 (   0.00%)     6717.61 (  67.39%)
Amean      79     22677.38 (   0.00%)    21866.80 (   3.57%)
Amean      110    35937.71 (   0.00%)    22517.63 (  37.34%)
Amean      141    25104.66 (   0.00%)    29897.08 ( -19.09%)
Amean      160    23843.74 (   0.00%)    23106.66 (   3.09%)
Stddev     1          0.50 (   0.00%)        0.46 (   6.67%)
Stddev     4        201.33 (   0.00%)      130.13 (  35.36%)
Stddev     7        471.94 (   0.00%)      641.69 ( -35.97%)
Stddev     12      1401.94 (   0.00%)     1750.14 ( -24.84%)
Stddev     21      2519.12 (   0.00%)     1416.77 (  43.76%)
Stddev     30      3469.05 (   0.00%)     1293.37 (  62.72%)
Stddev     48     11521.49 (   0.00%)     3846.34 (  66.62%)
Stddev     79     12849.21 (   0.00%)    12275.89 (   4.46%)
Stddev     110    20362.88 (   0.00%)    12989.46 (  36.21%)
Stddev     141    13768.42 (   0.00%)    17108.34 ( -24.26%)
Stddev     160    13196.34 (   0.00%)    13029.75 (   1.26%)

This is more of a mixed bag but it at least shows that fairness
is not crippled.

The hackbench results are more neutral but this is still important.
It's possible to boost the dbench figures by a large amount but only by
crippling the performance of a workload like hackbench.

hackbench-process-pipes
                          6.16.0-rc5             6.16.0-rc5
                             vanilla sched-preemptnext-v1r8
Amean     1        0.2183 (   0.00%)      0.2223 (  -1.83%)
Amean     4        0.5780 (   0.00%)      0.5413 (   6.34%)
Amean     7        0.7727 (   0.00%)      0.7093 (   8.20%)
Amean     12       1.1220 (   0.00%)      1.1170 (   0.45%)
Amean     21       1.7470 (   0.00%)      1.7713 (  -1.39%)
Amean     30       2.2940 (   0.00%)      2.6957 * -17.51%*
Amean     48       3.7337 (   0.00%)      4.1003 *  -9.82%*
Amean     79       4.9310 (   0.00%)      5.1417 *  -4.27%*
Amean     110      6.1800 (   0.00%)      6.5370 *  -5.78%*
Amean     141      7.5737 (   0.00%)      8.0060 *  -5.71%*
Amean     172      9.0820 (   0.00%)      9.4767 *  -4.35%*
Amean     203     10.6053 (   0.00%)     10.8870 (  -2.66%)
Amean     234     12.3380 (   0.00%)     13.1290 *  -6.41%*
Amean     265     14.5900 (   0.00%)     15.3547 *  -5.24%*
Amean     296     16.1937 (   0.00%)     17.1533 *  -5.93%*

Processes using pipes are impacted and it's outside the noise as the
coefficient of variance is roughly 3%. These results are not always
reproducible. If executed across multiple reboots, it may show neutral or
small gains so the worst measured results are presented.

Hackbench using sockets is more reliably neutral as the wakeup
mechanisms are different between sockets and pipes.

hackbench-process-sockets
                          6.16.0-rc5             6.16.0-rc5
                             vanilla sched-preemptnext-v1r8
Amean     1        0.3217 (   0.00%)      0.3053 (   5.08%)
Amean     4        0.8967 (   0.00%)      0.9007 (  -0.45%)
Amean     7        1.4780 (   0.00%)      1.5067 (  -1.94%)
Amean     12       2.1977 (   0.00%)      2.2693 (  -3.26%)
Amean     21       3.4983 (   0.00%)      3.6667 *  -4.81%*
Amean     30       4.9270 (   0.00%)      5.1207 *  -3.93%*
Amean     48       7.6250 (   0.00%)      7.9667 *  -4.48%*
Amean     79      15.7477 (   0.00%)     15.4177 (   2.10%)
Amean     110     21.8070 (   0.00%)     21.9563 (  -0.68%)
Amean     141     29.4813 (   0.00%)     29.2327 (   0.84%)
Amean     172     36.7433 (   0.00%)     35.9043 (   2.28%)
Amean     203     40.8823 (   0.00%)     40.3467 (   1.31%)
Amean     234     43.1627 (   0.00%)     43.0343 (   0.30%)
Amean     265     49.6417 (   0.00%)     49.9030 (  -0.53%)
Amean     296     51.3137 (   0.00%)     51.9310 (  -1.20%)

At the time of writing, other tests are still running but most or either
neutral or relatively small gains. In general, the other workloads are
less wakeup-intensive than dbench or hackbench.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 123 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 106 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a14da5396fb..62fa036b0c3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -936,6 +936,16 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 	if (cfs_rq->nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
+	/*
+	 * Picking the ->next buddy will affect latency but not fairness.
+	 */
+	if (sched_feat(PICK_BUDDY) &&
+	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+		/* ->next will never be delayed */
+		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
+		return cfs_rq->next;
+	}
+
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
 
@@ -1205,6 +1215,83 @@ static inline bool do_preempt_short(struct cfs_rq *cfs_rq,
 	return false;
 }
 
+enum preempt_buddy_action {
+	PREEMPT_BUDDY_NONE,		/* No action on the buddy */
+	PREEMPT_BUDDY_NEXT,		/* Check next is most eligible
+					 * before rescheduling.
+					 */
+	PREEMPT_BUDDY_RESCHED,		/* Plain reschedule */
+	PREEMPT_BUDDY_IMMEDIATE		/* Remove slice protection
+					 * and reschedule
+					 */
+};
+
+static void set_next_buddy(struct sched_entity *se);
+
+static inline enum preempt_buddy_action
+do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
+		 struct sched_entity *pse, struct sched_entity *se,
+		 s64 delta, bool did_short)
+{
+	bool pse_before, pse_eligible;
+
+	if (!sched_feat(NEXT_BUDDY) ||
+	    (wake_flags & WF_FORK) ||
+	    (pse->sched_delayed)) {
+		BUILD_BUG_ON(PREEMPT_BUDDY_NONE + 1 != PREEMPT_BUDDY_NEXT);
+		return PREEMPT_BUDDY_NONE + did_short;
+	}
+
+	/* Reschedule if waker is no longer eligible */
+	if (!entity_eligible(cfs_rq, se))
+		return PREEMPT_BUDDY_RESCHED;
+
+	/* Keep existing buddy if the deadline is sooner than pse */
+	if (cfs_rq->next && entity_before(cfs_rq->next, pse))
+		return PREEMPT_BUDDY_NONE;
+
+	set_next_buddy(pse);
+	pse_before = entity_before(pse, se);
+	pse_eligible = entity_eligible(cfs_rq, pse);
+
+	/*
+	 * WF_SYNC implies that waker will sleep soon but it is not enforced
+	 * because the hint is often abused or misunderstood.
+	 */
+	if ((wake_flags & (WF_TTWU|WF_SYNC)) == (WF_TTWU|WF_SYNC)) {
+		/*
+		 * WF_RQ_SELECTED implies the tasks are stacking on a
+		 * CPU. Only consider reschedule if pse deadline expires
+		 * before se.
+		 */
+		if ((wake_flags & WF_RQ_SELECTED) &&
+		    delta < sysctl_sched_migration_cost) {
+
+			if (!pse_before)
+				return PREEMPT_BUDDY_NONE;
+
+			/* Fall through to pse deadline.  */
+		}
+
+		/*
+		 * Reschedule if pse's deadline is sooner and there is a chance
+		 * that some wakeup batching has completed.
+		 */
+		if (pse_before &&
+		    delta >= (sysctl_sched_migration_cost >> 6)) {
+			return PREEMPT_BUDDY_IMMEDIATE;
+		}
+
+		return PREEMPT_BUDDY_NONE;
+	}
+
+	/* Check eligibility of buddy to start now. */
+	if (pse_before && pse_eligible)
+		return PREEMPT_BUDDY_IMMEDIATE;
+
+	return PREEMPT_BUDDY_NEXT;
+}
+
 /*
  * Used by other classes to account runtime.
  */
@@ -5589,16 +5676,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 
-	/*
-	 * Picking the ->next buddy will affect latency but not fairness.
-	 */
-	if (sched_feat(PICK_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
-		/* ->next will never be delayed */
-		WARN_ON_ONCE(cfs_rq->next->sched_delayed);
-		return cfs_rq->next;
-	}
-
 	se = pick_eevdf(cfs_rq);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
@@ -7056,8 +7133,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
-static void set_next_buddy(struct sched_entity *se);
-
 /*
  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
  * failing half-way through and resume the dequeue later.
@@ -8767,6 +8842,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
+	bool did_short;
+	s64 delta;
 
 	if (unlikely(se == pse))
 		return;
@@ -8780,10 +8857,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
 		return;
 
-	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
-		set_next_buddy(pse);
-	}
-
 	/*
 	 * We can come here with TIF_NEED_RESCHED already set from new task
 	 * wake up path.
@@ -8829,6 +8902,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 		return;
 
 	cfs_rq = cfs_rq_of(se);
+	delta = rq_clock_task(rq) - se->exec_start;
 	update_curr(cfs_rq);
 	/*
 	 * If @p has a shorter slice than current and @p is eligible, override
@@ -8837,9 +8911,24 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	 * Note that even if @p does not turn out to be the most eligible
 	 * task at this moment, current's slice protection will be lost.
 	 */
-	if (do_preempt_short(cfs_rq, pse, se))
+	did_short = do_preempt_short(cfs_rq, pse, se);
+	if (did_short)
 		cancel_protect_slice(se);
 
+	switch (do_preempt_buddy(rq, cfs_rq, wake_flags, pse, se, delta, did_short)) {
+	case PREEMPT_BUDDY_NONE:
+		return;
+		break;
+	case PREEMPT_BUDDY_IMMEDIATE:
+		cancel_protect_slice(se);
+		;;
+	case PREEMPT_BUDDY_RESCHED:
+		goto preempt;
+		break;
+	case PREEMPT_BUDDY_NEXT:
+		break;
+	}
+
 	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-11-14 12:13 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251027133915.4103633-1-mgorman@techsingularity.net>
2025-10-27 13:39 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
2025-10-28 14:37   ` Peter Zijlstra
2025-10-27 13:39 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
2025-10-28 15:05   ` Peter Zijlstra
2025-10-31  9:46     ` Mel Gorman
2025-10-28 15:09   ` Peter Zijlstra
2025-10-31  9:48     ` Mel Gorman
2025-10-28 15:33   ` Peter Zijlstra
2025-10-28 15:47     ` Peter Zijlstra
2025-10-30  9:10   ` Peter Zijlstra
2025-10-31 10:27     ` Mel Gorman
2025-11-12 12:25 [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman
     [not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net>
2025-11-12 14:48   ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra
2025-11-13  8:26     ` Madadi Vineeth Reddy
2025-11-13  9:04     ` Mel Gorman
2025-11-14 12:13       ` Peter Zijlstra
  -- strict thread matches above, loose matches on Subject: below --
2025-11-03 11:04 [PATCH 0/2 v4] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman
2025-11-03 11:04 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
2025-11-03 14:07   ` Peter Zijlstra
2025-11-03 14:14     ` Peter Zijlstra
2025-11-05 21:48   ` Madadi Vineeth Reddy
2025-11-07  8:53     ` Mel Gorman
2025-10-21 14:28 [RFC PATCH 0/2] Reintroduce NEXT_BUDDY for EEVDF v2 Mel Gorman
2025-10-21 14:28 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
2025-10-23  6:29   ` K Prateek Nayak
     [not found] <20250714134429.19624-1-mgorman@techsingularity.net>
2025-07-14 13:44 ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox