* [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY [not found] <20251027133915.4103633-1-mgorman@techsingularity.net> @ 2025-10-27 13:39 ` Mel Gorman 2025-10-28 14:37 ` Peter Zijlstra 2025-10-27 13:39 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman 1 sibling, 1 reply; 23+ messages in thread From: Mel Gorman @ 2025-10-27 13:39 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, Mel Gorman The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last wakee to be scheduled sooner on the assumption that the waker/wakee share cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on the assumption that the pair of tasks still share data but also relied on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get good results. NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465 ("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime spread is mentioned so the expectation is that NEXT_BUDDY had an impact on overall fairness. It was not noted why LAST_BUDDY was removed but it is assumed that it's very difficult to reason what LAST_BUDDY's correct and effective behaviour should be while still respecting EEVDFs goals. NEXT_BUDDY is easier to reason about given that it's a point-in-time decision on the wakees deadline and eligibilty relative to the waker. Enable NEXT_BUDDY as a preparation path to document that the decision to ignore the current implementation is deliberate. While not presented, the results were at best neutral and often much more variable. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/features.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 3c12d9f93331..0607def744af 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true) * wakeup-preemption), since its likely going to consume data we * touched, increases cache locality. */ -SCHED_FEAT(NEXT_BUDDY, false) +SCHED_FEAT(NEXT_BUDDY, true) /* * Allow completely ignoring cfs_rq->next; which can be set from various -- 2.51.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY 2025-10-27 13:39 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman @ 2025-10-28 14:37 ` Peter Zijlstra 0 siblings, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2025-10-28 14:37 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason On Mon, Oct 27, 2025 at 01:39:14PM +0000, Mel Gorman wrote: > NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve > latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465 > ("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime > spread is mentioned so the expectation is that NEXT_BUDDY had an impact > on overall fairness. It was not noted why LAST_BUDDY was removed but it > is assumed that it's very difficult to reason what LAST_BUDDY's correct > and effective behaviour should be while still respecting EEVDFs goals. I think I was just struggling to make sense of things and figured less is more and axed it. I have vague memories trying to work through the dynamics of a wakeup-stack and the EEVDF latency requirements and getting a head-ache. ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals [not found] <20251027133915.4103633-1-mgorman@techsingularity.net> 2025-10-27 13:39 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman @ 2025-10-27 13:39 ` Mel Gorman 2025-10-28 15:05 ` Peter Zijlstra ` (3 more replies) 1 sibling, 4 replies; 23+ messages in thread From: Mel Gorman @ 2025-10-27 13:39 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, Mel Gorman Reimplement NEXT_BUDDY preemption to take into account the deadline and eligibility of the wakee with respect to the waker. In the event multiple buddies could be considered, the one with the earliest deadline is selected. Sync wakeups are treated differently to every other type of wakeup. The WF_SYNC assumption is that the waker promises to sleep in the very near future. This is violated in enough cases that WF_SYNC should be treated as a suggestion instead of a contract. If a waker does go to sleep almost immediately then the delay in wakeup is negligible. In all other cases, it's throttled based on the accumulated runtime of the waker so there is a chance that some batched wakeups have been issued before preemption. For all other wakeups, preemption happens if the wakee has a earlier deadline than the waker and eligible to run. While many workloads were tested, the two main targets were a modified dbench4 benchmark and hackbench because the are on opposite ends of the spectrum -- one prefers throughput by avoiding preemption and the other relies on preemption. First is the dbench throughput data even though it is a poor metric but it is the default metric. The test machine is a 2-socket machine and the backing filesystem is XFS as a lot of the IO work is dispatched to kernel threads. It's important to note that these results are not representative across all machines, especially Zen machines, as different bottlenecks are exposed on different machines and filesystems. dbench4 Throughput (misleading but traditional) 6.18-rc1 6.18-rc1 vanilla sched-preemptnext Hmean 1 1268.80 ( 0.00%) 1285.93 * 1.35%* Hmean 4 3971.74 ( 0.00%) 3851.10 * -3.04%* Hmean 7 5548.23 ( 0.00%) 5507.07 ( -0.74%) Hmean 12 7310.86 ( 0.00%) 7205.37 ( -1.44%) Hmean 21 8874.53 ( 0.00%) 9193.66 * 3.60%* Hmean 30 9361.93 ( 0.00%) 10552.03 * 12.71%* Hmean 48 9540.14 ( 0.00%) 11936.22 * 25.12%* Hmean 79 9208.74 ( 0.00%) 12367.06 * 34.30%* Hmean 110 8573.12 ( 0.00%) 12114.01 * 41.30%* Hmean 141 7791.33 ( 0.00%) 11391.60 * 46.21%* Hmean 160 7666.60 ( 0.00%) 10789.23 * 40.73%* As throughput is misleading, the benchmark is modified to use a short loadfile report the completion time duration in milliseconds. dbench4 Loadfile Execution Time 6.18-rc1 6.18-rc1 vanilla sched-preemptnext Amean 1 14.62 ( 0.00%) 14.21 ( 2.80%) Amean 4 18.76 ( 0.00%) 19.45 ( -3.67%) Amean 7 23.71 ( 0.00%) 23.82 ( -0.48%) Amean 12 31.25 ( 0.00%) 31.78 ( -1.69%) Amean 21 45.12 ( 0.00%) 43.53 ( 3.52%) Amean 30 61.07 ( 0.00%) 54.13 ( 11.37%) Amean 48 95.91 ( 0.00%) 76.51 ( 20.23%) Amean 79 163.38 ( 0.00%) 121.46 ( 25.66%) Amean 110 243.91 ( 0.00%) 172.56 ( 29.25%) Amean 141 343.47 ( 0.00%) 236.07 ( 31.27%) Amean 160 401.15 ( 0.00%) 283.64 ( 29.29%) Stddev 1 0.52 ( 0.00%) 0.44 ( 15.50%) Stddev 4 1.36 ( 0.00%) 1.42 ( -4.91%) Stddev 7 1.88 ( 0.00%) 1.84 ( 2.27%) Stddev 12 3.06 ( 0.00%) 2.55 ( 16.57%) Stddev 21 5.78 ( 0.00%) 3.70 ( 35.90%) Stddev 30 9.85 ( 0.00%) 5.10 ( 48.26%) Stddev 48 22.31 ( 0.00%) 8.30 ( 62.79%) Stddev 79 35.96 ( 0.00%) 16.79 ( 53.31%) Stddev 110 59.04 ( 0.00%) 30.08 ( 49.04%) Stddev 141 85.38 ( 0.00%) 47.05 ( 44.89%) Stddev 160 96.38 ( 0.00%) 47.27 ( 50.95%) That is still looking good and the variance is reduced quite a bit. Finally, fairness is a concern so the next report tracks how many milliseconds does it take for all clients to complete a workfile. This one is tricky because dbench makes to effort to synchronise clients so the durations at benchmark start time differ substantially from typical runtimes. This problem could be mitigated by warming up the benchmark for a number of minutes but it's a matter of opinion whether that counts as an evasion of inconvenient results. dbench4 All Clients Loadfile Execution Time 6.18-rc1 6.18-rc1 vanilla sched-preemptnext Amean 1 14.93 ( 0.00%) 14.91 ( 0.11%) Amean 4 348.88 ( 0.00%) 277.06 ( 20.59%) Amean 7 722.94 ( 0.00%) 991.70 ( -37.18%) Amean 12 2055.72 ( 0.00%) 2684.48 ( -30.59%) Amean 21 4393.85 ( 0.00%) 2625.79 ( 40.24%) Amean 30 6119.84 ( 0.00%) 2491.15 ( 59.29%) Amean 48 20600.85 ( 0.00%) 6717.61 ( 67.39%) Amean 79 22677.38 ( 0.00%) 21866.80 ( 3.57%) Amean 110 35937.71 ( 0.00%) 22517.63 ( 37.34%) Amean 141 25104.66 ( 0.00%) 29897.08 ( -19.09%) Amean 160 23843.74 ( 0.00%) 23106.66 ( 3.09%) Stddev 1 0.50 ( 0.00%) 0.46 ( 6.67%) Stddev 4 201.33 ( 0.00%) 130.13 ( 35.36%) Stddev 7 471.94 ( 0.00%) 641.69 ( -35.97%) Stddev 12 1401.94 ( 0.00%) 1750.14 ( -24.84%) Stddev 21 2519.12 ( 0.00%) 1416.77 ( 43.76%) Stddev 30 3469.05 ( 0.00%) 1293.37 ( 62.72%) Stddev 48 11521.49 ( 0.00%) 3846.34 ( 66.62%) Stddev 79 12849.21 ( 0.00%) 12275.89 ( 4.46%) Stddev 110 20362.88 ( 0.00%) 12989.46 ( 36.21%) Stddev 141 13768.42 ( 0.00%) 17108.34 ( -24.26%) Stddev 160 13196.34 ( 0.00%) 13029.75 ( 1.26%) This is more of a mixed bag but it at least shows that fairness is not crippled. The hackbench results are more neutral but this is still important. It's possible to boost the dbench figures by a large amount but only by crippling the performance of a workload like hackbench. hackbench-process-pipes 6.18-rc1 6.18-rc1 vanilla sched-preemptnext Amean 1 0.2657 ( 0.00%) 0.2180 ( 17.94%) Amean 4 0.6107 ( 0.00%) 0.5237 ( 14.25%) Amean 7 0.7923 ( 0.00%) 0.7357 ( 7.15%) Amean 12 1.1500 ( 0.00%) 1.1210 ( 2.52%) Amean 21 1.7950 ( 0.00%) 1.8220 ( -1.50%) Amean 30 2.3207 ( 0.00%) 2.5337 * -9.18%* Amean 48 3.5023 ( 0.00%) 4.0057 * -14.37%* Amean 79 4.8093 ( 0.00%) 5.1827 * -7.76%* Amean 110 6.1160 ( 0.00%) 6.5667 * -7.37%* Amean 141 7.4763 ( 0.00%) 7.9413 * -6.22%* Amean 172 8.9560 ( 0.00%) 9.5543 * -6.68%* Amean 203 10.4783 ( 0.00%) 11.3620 * -8.43%* Amean 234 12.4977 ( 0.00%) 13.6183 ( -8.97%) Amean 265 14.7003 ( 0.00%) 15.3720 * -4.57%* Amean 296 16.1007 ( 0.00%) 17.2463 * -7.12%* Processes using pipes are impacted but the variance (not presented) is bad enough that it's close to noise. These results are not always reproducible. If executed across multiple reboots, it may show neutral or small gains so the worst measured results are presented. Hackbench using sockets is more reliably neutral as the wakeup mechanisms are different between sockets and pipes. hackbench-process-sockets 6.18-rc1 6.18-rc1 vanilla sched-preemptnext Amean 1 0.3073 ( 0.00%) 0.3333 ( -8.46%) Amean 4 0.7863 ( 0.00%) 0.7350 ( 6.53%) Amean 7 1.3670 ( 0.00%) 1.3580 ( 0.66%) Amean 12 2.1337 ( 0.00%) 2.1320 ( 0.08%) Amean 21 3.4683 ( 0.00%) 3.4677 ( 0.02%) Amean 30 4.7247 ( 0.00%) 4.8200 ( -2.02%) Amean 48 7.6097 ( 0.00%) 7.8383 ( -3.00%) Amean 79 14.7957 ( 0.00%) 15.2863 ( -3.32%) Amean 110 21.3413 ( 0.00%) 21.7297 ( -1.82%) Amean 141 29.0503 ( 0.00%) 29.1197 ( -0.24%) Amean 172 36.4660 ( 0.00%) 36.3173 ( 0.41%) Amean 203 39.7177 ( 0.00%) 40.2843 ( -1.43%) Amean 234 42.1120 ( 0.00%) 43.8480 ( -4.12%) Amean 265 45.7830 ( 0.00%) 48.1233 ( -5.11%) Amean 296 50.7043 ( 0.00%) 54.2533 ( -7.00%) As schbench has been mentioned in numerous bugs recently, the results are interesting. A test case that represents the default schbench behaviour is schbench Wakeup Latency (usec) 6.18.0-rc1 6.18.0-rc1 vanilla sched-preemptnextr1 Amean Wakeup-50th-80 7.17 ( 0.00%) 6.00 * 16.28%* Amean Wakeup-90th-80 46.56 ( 0.00%) 20.56 * 55.85%* Amean Wakeup-99th-80 119.61 ( 0.00%) 88.83 ( 25.73%) Amean Wakeup-99.9th-80 3193.78 ( 0.00%) 339.78 * 89.36%* Amean Wakeup-max-80 3874.28 ( 0.00%) 3942.06 ( -1.75%) schbench Requests Per Second (ops/sec) 6.18.0-rc1 6.18.0-rc1 vanilla sched-preemptnextr1 Hmean RPS-20th-80 8900.91 ( 0.00%) 9148.18 * 2.78%* Hmean RPS-50th-80 8987.41 ( 0.00%) 9199.72 ( 2.36%) Hmean RPS-90th-80 9123.73 ( 0.00%) 9233.56 ( 1.20%) Hmean RPS-max-80 9193.50 ( 0.00%) 9249.84 ( 0.61%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++++++------- 1 file changed, 117 insertions(+), 20 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bc0b7ce8a65d..158e0430449b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -955,6 +955,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect) if (cfs_rq->nr_queued == 1) return curr && curr->on_rq ? curr : se; + /* + * Picking the ->next buddy will affect latency but not fairness. + */ + if (sched_feat(PICK_BUDDY) && + cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { + /* ->next will never be delayed */ + WARN_ON_ONCE(cfs_rq->next->sched_delayed); + return cfs_rq->next; + } + if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) curr = NULL; @@ -1193,6 +1203,91 @@ static s64 update_se(struct rq *rq, struct sched_entity *se) return delta_exec; } +enum preempt_wakeup_action { + PREEMPT_WAKEUP_NONE, /* No action on the buddy */ + PREEMPT_WAKEUP_NEXT, /* Check next is most eligible + * before rescheduling. + */ + PREEMPT_WAKEUP_RESCHED, /* Plain reschedule */ +}; + +static void set_next_buddy(struct sched_entity *se); + +static inline enum preempt_wakeup_action +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags, + struct sched_entity *pse, struct sched_entity *se) +{ + bool pse_before; + + /* + * Ignore wakee preemption on WF_WORK as it is less likely that + * there is shared data as exec often follow fork. Do not + * preempt for tasks that are sched_delayed as it would violate + * EEVDF to forcibly queue an ineligible task. + */ + if (!sched_feat(NEXT_BUDDY) || + (wake_flags & WF_FORK) || + (pse->sched_delayed)) { + return PREEMPT_WAKEUP_NONE; + } + + /* Reschedule if waker is no longer eligible. */ + if (!entity_eligible(cfs_rq, se)) + return PREEMPT_WAKEUP_RESCHED; + + /* + * Keep existing buddy if the deadline is sooner than pse. + * The downside is that the older buddy may be cache cold + * but that is unpredictable where as an earlier deadline + * is absolute. + */ + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) + return PREEMPT_WAKEUP_NONE; + + set_next_buddy(pse); + + /* + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not + * strictly enforced because the hint is either misunderstood or + * multiple tasks must be woken up. + */ + pse_before = entity_before(pse, se); + if (wake_flags & WF_SYNC) { + u64 delta = rq_clock_task(rq) - se->exec_start; + u64 threshold = sysctl_sched_migration_cost; + + /* + * WF_SYNC without WF_TTWU is not expected so warn if it + * happens even though it is likely harmless. + */ + WARN_ON_ONCE(!(wake_flags | WF_TTWU)); + + if ((s64)delta < 0) + delta = 0; + + /* + * WF_RQ_SELECTED implies the tasks are stacking on a + * CPU when they could run on other CPUs. Reduce the + * threshold before preemption is allowed to an + * arbitrary lower value as it is more likely (but not + * guaranteed) the waker requires the wakee to finish. + */ + if (wake_flags & WF_RQ_SELECTED) + threshold >>= 2; + + /* + * As WF_SYNC is not strictly obeyed, allow some runtime for + * batch wakeups to be issued. + */ + if (pse_before && delta >= threshold) + return PREEMPT_WAKEUP_RESCHED; + + return PREEMPT_WAKEUP_NONE; + } + + return PREEMPT_WAKEUP_NEXT; +} + /* * Used by other classes to account runtime. */ @@ -5512,16 +5607,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) { struct sched_entity *se; - /* - * Picking the ->next buddy will affect latency but not fairness. - */ - if (sched_feat(PICK_BUDDY) && - cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { - /* ->next will never be delayed */ - WARN_ON_ONCE(cfs_rq->next->sched_delayed); - return cfs_rq->next; - } - se = pick_eevdf(cfs_rq); if (se->sched_delayed) { dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); @@ -7028,8 +7113,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) hrtick_update(rq); } -static void set_next_buddy(struct sched_entity *se); - /* * Basically dequeue_task_fair(), except it can deal with dequeue_entity() * failing half-way through and resume the dequeue later. @@ -8734,7 +8817,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int struct sched_entity *se = &donor->se, *pse = &p->se; struct cfs_rq *cfs_rq = task_cfs_rq(donor); int cse_is_idle, pse_is_idle; - bool do_preempt_short = false; + enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE; if (unlikely(se == pse)) return; @@ -8748,10 +8831,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int if (task_is_throttled(p)) return; - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { - set_next_buddy(pse); - } - /* * We can come here with TIF_NEED_RESCHED already set from new task * wake up path. @@ -8783,7 +8862,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int * When non-idle entity preempt an idle entity, * don't give idle entity slice protection. */ - do_preempt_short = true; + do_preempt_short = PREEMPT_WAKEUP_NEXT; goto preempt; } @@ -8802,7 +8881,25 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int * If @p has a shorter slice than current and @p is eligible, override * current's slice protection in order to allow preemption. */ - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); + if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) { + do_preempt_short = PREEMPT_WAKEUP_NEXT; + } else { + /* + * If @p potentially is completing work required by current then + * consider preemption. + */ + do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags, + pse, se); + } + + switch (do_preempt_short) { + case PREEMPT_WAKEUP_NONE: + return; + case PREEMPT_WAKEUP_RESCHED: + goto preempt; + case PREEMPT_WAKEUP_NEXT: + break; + } /* * If @p has become the most eligible task, force preemption. @@ -8810,7 +8907,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) goto preempt; - if (sched_feat(RUN_TO_PARITY) && do_preempt_short) + if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE) update_protect_slice(cfs_rq, se); return; -- 2.51.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-10-27 13:39 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman @ 2025-10-28 15:05 ` Peter Zijlstra 2025-10-31 9:46 ` Mel Gorman 2025-10-28 15:09 ` Peter Zijlstra ` (2 subsequent siblings) 3 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2025-10-28 15:05 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote: Still going through this; just a few early comments. > kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++++++------- > 1 file changed, 117 insertions(+), 20 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index bc0b7ce8a65d..158e0430449b 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1193,6 +1203,91 @@ static s64 update_se(struct rq *rq, struct sched_entity *se) > return delta_exec; > } > > +enum preempt_wakeup_action { > + PREEMPT_WAKEUP_NONE, /* No action on the buddy */ > + PREEMPT_WAKEUP_NEXT, /* Check next is most eligible > + * before rescheduling. > + */ > + PREEMPT_WAKEUP_RESCHED, /* Plain reschedule */ > +}; > + > +static void set_next_buddy(struct sched_entity *se); > + > +static inline enum preempt_wakeup_action > +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags, > + struct sched_entity *pse, struct sched_entity *se) > +{ > + bool pse_before; > + > + /* > + * Ignore wakee preemption on WF_WORK as it is less likely that > + * there is shared data as exec often follow fork. Do not > + * preempt for tasks that are sched_delayed as it would violate > + * EEVDF to forcibly queue an ineligible task. > + */ > + if (!sched_feat(NEXT_BUDDY) || > + (wake_flags & WF_FORK) || > + (pse->sched_delayed)) { > + return PREEMPT_WAKEUP_NONE; > + } > + > + /* Reschedule if waker is no longer eligible. */ > + if (!entity_eligible(cfs_rq, se)) > + return PREEMPT_WAKEUP_RESCHED; > + > + /* > + * Keep existing buddy if the deadline is sooner than pse. > + * The downside is that the older buddy may be cache cold > + * but that is unpredictable where as an earlier deadline > + * is absolute. > + */ > + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) > + return PREEMPT_WAKEUP_NONE; > + > + set_next_buddy(pse); > + > + /* > + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not > + * strictly enforced because the hint is either misunderstood or > + * multiple tasks must be woken up. > + */ > + pse_before = entity_before(pse, se); > + if (wake_flags & WF_SYNC) { > + u64 delta = rq_clock_task(rq) - se->exec_start; > + u64 threshold = sysctl_sched_migration_cost; > + > + /* > + * WF_SYNC without WF_TTWU is not expected so warn if it > + * happens even though it is likely harmless. > + */ > + WARN_ON_ONCE(!(wake_flags | WF_TTWU)); > + > + if ((s64)delta < 0) > + delta = 0; > + > + /* > + * WF_RQ_SELECTED implies the tasks are stacking on a > + * CPU when they could run on other CPUs. Reduce the > + * threshold before preemption is allowed to an > + * arbitrary lower value as it is more likely (but not > + * guaranteed) the waker requires the wakee to finish. > + */ > + if (wake_flags & WF_RQ_SELECTED) > + threshold >>= 2; > + > + /* > + * As WF_SYNC is not strictly obeyed, allow some runtime for > + * batch wakeups to be issued. > + */ > + if (pse_before && delta >= threshold) > + return PREEMPT_WAKEUP_RESCHED; > + > + return PREEMPT_WAKEUP_NONE; > + } > + > + return PREEMPT_WAKEUP_NEXT; > +} All this seems weirdly placed inside the file. Is there a reason this is placed so far away from its only caller? > /* > * Used by other classes to account runtime. > */ > @@ -7028,8 +7113,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > hrtick_update(rq); > } > > -static void set_next_buddy(struct sched_entity *se); > - > /* > * Basically dequeue_task_fair(), except it can deal with dequeue_entity() > * failing half-way through and resume the dequeue later. > @@ -8734,7 +8817,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > struct sched_entity *se = &donor->se, *pse = &p->se; > struct cfs_rq *cfs_rq = task_cfs_rq(donor); > int cse_is_idle, pse_is_idle; > - bool do_preempt_short = false; > + enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE; naming seems off; I'm not sure what this still has to do with short. Perhaps just preempt_action or whatever? > > if (unlikely(se == pse)) > return; > @@ -8748,10 +8831,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > if (task_is_throttled(p)) > return; > > - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { > - set_next_buddy(pse); > - } > - > /* > * We can come here with TIF_NEED_RESCHED already set from new task > * wake up path. > @@ -8783,7 +8862,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * When non-idle entity preempt an idle entity, > * don't give idle entity slice protection. > */ > - do_preempt_short = true; > + do_preempt_short = PREEMPT_WAKEUP_NEXT; > goto preempt; > } > > @@ -8802,7 +8881,25 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * If @p has a shorter slice than current and @p is eligible, override > * current's slice protection in order to allow preemption. > */ > - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); > + if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) { > + do_preempt_short = PREEMPT_WAKEUP_NEXT; > + } else { > + /* > + * If @p potentially is completing work required by current then > + * consider preemption. > + */ > + do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags, > + pse, se); > + } > + > + switch (do_preempt_short) { > + case PREEMPT_WAKEUP_NONE: > + return; > + case PREEMPT_WAKEUP_RESCHED: > + goto preempt; > + case PREEMPT_WAKEUP_NEXT: > + break; > + } > > /* > * If @p has become the most eligible task, force preemption. > @@ -8810,7 +8907,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) > goto preempt; > > - if (sched_feat(RUN_TO_PARITY) && do_preempt_short) > + if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE) > update_protect_slice(cfs_rq, se); WAKEUP_NONE did a return above, I don't think you can get here with WAKEUP_NONE, making the above condition always true. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-10-28 15:05 ` Peter Zijlstra @ 2025-10-31 9:46 ` Mel Gorman 0 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2025-10-31 9:46 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason On Tue, Oct 28, 2025 at 04:05:45PM +0100, Peter Zijlstra wrote: > On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote: > > kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++++++------- > > 1 file changed, 117 insertions(+), 20 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index bc0b7ce8a65d..158e0430449b 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -1193,6 +1203,91 @@ static s64 update_se(struct rq *rq, struct sched_entity *se) > > return delta_exec; > > } > > > > +enum preempt_wakeup_action { > > + PREEMPT_WAKEUP_NONE, /* No action on the buddy */ > > + PREEMPT_WAKEUP_NEXT, /* Check next is most eligible > > + * before rescheduling. > > + */ > > + PREEMPT_WAKEUP_RESCHED, /* Plain reschedule */ > > +}; > > + > > <SNIP> > > All this seems weirdly placed inside the file. Is there a reason this is > placed so far away from its only caller? > No, I don't recall why I placed it there and whether it was simply required by an early prototype. > > /* > > * Used by other classes to account runtime. > > */ > > > @@ -7028,8 +7113,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > hrtick_update(rq); > > } > > > > -static void set_next_buddy(struct sched_entity *se); > > - > > /* > > * Basically dequeue_task_fair(), except it can deal with dequeue_entity() > > * failing half-way through and resume the dequeue later. > > @@ -8734,7 +8817,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > > struct sched_entity *se = &donor->se, *pse = &p->se; > > struct cfs_rq *cfs_rq = task_cfs_rq(donor); > > int cse_is_idle, pse_is_idle; > > - bool do_preempt_short = false; > > + enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE; > > naming seems off; I'm not sure what this still has to do with short. > Perhaps just preempt_action or whatever? > Good as name as any, may change it to a more meaningful name once I go through the rest of the review. > > > > if (unlikely(se == pse)) > > return; > > @@ -8748,10 +8831,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > > if (task_is_throttled(p)) > > return; > > > > - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { > > - set_next_buddy(pse); > > - } > > - > > /* > > * We can come here with TIF_NEED_RESCHED already set from new task > > * wake up path. > > @@ -8783,7 +8862,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > > * When non-idle entity preempt an idle entity, > > * don't give idle entity slice protection. > > */ > > - do_preempt_short = true; > > + do_preempt_short = PREEMPT_WAKEUP_NEXT; > > goto preempt; > > } > > > > @@ -8802,7 +8881,25 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > > * If @p has a shorter slice than current and @p is eligible, override > > * current's slice protection in order to allow preemption. > > */ > > - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); > > + if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) { > > + do_preempt_short = PREEMPT_WAKEUP_NEXT; > > + } else { > > + /* > > + * If @p potentially is completing work required by current then > > + * consider preemption. > > + */ > > + do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags, > > + pse, se); > > + } > > + > > + switch (do_preempt_short) { > > + case PREEMPT_WAKEUP_NONE: > > + return; > > + case PREEMPT_WAKEUP_RESCHED: > > + goto preempt; > > + case PREEMPT_WAKEUP_NEXT: > > + break; > > + } > > > > /* > > * If @p has become the most eligible task, force preemption. > > @@ -8810,7 +8907,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > > if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) > > goto preempt; > > > > - if (sched_feat(RUN_TO_PARITY) && do_preempt_short) > > + if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE) > > update_protect_slice(cfs_rq, se); > > WAKEUP_NONE did a return above, I don't think you can get here with > WAKEUP_NONE, making the above condition always true. Correct. This was not always the case during development but now the rename alone highlights issues beyond this oddity. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-10-27 13:39 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman 2025-10-28 15:05 ` Peter Zijlstra @ 2025-10-28 15:09 ` Peter Zijlstra 2025-10-31 9:48 ` Mel Gorman 2025-10-28 15:33 ` Peter Zijlstra 2025-10-30 9:10 ` Peter Zijlstra 3 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2025-10-28 15:09 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote: > +enum preempt_wakeup_action { > + PREEMPT_WAKEUP_NONE, /* No action on the buddy */ > + PREEMPT_WAKEUP_NEXT, /* Check next is most eligible > + * before rescheduling. > + */ > + PREEMPT_WAKEUP_RESCHED, /* Plain reschedule */ > +}; In pre-existing code that isn't modified by this patch, we have: if (do_preempt_short) Which seems to hard rely on PREEMPT_WAKEUP_NONE being 0, please make that explicit in the enum above. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-10-28 15:09 ` Peter Zijlstra @ 2025-10-31 9:48 ` Mel Gorman 0 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2025-10-31 9:48 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason On Tue, Oct 28, 2025 at 04:09:51PM +0100, Peter Zijlstra wrote: > On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote: > > +enum preempt_wakeup_action { > > + PREEMPT_WAKEUP_NONE, /* No action on the buddy */ > > + PREEMPT_WAKEUP_NEXT, /* Check next is most eligible > > + * before rescheduling. > > + */ > > + PREEMPT_WAKEUP_RESCHED, /* Plain reschedule */ > > +}; > > In pre-existing code that isn't modified by this patch, we have: > > if (do_preempt_short) > > Which seems to hard rely on PREEMPT_WAKEUP_NONE being 0, please make > that explicit in the enum above. Will do. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-10-27 13:39 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman 2025-10-28 15:05 ` Peter Zijlstra 2025-10-28 15:09 ` Peter Zijlstra @ 2025-10-28 15:33 ` Peter Zijlstra 2025-10-28 15:47 ` Peter Zijlstra 2025-10-30 9:10 ` Peter Zijlstra 3 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2025-10-28 15:33 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote: > @@ -8783,7 +8862,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * When non-idle entity preempt an idle entity, > * don't give idle entity slice protection. > */ > - do_preempt_short = true; > + do_preempt_short = PREEMPT_WAKEUP_NEXT; > goto preempt; > } I'm confused, should this not be WAKEUP_RESCHED? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-10-28 15:33 ` Peter Zijlstra @ 2025-10-28 15:47 ` Peter Zijlstra 0 siblings, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2025-10-28 15:47 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason On Tue, Oct 28, 2025 at 04:33:51PM +0100, Peter Zijlstra wrote: > On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote: > > @@ -8783,7 +8862,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > > * When non-idle entity preempt an idle entity, > > * don't give idle entity slice protection. > > */ > > - do_preempt_short = true; > > + do_preempt_short = PREEMPT_WAKEUP_NEXT; > > goto preempt; > > } > > I'm confused, should this not be WAKEUP_RESCHED? It doesn't matter, you cannot end up with !do_preempt_short at preempt:, so that condition is always true and can thus be deleted, at which point the value of do_preempt_short is irrelevant and doesn't need to be set. A little something like so perhaps... --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8727,7 +8727,7 @@ static void set_next_buddy(struct sched_ } enum preempt_wakeup_action { - PREEMPT_WAKEUP_NONE, /* No action on the buddy */ + PREEMPT_WAKEUP_NONE = 0, /* No action on the buddy */ PREEMPT_WAKEUP_NEXT, /* Check next is most eligible * before rescheduling. */ @@ -8814,7 +8814,7 @@ __do_preempt_buddy(struct rq *rq, struct */ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags) { - enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE; + enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE; struct task_struct *donor = rq->donor; struct sched_entity *se = &donor->se, *pse = &p->se; struct cfs_rq *cfs_rq = task_cfs_rq(donor); @@ -8863,7 +8863,6 @@ static void check_preempt_wakeup_fair(st * When non-idle entity preempt an idle entity, * don't give idle entity slice protection. */ - do_preempt_short = PREEMPT_WAKEUP_NEXT; goto preempt; } @@ -8883,17 +8882,17 @@ static void check_preempt_wakeup_fair(st * current's slice protection in order to allow preemption. */ if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) { - do_preempt_short = PREEMPT_WAKEUP_NEXT; + preempt_action = PREEMPT_WAKEUP_NEXT; } else { /* * If @p potentially is completing work required by current then * consider preemption. */ - do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags, + preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags, pse, se); } - switch (do_preempt_short) { + switch (preempt_action) { case PREEMPT_WAKEUP_NONE: return; case PREEMPT_WAKEUP_RESCHED: @@ -8905,18 +8904,16 @@ static void check_preempt_wakeup_fair(st /* * If @p has become the most eligible task, force preemption. */ - if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) + if (__pick_eevdf(cfs_rq, false) == pse) goto preempt; - if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE) + if (sched_feat(RUN_TO_PARITY)) update_protect_slice(cfs_rq, se); return; preempt: - if (do_preempt_short) - cancel_protect_slice(se); - + cancel_protect_slice(se); resched_curr_lazy(rq); } ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-10-27 13:39 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman ` (2 preceding siblings ...) 2025-10-28 15:33 ` Peter Zijlstra @ 2025-10-30 9:10 ` Peter Zijlstra 2025-10-31 10:27 ` Mel Gorman 3 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2025-10-30 9:10 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote: > +static inline enum preempt_wakeup_action > +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags, > + struct sched_entity *pse, struct sched_entity *se) > +{ > + bool pse_before; > + > + /* > + * Ignore wakee preemption on WF_WORK as it is less likely that > + * there is shared data as exec often follow fork. Do not > + * preempt for tasks that are sched_delayed as it would violate > + * EEVDF to forcibly queue an ineligible task. > + */ > + if (!sched_feat(NEXT_BUDDY) || This seems wrong, that would mean wakeup preemption gets killed the moment you disable NEXT_BUDDY, that can't be right. > + (wake_flags & WF_FORK) || > + (pse->sched_delayed)) { > + return PREEMPT_WAKEUP_NONE; > + } > + > + /* Reschedule if waker is no longer eligible. */ > + if (!entity_eligible(cfs_rq, se)) > + return PREEMPT_WAKEUP_RESCHED; That comment isn't accurate, unless you add: && in_task(). That is, if this is an interrupt doing the wakeup, it has nothing to do with current. > + /* > + * Keep existing buddy if the deadline is sooner than pse. > + * The downside is that the older buddy may be cache cold > + * but that is unpredictable where as an earlier deadline > + * is absolute. > + */ > + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) > + return PREEMPT_WAKEUP_NONE; But if previously we set next and didn't preempt, we should try again, maybe it has more success now. That is, should this not be _NEXT? > + > + set_next_buddy(pse); > + > + /* > + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not > + * strictly enforced because the hint is either misunderstood or > + * multiple tasks must be woken up. > + */ > + pse_before = entity_before(pse, se); > + if (wake_flags & WF_SYNC) { > + u64 delta = rq_clock_task(rq) - se->exec_start; > + u64 threshold = sysctl_sched_migration_cost; > + > + /* > + * WF_SYNC without WF_TTWU is not expected so warn if it > + * happens even though it is likely harmless. > + */ > + WARN_ON_ONCE(!(wake_flags | WF_TTWU)); s/|/&/ ? > + if ((s64)delta < 0) > + delta = 0; > + > + /* > + * WF_RQ_SELECTED implies the tasks are stacking on a > + * CPU when they could run on other CPUs. Reduce the > + * threshold before preemption is allowed to an > + * arbitrary lower value as it is more likely (but not > + * guaranteed) the waker requires the wakee to finish. > + */ > + if (wake_flags & WF_RQ_SELECTED) > + threshold >>= 2; > + > + /* > + * As WF_SYNC is not strictly obeyed, allow some runtime for > + * batch wakeups to be issued. > + */ > + if (pse_before && delta >= threshold) > + return PREEMPT_WAKEUP_RESCHED; > + > + return PREEMPT_WAKEUP_NONE; > + } > + > + return PREEMPT_WAKEUP_NEXT; > +} Add to this that AFAICT your patch ends up doing: __pick_eevdf(.protect = false) == pse which unconditionally disables the slice protection feature. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-10-30 9:10 ` Peter Zijlstra @ 2025-10-31 10:27 ` Mel Gorman 0 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2025-10-31 10:27 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason On Thu, Oct 30, 2025 at 10:10:58AM +0100, Peter Zijlstra wrote: > On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote: > > +static inline enum preempt_wakeup_action > > +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags, > > + struct sched_entity *pse, struct sched_entity *se) > > +{ > > + bool pse_before; > > + > > + /* > > + * Ignore wakee preemption on WF_WORK as it is less likely that > > + * there is shared data as exec often follow fork. Do not > > + * preempt for tasks that are sched_delayed as it would violate > > + * EEVDF to forcibly queue an ineligible task. > > + */ > > + if (!sched_feat(NEXT_BUDDY) || > > This seems wrong, that would mean wakeup preemption gets killed the > moment you disable NEXT_BUDDY, that can't be right. > Correct, the check is bogus. > > + (wake_flags & WF_FORK) || > > + (pse->sched_delayed)) { > > + return PREEMPT_WAKEUP_NONE; > > + } > > + > > + /* Reschedule if waker is no longer eligible. */ > > + if (!entity_eligible(cfs_rq, se)) > > + return PREEMPT_WAKEUP_RESCHED; > > That comment isn't accurate, unless you add: && in_task(). That is, if > this is an interrupt doing the wakeup, it has nothing to do with > current. > That was a complete oversight. > > + /* > > + * Keep existing buddy if the deadline is sooner than pse. > > + * The downside is that the older buddy may be cache cold > > + * but that is unpredictable where as an earlier deadline > > + * is absolute. > > + */ > > + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) > > + return PREEMPT_WAKEUP_NONE; > > But if previously we set next and didn't preempt, we should try again, > maybe it has more success now. That is, should this not be _NEXT? > The context of why the original buddy was set is now lost but you're right, it is more straight-forward to reconsider the old buddy. It's more in line with EEVDF objectives and cache residency and future hotness requires crystal ball instructions. > > + > > + set_next_buddy(pse); > > + > > + /* > > + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not > > + * strictly enforced because the hint is either misunderstood or > > + * multiple tasks must be woken up. > > + */ > > + pse_before = entity_before(pse, se); > > + if (wake_flags & WF_SYNC) { > > + u64 delta = rq_clock_task(rq) - se->exec_start; > > + u64 threshold = sysctl_sched_migration_cost; > > + > > + /* > > + * WF_SYNC without WF_TTWU is not expected so warn if it > > + * happens even though it is likely harmless. > > + */ > > + WARN_ON_ONCE(!(wake_flags | WF_TTWU)); > > s/|/&/ ? > Bah, thanks. > > + if ((s64)delta < 0) > > + delta = 0; > > + > > + /* > > + * WF_RQ_SELECTED implies the tasks are stacking on a > > + * CPU when they could run on other CPUs. Reduce the > > + * threshold before preemption is allowed to an > > + * arbitrary lower value as it is more likely (but not > > + * guaranteed) the waker requires the wakee to finish. > > + */ > > + if (wake_flags & WF_RQ_SELECTED) > > + threshold >>= 2; > > + > > + /* > > + * As WF_SYNC is not strictly obeyed, allow some runtime for > > + * batch wakeups to be issued. > > + */ > > + if (pse_before && delta >= threshold) > > + return PREEMPT_WAKEUP_RESCHED; > > + > > + return PREEMPT_WAKEUP_NONE; > > + } > > + > > + return PREEMPT_WAKEUP_NEXT; > > +} > > Add to this that AFAICT your patch ends up doing: > > __pick_eevdf(.protect = false) == pse > > which unconditionally disables the slice protection feature. > Yes, trying to converge PREEMPT_SHORT with NEXT_BUDDY during prototyping was a poor decision because it led to mistakes like this. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF
@ 2025-11-12 12:25 Mel Gorman
[not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net>
0 siblings, 1 reply; 23+ messages in thread
From: Mel Gorman @ 2025-11-12 12:25 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
Chris Mason, Madadi Vineeth Reddy, linux-kernel, Mel Gorman
Changes since v4
o Splitout decisions into separate functions (peterz)
o Flow clarity (peterz)
Changes since v3
o Place new code near first consumer (peterz)
o Separate between PREEMPT_SHORT and NEXT_BUDDY (peterz)
o Naming and code flow clarity (peterz)
o Restore slice protection (peterz)
Changes since v2
o Review feedback applied from Prateek
I've been chasing down a number of schedule issues recently like many
others and found they were broadly grouped as
1. Failure to boost CPU frequency with powersave/ondemand governors
2. Processors entering idle states that are too deep
3. Differences in wakeup latencies for wakeup-intensive workloads
Adding topology into account means that there is a lot of machine-specific
behaviour which may explain why some discussions recently have reproduction
problems. Nevertheless, the removal of LAST_BUDDY and NEXT_BUDDY being
disabled has an impact on wakeup latencies.
This series enables NEXT_BUDDY and may select a wakee if it's eligible to
run even though other unrelated tasks may have an earlier deadline.
Mel Gorman (2):
sched/fair: Enable scheduler feature NEXT_BUDDY
sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++------
kernel/sched/features.h | 2 +-
2 files changed, 131 insertions(+), 23 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 23+ messages in thread[parent not found: <20251112122521.1331238-3-mgorman@techsingularity.net>]
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals [not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net> @ 2025-11-12 14:48 ` Peter Zijlstra 2025-11-13 8:26 ` Madadi Vineeth Reddy 2025-11-13 9:04 ` Mel Gorman 0 siblings, 2 replies; 23+ messages in thread From: Peter Zijlstra @ 2025-11-12 14:48 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, Madadi Vineeth Reddy, linux-kernel On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote: > + /* Prefer picking wakee soon if appropriate. */ > + if (sched_feat(NEXT_BUDDY) && > + set_preempt_buddy(cfs_rq, wake_flags, pse, se)) { > + > + /* > + * Decide whether to obey WF_SYNC hint for a new buddy. Old > + * buddies are ignored as they may not be relevant to the > + * waker and less likely to be cache hot. > + */ > + if (wake_flags & WF_SYNC) > + preempt_action = preempt_sync(rq, wake_flags, pse, se); > + } Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to depend on buddies. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-11-12 14:48 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra @ 2025-11-13 8:26 ` Madadi Vineeth Reddy 2025-11-13 9:04 ` Mel Gorman 1 sibling, 0 replies; 23+ messages in thread From: Madadi Vineeth Reddy @ 2025-11-13 8:26 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, linux-kernel, Madadi Vineeth Reddy Hi Peter, On 12/11/25 20:18, Peter Zijlstra wrote: > On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote: > >> + /* Prefer picking wakee soon if appropriate. */ >> + if (sched_feat(NEXT_BUDDY) && >> + set_preempt_buddy(cfs_rq, wake_flags, pse, se)) { >> + >> + /* >> + * Decide whether to obey WF_SYNC hint for a new buddy. Old >> + * buddies are ignored as they may not be relevant to the >> + * waker and less likely to be cache hot. >> + */ >> + if (wake_flags & WF_SYNC) >> + preempt_action = preempt_sync(rq, wake_flags, pse, se); >> + } > > Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to > depend on buddies. IIUC, calling preempt_sync() Without NEXT_BUDDY would force a reschedule after the threshold, but no buddy would be set. This means pick_eevdf() would run with normal EEVDF deadline selection, potentially picking a different task instead of the wakee.The forced reschedule would accomplish nothing for the wakees. That said, I see your point that the WF_SYNC threshold check could still reduce context switch overhead even without guaranteeing wakee selection. Thanks, Madadi Vineeth Reddy ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-11-12 14:48 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra 2025-11-13 8:26 ` Madadi Vineeth Reddy @ 2025-11-13 9:04 ` Mel Gorman 2025-11-14 12:13 ` Peter Zijlstra 1 sibling, 1 reply; 23+ messages in thread From: Mel Gorman @ 2025-11-13 9:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, Madadi Vineeth Reddy, linux-kernel On Wed, Nov 12, 2025 at 03:48:23PM +0100, Peter Zijlstra wrote: > On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote: > > > + /* Prefer picking wakee soon if appropriate. */ > > + if (sched_feat(NEXT_BUDDY) && > > + set_preempt_buddy(cfs_rq, wake_flags, pse, se)) { > > + > > + /* > > + * Decide whether to obey WF_SYNC hint for a new buddy. Old > > + * buddies are ignored as they may not be relevant to the > > + * waker and less likely to be cache hot. > > + */ > > + if (wake_flags & WF_SYNC) > > + preempt_action = preempt_sync(rq, wake_flags, pse, se); > > + } > > Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to > depend on buddies. There isn't a direct relation, but there is an indirect one. I know from your previous review that you separated out the WF_SYNC but after a while, I did not find a good reason to separate it completely from NEXT_BUDDY. NEXT_BUDDY updates cfs_rq->next if appropriate to indicate there is a waker relationship between two tasks and potentially share data that may still be cache resident after a context switch. WF_SYNC indicates there may be a strict relationship between those two tasks that the waker may need the wakee to do some work before it can make progress. If NEXT_BUDDY does not set cfs_rq->next in the current waking context then the wakee may only be picked next by coincidence under normal EEVDF rules. WF_SYNC could still reschedule if the wakee is not selected as a buddy but the benefit, if any, would be marginal -- if the waker does not go to sleep then WF_SYNC contract is violated and if the data becomes cache cold after a wakeup delay then the shared data may already be evicted from cache. With NEXT_BUDDY, there is a chance that the cost of a reschedule and/or a context switch will be offset by reduced overall latency (e.g. fewer cache misses). Without NEXT_BUDDY, WF_SYNC may only incur costs due to context switching. I considered the possibility of WF_SYNC being applied if pse is already a buddy due to yield or some other factor but there is no reason to assume any shared data is still cache resident and it's not easy to reason about. I considered applying WF_SYNC if pse was already set and use it as a two-pass filter but again, no obvious benefit or why the second wakeup ie more important than the first wakeup. I considered WF_SYNC being applied if any buddy is set but it's not clear why a SYNC wakeup between tasks A,B should instead pick C to run ASAP outside of the normal EEVDF rules. I think it's straight-forward if the logic is o If NEXT_BUDDY sets the wakee becomes cfs_rq->next then schedule the wakee soon o If the wakee is to be selected soon and WF_SYNC is also set then pick the wakee ASAP but less straight-forward if o If WF_SYNC is set, reschedule now and maybe the wakee will be picked, maybe the waker will run again, maybe something else will run and sometimes it'll be a gain overall. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-11-13 9:04 ` Mel Gorman @ 2025-11-14 12:13 ` Peter Zijlstra 0 siblings, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2025-11-14 12:13 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, Madadi Vineeth Reddy, linux-kernel On Thu, Nov 13, 2025 at 09:04:38AM +0000, Mel Gorman wrote: > On Wed, Nov 12, 2025 at 03:48:23PM +0100, Peter Zijlstra wrote: > > On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote: > > > > > + /* Prefer picking wakee soon if appropriate. */ > > > + if (sched_feat(NEXT_BUDDY) && > > > + set_preempt_buddy(cfs_rq, wake_flags, pse, se)) { > > > + > > > + /* > > > + * Decide whether to obey WF_SYNC hint for a new buddy. Old > > > + * buddies are ignored as they may not be relevant to the > > > + * waker and less likely to be cache hot. > > > + */ > > > + if (wake_flags & WF_SYNC) > > > + preempt_action = preempt_sync(rq, wake_flags, pse, se); > > > + } > > > > Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to > > depend on buddies. > > There isn't a direct relation, but there is an indirect one. I know from > your previous review that you separated out the WF_SYNC but after a while, > I did not find a good reason to separate it completely from NEXT_BUDDY. > > NEXT_BUDDY updates cfs_rq->next if appropriate to indicate there is a waker > relationship between two tasks and potentially share data that may still > be cache resident after a context switch. WF_SYNC indicates there may be > a strict relationship between those two tasks that the waker may need the > wakee to do some work before it can make progress. If NEXT_BUDDY does not > set cfs_rq->next in the current waking context then the wakee may only be > picked next by coincidence under normal EEVDF rules. Aah, fair enough. Perhaps the comment could've been clearer but whatever. ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 0/2 v4] Reintroduce NEXT_BUDDY for EEVDF @ 2025-11-03 11:04 Mel Gorman 2025-11-03 11:04 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2025-11-03 11:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, linux-kernel, Mel Gorman Changes since v3 o Place new code near first consumer (peterz) o Separate between PREEMPT_SHORT and NEXT_BUDDY (peterz) o Naming and code flow clarity (peterz) o Restore slice protection (peterz) Changes since v2 o Review feedback applied from Prateek I've been chasing down a number of schedule issues recently like many others and found they were broadly grouped as 1. Failure to boost CPU frequency with powersave/ondemand governors 2. Processors entering idle states that are too deep 3. Differences in wakeup latencies for wakeup-intensive workloads Adding topology into account means that there is a lot of machine-specific behaviour which may explain why some discussions recently have reproduction problems. Nevertheless, the removal of LAST_BUDDY and NEXT_BUDDY being disabled has an impact on wakeup latencies. This series enables NEXT_BUDDY and may select a wakee if it's eligible to run even though other unrelated tasks may have an earlier deadline. kernel/sched/fair.c | 145 ++++++++++++++++++++++++++++++++++------ kernel/sched/features.h | 2 +- 2 files changed, 124 insertions(+), 23 deletions(-) -- 2.51.0 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-11-03 11:04 [PATCH 0/2 v4] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman @ 2025-11-03 11:04 ` Mel Gorman 2025-11-03 14:07 ` Peter Zijlstra 2025-11-05 21:48 ` Madadi Vineeth Reddy 0 siblings, 2 replies; 23+ messages in thread From: Mel Gorman @ 2025-11-03 11:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, linux-kernel, Mel Gorman Reimplement NEXT_BUDDY preemption to take into account the deadline and eligibility of the wakee with respect to the waker. In the event multiple buddies could be considered, the one with the earliest deadline is selected. Sync wakeups are treated differently to every other type of wakeup. The WF_SYNC assumption is that the waker promises to sleep in the very near future. This is violated in enough cases that WF_SYNC should be treated as a suggestion instead of a contract. If a waker does go to sleep almost immediately then the delay in wakeup is negligible. In all other cases, it's throttled based on the accumulated runtime of the waker so there is a chance that some batched wakeups have been issued before preemption. For all other wakeups, preemption happens if the wakee has a earlier deadline than the waker and eligible to run. While many workloads were tested, the two main targets were a modified dbench4 benchmark and hackbench because the are on opposite ends of the spectrum -- one prefers throughput by avoiding preemption and the other relies on preemption. First is the dbench throughput data even though it is a poor metric but it is the default metric. The test machine is a 2-socket machine and the backing filesystem is XFS as a lot of the IO work is dispatched to kernel threads. It's important to note that these results are not representative across all machines, especially Zen machines, as different bottlenecks are exposed on different machines and filesystems. dbench4 Throughput (misleading but traditional) 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v4 Hmean 1 1268.80 ( 0.00%) 1268.92 ( 0.01%) Hmean 4 3971.74 ( 0.00%) 3952.44 ( -0.49%) Hmean 7 5548.23 ( 0.00%) 5375.49 * -3.11%* Hmean 12 7310.86 ( 0.00%) 7080.80 ( -3.15%) Hmean 21 8874.53 ( 0.00%) 9044.30 ( 1.91%) Hmean 30 9361.93 ( 0.00%) 10430.80 * 11.42%* Hmean 48 9540.14 ( 0.00%) 11786.28 * 23.54%* Hmean 79 9208.74 ( 0.00%) 11910.07 * 29.33%* Hmean 110 8573.12 ( 0.00%) 11571.03 * 34.97%* Hmean 141 7791.33 ( 0.00%) 11151.69 * 43.13%* Hmean 160 7666.60 ( 0.00%) 10661.53 * 39.06%* As throughput is misleading, the benchmark is modified to use a short loadfile report the completion time duration in milliseconds. dbench4 Loadfile Execution Time 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v4 Amean 1 14.62 ( 0.00%) 14.69 ( -0.50%) Amean 4 18.76 ( 0.00%) 18.88 ( -0.63%) Amean 7 23.71 ( 0.00%) 24.52 ( -3.43%) Amean 12 31.25 ( 0.00%) 32.20 ( -3.03%) Amean 21 45.12 ( 0.00%) 44.19 ( 2.05%) Amean 30 61.07 ( 0.00%) 54.74 ( 10.35%) Amean 48 95.91 ( 0.00%) 77.52 ( 19.18%) Amean 79 163.38 ( 0.00%) 126.04 ( 22.85%) Amean 110 243.91 ( 0.00%) 179.93 ( 26.23%) Amean 141 343.47 ( 0.00%) 240.03 ( 30.12%) Amean 160 401.15 ( 0.00%) 284.92 ( 28.98%) That is still looking good and the variance is reduced quite a bit with the caveat that different CPU families may yield different results. Fairness is a concern so the next report tracks how many milliseconds does it take for all clients to complete a workfile. This one is tricky because dbench makes no effort to synchronise clients so the durations at benchmark start time differ substantially from typical runtimes. Depending on the configuration, the benchmark can be dominated by the timing of fsync of different clients. This problem could be mitigated by warming up the benchmark for a number of minutes but it's a matter of opinion whether that counts as an evasion of inconvenient results. dbench4 All Clients Loadfile Execution Time 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v4 Amean 1 15.06 ( 0.00%) 15.10 ( -0.27%) Amean 4 603.81 ( 0.00%) 1451.94 (-140.46%) Amean 7 855.32 ( 0.00%) 1116.68 ( -30.56%) Amean 12 1890.02 ( 0.00%) 2075.98 ( -9.84%) Amean 21 3195.23 ( 0.00%) 2508.86 ( 21.48%) Amean 30 13919.53 ( 0.00%) 3324.58 ( 76.12%) Amean 48 25246.07 ( 0.00%) 4146.10 ( 83.58%) Amean 79 29701.84 ( 0.00%) 21338.68 ( 28.16%) Amean 110 22803.03 ( 0.00%) 29502.47 ( -29.38%) Amean 141 36356.07 ( 0.00%) 25292.07 ( 30.43%) Amean 160 17046.71 ( 0.00%) 27043.67 ( -58.64%) Stddev 1 0.47 ( 0.00%) 0.46 ( 2.84%) Stddev 4 395.24 ( 0.00%) 782.96 ( -98.10%) Stddev 7 467.24 ( 0.00%) 738.26 ( -58.00%) Stddev 12 1071.43 ( 0.00%) 1124.78 ( -4.98%) Stddev 21 1694.50 ( 0.00%) 1463.27 ( 13.65%) Stddev 30 7945.63 ( 0.00%) 1889.19 ( 76.22%) Stddev 48 14339.51 ( 0.00%) 2226.59 ( 84.47%) Stddev 79 16620.91 ( 0.00%) 12027.14 ( 27.64%) Stddev 110 12912.15 ( 0.00%) 16601.77 ( -28.57%) Stddev 141 20700.13 ( 0.00%) 13624.50 ( 34.18%) Stddev 160 9079.16 ( 0.00%) 15729.63 ( -73.25%) This is more of a mixed bag but it at least shows that fairness is not crippled. The hackbench results are more neutral but this is still important. It's possible to boost the dbench figures by a large amount but only by crippling the performance of a workload like hackbench. hackbench-process-pipes 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v4 Amean 1 0.2657 ( 0.00%) 0.2120 ( 20.20%) Amean 4 0.6107 ( 0.00%) 0.6120 ( -0.22%) Amean 7 0.7923 ( 0.00%) 0.7307 ( 7.78%) Amean 12 1.1500 ( 0.00%) 1.1480 ( 0.17%) Amean 21 1.7950 ( 0.00%) 1.8207 ( -1.43%) Amean 30 2.3207 ( 0.00%) 2.5340 * -9.19%* Amean 48 3.5023 ( 0.00%) 4.0047 * -14.34%* Amean 79 4.8093 ( 0.00%) 5.3137 * -10.49%* Amean 110 6.1160 ( 0.00%) 6.5287 * -6.75%* Amean 141 7.4763 ( 0.00%) 7.9553 * -6.41%* Amean 172 8.9560 ( 0.00%) 9.3977 * -4.93%* Amean 203 10.4783 ( 0.00%) 11.0160 * -5.13%* Amean 234 12.4977 ( 0.00%) 13.5510 ( -8.43%) Amean 265 14.7003 ( 0.00%) 15.6950 * -6.77%* Amean 296 16.1007 ( 0.00%) 17.0860 * -6.12%* Processes using pipes are impacted but the variance (not presented) is bad enough that it's close to noise. These results are not always reproducible. If executed across multiple reboots, it may show neutral or small gains so the worst measured results are presented. Hackbench using sockets is more reliably neutral as the wakeup mechanisms are different between sockets and pipes. hackbench-process-sockets 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v4 Amean 1 0.3073 ( 0.00%) 0.3187 ( -3.69%) Amean 4 0.7863 ( 0.00%) 0.7850 ( 0.17%) Amean 7 1.3670 ( 0.00%) 1.3747 ( -0.56%) Amean 12 2.1337 ( 0.00%) 2.1450 ( -0.53%) Amean 21 3.4683 ( 0.00%) 3.4293 ( 1.12%) Amean 30 4.7247 ( 0.00%) 4.8000 ( -1.59%) Amean 48 7.6097 ( 0.00%) 7.6943 ( -1.11%) Amean 79 14.7957 ( 0.00%) 15.3353 ( -3.65%) Amean 110 21.3413 ( 0.00%) 21.5753 ( -1.10%) Amean 141 29.0503 ( 0.00%) 28.8740 ( 0.61%) Amean 172 36.4660 ( 0.00%) 35.7207 ( 2.04%) Amean 203 39.7177 ( 0.00%) 39.7343 ( -0.04%) Amean 234 42.1120 ( 0.00%) 43.0177 ( -2.15%) Amean 265 45.7830 ( 0.00%) 46.9863 ( -2.63%) Amean 296 50.7043 ( 0.00%) 50.7017 ( 0.01%) As schbench has been mentioned in numerous bugs recently, the results are interesting. A test case that represents the default schbench behaviour is schbench Wakeup Latency (usec) 6.18.0-rc1 6.18.0-rc1 vanilla sched-preemptnext-v4 Amean Wakeup-50th-80 7.17 ( 0.00%) 6.00 * 16.28%* Amean Wakeup-90th-80 46.56 ( 0.00%) 19.61 * 57.88%* Amean Wakeup-99th-80 119.61 ( 0.00%) 87.28 * 27.03%* Amean Wakeup-99.9th-80 3193.78 ( 0.00%) 367.00 * 88.51%* Amean Wakeup-max-80 3874.28 ( 0.00%) 3951.39 ( -1.99%) schbench Requests Per Second (ops/sec) 6.18.0-rc1 6.18.0-rc1 vanilla sched-preemptnext-v4r1 Hmean RPS-20th-80 8900.91 ( 0.00%) 9167.86 * 3.00%* Hmean RPS-50th-80 8987.41 ( 0.00%) 9213.06 * 2.51%* Hmean RPS-90th-80 9123.73 ( 0.00%) 9263.80 ( 1.54%) Hmean RPS-max-80 9193.50 ( 0.00%) 9282.18 ( 0.96%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/fair.c | 145 +++++++++++++++++++++++++++++++++++++------- 1 file changed, 123 insertions(+), 22 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bc0b7ce8a65d..688bf010f444 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -955,6 +955,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect) if (cfs_rq->nr_queued == 1) return curr && curr->on_rq ? curr : se; + /* + * Picking the ->next buddy will affect latency but not fairness. + */ + if (sched_feat(PICK_BUDDY) && + cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { + /* ->next will never be delayed */ + WARN_ON_ONCE(cfs_rq->next->sched_delayed); + return cfs_rq->next; + } + if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) curr = NULL; @@ -1193,6 +1203,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se) return delta_exec; } +static void set_next_buddy(struct sched_entity *se); + /* * Used by other classes to account runtime. */ @@ -5512,16 +5524,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) { struct sched_entity *se; - /* - * Picking the ->next buddy will affect latency but not fairness. - */ - if (sched_feat(PICK_BUDDY) && - cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { - /* ->next will never be delayed */ - WARN_ON_ONCE(cfs_rq->next->sched_delayed); - return cfs_rq->next; - } - se = pick_eevdf(cfs_rq); if (se->sched_delayed) { dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); @@ -7028,8 +7030,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) hrtick_update(rq); } -static void set_next_buddy(struct sched_entity *se); - /* * Basically dequeue_task_fair(), except it can deal with dequeue_entity() * failing half-way through and resume the dequeue later. @@ -8725,6 +8725,91 @@ static void set_next_buddy(struct sched_entity *se) } } +enum preempt_wakeup_action { + PREEMPT_WAKEUP_NONE, /* No action on the buddy */ + PREEMPT_WAKEUP_SHORT, /* Override current's slice + * protection to allow + * preemption. + */ + PREEMPT_WAKEUP_NEXT, /* Check next is most eligible + * before rescheduling. + */ + PREEMPT_WAKEUP_RESCHED, /* Plain reschedule */ +}; + +static inline enum preempt_wakeup_action +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags, + struct sched_entity *pse, struct sched_entity *se) +{ + bool pse_before; + + /* + * Ignore wakee preemption on WF_FORK as it is less likely that + * there is shared data as exec often follow fork. Do not + * preempt for tasks that are sched_delayed as it would violate + * EEVDF to forcibly queue an ineligible task. + */ + if ((wake_flags & WF_FORK) || pse->sched_delayed) + return PREEMPT_WAKEUP_NONE; + + /* Reschedule if waker is no longer eligible. */ + if (in_task() && !entity_eligible(cfs_rq, se)) + return PREEMPT_WAKEUP_RESCHED; + + /* + * Keep existing buddy if the deadline is sooner than pse. + * The older buddy may be cache cold and completely unrelated + * to the current wakeup but that is unpredictable where as + * obeying the deadline is more in line with EEVDF objectives. + */ + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) + return PREEMPT_WAKEUP_NEXT; + + set_next_buddy(pse); + + /* + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not + * strictly enforced because the hint is either misunderstood or + * multiple tasks must be woken up. + */ + pse_before = entity_before(pse, se); + if (wake_flags & WF_SYNC) { + u64 delta = rq_clock_task(rq) - se->exec_start; + u64 threshold = sysctl_sched_migration_cost; + + /* + * WF_SYNC without WF_TTWU is not expected so warn if it + * happens even though it is likely harmless. + */ + WARN_ON_ONCE(!(wake_flags & WF_TTWU)); + + if ((s64)delta < 0) + delta = 0; + + /* + * WF_RQ_SELECTED implies the tasks are stacking on a + * CPU when they could run on other CPUs. Reduce the + * threshold before preemption is allowed to an + * arbitrary lower value as it is more likely (but not + * guaranteed) the waker requires the wakee to finish. + */ + if (wake_flags & WF_RQ_SELECTED) + threshold >>= 2; + + /* + * As WF_SYNC is not strictly obeyed, allow some runtime for + * batch wakeups to be issued. + */ + if (pse_before && delta >= threshold) + return PREEMPT_WAKEUP_RESCHED; + + return PREEMPT_WAKEUP_NONE; + } + + return PREEMPT_WAKEUP_NEXT; +} + + /* * Preempt the current task with a newly woken task if needed: */ @@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int struct sched_entity *se = &donor->se, *pse = &p->se; struct cfs_rq *cfs_rq = task_cfs_rq(donor); int cse_is_idle, pse_is_idle; - bool do_preempt_short = false; + enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE; if (unlikely(se == pse)) return; @@ -8748,10 +8833,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int if (task_is_throttled(p)) return; - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { - set_next_buddy(pse); - } - /* * We can come here with TIF_NEED_RESCHED already set from new task * wake up path. @@ -8783,7 +8864,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int * When non-idle entity preempt an idle entity, * don't give idle entity slice protection. */ - do_preempt_short = true; + preempt_action = PREEMPT_WAKEUP_SHORT; goto preempt; } @@ -8802,21 +8883,41 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int * If @p has a shorter slice than current and @p is eligible, override * current's slice protection in order to allow preemption. */ - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); + if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) { + preempt_action = PREEMPT_WAKEUP_SHORT; + } else { + /* + * If @p potentially is completing work required by current then + * consider preemption. + */ + preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags, + pse, se); + } + + switch (preempt_action) { + case PREEMPT_WAKEUP_NONE: + return; + case PREEMPT_WAKEUP_RESCHED: + goto preempt; + case PREEMPT_WAKEUP_SHORT: + fallthrough; + case PREEMPT_WAKEUP_NEXT: + break; + } /* * If @p has become the most eligible task, force preemption. */ - if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) + if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse) goto preempt; - if (sched_feat(RUN_TO_PARITY) && do_preempt_short) + if (sched_feat(RUN_TO_PARITY)) update_protect_slice(cfs_rq, se); return; preempt: - if (do_preempt_short) + if (preempt_action == PREEMPT_WAKEUP_SHORT) cancel_protect_slice(se); resched_curr_lazy(rq); -- 2.51.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-11-03 11:04 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman @ 2025-11-03 14:07 ` Peter Zijlstra 2025-11-03 14:14 ` Peter Zijlstra 2025-11-05 21:48 ` Madadi Vineeth Reddy 1 sibling, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2025-11-03 14:07 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, linux-kernel On Mon, Nov 03, 2025 at 11:04:45AM +0000, Mel Gorman wrote: > @@ -8725,6 +8725,91 @@ static void set_next_buddy(struct sched_entity *se) > } > } > > +enum preempt_wakeup_action { > + PREEMPT_WAKEUP_NONE, /* No action on the buddy */ > + PREEMPT_WAKEUP_SHORT, /* Override current's slice > + * protection to allow > + * preemption. > + */ > + PREEMPT_WAKEUP_NEXT, /* Check next is most eligible > + * before rescheduling. > + */ > + PREEMPT_WAKEUP_RESCHED, /* Plain reschedule */ > +}; Not really a fan of that comment style. While noodling, I've managed to 'accidentally' rename NEXT to PICK, since its more about letting __pick_eevdf() decide. > +static inline enum preempt_wakeup_action > +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags, > + struct sched_entity *pse, struct sched_entity *se) > +{ > + bool pse_before; > + > + /* > + * Ignore wakee preemption on WF_FORK as it is less likely that > + * there is shared data as exec often follow fork. Do not > + * preempt for tasks that are sched_delayed as it would violate > + * EEVDF to forcibly queue an ineligible task. > + */ > + if ((wake_flags & WF_FORK) || pse->sched_delayed) > + return PREEMPT_WAKEUP_NONE; > + > + /* Reschedule if waker is no longer eligible. */ > + if (in_task() && !entity_eligible(cfs_rq, se)) > + return PREEMPT_WAKEUP_RESCHED; > + > + /* > + * Keep existing buddy if the deadline is sooner than pse. > + * The older buddy may be cache cold and completely unrelated > + * to the current wakeup but that is unpredictable where as > + * obeying the deadline is more in line with EEVDF objectives. > + */ > + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) > + return PREEMPT_WAKEUP_NEXT; > + > + set_next_buddy(pse); > + > + /* > + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not > + * strictly enforced because the hint is either misunderstood or > + * multiple tasks must be woken up. > + */ > + pse_before = entity_before(pse, se); > + if (wake_flags & WF_SYNC) { > + u64 delta = rq_clock_task(rq) - se->exec_start; > + u64 threshold = sysctl_sched_migration_cost; > + > + /* > + * WF_SYNC without WF_TTWU is not expected so warn if it > + * happens even though it is likely harmless. > + */ > + WARN_ON_ONCE(!(wake_flags & WF_TTWU)); > + > + if ((s64)delta < 0) > + delta = 0; > + > + /* > + * WF_RQ_SELECTED implies the tasks are stacking on a > + * CPU when they could run on other CPUs. Reduce the > + * threshold before preemption is allowed to an > + * arbitrary lower value as it is more likely (but not > + * guaranteed) the waker requires the wakee to finish. > + */ > + if (wake_flags & WF_RQ_SELECTED) > + threshold >>= 2; > + > + /* > + * As WF_SYNC is not strictly obeyed, allow some runtime for > + * batch wakeups to be issued. > + */ > + if (pse_before && delta >= threshold) > + return PREEMPT_WAKEUP_RESCHED; > + > + return PREEMPT_WAKEUP_NONE; > + } > + > + return PREEMPT_WAKEUP_NEXT; > +} This seems to do 3 things: - that reschedule waker on !eligible - set the buddy (while losing NEXT_BUDDY) - the WF_SYNC thing Let's split that out. > @@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > struct sched_entity *se = &donor->se, *pse = &p->se; > struct cfs_rq *cfs_rq = task_cfs_rq(donor); > int cse_is_idle, pse_is_idle; > - bool do_preempt_short = false; > + enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE; I'm thinking NONE is the wrong default > > if (unlikely(se == pse)) > return; > @@ -8748,10 +8833,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > if (task_is_throttled(p)) > return; > > - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { > - set_next_buddy(pse); > - } This was the only NEXT_BUDDY site and none were returned in trade. > /* > * We can come here with TIF_NEED_RESCHED already set from new task > * wake up path. > @@ -8783,7 +8864,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * When non-idle entity preempt an idle entity, > * don't give idle entity slice protection. > */ > - do_preempt_short = true; > + preempt_action = PREEMPT_WAKEUP_SHORT; > goto preempt; > } > > @@ -8802,21 +8883,41 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * If @p has a shorter slice than current and @p is eligible, override > * current's slice protection in order to allow preemption. > */ > - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); > + if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) { > + preempt_action = PREEMPT_WAKEUP_SHORT; > + } else { > + /* > + * If @p potentially is completing work required by current then > + * consider preemption. > + */ > + preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags, > + pse, se); > + } > + > + switch (preempt_action) { > + case PREEMPT_WAKEUP_NONE: > + return; > + case PREEMPT_WAKEUP_RESCHED: > + goto preempt; > + case PREEMPT_WAKEUP_SHORT: > + fallthrough; (this is redundant) > + case PREEMPT_WAKEUP_NEXT: > + break; > + } > > /* > * If @p has become the most eligible task, force preemption. > */ > - if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) > + if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse) > goto preempt; > > - if (sched_feat(RUN_TO_PARITY) && do_preempt_short) > + if (sched_feat(RUN_TO_PARITY)) > update_protect_slice(cfs_rq, se); > > return; > > preempt: > - if (do_preempt_short) > + if (preempt_action == PREEMPT_WAKEUP_SHORT) > cancel_protect_slice(se); > > resched_curr_lazy(rq); Right, much better. But since I was noddling, how about something like so on top? --- --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8727,23 +8727,16 @@ static void set_next_buddy(struct sched_ } enum preempt_wakeup_action { - PREEMPT_WAKEUP_NONE, /* No action on the buddy */ - PREEMPT_WAKEUP_SHORT, /* Override current's slice - * protection to allow - * preemption. - */ - PREEMPT_WAKEUP_NEXT, /* Check next is most eligible - * before rescheduling. - */ - PREEMPT_WAKEUP_RESCHED, /* Plain reschedule */ + PREEMPT_WAKEUP_NONE, /* No preemption. */ + PREEMPT_WAKEUP_SHORT, /* Ignore slice protection. */ + PREEMPT_WAKEUP_PICK, /* Let __pick_eevdf() decide. */ + PREEMPT_WAKEUP_RESCHED, /* Force reschedule. */ }; -static inline enum preempt_wakeup_action -__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags, - struct sched_entity *pse, struct sched_entity *se) +static inline void +set_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags, + struct sched_entity *pse, struct sched_entity *se) { - bool pse_before; - /* * Ignore wakee preemption on WF_FORK as it is less likely that * there is shared data as exec often follow fork. Do not @@ -8751,11 +8744,7 @@ __do_preempt_buddy(struct rq *rq, struct * EEVDF to forcibly queue an ineligible task. */ if ((wake_flags & WF_FORK) || pse->sched_delayed) - return PREEMPT_WAKEUP_NONE; - - /* Reschedule if waker is no longer eligible. */ - if (in_task() && !entity_eligible(cfs_rq, se)) - return PREEMPT_WAKEUP_RESCHED; + return; /* * Keep existing buddy if the deadline is sooner than pse. @@ -8764,63 +8753,62 @@ __do_preempt_buddy(struct rq *rq, struct * obeying the deadline is more in line with EEVDF objectives. */ if (cfs_rq->next && entity_before(cfs_rq->next, pse)) - return PREEMPT_WAKEUP_NEXT; + return; set_next_buddy(pse); +} - /* - * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not - * strictly enforced because the hint is either misunderstood or - * multiple tasks must be woken up. - */ - pse_before = entity_before(pse, se); - if (wake_flags & WF_SYNC) { - u64 delta = rq_clock_task(rq) - se->exec_start; - u64 threshold = sysctl_sched_migration_cost; - - /* - * WF_SYNC without WF_TTWU is not expected so warn if it - * happens even though it is likely harmless. - */ - WARN_ON_ONCE(!(wake_flags & WF_TTWU)); +/* + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not + * strictly enforced because the hint is either misunderstood or + * multiple tasks must be woken up. + */ +static inline enum preempt_wakeup_action +preempt_sync(struct rq *rq, int wake_flags, + struct sched_entity *pse, struct sched_entity *se) +{ + u64 delta = rq_clock_task(rq) - se->exec_start; + u64 threshold = sysctl_sched_migration_cost; - if ((s64)delta < 0) - delta = 0; + /* + * WF_SYNC without WF_TTWU is not expected so warn if it + * happens even though it is likely harmless. + */ + WARN_ON_ONCE(!(wake_flags & WF_TTWU)); - /* - * WF_RQ_SELECTED implies the tasks are stacking on a - * CPU when they could run on other CPUs. Reduce the - * threshold before preemption is allowed to an - * arbitrary lower value as it is more likely (but not - * guaranteed) the waker requires the wakee to finish. - */ - if (wake_flags & WF_RQ_SELECTED) - threshold >>= 2; + if ((s64)delta < 0) + delta = 0; - /* - * As WF_SYNC is not strictly obeyed, allow some runtime for - * batch wakeups to be issued. - */ - if (pse_before && delta >= threshold) - return PREEMPT_WAKEUP_RESCHED; + /* + * WF_RQ_SELECTED implies the tasks are stacking on a + * CPU when they could run on other CPUs. Reduce the + * threshold before preemption is allowed to an + * arbitrary lower value as it is more likely (but not + * guaranteed) the waker requires the wakee to finish. + */ + if (wake_flags & WF_RQ_SELECTED) + threshold >>= 2; - return PREEMPT_WAKEUP_NONE; - } + /* + * As WF_SYNC is not strictly obeyed, allow some runtime for + * batch wakeups to be issued. + */ + if (entity_before(pse, se) && delta >= threshold) + return PREEMPT_WAKEUP_RESCHED; - return PREEMPT_WAKEUP_NEXT; + return PREEMPT_WAKEUP_NONE; } - /* * Preempt the current task with a newly woken task if needed: */ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags) { + enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK; struct task_struct *donor = rq->donor; struct sched_entity *se = &donor->se, *pse = &p->se; struct cfs_rq *cfs_rq = task_cfs_rq(donor); int cse_is_idle, pse_is_idle; - enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE; if (unlikely(se == pse)) return; @@ -8886,26 +8874,40 @@ static void check_preempt_wakeup_fair(st */ if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) { preempt_action = PREEMPT_WAKEUP_SHORT; - } else { - /* - * If @p potentially is completing work required by current then - * consider preemption. - */ - preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags, - pse, se); + goto pick; + } + /* + * If @p potentially is completing work required by current + * then consider preemption. + */ + if (in_task() && !entity_eligible(cfs_rq, se)) { + /* Reschedule if waker is no longer eligible. */ + preempt_action = PREEMPT_WAKEUP_RESCHED; + goto preempt; + + } + + if (sched_feat(NEXT_BUDDY)) + set_preempt_buddy(rq, cfs_rq, wake_flags, pse, se); + + if (wake_flags & WF_SYNC) + preempt_action = preempt_sync(rq, wake_flags, pse, se); + switch (preempt_action) { case PREEMPT_WAKEUP_NONE: return; + case PREEMPT_WAKEUP_RESCHED: goto preempt; + case PREEMPT_WAKEUP_SHORT: - fallthrough; - case PREEMPT_WAKEUP_NEXT: - break; + case PREEMPT_WAKEUP_PICK: + goto pick; } +pick: /* * If @p has become the most eligible task, force preemption. */ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-11-03 14:07 ` Peter Zijlstra @ 2025-11-03 14:14 ` Peter Zijlstra 0 siblings, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2025-11-03 14:14 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, linux-kernel On Mon, Nov 03, 2025 at 03:07:11PM +0100, Peter Zijlstra wrote: > > @@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > > struct sched_entity *se = &donor->se, *pse = &p->se; > > struct cfs_rq *cfs_rq = task_cfs_rq(donor); > > int cse_is_idle, pse_is_idle; > > - bool do_preempt_short = false; > > + enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE; > > I'm thinking NONE is the wrong default That was unfinished, kid interrupted me at an inopportune moment and I lost my train ;-) Anyway, the current code defaults to what will be 'pick'. And I suppose we could make the default depend on WAKEUP_PREEMPT but meh. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-11-03 11:04 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman 2025-11-03 14:07 ` Peter Zijlstra @ 2025-11-05 21:48 ` Madadi Vineeth Reddy 2025-11-07 8:53 ` Mel Gorman 1 sibling, 1 reply; 23+ messages in thread From: Madadi Vineeth Reddy @ 2025-11-05 21:48 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, linux-kernel, Madadi Vineeth Reddy Hi Mel, On 03/11/25 16:34, Mel Gorman wrote: > Reimplement NEXT_BUDDY preemption to take into account the deadline and > eligibility of the wakee with respect to the waker. In the event > multiple buddies could be considered, the one with the earliest deadline > is selected. > > Sync wakeups are treated differently to every other type of wakeup. The > WF_SYNC assumption is that the waker promises to sleep in the very near > future. This is violated in enough cases that WF_SYNC should be treated > as a suggestion instead of a contract. If a waker does go to sleep almost > immediately then the delay in wakeup is negligible. In all other cases, > it's throttled based on the accumulated runtime of the waker so there is > a chance that some batched wakeups have been issued before preemption. [..snip..] > +static inline enum preempt_wakeup_action > +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags, > + struct sched_entity *pse, struct sched_entity *se) > +{ > + bool pse_before; > + > + /* > + * Ignore wakee preemption on WF_FORK as it is less likely that > + * there is shared data as exec often follow fork. Do not > + * preempt for tasks that are sched_delayed as it would violate > + * EEVDF to forcibly queue an ineligible task. > + */ > + if ((wake_flags & WF_FORK) || pse->sched_delayed) > + return PREEMPT_WAKEUP_NONE; > + > + /* Reschedule if waker is no longer eligible. */ > + if (in_task() && !entity_eligible(cfs_rq, se)) > + return PREEMPT_WAKEUP_RESCHED; > + > + /* > + * Keep existing buddy if the deadline is sooner than pse. > + * The older buddy may be cache cold and completely unrelated > + * to the current wakeup but that is unpredictable where as > + * obeying the deadline is more in line with EEVDF objectives. > + */ > + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) > + return PREEMPT_WAKEUP_NEXT; IIUC, the logic attempts to maintain deadline ordering among buddies, but this doesn't address tasks already on the runqueue. So with frequent wakeups, queued tasks (even with earlier deadlines) may be unfairly delayed. I understand that this would fade away quickly as the woken up task that got to run due to buddy preference would accumulate negative lag and would not be eligible to run again, but the starvation could be higher if wakeups are very high. To test this, I ran schbench (many message and worker threads) together with stress-ng (CPU-bound), and observed stress-ng's bogo-ops throughput dropped by around 64%. This shows a significant regression for CPU-bound tasks under heavy wakeup loads. Thoughts? I also ran schbench and hackbench. All these were run on a Power11 System containing 4 sockets and 160 CPUs spread across 4 NUMA nodes. schbench(new) 99.0th latency (lower is better) ======== load baseline[pct imp](std%) With patch[pct imp]( std%) 20mt, 10wt 1.00 [ 0.00]( 0.24) 0.97 [ +3.00]( 0.18) 20mt, 20wt 1.00 [ 0.00]( 0.33) 1.00 [ 0.00]( 0.12) 20mt, 40wt 1.00 [ 0.00]( 2.84) 0.76 [ +24.0]( 0.32) 20mt, 80wt 1.00 [ 0.00]( 3.66) 0.66 [ +34.0]( 0.72) 20mt, 160wt 1.00 [ 0.00](12.92) 0.88 [ +12.0]( 6.77) mt=message threads ; wt=worker threads schbench being a wakeup sensitive workload showed good improvement. hackbench (lower is better) ======== case load baseline[pct imp](std%) With patch[pct imp]( std%) process-sockets 1-groups 1.00 [ 0.00]( 5.21) 0.91 [ +9.00]( 5.50) process-sockets 4-groups 1.00 [ 0.00]( 7.30) 1.01 [ -1.00]( 4.27) process-sockets 12-groups 1.00 [ 0.00]( 2.44) 1.00 [ 0.00]( 1.78) process-sockets 30-groups 1.00 [ 0.00]( 2.05) 1.04 [ -4.00]( 0.86) process-sockets 48-groups 1.00 [ 0.00]( 2.25) 1.04 [ -4.00]( 1.03) process-sockets 79-groups 1.00 [ 0.00]( 2.28) 1.05 [ -5.00]( 1.67) process-sockets 110-groups 1.00 [ 0.00]( 11.17) 1.04 [ -4.00]( 8.64) process-pipe 1-groups 1.00 [ 0.00]( 8.21) 0.84 [+16.00](13.00) process-pipe 4-groups 1.00 [ 0.00]( 5.54) 0.95 [ +5.00]( 4.21) process-pipe 12-groups 1.00 [ 0.00]( 3.96) 1.04 [ -4.00]( 2.26) process-pipe 30-groups 1.00 [ 0.00]( 7.64) 1.20 [ -20.0]( 3.63) process-pipe 48-groups 1.00 [ 0.00]( 6.28) 1.04 [ -4.00]( 8.48) process-pipe 79-groups 1.00 [ 0.00]( 6.19) 1.01 [ -1.00]( 4.36) process-pipe 110-groups 1.00 [ 0.00]( 10.23) 0.94 [ +6.00]( 5.21) Didn't notice significant improvement or regression in Hackbench. Mostly in the noise range. Thanks, Madadi Vineeth Reddy > + > + set_next_buddy(pse); > + > + /* > + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not > + * strictly enforced because the hint is either misunderstood or > + * multiple tasks must be woken up. > + */ > + pse_before = entity_before(pse, se); > + if (wake_flags & WF_SYNC) { > + u64 delta = rq_clock_task(rq) - se->exec_start; > + u64 threshold = sysctl_sched_migration_cost; > + > + /* > + * WF_SYNC without WF_TTWU is not expected so warn if it > + * happens even though it is likely harmless. > + */ > + WARN_ON_ONCE(!(wake_flags & WF_TTWU)); > + > + if ((s64)delta < 0) > + delta = 0; > + > + /* > + * WF_RQ_SELECTED implies the tasks are stacking on a > + * CPU when they could run on other CPUs. Reduce the > + * threshold before preemption is allowed to an > + * arbitrary lower value as it is more likely (but not > + * guaranteed) the waker requires the wakee to finish. > + */ > + if (wake_flags & WF_RQ_SELECTED) > + threshold >>= 2; > + > + /* > + * As WF_SYNC is not strictly obeyed, allow some runtime for > + * batch wakeups to be issued. > + */ > + if (pse_before && delta >= threshold) > + return PREEMPT_WAKEUP_RESCHED; > + > + return PREEMPT_WAKEUP_NONE; > + } > + > + return PREEMPT_WAKEUP_NEXT; > +} > + > + > /* > * Preempt the current task with a newly woken task if needed: > */ > @@ -8734,7 +8819,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > struct sched_entity *se = &donor->se, *pse = &p->se; > struct cfs_rq *cfs_rq = task_cfs_rq(donor); > int cse_is_idle, pse_is_idle; > - bool do_preempt_short = false; > + enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_NONE; > > if (unlikely(se == pse)) > return; > @@ -8748,10 +8833,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > if (task_is_throttled(p)) > return; > > - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { > - set_next_buddy(pse); > - } > - > /* > * We can come here with TIF_NEED_RESCHED already set from new task > * wake up path. > @@ -8783,7 +8864,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * When non-idle entity preempt an idle entity, > * don't give idle entity slice protection. > */ > - do_preempt_short = true; > + preempt_action = PREEMPT_WAKEUP_SHORT; > goto preempt; > } > > @@ -8802,21 +8883,41 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * If @p has a shorter slice than current and @p is eligible, override > * current's slice protection in order to allow preemption. > */ > - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); > + if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) { > + preempt_action = PREEMPT_WAKEUP_SHORT; > + } else { > + /* > + * If @p potentially is completing work required by current then > + * consider preemption. > + */ > + preempt_action = __do_preempt_buddy(rq, cfs_rq, wake_flags, > + pse, se); > + } > + > + switch (preempt_action) { > + case PREEMPT_WAKEUP_NONE: > + return; > + case PREEMPT_WAKEUP_RESCHED: > + goto preempt; > + case PREEMPT_WAKEUP_SHORT: > + fallthrough; > + case PREEMPT_WAKEUP_NEXT: > + break; > + } > > /* > * If @p has become the most eligible task, force preemption. > */ > - if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) > + if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse) > goto preempt; > > - if (sched_feat(RUN_TO_PARITY) && do_preempt_short) > + if (sched_feat(RUN_TO_PARITY)) > update_protect_slice(cfs_rq, se); > > return; > > preempt: > - if (do_preempt_short) > + if (preempt_action == PREEMPT_WAKEUP_SHORT) > cancel_protect_slice(se); > > resched_curr_lazy(rq); ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-11-05 21:48 ` Madadi Vineeth Reddy @ 2025-11-07 8:53 ` Mel Gorman 0 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2025-11-07 8:53 UTC (permalink / raw) To: Madadi Vineeth Reddy Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, linux-kernel On Thu, Nov 06, 2025 at 03:18:30AM +0530, Madadi Vineeth Reddy wrote: > Hi Mel, > > On 03/11/25 16:34, Mel Gorman wrote: > > Reimplement NEXT_BUDDY preemption to take into account the deadline and > > eligibility of the wakee with respect to the waker. In the event > > multiple buddies could be considered, the one with the earliest deadline > > is selected. > > > > Sync wakeups are treated differently to every other type of wakeup. The > > WF_SYNC assumption is that the waker promises to sleep in the very near > > future. This is violated in enough cases that WF_SYNC should be treated > > as a suggestion instead of a contract. If a waker does go to sleep almost > > immediately then the delay in wakeup is negligible. In all other cases, > > it's throttled based on the accumulated runtime of the waker so there is > > a chance that some batched wakeups have been issued before preemption. > > [..snip..] > > > +static inline enum preempt_wakeup_action > > +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags, > > + struct sched_entity *pse, struct sched_entity *se) > > +{ > > + bool pse_before; > > + > > + /* > > + * Ignore wakee preemption on WF_FORK as it is less likely that > > + * there is shared data as exec often follow fork. Do not > > + * preempt for tasks that are sched_delayed as it would violate > > + * EEVDF to forcibly queue an ineligible task. > > + */ > > + if ((wake_flags & WF_FORK) || pse->sched_delayed) > > + return PREEMPT_WAKEUP_NONE; > > + > > + /* Reschedule if waker is no longer eligible. */ > > + if (in_task() && !entity_eligible(cfs_rq, se)) > > + return PREEMPT_WAKEUP_RESCHED; > > + > > + /* > > + * Keep existing buddy if the deadline is sooner than pse. > > + * The older buddy may be cache cold and completely unrelated > > + * to the current wakeup but that is unpredictable where as > > + * obeying the deadline is more in line with EEVDF objectives. > > + */ > > + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) > > + return PREEMPT_WAKEUP_NEXT; > > IIUC, the logic attempts to maintain deadline ordering among buddies, but > this doesn't address tasks already on the runqueue. > It's addressed in that a buddy is only selected if it is eligible to run. Buddies in this context are receiving preferential treatment in terms of ordering but the other tasks on the queue should not be starved either. > So with frequent wakeups, queued tasks (even with earlier deadlines) may be > unfairly delayed. I understand that this would fade away quickly as the > woken up task that got to run due to buddy preference would accumulate negative > lag and would not be eligible to run again, but the starvation could be higher if > wakeups are very high. > They shouldn't get starved as such, only delayed as the buddies become eligible while other tasks on the runqueue have positive lag. > To test this, I ran schbench (many message and worker threads) together with > stress-ng (CPU-bound), and observed stress-ng's bogo-ops throughput dropped by > around 64%. > Stress-NG bogo-ops are by definition, bogus ops. The amount of work executed depends on the stressor and the timing of when they execute. Hence, a drop of 64% may or may not matter to the general case because the drop may be due to a different mix of "operations", some of which may task 1ms and others that take a minute but are both "1 operation". > This shows a significant regression for CPU-bound tasks under heavy wakeup loads. > Thoughts? I 100% accept that NEXT_BUDDY can affect timings of workloads but stress-ng is not the best workload to use for performance testing at all because the bogoops metric is by definition, bogus. There may be a good reason to revisit if PICK_BUDDY should have been moved to __pick_eevdf or if PICK_BUDDY should be applied if the slice is not protected but stressng is a terrible workload to justify a decision either way. > I also ran schbench and hackbench. All these were run on a Power11 System > containing 4 sockets and 160 CPUs spread across 4 NUMA nodes. > > schbench(new) 99.0th latency (lower is better) > ======== > load baseline[pct imp](std%) With patch[pct imp]( std%) > 20mt, 10wt 1.00 [ 0.00]( 0.24) 0.97 [ +3.00]( 0.18) > 20mt, 20wt 1.00 [ 0.00]( 0.33) 1.00 [ 0.00]( 0.12) > 20mt, 40wt 1.00 [ 0.00]( 2.84) 0.76 [ +24.0]( 0.32) > 20mt, 80wt 1.00 [ 0.00]( 3.66) 0.66 [ +34.0]( 0.72) > 20mt, 160wt 1.00 [ 0.00](12.92) 0.88 [ +12.0]( 6.77) > > mt=message threads ; wt=worker threads > > schbench being a wakeup sensitive workload showed good improvement. > Good news because NEXT_BUDDY is primarily about prioritising an eligible wakee over another eligible task to preserve hotness. > > hackbench (lower is better) > ======== > case load baseline[pct imp](std%) With patch[pct imp]( std%) > process-sockets 1-groups 1.00 [ 0.00]( 5.21) 0.91 [ +9.00]( 5.50) > process-sockets 4-groups 1.00 [ 0.00]( 7.30) 1.01 [ -1.00]( 4.27) > process-sockets 12-groups 1.00 [ 0.00]( 2.44) 1.00 [ 0.00]( 1.78) > process-sockets 30-groups 1.00 [ 0.00]( 2.05) 1.04 [ -4.00]( 0.86) > process-sockets 48-groups 1.00 [ 0.00]( 2.25) 1.04 [ -4.00]( 1.03) > process-sockets 79-groups 1.00 [ 0.00]( 2.28) 1.05 [ -5.00]( 1.67) > process-sockets 110-groups 1.00 [ 0.00]( 11.17) 1.04 [ -4.00]( 8.64) > > process-pipe 1-groups 1.00 [ 0.00]( 8.21) 0.84 [+16.00](13.00) > process-pipe 4-groups 1.00 [ 0.00]( 5.54) 0.95 [ +5.00]( 4.21) > process-pipe 12-groups 1.00 [ 0.00]( 3.96) 1.04 [ -4.00]( 2.26) > process-pipe 30-groups 1.00 [ 0.00]( 7.64) 1.20 [ -20.0]( 3.63) > process-pipe 48-groups 1.00 [ 0.00]( 6.28) 1.04 [ -4.00]( 8.48) > process-pipe 79-groups 1.00 [ 0.00]( 6.19) 1.01 [ -1.00]( 4.36) > process-pipe 110-groups 1.00 [ 0.00]( 10.23) 0.94 [ +6.00]( 5.21) > > Didn't notice significant improvement or regression in Hackbench. Mostly in the noise > range. > Expected for hackbench because the degree of overload is so generally high and cache hotness has limited benefit for it as so little data is shared. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 23+ messages in thread
* [RFC PATCH 0/2] Reintroduce NEXT_BUDDY for EEVDF v2 @ 2025-10-21 14:28 Mel Gorman 2025-10-21 14:28 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2025-10-21 14:28 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, Mel Gorman I've been chasing down a number of schedule issues recently like many others and found they were broadly grouped as 1. Failure to boost CPU frequency with powersave/ondemand governors 2. Processors entering idle states that are too deep 3. Differences in wakeup latencies for wakeup-intensive workloads Adding topology into account means that there is a lot of machine-specific behaviour which may explain why some discussions recently have reproduction problems. Nevertheless, the removal of LAST_BUDDY and NEXT_BUDDY being disabled has an impact on wakeup latencies. This RFC is to determine if this is valid approach to prefer selecting a wakee if it's eligible to run even though other unrelated tasks are more eligible. kernel/sched/fair.c | 131 ++++++++++++++++++++++++++++++++++------ kernel/sched/features.h | 2 +- 2 files changed, 112 insertions(+), 21 deletions(-) -- 2.51.0 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-10-21 14:28 [RFC PATCH 0/2] Reintroduce NEXT_BUDDY for EEVDF v2 Mel Gorman @ 2025-10-21 14:28 ` Mel Gorman 2025-10-23 6:29 ` K Prateek Nayak 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2025-10-21 14:28 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, Mel Gorman Reimplement NEXT_BUDDY preemption to take into account the deadline and eligibility of the wakee with respect to the waker. In the event multiple buddies could be considered, the one with the earliest deadline is selected. Sync wakeups are treated differently to every other type of wakeup. The WF_SYNC assumption is that the waker promises to sleep in the very near future. This is violated in enough cases that WF_SYNC should be treated as a suggestion instead of a contract. If a waker does go to sleep almost immediately then the delay in wakeup is negligible. In all other cases, it's throttled based on the accumulated runtime of the waker so there is a chance that some batched wakeups have been issued before preemption. For all other wakeups, preemption happens if the wakee has a earlier deadline than the waker and eligible to run. While many workloads were tested, the two main targets were a modified dbench4 benchmark and hackbench because the are on opposite ends of the spectrum -- one prefers throughput by avoiding preemption and the other relies on preemption. First is the dbench throughput data even though it is a poor metric but it is the default metric. The test machine is a 2-socket machine and the backing filesystem is XFS as a lot of the IO work is dispatched to kernel threads. It's important to note that these results are not representative across all machines, especially Zen machines, as different bottlenecks are exposed on different machines and filesystems. dbench4 Throughput (misleading but traditional) 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v2 Hmean 1 1268.80 ( 0.00%) 1285.93 * 1.35%* Hmean 4 3971.74 ( 0.00%) 3851.10 * -3.04%* Hmean 7 5548.23 ( 0.00%) 5507.07 ( -0.74%) Hmean 12 7310.86 ( 0.00%) 7205.37 ( -1.44%) Hmean 21 8874.53 ( 0.00%) 9193.66 * 3.60%* Hmean 30 9361.93 ( 0.00%) 10552.03 * 12.71%* Hmean 48 9540.14 ( 0.00%) 11936.22 * 25.12%* Hmean 79 9208.74 ( 0.00%) 12367.06 * 34.30%* Hmean 110 8573.12 ( 0.00%) 12114.01 * 41.30%* Hmean 141 7791.33 ( 0.00%) 11391.60 * 46.21%* Hmean 160 7666.60 ( 0.00%) 10789.23 * 40.73%* As throughput is misleading, the benchmark is modified to use a short loadfile report the completion time duration in milliseconds. dbench4 Loadfile Execution Time 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v2 Amean 1 14.62 ( 0.00%) 14.21 ( 2.80%) Amean 4 18.76 ( 0.00%) 19.45 ( -3.67%) Amean 7 23.71 ( 0.00%) 23.82 ( -0.48%) Amean 12 31.25 ( 0.00%) 31.78 ( -1.69%) Amean 21 45.12 ( 0.00%) 43.53 ( 3.52%) Amean 30 61.07 ( 0.00%) 54.13 ( 11.37%) Amean 48 95.91 ( 0.00%) 76.51 ( 20.23%) Amean 79 163.38 ( 0.00%) 121.46 ( 25.66%) Amean 110 243.91 ( 0.00%) 172.56 ( 29.25%) Amean 141 343.47 ( 0.00%) 236.07 ( 31.27%) Amean 160 401.15 ( 0.00%) 283.64 ( 29.29%) Stddev 1 0.52 ( 0.00%) 0.44 ( 15.50%) Stddev 4 1.36 ( 0.00%) 1.42 ( -4.91%) Stddev 7 1.88 ( 0.00%) 1.84 ( 2.27%) Stddev 12 3.06 ( 0.00%) 2.55 ( 16.57%) Stddev 21 5.78 ( 0.00%) 3.70 ( 35.90%) Stddev 30 9.85 ( 0.00%) 5.10 ( 48.26%) Stddev 48 22.31 ( 0.00%) 8.30 ( 62.79%) Stddev 79 35.96 ( 0.00%) 16.79 ( 53.31%) Stddev 110 59.04 ( 0.00%) 30.08 ( 49.04%) Stddev 141 85.38 ( 0.00%) 47.05 ( 44.89%) Stddev 160 96.38 ( 0.00%) 47.27 ( 50.95%) That is still looking good and the variance is reduced quite a bit. Finally, fairness is a concern so the next report tracks how many milliseconds does it take for all clients to complete a workfile. This one is tricky because dbench makes to effort to synchronise clients so the durations at benchmark start time differ substantially from typical runtimes. This problem could be mitigated by warming up the benchmark for a number of minutes but it's a matter of opinion whether that counts as an evasion of inconvenient results. dbench4 All Clients Loadfile Execution Time 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v2 Amean 1 14.93 ( 0.00%) 14.91 ( 0.11%) Amean 4 348.88 ( 0.00%) 277.06 ( 20.59%) Amean 7 722.94 ( 0.00%) 991.70 ( -37.18%) Amean 12 2055.72 ( 0.00%) 2684.48 ( -30.59%) Amean 21 4393.85 ( 0.00%) 2625.79 ( 40.24%) Amean 30 6119.84 ( 0.00%) 2491.15 ( 59.29%) Amean 48 20600.85 ( 0.00%) 6717.61 ( 67.39%) Amean 79 22677.38 ( 0.00%) 21866.80 ( 3.57%) Amean 110 35937.71 ( 0.00%) 22517.63 ( 37.34%) Amean 141 25104.66 ( 0.00%) 29897.08 ( -19.09%) Amean 160 23843.74 ( 0.00%) 23106.66 ( 3.09%) Stddev 1 0.50 ( 0.00%) 0.46 ( 6.67%) Stddev 4 201.33 ( 0.00%) 130.13 ( 35.36%) Stddev 7 471.94 ( 0.00%) 641.69 ( -35.97%) Stddev 12 1401.94 ( 0.00%) 1750.14 ( -24.84%) Stddev 21 2519.12 ( 0.00%) 1416.77 ( 43.76%) Stddev 30 3469.05 ( 0.00%) 1293.37 ( 62.72%) Stddev 48 11521.49 ( 0.00%) 3846.34 ( 66.62%) Stddev 79 12849.21 ( 0.00%) 12275.89 ( 4.46%) Stddev 110 20362.88 ( 0.00%) 12989.46 ( 36.21%) Stddev 141 13768.42 ( 0.00%) 17108.34 ( -24.26%) Stddev 160 13196.34 ( 0.00%) 13029.75 ( 1.26%) This is more of a mixed bag but it at least shows that fairness is not crippled. The hackbench results are more neutral but this is still important. It's possible to boost the dbench figures by a large amount but only by crippling the performance of a workload like hackbench. hackbench-process-pipes 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v2 Amean 1 0.2657 ( 0.00%) 0.2180 ( 17.94%) Amean 4 0.6107 ( 0.00%) 0.5237 ( 14.25%) Amean 7 0.7923 ( 0.00%) 0.7357 ( 7.15%) Amean 12 1.1500 ( 0.00%) 1.1210 ( 2.52%) Amean 21 1.7950 ( 0.00%) 1.8220 ( -1.50%) Amean 30 2.3207 ( 0.00%) 2.5337 * -9.18%* Amean 48 3.5023 ( 0.00%) 4.0057 * -14.37%* Amean 79 4.8093 ( 0.00%) 5.1827 * -7.76%* Amean 110 6.1160 ( 0.00%) 6.5667 * -7.37%* Amean 141 7.4763 ( 0.00%) 7.9413 * -6.22%* Amean 172 8.9560 ( 0.00%) 9.5543 * -6.68%* Amean 203 10.4783 ( 0.00%) 11.3620 * -8.43%* Amean 234 12.4977 ( 0.00%) 13.6183 ( -8.97%) Amean 265 14.7003 ( 0.00%) 15.3720 * -4.57%* Amean 296 16.1007 ( 0.00%) 17.2463 * -7.12%* Processes using pipes are impacted but the variance (not presented) is bad enough that it's close to noise. These results are not always reproducible. If executed across multiple reboots, it may show neutral or small gains so the worst measured results are presented. Hackbench using sockets is more reliably neutral as the wakeup mechanisms are different between sockets and pipes. hackbench-process-sockets 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v2 Amean 1 0.3073 ( 0.00%) 0.3333 ( -8.46%) Amean 4 0.7863 ( 0.00%) 0.7350 ( 6.53%) Amean 7 1.3670 ( 0.00%) 1.3580 ( 0.66%) Amean 12 2.1337 ( 0.00%) 2.1320 ( 0.08%) Amean 21 3.4683 ( 0.00%) 3.4677 ( 0.02%) Amean 30 4.7247 ( 0.00%) 4.8200 ( -2.02%) Amean 48 7.6097 ( 0.00%) 7.8383 ( -3.00%) Amean 79 14.7957 ( 0.00%) 15.2863 ( -3.32%) Amean 110 21.3413 ( 0.00%) 21.7297 ( -1.82%) Amean 141 29.0503 ( 0.00%) 29.1197 ( -0.24%) Amean 172 36.4660 ( 0.00%) 36.3173 ( 0.41%) Amean 203 39.7177 ( 0.00%) 40.2843 ( -1.43%) Amean 234 42.1120 ( 0.00%) 43.8480 ( -4.12%) Amean 265 45.7830 ( 0.00%) 48.1233 ( -5.11%) Amean 296 50.7043 ( 0.00%) 54.2533 ( -7.00%) As schbench has been mentioned in numerous bugs recently, the results are interesting. A test case that represents the default schbench behaviour is schbench Wakeup Latency (usec) 6.18.0-rc1 6.18.0-rc1 vanilla sched-preemptnext-v2r1 Amean Wakeup-50th-80 7.17 ( 0.00%) 6.00 * 16.28%* Amean Wakeup-90th-80 46.56 ( 0.00%) 20.56 * 55.85%* Amean Wakeup-99th-80 119.61 ( 0.00%) 88.83 ( 25.73%) Amean Wakeup-99.9th-80 3193.78 ( 0.00%) 339.78 * 89.36%* Amean Wakeup-max-80 3874.28 ( 0.00%) 3942.06 ( -1.75%) schbench Requests Per Second (ops/sec) 6.18.0-rc1 6.18.0-rc1 vanilla sched-preemptnext-v2r1 Hmean RPS-20th-80 8900.91 ( 0.00%) 9148.18 * 2.78%* Hmean RPS-50th-80 8987.41 ( 0.00%) 9199.72 ( 2.36%) Hmean RPS-90th-80 9123.73 ( 0.00%) 9233.56 ( 1.20%) Hmean RPS-max-80 9193.50 ( 0.00%) 9249.84 ( 0.61%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/fair.c | 131 +++++++++++++++++++++++++++++++++++++------- 1 file changed, 111 insertions(+), 20 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bc0b7ce8a65d..26413062009b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -955,6 +955,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect) if (cfs_rq->nr_queued == 1) return curr && curr->on_rq ? curr : se; + /* + * Picking the ->next buddy will affect latency but not fairness. + */ + if (sched_feat(PICK_BUDDY) && + cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { + /* ->next will never be delayed */ + WARN_ON_ONCE(cfs_rq->next->sched_delayed); + return cfs_rq->next; + } + if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) curr = NULL; @@ -1193,6 +1203,82 @@ static s64 update_se(struct rq *rq, struct sched_entity *se) return delta_exec; } +enum preempt_wakeup_action { + PREEMPT_WAKEUP_NONE, /* No action on the buddy */ + PREEMPT_WAKEUP_NEXT, /* Check next is most eligible + * before rescheduling. + */ + PREEMPT_WAKEUP_RESCHED, /* Plain reschedule */ +}; + +static void set_next_buddy(struct sched_entity *se); + +static inline enum preempt_wakeup_action +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags, + struct sched_entity *pse, struct sched_entity *se, + s64 delta) +{ + bool pse_before; + + /* + * Ignore wakee preemption on WF_WORK as it is less likely that + * there is shared data as exec often follow fork. Do not + * preempt for tasks that are sched_delayed as it would violate + * EEVDF to forcibly queue an ineligible task. + */ + if (!sched_feat(NEXT_BUDDY) || + (wake_flags & WF_FORK) || + (pse->sched_delayed)) { + return PREEMPT_WAKEUP_NONE; + } + + /* Reschedule if waker is no longer eligible. */ + if (!entity_eligible(cfs_rq, se)) { + resched_curr_lazy(rq); + return PREEMPT_WAKEUP_RESCHED; + } + + /* + * Keep existing buddy if the deadline is sooner than pse. + * The downside is that the older buddy may be cache cold + * but that is unpredictable where as an earlier deadline + * is absolute. + */ + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) + return PREEMPT_WAKEUP_NONE; + + set_next_buddy(pse); + + /* + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not + * strictly enforced because the hint is either misunderstood or + * multiple tasks must be woken up. + */ + pse_before = entity_before(pse, se); + if ((wake_flags & (WF_TTWU|WF_SYNC)) == (WF_TTWU|WF_SYNC)) { + /* + * WF_RQ_SELECTED implies the tasks are stacking on a + * CPU when they could run on other CPUs. Only consider + * reschedule if pse deadline expires before se. + */ + if ((wake_flags & WF_RQ_SELECTED) && + delta < sysctl_sched_migration_cost && pse_before) { + return PREEMPT_WAKEUP_NONE; + } + + /* + * As WF_SYNC is not strictly obeyed, allow some runtime for + * batch wakeups to be issued. + */ + if (pse_before && delta >= sysctl_sched_migration_cost) + return PREEMPT_WAKEUP_RESCHED; + + return PREEMPT_WAKEUP_NONE; + } + + return PREEMPT_WAKEUP_NEXT; +} + /* * Used by other classes to account runtime. */ @@ -5512,16 +5598,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) { struct sched_entity *se; - /* - * Picking the ->next buddy will affect latency but not fairness. - */ - if (sched_feat(PICK_BUDDY) && - cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { - /* ->next will never be delayed */ - WARN_ON_ONCE(cfs_rq->next->sched_delayed); - return cfs_rq->next; - } - se = pick_eevdf(cfs_rq); if (se->sched_delayed) { dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); @@ -7028,8 +7104,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) hrtick_update(rq); } -static void set_next_buddy(struct sched_entity *se); - /* * Basically dequeue_task_fair(), except it can deal with dequeue_entity() * failing half-way through and resume the dequeue later. @@ -8734,7 +8808,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int struct sched_entity *se = &donor->se, *pse = &p->se; struct cfs_rq *cfs_rq = task_cfs_rq(donor); int cse_is_idle, pse_is_idle; - bool do_preempt_short = false; + enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE; + s64 delta; if (unlikely(se == pse)) return; @@ -8748,10 +8823,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int if (task_is_throttled(p)) return; - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { - set_next_buddy(pse); - } - /* * We can come here with TIF_NEED_RESCHED already set from new task * wake up path. @@ -8783,7 +8854,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int * When non-idle entity preempt an idle entity, * don't give idle entity slice protection. */ - do_preempt_short = true; + do_preempt_short = PREEMPT_WAKEUP_NEXT; goto preempt; } @@ -8797,12 +8868,31 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int return; cfs_rq = cfs_rq_of(se); + delta = rq_clock_task(rq) - se->exec_start; update_curr(cfs_rq); /* * If @p has a shorter slice than current and @p is eligible, override * current's slice protection in order to allow preemption. */ - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); + if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) { + do_preempt_short = PREEMPT_WAKEUP_NEXT; + } else { + /* + * If @p potentially is completing work required by current then + * consider preemption. + */ + do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags, + pse, se, delta); + } + + switch (do_preempt_short) { + case PREEMPT_WAKEUP_NONE: + goto update_slice; + case PREEMPT_WAKEUP_RESCHED: + goto preempt; + case PREEMPT_WAKEUP_NEXT: + break; + } /* * If @p has become the most eligible task, force preemption. @@ -8810,7 +8900,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) goto preempt; - if (sched_feat(RUN_TO_PARITY) && do_preempt_short) +update_slice: + if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE) update_protect_slice(cfs_rq, se); return; -- 2.51.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-10-21 14:28 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman @ 2025-10-23 6:29 ` K Prateek Nayak 0 siblings, 0 replies; 23+ messages in thread From: K Prateek Nayak @ 2025-10-23 6:29 UTC (permalink / raw) To: Mel Gorman, linux-kernel Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason Hello Mel, On 10/21/2025 7:58 PM, Mel Gorman wrote: > +static inline enum preempt_wakeup_action > +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags, > + struct sched_entity *pse, struct sched_entity *se, > + s64 delta) "delta" is only used within __do_preempt_buddy(). Might as well compute it here. > +{ > + bool pse_before; > + > + /* > + * Ignore wakee preemption on WF_WORK as it is less likely that > + * there is shared data as exec often follow fork. Do not > + * preempt for tasks that are sched_delayed as it would violate > + * EEVDF to forcibly queue an ineligible task. > + */ > + if (!sched_feat(NEXT_BUDDY) || > + (wake_flags & WF_FORK) || > + (pse->sched_delayed)) { > + return PREEMPT_WAKEUP_NONE; > + } > + > + /* Reschedule if waker is no longer eligible. */ > + if (!entity_eligible(cfs_rq, se)) { > + resched_curr_lazy(rq); This is unnecessary since PREEMPT_WAKEUP_RESCHED case already does a "goto preempt" which does a resched_curr_lazy(). > + return PREEMPT_WAKEUP_RESCHED; Should we update the next buddy before returning here if entity_before(pse, cfs_rq->next)? > + } > + > + /* > + * Keep existing buddy if the deadline is sooner than pse. > + * The downside is that the older buddy may be cache cold > + * but that is unpredictable where as an earlier deadline > + * is absolute. > + */ > + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) > + return PREEMPT_WAKEUP_NONE; > + > + set_next_buddy(pse); > + > + /* > + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not > + * strictly enforced because the hint is either misunderstood or > + * multiple tasks must be woken up. > + */ > + pse_before = entity_before(pse, se); > + if ((wake_flags & (WF_TTWU|WF_SYNC)) == (WF_TTWU|WF_SYNC)) { Do we have !TTWU cases that set WF_SYNC? > + /* > + * WF_RQ_SELECTED implies the tasks are stacking on a > + * CPU when they could run on other CPUs. Only consider > + * reschedule if pse deadline expires before se. > + */ Code doesn't seem to match that comment. We are foregoing preemption if the current task has run for less than "sysctl_sched_migration_cost" even on pse_before. Also. looking at the comment in check_preempt_wakeup_fair(): If @p potentially is completing work required by current then consider preemption. Shouldn't this check if "se" is indeed the waker task? Despite WF_SYNC, wake_affine() can still return the CPU where @p was previously running which now might be running another unrelated task. > + if ((wake_flags & WF_RQ_SELECTED) && > + delta < sysctl_sched_migration_cost && pse_before) { > + return PREEMPT_WAKEUP_NONE; > + } Above checks seems unnecessary since we return "PREEMPT_WAKEUP_NONE" if we don't enter the below condition. The two cannot have any overlap based on my reading. > + > + /* > + * As WF_SYNC is not strictly obeyed, allow some runtime for > + * batch wakeups to be issued. > + */ > + if (pse_before && delta >= sysctl_sched_migration_cost) > + return PREEMPT_WAKEUP_RESCHED; > + > + return PREEMPT_WAKEUP_NONE; > + } > + > + return PREEMPT_WAKEUP_NEXT; > +} > + > /* > * Used by other classes to account runtime. > */ > @@ -5512,16 +5598,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) > { > struct sched_entity *se; > > - /* > - * Picking the ->next buddy will affect latency but not fairness. > - */ > - if (sched_feat(PICK_BUDDY) && > - cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { > - /* ->next will never be delayed */ > - WARN_ON_ONCE(cfs_rq->next->sched_delayed); > - return cfs_rq->next; > - } > - > se = pick_eevdf(cfs_rq); > if (se->sched_delayed) { > dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); > @@ -7028,8 +7104,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > hrtick_update(rq); > } > > -static void set_next_buddy(struct sched_entity *se); > - > /* > * Basically dequeue_task_fair(), except it can deal with dequeue_entity() > * failing half-way through and resume the dequeue later. > @@ -8734,7 +8808,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > struct sched_entity *se = &donor->se, *pse = &p->se; > struct cfs_rq *cfs_rq = task_cfs_rq(donor); > int cse_is_idle, pse_is_idle; > - bool do_preempt_short = false; > + enum preempt_wakeup_action do_preempt_short = PREEMPT_WAKEUP_NONE; > + s64 delta; > > if (unlikely(se == pse)) > return; > @@ -8748,10 +8823,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > if (task_is_throttled(p)) > return; > > - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { > - set_next_buddy(pse); > - } > - > /* > * We can come here with TIF_NEED_RESCHED already set from new task > * wake up path. > @@ -8783,7 +8854,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * When non-idle entity preempt an idle entity, > * don't give idle entity slice protection. > */ > - do_preempt_short = true; > + do_preempt_short = PREEMPT_WAKEUP_NEXT; > goto preempt; > } > > @@ -8797,12 +8868,31 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > return; > > cfs_rq = cfs_rq_of(se); > + delta = rq_clock_task(rq) - se->exec_start; > update_curr(cfs_rq); > /* > * If @p has a shorter slice than current and @p is eligible, override > * current's slice protection in order to allow preemption. > */ > - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); > + if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) { > + do_preempt_short = PREEMPT_WAKEUP_NEXT; > + } else { > + /* > + * If @p potentially is completing work required by current then > + * consider preemption. > + */ > + do_preempt_short = __do_preempt_buddy(rq, cfs_rq, wake_flags, > + pse, se, delta); > + } > + > + switch (do_preempt_short) { > + case PREEMPT_WAKEUP_NONE: > + goto update_slice; You can just return from here since update_protect_slice() is only done on "do_preempt_short != PREEMPT_WAKEUP_NONE". With that, the "update_slice" label becomes unnecessary ... > + case PREEMPT_WAKEUP_RESCHED: > + goto preempt; > + case PREEMPT_WAKEUP_NEXT: > + break; > + } > > /* > * If @p has become the most eligible task, force preemption. > @@ -8810,7 +8900,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) > goto preempt; > > - if (sched_feat(RUN_TO_PARITY) && do_preempt_short) > +update_slice: > + if (sched_feat(RUN_TO_PARITY) && do_preempt_short != PREEMPT_WAKEUP_NONE) ... and since do_preempt_short can only be "PREEMPT_WAKEUP_NEXT" which is non-zero, changes in this hunk can be dropped. > update_protect_slice(cfs_rq, se); > > return; -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <20250714134429.19624-1-mgorman@techsingularity.net>]
* [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals [not found] <20250714134429.19624-1-mgorman@techsingularity.net> @ 2025-07-14 13:44 ` Mel Gorman 0 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2025-07-14 13:44 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Mel Gorman Reimplement NEXT_BUDDY preemption to take into account the deadline and eligibility of the wakee with respect to the waker. In the event multiple buddies could be considered, the one with the earliest deadline is selected. Sync wakeups are treated differently to every other type of wakeup. The WF_SYNC assumption is that the waker promises to sleep in the very near future. This is violated in enough cases that WF_SYNC should be treated as a mild suggestion instead of a hard rule. If a waker does go to sleep almost immediately then the delay in wakeup is negligible. In all other cases, it's throttled based on the accumulated runtime of the waker so there is a chance that some batched wakeups have been issued before preemption. For all other wakeups, preemption happens if the wakee has a sooner deadline than the waker and eligible to run. While many workloads were tested, the two main targets were a modified dbench4 benchmark and hackbench because the are on opposite ends of the spectrum -- one prefers throughput by avoiding preemption and the other relies on preemption. First is the dbench throughput data even though it is a terrible metric for dbench as it's the default one reported. The test machine is a 2-socket CascadeLake machine and the backing filesystem is XFS as a lot of the IO work is dispatched to kernel threads. It's important to note that these results are not representative across all machines, especially Zen machines, as different bottlenecks are exposed on different machines and filesystems. dbench4 Throughput (misleading but traditional) 6.16.0-rc5 6.16.0-rc5 vanilla sched-preemptnext-v1r8 Hmean 1 1286.83 ( 0.00%) 1281.73 ( -0.40%) Hmean 4 4017.50 ( 0.00%) 3934.85 * -2.06%* Hmean 7 5536.45 ( 0.00%) 5453.55 * -1.50%* Hmean 12 7251.59 ( 0.00%) 7217.25 ( -0.47%) Hmean 21 8957.92 ( 0.00%) 9188.07 ( 2.57%) Hmean 30 9403.41 ( 0.00%) 10523.72 * 11.91%* Hmean 48 9320.12 ( 0.00%) 11496.27 * 23.35%* Hmean 79 8962.30 ( 0.00%) 11555.71 * 28.94%* Hmean 110 8066.52 ( 0.00%) 11307.26 * 40.18%* Hmean 141 7605.20 ( 0.00%) 10622.52 * 39.67%* Hmean 160 7422.56 ( 0.00%) 10250.78 * 38.10%* As throughput is misleading, the benchmark is modified to use a short loadfile report the completion time duration in milliseconds. dbench4 Loadfile Execution Time 6.16.0-rc5 6.16.0-rc5 vanilla sched-preemptnext-v1r8 Amean 1 14.35 ( 0.00%) 14.27 ( 0.57%) Amean 4 18.58 ( 0.00%) 19.01 ( -2.35%) Amean 7 23.83 ( 0.00%) 24.18 ( -1.48%) Amean 12 31.59 ( 0.00%) 31.77 ( -0.55%) Amean 21 44.65 ( 0.00%) 43.44 ( 2.71%) Amean 30 60.73 ( 0.00%) 54.21 ( 10.74%) Amean 48 98.25 ( 0.00%) 79.41 ( 19.17%) Amean 79 168.34 ( 0.00%) 130.06 ( 22.74%) Amean 110 261.03 ( 0.00%) 185.04 ( 29.11%) Amean 141 353.98 ( 0.00%) 251.55 ( 28.94%) Amean 160 410.66 ( 0.00%) 296.87 ( 27.71%) Stddev 1 0.51 ( 0.00%) 0.48 ( 6.67%) Stddev 4 1.14 ( 0.00%) 1.21 ( -6.78%) Stddev 7 1.63 ( 0.00%) 1.58 ( 3.12%) Stddev 12 2.62 ( 0.00%) 2.38 ( 9.05%) Stddev 21 5.21 ( 0.00%) 3.87 ( 25.70%) Stddev 30 10.03 ( 0.00%) 6.65 ( 33.65%) Stddev 48 22.31 ( 0.00%) 12.26 ( 45.05%) Stddev 79 41.14 ( 0.00%) 29.11 ( 29.25%) Stddev 110 70.55 ( 0.00%) 47.71 ( 32.38%) Stddev 141 98.12 ( 0.00%) 66.83 ( 31.89%) Stddev 160 139.37 ( 0.00%) 67.73 ( 51.40%) That is still looking good and the variance is reduced quite a bit. Finally, fairness is a concern so the next report tracks how many milliseconds does it take for all clients to complete a workfile. This one is tricky because dbench makes to effort to synchronise clients so the durations at benchmark start time differ substantially from typical runtimes. This problem could be mitigated by warming up the benchmark for a number of minutes but it's a matter of opinion whether that counts as an evasion of inconvenient results. dbench4 All Clients Loadfile Execution Time 6.16.0-rc5 6.16.0-rc5 vanilla sched-preemptnext-v1r8 Amean 1 14.93 ( 0.00%) 14.91 ( 0.11%) Amean 4 348.88 ( 0.00%) 277.06 ( 20.59%) Amean 7 722.94 ( 0.00%) 991.70 ( -37.18%) Amean 12 2055.72 ( 0.00%) 2684.48 ( -30.59%) Amean 21 4393.85 ( 0.00%) 2625.79 ( 40.24%) Amean 30 6119.84 ( 0.00%) 2491.15 ( 59.29%) Amean 48 20600.85 ( 0.00%) 6717.61 ( 67.39%) Amean 79 22677.38 ( 0.00%) 21866.80 ( 3.57%) Amean 110 35937.71 ( 0.00%) 22517.63 ( 37.34%) Amean 141 25104.66 ( 0.00%) 29897.08 ( -19.09%) Amean 160 23843.74 ( 0.00%) 23106.66 ( 3.09%) Stddev 1 0.50 ( 0.00%) 0.46 ( 6.67%) Stddev 4 201.33 ( 0.00%) 130.13 ( 35.36%) Stddev 7 471.94 ( 0.00%) 641.69 ( -35.97%) Stddev 12 1401.94 ( 0.00%) 1750.14 ( -24.84%) Stddev 21 2519.12 ( 0.00%) 1416.77 ( 43.76%) Stddev 30 3469.05 ( 0.00%) 1293.37 ( 62.72%) Stddev 48 11521.49 ( 0.00%) 3846.34 ( 66.62%) Stddev 79 12849.21 ( 0.00%) 12275.89 ( 4.46%) Stddev 110 20362.88 ( 0.00%) 12989.46 ( 36.21%) Stddev 141 13768.42 ( 0.00%) 17108.34 ( -24.26%) Stddev 160 13196.34 ( 0.00%) 13029.75 ( 1.26%) This is more of a mixed bag but it at least shows that fairness is not crippled. The hackbench results are more neutral but this is still important. It's possible to boost the dbench figures by a large amount but only by crippling the performance of a workload like hackbench. hackbench-process-pipes 6.16.0-rc5 6.16.0-rc5 vanilla sched-preemptnext-v1r8 Amean 1 0.2183 ( 0.00%) 0.2223 ( -1.83%) Amean 4 0.5780 ( 0.00%) 0.5413 ( 6.34%) Amean 7 0.7727 ( 0.00%) 0.7093 ( 8.20%) Amean 12 1.1220 ( 0.00%) 1.1170 ( 0.45%) Amean 21 1.7470 ( 0.00%) 1.7713 ( -1.39%) Amean 30 2.2940 ( 0.00%) 2.6957 * -17.51%* Amean 48 3.7337 ( 0.00%) 4.1003 * -9.82%* Amean 79 4.9310 ( 0.00%) 5.1417 * -4.27%* Amean 110 6.1800 ( 0.00%) 6.5370 * -5.78%* Amean 141 7.5737 ( 0.00%) 8.0060 * -5.71%* Amean 172 9.0820 ( 0.00%) 9.4767 * -4.35%* Amean 203 10.6053 ( 0.00%) 10.8870 ( -2.66%) Amean 234 12.3380 ( 0.00%) 13.1290 * -6.41%* Amean 265 14.5900 ( 0.00%) 15.3547 * -5.24%* Amean 296 16.1937 ( 0.00%) 17.1533 * -5.93%* Processes using pipes are impacted and it's outside the noise as the coefficient of variance is roughly 3%. These results are not always reproducible. If executed across multiple reboots, it may show neutral or small gains so the worst measured results are presented. Hackbench using sockets is more reliably neutral as the wakeup mechanisms are different between sockets and pipes. hackbench-process-sockets 6.16.0-rc5 6.16.0-rc5 vanilla sched-preemptnext-v1r8 Amean 1 0.3217 ( 0.00%) 0.3053 ( 5.08%) Amean 4 0.8967 ( 0.00%) 0.9007 ( -0.45%) Amean 7 1.4780 ( 0.00%) 1.5067 ( -1.94%) Amean 12 2.1977 ( 0.00%) 2.2693 ( -3.26%) Amean 21 3.4983 ( 0.00%) 3.6667 * -4.81%* Amean 30 4.9270 ( 0.00%) 5.1207 * -3.93%* Amean 48 7.6250 ( 0.00%) 7.9667 * -4.48%* Amean 79 15.7477 ( 0.00%) 15.4177 ( 2.10%) Amean 110 21.8070 ( 0.00%) 21.9563 ( -0.68%) Amean 141 29.4813 ( 0.00%) 29.2327 ( 0.84%) Amean 172 36.7433 ( 0.00%) 35.9043 ( 2.28%) Amean 203 40.8823 ( 0.00%) 40.3467 ( 1.31%) Amean 234 43.1627 ( 0.00%) 43.0343 ( 0.30%) Amean 265 49.6417 ( 0.00%) 49.9030 ( -0.53%) Amean 296 51.3137 ( 0.00%) 51.9310 ( -1.20%) At the time of writing, other tests are still running but most or either neutral or relatively small gains. In general, the other workloads are less wakeup-intensive than dbench or hackbench. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/fair.c | 123 ++++++++++++++++++++++++++++++++++++++------ 1 file changed, 106 insertions(+), 17 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7a14da5396fb..62fa036b0c3e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -936,6 +936,16 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) if (cfs_rq->nr_queued == 1) return curr && curr->on_rq ? curr : se; + /* + * Picking the ->next buddy will affect latency but not fairness. + */ + if (sched_feat(PICK_BUDDY) && + cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { + /* ->next will never be delayed */ + WARN_ON_ONCE(cfs_rq->next->sched_delayed); + return cfs_rq->next; + } + if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) curr = NULL; @@ -1205,6 +1215,83 @@ static inline bool do_preempt_short(struct cfs_rq *cfs_rq, return false; } +enum preempt_buddy_action { + PREEMPT_BUDDY_NONE, /* No action on the buddy */ + PREEMPT_BUDDY_NEXT, /* Check next is most eligible + * before rescheduling. + */ + PREEMPT_BUDDY_RESCHED, /* Plain reschedule */ + PREEMPT_BUDDY_IMMEDIATE /* Remove slice protection + * and reschedule + */ +}; + +static void set_next_buddy(struct sched_entity *se); + +static inline enum preempt_buddy_action +do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags, + struct sched_entity *pse, struct sched_entity *se, + s64 delta, bool did_short) +{ + bool pse_before, pse_eligible; + + if (!sched_feat(NEXT_BUDDY) || + (wake_flags & WF_FORK) || + (pse->sched_delayed)) { + BUILD_BUG_ON(PREEMPT_BUDDY_NONE + 1 != PREEMPT_BUDDY_NEXT); + return PREEMPT_BUDDY_NONE + did_short; + } + + /* Reschedule if waker is no longer eligible */ + if (!entity_eligible(cfs_rq, se)) + return PREEMPT_BUDDY_RESCHED; + + /* Keep existing buddy if the deadline is sooner than pse */ + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) + return PREEMPT_BUDDY_NONE; + + set_next_buddy(pse); + pse_before = entity_before(pse, se); + pse_eligible = entity_eligible(cfs_rq, pse); + + /* + * WF_SYNC implies that waker will sleep soon but it is not enforced + * because the hint is often abused or misunderstood. + */ + if ((wake_flags & (WF_TTWU|WF_SYNC)) == (WF_TTWU|WF_SYNC)) { + /* + * WF_RQ_SELECTED implies the tasks are stacking on a + * CPU. Only consider reschedule if pse deadline expires + * before se. + */ + if ((wake_flags & WF_RQ_SELECTED) && + delta < sysctl_sched_migration_cost) { + + if (!pse_before) + return PREEMPT_BUDDY_NONE; + + /* Fall through to pse deadline. */ + } + + /* + * Reschedule if pse's deadline is sooner and there is a chance + * that some wakeup batching has completed. + */ + if (pse_before && + delta >= (sysctl_sched_migration_cost >> 6)) { + return PREEMPT_BUDDY_IMMEDIATE; + } + + return PREEMPT_BUDDY_NONE; + } + + /* Check eligibility of buddy to start now. */ + if (pse_before && pse_eligible) + return PREEMPT_BUDDY_IMMEDIATE; + + return PREEMPT_BUDDY_NEXT; +} + /* * Used by other classes to account runtime. */ @@ -5589,16 +5676,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) { struct sched_entity *se; - /* - * Picking the ->next buddy will affect latency but not fairness. - */ - if (sched_feat(PICK_BUDDY) && - cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { - /* ->next will never be delayed */ - WARN_ON_ONCE(cfs_rq->next->sched_delayed); - return cfs_rq->next; - } - se = pick_eevdf(cfs_rq); if (se->sched_delayed) { dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); @@ -7056,8 +7133,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) hrtick_update(rq); } -static void set_next_buddy(struct sched_entity *se); - /* * Basically dequeue_task_fair(), except it can deal with dequeue_entity() * failing half-way through and resume the dequeue later. @@ -8767,6 +8842,8 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int struct sched_entity *se = &donor->se, *pse = &p->se; struct cfs_rq *cfs_rq = task_cfs_rq(donor); int cse_is_idle, pse_is_idle; + bool did_short; + s64 delta; if (unlikely(se == pse)) return; @@ -8780,10 +8857,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int if (unlikely(throttled_hierarchy(cfs_rq_of(pse)))) return; - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { - set_next_buddy(pse); - } - /* * We can come here with TIF_NEED_RESCHED already set from new task * wake up path. @@ -8829,6 +8902,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int return; cfs_rq = cfs_rq_of(se); + delta = rq_clock_task(rq) - se->exec_start; update_curr(cfs_rq); /* * If @p has a shorter slice than current and @p is eligible, override @@ -8837,9 +8911,24 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int * Note that even if @p does not turn out to be the most eligible * task at this moment, current's slice protection will be lost. */ - if (do_preempt_short(cfs_rq, pse, se)) + did_short = do_preempt_short(cfs_rq, pse, se); + if (did_short) cancel_protect_slice(se); + switch (do_preempt_buddy(rq, cfs_rq, wake_flags, pse, se, delta, did_short)) { + case PREEMPT_BUDDY_NONE: + return; + break; + case PREEMPT_BUDDY_IMMEDIATE: + cancel_protect_slice(se); + ;; + case PREEMPT_BUDDY_RESCHED: + goto preempt; + break; + case PREEMPT_BUDDY_NEXT: + break; + } + /* * If @p has become the most eligible task, force preemption. */ -- 2.43.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
end of thread, other threads:[~2025-11-14 12:13 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20251027133915.4103633-1-mgorman@techsingularity.net>
2025-10-27 13:39 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
2025-10-28 14:37 ` Peter Zijlstra
2025-10-27 13:39 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
2025-10-28 15:05 ` Peter Zijlstra
2025-10-31 9:46 ` Mel Gorman
2025-10-28 15:09 ` Peter Zijlstra
2025-10-31 9:48 ` Mel Gorman
2025-10-28 15:33 ` Peter Zijlstra
2025-10-28 15:47 ` Peter Zijlstra
2025-10-30 9:10 ` Peter Zijlstra
2025-10-31 10:27 ` Mel Gorman
2025-11-12 12:25 [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman
[not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net>
2025-11-12 14:48 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra
2025-11-13 8:26 ` Madadi Vineeth Reddy
2025-11-13 9:04 ` Mel Gorman
2025-11-14 12:13 ` Peter Zijlstra
-- strict thread matches above, loose matches on Subject: below --
2025-11-03 11:04 [PATCH 0/2 v4] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman
2025-11-03 11:04 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
2025-11-03 14:07 ` Peter Zijlstra
2025-11-03 14:14 ` Peter Zijlstra
2025-11-05 21:48 ` Madadi Vineeth Reddy
2025-11-07 8:53 ` Mel Gorman
2025-10-21 14:28 [RFC PATCH 0/2] Reintroduce NEXT_BUDDY for EEVDF v2 Mel Gorman
2025-10-21 14:28 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Mel Gorman
2025-10-23 6:29 ` K Prateek Nayak
[not found] <20250714134429.19624-1-mgorman@techsingularity.net>
2025-07-14 13:44 ` Mel Gorman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox