* [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF
@ 2025-11-12 12:25 Mel Gorman
2025-11-12 12:25 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Mel Gorman @ 2025-11-12 12:25 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider,
Chris Mason, Madadi Vineeth Reddy, linux-kernel, Mel Gorman
Changes since v4
o Splitout decisions into separate functions (peterz)
o Flow clarity (peterz)
Changes since v3
o Place new code near first consumer (peterz)
o Separate between PREEMPT_SHORT and NEXT_BUDDY (peterz)
o Naming and code flow clarity (peterz)
o Restore slice protection (peterz)
Changes since v2
o Review feedback applied from Prateek
I've been chasing down a number of schedule issues recently like many
others and found they were broadly grouped as
1. Failure to boost CPU frequency with powersave/ondemand governors
2. Processors entering idle states that are too deep
3. Differences in wakeup latencies for wakeup-intensive workloads
Adding topology into account means that there is a lot of machine-specific
behaviour which may explain why some discussions recently have reproduction
problems. Nevertheless, the removal of LAST_BUDDY and NEXT_BUDDY being
disabled has an impact on wakeup latencies.
This series enables NEXT_BUDDY and may select a wakee if it's eligible to
run even though other unrelated tasks may have an earlier deadline.
Mel Gorman (2):
sched/fair: Enable scheduler feature NEXT_BUDDY
sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals
kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++------
kernel/sched/features.h | 2 +-
2 files changed, 131 insertions(+), 23 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 27+ messages in thread* [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY 2025-11-12 12:25 [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman @ 2025-11-12 12:25 ` Mel Gorman 2025-11-14 12:19 ` [tip: sched/core] " tip-bot2 for Mel Gorman 2025-11-17 16:23 ` tip-bot2 for Mel Gorman [not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net> 2026-01-08 10:01 ` [REGRESSION] [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF Madadi Vineeth Reddy 2 siblings, 2 replies; 27+ messages in thread From: Mel Gorman @ 2025-11-12 12:25 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, Madadi Vineeth Reddy, linux-kernel, Mel Gorman The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last wakee to be scheduled sooner on the assumption that the waker/wakee share cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on the assumption that the pair of tasks still share data but also relied on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get good results. NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465 ("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime spread is mentioned so the expectation is that NEXT_BUDDY had an impact on overall fairness. It was not noted why LAST_BUDDY was removed but it is assumed that it's very difficult to reason what LAST_BUDDY's correct and effective behaviour should be while still respecting EEVDFs goals. Peter Zijlstra noted during review; I think I was just struggling to make sense of things and figured less is more and axed it. I have vague memories trying to work through the dynamics of a wakeup-stack and the EEVDF latency requirements and getting a head-ache. NEXT_BUDDY is easier to reason about given that it's a point-in-time decision on the wakees deadline and eligibilty relative to the waker. Enable NEXT_BUDDY as a preparation path to document that the decision to ignore the current implementation is deliberate. While not presented, the results were at best neutral and often much more variable. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/features.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 3c12d9f93331..0607def744af 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true) * wakeup-preemption), since its likely going to consume data we * touched, increases cache locality. */ -SCHED_FEAT(NEXT_BUDDY, false) +SCHED_FEAT(NEXT_BUDDY, true) /* * Allow completely ignoring cfs_rq->next; which can be set from various -- 2.51.0 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [tip: sched/core] sched/fair: Enable scheduler feature NEXT_BUDDY 2025-11-12 12:25 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman @ 2025-11-14 12:19 ` tip-bot2 for Mel Gorman 2025-11-17 16:23 ` tip-bot2 for Mel Gorman 1 sibling, 0 replies; 27+ messages in thread From: tip-bot2 for Mel Gorman @ 2025-11-14 12:19 UTC (permalink / raw) To: linux-tip-commits; +Cc: Mel Gorman, Peter Zijlstra (Intel), x86, linux-kernel The following commit has been merged into the sched/core branch of tip: Commit-ID: 8f839c9c55f2a034867b5d382950f6bc9acd1a3f Gitweb: https://git.kernel.org/tip/8f839c9c55f2a034867b5d382950f6bc9acd1a3f Author: Mel Gorman <mgorman@techsingularity.net> AuthorDate: Wed, 12 Nov 2025 12:25:20 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Fri, 14 Nov 2025 13:03:06 +01:00 sched/fair: Enable scheduler feature NEXT_BUDDY The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last wakee to be scheduled sooner on the assumption that the waker/wakee share cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on the assumption that the pair of tasks still share data but also relied on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get good results. NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465 ("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime spread is mentioned so the expectation is that NEXT_BUDDY had an impact on overall fairness. It was not noted why LAST_BUDDY was removed but it is assumed that it's very difficult to reason what LAST_BUDDY's correct and effective behaviour should be while still respecting EEVDFs goals. Peter Zijlstra noted during review; I think I was just struggling to make sense of things and figured less is more and axed it. I have vague memories trying to work through the dynamics of a wakeup-stack and the EEVDF latency requirements and getting a head-ache. NEXT_BUDDY is easier to reason about given that it's a point-in-time decision on the wakees deadline and eligibilty relative to the waker. Enable NEXT_BUDDY as a preparation path to document that the decision to ignore the current implementation is deliberate. While not presented, the results were at best neutral and often much more variable. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251112122521.1331238-2-mgorman@techsingularity.net --- kernel/sched/features.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 3c12d9f..0607def 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true) * wakeup-preemption), since its likely going to consume data we * touched, increases cache locality. */ -SCHED_FEAT(NEXT_BUDDY, false) +SCHED_FEAT(NEXT_BUDDY, true) /* * Allow completely ignoring cfs_rq->next; which can be set from various ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [tip: sched/core] sched/fair: Enable scheduler feature NEXT_BUDDY 2025-11-12 12:25 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman 2025-11-14 12:19 ` [tip: sched/core] " tip-bot2 for Mel Gorman @ 2025-11-17 16:23 ` tip-bot2 for Mel Gorman 1 sibling, 0 replies; 27+ messages in thread From: tip-bot2 for Mel Gorman @ 2025-11-17 16:23 UTC (permalink / raw) To: linux-tip-commits; +Cc: Mel Gorman, Peter Zijlstra (Intel), x86, linux-kernel The following commit has been merged into the sched/core branch of tip: Commit-ID: aceccac58ad76305d147165788ea6b939bef179b Gitweb: https://git.kernel.org/tip/aceccac58ad76305d147165788ea6b939bef179b Author: Mel Gorman <mgorman@techsingularity.net> AuthorDate: Wed, 12 Nov 2025 12:25:20 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00 sched/fair: Enable scheduler feature NEXT_BUDDY The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last wakee to be scheduled sooner on the assumption that the waker/wakee share cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on the assumption that the pair of tasks still share data but also relied on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get good results. NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465 ("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime spread is mentioned so the expectation is that NEXT_BUDDY had an impact on overall fairness. It was not noted why LAST_BUDDY was removed but it is assumed that it's very difficult to reason what LAST_BUDDY's correct and effective behaviour should be while still respecting EEVDFs goals. Peter Zijlstra noted during review; I think I was just struggling to make sense of things and figured less is more and axed it. I have vague memories trying to work through the dynamics of a wakeup-stack and the EEVDF latency requirements and getting a head-ache. NEXT_BUDDY is easier to reason about given that it's a point-in-time decision on the wakees deadline and eligibilty relative to the waker. Enable NEXT_BUDDY as a preparation path to document that the decision to ignore the current implementation is deliberate. While not presented, the results were at best neutral and often much more variable. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251112122521.1331238-2-mgorman@techsingularity.net --- kernel/sched/features.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 3c12d9f..0607def 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true) * wakeup-preemption), since its likely going to consume data we * touched, increases cache locality. */ -SCHED_FEAT(NEXT_BUDDY, false) +SCHED_FEAT(NEXT_BUDDY, true) /* * Allow completely ignoring cfs_rq->next; which can be set from various ^ permalink raw reply related [flat|nested] 27+ messages in thread
[parent not found: <20251112122521.1331238-3-mgorman@techsingularity.net>]
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals [not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net> @ 2025-11-12 14:48 ` Peter Zijlstra 2025-11-13 8:26 ` Madadi Vineeth Reddy 2025-11-13 9:04 ` Mel Gorman 2025-11-14 12:19 ` [tip: sched/core] " tip-bot2 for Mel Gorman 2025-11-17 16:23 ` tip-bot2 for Mel Gorman 2 siblings, 2 replies; 27+ messages in thread From: Peter Zijlstra @ 2025-11-12 14:48 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, Madadi Vineeth Reddy, linux-kernel On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote: > + /* Prefer picking wakee soon if appropriate. */ > + if (sched_feat(NEXT_BUDDY) && > + set_preempt_buddy(cfs_rq, wake_flags, pse, se)) { > + > + /* > + * Decide whether to obey WF_SYNC hint for a new buddy. Old > + * buddies are ignored as they may not be relevant to the > + * waker and less likely to be cache hot. > + */ > + if (wake_flags & WF_SYNC) > + preempt_action = preempt_sync(rq, wake_flags, pse, se); > + } Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to depend on buddies. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-11-12 14:48 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra @ 2025-11-13 8:26 ` Madadi Vineeth Reddy 2025-11-13 9:04 ` Mel Gorman 1 sibling, 0 replies; 27+ messages in thread From: Madadi Vineeth Reddy @ 2025-11-13 8:26 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, linux-kernel, Madadi Vineeth Reddy Hi Peter, On 12/11/25 20:18, Peter Zijlstra wrote: > On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote: > >> + /* Prefer picking wakee soon if appropriate. */ >> + if (sched_feat(NEXT_BUDDY) && >> + set_preempt_buddy(cfs_rq, wake_flags, pse, se)) { >> + >> + /* >> + * Decide whether to obey WF_SYNC hint for a new buddy. Old >> + * buddies are ignored as they may not be relevant to the >> + * waker and less likely to be cache hot. >> + */ >> + if (wake_flags & WF_SYNC) >> + preempt_action = preempt_sync(rq, wake_flags, pse, se); >> + } > > Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to > depend on buddies. IIUC, calling preempt_sync() Without NEXT_BUDDY would force a reschedule after the threshold, but no buddy would be set. This means pick_eevdf() would run with normal EEVDF deadline selection, potentially picking a different task instead of the wakee.The forced reschedule would accomplish nothing for the wakees. That said, I see your point that the WF_SYNC threshold check could still reduce context switch overhead even without guaranteeing wakee selection. Thanks, Madadi Vineeth Reddy ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-11-12 14:48 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra 2025-11-13 8:26 ` Madadi Vineeth Reddy @ 2025-11-13 9:04 ` Mel Gorman 2025-11-14 12:13 ` Peter Zijlstra 1 sibling, 1 reply; 27+ messages in thread From: Mel Gorman @ 2025-11-13 9:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, Madadi Vineeth Reddy, linux-kernel On Wed, Nov 12, 2025 at 03:48:23PM +0100, Peter Zijlstra wrote: > On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote: > > > + /* Prefer picking wakee soon if appropriate. */ > > + if (sched_feat(NEXT_BUDDY) && > > + set_preempt_buddy(cfs_rq, wake_flags, pse, se)) { > > + > > + /* > > + * Decide whether to obey WF_SYNC hint for a new buddy. Old > > + * buddies are ignored as they may not be relevant to the > > + * waker and less likely to be cache hot. > > + */ > > + if (wake_flags & WF_SYNC) > > + preempt_action = preempt_sync(rq, wake_flags, pse, se); > > + } > > Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to > depend on buddies. There isn't a direct relation, but there is an indirect one. I know from your previous review that you separated out the WF_SYNC but after a while, I did not find a good reason to separate it completely from NEXT_BUDDY. NEXT_BUDDY updates cfs_rq->next if appropriate to indicate there is a waker relationship between two tasks and potentially share data that may still be cache resident after a context switch. WF_SYNC indicates there may be a strict relationship between those two tasks that the waker may need the wakee to do some work before it can make progress. If NEXT_BUDDY does not set cfs_rq->next in the current waking context then the wakee may only be picked next by coincidence under normal EEVDF rules. WF_SYNC could still reschedule if the wakee is not selected as a buddy but the benefit, if any, would be marginal -- if the waker does not go to sleep then WF_SYNC contract is violated and if the data becomes cache cold after a wakeup delay then the shared data may already be evicted from cache. With NEXT_BUDDY, there is a chance that the cost of a reschedule and/or a context switch will be offset by reduced overall latency (e.g. fewer cache misses). Without NEXT_BUDDY, WF_SYNC may only incur costs due to context switching. I considered the possibility of WF_SYNC being applied if pse is already a buddy due to yield or some other factor but there is no reason to assume any shared data is still cache resident and it's not easy to reason about. I considered applying WF_SYNC if pse was already set and use it as a two-pass filter but again, no obvious benefit or why the second wakeup ie more important than the first wakeup. I considered WF_SYNC being applied if any buddy is set but it's not clear why a SYNC wakeup between tasks A,B should instead pick C to run ASAP outside of the normal EEVDF rules. I think it's straight-forward if the logic is o If NEXT_BUDDY sets the wakee becomes cfs_rq->next then schedule the wakee soon o If the wakee is to be selected soon and WF_SYNC is also set then pick the wakee ASAP but less straight-forward if o If WF_SYNC is set, reschedule now and maybe the wakee will be picked, maybe the waker will run again, maybe something else will run and sometimes it'll be a gain overall. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-11-13 9:04 ` Mel Gorman @ 2025-11-14 12:13 ` Peter Zijlstra 0 siblings, 0 replies; 27+ messages in thread From: Peter Zijlstra @ 2025-11-14 12:13 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, Madadi Vineeth Reddy, linux-kernel On Thu, Nov 13, 2025 at 09:04:38AM +0000, Mel Gorman wrote: > On Wed, Nov 12, 2025 at 03:48:23PM +0100, Peter Zijlstra wrote: > > On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote: > > > > > + /* Prefer picking wakee soon if appropriate. */ > > > + if (sched_feat(NEXT_BUDDY) && > > > + set_preempt_buddy(cfs_rq, wake_flags, pse, se)) { > > > + > > > + /* > > > + * Decide whether to obey WF_SYNC hint for a new buddy. Old > > > + * buddies are ignored as they may not be relevant to the > > > + * waker and less likely to be cache hot. > > > + */ > > > + if (wake_flags & WF_SYNC) > > > + preempt_action = preempt_sync(rq, wake_flags, pse, se); > > > + } > > > > Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to > > depend on buddies. > > There isn't a direct relation, but there is an indirect one. I know from > your previous review that you separated out the WF_SYNC but after a while, > I did not find a good reason to separate it completely from NEXT_BUDDY. > > NEXT_BUDDY updates cfs_rq->next if appropriate to indicate there is a waker > relationship between two tasks and potentially share data that may still > be cache resident after a context switch. WF_SYNC indicates there may be > a strict relationship between those two tasks that the waker may need the > wakee to do some work before it can make progress. If NEXT_BUDDY does not > set cfs_rq->next in the current waking context then the wakee may only be > picked next by coincidence under normal EEVDF rules. Aah, fair enough. Perhaps the comment could've been clearer but whatever. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [tip: sched/core] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals [not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net> 2025-11-12 14:48 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra @ 2025-11-14 12:19 ` tip-bot2 for Mel Gorman 2025-11-17 16:23 ` tip-bot2 for Mel Gorman 2 siblings, 0 replies; 27+ messages in thread From: tip-bot2 for Mel Gorman @ 2025-11-14 12:19 UTC (permalink / raw) To: linux-tip-commits; +Cc: Mel Gorman, Peter Zijlstra (Intel), x86, linux-kernel The following commit has been merged into the sched/core branch of tip: Commit-ID: 523afd46059d80351189630355ab7e6eccd6451e Gitweb: https://git.kernel.org/tip/523afd46059d80351189630355ab7e6eccd6451e Author: Mel Gorman <mgorman@techsingularity.net> AuthorDate: Wed, 12 Nov 2025 12:25:21 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Fri, 14 Nov 2025 13:03:07 +01:00 sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Reimplement NEXT_BUDDY preemption to take into account the deadline and eligibility of the wakee with respect to the waker. In the event multiple buddies could be considered, the one with the earliest deadline is selected. Sync wakeups are treated differently to every other type of wakeup. The WF_SYNC assumption is that the waker promises to sleep in the very near future. This is violated in enough cases that WF_SYNC should be treated as a suggestion instead of a contract. If a waker does go to sleep almost immediately then the delay in wakeup is negligible. In other cases, it's throttled based on the accumulated runtime of the waker so there is a chance that some batched wakeups have been issued before preemption. For all other wakeups, preemption happens if the wakee has a earlier deadline than the waker and eligible to run. While many workloads were tested, the two main targets were a modified dbench4 benchmark and hackbench because the are on opposite ends of the spectrum -- one prefers throughput by avoiding preemption and the other relies on preemption. First is the dbench throughput data even though it is a poor metric but it is the default metric. The test machine is a 2-socket machine and the backing filesystem is XFS as a lot of the IO work is dispatched to kernel threads. It's important to note that these results are not representative across all machines, especially Zen machines, as different bottlenecks are exposed on different machines and filesystems. dbench4 Throughput (misleading but traditional) 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Hmean 1 1268.80 ( 0.00%) 1269.74 ( 0.07%) Hmean 4 3971.74 ( 0.00%) 3950.59 ( -0.53%) Hmean 7 5548.23 ( 0.00%) 5420.08 ( -2.31%) Hmean 12 7310.86 ( 0.00%) 7165.57 ( -1.99%) Hmean 21 8874.53 ( 0.00%) 9149.04 ( 3.09%) Hmean 30 9361.93 ( 0.00%) 10530.04 ( 12.48%) Hmean 48 9540.14 ( 0.00%) 11820.40 ( 23.90%) Hmean 79 9208.74 ( 0.00%) 12193.79 ( 32.42%) Hmean 110 8573.12 ( 0.00%) 11933.72 ( 39.20%) Hmean 141 7791.33 ( 0.00%) 11273.90 ( 44.70%) Hmean 160 7666.60 ( 0.00%) 10768.72 ( 40.46%) As throughput is misleading, the benchmark is modified to use a short loadfile report the completion time duration in milliseconds. dbench4 Loadfile Execution Time 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Amean 1 14.62 ( 0.00%) 14.69 ( -0.46%) Amean 4 18.76 ( 0.00%) 18.85 ( -0.45%) Amean 7 23.71 ( 0.00%) 24.38 ( -2.82%) Amean 12 31.25 ( 0.00%) 31.87 ( -1.97%) Amean 21 45.12 ( 0.00%) 43.69 ( 3.16%) Amean 30 61.07 ( 0.00%) 54.33 ( 11.03%) Amean 48 95.91 ( 0.00%) 77.22 ( 19.49%) Amean 79 163.38 ( 0.00%) 123.08 ( 24.66%) Amean 110 243.91 ( 0.00%) 175.11 ( 28.21%) Amean 141 343.47 ( 0.00%) 239.10 ( 30.39%) Amean 160 401.15 ( 0.00%) 283.73 ( 29.27%) Stddev 1 0.52 ( 0.00%) 0.51 ( 2.45%) Stddev 4 1.36 ( 0.00%) 1.30 ( 4.04%) Stddev 7 1.88 ( 0.00%) 1.87 ( 0.72%) Stddev 12 3.06 ( 0.00%) 2.45 ( 19.83%) Stddev 21 5.78 ( 0.00%) 3.87 ( 33.06%) Stddev 30 9.85 ( 0.00%) 5.25 ( 46.76%) Stddev 48 22.31 ( 0.00%) 8.64 ( 61.27%) Stddev 79 35.96 ( 0.00%) 18.07 ( 49.76%) Stddev 110 59.04 ( 0.00%) 30.93 ( 47.61%) Stddev 141 85.38 ( 0.00%) 40.93 ( 52.06%) Stddev 160 96.38 ( 0.00%) 39.72 ( 58.79%) That is still looking good and the variance is reduced quite a bit. Finally, fairness is a concern so the next report tracks how many milliseconds does it take for all clients to complete a workfile. This one is tricky because dbench makes to effort to synchronise clients so the durations at benchmark start time differ substantially from typical runtimes. This problem could be mitigated by warming up the benchmark for a number of minutes but it's a matter of opinion whether that counts as an evasion of inconvenient results. dbench4 All Clients Loadfile Execution Time 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Amean 1 15.06 ( 0.00%) 15.07 ( -0.03%) Amean 4 603.81 ( 0.00%) 524.29 ( 13.17%) Amean 7 855.32 ( 0.00%) 1331.07 ( -55.62%) Amean 12 1890.02 ( 0.00%) 2323.97 ( -22.96%) Amean 21 3195.23 ( 0.00%) 2009.29 ( 37.12%) Amean 30 13919.53 ( 0.00%) 4579.44 ( 67.10%) Amean 48 25246.07 ( 0.00%) 5705.46 ( 77.40%) Amean 79 29701.84 ( 0.00%) 15509.26 ( 47.78%) Amean 110 22803.03 ( 0.00%) 23782.08 ( -4.29%) Amean 141 36356.07 ( 0.00%) 25074.20 ( 31.03%) Amean 160 17046.71 ( 0.00%) 13247.62 ( 22.29%) Stddev 1 0.47 ( 0.00%) 0.49 ( -3.74%) Stddev 4 395.24 ( 0.00%) 254.18 ( 35.69%) Stddev 7 467.24 ( 0.00%) 764.42 ( -63.60%) Stddev 12 1071.43 ( 0.00%) 1395.90 ( -30.28%) Stddev 21 1694.50 ( 0.00%) 1204.89 ( 28.89%) Stddev 30 7945.63 ( 0.00%) 2552.59 ( 67.87%) Stddev 48 14339.51 ( 0.00%) 3227.55 ( 77.49%) Stddev 79 16620.91 ( 0.00%) 8422.15 ( 49.33%) Stddev 110 12912.15 ( 0.00%) 13560.95 ( -5.02%) Stddev 141 20700.13 ( 0.00%) 14544.51 ( 29.74%) Stddev 160 9079.16 ( 0.00%) 7400.69 ( 18.49%) This is more of a mixed bag but it at least shows that fairness is not crippled. The hackbench results are more neutral but this is still important. It's possible to boost the dbench figures by a large amount but only by crippling the performance of a workload like hackbench. The WF_SYNC behaviour is important for these workloads and is why the WF_SYNC changes are not a separate patch. hackbench-process-pipes 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Amean 1 0.2657 ( 0.00%) 0.2150 ( 19.07%) Amean 4 0.6107 ( 0.00%) 0.6060 ( 0.76%) Amean 7 0.7923 ( 0.00%) 0.7440 ( 6.10%) Amean 12 1.1500 ( 0.00%) 1.1263 ( 2.06%) Amean 21 1.7950 ( 0.00%) 1.7987 ( -0.20%) Amean 30 2.3207 ( 0.00%) 2.5053 ( -7.96%) Amean 48 3.5023 ( 0.00%) 3.9197 ( -11.92%) Amean 79 4.8093 ( 0.00%) 5.2247 ( -8.64%) Amean 110 6.1160 ( 0.00%) 6.6650 ( -8.98%) Amean 141 7.4763 ( 0.00%) 7.8973 ( -5.63%) Amean 172 8.9560 ( 0.00%) 9.3593 ( -4.50%) Amean 203 10.4783 ( 0.00%) 10.8347 ( -3.40%) Amean 234 12.4977 ( 0.00%) 13.0177 ( -4.16%) Amean 265 14.7003 ( 0.00%) 15.5630 ( -5.87%) Amean 296 16.1007 ( 0.00%) 17.4023 ( -8.08%) Processes using pipes are impacted but the variance (not presented) indicates it's close to noise and the results are not always reproducible. If executed across multiple reboots, it may show neutral or small gains so the worst measured results are presented. Hackbench using sockets is more reliably neutral as the wakeup mechanisms are different between sockets and pipes. hackbench-process-sockets 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v2 Amean 1 0.3073 ( 0.00%) 0.3263 ( -6.18%) Amean 4 0.7863 ( 0.00%) 0.7930 ( -0.85%) Amean 7 1.3670 ( 0.00%) 1.3537 ( 0.98%) Amean 12 2.1337 ( 0.00%) 2.1903 ( -2.66%) Amean 21 3.4683 ( 0.00%) 3.4940 ( -0.74%) Amean 30 4.7247 ( 0.00%) 4.8853 ( -3.40%) Amean 48 7.6097 ( 0.00%) 7.8197 ( -2.76%) Amean 79 14.7957 ( 0.00%) 16.1000 ( -8.82%) Amean 110 21.3413 ( 0.00%) 21.9997 ( -3.08%) Amean 141 29.0503 ( 0.00%) 29.0353 ( 0.05%) Amean 172 36.4660 ( 0.00%) 36.1433 ( 0.88%) Amean 203 39.7177 ( 0.00%) 40.5910 ( -2.20%) Amean 234 42.1120 ( 0.00%) 43.5527 ( -3.42%) Amean 265 45.7830 ( 0.00%) 50.0560 ( -9.33%) Amean 296 50.7043 ( 0.00%) 54.3657 ( -7.22%) As schbench has been mentioned in numerous bugs recently, the results are interesting. A test case that represents the default schbench behaviour is schbench Wakeup Latency (usec) 6.18.0-rc1 6.18.0-rc1 vanilla sched-preemptnext-v5 Amean Wakeup-50th-80 7.17 ( 0.00%) 6.00 ( 16.28%) Amean Wakeup-90th-80 46.56 ( 0.00%) 19.78 ( 57.52%) Amean Wakeup-99th-80 119.61 ( 0.00%) 89.94 ( 24.80%) Amean Wakeup-99.9th-80 3193.78 ( 0.00%) 328.22 ( 89.72%) schbench Requests Per Second (ops/sec) 6.18.0-rc1 6.18.0-rc1 vanilla sched-preemptnext-v5 Hmean RPS-20th-80 8900.91 ( 0.00%) 9176.78 ( 3.10%) Hmean RPS-50th-80 8987.41 ( 0.00%) 9217.89 ( 2.56%) Hmean RPS-90th-80 9123.73 ( 0.00%) 9273.25 ( 1.64%) Hmean RPS-max-80 9193.50 ( 0.00%) 9301.47 ( 1.17%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net --- kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++------- 1 file changed, 130 insertions(+), 22 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 320eedd..11d480e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect) if (cfs_rq->nr_queued == 1) return curr && curr->on_rq ? curr : se; + /* + * Picking the ->next buddy will affect latency but not fairness. + */ + if (sched_feat(PICK_BUDDY) && + cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { + /* ->next will never be delayed */ + WARN_ON_ONCE(cfs_rq->next->sched_delayed); + return cfs_rq->next; + } + if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) curr = NULL; @@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se) return delta_exec; } +static void set_next_buddy(struct sched_entity *se); + /* * Used by other classes to account runtime. */ @@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) { struct sched_entity *se; - /* - * Picking the ->next buddy will affect latency but not fairness. - */ - if (sched_feat(PICK_BUDDY) && - cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { - /* ->next will never be delayed */ - WARN_ON_ONCE(cfs_rq->next->sched_delayed); - return cfs_rq->next; - } - se = pick_eevdf(cfs_rq); if (se->sched_delayed) { dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); @@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) hrtick_update(rq); } -static void set_next_buddy(struct sched_entity *se); - /* * Basically dequeue_task_fair(), except it can deal with dequeue_entity() * failing half-way through and resume the dequeue later. @@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se) } } +enum preempt_wakeup_action { + PREEMPT_WAKEUP_NONE, /* No preemption. */ + PREEMPT_WAKEUP_SHORT, /* Ignore slice protection. */ + PREEMPT_WAKEUP_PICK, /* Let __pick_eevdf() decide. */ + PREEMPT_WAKEUP_RESCHED, /* Force reschedule. */ +}; + +static inline bool +set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags, + struct sched_entity *pse, struct sched_entity *se) +{ + /* + * Keep existing buddy if the deadline is sooner than pse. + * The older buddy may be cache cold and completely unrelated + * to the current wakeup but that is unpredictable where as + * obeying the deadline is more in line with EEVDF objectives. + */ + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) + return false; + + set_next_buddy(pse); + return true; +} + +/* + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not + * strictly enforced because the hint is either misunderstood or + * multiple tasks must be woken up. + */ +static inline enum preempt_wakeup_action +preempt_sync(struct rq *rq, int wake_flags, + struct sched_entity *pse, struct sched_entity *se) +{ + u64 threshold, delta; + + /* + * WF_SYNC without WF_TTWU is not expected so warn if it happens even + * though it is likely harmless. + */ + WARN_ON_ONCE(!(wake_flags & WF_TTWU)); + + threshold = sysctl_sched_migration_cost; + delta = rq_clock_task(rq) - se->exec_start; + if ((s64)delta < 0) + delta = 0; + + /* + * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they + * could run on other CPUs. Reduce the threshold before preemption is + * allowed to an arbitrary lower value as it is more likely (but not + * guaranteed) the waker requires the wakee to finish. + */ + if (wake_flags & WF_RQ_SELECTED) + threshold >>= 2; + + /* + * As WF_SYNC is not strictly obeyed, allow some runtime for batch + * wakeups to be issued. + */ + if (entity_before(pse, se) && delta >= threshold) + return PREEMPT_WAKEUP_RESCHED; + + return PREEMPT_WAKEUP_NONE; +} + /* * Preempt the current task with a newly woken task if needed: */ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags) { + enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK; struct task_struct *donor = rq->donor; struct sched_entity *se = &donor->se, *pse = &p->se; struct cfs_rq *cfs_rq = task_cfs_rq(donor); int cse_is_idle, pse_is_idle; - bool do_preempt_short = false; if (unlikely(se == pse)) return; @@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int if (task_is_throttled(p)) return; - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { - set_next_buddy(pse); - } - /* * We can come here with TIF_NEED_RESCHED already set from new task * wake up path. @@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int * When non-idle entity preempt an idle entity, * don't give idle entity slice protection. */ - do_preempt_short = true; + preempt_action = PREEMPT_WAKEUP_SHORT; goto preempt; } @@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int * If @p has a shorter slice than current and @p is eligible, override * current's slice protection in order to allow preemption. */ - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); + if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) { + preempt_action = PREEMPT_WAKEUP_SHORT; + goto pick; + } /* + * Ignore wakee preemption on WF_FORK as it is less likely that + * there is shared data as exec often follow fork. Do not + * preempt for tasks that are sched_delayed as it would violate + * EEVDF to forcibly queue an ineligible task. + */ + if ((wake_flags & WF_FORK) || pse->sched_delayed) + return; + + /* + * If @p potentially is completing work required by current then + * consider preemption. + * + * Reschedule if waker is no longer eligible. */ + if (in_task() && !entity_eligible(cfs_rq, se)) { + preempt_action = PREEMPT_WAKEUP_RESCHED; + goto preempt; + } + + /* Prefer picking wakee soon if appropriate. */ + if (sched_feat(NEXT_BUDDY) && + set_preempt_buddy(cfs_rq, wake_flags, pse, se)) { + + /* + * Decide whether to obey WF_SYNC hint for a new buddy. Old + * buddies are ignored as they may not be relevant to the + * waker and less likely to be cache hot. + */ + if (wake_flags & WF_SYNC) + preempt_action = preempt_sync(rq, wake_flags, pse, se); + } + + switch (preempt_action) { + case PREEMPT_WAKEUP_NONE: + return; + case PREEMPT_WAKEUP_RESCHED: + goto preempt; + case PREEMPT_WAKEUP_SHORT: + fallthrough; + case PREEMPT_WAKEUP_PICK: + break; + } + +pick: + /* * If @p has become the most eligible task, force preemption. */ - if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) + if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse) goto preempt; - if (sched_feat(RUN_TO_PARITY) && do_preempt_short) + if (sched_feat(RUN_TO_PARITY)) update_protect_slice(cfs_rq, se); return; preempt: - if (do_preempt_short) + if (preempt_action == PREEMPT_WAKEUP_SHORT) cancel_protect_slice(se); resched_curr_lazy(rq); ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [tip: sched/core] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals [not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net> 2025-11-12 14:48 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra 2025-11-14 12:19 ` [tip: sched/core] " tip-bot2 for Mel Gorman @ 2025-11-17 16:23 ` tip-bot2 for Mel Gorman 2025-12-22 10:57 ` [REGRESSION] " Ryan Roberts 2 siblings, 1 reply; 27+ messages in thread From: tip-bot2 for Mel Gorman @ 2025-11-17 16:23 UTC (permalink / raw) To: linux-tip-commits; +Cc: Mel Gorman, Peter Zijlstra (Intel), x86, linux-kernel The following commit has been merged into the sched/core branch of tip: Commit-ID: e837456fdca81899a3c8e47b3fd39e30eae6e291 Gitweb: https://git.kernel.org/tip/e837456fdca81899a3c8e47b3fd39e30eae6e291 Author: Mel Gorman <mgorman@techsingularity.net> AuthorDate: Wed, 12 Nov 2025 12:25:21 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00 sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Reimplement NEXT_BUDDY preemption to take into account the deadline and eligibility of the wakee with respect to the waker. In the event multiple buddies could be considered, the one with the earliest deadline is selected. Sync wakeups are treated differently to every other type of wakeup. The WF_SYNC assumption is that the waker promises to sleep in the very near future. This is violated in enough cases that WF_SYNC should be treated as a suggestion instead of a contract. If a waker does go to sleep almost immediately then the delay in wakeup is negligible. In other cases, it's throttled based on the accumulated runtime of the waker so there is a chance that some batched wakeups have been issued before preemption. For all other wakeups, preemption happens if the wakee has a earlier deadline than the waker and eligible to run. While many workloads were tested, the two main targets were a modified dbench4 benchmark and hackbench because the are on opposite ends of the spectrum -- one prefers throughput by avoiding preemption and the other relies on preemption. First is the dbench throughput data even though it is a poor metric but it is the default metric. The test machine is a 2-socket machine and the backing filesystem is XFS as a lot of the IO work is dispatched to kernel threads. It's important to note that these results are not representative across all machines, especially Zen machines, as different bottlenecks are exposed on different machines and filesystems. dbench4 Throughput (misleading but traditional) 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Hmean 1 1268.80 ( 0.00%) 1269.74 ( 0.07%) Hmean 4 3971.74 ( 0.00%) 3950.59 ( -0.53%) Hmean 7 5548.23 ( 0.00%) 5420.08 ( -2.31%) Hmean 12 7310.86 ( 0.00%) 7165.57 ( -1.99%) Hmean 21 8874.53 ( 0.00%) 9149.04 ( 3.09%) Hmean 30 9361.93 ( 0.00%) 10530.04 ( 12.48%) Hmean 48 9540.14 ( 0.00%) 11820.40 ( 23.90%) Hmean 79 9208.74 ( 0.00%) 12193.79 ( 32.42%) Hmean 110 8573.12 ( 0.00%) 11933.72 ( 39.20%) Hmean 141 7791.33 ( 0.00%) 11273.90 ( 44.70%) Hmean 160 7666.60 ( 0.00%) 10768.72 ( 40.46%) As throughput is misleading, the benchmark is modified to use a short loadfile report the completion time duration in milliseconds. dbench4 Loadfile Execution Time 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Amean 1 14.62 ( 0.00%) 14.69 ( -0.46%) Amean 4 18.76 ( 0.00%) 18.85 ( -0.45%) Amean 7 23.71 ( 0.00%) 24.38 ( -2.82%) Amean 12 31.25 ( 0.00%) 31.87 ( -1.97%) Amean 21 45.12 ( 0.00%) 43.69 ( 3.16%) Amean 30 61.07 ( 0.00%) 54.33 ( 11.03%) Amean 48 95.91 ( 0.00%) 77.22 ( 19.49%) Amean 79 163.38 ( 0.00%) 123.08 ( 24.66%) Amean 110 243.91 ( 0.00%) 175.11 ( 28.21%) Amean 141 343.47 ( 0.00%) 239.10 ( 30.39%) Amean 160 401.15 ( 0.00%) 283.73 ( 29.27%) Stddev 1 0.52 ( 0.00%) 0.51 ( 2.45%) Stddev 4 1.36 ( 0.00%) 1.30 ( 4.04%) Stddev 7 1.88 ( 0.00%) 1.87 ( 0.72%) Stddev 12 3.06 ( 0.00%) 2.45 ( 19.83%) Stddev 21 5.78 ( 0.00%) 3.87 ( 33.06%) Stddev 30 9.85 ( 0.00%) 5.25 ( 46.76%) Stddev 48 22.31 ( 0.00%) 8.64 ( 61.27%) Stddev 79 35.96 ( 0.00%) 18.07 ( 49.76%) Stddev 110 59.04 ( 0.00%) 30.93 ( 47.61%) Stddev 141 85.38 ( 0.00%) 40.93 ( 52.06%) Stddev 160 96.38 ( 0.00%) 39.72 ( 58.79%) That is still looking good and the variance is reduced quite a bit. Finally, fairness is a concern so the next report tracks how many milliseconds does it take for all clients to complete a workfile. This one is tricky because dbench makes to effort to synchronise clients so the durations at benchmark start time differ substantially from typical runtimes. This problem could be mitigated by warming up the benchmark for a number of minutes but it's a matter of opinion whether that counts as an evasion of inconvenient results. dbench4 All Clients Loadfile Execution Time 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Amean 1 15.06 ( 0.00%) 15.07 ( -0.03%) Amean 4 603.81 ( 0.00%) 524.29 ( 13.17%) Amean 7 855.32 ( 0.00%) 1331.07 ( -55.62%) Amean 12 1890.02 ( 0.00%) 2323.97 ( -22.96%) Amean 21 3195.23 ( 0.00%) 2009.29 ( 37.12%) Amean 30 13919.53 ( 0.00%) 4579.44 ( 67.10%) Amean 48 25246.07 ( 0.00%) 5705.46 ( 77.40%) Amean 79 29701.84 ( 0.00%) 15509.26 ( 47.78%) Amean 110 22803.03 ( 0.00%) 23782.08 ( -4.29%) Amean 141 36356.07 ( 0.00%) 25074.20 ( 31.03%) Amean 160 17046.71 ( 0.00%) 13247.62 ( 22.29%) Stddev 1 0.47 ( 0.00%) 0.49 ( -3.74%) Stddev 4 395.24 ( 0.00%) 254.18 ( 35.69%) Stddev 7 467.24 ( 0.00%) 764.42 ( -63.60%) Stddev 12 1071.43 ( 0.00%) 1395.90 ( -30.28%) Stddev 21 1694.50 ( 0.00%) 1204.89 ( 28.89%) Stddev 30 7945.63 ( 0.00%) 2552.59 ( 67.87%) Stddev 48 14339.51 ( 0.00%) 3227.55 ( 77.49%) Stddev 79 16620.91 ( 0.00%) 8422.15 ( 49.33%) Stddev 110 12912.15 ( 0.00%) 13560.95 ( -5.02%) Stddev 141 20700.13 ( 0.00%) 14544.51 ( 29.74%) Stddev 160 9079.16 ( 0.00%) 7400.69 ( 18.49%) This is more of a mixed bag but it at least shows that fairness is not crippled. The hackbench results are more neutral but this is still important. It's possible to boost the dbench figures by a large amount but only by crippling the performance of a workload like hackbench. The WF_SYNC behaviour is important for these workloads and is why the WF_SYNC changes are not a separate patch. hackbench-process-pipes 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Amean 1 0.2657 ( 0.00%) 0.2150 ( 19.07%) Amean 4 0.6107 ( 0.00%) 0.6060 ( 0.76%) Amean 7 0.7923 ( 0.00%) 0.7440 ( 6.10%) Amean 12 1.1500 ( 0.00%) 1.1263 ( 2.06%) Amean 21 1.7950 ( 0.00%) 1.7987 ( -0.20%) Amean 30 2.3207 ( 0.00%) 2.5053 ( -7.96%) Amean 48 3.5023 ( 0.00%) 3.9197 ( -11.92%) Amean 79 4.8093 ( 0.00%) 5.2247 ( -8.64%) Amean 110 6.1160 ( 0.00%) 6.6650 ( -8.98%) Amean 141 7.4763 ( 0.00%) 7.8973 ( -5.63%) Amean 172 8.9560 ( 0.00%) 9.3593 ( -4.50%) Amean 203 10.4783 ( 0.00%) 10.8347 ( -3.40%) Amean 234 12.4977 ( 0.00%) 13.0177 ( -4.16%) Amean 265 14.7003 ( 0.00%) 15.5630 ( -5.87%) Amean 296 16.1007 ( 0.00%) 17.4023 ( -8.08%) Processes using pipes are impacted but the variance (not presented) indicates it's close to noise and the results are not always reproducible. If executed across multiple reboots, it may show neutral or small gains so the worst measured results are presented. Hackbench using sockets is more reliably neutral as the wakeup mechanisms are different between sockets and pipes. hackbench-process-sockets 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v2 Amean 1 0.3073 ( 0.00%) 0.3263 ( -6.18%) Amean 4 0.7863 ( 0.00%) 0.7930 ( -0.85%) Amean 7 1.3670 ( 0.00%) 1.3537 ( 0.98%) Amean 12 2.1337 ( 0.00%) 2.1903 ( -2.66%) Amean 21 3.4683 ( 0.00%) 3.4940 ( -0.74%) Amean 30 4.7247 ( 0.00%) 4.8853 ( -3.40%) Amean 48 7.6097 ( 0.00%) 7.8197 ( -2.76%) Amean 79 14.7957 ( 0.00%) 16.1000 ( -8.82%) Amean 110 21.3413 ( 0.00%) 21.9997 ( -3.08%) Amean 141 29.0503 ( 0.00%) 29.0353 ( 0.05%) Amean 172 36.4660 ( 0.00%) 36.1433 ( 0.88%) Amean 203 39.7177 ( 0.00%) 40.5910 ( -2.20%) Amean 234 42.1120 ( 0.00%) 43.5527 ( -3.42%) Amean 265 45.7830 ( 0.00%) 50.0560 ( -9.33%) Amean 296 50.7043 ( 0.00%) 54.3657 ( -7.22%) As schbench has been mentioned in numerous bugs recently, the results are interesting. A test case that represents the default schbench behaviour is schbench Wakeup Latency (usec) 6.18.0-rc1 6.18.0-rc1 vanilla sched-preemptnext-v5 Amean Wakeup-50th-80 7.17 ( 0.00%) 6.00 ( 16.28%) Amean Wakeup-90th-80 46.56 ( 0.00%) 19.78 ( 57.52%) Amean Wakeup-99th-80 119.61 ( 0.00%) 89.94 ( 24.80%) Amean Wakeup-99.9th-80 3193.78 ( 0.00%) 328.22 ( 89.72%) schbench Requests Per Second (ops/sec) 6.18.0-rc1 6.18.0-rc1 vanilla sched-preemptnext-v5 Hmean RPS-20th-80 8900.91 ( 0.00%) 9176.78 ( 3.10%) Hmean RPS-50th-80 8987.41 ( 0.00%) 9217.89 ( 2.56%) Hmean RPS-90th-80 9123.73 ( 0.00%) 9273.25 ( 1.64%) Hmean RPS-max-80 9193.50 ( 0.00%) 9301.47 ( 1.17%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net --- kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++------- 1 file changed, 130 insertions(+), 22 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 071e07f..c6e5c64 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect) if (cfs_rq->nr_queued == 1) return curr && curr->on_rq ? curr : se; + /* + * Picking the ->next buddy will affect latency but not fairness. + */ + if (sched_feat(PICK_BUDDY) && + cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { + /* ->next will never be delayed */ + WARN_ON_ONCE(cfs_rq->next->sched_delayed); + return cfs_rq->next; + } + if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) curr = NULL; @@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se) return delta_exec; } +static void set_next_buddy(struct sched_entity *se); + /* * Used by other classes to account runtime. */ @@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) { struct sched_entity *se; - /* - * Picking the ->next buddy will affect latency but not fairness. - */ - if (sched_feat(PICK_BUDDY) && - cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { - /* ->next will never be delayed */ - WARN_ON_ONCE(cfs_rq->next->sched_delayed); - return cfs_rq->next; - } - se = pick_eevdf(cfs_rq); if (se->sched_delayed) { dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); @@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) hrtick_update(rq); } -static void set_next_buddy(struct sched_entity *se); - /* * Basically dequeue_task_fair(), except it can deal with dequeue_entity() * failing half-way through and resume the dequeue later. @@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se) } } +enum preempt_wakeup_action { + PREEMPT_WAKEUP_NONE, /* No preemption. */ + PREEMPT_WAKEUP_SHORT, /* Ignore slice protection. */ + PREEMPT_WAKEUP_PICK, /* Let __pick_eevdf() decide. */ + PREEMPT_WAKEUP_RESCHED, /* Force reschedule. */ +}; + +static inline bool +set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags, + struct sched_entity *pse, struct sched_entity *se) +{ + /* + * Keep existing buddy if the deadline is sooner than pse. + * The older buddy may be cache cold and completely unrelated + * to the current wakeup but that is unpredictable where as + * obeying the deadline is more in line with EEVDF objectives. + */ + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) + return false; + + set_next_buddy(pse); + return true; +} + +/* + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not + * strictly enforced because the hint is either misunderstood or + * multiple tasks must be woken up. + */ +static inline enum preempt_wakeup_action +preempt_sync(struct rq *rq, int wake_flags, + struct sched_entity *pse, struct sched_entity *se) +{ + u64 threshold, delta; + + /* + * WF_SYNC without WF_TTWU is not expected so warn if it happens even + * though it is likely harmless. + */ + WARN_ON_ONCE(!(wake_flags & WF_TTWU)); + + threshold = sysctl_sched_migration_cost; + delta = rq_clock_task(rq) - se->exec_start; + if ((s64)delta < 0) + delta = 0; + + /* + * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they + * could run on other CPUs. Reduce the threshold before preemption is + * allowed to an arbitrary lower value as it is more likely (but not + * guaranteed) the waker requires the wakee to finish. + */ + if (wake_flags & WF_RQ_SELECTED) + threshold >>= 2; + + /* + * As WF_SYNC is not strictly obeyed, allow some runtime for batch + * wakeups to be issued. + */ + if (entity_before(pse, se) && delta >= threshold) + return PREEMPT_WAKEUP_RESCHED; + + return PREEMPT_WAKEUP_NONE; +} + /* * Preempt the current task with a newly woken task if needed: */ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags) { + enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK; struct task_struct *donor = rq->donor; struct sched_entity *se = &donor->se, *pse = &p->se; struct cfs_rq *cfs_rq = task_cfs_rq(donor); int cse_is_idle, pse_is_idle; - bool do_preempt_short = false; if (unlikely(se == pse)) return; @@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int if (task_is_throttled(p)) return; - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { - set_next_buddy(pse); - } - /* * We can come here with TIF_NEED_RESCHED already set from new task * wake up path. @@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int * When non-idle entity preempt an idle entity, * don't give idle entity slice protection. */ - do_preempt_short = true; + preempt_action = PREEMPT_WAKEUP_SHORT; goto preempt; } @@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int * If @p has a shorter slice than current and @p is eligible, override * current's slice protection in order to allow preemption. */ - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); + if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) { + preempt_action = PREEMPT_WAKEUP_SHORT; + goto pick; + } /* + * Ignore wakee preemption on WF_FORK as it is less likely that + * there is shared data as exec often follow fork. Do not + * preempt for tasks that are sched_delayed as it would violate + * EEVDF to forcibly queue an ineligible task. + */ + if ((wake_flags & WF_FORK) || pse->sched_delayed) + return; + + /* + * If @p potentially is completing work required by current then + * consider preemption. + * + * Reschedule if waker is no longer eligible. */ + if (in_task() && !entity_eligible(cfs_rq, se)) { + preempt_action = PREEMPT_WAKEUP_RESCHED; + goto preempt; + } + + /* Prefer picking wakee soon if appropriate. */ + if (sched_feat(NEXT_BUDDY) && + set_preempt_buddy(cfs_rq, wake_flags, pse, se)) { + + /* + * Decide whether to obey WF_SYNC hint for a new buddy. Old + * buddies are ignored as they may not be relevant to the + * waker and less likely to be cache hot. + */ + if (wake_flags & WF_SYNC) + preempt_action = preempt_sync(rq, wake_flags, pse, se); + } + + switch (preempt_action) { + case PREEMPT_WAKEUP_NONE: + return; + case PREEMPT_WAKEUP_RESCHED: + goto preempt; + case PREEMPT_WAKEUP_SHORT: + fallthrough; + case PREEMPT_WAKEUP_PICK: + break; + } + +pick: + /* * If @p has become the most eligible task, force preemption. */ - if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) + if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse) goto preempt; - if (sched_feat(RUN_TO_PARITY) && do_preempt_short) + if (sched_feat(RUN_TO_PARITY)) update_protect_slice(cfs_rq, se); return; preempt: - if (do_preempt_short) + if (preempt_action == PREEMPT_WAKEUP_SHORT) cancel_protect_slice(se); resched_curr_lazy(rq); ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-11-17 16:23 ` tip-bot2 for Mel Gorman @ 2025-12-22 10:57 ` Ryan Roberts 2026-01-02 12:38 ` Ryan Roberts 0 siblings, 1 reply; 27+ messages in thread From: Ryan Roberts @ 2025-12-22 10:57 UTC (permalink / raw) To: Mel Gorman, Peter Zijlstra (Intel); +Cc: x86, linux-kernel, Aishwarya TCV Hi Mel, Peter, We are building out a kernel performance regression monitoring lab at Arm, and I've noticed some fairly large perofrmance regressions in real-world workloads, for which bisection has fingered this patch. We are looking at performance changes between v6.18 and v6.19-rc1, and by reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan to move the testing to linux-next over the next couple of quarters so hopefully we will be able to deliver this sort of news prior to merging in future). All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean statistically significant regression/improvement, where "statistically significant" means the 95% confidence intervals do not overlap". The below is a large scale mysql workload, running across 2 AWS instances (a load generator and the mysql server). We have a partner for whom this is a very important workload. Performance regresses by 1.3% between 6.18 and 6.19-rc1 (where the patch is added). By reverting the patch, the regression is not only fixed by performance is now nearly 6% better than v6.18: +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert-next-buddy | +=================================+====================================================+=================+==============+===================+ | repro-collection/mysql-workload | db transaction rate (transactions/min) | 646267.33 | (R) -1.33% | (I) 5.87% | | | new order rate (orders/min) | 213256.50 | (R) -1.32% | (I) 5.87% | +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ Next are a bunch of benchmarks all running on a single system. specjbb is the SPEC Java Business Benchmark. The mysql one is the same as above but this time both loadgen and server are on the same system. pgbench is the PostgreSQL benchmark. I'm showing hackbench for completeness, but I don't consider it a high priority issue. Interestingly, nginx improves significantly with the patch. +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert-next-buddy | +=================================+====================================================+=================+==============+===================+ | specjbb/composite | critical-jOPS (jOPS) | 94700.00 | (R) -5.10% | -0.90% | | | max-jOPS (jOPS) | 113984.50 | (R) -3.90% | -0.65% | +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ | repro-collection/mysql-workload | db transaction rate (transactions/min) | 245438.25 | (R) -3.88% | -0.13% | | | new order rate (orders/min) | 80985.75 | (R) -3.78% | -0.07% | +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ | pts/pgbench | Scale: 1 Clients: 1 Read Only (TPS) | 63124.00 | (I) 2.90% | 0.74% | | | Scale: 1 Clients: 1 Read Only - Latency (ms) | 0.016 | (I) 5.49% | 1.05% | | | Scale: 1 Clients: 1 Read Write (TPS) | 974.92 | 0.11% | -0.08% | | | Scale: 1 Clients: 1 Read Write - Latency (ms) | 1.03 | 0.12% | -0.06% | | | Scale: 1 Clients: 250 Read Only (TPS) | 1915931.58 | (R) -2.25% | (I) 2.12% | | | Scale: 1 Clients: 250 Read Only - Latency (ms) | 0.13 | (R) -2.37% | (I) 2.09% | | | Scale: 1 Clients: 250 Read Write (TPS) | 855.67 | -1.36% | -0.14% | | | Scale: 1 Clients: 250 Read Write - Latency (ms) | 292.39 | -1.31% | -0.08% | | | Scale: 1 Clients: 1000 Read Only (TPS) | 1534130.08 | (R) -11.37% | 0.08% | | | Scale: 1 Clients: 1000 Read Only - Latency (ms) | 0.65 | (R) -11.38% | 0.08% | | | Scale: 1 Clients: 1000 Read Write (TPS) | 578.75 | -1.11% | 2.15% | | | Scale: 1 Clients: 1000 Read Write - Latency (ms) | 1736.98 | -1.26% | 2.47% | | | Scale: 100 Clients: 1 Read Only (TPS) | 57170.33 | 1.68% | 0.10% | | | Scale: 100 Clients: 1 Read Only - Latency (ms) | 0.018 | 1.94% | 0.00% | | | Scale: 100 Clients: 1 Read Write (TPS) | 836.58 | -0.37% | -0.41% | | | Scale: 100 Clients: 1 Read Write - Latency (ms) | 1.20 | -0.37% | -0.40% | | | Scale: 100 Clients: 250 Read Only (TPS) | 1773440.67 | -1.61% | 1.67% | | | Scale: 100 Clients: 250 Read Only - Latency (ms) | 0.14 | -1.40% | 1.56% | | | Scale: 100 Clients: 250 Read Write (TPS) | 5505.50 | -0.17% | -0.86% | | | Scale: 100 Clients: 250 Read Write - Latency (ms) | 45.42 | -0.17% | -0.85% | | | Scale: 100 Clients: 1000 Read Only (TPS) | 1393037.50 | (R) -10.31% | -0.19% | | | Scale: 100 Clients: 1000 Read Only - Latency (ms) | 0.72 | (R) -10.30% | -0.17% | | | Scale: 100 Clients: 1000 Read Write (TPS) | 5085.92 | 0.27% | 0.07% | | | Scale: 100 Clients: 1000 Read Write - Latency (ms) | 196.79 | 0.23% | 0.05% | +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ | mmtests/hackbench | hackbench-process-pipes-1 (seconds) | 0.14 | -1.51% | -1.05% | | | hackbench-process-pipes-4 (seconds) | 0.44 | (I) 6.49% | (I) 5.42% | | | hackbench-process-pipes-7 (seconds) | 0.68 | (R) -18.36% | (I) 3.40% | | | hackbench-process-pipes-12 (seconds) | 1.24 | (R) -19.89% | -0.45% | | | hackbench-process-pipes-21 (seconds) | 1.81 | (R) -8.41% | -1.22% | | | hackbench-process-pipes-30 (seconds) | 2.39 | (R) -9.06% | (R) -2.95% | | | hackbench-process-pipes-48 (seconds) | 3.18 | (R) -11.68% | (R) -4.10% | | | hackbench-process-pipes-79 (seconds) | 3.84 | (R) -9.74% | (R) -3.25% | | | hackbench-process-pipes-110 (seconds) | 4.68 | (R) -6.57% | (R) -2.12% | | | hackbench-process-pipes-141 (seconds) | 5.75 | (R) -5.86% | (R) -3.44% | | | hackbench-process-pipes-172 (seconds) | 6.80 | (R) -4.28% | (R) -2.81% | | | hackbench-process-pipes-203 (seconds) | 7.94 | (R) -4.01% | (R) -3.00% | | | hackbench-process-pipes-234 (seconds) | 9.02 | (R) -3.52% | (R) -2.81% | | | hackbench-process-pipes-256 (seconds) | 9.78 | (R) -3.24% | (R) -2.81% | | | hackbench-process-sockets-1 (seconds) | 0.29 | 0.50% | 0.26% | | | hackbench-process-sockets-4 (seconds) | 0.76 | (I) 17.44% | (I) 16.31% | | | hackbench-process-sockets-7 (seconds) | 1.16 | (I) 12.10% | (I) 9.78% | | | hackbench-process-sockets-12 (seconds) | 1.86 | (I) 10.19% | (I) 9.83% | | | hackbench-process-sockets-21 (seconds) | 3.12 | (I) 9.38% | (I) 9.20% | | | hackbench-process-sockets-30 (seconds) | 4.30 | (I) 6.43% | (I) 6.11% | | | hackbench-process-sockets-48 (seconds) | 6.58 | (I) 3.00% | (I) 2.19% | | | hackbench-process-sockets-79 (seconds) | 10.56 | (I) 2.87% | (I) 3.31% | | | hackbench-process-sockets-110 (seconds) | 13.85 | -1.15% | (I) 2.33% | | | hackbench-process-sockets-141 (seconds) | 19.23 | -1.40% | (I) 14.53% | | | hackbench-process-sockets-172 (seconds) | 26.33 | (I) 3.52% | (I) 30.37% | | | hackbench-process-sockets-203 (seconds) | 30.27 | 1.10% | (I) 27.20% | | | hackbench-process-sockets-234 (seconds) | 35.12 | 1.60% | (I) 28.24% | | | hackbench-process-sockets-256 (seconds) | 38.74 | 0.70% | (I) 28.74% | | | hackbench-thread-pipes-1 (seconds) | 0.17 | -1.32% | -0.76% | | | hackbench-thread-pipes-4 (seconds) | 0.45 | (I) 6.91% | (I) 7.64% | | | hackbench-thread-pipes-7 (seconds) | 0.74 | (R) -7.51% | (I) 5.26% | | | hackbench-thread-pipes-12 (seconds) | 1.32 | (R) -8.40% | (I) 2.32% | | | hackbench-thread-pipes-21 (seconds) | 1.95 | (R) -2.95% | 0.91% | | | hackbench-thread-pipes-30 (seconds) | 2.50 | (R) -4.61% | 1.47% | | | hackbench-thread-pipes-48 (seconds) | 3.32 | (R) -5.45% | (I) 2.15% | | | hackbench-thread-pipes-79 (seconds) | 4.04 | (R) -5.53% | 1.85% | | | hackbench-thread-pipes-110 (seconds) | 4.94 | (R) -2.33% | 1.51% | | | hackbench-thread-pipes-141 (seconds) | 6.04 | (R) -2.47% | 1.15% | | | hackbench-thread-pipes-172 (seconds) | 7.15 | -0.91% | 1.48% | | | hackbench-thread-pipes-203 (seconds) | 8.31 | -1.29% | 0.77% | | | hackbench-thread-pipes-234 (seconds) | 9.49 | -1.03% | 0.77% | | | hackbench-thread-pipes-256 (seconds) | 10.30 | -0.80% | 0.42% | | | hackbench-thread-sockets-1 (seconds) | 0.31 | 0.05% | -0.05% | | | hackbench-thread-sockets-4 (seconds) | 0.79 | (I) 18.91% | (I) 16.82% | | | hackbench-thread-sockets-7 (seconds) | 1.16 | (I) 12.57% | (I) 10.63% | | | hackbench-thread-sockets-12 (seconds) | 1.87 | (I) 12.65% | (I) 12.26% | | | hackbench-thread-sockets-21 (seconds) | 3.16 | (I) 11.62% | (I) 12.74% | | | hackbench-thread-sockets-30 (seconds) | 4.32 | (I) 7.35% | (I) 8.89% | | | hackbench-thread-sockets-48 (seconds) | 6.45 | (I) 2.69% | (I) 3.06% | | | hackbench-thread-sockets-79 (seconds) | 10.15 | (I) 3.30% | 1.98% | | | hackbench-thread-sockets-110 (seconds) | 13.45 | -0.25% | (I) 3.68% | | | hackbench-thread-sockets-141 (seconds) | 17.87 | (R) -2.18% | (I) 8.46% | | | hackbench-thread-sockets-172 (seconds) | 24.38 | 1.02% | (I) 24.33% | | | hackbench-thread-sockets-203 (seconds) | 28.38 | -0.99% | (I) 24.20% | | | hackbench-thread-sockets-234 (seconds) | 32.75 | -0.42% | (I) 24.35% | | | hackbench-thread-sockets-256 (seconds) | 36.49 | -1.30% | (I) 26.22% | +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ | pts/nginx | Connections: 200 (Requests Per Second) | 252332.60 | (I) 17.54% | -0.53% | | | Connections: 1000 (Requests Per Second) | 248591.29 | (I) 20.41% | 0.10% | +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ All of the benchmarks have been run multiple times and I have high confidence in the results. I can share min/mean/max/stdev/ci95 stats if that's helpful though. I'm not providing the data, but we also see similar regressions on AmpereOne (another arm64 server system). And we have seen a few functional tests (kvm selftests) that have started to timeout due to this patch slowing things down on arm64. I'm hoping you can advise on the best way to proceed? We have a bigger library than what I'm showing, but the only improvement I see due to this patch is nginx. So based on that, my preference would be to revert the patch upstream until the issues can be worked out. I'm guessing the story is quite different for x86 though? Thanks, Ryan On 17/11/2025 16:23, tip-bot2 for Mel Gorman wrote: > The following commit has been merged into the sched/core branch of tip: > > Commit-ID: e837456fdca81899a3c8e47b3fd39e30eae6e291 > Gitweb: https://git.kernel.org/tip/e837456fdca81899a3c8e47b3fd39e30eae6e291 > Author: Mel Gorman <mgorman@techsingularity.net> > AuthorDate: Wed, 12 Nov 2025 12:25:21 > Committer: Peter Zijlstra <peterz@infradead.org> > CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00 > > sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals > > Reimplement NEXT_BUDDY preemption to take into account the deadline and > eligibility of the wakee with respect to the waker. In the event > multiple buddies could be considered, the one with the earliest deadline > is selected. > > Sync wakeups are treated differently to every other type of wakeup. The > WF_SYNC assumption is that the waker promises to sleep in the very near > future. This is violated in enough cases that WF_SYNC should be treated > as a suggestion instead of a contract. If a waker does go to sleep almost > immediately then the delay in wakeup is negligible. In other cases, it's > throttled based on the accumulated runtime of the waker so there is a > chance that some batched wakeups have been issued before preemption. > > For all other wakeups, preemption happens if the wakee has a earlier > deadline than the waker and eligible to run. > > While many workloads were tested, the two main targets were a modified > dbench4 benchmark and hackbench because the are on opposite ends of the > spectrum -- one prefers throughput by avoiding preemption and the other > relies on preemption. > > First is the dbench throughput data even though it is a poor metric but > it is the default metric. The test machine is a 2-socket machine and the > backing filesystem is XFS as a lot of the IO work is dispatched to kernel > threads. It's important to note that these results are not representative > across all machines, especially Zen machines, as different bottlenecks > are exposed on different machines and filesystems. > > dbench4 Throughput (misleading but traditional) > 6.18-rc1 6.18-rc1 > vanilla sched-preemptnext-v5 > Hmean 1 1268.80 ( 0.00%) 1269.74 ( 0.07%) > Hmean 4 3971.74 ( 0.00%) 3950.59 ( -0.53%) > Hmean 7 5548.23 ( 0.00%) 5420.08 ( -2.31%) > Hmean 12 7310.86 ( 0.00%) 7165.57 ( -1.99%) > Hmean 21 8874.53 ( 0.00%) 9149.04 ( 3.09%) > Hmean 30 9361.93 ( 0.00%) 10530.04 ( 12.48%) > Hmean 48 9540.14 ( 0.00%) 11820.40 ( 23.90%) > Hmean 79 9208.74 ( 0.00%) 12193.79 ( 32.42%) > Hmean 110 8573.12 ( 0.00%) 11933.72 ( 39.20%) > Hmean 141 7791.33 ( 0.00%) 11273.90 ( 44.70%) > Hmean 160 7666.60 ( 0.00%) 10768.72 ( 40.46%) > > As throughput is misleading, the benchmark is modified to use a short > loadfile report the completion time duration in milliseconds. > > dbench4 Loadfile Execution Time > 6.18-rc1 6.18-rc1 > vanilla sched-preemptnext-v5 > Amean 1 14.62 ( 0.00%) 14.69 ( -0.46%) > Amean 4 18.76 ( 0.00%) 18.85 ( -0.45%) > Amean 7 23.71 ( 0.00%) 24.38 ( -2.82%) > Amean 12 31.25 ( 0.00%) 31.87 ( -1.97%) > Amean 21 45.12 ( 0.00%) 43.69 ( 3.16%) > Amean 30 61.07 ( 0.00%) 54.33 ( 11.03%) > Amean 48 95.91 ( 0.00%) 77.22 ( 19.49%) > Amean 79 163.38 ( 0.00%) 123.08 ( 24.66%) > Amean 110 243.91 ( 0.00%) 175.11 ( 28.21%) > Amean 141 343.47 ( 0.00%) 239.10 ( 30.39%) > Amean 160 401.15 ( 0.00%) 283.73 ( 29.27%) > Stddev 1 0.52 ( 0.00%) 0.51 ( 2.45%) > Stddev 4 1.36 ( 0.00%) 1.30 ( 4.04%) > Stddev 7 1.88 ( 0.00%) 1.87 ( 0.72%) > Stddev 12 3.06 ( 0.00%) 2.45 ( 19.83%) > Stddev 21 5.78 ( 0.00%) 3.87 ( 33.06%) > Stddev 30 9.85 ( 0.00%) 5.25 ( 46.76%) > Stddev 48 22.31 ( 0.00%) 8.64 ( 61.27%) > Stddev 79 35.96 ( 0.00%) 18.07 ( 49.76%) > Stddev 110 59.04 ( 0.00%) 30.93 ( 47.61%) > Stddev 141 85.38 ( 0.00%) 40.93 ( 52.06%) > Stddev 160 96.38 ( 0.00%) 39.72 ( 58.79%) > > That is still looking good and the variance is reduced quite a bit. > Finally, fairness is a concern so the next report tracks how many > milliseconds does it take for all clients to complete a workfile. This > one is tricky because dbench makes to effort to synchronise clients so > the durations at benchmark start time differ substantially from typical > runtimes. This problem could be mitigated by warming up the benchmark > for a number of minutes but it's a matter of opinion whether that > counts as an evasion of inconvenient results. > > dbench4 All Clients Loadfile Execution Time > 6.18-rc1 6.18-rc1 > vanilla sched-preemptnext-v5 > Amean 1 15.06 ( 0.00%) 15.07 ( -0.03%) > Amean 4 603.81 ( 0.00%) 524.29 ( 13.17%) > Amean 7 855.32 ( 0.00%) 1331.07 ( -55.62%) > Amean 12 1890.02 ( 0.00%) 2323.97 ( -22.96%) > Amean 21 3195.23 ( 0.00%) 2009.29 ( 37.12%) > Amean 30 13919.53 ( 0.00%) 4579.44 ( 67.10%) > Amean 48 25246.07 ( 0.00%) 5705.46 ( 77.40%) > Amean 79 29701.84 ( 0.00%) 15509.26 ( 47.78%) > Amean 110 22803.03 ( 0.00%) 23782.08 ( -4.29%) > Amean 141 36356.07 ( 0.00%) 25074.20 ( 31.03%) > Amean 160 17046.71 ( 0.00%) 13247.62 ( 22.29%) > Stddev 1 0.47 ( 0.00%) 0.49 ( -3.74%) > Stddev 4 395.24 ( 0.00%) 254.18 ( 35.69%) > Stddev 7 467.24 ( 0.00%) 764.42 ( -63.60%) > Stddev 12 1071.43 ( 0.00%) 1395.90 ( -30.28%) > Stddev 21 1694.50 ( 0.00%) 1204.89 ( 28.89%) > Stddev 30 7945.63 ( 0.00%) 2552.59 ( 67.87%) > Stddev 48 14339.51 ( 0.00%) 3227.55 ( 77.49%) > Stddev 79 16620.91 ( 0.00%) 8422.15 ( 49.33%) > Stddev 110 12912.15 ( 0.00%) 13560.95 ( -5.02%) > Stddev 141 20700.13 ( 0.00%) 14544.51 ( 29.74%) > Stddev 160 9079.16 ( 0.00%) 7400.69 ( 18.49%) > > This is more of a mixed bag but it at least shows that fairness > is not crippled. > > The hackbench results are more neutral but this is still important. > It's possible to boost the dbench figures by a large amount but only by > crippling the performance of a workload like hackbench. The WF_SYNC > behaviour is important for these workloads and is why the WF_SYNC > changes are not a separate patch. > > hackbench-process-pipes > 6.18-rc1 6.18-rc1 > vanilla sched-preemptnext-v5 > Amean 1 0.2657 ( 0.00%) 0.2150 ( 19.07%) > Amean 4 0.6107 ( 0.00%) 0.6060 ( 0.76%) > Amean 7 0.7923 ( 0.00%) 0.7440 ( 6.10%) > Amean 12 1.1500 ( 0.00%) 1.1263 ( 2.06%) > Amean 21 1.7950 ( 0.00%) 1.7987 ( -0.20%) > Amean 30 2.3207 ( 0.00%) 2.5053 ( -7.96%) > Amean 48 3.5023 ( 0.00%) 3.9197 ( -11.92%) > Amean 79 4.8093 ( 0.00%) 5.2247 ( -8.64%) > Amean 110 6.1160 ( 0.00%) 6.6650 ( -8.98%) > Amean 141 7.4763 ( 0.00%) 7.8973 ( -5.63%) > Amean 172 8.9560 ( 0.00%) 9.3593 ( -4.50%) > Amean 203 10.4783 ( 0.00%) 10.8347 ( -3.40%) > Amean 234 12.4977 ( 0.00%) 13.0177 ( -4.16%) > Amean 265 14.7003 ( 0.00%) 15.5630 ( -5.87%) > Amean 296 16.1007 ( 0.00%) 17.4023 ( -8.08%) > > Processes using pipes are impacted but the variance (not presented) indicates > it's close to noise and the results are not always reproducible. If executed > across multiple reboots, it may show neutral or small gains so the worst > measured results are presented. > > Hackbench using sockets is more reliably neutral as the wakeup > mechanisms are different between sockets and pipes. > > hackbench-process-sockets > 6.18-rc1 6.18-rc1 > vanilla sched-preemptnext-v2 > Amean 1 0.3073 ( 0.00%) 0.3263 ( -6.18%) > Amean 4 0.7863 ( 0.00%) 0.7930 ( -0.85%) > Amean 7 1.3670 ( 0.00%) 1.3537 ( 0.98%) > Amean 12 2.1337 ( 0.00%) 2.1903 ( -2.66%) > Amean 21 3.4683 ( 0.00%) 3.4940 ( -0.74%) > Amean 30 4.7247 ( 0.00%) 4.8853 ( -3.40%) > Amean 48 7.6097 ( 0.00%) 7.8197 ( -2.76%) > Amean 79 14.7957 ( 0.00%) 16.1000 ( -8.82%) > Amean 110 21.3413 ( 0.00%) 21.9997 ( -3.08%) > Amean 141 29.0503 ( 0.00%) 29.0353 ( 0.05%) > Amean 172 36.4660 ( 0.00%) 36.1433 ( 0.88%) > Amean 203 39.7177 ( 0.00%) 40.5910 ( -2.20%) > Amean 234 42.1120 ( 0.00%) 43.5527 ( -3.42%) > Amean 265 45.7830 ( 0.00%) 50.0560 ( -9.33%) > Amean 296 50.7043 ( 0.00%) 54.3657 ( -7.22%) > > As schbench has been mentioned in numerous bugs recently, the results > are interesting. A test case that represents the default schbench > behaviour is > > schbench Wakeup Latency (usec) > 6.18.0-rc1 6.18.0-rc1 > vanilla sched-preemptnext-v5 > Amean Wakeup-50th-80 7.17 ( 0.00%) 6.00 ( 16.28%) > Amean Wakeup-90th-80 46.56 ( 0.00%) 19.78 ( 57.52%) > Amean Wakeup-99th-80 119.61 ( 0.00%) 89.94 ( 24.80%) > Amean Wakeup-99.9th-80 3193.78 ( 0.00%) 328.22 ( 89.72%) > > schbench Requests Per Second (ops/sec) > 6.18.0-rc1 6.18.0-rc1 > vanilla sched-preemptnext-v5 > Hmean RPS-20th-80 8900.91 ( 0.00%) 9176.78 ( 3.10%) > Hmean RPS-50th-80 8987.41 ( 0.00%) 9217.89 ( 2.56%) > Hmean RPS-90th-80 9123.73 ( 0.00%) 9273.25 ( 1.64%) > Hmean RPS-max-80 9193.50 ( 0.00%) 9301.47 ( 1.17%) > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net > --- > kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++------- > 1 file changed, 130 insertions(+), 22 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 071e07f..c6e5c64 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect) > if (cfs_rq->nr_queued == 1) > return curr && curr->on_rq ? curr : se; > > + /* > + * Picking the ->next buddy will affect latency but not fairness. > + */ > + if (sched_feat(PICK_BUDDY) && > + cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { > + /* ->next will never be delayed */ > + WARN_ON_ONCE(cfs_rq->next->sched_delayed); > + return cfs_rq->next; > + } > + > if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) > curr = NULL; > > @@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se) > return delta_exec; > } > > +static void set_next_buddy(struct sched_entity *se); > + > /* > * Used by other classes to account runtime. > */ > @@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) > { > struct sched_entity *se; > > - /* > - * Picking the ->next buddy will affect latency but not fairness. > - */ > - if (sched_feat(PICK_BUDDY) && > - cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { > - /* ->next will never be delayed */ > - WARN_ON_ONCE(cfs_rq->next->sched_delayed); > - return cfs_rq->next; > - } > - > se = pick_eevdf(cfs_rq); > if (se->sched_delayed) { > dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); > @@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > hrtick_update(rq); > } > > -static void set_next_buddy(struct sched_entity *se); > - > /* > * Basically dequeue_task_fair(), except it can deal with dequeue_entity() > * failing half-way through and resume the dequeue later. > @@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se) > } > } > > +enum preempt_wakeup_action { > + PREEMPT_WAKEUP_NONE, /* No preemption. */ > + PREEMPT_WAKEUP_SHORT, /* Ignore slice protection. */ > + PREEMPT_WAKEUP_PICK, /* Let __pick_eevdf() decide. */ > + PREEMPT_WAKEUP_RESCHED, /* Force reschedule. */ > +}; > + > +static inline bool > +set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags, > + struct sched_entity *pse, struct sched_entity *se) > +{ > + /* > + * Keep existing buddy if the deadline is sooner than pse. > + * The older buddy may be cache cold and completely unrelated > + * to the current wakeup but that is unpredictable where as > + * obeying the deadline is more in line with EEVDF objectives. > + */ > + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) > + return false; > + > + set_next_buddy(pse); > + return true; > +} > + > +/* > + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not > + * strictly enforced because the hint is either misunderstood or > + * multiple tasks must be woken up. > + */ > +static inline enum preempt_wakeup_action > +preempt_sync(struct rq *rq, int wake_flags, > + struct sched_entity *pse, struct sched_entity *se) > +{ > + u64 threshold, delta; > + > + /* > + * WF_SYNC without WF_TTWU is not expected so warn if it happens even > + * though it is likely harmless. > + */ > + WARN_ON_ONCE(!(wake_flags & WF_TTWU)); > + > + threshold = sysctl_sched_migration_cost; > + delta = rq_clock_task(rq) - se->exec_start; > + if ((s64)delta < 0) > + delta = 0; > + > + /* > + * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they > + * could run on other CPUs. Reduce the threshold before preemption is > + * allowed to an arbitrary lower value as it is more likely (but not > + * guaranteed) the waker requires the wakee to finish. > + */ > + if (wake_flags & WF_RQ_SELECTED) > + threshold >>= 2; > + > + /* > + * As WF_SYNC is not strictly obeyed, allow some runtime for batch > + * wakeups to be issued. > + */ > + if (entity_before(pse, se) && delta >= threshold) > + return PREEMPT_WAKEUP_RESCHED; > + > + return PREEMPT_WAKEUP_NONE; > +} > + > /* > * Preempt the current task with a newly woken task if needed: > */ > static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags) > { > + enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK; > struct task_struct *donor = rq->donor; > struct sched_entity *se = &donor->se, *pse = &p->se; > struct cfs_rq *cfs_rq = task_cfs_rq(donor); > int cse_is_idle, pse_is_idle; > - bool do_preempt_short = false; > > if (unlikely(se == pse)) > return; > @@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > if (task_is_throttled(p)) > return; > > - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { > - set_next_buddy(pse); > - } > - > /* > * We can come here with TIF_NEED_RESCHED already set from new task > * wake up path. > @@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * When non-idle entity preempt an idle entity, > * don't give idle entity slice protection. > */ > - do_preempt_short = true; > + preempt_action = PREEMPT_WAKEUP_SHORT; > goto preempt; > } > > @@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * If @p has a shorter slice than current and @p is eligible, override > * current's slice protection in order to allow preemption. > */ > - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); > + if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) { > + preempt_action = PREEMPT_WAKEUP_SHORT; > + goto pick; > + } > > /* > + * Ignore wakee preemption on WF_FORK as it is less likely that > + * there is shared data as exec often follow fork. Do not > + * preempt for tasks that are sched_delayed as it would violate > + * EEVDF to forcibly queue an ineligible task. > + */ > + if ((wake_flags & WF_FORK) || pse->sched_delayed) > + return; > + > + /* > + * If @p potentially is completing work required by current then > + * consider preemption. > + * > + * Reschedule if waker is no longer eligible. */ > + if (in_task() && !entity_eligible(cfs_rq, se)) { > + preempt_action = PREEMPT_WAKEUP_RESCHED; > + goto preempt; > + } > + > + /* Prefer picking wakee soon if appropriate. */ > + if (sched_feat(NEXT_BUDDY) && > + set_preempt_buddy(cfs_rq, wake_flags, pse, se)) { > + > + /* > + * Decide whether to obey WF_SYNC hint for a new buddy. Old > + * buddies are ignored as they may not be relevant to the > + * waker and less likely to be cache hot. > + */ > + if (wake_flags & WF_SYNC) > + preempt_action = preempt_sync(rq, wake_flags, pse, se); > + } > + > + switch (preempt_action) { > + case PREEMPT_WAKEUP_NONE: > + return; > + case PREEMPT_WAKEUP_RESCHED: > + goto preempt; > + case PREEMPT_WAKEUP_SHORT: > + fallthrough; > + case PREEMPT_WAKEUP_PICK: > + break; > + } > + > +pick: > + /* > * If @p has become the most eligible task, force preemption. > */ > - if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) > + if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse) > goto preempt; > > - if (sched_feat(RUN_TO_PARITY) && do_preempt_short) > + if (sched_feat(RUN_TO_PARITY)) > update_protect_slice(cfs_rq, se); > > return; > > preempt: > - if (do_preempt_short) > + if (preempt_action == PREEMPT_WAKEUP_SHORT) > cancel_protect_slice(se); > > resched_curr_lazy(rq); > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2025-12-22 10:57 ` [REGRESSION] " Ryan Roberts @ 2026-01-02 12:38 ` Ryan Roberts 2026-01-02 15:52 ` Dietmar Eggemann 0 siblings, 1 reply; 27+ messages in thread From: Ryan Roberts @ 2026-01-02 12:38 UTC (permalink / raw) To: Mel Gorman, Peter Zijlstra (Intel); +Cc: x86, linux-kernel, Aishwarya TCV Hi, I appreciate I sent this report just before Xmas so most likely you haven't had a chance to look, but wanted to bring it back to the top of your mailbox in case it was missed. Happy new year! Thanks, Ryan On 22/12/2025 10:57, Ryan Roberts wrote: > Hi Mel, Peter, > > We are building out a kernel performance regression monitoring lab at Arm, and > I've noticed some fairly large perofrmance regressions in real-world workloads, > for which bisection has fingered this patch. > > We are looking at performance changes between v6.18 and v6.19-rc1, and by > reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan > to move the testing to linux-next over the next couple of quarters so hopefully > we will be able to deliver this sort of news prior to merging in future). > > All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean > statistically significant regression/improvement, where "statistically > significant" means the 95% confidence intervals do not overlap". > > The below is a large scale mysql workload, running across 2 AWS instances (a > load generator and the mysql server). We have a partner for whom this is a very > important workload. Performance regresses by 1.3% between 6.18 and 6.19-rc1 > (where the patch is added). By reverting the patch, the regression is not only > fixed by performance is now nearly 6% better than v6.18: > > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert-next-buddy | > +=================================+====================================================+=================+==============+===================+ > | repro-collection/mysql-workload | db transaction rate (transactions/min) | 646267.33 | (R) -1.33% | (I) 5.87% | > | | new order rate (orders/min) | 213256.50 | (R) -1.32% | (I) 5.87% | > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > > > Next are a bunch of benchmarks all running on a single system. specjbb is the > SPEC Java Business Benchmark. The mysql one is the same as above but this time > both loadgen and server are on the same system. pgbench is the PostgreSQL > benchmark. > > I'm showing hackbench for completeness, but I don't consider it a high priority > issue. > > Interestingly, nginx improves significantly with the patch. > > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert-next-buddy | > +=================================+====================================================+=================+==============+===================+ > | specjbb/composite | critical-jOPS (jOPS) | 94700.00 | (R) -5.10% | -0.90% | > | | max-jOPS (jOPS) | 113984.50 | (R) -3.90% | -0.65% | > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > | repro-collection/mysql-workload | db transaction rate (transactions/min) | 245438.25 | (R) -3.88% | -0.13% | > | | new order rate (orders/min) | 80985.75 | (R) -3.78% | -0.07% | > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > | pts/pgbench | Scale: 1 Clients: 1 Read Only (TPS) | 63124.00 | (I) 2.90% | 0.74% | > | | Scale: 1 Clients: 1 Read Only - Latency (ms) | 0.016 | (I) 5.49% | 1.05% | > | | Scale: 1 Clients: 1 Read Write (TPS) | 974.92 | 0.11% | -0.08% | > | | Scale: 1 Clients: 1 Read Write - Latency (ms) | 1.03 | 0.12% | -0.06% | > | | Scale: 1 Clients: 250 Read Only (TPS) | 1915931.58 | (R) -2.25% | (I) 2.12% | > | | Scale: 1 Clients: 250 Read Only - Latency (ms) | 0.13 | (R) -2.37% | (I) 2.09% | > | | Scale: 1 Clients: 250 Read Write (TPS) | 855.67 | -1.36% | -0.14% | > | | Scale: 1 Clients: 250 Read Write - Latency (ms) | 292.39 | -1.31% | -0.08% | > | | Scale: 1 Clients: 1000 Read Only (TPS) | 1534130.08 | (R) -11.37% | 0.08% | > | | Scale: 1 Clients: 1000 Read Only - Latency (ms) | 0.65 | (R) -11.38% | 0.08% | > | | Scale: 1 Clients: 1000 Read Write (TPS) | 578.75 | -1.11% | 2.15% | > | | Scale: 1 Clients: 1000 Read Write - Latency (ms) | 1736.98 | -1.26% | 2.47% | > | | Scale: 100 Clients: 1 Read Only (TPS) | 57170.33 | 1.68% | 0.10% | > | | Scale: 100 Clients: 1 Read Only - Latency (ms) | 0.018 | 1.94% | 0.00% | > | | Scale: 100 Clients: 1 Read Write (TPS) | 836.58 | -0.37% | -0.41% | > | | Scale: 100 Clients: 1 Read Write - Latency (ms) | 1.20 | -0.37% | -0.40% | > | | Scale: 100 Clients: 250 Read Only (TPS) | 1773440.67 | -1.61% | 1.67% | > | | Scale: 100 Clients: 250 Read Only - Latency (ms) | 0.14 | -1.40% | 1.56% | > | | Scale: 100 Clients: 250 Read Write (TPS) | 5505.50 | -0.17% | -0.86% | > | | Scale: 100 Clients: 250 Read Write - Latency (ms) | 45.42 | -0.17% | -0.85% | > | | Scale: 100 Clients: 1000 Read Only (TPS) | 1393037.50 | (R) -10.31% | -0.19% | > | | Scale: 100 Clients: 1000 Read Only - Latency (ms) | 0.72 | (R) -10.30% | -0.17% | > | | Scale: 100 Clients: 1000 Read Write (TPS) | 5085.92 | 0.27% | 0.07% | > | | Scale: 100 Clients: 1000 Read Write - Latency (ms) | 196.79 | 0.23% | 0.05% | > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > | mmtests/hackbench | hackbench-process-pipes-1 (seconds) | 0.14 | -1.51% | -1.05% | > | | hackbench-process-pipes-4 (seconds) | 0.44 | (I) 6.49% | (I) 5.42% | > | | hackbench-process-pipes-7 (seconds) | 0.68 | (R) -18.36% | (I) 3.40% | > | | hackbench-process-pipes-12 (seconds) | 1.24 | (R) -19.89% | -0.45% | > | | hackbench-process-pipes-21 (seconds) | 1.81 | (R) -8.41% | -1.22% | > | | hackbench-process-pipes-30 (seconds) | 2.39 | (R) -9.06% | (R) -2.95% | > | | hackbench-process-pipes-48 (seconds) | 3.18 | (R) -11.68% | (R) -4.10% | > | | hackbench-process-pipes-79 (seconds) | 3.84 | (R) -9.74% | (R) -3.25% | > | | hackbench-process-pipes-110 (seconds) | 4.68 | (R) -6.57% | (R) -2.12% | > | | hackbench-process-pipes-141 (seconds) | 5.75 | (R) -5.86% | (R) -3.44% | > | | hackbench-process-pipes-172 (seconds) | 6.80 | (R) -4.28% | (R) -2.81% | > | | hackbench-process-pipes-203 (seconds) | 7.94 | (R) -4.01% | (R) -3.00% | > | | hackbench-process-pipes-234 (seconds) | 9.02 | (R) -3.52% | (R) -2.81% | > | | hackbench-process-pipes-256 (seconds) | 9.78 | (R) -3.24% | (R) -2.81% | > | | hackbench-process-sockets-1 (seconds) | 0.29 | 0.50% | 0.26% | > | | hackbench-process-sockets-4 (seconds) | 0.76 | (I) 17.44% | (I) 16.31% | > | | hackbench-process-sockets-7 (seconds) | 1.16 | (I) 12.10% | (I) 9.78% | > | | hackbench-process-sockets-12 (seconds) | 1.86 | (I) 10.19% | (I) 9.83% | > | | hackbench-process-sockets-21 (seconds) | 3.12 | (I) 9.38% | (I) 9.20% | > | | hackbench-process-sockets-30 (seconds) | 4.30 | (I) 6.43% | (I) 6.11% | > | | hackbench-process-sockets-48 (seconds) | 6.58 | (I) 3.00% | (I) 2.19% | > | | hackbench-process-sockets-79 (seconds) | 10.56 | (I) 2.87% | (I) 3.31% | > | | hackbench-process-sockets-110 (seconds) | 13.85 | -1.15% | (I) 2.33% | > | | hackbench-process-sockets-141 (seconds) | 19.23 | -1.40% | (I) 14.53% | > | | hackbench-process-sockets-172 (seconds) | 26.33 | (I) 3.52% | (I) 30.37% | > | | hackbench-process-sockets-203 (seconds) | 30.27 | 1.10% | (I) 27.20% | > | | hackbench-process-sockets-234 (seconds) | 35.12 | 1.60% | (I) 28.24% | > | | hackbench-process-sockets-256 (seconds) | 38.74 | 0.70% | (I) 28.74% | > | | hackbench-thread-pipes-1 (seconds) | 0.17 | -1.32% | -0.76% | > | | hackbench-thread-pipes-4 (seconds) | 0.45 | (I) 6.91% | (I) 7.64% | > | | hackbench-thread-pipes-7 (seconds) | 0.74 | (R) -7.51% | (I) 5.26% | > | | hackbench-thread-pipes-12 (seconds) | 1.32 | (R) -8.40% | (I) 2.32% | > | | hackbench-thread-pipes-21 (seconds) | 1.95 | (R) -2.95% | 0.91% | > | | hackbench-thread-pipes-30 (seconds) | 2.50 | (R) -4.61% | 1.47% | > | | hackbench-thread-pipes-48 (seconds) | 3.32 | (R) -5.45% | (I) 2.15% | > | | hackbench-thread-pipes-79 (seconds) | 4.04 | (R) -5.53% | 1.85% | > | | hackbench-thread-pipes-110 (seconds) | 4.94 | (R) -2.33% | 1.51% | > | | hackbench-thread-pipes-141 (seconds) | 6.04 | (R) -2.47% | 1.15% | > | | hackbench-thread-pipes-172 (seconds) | 7.15 | -0.91% | 1.48% | > | | hackbench-thread-pipes-203 (seconds) | 8.31 | -1.29% | 0.77% | > | | hackbench-thread-pipes-234 (seconds) | 9.49 | -1.03% | 0.77% | > | | hackbench-thread-pipes-256 (seconds) | 10.30 | -0.80% | 0.42% | > | | hackbench-thread-sockets-1 (seconds) | 0.31 | 0.05% | -0.05% | > | | hackbench-thread-sockets-4 (seconds) | 0.79 | (I) 18.91% | (I) 16.82% | > | | hackbench-thread-sockets-7 (seconds) | 1.16 | (I) 12.57% | (I) 10.63% | > | | hackbench-thread-sockets-12 (seconds) | 1.87 | (I) 12.65% | (I) 12.26% | > | | hackbench-thread-sockets-21 (seconds) | 3.16 | (I) 11.62% | (I) 12.74% | > | | hackbench-thread-sockets-30 (seconds) | 4.32 | (I) 7.35% | (I) 8.89% | > | | hackbench-thread-sockets-48 (seconds) | 6.45 | (I) 2.69% | (I) 3.06% | > | | hackbench-thread-sockets-79 (seconds) | 10.15 | (I) 3.30% | 1.98% | > | | hackbench-thread-sockets-110 (seconds) | 13.45 | -0.25% | (I) 3.68% | > | | hackbench-thread-sockets-141 (seconds) | 17.87 | (R) -2.18% | (I) 8.46% | > | | hackbench-thread-sockets-172 (seconds) | 24.38 | 1.02% | (I) 24.33% | > | | hackbench-thread-sockets-203 (seconds) | 28.38 | -0.99% | (I) 24.20% | > | | hackbench-thread-sockets-234 (seconds) | 32.75 | -0.42% | (I) 24.35% | > | | hackbench-thread-sockets-256 (seconds) | 36.49 | -1.30% | (I) 26.22% | > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > | pts/nginx | Connections: 200 (Requests Per Second) | 252332.60 | (I) 17.54% | -0.53% | > | | Connections: 1000 (Requests Per Second) | 248591.29 | (I) 20.41% | 0.10% | > +---------------------------------+----------------------------------------------------+-----------------+--------------+-------------------+ > > All of the benchmarks have been run multiple times and I have high confidence in > the results. I can share min/mean/max/stdev/ci95 stats if that's helpful though. > > I'm not providing the data, but we also see similar regressions on AmpereOne > (another arm64 server system). And we have seen a few functional tests (kvm > selftests) that have started to timeout due to this patch slowing things down on > arm64. > > I'm hoping you can advise on the best way to proceed? We have a bigger library > than what I'm showing, but the only improvement I see due to this patch is > nginx. So based on that, my preference would be to revert the patch upstream > until the issues can be worked out. I'm guessing the story is quite different > for x86 though? > > Thanks, > Ryan > > > > On 17/11/2025 16:23, tip-bot2 for Mel Gorman wrote: >> The following commit has been merged into the sched/core branch of tip: >> >> Commit-ID: e837456fdca81899a3c8e47b3fd39e30eae6e291 >> Gitweb: https://git.kernel.org/tip/e837456fdca81899a3c8e47b3fd39e30eae6e291 >> Author: Mel Gorman <mgorman@techsingularity.net> >> AuthorDate: Wed, 12 Nov 2025 12:25:21 >> Committer: Peter Zijlstra <peterz@infradead.org> >> CommitterDate: Mon, 17 Nov 2025 17:13:15 +01:00 >> >> sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals >> >> Reimplement NEXT_BUDDY preemption to take into account the deadline and >> eligibility of the wakee with respect to the waker. In the event >> multiple buddies could be considered, the one with the earliest deadline >> is selected. >> >> Sync wakeups are treated differently to every other type of wakeup. The >> WF_SYNC assumption is that the waker promises to sleep in the very near >> future. This is violated in enough cases that WF_SYNC should be treated >> as a suggestion instead of a contract. If a waker does go to sleep almost >> immediately then the delay in wakeup is negligible. In other cases, it's >> throttled based on the accumulated runtime of the waker so there is a >> chance that some batched wakeups have been issued before preemption. >> >> For all other wakeups, preemption happens if the wakee has a earlier >> deadline than the waker and eligible to run. >> >> While many workloads were tested, the two main targets were a modified >> dbench4 benchmark and hackbench because the are on opposite ends of the >> spectrum -- one prefers throughput by avoiding preemption and the other >> relies on preemption. >> >> First is the dbench throughput data even though it is a poor metric but >> it is the default metric. The test machine is a 2-socket machine and the >> backing filesystem is XFS as a lot of the IO work is dispatched to kernel >> threads. It's important to note that these results are not representative >> across all machines, especially Zen machines, as different bottlenecks >> are exposed on different machines and filesystems. >> >> dbench4 Throughput (misleading but traditional) >> 6.18-rc1 6.18-rc1 >> vanilla sched-preemptnext-v5 >> Hmean 1 1268.80 ( 0.00%) 1269.74 ( 0.07%) >> Hmean 4 3971.74 ( 0.00%) 3950.59 ( -0.53%) >> Hmean 7 5548.23 ( 0.00%) 5420.08 ( -2.31%) >> Hmean 12 7310.86 ( 0.00%) 7165.57 ( -1.99%) >> Hmean 21 8874.53 ( 0.00%) 9149.04 ( 3.09%) >> Hmean 30 9361.93 ( 0.00%) 10530.04 ( 12.48%) >> Hmean 48 9540.14 ( 0.00%) 11820.40 ( 23.90%) >> Hmean 79 9208.74 ( 0.00%) 12193.79 ( 32.42%) >> Hmean 110 8573.12 ( 0.00%) 11933.72 ( 39.20%) >> Hmean 141 7791.33 ( 0.00%) 11273.90 ( 44.70%) >> Hmean 160 7666.60 ( 0.00%) 10768.72 ( 40.46%) >> >> As throughput is misleading, the benchmark is modified to use a short >> loadfile report the completion time duration in milliseconds. >> >> dbench4 Loadfile Execution Time >> 6.18-rc1 6.18-rc1 >> vanilla sched-preemptnext-v5 >> Amean 1 14.62 ( 0.00%) 14.69 ( -0.46%) >> Amean 4 18.76 ( 0.00%) 18.85 ( -0.45%) >> Amean 7 23.71 ( 0.00%) 24.38 ( -2.82%) >> Amean 12 31.25 ( 0.00%) 31.87 ( -1.97%) >> Amean 21 45.12 ( 0.00%) 43.69 ( 3.16%) >> Amean 30 61.07 ( 0.00%) 54.33 ( 11.03%) >> Amean 48 95.91 ( 0.00%) 77.22 ( 19.49%) >> Amean 79 163.38 ( 0.00%) 123.08 ( 24.66%) >> Amean 110 243.91 ( 0.00%) 175.11 ( 28.21%) >> Amean 141 343.47 ( 0.00%) 239.10 ( 30.39%) >> Amean 160 401.15 ( 0.00%) 283.73 ( 29.27%) >> Stddev 1 0.52 ( 0.00%) 0.51 ( 2.45%) >> Stddev 4 1.36 ( 0.00%) 1.30 ( 4.04%) >> Stddev 7 1.88 ( 0.00%) 1.87 ( 0.72%) >> Stddev 12 3.06 ( 0.00%) 2.45 ( 19.83%) >> Stddev 21 5.78 ( 0.00%) 3.87 ( 33.06%) >> Stddev 30 9.85 ( 0.00%) 5.25 ( 46.76%) >> Stddev 48 22.31 ( 0.00%) 8.64 ( 61.27%) >> Stddev 79 35.96 ( 0.00%) 18.07 ( 49.76%) >> Stddev 110 59.04 ( 0.00%) 30.93 ( 47.61%) >> Stddev 141 85.38 ( 0.00%) 40.93 ( 52.06%) >> Stddev 160 96.38 ( 0.00%) 39.72 ( 58.79%) >> >> That is still looking good and the variance is reduced quite a bit. >> Finally, fairness is a concern so the next report tracks how many >> milliseconds does it take for all clients to complete a workfile. This >> one is tricky because dbench makes to effort to synchronise clients so >> the durations at benchmark start time differ substantially from typical >> runtimes. This problem could be mitigated by warming up the benchmark >> for a number of minutes but it's a matter of opinion whether that >> counts as an evasion of inconvenient results. >> >> dbench4 All Clients Loadfile Execution Time >> 6.18-rc1 6.18-rc1 >> vanilla sched-preemptnext-v5 >> Amean 1 15.06 ( 0.00%) 15.07 ( -0.03%) >> Amean 4 603.81 ( 0.00%) 524.29 ( 13.17%) >> Amean 7 855.32 ( 0.00%) 1331.07 ( -55.62%) >> Amean 12 1890.02 ( 0.00%) 2323.97 ( -22.96%) >> Amean 21 3195.23 ( 0.00%) 2009.29 ( 37.12%) >> Amean 30 13919.53 ( 0.00%) 4579.44 ( 67.10%) >> Amean 48 25246.07 ( 0.00%) 5705.46 ( 77.40%) >> Amean 79 29701.84 ( 0.00%) 15509.26 ( 47.78%) >> Amean 110 22803.03 ( 0.00%) 23782.08 ( -4.29%) >> Amean 141 36356.07 ( 0.00%) 25074.20 ( 31.03%) >> Amean 160 17046.71 ( 0.00%) 13247.62 ( 22.29%) >> Stddev 1 0.47 ( 0.00%) 0.49 ( -3.74%) >> Stddev 4 395.24 ( 0.00%) 254.18 ( 35.69%) >> Stddev 7 467.24 ( 0.00%) 764.42 ( -63.60%) >> Stddev 12 1071.43 ( 0.00%) 1395.90 ( -30.28%) >> Stddev 21 1694.50 ( 0.00%) 1204.89 ( 28.89%) >> Stddev 30 7945.63 ( 0.00%) 2552.59 ( 67.87%) >> Stddev 48 14339.51 ( 0.00%) 3227.55 ( 77.49%) >> Stddev 79 16620.91 ( 0.00%) 8422.15 ( 49.33%) >> Stddev 110 12912.15 ( 0.00%) 13560.95 ( -5.02%) >> Stddev 141 20700.13 ( 0.00%) 14544.51 ( 29.74%) >> Stddev 160 9079.16 ( 0.00%) 7400.69 ( 18.49%) >> >> This is more of a mixed bag but it at least shows that fairness >> is not crippled. >> >> The hackbench results are more neutral but this is still important. >> It's possible to boost the dbench figures by a large amount but only by >> crippling the performance of a workload like hackbench. The WF_SYNC >> behaviour is important for these workloads and is why the WF_SYNC >> changes are not a separate patch. >> >> hackbench-process-pipes >> 6.18-rc1 6.18-rc1 >> vanilla sched-preemptnext-v5 >> Amean 1 0.2657 ( 0.00%) 0.2150 ( 19.07%) >> Amean 4 0.6107 ( 0.00%) 0.6060 ( 0.76%) >> Amean 7 0.7923 ( 0.00%) 0.7440 ( 6.10%) >> Amean 12 1.1500 ( 0.00%) 1.1263 ( 2.06%) >> Amean 21 1.7950 ( 0.00%) 1.7987 ( -0.20%) >> Amean 30 2.3207 ( 0.00%) 2.5053 ( -7.96%) >> Amean 48 3.5023 ( 0.00%) 3.9197 ( -11.92%) >> Amean 79 4.8093 ( 0.00%) 5.2247 ( -8.64%) >> Amean 110 6.1160 ( 0.00%) 6.6650 ( -8.98%) >> Amean 141 7.4763 ( 0.00%) 7.8973 ( -5.63%) >> Amean 172 8.9560 ( 0.00%) 9.3593 ( -4.50%) >> Amean 203 10.4783 ( 0.00%) 10.8347 ( -3.40%) >> Amean 234 12.4977 ( 0.00%) 13.0177 ( -4.16%) >> Amean 265 14.7003 ( 0.00%) 15.5630 ( -5.87%) >> Amean 296 16.1007 ( 0.00%) 17.4023 ( -8.08%) >> >> Processes using pipes are impacted but the variance (not presented) indicates >> it's close to noise and the results are not always reproducible. If executed >> across multiple reboots, it may show neutral or small gains so the worst >> measured results are presented. >> >> Hackbench using sockets is more reliably neutral as the wakeup >> mechanisms are different between sockets and pipes. >> >> hackbench-process-sockets >> 6.18-rc1 6.18-rc1 >> vanilla sched-preemptnext-v2 >> Amean 1 0.3073 ( 0.00%) 0.3263 ( -6.18%) >> Amean 4 0.7863 ( 0.00%) 0.7930 ( -0.85%) >> Amean 7 1.3670 ( 0.00%) 1.3537 ( 0.98%) >> Amean 12 2.1337 ( 0.00%) 2.1903 ( -2.66%) >> Amean 21 3.4683 ( 0.00%) 3.4940 ( -0.74%) >> Amean 30 4.7247 ( 0.00%) 4.8853 ( -3.40%) >> Amean 48 7.6097 ( 0.00%) 7.8197 ( -2.76%) >> Amean 79 14.7957 ( 0.00%) 16.1000 ( -8.82%) >> Amean 110 21.3413 ( 0.00%) 21.9997 ( -3.08%) >> Amean 141 29.0503 ( 0.00%) 29.0353 ( 0.05%) >> Amean 172 36.4660 ( 0.00%) 36.1433 ( 0.88%) >> Amean 203 39.7177 ( 0.00%) 40.5910 ( -2.20%) >> Amean 234 42.1120 ( 0.00%) 43.5527 ( -3.42%) >> Amean 265 45.7830 ( 0.00%) 50.0560 ( -9.33%) >> Amean 296 50.7043 ( 0.00%) 54.3657 ( -7.22%) >> >> As schbench has been mentioned in numerous bugs recently, the results >> are interesting. A test case that represents the default schbench >> behaviour is >> >> schbench Wakeup Latency (usec) >> 6.18.0-rc1 6.18.0-rc1 >> vanilla sched-preemptnext-v5 >> Amean Wakeup-50th-80 7.17 ( 0.00%) 6.00 ( 16.28%) >> Amean Wakeup-90th-80 46.56 ( 0.00%) 19.78 ( 57.52%) >> Amean Wakeup-99th-80 119.61 ( 0.00%) 89.94 ( 24.80%) >> Amean Wakeup-99.9th-80 3193.78 ( 0.00%) 328.22 ( 89.72%) >> >> schbench Requests Per Second (ops/sec) >> 6.18.0-rc1 6.18.0-rc1 >> vanilla sched-preemptnext-v5 >> Hmean RPS-20th-80 8900.91 ( 0.00%) 9176.78 ( 3.10%) >> Hmean RPS-50th-80 8987.41 ( 0.00%) 9217.89 ( 2.56%) >> Hmean RPS-90th-80 9123.73 ( 0.00%) 9273.25 ( 1.64%) >> Hmean RPS-max-80 9193.50 ( 0.00%) 9301.47 ( 1.17%) >> >> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> >> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> >> Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net >> --- >> kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++++------- >> 1 file changed, 130 insertions(+), 22 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 071e07f..c6e5c64 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -929,6 +929,16 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq, bool protect) >> if (cfs_rq->nr_queued == 1) >> return curr && curr->on_rq ? curr : se; >> >> + /* >> + * Picking the ->next buddy will affect latency but not fairness. >> + */ >> + if (sched_feat(PICK_BUDDY) && >> + cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { >> + /* ->next will never be delayed */ >> + WARN_ON_ONCE(cfs_rq->next->sched_delayed); >> + return cfs_rq->next; >> + } >> + >> if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) >> curr = NULL; >> >> @@ -1167,6 +1177,8 @@ static s64 update_se(struct rq *rq, struct sched_entity *se) >> return delta_exec; >> } >> >> +static void set_next_buddy(struct sched_entity *se); >> + >> /* >> * Used by other classes to account runtime. >> */ >> @@ -5466,16 +5478,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) >> { >> struct sched_entity *se; >> >> - /* >> - * Picking the ->next buddy will affect latency but not fairness. >> - */ >> - if (sched_feat(PICK_BUDDY) && >> - cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { >> - /* ->next will never be delayed */ >> - WARN_ON_ONCE(cfs_rq->next->sched_delayed); >> - return cfs_rq->next; >> - } >> - >> se = pick_eevdf(cfs_rq); >> if (se->sched_delayed) { >> dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); >> @@ -6988,8 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) >> hrtick_update(rq); >> } >> >> -static void set_next_buddy(struct sched_entity *se); >> - >> /* >> * Basically dequeue_task_fair(), except it can deal with dequeue_entity() >> * failing half-way through and resume the dequeue later. >> @@ -8676,16 +8676,81 @@ static void set_next_buddy(struct sched_entity *se) >> } >> } >> >> +enum preempt_wakeup_action { >> + PREEMPT_WAKEUP_NONE, /* No preemption. */ >> + PREEMPT_WAKEUP_SHORT, /* Ignore slice protection. */ >> + PREEMPT_WAKEUP_PICK, /* Let __pick_eevdf() decide. */ >> + PREEMPT_WAKEUP_RESCHED, /* Force reschedule. */ >> +}; >> + >> +static inline bool >> +set_preempt_buddy(struct cfs_rq *cfs_rq, int wake_flags, >> + struct sched_entity *pse, struct sched_entity *se) >> +{ >> + /* >> + * Keep existing buddy if the deadline is sooner than pse. >> + * The older buddy may be cache cold and completely unrelated >> + * to the current wakeup but that is unpredictable where as >> + * obeying the deadline is more in line with EEVDF objectives. >> + */ >> + if (cfs_rq->next && entity_before(cfs_rq->next, pse)) >> + return false; >> + >> + set_next_buddy(pse); >> + return true; >> +} >> + >> +/* >> + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not >> + * strictly enforced because the hint is either misunderstood or >> + * multiple tasks must be woken up. >> + */ >> +static inline enum preempt_wakeup_action >> +preempt_sync(struct rq *rq, int wake_flags, >> + struct sched_entity *pse, struct sched_entity *se) >> +{ >> + u64 threshold, delta; >> + >> + /* >> + * WF_SYNC without WF_TTWU is not expected so warn if it happens even >> + * though it is likely harmless. >> + */ >> + WARN_ON_ONCE(!(wake_flags & WF_TTWU)); >> + >> + threshold = sysctl_sched_migration_cost; >> + delta = rq_clock_task(rq) - se->exec_start; >> + if ((s64)delta < 0) >> + delta = 0; >> + >> + /* >> + * WF_RQ_SELECTED implies the tasks are stacking on a CPU when they >> + * could run on other CPUs. Reduce the threshold before preemption is >> + * allowed to an arbitrary lower value as it is more likely (but not >> + * guaranteed) the waker requires the wakee to finish. >> + */ >> + if (wake_flags & WF_RQ_SELECTED) >> + threshold >>= 2; >> + >> + /* >> + * As WF_SYNC is not strictly obeyed, allow some runtime for batch >> + * wakeups to be issued. >> + */ >> + if (entity_before(pse, se) && delta >= threshold) >> + return PREEMPT_WAKEUP_RESCHED; >> + >> + return PREEMPT_WAKEUP_NONE; >> +} >> + >> /* >> * Preempt the current task with a newly woken task if needed: >> */ >> static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags) >> { >> + enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK; >> struct task_struct *donor = rq->donor; >> struct sched_entity *se = &donor->se, *pse = &p->se; >> struct cfs_rq *cfs_rq = task_cfs_rq(donor); >> int cse_is_idle, pse_is_idle; >> - bool do_preempt_short = false; >> >> if (unlikely(se == pse)) >> return; >> @@ -8699,10 +8764,6 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int >> if (task_is_throttled(p)) >> return; >> >> - if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) { >> - set_next_buddy(pse); >> - } >> - >> /* >> * We can come here with TIF_NEED_RESCHED already set from new task >> * wake up path. >> @@ -8734,7 +8795,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int >> * When non-idle entity preempt an idle entity, >> * don't give idle entity slice protection. >> */ >> - do_preempt_short = true; >> + preempt_action = PREEMPT_WAKEUP_SHORT; >> goto preempt; >> } >> >> @@ -8753,21 +8814,68 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int >> * If @p has a shorter slice than current and @p is eligible, override >> * current's slice protection in order to allow preemption. >> */ >> - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); >> + if (sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice)) { >> + preempt_action = PREEMPT_WAKEUP_SHORT; >> + goto pick; >> + } >> >> /* >> + * Ignore wakee preemption on WF_FORK as it is less likely that >> + * there is shared data as exec often follow fork. Do not >> + * preempt for tasks that are sched_delayed as it would violate >> + * EEVDF to forcibly queue an ineligible task. >> + */ >> + if ((wake_flags & WF_FORK) || pse->sched_delayed) >> + return; >> + >> + /* >> + * If @p potentially is completing work required by current then >> + * consider preemption. >> + * >> + * Reschedule if waker is no longer eligible. */ >> + if (in_task() && !entity_eligible(cfs_rq, se)) { >> + preempt_action = PREEMPT_WAKEUP_RESCHED; >> + goto preempt; >> + } >> + >> + /* Prefer picking wakee soon if appropriate. */ >> + if (sched_feat(NEXT_BUDDY) && >> + set_preempt_buddy(cfs_rq, wake_flags, pse, se)) { >> + >> + /* >> + * Decide whether to obey WF_SYNC hint for a new buddy. Old >> + * buddies are ignored as they may not be relevant to the >> + * waker and less likely to be cache hot. >> + */ >> + if (wake_flags & WF_SYNC) >> + preempt_action = preempt_sync(rq, wake_flags, pse, se); >> + } >> + >> + switch (preempt_action) { >> + case PREEMPT_WAKEUP_NONE: >> + return; >> + case PREEMPT_WAKEUP_RESCHED: >> + goto preempt; >> + case PREEMPT_WAKEUP_SHORT: >> + fallthrough; >> + case PREEMPT_WAKEUP_PICK: >> + break; >> + } >> + >> +pick: >> + /* >> * If @p has become the most eligible task, force preemption. >> */ >> - if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) >> + if (__pick_eevdf(cfs_rq, preempt_action != PREEMPT_WAKEUP_SHORT) == pse) >> goto preempt; >> >> - if (sched_feat(RUN_TO_PARITY) && do_preempt_short) >> + if (sched_feat(RUN_TO_PARITY)) >> update_protect_slice(cfs_rq, se); >> >> return; >> >> preempt: >> - if (do_preempt_short) >> + if (preempt_action == PREEMPT_WAKEUP_SHORT) >> cancel_protect_slice(se); >> >> resched_curr_lazy(rq); >> > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2026-01-02 12:38 ` Ryan Roberts @ 2026-01-02 15:52 ` Dietmar Eggemann 2026-01-05 11:45 ` Ryan Roberts 0 siblings, 1 reply; 27+ messages in thread From: Dietmar Eggemann @ 2026-01-02 15:52 UTC (permalink / raw) To: Ryan Roberts, Mel Gorman, Peter Zijlstra (Intel) Cc: x86, linux-kernel, Aishwarya TCV On 02.01.26 13:38, Ryan Roberts wrote: > Hi, I appreciate I sent this report just before Xmas so most likely you haven't > had a chance to look, but wanted to bring it back to the top of your mailbox in > case it was missed. > > Happy new year! > > Thanks, > Ryan > > On 22/12/2025 10:57, Ryan Roberts wrote: >> Hi Mel, Peter, >> >> We are building out a kernel performance regression monitoring lab at Arm, and >> I've noticed some fairly large perofrmance regressions in real-world workloads, >> for which bisection has fingered this patch. >> >> We are looking at performance changes between v6.18 and v6.19-rc1, and by >> reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan >> to move the testing to linux-next over the next couple of quarters so hopefully >> we will be able to deliver this sort of news prior to merging in future). >> >> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean >> statistically significant regression/improvement, where "statistically >> significant" means the 95% confidence intervals do not overlap". You mentioned that you reverted this patch 'patch 2/2 'sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals'. Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well? --- Mel mentioned that he tested on a 2-socket machine. So I guess something like my Intel Xeon Silver 4314: cpu0 0 0 domain0 SMT 00000001,00000001 domain1 MC 55555555,55555555 domain2 NUMA ffffffff,ffffffff node distances: node 0 1 0: 10 20 1: 20 10 Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC domain? I guess topology has influence in benchmark numbers here as well. --- There was also a lot of improvement on schbench (wakeup latency) on higher percentiles (>= 99.0th) on the 2-socket machine with those 2 patches. I guess you haven't seen those on Grav3? [...] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2026-01-02 15:52 ` Dietmar Eggemann @ 2026-01-05 11:45 ` Ryan Roberts 2026-01-05 14:38 ` Shrikanth Hegde 2026-01-07 15:30 ` Dietmar Eggemann 0 siblings, 2 replies; 27+ messages in thread From: Ryan Roberts @ 2026-01-05 11:45 UTC (permalink / raw) To: Dietmar Eggemann, Mel Gorman, Peter Zijlstra (Intel) Cc: x86, linux-kernel, Aishwarya TCV On 02/01/2026 15:52, Dietmar Eggemann wrote: > On 02.01.26 13:38, Ryan Roberts wrote: >> Hi, I appreciate I sent this report just before Xmas so most likely you haven't >> had a chance to look, but wanted to bring it back to the top of your mailbox in >> case it was missed. >> >> Happy new year! >> >> Thanks, >> Ryan >> >> On 22/12/2025 10:57, Ryan Roberts wrote: >>> Hi Mel, Peter, >>> >>> We are building out a kernel performance regression monitoring lab at Arm, and >>> I've noticed some fairly large perofrmance regressions in real-world workloads, >>> for which bisection has fingered this patch. >>> >>> We are looking at performance changes between v6.18 and v6.19-rc1, and by >>> reverting this patch on top of v6.19-rc1 many regressions are resolved. (We plan >>> to move the testing to linux-next over the next couple of quarters so hopefully >>> we will be able to deliver this sort of news prior to merging in future). >>> >>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean >>> statistically significant regression/improvement, where "statistically >>> significant" means the 95% confidence intervals do not overlap". > > You mentioned that you reverted this patch 'patch 2/2 'sched/fair: > Reimplement NEXT_BUDDY to align with EEVDF goals'. > > Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted > patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well? Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful? > > --- > > Mel mentioned that he tested on a 2-socket machine. So I guess something > like my Intel Xeon Silver 4314: > > cpu0 0 0 > domain0 SMT 00000001,00000001 > domain1 MC 55555555,55555555 > domain2 NUMA ffffffff,ffffffff > > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC > domain? I guess topology has influence in benchmark numbers here as well. I can't easily enable scheduler debugging right now (which I think is needed to get this info directly?). But that's what I'd expect, yes. lscpu confirms there is a single NUMA node and topology for cpu0 gives this if it helps: /sys/devices/system/cpu/cpu0/topology$ grep "" -r . ./cluster_cpus:ffffffff,ffffffff ./cluster_cpus_list:0-63 ./physical_package_id:0 ./core_cpus_list:0 ./core_siblings:ffffffff,ffffffff ./cluster_id:0 ./core_siblings_list:0-63 ./package_cpus:ffffffff,ffffffff ./package_cpus_list:0-63 ./thread_siblings_list:0 ./core_id:0 ./core_cpus:00000000,00000001 ./thread_siblings:00000000,00000001 > > --- > > There was also a lot of improvement on schbench (wakeup latency) on > higher percentiles (>= 99.0th) on the 2-socket machine with those 2 > patches. I guess you haven't seen those on Grav3? > I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for revert-next-buddy. The means have moved a bit but there are only a couple of cases that we consisder statistically significant (marked (R)egression / (I)mprovement): +----------------------------+------------------------------------------------------+-------------+-------------------+ | Benchmark | Result Class | 6-19-0-rc1 | revert-next-buddy | +============================+======================================================+=============+===================+ | schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec) | 1263.97 | -6.43% | | | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec) | 15088.00 | -0.28% | | | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec) | 3.00 | 0.00% | | | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec) | 6433.07 | -10.99% | | | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec) | 15088.00 | -0.39% | | | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec) | 4.17 | (R) -16.67% | | | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec) | 1458.33 | -1.57% | | | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec) | 813056.00 | 15.46% | | | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) | 14240.00 | -5.97% | | | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec) | 434.22 | 3.21% | | | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec) | 11354112.00 | 2.92% | | | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) | 63168.00 | -2.87% | | | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec) | 2828.63 | 2.58% | | | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec) | 15088.00 | 0.00% | | | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec) | 3.00 | 0.00% | | | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec) | 3182.15 | 5.18% | | | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec) | 116266.67 | 8.22% | | | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec) | 6186.67 | (R) -5.34% | | | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec) | 749.20 | 2.91% | | | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec) | 3702784.00 | (I) 13.76% | | | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) | 33514.67 | 0.24% | | | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec) | 392.23 | 3.42% | | | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec) | 16695296.00 | (I) 5.82% | | | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) | 120618.67 | -3.22% | | | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec) | 5951.15 | 5.02% | | | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec) | 15157.33 | 0.42% | | | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec) | 3.67 | -4.35% | | | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec) | 1510.23 | -1.38% | | | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec) | 802816.00 | 13.73% | | | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec) | 14890.67 | -10.44% | | | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec) | 458.87 | 4.60% | | | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec) | 11348650.67 | (I) 2.67% | | | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) | 63445.33 | (R) -5.48% | | | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec) | 541.33 | 2.65% | | | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec) | 36743850.67 | (I) 10.95% | | | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) | 211370.67 | -1.94% | +----------------------------+------------------------------------------------------+-------------+-------------------+ I could get the results for 6.18 if useful, but I think what I have probably shows enough of the picture: This patch has not impacted schbench much on this HW. Thanks, Ryan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2026-01-05 11:45 ` Ryan Roberts @ 2026-01-05 14:38 ` Shrikanth Hegde 2026-01-05 16:33 ` Ryan Roberts 2026-01-07 15:30 ` Dietmar Eggemann 1 sibling, 1 reply; 27+ messages in thread From: Shrikanth Hegde @ 2026-01-05 14:38 UTC (permalink / raw) To: Ryan Roberts Cc: x86, linux-kernel, Aishwarya TCV, Dietmar Eggemann, Mel Gorman, Peter Zijlstra (Intel) Hi Ryan, >> node distances: >> node 0 1 >> 0: 10 20 >> 1: 20 10 >> >> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC >> domain? I guess topology has influence in benchmark numbers here as well. > > I can't easily enable scheduler debugging right now (which I think is needed to > get this info directly?). But that's what I'd expect, yes. lscpu confirms there > is a single NUMA node and topology for cpu0 gives this if it helps: If you dump /proc/schedstat it should give you topology info as well. (you will need to parse it depending on which CPU you are looking this from) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2026-01-05 14:38 ` Shrikanth Hegde @ 2026-01-05 16:33 ` Ryan Roberts 0 siblings, 0 replies; 27+ messages in thread From: Ryan Roberts @ 2026-01-05 16:33 UTC (permalink / raw) To: Shrikanth Hegde Cc: x86, linux-kernel, Aishwarya TCV, Dietmar Eggemann, Mel Gorman, Peter Zijlstra (Intel) On 05/01/2026 14:38, Shrikanth Hegde wrote: > > Hi Ryan, > >>> node distances: >>> node 0 1 >>> 0: 10 20 >>> 1: 20 10 >>> >>> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC >>> domain? I guess topology has influence in benchmark numbers here as well. >> >> I can't easily enable scheduler debugging right now (which I think is needed to >> get this info directly?). But that's what I'd expect, yes. lscpu confirms there >> is a single NUMA node and topology for cpu0 gives this if it helps: > > If you dump /proc/schedstat it should give you topology info as well. > > (you will need to parse it depending on which CPU you are looking this from) Ahh yes, thanks! Every cpu is reported as being in "domain0 MC ffffffff,ffffffff". So I guess that means there is a single MC domain as Dietmar suggests. Thanks, Ryan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2026-01-05 11:45 ` Ryan Roberts 2026-01-05 14:38 ` Shrikanth Hegde @ 2026-01-07 15:30 ` Dietmar Eggemann 2026-01-08 8:50 ` Mel Gorman 1 sibling, 1 reply; 27+ messages in thread From: Dietmar Eggemann @ 2026-01-07 15:30 UTC (permalink / raw) To: Ryan Roberts, Mel Gorman, Peter Zijlstra (Intel) Cc: x86, linux-kernel, Aishwarya TCV On 05.01.26 12:45, Ryan Roberts wrote: > On 02/01/2026 15:52, Dietmar Eggemann wrote: >> On 02.01.26 13:38, Ryan Roberts wrote: [...] >>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean >>>> statistically significant regression/improvement, where "statistically >>>> significant" means the 95% confidence intervals do not overlap". >> >> You mentioned that you reverted this patch 'patch 2/2 'sched/fair: >> Reimplement NEXT_BUDDY to align with EEVDF goals'. >> >> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted >> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well? > > Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful? Well, I assume this would be more valuable. Before this patch-set (e.g v6.18), NEXT_BUDDY was disabled and this is what people are running. Now (>= v6.19-rc1) we have NEXT_BUDDY=true (1/2) and 'NEXT_BUDDY aligned to EEVDF' (2/2). This is what people will run when they switch to v6.19 later. But patch 2/2 changes more than the 'if (sched_feat(NEXT_BUDDY) ...' condition. So testing 'w/o 2/2' vs. 'w/ 2/2' and 'NEXT_BUDDY=false' could be helpful as well. >> --- >> >> Mel mentioned that he tested on a 2-socket machine. So I guess something >> like my Intel Xeon Silver 4314: >> >> cpu0 0 0 >> domain0 SMT 00000001,00000001 >> domain1 MC 55555555,55555555 >> domain2 NUMA ffffffff,ffffffff >> >> node distances: >> node 0 1 >> 0: 10 20 >> 1: 20 10 >> >> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC >> domain? I guess topology has influence in benchmark numbers here as well. > > I can't easily enable scheduler debugging right now (which I think is needed to > get this info directly?). But that's what I'd expect, yes. lscpu confirms there > is a single NUMA node and topology for cpu0 gives this if it helps: > > /sys/devices/system/cpu/cpu0/topology$ grep "" -r . > ./cluster_cpus:ffffffff,ffffffff > ./cluster_cpus_list:0-63 > ./physical_package_id:0 > ./core_cpus_list:0 > ./core_siblings:ffffffff,ffffffff > ./cluster_id:0 > ./core_siblings_list:0-63 > ./package_cpus:ffffffff,ffffffff > ./package_cpus_list:0-63 [...] OK, so single (flat) MC domain with 64 CPUs. >> There was also a lot of improvement on schbench (wakeup latency) on >> higher percentiles (>= 99.0th) on the 2-socket machine with those 2 >> patches. I guess you haven't seen those on Grav3? >> > > I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for > revert-next-buddy. The means have moved a bit but there are only a couple of > cases that we consisder statistically significant (marked (R)egression / > (I)mprovement): > > +----------------------------+------------------------------------------------------+-------------+-------------------+ > | Benchmark | Result Class | 6-19-0-rc1 | revert-next-buddy | > +============================+======================================================+=============+===================+ > | schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec) | 1263.97 | -6.43% | > | | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec) | 15088.00 | -0.28% | > | | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec) | 3.00 | 0.00% | > | | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec) | 6433.07 | -10.99% | > | | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec) | 15088.00 | -0.39% | > | | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec) | 4.17 | (R) -16.67% | > | | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec) | 1458.33 | -1.57% | > | | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec) | 813056.00 | 15.46% | > | | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) | 14240.00 | -5.97% | > | | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec) | 434.22 | 3.21% | > | | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec) | 11354112.00 | 2.92% | > | | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) | 63168.00 | -2.87% | > | | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec) | 2828.63 | 2.58% | > | | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec) | 15088.00 | 0.00% | > | | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec) | 3.00 | 0.00% | > | | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec) | 3182.15 | 5.18% | > | | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec) | 116266.67 | 8.22% | > | | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec) | 6186.67 | (R) -5.34% | > | | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec) | 749.20 | 2.91% | > | | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec) | 3702784.00 | (I) 13.76% | > | | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) | 33514.67 | 0.24% | > | | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec) | 392.23 | 3.42% | > | | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec) | 16695296.00 | (I) 5.82% | > | | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) | 120618.67 | -3.22% | > | | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec) | 5951.15 | 5.02% | > | | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec) | 15157.33 | 0.42% | > | | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec) | 3.67 | -4.35% | > | | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec) | 1510.23 | -1.38% | > | | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec) | 802816.00 | 13.73% | > | | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec) | 14890.67 | -10.44% | > | | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec) | 458.87 | 4.60% | > | | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec) | 11348650.67 | (I) 2.67% | > | | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) | 63445.33 | (R) -5.48% | > | | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec) | 541.33 | 2.65% | > | | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec) | 36743850.67 | (I) 10.95% | > | | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) | 211370.67 | -1.94% | > +----------------------------+------------------------------------------------------+-------------+-------------------+ > > I could get the results for 6.18 if useful, but I think what I have probably > shows enough of the picture: This patch has not impacted schbench much on > this HW. I see. IMHO, task scheduler tests are all about putting the right about of stress onto the system, not too little and not too much. I guess, these ~ `-m 64 -t 4` tests should be fine for a 64 CPUs system. Not sure which parameter set Mel was using on his 2 socket machine. And I still assume he tested w/o (base) against with these 2 patches. The other test Mel was using is this modified dbench4 (prefers throughput (less preemption)). Not sure if this is part of the MmTests suite? It would be nice to be able to run the same tests on different machines (with a parameter set adapted to the number of CPUs), so we have only the arch and the topology as variables). But there is definitely more variety (e.g. used filesystem, etc) ... so this is not trivial. [...] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2026-01-07 15:30 ` Dietmar Eggemann @ 2026-01-08 8:50 ` Mel Gorman 2026-01-08 13:15 ` Ryan Roberts 0 siblings, 1 reply; 27+ messages in thread From: Mel Gorman @ 2026-01-08 8:50 UTC (permalink / raw) To: Dietmar Eggemann Cc: Ryan Roberts, Peter Zijlstra (Intel), x86, linux-kernel, Aishwarya TCV On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote: > On 05.01.26 12:45, Ryan Roberts wrote: > > On 02/01/2026 15:52, Dietmar Eggemann wrote: > >> On 02.01.26 13:38, Ryan Roberts wrote: > > [...] > Sorry for slow responses. I'm still back from holidays yet and unfortunately do not have access to test machines right now cannot revalidate any of the results against 6.19-rc*. > >>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean > >>>> statistically significant regression/improvement, where "statistically > >>>> significant" means the 95% confidence intervals do not overlap". > >> > >> You mentioned that you reverted this patch 'patch 2/2 'sched/fair: > >> Reimplement NEXT_BUDDY to align with EEVDF goals'. > >> > >> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted > >> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well? > > > > Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful? > > Well, I assume this would be more valuable. Agreed because we need to know if it's NEXT_BUDDY that is conceptually an issue with EEVDF in these cases or the specific implementation. The comparison between 6.18A (baseline) 6.19-rcN vanilla (New NEXT_BUDDY implementation enabled) 6.19-rcN revert patches 1+2 (NEXT_BUDDY disabled) 6.19-rcN revert patch 2 only (Old NEXT_BUDDY implementation enabled) It was known that NEXT_BUDDY was always a tradeoff but one that is workload, architecture and specific arch implementation dependent. If it cannot be sanely reconciled then it may be best to completely remove NEXT_BUDDY from EEVDF or disable it by default again for now. I don't think NEXT_BUDDY as it existed in CFS can be sanely implemented against EEVDF so it'll never be equivalent. Related to that I doubt anyone has good data on NEXT_BUDDY vs !NEXT_BUDDY even on CFS as it was enabled for so long. > >> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC > >> domain? I guess topology has influence in benchmark numbers here as well. > > > > I can't easily enable scheduler debugging right now (which I think is needed to > > get this info directly?). But that's what I'd expect, yes. lscpu confirms there > > is a single NUMA node and topology for cpu0 gives this if it helps: > > > > /sys/devices/system/cpu/cpu0/topology$ grep "" -r . > > ./cluster_cpus:ffffffff,ffffffff > > ./cluster_cpus_list:0-63 > > ./physical_package_id:0 > > ./core_cpus_list:0 > > ./core_siblings:ffffffff,ffffffff > > ./cluster_id:0 > > ./core_siblings_list:0-63 > > ./package_cpus:ffffffff,ffffffff > > ./package_cpus_list:0-63 > > [...] > > OK, so single (flat) MC domain with 64 CPUs. > That is what the OS sees but does it reflect reality? e.g. does Graviton3 have multiple caches that are simply not advertised to the OS? > >> There was also a lot of improvement on schbench (wakeup latency) on > >> higher percentiles (>= 99.0th) on the 2-socket machine with those 2 > >> patches. I guess you haven't seen those on Grav3? > >> > > > > I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for > > revert-next-buddy. The means have moved a bit but there are only a couple of > > cases that we consisder statistically significant (marked (R)egression / > > (I)mprovement): > > > > +----------------------------+------------------------------------------------------+-------------+-------------------+ > > | Benchmark | Result Class | 6-19-0-rc1 | revert-next-buddy | > > +============================+======================================================+=============+===================+ > > | schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec) | 1263.97 | -6.43% | > > | | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec) | 15088.00 | -0.28% | > > | | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec) | 3.00 | 0.00% | > > | | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec) | 6433.07 | -10.99% | > > | | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec) | 15088.00 | -0.39% | > > | | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec) | 4.17 | (R) -16.67% | > > | | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec) | 1458.33 | -1.57% | > > | | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec) | 813056.00 | 15.46% | > > | | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) | 14240.00 | -5.97% | > > | | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec) | 434.22 | 3.21% | > > | | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec) | 11354112.00 | 2.92% | > > | | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) | 63168.00 | -2.87% | > > | | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec) | 2828.63 | 2.58% | > > | | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec) | 15088.00 | 0.00% | > > | | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec) | 3.00 | 0.00% | > > | | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec) | 3182.15 | 5.18% | > > | | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec) | 116266.67 | 8.22% | > > | | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec) | 6186.67 | (R) -5.34% | > > | | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec) | 749.20 | 2.91% | > > | | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec) | 3702784.00 | (I) 13.76% | > > | | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) | 33514.67 | 0.24% | > > | | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec) | 392.23 | 3.42% | > > | | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec) | 16695296.00 | (I) 5.82% | > > | | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) | 120618.67 | -3.22% | > > | | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec) | 5951.15 | 5.02% | > > | | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec) | 15157.33 | 0.42% | > > | | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec) | 3.67 | -4.35% | > > | | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec) | 1510.23 | -1.38% | > > | | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec) | 802816.00 | 13.73% | > > | | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec) | 14890.67 | -10.44% | > > | | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec) | 458.87 | 4.60% | > > | | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec) | 11348650.67 | (I) 2.67% | > > | | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) | 63445.33 | (R) -5.48% | > > | | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec) | 541.33 | 2.65% | > > | | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec) | 36743850.67 | (I) 10.95% | > > | | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) | 211370.67 | -1.94% | > > +----------------------------+------------------------------------------------------+-------------+-------------------+ > > > > I could get the results for 6.18 if useful, but I think what I have probably > > shows enough of the picture: This patch has not impacted schbench much on > > this HW. > > I see. IMHO, task scheduler tests are all about putting the right about > of stress onto the system, not too little and not too much. > > I guess, these ~ `-m 64 -t 4` tests should be fine for a 64 CPUs system. Agreed. It's not the full picture but it's a valuable part. > Not sure which parameter set Mel was using on his 2 socket machine. And > I still assume he tested w/o (base) against with these 2 patches. > He did. The most comparable test I used was NUM_CPUS so 64 for Graviton is ok. > The other test Mel was using is this modified dbench4 (prefers > throughput (less preemption)). Not sure if this is part of the MmTests > suite? > It is. The modifications are not extensive. dbench by default reports overall throughput over time which masks actual throughput at a point in time. The new metric tracks time taken to process "loadfiles" over time which is more sensible to analyse. Other metrics such as loadfiles processed per client would be easily extracted but isn't at the moment as dbench itself is not designed for measuring fairness of forward progress as such. > It would be nice to be able to run the same tests on different machines > (with a parameter set adapted to the number of CPUs), so we have only > the arch and the topology as variables). But there is definitely more > variety (e.g. used filesystem, etc) ... so this is not trivial. > From a topology perspective it is fairly trivial though. For example, MMTESTS has a schbench configuration that runs one message thread per NUMA node communicating with nr_cpus/nr_nodes to evaluate placement. A similar configuration could use (nr_cpus/nr_nodes)-1 to ensure tasks are packed properly or nr_llcs could also be used fairly trivially. You're right that once filesystems are involved then it all gets more interesting. ext4 and xfs use kernel threads differently (jbd vs kworkers), the underlying storage is a factor, workset size vs RAM impacts dirty throttling and reclaim and NUMA sizes all play a part. dbench is useful in this regard because while it interacts with the filesystem a wakeups between userspace and kernel threads get exercised, the amount of IO is relatively small. Lets start with getting figures for 6.18, new-NEXT, old-NEXT and no-NEXT side-by-side. Ideally I'd do the same across a range of x86-64 machines but I can't start that yet. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2026-01-08 8:50 ` Mel Gorman @ 2026-01-08 13:15 ` Ryan Roberts 2026-01-09 10:15 ` Ryan Roberts 0 siblings, 1 reply; 27+ messages in thread From: Ryan Roberts @ 2026-01-08 13:15 UTC (permalink / raw) To: Mel Gorman, Dietmar Eggemann Cc: Peter Zijlstra (Intel), x86, linux-kernel, Aishwarya TCV On 08/01/2026 08:50, Mel Gorman wrote: > On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote: >> On 05.01.26 12:45, Ryan Roberts wrote: >>> On 02/01/2026 15:52, Dietmar Eggemann wrote: >>>> On 02.01.26 13:38, Ryan Roberts wrote: >> >> [...] >> > > Sorry for slow responses. I'm still back from holidays yet and unfortunately > do not have access to test machines right now cannot revalidate any of > the results against 6.19-rc*. No problem, thanks for getting back to me! > >>>>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean >>>>>> statistically significant regression/improvement, where "statistically >>>>>> significant" means the 95% confidence intervals do not overlap". >>>> >>>> You mentioned that you reverted this patch 'patch 2/2 'sched/fair: >>>> Reimplement NEXT_BUDDY to align with EEVDF goals'. >>>> >>>> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted >>>> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well? >>> >>> Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful? >> >> Well, I assume this would be more valuable. > > Agreed because we need to know if it's NEXT_BUDDY that is conceptually > an issue with EEVDF in these cases or the specific implementation. The > comparison between > > 6.18A (baseline) > 6.19-rcN vanilla (New NEXT_BUDDY implementation enabled) > 6.19-rcN revert patches 1+2 (NEXT_BUDDY disabled) > 6.19-rcN revert patch 2 only (Old NEXT_BUDDY implementation enabled) OK, I've already got 1, 2 and 4. Let me grab 3 and come back to you - hopefully tomorrow. Then we can take it from there. I appreciate your time on this! Thanks, Ryan > > It was known that NEXT_BUDDY was always a tradeoff but one that is workload, > architecture and specific arch implementation dependent. If it cannot be > sanely reconciled then it may be best to completely remove NEXT_BUDDY from > EEVDF or disable it by default again for now. I don't think NEXT_BUDDY as > it existed in CFS can be sanely implemented against EEVDF so it'll never > be equivalent. Related to that I doubt anyone has good data on NEXT_BUDDY > vs !NEXT_BUDDY even on CFS as it was enabled for so long. > >>>> Whereas I assume the Graviton3 has 64 CPUs (cores) flat in a single MC >>>> domain? I guess topology has influence in benchmark numbers here as well. >>> >>> I can't easily enable scheduler debugging right now (which I think is needed to >>> get this info directly?). But that's what I'd expect, yes. lscpu confirms there >>> is a single NUMA node and topology for cpu0 gives this if it helps: >>> >>> /sys/devices/system/cpu/cpu0/topology$ grep "" -r . >>> ./cluster_cpus:ffffffff,ffffffff >>> ./cluster_cpus_list:0-63 >>> ./physical_package_id:0 >>> ./core_cpus_list:0 >>> ./core_siblings:ffffffff,ffffffff >>> ./cluster_id:0 >>> ./core_siblings_list:0-63 >>> ./package_cpus:ffffffff,ffffffff >>> ./package_cpus_list:0-63 >> >> [...] >> >> OK, so single (flat) MC domain with 64 CPUs. >> > > That is what the OS sees but does it reflect reality? e.g. does Graviton3 > have multiple caches that are simply not advertised to the OS? > >>>> There was also a lot of improvement on schbench (wakeup latency) on >>>> higher percentiles (>= 99.0th) on the 2-socket machine with those 2 >>>> patches. I guess you haven't seen those on Grav3? >>>> >>> >>> I don't have schbench results for 6.18 but I do have them for 6.19-rc1 and for >>> revert-next-buddy. The means have moved a bit but there are only a couple of >>> cases that we consisder statistically significant (marked (R)egression / >>> (I)mprovement): >>> >>> +----------------------------+------------------------------------------------------+-------------+-------------------+ >>> | Benchmark | Result Class | 6-19-0-rc1 | revert-next-buddy | >>> +============================+======================================================+=============+===================+ >>> | schbench/thread-contention | -m 16 -t 1 -r 10 -s 1000, avg_rps (req/sec) | 1263.97 | -6.43% | >>> | | -m 16 -t 1 -r 10 -s 1000, req_latency_p99 (usec) | 15088.00 | -0.28% | >>> | | -m 16 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec) | 3.00 | 0.00% | >>> | | -m 16 -t 4 -r 10 -s 1000, avg_rps (req/sec) | 6433.07 | -10.99% | >>> | | -m 16 -t 4 -r 10 -s 1000, req_latency_p99 (usec) | 15088.00 | -0.39% | >>> | | -m 16 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec) | 4.17 | (R) -16.67% | >>> | | -m 16 -t 16 -r 10 -s 1000, avg_rps (req/sec) | 1458.33 | -1.57% | >>> | | -m 16 -t 16 -r 10 -s 1000, req_latency_p99 (usec) | 813056.00 | 15.46% | >>> | | -m 16 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) | 14240.00 | -5.97% | >>> | | -m 16 -t 64 -r 10 -s 1000, avg_rps (req/sec) | 434.22 | 3.21% | >>> | | -m 16 -t 64 -r 10 -s 1000, req_latency_p99 (usec) | 11354112.00 | 2.92% | >>> | | -m 16 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) | 63168.00 | -2.87% | >>> | | -m 32 -t 1 -r 10 -s 1000, avg_rps (req/sec) | 2828.63 | 2.58% | >>> | | -m 32 -t 1 -r 10 -s 1000, req_latency_p99 (usec) | 15088.00 | 0.00% | >>> | | -m 32 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec) | 3.00 | 0.00% | >>> | | -m 32 -t 4 -r 10 -s 1000, avg_rps (req/sec) | 3182.15 | 5.18% | >>> | | -m 32 -t 4 -r 10 -s 1000, req_latency_p99 (usec) | 116266.67 | 8.22% | >>> | | -m 32 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec) | 6186.67 | (R) -5.34% | >>> | | -m 32 -t 16 -r 10 -s 1000, avg_rps (req/sec) | 749.20 | 2.91% | >>> | | -m 32 -t 16 -r 10 -s 1000, req_latency_p99 (usec) | 3702784.00 | (I) 13.76% | >>> | | -m 32 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) | 33514.67 | 0.24% | >>> | | -m 32 -t 64 -r 10 -s 1000, avg_rps (req/sec) | 392.23 | 3.42% | >>> | | -m 32 -t 64 -r 10 -s 1000, req_latency_p99 (usec) | 16695296.00 | (I) 5.82% | >>> | | -m 32 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) | 120618.67 | -3.22% | >>> | | -m 64 -t 1 -r 10 -s 1000, avg_rps (req/sec) | 5951.15 | 5.02% | >>> | | -m 64 -t 1 -r 10 -s 1000, req_latency_p99 (usec) | 15157.33 | 0.42% | >>> | | -m 64 -t 1 -r 10 -s 1000, wakeup_latency_p99 (usec) | 3.67 | -4.35% | >>> | | -m 64 -t 4 -r 10 -s 1000, avg_rps (req/sec) | 1510.23 | -1.38% | >>> | | -m 64 -t 4 -r 10 -s 1000, req_latency_p99 (usec) | 802816.00 | 13.73% | >>> | | -m 64 -t 4 -r 10 -s 1000, wakeup_latency_p99 (usec) | 14890.67 | -10.44% | >>> | | -m 64 -t 16 -r 10 -s 1000, avg_rps (req/sec) | 458.87 | 4.60% | >>> | | -m 64 -t 16 -r 10 -s 1000, req_latency_p99 (usec) | 11348650.67 | (I) 2.67% | >>> | | -m 64 -t 16 -r 10 -s 1000, wakeup_latency_p99 (usec) | 63445.33 | (R) -5.48% | >>> | | -m 64 -t 64 -r 10 -s 1000, avg_rps (req/sec) | 541.33 | 2.65% | >>> | | -m 64 -t 64 -r 10 -s 1000, req_latency_p99 (usec) | 36743850.67 | (I) 10.95% | >>> | | -m 64 -t 64 -r 10 -s 1000, wakeup_latency_p99 (usec) | 211370.67 | -1.94% | >>> +----------------------------+------------------------------------------------------+-------------+-------------------+ >>> >>> I could get the results for 6.18 if useful, but I think what I have probably >>> shows enough of the picture: This patch has not impacted schbench much on >>> this HW. >> >> I see. IMHO, task scheduler tests are all about putting the right about >> of stress onto the system, not too little and not too much. >> >> I guess, these ~ `-m 64 -t 4` tests should be fine for a 64 CPUs system. > > Agreed. It's not the full picture but it's a valuable part. > >> Not sure which parameter set Mel was using on his 2 socket machine. And >> I still assume he tested w/o (base) against with these 2 patches. >> > > He did. The most comparable test I used was NUM_CPUS so 64 for Graviton > is ok. > >> The other test Mel was using is this modified dbench4 (prefers >> throughput (less preemption)). Not sure if this is part of the MmTests >> suite? >> > > It is. The modifications are not extensive. dbench by default reports overall > throughput over time which masks actual throughput at a point in time. The > new metric tracks time taken to process "loadfiles" over time which is > more sensible to analyse. Other metrics such as loadfiles processed per > client would be easily extracted but isn't at the moment as dbench itself > is not designed for measuring fairness of forward progress as such. > >> It would be nice to be able to run the same tests on different machines >> (with a parameter set adapted to the number of CPUs), so we have only >> the arch and the topology as variables). But there is definitely more >> variety (e.g. used filesystem, etc) ... so this is not trivial. >> > > From a topology perspective it is fairly trivial though. For example, > MMTESTS has a schbench configuration that runs one message thread per NUMA > node communicating with nr_cpus/nr_nodes to evaluate placement. A similar > configuration could use (nr_cpus/nr_nodes)-1 to ensure tasks are packed > properly or nr_llcs could also be used fairly trivially. You're right that > once filesystems are involved then it all gets more interesting. ext4 and > xfs use kernel threads differently (jbd vs kworkers), the underlying storage > is a factor, workset size vs RAM impacts dirty throttling and reclaim and > NUMA sizes all play a part. dbench is useful in this regard because while > it interacts with the filesystem a wakeups between userspace and kernel > threads get exercised, the amount of IO is relatively small. > > Lets start with getting figures for 6.18, new-NEXT, old-NEXT and > no-NEXT side-by-side. Ideally I'd do the same across a range of x86-64 > machines but I can't start that yet. > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2026-01-08 13:15 ` Ryan Roberts @ 2026-01-09 10:15 ` Ryan Roberts 2026-01-12 7:47 ` Peter Zijlstra 2026-01-15 10:16 ` Mel Gorman 0 siblings, 2 replies; 27+ messages in thread From: Ryan Roberts @ 2026-01-09 10:15 UTC (permalink / raw) To: Mel Gorman, Dietmar Eggemann Cc: Peter Zijlstra (Intel), x86, linux-kernel, Aishwarya TCV On 08/01/2026 13:15, Ryan Roberts wrote: > On 08/01/2026 08:50, Mel Gorman wrote: >> On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote: >>> On 05.01.26 12:45, Ryan Roberts wrote: >>>> On 02/01/2026 15:52, Dietmar Eggemann wrote: >>>>> On 02.01.26 13:38, Ryan Roberts wrote: >>> >>> [...] >>> >> >> Sorry for slow responses. I'm still back from holidays yet and unfortunately >> do not have access to test machines right now cannot revalidate any of >> the results against 6.19-rc*. > > No problem, thanks for getting back to me! > >> >>>>>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean >>>>>>> statistically significant regression/improvement, where "statistically >>>>>>> significant" means the 95% confidence intervals do not overlap". >>>>> >>>>> You mentioned that you reverted this patch 'patch 2/2 'sched/fair: >>>>> Reimplement NEXT_BUDDY to align with EEVDF goals'. >>>>> >>>>> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted >>>>> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well? >>>> >>>> Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful? >>> >>> Well, I assume this would be more valuable. >> >> Agreed because we need to know if it's NEXT_BUDDY that is conceptually >> an issue with EEVDF in these cases or the specific implementation. The >> comparison between >> >> 6.18A (baseline) >> 6.19-rcN vanilla (New NEXT_BUDDY implementation enabled) >> 6.19-rcN revert patches 1+2 (NEXT_BUDDY disabled) >> 6.19-rcN revert patch 2 only (Old NEXT_BUDDY implementation enabled) > > OK, I've already got 1, 2 and 4. Let me grab 3 and come back to you - hopefully > tomorrow. Then we can take it from there. Hi Mel, Dietmar, Here are the updated results, now including column for "revert #1 & #2". 6-18-0 (base) (baseline) 6-19-0-rc1 (New NEXT_BUDDY implementation enabled) revert #1 & #2 (NEXT_BUDDY disabled) revert #2 (Old NEXT_BUDDY implementation enabled) The regressions that are fixed by "revert #2" (as originally reported) are still fixed in "revert #1 & #2". Interestingly, performance actually improves further for the latter in the multi-node mysql benchmark (which is our VIP workload). There are a couple of hackbench cases (sockets with high thread counts) that showed an improvement with "revert #2" but which is gone with "revert #1 & #2". Let me know if I can usefully do anything else. Multi-node SUT (workload running across 2 machines): +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert #2 | revert #1 & #2 | +=================================+====================================================+===============+=============+============+================+ | repro-collection/mysql-workload | db transaction rate (transactions/min) | 646267.33 | (R) -1.33% | (I) 5.87% | (I) 7.63% | | | new order rate (orders/min) | 213256.50 | (R) -1.32% | (I) 5.87% | (I) 7.64% | +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ Single-node SUT (workload running on single machine): +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert #2 | revert #1 & #2 | +=================================+====================================================+===============+=============+============+================+ | specjbb/composite | critical-jOPS (jOPS) | 94700.00 | (R) -5.10% | -0.90% | -0.37% | | | max-jOPS (jOPS) | 113984.50 | (R) -3.90% | -0.65% | 0.65% | +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ | repro-collection/mysql-workload | db transaction rate (transactions/min) | 245438.25 | (R) -3.88% | -0.13% | 0.24% | | | new order rate (orders/min) | 80985.75 | (R) -3.78% | -0.07% | 0.29% | +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ | pts/pgbench | Scale: 1 Clients: 1 Read Only (TPS) | 63124.00 | (I) 2.90% | 0.74% | 0.85% | | | Scale: 1 Clients: 1 Read Only - Latency (ms) | 0.016 | (I) 5.49% | 1.05% | 1.05% | | | Scale: 1 Clients: 1 Read Write (TPS) | 974.92 | 0.11% | -0.08% | -0.03% | | | Scale: 1 Clients: 1 Read Write - Latency (ms) | 1.03 | 0.12% | -0.06% | -0.06% | | | Scale: 1 Clients: 250 Read Only (TPS) | 1915931.58 | (R) -2.25% | (I) 2.12% | 1.62% | | | Scale: 1 Clients: 250 Read Only - Latency (ms) | 0.13 | (R) -2.37% | (I) 2.09% | 1.69% | | | Scale: 1 Clients: 250 Read Write (TPS) | 855.67 | -1.36% | -0.14% | -0.12% | | | Scale: 1 Clients: 250 Read Write - Latency (ms) | 292.39 | -1.31% | -0.08% | -0.08% | | | Scale: 1 Clients: 1000 Read Only (TPS) | 1534130.08 | (R) -11.37% | 0.08% | 0.48% | | | Scale: 1 Clients: 1000 Read Only - Latency (ms) | 0.65 | (R) -11.38% | 0.08% | 0.44% | | | Scale: 1 Clients: 1000 Read Write (TPS) | 578.75 | -1.11% | 2.15% | -0.96% | | | Scale: 1 Clients: 1000 Read Write - Latency (ms) | 1736.98 | -1.26% | 2.47% | -0.90% | | | Scale: 100 Clients: 1 Read Only (TPS) | 57170.33 | 1.68% | 0.10% | 0.22% | | | Scale: 100 Clients: 1 Read Only - Latency (ms) | 0.018 | 1.94% | 0.00% | 0.96% | | | Scale: 100 Clients: 1 Read Write (TPS) | 836.58 | -0.37% | -0.41% | 0.07% | | | Scale: 100 Clients: 1 Read Write - Latency (ms) | 1.20 | -0.37% | -0.40% | 0.06% | | | Scale: 100 Clients: 250 Read Only (TPS) | 1773440.67 | -1.61% | 1.67% | 1.34% | | | Scale: 100 Clients: 250 Read Only - Latency (ms) | 0.14 | -1.40% | 1.56% | 1.20% | | | Scale: 100 Clients: 250 Read Write (TPS) | 5505.50 | -0.17% | -0.86% | -1.66% | | | Scale: 100 Clients: 250 Read Write - Latency (ms) | 45.42 | -0.17% | -0.85% | -1.67% | | | Scale: 100 Clients: 1000 Read Only (TPS) | 1393037.50 | (R) -10.31% | -0.19% | 0.53% | | | Scale: 100 Clients: 1000 Read Only - Latency (ms) | 0.72 | (R) -10.30% | -0.17% | 0.53% | | | Scale: 100 Clients: 1000 Read Write (TPS) | 5085.92 | 0.27% | 0.07% | -0.79% | | | Scale: 100 Clients: 1000 Read Write - Latency (ms) | 196.79 | 0.23% | 0.05% | -0.81% | +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ | mmtests/hackbench | hackbench-process-pipes-1 (seconds) | 0.14 | -1.51% | -1.05% | -1.51% | | | hackbench-process-pipes-4 (seconds) | 0.44 | (I) 6.49% | (I) 5.42% | (I) 6.06% | | | hackbench-process-pipes-7 (seconds) | 0.68 | (R) -18.36% | (I) 3.40% | -0.41% | | | hackbench-process-pipes-12 (seconds) | 1.24 | (R) -19.89% | -0.45% | (R) -2.23% | | | hackbench-process-pipes-21 (seconds) | 1.81 | (R) -8.41% | -1.22% | (R) -2.46% | | | hackbench-process-pipes-30 (seconds) | 2.39 | (R) -9.06% | (R) -2.95% | -1.62% | | | hackbench-process-pipes-48 (seconds) | 3.18 | (R) -11.68% | (R) -4.10% | -0.26% | | | hackbench-process-pipes-79 (seconds) | 3.84 | (R) -9.74% | (R) -3.25% | (R) -2.45% | | | hackbench-process-pipes-110 (seconds) | 4.68 | (R) -6.57% | (R) -2.12% | (R) -2.25% | | | hackbench-process-pipes-141 (seconds) | 5.75 | (R) -5.86% | (R) -3.44% | (R) -2.89% | | | hackbench-process-pipes-172 (seconds) | 6.80 | (R) -4.28% | (R) -2.81% | (R) -2.44% | | | hackbench-process-pipes-203 (seconds) | 7.94 | (R) -4.01% | (R) -3.00% | (R) -2.17% | | | hackbench-process-pipes-234 (seconds) | 9.02 | (R) -3.52% | (R) -2.81% | (R) -2.20% | | | hackbench-process-pipes-256 (seconds) | 9.78 | (R) -3.24% | (R) -2.81% | (R) -2.74% | | | hackbench-process-sockets-1 (seconds) | 0.29 | 0.50% | 0.26% | 0.03% | | | hackbench-process-sockets-4 (seconds) | 0.76 | (I) 17.44% | (I) 16.31% | (I) 19.09% | | | hackbench-process-sockets-7 (seconds) | 1.16 | (I) 12.10% | (I) 9.78% | (I) 11.83% | | | hackbench-process-sockets-12 (seconds) | 1.86 | (I) 10.19% | (I) 9.83% | (I) 11.21% | | | hackbench-process-sockets-21 (seconds) | 3.12 | (I) 9.38% | (I) 9.20% | (I) 10.30% | | | hackbench-process-sockets-30 (seconds) | 4.30 | (I) 6.43% | (I) 6.11% | (I) 7.22% | | | hackbench-process-sockets-48 (seconds) | 6.58 | (I) 3.00% | (I) 2.19% | (I) 2.85% | | | hackbench-process-sockets-79 (seconds) | 10.56 | (I) 2.87% | (I) 3.31% | 3.10% | | | hackbench-process-sockets-110 (seconds) | 13.85 | -1.15% | (I) 2.33% | 0.22% | | | hackbench-process-sockets-141 (seconds) | 19.23 | -1.40% | (I) 14.53% | 2.64% | | | hackbench-process-sockets-172 (seconds) | 26.33 | (I) 3.52% | (I) 30.37% | (I) 4.32% | | | hackbench-process-sockets-203 (seconds) | 30.27 | 1.10% | (I) 27.20% | 0.32% | | | hackbench-process-sockets-234 (seconds) | 35.12 | 1.60% | (I) 28.24% | 1.28% | | | hackbench-process-sockets-256 (seconds) | 38.74 | 0.70% | (I) 28.74% | 0.53% | | | hackbench-thread-pipes-1 (seconds) | 0.17 | -1.32% | -0.76% | -0.67% | | | hackbench-thread-pipes-4 (seconds) | 0.45 | (I) 6.91% | (I) 7.64% | (I) 9.08% | | | hackbench-thread-pipes-7 (seconds) | 0.74 | (R) -7.51% | (I) 5.26% | (I) 2.82% | | | hackbench-thread-pipes-12 (seconds) | 1.32 | (R) -8.40% | (I) 2.32% | -0.53% | | | hackbench-thread-pipes-21 (seconds) | 1.95 | (R) -2.95% | 0.91% | (R) -2.00% | | | hackbench-thread-pipes-30 (seconds) | 2.50 | (R) -4.61% | 1.47% | -1.63% | | | hackbench-thread-pipes-48 (seconds) | 3.32 | (R) -5.45% | (I) 2.15% | 0.81% | | | hackbench-thread-pipes-79 (seconds) | 4.04 | (R) -5.53% | 1.85% | -0.53% | | | hackbench-thread-pipes-110 (seconds) | 4.94 | (R) -2.33% | 1.51% | 0.59% | | | hackbench-thread-pipes-141 (seconds) | 6.04 | (R) -2.47% | 1.15% | 0.24% | | | hackbench-thread-pipes-172 (seconds) | 7.15 | -0.91% | 1.48% | 0.45% | | | hackbench-thread-pipes-203 (seconds) | 8.31 | -1.29% | 0.77% | 0.40% | | | hackbench-thread-pipes-234 (seconds) | 9.49 | -1.03% | 0.77% | 0.65% | | | hackbench-thread-pipes-256 (seconds) | 10.30 | -0.80% | 0.42% | 0.30% | | | hackbench-thread-sockets-1 (seconds) | 0.31 | 0.05% | -0.05% | -0.43% | | | hackbench-thread-sockets-4 (seconds) | 0.79 | (I) 18.91% | (I) 16.82% | (I) 19.79% | | | hackbench-thread-sockets-7 (seconds) | 1.16 | (I) 12.57% | (I) 10.63% | (I) 12.95% | | | hackbench-thread-sockets-12 (seconds) | 1.87 | (I) 12.65% | (I) 12.26% | (I) 13.90% | | | hackbench-thread-sockets-21 (seconds) | 3.16 | (I) 11.62% | (I) 12.74% | (I) 13.89% | | | hackbench-thread-sockets-30 (seconds) | 4.32 | (I) 7.35% | (I) 8.89% | (I) 9.51% | | | hackbench-thread-sockets-48 (seconds) | 6.45 | (I) 2.69% | (I) 3.06% | (I) 3.74% | | | hackbench-thread-sockets-79 (seconds) | 10.15 | (I) 3.30% | 1.98% | (I) 2.76% | | | hackbench-thread-sockets-110 (seconds) | 13.45 | -0.25% | (I) 3.68% | 0.44% | | | hackbench-thread-sockets-141 (seconds) | 17.87 | (R) -2.18% | (I) 8.46% | 1.51% | | | hackbench-thread-sockets-172 (seconds) | 24.38 | 1.02% | (I) 24.33% | 1.38% | | | hackbench-thread-sockets-203 (seconds) | 28.38 | -0.99% | (I) 24.20% | 0.57% | | | hackbench-thread-sockets-234 (seconds) | 32.75 | -0.42% | (I) 24.35% | 0.72% | | | hackbench-thread-sockets-256 (seconds) | 36.49 | -1.30% | (I) 26.22% | 0.81% | +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ | pts/nginx | Connections: 200 (Requests Per Second) | 252332.60 | (I) 17.54% | -0.53% | -0.61% | | | Connections: 1000 (Requests Per Second) | 248591.29 | (I) 20.41% | 0.10% | 0.57% | +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ Thanks, Ryan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2026-01-09 10:15 ` Ryan Roberts @ 2026-01-12 7:47 ` Peter Zijlstra 2026-01-12 8:52 ` Ryan Roberts 2026-01-15 10:16 ` Mel Gorman 1 sibling, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2026-01-12 7:47 UTC (permalink / raw) To: Ryan Roberts Cc: Mel Gorman, Dietmar Eggemann, x86, linux-kernel, Aishwarya TCV On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote: > Here are the updated results, now including column for "revert #1 & #2". > > 6-18-0 (base) (baseline) > 6-19-0-rc1 (New NEXT_BUDDY implementation enabled) > revert #1 & #2 (NEXT_BUDDY disabled) > revert #2 (Old NEXT_BUDDY implementation enabled) > > > The regressions that are fixed by "revert #2" (as originally reported) are still > fixed in "revert #1 & #2". Interestingly, performance actually improves further > for the latter in the multi-node mysql benchmark (which is our VIP workload). > There are a couple of hackbench cases (sockets with high thread counts) that > showed an improvement with "revert #2" but which is gone with "revert #1 & #2". > > Let me know if I can usefully do anything else. If its not too much bother, could you run 6.19-rc with SCHED_BATCH ? The defining characteristic of BATCH is that it fully ignores wakeup preemption. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2026-01-12 7:47 ` Peter Zijlstra @ 2026-01-12 8:52 ` Ryan Roberts 2026-01-12 9:57 ` Peter Zijlstra 2026-01-13 6:31 ` K Prateek Nayak 0 siblings, 2 replies; 27+ messages in thread From: Ryan Roberts @ 2026-01-12 8:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Dietmar Eggemann, x86, linux-kernel, Aishwarya TCV On 12/01/2026 07:47, Peter Zijlstra wrote: > On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote: > >> Here are the updated results, now including column for "revert #1 & #2". >> >> 6-18-0 (base) (baseline) >> 6-19-0-rc1 (New NEXT_BUDDY implementation enabled) >> revert #1 & #2 (NEXT_BUDDY disabled) >> revert #2 (Old NEXT_BUDDY implementation enabled) >> >> >> The regressions that are fixed by "revert #2" (as originally reported) are still >> fixed in "revert #1 & #2". Interestingly, performance actually improves further >> for the latter in the multi-node mysql benchmark (which is our VIP workload). >> There are a couple of hackbench cases (sockets with high thread counts) that >> showed an improvement with "revert #2" but which is gone with "revert #1 & #2". >> >> Let me know if I can usefully do anything else. > > If its not too much bother, could you run 6.19-rc with SCHED_BATCH ? The > defining characteristic of BATCH is that it fully ignores wakeup > preemption. Is there a way I can force all future tasks to use SCHED_BATCH at the system level? (a Kconfig, cmdline arg or sysfs toggle?) If so that would be simple for me to do. But if I need to invoke the top level command with chrt -b and hope that nothing in the workload explicitly changes the scheduling policy that would be both trickier for me to do and (I guess) higher risk that it ends up not doing what I expected. Happy to give whatever you recommend a try... ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2026-01-12 8:52 ` Ryan Roberts @ 2026-01-12 9:57 ` Peter Zijlstra 2026-01-12 10:27 ` Ryan Roberts 2026-01-13 6:31 ` K Prateek Nayak 1 sibling, 1 reply; 27+ messages in thread From: Peter Zijlstra @ 2026-01-12 9:57 UTC (permalink / raw) To: Ryan Roberts Cc: Mel Gorman, Dietmar Eggemann, x86, linux-kernel, Aishwarya TCV On Mon, Jan 12, 2026 at 08:52:17AM +0000, Ryan Roberts wrote: > On 12/01/2026 07:47, Peter Zijlstra wrote: > > On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote: > > > >> Here are the updated results, now including column for "revert #1 & #2". > >> > >> 6-18-0 (base) (baseline) > >> 6-19-0-rc1 (New NEXT_BUDDY implementation enabled) > >> revert #1 & #2 (NEXT_BUDDY disabled) > >> revert #2 (Old NEXT_BUDDY implementation enabled) > >> > >> > >> The regressions that are fixed by "revert #2" (as originally reported) are still > >> fixed in "revert #1 & #2". Interestingly, performance actually improves further > >> for the latter in the multi-node mysql benchmark (which is our VIP workload). > >> There are a couple of hackbench cases (sockets with high thread counts) that > >> showed an improvement with "revert #2" but which is gone with "revert #1 & #2". > >> > >> Let me know if I can usefully do anything else. > > > > If its not too much bother, could you run 6.19-rc with SCHED_BATCH ? The > > defining characteristic of BATCH is that it fully ignores wakeup > > preemption. > > Is there a way I can force all future tasks to use SCHED_BATCH at the system > level? (a Kconfig, cmdline arg or sysfs toggle?) If so that would be simple for > me to do. But if I need to invoke the top level command with chrt -b and hope > that nothing in the workload explicitly changes the scheduling policy that would > be both trickier for me to do and (I guess) higher risk that it ends up not > doing what I expected. Happy to give whatever you recommend a try... No fancy things here, chrt/schedtool are it. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2026-01-12 9:57 ` Peter Zijlstra @ 2026-01-12 10:27 ` Ryan Roberts 0 siblings, 0 replies; 27+ messages in thread From: Ryan Roberts @ 2026-01-12 10:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Dietmar Eggemann, x86, linux-kernel, Aishwarya TCV On 12/01/2026 09:57, Peter Zijlstra wrote: > On Mon, Jan 12, 2026 at 08:52:17AM +0000, Ryan Roberts wrote: >> On 12/01/2026 07:47, Peter Zijlstra wrote: >>> On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote: >>> >>>> Here are the updated results, now including column for "revert #1 & #2". >>>> >>>> 6-18-0 (base) (baseline) >>>> 6-19-0-rc1 (New NEXT_BUDDY implementation enabled) >>>> revert #1 & #2 (NEXT_BUDDY disabled) >>>> revert #2 (Old NEXT_BUDDY implementation enabled) >>>> >>>> >>>> The regressions that are fixed by "revert #2" (as originally reported) are still >>>> fixed in "revert #1 & #2". Interestingly, performance actually improves further >>>> for the latter in the multi-node mysql benchmark (which is our VIP workload). >>>> There are a couple of hackbench cases (sockets with high thread counts) that >>>> showed an improvement with "revert #2" but which is gone with "revert #1 & #2". >>>> >>>> Let me know if I can usefully do anything else. >>> >>> If its not too much bother, could you run 6.19-rc with SCHED_BATCH ? The >>> defining characteristic of BATCH is that it fully ignores wakeup >>> preemption. >> >> Is there a way I can force all future tasks to use SCHED_BATCH at the system >> level? (a Kconfig, cmdline arg or sysfs toggle?) If so that would be simple for >> me to do. But if I need to invoke the top level command with chrt -b and hope >> that nothing in the workload explicitly changes the scheduling policy that would >> be both trickier for me to do and (I guess) higher risk that it ends up not >> doing what I expected. Happy to give whatever you recommend a try... > > No fancy things here, chrt/schedtool are it. OK I'll figure out how to butcher this into my workflow and get back to you with results. It probably won't be until Wednesday though. Thanks, Ryan ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2026-01-12 8:52 ` Ryan Roberts 2026-01-12 9:57 ` Peter Zijlstra @ 2026-01-13 6:31 ` K Prateek Nayak 1 sibling, 0 replies; 27+ messages in thread From: K Prateek Nayak @ 2026-01-13 6:31 UTC (permalink / raw) To: Ryan Roberts, Peter Zijlstra Cc: Mel Gorman, Dietmar Eggemann, x86, linux-kernel, Aishwarya TCV Hello Ryan, On 1/12/2026 2:22 PM, Ryan Roberts wrote: > On 12/01/2026 07:47, Peter Zijlstra wrote: >> On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote: >> >>> Here are the updated results, now including column for "revert #1 & #2". >>> >>> 6-18-0 (base) (baseline) >>> 6-19-0-rc1 (New NEXT_BUDDY implementation enabled) >>> revert #1 & #2 (NEXT_BUDDY disabled) >>> revert #2 (Old NEXT_BUDDY implementation enabled) >>> >>> >>> The regressions that are fixed by "revert #2" (as originally reported) are still >>> fixed in "revert #1 & #2". Interestingly, performance actually improves further >>> for the latter in the multi-node mysql benchmark (which is our VIP workload). >>> There are a couple of hackbench cases (sockets with high thread counts) that >>> showed an improvement with "revert #2" but which is gone with "revert #1 & #2". >>> >>> Let me know if I can usefully do anything else. >> >> If its not too much bother, could you run 6.19-rc with SCHED_BATCH ? The >> defining characteristic of BATCH is that it fully ignores wakeup >> preemption. > > Is there a way I can force all future tasks to use SCHED_BATCH at the system > level? One shortcut is to echo "NO_WAKEUP_PREEMPTION" into /sys/kernel/debug/sched/features but note it'll disable wakeup preemption for all tasks, including kthreads, which might adversely affect performance and is not an exact equivalent to only running the workload under SCHED_BATCH. For repro-collection/mysql-workload, (which I presume is [1]), there is a "WORKLOAD_SCHED_POLICY" environment variable that can be overridden [2] which controls the "CPUSchedulingPolicy" of the mysqld service. [1] https://github.com/aws/repro-collection/tree/main/workloads/mysql [2] https://github.com/aws/repro-collection/blob/a2cdf0455bd3422c9c1fc689ceac32971223b984/repros/repro-mysql-EEVDF-regression/main.sh#L102 -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [REGRESSION] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals 2026-01-09 10:15 ` Ryan Roberts 2026-01-12 7:47 ` Peter Zijlstra @ 2026-01-15 10:16 ` Mel Gorman 1 sibling, 0 replies; 27+ messages in thread From: Mel Gorman @ 2026-01-15 10:16 UTC (permalink / raw) To: Ryan Roberts Cc: Dietmar Eggemann, Peter Zijlstra (Intel), x86, linux-kernel, Aishwarya TCV, Madadi Vineeth Reddy On Fri, Jan 09, 2026 at 10:15:46AM +0000, Ryan Roberts wrote: > On 08/01/2026 13:15, Ryan Roberts wrote: > > On 08/01/2026 08:50, Mel Gorman wrote: > >> On Wed, Jan 07, 2026 at 04:30:09PM +0100, Dietmar Eggemann wrote: > >>> On 05.01.26 12:45, Ryan Roberts wrote: > >>>> On 02/01/2026 15:52, Dietmar Eggemann wrote: > >>>>> On 02.01.26 13:38, Ryan Roberts wrote: > >>> > >>> [...] > >>> > >> > >> Sorry for slow responses. I'm still back from holidays yet and unfortunately > >> do not have access to test machines right now cannot revalidate any of > >> the results against 6.19-rc*. > > > > No problem, thanks for getting back to me! > > > >> > >>>>>>> All testing is done on AWS Graviton3 (arm64) bare metal systems. (R)/(I) mean > >>>>>>> statistically significant regression/improvement, where "statistically > >>>>>>> significant" means the 95% confidence intervals do not overlap". > >>>>> > >>>>> You mentioned that you reverted this patch 'patch 2/2 'sched/fair: > >>>>> Reimplement NEXT_BUDDY to align with EEVDF goals'. > >>>>> > >>>>> Does this mean NEXT_BUDDY is still enabled, i.e. you haven't reverted > >>>>> patch 1/2 'sched/fair: Enable scheduler feature NEXT_BUDDY' as well? > >>>> > >>>> Yes that's correct; patch 1 is still present. I could revert that as well and rerun if useful? > >>> > >>> Well, I assume this would be more valuable. > >> > >> Agreed because we need to know if it's NEXT_BUDDY that is conceptually > >> an issue with EEVDF in these cases or the specific implementation. The > >> comparison between > >> > >> 6.18A (baseline) > >> 6.19-rcN vanilla (New NEXT_BUDDY implementation enabled) > >> 6.19-rcN revert patches 1+2 (NEXT_BUDDY disabled) > >> 6.19-rcN revert patch 2 only (Old NEXT_BUDDY implementation enabled) > > > > OK, I've already got 1, 2 and 4. Let me grab 3 and come back to you - hopefully > > tomorrow. Then we can take it from there. > > Hi Mel, Dietmar, > > Here are the updated results, now including column for "revert #1 & #2". > > 6-18-0 (base) (baseline) > 6-19-0-rc1 (New NEXT_BUDDY implementation enabled) > revert #1 & #2 (NEXT_BUDDY disabled) > revert #2 (Old NEXT_BUDDY implementation enabled) > Thanks. > > The regressions that are fixed by "revert #2" (as originally reported) are still > fixed in "revert #1 & #2". Interestingly, performance actually improves further > for the latter in the multi-node mysql benchmark (which is our VIP workload). It suggests that NEXT_BUDDY in general is harmful to this workload. In an ideal world, this would also be checked against the NEXT_BUDDY implementation CFS but it would be a waste of time for many reasons. I find it particularly interesting that it is only measurable with the 2-machine test as it suggests, but not proves, that the problem may be related to WF_SYNC wakeups from the network layer > There are a couple of hackbench cases (sockets with high thread counts) that > showed an improvement with "revert #2" but which is gone with "revert #1 & #2". > > Let me know if I can usefully do anything else. > > Multi-node SUT (workload running across 2 machines): > > +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ > | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert #2 | revert #1 & #2 | > +=================================+====================================================+===============+=============+============+================+ > | repro-collection/mysql-workload | db transaction rate (transactions/min) | 646267.33 | (R) -1.33% | (I) 5.87% | (I) 7.63% | > | | new order rate (orders/min) | 213256.50 | (R) -1.32% | (I) 5.87% | (I) 7.64% | > +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ Ok, fairly clear there > Single-node SUT (workload running on single machine): > > +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ > | Benchmark | Result Class | 6-18-0 (base) | 6-19-0-rc1 | revert #2 | revert #1 & #2 | > +=================================+====================================================+===============+=============+============+================+ > | specjbb/composite | critical-jOPS (jOPS) | 94700.00 | (R) -5.10% | -0.90% | -0.37% | > | | max-jOPS (jOPS) | 113984.50 | (R) -3.90% | -0.65% | 0.65% | > +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ I assume this is specjbb2015. I'm a little cautious of this results as specjbb2015 focuses on peak performance. It starts with low CPU usage and scales up to find the point where performance reaches a peak. This metric can be gamed and what works for specjbb, particularly as the machines approches being heavily utilised and transitions to overloaded can be problematic. Can you look at the detailed results for specjbb2015 and determine if the peak was picked from different load points? > | repro-collection/mysql-workload | db transaction rate (transactions/min) | 245438.25 | (R) -3.88% | -0.13% | 0.24% | > | | new order rate (orders/min) | 80985.75 | (R) -3.78% | -0.07% | 0.29% | > +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ > | pts/pgbench | Scale: 1 Clients: 1 Read Only (TPS) | 63124.00 | (I) 2.90% | 0.74% | 0.85% | > | | Scale: 1 Clients: 1 Read Only - Latency (ms) | 0.016 | (I) 5.49% | 1.05% | 1.05% | > | | Scale: 1 Clients: 1 Read Write (TPS) | 974.92 | 0.11% | -0.08% | -0.03% | > | | Scale: 1 Clients: 1 Read Write - Latency (ms) | 1.03 | 0.12% | -0.06% | -0.06% | > | | Scale: 1 Clients: 250 Read Only (TPS) | 1915931.58 | (R) -2.25% | (I) 2.12% | 1.62% | > | | Scale: 1 Clients: 250 Read Only - Latency (ms) | 0.13 | (R) -2.37% | (I) 2.09% | 1.69% | > | | Scale: 1 Clients: 250 Read Write (TPS) | 855.67 | -1.36% | -0.14% | -0.12% | > | | Scale: 1 Clients: 250 Read Write - Latency (ms) | 292.39 | -1.31% | -0.08% | -0.08% | > | | Scale: 1 Clients: 1000 Read Only (TPS) | 1534130.08 | (R) -11.37% | 0.08% | 0.48% | > | | Scale: 1 Clients: 1000 Read Only - Latency (ms) | 0.65 | (R) -11.38% | 0.08% | 0.44% | > | | Scale: 1 Clients: 1000 Read Write (TPS) | 578.75 | -1.11% | 2.15% | -0.96% | > | | Scale: 1 Clients: 1000 Read Write - Latency (ms) | 1736.98 | -1.26% | 2.47% | -0.90% | > | | Scale: 100 Clients: 1 Read Only (TPS) | 57170.33 | 1.68% | 0.10% | 0.22% | > | | Scale: 100 Clients: 1 Read Only - Latency (ms) | 0.018 | 1.94% | 0.00% | 0.96% | > | | Scale: 100 Clients: 1 Read Write (TPS) | 836.58 | -0.37% | -0.41% | 0.07% | > | | Scale: 100 Clients: 1 Read Write - Latency (ms) | 1.20 | -0.37% | -0.40% | 0.06% | > | | Scale: 100 Clients: 250 Read Only (TPS) | 1773440.67 | -1.61% | 1.67% | 1.34% | > | | Scale: 100 Clients: 250 Read Only - Latency (ms) | 0.14 | -1.40% | 1.56% | 1.20% | > | | Scale: 100 Clients: 250 Read Write (TPS) | 5505.50 | -0.17% | -0.86% | -1.66% | > | | Scale: 100 Clients: 250 Read Write - Latency (ms) | 45.42 | -0.17% | -0.85% | -1.67% | > | | Scale: 100 Clients: 1000 Read Only (TPS) | 1393037.50 | (R) -10.31% | -0.19% | 0.53% | > | | Scale: 100 Clients: 1000 Read Only - Latency (ms) | 0.72 | (R) -10.30% | -0.17% | 0.53% | > | | Scale: 100 Clients: 1000 Read Write (TPS) | 5085.92 | 0.27% | 0.07% | -0.79% | > | | Scale: 100 Clients: 1000 Read Write - Latency (ms) | 196.79 | 0.23% | 0.05% | -0.81% | A few points of concern but nothing as severe as the mysql Multi-node SUT. The worst regressions are when the number of clients exceeds the number of CPUs and at that point any wakeup preemption is potentially harmful. > +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ > | mmtests/hackbench | hackbench-process-pipes-1 (seconds) | 0.14 | -1.51% | -1.05% | -1.51% | > | | hackbench-process-pipes-4 (seconds) | 0.44 | (I) 6.49% | (I) 5.42% | (I) 6.06% | > | | hackbench-process-pipes-7 (seconds) | 0.68 | (R) -18.36% | (I) 3.40% | -0.41% | So hackbench is all over the place with a mix of gains and losses so no clear winner. > +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ > | pts/nginx | Connections: 200 (Requests Per Second) | 252332.60 | (I) 17.54% | -0.53% | -0.61% | > | | Connections: 1000 (Requests Per Second) | 248591.29 | (I) 20.41% | 0.10% | 0.57% | > +---------------------------------+----------------------------------------------------+---------------+-------------+------------+----------------+ > And this is the main winner. The results confirm that NEXT_BUDDY is not a universal win but the mysql results and Daytrader results from Madadi are a concern. I still don't have access to test machines to investigate this properly and may not have access for 1-2 weeks. I think the best approach for now is to disable NEXT_BUDDY by default again until it's determined exactly why mysql multi-host and daytrader suffered. Can you this this to be sure please? --8<-- sched/fair: Disable scheduler feature NEXT_BUDDY NEXT_BUDDY was disabled with the introduction of EEVDF and enabled again after NEXT_BUDDY was rewritten for EEVDF by commit e837456fdca8 ("sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals"). It was not expected that this would be a universal win without a crystal ball instruction but the reported regressions are a concern [1][2] even if gains were also reported. Specifically; o mysql with client/server running on different servers regresses o specjbb reports lower peak metrics o daytrader regresses The mysql is realistic and a concern. It needs to be confirmed if specjbb is simply shifting the point where peak performance is measured but still a concern. daytrader is considered to be representative of a real workload. Access to test machines is currently problematic for verifying any fix to this problem. Disable NEXT_BUDDY for now by default until the root causes are addressed. Link: https://lore.kernel.org/lkml/4b96909a-f1ac-49eb-b814-97b8adda6229@arm.com [1] Link: https://lore.kernel.org/lkml/ec3ea66f-3a0d-4b5a-ab36-ce778f159b5b@linux.ibm.com [2] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/features.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 980d92bab8ab..136a6584be79 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -29,7 +29,7 @@ SCHED_FEAT(PREEMPT_SHORT, true) * wakeup-preemption), since its likely going to consume data we * touched, increases cache locality. */ -SCHED_FEAT(NEXT_BUDDY, true) +SCHED_FEAT(NEXT_BUDDY, false) /* * Allow completely ignoring cfs_rq->next; which can be set from various ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [REGRESSION] [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF 2025-11-12 12:25 [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman 2025-11-12 12:25 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman [not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net> @ 2026-01-08 10:01 ` Madadi Vineeth Reddy 2 siblings, 0 replies; 27+ messages in thread From: Madadi Vineeth Reddy @ 2026-01-08 10:01 UTC (permalink / raw) To: Mel Gorman, Peter Zijlstra Cc: Ingo Molnar, Juri Lelli, Dietmar Eggemann, Valentin Schneider, Chris Mason, linux-kernel, Madadi Vineeth Reddy On 12/11/25 17:55, Mel Gorman wrote: > Changes since v4 > o Splitout decisions into separate functions (peterz) > o Flow clarity (peterz) > > Changes since v3 > o Place new code near first consumer (peterz) > o Separate between PREEMPT_SHORT and NEXT_BUDDY (peterz) > o Naming and code flow clarity (peterz) > o Restore slice protection (peterz) > > Changes since v2 > o Review feedback applied from Prateek > > I've been chasing down a number of schedule issues recently like many > others and found they were broadly grouped as > > 1. Failure to boost CPU frequency with powersave/ondemand governors > 2. Processors entering idle states that are too deep > 3. Differences in wakeup latencies for wakeup-intensive workloads > > Adding topology into account means that there is a lot of machine-specific > behaviour which may explain why some discussions recently have reproduction > problems. Nevertheless, the removal of LAST_BUDDY and NEXT_BUDDY being > disabled has an impact on wakeup latencies. > > This series enables NEXT_BUDDY and may select a wakee if it's eligible to > run even though other unrelated tasks may have an earlier deadline. > > Mel Gorman (2): > sched/fair: Enable scheduler feature NEXT_BUDDY > sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals > > kernel/sched/fair.c | 152 ++++++++++++++++++++++++++++++++++------ > kernel/sched/features.h | 2 +- > 2 files changed, 131 insertions(+), 23 deletions(-) > Hi Mel, Peter, During internal testing, I noticed approximately 7% regression in a real-world workload called DayTrader. Git bisect pointed to this patch: "sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals" Before this patch was merged, I reported a regression in v4 with schbench and stress-ng. From that discussion: https://lore.kernel.org/all/ddfde793-ad6e-4517-96b8-662dcb78acc8@linux.ibm.com/#t ``` So with frequent wakeups, queued tasks (even with earlier deadlines) may be unfairly delayed. I understand that this would fade away quickly as the woken up task that got to run due to buddy preference would accumulate negative lag and would not be eligible to run again, but the starvation could be higher if wakeups are very high. To test this, I ran schbench (many message and worker threads) together with stress-ng (CPU-bound), and observed stress-ng's bogo-ops throughput dropped by around 64%. This shows a significant regression for CPU-bound tasks under heavy wakeup loads. ``` I understand that stress-ng bogo-ops is not a reliable metric. However, the problem appears to be real, as DayTrader also shows regression with this patch. To check if WF_SYNC related change is the issue, I tried to decrease threshold by `echo 50000 > /sys/kernel/debug/sched/migration_cost_ns` so that waker could preempt quickly in WF_SYNC case. This helped but I understand that it changes a lot of code paths that use migration_cost_ns. So, when I decreased only threshold in this patch, the performance didn't improve. So, I think the problem is in making the tasks that are having earlier deadline to wait in presence of frequent wakeups is hurting CPU intensive workloads. Any thoughts/ideas? Meanwhile, I will also spend time to workaround this patch and see if the performance could be improved. Thanks, Vineeth ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2026-01-15 10:35 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-12 12:25 [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF Mel Gorman
2025-11-12 12:25 ` [PATCH 1/2] sched/fair: Enable scheduler feature NEXT_BUDDY Mel Gorman
2025-11-14 12:19 ` [tip: sched/core] " tip-bot2 for Mel Gorman
2025-11-17 16:23 ` tip-bot2 for Mel Gorman
[not found] ` <20251112122521.1331238-3-mgorman@techsingularity.net>
2025-11-12 14:48 ` [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals Peter Zijlstra
2025-11-13 8:26 ` Madadi Vineeth Reddy
2025-11-13 9:04 ` Mel Gorman
2025-11-14 12:13 ` Peter Zijlstra
2025-11-14 12:19 ` [tip: sched/core] " tip-bot2 for Mel Gorman
2025-11-17 16:23 ` tip-bot2 for Mel Gorman
2025-12-22 10:57 ` [REGRESSION] " Ryan Roberts
2026-01-02 12:38 ` Ryan Roberts
2026-01-02 15:52 ` Dietmar Eggemann
2026-01-05 11:45 ` Ryan Roberts
2026-01-05 14:38 ` Shrikanth Hegde
2026-01-05 16:33 ` Ryan Roberts
2026-01-07 15:30 ` Dietmar Eggemann
2026-01-08 8:50 ` Mel Gorman
2026-01-08 13:15 ` Ryan Roberts
2026-01-09 10:15 ` Ryan Roberts
2026-01-12 7:47 ` Peter Zijlstra
2026-01-12 8:52 ` Ryan Roberts
2026-01-12 9:57 ` Peter Zijlstra
2026-01-12 10:27 ` Ryan Roberts
2026-01-13 6:31 ` K Prateek Nayak
2026-01-15 10:16 ` Mel Gorman
2026-01-08 10:01 ` [REGRESSION] [PATCH 0/2 v5] Reintroduce NEXT_BUDDY for EEVDF Madadi Vineeth Reddy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox