From: Mario Roy <marioeroy@gmail.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Mason <clm@meta.com>,
Joseph Salisbury <joseph.salisbury@oracle.com>,
Adam Li <adamli@os.amperecomputing.com>,
Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>,
Josh Don <joshdon@google.com>,
mingo@redhat.com, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, linux-kernel@vger.kernel.org,
kprateek.nayak@amd.com, shubhang@os.amperecomputing.com,
arighi@nvidia.com
Subject: Re: [PATCH 4/4] sched/fair: Proportional newidle balance
Date: Thu, 29 Jan 2026 20:44:50 -0500 [thread overview]
Message-ID: <5b99dd88-94ae-4469-a34a-24c32aa3af81@gmail.com> (raw)
In-Reply-To: <20260127151748.GA1079264@noisy.programming.kicks-ass.net>
Peter, thank you for your fix to improve EEVDF.
Cc'd Andrea Righi
Thank you for the is_idle_core() function and help. [0]
Cc'd Shubhang Kaushik
Your patch inspired me to perform trial and error testing.
What has now become the 0280 patch in CachyMod GitHub repo. [0]
Together with the help of CachyOS community members, we concluded
the prefcore + prefer-idle-core to be surreal. I enjoy the EEVDF
scheduler a lot more, since lesser favoring the SMT siblings.
For comparison, I added results for sched-ext cosmos.
Limited CPU saturation can be revealing of potential scheduler issues.
Testing includes 100%, 50%, 31.25%, and 25% CPU saturation.
All kernels built with GCC to factor out CLANG/AutoFDO.
A) 6.18.8-rc1
with sched/fair: Proportional newidle balance
48cpus(100%) 24cpus(50%) 15cpus(31.25%) 12cpus(25%)
algorithm3 [1] 9.462s 14.181s 20.311s 24.498s
darktable [2] 2.811s 3.715s 5.315s 6.434s
easywave [3] 19.747s 10.804s 20.207s 21.571s
stress-ng [4] 37632.06 56220.21 41694.50 34740.58
B) 6.18.8-rc1
Peter Z's fix for sched/fair: Proportional newidle balance
48cpus(100%) 24cpus(50%) 15cpus(31.25%) 12cpus(25%)
algorithm3 [1] 9.340s 14.733s 21.339s 25.069s
darktable [2] 2.493s 3.616s 5.148s 5.968s
easywave [3] 11.357s 13.312s * 18.483s 20.741s
stress-ng [4] 37533.24 55419.85 39452.17 32217.55
algorithm3 and stress-ng regressed, possibly limited CPU saturation
anomaly
easywave (*) wierd result, repeatable and all over the place
C) 6.18.8-rc1
Revert sched/fair: Proportional newidle balance
48cpus(100%) 24cpus(50%) 15cpus(31.25%) 12cpus(25%)
algorithm3 [1] 9.286s 15.101s 21.417s 25.126s
darktable [2] 2.484s 3.531s 5.185s 6.002s
easywave [3] 11.517s 12.300s 18.466s 20.428s
stress-ng [4] 42231.92 47306.18 * 32438.03 * 28820.83 *
stress-ng (*) lack-luster with limited CPU saturation
D) 6.18.8-rc1
Revert sched/fair: Proportional newidle balance
Plus apply the prefer-idle-core patch [0]
48cpus(100%) 24cpus(50%) 15cpus(31.25%) 12cpus(25%)
algorithm3 [1] 9.312s 11.292s 17.243s 21.811s
darktable [2] 2.418s 3.711s * 5.499s * 6.510s *
easywave [3] 10.035s 9.832s 15.738s 18.805s
stress-ng [4] 44837.41 63364.56 55646.26 48202.58
darktable (*) lesser performance with limited CPU saturation
noticeably better performance, otherwise
E) scx_cosmos -m 0-5 -s 800 -l 8000 -f -c 1 -p 0 [5]
48cpus(100%) 24cpus(50%) 15cpus(31.25%) 12cpus(25%)
algorithm3 [1] 9.218s 11.188s 17.045s 21.130s
darktable [2] 2.365s 3.900s 4.626s 5.664s
easywave [3] 9.187s 16.528s * 15.933s 16.991s
stress-ng [4] 21065.70 36417.65 27185.95 23141.87
easywave (*) sched-ext cosmos appears to favor SMT siblings
---
[0] https://github.com/marioroy/cachymod
the prefer-idle-core is 0280-prefer-prevcpu-for-wakeup.patch
more about mindfulness for limited CPU saturation versus accepting
patch
surreal is prefcore + prefer-idle-core, improving many workloads
[1] https://github.com/marioroy/mce-sandbox
./algorithm3.pl 1e12 --threads=N
algorithm3.pl is akin to server/client application; chatty
primesieve.pl is more CPU-bound; less chatty
optionally, compare with primesieve binary (fully cpu bound, no chatty)
https://github.com/kimwalisch/primesieve
[2] https://math.dartmouth.edu/~sarunas/darktable_bench.html
OMP_NUM_THREADS=N darktable-cli setubal.orf setubal.orf.xmp test.jpg \
--core --disable-opencl -d perf
result: pixel pipeline processing took {...} secs
[3] https://openbenchmarking.org/test/pts/easywave
OMP_NUM_THREADS=N ./src/easywave \
-grid examples/e2Asean.grd -source examples/BengkuluSept2007.flt \
-time 600
result: Model time = 10:00:00, elapsed: {...} msec
[4] https://openbenchmarking.org/test/pts/stress-ng
stress-ng -t 30 --metrics-brief --sock N --no-rand-seed --sock-zerocopy
result: bogo ops real time usr time sys time bogo ops/s bogo ops/s
(secs) (secs) (secs) (real time)
(usr+sys time)
{...}
this involves 2x NCPUs due to { writer, reader } threads per sock
thus the reason adding 12cpus result (12 x 2 = 24 <= 50% saturation)
[5] https://github.com/sched-ext/scx
cargo build --release -p scx_cosmos
On 1/27/26 10:17 AM, Peter Zijlstra wrote:
> On Tue, Jan 27, 2026 at 11:40:41AM +0100, Peter Zijlstra wrote:
>> On Fri, Jan 23, 2026 at 12:03:06PM +0100, Peter Zijlstra wrote:
>>> On Fri, Jan 23, 2026 at 11:50:46AM +0100, Peter Zijlstra wrote:
>>>> On Sun, Jan 18, 2026 at 03:46:22PM -0500, Mario Roy wrote:
>>>>> The patch "Proportional newidle balance" introduced a regression
>>>>> with Linux 6.12.65 and 6.18.5. There is noticeable regression with
>>>>> easyWave testing. [1]
>>>>>
>>>>> The CPU is AMD Threadripper 9960X CPU (24/48). I followed the source
>>>>> to install easyWave [2]. That is fetching the two tar.gz archives.
>>>> What is the actual configuration of that chip? Is it like 3*8 or 4*6
>>>> (CCX wise). A quick google couldn't find me the answer :/
>>> Obviously I found it right after sending this. It's a 4x6 config.
>>> Meaning it needs newidle to balance between those 4 domains.
>> So with the below patch on top of my Xeon w7-2495X (which is 24-core
>> 48-thread) I too have 4 LLC :-)
>>
>> And I think I can see a slight difference, but nowhere near as terrible.
>>
>> Let me go stick some tracing on.
> Does this help some?
>
> Turns out, this easywave thing has a very low newidle rate, but then
> also a fairly low success rate. But since it doesn't do it that often,
> the cost isn't that significant so we might as well always do it etc..
>
> This adds a second term to the ratio computation that takes time into
> account, For low rate newidle this term will dominate, while for higher
> rate the success ratio is more important.
>
> Chris, afaict this still DTRT for schbench, but if this works for Mario,
> could you also re-run things at your end?
>
> [ the 4 'second' thing is a bit random, but looking at the timings
> between easywave and schbench this seems to be a reasonable middle
> ground. Although I think 8 'seconds' -- 23 shift -- would also work.
>
> That would give:
>
> 1024 - 8 s - 64 Hz
> 512 - 4 s - 128 Hz
> 256 - 2 s - 256 Hz
> 128 - 1 s - 512 Hz
> 64 - .5 s - 1024 Hz
> 32 - .25 s - 2048 Hz
> ]
>
> ---
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 45c0022b91ce..a1e1032426dc 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -95,6 +95,7 @@ struct sched_domain {
> unsigned int newidle_call;
> unsigned int newidle_success;
> unsigned int newidle_ratio;
> + u64 newidle_stamp;
> u64 max_newidle_lb_cost;
> unsigned long last_decay_max_lb_cost;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eca642295c4b..ab9cf06c6a76 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12224,8 +12224,31 @@ static inline void update_newidle_stats(struct sched_domain *sd, unsigned int su
> sd->newidle_call++;
> sd->newidle_success += success;
>
> if (sd->newidle_call >= 1024) {
> - sd->newidle_ratio = sd->newidle_success;
> + u64 now = sched_clock();
> + s64 delta = now - sd->newidle_stamp;
> + sd->newidle_stamp = now;
> + int ratio = 0;
> +
> + if (delta < 0)
> + delta = 0;
> +
> + if (sched_feat(NI_RATE)) {
> + /*
> + * ratio delta freq
> + *
> + * 1024 - 4 s - 128 Hz
> + * 512 - 2 s - 256 Hz
> + * 256 - 1 s - 512 Hz
> + * 128 - .5 s - 1024 Hz
> + * 64 - .25 s - 2048 Hz
> + */
> + ratio = delta >> 22;
> + }
> +
> + ratio += sd->newidle_success;
> +
> + sd->newidle_ratio = min(1024, ratio);
> sd->newidle_call /= 2;
> sd->newidle_success /= 2;
> }
> @@ -12932,7 +12959,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
> if (sd->flags & SD_BALANCE_NEWIDLE) {
> unsigned int weight = 1;
>
> - if (sched_feat(NI_RANDOM)) {
> + if (sched_feat(NI_RANDOM) && sd->newidle_ratio < 1024) {
> /*
> * Throw a 1k sided dice; and only run
> * newidle_balance according to the success
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 980d92bab8ab..7aba7523c6c1 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -126,3 +126,4 @@ SCHED_FEAT(LATENCY_WARN, false)
> * Do newidle balancing proportional to its success rate using randomization.
> */
> SCHED_FEAT(NI_RANDOM, true)
> +SCHED_FEAT(NI_RATE, true)
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index cf643a5ddedd..05741f18f334 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -4,6 +4,7 @@
> */
>
> #include <linux/sched/isolation.h>
> +#include <linux/sched/clock.h>
> #include <linux/bsearch.h>
> #include "sched.h"
>
> @@ -1637,6 +1638,7 @@ sd_init(struct sched_domain_topology_level *tl,
> struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
> int sd_id, sd_weight, sd_flags = 0;
> struct cpumask *sd_span;
> + u64 now = sched_clock();
>
> sd_weight = cpumask_weight(tl->mask(tl, cpu));
>
> @@ -1674,6 +1676,7 @@ sd_init(struct sched_domain_topology_level *tl,
> .newidle_call = 512,
> .newidle_success = 256,
> .newidle_ratio = 512,
> + .newidle_stamp = now,
>
> .max_newidle_lb_cost = 0,
> .last_decay_max_lb_cost = jiffies,
next prev parent reply other threads:[~2026-01-30 1:44 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-07 16:06 [PATCH 0/4] sched: The newidle balance regression Peter Zijlstra
2025-11-07 16:06 ` [PATCH 1/4] sched/fair: Revert max_newidle_lb_cost bump Peter Zijlstra
2025-11-14 12:19 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2025-11-17 16:23 ` tip-bot2 for Peter Zijlstra
2025-11-07 16:06 ` [PATCH 2/4] sched/fair: Small cleanup to sched_balance_newidle() Peter Zijlstra
2025-11-10 13:55 ` Dietmar Eggemann
2025-11-10 14:04 ` Peter Zijlstra
2025-11-12 14:37 ` Shrikanth Hegde
2025-11-12 14:42 ` Peter Zijlstra
2025-11-12 15:08 ` Peter Zijlstra
2025-11-12 15:28 ` Shrikanth Hegde
2025-11-14 9:49 ` Peter Zijlstra
2025-11-14 10:22 ` Vincent Guittot
2025-11-14 11:05 ` Peter Zijlstra
2025-11-14 13:11 ` Vincent Guittot
2025-11-14 12:19 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2025-11-17 16:23 ` tip-bot2 for Peter Zijlstra
2025-11-07 16:06 ` [PATCH 3/4] sched/fair: Small cleanup to update_newidle_cost() Peter Zijlstra
2025-11-14 12:19 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2025-11-17 16:23 ` tip-bot2 for Peter Zijlstra
2025-11-07 16:06 ` [PATCH 4/4] sched/fair: Proportional newidle balance Peter Zijlstra
2025-11-10 13:55 ` Dietmar Eggemann
2025-11-11 9:07 ` Adam Li
2025-11-11 9:20 ` Peter Zijlstra
2025-11-12 12:04 ` Adam Li
2025-11-12 13:41 ` Peter Zijlstra
2025-11-12 15:42 ` Shrikanth Hegde
2025-11-14 9:35 ` Peter Zijlstra
2025-11-14 12:18 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2025-11-17 16:23 ` tip-bot2 for Peter Zijlstra
2026-01-18 20:46 ` [PATCH 4/4] " Mario Roy
2026-01-23 10:50 ` Peter Zijlstra
2026-01-23 11:03 ` Peter Zijlstra
2026-01-23 12:24 ` K Prateek Nayak
2026-01-28 4:08 ` K Prateek Nayak
2026-01-27 4:15 ` Mario Roy
2026-01-27 10:40 ` Peter Zijlstra
2026-01-27 15:17 ` Peter Zijlstra
2026-01-30 1:44 ` Mario Roy [this message]
2026-01-30 4:14 ` Mario Roy
2026-02-24 9:13 ` [tip: sched/core] sched/fair: More complex proportional " tip-bot2 for Peter Zijlstra
2026-01-25 12:22 ` [PATCH 4/4] sched/fair: Proportional " Mohamed Abuelfotoh, Hazem
2026-01-27 8:44 ` Peter Zijlstra
2026-01-28 15:48 ` Mohamed Abuelfotoh, Hazem
2026-01-29 9:19 ` Peter Zijlstra
2026-01-29 9:24 ` Peter Zijlstra
2026-01-30 16:12 ` Mohamed Abuelfotoh, Hazem
2026-01-30 13:16 ` Mohamed Abuelfotoh, Hazem
2026-02-02 10:51 ` Peter Zijlstra
2026-02-02 11:07 ` Mohamed Abuelfotoh, Hazem
2026-02-04 12:45 ` Mohamed Abuelfotoh, Hazem
2026-02-04 13:27 ` Peter Zijlstra
2026-02-04 13:59 ` Mohamed Abuelfotoh, Hazem
2026-02-04 14:05 ` Peter Zijlstra
2026-02-04 22:48 ` Mohamed Abuelfotoh, Hazem
2026-01-27 8:50 ` Peter Zijlstra
2026-01-27 9:13 ` Peter Zijlstra
2026-01-28 16:24 ` Mohamed Abuelfotoh, Hazem
2026-01-28 16:03 ` Mohamed Abuelfotoh, Hazem
2026-04-29 8:51 ` Qing Wang
2026-05-06 2:44 ` [PATCH] sched/fair: Replace random newidle_balance with Bresenham accumulator Qing Wang
2025-11-10 19:47 ` [PATCH 0/4] sched: The newidle balance regression Chris Mason
2025-11-11 19:08 ` Josh Don
2025-11-12 21:59 ` Chris Mason
2025-11-14 9:37 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5b99dd88-94ae-4469-a34a-24c32aa3af81@gmail.com \
--to=marioeroy@gmail.com \
--cc=abuehaze@amazon.com \
--cc=adamli@os.amperecomputing.com \
--cc=arighi@nvidia.com \
--cc=bsegall@google.com \
--cc=clm@meta.com \
--cc=dietmar.eggemann@arm.com \
--cc=joseph.salisbury@oracle.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=shubhang@os.amperecomputing.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox