From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yw1-f182.google.com (mail-yw1-f182.google.com [209.85.128.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 29CC630BBB0 for ; Fri, 30 Jan 2026 01:44:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.182 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769737494; cv=none; b=IKFJWXzgsBajcNicPIrAy9WPsWhcGLotRwYX8+h4gj4tCnqNcXZ7QN5/SPrVfjlf2noMuHQgLtsAj8iM4OTQ59q4Z7PdIJNQwR38pqL169noSAa1LdxtFeeejrCk9sfMtdMTh9z3IBCmvVijb9gh45AW8KmWJTK8YKhJFa8lqOg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769737494; c=relaxed/simple; bh=7Q1WXyH6ft9OpPL2gW5cjW9CYEqrG7vOgVJBuqWwxEQ=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=gdLRuN34AET9mlBzDadtTxzaLw62hqw5Q5fszaYADgnZe6dgAyCZSfcELl2LlRt2V1pWoSB5b/J125gT1bzoMpvBDEZrDQgCdotBi1JQIU2h95v3oxg7cZfjGHWwvrnllcnR6jRtmjpww3cLxn1bxnymxY1vLY0Y24AIyfIbA4A= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Lnjb3G0z; arc=none smtp.client-ip=209.85.128.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Lnjb3G0z" Received: by mail-yw1-f182.google.com with SMTP id 00721157ae682-79407df9391so18196227b3.1 for ; Thu, 29 Jan 2026 17:44:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1769737492; x=1770342292; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=BK9eVflWiP72rcFpows1NBAyKOPeCU5ln6GAtT2SnzA=; b=Lnjb3G0z7FoGTCUIHjf7vHle/9gJKM+zQqUJOASKVtXdjnMlC5YORmu21yswBRAjoz AHna4xLAcMBiNas470cpSRZJmdxK5qbtFcyJG3paqkvsCymMrQYO8WjlrmshuPspJLlw Vw+Ta8LBO5Z7XC3a3nRwjGExw3YC7ZM9jptufRtZyMkeO1JWuxI5ei+fK93LjPUfAZDL 5Xd+duR9JCED76gbB/UrFlT9xp5UhGY8C9l6aFzHml0Wb+3U7TBt1FEn6y0KQC+XSUcp dhs/3bF2iTIffE1aVZDwjhDhfShAyYdF4VA6PL4pqdvgDISuyAbueeOjcR9BgueKZ4Kf 1cfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769737492; x=1770342292; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=BK9eVflWiP72rcFpows1NBAyKOPeCU5ln6GAtT2SnzA=; b=ZoJm33E47+BV/pObVys3b73ugsDlAMdtdRv9++tvjmybT6VOR1JuQFZQc+SR65nOnj eF2saghP1nK4dkcRgcvxYBBcYJ2ciovpTN3TJd5btTbanvOdg0SlhSOEoU+n+oDkOY+g GqFZwlGyuZE/kg6/HcUwOVjjecGAlLMNBzfTVPxuyN6v1iybET6uI75MdBBiMxqjH4JR o83xTOysi5umYvKsj0Ul8Kz0Kab5VqxGlQ7V8n4XK7Lw6FE9Exc8LPkI/Hw2sXyMihY7 Tkbmf+Xw1XqjEjP/4jXAvqH0vxb46RPVp4bCbt90YMe1L72j9FWUbKqLIVNAk7AXKbJ6 FzfQ== X-Forwarded-Encrypted: i=1; AJvYcCWnPC1SEkeB/CRbEIdkNqOZpkswb4GQ90+Rz0OwPFpt+twzKq5UY2f6DaGKIJEXSXL80pGD6Awq/q6QE0s=@vger.kernel.org X-Gm-Message-State: AOJu0YxY5oDtBo4lh2ZkFyP0YhcvD26rvHqTZojB4z5xQwnd1+kyBTjN o41kKtO0lDmzdgXPelWYp5z5BEQqYculDUvmqdf1Gzu2hrG7lwl7mfy7 X-Gm-Gg: AZuq6aJC6JxXum8aAdl6iyJ1fbEb2kLWj2FOoFxW8SvYuXaNXajw3eUlIwx/wkpclsU w5sDyiKcaMDMIHlugi/00MZakpWBzQBkptHOXiw8yfpyU1GdHz2BRL1OwaA8URg+4Rie7mixZ8j f5AWjda2lx29F/1lD2t7t8J/7iZSRiLwxzMl3QK6/T/dWRE5hvNF3I2VUbTVoSQXXxzCLQxDh67 Frz9tWrNjj2q3o0R5dU1ym+/sbrU1VeAVb+PrVg4tS97pWu2LJxqeJmH5c2U4b9WcZdNPV0NZYs LRdtDJnlxlPlogIY79ZsxJtyPbq1//HplNN9jCU01uGYXWQH4qWxNi/TN0z1Tcg4tvF68gneoJa ghcXh7M/tziyX/GTsFb91XAgrKOdWjPnTbXsHwv9qn061yukzLybcwWA7zXZfXP68wZj+MNW4Wj FSjTME4m16Y6l/Tg== X-Received: by 2002:a05:690c:d95:b0:792:7721:e072 with SMTP id 00721157ae682-7949de6cbdcmr14964967b3.11.1769737491970; Thu, 29 Jan 2026 17:44:51 -0800 (PST) Received: from [192.168.1.64] ([173.92.131.131]) by smtp.gmail.com with ESMTPSA id 00721157ae682-794828a9d1asm31050927b3.31.2026.01.29.17.44.50 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 29 Jan 2026 17:44:51 -0800 (PST) Message-ID: <5b99dd88-94ae-4469-a34a-24c32aa3af81@gmail.com> Date: Thu, 29 Jan 2026 20:44:50 -0500 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 4/4] sched/fair: Proportional newidle balance To: Peter Zijlstra Cc: Chris Mason , Joseph Salisbury , Adam Li , Hazem Mohamed Abuelfotoh , Josh Don , mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, linux-kernel@vger.kernel.org, kprateek.nayak@amd.com, shubhang@os.amperecomputing.com, arighi@nvidia.com References: <20251107160645.929564468@infradead.org> <20251107161739.770122091@infradead.org> <8760001e-0274-454c-a4e4-1f38a9695b88@gmail.com> <20260123105046.GM171111@noisy.programming.kicks-ass.net> <20260123110306.GA217302@noisy.programming.kicks-ass.net> <20260127104041.GD217302@noisy.programming.kicks-ass.net> <20260127151748.GA1079264@noisy.programming.kicks-ass.net> Content-Language: en-US From: Mario Roy In-Reply-To: <20260127151748.GA1079264@noisy.programming.kicks-ass.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Peter, thank you for your fix to improve EEVDF. Cc'd Andrea Righi Thank you for the is_idle_core() function and help. [0] Cc'd Shubhang Kaushik Your patch inspired me to perform trial and error testing. What has now become the 0280 patch in CachyMod GitHub repo. [0] Together with the help of CachyOS community members, we concluded the prefcore + prefer-idle-core to be surreal. I enjoy the EEVDF scheduler a lot more, since lesser favoring the SMT siblings. For comparison, I added results for sched-ext cosmos. Limited CPU saturation can be revealing of potential scheduler issues. Testing includes 100%, 50%, 31.25%, and 25% CPU saturation. All kernels built with GCC to factor out CLANG/AutoFDO. A) 6.18.8-rc1    with sched/fair: Proportional newidle balance                     48cpus(100%)  24cpus(50%)  15cpus(31.25%) 12cpus(25%)    algorithm3 [1]       9.462s      14.181s        20.311s 24.498s    darktable  [2]       2.811s       3.715s         5.315s  6.434s    easywave   [3]      19.747s      10.804s        20.207s 21.571s    stress-ng  [4]     37632.06     56220.21       41694.50  34740.58 B) 6.18.8-rc1    Peter Z's fix for sched/fair: Proportional newidle balance                     48cpus(100%)  24cpus(50%)  15cpus(31.25%) 12cpus(25%)    algorithm3 [1]       9.340s      14.733s        21.339s 25.069s    darktable  [2]       2.493s       3.616s         5.148s  5.968s    easywave   [3]      11.357s      13.312s *      18.483s 20.741s    stress-ng  [4]     37533.24     55419.85       39452.17  32217.55    algorithm3 and stress-ng regressed, possibly limited CPU saturation anomaly    easywave (*) wierd result, repeatable and all over the place C) 6.18.8-rc1    Revert sched/fair: Proportional newidle balance                     48cpus(100%)  24cpus(50%)  15cpus(31.25%) 12cpus(25%)    algorithm3 [1]       9.286s      15.101s        21.417s 25.126s    darktable  [2]       2.484s       3.531s         5.185s  6.002s    easywave   [3]      11.517s      12.300s        18.466s 20.428s    stress-ng  [4]     42231.92     47306.18 *     32438.03 *  28820.83 *    stress-ng (*) lack-luster with limited CPU saturation D) 6.18.8-rc1    Revert sched/fair: Proportional newidle balance    Plus apply the prefer-idle-core patch [0]                     48cpus(100%)  24cpus(50%)  15cpus(31.25%) 12cpus(25%)    algorithm3 [1]       9.312s      11.292s        17.243s 21.811s    darktable  [2]       2.418s       3.711s *       5.499s *  6.510s *    easywave   [3]      10.035s       9.832s        15.738s 18.805s    stress-ng  [4]     44837.41     63364.56       55646.26  48202.58    darktable (*) lesser performance with limited CPU saturation    noticeably better performance, otherwise E) scx_cosmos -m 0-5 -s 800 -l 8000 -f -c 1 -p 0 [5]                     48cpus(100%)  24cpus(50%)  15cpus(31.25%) 12cpus(25%)    algorithm3 [1]       9.218s      11.188s        17.045s 21.130s    darktable  [2]       2.365s       3.900s         4.626s  5.664s    easywave   [3]       9.187s      16.528s *      15.933s 16.991s    stress-ng  [4]     21065.70     36417.65       27185.95  23141.87    easywave (*) sched-ext cosmos appears to favor SMT siblings --- [0] https://github.com/marioroy/cachymod     the prefer-idle-core is 0280-prefer-prevcpu-for-wakeup.patch     more about mindfulness for limited CPU saturation versus accepting patch     surreal is prefcore + prefer-idle-core, improving many workloads [1] https://github.com/marioroy/mce-sandbox     ./algorithm3.pl 1e12 --threads=N     algorithm3.pl is akin to server/client application; chatty     primesieve.pl is more CPU-bound; less chatty     optionally, compare with primesieve binary (fully cpu bound, no chatty)     https://github.com/kimwalisch/primesieve [2] https://math.dartmouth.edu/~sarunas/darktable_bench.html     OMP_NUM_THREADS=N darktable-cli setubal.orf setubal.orf.xmp test.jpg \     --core --disable-opencl -d perf     result: pixel pipeline processing took {...} secs [3] https://openbenchmarking.org/test/pts/easywave     OMP_NUM_THREADS=N ./src/easywave \     -grid examples/e2Asean.grd -source examples/BengkuluSept2007.flt \     -time 600     result: Model time = 10:00:00,   elapsed: {...} msec [4] https://openbenchmarking.org/test/pts/stress-ng     stress-ng -t 30 --metrics-brief --sock N --no-rand-seed --sock-zerocopy     result: bogo ops real time  usr time  sys time   bogo ops/s  bogo ops/s                        (secs)    (secs)    (secs)   (real time) (usr+sys time)                                                        {...}     this involves 2x NCPUs due to { writer, reader } threads per sock     thus the reason adding 12cpus result (12 x 2 = 24 <= 50% saturation) [5] https://github.com/sched-ext/scx     cargo build --release -p scx_cosmos On 1/27/26 10:17 AM, Peter Zijlstra wrote: > On Tue, Jan 27, 2026 at 11:40:41AM +0100, Peter Zijlstra wrote: >> On Fri, Jan 23, 2026 at 12:03:06PM +0100, Peter Zijlstra wrote: >>> On Fri, Jan 23, 2026 at 11:50:46AM +0100, Peter Zijlstra wrote: >>>> On Sun, Jan 18, 2026 at 03:46:22PM -0500, Mario Roy wrote: >>>>> The patch "Proportional newidle balance" introduced a regression >>>>> with Linux 6.12.65 and 6.18.5. There is noticeable regression with >>>>> easyWave testing. [1] >>>>> >>>>> The CPU is AMD Threadripper 9960X CPU (24/48). I followed the source >>>>> to install easyWave [2]. That is fetching the two tar.gz archives. >>>> What is the actual configuration of that chip? Is it like 3*8 or 4*6 >>>> (CCX wise). A quick google couldn't find me the answer :/ >>> Obviously I found it right after sending this. It's a 4x6 config. >>> Meaning it needs newidle to balance between those 4 domains. >> So with the below patch on top of my Xeon w7-2495X (which is 24-core >> 48-thread) I too have 4 LLC :-) >> >> And I think I can see a slight difference, but nowhere near as terrible. >> >> Let me go stick some tracing on. > Does this help some? > > Turns out, this easywave thing has a very low newidle rate, but then > also a fairly low success rate. But since it doesn't do it that often, > the cost isn't that significant so we might as well always do it etc.. > > This adds a second term to the ratio computation that takes time into > account, For low rate newidle this term will dominate, while for higher > rate the success ratio is more important. > > Chris, afaict this still DTRT for schbench, but if this works for Mario, > could you also re-run things at your end? > > [ the 4 'second' thing is a bit random, but looking at the timings > between easywave and schbench this seems to be a reasonable middle > ground. Although I think 8 'seconds' -- 23 shift -- would also work. > > That would give: > > 1024 - 8 s - 64 Hz > 512 - 4 s - 128 Hz > 256 - 2 s - 256 Hz > 128 - 1 s - 512 Hz > 64 - .5 s - 1024 Hz > 32 - .25 s - 2048 Hz > ] > > --- > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > index 45c0022b91ce..a1e1032426dc 100644 > --- a/include/linux/sched/topology.h > +++ b/include/linux/sched/topology.h > @@ -95,6 +95,7 @@ struct sched_domain { > unsigned int newidle_call; > unsigned int newidle_success; > unsigned int newidle_ratio; > + u64 newidle_stamp; > u64 max_newidle_lb_cost; > unsigned long last_decay_max_lb_cost; > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index eca642295c4b..ab9cf06c6a76 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -12224,8 +12224,31 @@ static inline void update_newidle_stats(struct sched_domain *sd, unsigned int su > sd->newidle_call++; > sd->newidle_success += success; > > if (sd->newidle_call >= 1024) { > - sd->newidle_ratio = sd->newidle_success; > + u64 now = sched_clock(); > + s64 delta = now - sd->newidle_stamp; > + sd->newidle_stamp = now; > + int ratio = 0; > + > + if (delta < 0) > + delta = 0; > + > + if (sched_feat(NI_RATE)) { > + /* > + * ratio delta freq > + * > + * 1024 - 4 s - 128 Hz > + * 512 - 2 s - 256 Hz > + * 256 - 1 s - 512 Hz > + * 128 - .5 s - 1024 Hz > + * 64 - .25 s - 2048 Hz > + */ > + ratio = delta >> 22; > + } > + > + ratio += sd->newidle_success; > + > + sd->newidle_ratio = min(1024, ratio); > sd->newidle_call /= 2; > sd->newidle_success /= 2; > } > @@ -12932,7 +12959,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf) > if (sd->flags & SD_BALANCE_NEWIDLE) { > unsigned int weight = 1; > > - if (sched_feat(NI_RANDOM)) { > + if (sched_feat(NI_RANDOM) && sd->newidle_ratio < 1024) { > /* > * Throw a 1k sided dice; and only run > * newidle_balance according to the success > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > index 980d92bab8ab..7aba7523c6c1 100644 > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -126,3 +126,4 @@ SCHED_FEAT(LATENCY_WARN, false) > * Do newidle balancing proportional to its success rate using randomization. > */ > SCHED_FEAT(NI_RANDOM, true) > +SCHED_FEAT(NI_RATE, true) > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index cf643a5ddedd..05741f18f334 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -4,6 +4,7 @@ > */ > > #include > +#include > #include > #include "sched.h" > > @@ -1637,6 +1638,7 @@ sd_init(struct sched_domain_topology_level *tl, > struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu); > int sd_id, sd_weight, sd_flags = 0; > struct cpumask *sd_span; > + u64 now = sched_clock(); > > sd_weight = cpumask_weight(tl->mask(tl, cpu)); > > @@ -1674,6 +1676,7 @@ sd_init(struct sched_domain_topology_level *tl, > .newidle_call = 512, > .newidle_success = 256, > .newidle_ratio = 512, > + .newidle_stamp = now, > > .max_newidle_lb_cost = 0, > .last_decay_max_lb_cost = jiffies,