public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Chen, Yu C" <yu.c.chen@intel.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Tim Chen <tim.c.chen@linux.intel.com>
Cc: Pan Deng <pan.deng@intel.com>, <mingo@kernel.org>,
	<linux-kernel@vger.kernel.org>, <tianyou.li@intel.com>
Subject: Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
Date: Thu, 2 Apr 2026 11:15:03 +0800	[thread overview]
Message-ID: <64649c85-29ab-4f70-a0c4-3c83cbdae2fc@intel.com> (raw)
In-Reply-To: <22072ef8-5aec-49ac-9cc4-8a80bec14261@amd.com>

Hello Prateek,

On 3/31/2026 6:19 PM, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 3/31/2026 11:07 AM, Chen, Yu C wrote:
>> update of the test:
>> With above change, I did a simple hackbench test on
>> a system with multiple LLCs within 1 node, so the benefit
>> is significant(+12%~+30%) when system is under-loaded, while
>> some regression when overloaded(-10%)(need to figure out)
> 
> Could it be because of how we are traversing the CPUs now for idle load
> balancing? Since we use the first set bit for ilb_cpu and also staring
> balancing from that very CPu, we might just stop after a successful
> balance on the ilb_cpu.
> 
> Would something like below on top of Peter's suggestion + your fix help?
> 
>    (lightly tested; Has survived sched messaging on baremetal)
> 
> diff --git a/include/linux/sbm.h b/include/linux/sbm.h
> index 8beade6c0585..98c4c1866534 100644
> --- a/include/linux/sbm.h
> +++ b/include/linux/sbm.h
> @@ -76,8 +76,45 @@ static inline bool sbm_cpu_test(struct sbm *sbm, int cpu)
>   	return __sbm_op(sbm, test_bit);
>   }
>   
> +static __always_inline
> +unsigned int sbm_find_next_bit_wrap(struct sbm *sbm, int start)
> +{
> +	int bit = sbm_find_next_bit(sbm, start);
> +
> +	if (bit >= 0 || start == 0)
> +		return bit;
> +
> +	bit = sbm_find_next_bit(sbm, 0);
> +	return bit < start ? bit : -1;
> +}
> +
> +static __always_inline
> +unsigned int __sbm_for_each_wrap(struct sbm *sbm, int start, int n)
> +{
> +	int bit;
> +
> +	/* If not wrapped around */
> +	if (n > start) {
> +		/* and have a bit, just return it. */
> +		bit = sbm_find_next_bit(sbm, n);
> +		if (bit >= 0)
> +			return bit;
> +
> +		/* Otherwise, wrap around and ... */
> +		n = 0;
> +	}
> +
> +	/* Search the other part. */
> +	bit = sbm_find_next_bit(sbm, n);
> +	return bit < start ? bit : -1;
> +}
> +
>   #define sbm_for_each_set_bit(sbm, idx) \
>   	for (int idx = sbm_find_next_bit(sbm, 0); \
>   	     idx >= 0; idx = sbm_find_next_bit(sbm, idx+1))
>   
> +#define sbm_for_each_set_bit_wrap(sbm, idx, start) \
> +	for (int idx = sbm_find_next_bit_wrap(sbm, start); \
> +	     idx >= 0; idx = __sbm_for_each_wrap(sbm, start, idx+1))
> +
>   #endif /* _LINUX_SBM_H */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a3a423c4706e..f485afb6286d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12916,6 +12916,7 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
>   	int this_cpu = this_rq->cpu;
>   	int balance_cpu;
>   	struct rq *rq;
> +	u32 start;
>   
>   	WARN_ON_ONCE((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>   
> @@ -12944,7 +12945,8 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
>   	 * Start with the next CPU after this_cpu so we will end with this_cpu and let a
>   	 * chance for other idle cpu to pull load.
>   	 */
> -	sbm_for_each_set_bit(nohz.sbm, idx) {
> +	start = arch_sbm_cpu_to_idx((this_cpu + 1) % nr_cpu_ids);
> +	sbm_for_each_set_bit_wrap(nohz.sbm, idx, start) {
>   		balance_cpu = arch_sbm_idx_to_cpu(idx);
>   
>   		if (!idle_cpu(balance_cpu))
> ---
> 
> This is pretty much giving me similar performance as tip for sched
> messaging runs under heavy load but your mileage may vary :-)
> 

Thanks very much for providing this optimization. It should help
more nohz idle CPUs-beyond just the currently selected ilb_cpu
to assist in offloading work. When I applied this patch and reran
the test, it appeared to introduce some regressions (underload and
overload) compared to the baseline without Peter’s sbm applied.

One suspicion is that with sbm enabled(without your patch), more
tasks are "aggregated" onto the first CPU(or maybe the front part)
in nohz.sbm, because sbm_for_each_set_bit() always picks the first
idle CPU to pull work. As we already know, hackbench on our
platform strongly prefers being aggregated rather than being
spread across different LLCs. So with the spreading fix, the
hackbench might be put to different CPUs. Anyway, I'll run more
rounds of testing to check whether this is consistent or merely
due to run-to-run variance. And I'll try other workloads besides
hackbench. Or do you have suggestion on what workload we can try,
which is sensitive to nohz cpumask access(I chose hackbench because
I found Shrikanth was using hackbench for nohz evaluation in
commit 5d86d542f6)

thanks,
Chenyu


CPUs

  reply	other threads:[~2026-04-02  3:15 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-21  6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
2025-07-21  6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
2026-03-20 10:09   ` Peter Zijlstra
2026-03-24  9:36     ` Deng, Pan
2026-03-24 12:11       ` Peter Zijlstra
2026-03-27 10:17         ` Deng, Pan
2026-04-02 10:37           ` Deng, Pan
2026-04-02 10:43           ` Peter Zijlstra
2026-04-08 10:16   ` Chen, Yu C
2026-04-09 11:47     ` Deng, Pan
2025-07-21  6:10 ` [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
2026-03-20 10:18   ` Peter Zijlstra
2025-07-21  6:10 ` [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
2026-03-20 10:24   ` Peter Zijlstra
2026-03-23 18:09     ` Tim Chen
2026-03-24 12:16       ` Peter Zijlstra
2026-03-24 22:40         ` Tim Chen
2025-07-21  6:10 ` [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
2026-03-20 12:40   ` Peter Zijlstra
2026-03-23 18:45     ` Tim Chen
2026-03-24 12:00       ` Peter Zijlstra
2026-03-31  5:37         ` Chen, Yu C
2026-03-31 10:19           ` K Prateek Nayak
2026-04-02  3:15             ` Chen, Yu C [this message]
2026-04-02  4:41               ` K Prateek Nayak
2026-04-02 10:55                 ` Peter Zijlstra
2026-04-02 11:06                   ` K Prateek Nayak
2026-04-03  5:46                     ` Chen, Yu C
2026-04-03  8:13                       ` K Prateek Nayak
2026-04-07 20:35                       ` Tim Chen
2026-04-08  3:06                         ` K Prateek Nayak
2026-04-08 11:35                           ` Chen, Yu C
2026-04-08 15:52                             ` K Prateek Nayak
2026-04-09  5:17                               ` K Prateek Nayak
2026-04-09 23:09                                 ` Tim Chen
2026-04-10  5:51                                   ` Chen, Yu C
2026-04-10  6:02                                     ` K Prateek Nayak
2026-04-08  9:25                         ` Chen, Yu C
2026-04-08 16:47                           ` Tim Chen
2026-03-20  9:59 ` [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Peter Zijlstra
2026-03-20 12:50   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=64649c85-29ab-4f70-a0c4-3c83cbdae2fc@intel.com \
    --to=yu.c.chen@intel.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=pan.deng@intel.com \
    --cc=peterz@infradead.org \
    --cc=tianyou.li@intel.com \
    --cc=tim.c.chen@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox