public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/fair: Avoid false sharing in nohz struct
@ 2025-12-11  5:56 Wangyang Guo
  2025-12-21 13:05 ` Shrikanth Hegde
  0 siblings, 1 reply; 7+ messages in thread
From: Wangyang Guo @ 2025-12-11  5:56 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: linux-kernel, Wangyang Guo, Benjamin Lei, Tim Chen, Tianyou Li

There are two potential false sharing issue in nohz struct:
1. idle_cpus_mask is a read-mostly field, but share the same cacheline
   with frequently updated nr_cpus.
2. Data followed by nohz still share the same cacheline and has
   potential false sharing issue.

This patch tries to resolve the above two problems by isolating the
frequently updated fields in a single cacheline.

Reported-by: Benjamin Lei <benjamin.lei@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
---
 kernel/sched/fair.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b752324270b..bcc2766b7986 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7193,13 +7193,14 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask);
 #ifdef CONFIG_NO_HZ_COMMON
 
 static struct {
-	cpumask_var_t idle_cpus_mask;
-	atomic_t nr_cpus;
+	/* Isolate frequently updated fields in a cacheline to avoid false sharing issue. */
+	atomic_t nr_cpus ____cacheline_aligned;
 	int has_blocked;		/* Idle CPUS has blocked load */
 	int needs_update;		/* Newly idle CPUs need their next_balance collated */
 	unsigned long next_balance;     /* in jiffy units */
 	unsigned long next_blocked;	/* Next update of blocked load in jiffies */
-} nohz ____cacheline_aligned;
+	cpumask_var_t idle_cpus_mask ____cacheline_aligned;
+} nohz;
 
 #endif /* CONFIG_NO_HZ_COMMON */
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/fair: Avoid false sharing in nohz struct
@ 2025-12-19  1:38 Shubhang Kaushik Prasanna Kumar
  2025-12-19 13:56 ` Guo, Wangyang
  0 siblings, 1 reply; 7+ messages in thread
From: Shubhang Kaushik Prasanna Kumar @ 2025-12-19  1:38 UTC (permalink / raw)
  To: wangyang.guo@intel.com
  Cc: benjamin.lei@intel.com, bsegall@google.com,
	dietmar.eggemann@arm.com, juri.lelli@redhat.com,
	linux-kernel@vger.kernel.org, mgorman@suse.de, mingo@redhat.com,
	peterz@infradead.org, rostedt@goodmis.org, tianyou.li@intel.com,
	tim.c.chen@linux.intel.com, vincent.guittot@linaro.org,
	vschneid@redhat.com

Hi Wangyang Guo,

While the intuition behind isolating the `nr_cpus` counter seems correct, could you please justify the added padding ? As this is a high contention path in the scheduler, we shouldn't be inflating global structures with padding on logic alone. I’d like to see some benchmarking such as `perf c2c` results from a multi-core system proving the `false sharing` scenario as a measurable bottleneck.

I am also concerned about the internal layout. By sandwiching the timer fields between two `__cacheline_aligned` boundaries, we might just be shifting the contention rather than fixing it.  See to it that fields like `next_balance` aren't being squeezed into a new conflict zone. Would like to review the benchmark data and the struct layout before we move forward.

Thanks,
Shubhang Kaushik

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/fair: Avoid false sharing in nohz struct
  2025-12-19  1:38 Shubhang Kaushik Prasanna Kumar
@ 2025-12-19 13:56 ` Guo, Wangyang
  0 siblings, 0 replies; 7+ messages in thread
From: Guo, Wangyang @ 2025-12-19 13:56 UTC (permalink / raw)
  To: Shubhang Kaushik Prasanna Kumar
  Cc: benjamin.lei, bsegall, dietmar.eggemann, juri.lelli, linux-kernel,
	mgorman, mingo, peterz, rostedt, tianyou.li, tim.c.chen,
	vincent.guittot, vschneid

On 12/19/2025 9:38 AM, Shubhang Kaushik Prasanna Kumar wrote:

> While the intuition behind isolating the `nr_cpus` counter seems correct, could you please justify the added padding ? As this is a high contention path in the scheduler, we shouldn't be inflating global structures with padding on logic alone. I’d like to see some benchmarking such as `perf c2c` results from a multi-core system proving the `false sharing` scenario as a measurable bottleneck.

Here is the cache line view in c2c report:
     Num RmtHitm LclHitm  Offset records             Symbol
   6.25%   0.00%   0.00%   0x0       4   [k] _nohz_idle_balance.isra.0
  18.75% 100.00%   0.00%   0x8      14   [k] nohz_balance_exit_idle
   6.25%   0.00%   0.00%   0x8       8   [k] nohz_balance_enter_idle
   6.25%   0.00%   0.00%   0xc       8   [k] sched_balance_newidle
   6.25%   0.00%   0.00%  0x10      31   [k] nohz_balancer_kick
   6.25%   0.00%   0.00%  0x20      16   [k] sched_balance_newidle
  37.50%   0.00%   0.00%  0x38      50   [k] irqtime_account_irq
   6.25%   0.00%   0.00%  0x38      47   [k] account_process_tick
   6.25%   0.00%   0.00%  0x38      12   [k] account_idle_ticks

Offsets:
*  0x0 -- nohz.idle_cpu_mask (r)
*  0x8 -- nohz.nr_cpus (w)
* 0x38 -- sched_clock_irqtime (r), not in nohz, but share cacheline

The layout in /proc/kallsyms can also confirm that:
ffffffff88600d40 b nohz
ffffffff88600d68 B arch_needs_tick_broadcast
ffffffff88600d6c b __key.264
ffffffff88600d6c b __key.265
ffffffff88600d70 b dl_generation
ffffffff88600d78 b sched_clock_irqtime

In perf cycle hotspots, we noticed that irqtime_account_irq has 3% 
overhead caused by false sharing. With the patch applied, that hotspot 
disappears.

> I am also concerned about the internal layout. By sandwiching the timer fields between two `__cacheline_aligned` boundaries, we might just be shifting the contention rather than fixing it.  See to it that fields like `next_balance` aren't being squeezed into a new conflict zone. Would like to review the benchmark data and the struct layout before we move forward.

Before the patch, all the nohz members stay in the same cache line. 
After layout fix, fields like `next_balance` still in the same cache 
line, no new conflict zone created.

Since nohz is global, the size inflating is minimal - less than 64 bytes 
in total.

BR
Wangyang


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/fair: Avoid false sharing in nohz struct
  2025-12-11  5:56 [PATCH] sched/fair: Avoid false sharing in nohz struct Wangyang Guo
@ 2025-12-21 13:05 ` Shrikanth Hegde
  2025-12-22  2:21   ` Guo, Wangyang
  0 siblings, 1 reply; 7+ messages in thread
From: Shrikanth Hegde @ 2025-12-21 13:05 UTC (permalink / raw)
  To: Wangyang Guo
  Cc: linux-kernel, Benjamin Lei, Tim Chen, Tianyou Li, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider

Hi Wangyang,

On 12/11/25 11:26 AM, Wangyang Guo wrote:
> There are two potential false sharing issue in nohz struct:
> 1. idle_cpus_mask is a read-mostly field, but share the same cacheline
>     with frequently updated nr_cpus.

Updates to idle_cpus_mask is not same cacheline. it is updated alongside nr_cpus.

with CPUMASK_OFFSTACK=y, idle_cpus_mask is a pointer to the actual mask.
Updates to it happen in another cacheline.

with CPUMASK_OFFSTACK=n, idle_cpus_mask is on the stack and its length
depends on NR_CPUS. typical value being 512/2048/8192 it can span a few
cachelines. So updates to it likely in different cacheline compared to nr_cpus.

see  https://lore.kernel.org/all/aS6bK4ad-wO2fsoo@gmail.com/


Likely in your case, nr_cpus updates are the costly ones.
Try below and see if it helps to fix your issue too.
https://lore.kernel.org/all/20251201183146.74443-1-sshegde@linux.ibm.com/
I Should send out new version soon.

> 2. Data followed by nohz still share the same cacheline and has
>     potential false sharing issue.
> 

How does your patch handle this?
I don't see any other struct apart from nohz being changed.


> This patch tries to resolve the above two problems by isolating the
> frequently updated fields in a single cacheline.
> 
> Reported-by: Benjamin Lei <benjamin.lei@intel.com>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Reviewed-by: Tianyou Li <tianyou.li@intel.com>
> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
> ---
>   kernel/sched/fair.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5b752324270b..bcc2766b7986 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7193,13 +7193,14 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask);
>   #ifdef CONFIG_NO_HZ_COMMON
>   
>   static struct {
> -	cpumask_var_t idle_cpus_mask;
> -	atomic_t nr_cpus;
> +	/* Isolate frequently updated fields in a cacheline to avoid false sharing issue. */
> +	atomic_t nr_cpus ____cacheline_aligned;
>   	int has_blocked;		/* Idle CPUS has blocked load */
>   	int needs_update;		/* Newly idle CPUs need their next_balance collated */
>   	unsigned long next_balance;     /* in jiffy units */
>   	unsigned long next_blocked;	/* Next update of blocked load in jiffies */
> -} nohz ____cacheline_aligned;
> +	cpumask_var_t idle_cpus_mask ____cacheline_aligned;
> +} nohz;
>

This can cause a lot of space wastage.
for exp: powerpc has 128 byte cacheline.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/fair: Avoid false sharing in nohz struct
  2025-12-21 13:05 ` Shrikanth Hegde
@ 2025-12-22  2:21   ` Guo, Wangyang
  2025-12-23  7:27     ` Shrikanth Hegde
  0 siblings, 1 reply; 7+ messages in thread
From: Guo, Wangyang @ 2025-12-22  2:21 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, Benjamin Lei, Tim Chen, Tianyou Li, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider

On 12/21/2025 9:05 PM, Shrikanth Hegde wrote:
> Hi Wangyang,
> 
> On 12/11/25 11:26 AM, Wangyang Guo wrote:
>> There are two potential false sharing issue in nohz struct:
>> 1. idle_cpus_mask is a read-mostly field, but share the same cacheline
>>     with frequently updated nr_cpus.
> 
> Updates to idle_cpus_mask is not same cacheline. it is updated alongside 
> nr_cpus.
> 
> with CPUMASK_OFFSTACK=y, idle_cpus_mask is a pointer to the actual mask.
> Updates to it happen in another cacheline.
> 
> with CPUMASK_OFFSTACK=n, idle_cpus_mask is on the stack and its length
> depends on NR_CPUS. typical value being 512/2048/8192 it can span a few
> cachelines. So updates to it likely in different cacheline compared to 
> nr_cpus.
> 
> see  https://lore.kernel.org/all/aS6bK4ad-wO2fsoo@gmail.com/
> 
This patch is mainly target for idle_cpus_mask as a pointer, which is 
default for many distro OS.

> 
> Likely in your case, nr_cpus updates are the costly ones.
> Try below and see if it helps to fix your issue too.
> https://lore.kernel.org/all/20251201183146.74443-1-sshegde@linux.ibm.com/
> I Should send out new version soon.
> 
>> 2. Data followed by nohz still share the same cacheline and has
>>     potential false sharing issue.
>>
> 
> How does your patch handle this?
> I don't see any other struct apart from nohz being changed.

The data follow by nohz is implicit and determined by compiler.
For example, this is the layout from /proc/kallsyms in my machine:
ffffffff88600d40 b nohz
ffffffff88600d68 B arch_needs_tick_broadcast
ffffffff88600d6c b __key.264
ffffffff88600d6c b __key.265
ffffffff88600d70 b dl_generation
ffffffff88600d78 b sched_clock_irqtime

What we can do is placing read-mostly `idle_cpus_mask` pointer in a new 
cacheline, so data followed by nohz would not be affected by nr_cpus.

> 
>> This patch tries to resolve the above two problems by isolating the
>> frequently updated fields in a single cacheline.
>>
>> Reported-by: Benjamin Lei <benjamin.lei@intel.com>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Reviewed-by: Tianyou Li <tianyou.li@intel.com>
>> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
>> ---
>>   kernel/sched/fair.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 5b752324270b..bcc2766b7986 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7193,13 +7193,14 @@ static DEFINE_PER_CPU(cpumask_var_t, 
>> should_we_balance_tmpmask);
>>   #ifdef CONFIG_NO_HZ_COMMON
>>   static struct {
>> -    cpumask_var_t idle_cpus_mask;
>> -    atomic_t nr_cpus;
>> +    /* Isolate frequently updated fields in a cacheline to avoid 
>> false sharing issue. */
>> +    atomic_t nr_cpus ____cacheline_aligned;
>>       int has_blocked;        /* Idle CPUS has blocked load */
>>       int needs_update;        /* Newly idle CPUs need their 
>> next_balance collated */
>>       unsigned long next_balance;     /* in jiffy units */
>>       unsigned long next_blocked;    /* Next update of blocked load in 
>> jiffies */
>> -} nohz ____cacheline_aligned;
>> +    cpumask_var_t idle_cpus_mask ____cacheline_aligned;
>> +} nohz;
>>
> 
> This can cause a lot of space wastage.
> for exp: powerpc has 128 byte cacheline.
> 

nohz is global, only one exists. The size inflating is minimal, less 
than 1 cacheline.

BR
Wangyang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/fair: Avoid false sharing in nohz struct
  2025-12-22  2:21   ` Guo, Wangyang
@ 2025-12-23  7:27     ` Shrikanth Hegde
  2025-12-23  8:03       ` Guo, Wangyang
  0 siblings, 1 reply; 7+ messages in thread
From: Shrikanth Hegde @ 2025-12-23  7:27 UTC (permalink / raw)
  To: Guo, Wangyang
  Cc: linux-kernel, Benjamin Lei, Tim Chen, Tianyou Li, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider



On 12/22/25 7:51 AM, Guo, Wangyang wrote:
> On 12/21/2025 9:05 PM, Shrikanth Hegde wrote:
>> Hi Wangyang,
>>
>> On 12/11/25 11:26 AM, Wangyang Guo wrote:
>>> There are two potential false sharing issue in nohz struct:
>>> 1. idle_cpus_mask is a read-mostly field, but share the same cacheline
>>>     with frequently updated nr_cpus.
>>
>> Updates to idle_cpus_mask is not same cacheline. it is updated 
>> alongside nr_cpus.
>>
>> with CPUMASK_OFFSTACK=y, idle_cpus_mask is a pointer to the actual mask.
>> Updates to it happen in another cacheline.
>>
>> with CPUMASK_OFFSTACK=n, idle_cpus_mask is on the stack and its length
>> depends on NR_CPUS. typical value being 512/2048/8192 it can span a few
>> cachelines. So updates to it likely in different cacheline compared to 
>> nr_cpus.
>>
>> see  https://lore.kernel.org/all/aS6bK4ad-wO2fsoo@gmail.com/
>>
> This patch is mainly target for idle_cpus_mask as a pointer, which is 
> default for many distro OS.
> 

Not all archs.

>>
>> Likely in your case, nr_cpus updates are the costly ones.
>> Try below and see if it helps to fix your issue too.
>> https://lore.kernel.org/all/20251201183146.74443-1-sshegde@linux.ibm.com/
>> I Should send out new version soon.
>>
>>> 2. Data followed by nohz still share the same cacheline and has
>>>     potential false sharing issue.
>>>
>>
>> How does your patch handle this?
>> I don't see any other struct apart from nohz being changed.
> 
> The data follow by nohz is implicit and determined by compiler.
> For example, this is the layout from /proc/kallsyms in my machine:
> ffffffff88600d40 b nohz
> ffffffff88600d68 B arch_needs_tick_broadcast
> ffffffff88600d6c b __key.264
> ffffffff88600d6c b __key.265
> ffffffff88600d70 b dl_generation
> ffffffff88600d78 b sched_clock_irqtime
> 
> What we can do is placing read-mostly `idle_cpus_mask` pointer in a new 
> cacheline, so data followed by nohz would not be affected by nr_cpus.
> 

That's a concern. If it is compiler dependent, then sometime it helps, sometime it wont.

It should done other way around rather than changing the nohz.
If there is structure which has a lot of read/updates, it should go into its
own cacheline rather.

i.e in your case sched_clock_irqtime should go into its own cacheline.

---
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 4f97896887ec..29f9438f9f03 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -25,7 +25,7 @@
   */
  DEFINE_PER_CPU(struct irqtime, cpu_irqtime);
  
-int sched_clock_irqtime;
+int sched_clock_irqtime __cacheline_aligned;
  
  void enable_sched_clock_irqtime(void)
  {


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/fair: Avoid false sharing in nohz struct
  2025-12-23  7:27     ` Shrikanth Hegde
@ 2025-12-23  8:03       ` Guo, Wangyang
  0 siblings, 0 replies; 7+ messages in thread
From: Guo, Wangyang @ 2025-12-23  8:03 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, Benjamin Lei, Tim Chen, Tianyou Li, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider

On 12/23/2025 3:27 PM, Shrikanth Hegde wrote:
>>>
>>> Likely in your case, nr_cpus updates are the costly ones.
>>> Try below and see if it helps to fix your issue too.
>>> https://lore.kernel.org/all/20251201183146.74443-1- 
>>> sshegde@linux.ibm.com/
>>> I Should send out new version soon.
>>>
>>>> 2. Data followed by nohz still share the same cacheline and has
>>>>     potential false sharing issue.
>>>>
>>>
>>> How does your patch handle this?
>>> I don't see any other struct apart from nohz being changed.
>>
>> The data follow by nohz is implicit and determined by compiler.
>> For example, this is the layout from /proc/kallsyms in my machine:
>> ffffffff88600d40 b nohz
>> ffffffff88600d68 B arch_needs_tick_broadcast
>> ffffffff88600d6c b __key.264
>> ffffffff88600d6c b __key.265
>> ffffffff88600d70 b dl_generation
>> ffffffff88600d78 b sched_clock_irqtime
>>
>> What we can do is placing read-mostly `idle_cpus_mask` pointer in a 
>> new cacheline, so data followed by nohz would not be affected by nr_cpus.
>>
> 
> That's a concern. If it is compiler dependent, then sometime it helps, 
> sometime it wont.

This patch wants to make it compiler independent. Sometimes it maybe 
sched_clock_irqtime, sometimes, it maybe baba_vars. Changing on nohz 
makes sure it would not affect others, no matter whatever data followed.

> It should done other way around rather than changing the nohz.
> If there is structure which has a lot of read/updates, it should go into 
> its
> own cacheline rather.
> 
> i.e in your case sched_clock_irqtime should go into its own cacheline.

If that is preferred in kernel, I can resubmit the patch which only 
requires alignment on sched_clock_irqtime.

> ---
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 4f97896887ec..29f9438f9f03 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -25,7 +25,7 @@
>    */
>   DEFINE_PER_CPU(struct irqtime, cpu_irqtime);
> 
> -int sched_clock_irqtime;
> +int sched_clock_irqtime __cacheline_aligned;
> 
>   void enable_sched_clock_irqtime(void)
>   {
> 
> 

BR
Wangyang

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-12-23  8:04 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-11  5:56 [PATCH] sched/fair: Avoid false sharing in nohz struct Wangyang Guo
2025-12-21 13:05 ` Shrikanth Hegde
2025-12-22  2:21   ` Guo, Wangyang
2025-12-23  7:27     ` Shrikanth Hegde
2025-12-23  8:03       ` Guo, Wangyang
  -- strict thread matches above, loose matches on Subject: below --
2025-12-19  1:38 Shubhang Kaushik Prasanna Kumar
2025-12-19 13:56 ` Guo, Wangyang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox