public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
@ 2025-10-28 20:23 Tim Chen
  2025-10-29  4:37 ` Shrikanth Hegde
  2025-10-29  8:47 ` Peter Zijlstra
  0 siblings, 2 replies; 4+ messages in thread
From: Tim Chen @ 2025-10-28 20:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Ingo Molnar, Chen Yu, Doug Nelson, Mohini Narkhede,
	linux-kernel, Vincent Guittot, Shrikanth Hegde, K Prateek Nayak

The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing
only one NUMA load balancing operation to run system-wide at a time.

Currently, each MC group leader in a NUMA domain attempts to acquire
the global sched_balance_running flag via cmpxchg() before checking
whether load balancing is due or whether it is the designated leader for
that NUMA domain. On systems with a large number of cores, this causes
significant cache contention on the shared sched_balance_running flag.

This patch reduces unnecessary cmpxchg() operations by first checking
whether the balance interval has expired. If load balancing is not due,
the attempt to acquire sched_balance_running is skipped entirely.

On a 2-socket Granite Rapids system with sub-NUMA clustering enabled,
running an OLTP workload, 7.8% of total CPU cycles were previously spent
in sched_balance_domain() contending on sched_balance_running before
this change.

         : 104              static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
         : 105              {
         : 106              return arch_cmpxchg(&v->counter, old, new);
    0.00 :   ffffffff81326e6c:       xor    %eax,%eax
    0.00 :   ffffffff81326e6e:       mov    $0x1,%ecx
    0.00 :   ffffffff81326e73:       lock cmpxchg %ecx,0x2394195(%rip)        # ffffffff836bb010 <sched_balance_running>
         : 110              sched_balance_domains():
         : 12234            if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
   99.39 :   ffffffff81326e7b:       test   %eax,%eax
    0.00 :   ffffffff81326e7d:       jne    ffffffff81326e99 <sched_balance_domains+0x209>
         : 12238            if (time_after_eq(jiffies, sd->last_balance + interval)) {
    0.00 :   ffffffff81326e7f:       mov    0x14e2b3a(%rip),%rax        # ffffffff828099c0 <jiffies_64>
    0.00 :   ffffffff81326e86:       sub    0x48(%r14),%rax
    0.00 :   ffffffff81326e8a:       cmp    %rdx,%rax

After applying this fix, sched_balance_domain() is gone from
the profile and there is a 8% throughput improvement.

v2:
1. Rearrange the patch to get rid of an indent level per Peter's
   suggestion.
2. Updated the data from new run by OLTP team.

link to v1: https://lore.kernel.org/lkml/e27d5dcb724fe46acc24ff44670bc4bb5be21d98.1759445926.git.tim.c.chen@linux.intel.com/

Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25970dbbb279..a10c95e11ea5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12226,6 +12226,8 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 		}
 
 		interval = get_sd_balance_interval(sd, busy);
+		if (time_before(jiffies, sd->last_balance + interval))
+			goto out;
 
 		need_serialize = sd->flags & SD_SERIALIZE;
 		if (need_serialize) {
@@ -12233,19 +12235,18 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 				goto out;
 		}
 
-		if (time_after_eq(jiffies, sd->last_balance + interval)) {
-			if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
-				/*
-				 * The LBF_DST_PINNED logic could have changed
-				 * env->dst_cpu, so we can't know our idle
-				 * state even if we migrated tasks. Update it.
-				 */
-				idle = idle_cpu(cpu);
-				busy = !idle && !sched_idle_cpu(cpu);
-			}
-			sd->last_balance = jiffies;
-			interval = get_sd_balance_interval(sd, busy);
+		if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
+			/*
+			 * The LBF_DST_PINNED logic could have changed
+			 * env->dst_cpu, so we can't know our idle
+			 * state even if we migrated tasks. Update it.
+			 */
+			idle = idle_cpu(cpu);
+			busy = !idle && !sched_idle_cpu(cpu);
 		}
+		sd->last_balance = jiffies;
+		interval = get_sd_balance_interval(sd, busy);
+
 		if (need_serialize)
 			atomic_set_release(&sched_balance_running, 0);
 out:
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-10-28 20:23 [PATCH v2] sched/fair: Skip sched_balance_running cmpxchg when balance is not due Tim Chen
@ 2025-10-29  4:37 ` Shrikanth Hegde
  2025-10-29  8:47 ` Peter Zijlstra
  1 sibling, 0 replies; 4+ messages in thread
From: Shrikanth Hegde @ 2025-10-29  4:37 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra
  Cc: Ingo Molnar, Chen Yu, Doug Nelson, Mohini Narkhede, linux-kernel,
	Vincent Guittot, K Prateek Nayak



On 10/29/25 1:53 AM, Tim Chen wrote:
> The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing
> only one NUMA load balancing operation to run system-wide at a time.
> 
> Currently, each MC group leader in a NUMA domain attempts to acquire

Could be MC or PKG, depending on topology.

> the global sched_balance_running flag via cmpxchg() before checking
> whether load balancing is due or whether it is the designated leader for
> that NUMA domain. On systems with a large number of cores, this causes
> significant cache contention on the shared sched_balance_running flag.
> 
> This patch reduces unnecessary cmpxchg() operations by first checking
> whether the balance interval has expired. If load balancing is not due,
> the attempt to acquire sched_balance_running is skipped entirely.
> 
> On a 2-socket Granite Rapids system with sub-NUMA clustering enabled,
> running an OLTP workload, 7.8% of total CPU cycles were previously spent
> in sched_balance_domain() contending on sched_balance_running before
> this change.
> 
>           : 104              static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
>           : 105              {
>           : 106              return arch_cmpxchg(&v->counter, old, new);
>      0.00 :   ffffffff81326e6c:       xor    %eax,%eax
>      0.00 :   ffffffff81326e6e:       mov    $0x1,%ecx
>      0.00 :   ffffffff81326e73:       lock cmpxchg %ecx,0x2394195(%rip)        # ffffffff836bb010 <sched_balance_running>
>           : 110              sched_balance_domains():
>           : 12234            if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
>     99.39 :   ffffffff81326e7b:       test   %eax,%eax
>      0.00 :   ffffffff81326e7d:       jne    ffffffff81326e99 <sched_balance_domains+0x209>
>           : 12238            if (time_after_eq(jiffies, sd->last_balance + interval)) {
>      0.00 :   ffffffff81326e7f:       mov    0x14e2b3a(%rip),%rax        # ffffffff828099c0 <jiffies_64>
>      0.00 :   ffffffff81326e86:       sub    0x48(%r14),%rax
>      0.00 :   ffffffff81326e8a:       cmp    %rdx,%rax
> 
> After applying this fix, sched_balance_domain() is gone from
> the profile and there is a 8% throughput improvement.
> 
> v2:
> 1. Rearrange the patch to get rid of an indent level per Peter's
>     suggestion.
> 2. Updated the data from new run by OLTP team.
> 
> link to v1: https://lore.kernel.org/lkml/e27d5dcb724fe46acc24ff44670bc4bb5be21d98.1759445926.git.tim.c.chen@linux.intel.com/
> 
> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>   kernel/sched/fair.c | 25 +++++++++++++------------
>   1 file changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 25970dbbb279..a10c95e11ea5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12226,6 +12226,8 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>   		}
>   
>   		interval = get_sd_balance_interval(sd, busy);
> +		if (time_before(jiffies, sd->last_balance + interval))
> +			goto out;
>   
>   		need_serialize = sd->flags & SD_SERIALIZE;
>   		if (need_serialize) {
> @@ -12233,19 +12235,18 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>   				goto out;
>   		}
>   
> -		if (time_after_eq(jiffies, sd->last_balance + interval)) {
> -			if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> -				/*
> -				 * The LBF_DST_PINNED logic could have changed
> -				 * env->dst_cpu, so we can't know our idle
> -				 * state even if we migrated tasks. Update it.
> -				 */
> -				idle = idle_cpu(cpu);
> -				busy = !idle && !sched_idle_cpu(cpu);
> -			}
> -			sd->last_balance = jiffies;
> -			interval = get_sd_balance_interval(sd, busy);
> +		if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> +			/*
> +			 * The LBF_DST_PINNED logic could have changed
> +			 * env->dst_cpu, so we can't know our idle
> +			 * state even if we migrated tasks. Update it.
> +			 */
> +			idle = idle_cpu(cpu);
> +			busy = !idle && !sched_idle_cpu(cpu);
>   		}
> +		sd->last_balance = jiffies;
> +		interval = get_sd_balance_interval(sd, busy);
> +
>   		if (need_serialize)
>   			atomic_set_release(&sched_balance_running, 0);
>   out:

Hi Tim,

I still prefer the method where we move this check after should_we_balance.
https://lore.kernel.org/all/a5d5ce5e-9f98-4c0d-a4ed-5e4a8a6f7b86@linux.ibm.com/


The reason being:

1. cpu aquires sched_balance_running but wont do balancing since swb says no.
    While a legit CPU which was supposed to do balance goes out since it couldn't
    aquire it.

2. Newidle balance doesn't care about serialize as of now and that too will come
    under the same checks. Iterating through NUMA domains rq's is costly. That could
    help avoid un-necessary bus traffic.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-10-28 20:23 [PATCH v2] sched/fair: Skip sched_balance_running cmpxchg when balance is not due Tim Chen
  2025-10-29  4:37 ` Shrikanth Hegde
@ 2025-10-29  8:47 ` Peter Zijlstra
  2025-10-29 18:00   ` Tim Chen
  1 sibling, 1 reply; 4+ messages in thread
From: Peter Zijlstra @ 2025-10-29  8:47 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Chen Yu, Doug Nelson, Mohini Narkhede, linux-kernel,
	Vincent Guittot, Shrikanth Hegde, K Prateek Nayak

On Tue, Oct 28, 2025 at 01:23:30PM -0700, Tim Chen wrote:
> The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing
> only one NUMA load balancing operation to run system-wide at a time.
> 
> Currently, each MC group leader in a NUMA domain attempts to acquire
> the global sched_balance_running flag via cmpxchg() before checking
> whether load balancing is due or whether it is the designated leader for
> that NUMA domain. On systems with a large number of cores, this causes
> significant cache contention on the shared sched_balance_running flag.
> 
> This patch reduces unnecessary cmpxchg() operations by first checking
> whether the balance interval has expired. If load balancing is not due,
> the attempt to acquire sched_balance_running is skipped entirely.
> 
> On a 2-socket Granite Rapids system with sub-NUMA clustering enabled,
> running an OLTP workload, 7.8% of total CPU cycles were previously spent
> in sched_balance_domain() contending on sched_balance_running before
> this change.
> 
>          : 104              static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
>          : 105              {
>          : 106              return arch_cmpxchg(&v->counter, old, new);
>     0.00 :   ffffffff81326e6c:       xor    %eax,%eax
>     0.00 :   ffffffff81326e6e:       mov    $0x1,%ecx
>     0.00 :   ffffffff81326e73:       lock cmpxchg %ecx,0x2394195(%rip)        # ffffffff836bb010 <sched_balance_running>
>          : 110              sched_balance_domains():
>          : 12234            if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
>    99.39 :   ffffffff81326e7b:       test   %eax,%eax
>     0.00 :   ffffffff81326e7d:       jne    ffffffff81326e99 <sched_balance_domains+0x209>
>          : 12238            if (time_after_eq(jiffies, sd->last_balance + interval)) {
>     0.00 :   ffffffff81326e7f:       mov    0x14e2b3a(%rip),%rax        # ffffffff828099c0 <jiffies_64>
>     0.00 :   ffffffff81326e86:       sub    0x48(%r14),%rax
>     0.00 :   ffffffff81326e8a:       cmp    %rdx,%rax
> 
> After applying this fix, sched_balance_domain() is gone from
> the profile and there is a 8% throughput improvement.
> 

this..

> v2:
> 1. Rearrange the patch to get rid of an indent level per Peter's
>    suggestion.
> 2. Updated the data from new run by OLTP team.
> 
> link to v1: https://lore.kernel.org/lkml/e27d5dcb724fe46acc24ff44670bc4bb5be21d98.1759445926.git.tim.c.chen@linux.intel.com/

... stuff goes under the '---' sign.

Also, what happened to my other suggestion:

  https://lkml.kernel.org/r/20251014092436.GK4067720@noisy.programming.kicks-ass.net

? That seemed like a better place to put things.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-10-29  8:47 ` Peter Zijlstra
@ 2025-10-29 18:00   ` Tim Chen
  0 siblings, 0 replies; 4+ messages in thread
From: Tim Chen @ 2025-10-29 18:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Chen Yu, Doug Nelson, Mohini Narkhede, linux-kernel,
	Vincent Guittot, Shrikanth Hegde, K Prateek Nayak

On Wed, 2025-10-29 at 09:47 +0100, Peter Zijlstra wrote:
> On Tue, Oct 28, 2025 at 01:23:30PM -0700, Tim Chen wrote:
> > The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing
> > only one NUMA load balancing operation to run system-wide at a time.
> > 
> > Currently, each MC group leader in a NUMA domain attempts to acquire
> > the global sched_balance_running flag via cmpxchg() before checking
> > whether load balancing is due or whether it is the designated leader for
> > that NUMA domain. On systems with a large number of cores, this causes
> > significant cache contention on the shared sched_balance_running flag.
> > 
> > This patch reduces unnecessary cmpxchg() operations by first checking
> > whether the balance interval has expired. If load balancing is not due,
> > the attempt to acquire sched_balance_running is skipped entirely.
> > 
> > On a 2-socket Granite Rapids system with sub-NUMA clustering enabled,
> > running an OLTP workload, 7.8% of total CPU cycles were previously spent
> > in sched_balance_domain() contending on sched_balance_running before
> > this change.
> > 
> >          : 104              static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
> >          : 105              {
> >          : 106              return arch_cmpxchg(&v->counter, old, new);
> >     0.00 :   ffffffff81326e6c:       xor    %eax,%eax
> >     0.00 :   ffffffff81326e6e:       mov    $0x1,%ecx
> >     0.00 :   ffffffff81326e73:       lock cmpxchg %ecx,0x2394195(%rip)        # ffffffff836bb010 <sched_balance_running>
> >          : 110              sched_balance_domains():
> >          : 12234            if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> >    99.39 :   ffffffff81326e7b:       test   %eax,%eax
> >     0.00 :   ffffffff81326e7d:       jne    ffffffff81326e99 <sched_balance_domains+0x209>
> >          : 12238            if (time_after_eq(jiffies, sd->last_balance + interval)) {
> >     0.00 :   ffffffff81326e7f:       mov    0x14e2b3a(%rip),%rax        # ffffffff828099c0 <jiffies_64>
> >     0.00 :   ffffffff81326e86:       sub    0x48(%r14),%rax
> >     0.00 :   ffffffff81326e8a:       cmp    %rdx,%rax
> > 
> > After applying this fix, sched_balance_domain() is gone from
> > the profile and there is a 8% throughput improvement.
> > 
> 
> this..
> 
> > v2:
> > 1. Rearrange the patch to get rid of an indent level per Peter's
> >    suggestion.
> > 2. Updated the data from new run by OLTP team.
> > 
> > link to v1: https://lore.kernel.org/lkml/e27d5dcb724fe46acc24ff44670bc4bb5be21d98.1759445926.git.tim.c.chen@linux.intel.com/
> 
> ... stuff goes under the '---' sign.
> 
> Also, what happened to my other suggestion:
> 
>   https://lkml.kernel.org/r/20251014092436.GK4067720@noisy.programming.kicks-ass.net
> 
> ? That seemed like a better place to put things.

Yes, I agree it is better to lock sched_balance_running after should_we_balance().
That will keep all the CPUs that shouldn't do balancing from locking sched_balance_running, as Shrikanth
also pointed out. Let me put that version of the patch through some OLTP testing
to validate it.

Tim

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-10-29 18:00 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-28 20:23 [PATCH v2] sched/fair: Skip sched_balance_running cmpxchg when balance is not due Tim Chen
2025-10-29  4:37 ` Shrikanth Hegde
2025-10-29  8:47 ` Peter Zijlstra
2025-10-29 18:00   ` Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox