[PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
@ 2025-11-10 18:47 Tim Chen
  2025-11-11  6:24 ` Shrikanth Hegde
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Tim Chen @ 2025-11-10 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Ingo Molnar, Chen Yu, Doug Nelson, Mohini Narkhede,
	linux-kernel, Vincent Guittot, Shrikanth Hegde, K Prateek Nayak

The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing
only one NUMA load balancing operation to run system-wide at a time.

Currently, each sched group leader directly under NUMA domain attempts
to acquire the global sched_balance_running flag via cmpxchg() before
checking whether load balancing is due or whether it is the designated
load balancer for that NUMA domain. On systems with a large number
of cores, this causes significant cache contention on the shared
sched_balance_running flag.

This patch reduces unnecessary cmpxchg() operations by first checking
that the balancer is the designated leader for a NUMA domain from
should_we_balance(), and the balance interval has expired before
trying to acquire sched_balance_running to load balance a NUMA
domain.

On a 2-socket Granite Rapids system with sub-NUMA clustering enabled,
running an OLTP workload, 7.8% of total CPU cycles were previously spent
in sched_balance_domain() contending on sched_balance_running before
this change.

         : 104              static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
         : 105              {
         : 106              return arch_cmpxchg(&v->counter, old, new);
    0.00 :   ffffffff81326e6c:       xor    %eax,%eax
    0.00 :   ffffffff81326e6e:       mov    $0x1,%ecx
    0.00 :   ffffffff81326e73:       lock cmpxchg %ecx,0x2394195(%rip)        # ffffffff836bb010 <sched_balance_running>
         : 110              sched_balance_domains():
         : 12234            if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
   99.39 :   ffffffff81326e7b:       test   %eax,%eax
    0.00 :   ffffffff81326e7d:       jne    ffffffff81326e99 <sched_balance_domains+0x209>
         : 12238            if (time_after_eq(jiffies, sd->last_balance + interval)) {
    0.00 :   ffffffff81326e7f:       mov    0x14e2b3a(%rip),%rax        # ffffffff828099c0 <jiffies_64>
    0.00 :   ffffffff81326e86:       sub    0x48(%r14),%rax
    0.00 :   ffffffff81326e8a:       cmp    %rdx,%rax

After applying this fix, sched_balance_domain() is gone from the profile
and there is a 5% throughput improvement.

Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>

---
v4:
1. Allow NEWLY_IDLE balance with SD_SERIALIZE domain to be serialized.
2. Reorganize need_unlock to be bool and initialize it again for the
   redo case.
link to v3: https://lore.kernel.org/lkml/52fcd1e8582a6b014a70f0ce7795ce0d88cd63a8.1762470554.git.tim.c.chen@linux.intel.com/

v3:
1. Move check balance time to after should_we_balance()
link to v2: https://lore.kernel.org/lkml/248b775fc9030989c829d4061f6f85ae33dabe45.1761682932.git.tim.c.chen@linux.intel.com/

v2:
1. Rearrange the patch to get rid of an indent level per Peter's
   suggestion.
2. Updated the data from new run by OLTP team.

link to v1: https://lore.kernel.org/lkml/e27d5dcb724fe46acc24ff44670bc4bb5be21d98.1759445926.git.tim.c.chen@linux.intel.com/

---
 kernel/sched/fair.c | 57 ++++++++++++++++++++++++---------------------
 1 file changed, 31 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25970dbbb279..43c5ec039633 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11732,6 +11732,21 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
 	}
 }
 
+/*
+ * This flag serializes load-balancing passes over large domains
+ * (above the NODE topology level) - only one load-balancing instance
+ * may run at a time, to reduce overhead on very large systems with
+ * lots of CPUs and large NUMA distances.
+ *
+ * - Note that load-balancing passes triggered while another one
+ *   is executing are skipped and not re-tried.
+ *
+ * - Also note that this does not serialize rebalance_domains()
+ *   execution, as non-SD_SERIALIZE domains will still be
+ *   load-balanced in parallel.
+ */
+static atomic_t sched_balance_running = ATOMIC_INIT(0);
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
@@ -11757,17 +11772,26 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
+	bool need_unlock;
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
 redo:
+	need_unlock = false;
 	if (!should_we_balance(&env)) {
 		*continue_balancing = 0;
 		goto out_balanced;
 	}
 
+	if (sd->flags & SD_SERIALIZE) {
+		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
+			goto out_balanced;
+		}
+		need_unlock = true;
+	}
+
 	group = sched_balance_find_src_group(&env);
 	if (!group) {
 		schedstat_inc(sd->lb_nobusyg[idle]);
@@ -11892,6 +11916,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = SCHED_NR_MIGRATE_BREAK;
+				if (need_unlock)
+					atomic_set_release(&sched_balance_running, 0);
+
 				goto redo;
 			}
 			goto out_all_pinned;
@@ -12008,6 +12035,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	    sd->balance_interval < sd->max_interval)
 		sd->balance_interval *= 2;
 out:
+	if (need_unlock)
+		atomic_set_release(&sched_balance_running, 0);
+
 	return ld_moved;
 }
 
@@ -12132,21 +12162,6 @@ static int active_load_balance_cpu_stop(void *data)
 	return 0;
 }
 
-/*
- * This flag serializes load-balancing passes over large domains
- * (above the NODE topology level) - only one load-balancing instance
- * may run at a time, to reduce overhead on very large systems with
- * lots of CPUs and large NUMA distances.
- *
- * - Note that load-balancing passes triggered while another one
- *   is executing are skipped and not re-tried.
- *
- * - Also note that this does not serialize rebalance_domains()
- *   execution, as non-SD_SERIALIZE domains will still be
- *   load-balanced in parallel.
- */
-static atomic_t sched_balance_running = ATOMIC_INIT(0);
-
 /*
  * Scale the max sched_balance_rq interval with the number of CPUs in the system.
  * This trades load-balance latency on larger machines for less cross talk.
@@ -12202,7 +12217,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 	/* Earliest time when we have to do rebalance again */
 	unsigned long next_balance = jiffies + 60*HZ;
 	int update_next_balance = 0;
-	int need_serialize, need_decay = 0;
+	int need_decay = 0;
 	u64 max_cost = 0;
 
 	rcu_read_lock();
@@ -12226,13 +12241,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 		}
 
 		interval = get_sd_balance_interval(sd, busy);
-
-		need_serialize = sd->flags & SD_SERIALIZE;
-		if (need_serialize) {
-			if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
-				goto out;
-		}
-
 		if (time_after_eq(jiffies, sd->last_balance + interval)) {
 			if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
 				/*
@@ -12246,9 +12254,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 			sd->last_balance = jiffies;
 			interval = get_sd_balance_interval(sd, busy);
 		}
-		if (need_serialize)
-			atomic_set_release(&sched_balance_running, 0);
-out:
 		if (time_after(next_balance, sd->last_balance + interval)) {
 			next_balance = sd->last_balance + interval;
 			update_next_balance = 1;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-10 18:47 [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due Tim Chen
@ 2025-11-11  6:24 ` Shrikanth Hegde
  2025-11-12  8:02 ` Srikar Dronamraju
  2025-11-14 12:19 ` [tip: sched/core] " tip-bot2 for Tim Chen
  2 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-11  6:24 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Chen Yu, Doug Nelson, Mohini Narkhede, linux-kernel,
	Vincent Guittot, K Prateek Nayak, Peter Zijlstra

Hi Tim,

On 11/11/25 12:17 AM, Tim Chen wrote:
> The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing
> only one NUMA load balancing operation to run system-wide at a time.
> 
> Currently, each sched group leader directly under NUMA domain attempts
> to acquire the global sched_balance_running flag via cmpxchg() before
> checking whether load balancing is due or whether it is the designated
> load balancer for that NUMA domain. On systems with a large number
> of cores, this causes significant cache contention on the shared
> sched_balance_running flag.
> 
> This patch reduces unnecessary cmpxchg() operations by first checking
> that the balancer is the designated leader for a NUMA domain from
> should_we_balance(), and the balance interval has expired before
> trying to acquire sched_balance_running to load balance a NUMA
> domain.
> 
> On a 2-socket Granite Rapids system with sub-NUMA clustering enabled,
> running an OLTP workload, 7.8% of total CPU cycles were previously spent
> in sched_balance_domain() contending on sched_balance_running before
> this change.
Looks good to me. Thanks for getting this into current shape.

I see hackbench improving slightly across its variations. So,
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-10 18:47 [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due Tim Chen
  2025-11-11  6:24 ` Shrikanth Hegde
@ 2025-11-12  8:02 ` Srikar Dronamraju
  2025-11-12 10:37   ` Peter Zijlstra
  2025-11-14 12:19 ` [tip: sched/core] " tip-bot2 for Tim Chen
  2 siblings, 1 reply; 25+ messages in thread
From: Srikar Dronamraju @ 2025-11-12  8:02 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, Chen Yu, Doug Nelson,
	Mohini Narkhede, linux-kernel, Vincent Guittot, Shrikanth Hegde,
	K Prateek Nayak

* Tim Chen <tim.c.chen@linux.intel.com> [2025-11-10 10:47:35]:

> The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing
> only one NUMA load balancing operation to run system-wide at a time.
> 
> @@ -11757,17 +11772,26 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>  		.fbq_type	= all,
>  		.tasks		= LIST_HEAD_INIT(env.tasks),
>  	};
> +	bool need_unlock;
>  
>  	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>  
>  	schedstat_inc(sd->lb_count[idle]);
>  
>  redo:
> +	need_unlock = false;
>  	if (!should_we_balance(&env)) {
>  		*continue_balancing = 0;
>  		goto out_balanced;
>  	}
>  
> +	if (sd->flags & SD_SERIALIZE) {
> +		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
> +			goto out_balanced;
> +		}
> +		need_unlock = true;
> +	}
> +

Moving the serialize check to sched_balance_rq is better since we only take
when its really needed. Previously we could have skipped the balancing for
the appropriate CPU.

>  	group = sched_balance_find_src_group(&env);
>  	if (!group) {
>  		schedstat_inc(sd->lb_nobusyg[idle]);
> @@ -11892,6 +11916,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>  			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>  				env.loop = 0;
>  				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> +				if (need_unlock)
> +					atomic_set_release(&sched_balance_running, 0);
> +

One nit:
While the current code is good, would conditionally resetting the
need_unlock just after resetting the atomic variable better than
unconditional reset that we do now?

>  				goto redo;

Otherwise looks good to me.

Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com>

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-12  8:02 ` Srikar Dronamraju
@ 2025-11-12 10:37   ` Peter Zijlstra
  2025-11-12 10:45     ` Peter Zijlstra
  2025-11-12 10:53     ` Shrikanth Hegde
  0 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2025-11-12 10:37 UTC (permalink / raw)
  To: Srikar Dronamraju, Linus Torvalds
  Cc: Tim Chen, Ingo Molnar, Chen Yu, Doug Nelson, Mohini Narkhede,
	linux-kernel, Vincent Guittot, Shrikanth Hegde, K Prateek Nayak

On Wed, Nov 12, 2025 at 01:32:23PM +0530, Srikar Dronamraju wrote:
> >  	group = sched_balance_find_src_group(&env);
> >  	if (!group) {
> >  		schedstat_inc(sd->lb_nobusyg[idle]);
> > @@ -11892,6 +11916,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >  			if (!cpumask_subset(cpus, env.dst_grpmask)) {
> >  				env.loop = 0;
> >  				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> > +				if (need_unlock)
> > +					atomic_set_release(&sched_balance_running, 0);
> > +
> 
> One nit:
> While the current code is good, would conditionally resetting the
> need_unlock just after resetting the atomic variable better than
> unconditional reset that we do now?

Right, I had the same thought when grabbed the patch yesterday, but
ignored it.

But perhaps something like so?

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11717,14 +11717,20 @@ static int sched_balance_rq(int this_cpu
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
-	bool need_unlock;
+	bool need_unlock = false;
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
+	if (0) {
 redo:
-	need_unlock = false;
+		if (need_unlock) {
+			atomic_set_release(&sched_balance_running, 0);
+			need_unlock = false;
+		}
+	}
+
 	if (!should_we_balance(&env)) {
 		*continue_balancing = 0;
 		goto out_balanced;
@@ -11861,9 +11867,6 @@ static int sched_balance_rq(int this_cpu
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = SCHED_NR_MIGRATE_BREAK;
-				if (need_unlock)
-					atomic_set_release(&sched_balance_running, 0);
-
 				goto redo;
 			}
 			goto out_all_pinned;


---

The other option is something like this, but Linus hated on this pattern
when we started with things.

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11692,6 +11692,8 @@ static void update_lb_imbalance_stat(str
  */
 static atomic_t sched_balance_running = ATOMIC_INIT(0);
 
+DEFINE_FREE(balance_unlock, atomic_t *, if (_T) atomic_set_release(_T, 0));
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
@@ -11717,24 +11719,23 @@ static int sched_balance_rq(int this_cpu
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
-	bool need_unlock;
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
 redo:
-	need_unlock = false;
 	if (!should_we_balance(&env)) {
 		*continue_balancing = 0;
-		goto out_balanced;
+		return 0;
 	}
 
+	atomic_t *lock __free(balance_unlock) = NULL;
 	if (sd->flags & SD_SERIALIZE) {
-		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
-			goto out_balanced;
-		}
-		need_unlock = true;
+		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
+			return 0;
+
+		lock = &sched_balance_running;
 	}
 
 	group = sched_balance_find_src_group(&env);
@@ -11861,9 +11862,6 @@ static int sched_balance_rq(int this_cpu
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = SCHED_NR_MIGRATE_BREAK;
-				if (need_unlock)
-					atomic_set_release(&sched_balance_running, 0);
-
 				goto redo;
 			}
 			goto out_all_pinned;
@@ -11980,9 +11978,6 @@ static int sched_balance_rq(int this_cpu
 	    sd->balance_interval < sd->max_interval)
 		sd->balance_interval *= 2;
 out:
-	if (need_unlock)
-		atomic_set_release(&sched_balance_running, 0);
-
 	return ld_moved;
 }
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-12 10:37   ` Peter Zijlstra
@ 2025-11-12 10:45     ` Peter Zijlstra
  2025-11-12 11:09       ` Shrikanth Hegde
  2025-11-12 11:25       ` Srikar Dronamraju
  2025-11-12 10:53     ` Shrikanth Hegde
  1 sibling, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2025-11-12 10:45 UTC (permalink / raw)
  To: Srikar Dronamraju, Linus Torvalds
  Cc: Tim Chen, Ingo Molnar, Chen Yu, Doug Nelson, Mohini Narkhede,
	linux-kernel, Vincent Guittot, Shrikanth Hegde, K Prateek Nayak

On Wed, Nov 12, 2025 at 11:37:40AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 01:32:23PM +0530, Srikar Dronamraju wrote:
> > >  	group = sched_balance_find_src_group(&env);
> > >  	if (!group) {
> > >  		schedstat_inc(sd->lb_nobusyg[idle]);
> > > @@ -11892,6 +11916,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> > >  			if (!cpumask_subset(cpus, env.dst_grpmask)) {
> > >  				env.loop = 0;
> > >  				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> > > +				if (need_unlock)
> > > +					atomic_set_release(&sched_balance_running, 0);
> > > +
> > 
> > One nit:
> > While the current code is good, would conditionally resetting the
> > need_unlock just after resetting the atomic variable better than
> > unconditional reset that we do now?
> 
> Right, I had the same thought when grabbed the patch yesterday, but
> ignored it.
> 

Hmm, should we not redo while keeping the lock? Doesn't make much sense
to drop and try to reacquire things here.

So perhaps this is the better option -- or did I overlook something with
should_we_balance? It doesn't look like that will make a different
decision on the retry.

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11717,26 +11717,25 @@ static int sched_balance_rq(int this_cpu
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
-	bool need_unlock;
+	bool need_unlock = false;
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
-redo:
-	need_unlock = false;
 	if (!should_we_balance(&env)) {
 		*continue_balancing = 0;
 		goto out_balanced;
 	}
 
 	if (sd->flags & SD_SERIALIZE) {
-		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
+		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
 			goto out_balanced;
-		}
+
 		need_unlock = true;
 	}
 
+redo:
 	group = sched_balance_find_src_group(&env);
 	if (!group) {
 		schedstat_inc(sd->lb_nobusyg[idle]);
@@ -11861,9 +11860,6 @@ static int sched_balance_rq(int this_cpu
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = SCHED_NR_MIGRATE_BREAK;
-				if (need_unlock)
-					atomic_set_release(&sched_balance_running, 0);
-
 				goto redo;
 			}
 			goto out_all_pinned;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-12 10:45     ` Peter Zijlstra
@ 2025-11-12 11:09       ` Shrikanth Hegde
  2025-11-12 11:21         ` Peter Zijlstra
  2025-11-12 11:25       ` Srikar Dronamraju
  1 sibling, 1 reply; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-12 11:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Ingo Molnar, Chen Yu, Doug Nelson, Mohini Narkhede,
	linux-kernel, Vincent Guittot, K Prateek Nayak, Srikar Dronamraju,
	Linus Torvalds



On 11/12/25 4:15 PM, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 11:37:40AM +0100, Peter Zijlstra wrote:
>> On Wed, Nov 12, 2025 at 01:32:23PM +0530, Srikar Dronamraju wrote:
>>>>   	group = sched_balance_find_src_group(&env);
>>>>   	if (!group) {
>>>>   		schedstat_inc(sd->lb_nobusyg[idle]);
>>>> @@ -11892,6 +11916,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>>>   			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>>>>   				env.loop = 0;
>>>>   				env.loop_break = SCHED_NR_MIGRATE_BREAK;
>>>> +				if (need_unlock)
>>>> +					atomic_set_release(&sched_balance_running, 0);
>>>> +
>>>
>>> One nit:
>>> While the current code is good, would conditionally resetting the
>>> need_unlock just after resetting the atomic variable better than
>>> unconditional reset that we do now?
>>
>> Right, I had the same thought when grabbed the patch yesterday, but
>> ignored it.
>>
> 
> Hmm, should we not redo while keeping the lock? Doesn't make much sense
> to drop and try to reacquire things here.
> 
> So perhaps this is the better option -- or did I overlook something with
> should_we_balance? It doesn't look like that will make a different
> decision on the retry.
> 

I think in newidle balance, these checks are there in swb to bail of load balance.
redo logic catches it right?

env->dst_rq lock is taken only in attach_tasks, meanwhile, if the wakeup happened,
pending would be set. is irq enabled or remote CPU can set ttwu_pending on this rq?


         if (env->idle == CPU_NEWLY_IDLE) {
                 if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
                         return 0;
                 return 1;
         }


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-12 11:09       ` Shrikanth Hegde
@ 2025-11-12 11:21         ` Peter Zijlstra
  2025-11-12 21:10           ` Tim Chen
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2025-11-12 11:21 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Tim Chen, Ingo Molnar, Chen Yu, Doug Nelson, Mohini Narkhede,
	linux-kernel, Vincent Guittot, K Prateek Nayak, Srikar Dronamraju,
	Linus Torvalds

On Wed, Nov 12, 2025 at 04:39:43PM +0530, Shrikanth Hegde wrote:
> 
> 

> > So perhaps this is the better option -- or did I overlook something with
> > should_we_balance? It doesn't look like that will make a different
> > decision on the retry.
> > 
> 
> I think in newidle balance, these checks are there in swb to bail of load balance.
> redo logic catches it right?

Urgh, my brain still thinks we're not serializing on newidle. Perhaps I
should make this 2 patches, one moving the serializing and one adding it
to newidle.

> env->dst_rq lock is taken only in attach_tasks, meanwhile, if the wakeup happened,
> pending would be set. is irq enabled or remote CPU can set ttwu_pending on this rq?
> 
>         if (env->idle == CPU_NEWLY_IDLE) {
>                 if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
>                         return 0;
>                 return 1;
>         }

Right, that could get tickled.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-12 11:21         ` Peter Zijlstra
@ 2025-11-12 21:10           ` Tim Chen
  2025-11-13  4:25             ` Shrikanth Hegde
  0 siblings, 1 reply; 25+ messages in thread
From: Tim Chen @ 2025-11-12 21:10 UTC (permalink / raw)
  To: Peter Zijlstra, Shrikanth Hegde
  Cc: Ingo Molnar, Chen Yu, Doug Nelson, Mohini Narkhede, linux-kernel,
	Vincent Guittot, K Prateek Nayak, Srikar Dronamraju,
	Linus Torvalds

On Wed, 2025-11-12 at 12:21 +0100, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 04:39:43PM +0530, Shrikanth Hegde wrote:
> > 
> > 
> 
> > > So perhaps this is the better option -- or did I overlook something with
> > > should_we_balance? It doesn't look like that will make a different
> > > decision on the retry.
> > > 
> > 
> > I think in newidle balance, these checks are there in swb to bail of load balance.
> > redo logic catches it right?
> 
> Urgh, my brain still thinks we're not serializing on newidle. Perhaps I
> should make this 2 patches, one moving the serializing and one adding it
> to newidle.
> 
> > env->dst_rq lock is taken only in attach_tasks, meanwhile, if the wakeup happened,
> > pending would be set. is irq enabled or remote CPU can set ttwu_pending on this rq?
> > 
> >         if (env->idle == CPU_NEWLY_IDLE) {
> >                 if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
> >                         return 0;
> >                 return 1;
> >         }
> 
> Right, that could get tickled.

How about something like the following on top of v4 patch?
This will avoid releasing the lock and take care of the NEWLY_IDLE case.

Tim

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43c5ec039633..26179f4b77f6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11772,14 +11772,13 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
-	bool need_unlock;
+	bool need_unlock = false;
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
 redo:
-	need_unlock = false;
 	if (!should_we_balance(&env)) {
 		*continue_balancing = 0;
 		goto out_balanced;
@@ -11916,9 +11915,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = SCHED_NR_MIGRATE_BREAK;
-				if (need_unlock)
-					atomic_set_release(&sched_balance_running, 0);
-
+				if (env->idle == CPU_NEWLY_IDLE &&
+				    (env->dst_running > 0 || env->dst_rq->ttwu_pending))
+					goto out;
 				goto redo;
 			}
 			goto out_all_pinned;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-12 21:10           ` Tim Chen
@ 2025-11-13  4:25             ` Shrikanth Hegde
  2025-11-13 17:49               ` Tim Chen
  0 siblings, 1 reply; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-13  4:25 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra
  Cc: Ingo Molnar, Chen Yu, Doug Nelson, Mohini Narkhede, linux-kernel,
	Vincent Guittot, K Prateek Nayak, Srikar Dronamraju,
	Linus Torvalds



On 11/13/25 2:40 AM, Tim Chen wrote:
> On Wed, 2025-11-12 at 12:21 +0100, Peter Zijlstra wrote:
>> On Wed, Nov 12, 2025 at 04:39:43PM +0530, Shrikanth Hegde wrote:
>>>
>>>
>>
>>>> So perhaps this is the better option -- or did I overlook something with
>>>> should_we_balance? It doesn't look like that will make a different
>>>> decision on the retry.
>>>>
>>>
>>> I think in newidle balance, these checks are there in swb to bail of load balance.
>>> redo logic catches it right?
>>
>> Urgh, my brain still thinks we're not serializing on newidle. Perhaps I
>> should make this 2 patches, one moving the serializing and one adding it
>> to newidle.
>>
>>> env->dst_rq lock is taken only in attach_tasks, meanwhile, if the wakeup happened,
>>> pending would be set. is irq enabled or remote CPU can set ttwu_pending on this rq?
>>>
>>>          if (env->idle == CPU_NEWLY_IDLE) {
>>>                  if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
>>>                          return 0;
>>>                  return 1;
>>>          }
>>
>> Right, that could get tickled.
> 
> How about something like the following on top of v4 patch?
> This will avoid releasing the lock and take care of the NEWLY_IDLE case.
> 
> Tim
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 43c5ec039633..26179f4b77f6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11772,14 +11772,13 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>   		.fbq_type	= all,
>   		.tasks		= LIST_HEAD_INIT(env.tasks),
>   	};
> -	bool need_unlock;
> +	bool need_unlock = false;
>   
>   	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>   
>   	schedstat_inc(sd->lb_count[idle]);
>   
>   redo:
> -	need_unlock = false;
>   	if (!should_we_balance(&env)) {
>   		*continue_balancing = 0;
>   		goto out_balanced;
> @@ -11916,9 +11915,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>   			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>   				env.loop = 0;
>   				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> -				if (need_unlock)
> -					atomic_set_release(&sched_balance_running, 0);
> -
> +				if (env->idle == CPU_NEWLY_IDLE &&
> +				    (env->dst_running > 0 || env->dst_rq->ttwu_pending))
> +					goto out;

IIUC, we come here, it means busiest cpu was found, but due to
affinity restrictions none of the tasks can come to this cpu.

So a redo is done excluding that busiest cpu if there are cpus other
than the group_mask of this cpu. So doing a redo does make sense specially
for newidle.

So doing bailing out might be wrong.

>   				goto redo;
>   			}
>   			goto out_all_pinned;


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-13  4:25             ` Shrikanth Hegde
@ 2025-11-13 17:49               ` Tim Chen
  0 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2025-11-13 17:49 UTC (permalink / raw)
  To: Shrikanth Hegde, Peter Zijlstra
  Cc: Ingo Molnar, Chen Yu, Doug Nelson, Mohini Narkhede, linux-kernel,
	Vincent Guittot, K Prateek Nayak, Srikar Dronamraju,
	Linus Torvalds

On Thu, 2025-11-13 at 09:55 +0530, Shrikanth Hegde wrote:
> 
> On 11/13/25 2:40 AM, Tim Chen wrote:
> > On Wed, 2025-11-12 at 12:21 +0100, Peter Zijlstra wrote:
> > > On Wed, Nov 12, 2025 at 04:39:43PM +0530, Shrikanth Hegde wrote:
> > > > 
> > > > 
> > > 
> > > > > So perhaps this is the better option -- or did I overlook something with
> > > > > should_we_balance? It doesn't look like that will make a different
> > > > > decision on the retry.
> > > > > 
> > > > 
> > > > I think in newidle balance, these checks are there in swb to bail of load balance.
> > > > redo logic catches it right?
> > > 
> > > Urgh, my brain still thinks we're not serializing on newidle. Perhaps I
> > > should make this 2 patches, one moving the serializing and one adding it
> > > to newidle.
> > > 
> > > > env->dst_rq lock is taken only in attach_tasks, meanwhile, if the wakeup happened,
> > > > pending would be set. is irq enabled or remote CPU can set ttwu_pending on this rq?
> > > > 
> > > >          if (env->idle == CPU_NEWLY_IDLE) {
> > > >                  if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
> > > >                          return 0;
> > > >                  return 1;
> > > >          }
> > > 
> > > Right, that could get tickled.
> > 
> > How about something like the following on top of v4 patch?
> > This will avoid releasing the lock and take care of the NEWLY_IDLE case.
> > 
> > Tim
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 43c5ec039633..26179f4b77f6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11772,14 +11772,13 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >   		.fbq_type	= all,
> >   		.tasks		= LIST_HEAD_INIT(env.tasks),
> >   	};
> > -	bool need_unlock;
> > +	bool need_unlock = false;
> >   
> >   	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
> >   
> >   	schedstat_inc(sd->lb_count[idle]);
> >   
> >   redo:
> > -	need_unlock = false;
> >   	if (!should_we_balance(&env)) {
> >   		*continue_balancing = 0;
> >   		goto out_balanced;
> > @@ -11916,9 +11915,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >   			if (!cpumask_subset(cpus, env.dst_grpmask)) {
> >   				env.loop = 0;
> >   				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> > -				if (need_unlock)
> > -					atomic_set_release(&sched_balance_running, 0);
> > -
> > +				if (env->idle == CPU_NEWLY_IDLE &&
> > +				    (env->dst_running > 0 || env->dst_rq->ttwu_pending))
> > +					goto out;
> 
> IIUC, we come here, it means busiest cpu was found, but due to
> affinity restrictions none of the tasks can come to this cpu.
> 
> So a redo is done excluding that busiest cpu if there are cpus other
> than the group_mask of this cpu. So doing a redo does make sense specially
> for newidle.
> 
> So doing bailing out might be wrong.

My understanding is the reason for the idle balancing is because the
dst_rq becomes idle.  If we see that the dst_rq already has something to run,
why add latency to search for more tasks to pull as it is likely the dst_rq
is the current cpu.

Tim

> 
> >   				goto redo;
> >   			}
> >   			goto out_all_pinned;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-12 10:45     ` Peter Zijlstra
  2025-11-12 11:09       ` Shrikanth Hegde
@ 2025-11-12 11:25       ` Srikar Dronamraju
  2025-11-12 13:39         ` Peter Zijlstra
  1 sibling, 1 reply; 25+ messages in thread
From: Srikar Dronamraju @ 2025-11-12 11:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Tim Chen, Ingo Molnar, Chen Yu, Doug Nelson,
	Mohini Narkhede, linux-kernel, Vincent Guittot, Shrikanth Hegde,
	K Prateek Nayak

* Peter Zijlstra <peterz@infradead.org> [2025-11-12 11:45:55]:

> > 
> > Right, I had the same thought when grabbed the patch yesterday, but
> > ignored it.
> > 
> 
> Hmm, should we not redo while keeping the lock? Doesn't make much sense
> to drop and try to reacquire things here.
> 
> So perhaps this is the better option -- or did I overlook something with
> should_we_balance? It doesn't look like that will make a different
> decision on the retry.
> 
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11717,26 +11717,25 @@ static int sched_balance_rq(int this_cpu
>  		.fbq_type	= all,
>  		.tasks		= LIST_HEAD_INIT(env.tasks),
>  	};
> -	bool need_unlock;
> +	bool need_unlock = false;
>  
>  	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>  
>  	schedstat_inc(sd->lb_count[idle]);
>  
> -redo:
> -	need_unlock = false;
>  	if (!should_we_balance(&env)) {
>  		*continue_balancing = 0;
>  		goto out_balanced;
>  	}
>  
>  	if (sd->flags & SD_SERIALIZE) {
> -		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
> +		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
>  			goto out_balanced;
> -		}
> +
>  		need_unlock = true;
>  	}
>  
> +redo:
>  	group = sched_balance_find_src_group(&env);
>  	if (!group) {
>  		schedstat_inc(sd->lb_nobusyg[idle]);
> @@ -11861,9 +11860,6 @@ static int sched_balance_rq(int this_cpu
>  			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>  				env.loop = 0;
>  				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> -				if (need_unlock)
> -					atomic_set_release(&sched_balance_running, 0);
> -
>  				goto redo;
>  			}
>  			goto out_all_pinned;

If the CPU that was doing the balance was not the first CPU of the domain
span, but it was doing the balance since the first CPU was busy, and the
first CPU now happens to be idle at redo, the scheduler would have chosen the
first CPU to do the balance. However it will now choose the CPU that had the atomic..

I think this is better because 
- The first CPU may have tried just before this CPU dropped the atomic and
  hence we may miss the balance opportunity.
- The first CPU and the other CPU may not be sharing cache and hence there
  may be a cache-miss, which we are avoiding by doing this.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-12 11:25       ` Srikar Dronamraju
@ 2025-11-12 13:39         ` Peter Zijlstra
  2025-11-12 13:44           ` Peter Zijlstra
  2025-11-12 16:02           ` Srikar Dronamraju
  0 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2025-11-12 13:39 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Tim Chen, Ingo Molnar, Chen Yu, Doug Nelson,
	Mohini Narkhede, linux-kernel, Vincent Guittot, Shrikanth Hegde,
	K Prateek Nayak

On Wed, Nov 12, 2025 at 04:55:48PM +0530, Srikar Dronamraju wrote:

> If the CPU that was doing the balance was not the first CPU of the domain
> span, but it was doing the balance since the first CPU was busy, and the
> first CPU now happens to be idle at redo, the scheduler would have chosen the
> first CPU to do the balance. However it will now choose the CPU that had the atomic..
> 
> I think this is better because 
> - The first CPU may have tried just before this CPU dropped the atomic and
>   hence we may miss the balance opportunity.
> - The first CPU and the other CPU may not be sharing cache and hence there
>   may be a cache-miss, which we are avoiding by doing this.

I'm not sure I understand what you're arguing for. Are you saying it
would be better to retain the lock where possible?


--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11717,23 +11717,22 @@ static int sched_balance_rq(int this_cpu
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
-	bool need_unlock;
+	bool need_unlock = false;
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
 redo:
-	need_unlock = false;
 	if (!should_we_balance(&env)) {
 		*continue_balancing = 0;
 		goto out_balanced;
 	}
 
-	if (sd->flags & SD_SERIALIZE) {
-		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
+	if (!need_unlock && sd->flags & SD_SERIALIZE) {
+		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
 			goto out_balanced;
-		}
+
 		need_unlock = true;
 	}
 
@@ -11861,9 +11860,6 @@ static int sched_balance_rq(int this_cpu
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = SCHED_NR_MIGRATE_BREAK;
-				if (need_unlock)
-					atomic_set_release(&sched_balance_running, 0);
-
 				goto redo;
 			}
 			goto out_all_pinned;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-12 13:39         ` Peter Zijlstra
@ 2025-11-12 13:44           ` Peter Zijlstra
  2025-11-12 16:02           ` Srikar Dronamraju
  1 sibling, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2025-11-12 13:44 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Tim Chen, Ingo Molnar, Chen Yu, Doug Nelson,
	Mohini Narkhede, linux-kernel, Vincent Guittot, Shrikanth Hegde,
	K Prateek Nayak

On Wed, Nov 12, 2025 at 02:39:37PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 04:55:48PM +0530, Srikar Dronamraju wrote:
> 
> > If the CPU that was doing the balance was not the first CPU of the domain
> > span, but it was doing the balance since the first CPU was busy, and the
> > first CPU now happens to be idle at redo, the scheduler would have chosen the
> > first CPU to do the balance. However it will now choose the CPU that had the atomic..
> > 
> > I think this is better because 
> > - The first CPU may have tried just before this CPU dropped the atomic and
> >   hence we may miss the balance opportunity.
> > - The first CPU and the other CPU may not be sharing cache and hence there
> >   may be a cache-miss, which we are avoiding by doing this.
> 
> I'm not sure I understand what you're arguing for. Are you saying it
> would be better to retain the lock where possible?

FWIW this is also the behaviour we had before this patch, where the lock
is taken in the caller, it would have covered the whole redo case as
well.

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11717,23 +11717,22 @@ static int sched_balance_rq(int this_cpu
>  		.fbq_type	= all,
>  		.tasks		= LIST_HEAD_INIT(env.tasks),
>  	};
> -	bool need_unlock;
> +	bool need_unlock = false;
>  
>  	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>  
>  	schedstat_inc(sd->lb_count[idle]);
>  
>  redo:
> -	need_unlock = false;
>  	if (!should_we_balance(&env)) {
>  		*continue_balancing = 0;
>  		goto out_balanced;
>  	}
>  
> -	if (sd->flags & SD_SERIALIZE) {
> -		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
> +	if (!need_unlock && sd->flags & SD_SERIALIZE) {
> +		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
>  			goto out_balanced;
> -		}
> +
>  		need_unlock = true;
>  	}
>  
> @@ -11861,9 +11860,6 @@ static int sched_balance_rq(int this_cpu
>  			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>  				env.loop = 0;
>  				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> -				if (need_unlock)
> -					atomic_set_release(&sched_balance_running, 0);
> -
>  				goto redo;
>  			}
>  			goto out_all_pinned;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-12 13:39         ` Peter Zijlstra
  2025-11-12 13:44           ` Peter Zijlstra
@ 2025-11-12 16:02           ` Srikar Dronamraju
  1 sibling, 0 replies; 25+ messages in thread
From: Srikar Dronamraju @ 2025-11-12 16:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Tim Chen, Ingo Molnar, Chen Yu, Doug Nelson,
	Mohini Narkhede, linux-kernel, Vincent Guittot, Shrikanth Hegde,
	K Prateek Nayak

* Peter Zijlstra <peterz@infradead.org> [2025-11-12 14:39:37]:

> On Wed, Nov 12, 2025 at 04:55:48PM +0530, Srikar Dronamraju wrote:
> 
> > If the CPU that was doing the balance was not the first CPU of the domain
> > span, but it was doing the balance since the first CPU was busy, and the
> > first CPU now happens to be idle at redo, the scheduler would have chosen the
> > first CPU to do the balance. However it will now choose the CPU that had the atomic..
> > 
> > I think this is better because 
> > - The first CPU may have tried just before this CPU dropped the atomic and
> >   hence we may miss the balance opportunity.
> > - The first CPU and the other CPU may not be sharing cache and hence there
> >   may be a cache-miss, which we are avoiding by doing this.
> 
> I'm not sure I understand what you're arguing for. Are you saying it
> would be better to retain the lock where possible?
> 

Yes, I was supporting keeping the lock and not check should_we_balance() with
lock held.

Lets say CPU2 enters sched_balance_rq(), should_we_balance succeeds, CPU 2 take
the lock. It calls redo, and this time should_we_balance() may not succeed for
CPU 2 (since CPU 0/1 is idle). However CPU0 may have already raced with CPU2
and tried to take the lock before CPU2 released it and bailed out. So we miss a
balancing opportunity.

> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-12 10:37   ` Peter Zijlstra
  2025-11-12 10:45     ` Peter Zijlstra
@ 2025-11-12 10:53     ` Shrikanth Hegde
  1 sibling, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-12 10:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Ingo Molnar, Chen Yu, Doug Nelson, Mohini Narkhede,
	linux-kernel, Vincent Guittot, K Prateek Nayak, Srikar Dronamraju,
	Linus Torvalds



On 11/12/25 4:07 PM, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 01:32:23PM +0530, Srikar Dronamraju wrote:
>>>   	group = sched_balance_find_src_group(&env);
>>>   	if (!group) {
>>>   		schedstat_inc(sd->lb_nobusyg[idle]);
>>> @@ -11892,6 +11916,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>>   			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>>>   				env.loop = 0;
>>>   				env.loop_break = SCHED_NR_MIGRATE_BREAK;
>>> +				if (need_unlock)
>>> +					atomic_set_release(&sched_balance_running, 0);
>>> +
>>
>> One nit:
>> While the current code is good, would conditionally resetting the
>> need_unlock just after resetting the atomic variable better than
>> unconditional reset that we do now?
> 
> Right, I had the same thought when grabbed the patch yesterday, but
> ignored it.
> 
> But perhaps something like so?
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11717,14 +11717,20 @@ static int sched_balance_rq(int this_cpu
>   		.fbq_type	= all,
>   		.tasks		= LIST_HEAD_INIT(env.tasks),
>   	};
> -	bool need_unlock;
> +	bool need_unlock = false;
>   
>   	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>   
>   	schedstat_inc(sd->lb_count[idle]);
>   
> +	if (0) {
>   redo:
> -	need_unlock = false;
> +		if (need_unlock) {
> +			atomic_set_release(&sched_balance_running, 0);
> +			need_unlock = false;
> +		}
> +	}
> +
>   	if (!should_we_balance(&env)) {
>   		*continue_balancing = 0;
>   		goto out_balanced;
> @@ -11861,9 +11867,6 @@ static int sched_balance_rq(int this_cpu
>   			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>   				env.loop = 0;
>   				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> -				if (need_unlock)
> -					atomic_set_release(&sched_balance_running, 0);
> -
>   				goto redo;
>   			}
>   			goto out_all_pinned;
> 
> 
> ---
> 
> The other option is something like this, but Linus hated on this pattern
> when we started with things.
> 
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11692,6 +11692,8 @@ static void update_lb_imbalance_stat(str
>    */
>   static atomic_t sched_balance_running = ATOMIC_INIT(0);
>   
> +DEFINE_FREE(balance_unlock, atomic_t *, if (_T) atomic_set_release(_T, 0));
> +
>   /*
>    * Check this_cpu to ensure it is balanced within domain. Attempt to move
>    * tasks if there is an imbalance.
> @@ -11717,24 +11719,23 @@ static int sched_balance_rq(int this_cpu
>   		.fbq_type	= all,
>   		.tasks		= LIST_HEAD_INIT(env.tasks),
>   	};
> -	bool need_unlock;
>   
>   	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>   
>   	schedstat_inc(sd->lb_count[idle]);
>   
>   redo:
> -	need_unlock = false;
>   	if (!should_we_balance(&env)) {
>   		*continue_balancing = 0;
> -		goto out_balanced;
> +		return 0;
>   	}
>   
> +	atomic_t *lock __free(balance_unlock) = NULL;
>   	if (sd->flags & SD_SERIALIZE) {
> -		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
> -			goto out_balanced;
> -		}
> -		need_unlock = true;
> +		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> +			return 0;
> +
> +		lock = &sched_balance_running;
>   	}
>   
>   	group = sched_balance_find_src_group(&env);
> @@ -11861,9 +11862,6 @@ static int sched_balance_rq(int this_cpu
>   			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>   				env.loop = 0;
>   				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> -				if (need_unlock)
> -					atomic_set_release(&sched_balance_running, 0);
> -
>   				goto redo;
>   			}
>   			goto out_all_pinned;
> @@ -11980,9 +11978,6 @@ static int sched_balance_rq(int this_cpu
>   	    sd->balance_interval < sd->max_interval)
>   		sd->balance_interval *= 2;
>   out:
> -	if (need_unlock)
> -		atomic_set_release(&sched_balance_running, 0);
> -
>   	return ld_moved;
>   }
>   

Second one is difficult to understand.


Is this an option too?
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f8e1df9c5199..64e2aeacd65e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11720,14 +11720,13 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
                 .fbq_type       = all,
                 .tasks          = LIST_HEAD_INIT(env.tasks),
         };
-       bool need_unlock;
+       bool need_unlock = false;
  
         cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
  
         schedstat_inc(sd->lb_count[idle]);
  
  redo:
-       need_unlock = false;
         if (!should_we_balance(&env)) {
                 *continue_balancing = 0;
                 goto out_balanced;
@@ -11864,8 +11863,10 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
                         if (!cpumask_subset(cpus, env.dst_grpmask)) {
                                 env.loop = 0;
                                 env.loop_break = SCHED_NR_MIGRATE_BREAK;
-                               if (need_unlock)
+                               if (need_unlock) {
                                         atomic_set_release(&sched_balance_running, 0);
+                                       need_unlock = false;
+                               }
  
                                 goto redo;
                         }



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
@ 2025-11-14 12:19 ` tip-bot2 for Tim Chen
  2025-11-15 20:56   ` Shrikanth Hegde
  0 siblings, 1 reply; 25+ messages in thread
From: tip-bot2 for Tim Chen @ 2025-11-14 12:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Tim Chen, Peter Zijlstra (Intel), Chen Yu, Vincent Guittot,
	Shrikanth Hegde, K Prateek Nayak, Srikar Dronamraju,
	Mohini Narkhede, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2265c5d4deeff3bfe4580d9ffe718fd80a414cac
Gitweb:        https://git.kernel.org/tip/2265c5d4deeff3bfe4580d9ffe718fd80a414cac
Author:        Tim Chen <tim.c.chen@linux.intel.com>
AuthorDate:    Mon, 10 Nov 2025 10:47:35 -08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 14 Nov 2025 13:03:05 +01:00

sched/fair: Skip sched_balance_running cmpxchg when balance is not due

The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing
only one NUMA load balancing operation to run system-wide at a time.

Currently, each sched group leader directly under NUMA domain attempts
to acquire the global sched_balance_running flag via cmpxchg() before
checking whether load balancing is due or whether it is the designated
load balancer for that NUMA domain. On systems with a large number
of cores, this causes significant cache contention on the shared
sched_balance_running flag.

This patch reduces unnecessary cmpxchg() operations by first checking
that the balancer is the designated leader for a NUMA domain from
should_we_balance(), and the balance interval has expired before
trying to acquire sched_balance_running to load balance a NUMA
domain.

On a 2-socket Granite Rapids system with sub-NUMA clustering enabled,
running an OLTP workload, 7.8% of total CPU cycles were previously spent
in sched_balance_domain() contending on sched_balance_running before
this change.

         : 104              static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
         : 105              {
         : 106              return arch_cmpxchg(&v->counter, old, new);
    0.00 :   ffffffff81326e6c:       xor    %eax,%eax
    0.00 :   ffffffff81326e6e:       mov    $0x1,%ecx
    0.00 :   ffffffff81326e73:       lock cmpxchg %ecx,0x2394195(%rip)        # ffffffff836bb010 <sched_balance_running>
         : 110              sched_balance_domains():
         : 12234            if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
   99.39 :   ffffffff81326e7b:       test   %eax,%eax
    0.00 :   ffffffff81326e7d:       jne    ffffffff81326e99 <sched_balance_domains+0x209>
         : 12238            if (time_after_eq(jiffies, sd->last_balance + interval)) {
    0.00 :   ffffffff81326e7f:       mov    0x14e2b3a(%rip),%rax        # ffffffff828099c0 <jiffies_64>
    0.00 :   ffffffff81326e86:       sub    0x48(%r14),%rax
    0.00 :   ffffffff81326e8a:       cmp    %rdx,%rax

After applying this fix, sched_balance_domain() is gone from the profile
and there is a 5% throughput improvement.

[peterz: made it so that redo retains the 'lock' and split out the
         CPU_NEWLY_IDLE change to a separate patch]
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/6fed119b723c71552943bfe5798c93851b30a361.1762800251.git.tim.c.chen@linux.intel.com
---
 kernel/sched/fair.c | 53 ++++++++++++++++++++++----------------------
 1 file changed, 27 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b4617d6..5feb5a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11681,6 +11681,21 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
 }
 
 /*
+ * This flag serializes load-balancing passes over large domains
+ * (above the NODE topology level) - only one load-balancing instance
+ * may run at a time, to reduce overhead on very large systems with
+ * lots of CPUs and large NUMA distances.
+ *
+ * - Note that load-balancing passes triggered while another one
+ *   is executing are skipped and not re-tried.
+ *
+ * - Also note that this does not serialize rebalance_domains()
+ *   execution, as non-SD_SERIALIZE domains will still be
+ *   load-balanced in parallel.
+ */
+static atomic_t sched_balance_running = ATOMIC_INIT(0);
+
+/*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
  */
@@ -11705,6 +11720,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
+	bool need_unlock = false;
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
@@ -11716,6 +11732,13 @@ redo:
 		goto out_balanced;
 	}
 
+	if (!need_unlock && (sd->flags & SD_SERIALIZE) && idle != CPU_NEWLY_IDLE) {
+		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
+			goto out_balanced;
+
+		need_unlock = true;
+	}
+
 	group = sched_balance_find_src_group(&env);
 	if (!group) {
 		schedstat_inc(sd->lb_nobusyg[idle]);
@@ -11956,6 +11979,9 @@ out_one_pinned:
 	    sd->balance_interval < sd->max_interval)
 		sd->balance_interval *= 2;
 out:
+	if (need_unlock)
+		atomic_set_release(&sched_balance_running, 0);
+
 	return ld_moved;
 }
 
@@ -12081,21 +12107,6 @@ out_unlock:
 }
 
 /*
- * This flag serializes load-balancing passes over large domains
- * (above the NODE topology level) - only one load-balancing instance
- * may run at a time, to reduce overhead on very large systems with
- * lots of CPUs and large NUMA distances.
- *
- * - Note that load-balancing passes triggered while another one
- *   is executing are skipped and not re-tried.
- *
- * - Also note that this does not serialize rebalance_domains()
- *   execution, as non-SD_SERIALIZE domains will still be
- *   load-balanced in parallel.
- */
-static atomic_t sched_balance_running = ATOMIC_INIT(0);
-
-/*
  * Scale the max sched_balance_rq interval with the number of CPUs in the system.
  * This trades load-balance latency on larger machines for less cross talk.
  */
@@ -12150,7 +12161,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 	/* Earliest time when we have to do rebalance again */
 	unsigned long next_balance = jiffies + 60*HZ;
 	int update_next_balance = 0;
-	int need_serialize, need_decay = 0;
+	int need_decay = 0;
 	u64 max_cost = 0;
 
 	rcu_read_lock();
@@ -12174,13 +12185,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 		}
 
 		interval = get_sd_balance_interval(sd, busy);
-
-		need_serialize = sd->flags & SD_SERIALIZE;
-		if (need_serialize) {
-			if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
-				goto out;
-		}
-
 		if (time_after_eq(jiffies, sd->last_balance + interval)) {
 			if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
 				/*
@@ -12194,9 +12198,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 			sd->last_balance = jiffies;
 			interval = get_sd_balance_interval(sd, busy);
 		}
-		if (need_serialize)
-			atomic_set_release(&sched_balance_running, 0);
-out:
 		if (time_after(next_balance, sd->last_balance + interval)) {
 			next_balance = sd->last_balance + interval;
 			update_next_balance = 1;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-14 12:19 ` [tip: sched/core] " tip-bot2 for Tim Chen
@ 2025-11-15 20:56   ` Shrikanth Hegde
  2025-11-17 18:55     ` Tim Chen
  2025-11-17 19:06     ` Borislav Petkov
  0 siblings, 2 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-11-15 20:56 UTC (permalink / raw)
  To: linux-kernel, Peter Zijlstra (Intel)
  Cc: Tim Chen, linux-tip-commits, Chen Yu, Vincent Guittot,
	K Prateek Nayak, Srikar Dronamraju, Mohini Narkhede, x86


Hi Peter.

On 11/14/25 5:49 PM, tip-bot2 for Tim Chen wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> Gitweb:        https://git.kernel.org/tip/2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> Author:        Tim Chen <tim.c.chen@linux.intel.com>
> AuthorDate:    Mon, 10 Nov 2025 10:47:35 -08:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Fri, 14 Nov 2025 13:03:05 +01:00
> 
> sched/fair: Skip sched_balance_running cmpxchg when balance is not due
> 
> 

   
> +	if (!need_unlock && (sd->flags & SD_SERIALIZE) && idle != CPU_NEWLY_IDLE) {
> +		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))

This should be atomic_cmpxchg_acquire?

I booted the system with latest sched/core and it crashes at the boot.

BUG: Kernel NULL pointer dereference on read at 0x00000000
Faulting instruction address: 0xc0000000001db57c
Oops: Kernel access of bad area, sig: 7 [#1]
LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
Modules linked in:
CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.18.0-rc3+ #242 PREEMPT(lazy)
NIP [c0000000001db57c] sched_balance_rq+0x560/0x92c
LR [c0000000001db198] sched_balance_rq+0x17c/0x92c
Call Trace:
[c00000111ffdfd10] [c0000000001db198] sched_balance_rq+0x17c/0x92c (unreliable)
[c00000111ffdfe50] [c0000000001dc598] sched_balance_domains+0x2c4/0x3d0
[c00000111ffdff00] [c000000000168958] handle_softirqs+0x138/0x414
[c00000111ffdffe0] [c000000000017d80] do_softirq_own_stack+0x3c/0x50
[c000000008a57a60] [c000000000168048] __irq_exit_rcu+0x18c/0x1b4
[c000000008a57a90] [c0000000001691a8] irq_exit+0x20/0x38
[c000000008a57ab0] [c000000000028c18] timer_interrupt+0x174/0x394
[c000000008a57b10] [c000000000009f8c] decrementer_common_virt+0x28c/0x290


Bisect pointed to:
git bisect bad 2265c5d4deeff3bfe4580d9ffe718fd80a414cac
# first bad commit: [2265c5d4deeff3bfe4580d9ffe718fd80a414cac] sched/fair: Skip sched_balance_running cmpxchg when balance is not due


I wondered what is really different since the tim's v4 boots fine.
There is try instead in the tip, i think that is messing it since likely
we are dereferencing 0?


With this diff it boots fine.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aaa47ece6a8e..01814b10b833 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11841,7 +11841,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
         }
  
         if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
-               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
+               if (!atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
                         goto out_balanced;
  
                 need_unlock = true;




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-15 20:56   ` Shrikanth Hegde
@ 2025-11-17 18:55     ` Tim Chen
  2025-11-17 19:00       ` K Prateek Nayak
  2025-11-18  9:54       ` Peter Zijlstra
  2025-11-17 19:06     ` Borislav Petkov
  1 sibling, 2 replies; 25+ messages in thread
From: Tim Chen @ 2025-11-17 18:55 UTC (permalink / raw)
  To: Shrikanth Hegde, linux-kernel, Peter Zijlstra (Intel)
  Cc: linux-tip-commits, Chen Yu, Vincent Guittot, K Prateek Nayak,
	Srikar Dronamraju, Mohini Narkhede, x86

On Sun, 2025-11-16 at 02:26 +0530, Shrikanth Hegde wrote:
> Hi Peter.
> 
> On 11/14/25 5:49 PM, tip-bot2 for Tim Chen wrote:
> > The following commit has been merged into the sched/core branch of tip:
> > 
> > Commit-ID:     2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> > Gitweb:        https://git.kernel.org/tip/2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> > Author:        Tim Chen <tim.c.chen@linux.intel.com>
> > AuthorDate:    Mon, 10 Nov 2025 10:47:35 -08:00
> > Committer:     Peter Zijlstra <peterz@infradead.org>
> > CommitterDate: Fri, 14 Nov 2025 13:03:05 +01:00
> > 
> > sched/fair: Skip sched_balance_running cmpxchg when balance is not due
> > 
> > 
> 
>    
> > +	if (!need_unlock && (sd->flags & SD_SERIALIZE) && idle != CPU_NEWLY_IDLE) {
> > +		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> 
> This should be atomic_cmpxchg_acquire?
> 
> I booted the system with latest sched/core and it crashes at the boot.
> 
> BUG: Kernel NULL pointer dereference on read at 0x00000000
> Faulting instruction address: 0xc0000000001db57c
> Oops: Kernel access of bad area, sig: 7 [#1]
> LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
> Modules linked in:
> CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.18.0-rc3+ #242 PREEMPT(lazy)
> NIP [c0000000001db57c] sched_balance_rq+0x560/0x92c
> LR [c0000000001db198] sched_balance_rq+0x17c/0x92c
> Call Trace:
> [c00000111ffdfd10] [c0000000001db198] sched_balance_rq+0x17c/0x92c (unreliable)
> [c00000111ffdfe50] [c0000000001dc598] sched_balance_domains+0x2c4/0x3d0
> [c00000111ffdff00] [c000000000168958] handle_softirqs+0x138/0x414
> [c00000111ffdffe0] [c000000000017d80] do_softirq_own_stack+0x3c/0x50
> [c000000008a57a60] [c000000000168048] __irq_exit_rcu+0x18c/0x1b4
> [c000000008a57a90] [c0000000001691a8] irq_exit+0x20/0x38
> [c000000008a57ab0] [c000000000028c18] timer_interrupt+0x174/0x394
> [c000000008a57b10] [c000000000009f8c] decrementer_common_virt+0x28c/0x290
> 
> 
> Bisect pointed to:
> git bisect bad 2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> # first bad commit: [2265c5d4deeff3bfe4580d9ffe718fd80a414cac] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
> 
> 
> I wondered what is really different since the tim's v4 boots fine.
> There is try instead in the tip, i think that is messing it since likely
> we are dereferencing 0?
> 
> 
> With this diff it boots fine.
> 
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index aaa47ece6a8e..01814b10b833 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11841,7 +11841,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>          }
>   
>          if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
> -               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))

The second argument of atomic_try_cmpxchg_acquire is "int *old" while that of atomic_cmpxchg_acquire
is "int old". So the above check would result in NULL pointer access.  Probably have
to do something like the following to use atomic_try_cmpxchg_acquire()

		int zero = 0;
		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, &zero, 1))
		
Otherwise we should do atomic_cmpxchg_acquire() as below

> +               if (!atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))

Tim

>                          goto out_balanced;
>   
>                  need_unlock = true;
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-17 18:55     ` Tim Chen
@ 2025-11-17 19:00       ` K Prateek Nayak
  2025-11-27 14:09         ` Peter Zijlstra
  2025-11-18  9:54       ` Peter Zijlstra
  1 sibling, 1 reply; 25+ messages in thread
From: K Prateek Nayak @ 2025-11-17 19:00 UTC (permalink / raw)
  To: Tim Chen, Shrikanth Hegde, linux-kernel, Peter Zijlstra (Intel)
  Cc: linux-tip-commits, Chen Yu, Vincent Guittot, Srikar Dronamraju,
	Mohini Narkhede, x86

On 11/18/2025 12:25 AM, Tim Chen wrote:
>> I wondered what is really different since the tim's v4 boots fine.
>> There is try instead in the tip, i think that is messing it since likely
>> we are dereferencing 0?
>>
>>
>> With this diff it boots fine.
>>
>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index aaa47ece6a8e..01814b10b833 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11841,7 +11841,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>          }
>>   
>>          if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
>> -               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> 
> The second argument of atomic_try_cmpxchg_acquire is "int *old" while that of atomic_cmpxchg_acquire
> is "int old". So the above check would result in NULL pointer access.  Probably have
> to do something like the following to use atomic_try_cmpxchg_acquire()
> 
> 		int zero = 0;
> 		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, &zero, 1))

Peter seems to have refreshed tip:sched/core with above but is
there any advantage of using atomic_try_cmpxchg_acquire() as
opposed to plain old atomic_cmpxchg_acquire() and then checking
the old value it returns?

That zero variable serves no other purpose and is a bit of an
eyesore IMO.

> 		
> Otherwise we should do atomic_cmpxchg_acquire() as below
> 
>> +               if (!atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> 
-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-17 19:00       ` K Prateek Nayak
@ 2025-11-27 14:09         ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2025-11-27 14:09 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Tim Chen, Shrikanth Hegde, linux-kernel, linux-tip-commits,
	Chen Yu, Vincent Guittot, Srikar Dronamraju, Mohini Narkhede, x86

On Tue, Nov 18, 2025 at 12:30:36AM +0530, K Prateek Nayak wrote:
> On 11/18/2025 12:25 AM, Tim Chen wrote:
> >> I wondered what is really different since the tim's v4 boots fine.
> >> There is try instead in the tip, i think that is messing it since likely
> >> we are dereferencing 0?
> >>
> >>
> >> With this diff it boots fine.
> >>
> >> ---
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index aaa47ece6a8e..01814b10b833 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -11841,7 +11841,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >>          }
> >>   
> >>          if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
> >> -               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> > 
> > The second argument of atomic_try_cmpxchg_acquire is "int *old" while that of atomic_cmpxchg_acquire
> > is "int old". So the above check would result in NULL pointer access.  Probably have
> > to do something like the following to use atomic_try_cmpxchg_acquire()
> > 
> > 		int zero = 0;
> > 		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, &zero, 1))
> 
> Peter seems to have refreshed tip:sched/core with above but is
> there any advantage of using atomic_try_cmpxchg_acquire() as
> opposed to plain old atomic_cmpxchg_acquire() and then checking
> the old value it returns?
> 
> That zero variable serves no other purpose and is a bit of an
> eyesore IMO.

Yeah, its not ideal. It should generate slightly saner code, since the
compiler can now use the condition codes set by cmpxchg instead of
having to do an extra compare.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-17 18:55     ` Tim Chen
  2025-11-17 19:00       ` K Prateek Nayak
@ 2025-11-18  9:54       ` Peter Zijlstra
  2025-11-18  9:56         ` Peter Zijlstra
  2025-11-21  6:26         ` Nathan Chancellor
  1 sibling, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2025-11-18  9:54 UTC (permalink / raw)
  To: Tim Chen
  Cc: Shrikanth Hegde, linux-kernel, linux-tip-commits, Chen Yu,
	Vincent Guittot, K Prateek Nayak, Srikar Dronamraju,
	Mohini Narkhede, x86

On Mon, Nov 17, 2025 at 10:55:07AM -0800, Tim Chen wrote:

> >          if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
> > -               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> 
> The second argument of atomic_try_cmpxchg_acquire is "int *old" while that of atomic_cmpxchg_acquire
> is "int old". So the above check would result in NULL pointer access.  Probably have
> to do something like the following to use atomic_try_cmpxchg_acquire()
> 
> 		int zero = 0;
> 		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, &zero, 1))
> 		
> Otherwise we should do atomic_cmpxchg_acquire() as below

Yes, and I'm all mightily miffed all the compilers accept 0 (which is
int) for an 'int *' argument without so much as a warning :/

Nathan, you looked into this a bit yesterday, afaict there is:

  -Wzero-as-null-pointer-constant

which is supposed to issue a warn here, but I can't get clang-22 to
object :/ (GCC doesn't take that warning for C mode, only C++, perhaps
that's the problem?).

And then there is modernize-use-nullptr, but that objects to using NULL,
although I suppose we could do:

  #define NULL nullptr

to get around that. Except I also cannot get clangd to report on the
issue.

Help?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-18  9:54       ` Peter Zijlstra
@ 2025-11-18  9:56         ` Peter Zijlstra
  2025-11-21  6:26         ` Nathan Chancellor
  1 sibling, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2025-11-18  9:56 UTC (permalink / raw)
  To: Tim Chen, nathan
  Cc: Shrikanth Hegde, linux-kernel, linux-tip-commits, Chen Yu,
	Vincent Guittot, K Prateek Nayak, Srikar Dronamraju,
	Mohini Narkhede, x86


And now with Nathan as a recipient..

On Tue, Nov 18, 2025 at 10:54:33AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 17, 2025 at 10:55:07AM -0800, Tim Chen wrote:
> 
> > >          if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
> > > -               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> > 
> > The second argument of atomic_try_cmpxchg_acquire is "int *old" while that of atomic_cmpxchg_acquire
> > is "int old". So the above check would result in NULL pointer access.  Probably have
> > to do something like the following to use atomic_try_cmpxchg_acquire()
> > 
> > 		int zero = 0;
> > 		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, &zero, 1))
> > 		
> > Otherwise we should do atomic_cmpxchg_acquire() as below
> 
> Yes, and I'm all mightily miffed all the compilers accept 0 (which is
> int) for an 'int *' argument without so much as a warning :/
> 
> Nathan, you looked into this a bit yesterday, afaict there is:
> 
>   -Wzero-as-null-pointer-constant
> 
> which is supposed to issue a warn here, but I can't get clang-22 to
> object :/ (GCC doesn't take that warning for C mode, only C++, perhaps
> that's the problem?).
> 
> And then there is modernize-use-nullptr, but that objects to using NULL,
> although I suppose we could do:
> 
>   #define NULL nullptr
> 
> to get around that. Except I also cannot get clangd to report on the
> issue.
> 
> Help?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-18  9:54       ` Peter Zijlstra
  2025-11-18  9:56         ` Peter Zijlstra
@ 2025-11-21  6:26         ` Nathan Chancellor
  2025-11-21  9:00           ` Peter Zijlstra
  1 sibling, 1 reply; 25+ messages in thread
From: Nathan Chancellor @ 2025-11-21  6:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Shrikanth Hegde, linux-kernel, linux-tip-commits,
	Chen Yu, Vincent Guittot, K Prateek Nayak, Srikar Dronamraju,
	Mohini Narkhede, x86

On Tue, Nov 18, 2025 at 10:54:32AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 17, 2025 at 10:55:07AM -0800, Tim Chen wrote:
> 
> > >          if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
> > > -               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> > 
> > The second argument of atomic_try_cmpxchg_acquire is "int *old" while that of atomic_cmpxchg_acquire
> > is "int old". So the above check would result in NULL pointer access.  Probably have
> > to do something like the following to use atomic_try_cmpxchg_acquire()
> > 
> > 		int zero = 0;
> > 		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, &zero, 1))
> > 		
> > Otherwise we should do atomic_cmpxchg_acquire() as below
> 
> Yes, and I'm all mightily miffed all the compilers accept 0 (which is
> int) for an 'int *' argument without so much as a warning :/

The C11 standard says in 6.3.2.3p3

  An integer constant expression with the value 0, or such an expression
  cast to type void *, is called a null pointer constant.

which seems to indicate to me that

  int *foo = 0;

and

  #define NULL (void *)0
  int *foo = NULL;

have to be treated the same way :/ I think that is a big part of the
motivation to bring nullptr into C in C23:

  https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3042.htm

> Nathan, you looked into this a bit yesterday, afaict there is:
> 
>   -Wzero-as-null-pointer-constant
> 
> which is supposed to issue a warn here, but I can't get clang-22 to
> object :/ (GCC doesn't take that warning for C mode, only C++, perhaps
> that's the problem?).

Right, it appears to be the same case for clang, notice the comment in
diagnoseZeroToNullptrConversion():

  https://github.com/llvm/llvm-project/commit/d7ba86b6bf54740dd4007e65a927151cb9f510b4

That warning should probably be updated to work for C23 but that does
not really help us now because nullptr is not available in older
standards (and I think the support for C23 is only solid in really
recent compilers IIUC).

> Help?

Maybe we could have something like -Wnon-literal-null-conversion-strict
in clang that would behave like -Wnon-literal-null-conversion but warn
even in the literal zero conversion case (i.e., require a 'void *'
cast)... That does not really help GCC though since it does not warn on
any case of implicit conversion to NULL:

https://godbolt.org/z/M5WE5covz

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-21  6:26         ` Nathan Chancellor
@ 2025-11-21  9:00           ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2025-11-21  9:00 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Tim Chen, Shrikanth Hegde, linux-kernel, linux-tip-commits,
	Chen Yu, Vincent Guittot, K Prateek Nayak, Srikar Dronamraju,
	Mohini Narkhede, x86

On Thu, Nov 20, 2025 at 11:26:00PM -0700, Nathan Chancellor wrote:
> On Tue, Nov 18, 2025 at 10:54:32AM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 17, 2025 at 10:55:07AM -0800, Tim Chen wrote:
> > 
> > > >          if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
> > > > -               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> > > 
> > > The second argument of atomic_try_cmpxchg_acquire is "int *old" while that of atomic_cmpxchg_acquire
> > > is "int old". So the above check would result in NULL pointer access.  Probably have
> > > to do something like the following to use atomic_try_cmpxchg_acquire()
> > > 
> > > 		int zero = 0;
> > > 		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, &zero, 1))
> > > 		
> > > Otherwise we should do atomic_cmpxchg_acquire() as below
> > 
> > Yes, and I'm all mightily miffed all the compilers accept 0 (which is
> > int) for an 'int *' argument without so much as a warning :/
> 
> The C11 standard says in 6.3.2.3p3
> 
>   An integer constant expression with the value 0, or such an expression
>   cast to type void *, is called a null pointer constant.

That's just bloody ludicrous :-(, I mean, the explicit cast to void*
sure, but the implicit conversion is just idiotic. I realize there's
legacy here, but urgh.

> which seems to indicate to me that
> 
>   int *foo = 0;
> 
> and
> 
>   #define NULL (void *)0
>   int *foo = NULL;
> 
> have to be treated the same way :/ I think that is a big part of the
> motivation to bring nullptr into C in C23:
> 
>   https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3042.htm

Even without that, just dropping the implicit conversion is a giant step
forward.

> > Nathan, you looked into this a bit yesterday, afaict there is:
> > 
> >   -Wzero-as-null-pointer-constant
> > 
> > which is supposed to issue a warn here, but I can't get clang-22 to
> > object :/ (GCC doesn't take that warning for C mode, only C++, perhaps
> > that's the problem?).
> 
> Right, it appears to be the same case for clang, notice the comment in
> diagnoseZeroToNullptrConversion():
> 
>   https://github.com/llvm/llvm-project/commit/d7ba86b6bf54740dd4007e65a927151cb9f510b4
> 
> That warning should probably be updated to work for C23 but that does
> not really help us now because nullptr is not available in older
> standards (and I think the support for C23 is only solid in really
> recent compilers IIUC).

So personally I really don't see a problem with '(void *)0', what if
anything does nullptr actually bring over that?

> > Help?
> 
> Maybe we could have something like -Wnon-literal-null-conversion-strict
> in clang that would behave like -Wnon-literal-null-conversion but warn
> even in the literal zero conversion case (i.e., require a 'void *'
> cast)... That does not really help GCC though since it does not warn on
> any case of implicit conversion to NULL:

Yes, that makes sense. Perhaps we can even convince GCC folks to also
implement this ;-)

Just having it in clang would mean clangd will have the warning and thus
all the LSP enabled editors will provide the warn, which is a win.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
  2025-11-15 20:56   ` Shrikanth Hegde
  2025-11-17 18:55     ` Tim Chen
@ 2025-11-17 19:06     ` Borislav Petkov
  1 sibling, 0 replies; 25+ messages in thread
From: Borislav Petkov @ 2025-11-17 19:06 UTC (permalink / raw)
  To: Shrikanth Hegde, Peter Zijlstra (Intel)
  Cc: linux-kernel, Tim Chen, linux-tip-commits, Chen Yu,
	Vincent Guittot, K Prateek Nayak, Srikar Dronamraju,
	Mohini Narkhede, x86

On Sun, Nov 16, 2025 at 02:26:13AM +0530, Shrikanth Hegde wrote:
> 
> Hi Peter.
> 
> On 11/14/25 5:49 PM, tip-bot2 for Tim Chen wrote:
> > The following commit has been merged into the sched/core branch of tip:
> > 
> > Commit-ID:     2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> > Gitweb:        https://git.kernel.org/tip/2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> > Author:        Tim Chen <tim.c.chen@linux.intel.com>
> > AuthorDate:    Mon, 10 Nov 2025 10:47:35 -08:00
> > Committer:     Peter Zijlstra <peterz@infradead.org>
> > CommitterDate: Fri, 14 Nov 2025 13:03:05 +01:00
> > 
> > sched/fair: Skip sched_balance_running cmpxchg when balance is not due
> > 
> > 
> 
> > +	if (!need_unlock && (sd->flags & SD_SERIALIZE) && idle != CPU_NEWLY_IDLE) {
> > +		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> 
> This should be atomic_cmpxchg_acquire?
> 
> I booted the system with latest sched/core and it crashes at the boot.
> 
> BUG: Kernel NULL pointer dereference on read at 0x00000000
> Faulting instruction address: 0xc0000000001db57c
> Oops: Kernel access of bad area, sig: 7 [#1]
> LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
> Modules linked in:
> CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.18.0-rc3+ #242 PREEMPT(lazy)
> NIP [c0000000001db57c] sched_balance_rq+0x560/0x92c
> LR [c0000000001db198] sched_balance_rq+0x17c/0x92c
> Call Trace:
> [c00000111ffdfd10] [c0000000001db198] sched_balance_rq+0x17c/0x92c (unreliable)
> [c00000111ffdfe50] [c0000000001dc598] sched_balance_domains+0x2c4/0x3d0
> [c00000111ffdff00] [c000000000168958] handle_softirqs+0x138/0x414
> [c00000111ffdffe0] [c000000000017d80] do_softirq_own_stack+0x3c/0x50
> [c000000008a57a60] [c000000000168048] __irq_exit_rcu+0x18c/0x1b4
> [c000000008a57a90] [c0000000001691a8] irq_exit+0x20/0x38
> [c000000008a57ab0] [c000000000028c18] timer_interrupt+0x174/0x394
> [c000000008a57b10] [c000000000009f8c] decrementer_common_virt+0x28c/0x290
> 
> 
> Bisect pointed to:
> git bisect bad 2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> # first bad commit: [2265c5d4deeff3bfe4580d9ffe718fd80a414cac] sched/fair: Skip sched_balance_running cmpxchg when balance is not due

Dammit, I spent a whole day bisecting exactly the same issue and I missed your
mail.

Oh well, it is fixed now... should pay more attention next time.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-11-27 14:09 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-10 18:47 [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due Tim Chen
2025-11-11  6:24 ` Shrikanth Hegde
2025-11-12  8:02 ` Srikar Dronamraju
2025-11-12 10:37   ` Peter Zijlstra
2025-11-12 10:45     ` Peter Zijlstra
2025-11-12 11:09       ` Shrikanth Hegde
2025-11-12 11:21         ` Peter Zijlstra
2025-11-12 21:10           ` Tim Chen
2025-11-13  4:25             ` Shrikanth Hegde
2025-11-13 17:49               ` Tim Chen
2025-11-12 11:25       ` Srikar Dronamraju
2025-11-12 13:39         ` Peter Zijlstra
2025-11-12 13:44           ` Peter Zijlstra
2025-11-12 16:02           ` Srikar Dronamraju
2025-11-12 10:53     ` Shrikanth Hegde
2025-11-14 12:19 ` [tip: sched/core] " tip-bot2 for Tim Chen
2025-11-15 20:56   ` Shrikanth Hegde
2025-11-17 18:55     ` Tim Chen
2025-11-17 19:00       ` K Prateek Nayak
2025-11-27 14:09         ` Peter Zijlstra
2025-11-18  9:54       ` Peter Zijlstra
2025-11-18  9:56         ` Peter Zijlstra
2025-11-21  6:26         ` Nathan Chancellor
2025-11-21  9:00           ` Peter Zijlstra
2025-11-17 19:06     ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox