From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9BDD83019C3
	for <linux-kernel@vger.kernel.org>; Tue,  2 Dec 2025 07:54:24 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764662064; cv=none; b=S0hF7deFkwlJhizGxHjcwHfANyO+uffN7KAwOaT2falEEvkE6ZS/xIm3zSDhZJBUw+P3dqpI0YOUbG1X/ZPLl0wxRnp+TLRirJcK7gdQClHNa34ma97noG0iN1tyI2O+PZzF3VI7/D1MVmajABFwsq/KtTuo76yLzDrTYFELKq4=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764662064; c=relaxed/simple;
	bh=Cz7sFeAUoXrhR7WesN9HCaNq7kEwj4o69baI7l5eR5A=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=uOFI7ADf078R3024A6+Zc9axNxCAnvaETEkpaDKXRRJvD42q6ABwomWzYwCwA4hc3UqH32vcBGkce063JMKdKuwGo85L71rt0AiRmneAG4gD6pcvStsodrQCzehC92fG7AN11ZnGIerSB+f/lc3r6C7VMRY8JwK1nchhgGdTrL8=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=V+0tcQhA; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="V+0tcQhA"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id D4D7BC4CEF1;
	Tue,  2 Dec 2025 07:54:21 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1764662064;
	bh=Cz7sFeAUoXrhR7WesN9HCaNq7kEwj4o69baI7l5eR5A=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=V+0tcQhAKmMMFlV0E9/p90F1tN+tu4fH3dOK7zx+lEJBI0IDN64MWRi1ZWaUxoqgv
	 GfdL3OF4bBIOzYVXyn9epI1mcV3oM0j6kOY6eGmCaEiYNSfk/H3ECRLhnY2TncTuCS
	 LaFq9umX6fC3jGxiqr/S3FSVku1BY4GD+cZG5Xt121LxH4W5KD/qbEa+smP7Ie0KGu
	 pwvzUwTiSwVvESE0sAPG0xeXZGVrJKYCsaIlwGHAJi0IkexM8xAppcQhBmPQH4u2sq
	 lsnGYXj1unQsUvEZvqrKG3thef2dxTozunPbgG8E7KNfI8sWE3xiTSIPQYRcbty2Dk
	 rX2+16q4RB3Ow==
Date: Tue, 2 Dec 2025 08:54:19 +0100
From: Ingo Molnar <mingo@kernel.org>
To: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: peterz@infradead.org, vincent.guittot@linaro.org,
	linux-kernel@vger.kernel.org, kprateek.nayak@amd.com,
	dietmar.eggemann@arm.com, vschneid@redhat.com, rostedt@goodmis.org,
	tglx@linutronix.de, tim.c.chen@linux.intel.com,
	Frederic Weisbecker <frederic@kernel.org>
Subject: Re: [PATCH 4/4] sched/fair: Remove atomic nr_cpus and use cpumask
 instead
Message-ID: <aS6bK4ad-wO2fsoo@gmail.com>
References: <20251201183146.74443-1-sshegde@linux.ibm.com>
 <20251201183146.74443-5-sshegde@linux.ibm.com>
 <aS3za7X9BLS5rg65@gmail.com>
 <15f8f8c6-df8f-4218-a650-eaa8f7581d67@linux.ibm.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <15f8f8c6-df8f-4218-a650-eaa8f7581d67@linux.ibm.com>


* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> > That the nr_cpus modification is an atomic op 
> > doesn't change the situation much in terms of 
> > cacheline bouncing, because the cacheline dirtying 
> > will still cause comparable levels of bouncing on 
> > modern CPUs with modern cache coherency protocols.
> > 
> > If nr_cpus and nohz.nr_cpus are in separate 
> > cachelines, then this patch might eliminate about 
> > half of the bounces - but AFAICS they are right 
> > next to each other, so unless it's off-stack 
> > cpumasks, they should be in the same cacheline. 
> > Half of 'bad bouncing' is still kinda 'bad 
> > bouncing'. :-)
> > 
> 
> You are right. If we have to get rid of cacheline 
> bouncing then we need to fix nohz.idle_cpus_mask too.
> 
> I forgot about CPUMASK_OFFSTACK.
> 
> If CPUMASK_OFFSTACK=y, then both idle_cpus_mask and 
> nr_cpus are in same cacheline Right?. That data in 
> cover-letter is with =y. In that case, getting it to 
> cpumask_empty will give minimal gains by remvong an 
> additional atomic inc/dec operations.
> 
> If CPUMASK_OFFSTACK=n, then they could be in 
> different cacheline. In that case gains should be 
> better. Very likely our performance team would have 
> done with =n. IIRC, on powerpc, based on NR_CPU we 
> change it. On x86 it chooses NR_CPUs.

Well, it's the other way around: in the 'off stack' 
case the cpumask_var_t is moved "off the stack" because 
it's too large - i.e. we allocate it separately, in a 
separate cacheline as a side effect. Even if the main 
cpumask pointer is next to nohz.nr_cpus, the mask 
itself is behind an indirect pointer, see 
<linux/cpumask_types.h>:

  #ifdef CONFIG_CPUMASK_OFFSTACK
  typedef struct cpumask *cpumask_var_t;
  #else

Note that idle_cpus_mask is defined as a cpumask_var_t 
and is thus affected by CONFIG_CPUMASK_OFFSTACK and may 
be allocated dynamically:

  kernel/sched/fair.c:    cpumask_var_t idle_cpus_mask;
  ...
  kernel/sched/fair.c:    zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);

So I think it's quite possible that the performance 
measurements were done with CONFIG_CPUMASK_OFFSTACK=y: 
it's commonly enabled in server/enterprise distros - 
but even Ubuntu enables it on their desktop kernel, so 
the reduction in cacheline ping-pong is probably real 
and the change makes sense in that context.

But even with OFFSTACK=n, if NR_CPUS=512 it's possible 
that a fair chunk of the cpumask ends up on the 
previous (separate) cacheline from where nr_cpus is, 
with a resulting observable reduction of the cache 
bouncing rate:

  static struct {
          cpumask_var_t idle_cpus_mask;
          atomic_t nr_cpus;

Note that since 'nohz' is ____cacheline_aligned, in 
this case idle_cpus_mask will take a full cacheline in 
the NR_CPUS=512 case, and nr_cpus will always be on a 
separate cacheline.

If CONFIG_NR_CPUS is 128 or smaller, then 
idle_cpus_mask and nr_cpus will be on the same 
cacheline.

Anyway, if the reduction in cache ping-pong is higher 
than 50%, then either something weird is going on, or 
I'm missing something. :-)

But the measurement data you provided:

   baseline: tip sched/core at 3eb593560146
   1.01%  [k] nohz_balance_exit_idle
   0.31%  [k] nohz_balancer_kick
   0.05%  [k] nohz_balance_enter_idle

   With series:
   0.45%  [k] nohz_balance_exit_idle
   0.18%  [k] nohz_balancer_kick
   0.01%  [k] nohz_balance_enter_idle

... is roughly in the 50% reduction range, if profiled 
overhead is a good proxy for cache bounce overhead 
(which it may be), which supports my hypothesis that 
the tests were run with CONFIG_CPUMASK_OFFSTACK=y and 
the cache pong-pong rate in these functions got roughly 
halved.

BTW., I'd expect _nohz_idle_balance() to show up too in 
the profile.


> arm64/Kconfig:	select CPUMASK_OFFSTACK if NR_CPUS > 256
> powerpc/Kconfig:	select CPUMASK_OFFSTACK			if NR_CPUS >= 8192
> x86/Kconfig:	select CPUMASK_OFFSTACK
> x86/Kconfig:	default 8192 if  SMP && CPUMASK_OFFSTACK
> x86/Kconfig:	default  512 if  SMP && !CPUMASK_OFFSTACK

Yeah, we make the cpumask a direct mask up to 512 bits 
(64 bytes) - it's allocated indirectly from that point 
onwards.

> In either case, if we think,
> 	nohz.nr_cpus == cpumask_weight(nohz.idle_cpus_mask)
> 
> Since it is not a correctness stuff here, at worst we 
> will lose a chance to do idle load balance.

Yeah, I don't think it's a correctness issue, removing 
nr_cpus should not change the ordering of modifications 
to nohz.idle_cpus_mask and nohz.has_blocked.

( The nohz.nr_cpus and nohz.idle_cpus_mask 
  modifications were not ordered versus each other 
  previously to begin with - they are only ordered 
  versus nohz.has_blocked. )

> Let me re-write changelog. Also see a bit more into it.

Thank you!

Note that the fundamental scalability challenge with 
the nohz_balancer_kick(), nohz_balance_enter_idle() and 
nohz_balance_exit_idle() functions is the following:

 (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus 
     (or nohz.idle_cpu_mask) and nohz.has_blocked to 
     see whether there's any nohz balancing work to do, 
     in every scheduler tick.

 (2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
     modify (write) nohz.nr_cpus (and/or nohz.idle_cpu_mask)
     and nohz.has_blocked.

The characteristic frequencies are the following:

 (1) happens at scheduler (busy-)tick frequency on 
     every CPU. This is a relatively constant frequency 
     in the ~1 kHz range or lower.

 (2) happens at idle enter/exit frequency on every CPU 
     that goes to idle. This is workload dependent, but 
     can easily be hundreds of kHz for IO-bound loads 
     and high CPU counts. Ie. can be orders of 
     magnitude higher than (1), in which case a 
     cachemiss at every invocation of (1) is almost 
     inevitable. [ Ie. the cost of getting really long 
     NOHZ idling times is the extra overhead of the 
     exit/enter nohz cycles for partially idle CPUs on 
     high-rate IO workloads. ]

There's two types of costs from these functions:

  (A) scheduler tick cost via (1): this happens on busy 
      CPUs too, and is thus a primary scalability cost. 
      But the rate here is constant and typically much 
      lower than (A), hence the absolute benefit to 
      workload scalability will be lower as well.

  (B) idle cost via (2): going-to-idle and 
      coming-from-idle costs are secondary concerns, 
      because they impact power efficiency more than 
      they impact scalability. (Ie. while 'wasting idle 
      time' isn't good, but it often doesn't hurt 
      scalability, at least as long as it's done for a 
      good reason and done in moderation.)

      but in terms of absolute cost this scales up with 
      nr_cpus as well, and a much faster rate, and thus 
      may also approach and negatively impact system 
      limits like memory bus/fabric bandwidth.

So I'd argue that reductions in both (A) and (B) are 
useful, but for different reasons.

The *real* breakthrough in this area would be to reduce 
the unlimited upwards frequency of (2), by 
fundamentally changing the model of NOHZ idle 
balancing:

For example by measuring the rate (frequency) of idle 
cycles on each CPU (this can be done without any 
cross-CPU logic), we would turn off NOHZ-idle for that 
CPU when the rate goes beyond a threshold.

The resulting regular idle load-balancing passes will 
be rate-limited by balance intervals and won't be as 
aggressive as nohz_balance_enter+exit_idle(). (I hope...)

Truly idle CPUs would go into NOHZ mode automatically, 
as their measured rate of idling drops below the 
threshold.

Thoughts?

	Ingo