From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9BDD83019C3 for ; Tue, 2 Dec 2025 07:54:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764662064; cv=none; b=S0hF7deFkwlJhizGxHjcwHfANyO+uffN7KAwOaT2falEEvkE6ZS/xIm3zSDhZJBUw+P3dqpI0YOUbG1X/ZPLl0wxRnp+TLRirJcK7gdQClHNa34ma97noG0iN1tyI2O+PZzF3VI7/D1MVmajABFwsq/KtTuo76yLzDrTYFELKq4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764662064; c=relaxed/simple; bh=Cz7sFeAUoXrhR7WesN9HCaNq7kEwj4o69baI7l5eR5A=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=uOFI7ADf078R3024A6+Zc9axNxCAnvaETEkpaDKXRRJvD42q6ABwomWzYwCwA4hc3UqH32vcBGkce063JMKdKuwGo85L71rt0AiRmneAG4gD6pcvStsodrQCzehC92fG7AN11ZnGIerSB+f/lc3r6C7VMRY8JwK1nchhgGdTrL8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=V+0tcQhA; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="V+0tcQhA" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D4D7BC4CEF1; Tue, 2 Dec 2025 07:54:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1764662064; bh=Cz7sFeAUoXrhR7WesN9HCaNq7kEwj4o69baI7l5eR5A=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=V+0tcQhAKmMMFlV0E9/p90F1tN+tu4fH3dOK7zx+lEJBI0IDN64MWRi1ZWaUxoqgv GfdL3OF4bBIOzYVXyn9epI1mcV3oM0j6kOY6eGmCaEiYNSfk/H3ECRLhnY2TncTuCS LaFq9umX6fC3jGxiqr/S3FSVku1BY4GD+cZG5Xt121LxH4W5KD/qbEa+smP7Ie0KGu pwvzUwTiSwVvESE0sAPG0xeXZGVrJKYCsaIlwGHAJi0IkexM8xAppcQhBmPQH4u2sq lsnGYXj1unQsUvEZvqrKG3thef2dxTozunPbgG8E7KNfI8sWE3xiTSIPQYRcbty2Dk rX2+16q4RB3Ow== Date: Tue, 2 Dec 2025 08:54:19 +0100 From: Ingo Molnar To: Shrikanth Hegde Cc: peterz@infradead.org, vincent.guittot@linaro.org, linux-kernel@vger.kernel.org, kprateek.nayak@amd.com, dietmar.eggemann@arm.com, vschneid@redhat.com, rostedt@goodmis.org, tglx@linutronix.de, tim.c.chen@linux.intel.com, Frederic Weisbecker Subject: Re: [PATCH 4/4] sched/fair: Remove atomic nr_cpus and use cpumask instead Message-ID: References: <20251201183146.74443-1-sshegde@linux.ibm.com> <20251201183146.74443-5-sshegde@linux.ibm.com> <15f8f8c6-df8f-4218-a650-eaa8f7581d67@linux.ibm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <15f8f8c6-df8f-4218-a650-eaa8f7581d67@linux.ibm.com> * Shrikanth Hegde wrote: > > That the nr_cpus modification is an atomic op > > doesn't change the situation much in terms of > > cacheline bouncing, because the cacheline dirtying > > will still cause comparable levels of bouncing on > > modern CPUs with modern cache coherency protocols. > > > > If nr_cpus and nohz.nr_cpus are in separate > > cachelines, then this patch might eliminate about > > half of the bounces - but AFAICS they are right > > next to each other, so unless it's off-stack > > cpumasks, they should be in the same cacheline. > > Half of 'bad bouncing' is still kinda 'bad > > bouncing'. :-) > > > > You are right. If we have to get rid of cacheline > bouncing then we need to fix nohz.idle_cpus_mask too. > > I forgot about CPUMASK_OFFSTACK. > > If CPUMASK_OFFSTACK=y, then both idle_cpus_mask and > nr_cpus are in same cacheline Right?. That data in > cover-letter is with =y. In that case, getting it to > cpumask_empty will give minimal gains by remvong an > additional atomic inc/dec operations. > > If CPUMASK_OFFSTACK=n, then they could be in > different cacheline. In that case gains should be > better. Very likely our performance team would have > done with =n. IIRC, on powerpc, based on NR_CPU we > change it. On x86 it chooses NR_CPUs. Well, it's the other way around: in the 'off stack' case the cpumask_var_t is moved "off the stack" because it's too large - i.e. we allocate it separately, in a separate cacheline as a side effect. Even if the main cpumask pointer is next to nohz.nr_cpus, the mask itself is behind an indirect pointer, see : #ifdef CONFIG_CPUMASK_OFFSTACK typedef struct cpumask *cpumask_var_t; #else Note that idle_cpus_mask is defined as a cpumask_var_t and is thus affected by CONFIG_CPUMASK_OFFSTACK and may be allocated dynamically: kernel/sched/fair.c: cpumask_var_t idle_cpus_mask; ... kernel/sched/fair.c: zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); So I think it's quite possible that the performance measurements were done with CONFIG_CPUMASK_OFFSTACK=y: it's commonly enabled in server/enterprise distros - but even Ubuntu enables it on their desktop kernel, so the reduction in cacheline ping-pong is probably real and the change makes sense in that context. But even with OFFSTACK=n, if NR_CPUS=512 it's possible that a fair chunk of the cpumask ends up on the previous (separate) cacheline from where nr_cpus is, with a resulting observable reduction of the cache bouncing rate: static struct { cpumask_var_t idle_cpus_mask; atomic_t nr_cpus; Note that since 'nohz' is ____cacheline_aligned, in this case idle_cpus_mask will take a full cacheline in the NR_CPUS=512 case, and nr_cpus will always be on a separate cacheline. If CONFIG_NR_CPUS is 128 or smaller, then idle_cpus_mask and nr_cpus will be on the same cacheline. Anyway, if the reduction in cache ping-pong is higher than 50%, then either something weird is going on, or I'm missing something. :-) But the measurement data you provided: baseline: tip sched/core at 3eb593560146 1.01% [k] nohz_balance_exit_idle 0.31% [k] nohz_balancer_kick 0.05% [k] nohz_balance_enter_idle With series: 0.45% [k] nohz_balance_exit_idle 0.18% [k] nohz_balancer_kick 0.01% [k] nohz_balance_enter_idle ... is roughly in the 50% reduction range, if profiled overhead is a good proxy for cache bounce overhead (which it may be), which supports my hypothesis that the tests were run with CONFIG_CPUMASK_OFFSTACK=y and the cache pong-pong rate in these functions got roughly halved. BTW., I'd expect _nohz_idle_balance() to show up too in the profile. > arm64/Kconfig: select CPUMASK_OFFSTACK if NR_CPUS > 256 > powerpc/Kconfig: select CPUMASK_OFFSTACK if NR_CPUS >= 8192 > x86/Kconfig: select CPUMASK_OFFSTACK > x86/Kconfig: default 8192 if SMP && CPUMASK_OFFSTACK > x86/Kconfig: default 512 if SMP && !CPUMASK_OFFSTACK Yeah, we make the cpumask a direct mask up to 512 bits (64 bytes) - it's allocated indirectly from that point onwards. > In either case, if we think, > nohz.nr_cpus == cpumask_weight(nohz.idle_cpus_mask) > > Since it is not a correctness stuff here, at worst we > will lose a chance to do idle load balance. Yeah, I don't think it's a correctness issue, removing nr_cpus should not change the ordering of modifications to nohz.idle_cpus_mask and nohz.has_blocked. ( The nohz.nr_cpus and nohz.idle_cpus_mask modifications were not ordered versus each other previously to begin with - they are only ordered versus nohz.has_blocked. ) > Let me re-write changelog. Also see a bit more into it. Thank you! Note that the fundamental scalability challenge with the nohz_balancer_kick(), nohz_balance_enter_idle() and nohz_balance_exit_idle() functions is the following: (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus (or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's any nohz balancing work to do, in every scheduler tick. (2) nohz_balance_enter_idle() and nohz_balance_exit_idle() modify (write) nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked. The characteristic frequencies are the following: (1) happens at scheduler (busy-)tick frequency on every CPU. This is a relatively constant frequency in the ~1 kHz range or lower. (2) happens at idle enter/exit frequency on every CPU that goes to idle. This is workload dependent, but can easily be hundreds of kHz for IO-bound loads and high CPU counts. Ie. can be orders of magnitude higher than (1), in which case a cachemiss at every invocation of (1) is almost inevitable. [ Ie. the cost of getting really long NOHZ idling times is the extra overhead of the exit/enter nohz cycles for partially idle CPUs on high-rate IO workloads. ] There's two types of costs from these functions: (A) scheduler tick cost via (1): this happens on busy CPUs too, and is thus a primary scalability cost. But the rate here is constant and typically much lower than (A), hence the absolute benefit to workload scalability will be lower as well. (B) idle cost via (2): going-to-idle and coming-from-idle costs are secondary concerns, because they impact power efficiency more than they impact scalability. (Ie. while 'wasting idle time' isn't good, but it often doesn't hurt scalability, at least as long as it's done for a good reason and done in moderation.) but in terms of absolute cost this scales up with nr_cpus as well, and a much faster rate, and thus may also approach and negatively impact system limits like memory bus/fabric bandwidth. So I'd argue that reductions in both (A) and (B) are useful, but for different reasons. The *real* breakthrough in this area would be to reduce the unlimited upwards frequency of (2), by fundamentally changing the model of NOHZ idle balancing: For example by measuring the rate (frequency) of idle cycles on each CPU (this can be done without any cross-CPU logic), we would turn off NOHZ-idle for that CPU when the rate goes beyond a threshold. The resulting regular idle load-balancing passes will be rate-limited by balance intervals and won't be as aggressive as nohz_balance_enter+exit_idle(). (I hope...) Truly idle CPUs would go into NOHZ mode automatically, as their measured rate of idling drops below the threshold. Thoughts? Ingo