From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0b-002e3701.pphosted.com (mx0b-002e3701.pphosted.com [148.163.143.35]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C5E0026E6FA for ; Wed, 25 Feb 2026 01:44:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.143.35 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771983855; cv=none; b=tzaXSBIN7+b+FFts9koYFDw1vco83HFkOs7QHsR3TeCnc7JWMxoXUHIefqkwXP7Xl5Z4ckWto1rfYIzthpDVcqJU4dnchs36rvqN+GrqNM6SLr0OlhjFzTcu2hlkplokbhgbeTjOmI3vPRvkS5kwcg6CsW1CFOsB/DZdZC4CQ8I= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1771983855; c=relaxed/simple; bh=F1tju6r2wV3a5bDlAJtUiQ7TrB66jBvC60loczQmTMc=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=f2iDlaQPaYIP+z0xzuleC+SSvvsCvDBqR/Dat/RtN9FdQC2C6lu4HHOqPA9LfMsaCysvffnLjWPLGRUYgAyw4sBne6ma6WJ6YXhyWES7QIJLFagMKzMCXUeTCsBylSDZ5+kspztAFAmEccQd/m2xGAutoC/TIL/rqHao91+nbGs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=hpe.com; spf=pass smtp.mailfrom=hpe.com; dkim=pass (2048-bit key) header.d=hpe.com header.i=@hpe.com header.b=GypdNB0h; arc=none smtp.client-ip=148.163.143.35 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=hpe.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=hpe.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=hpe.com header.i=@hpe.com header.b="GypdNB0h" Received: from pps.filterd (m0150244.ppops.net [127.0.0.1]) by mx0b-002e3701.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 61P02U7k2480863; Wed, 25 Feb 2026 01:43:14 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hpe.com; h=cc :content-type:date:from:in-reply-to:message-id:mime-version :references:subject:to; s=pps0720; bh=4kqk9oS2GkVFfplgNjDPcckNOq m+F9s6IDj+XVninfA=; b=GypdNB0h62isZn+B45EzPP7K0o/7XA+cFBSpDyklFw 1FqWnlkRWz7QwGBjrE+jDCNvxcDEiX3Wmb3w5lJR/3CsAMYQioHBI+2IAkkR5f/x LTplo+pGVax1tYTZas1T20i6BqE0jFC6efXtXJPDKIp0kyt9ivv+20Ll7iesLkEN oL1f1UXmAvsXEAJ5QGezN91/ZseLfXe5P5ch6XoUo5wsg0WiBNiYPHy2R5Cd7HQE Jok6uOV8oacrvPWgVR4A1miYqs8D6kwk7bRZnU0/UKQe05FSQJVh11tK5frDjQ1K 6HQzjEsXhclp4EV9Ynq7V76dp7D/m/5IRWJcW1mo9hAw== Received: from p1lg14880.it.hpe.com (p1lg14880.it.hpe.com [16.230.97.201]) by mx0b-002e3701.pphosted.com (PPS) with ESMTPS id 4chk6n328k-1 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Wed, 25 Feb 2026 01:43:14 +0000 (GMT) Received: from p1lg14886.dc01.its.hpecorp.net (unknown [10.119.18.237]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by p1lg14880.it.hpe.com (Postfix) with ESMTPS id B0C858014FA; Wed, 25 Feb 2026 01:43:13 +0000 (UTC) Received: from HPE-5CG20646DK.localdomain (unknown [16.231.227.39]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (Client did not present a certificate) by p1lg14886.dc01.its.hpecorp.net (Postfix) with ESMTPS id 0330B809A1F; Wed, 25 Feb 2026 01:43:11 +0000 (UTC) Date: Tue, 24 Feb 2026 19:43:10 -0600 From: Kyle Meyer To: Peter Zijlstra Cc: tim.c.chen@linux.intel.com, bp@alien8.de, dave.hansen@linux.intel.com, mingo@redhat.com, tglx@kernel.org, vinicius.gomes@intel.com, brgerst@gmail.com, hpa@zytor.com, kprateek.nayak@amd.com, linux-kernel@vger.kernel.org, patryk.wlazlyn@linux.intel.com, rafael.j.wysocki@intel.com, russ.anderson@hpe.com, x86@kernel.org, yu.c.chen@intel.com, zhao1.liu@intel.com Subject: Re: [PATCH v2] sched/topology: Check average distances to remote packages Message-ID: References: <20260223170314.GU1395266@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260223170314.GU1395266@noisy.programming.kicks-ass.net> X-Proofpoint-GUID: qzPACNWz9lC1KVxCI26obdk4NhnOu9lj X-Proofpoint-ORIG-GUID: qzPACNWz9lC1KVxCI26obdk4NhnOu9lj X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwMjI1MDAxNCBTYWx0ZWRfX6j9UUScI0A+G 7t3TtI65x83aXsP3mOmGHnsSmrRttj/uwwGbmaO2l/cGnBWhkiFSkIpsJyjw6rULNh8HAPOhKSt NtvKz8BQGAVRb6M83gnTt4jUGxAMydE5pwu/vqjzjBiGOu2cI3ALcuihtN428ZY4X9NC93Wpy1D JVVnAWWLK+a4EcFcaNmnS92xocm232DANKsZCM0o176TJ2zazwEfFEJzo26FOZFOJpWEBUhTR5g ZYIuzYtbQNiYsW9TivzI+AuVN5DYOxZj/klX35mE7Xll/1YcoW6CD2D9fgTn82zYvIYEuGwCg1/ D6DNfOQ8Nv6YDtiVWvD2FhoRhKB1YaOY6yz+4QBHw7m7ZzPQ+ju/cpU7crcsRK2fHYlphWFj2Xe 0fuLXPa421qXtjHsUaNnfB+LT1SANZFN5VJ/NxoFnZAh+KHVNCWYCTToNPu7csjpO34O7yXlx5A j5KiGxMEhZYB4i7imjQ== X-Authority-Analysis: v=2.4 cv=Uupu9uwB c=1 sm=1 tr=0 ts=699e53b2 cx=c_pps a=A+SOMQ4XYIH4HgQ50p3F5Q==:117 a=A+SOMQ4XYIH4HgQ50p3F5Q==:17 a=kj9zAlcOel0A:10 a=HzLeVaNsDn8A:10 a=VkNPw1HP01LnGYTKEx00:22 a=Mpw57Om8IfrbqaoTuvik:22 a=GgsMoib0sEa3-_RKJdDe:22 a=TNVVVHs5YWs9t791FucA:9 a=CjuIK1q_8ugA:10 X-HPE-SCL: -1 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1121,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-02-24_03,2026-02-23_03,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 lowpriorityscore=0 suspectscore=0 malwarescore=0 adultscore=0 clxscore=1015 spamscore=0 phishscore=0 impostorscore=0 bulkscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2601150000 definitions=main-2602250014 On Mon, Feb 23, 2026 at 06:03:14PM +0100, Peter Zijlstra wrote: > On Wed, Feb 04, 2026 at 06:24:26PM -0600, Kyle Meyer wrote: > > Granite Rapids (GNR) and Clearwater Forest (CWF) average distances to > > remote packages to fix scheduler domains, see [1] for more information. > > > > A warning and backtrace are printed when sub-NUMA clustering (SNC) is > > enabled and there are more than 2 packages because the average distances > > to remote packages could be different, skewing the single average remote > > distance. > > But earlier Tim said these systems will not have more than 2 packages. > So what's what? We have Intel customer reference boards with 2, 4, and 8 sockets. > So what do these new systems look like? Here's an 8 socket (2 chassis) HPE system with SNC enabled: node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: 10 12 16 16 16 16 18 18 40 40 40 40 40 40 40 40 1: 12 10 16 16 16 16 18 18 40 40 40 40 40 40 40 40 2: 16 16 10 12 18 18 16 16 40 40 40 40 40 40 40 40 3: 16 16 12 10 18 18 16 16 40 40 40 40 40 40 40 40 4: 16 16 18 18 10 12 16 16 40 40 40 40 40 40 40 40 5: 16 16 18 18 12 10 16 16 40 40 40 40 40 40 40 40 6: 18 18 16 16 16 16 10 12 40 40 40 40 40 40 40 40 7: 18 18 16 16 16 16 12 10 40 40 40 40 40 40 40 40 8: 40 40 40 40 40 40 40 40 10 12 16 16 16 16 18 18 9: 40 40 40 40 40 40 40 40 12 10 16 16 16 16 18 18 10: 40 40 40 40 40 40 40 40 16 16 10 12 18 18 16 16 11: 40 40 40 40 40 40 40 40 16 16 12 10 18 18 16 16 12: 40 40 40 40 40 40 40 40 16 16 18 18 10 12 16 16 13: 40 40 40 40 40 40 40 40 16 16 18 18 12 10 16 16 14: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 10 12 15: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 12 10 10 = Same chassis and socket 12 = Same chassis and socket (SNC) 16 = Same chassis and adjacent socket 18 = Same chassis and non-adjacent socket 40 = Different chassis Each processor connects to an ASIC (XNC) that acts as a multiplexer, extending the UPI interconnect across the entire system. We don't experience the scheduler domain issue reported by Tim because our SLIT provides symmetric distances to remote NUMA nodes, but we trigger the WARN_ONCE because we exceed 2 packages. > > This is unnecessary when the average distances to remote packages are > > the same. > > > > Support single average remote distance on systems with more than 2 > > packages, preventing unnecessary warnings and backtraces, by checking if > > average distances to remote packages are the same. > > > > > --- > > arch/x86/kernel/smpboot.c | 69 ++++++++++++++++++++++++++++----------- > > 1 file changed, 50 insertions(+), 19 deletions(-) > > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > > index 5cd6950ab672..dc8f15bd2e19 100644 > > --- a/arch/x86/kernel/smpboot.c > > +++ b/arch/x86/kernel/smpboot.c > > @@ -518,27 +518,69 @@ static int avg_remote_numa_distance(void) > > { > > int i, j; > > int distance, nr_remote, total_distance; > > + int max_pkgs = topology_max_packages(); > > + int cpu, pkg, pkg_avg_distance; > > + int *pkg_total_distance = NULL, *pkg_nr_remote = NULL; > > Can you make that the normal reverse xmas thing? Yes. > > if (sched_avg_remote_distance > 0) > > return sched_avg_remote_distance; > > > > + sched_avg_remote_distance = REMOTE_DISTANCE; > > + > > nr_remote = 0; > > total_distance = 0; > > + > > + pkg_total_distance = kcalloc(max_pkgs, sizeof(int), GFP_KERNEL); > > + if (!pkg_total_distance) > > + goto cleanup; > > + > > + pkg_nr_remote = kcalloc(max_pkgs, sizeof(int), GFP_KERNEL); > > + if (!pkg_nr_remote) > > + goto cleanup; > > + > > for_each_node_state(i, N_CPU) { > > for_each_node_state(j, N_CPU) { > > distance = node_distance(i, j); > > > > - if (distance >= REMOTE_DISTANCE) { > > - nr_remote++; > > - total_distance += distance; > > - } > > + if (distance < REMOTE_DISTANCE) > > + continue; > > + > > + nr_remote++; > > + total_distance += distance; > > + > > + cpu = cpumask_first(cpumask_of_node(j)); > > + if (cpu >= nr_cpu_ids) > > + continue; > > + > > + pkg = topology_physical_package_id(cpu); > > + pkg_total_distance[pkg] += distance; > > + pkg_nr_remote[pkg]++; > > This is broken, physical_package_id is not guaranteed to be dense. Thank you, I'll fix this. > > } > > } > > - if (nr_remote) > > - sched_avg_remote_distance = total_distance / nr_remote; > > - else > > - sched_avg_remote_distance = REMOTE_DISTANCE; > > > > + if (!nr_remote) > > + goto cleanup; > > + > > + sched_avg_remote_distance = total_distance / nr_remote; > > + > > + /* > > + * Single average remote distance won't be appropriate if different > > + * packages have different distances to remote packages. > > + */ > > + for (i = 0; i < max_pkgs; i++) { > > + if (!pkg_nr_remote[i]) > > + continue; > > + > > + pkg_avg_distance = pkg_total_distance[i] / pkg_nr_remote[i]; > > + > > + pr_debug("sched: Avg. distance to remote package %d: %d\n", i, pkg_avg_distance); > > + > > + if (pkg_avg_distance != sched_avg_remote_distance) > > + WARN_ONCE(1, "sched: Avg. distances to remote packages are different\n"); > > + } > > This is pretty yuck. > > Also, what's with the pr_debug() stuff? > > Anyway, that function was fairly magical, and now it is nearly > impenetrable. If we want this, it needs comments. Definitely more > comments, with nice pictures on. OK, thank you for the feedback, I'll work on a v3. Thanks, Kyle Meyer