From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0a-002e3701.pphosted.com (mx0a-002e3701.pphosted.com [148.163.147.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 74EAA3C199A for ; Wed, 25 Feb 2026 16:42:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.147.86 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772037747; cv=none; b=Dvz0lUut3hdCOu/X3mRTcvfADSp4W84m0zRh9KytJpvY6+6OStJm/B+Nsdau4oJOLyK62Sdb2bfDvRZW3Zj7VI69g3bt43+iNrwjReHuI7tE5QGwRdb18htPcNrZoUYTHjSUKC78DBNGc4A8GEmg6mAr8BwmNPFqdW1NqzpB9yY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772037747; c=relaxed/simple; bh=swvZZdPWZnO5ZLkh1HY2FWxBOAWTLisw1MRBp7eOLlo=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=q5RZUQtiHL6azV704OalkvayqoL0PcoMQ/2XYpSUHJCQ6Z+ejVYWNL0jAvtbSnN/hnU2dvGz6STw3dThSZx0U2QBsMvNPXEnZQGSTO2ZKTHokQ50D6AcO+B/perEn27GJdcvP3F83EnCu/HtjQcDQ2nUTbbM1Cftvjctr20aOpA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=hpe.com; spf=pass smtp.mailfrom=hpe.com; dkim=pass (2048-bit key) header.d=hpe.com header.i=@hpe.com header.b=Ifv33JnC; arc=none smtp.client-ip=148.163.147.86 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=hpe.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=hpe.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=hpe.com header.i=@hpe.com header.b="Ifv33JnC" Received: from pps.filterd (m0134420.ppops.net [127.0.0.1]) by mx0b-002e3701.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 61PE64jH2923285; Wed, 25 Feb 2026 16:41:42 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hpe.com; h=cc :content-type:date:from:in-reply-to:message-id:mime-version :references:subject:to; s=pps0720; bh=KJfgxbaQYceoIoi6J/1zL7jw9m sj1VY+wYUsPTPv5Ag=; b=Ifv33JnCGz+V14UNuk/6cAXRKuoKxKRVEtQipsAk+q zKbgZXqQXo6gAV3bC7rp42mvaCJRpQOGiQ34i0G5J3gDw9HcdUlJyNTSXX2QjI+G GYHYk4kXEfhLpxRI15O0eQOtYY6n2Vl0jzcjbzqjlA9EGqvbuqvLppcCw3WqORNh 7srJ5QyRv+3ltIyZoZQCShgFtZlZ9KVi3GjMnYBRxvnLYr3q420DO3ViC4Y09OX6 opUurZyl8V2NN43ZqjLrp+wzD8tibJ8LTaOuWOzEQthxaz740rMhxcwohhzb1FFL Dqp5hvBMLDZVTRXK+50VwLS0GskFftLzZ7hiAPqXRWeA== Received: from p1lg14878.it.hpe.com (p1lg14878.it.hpe.com [16.230.97.204]) by mx0b-002e3701.pphosted.com (PPS) with ESMTPS id 4cj2j1tf6g-1 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Wed, 25 Feb 2026 16:41:42 +0000 (GMT) Received: from p1lg14885.dc01.its.hpecorp.net (unknown [10.119.18.236]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by p1lg14878.it.hpe.com (Postfix) with ESMTPS id 89633D17B; Wed, 25 Feb 2026 16:41:41 +0000 (UTC) Received: from HPE-5CG20646DK.localdomain (unknown [16.231.227.39]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (Client did not present a certificate) by p1lg14885.dc01.its.hpecorp.net (Postfix) with ESMTPS id BBB2A80C591; Wed, 25 Feb 2026 16:41:39 +0000 (UTC) Date: Wed, 25 Feb 2026 10:41:38 -0600 From: Kyle Meyer To: Peter Zijlstra Cc: tim.c.chen@linux.intel.com, bp@alien8.de, dave.hansen@linux.intel.com, mingo@redhat.com, tglx@kernel.org, vinicius.gomes@intel.com, brgerst@gmail.com, hpa@zytor.com, kprateek.nayak@amd.com, linux-kernel@vger.kernel.org, patryk.wlazlyn@linux.intel.com, rafael.j.wysocki@intel.com, russ.anderson@hpe.com, x86@kernel.org, yu.c.chen@intel.com, zhao1.liu@intel.com Subject: Re: [PATCH v2] sched/topology: Check average distances to remote packages Message-ID: References: <20260223170314.GU1395266@noisy.programming.kicks-ass.net> <20260225123052.GN3016024@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260225123052.GN3016024@noisy.programming.kicks-ass.net> X-Authority-Analysis: v=2.4 cv=GKcF0+NK c=1 sm=1 tr=0 ts=699f2646 cx=c_pps a=UObrlqRbTUrrdMEdGJ+KZA==:117 a=UObrlqRbTUrrdMEdGJ+KZA==:17 a=kj9zAlcOel0A:10 a=HzLeVaNsDn8A:10 a=VkNPw1HP01LnGYTKEx00:22 a=Mpw57Om8IfrbqaoTuvik:22 a=GgsMoib0sEa3-_RKJdDe:22 a=VwQbUJbxAAAA:8 a=WsHKUha7AAAA:8 a=Y6noo0jIXdsssN7rDoUA:9 a=CjuIK1q_8ugA:10 a=H4LAKuo8djmI0KOkngUh:22 X-Proofpoint-ORIG-GUID: 5vh_HtPhAHiA-SuK2Sf04xQlB8UUXnS2 X-Proofpoint-GUID: 5vh_HtPhAHiA-SuK2Sf04xQlB8UUXnS2 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwMjI1MDE2MCBTYWx0ZWRfX2/dUjE7xS2yC GKrYeST9DrJrSLwBH+vwoOMg3RHAP9pCOfHDdEqe3x5lV7hwPapyXBNBG3yRTRuw0p4PTXLDsFm 720KKaE7GY7GtzRHJXI0u95O2DSEkupywgaRwqq3cMnv82fWJXMA+y0DGl2Kgjco3AHbaO670WE DEYZXMHtthnDRuEs3LKnmnqInKLuvLY+1c8+TJrhYExN0/nPIIKC/qE3pLzC/iPQxR9BqAlhrpr 52j93ooILF+TUTWuNAFX3YI1y0x+cizD+PAh9mUbpDZKMTWD8xS1k5ov+IUJXTqR5BblxqGpYpd QVEiGtr6snOVc8Z3AUQkkCtO1KbaVSa+3vOFQ9fwRo/iMa0BYbTIQ7NxIIJMBbv5dDEfi1NXm3C JOVdeBnOtmzbOs2VHZAKwf0XxR1BdhRs9rWH0vpoYUeuR5EfKHzO5gI35K7VHEDSV6DukF514ZX gWGRAWLex9auXEYhgYQ== X-HPE-SCL: -1 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1121,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-02-25_02,2026-02-25_01,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 suspectscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 bulkscore=0 adultscore=0 phishscore=0 priorityscore=1501 impostorscore=0 malwarescore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.22.0-2601150000 definitions=main-2602250160 On Wed, Feb 25, 2026 at 01:30:52PM +0100, Peter Zijlstra wrote: > On Tue, Feb 24, 2026 at 07:43:10PM -0600, Kyle Meyer wrote: > > > Here's an 8 socket (2 chassis) HPE system with SNC enabled: > > > > node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > > 0: 10 12 16 16 16 16 18 18 40 40 40 40 40 40 40 40 > > 1: 12 10 16 16 16 16 18 18 40 40 40 40 40 40 40 40 > > 2: 16 16 10 12 18 18 16 16 40 40 40 40 40 40 40 40 > > 3: 16 16 12 10 18 18 16 16 40 40 40 40 40 40 40 40 > > 4: 16 16 18 18 10 12 16 16 40 40 40 40 40 40 40 40 > > 5: 16 16 18 18 12 10 16 16 40 40 40 40 40 40 40 40 > > 6: 18 18 16 16 16 16 10 12 40 40 40 40 40 40 40 40 > > 7: 18 18 16 16 16 16 12 10 40 40 40 40 40 40 40 40 > > 8: 40 40 40 40 40 40 40 40 10 12 16 16 16 16 18 18 > > 9: 40 40 40 40 40 40 40 40 12 10 16 16 16 16 18 18 > > 10: 40 40 40 40 40 40 40 40 16 16 10 12 18 18 16 16 > > 11: 40 40 40 40 40 40 40 40 16 16 12 10 18 18 16 16 > > 12: 40 40 40 40 40 40 40 40 16 16 18 18 10 12 16 16 > > 13: 40 40 40 40 40 40 40 40 16 16 18 18 12 10 16 16 > > 14: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 10 12 > > 15: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 12 10 > > > > 10 = Same chassis and socket > > 12 = Same chassis and socket (SNC) > > 16 = Same chassis and adjacent socket > > 18 = Same chassis and non-adjacent socket > > 40 = Different chassis > > > > Each processor connects to an ASIC (XNC) that acts as a multiplexer, extending > > the UPI interconnect across the entire system. > > > > We don't experience the scheduler domain issue reported by Tim because our SLIT > > provides symmetric distances to remote NUMA nodes, but we trigger the WARN_ONCE > > because we exceed 2 packages. > > The original case was for SNC-3, the above looks to be SNC-2. Does your > system also support SNC-3? We do not currently use SKUs that support SNC-3. That distance would be set to 12: node 0 1 2 0: 10 12 12 1: 12 10 12 2: 12 12 10 That might be changed if there's actually a difference in distance. Distances to adjacent sockets, non-adjacent sockets, and different chassis would remain the same. > Anyway, yes your SLIT table looks sane (unlike that SNC-3 monster Tim > showed earlier). > > And it also shows that using REMOTE_DISTANCE (20) was completely random > and 'wrong'. > > So per 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode") > > Tim's original crazy SNC-3 SLIT table was: > > node distances: > node 0 1 2 3 4 5 > 0: 10 15 17 21 28 26 > 1: 15 10 15 23 26 23 > 2: 17 15 10 26 23 21 > 3: 21 28 26 10 15 17 > 4: 23 26 23 15 10 15 > 5: 26 23 21 17 15 10 > > And per: > > https://lore.kernel.org/lkml/20250825075642.GQ3245006@noisy.programming.kicks-ass.net/ > > My suggestion was to average the off-trace clusters to restore sanity. > > So how about we go about implementing that without reference to magical > numbers, something like so. This obviously needs a little TLC, but it > might just work. > > Hmm? > > --- > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 5cd6950ab672..cba3e4b14250 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -513,33 +513,55 @@ static void __init build_sched_topology(void) > } > > #ifdef CONFIG_NUMA > -static int sched_avg_remote_distance; > -static int avg_remote_numa_distance(void) > + > +/* > + * Find the largest symmetric cluster in an attempt to identify the unit size. > + * > + * XXX doesn't respect N_CPU node classes and such. > + */ > +static int slit_cluster_size(void) > { > - int i, j; > - int distance, nr_remote, total_distance; > + int i, j, n, m = num_possible_nodes(); > > - if (sched_avg_remote_distance > 0) > - return sched_avg_remote_distance; > - > - nr_remote = 0; > - total_distance = 0; > - for_each_node_state(i, N_CPU) { > - for_each_node_state(j, N_CPU) { > - distance = node_distance(i, j); > - > - if (distance >= REMOTE_DISTANCE) { > - nr_remote++; > - total_distance += distance; > + for (n = 2; n < m; n++) { > + for (i = 0; i < n; i++) { > + for (j = i; j < n; j++) { > + if (node_distance(i, j) != node_distance(j, i)) > + return n - 1; > } > } > } > - if (nr_remote) > - sched_avg_remote_distance = total_distance / nr_remote; > - else > - sched_avg_remote_distance = REMOTE_DISTANCE; > > - return sched_avg_remote_distance; > + return m; > +} > + > +static int slit_cluster_distance(int i, int j) > +{ > + static int u = 0; > + long d = 0; > + int x, y; > + > + if (!u) > + u = slit_cluster_size(); > + > + /* > + * Is this a unit cluster on the trace? > + */ > + if ((i / u) == (j / u)) > + return node_distance(i, j); > + > + /* > + * Off-trace cluster, return average of the cluster to force symmetry. > + */ > + x = i - (i % u); > + y = j - (j % u); > + > + for (i = x; i < x + u; i++) { > + for (j = y; j < y + u; j++) > + d += node_distance(i, j); > + } > + > + return d / (u*u); > } > > int arch_sched_node_distance(int from, int to) > @@ -550,8 +572,7 @@ int arch_sched_node_distance(int from, int to) > case INTEL_GRANITERAPIDS_X: > case INTEL_ATOM_DARKMONT_X: > > - if (!x86_has_numa_in_package || topology_max_packages() == 1 || > - d < REMOTE_DISTANCE) > + if (!x86_has_numa_in_package || topology_max_packages() == 1) > return d; > > /* > @@ -571,12 +592,7 @@ int arch_sched_node_distance(int from, int to) > * packages as average distance to different remote packages > * could be different. > */ > - WARN_ONCE(topology_max_packages() > 2, > - "sched: Expect only up to 2 packages for GNR or CWF, " > - "but saw %d packages when building sched domains.", > - topology_max_packages()); > - > - d = avg_remote_numa_distance(); > + return slit_cluster_distance(from, to); > } > return d; > }