From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0B5753AE71C for ; Wed, 25 Feb 2026 12:31:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772022669; cv=none; b=qBric5nr+MSmQGsg2Po5zMPUDJGE79aPj66Vn3y5wlGEazWSyXGvyZvhXXtSLstA3tu08mJ+IECxgCHvgIx9RFroSBP7ZiD2CYxWY73x9fft0ThsDOH08zo7OpR586Q4OY8wPtBQebqJBauyviPWBFumxnDo8JiVeeeh/FyaWVw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772022669; c=relaxed/simple; bh=PcMAn20u/pvCckyFAtw6GJ7zFe6Qk9v3I7Wla2hhq2Q=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=rmwBku4v6Z9tkqLDswcakYfpOXy04MujmcOf609GiQpF5Nx09vmKtP1iQcOUt6yGQFMgHMT71qk+WzzIg3MR8LanyPyCDG7qbcyGtvSymGCpX2N1+hUCfrYeCbfqvcNTiycWUF3UYMW0EPmP8RiH8XgiiRx9Pgrt/oTHKC647Mo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=hnzVpOTw; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="hnzVpOTw" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=Pe/y7i4eFnAMjFCtksTqWBW7xhHWUW2Nm1Tqmy8oPBw=; b=hnzVpOTwdDYcDz0A6bWjnDO1Ua aso+B3/fXkKhdYjTWMT5lQ9tAU6Hf3GGsDlOtwgrzi5Pa3JAqc+XKpZSoOBUoRd+5aFiOLqWoZoDX qPcpVx+JtKyqk3mC0rIM12JikDvRmjyYTpOiMUSgVmiWGKOvUMZXimrG2b7gFoQPjvZ3PhMo8+8KA UfezupECKmTmC30b0L/K3t5xJe9yaAlPoYAqlBU0PwRBPWKK/lXY6OoeQJmH+4WlMi5nC0Kq+HkTS ADFyAl0Zec8yVu8Um0rVJ5XOBoGfrLUQ00VO6YJPkZgaL617C2n1Lr3cv5uzvL9KO3V27zwDPGLrw HeYsubQw==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1vvE2Y-000000012sp-0RpX; Wed, 25 Feb 2026 12:30:54 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id 8BAE5300666; Wed, 25 Feb 2026 13:30:52 +0100 (CET) Date: Wed, 25 Feb 2026 13:30:52 +0100 From: Peter Zijlstra To: Kyle Meyer Cc: tim.c.chen@linux.intel.com, bp@alien8.de, dave.hansen@linux.intel.com, mingo@redhat.com, tglx@kernel.org, vinicius.gomes@intel.com, brgerst@gmail.com, hpa@zytor.com, kprateek.nayak@amd.com, linux-kernel@vger.kernel.org, patryk.wlazlyn@linux.intel.com, rafael.j.wysocki@intel.com, russ.anderson@hpe.com, x86@kernel.org, yu.c.chen@intel.com, zhao1.liu@intel.com Subject: Re: [PATCH v2] sched/topology: Check average distances to remote packages Message-ID: <20260225123052.GN3016024@noisy.programming.kicks-ass.net> References: <20260223170314.GU1395266@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Tue, Feb 24, 2026 at 07:43:10PM -0600, Kyle Meyer wrote: > Here's an 8 socket (2 chassis) HPE system with SNC enabled: > > node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > 0: 10 12 16 16 16 16 18 18 40 40 40 40 40 40 40 40 > 1: 12 10 16 16 16 16 18 18 40 40 40 40 40 40 40 40 > 2: 16 16 10 12 18 18 16 16 40 40 40 40 40 40 40 40 > 3: 16 16 12 10 18 18 16 16 40 40 40 40 40 40 40 40 > 4: 16 16 18 18 10 12 16 16 40 40 40 40 40 40 40 40 > 5: 16 16 18 18 12 10 16 16 40 40 40 40 40 40 40 40 > 6: 18 18 16 16 16 16 10 12 40 40 40 40 40 40 40 40 > 7: 18 18 16 16 16 16 12 10 40 40 40 40 40 40 40 40 > 8: 40 40 40 40 40 40 40 40 10 12 16 16 16 16 18 18 > 9: 40 40 40 40 40 40 40 40 12 10 16 16 16 16 18 18 > 10: 40 40 40 40 40 40 40 40 16 16 10 12 18 18 16 16 > 11: 40 40 40 40 40 40 40 40 16 16 12 10 18 18 16 16 > 12: 40 40 40 40 40 40 40 40 16 16 18 18 10 12 16 16 > 13: 40 40 40 40 40 40 40 40 16 16 18 18 12 10 16 16 > 14: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 10 12 > 15: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 12 10 > > 10 = Same chassis and socket > 12 = Same chassis and socket (SNC) > 16 = Same chassis and adjacent socket > 18 = Same chassis and non-adjacent socket > 40 = Different chassis > > Each processor connects to an ASIC (XNC) that acts as a multiplexer, extending > the UPI interconnect across the entire system. > > We don't experience the scheduler domain issue reported by Tim because our SLIT > provides symmetric distances to remote NUMA nodes, but we trigger the WARN_ONCE > because we exceed 2 packages. The original case was for SNC-3, the above looks to be SNC-2. Does your system also support SNC-3? Anyway, yes your SLIT table looks sane (unlike that SNC-3 monster Tim showed earlier). And it also shows that using REMOTE_DISTANCE (20) was completely random and 'wrong'. So per 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode") Tim's original crazy SNC-3 SLIT table was: node distances: node 0 1 2 3 4 5 0: 10 15 17 21 28 26 1: 15 10 15 23 26 23 2: 17 15 10 26 23 21 3: 21 28 26 10 15 17 4: 23 26 23 15 10 15 5: 26 23 21 17 15 10 And per: https://lore.kernel.org/lkml/20250825075642.GQ3245006@noisy.programming.kicks-ass.net/ My suggestion was to average the off-trace clusters to restore sanity. So how about we go about implementing that without reference to magical numbers, something like so. This obviously needs a little TLC, but it might just work. Hmm? --- diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 5cd6950ab672..cba3e4b14250 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -513,33 +513,55 @@ static void __init build_sched_topology(void) } #ifdef CONFIG_NUMA -static int sched_avg_remote_distance; -static int avg_remote_numa_distance(void) + +/* + * Find the largest symmetric cluster in an attempt to identify the unit size. + * + * XXX doesn't respect N_CPU node classes and such. + */ +static int slit_cluster_size(void) { - int i, j; - int distance, nr_remote, total_distance; + int i, j, n, m = num_possible_nodes(); - if (sched_avg_remote_distance > 0) - return sched_avg_remote_distance; - - nr_remote = 0; - total_distance = 0; - for_each_node_state(i, N_CPU) { - for_each_node_state(j, N_CPU) { - distance = node_distance(i, j); - - if (distance >= REMOTE_DISTANCE) { - nr_remote++; - total_distance += distance; + for (n = 2; n < m; n++) { + for (i = 0; i < n; i++) { + for (j = i; j < n; j++) { + if (node_distance(i, j) != node_distance(j, i)) + return n - 1; } } } - if (nr_remote) - sched_avg_remote_distance = total_distance / nr_remote; - else - sched_avg_remote_distance = REMOTE_DISTANCE; - return sched_avg_remote_distance; + return m; +} + +static int slit_cluster_distance(int i, int j) +{ + static int u = 0; + long d = 0; + int x, y; + + if (!u) + u = slit_cluster_size(); + + /* + * Is this a unit cluster on the trace? + */ + if ((i / u) == (j / u)) + return node_distance(i, j); + + /* + * Off-trace cluster, return average of the cluster to force symmetry. + */ + x = i - (i % u); + y = j - (j % u); + + for (i = x; i < x + u; i++) { + for (j = y; j < y + u; j++) + d += node_distance(i, j); + } + + return d / (u*u); } int arch_sched_node_distance(int from, int to) @@ -550,8 +572,7 @@ int arch_sched_node_distance(int from, int to) case INTEL_GRANITERAPIDS_X: case INTEL_ATOM_DARKMONT_X: - if (!x86_has_numa_in_package || topology_max_packages() == 1 || - d < REMOTE_DISTANCE) + if (!x86_has_numa_in_package || topology_max_packages() == 1) return d; /* @@ -571,12 +592,7 @@ int arch_sched_node_distance(int from, int to) * packages as average distance to different remote packages * could be different. */ - WARN_ONCE(topology_max_packages() > 2, - "sched: Expect only up to 2 packages for GNR or CWF, " - "but saw %d packages when building sched domains.", - topology_max_packages()); - - d = avg_remote_numa_distance(); + return slit_cluster_distance(from, to); } return d; }