From nobody Tue Apr 7 17:13:25 2026 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A30B131B104 for ; Thu, 26 Feb 2026 10:53:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772103210; cv=none; b=FPLu+u+cokiIC1/5FDp/WOjr7IAN1KE2erlUBj8v7KiUJ/gvaBLDxvKzk00z52ve9IaVvSVAUtt7pl8B6t73ftcJ+TE+CaFgVKYGXsdAgdHqQM7LCYSFgG8tCAkHMPyB5R6yhHGETUAimRmvHQLj9IGdjY84uGl339xNkPEAhAQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772103210; c=relaxed/simple; bh=qoi5WxacZ+5gUX7Wvqj7er0BQYGs4t0EAr6h1bm0b+U=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=cnXeHqFMhe+rtqeXA48yDiT8+WV32ugVWl5YAsYQYkFptW+URKQlW0X7pEDP3dsMKxiIwg6CroyC9INMbtKzMfGaJXhjOIVrfRFUIMVHTiDOUC5LbXxpceRvody1nV3rbwm1RrBJFoGLm6vEGeiDmWS98ef7se4ExUM/2h7NhJU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=KmhL1cO6; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="KmhL1cO6" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=mpb5TJrEd58y+GogJG8u8vHetBWOc7fxy+EfULdIsOU=; b=KmhL1cO6nXirMNDTI67PAxr+ii GMEPjEu+61LNCJNDU1KJTNqCGwEZ4BjcFc0ztB+o+5UZsZ5NDuQt1O0XHljuyi6tlCgYqLxM2oFPH /sBZx8H0nDDSbR9kKD3GOByYHZ00Acsu37kSKte4Qt6q84tigb0LbrRMoG0aecJF2Hrv/fEK2yfnD iZJ13gx/sNeMXRP2lswH9I2vmRn53zjHRH+6/xXxcTZjX/tNEt8laJWQ5TYAcC4gMFMO78tqmTbxI MN+KoXhVhIDXQwFK0AnM2hyajZ9KWpfP/IToZTiQeX6MNrIifp45R2WdzLcM/KumtJ7TfD2Wtelln LehO+brg==; Received: from 2001-1c00-8d85-5700-266e-96ff-fe07-7dcc.cable.dynamic.v6.ziggo.nl ([2001:1c00:8d85:5700:266e:96ff:fe07:7dcc] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1vvYzh-0000000AmIT-0szI; Thu, 26 Feb 2026 10:53:21 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id 21688303377; Thu, 26 Feb 2026 11:53:19 +0100 (CET) Message-ID: <20260226105052.737712686@infradead.org> User-Agent: quilt/0.68 Date: Thu, 26 Feb 2026 11:49:14 +0100 From: Peter Zijlstra To: x86@kernel.org, tglx@kernel.org Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, tim.c.chen@linux.intel.com, yu.c.chen@intel.com, kyle.meyer@hpe.com, vinicius.gomes@intel.com, brgerst@gmail.com, hpa@zytor.com, kprateek.nayak@amd.com, patryk.wlazlyn@linux.intel.com, rafael.j.wysocki@intel.com, russ.anderson@hpe.com, zhao1.liu@intel.com, tony.luck@intel.com Subject: [RFC][PATCH 5/6] x86/topo: Fix SNC topology mess References: <20260226104909.675623579@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" So per 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR,= CWF in SNC-3 mode") The original crazy SNC-3 SLIT table was: node distances: node 0 1 2 3 4 5 0: 10 15 17 21 28 26 1: 15 10 15 23 26 23 2: 17 15 10 26 23 21 3: 21 28 26 10 15 17 4: 23 26 23 15 10 15 5: 26 23 21 17 15 10 And per: https://lore.kernel.org/lkml/20250825075642.GQ3245006@noisy.programming.k= icks-ass.net/ My suggestion was to average the off-trace clusters to restore sanity. However, 4d6dd05d07d0 implements this under various assumptions: - there will never be more than 2 packages; - the off-trace cluster will have distance >20 And then HPE shows up with a machine that matches the Vendor-Family-Model checks but looks like this: Here's an 8 socket (2 chassis) HPE system with SNC enabled: node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: 10 12 16 16 16 16 18 18 40 40 40 40 40 40 40 40 1: 12 10 16 16 16 16 18 18 40 40 40 40 40 40 40 40 2: 16 16 10 12 18 18 16 16 40 40 40 40 40 40 40 40 3: 16 16 12 10 18 18 16 16 40 40 40 40 40 40 40 40 4: 16 16 18 18 10 12 16 16 40 40 40 40 40 40 40 40 5: 16 16 18 18 12 10 16 16 40 40 40 40 40 40 40 40 6: 18 18 16 16 16 16 10 12 40 40 40 40 40 40 40 40 7: 18 18 16 16 16 16 12 10 40 40 40 40 40 40 40 40 8: 40 40 40 40 40 40 40 40 10 12 16 16 16 16 18 18 9: 40 40 40 40 40 40 40 40 12 10 16 16 16 16 18 18 10: 40 40 40 40 40 40 40 40 16 16 10 12 18 18 16 16 11: 40 40 40 40 40 40 40 40 16 16 12 10 18 18 16 16 12: 40 40 40 40 40 40 40 40 16 16 18 18 10 12 16 16 13: 40 40 40 40 40 40 40 40 16 16 18 18 12 10 16 16 14: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 10 12 15: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 12 10 10 =3D Same chassis and socket 12 =3D Same chassis and socket (SNC) 16 =3D Same chassis and adjacent socket 18 =3D Same chassis and non-adjacent socket 40 =3D Different chassis *However* this is SNC-2. This completely invalidates all the earlier assumptions and trips WARNs. Now that the topology code has a sensible measure of nodes-per-package, we can use that to divinate the SNC mode at hand, and only fix up SNC-3 topologies. With the only assumption that there are no CPU-less nodes -- is this a valid assumption ? Fixes: 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR,= CWF in SNC-3 mode") Signed-off-by: Peter Zijlstra (Intel) Tested-by: Chen Yu Tested-by: Zhang Rui --- arch/x86/kernel/smpboot.c | 64 +++++++++++++++++------------------------= ----- 1 file changed, 25 insertions(+), 39 deletions(-) --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -506,33 +506,32 @@ static void __init build_sched_topology( } =20 #ifdef CONFIG_NUMA -static int sched_avg_remote_distance; -static int avg_remote_numa_distance(void) +static int slit_cluster_distance(int i, int j) { - int i, j; - int distance, nr_remote, total_distance; - - if (sched_avg_remote_distance > 0) - return sched_avg_remote_distance; - - nr_remote =3D 0; - total_distance =3D 0; - for_each_node_state(i, N_CPU) { - for_each_node_state(j, N_CPU) { - distance =3D node_distance(i, j); - - if (distance >=3D REMOTE_DISTANCE) { - nr_remote++; - total_distance +=3D distance; - } + int u =3D __num_nodes_per_package; + long d =3D 0; + int x, y; + + /* + * Is this a unit cluster on the trace? + */ + if ((i / u) =3D=3D (j / u)) + return node_distance(i, j); + + /* + * Off-trace cluster, return average of the cluster to force symmetry. + */ + x =3D i - (i % u); + y =3D j - (j % u); + + for (i =3D x; i < x + u; i++) { + for (j =3D y; j < y + u; j++) { + d +=3D node_distance(i, j); + d +=3D node_distance(j, i); } } - if (nr_remote) - sched_avg_remote_distance =3D total_distance / nr_remote; - else - sched_avg_remote_distance =3D REMOTE_DISTANCE; =20 - return sched_avg_remote_distance; + return d / (2*u*u); } =20 int arch_sched_node_distance(int from, int to) @@ -542,13 +541,11 @@ int arch_sched_node_distance(int from, i switch (boot_cpu_data.x86_vfm) { case INTEL_GRANITERAPIDS_X: case INTEL_ATOM_DARKMONT_X: - - if (topology_max_packages() =3D=3D 1 || __num_nodes_per_package =3D=3D 1= || - d < REMOTE_DISTANCE) + if (topology_max_packages() =3D=3D 1 || __num_nodes_per_package < 3) return d; =20 /* - * With SNC enabled, there could be too many levels of remote + * With SNC-3 enabled, there could be too many levels of remote * NUMA node distances, creating NUMA domain levels * including local nodes and partial remote nodes. * @@ -557,19 +554,8 @@ int arch_sched_node_distance(int from, i * in the remote package in the same sched group. * Simplify NUMA domains and avoid extra NUMA levels including * different remote NUMA nodes and local nodes. - * - * GNR and CWF don't expect systems with more than 2 packages - * and more than 2 hops between packages. Single average remote - * distance won't be appropriate if there are more than 2 - * packages as average distance to different remote packages - * could be different. */ - WARN_ONCE(topology_max_packages() > 2, - "sched: Expect only up to 2 packages for GNR or CWF, " - "but saw %d packages when building sched domains.", - topology_max_packages()); - - d =3D avg_remote_numa_distance(); + return slit_cluster_distance(from, to); } return d; }