From nobody Tue Apr 7 17:13:25 2026 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E59BA3DA5C3 for ; Tue, 3 Mar 2026 11:05:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772535921; cv=none; b=HabAA7Q+u1dBc8+EWCuMbYV/2EtOpcI28kMG97awkQLxSHQ518TCG9rXPenLjewViaYT4GiNRDrW3wM1b+6tG7il0e//Bpx2uhoN5pydyhDS/Sdje1CLBrBrP1AcGFrb/GbHb2viucpnx99V/JfkGOWl+cU4o8dboIo4zGghWCA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772535921; c=relaxed/simple; bh=8+GBbHDnIfsbMDv9zTav96O2Ivvfy6yxI6wJ+le7nmg=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=KhDQA5gWN+ROaTHQL9lff9xGjtDSNy/OmzgAFw6Q89MwfJb092rtwimkZDaQWR02fkzaKtCVho6cQ+bxmUDhvBf8uy0sEuIUMhnUDnci27sXR+I9TkBpaXRSRvDpU3KKoDbeShJ7mGpEdZu6/WiRNVbU62LE0WKd0uFCOVsfPLc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=oxNPHXxL; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="oxNPHXxL" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:MIME-Version:References: Subject:Cc:To:From:Date:Message-ID:Sender:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To; bh=5DnkckvetsDsMJf1bsA1FIpLkc/pUyCpjaeOwgy72kQ=; b=oxNPHXxLlsTdOwY/Ty6G3l89SE 5ZFx6S1O+w6cDKq84PsS6DPgKCTA5NGrxZUoVLxgMHUAKz+11sq0bResWaCkHuO3CrrU7ZeaJkLgH WnHPnbj4KZIE+vBKCi7r6KOqI1NVW8SHzfFnG/nf07LO+Uwmua9fIVrbFd3VhVGQOyovv/kXa3Pw/ HU72OcjSYUbPov6LxsqdOfxe9mQ9IXc63IE9DOOrS1btXQ2joE9youu6z4G2XsyYlzAgIi/bYfULK 94bRtywShgYFTFmPZTWexuMkfC9iOBBKJ5BA9oHOad+04ll2MHZt63QlnK2wd+ckuH2YMlpT1Oqws wCxWVP6g==; Received: from 2001-1c00-8d85-5700-266e-96ff-fe07-7dcc.cable.dynamic.v6.ziggo.nl ([2001:1c00:8d85:5700:266e:96ff:fe07:7dcc] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1vxNYu-00000002Bz2-1Amx; Tue, 03 Mar 2026 11:05:12 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 0) id 3302E30325C; Tue, 03 Mar 2026 12:05:09 +0100 (CET) Message-ID: <20260303110100.238361290@infradead.org> User-Agent: quilt/0.68 Date: Tue, 03 Mar 2026 11:55:43 +0100 From: Peter Zijlstra To: x86@kernel.org, tglx@kernel.org Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, tim.c.chen@linux.intel.com, yu.c.chen@intel.com, kyle.meyer@hpe.com, vinicius.gomes@intel.com, brgerst@gmail.com, hpa@zytor.com, kprateek.nayak@amd.com, patryk.wlazlyn@linux.intel.com, rafael.j.wysocki@intel.com, russ.anderson@hpe.com, zhao1.liu@intel.com, tony.luck@intel.com, Zhang Rui Subject: [PATCH v2 4/5] x86/topo: Fix SNC topology mess References: <20260303105539.428037056@infradead.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Per 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CW= F in SNC-3 mode"), the original crazy SNC-3 SLIT table was: node distances: node 0 1 2 3 4 5 0: 10 15 17 21 28 26 1: 15 10 15 23 26 23 2: 17 15 10 26 23 21 3: 21 28 26 10 15 17 4: 23 26 23 15 10 15 5: 26 23 21 17 15 10 And per: https://lore.kernel.org/lkml/20250825075642.GQ3245006@noisy.programming.k= icks-ass.net/ The suggestion was to average the off-trace clusters to restore sanity. However, 4d6dd05d07d0 implements this under various assumptions: - anything GNR/CWF with numa_in_package; - there will never be more than 2 packages; - the off-trace cluster will have distance >20 And then HPE shows up with a machine that matches the Vendor-Family-Model checks but looks like this: Here's an 8 socket (2 chassis) HPE system with SNC enabled: node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: 10 12 16 16 16 16 18 18 40 40 40 40 40 40 40 40 1: 12 10 16 16 16 16 18 18 40 40 40 40 40 40 40 40 2: 16 16 10 12 18 18 16 16 40 40 40 40 40 40 40 40 3: 16 16 12 10 18 18 16 16 40 40 40 40 40 40 40 40 4: 16 16 18 18 10 12 16 16 40 40 40 40 40 40 40 40 5: 16 16 18 18 12 10 16 16 40 40 40 40 40 40 40 40 6: 18 18 16 16 16 16 10 12 40 40 40 40 40 40 40 40 7: 18 18 16 16 16 16 12 10 40 40 40 40 40 40 40 40 8: 40 40 40 40 40 40 40 40 10 12 16 16 16 16 18 18 9: 40 40 40 40 40 40 40 40 12 10 16 16 16 16 18 18 10: 40 40 40 40 40 40 40 40 16 16 10 12 18 18 16 16 11: 40 40 40 40 40 40 40 40 16 16 12 10 18 18 16 16 12: 40 40 40 40 40 40 40 40 16 16 18 18 10 12 16 16 13: 40 40 40 40 40 40 40 40 16 16 18 18 12 10 16 16 14: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 10 12 15: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 12 10 10 =3D Same chassis and socket 12 =3D Same chassis and socket (SNC) 16 =3D Same chassis and adjacent socket 18 =3D Same chassis and non-adjacent socket 40 =3D Different chassis Turns out, the 'max 2 packages' thing is only relevant to the SNC-3 parts, = the smaller parts do 8 sockets (like usual). The above SLIT table is sane, but violates the previous assumptions and trips a WARN. Now that the topology code has a sensible measure of nodes-per-package, we = can use that to divinate the SNC mode at hand, and only fix up SNC-3 topologies. There is a 'healthy' amount of paranoia code validating the assumptions on = the SLIT table, a simple pr_err(FW_BUG) print on failure and a fallback to using the regular table. Lets see how long this lasts :-) Fixes: 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR,= CWF in SNC-3 mode") Reported-by: Kyle Meyer Signed-off-by: Peter Zijlstra (Intel) Tested-by: K Prateek Nayak Tested-by: Zhang Rui Tested-by: Chen Yu Tested-by: Kyle Meyer --- arch/x86/kernel/smpboot.c | 185 ++++++++++++++++++++++++++++++++++-------= ----- 1 file changed, 140 insertions(+), 45 deletions(-) --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -506,33 +506,148 @@ static void __init build_sched_topology( } =20 #ifdef CONFIG_NUMA -static int sched_avg_remote_distance; -static int avg_remote_numa_distance(void) +/* + * Test if the on-trace cluster at (N,N) is symmetric. + * Uses upper triangle iteration to avoid obvious duplicates. + */ +static bool slit_cluster_symmetric(int N) { - int i, j; - int distance, nr_remote, total_distance; + int u =3D topology_num_nodes_per_package(); =20 - if (sched_avg_remote_distance > 0) - return sched_avg_remote_distance; + for (int k =3D 0; k < u; k++) { + for (int l =3D k; l < u; l++) { + if (node_distance(N + k, N + l) !=3D + node_distance(N + l, N + k)) + return false; + } + } + + return true; +} + +/* + * Return the package-id of the cluster, or ~0 if indeterminate. + * Each node in the on-trace cluster should have the same package-id. + */ +static u32 slit_cluster_package(int N) +{ + int u =3D topology_num_nodes_per_package(); + u32 pkg_id =3D ~0; + + for (int n =3D 0; n < u; n++) { + const struct cpumask *cpus =3D cpumask_of_node(N + n); + int cpu; + + for_each_cpu(cpu, cpus) { + u32 id =3D topology_logical_package_id(cpu); + if (pkg_id =3D=3D ~0) + pkg_id =3D id; + if (pkg_id !=3D id) + return ~0; + } + } + + return pkg_id; +} + +/* + * Validate the SLIT table is of the form expected for SNC-3, specifically: + * + * - each on-trace cluster should be symmetric, + * - each on-trace cluster should have a unique package-id. + * + * If you NUMA_EMU on top of SNC, you get to keep the pieces. + */ +static bool slit_validate(void) +{ + int u =3D topology_num_nodes_per_package(); + u32 pkg_id, prev_pkg_id =3D ~0; =20 - nr_remote =3D 0; - total_distance =3D 0; - for_each_node_state(i, N_CPU) { - for_each_node_state(j, N_CPU) { - distance =3D node_distance(i, j); - - if (distance >=3D REMOTE_DISTANCE) { - nr_remote++; - total_distance +=3D distance; - } + for (int pkg =3D 0; pkg < topology_max_packages(); pkg++) { + int n =3D pkg * u; + + /* + * Ensure the on-trace cluster is symmetric and each cluster + * has a different package id. + */ + if (!slit_cluster_symmetric(n)) + return false; + pkg_id =3D slit_cluster_package(n); + if (pkg_id =3D=3D ~0) + return false; + if (pkg && pkg_id =3D=3D prev_pkg_id) + return false; + + prev_pkg_id =3D pkg_id; + } + + return true; +} + +/* + * Compute a sanitized SLIT table for SNC; notably SNC-3 can end up with + * asymmetric off-trace clusters, reflecting physical assymmetries. However + * this leads to 'unfortunate' sched_domain configurations. + * + * For example dual socket GNR with SNC-3: + * + * node distances: + * node 0 1 2 3 4 5 + * 0: 10 15 17 21 28 26 + * 1: 15 10 15 23 26 23 + * 2: 17 15 10 26 23 21 + * 3: 21 28 26 10 15 17 + * 4: 23 26 23 15 10 15 + * 5: 26 23 21 17 15 10 + * + * Fix things up by averaging out the off-trace clusters; resulting in: + * + * node 0 1 2 3 4 5 + * 0: 10 15 17 24 24 24 + * 1: 15 10 15 24 24 24 + * 2: 17 15 10 24 24 24 + * 3: 24 24 24 10 15 17 + * 4: 24 24 24 15 10 15 + * 5: 24 24 24 17 15 10 + */ +static int slit_cluster_distance(int i, int j) +{ + static int slit_valid =3D -1; + int u =3D topology_num_nodes_per_package(); + long d =3D 0; + int x, y; + + if (slit_valid < 0) { + slit_valid =3D slit_validate(); + if (!slit_valid) + pr_err(FW_BUG "SLIT table doesn't have the expected form for SNC -- fix= up disabled!\n"); + else + pr_info("Fixing up SNC SLIT table.\n"); + } + + /* + * Is this a unit cluster on the trace? + */ + if ((i / u) =3D=3D (j / u) || !slit_valid) + return node_distance(i, j); + + /* + * Off-trace cluster. + * + * Notably average out the symmetric pair of off-trace clusters to + * ensure the resulting SLIT table is symmetric. + */ + x =3D i - (i % u); + y =3D j - (j % u); + + for (i =3D x; i < x + u; i++) { + for (j =3D y; j < y + u; j++) { + d +=3D node_distance(i, j); + d +=3D node_distance(j, i); } } - if (nr_remote) - sched_avg_remote_distance =3D total_distance / nr_remote; - else - sched_avg_remote_distance =3D REMOTE_DISTANCE; =20 - return sched_avg_remote_distance; + return d / (2*u*u); } =20 int arch_sched_node_distance(int from, int to) @@ -542,34 +657,14 @@ int arch_sched_node_distance(int from, i switch (boot_cpu_data.x86_vfm) { case INTEL_GRANITERAPIDS_X: case INTEL_ATOM_DARKMONT_X: - - if (topology_max_packages() =3D=3D 1 || topology_num_nodes_per_package()= =3D=3D 1 || - d < REMOTE_DISTANCE) + if (topology_max_packages() =3D=3D 1 || + topology_num_nodes_per_package() < 3) return d; =20 /* - * With SNC enabled, there could be too many levels of remote - * NUMA node distances, creating NUMA domain levels - * including local nodes and partial remote nodes. - * - * Trim finer distance tuning for NUMA nodes in remote package - * for the purpose of building sched domains. Group NUMA nodes - * in the remote package in the same sched group. - * Simplify NUMA domains and avoid extra NUMA levels including - * different remote NUMA nodes and local nodes. - * - * GNR and CWF don't expect systems with more than 2 packages - * and more than 2 hops between packages. Single average remote - * distance won't be appropriate if there are more than 2 - * packages as average distance to different remote packages - * could be different. + * Handle SNC-3 asymmetries. */ - WARN_ONCE(topology_max_packages() > 2, - "sched: Expect only up to 2 packages for GNR or CWF, " - "but saw %d packages when building sched domains.", - topology_max_packages()); - - d =3D avg_remote_numa_distance(); + return slit_cluster_distance(from, to); } return d; }