So per 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode")
The original crazy SNC-3 SLIT table was:
node distances:
node 0 1 2 3 4 5
0: 10 15 17 21 28 26
1: 15 10 15 23 26 23
2: 17 15 10 26 23 21
3: 21 28 26 10 15 17
4: 23 26 23 15 10 15
5: 26 23 21 17 15 10
And per:
https://lore.kernel.org/lkml/20250825075642.GQ3245006@noisy.programming.kicks-ass.net/
My suggestion was to average the off-trace clusters to restore sanity.
However, 4d6dd05d07d0 implements this under various assumptions:
- there will never be more than 2 packages;
- the off-trace cluster will have distance >20
And then HPE shows up with a machine that matches the
Vendor-Family-Model checks but looks like this:
Here's an 8 socket (2 chassis) HPE system with SNC enabled:
node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0: 10 12 16 16 16 16 18 18 40 40 40 40 40 40 40 40
1: 12 10 16 16 16 16 18 18 40 40 40 40 40 40 40 40
2: 16 16 10 12 18 18 16 16 40 40 40 40 40 40 40 40
3: 16 16 12 10 18 18 16 16 40 40 40 40 40 40 40 40
4: 16 16 18 18 10 12 16 16 40 40 40 40 40 40 40 40
5: 16 16 18 18 12 10 16 16 40 40 40 40 40 40 40 40
6: 18 18 16 16 16 16 10 12 40 40 40 40 40 40 40 40
7: 18 18 16 16 16 16 12 10 40 40 40 40 40 40 40 40
8: 40 40 40 40 40 40 40 40 10 12 16 16 16 16 18 18
9: 40 40 40 40 40 40 40 40 12 10 16 16 16 16 18 18
10: 40 40 40 40 40 40 40 40 16 16 10 12 18 18 16 16
11: 40 40 40 40 40 40 40 40 16 16 12 10 18 18 16 16
12: 40 40 40 40 40 40 40 40 16 16 18 18 10 12 16 16
13: 40 40 40 40 40 40 40 40 16 16 18 18 12 10 16 16
14: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 10 12
15: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 12 10
10 = Same chassis and socket
12 = Same chassis and socket (SNC)
16 = Same chassis and adjacent socket
18 = Same chassis and non-adjacent socket
40 = Different chassis
*However* this is SNC-2.
This completely invalidates all the earlier assumptions and trips
WARNs.
Now that the topology code has a sensible measure of
nodes-per-package, we can use that to divinate the SNC mode at hand,
and only fix up SNC-3 topologies.
With the only assumption that there are no CPU-less nodes -- is this
a valid assumption ?
Fixes: 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
arch/x86/kernel/smpboot.c | 64 +++++++++++++++++-----------------------------
1 file changed, 25 insertions(+), 39 deletions(-)
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -506,33 +506,32 @@ static void __init build_sched_topology(
}
#ifdef CONFIG_NUMA
-static int sched_avg_remote_distance;
-static int avg_remote_numa_distance(void)
+static int slit_cluster_distance(int i, int j)
{
- int i, j;
- int distance, nr_remote, total_distance;
-
- if (sched_avg_remote_distance > 0)
- return sched_avg_remote_distance;
-
- nr_remote = 0;
- total_distance = 0;
- for_each_node_state(i, N_CPU) {
- for_each_node_state(j, N_CPU) {
- distance = node_distance(i, j);
-
- if (distance >= REMOTE_DISTANCE) {
- nr_remote++;
- total_distance += distance;
- }
+ int u = __num_nodes_per_package;
+ long d = 0;
+ int x, y;
+
+ /*
+ * Is this a unit cluster on the trace?
+ */
+ if ((i / u) == (j / u))
+ return node_distance(i, j);
+
+ /*
+ * Off-trace cluster, return average of the cluster to force symmetry.
+ */
+ x = i - (i % u);
+ y = j - (j % u);
+
+ for (i = x; i < x + u; i++) {
+ for (j = y; j < y + u; j++) {
+ d += node_distance(i, j);
+ d += node_distance(j, i);
}
}
- if (nr_remote)
- sched_avg_remote_distance = total_distance / nr_remote;
- else
- sched_avg_remote_distance = REMOTE_DISTANCE;
- return sched_avg_remote_distance;
+ return d / (2*u*u);
}
int arch_sched_node_distance(int from, int to)
@@ -542,13 +541,11 @@ int arch_sched_node_distance(int from, i
switch (boot_cpu_data.x86_vfm) {
case INTEL_GRANITERAPIDS_X:
case INTEL_ATOM_DARKMONT_X:
-
- if (topology_max_packages() == 1 || __num_nodes_per_package == 1 ||
- d < REMOTE_DISTANCE)
+ if (topology_max_packages() == 1 || __num_nodes_per_package < 3)
return d;
/*
- * With SNC enabled, there could be too many levels of remote
+ * With SNC-3 enabled, there could be too many levels of remote
* NUMA node distances, creating NUMA domain levels
* including local nodes and partial remote nodes.
*
@@ -557,19 +554,8 @@ int arch_sched_node_distance(int from, i
* in the remote package in the same sched group.
* Simplify NUMA domains and avoid extra NUMA levels including
* different remote NUMA nodes and local nodes.
- *
- * GNR and CWF don't expect systems with more than 2 packages
- * and more than 2 hops between packages. Single average remote
- * distance won't be appropriate if there are more than 2
- * packages as average distance to different remote packages
- * could be different.
*/
- WARN_ONCE(topology_max_packages() > 2,
- "sched: Expect only up to 2 packages for GNR or CWF, "
- "but saw %d packages when building sched domains.",
- topology_max_packages());
-
- d = avg_remote_numa_distance();
+ return slit_cluster_distance(from, to);
}
return d;
}
Hi Peter, On 2/26/2026 6:49 PM, Peter Zijlstra wrote: > + int u = __num_nodes_per_package; Yes, this is much simpler, thanks for the patch! > + long d = 0; > + int x, y; > + > + /* > + * Is this a unit cluster on the trace? > + */ > + if ((i / u) == (j / u)) > + return node_distance(i, j); If the number of nodes per package is 3, we assume that every 3 consecutive nodes are SNC siblings (on the same trace):node0, node1, and node2 are SNC siblings, while node3, node4, and node5 form another group of SNC siblings. I have a curious thought: could it be possible that node0, node2, and node4 are SNC siblings, and node1, node3, and node5 are another set of SNC siblings instead? Then I studied the code a little more, node ids are dynamically allocated via the acpi_map_pxm_to_node, so the assignment of node ids depends on the order in which each processor affinity structure is listed in the SRAT table. For example, suppose CPU0 belongs to package0 and CPU1 belongs to package1, but their entries are placed consecutively in the SRAT. In this case, the Proximity Domain of CPU0 would be mapped to node0 via acpi_map_pxm_to_node, and CPU1’s Proximity Domain would be assigned node1. The logic above would then treat them as belonging to the same package, even though they are physically in different packages. However, I believe such a scenario is unlikely to occur in practice in the BIOS and if it happens it should be a BIOS bug if I understand correctly. thanks, Chenyu
On Fri, Feb 27, 2026 at 01:07:40AM +0800, Chen, Yu C wrote: > Hi Peter, > > On 2/26/2026 6:49 PM, Peter Zijlstra wrote: > > + int u = __num_nodes_per_package; > > Yes, this is much simpler, thanks for the patch! > > > + long d = 0; > > + int x, y; > > + > > + /* > > + * Is this a unit cluster on the trace? > > + */ > > + if ((i / u) == (j / u)) > > + return node_distance(i, j); > > If the number of nodes per package is 3, we assume that > every 3 consecutive nodes are SNC siblings (on the same > trace):node0, node1, and node2 are SNC siblings, while > node3, node4, and node5 form another group of SNC siblings. > > I have a curious thought: could it be possible that > node0, node2, and node4 are SNC siblings, and node1, > node3, and node5 are another set of SNC siblings instead? Yes, give a BIOS guy enough bong-hits and this can be. That said (and knock on wood), I've so far never seen this (and please people, don't take this as a challenge). > Then I studied the code a little more, node ids are dynamically > allocated via the acpi_map_pxm_to_node, so the assignment of node > ids depends on the order in which each processor affinity structure > is listed in the SRAT table. For example, suppose CPU0 belongs to > package0 and CPU1 belongs to package1, but their entries are placed > consecutively in the SRAT. In this case, the Proximity Domain of > CPU0 would be mapped to node0 via acpi_map_pxm_to_node, and CPU1’s > Proximity Domain would be assigned node1. The logic above would > then treat them as belonging to the same package, even though they > are physically in different packages. However, I believe such a > scenario is unlikely to occur in practice in the BIOS and if it > happens it should be a BIOS bug if I understand correctly. Just so. The thing I worried about is getting memory only nodes iterated in between or something. But as long as the CPU enumeration happens before the 'other' crud, then the CPU node mappings should be the consecutive low numbers and it all just works.
On Fri, 2026-02-27 at 01:07 +0800, Chen, Yu C wrote: > Hi Peter, > > On 2/26/2026 6:49 PM, Peter Zijlstra wrote: > > + int u = __num_nodes_per_package; > > Yes, this is much simpler, thanks for the patch! > > > + long d = 0; > > + int x, y; > > + > > + /* > > + * Is this a unit cluster on the trace? > > + */ > > + if ((i / u) == (j / u)) > > + return node_distance(i, j); > > If the number of nodes per package is 3, we assume that > every 3 consecutive nodes are SNC siblings (on the same > trace):node0, node1, and node2 are SNC siblings, while > node3, node4, and node5 form another group of SNC siblings. > > I have a curious thought: could it be possible that > node0, node2, and node4 are SNC siblings, and node1, > node3, and node5 are another set of SNC siblings instead? > > Then I studied the code a little more, node ids are dynamically > allocated via the acpi_map_pxm_to_node, so the assignment of node > ids depends on the order in which each processor affinity structure > is listed in the SRAT table. For example, suppose CPU0 belongs to > package0 and CPU1 belongs to package1, but their entries are placed > consecutively in the SRAT. In this case, the Proximity Domain of > CPU0 would be mapped to node0 via acpi_map_pxm_to_node, and CPU1’s > Proximity Domain would be assigned node1. The logic above would > then treat them as belonging to the same package, even though they > are physically in different packages. However, I believe such a > scenario is unlikely to occur in practice in the BIOS and if it > happens it should be a BIOS bug if I understand correctly. > > May be a good idea to sanity check that the nodes in the first unit cluster has the same package id and give a WARNING if that's not the case. Tim
On Thu, Feb 26, 2026 at 11:00:38AM -0800, Tim Chen wrote:
> May be a good idea to sanity check that the nodes in the first unit cluster
> has the same package id and give a WARNING if that's not the case.
But then we'd also have check the second cluster is another package. And
if we're checking that, we might as well check they're symmetric.
Is this sufficiently paranoid for you? :-)
---
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -61,6 +61,7 @@
#include <linux/cpuhotplug.h>
#include <linux/mc146818rtc.h>
#include <linux/acpi.h>
+#include <linux/once_lite.h>
#include <asm/acpi.h>
#include <asm/cacheinfo.h>
@@ -506,12 +507,58 @@ static void __init build_sched_topology(
}
#ifdef CONFIG_NUMA
+static bool slit_cluster_symmetric(int N)
+{
+ for (int k = 0; k < __num_nodes_per_package; k++) {
+ for (int l = k; l < __num_nodes_per_package; l++) {
+ if (node_distance(N + k, N + l) !=
+ node_distance(N + l, N + k))
+ return false;
+ }
+ }
+
+ return true;
+}
+
+static u32 slit_cluster_package(int N)
+{
+ u32 pkg_id = ~0;
+
+ for (int n = 0; n < __num_nodes_per_package; n++) {
+ const struct cpumask *cpus = cpumask_of_node(N + n);
+ int cpu;
+
+ for_each_cpu(cpu, cpus) {
+ u32 id = topology_logical_package_id(cpu);
+ if (pkg_id == ~0)
+ pkg_id = id;
+ if (pkg_id != id)
+ return ~0;
+ }
+ }
+
+ return pkg_id;
+}
+
+/* If you NUMA_EMU on top of SNC, you get to keep the pieces */
+static void slit_validate(void)
+{
+ u32 pkg1 = slit_cluster_package(0);
+ u32 pkg2 = slit_cluster_package(__num_nodes_per_package);
+ WARN_ON(pkg1 == ~0 || pkg2 == ~0 || pkg1 == pkg2);
+
+ WARN_ON(!slit_cluster_symmetric(0));
+ WARN_ON(!slit_cluster_symmetric(__num_nodes_per_package));
+}
+
static int slit_cluster_distance(int i, int j)
{
int u = __num_nodes_per_package;
long d = 0;
int x, y;
+ DO_ONCE_LITE(slit_validate);
+
/*
* Is this a unit cluster on the trace?
*/
On Fri, 2026-02-27 at 14:01 +0100, Peter Zijlstra wrote:
> On Thu, Feb 26, 2026 at 11:00:38AM -0800, Tim Chen wrote:
> > May be a good idea to sanity check that the nodes in the first unit cluster
> > has the same package id and give a WARNING if that's not the case.
>
> But then we'd also have check the second cluster is another package. And
> if we're checking that, we might as well check they're symmetric.
>
> Is this sufficiently paranoid for you? :-)
>
Thanks. Looks pretty good and hopefully those warnings will
never be triggered.
Tim
>
> ---
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -61,6 +61,7 @@
> #include <linux/cpuhotplug.h>
> #include <linux/mc146818rtc.h>
> #include <linux/acpi.h>
> +#include <linux/once_lite.h>
>
> #include <asm/acpi.h>
> #include <asm/cacheinfo.h>
> @@ -506,12 +507,58 @@ static void __init build_sched_topology(
> }
>
> #ifdef CONFIG_NUMA
> +static bool slit_cluster_symmetric(int N)
> +{
> + for (int k = 0; k < __num_nodes_per_package; k++) {
> + for (int l = k; l < __num_nodes_per_package; l++) {
> + if (node_distance(N + k, N + l) !=
> + node_distance(N + l, N + k))
> + return false;
> + }
> + }
> +
> + return true;
> +}
> +
> +static u32 slit_cluster_package(int N)
> +{
> + u32 pkg_id = ~0;
> +
> + for (int n = 0; n < __num_nodes_per_package; n++) {
> + const struct cpumask *cpus = cpumask_of_node(N + n);
> + int cpu;
> +
> + for_each_cpu(cpu, cpus) {
> + u32 id = topology_logical_package_id(cpu);
> + if (pkg_id == ~0)
> + pkg_id = id;
> + if (pkg_id != id)
> + return ~0;
> + }
> + }
> +
> + return pkg_id;
> +}
> +
> +/* If you NUMA_EMU on top of SNC, you get to keep the pieces */
> +static void slit_validate(void)
> +{
> + u32 pkg1 = slit_cluster_package(0);
> + u32 pkg2 = slit_cluster_package(__num_nodes_per_package);
> + WARN_ON(pkg1 == ~0 || pkg2 == ~0 || pkg1 == pkg2);
> +
> + WARN_ON(!slit_cluster_symmetric(0));
> + WARN_ON(!slit_cluster_symmetric(__num_nodes_per_package));
> +}
> +
> static int slit_cluster_distance(int i, int j)
> {
> int u = __num_nodes_per_package;
> long d = 0;
> int x, y;
>
> + DO_ONCE_LITE(slit_validate);
> +
> /*
> * Is this a unit cluster on the trace?
> */
>
On 2/28/2026 3:23 AM, Tim Chen wrote:
> On Fri, 2026-02-27 at 14:01 +0100, Peter Zijlstra wrote:
>> On Thu, Feb 26, 2026 at 11:00:38AM -0800, Tim Chen wrote:
>>> May be a good idea to sanity check that the nodes in the first unit cluster
>>> has the same package id and give a WARNING if that's not the case.
>>
>> But then we'd also have check the second cluster is another package. And
>> if we're checking that, we might as well check they're symmetric.
>>
>> Is this sufficiently paranoid for you? :-)
>>
>
> Thanks. Looks pretty good and hopefully those warnings will
> never be triggered.
>
>> +
>> +/* If you NUMA_EMU on top of SNC, you get to keep the pieces */
>> +static void slit_validate(void)
>> +{
>> + u32 pkg1 = slit_cluster_package(0);
>> + u32 pkg2 = slit_cluster_package(__num_nodes_per_package);
>> + WARN_ON(pkg1 == ~0 || pkg2 == ~0 || pkg1 == pkg2);
>> +
>> + WARN_ON(!slit_cluster_symmetric(0));
>> + WARN_ON(!slit_cluster_symmetric(__num_nodes_per_package));
>> +}
>> +
Here we check packages0 and 1, should we check all the packages?
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 8735f1968b00..91bac3e2e7fd 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -543,12 +543,13 @@ static u32 slit_cluster_package(int N)
/* If you NUMA_EMU on top of SNC, you get to keep the pieces */
static void slit_validate(void)
{
- u32 pkg1 = slit_cluster_package(0);
- u32 pkg2 = slit_cluster_package(__num_nodes_per_package);
- WARN_ON(pkg1 == ~0 || pkg2 == ~0 || pkg1 == pkg2);
+ for (int pkg = 0; pkg < topology_max_packages(); pkg++) {
+ int node = pkg * __num_nodes_per_package;
+ u32 pkg_id = slit_cluster_package(node);
- WARN_ON(!slit_cluster_symmetric(0));
- WARN_ON(!slit_cluster_symmetric(__num_nodes_per_package));
+ WARN_ON(pkg_id == ~0);
+ WARN_ON(!slit_cluster_symmetric(node));
+ }
}
nt On Sat, Feb 28, 2026 at 03:35:26PM +0800, Chen, Yu C wrote: > Here we check packages0 and 1, should we check all the packages? Might as well I suppose. Could you boot queue/x86/topo on an snc-3 machine to verify it doesn't explode with all the paranoia on?
On Mon, 2026-03-02 at 17:43 +0100, Peter Zijlstra wrote: > nt On Sat, Feb 28, 2026 at 03:35:26PM +0800, Chen, Yu C wrote: > > > Here we check packages0 and 1, should we check all the packages? > > Might as well I suppose. > > Could you boot queue/x86/topo on an snc-3 machine to verify it > doesn't > explode with all the paranoia on? Hi, Peter, regarding slit_validate() in https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=x86/topo&id=24ca94ac4b72803a7164b7ad84f06f0e9f0c49df I suppose we want to use WARN_ON_ONCE(!slit_cluster_symmetric(n)) rather than WARN_ON_ONCE(slit_cluster_symmetric(n)); right? I tested the queue/x86/topo plus above change, 1. on GNR 4 sockets with SNC2, no difference about sched_domains in /proc/schedstat compared with upstream 7.0-rc1, plus the below warning is also gone [ 11.439633] ------------[ cut here ]------------ [ 11.440491] sched: Expect only up to 2 packages for GNR or CWF, but saw 4 packages when building sched domains. [ 11.440493] WARNING: arch/x86/kernel/smpboot.c:574 at arch_sched_node_distance+0x133/0x140, CPU#0: swapper/0/1 2. on CWF 2 sockets with SNC3 and CWF 1 socket with SNC3, no difference about sched_domains in /proc/schedstat compared with upstream 7.0-rc1 So Tested-by: Zhang Rui <rui.zhang@intel.com> -rui
On Tue, Mar 03, 2026 at 02:31:37PM +0800, Zhang Rui wrote: > On Mon, 2026-03-02 at 17:43 +0100, Peter Zijlstra wrote: > > nt On Sat, Feb 28, 2026 at 03:35:26PM +0800, Chen, Yu C wrote: > > > > > Here we check packages0 and 1, should we check all the packages? > > > > Might as well I suppose. > > > > Could you boot queue/x86/topo on an snc-3 machine to verify it > > doesn't > > explode with all the paranoia on? > > Hi, Peter, > > regarding slit_validate() in > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=x86/topo&id=24ca94ac4b72803a7164b7ad84f06f0e9f0c49df > > I suppose we want to use > WARN_ON_ONCE(!slit_cluster_symmetric(n)) > rather than > WARN_ON_ONCE(slit_cluster_symmetric(n)); > right? D'0h yes, very much so! Thanks! Let me go expand the Changelogs some post and then probably stick it in tip somewhere.
On 3/3/2026 2:31 PM, Zhang Rui wrote:
> On Mon, 2026-03-02 at 17:43 +0100, Peter Zijlstra wrote:
>> nt On Sat, Feb 28, 2026 at 03:35:26PM +0800, Chen, Yu C wrote:
>>
>>> Here we check packages0 and 1, should we check all the packages?
>>
>> Might as well I suppose.
>>
>> Could you boot queue/x86/topo on an snc-3 machine to verify it
>> doesn't
>> explode with all the paranoia on?
>
> Hi, Peter,
>
> regarding slit_validate() in
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=x86/topo&id=24ca94ac4b72803a7164b7ad84f06f0e9f0c49df
>
> I suppose we want to use
> WARN_ON_ONCE(!slit_cluster_symmetric(n))
> rather than
> WARN_ON_ONCE(slit_cluster_symmetric(n));
With Rui's fix, tested on GNR ,2 sockets, SNC3, NUMA distance symmetric
platform,
it works as expected,
smp: Brought up 6 nodes, 384 CPUs
domain-0: span=0,192 level=SMT
domain-1: span=0-31,192-223 level=MC
domain-2: span=0-95,192-287 level=NUMA
groups: 0:{ span=0-31,192-223 cap=65536 }, 32:{ span=32-63,224-255
cap=65536 }, 64:{ span=64-95,256-287 cap=65536 }
domain-3: span=0-383 level=NUMA
groups: 0:{ span=0-95,192-287 cap=196608 }, 96:{
span=96-191,288-383 cap=196608 }
resctrl also detects SNC3
root@a4bf018d7604:/sys/fs/resctrl/mon_data# tree mon_L3_00/
mon_L3_00/
├── llc_occupancy
├── mbm_local_bytes
├── mbm_total_bytes
├── mon_sub_L3_00
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
├── mon_sub_L3_01
│ ├── llc_occupancy
│ ├── mbm_local_bytes
│ └── mbm_total_bytes
└── mon_sub_L3_02
├── llc_occupancy
├── mbm_local_bytes
└── mbm_total_bytes
> So
>
> Tested-by: Zhang Rui <rui.zhang@intel.com>
>
For this series,
Tested-by: Chen Yu <yu.c.chen@intel.com>
thanks,
Chenyu
On Thu, 2026-02-26 at 11:00 -0800, Tim Chen wrote:
> On Fri, 2026-02-27 at 01:07 +0800, Chen, Yu C wrote:
> > Hi Peter,
> >
> > On 2/26/2026 6:49 PM, Peter Zijlstra wrote:
> > > + int u = __num_nodes_per_package;
> >
> > Yes, this is much simpler, thanks for the patch!
> >
> > > + long d = 0;
> > > + int x, y;
> > > +
> > > + /*
> > > + * Is this a unit cluster on the trace?
> > > + */
> > > + if ((i / u) == (j / u))
> > > + return node_distance(i, j);
> >
> > If the number of nodes per package is 3, we assume that
> > every 3 consecutive nodes are SNC siblings (on the same
> > trace):node0, node1, and node2 are SNC siblings, while
> > node3, node4, and node5 form another group of SNC siblings.
> >
> > I have a curious thought: could it be possible that
> > node0, node2, and node4 are SNC siblings, and node1,
> > node3, and node5 are another set of SNC siblings instead?
> >
> > Then I studied the code a little more, node ids are dynamically
> > allocated via the acpi_map_pxm_to_node, so the assignment of node
> > ids depends on the order in which each processor affinity structure
> > is listed in the SRAT table. For example, suppose CPU0 belongs to
> > package0 and CPU1 belongs to package1, but their entries are placed
> > consecutively in the SRAT. In this case, the Proximity Domain of
> > CPU0 would be mapped to node0 via acpi_map_pxm_to_node, and CPU1’s
> > Proximity Domain would be assigned node1. The logic above would
> > then treat them as belonging to the same package, even though they
> > are physically in different packages. However, I believe such a
> > scenario is unlikely to occur in practice in the BIOS and if it
> > happens it should be a BIOS bug if I understand correctly.
> >
> >
>
> May be a good idea to sanity check that the nodes in the first unit cluster
> has the same package id and give a WARNING if that's not the case.
>
Perhaps something like below
Tim
---
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index d97f8f4e014c..38384ea5253a 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -511,6 +511,24 @@ static int slit_cluster_distance(int i, int j)
int u = __num_nodes_per_package;
long d = 0;
int x, y;
+ static int valid_slit = 0;
+
+ if (valid_slit == -1)
+ return node_distance(i, j);
+
+ if (valid_slit == 0) {
+ /* Check first nodes in package are grouped together consecutively */
+ for (x = 0; x < u-1 ; x++) {
+ if (topology_physical_package_id(x) !=
+ topology_physical_package_id(x+1)) {
+ pr_warn("Expect nodes %d and %d to be in the same package",
+ x, x+1);
+ valid_slit = -1;
+ return node_distance(i, j);
+ }
+ }
+ valid_slit = 1;
+ }
/*
* Is this a unit cluster on the trace?
On Thu, 2026-02-26 at 14:11 -0800, Tim Chen wrote:
> On Thu, 2026-02-26 at 11:00 -0800, Tim Chen wrote:
> > On Fri, 2026-02-27 at 01:07 +0800, Chen, Yu C wrote:
> > > Hi Peter,
> > >
> > > On 2/26/2026 6:49 PM, Peter Zijlstra wrote:
> > > > + int u = __num_nodes_per_package;
> > >
> > > Yes, this is much simpler, thanks for the patch!
> > >
> > > > + long d = 0;
> > > > + int x, y;
> > > > +
> > > > + /*
> > > > + * Is this a unit cluster on the trace?
> > > > + */
> > > > + if ((i / u) == (j / u))
> > > > + return node_distance(i, j);
> > >
> > > If the number of nodes per package is 3, we assume that
> > > every 3 consecutive nodes are SNC siblings (on the same
> > > trace):node0, node1, and node2 are SNC siblings, while
> > > node3, node4, and node5 form another group of SNC siblings.
> > >
> > > I have a curious thought: could it be possible that
> > > node0, node2, and node4 are SNC siblings, and node1,
> > > node3, and node5 are another set of SNC siblings instead?
> > >
> > > Then I studied the code a little more, node ids are dynamically
> > > allocated via the acpi_map_pxm_to_node, so the assignment of node
> > > ids depends on the order in which each processor affinity structure
> > > is listed in the SRAT table. For example, suppose CPU0 belongs to
> > > package0 and CPU1 belongs to package1, but their entries are placed
> > > consecutively in the SRAT. In this case, the Proximity Domain of
> > > CPU0 would be mapped to node0 via acpi_map_pxm_to_node, and CPU1’s
> > > Proximity Domain would be assigned node1. The logic above would
> > > then treat them as belonging to the same package, even though they
> > > are physically in different packages. However, I believe such a
> > > scenario is unlikely to occur in practice in the BIOS and if it
> > > happens it should be a BIOS bug if I understand correctly.
> > >
> > >
> >
> > May be a good idea to sanity check that the nodes in the first unit cluster
> > has the same package id and give a WARNING if that's not the case.
> >
> Perhaps something like below
>
> Tim
>
> ---
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index d97f8f4e014c..38384ea5253a 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -511,6 +511,24 @@ static int slit_cluster_distance(int i, int j)
> int u = __num_nodes_per_package;
> long d = 0;
> int x, y;
> + static int valid_slit = 0;
> +
> + if (valid_slit == -1)
> + return node_distance(i, j);
> +
> + if (valid_slit == 0) {
> + /* Check first nodes in package are grouped together consecutively */
> + for (x = 0; x < u-1 ; x++) {
> + if (topology_physical_package_id(x) !=
> + topology_physical_package_id(x+1)) {
This won't work because topology_physical_package_id() takes cpu
as argument. Will need to find the first cpu of x and x+1 and pass to it.
> + pr_warn("Expect nodes %d and %d to be in the same package",
> + x, x+1);
> + valid_slit = -1;
> + return node_distance(i, j);
> + }
> + }
> + valid_slit = 1;
> + }
>
> /*
> * Is this a unit cluster on the trace?
>
© 2016 - 2026 Red Hat, Inc.