Fix NUMA sched domain build errors for GNR-X and CWF-X

[PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode

Posted by Tim Chen 5 months, 2 weeks ago

It is possible for Granite Rapids X (GNR) and Clearwater Forest X
(CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
is enabled, each die will become a separate NUMA node in the package
with different distances between dies within the same package.

For example, on GNR-X, we see the following numa distances for a 2 socket
system with 3 dies per socket:

        package 1       package2
            ----------------
            |               |
        ---------       ---------
        |   0   |       |   3   |
        ---------       ---------
            |               |
        ---------       ---------
        |   1   |       |   4   |
        ---------       ---------
            |               |
        ---------       ---------
        |   2   |       |   5   |
        ---------       ---------
            |               |
            ----------------

node distances:
node     0    1    2    3    4    5
   0:   10   15   17   21   28   26
   1:   15   10   15   23   26   23
   2:   17   15   10   26   23   21
   3:   21   28   26   10   15   17
   4:   23   26   23   15   10   15
   5:   26   23   21   17   15   10

The node distances above led to 2 problems:

1. Asymmetric routes taken between nodes in different packages led to
asymmetric scheduler domain perspective depending on which node you
are on.  Current scheduler code failed to build domains properly with
asymmetric distances.

2. Multiple remote distances to respective tiles on remote package create
too many levels of domain hierarchies grouping different nodes between
remote packages.

For example, the above GNR-X topology lead to NUMA domains below:

Sched domains from the perspective of a CPU in node 0, where the number
in bracket represent node number.

NUMA-level 1	[0,1] [2]
NUMA-level 2	[0,1,2] [3]
NUMA-level 3	[0,1,2,3] [5]
NUMA-level 4	[0,1,2,3,5] [4]

Sched domains from the perspective of a CPU in node 4
NUMA-level 1	[4] [3,5]
NUMA-level 2	[3,4,5] [0,2]
NUMA-level 3	[0,2,3,4,5] [1]

Scheduler group peers for load balancing from the perspective of CPU 0
and 4 are different.  Improper task could be chosen for load balancing
between groups such as [0,2,3,4,5] [1].  Ideally you should choose nodes
in 0 or 2 that are in same package as node 1 first.  But instead tasks
in the remote package node 3, 4, 5 could be chosen with an equal chance
and could lead to excessive remote package migrations and imbalance of
load between packages.  We should not group partial remote nodes and
local nodes together.

Simplify the remote distances for CWF-X and GNR-X for the purpose of
sched domains building, which maintains symmetry and leads to a more
reasonable load balance hierarchy.

The sched domains from the perspective of a CPU in node 0 NUMA-level 1
is now
NUMA-level 1	[0,1] [2]
NUMA-level 2	[0,1,2] [3,4,5]

The sched domains from the perspective of a CPU in node 4 NUMA-level 1
is now
NUMA-level 1	[4] [3,5]
NUMA-level 2	[3,4,5] [0,1,2]

We have the same balancing perspective from node 0 or node 4.  Loads are
now balanced equally between packages.

Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
---
 arch/x86/kernel/smpboot.c      | 28 ++++++++++++++++++++++++++++
 include/linux/sched/topology.h |  1 +
 kernel/sched/topology.c        | 25 +++++++++++++++++++------
 3 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 33e166f6ab12..c425e84c88b5 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
 	set_sched_topology(topology);
 }
 
+int sched_node_distance(int from, int to)
+{
+	int d = node_distance(from, to);
+
+	if (!x86_has_numa_in_package)
+		return d;
+
+	switch (boot_cpu_data.x86_vfm) {
+	case INTEL_GRANITERAPIDS_X:
+	case INTEL_ATOM_DARKMONT_X:
+		if (d < REMOTE_DISTANCE)
+			return d;
+
+		/*
+		 * Trim finer distance tuning for nodes in remote package
+		 * for the purpose of building sched domains.
+		 * Put NUMA nodes in each remote package in a single sched group.
+		 * Simplify NUMA domains and avoid extra NUMA levels including different
+		 * NUMA nodes in remote packages.
+		 *
+		 * GNR-x and CWF-X has GLUELESS-MESH topology with SNC
+		 * turned on.
+		 */
+		d = (d / 10) * 10;
+	}
+	return d;
+}
+
 void set_cpu_sibling_map(int cpu)
 {
 	bool has_smt = __max_threads_per_core > 1;
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 5263746b63e8..3b62226394af 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -59,6 +59,7 @@ static inline int cpu_numa_flags(void)
 #endif
 
 extern int arch_asym_cpu_priority(int cpu);
+extern int sched_node_distance(int from, int to);
 
 struct sched_domain_attr {
 	int relax_domain_level;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9a7ac67e3d63..3f485da994a7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1804,7 +1804,7 @@ bool find_numa_distance(int distance)
 	bool found = false;
 	int i, *distances;
 
-	if (distance == node_distance(0, 0))
+	if (distance == sched_node_distance(0, 0))
 		return true;
 
 	rcu_read_lock();
@@ -1887,6 +1887,15 @@ static void init_numa_topology_type(int offline_node)
 
 #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
 
+/*
+ * Architecture could simplify NUMA distance, to avoid
+ * creating too many NUMA levels when SNC is turned on.
+ */
+int __weak sched_node_distance(int from, int to)
+{
+	return node_distance(from, to);
+}
+
 void sched_init_numa(int offline_node)
 {
 	struct sched_domain_topology_level *tl;
@@ -1894,6 +1903,7 @@ void sched_init_numa(int offline_node)
 	int nr_levels = 0;
 	int i, j;
 	int *distances;
+	int max_dist = 0;
 	struct cpumask ***masks;
 
 	/*
@@ -1907,7 +1917,10 @@ void sched_init_numa(int offline_node)
 	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
 	for_each_cpu_node_but(i, offline_node) {
 		for_each_cpu_node_but(j, offline_node) {
-			int distance = node_distance(i, j);
+			int distance = sched_node_distance(i, j);
+
+			if (node_distance(i,j) > max_dist)
+				max_dist = node_distance(i,j);
 
 			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
 				sched_numa_warn("Invalid distance value range");
@@ -1979,10 +1992,10 @@ void sched_init_numa(int offline_node)
 			masks[i][j] = mask;
 
 			for_each_cpu_node_but(k, offline_node) {
-				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
+				if (sched_debug() && (sched_node_distance(j, k) != sched_node_distance(k, j)))
 					sched_numa_warn("Node-distance not symmetric");
 
-				if (node_distance(j, k) > sched_domains_numa_distance[i])
+				if (sched_node_distance(j, k) > sched_domains_numa_distance[i])
 					continue;
 
 				cpumask_or(mask, mask, cpumask_of_node(k));
@@ -2022,7 +2035,7 @@ void sched_init_numa(int offline_node)
 	sched_domain_topology = tl;
 
 	sched_domains_numa_levels = nr_levels;
-	WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
+	WRITE_ONCE(sched_max_numa_distance, max_dist);
 
 	init_numa_topology_type(offline_node);
 }
@@ -2092,7 +2105,7 @@ void sched_domains_numa_masks_set(unsigned int cpu)
 				continue;
 
 			/* Set ourselves in the remote node's masks */
-			if (node_distance(j, node) <= sched_domains_numa_distance[i])
+			if (sched_node_distance(j, node) <= sched_domains_numa_distance[i])
 				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
 		}
 	}
-- 
2.32.0

Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode

Posted by Chen, Yu C 5 months, 2 weeks ago

On 8/23/2025 4:14 AM, Tim Chen wrote:
> It is possible for Granite Rapids X (GNR) and Clearwater Forest X
> (CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
> is enabled, each die will become a separate NUMA node in the package
> with different distances between dies within the same package.
> 
> For example, on GNR-X, we see the following numa distances for a 2 socket
> system with 3 dies per socket:
> 
>          package 1       package2
>              ----------------
>              |               |
>          ---------       ---------
>          |   0   |       |   3   |
>          ---------       ---------
>              |               |
>          ---------       ---------
>          |   1   |       |   4   |
>          ---------       ---------
>              |               |
>          ---------       ---------
>          |   2   |       |   5   |
>          ---------       ---------
>              |               |
>              ----------------
> 
> node distances:
> node     0    1    2    3    4    5
>     0:   10   15   17   21   28   26
>     1:   15   10   15   23   26   23
>     2:   17   15   10   26   23   21
>     3:   21   28   26   10   15   17
>     4:   23   26   23   15   10   15
>     5:   26   23   21   17   15   10
> 
> The node distances above led to 2 problems:
> 
> 1. Asymmetric routes taken between nodes in different packages led to
> asymmetric scheduler domain perspective depending on which node you
> are on.  Current scheduler code failed to build domains properly with
> asymmetric distances.
> 
> 2. Multiple remote distances to respective tiles on remote package create
> too many levels of domain hierarchies grouping different nodes between
> remote packages.
> 
> For example, the above GNR-X topology lead to NUMA domains below:
> 
> Sched domains from the perspective of a CPU in node 0, where the number
> in bracket represent node number.
> 
> NUMA-level 1	[0,1] [2]
> NUMA-level 2	[0,1,2] [3]
> NUMA-level 3	[0,1,2,3] [5]
> NUMA-level 4	[0,1,2,3,5] [4]
> 
> Sched domains from the perspective of a CPU in node 4
> NUMA-level 1	[4] [3,5]
> NUMA-level 2	[3,4,5] [0,2]
> NUMA-level 3	[0,2,3,4,5] [1]
> 
> Scheduler group peers for load balancing from the perspective of CPU 0
> and 4 are different.  Improper task could be chosen for load balancing
> between groups such as [0,2,3,4,5] [1].  Ideally you should choose nodes
> in 0 or 2 that are in same package as node 1 first.  But instead tasks
> in the remote package node 3, 4, 5 could be chosen with an equal chance
> and could lead to excessive remote package migrations and imbalance of
> load between packages.  We should not group partial remote nodes and
> local nodes together.
> 
> Simplify the remote distances for CWF-X and GNR-X for the purpose of
> sched domains building, which maintains symmetry and leads to a more
> reasonable load balance hierarchy.
> 
> The sched domains from the perspective of a CPU in node 0 NUMA-level 1
> is now
> NUMA-level 1	[0,1] [2]
> NUMA-level 2	[0,1,2] [3,4,5]
> 
> The sched domains from the perspective of a CPU in node 4 NUMA-level 1
> is now
> NUMA-level 1	[4] [3,5]
> NUMA-level 2	[3,4,5] [0,1,2]
> 
> We have the same balancing perspective from node 0 or node 4.  Loads are
> now balanced equally between packages.
> 
> Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> Tested-by: Zhao Liu <zhao1.liu@intel.com>
> ---
>   arch/x86/kernel/smpboot.c      | 28 ++++++++++++++++++++++++++++
>   include/linux/sched/topology.h |  1 +
>   kernel/sched/topology.c        | 25 +++++++++++++++++++------
>   3 files changed, 48 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 33e166f6ab12..c425e84c88b5 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
>   	set_sched_topology(topology);
>   }
>   
> +int sched_node_distance(int from, int to)
> +{
> +	int d = node_distance(from, to);
> +
> +	if (!x86_has_numa_in_package)
> +		return d;
> +
> +	switch (boot_cpu_data.x86_vfm) {
> +	case INTEL_GRANITERAPIDS_X:
> +	case INTEL_ATOM_DARKMONT_X:
> +		if (d < REMOTE_DISTANCE)
> +			return d;
> +
> +		/*
> +		 * Trim finer distance tuning for nodes in remote package
> +		 * for the purpose of building sched domains.
> +		 * Put NUMA nodes in each remote package in a single sched group.
> +		 * Simplify NUMA domains and avoid extra NUMA levels including different
> +		 * NUMA nodes in remote packages.
> +		 *
> +		 * GNR-x and CWF-X has GLUELESS-MESH topology with SNC
> +		 * turned on.
> +		 */
> +		d = (d / 10) * 10;

Does the '10' here mean that, the distance of the hierarchy socket
is 10 from SLIT table? For example, from a socket0 point of view,
the distance of socket1 to socket0 is within [20, 29), the distance
of socket2 to socket0 is [30,39), and so on. If this is the case,
maybe add a comment above for future reference.

> +	}
> +	return d;
> +}
> +
>   void set_cpu_sibling_map(int cpu)
>   {
>   	bool has_smt = __max_threads_per_core > 1;
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 5263746b63e8..3b62226394af 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -59,6 +59,7 @@ static inline int cpu_numa_flags(void)
>   #endif
>   
>   extern int arch_asym_cpu_priority(int cpu);
> +extern int sched_node_distance(int from, int to);
>   
>   struct sched_domain_attr {
>   	int relax_domain_level;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9a7ac67e3d63..3f485da994a7 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1804,7 +1804,7 @@ bool find_numa_distance(int distance)
>   	bool found = false;
>   	int i, *distances;
>   
> -	if (distance == node_distance(0, 0))
> +	if (distance == sched_node_distance(0, 0))
>   		return true;
>

If I understand correct, this patch is trying to fix the sched
domain issue during load balancing, and NUMA balance logic
should not be changed because NUMA balancing is not based on
sched domain?

That is to say, since the find_numa_distance() is only used by
NUMA balance, should we keep find_numa_distance() to still use
node_distance()?

>   	rcu_read_lock();
> @@ -1887,6 +1887,15 @@ static void init_numa_topology_type(int offline_node)
>   
>   #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
>   
> +/*
> + * Architecture could simplify NUMA distance, to avoid
> + * creating too many NUMA levels when SNC is turned on.
> + */
> +int __weak sched_node_distance(int from, int to)
> +{
> +	return node_distance(from, to);
> +}
> +
>   void sched_init_numa(int offline_node)
>   {
>   	struct sched_domain_topology_level *tl;
> @@ -1894,6 +1903,7 @@ void sched_init_numa(int offline_node)
>   	int nr_levels = 0;
>   	int i, j;
>   	int *distances;
> +	int max_dist = 0;
>   	struct cpumask ***masks;
>   
>   	/*
> @@ -1907,7 +1917,10 @@ void sched_init_numa(int offline_node)
>   	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
>   	for_each_cpu_node_but(i, offline_node) {
>   		for_each_cpu_node_but(j, offline_node) {
> -			int distance = node_distance(i, j);
> +			int distance = sched_node_distance(i, j);
> +
> +			if (node_distance(i,j) > max_dist)
> +				max_dist = node_distance(i,j);
>   
>   			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
>   				sched_numa_warn("Invalid distance value range");
> @@ -1979,10 +1992,10 @@ void sched_init_numa(int offline_node)
>   			masks[i][j] = mask;
>   
>   			for_each_cpu_node_but(k, offline_node) {
> -				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
> +				if (sched_debug() && (sched_node_distance(j, k) != sched_node_distance(k, j)))
>   					sched_numa_warn("Node-distance not symmetric");
>   
> -				if (node_distance(j, k) > sched_domains_numa_distance[i])
> +				if (sched_node_distance(j, k) > sched_domains_numa_distance[i])
>   					continue;
>   
>   				cpumask_or(mask, mask, cpumask_of_node(k));
> @@ -2022,7 +2035,7 @@ void sched_init_numa(int offline_node)
>   	sched_domain_topology = tl;
>   
>   	sched_domains_numa_levels = nr_levels;
> -	WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
> +	WRITE_ONCE(sched_max_numa_distance, max_dist);

Above change is to use the original node_distance() rather than
sched_node_distance() for sched_max_numa_distance, and
sched_max_numa_distance is only used by NUMA balance to figure out
the NUMA topology type as well as scaling the NUMA fault statistics
for remote Nodes.

So I think we might want to keep it align by using node_distance()
in find_numa_distance().

thanks,
Chenyu
>   
>   	init_numa_topology_type(offline_node);
>   }
> @@ -2092,7 +2105,7 @@ void sched_domains_numa_masks_set(unsigned int cpu)
>   				continue;
>   
>   			/* Set ourselves in the remote node's masks */
> -			if (node_distance(j, node) <= sched_domains_numa_distance[i])
> +			if (sched_node_distance(j, node) <= sched_domains_numa_distance[i])
>   				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
>   		}
>   	}

Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode

Posted by Tim Chen 5 months, 2 weeks ago

On Mon, 2025-08-25 at 13:08 +0800, Chen, Yu C wrote:
> On 8/23/2025 4:14 AM, Tim Chen wrote:
> > 
... snip...
> > 
> > Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> > Tested-by: Zhao Liu <zhao1.liu@intel.com>
> > ---
> >   arch/x86/kernel/smpboot.c      | 28 ++++++++++++++++++++++++++++
> >   include/linux/sched/topology.h |  1 +
> >   kernel/sched/topology.c        | 25 +++++++++++++++++++------
> >   3 files changed, 48 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 33e166f6ab12..c425e84c88b5 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
> >   	set_sched_topology(topology);
> >   }
> >   
> > +int sched_node_distance(int from, int to)
> > +{
> > +	int d = node_distance(from, to);
> > +
> > +	if (!x86_has_numa_in_package)
> > +		return d;
> > +
> > +	switch (boot_cpu_data.x86_vfm) {
> > +	case INTEL_GRANITERAPIDS_X:
> > +	case INTEL_ATOM_DARKMONT_X:
> > +		if (d < REMOTE_DISTANCE)
> > +			return d;
> > +
> > +		/*
> > +		 * Trim finer distance tuning for nodes in remote package
> > +		 * for the purpose of building sched domains.
> > +		 * Put NUMA nodes in each remote package in a single sched group.
> > +		 * Simplify NUMA domains and avoid extra NUMA levels including different
> > +		 * NUMA nodes in remote packages.
> > +		 *
> > +		 * GNR-x and CWF-X has GLUELESS-MESH topology with SNC
> > +		 * turned on.
> > +		 */
> > +		d = (d / 10) * 10;
> 
> Does the '10' here mean that, the distance of the hierarchy socket
> is 10 from SLIT table? 
> 

Yes.

> For example, from a socket0 point of view,
> the distance of socket1 to socket0 is within [20, 29), the distance
> of socket2 to socket0 is [30,39), and so on. If this is the case,
> maybe add a comment above for future reference.
> 

We don't expect to have more than 2 sockets for GNR and CWF.
So the case of 2 hops like [30,39) should not happen.

> > +	}
> > +	return d;
> > +}
> > +
> >   void set_cpu_sibling_map(int cpu)
> >   {
> >   	bool has_smt = __max_threads_per_core > 1;
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 5263746b63e8..3b62226394af 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -59,6 +59,7 @@ static inline int cpu_numa_flags(void)
> >   #endif
> >   
> >   extern int arch_asym_cpu_priority(int cpu);
> > +extern int sched_node_distance(int from, int to);
> >   
> >   struct sched_domain_attr {
> >   	int relax_domain_level;
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 9a7ac67e3d63..3f485da994a7 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1804,7 +1804,7 @@ bool find_numa_distance(int distance)
> >   	bool found = false;
> >   	int i, *distances;
> >   
> > -	if (distance == node_distance(0, 0))
> > +	if (distance == sched_node_distance(0, 0))
> >   		return true;
> > 
> 
> If I understand correct, this patch is trying to fix the sched
> domain issue during load balancing, and NUMA balance logic
> should not be changed because NUMA balancing is not based on
> sched domain?
> 
> That is to say, since the find_numa_distance() is only used by
> NUMA balance, should we keep find_numa_distance() to still use
> node_distance()?

The procedure here is using the distance matrix that's initialized
using sched_node_distance(). Hence the change.

Otherwise we could keep a separate sched_distance matrix and uses
only node_distance here.  Did not do that to minimize the change.

Tim

> 
> >   	rcu_read_lock();
> > @@ -1887,6 +1887,15 @@ static void init_numa_topology_type(int offline_node)
> >   
> >   #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
> >   
> > +/*
> > + * Architecture could simplify NUMA distance, to avoid
> > + * creating too many NUMA levels when SNC is turned on.
> > + */
> > +int __weak sched_node_distance(int from, int to)
> > +{
> > +	return node_distance(from, to);
> > +}
> > +
> >   void sched_init_numa(int offline_node)
> >   {
> >   	struct sched_domain_topology_level *tl;
> > @@ -1894,6 +1903,7 @@ void sched_init_numa(int offline_node)
> >   	int nr_levels = 0;
> >   	int i, j;
> >   	int *distances;
> > +	int max_dist = 0;
> >   	struct cpumask ***masks;
> >   
> >   	/*
> > @@ -1907,7 +1917,10 @@ void sched_init_numa(int offline_node)
> >   	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
> >   	for_each_cpu_node_but(i, offline_node) {
> >   		for_each_cpu_node_but(j, offline_node) {
> > -			int distance = node_distance(i, j);
> > +			int distance = sched_node_distance(i, j);
> > +
> > +			if (node_distance(i,j) > max_dist)
> > +				max_dist = node_distance(i,j);
> >   
> >   			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
> >   				sched_numa_warn("Invalid distance value range");
> > @@ -1979,10 +1992,10 @@ void sched_init_numa(int offline_node)
> >   			masks[i][j] = mask;
> >   
> >   			for_each_cpu_node_but(k, offline_node) {
> > -				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
> > +				if (sched_debug() && (sched_node_distance(j, k) != sched_node_distance(k, j)))
> >   					sched_numa_warn("Node-distance not symmetric");
> >   
> > -				if (node_distance(j, k) > sched_domains_numa_distance[i])
> > +				if (sched_node_distance(j, k) > sched_domains_numa_distance[i])
> >   					continue;
> >   
> >   				cpumask_or(mask, mask, cpumask_of_node(k));
> > @@ -2022,7 +2035,7 @@ void sched_init_numa(int offline_node)
> >   	sched_domain_topology = tl;
> >   
> >   	sched_domains_numa_levels = nr_levels;
> > -	WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
> > +	WRITE_ONCE(sched_max_numa_distance, max_dist);
> 
> Above change is to use the original node_distance() rather than
> sched_node_distance() for sched_max_numa_distance, and
> sched_max_numa_distance is only used by NUMA balance to figure out
> the NUMA topology type as well as scaling the NUMA fault statistics
> for remote Nodes.
> 
> So I think we might want to keep it align by using node_distance()
> in find_numa_distance().
> 
> thanks,
> Chenyu
> >   
> >   	init_numa_topology_type(offline_node);
> >   }
> > @@ -2092,7 +2105,7 @@ void sched_domains_numa_masks_set(unsigned int cpu)
> >   				continue;
> >   
> >   			/* Set ourselves in the remote node's masks */
> > -			if (node_distance(j, node) <= sched_domains_numa_distance[i])
> > +			if (sched_node_distance(j, node) <= sched_domains_numa_distance[i])
> >   				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
> >   		}
> >   	}

Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode

Posted by Peter Zijlstra 5 months, 2 weeks ago

On Mon, Aug 25, 2025 at 01:08:39PM +0800, Chen, Yu C wrote:
> On 8/23/2025 4:14 AM, Tim Chen wrote:
> > It is possible for Granite Rapids X (GNR) and Clearwater Forest X
> > (CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
> > is enabled, each die will become a separate NUMA node in the package
> > with different distances between dies within the same package.
> > 
> > For example, on GNR-X, we see the following numa distances for a 2 socket
> > system with 3 dies per socket:
> > 
> >          package 1       package2
> >              ----------------
> >              |               |
> >          ---------       ---------
> >          |   0   |       |   3   |
> >          ---------       ---------
> >              |               |
> >          ---------       ---------
> >          |   1   |       |   4   |
> >          ---------       ---------
> >              |               |
> >          ---------       ---------
> >          |   2   |       |   5   |
> >          ---------       ---------
> >              |               |
> >              ----------------
> > 
> > node distances:
> > node     0    1    2    3    4    5
> >     0:   10   15   17   21   28   26
> >     1:   15   10   15   23   26   23
> >     2:   17   15   10   26   23   21
> >     3:   21   28   26   10   15   17
> >     4:   23   26   23   15   10   15
> >     5:   26   23   21   17   15   10
> > 

> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 33e166f6ab12..c425e84c88b5 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
> >   	set_sched_topology(topology);
> >   }
> > +int sched_node_distance(int from, int to)
> > +{
> > +	int d = node_distance(from, to);
> > +
> > +	if (!x86_has_numa_in_package)
> > +		return d;
> > +
> > +	switch (boot_cpu_data.x86_vfm) {
> > +	case INTEL_GRANITERAPIDS_X:
> > +	case INTEL_ATOM_DARKMONT_X:
> > +		if (d < REMOTE_DISTANCE)
> > +			return d;
> > +
> > +		/*
> > +		 * Trim finer distance tuning for nodes in remote package
> > +		 * for the purpose of building sched domains.
> > +		 * Put NUMA nodes in each remote package in a single sched group.
> > +		 * Simplify NUMA domains and avoid extra NUMA levels including different
> > +		 * NUMA nodes in remote packages.
> > +		 *
> > +		 * GNR-x and CWF-X has GLUELESS-MESH topology with SNC
> > +		 * turned on.
> > +		 */
> > +		d = (d / 10) * 10;
> 
> Does the '10' here mean that, the distance of the hierarchy socket
> is 10 from SLIT table? For example, from a socket0 point of view,
> the distance of socket1 to socket0 is within [20, 29), the distance
> of socket2 to socket0 is [30,39), and so on. If this is the case,
> maybe add a comment above for future reference.

This is all because of the ACPI SLIT distance definitions I suppose, 10
for local and 20 for remote (which IMO is actively wrong, since it
mandates distances that are not relative performance).

Additionally, the table above magically has all the remote distances in
the range of [20,29] and so the strip 1s thing works.

The problem of course is that the SLIT table is fully under control of
the BIOS and random BIOS monkey could cause this to not be so making the
above code not work as intended. Eg. if the remote distances ends up
being in the range of [20,35] or whatever, then it all goes sideways.

( There is a history of manupulating the SLIT table to influence
scheduler behaviour of OS of choice :-/ )

Similarly, when doing a 4 node system, it is possible a 2 hop distances
doesn't align nicely with the 10s and we're up a creek again.

This is all very fragile. A much better way would be to allocate a new
SLIT table, identify the (local) clusters and replace all remote
instances with an average.

Eg. since (21+28+26+23+26+23+26+23+21)/9 ~ 24, you end up with:

 node     0    1    2    3    4    5
     0:   10   15   17   24   24   24
     1:   15   10   15   24   24   24
     2:   17   15   10   24   24   24
     3:   24   24   24   10   15   17
     4:   24   24   24   15   10   15
     5:   24   24   24   17   15   10

Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode

Posted by Tim Chen 5 months, 2 weeks ago

On Mon, 2025-08-25 at 09:56 +0200, Peter Zijlstra wrote:
> > 

... snip ...

> > > > > > +		/*
> > > > > > +		 * Trim finer distance tuning for nodes in remote package
> > > > > > +		 * for the purpose of building sched domains.
> > > > > > +		 * Put NUMA nodes in each remote package in a single sched group.
> > > > > > +		 * Simplify NUMA domains and avoid extra NUMA levels including different
> > > > > > +		 * NUMA nodes in remote packages.
> > > > > > +		 *
> > > > > > +		 * GNR-x and CWF-X has GLUELESS-MESH topology with SNC
> > > > > > +		 * turned on.
> > > > > > +		 */
> > > > > > +		d = (d / 10) * 10;
> > > > 
> > > > Does the '10' here mean that, the distance of the hierarchy socket
> > > > is 10 from SLIT table? For example, from a socket0 point of view,
> > > > the distance of socket1 to socket0 is within [20, 29), the distance
> > > > of socket2 to socket0 is [30,39), and so on. If this is the case,
> > > > maybe add a comment above for future reference.
> > 
> > This is all because of the ACPI SLIT distance definitions I suppose, 10
> > for local and 20 for remote (which IMO is actively wrong, since it
> > mandates distances that are not relative performance).
> > 
> > Additionally, the table above magically has all the remote distances in
> > the range of [20,29] and so the strip 1s thing works.
> > 
> > The problem of course is that the SLIT table is fully under control of
> > the BIOS and random BIOS monkey could cause this to not be so making the
> > above code not work as intended. Eg. if the remote distances ends up
> > being in the range of [20,35] or whatever, then it all goes sideways.
> > 
> > ( There is a history of manupulating the SLIT table to influence
> > scheduler behaviour of OS of choice :-/ )
> > 
> > Similarly, when doing a 4 node system, it is possible a 2 hop distances
> > doesn't align nicely with the 10s and we're up a creek again.

We don't expect 4 node systems for GNR nor CWF. So hopefully we don't need to
worry about them.  Otherwise we may need additional code to check for 2 hops.

> > 
> > This is all very fragile. A much better way would be to allocate a new
> > SLIT table, identify the (local) clusters and replace all remote
> > instances with an average.

Are you suggesting to have one SLIT distance table that's simplified for
scheduler domain build and another for true node distance?

> > 
> > Eg. since (21+28+26+23+26+23+26+23+21)/9 ~ 24, you end up with:
> > 
> >  node     0    1    2    3    4    5
> >      0:   10   15   17   24   24   24
> >      1:   15   10   15   24   24   24
> >      2:   17   15   10   24   24   24
> >      3:   24   24   24   10   15   17
> >      4:   24   24   24   15   10   15
> >      5:   24   24   24   17   15   10
> > 
> > 

Will take a closer look to use average for nodes
in remote package.

Tim

[PATCH 1/2] sched: topology: Fix topology validation error
[PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode