[v3] Fix NUMA sched domain build errors for GNR and CWF

[PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode

Posted by Tim Chen 4 months, 4 weeks ago

It is possible for Granite Rapids (GNR) and Clearwater Forest
(CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
is enabled, each die will become a separate NUMA node in the package
with different distances between dies within the same package.

For example, on GNR, we see the following numa distances for a 2 socket
system with 3 dies per socket:

    package 1       package2
	----------------
	|               |
    ---------       ---------
    |   0   |       |   3   |
    ---------       ---------
	|               |
    ---------       ---------
    |   1   |       |   4   |
    ---------       ---------
	|               |
    ---------       ---------
    |   2   |       |   5   |
    ---------       ---------
	|               |
	----------------

node distances:
node     0    1    2    3    4    5
0:   	10   15   17   21   28   26
1:   	15   10   15   23   26   23
2:   	17   15   10   26   23   21
3:   	21   28   26   10   15   17
4:   	23   26   23   15   10   15
5:   	26   23   21   17   15   10

The node distances above led to 2 problems:

1. Asymmetric routes taken between nodes in different packages led to
asymmetric scheduler domain perspective depending on which node you
are on.  Current scheduler code failed to build domains properly with
asymmetric distances.

2. Multiple remote distances to respective tiles on remote package create
too many levels of domain hierarchies grouping different nodes between
remote packages.

For example, the above GNR-X topology lead to NUMA domains below:

Sched domains from the perspective of a CPU in node 0, where the number
in bracket represent node number.

NUMA-level 1    [0,1] [2]
NUMA-level 2    [0,1,2] [3]
NUMA-level 3    [0,1,2,3] [5]
NUMA-level 4    [0,1,2,3,5] [4]

Sched domains from the perspective of a CPU in node 4
NUMA-level 1    [4] [3,5]
NUMA-level 2    [3,4,5] [0,2]
NUMA-level 3    [0,2,3,4,5] [1]

Scheduler group peers for load balancing from the perspective of CPU 0
and 4 are different.  Improper task could be chosen for load balancing
between groups such as [0,2,3,4,5] [1].  Ideally you should choose nodes
in 0 or 2 that are in same package as node 1 first.  But instead tasks
in the remote package node 3, 4, 5 could be chosen with an equal chance
and could lead to excessive remote package migrations and imbalance of
load between packages.  We should not group partial remote nodes and
local nodes together.
Simplify the remote distances for CWF-X and GNR-X for the purpose of
sched domains building, which maintains symmetry and leads to a more
reasonable load balance hierarchy.

The sched domains from the perspective of a CPU in node 0 NUMA-level 1
is now
NUMA-level 1    [0,1] [2]
NUMA-level 2    [0,1,2] [3,4,5]

The sched domains from the perspective of a CPU in node 4 NUMA-level 1
is now
NUMA-level 1    [4] [3,5]
NUMA-level 2    [3,4,5] [0,1,2]

We have the same balancing perspective from node 0 or node 4.  Loads are
now balanced equally between packages.

Tested-by: Zhao Liu <zhao1.liu@intel.com>
Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 arch/x86/kernel/smpboot.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 33e166f6ab12..3f894c525e49 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
 	set_sched_topology(topology);
 }
 
+int arch_sched_node_distance(int from, int to)
+{
+	int d = node_distance(from, to);
+
+	if (!x86_has_numa_in_package)
+		return d;
+
+	switch (boot_cpu_data.x86_vfm) {
+	case INTEL_GRANITERAPIDS_X:
+	case INTEL_ATOM_DARKMONT_X:
+		if (d < REMOTE_DISTANCE)
+			return d;
+
+		/*
+		 * Trim finer distance tuning for nodes in remote package
+		 * for the purpose of building sched domains.  Put NUMA nodes
+		 * in each remote package in the same sched group.
+		 * Simplify NUMA domains and avoid extra NUMA levels including
+		 * different NUMA nodes in remote packages.
+		 *
+		 * GNR and CWF don't expect systmes with more than 2 packages
+		 * and more than 2 hops between packages.
+		 */
+		d = sched_avg_remote_numa_distance;
+	}
+	return d;
+}
+
 void set_cpu_sibling_map(int cpu)
 {
 	bool has_smt = __max_threads_per_core > 1;
-- 
2.32.0

Re: [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode

Posted by Peter Zijlstra 4 months, 3 weeks ago

On Thu, Sep 11, 2025 at 11:30:57AM -0700, Tim Chen wrote:
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 33e166f6ab12..3f894c525e49 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
>  	set_sched_topology(topology);
>  }
>  
> +int arch_sched_node_distance(int from, int to)
> +{
> +	int d = node_distance(from, to);
> +
> +	if (!x86_has_numa_in_package)
> +		return d;
> +
> +	switch (boot_cpu_data.x86_vfm) {
> +	case INTEL_GRANITERAPIDS_X:
> +	case INTEL_ATOM_DARKMONT_X:
> +		if (d < REMOTE_DISTANCE)
> +			return d;
> +
> +		/*
> +		 * Trim finer distance tuning for nodes in remote package
> +		 * for the purpose of building sched domains.  Put NUMA nodes
> +		 * in each remote package in the same sched group.
> +		 * Simplify NUMA domains and avoid extra NUMA levels including
> +		 * different NUMA nodes in remote packages.
> +		 *
> +		 * GNR and CWF don't expect systmes with more than 2 packages
> +		 * and more than 2 hops between packages.
> +		 */
> +		d = sched_avg_remote_numa_distance;

So all of that avg_remote crap should live here, and in this patch. It
really should not be in generic code.

You really need to assert this 'expectation', otherwise weird stuff will
happen. The whole 'avg_remote' thing hard relies on there being a single
remote package.

> +	}
> +	return d;
> +}

Re: [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode

Posted by Chen, Yu C 4 months, 4 weeks ago

On 9/12/2025 2:30 AM, Tim Chen wrote:

[snip]

>   
> +int arch_sched_node_distance(int from, int to)
> +{
> +	int d = node_distance(from, to);
> +
> +	if (!x86_has_numa_in_package)
> +		return d;
> +
> +	switch (boot_cpu_data.x86_vfm) {
> +	case INTEL_GRANITERAPIDS_X:
> +	case INTEL_ATOM_DARKMONT_X:
> +		if (d < REMOTE_DISTANCE)
> +			return d;
> +
> +		/*
> +		 * Trim finer distance tuning for nodes in remote package
> +		 * for the purpose of building sched domains.  Put NUMA nodes
> +		 * in each remote package in the same sched group.
> +		 * Simplify NUMA domains and avoid extra NUMA levels including
> +		 * different NUMA nodes in remote packages.
> +		 *
> +		 * GNR and CWF don't expect systmes with more than 2 packages
> +		 * and more than 2 hops between packages.
> +		 */
> +		d = sched_avg_remote_numa_distance;

sched_avg_remote_numa_distance is defined in topology.c with
CONFIG_NUMA controlled, should we make arch_sched_node_distance()
be controlled under CONFIG_NUMA too?

thanks,
Chenyu

> +	}

Re: [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode

Posted by K Prateek Nayak 4 months, 4 weeks ago

On 9/12/2025 11:09 AM, Chen, Yu C wrote:
> sched_avg_remote_numa_distance is defined in topology.c with
> CONFIG_NUMA controlled, should we make arch_sched_node_distance()
> be controlled under CONFIG_NUMA too?

Good catch! Given node_distance() too is behind CONFIG_NUMA, I
think we can put this behind CONFIG_NUMA too (including those
declarations in include/linux/sched/topology.h)

-- 
Thanks and Regards,
Prateek

Re: [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode

Posted by Chen, Yu C 4 months, 4 weeks ago

On 9/12/2025 5:23 PM, K Prateek Nayak wrote:
> On 9/12/2025 11:09 AM, Chen, Yu C wrote:
>> sched_avg_remote_numa_distance is defined in topology.c with
>> CONFIG_NUMA controlled, should we make arch_sched_node_distance()
>> be controlled under CONFIG_NUMA too?
> 
> Good catch! Given node_distance() too is behind CONFIG_NUMA, I
> think we can put this behind CONFIG_NUMA too (including those
> declarations in include/linux/sched/topology.h)
> 

Exactly, only NUMA would use this function.

Thanks,
Chenyu

Re: [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode

Posted by K Prateek Nayak 4 months, 4 weeks ago

Hello Tim,

On 9/12/2025 12:00 AM, Tim Chen wrote:
> It is possible for Granite Rapids (GNR) and Clearwater Forest
> (CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
> is enabled, each die will become a separate NUMA node in the package
> with different distances between dies within the same package.
> 
> For example, on GNR, we see the following numa distances for a 2 socket
> system with 3 dies per socket:
> 
>     package 1       package2
> 	----------------
> 	|               |
>     ---------       ---------
>     |   0   |       |   3   |
>     ---------       ---------
> 	|               |
>     ---------       ---------
>     |   1   |       |   4   |
>     ---------       ---------
> 	|               |
>     ---------       ---------
>     |   2   |       |   5   |
>     ---------       ---------
> 	|               |
> 	----------------
> 
> node distances:
> node     0    1    2    3    4    5
> 0:   	10   15   17   21   28   26
> 1:   	15   10   15   23   26   23
> 2:   	17   15   10   26   23   21
> 3:   	21   28   26   10   15   17
> 4:   	23   26   23   15   10   15
> 5:   	26   23   21   17   15   10
> 
> The node distances above led to 2 problems:
> 
> 1. Asymmetric routes taken between nodes in different packages led to
> asymmetric scheduler domain perspective depending on which node you
> are on.  Current scheduler code failed to build domains properly with
> asymmetric distances.
> 
> 2. Multiple remote distances to respective tiles on remote package create
> too many levels of domain hierarchies grouping different nodes between
> remote packages.
> 
> For example, the above GNR-X topology lead to NUMA domains below:
> 
> Sched domains from the perspective of a CPU in node 0, where the number
> in bracket represent node number.
> 
> NUMA-level 1    [0,1] [2]
> NUMA-level 2    [0,1,2] [3]
> NUMA-level 3    [0,1,2,3] [5]
> NUMA-level 4    [0,1,2,3,5] [4]
> 
> Sched domains from the perspective of a CPU in node 4
> NUMA-level 1    [4] [3,5]
> NUMA-level 2    [3,4,5] [0,2]
> NUMA-level 3    [0,2,3,4,5] [1]
> 
> Scheduler group peers for load balancing from the perspective of CPU 0
> and 4 are different.  Improper task could be chosen for load balancing
> between groups such as [0,2,3,4,5] [1].  Ideally you should choose nodes
> in 0 or 2 that are in same package as node 1 first.  But instead tasks
> in the remote package node 3, 4, 5 could be chosen with an equal chance
> and could lead to excessive remote package migrations and imbalance of
> load between packages.  We should not group partial remote nodes and
> local nodes together.
> Simplify the remote distances for CWF-X and GNR-X for the purpose of
> sched domains building, which maintains symmetry and leads to a more
> reasonable load balance hierarchy.
> 
> The sched domains from the perspective of a CPU in node 0 NUMA-level 1
> is now
> NUMA-level 1    [0,1] [2]
> NUMA-level 2    [0,1,2] [3,4,5]
> 
> The sched domains from the perspective of a CPU in node 4 NUMA-level 1
> is now
> NUMA-level 1    [4] [3,5]
> NUMA-level 2    [3,4,5] [0,1,2]
> 
> We have the same balancing perspective from node 0 or node 4.  Loads are
> now balanced equally between packages.
> 
> Tested-by: Zhao Liu <zhao1.liu@intel.com>
> Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>

Feel free to include:

Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek

> ---
>  arch/x86/kernel/smpboot.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 33e166f6ab12..3f894c525e49 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
>  	set_sched_topology(topology);
>  }
>  
> +int arch_sched_node_distance(int from, int to)
> +{
> +	int d = node_distance(from, to);
> +
> +	if (!x86_has_numa_in_package)
> +		return d;
> +
> +	switch (boot_cpu_data.x86_vfm) {
> +	case INTEL_GRANITERAPIDS_X:
> +	case INTEL_ATOM_DARKMONT_X:
> +		if (d < REMOTE_DISTANCE)
> +			return d;
> +
> +		/*
> +		 * Trim finer distance tuning for nodes in remote package
> +		 * for the purpose of building sched domains.  Put NUMA nodes
> +		 * in each remote package in the same sched group.
> +		 * Simplify NUMA domains and avoid extra NUMA levels including
> +		 * different NUMA nodes in remote packages.
> +		 *
> +		 * GNR and CWF don't expect systmes with more than 2 packages
> +		 * and more than 2 hops between packages.
> +		 */
> +		d = sched_avg_remote_numa_distance;
> +	}
> +	return d;
> +}
> +
>  void set_cpu_sibling_map(int cpu)
>  {
>  	bool has_smt = __max_threads_per_core > 1;

Re: [PATCH v3 2/2] sched: Fix sched domain build error for GNR, CWF in SNC-3 mode

Posted by Tim Chen 4 months, 3 weeks ago

On Fri, 2025-09-12 at 10:38 +0530, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 9/12/2025 12:00 AM, Tim Chen wrote:
> > It is possible for Granite Rapids (GNR) and Clearwater Forest
> > (CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
> > is enabled, each die will become a separate NUMA node in the package
> > with different distances between dies within the same package.
> > 
> > For example, on GNR, we see the following numa distances for a 2 socket
> > system with 3 dies per socket:
> > 
> >     package 1       package2
> > 	----------------
> > 	|               |
> >     ---------       ---------
> >     |   0   |       |   3   |
> >     ---------       ---------
> > 	|               |
> >     ---------       ---------
> >     |   1   |       |   4   |
> >     ---------       ---------
> > 	|               |
> >     ---------       ---------
> >     |   2   |       |   5   |
> >     ---------       ---------
> > 	|               |
> > 	----------------
> > 
> > node distances:
> > node     0    1    2    3    4    5
> > 0:   	10   15   17   21   28   26
> > 1:   	15   10   15   23   26   23
> > 2:   	17   15   10   26   23   21
> > 3:   	21   28   26   10   15   17
> > 4:   	23   26   23   15   10   15
> > 5:   	26   23   21   17   15   10
> > 
> > The node distances above led to 2 problems:
> > 
> > 1. Asymmetric routes taken between nodes in different packages led to
> > asymmetric scheduler domain perspective depending on which node you
> > are on.  Current scheduler code failed to build domains properly with
> > asymmetric distances.
> > 
> > 2. Multiple remote distances to respective tiles on remote package create
> > too many levels of domain hierarchies grouping different nodes between
> > remote packages.
> > 
> > For example, the above GNR-X topology lead to NUMA domains below:
> > 
> > Sched domains from the perspective of a CPU in node 0, where the number
> > in bracket represent node number.
> > 
> > NUMA-level 1    [0,1] [2]
> > NUMA-level 2    [0,1,2] [3]
> > NUMA-level 3    [0,1,2,3] [5]
> > NUMA-level 4    [0,1,2,3,5] [4]
> > 
> > Sched domains from the perspective of a CPU in node 4
> > NUMA-level 1    [4] [3,5]
> > NUMA-level 2    [3,4,5] [0,2]
> > NUMA-level 3    [0,2,3,4,5] [1]
> > 
> > Scheduler group peers for load balancing from the perspective of CPU 0
> > and 4 are different.  Improper task could be chosen for load balancing
> > between groups such as [0,2,3,4,5] [1].  Ideally you should choose nodes
> > in 0 or 2 that are in same package as node 1 first.  But instead tasks
> > in the remote package node 3, 4, 5 could be chosen with an equal chance
> > and could lead to excessive remote package migrations and imbalance of
> > load between packages.  We should not group partial remote nodes and
> > local nodes together.
> > Simplify the remote distances for CWF-X and GNR-X for the purpose of
> > sched domains building, which maintains symmetry and leads to a more
> > reasonable load balance hierarchy.
> > 
> > The sched domains from the perspective of a CPU in node 0 NUMA-level 1
> > is now
> > NUMA-level 1    [0,1] [2]
> > NUMA-level 2    [0,1,2] [3,4,5]
> > 
> > The sched domains from the perspective of a CPU in node 4 NUMA-level 1
> > is now
> > NUMA-level 1    [4] [3,5]
> > NUMA-level 2    [3,4,5] [0,1,2]
> > 
> > We have the same balancing perspective from node 0 or node 4.  Loads are
> > now balanced equally between packages.
> > 
> > Tested-by: Zhao Liu <zhao1.liu@intel.com>
> > Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> > Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> 
> Feel free to include:
> 
> Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

Thanks for reviewing and testing.

Tim