sched/topology: Correctly propagate NUMA flag to scheduling domains

[PATCH] sched/topology: Correctly propagate NUMA flag to scheduling domains

Posted by Andrea Righi 1 year, 3 months ago

A scheduling domain can degenerate a parent NUMA domain if the CPUs
perfectly overlap, without inheriting the SD_NUMA flag.

This can result in the creation of a single NUMA domain that includes
all CPUs, even when the CPUs are spread across multiple NUMA nodes,
which may result in sub-optimal scheduling decisions.

Example:

$ vng -v --cpu 16,sockets=4,cores=2,threads=2 \
      -m 4G --numa 2G,cpus=0-7 --numa 2G,cpus=8-15
 ...
$ lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
  0    0      0    0 0:0:0:0          yes
  1    0      0    0 0:0:0:0          yes
  2    0      0    1 1:1:1:0          yes
  3    0      0    1 1:1:1:0          yes
  4    0      1    2 2:2:2:1          yes
  5    0      1    2 2:2:2:1          yes
  6    0      1    3 3:3:3:1          yes
  7    0      1    3 3:3:3:1          yes
  8    1      2    4 4:4:4:2          yes
  9    1      2    4 4:4:4:2          yes
 10    1      2    5 5:5:5:2          yes
 11    1      2    5 5:5:5:2          yes
 12    1      3    6 6:6:6:3          yes
 13    1      3    6 6:6:6:3          yes
 14    1      3    7 7:7:7:3          yes
 15    1      3    7 7:7:7:3          yes

Without this change:
  sd_llc[cpu0] spans cpus=0-3
  sd_numa[cpu0] spans cpus=0-15
  ...
  sd_llc[cpu15] spans cpus=12-15
  sd_numa[cpu15] spans cpus=0-15

With this change:
 - sd_llc[cpu0] spans cpus=0-3
 - sd_numa[cpu0] spans cpus=0-7
  ...
  sd_llc[cpu15] spans cpus=12-15
  sd_numa[cpu15] spans cpus=8-15

This also allows re-using sd_numa from the sched_ext built-in CPU idle
selection policy, instead of relying on the NUMA cpumasks [1].

[1] https://lore.kernel.org/lkml/20241108000136.184909-1-arighi@nvidia.com/

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/topology.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9748a4c8d668..e0fe493b7ae0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -755,6 +755,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 			 */
 			if (parent->flags & SD_PREFER_SIBLING)
 				tmp->flags |= SD_PREFER_SIBLING;
+			/*
+			 * Transfer SD_NUMA to the child in case of a
+			 * degenerate NUMA parent.
+			 */
+			if (parent->flags & SD_NUMA)
+				tmp->flags |= SD_NUMA;
+
 			destroy_sched_domain(parent);
 		} else
 			tmp = tmp->parent;
@@ -1974,6 +1981,7 @@ void sched_init_numa(int offline_node)
 	 */
 	tl[i++] = (struct sched_domain_topology_level){
 		.mask = sd_numa_mask,
+		.sd_flags = cpu_numa_flags,
 		.numa_level = 0,
 		SD_INIT_NAME(NODE)
 	};
-- 
2.47.0

Re: [PATCH] sched/topology: Correctly propagate NUMA flag to scheduling domains

Posted by Peter Zijlstra 1 year, 2 months ago

On Sat, Nov 09, 2024 at 03:56:28PM +0100, Andrea Righi wrote:

> @@ -1974,6 +1981,7 @@ void sched_init_numa(int offline_node)
>  	 */
>  	tl[i++] = (struct sched_domain_topology_level){
>  		.mask = sd_numa_mask,
> +		.sd_flags = cpu_numa_flags,
>  		.numa_level = 0,
>  		SD_INIT_NAME(NODE)
>  	};

This doesn't seem right. This level is a single node, and IIRC we only
expect SD_NUMA on cross-node domains.

Re: [PATCH] sched/topology: Correctly propagate NUMA flag to scheduling domains

Posted by Andrea Righi 1 year, 2 months ago

Hi Peter,

On Mon, Nov 11, 2024 at 10:22:46AM +0100, Peter Zijlstra wrote:
> On Sat, Nov 09, 2024 at 03:56:28PM +0100, Andrea Righi wrote:
> 
> > @@ -1974,6 +1981,7 @@ void sched_init_numa(int offline_node)
> >  	 */
> >  	tl[i++] = (struct sched_domain_topology_level){
> >  		.mask = sd_numa_mask,
> > +		.sd_flags = cpu_numa_flags,
> >  		.numa_level = 0,
> >  		SD_INIT_NAME(NODE)
> >  	};
> 
> This doesn't seem right. This level is a single node, and IIRC we only
> expect SD_NUMA on cross-node domains.

Ah! This is the part that I was missing, thanks for clarifying it.
Basically I need to look at sd->groups of the SD_NUMA domain to figure
out the individual nodes.

Please ignore this patch then.

Thanks,
-Andrea