kernel/sched/topology.c | 8 ++++++++ 1 file changed, 8 insertions(+)
A scheduling domain can degenerate a parent NUMA domain if the CPUs
perfectly overlap, without inheriting the SD_NUMA flag.
This can result in the creation of a single NUMA domain that includes
all CPUs, even when the CPUs are spread across multiple NUMA nodes,
which may result in sub-optimal scheduling decisions.
Example:
$ vng -v --cpu 16,sockets=4,cores=2,threads=2 \
-m 4G --numa 2G,cpus=0-7 --numa 2G,cpus=8-15
...
$ lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
0 0 0 0 0:0:0:0 yes
1 0 0 0 0:0:0:0 yes
2 0 0 1 1:1:1:0 yes
3 0 0 1 1:1:1:0 yes
4 0 1 2 2:2:2:1 yes
5 0 1 2 2:2:2:1 yes
6 0 1 3 3:3:3:1 yes
7 0 1 3 3:3:3:1 yes
8 1 2 4 4:4:4:2 yes
9 1 2 4 4:4:4:2 yes
10 1 2 5 5:5:5:2 yes
11 1 2 5 5:5:5:2 yes
12 1 3 6 6:6:6:3 yes
13 1 3 6 6:6:6:3 yes
14 1 3 7 7:7:7:3 yes
15 1 3 7 7:7:7:3 yes
Without this change:
sd_llc[cpu0] spans cpus=0-3
sd_numa[cpu0] spans cpus=0-15
...
sd_llc[cpu15] spans cpus=12-15
sd_numa[cpu15] spans cpus=0-15
With this change:
- sd_llc[cpu0] spans cpus=0-3
- sd_numa[cpu0] spans cpus=0-7
...
sd_llc[cpu15] spans cpus=12-15
sd_numa[cpu15] spans cpus=8-15
This also allows re-using sd_numa from the sched_ext built-in CPU idle
selection policy, instead of relying on the NUMA cpumasks [1].
[1] https://lore.kernel.org/lkml/20241108000136.184909-1-arighi@nvidia.com/
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/topology.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9748a4c8d668..e0fe493b7ae0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -755,6 +755,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
*/
if (parent->flags & SD_PREFER_SIBLING)
tmp->flags |= SD_PREFER_SIBLING;
+ /*
+ * Transfer SD_NUMA to the child in case of a
+ * degenerate NUMA parent.
+ */
+ if (parent->flags & SD_NUMA)
+ tmp->flags |= SD_NUMA;
+
destroy_sched_domain(parent);
} else
tmp = tmp->parent;
@@ -1974,6 +1981,7 @@ void sched_init_numa(int offline_node)
*/
tl[i++] = (struct sched_domain_topology_level){
.mask = sd_numa_mask,
+ .sd_flags = cpu_numa_flags,
.numa_level = 0,
SD_INIT_NAME(NODE)
};
--
2.47.0
On Sat, Nov 09, 2024 at 03:56:28PM +0100, Andrea Righi wrote: > @@ -1974,6 +1981,7 @@ void sched_init_numa(int offline_node) > */ > tl[i++] = (struct sched_domain_topology_level){ > .mask = sd_numa_mask, > + .sd_flags = cpu_numa_flags, > .numa_level = 0, > SD_INIT_NAME(NODE) > }; This doesn't seem right. This level is a single node, and IIRC we only expect SD_NUMA on cross-node domains.
Hi Peter, On Mon, Nov 11, 2024 at 10:22:46AM +0100, Peter Zijlstra wrote: > On Sat, Nov 09, 2024 at 03:56:28PM +0100, Andrea Righi wrote: > > > @@ -1974,6 +1981,7 @@ void sched_init_numa(int offline_node) > > */ > > tl[i++] = (struct sched_domain_topology_level){ > > .mask = sd_numa_mask, > > + .sd_flags = cpu_numa_flags, > > .numa_level = 0, > > SD_INIT_NAME(NODE) > > }; > > This doesn't seem right. This level is a single node, and IIRC we only > expect SD_NUMA on cross-node domains. Ah! This is the part that I was missing, thanks for clarifying it. Basically I need to look at sd->groups of the SD_NUMA domain to figure out the individual nodes. Please ignore this patch then. Thanks, -Andrea
© 2016 - 2024 Red Hat, Inc.