kernel/sched/topology.c | 9 +++++++++ 1 file changed, 9 insertions(+)
The updated topology_span_sane() algorithm in commit ce29a7da84cd
("sched/topology: Refinement to topology_span_sane speedup") works on
the "sched_domain_topology_level" hierarchy to detect overlap in
!SDTL_OVERLAP domains using the tl->mask() as opposed to the
sched_domain hierarchy (and the sched_domain_span()) in the previous
approach.
For NODE domain, the cpumask retunred by tl->mask() depends on the
"sched_domains_curr_level". Unless the "sched_domains_curr_level" is
reset during topology_span_sane(), it reflects the "numa_level" of the
last sched_domain created which is incorrect.
Reset the "sched_domains_curr_level" to the "tl->numa_level" during
topology_span_sane(). Although setting "topology_span_sane" to 0 in
topology_span_sane() should be enough since all domains with
numa_level > 0 currently set SDTL_OVERLAP flag, setting it to
"tl->numa_level" makes it more explicit and future proof against changes
in the same area.
Cc: Steve Wahl <steve.wahl@hpe.com>
Reported-by: Leon Romanovsky <leon@kernel.org>
Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/
Fixes: ce29a7da84cd ("sched/topology: Refinement to topology_span_sane speedup")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
This issue can be reproduced on a setup with the following NUMA topology
shared by Leon:
$ sudo numactl -H
available: 5 nodes (0-4)
node 0 cpus: 0 1
node 0 size: 2927 MB
node 0 free: 1603 MB
node 1 cpus: 2 3
node 1 size: 3023 MB
node 1 free: 3008 MB
node 2 cpus: 4 5
node 2 size: 3023 MB
node 2 free: 3007 MB
node 3 cpus: 6 7
node 3 size: 3023 MB
node 3 free: 3002 MB
node 4 cpus: 8 9
node 4 size: 3022 MB
node 4 free: 2718 MB
node distances:
node 0 1 2 3 4
0: 10 39 38 37 36
1: 39 10 38 37 36
2: 38 38 10 37 36
3: 37 37 37 10 36
4: 36 36 36 36 10
This topology can be emulated using QEMU with the following cmdline used
in my testing:
sudo ~/dev/qemu/build/qemu-system-x86_64 -enable-kvm \
-cpu host \
-m 20G -smp cpus=10,sockets=10 -machine q35 \
-object memory-backend-ram,size=4G,id=m0 \
-object memory-backend-ram,size=4G,id=m1 \
-object memory-backend-ram,size=4G,id=m2 \
-object memory-backend-ram,size=4G,id=m3 \
-object memory-backend-ram,size=4G,id=m4 \
-numa node,cpus=0-1,memdev=m0,nodeid=0 \
-numa node,cpus=2-3,memdev=m1,nodeid=1 \
-numa node,cpus=4-5,memdev=m2,nodeid=2 \
-numa node,cpus=6-7,memdev=m3,nodeid=3 \
-numa node,cpus=8-9,memdev=m4,nodeid=4 \
-numa dist,src=0,dst=1,val=39 \
-numa dist,src=0,dst=2,val=38 \
-numa dist,src=0,dst=3,val=37 \
-numa dist,src=0,dst=4,val=36 \
-numa dist,src=1,dst=0,val=39 \
-numa dist,src=1,dst=2,val=38 \
-numa dist,src=1,dst=3,val=37 \
-numa dist,src=1,dst=4,val=36 \
-numa dist,src=2,dst=0,val=38 \
-numa dist,src=2,dst=1,val=38 \
-numa dist,src=2,dst=3,val=37 \
-numa dist,src=2,dst=4,val=36 \
-numa dist,src=3,dst=0,val=37 \
-numa dist,src=3,dst=1,val=37 \
-numa dist,src=3,dst=2,val=37 \
-numa dist,src=3,dst=4,val=36 \
-numa dist,src=4,dst=0,val=36 \
-numa dist,src=4,dst=1,val=36 \
-numa dist,src=4,dst=2,val=36 \
-numa dist,src=4,dst=3,val=36 \
...
Changes are based on tip:sched/urgent at commit 914873bc7df9 ("Merge tag
'x86-build-2025-05-25' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
---
kernel/sched/topology.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a2a38e1b6f18..1d634862c8df 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2426,6 +2426,15 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
cpumask_clear(covered);
cpumask_clear(id_seen);
+#ifdef CONFIG_NUMA
+ /*
+ * Reuse the sched_domains_curr_level hack since
+ * tl->mask() below can resolve to sd_numa_mask()
+ * for the NODE domain.
+ */
+ sched_domains_curr_level = tl->numa_level;
+#endif
+
/*
* Non-NUMA levels cannot partially overlap - they must be either
* completely equal or completely disjoint. Otherwise we can end up
base-commit: 914873bc7df913db988284876c16257e6ab772c6
--
2.34.1
On Tue, Jun 24, 2025 at 04:12:35AM +0000, K Prateek Nayak wrote: > The updated topology_span_sane() algorithm in commit ce29a7da84cd > ("sched/topology: Refinement to topology_span_sane speedup") works on > the "sched_domain_topology_level" hierarchy to detect overlap in > !SDTL_OVERLAP domains using the tl->mask() as opposed to the > sched_domain hierarchy (and the sched_domain_span()) in the previous > approach. > > For NODE domain, the cpumask retunred by tl->mask() depends on the > "sched_domains_curr_level". Unless the "sched_domains_curr_level" is > reset during topology_span_sane(), it reflects the "numa_level" of the > last sched_domain created which is incorrect. > > Reset the "sched_domains_curr_level" to the "tl->numa_level" during > topology_span_sane(). Although setting "topology_span_sane" to 0 in > topology_span_sane() should be enough since all domains with > numa_level > 0 currently set SDTL_OVERLAP flag, setting it to > "tl->numa_level" makes it more explicit and future proof against changes > in the same area. <...> > Changes are based on tip:sched/urgent at commit 914873bc7df9 ("Merge tag > 'x86-build-2025-05-25' of > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip") > --- > kernel/sched/topology.c | 9 +++++++++ > 1 file changed, 9 insertions(+) Sorry for the delay, we had internal issues with our verification cloud :(. Thanks
Hey, First of all, thanks for looking into this! On 24/06/25 04:12, K Prateek Nayak wrote: > The updated topology_span_sane() algorithm in commit ce29a7da84cd > ("sched/topology: Refinement to topology_span_sane speedup") works on > the "sched_domain_topology_level" hierarchy to detect overlap in > !SDTL_OVERLAP domains using the tl->mask() as opposed to the > sched_domain hierarchy (and the sched_domain_span()) in the previous > approach. > The previous approach also used tl->mask() directly, but it happened to be called *before* the build_sched_domain() loop, so the NODE iteration happened with sched_domain_curr_level at its default static value of 0... For the first SD build that is, I assume that was then broken for any subsequent rebuild. > For NODE domain, the cpumask retunred by tl->mask() depends on the > "sched_domains_curr_level". Unless the "sched_domains_curr_level" is > reset during topology_span_sane(), it reflects the "numa_level" of the > last sched_domain created which is incorrect. > > Reset the "sched_domains_curr_level" to the "tl->numa_level" during > topology_span_sane(). Although setting "topology_span_sane" to 0 in > topology_span_sane() should be enough since all domains with > numa_level > 0 currently set SDTL_OVERLAP flag, setting it to > "tl->numa_level" makes it more explicit and future proof against changes > in the same area. > > Cc: Steve Wahl <steve.wahl@hpe.com> > Reported-by: Leon Romanovsky <leon@kernel.org> > Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ > Fixes: ce29a7da84cd ("sched/topology: Refinement to topology_span_sane speedup") Per the above, this could probably be: Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") > Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> > --- > This issue can be reproduced on a setup with the following NUMA topology > shared by Leon: > > $ sudo numactl -H > available: 5 nodes (0-4) > node 0 cpus: 0 1 > node 0 size: 2927 MB > node 0 free: 1603 MB > node 1 cpus: 2 3 > node 1 size: 3023 MB > node 1 free: 3008 MB > node 2 cpus: 4 5 > node 2 size: 3023 MB > node 2 free: 3007 MB > node 3 cpus: 6 7 > node 3 size: 3023 MB > node 3 free: 3002 MB > node 4 cpus: 8 9 > node 4 size: 3022 MB > node 4 free: 2718 MB > node distances: > node 0 1 2 3 4 > 0: 10 39 38 37 36 > 1: 39 10 38 37 36 > 2: 38 38 10 37 36 > 3: 37 37 37 10 36 > 4: 36 36 36 36 10 > > > This topology can be emulated using QEMU with the following cmdline used > in my testing: > > sudo ~/dev/qemu/build/qemu-system-x86_64 -enable-kvm \ > -cpu host \ > -m 20G -smp cpus=10,sockets=10 -machine q35 \ > -object memory-backend-ram,size=4G,id=m0 \ > -object memory-backend-ram,size=4G,id=m1 \ > -object memory-backend-ram,size=4G,id=m2 \ > -object memory-backend-ram,size=4G,id=m3 \ > -object memory-backend-ram,size=4G,id=m4 \ > -numa node,cpus=0-1,memdev=m0,nodeid=0 \ > -numa node,cpus=2-3,memdev=m1,nodeid=1 \ > -numa node,cpus=4-5,memdev=m2,nodeid=2 \ > -numa node,cpus=6-7,memdev=m3,nodeid=3 \ > -numa node,cpus=8-9,memdev=m4,nodeid=4 \ > -numa dist,src=0,dst=1,val=39 \ > -numa dist,src=0,dst=2,val=38 \ > -numa dist,src=0,dst=3,val=37 \ > -numa dist,src=0,dst=4,val=36 \ > -numa dist,src=1,dst=0,val=39 \ > -numa dist,src=1,dst=2,val=38 \ > -numa dist,src=1,dst=3,val=37 \ > -numa dist,src=1,dst=4,val=36 \ > -numa dist,src=2,dst=0,val=38 \ > -numa dist,src=2,dst=1,val=38 \ > -numa dist,src=2,dst=3,val=37 \ > -numa dist,src=2,dst=4,val=36 \ > -numa dist,src=3,dst=0,val=37 \ > -numa dist,src=3,dst=1,val=37 \ > -numa dist,src=3,dst=2,val=37 \ > -numa dist,src=3,dst=4,val=36 \ > -numa dist,src=4,dst=0,val=36 \ > -numa dist,src=4,dst=1,val=36 \ > -numa dist,src=4,dst=2,val=36 \ > -numa dist,src=4,dst=3,val=36 \ > ... > It's a bit of a mouthful but I would keep that in the changelog itself given that it's part of reproducing the bug. > > Changes are based on tip:sched/urgent at commit 914873bc7df9 ("Merge tag > 'x86-build-2025-05-25' of > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip") > --- > kernel/sched/topology.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index a2a38e1b6f18..1d634862c8df 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2426,6 +2426,15 @@ static bool topology_span_sane(const struct cpumask *cpu_map) > cpumask_clear(covered); > cpumask_clear(id_seen); > > +#ifdef CONFIG_NUMA > + /* > + * Reuse the sched_domains_curr_level hack since > + * tl->mask() below can resolve to sd_numa_mask() > + * for the NODE domain. > + */ > + sched_domains_curr_level = tl->numa_level; > +#endif > + Urgh... Given this is now invoked after the build_sched_domain() loop, what if we directly used the sched_domain_span(), instead, i.e. use sched_domain_mask(per_cpu_ptr(tl->data->sd, cpu)) instead of tl->mask(id) which means no indrect use of sched_domains_curr_level. Note that I'm currently running out of brain juice so this might be a really stupid idea :-) > /* > * Non-NUMA levels cannot partially overlap - they must be either > * completely equal or completely disjoint. Otherwise we can end up > > base-commit: 914873bc7df913db988284876c16257e6ab772c6 > -- > 2.34.1
Hello Valentin, Thank you for taking a look at the patch. On 6/25/2025 9:02 PM, Valentin Schneider wrote: > Hey, > > First of all, thanks for looking into this! > > On 24/06/25 04:12, K Prateek Nayak wrote: >> The updated topology_span_sane() algorithm in commit ce29a7da84cd >> ("sched/topology: Refinement to topology_span_sane speedup") works on >> the "sched_domain_topology_level" hierarchy to detect overlap in >> !SDTL_OVERLAP domains using the tl->mask() as opposed to the >> sched_domain hierarchy (and the sched_domain_span()) in the previous >> approach. >> > > The previous approach also used tl->mask() directly, but it happened > to be called *before* the build_sched_domain() loop, so the NODE iteration > happened with sched_domain_curr_level at its default static value of > 0... For the first SD build that is, I assume that was then broken for any > subsequent rebuild. Oh right! I read it wrong. My bad. topology_span_sane() being called before build_sched_domain() always worked on "sched_domains_curr_level" of the previous domain and could potentially cause an issue during rebuild if we start from the NODE domain. > >> For NODE domain, the cpumask retunred by tl->mask() depends on the >> "sched_domains_curr_level". Unless the "sched_domains_curr_level" is >> reset during topology_span_sane(), it reflects the "numa_level" of the >> last sched_domain created which is incorrect. >> >> Reset the "sched_domains_curr_level" to the "tl->numa_level" during >> topology_span_sane(). Although setting "topology_span_sane" to 0 in >> topology_span_sane() should be enough since all domains with >> numa_level > 0 currently set SDTL_OVERLAP flag, setting it to >> "tl->numa_level" makes it more explicit and future proof against changes >> in the same area. >> >> Cc: Steve Wahl <steve.wahl@hpe.com> >> Reported-by: Leon Romanovsky <leon@kernel.org> >> Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ >> Fixes: ce29a7da84cd ("sched/topology: Refinement to topology_span_sane speedup") > > Per the above, this could probably be: > > Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") I'll update this in the next version. > >> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> >> --- >> This issue can be reproduced on a setup with the following NUMA topology >> shared by Leon: >> >> $ sudo numactl -H >> available: 5 nodes (0-4) >> node 0 cpus: 0 1 >> node 0 size: 2927 MB >> node 0 free: 1603 MB >> node 1 cpus: 2 3 >> node 1 size: 3023 MB >> node 1 free: 3008 MB >> node 2 cpus: 4 5 >> node 2 size: 3023 MB >> node 2 free: 3007 MB >> node 3 cpus: 6 7 >> node 3 size: 3023 MB >> node 3 free: 3002 MB >> node 4 cpus: 8 9 >> node 4 size: 3022 MB >> node 4 free: 2718 MB >> node distances: >> node 0 1 2 3 4 >> 0: 10 39 38 37 36 >> 1: 39 10 38 37 36 >> 2: 38 38 10 37 36 >> 3: 37 37 37 10 36 >> 4: 36 36 36 36 10 >> >> >> This topology can be emulated using QEMU with the following cmdline used >> in my testing: >> >> sudo ~/dev/qemu/build/qemu-system-x86_64 -enable-kvm \ >> -cpu host \ >> -m 20G -smp cpus=10,sockets=10 -machine q35 \ >> -object memory-backend-ram,size=4G,id=m0 \ >> -object memory-backend-ram,size=4G,id=m1 \ >> -object memory-backend-ram,size=4G,id=m2 \ >> -object memory-backend-ram,size=4G,id=m3 \ >> -object memory-backend-ram,size=4G,id=m4 \ >> -numa node,cpus=0-1,memdev=m0,nodeid=0 \ >> -numa node,cpus=2-3,memdev=m1,nodeid=1 \ >> -numa node,cpus=4-5,memdev=m2,nodeid=2 \ >> -numa node,cpus=6-7,memdev=m3,nodeid=3 \ >> -numa node,cpus=8-9,memdev=m4,nodeid=4 \ >> -numa dist,src=0,dst=1,val=39 \ >> -numa dist,src=0,dst=2,val=38 \ >> -numa dist,src=0,dst=3,val=37 \ >> -numa dist,src=0,dst=4,val=36 \ >> -numa dist,src=1,dst=0,val=39 \ >> -numa dist,src=1,dst=2,val=38 \ >> -numa dist,src=1,dst=3,val=37 \ >> -numa dist,src=1,dst=4,val=36 \ >> -numa dist,src=2,dst=0,val=38 \ >> -numa dist,src=2,dst=1,val=38 \ >> -numa dist,src=2,dst=3,val=37 \ >> -numa dist,src=2,dst=4,val=36 \ >> -numa dist,src=3,dst=0,val=37 \ >> -numa dist,src=3,dst=1,val=37 \ >> -numa dist,src=3,dst=2,val=37 \ >> -numa dist,src=3,dst=4,val=36 \ >> -numa dist,src=4,dst=0,val=36 \ >> -numa dist,src=4,dst=1,val=36 \ >> -numa dist,src=4,dst=2,val=36 \ >> -numa dist,src=4,dst=3,val=36 \ >> ... >> > > It's a bit of a mouthful but I would keep that in the changelog itself > given that it's part of reproducing the bug. Sure thing! > >> >> Changes are based on tip:sched/urgent at commit 914873bc7df9 ("Merge tag >> 'x86-build-2025-05-25' of >> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip") >> --- >> kernel/sched/topology.c | 9 +++++++++ >> 1 file changed, 9 insertions(+) >> >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c >> index a2a38e1b6f18..1d634862c8df 100644 >> --- a/kernel/sched/topology.c >> +++ b/kernel/sched/topology.c >> @@ -2426,6 +2426,15 @@ static bool topology_span_sane(const struct cpumask *cpu_map) >> cpumask_clear(covered); >> cpumask_clear(id_seen); >> >> +#ifdef CONFIG_NUMA >> + /* >> + * Reuse the sched_domains_curr_level hack since >> + * tl->mask() below can resolve to sd_numa_mask() >> + * for the NODE domain. >> + */ >> + sched_domains_curr_level = tl->numa_level; >> +#endif >> + > > Urgh... Given this is now invoked after the build_sched_domain() loop, what > if we directly used the sched_domain_span(), instead, i.e. use > > sched_domain_mask(per_cpu_ptr(tl->data->sd, cpu)) > > instead of > > tl->mask(id) > > which means no indrect use of sched_domains_curr_level. Note that I'm > currently running out of brain juice so this might be a really stupid idea :-) Let me go try that! It should also help detect overlap in case build_sched_domain() has fixed up domain spans when the child's span is not a subset of the parent's span. > >> /* >> * Non-NUMA levels cannot partially overlap - they must be either >> * completely equal or completely disjoint. Otherwise we can end up >> >> base-commit: 914873bc7df913db988284876c16257e6ab772c6 >> -- >> 2.34.1 > -- Thanks and Regards, Prateek
© 2016 - 2025 Red Hat, Inc.