sched/topology: Correct "sched_domains_curr_level" in topology_span_sane()

[PATCH] sched/topology: Correct "sched_domains_curr_level" in topology_span_sane()

Posted by K Prateek Nayak 7 months, 2 weeks ago

The updated topology_span_sane() algorithm in commit ce29a7da84cd
("sched/topology: Refinement to topology_span_sane speedup") works on
the "sched_domain_topology_level" hierarchy to detect overlap in
!SDTL_OVERLAP domains using the tl->mask() as opposed to the
sched_domain hierarchy (and the sched_domain_span()) in the previous
approach.

For NODE domain, the cpumask retunred by tl->mask() depends on the
"sched_domains_curr_level". Unless the "sched_domains_curr_level" is
reset during topology_span_sane(), it reflects the "numa_level" of the
last sched_domain created which is incorrect.

Reset the "sched_domains_curr_level" to the "tl->numa_level" during
topology_span_sane(). Although setting "topology_span_sane" to 0 in
topology_span_sane() should be enough since all domains with
numa_level > 0 currently set SDTL_OVERLAP flag, setting it to
"tl->numa_level" makes it more explicit and future proof against changes
in the same area.

Cc: Steve Wahl <steve.wahl@hpe.com>
Reported-by: Leon Romanovsky <leon@kernel.org>
Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/
Fixes: ce29a7da84cd ("sched/topology: Refinement to topology_span_sane speedup")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
This issue can be reproduced on a setup with the following NUMA topology
shared by Leon:

    $ sudo numactl -H
    available: 5 nodes (0-4)
    node 0 cpus: 0 1
    node 0 size: 2927 MB
    node 0 free: 1603 MB
    node 1 cpus: 2 3
    node 1 size: 3023 MB
    node 1 free: 3008 MB
    node 2 cpus: 4 5
    node 2 size: 3023 MB
    node 2 free: 3007 MB
    node 3 cpus: 6 7
    node 3 size: 3023 MB
    node 3 free: 3002 MB
    node 4 cpus: 8 9
    node 4 size: 3022 MB
    node 4 free: 2718 MB
    node distances:
    node   0   1   2   3   4 
      0:  10  39  38  37  36 
      1:  39  10  38  37  36 
      2:  38  38  10  37  36 
      3:  37  37  37  10  36 
      4:  36  36  36  36  10


This topology can be emulated using QEMU with the following cmdline used
in my testing:

     sudo ~/dev/qemu/build/qemu-system-x86_64 -enable-kvm \
     -cpu host \
     -m 20G -smp cpus=10,sockets=10 -machine q35 \
     -object memory-backend-ram,size=4G,id=m0 \
     -object memory-backend-ram,size=4G,id=m1 \
     -object memory-backend-ram,size=4G,id=m2 \
     -object memory-backend-ram,size=4G,id=m3 \
     -object memory-backend-ram,size=4G,id=m4 \
     -numa node,cpus=0-1,memdev=m0,nodeid=0 \
     -numa node,cpus=2-3,memdev=m1,nodeid=1 \
     -numa node,cpus=4-5,memdev=m2,nodeid=2 \
     -numa node,cpus=6-7,memdev=m3,nodeid=3 \
     -numa node,cpus=8-9,memdev=m4,nodeid=4 \
     -numa dist,src=0,dst=1,val=39 \
     -numa dist,src=0,dst=2,val=38 \
     -numa dist,src=0,dst=3,val=37 \
     -numa dist,src=0,dst=4,val=36 \
     -numa dist,src=1,dst=0,val=39 \
     -numa dist,src=1,dst=2,val=38 \
     -numa dist,src=1,dst=3,val=37 \
     -numa dist,src=1,dst=4,val=36 \
     -numa dist,src=2,dst=0,val=38 \
     -numa dist,src=2,dst=1,val=38 \
     -numa dist,src=2,dst=3,val=37 \
     -numa dist,src=2,dst=4,val=36 \
     -numa dist,src=3,dst=0,val=37 \
     -numa dist,src=3,dst=1,val=37 \
     -numa dist,src=3,dst=2,val=37 \
     -numa dist,src=3,dst=4,val=36 \
     -numa dist,src=4,dst=0,val=36 \
     -numa dist,src=4,dst=1,val=36 \
     -numa dist,src=4,dst=2,val=36 \
     -numa dist,src=4,dst=3,val=36 \
     ...


Changes are based on tip:sched/urgent at commit 914873bc7df9 ("Merge tag
'x86-build-2025-05-25' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
---
 kernel/sched/topology.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a2a38e1b6f18..1d634862c8df 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2426,6 +2426,15 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
 		cpumask_clear(covered);
 		cpumask_clear(id_seen);
 
+#ifdef CONFIG_NUMA
+		/*
+		 * Reuse the sched_domains_curr_level hack since
+		 * tl->mask() below can resolve to sd_numa_mask()
+		 * for the NODE domain.
+		 */
+		sched_domains_curr_level = tl->numa_level;
+#endif
+
 		/*
 		 * Non-NUMA levels cannot partially overlap - they must be either
 		 * completely equal or completely disjoint. Otherwise we can end up

base-commit: 914873bc7df913db988284876c16257e6ab772c6
-- 
2.34.1

Re: [PATCH] sched/topology: Correct "sched_domains_curr_level" in topology_span_sane()

Posted by Leon Romanovsky 7 months, 2 weeks ago

On Tue, Jun 24, 2025 at 04:12:35AM +0000, K Prateek Nayak wrote:
> The updated topology_span_sane() algorithm in commit ce29a7da84cd
> ("sched/topology: Refinement to topology_span_sane speedup") works on
> the "sched_domain_topology_level" hierarchy to detect overlap in
> !SDTL_OVERLAP domains using the tl->mask() as opposed to the
> sched_domain hierarchy (and the sched_domain_span()) in the previous
> approach.
> 
> For NODE domain, the cpumask retunred by tl->mask() depends on the
> "sched_domains_curr_level". Unless the "sched_domains_curr_level" is
> reset during topology_span_sane(), it reflects the "numa_level" of the
> last sched_domain created which is incorrect.
> 
> Reset the "sched_domains_curr_level" to the "tl->numa_level" during
> topology_span_sane(). Although setting "topology_span_sane" to 0 in
> topology_span_sane() should be enough since all domains with
> numa_level > 0 currently set SDTL_OVERLAP flag, setting it to
> "tl->numa_level" makes it more explicit and future proof against changes
> in the same area.

<...>

> Changes are based on tip:sched/urgent at commit 914873bc7df9 ("Merge tag
> 'x86-build-2025-05-25' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
> ---
>  kernel/sched/topology.c | 9 +++++++++
>  1 file changed, 9 insertions(+)

Sorry for the delay, we had internal issues with our verification cloud :(.

Thanks

Re: [PATCH] sched/topology: Correct "sched_domains_curr_level" in topology_span_sane()

Posted by Valentin Schneider 7 months, 2 weeks ago

Hey,

First of all, thanks for looking into this!

On 24/06/25 04:12, K Prateek Nayak wrote:
> The updated topology_span_sane() algorithm in commit ce29a7da84cd
> ("sched/topology: Refinement to topology_span_sane speedup") works on
> the "sched_domain_topology_level" hierarchy to detect overlap in
> !SDTL_OVERLAP domains using the tl->mask() as opposed to the
> sched_domain hierarchy (and the sched_domain_span()) in the previous
> approach.
>

The previous approach also used tl->mask() directly, but it happened
to be called *before* the build_sched_domain() loop, so the NODE iteration
happened with sched_domain_curr_level at its default static value of
0... For the first SD build that is, I assume that was then broken for any
subsequent rebuild.

> For NODE domain, the cpumask retunred by tl->mask() depends on the
> "sched_domains_curr_level". Unless the "sched_domains_curr_level" is
> reset during topology_span_sane(), it reflects the "numa_level" of the
> last sched_domain created which is incorrect.
>
> Reset the "sched_domains_curr_level" to the "tl->numa_level" during
> topology_span_sane(). Although setting "topology_span_sane" to 0 in
> topology_span_sane() should be enough since all domains with
> numa_level > 0 currently set SDTL_OVERLAP flag, setting it to
> "tl->numa_level" makes it more explicit and future proof against changes
> in the same area.
>
> Cc: Steve Wahl <steve.wahl@hpe.com>
> Reported-by: Leon Romanovsky <leon@kernel.org>
> Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/
> Fixes: ce29a7da84cd ("sched/topology: Refinement to topology_span_sane speedup")

Per the above, this could probably be:

Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap")

> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> This issue can be reproduced on a setup with the following NUMA topology
> shared by Leon:
>
>     $ sudo numactl -H
>     available: 5 nodes (0-4)
>     node 0 cpus: 0 1
>     node 0 size: 2927 MB
>     node 0 free: 1603 MB
>     node 1 cpus: 2 3
>     node 1 size: 3023 MB
>     node 1 free: 3008 MB
>     node 2 cpus: 4 5
>     node 2 size: 3023 MB
>     node 2 free: 3007 MB
>     node 3 cpus: 6 7
>     node 3 size: 3023 MB
>     node 3 free: 3002 MB
>     node 4 cpus: 8 9
>     node 4 size: 3022 MB
>     node 4 free: 2718 MB
>     node distances:
>     node   0   1   2   3   4
>       0:  10  39  38  37  36
>       1:  39  10  38  37  36
>       2:  38  38  10  37  36
>       3:  37  37  37  10  36
>       4:  36  36  36  36  10
>
>
> This topology can be emulated using QEMU with the following cmdline used
> in my testing:
>
>      sudo ~/dev/qemu/build/qemu-system-x86_64 -enable-kvm \
>      -cpu host \
>      -m 20G -smp cpus=10,sockets=10 -machine q35 \
>      -object memory-backend-ram,size=4G,id=m0 \
>      -object memory-backend-ram,size=4G,id=m1 \
>      -object memory-backend-ram,size=4G,id=m2 \
>      -object memory-backend-ram,size=4G,id=m3 \
>      -object memory-backend-ram,size=4G,id=m4 \
>      -numa node,cpus=0-1,memdev=m0,nodeid=0 \
>      -numa node,cpus=2-3,memdev=m1,nodeid=1 \
>      -numa node,cpus=4-5,memdev=m2,nodeid=2 \
>      -numa node,cpus=6-7,memdev=m3,nodeid=3 \
>      -numa node,cpus=8-9,memdev=m4,nodeid=4 \
>      -numa dist,src=0,dst=1,val=39 \
>      -numa dist,src=0,dst=2,val=38 \
>      -numa dist,src=0,dst=3,val=37 \
>      -numa dist,src=0,dst=4,val=36 \
>      -numa dist,src=1,dst=0,val=39 \
>      -numa dist,src=1,dst=2,val=38 \
>      -numa dist,src=1,dst=3,val=37 \
>      -numa dist,src=1,dst=4,val=36 \
>      -numa dist,src=2,dst=0,val=38 \
>      -numa dist,src=2,dst=1,val=38 \
>      -numa dist,src=2,dst=3,val=37 \
>      -numa dist,src=2,dst=4,val=36 \
>      -numa dist,src=3,dst=0,val=37 \
>      -numa dist,src=3,dst=1,val=37 \
>      -numa dist,src=3,dst=2,val=37 \
>      -numa dist,src=3,dst=4,val=36 \
>      -numa dist,src=4,dst=0,val=36 \
>      -numa dist,src=4,dst=1,val=36 \
>      -numa dist,src=4,dst=2,val=36 \
>      -numa dist,src=4,dst=3,val=36 \
>      ...
>

It's a bit of a mouthful but I would keep that in the changelog itself
given that it's part of reproducing the bug.

>
> Changes are based on tip:sched/urgent at commit 914873bc7df9 ("Merge tag
> 'x86-build-2025-05-25' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
> ---
>  kernel/sched/topology.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index a2a38e1b6f18..1d634862c8df 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2426,6 +2426,15 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
>               cpumask_clear(covered);
>               cpumask_clear(id_seen);
>
> +#ifdef CONFIG_NUMA
> +		/*
> +		 * Reuse the sched_domains_curr_level hack since
> +		 * tl->mask() below can resolve to sd_numa_mask()
> +		 * for the NODE domain.
> +		 */
> +		sched_domains_curr_level = tl->numa_level;
> +#endif
> +

Urgh... Given this is now invoked after the build_sched_domain() loop, what
if we directly used the sched_domain_span(), instead, i.e. use

   sched_domain_mask(per_cpu_ptr(tl->data->sd, cpu))

instead of

   tl->mask(id)

which means no indrect use of sched_domains_curr_level. Note that I'm
currently running out of brain juice so this might be a really stupid idea :-)

>               /*
>                * Non-NUMA levels cannot partially overlap - they must be either
>                * completely equal or completely disjoint. Otherwise we can end up
>
> base-commit: 914873bc7df913db988284876c16257e6ab772c6
> --
> 2.34.1

Re: [PATCH] sched/topology: Correct "sched_domains_curr_level" in topology_span_sane()

Posted by K Prateek Nayak 7 months, 2 weeks ago

Hello Valentin,

Thank you for taking a look at the patch.

On 6/25/2025 9:02 PM, Valentin Schneider wrote:
> Hey,
> 
> First of all, thanks for looking into this!
> 
> On 24/06/25 04:12, K Prateek Nayak wrote:
>> The updated topology_span_sane() algorithm in commit ce29a7da84cd
>> ("sched/topology: Refinement to topology_span_sane speedup") works on
>> the "sched_domain_topology_level" hierarchy to detect overlap in
>> !SDTL_OVERLAP domains using the tl->mask() as opposed to the
>> sched_domain hierarchy (and the sched_domain_span()) in the previous
>> approach.
>>
> 
> The previous approach also used tl->mask() directly, but it happened
> to be called *before* the build_sched_domain() loop, so the NODE iteration
> happened with sched_domain_curr_level at its default static value of
> 0... For the first SD build that is, I assume that was then broken for any
> subsequent rebuild.

Oh right! I read it wrong. My bad. topology_span_sane() being called
before build_sched_domain() always worked on "sched_domains_curr_level"
of the previous domain and could potentially cause an issue during
rebuild if we start from the NODE domain.

> 
>> For NODE domain, the cpumask retunred by tl->mask() depends on the
>> "sched_domains_curr_level". Unless the "sched_domains_curr_level" is
>> reset during topology_span_sane(), it reflects the "numa_level" of the
>> last sched_domain created which is incorrect.
>>
>> Reset the "sched_domains_curr_level" to the "tl->numa_level" during
>> topology_span_sane(). Although setting "topology_span_sane" to 0 in
>> topology_span_sane() should be enough since all domains with
>> numa_level > 0 currently set SDTL_OVERLAP flag, setting it to
>> "tl->numa_level" makes it more explicit and future proof against changes
>> in the same area.
>>
>> Cc: Steve Wahl <steve.wahl@hpe.com>
>> Reported-by: Leon Romanovsky <leon@kernel.org>
>> Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/
>> Fixes: ce29a7da84cd ("sched/topology: Refinement to topology_span_sane speedup")
> 
> Per the above, this could probably be:
> 
> Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap")

I'll update this in the next version.

> 
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>> This issue can be reproduced on a setup with the following NUMA topology
>> shared by Leon:
>>
>>      $ sudo numactl -H
>>      available: 5 nodes (0-4)
>>      node 0 cpus: 0 1
>>      node 0 size: 2927 MB
>>      node 0 free: 1603 MB
>>      node 1 cpus: 2 3
>>      node 1 size: 3023 MB
>>      node 1 free: 3008 MB
>>      node 2 cpus: 4 5
>>      node 2 size: 3023 MB
>>      node 2 free: 3007 MB
>>      node 3 cpus: 6 7
>>      node 3 size: 3023 MB
>>      node 3 free: 3002 MB
>>      node 4 cpus: 8 9
>>      node 4 size: 3022 MB
>>      node 4 free: 2718 MB
>>      node distances:
>>      node   0   1   2   3   4
>>        0:  10  39  38  37  36
>>        1:  39  10  38  37  36
>>        2:  38  38  10  37  36
>>        3:  37  37  37  10  36
>>        4:  36  36  36  36  10
>>
>>
>> This topology can be emulated using QEMU with the following cmdline used
>> in my testing:
>>
>>       sudo ~/dev/qemu/build/qemu-system-x86_64 -enable-kvm \
>>       -cpu host \
>>       -m 20G -smp cpus=10,sockets=10 -machine q35 \
>>       -object memory-backend-ram,size=4G,id=m0 \
>>       -object memory-backend-ram,size=4G,id=m1 \
>>       -object memory-backend-ram,size=4G,id=m2 \
>>       -object memory-backend-ram,size=4G,id=m3 \
>>       -object memory-backend-ram,size=4G,id=m4 \
>>       -numa node,cpus=0-1,memdev=m0,nodeid=0 \
>>       -numa node,cpus=2-3,memdev=m1,nodeid=1 \
>>       -numa node,cpus=4-5,memdev=m2,nodeid=2 \
>>       -numa node,cpus=6-7,memdev=m3,nodeid=3 \
>>       -numa node,cpus=8-9,memdev=m4,nodeid=4 \
>>       -numa dist,src=0,dst=1,val=39 \
>>       -numa dist,src=0,dst=2,val=38 \
>>       -numa dist,src=0,dst=3,val=37 \
>>       -numa dist,src=0,dst=4,val=36 \
>>       -numa dist,src=1,dst=0,val=39 \
>>       -numa dist,src=1,dst=2,val=38 \
>>       -numa dist,src=1,dst=3,val=37 \
>>       -numa dist,src=1,dst=4,val=36 \
>>       -numa dist,src=2,dst=0,val=38 \
>>       -numa dist,src=2,dst=1,val=38 \
>>       -numa dist,src=2,dst=3,val=37 \
>>       -numa dist,src=2,dst=4,val=36 \
>>       -numa dist,src=3,dst=0,val=37 \
>>       -numa dist,src=3,dst=1,val=37 \
>>       -numa dist,src=3,dst=2,val=37 \
>>       -numa dist,src=3,dst=4,val=36 \
>>       -numa dist,src=4,dst=0,val=36 \
>>       -numa dist,src=4,dst=1,val=36 \
>>       -numa dist,src=4,dst=2,val=36 \
>>       -numa dist,src=4,dst=3,val=36 \
>>       ...
>>
> 
> It's a bit of a mouthful but I would keep that in the changelog itself
> given that it's part of reproducing the bug.

Sure thing!

> 
>>
>> Changes are based on tip:sched/urgent at commit 914873bc7df9 ("Merge tag
>> 'x86-build-2025-05-25' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
>> ---
>>   kernel/sched/topology.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index a2a38e1b6f18..1d634862c8df 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -2426,6 +2426,15 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
>>                cpumask_clear(covered);
>>                cpumask_clear(id_seen);
>>
>> +#ifdef CONFIG_NUMA
>> +		/*
>> +		 * Reuse the sched_domains_curr_level hack since
>> +		 * tl->mask() below can resolve to sd_numa_mask()
>> +		 * for the NODE domain.
>> +		 */
>> +		sched_domains_curr_level = tl->numa_level;
>> +#endif
>> +
> 
> Urgh... Given this is now invoked after the build_sched_domain() loop, what
> if we directly used the sched_domain_span(), instead, i.e. use
> 
>     sched_domain_mask(per_cpu_ptr(tl->data->sd, cpu))
> 
> instead of
> 
>     tl->mask(id)
> 
> which means no indrect use of sched_domains_curr_level. Note that I'm
> currently running out of brain juice so this might be a really stupid idea :-)

Let me go try that! It should also help detect overlap in case
build_sched_domain() has fixed up domain spans when the child's span is
not a subset of the parent's span.

> 
>>                /*
>>                 * Non-NUMA levels cannot partially overlap - they must be either
>>                 * completely equal or completely disjoint. Otherwise we can end up
>>
>> base-commit: 914873bc7df913db988284876c16257e6ab772c6
>> --
>> 2.34.1
> 

-- 
Thanks and Regards,
Prateek