sched/topology: Compute sd_weight considering cpuset partitions

[tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions

Posted by tip-bot2 for K Prateek Nayak 2 weeks, 5 days ago

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     8e8e23dea43e64ddafbd1246644c3219209be113
Gitweb:        https://git.kernel.org/tip/8e8e23dea43e64ddafbd1246644c3219209be113
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Thu, 12 Mar 2026 04:44:26 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 18 Mar 2026 09:06:47 +01:00

sched/topology: Compute sd_weight considering cpuset partitions

The "sd_weight" used for calculating the load balancing interval, and
its limits, considers the span weight of the entire topology level
without accounting for cpuset partitions.

For example, consider a large system of 128CPUs divided into 8 * 16CPUs
partition which is typical when deploying virtual machines:

  [                      PKG Domain: 128CPUs                      ]

  [Partition0: 16CPUs][Partition1: 16CPUs] ... [Partition7: 16CPUs]

Although each partition only contains 16CPUs, the load balancing
interval is set to a minimum of 128 jiffies considering the span of the
entire domain with 128CPUs which can lead to longer imbalances within
the partition although balancing within is cheaper with 16CPUs.

Compute the "sd_weight" after computing the "sd_span" considering the
cpu_map covered by the partition, and set the load balancing interval,
and its limits accordingly.

For the above example, the balancing intervals for the partitions PKG
domain changes as follows:

                  before   after
balance_interval   128      16
min_interval       128      16
max_interval       256      32

Intervals are now proportional to the CPUs in the partitioned domain as
was intended by the original formula.

Fixes: cb83b629bae03 ("sched/numa: Rewrite the CONFIG_NUMA sched domain support")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/20260312044434.1974-2-kprateek.nayak@amd.com
---
 kernel/sched/topology.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 061f8c8..79bab80 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1645,13 +1645,17 @@ sd_init(struct sched_domain_topology_level *tl,
 	struct cpumask *sd_span;
 	u64 now = sched_clock();
 
-	sd_weight = cpumask_weight(tl->mask(tl, cpu));
+	sd_span = sched_domain_span(sd);
+	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
+	sd_weight = cpumask_weight(sd_span);
+	sd_id = cpumask_first(sd_span);
 
 	if (tl->sd_flags)
 		sd_flags = (*tl->sd_flags)();
 	if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
-			"wrong sd_flags in topology description\n"))
+		      "wrong sd_flags in topology description\n"))
 		sd_flags &= TOPOLOGY_SD_FLAGS;
+	sd_flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
 
 	*sd = (struct sched_domain){
 		.min_interval		= sd_weight,
@@ -1689,12 +1693,6 @@ sd_init(struct sched_domain_topology_level *tl,
 		.name			= tl->name,
 	};
 
-	sd_span = sched_domain_span(sd);
-	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
-	sd_id = cpumask_first(sd_span);
-
-	sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
-
 	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
 		  (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
 		  "CPU capacity asymmetry not supported on SMT\n");

Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions

Posted by Nathan Chancellor 2 weeks, 2 days ago

Hi all,

On Wed, Mar 18, 2026 at 08:08:44AM -0000, tip-bot2 for K Prateek Nayak wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     8e8e23dea43e64ddafbd1246644c3219209be113
> Gitweb:        https://git.kernel.org/tip/8e8e23dea43e64ddafbd1246644c3219209be113
> Author:        K Prateek Nayak <kprateek.nayak@amd.com>
> AuthorDate:    Thu, 12 Mar 2026 04:44:26 
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Wed, 18 Mar 2026 09:06:47 +01:00
> 
> sched/topology: Compute sd_weight considering cpuset partitions
> 
> The "sd_weight" used for calculating the load balancing interval, and
> its limits, considers the span weight of the entire topology level
> without accounting for cpuset partitions.
> 
> For example, consider a large system of 128CPUs divided into 8 * 16CPUs
> partition which is typical when deploying virtual machines:
> 
>   [                      PKG Domain: 128CPUs                      ]
> 
>   [Partition0: 16CPUs][Partition1: 16CPUs] ... [Partition7: 16CPUs]
> 
> Although each partition only contains 16CPUs, the load balancing
> interval is set to a minimum of 128 jiffies considering the span of the
> entire domain with 128CPUs which can lead to longer imbalances within
> the partition although balancing within is cheaper with 16CPUs.
> 
> Compute the "sd_weight" after computing the "sd_span" considering the
> cpu_map covered by the partition, and set the load balancing interval,
> and its limits accordingly.
> 
> For the above example, the balancing intervals for the partitions PKG
> domain changes as follows:
> 
>                   before   after
> balance_interval   128      16
> min_interval       128      16
> max_interval       256      32
> 
> Intervals are now proportional to the CPUs in the partitioned domain as
> was intended by the original formula.
> 
> Fixes: cb83b629bae03 ("sched/numa: Rewrite the CONFIG_NUMA sched domain support")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
> Reviewed-by: Valentin Schneider <vschneid@redhat.com>
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Link: https://patch.msgid.link/20260312044434.1974-2-kprateek.nayak@amd.com
> ---
>  kernel/sched/topology.c | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 061f8c8..79bab80 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1645,13 +1645,17 @@ sd_init(struct sched_domain_topology_level *tl,
>  	struct cpumask *sd_span;
>  	u64 now = sched_clock();
>  
> -	sd_weight = cpumask_weight(tl->mask(tl, cpu));
> +	sd_span = sched_domain_span(sd);
> +	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
> +	sd_weight = cpumask_weight(sd_span);
> +	sd_id = cpumask_first(sd_span);
>  
>  	if (tl->sd_flags)
>  		sd_flags = (*tl->sd_flags)();
>  	if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
> -			"wrong sd_flags in topology description\n"))
> +		      "wrong sd_flags in topology description\n"))
>  		sd_flags &= TOPOLOGY_SD_FLAGS;
> +	sd_flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
>  
>  	*sd = (struct sched_domain){
>  		.min_interval		= sd_weight,
> @@ -1689,12 +1693,6 @@ sd_init(struct sched_domain_topology_level *tl,
>  		.name			= tl->name,
>  	};
>  
> -	sd_span = sched_domain_span(sd);
> -	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
> -	sd_id = cpumask_first(sd_span);
> -
> -	sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
> -
>  	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
>  		  (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
>  		  "CPU capacity asymmetry not supported on SMT\n");

Apologies if this has already been reported or addressed but I am seeing
a crash when booting certain ARM configurations after this change landed
in -next. I reduced it down to

  $ cat kernel/configs/schedstats.config
  CONFIG_SCHEDSTATS=y

  $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- mrproper defconfig schedstats.config zImage

  $ curl -LSs https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/arm-rootfs.cpio.zst | zstd -d >rootfs.cpio

  $ qemu-system-arm \
      -display none \
      -nodefaults \
      -no-reboot \
      -machine virt \
      -append 'console=ttyAMA0 earlycon' \
      -kernel arch/arm/boot/zImage \
      -initrd rootfs.cpio \
      -m 1G \
      -serial mon:stdio
  [    0.000000] Booting Linux on physical CPU 0x0
  [    0.000000] Linux version 7.0.0-rc4-00017-g8e8e23dea43e (nathan@framework-amd-ryzen-maxplus-395) (arm-linux-gnueabi-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.45) #1 SMP Fri Mar 20 16:12:05 MST 2026
  ...
  [    0.031929] 8<--- cut here ---
  [    0.031999] Unable to handle kernel NULL pointer dereference at virtual address 00000000 when write
  [    0.032172] [00000000] *pgd=00000000
  [    0.032459] Internal error: Oops: 805 [#1] SMP ARM
  [    0.032902] Modules linked in:
  [    0.033466] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-00017-g8e8e23dea43e #1 VOLUNTARY
  [    0.033658] Hardware name: Generic DT based system
  [    0.033770] PC is at build_sched_domains+0x7d0/0x1628
  [    0.034091] LR is at build_sched_domains+0x78c/0x1628
  [    0.034166] pc : [<c03c54bc>]    lr : [<c03c5478>]    psr: 20000053
  [    0.034255] sp : f080dec0  ip : 00000000  fp : c1e244a4
  [    0.034339] r10: c1e04fd4  r9 : c1e24518  r8 : 00000000
  [    0.034415] r7 : c2088f20  r6 : c28db924  r5 : c1e051ec  r4 : 00000010
  [    0.034508] r3 : 00000000  r2 : 00000000  r1 : 00000010  r0 : 00000010
  [    0.034623] Flags: nzCv  IRQs on  FIQs off  Mode SVC_32  ISA ARM  Segment none
  [    0.034730] Control: 10c5387d  Table: 4020406a  DAC: 00000051
  [    0.034819] Register r0 information: zero-size pointer
  [    0.034990] Register r1 information: zero-size pointer
  [    0.035064] Register r2 information: NULL pointer
  [    0.035133] Register r3 information: NULL pointer
  [    0.035198] Register r4 information: zero-size pointer
  [    0.035266] Register r5 information: non-slab/vmalloc memory
  [    0.035376] Register r6 information: slab kmalloc-512 start c28db800 pointer offset 292 size 512
  [    0.035623] Register r7 information: non-slab/vmalloc memory
  [    0.035703] Register r8 information: NULL pointer
  [    0.035769] Register r9 information: non-slab/vmalloc memory
  [    0.035848] Register r10 information: non-slab/vmalloc memory
  [    0.035928] Register r11 information: non-slab/vmalloc memory
  [    0.036006] Register r12 information: NULL pointer
  [    0.036083] Process swapper/0 (pid: 1, stack limit = 0x(ptrval))
  [    0.036243] Stack: (0xf080dec0 to 0xf080e000)
  [    0.036339] dec0: 00000000 c139a06c 00000001 00000000 c1e243f4 c28db924 c28db800 00000000
  [    0.036450] dee0: 00000000 ffff8ad3 00000000 00000001 c18f9f1c 00000000 c1e03d80 c1a8d4d0
  [    0.036559] df00: 00000000 c2073b8f c28e3180 00000000 c20d0050 c1d7ea64 c28b8800 f4b63fe3
  [    0.036665] df20: c1e22714 c2969480 c1e22714 00000000 c2074620 c1a8a0e8 00000000 00000000
  [    0.036772] df40: f080df6c c1c1d724 20000053 c0303d80 f080df64 f4b63fe3 c1d703dc c1d703dc
  [    0.036878] df60: c1d703dc 00000000 00000000 c1c01368 c2969480 f080df74 f080df74 f4b63fe3
  [    0.036989] df80: 00000000 c1e04f80 c13979fc 00000000 00000000 00000000 00000000 00000000
  [    0.037097] dfa0: 00000000 c1397a14 00000000 c03001ac 00000000 00000000 00000000 00000000
  [    0.037206] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  [    0.037316] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
  [    0.037447] Call trace:
  [    0.037698]  build_sched_domains from sched_init_smp+0x80/0x108
  [    0.037943]  sched_init_smp from kernel_init_freeable+0xe8/0x24c
  [    0.038029]  kernel_init_freeable from kernel_init+0x18/0x12c
  [    0.038122]  kernel_init from ret_from_fork+0x14/0x28
  [    0.038209] Exception stack(0xf080dfb0 to 0xf080dff8)
  [    0.038277] dfa0:                                     00000000 00000000 00000000 00000000
  [    0.038386] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  [    0.038495] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
  [    0.038640] Code: e58d3020 e58d300c e59d3020 e59d200c (e5832000)
  [    0.038903] ---[ end trace 0000000000000000 ]---
  [    0.039275] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
  [    0.039628] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

If there is any more information I can provide or patches I can test, I
am more than happy to do so.

Cheers,
Nathan

# bad: [b5d083a3ed1e2798396d5e491432e887da8d4a06] Add linux-next specific files for 20260319
# good: [8a30aeb0d1b4e4aaf7f7bae72f20f2ae75385ccb] Merge tag 'nfsd-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
git bisect start 'b5d083a3ed1e2798396d5e491432e887da8d4a06' '8a30aeb0d1b4e4aaf7f7bae72f20f2ae75385ccb'
# good: [21fbd87ec0afe2af5457f5a7f9acbee4bf5db891] Merge branch 'main' of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git
git bisect good 21fbd87ec0afe2af5457f5a7f9acbee4bf5db891
# good: [bffa4391cf4ee844778893a781f14faa55c75cce] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git
git bisect good bffa4391cf4ee844778893a781f14faa55c75cce
# bad: [a360efb89caee066919156db3921e616093c43b6] Merge branch 'for-leds-next' of https://git.kernel.org/pub/scm/linux/kernel/git/lee/leds.git
git bisect bad a360efb89caee066919156db3921e616093c43b6
# good: [77f1b9e1181ac53ae9ce7c3c0e52002d02495c5e] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git
git bisect good 77f1b9e1181ac53ae9ce7c3c0e52002d02495c5e
# bad: [d0b3afea83e48990083c0367c10f02af751166b4] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
git bisect bad d0b3afea83e48990083c0367c10f02af751166b4
# good: [fe58c95c6f191a8c45dc183a2348a3b4caa77ed8] Merge branch into tip/master: 'perf/core'
git bisect good fe58c95c6f191a8c45dc183a2348a3b4caa77ed8
# bad: [90924d8b73ac96a1a8b1cb9ba6cae36e193061a1] Merge branch into tip/master: 'timers/vdso'
git bisect bad 90924d8b73ac96a1a8b1cb9ba6cae36e193061a1
# bad: [91396a53d7c7cb694627c665e0dbd2589c99eb0a] Merge branch into tip/master: 'timers/core'
git bisect bad 91396a53d7c7cb694627c665e0dbd2589c99eb0a
# bad: [fe7171d0d5dfbe189e41db99580ebacafc3c09ce] sched/fair: Simplify SIS_UTIL handling in select_idle_cpu()
git bisect bad fe7171d0d5dfbe189e41db99580ebacafc3c09ce
# good: [54a66e431eeacf23e1dc47cb3507f2d0c068aaf0] sched/headers: Inline raw_spin_rq_unlock()
git bisect good 54a66e431eeacf23e1dc47cb3507f2d0c068aaf0
# bad: [1cc8a33ca7e8d38f962b64ece2a42c411a67bc76] sched/topology: Allocate per-CPU sched_domain_shared in s_data
git bisect bad 1cc8a33ca7e8d38f962b64ece2a42c411a67bc76
# good: [786244f70322e41c937e69f0f935bfd11a9611bf] Merge tag 'v7.0-rc4' into sched/core, to pick up scheduler fixes
git bisect good 786244f70322e41c937e69f0f935bfd11a9611bf
# bad: [5a7b576b3ec1acc2694c5b58f80cd1d44a11b2c1] sched/topology: Extract "imb_numa_nr" calculation into a separate helper
git bisect bad 5a7b576b3ec1acc2694c5b58f80cd1d44a11b2c1
# bad: [8e8e23dea43e64ddafbd1246644c3219209be113] sched/topology: Compute sd_weight considering cpuset partitions
git bisect bad 8e8e23dea43e64ddafbd1246644c3219209be113
# first bad commit: [8e8e23dea43e64ddafbd1246644c3219209be113] sched/topology: Compute sd_weight considering cpuset partitions

Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions

Posted by K Prateek Nayak 2 weeks, 2 days ago

Hello Nathan,

Thank you for the report.

On 3/21/2026 5:28 AM, Nathan Chancellor wrote:
>   $ cat kernel/configs/schedstats.config
>   CONFIG_SCHEDSTATS=y

Is the "schedstats.config" available somewhere? I tried these
steps on my end but couldn't reproduce the crash with my config.

Also, are you saying it is necessary to enable CONFIG_SCHEDSTATS
to observe the crash?

> 
>   $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- mrproper defconfig schedstats.config zImage
> 
>   $ curl -LSs https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/arm-rootfs.cpio.zst | zstd -d >rootfs.cpio
> 
>   $ qemu-system-arm \
>       -display none \
>       -nodefaults \
>       -no-reboot \
>       -machine virt \
>       -append 'console=ttyAMA0 earlycon' \
>       -kernel arch/arm/boot/zImage \
>       -initrd rootfs.cpio \
>       -m 1G \
>       -serial mon:stdio
>   [    0.000000] Booting Linux on physical CPU 0x0
>   [    0.000000] Linux version 7.0.0-rc4-00017-g8e8e23dea43e (nathan@framework-amd-ryzen-maxplus-395) (arm-linux-gnueabi-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.45) #1 SMP Fri Mar 20 16:12:05 MST 2026
>   ...
>   [    0.031929] 8<--- cut here ---
>   [    0.031999] Unable to handle kernel NULL pointer dereference at virtual address 00000000 when write
>   [    0.032172] [00000000] *pgd=00000000
>   [    0.032459] Internal error: Oops: 805 [#1] SMP ARM
>   [    0.032902] Modules linked in:
>   [    0.033466] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-00017-g8e8e23dea43e #1 VOLUNTARY
>   [    0.033658] Hardware name: Generic DT based system
>   [    0.033770] PC is at build_sched_domains+0x7d0/0x1628

For me, this points to:

  $ scripts/faddr2line vmlinux build_sched_domains+0x7d0/0x1628
  build_sched_domains+0x7d0/0x1628:
  find_next_bit_wrap at include/linux/find.h:455
  (inlined by) build_sched_groups at kernel/sched/topology.c:1255
  (inlined by) build_sched_domains at kernel/sched/topology.c:2603

which is the:

  span = sched_domain_span(sd);

  for_each_cpu_wrap(i, span, cpu) /* Here */ {
    ...
  }

in build_sched_groups() so we are likely going off the allocated
cpumask size but before that, we do this in the caller:

  sd->span_weight = cpumask_weight(sched_domain_span(sd));

which should have crashed too if we had a NULL pointer in the
cpumask range. So I'm at a loss. Maybe the pc points to a
different location in your build?

-- 
Thanks and Regards,
Prateek

Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions

Posted by Chen, Yu C 2 weeks, 2 days ago

On 3/21/2026 11:36 AM, K Prateek Nayak wrote:
> Hello Nathan,
> 
> Thank you for the report.
> 
> On 3/21/2026 5:28 AM, Nathan Chancellor wrote:
>>    $ cat kernel/configs/schedstats.config
>>    CONFIG_SCHEDSTATS=y
> 
> Is the "schedstats.config" available somewhere? I tried these
> steps on my end but couldn't reproduce the crash with my config.
> 
> Also, are you saying it is necessary to enable CONFIG_SCHEDSTATS
> to observe the crash?
> 
>>
>>    $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- mrproper defconfig schedstats.config zImage
>>
>>    $ curl -LSs https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/arm-rootfs.cpio.zst | zstd -d >rootfs.cpio
>>
>>    $ qemu-system-arm \
>>        -display none \
>>        -nodefaults \
>>        -no-reboot \
>>        -machine virt \
>>        -append 'console=ttyAMA0 earlycon' \
>>        -kernel arch/arm/boot/zImage \
>>        -initrd rootfs.cpio \
>>        -m 1G \
>>        -serial mon:stdio
>>    [    0.000000] Booting Linux on physical CPU 0x0
>>    [    0.000000] Linux version 7.0.0-rc4-00017-g8e8e23dea43e (nathan@framework-amd-ryzen-maxplus-395) (arm-linux-gnueabi-gcc (GCC) 15.2.0, GNU ld (GNU Binutils) 2.45) #1 SMP Fri Mar 20 16:12:05 MST 2026
>>    ...
>>    [    0.031929] 8<--- cut here ---
>>    [    0.031999] Unable to handle kernel NULL pointer dereference at virtual address 00000000 when write
>>    [    0.032172] [00000000] *pgd=00000000
>>    [    0.032459] Internal error: Oops: 805 [#1] SMP ARM
>>    [    0.032902] Modules linked in:
>>    [    0.033466] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-00017-g8e8e23dea43e #1 VOLUNTARY
>>    [    0.033658] Hardware name: Generic DT based system
>>    [    0.033770] PC is at build_sched_domains+0x7d0/0x1628
> 
> For me, this points to:
> 
>    $ scripts/faddr2line vmlinux build_sched_domains+0x7d0/0x1628

I suppose we might need to use arm-linux-gnueabi-addr2line, just
in case of miss-match.

>    build_sched_domains+0x7d0/0x1628:
>    find_next_bit_wrap at include/linux/find.h:455
>    (inlined by) build_sched_groups at kernel/sched/topology.c:1255
>    (inlined by) build_sched_domains at kernel/sched/topology.c:2603
> 
> which is the:
> 
>    span = sched_domain_span(sd);
> 
>    for_each_cpu_wrap(i, span, cpu) /* Here */ {
>      ...
>    }
> 
> in build_sched_groups() so we are likely going off the allocated
> cpumask size but before that, we do this in the caller:
> 
>    sd->span_weight = cpumask_weight(sched_domain_span(sd));
> 
> which should have crashed too if we had a NULL pointer in the
> cpumask range. So I'm at a loss. Maybe the pc points to a
> different location in your build?
> 

A wild guess, the major change is that we access sd->span, before
initializing  the sd structure with *sd = { ... }. The sd is allocated
via alloc_percpu() uninitialized, the span at the end of the sd structure
remain uninitialized. It is unclear how cpumask_weight(sd->span) might be
affected by this uninitialized state. Before this patch, after *sd = { 
... }
is executed, the contents of  sd->span are explicitly set to 0, which might
be safer?

Thanks,
Chenyu

Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions

Posted by Chen, Yu C 2 weeks, 2 days ago

On 3/21/2026 3:33 PM, Chen, Yu C wrote:
> On 3/21/2026 11:36 AM, K Prateek Nayak wrote:
>>    sd->span_weight = cpumask_weight(sched_domain_span(sd));
>>
>> which should have crashed too if we had a NULL pointer in the
>> cpumask range. So I'm at a loss. Maybe the pc points to a
>> different location in your build?
>>
> 
> A wild guess, the major change is that we access sd->span, before
> initializing  the sd structure with *sd = { ... }. The sd is allocated
> via alloc_percpu() uninitialized, the span at the end of the sd structure
> remain uninitialized. It is unclear how cpumask_weight(sd->span) might be
> affected by this uninitialized state. Before this patch, after *sd = 
> { ... }
> is executed, the contents of  sd->span are explicitly set to 0, which might
> be safer?
> 

I replied too fast, please ignore above comments, the sd->span should 
have been
set via cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu))

Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions

Posted by K Prateek Nayak 2 weeks, 2 days ago

Hello Chenyu,

On 3/21/2026 1:17 PM, Chen, Yu C wrote:
> On 3/21/2026 3:33 PM, Chen, Yu C wrote:
>> On 3/21/2026 11:36 AM, K Prateek Nayak wrote:
>>>    sd->span_weight = cpumask_weight(sched_domain_span(sd));
>>>
>>> which should have crashed too if we had a NULL pointer in the
>>> cpumask range. So I'm at a loss. Maybe the pc points to a
>>> different location in your build?
>>>
>>
>> A wild guess, the major change is that we access sd->span, before
>> initializing  the sd structure with *sd = { ... }. The sd is allocated
>> via alloc_percpu() uninitialized, the span at the end of the sd structure
>> remain uninitialized. It is unclear how cpumask_weight(sd->span) might be
>> affected by this uninitialized state. Before this patch, after *sd = { ... }
>> is executed, the contents of  sd->span are explicitly set to 0, which might
>> be safer?
>>
> 
> I replied too fast, please ignore above comments, the sd->span should have been
> set via cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu))

So I managed to reproduce the crash and it is actually crashing at:

  last->next = first;

in build_sched_groups(). If I print the span befora nd after we do
the *sd = { ... }, I see:

  [    0.056301] span before: 0
  [    0.056559] span after:
  [    0.056686] span double check:

double check does a cpumask_pr_args(sched_domain_span(sd)).
This solves the crash on top of this patch:

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 79bab80af8f2..b347ae5d2786 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1693,6 +1693,8 @@ sd_init(struct sched_domain_topology_level *tl,
 		.name			= tl->name,
 	};
 
+	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
+
 	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
 		  (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
 		  "CPU capacity asymmetry not supported on SMT\n");
---

And I see:

  [    0.056479] span before: 0
  [    0.056749] span after: 0
  [    0.056881] span double check: 0


But since span[] is a variable array at the end of sched_domain struct,
doing a *sd = { ... } shouldn't modify it since the size isn't known at
compile time and the compiler will only overwrite the fixed fields.

Is there a compiler angle I'm missing here?

The cpumask_and() that comes first looks like:

@ kernel/sched/topology.c:1649:         cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
        ldr     r3, [r9]        @ MEM[(const struct cpumask * (*<T2127>) (struct sched_domain_topology_level *, int) *)tl_317], MEM[(const struct cpumask * (*<T2127>) (struct sched_domain_topology_level *, int) *)tl_317]
@ kernel/sched/topology.c:1646:         u64 now = sched_clock();
        strd    r0, [sp, #28]   @,,
@ kernel/sched/topology.c:1649:         cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
        mov     r1, r6  @, i
        mov     r0, r9  @, ivtmp.1798
@ ./include/linux/bitmap.h:329:                 return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
        mov     r4, fp  @ tmp740, sd
@ kernel/sched/topology.c:1649:         cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
        blx     r3              @ MEM[(const struct cpumask * (*<T2127>) (struct sched_domain_topology_level *, int) *)tl_317]
@ ./include/linux/bitmap.h:329:                 return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
        ldr     r3, [r0]        @ MEM[(const long unsigned int *)_356], MEM[(const long unsigned int *)_356]
        ldr     r0, [r7]        @ MEM[(const long unsigned int *)cpu_map_104(D)], MEM[(const long unsigned int *)cpu_map_104(D)]
        and     r0, r0, r3      @ tmp736, MEM[(const long unsigned int *)cpu_map_104(D)], MEM[(const long unsigned int *)_356]
@ ./include/linux/bitmap.h:329:                 return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
        uxth    r0, r0  @ _360, tmp736
@ ./include/linux/bitmap.h:329:                 return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
        str     r0, [r4, #292]! @ _360, MEM[(long unsigned int *)sd_352 + 292B]
---

*sd assignment looks as follows in my disassembly:

.L1867:
@ kernel/sched/topology.c:1660:         *sd = (struct sched_domain){
        ldr     ip, [sp, #48]   @ tmp1203, %sfp
        mov     r2, #296        @,
        mov     r0, fp  @, sd
        mov     r1, #0  @,
        ldr     r3, [ip]        @ jiffies.324_453, jiffies
        str     r3, [sp, #36]   @ jiffies.324_453, %sfp
        ldr     ip, [ip]        @ jiffies.326_454, jiffies
@ kernel/sched/topology.c:1693:                 .name                   = tl->name,
        ldr     r3, [r9, #28]   @ _455, MEM[(char * *)tl_317 + 28B]
        str     r3, [sp, #16]   @ _455, %sfp
@ kernel/sched/topology.c:1660:         *sd = (struct sched_domain){
        str     ip, [sp, #8]    @ jiffies.326_454, %sfp
        bl      memset          @
        ldr     r3, [sp, #36]   @ jiffies.324_453, %sfp
        ldr     r2, [sp, #28]   @ now, %sfp
        str     r3, [fp, #48]   @ jiffies.324_453, sd_352->last_balance
        ldr     r3, [sp, #16]   @ _455, %sfp
        ldr     ip, [sp, #8]    @ jiffies.326_454, %sfp
        str     r2, [fp, #72]   @ now, sd_352->newidle_stamp
        str     r3, [fp, #272]  @ _455, sd_352->name
        mov     r3, #16 @ tmp1502,
        ldr     r2, [sp, #32]   @ now, %sfp
        str     r3, [fp, #20]   @ tmp1502, sd_352->busy_factor
@ kernel/sched/topology.c:1678:                                         | sd_flags
        orr     r3, r4, #4096   @ _452, sd_flags,
@ kernel/sched/topology.c:1696:         WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
        and     r4, r4, #160    @ tmp779, sd_flags,
@ kernel/sched/topology.c:1678:                                         | sd_flags
        orr     r3, r3, #23     @ _452, _452,
@ kernel/sched/topology.c:1660:         *sd = (struct sched_domain){
        str     r2, [fp, #76]   @ now, sd_352->newidle_stamp
@ kernel/sched/topology.c:1696:         WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
        cmp     r4, #160        @ tmp779,
@ kernel/sched/topology.c:1660:         *sd = (struct sched_domain){
        mov     r2, #512        @ tmp776,
        str     ip, [fp, #88]   @ jiffies.326_454, sd_352->last_decay_max_lb_cost
        str     r2, [fp, #60]   @ tmp776, sd_352->newidle_call
        str     r2, [fp, #68]   @ tmp776, sd_352->newidle_ratio
@ kernel/sched/topology.c:1662:                 .max_interval           = 2*sd_weight,
        lsl     r2, r10, #1     @ tmp773, _484,
@ kernel/sched/topology.c:1660:         *sd = (struct sched_domain){
        str     r5, [fp, #4]    @ sd, sd_352->child
        str     r2, [fp, #16]   @ tmp773, sd_352->max_interval
        mov     r2, #117        @ tmp775,
        str     r10, [fp, #12]  @ _484, sd_352->min_interval
        str     r2, [fp, #24]   @ tmp775, sd_352->imbalance_pct
        mov     r2, #256        @ tmp777,
        str     r10, [fp, #52]  @ _484, sd_352->balance_interval
        str     r3, [fp, #40]   @ _452, sd_352->flags
        str     r2, [fp, #64]   @ tmp777, sd_352->newidle_success
---


If I add the new cpumask_and() I get the following after *sd assignment:

@ kernel/sched/topology.c:1696:         cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
        ldr     r3, [r9]        @ MEM[(const struct cpumask * (*<T2127>) (struct sched_domain_topology_level *, int) *)tl_317], MEM[(const struct cpumask * (*<T2127>) (struct sched_domain_topology_level *, int) *)tl_317]
        blx     r3              @ MEM[(const struct cpumask * (*<T2127>) (struct sched_domain_topology_level *, int) *)tl_317]
@ ./include/linux/bitmap.h:329:                 return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
        ldr     r3, [r7]        @ MEM[(const long unsigned int *)cpu_map_104(D)], MEM[(const long unsigned int *)cpu_map_104(D)]
        ldr     r2, [r0]        @ MEM[(const long unsigned int *)_457], MEM[(const long unsigned int *)_457]
        and     r3, r3, r2      @ tmp788, MEM[(const long unsigned int *)cpu_map_104(D)], MEM[(const long unsigned int *)_457]
@ ./include/linux/bitmap.h:329:                 return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
        uxth    r3, r3  @ tmp791, tmp788
@ ./include/linux/bitmap.h:329:                 return (*dst = *src1 & *src2 & BITMAP_LAST_WORD_MASK(nbits)) != 0;
        str     r3, [fp, #292]  @ tmp791, MEM[(long unsigned int *)sd_352 + 292B]
---


Both cpumask_and() seems to store to:

  MEM[(long unsigned int *)sd_352 + 292B]

So I'm at a loss why this happens. Let me dig little more.

-- 
Thanks and Regards,
Prateek

Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions

Posted by K Prateek Nayak 2 weeks, 2 days ago

Hello folks,

On 3/21/2026 2:29 PM, K Prateek Nayak wrote:
> So I managed to reproduce the crash and it is actually crashing at:
> 
>   last->next = first;
> 
> in build_sched_groups(). If I print the span befora nd after we do
> the *sd = { ... }, I see:
> 
>   [    0.056301] span before: 0
>   [    0.056559] span after:
>   [    0.056686] span double check:
> 
> double check does a cpumask_pr_args(sched_domain_span(sd)).
> This solves the crash on top of this patch:
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 79bab80af8f2..b347ae5d2786 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1693,6 +1693,8 @@ sd_init(struct sched_domain_topology_level *tl,
>  		.name			= tl->name,
>  	};
>  
> +	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
> +
>  	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
>  		  (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
>  		  "CPU capacity asymmetry not supported on SMT\n");
> ---
> 
> And I see:
> 
>   [    0.056479] span before: 0
>   [    0.056749] span after: 0
>   [    0.056881] span double check: 0
> 
> 
> But since span[] is a variable array at the end of sched_domain struct,
> doing a *sd = { ... } shouldn't modify it since the size isn't known at
> compile time and the compiler will only overwrite the fixed fields.
> 
> Is there a compiler angle I'm missing here?

So this is what I've found: By default we have:

  cpumask_size: 4
  struct sched_domain size: 296

If I do:

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index a1e1032426dc..f0bebce274f7 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -148,7 +148,7 @@ struct sched_domain {
 	 * by attaching extra space to the end of the structure,
 	 * depending on how many CPUs the kernel has booted up with)
 	 */
-	unsigned long span[];
+	unsigned long span[1];
 };
 
 static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
---

I still see:

  cpumask_size: 4
  struct sched_domain size: 296

Which means we are overwriting the sd->span during *sd assignment even
with the variable length array at the end :-(

-- 
Thanks and Regards,
Prateek

Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions

Posted by K Prateek Nayak 2 weeks, 2 days ago

On 3/21/2026 3:15 PM, K Prateek Nayak wrote:
> So this is what I've found: By default we have:
> 
>   cpumask_size: 4
>   struct sched_domain size: 296
> 
> If I do:
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index a1e1032426dc..f0bebce274f7 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -148,7 +148,7 @@ struct sched_domain {
>  	 * by attaching extra space to the end of the structure,
>  	 * depending on how many CPUs the kernel has booted up with)
>  	 */
> -	unsigned long span[];
> +	unsigned long span[1];
>  };
>  
>  static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
> ---
> 
> I still see:
> 
>   cpumask_size: 4
>   struct sched_domain size: 296
> 
> Which means we are overwriting the sd->span during *sd assignment even
> with the variable length array at the end :-(
> 

And more evidence - by default we have:

  sched_domain size: 296
  offset of sd_span: 292

sizeof() seems to account some sort of 4-byte padding for the struct which
pushes the offset of sd->span into the struct size.

To resolve this, we can also do:

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index a1e1032426dc..48bea2f7f750 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -148,7 +148,7 @@ struct sched_domain {
 	 * by attaching extra space to the end of the structure,
 	 * depending on how many CPUs the kernel has booted up with)
 	 */
-	unsigned long span[];
+	unsigned long span[] __aligned(2 * sizeof(int));
 };
 
 static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
---

and the kernel boots fine with the sd_span offset aligned with
sched_domain struct size:

  sched_domain size: 296
  offset of sd_span: 296


So Peter, which solution do you prefer?

1. Doing cpumask_and() after the *sd = { ... } initialization. (or)

2. Align sd->span to an 8-byte boundary.

-- 
Thanks and Regards,
Prateek

Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions

Posted by Shrikanth Hegde 2 weeks, 2 days ago

Hi Prateek.

On 3/21/26 3:43 PM, K Prateek Nayak wrote:
> On 3/21/2026 3:15 PM, K Prateek Nayak wrote:
>> So this is what I've found: By default we have:
>>
>>    cpumask_size: 4
>>    struct sched_domain size: 296
>>
>> If I do:
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index a1e1032426dc..f0bebce274f7 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -148,7 +148,7 @@ struct sched_domain {
>>   	 * by attaching extra space to the end of the structure,
>>   	 * depending on how many CPUs the kernel has booted up with)
>>   	 */
>> -	unsigned long span[];
>> +	unsigned long span[1];
>>   };
>>   
>>   static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
>> ---
>>
>> I still see:
>>
>>    cpumask_size: 4
>>    struct sched_domain size: 296
>>
>> Which means we are overwriting the sd->span during *sd assignment even
>> with the variable length array at the end :-(
>>
> 
> And more evidence - by default we have:
> 
>    sched_domain size: 296
>    offset of sd_span: 292
> 
> sizeof() seems to account some sort of 4-byte padding for the struct which
> pushes the offset of sd->span into the struct size.
> 
> To resolve this, we can also do:
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index a1e1032426dc..48bea2f7f750 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -148,7 +148,7 @@ struct sched_domain {
>   	 * by attaching extra space to the end of the structure,
>   	 * depending on how many CPUs the kernel has booted up with)
>   	 */
> -	unsigned long span[];
> +	unsigned long span[] __aligned(2 * sizeof(int));
>   };

Wouldn't that be susceptible to change in sched_domain somewhere in between?
Right now, it maybe aligning to 296 since it is 8 byte aligned.

But lets say someone adds a new int in between. Then size of sched_domain would be 300.
but span would still be 296 since it 8 bit aligned?

>   
>   static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
> ---
> 
> and the kernel boots fine with the sd_span offset aligned with
> sched_domain struct size:
> 
>    sched_domain size: 296
>    offset of sd_span: 296
> 
>
> So Peter, which solution do you prefer?
> 
> 1. Doing cpumask_and() after the *sd = { ... } initialization. (or)
> 
> 2. Align sd->span to an 8-byte boundary.
> 

Only update sd_weight and leave everything as it was earlier?

sd_weight = cpumask_and_weight(cpu_map, tl->mask(tl, cpu));

Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions

Posted by K Prateek Nayak 2 weeks, 2 days ago

Hello Shrikanth,

On 3/21/2026 7:43 PM, Shrikanth Hegde wrote:
>> And more evidence - by default we have:
>>
>>    sched_domain size: 296
>>    offset of sd_span: 292
>>
>> sizeof() seems to account some sort of 4-byte padding for the struct which
>> pushes the offset of sd->span into the struct size.
>>
>> To resolve this, we can also do:
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index a1e1032426dc..48bea2f7f750 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -148,7 +148,7 @@ struct sched_domain {
>>        * by attaching extra space to the end of the structure,
>>        * depending on how many CPUs the kernel has booted up with)
>>        */
>> -    unsigned long span[];
>> +    unsigned long span[] __aligned(2 * sizeof(int));
>>   };
> 
> Wouldn't that be susceptible to change in sched_domain somewhere in between?
> Right now, it maybe aligning to 296 since it is 8 byte aligned.
> 
> But lets say someone adds a new int in between. Then size of sched_domain would be 300.
> but span would still be 296 since it 8 bit aligned?

So the official GCC specification for "Arrays of Length Zero" [1] says:

  Although the size of a zero-length array is zero, an array member of
  this kind may increase the size of the enclosing type as a result of
  tail padding.

so you can either have:

  struct sched_domain {
    ...
    unsigned int span_length;  /* 288  4 */
    unsigned long span[];      /* 292  0 */

    /* XXX 4 byte tail padding */

    /* size: 296 */
  }

or:

  struct sched_domain {
    ...
    unsigned int span_length;  /* 288  4 */

    /* XXX 4 bytes hole, try to pack */

    unsigned long span[];      /* 296  0 */

    /* size: 296 */
  }

If the variable length array is aligned, there is no need for a tail
padding and in both cases, length of span[] is always 0.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html

But ...

> 
>>     static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
>> ---
>>
>> and the kernel boots fine with the sd_span offset aligned with
>> sched_domain struct size:
>>
>>    sched_domain size: 296
>>    offset of sd_span: 296
>>
>>
>> So Peter, which solution do you prefer?
>>
>> 1. Doing cpumask_and() after the *sd = { ... } initialization. (or)
>>
>> 2. Align sd->span to an 8-byte boundary.
>>
> 
> Only update sd_weight and leave everything as it was earlier?
> 
> sd_weight = cpumask_and_weight(cpu_map, tl->mask(tl, cpu));

... I agree with you and Chenyu, that this approach is better since
padding and alignment is again dependent on the compiler.

Anyhow we do a cpumask_weight() for sd_weight, and cpumask_and_weight()
being the same complexity shouldn't add any more overhead.

While we are at it, we can also remove that "sd->span_weight" assignment
before the build_sched_groups() loop since we already have it here.

-- 
Thanks and Regards,
Prateek

Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions

Posted by Chen, Yu C 2 weeks, 2 days ago

On 3/21/2026 6:13 PM, K Prateek Nayak wrote:
> On 3/21/2026 3:15 PM, K Prateek Nayak wrote:
>> So this is what I've found: By default we have:
>>
>>    cpumask_size: 4
>>    struct sched_domain size: 296
>>
>> If I do:
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index a1e1032426dc..f0bebce274f7 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -148,7 +148,7 @@ struct sched_domain {
>>   	 * by attaching extra space to the end of the structure,
>>   	 * depending on how many CPUs the kernel has booted up with)
>>   	 */
>> -	unsigned long span[];
>> +	unsigned long span[1];
>>   };
>>   
>>   static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
>> ---
>>
>> I still see:
>>
>>    cpumask_size: 4
>>    struct sched_domain size: 296
>>
>> Which means we are overwriting the sd->span during *sd assignment even
>> with the variable length array at the end :-(
>>

Ah, that's right.

> 
> And more evidence - by default we have:
> 
>    sched_domain size: 296
>    offset of sd_span: 292
> 
> sizeof() seems to account some sort of 4-byte padding for the struct which
> pushes the offset of sd->span into the struct size.
> 

In your disassembly for *sd = {...}

mov     r2, #296
mov     r0, fp
mov     r1, #0
...
bl memset  <-- oops!

> To resolve this, we can also do:
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index a1e1032426dc..48bea2f7f750 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -148,7 +148,7 @@ struct sched_domain {
>   	 * by attaching extra space to the end of the structure,
>   	 * depending on how many CPUs the kernel has booted up with)
>   	 */
> -	unsigned long span[];
> +	unsigned long span[] __aligned(2 * sizeof(int));
>   };
>   
>   static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
> ---
> 
> and the kernel boots fine with the sd_span offset aligned with
> sched_domain struct size:
> 
>    sched_domain size: 296
>    offset of sd_span: 296
> 
> 
> So Peter, which solution do you prefer?
> 
> 1. Doing cpumask_and() after the *sd = { ... } initialization. (or)
> 
> 2. Align sd->span to an 8-byte boundary.
> 

I vote for option 1, as option 2 relies on how the compiler
interprets sizeof() and the offset of each member within
the structure IMO. Initializing the values after *sd = {} seems
safer and more generic, but the decision is up to Peter : )

thanks,
Chenyu

Re: [tip: sched/core] sched/topology: Compute sd_weight considering cpuset partitions

Posted by K Prateek Nayak 1 week, 6 days ago

Hello Chenyu,

On 3/21/2026 6:18 PM, Chen, Yu C wrote:
>> And more evidence - by default we have:
>>
>>    sched_domain size: 296
>>    offset of sd_span: 292
>>
>> sizeof() seems to account some sort of 4-byte padding for the struct which
>> pushes the offset of sd->span into the struct size.
>>
> 
> In your disassembly for *sd = {...}
> 
> mov     r2, #296
> mov     r0, fp
> mov     r1, #0
> ...
> bl memset  <-- oops!

Ah! I was not able to see this correctly on Saturday. Thank you for
pointing it out.

-- 
Thanks and Regards,
Prateek

[PATCH] sched/topology: Initialize sd_span after assignment to *sd

Posted by K Prateek Nayak 2 weeks, 2 days ago

Nathan reported a kernel panic on his ARM builds after commit
8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset
partitions") which was root caused to the compiler zeroing out the first
few bytes of sd->span.

During the debug [1], it was discovered that, on some configs,
offsetof(struct sched_domain, span) at 292 was less than
sizeof(struct sched_domain) at 296 resulting in:

  *sd = { ... }

assignment clearing out first 4 bytes of sd->span which was initialized
before.

The official GCC specification for "Arrays of Length Zero" [2] says:

  Although the size of a zero-length array is zero, an array member of
  this kind may increase the size of the enclosing type as a result of
  tail padding.

which means the relative offset of the variable length array at the end
of the sturct can indeed be less than sizeof() the struct as a result of
tail padding thus overwriting that data of the flexible array that
overlapped with the padding whenever the struct is initialized as whole.

Partially revert commit 8e8e23dea43e ("sched/topology: Compute sd_weight
considering cpuset partitions") to initialize sd_span after the fixed
memebers of sd.

Use

  cpumask_weight_and(cpu_map, tl->mask(tl, cpu))

to calculate span_weight before initializing the sd_span.
cpumask_and_weight() is of same complexity as cpumask_and() and the
additional overhead is negligible.

While at it, also initialize sd->span_weight in sd_init() since
sd_weight now captures the cpu_map constraints. Fixup the
sd->span_weight whenever sd_span is fixed up by the generic topology
layer.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/all/20260320235824.GA1176840@ax162/
Fixes: 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset partitions")
Link: https://lore.kernel.org/all/a8c125fd-960d-4b35-b640-95a33584eb08@amd.com/ [1]
Link: https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2]
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Nathan, can you please check if this fixes the issue you are observing -
it at least fixed one that I'm observing ;-)

Peter, if you would like to keep revert and enhancements separate, let
me know and I'll spin a v2.
---
 kernel/sched/topology.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 43150591914b..721ed9b883b8 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1669,17 +1669,13 @@ sd_init(struct sched_domain_topology_level *tl,
 	struct cpumask *sd_span;
 	u64 now = sched_clock();
 
-	sd_span = sched_domain_span(sd);
-	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
-	sd_weight = cpumask_weight(sd_span);
-	sd_id = cpumask_first(sd_span);
+	sd_weight = cpumask_weight_and(cpu_map, tl->mask(tl, cpu));
 
 	if (tl->sd_flags)
 		sd_flags = (*tl->sd_flags)();
 	if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
 		      "wrong sd_flags in topology description\n"))
 		sd_flags &= TOPOLOGY_SD_FLAGS;
-	sd_flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
 
 	*sd = (struct sched_domain){
 		.min_interval		= sd_weight,
@@ -1715,8 +1711,15 @@ sd_init(struct sched_domain_topology_level *tl,
 		.last_decay_max_lb_cost	= jiffies,
 		.child			= child,
 		.name			= tl->name,
+		.span_weight		= sd_weight,
 	};
 
+	sd_span = sched_domain_span(sd);
+	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
+	sd_id = cpumask_first(sd_span);
+
+	sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
+
 	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
 		  (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
 		  "CPU capacity asymmetry not supported on SMT\n");
@@ -2518,6 +2521,8 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
 			cpumask_or(sched_domain_span(sd),
 				   sched_domain_span(sd),
 				   sched_domain_span(child));
+
+			sd->span_weight = cpumask_weight(sched_domain_span(sd));
 		}
 
 	}
@@ -2697,7 +2702,6 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	/* Build the groups for the domains */
 	for_each_cpu(i, cpu_map) {
 		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
-			sd->span_weight = cpumask_weight(sched_domain_span(sd));
 			if (sd->flags & SD_NUMA) {
 				if (build_overlap_sched_groups(sd, i))
 					goto error;

base-commit: fe7171d0d5dfbe189e41db99580ebacafc3c09ce
-- 
2.34.1

Re: [PATCH] sched/topology: Initialize sd_span after assignment to *sd

Posted by Peter Zijlstra 2 weeks ago

On Sat, Mar 21, 2026 at 04:38:52PM +0000, K Prateek Nayak wrote:
> Nathan reported a kernel panic on his ARM builds after commit
> 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset
> partitions") which was root caused to the compiler zeroing out the first
> few bytes of sd->span.
> 
> During the debug [1], it was discovered that, on some configs,
> offsetof(struct sched_domain, span) at 292 was less than
> sizeof(struct sched_domain) at 296 resulting in:
> 
>   *sd = { ... }
> 
> assignment clearing out first 4 bytes of sd->span which was initialized
> before.
> 
> The official GCC specification for "Arrays of Length Zero" [2] says:
> 
>   Although the size of a zero-length array is zero, an array member of
>   this kind may increase the size of the enclosing type as a result of
>   tail padding.
> 
> which means the relative offset of the variable length array at the end
> of the sturct can indeed be less than sizeof() the struct as a result of
> tail padding thus overwriting that data of the flexible array that
> overlapped with the padding whenever the struct is initialized as whole.

WTF! that's terrible :(

Why is this allowed, this makes no bloody sense :/

However the way we allocate space for flex arrays is: sizeof(*obj) +
count * sizeof(*obj->member); this means that we do have sufficient
space, irrespective of this extra padding.


Does this work?

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 51c29581f15e..defa86ed9b06 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -153,7 +153,21 @@ struct sched_domain {
 
 static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 {
-	return to_cpumask(sd->span);
+	/*
+	 * Because C is an absolutely broken piece of shit, it is allowed for
+	 * offsetof(*sd, span) < sizeof(*sd), this means that structure
+	 * initialzation *sd = { ... }; which will clear every unmentioned
+	 * member, can over-write the start of the flexible array member.
+	 *
+	 * Luckily, the way we allocate the flexible array is by:
+	 *
+	 *   sizeof(*sd) + count * sizeof(*sd->span)
+	 *
+	 * this means that we have sufficient space for the whole flex array
+	 * *outside* of sizeof(*sd). So use that, and avoid using sd->span.
+	 */
+	unsigned long *bitmap = (void *)sd + sizeof(*sd);
+	return to_cpumask(bitmap);
 }
 
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],

Re: [PATCH] sched/topology: Initialize sd_span after assignment to *sd

Posted by Nathan Chancellor 1 week, 6 days ago

On Mon, Mar 23, 2026 at 10:36:27AM +0100, Peter Zijlstra wrote:
> Does this work?

Yes, that avoids the initial panic I reported.

Tested-by: Nathan Chancellor <nathan@kernel.org>

> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 51c29581f15e..defa86ed9b06 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -153,7 +153,21 @@ struct sched_domain {
>  
>  static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
>  {
> -	return to_cpumask(sd->span);
> +	/*
> +	 * Because C is an absolutely broken piece of shit, it is allowed for
> +	 * offsetof(*sd, span) < sizeof(*sd), this means that structure
> +	 * initialzation *sd = { ... }; which will clear every unmentioned
> +	 * member, can over-write the start of the flexible array member.
> +	 *
> +	 * Luckily, the way we allocate the flexible array is by:
> +	 *
> +	 *   sizeof(*sd) + count * sizeof(*sd->span)
> +	 *
> +	 * this means that we have sufficient space for the whole flex array
> +	 * *outside* of sizeof(*sd). So use that, and avoid using sd->span.
> +	 */
> +	unsigned long *bitmap = (void *)sd + sizeof(*sd);
> +	return to_cpumask(bitmap);
>  }
>  
>  extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],

Re: [PATCH] sched/topology: Initialize sd_span after assignment to *sd

Posted by K Prateek Nayak 2 weeks ago

Hello Peter,

On 3/23/2026 3:06 PM, Peter Zijlstra wrote:
> However the way we allocate space for flex arrays is: sizeof(*obj) +
> count * sizeof(*obj->member); this means that we do have sufficient
> space, irrespective of this extra padding.
> 
> 
> Does this work?

Solves the panic on the setup shared by Nathan and KASAN hasn't
noted anything in my baremetal testing so feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek

Re: [PATCH] sched/topology: Initialize sd_span after assignment to *sd

Posted by Chen, Yu C 2 weeks ago

On 3/23/2026 5:36 PM, Peter Zijlstra wrote:
> On Sat, Mar 21, 2026 at 04:38:52PM +0000, K Prateek Nayak wrote:
>> Nathan reported a kernel panic on his ARM builds after commit
>> 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset
>> partitions") which was root caused to the compiler zeroing out the first
>> few bytes of sd->span.
>>
>> During the debug [1], it was discovered that, on some configs,
>> offsetof(struct sched_domain, span) at 292 was less than
>> sizeof(struct sched_domain) at 296 resulting in:
>>
>>    *sd = { ... }
>>
>> assignment clearing out first 4 bytes of sd->span which was initialized
>> before.
>>
>> The official GCC specification for "Arrays of Length Zero" [2] says:
>>
>>    Although the size of a zero-length array is zero, an array member of
>>    this kind may increase the size of the enclosing type as a result of
>>    tail padding.
>>
>> which means the relative offset of the variable length array at the end
>> of the sturct can indeed be less than sizeof() the struct as a result of
>> tail padding thus overwriting that data of the flexible array that
>> overlapped with the padding whenever the struct is initialized as whole.
> 
> WTF! that's terrible :(
> 
> Why is this allowed, this makes no bloody sense :/
> 
> However the way we allocate space for flex arrays is: sizeof(*obj) +
> count * sizeof(*obj->member); this means that we do have sufficient
> space, irrespective of this extra padding.
> 
> 
> Does this work?
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 51c29581f15e..defa86ed9b06 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -153,7 +153,21 @@ struct sched_domain {
>   
>   static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
>   {
> -	return to_cpumask(sd->span);
> +	/*
> +	 * Because C is an absolutely broken piece of shit, it is allowed for
> +	 * offsetof(*sd, span) < sizeof(*sd), this means that structure
> +	 * initialzation *sd = { ... }; which will clear every unmentioned
> +	 * member, can over-write the start of the flexible array member.
> +	 *
> +	 * Luckily, the way we allocate the flexible array is by:
> +	 *
> +	 *   sizeof(*sd) + count * sizeof(*sd->span)
> +	 *
> +	 * this means that we have sufficient space for the whole flex array
> +	 * *outside* of sizeof(*sd). So use that, and avoid using sd->span.
> +	 */
> +	unsigned long *bitmap = (void *)sd + sizeof(*sd);
> +	return to_cpumask(bitmap);
>   }
>   
>   extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],

While I still wonder if it is risky to initialize the structure members 
before
*sd = { ... }, this patch could keep the current sd_init() unchanged.
According to the tests on GNR, it works as expected with no regressions
noticed on top of sched/core commit 349edbba1125 ("sched/fair: Simplify
SIS_UTIL handling in select_idle_cpu()"),

Tested-by: Chen Yu <yu.c.chen@intel.com>

thanks,
Chenyu

Re: [PATCH] sched/topology: Initialize sd_span after assignment to *sd

Posted by Jon Hunter 2 weeks ago

Hi Peter,

On 23/03/2026 09:36, Peter Zijlstra wrote:
> On Sat, Mar 21, 2026 at 04:38:52PM +0000, K Prateek Nayak wrote:
>> Nathan reported a kernel panic on his ARM builds after commit
>> 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset
>> partitions") which was root caused to the compiler zeroing out the first
>> few bytes of sd->span.
>>
>> During the debug [1], it was discovered that, on some configs,
>> offsetof(struct sched_domain, span) at 292 was less than
>> sizeof(struct sched_domain) at 296 resulting in:
>>
>>    *sd = { ... }
>>
>> assignment clearing out first 4 bytes of sd->span which was initialized
>> before.
>>
>> The official GCC specification for "Arrays of Length Zero" [2] says:
>>
>>    Although the size of a zero-length array is zero, an array member of
>>    this kind may increase the size of the enclosing type as a result of
>>    tail padding.
>>
>> which means the relative offset of the variable length array at the end
>> of the sturct can indeed be less than sizeof() the struct as a result of
>> tail padding thus overwriting that data of the flexible array that
>> overlapped with the padding whenever the struct is initialized as whole.
> 
> WTF! that's terrible :(
> 
> Why is this allowed, this makes no bloody sense :/
> 
> However the way we allocate space for flex arrays is: sizeof(*obj) +
> count * sizeof(*obj->member); this means that we do have sufficient
> space, irrespective of this extra padding.
> 
> 
> Does this work?
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 51c29581f15e..defa86ed9b06 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -153,7 +153,21 @@ struct sched_domain {
>   
>   static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
>   {
> -	return to_cpumask(sd->span);
> +	/*
> +	 * Because C is an absolutely broken piece of shit, it is allowed for
> +	 * offsetof(*sd, span) < sizeof(*sd), this means that structure
> +	 * initialzation *sd = { ... }; which will clear every unmentioned
> +	 * member, can over-write the start of the flexible array member.
> +	 *
> +	 * Luckily, the way we allocate the flexible array is by:
> +	 *
> +	 *   sizeof(*sd) + count * sizeof(*sd->span)
> +	 *
> +	 * this means that we have sufficient space for the whole flex array
> +	 * *outside* of sizeof(*sd). So use that, and avoid using sd->span.
> +	 */
> +	unsigned long *bitmap = (void *)sd + sizeof(*sd);
> +	return to_cpumask(bitmap);
>   }
>   
>   extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],


I noticed the same issue that Nathan reported on 32-bit Tegra and the 
above does fix it for me.

Tested-by: Jon Hunter <jonathanh@nvidia.com>

Thanks!
Jon

-- 
nvpublic

[tip: sched/core] sched/topology: Fix sched_domain_span()

Posted by tip-bot2 for Peter Zijlstra 1 week, 6 days ago

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e379dce8af11d8d6040b4348316a499bfd174bfb
Gitweb:        https://git.kernel.org/tip/e379dce8af11d8d6040b4348316a499bfd174bfb
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 23 Mar 2026 10:36:27 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 24 Mar 2026 10:07:04 +01:00

sched/topology: Fix sched_domain_span()

Commit 8e8e23dea43e ("sched/topology: Compute sd_weight considering
cpuset partitions") ends up relying on the fact that structure
initialization should not touch the flexible array.

However, the official GCC specification for "Arrays of Length Zero"
[*] says:

  Although the size of a zero-length array is zero, an array member of
  this kind may increase the size of the enclosing type as a result of
  tail padding.

Additionally, structure initialization will zero tail padding. With
the end result that since offsetof(*type, member) < sizeof(*type),
array initialization will clobber the flex array.

Luckily, the way flexible array sizes are calculated is:

  sizeof(*type) + count * sizeof(*type->member)

This means we have the complete size of the flex array *outside* of
sizeof(*type), so use that instead of relying on the broken flex array
definition.

[*] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html

Fixes: 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset partitions")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Debugged-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20260323093627.GY3738010@noisy.programming.kicks-ass.net
---
 include/linux/sched/topology.h | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 51c2958..36553e1 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -142,18 +142,30 @@ struct sched_domain {
 
 	unsigned int span_weight;
 	/*
-	 * Span of all CPUs in this domain.
+	 * See sched_domain_span(), on why flex arrays are broken.
 	 *
-	 * NOTE: this field is variable length. (Allocated dynamically
-	 * by attaching extra space to the end of the structure,
-	 * depending on how many CPUs the kernel has booted up with)
-	 */
 	unsigned long span[];
+	 */
 };
 
 static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 {
-	return to_cpumask(sd->span);
+	/*
+	 * Turns out that C flexible arrays are fundamentally broken since it
+	 * is allowed for offsetof(*sd, span) < sizeof(*sd), this means that
+	 * structure initialzation *sd = { ... }; which writes every byte
+	 * inside sizeof(*type), will over-write the start of the flexible
+	 * array.
+	 *
+	 * Luckily, the way we allocate sched_domain is by:
+	 *
+	 *   sizeof(*sd) + cpumask_size()
+	 *
+	 * this means that we have sufficient space for the whole flex array
+	 * *outside* of sizeof(*sd). So use that, and avoid using sd->span.
+	 */
+	unsigned long *bitmap = (void *)sd + sizeof(*sd);
+	return to_cpumask(bitmap);
 }
 
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],

Re: [PATCH] sched/topology: Initialize sd_span after assignment to *sd

Posted by Shrikanth Hegde 2 weeks ago


On 3/21/26 10:08 PM, K Prateek Nayak wrote:
> Nathan reported a kernel panic on his ARM builds after commit
> 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset
> partitions") which was root caused to the compiler zeroing out the first
> few bytes of sd->span.
> 
> During the debug [1], it was discovered that, on some configs,
> offsetof(struct sched_domain, span) at 292 was less than
> sizeof(struct sched_domain) at 296 resulting in:
> 
>    *sd = { ... }
> 
> assignment clearing out first 4 bytes of sd->span which was initialized
> before.
> 
> The official GCC specification for "Arrays of Length Zero" [2] says:
> 
>    Although the size of a zero-length array is zero, an array member of
>    this kind may increase the size of the enclosing type as a result of
>    tail padding.
> 
> which means the relative offset of the variable length array at the end
> of the sturct can indeed be less than sizeof() the struct as a result of
> tail padding thus overwriting that data of the flexible array that
> overlapped with the padding whenever the struct is initialized as whole.
> 
> Partially revert commit 8e8e23dea43e ("sched/topology: Compute sd_weight
> considering cpuset partitions") to initialize sd_span after the fixed
> memebers of sd.
> 
> Use
> 
>    cpumask_weight_and(cpu_map, tl->mask(tl, cpu))
> 
> to calculate span_weight before initializing the sd_span.
> cpumask_and_weight() is of same complexity as cpumask_and() and the
> additional overhead is negligible.
> 
> While at it, also initialize sd->span_weight in sd_init() since
> sd_weight now captures the cpu_map constraints. Fixup the
> sd->span_weight whenever sd_span is fixed up by the generic topology
> layer.
> 


This description is a bit confusing. Fixup happens naturally since
cpu_map now reflects the changes right?

Maybe mention about that removal in build_sched_domains?

> Reported-by: Nathan Chancellor <nathan@kernel.org>
> Closes: https://lore.kernel.org/all/20260320235824.GA1176840@ax162/
> Fixes: 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset partitions")
> Link: https://lore.kernel.org/all/a8c125fd-960d-4b35-b640-95a33584eb08@amd.com/ [1]
> Link: https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2]
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Nathan, can you please check if this fixes the issue you are observing -
> it at least fixed one that I'm observing ;-)
> 
> Peter, if you would like to keep revert and enhancements separate, let
> me know and I'll spin a v2.
> ---
>   kernel/sched/topology.c | 16 ++++++++++------
>   1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 43150591914b..721ed9b883b8 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1669,17 +1669,13 @@ sd_init(struct sched_domain_topology_level *tl,
>   	struct cpumask *sd_span;
>   	u64 now = sched_clock();
>   
> -	sd_span = sched_domain_span(sd);
> -	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
> -	sd_weight = cpumask_weight(sd_span);
> -	sd_id = cpumask_first(sd_span);
> +	sd_weight = cpumask_weight_and(cpu_map, tl->mask(tl, cpu));
>   
>   	if (tl->sd_flags)
>   		sd_flags = (*tl->sd_flags)();
>   	if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
>   		      "wrong sd_flags in topology description\n"))
>   		sd_flags &= TOPOLOGY_SD_FLAGS;
> -	sd_flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
>   
>   	*sd = (struct sched_domain){
>   		.min_interval		= sd_weight,
> @@ -1715,8 +1711,15 @@ sd_init(struct sched_domain_topology_level *tl,
>   		.last_decay_max_lb_cost	= jiffies,
>   		.child			= child,
>   		.name			= tl->name,
> +		.span_weight		= sd_weight,
>   	};
>   
> +	sd_span = sched_domain_span(sd);
> +	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
> +	sd_id = cpumask_first(sd_span);
> +
> +	sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
> +
>   	WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
>   		  (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
>   		  "CPU capacity asymmetry not supported on SMT\n");
> @@ -2518,6 +2521,8 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
>   			cpumask_or(sched_domain_span(sd),
>   				   sched_domain_span(sd),
>   				   sched_domain_span(child));
> +
> +			sd->span_weight = cpumask_weight(sched_domain_span(sd));
>   		}
>   
>   	}
> @@ -2697,7 +2702,6 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>   	/* Build the groups for the domains */
>   	for_each_cpu(i, cpu_map) {
>   		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> -			sd->span_weight = cpumask_weight(sched_domain_span(sd));
>   			if (sd->flags & SD_NUMA) {
>   				if (build_overlap_sched_groups(sd, i))
>   					goto error;
> 
> base-commit: fe7171d0d5dfbe189e41db99580ebacafc3c09ce


Other than nits in changelog:
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>


PS: b4 am -Q was quite confused which patch to pick for 0001.
may since it was a reply to the thread. Not sure. So i pulled
each patch separate and applied.

Re: [PATCH] sched/topology: Initialize sd_span after assignment to *sd

Posted by K Prateek Nayak 2 weeks ago

Hello Shrikanth,

On 3/23/2026 2:38 PM, Shrikanth Hegde wrote:
>> While at it, also initialize sd->span_weight in sd_init() since
>> sd_weight now captures the cpu_map constraints. Fixup the
>> sd->span_weight whenever sd_span is fixed up by the generic topology
>> layer.
>>
> 
> 
> This description is a bit confusing. Fixup happens naturally since
> cpu_map now reflects the changes right?

That was for the hunk in build_sched_domain() where the the sd span
is fixed up if it is found that the child isn't a subset of the
parent in which case span_weight needs to be calculated again after
the cpumask_or().

[..snip..]

> Other than nits in changelog:
> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>

Thanks for the review but Peter has found an alternate approach to
work around this with the current flow of computing span first.

> PS: b4 am -Q was quite confused which patch to pick for 0001.
> may since it was a reply to the thread. Not sure. So i pulled
> each patch separate and applied.

Sorry for the inconvenience. For single patch it should be still fine
to grab the raw patch but for larger series I make sure to post out
separately for convenience. Will be mindful next time.

-- 
Thanks and Regards,
Prateek