[PATCH v7 0/8] sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()

K Prateek Nayak posted 8 patches 1 month, 1 week ago
arch/powerpc/Kconfig           |  9 ++++++
arch/powerpc/include/asm/smp.h |  4 +++
arch/powerpc/kernel/smp.c      | 51 +++++++++++++++++++---------------
arch/s390/kernel/topology.c    | 16 ++++-------
arch/x86/kernel/smpboot.c      |  9 +++---
include/linux/sched/topology.h | 34 ++++++++++++++++++++---
include/linux/topology.h       |  2 +-
kernel/sched/topology.c        | 28 +++++++------------
8 files changed, 93 insertions(+), 60 deletions(-)
[PATCH v7 0/8] sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
Posted by K Prateek Nayak 1 month, 1 week ago
This version uses Peter's suggestion from [1] as if and incrementally
adds cleanup on top to the arch/ bits. I've tested the x86 side but the
PowerPC and the s390 bits are only build tested. Review and feedback is
greatly appreciated.

[1] https://lore.kernel.org/lkml/20250825091910.GT3245006@noisy.programming.kicks-ass.net/

Patches are prepared on top of tip:master at commit 4628e5bbca91 ("Merge
branch into tip/master: 'x86/tdx'")
---
changelog v6..v7:

o Fix the s390 and ppc build errors (Intel test robot)

o Use Peter's diff as is and incrementally do the cleanup on top. The
  PowerPC part was slightly more extensive due to the lack of
  CONFIG_SCHED_MC in arch/powerpc/Kconfig.

v6: https://lore.kernel.org/lkml/20250825120244.11093-1-kprateek.nayak@amd.com/
---
K Prateek Nayak (7):
  powerpc/smp: Rename cpu_corgroup_* to cpu_corgrp_*
  powerpc/smp: Export cpu_coregroup_mask()
  powerpc/smp: Introduce CONFIG_SCHED_MC to guard MC scheduling bits
  sched/topology: Unify tl_smt_mask() across core and all arch
  sched/topology: Unify tl_cls_mask() across core and x86
  sched/topology: Unify tl_mc_mask() across core and all arch
  sched/topology: Unify tl_pkg_mask() across core and all arch

Peter Zijlstra (1):
  sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()

 arch/powerpc/Kconfig           |  9 ++++++
 arch/powerpc/include/asm/smp.h |  4 +++
 arch/powerpc/kernel/smp.c      | 51 +++++++++++++++++++---------------
 arch/s390/kernel/topology.c    | 16 ++++-------
 arch/x86/kernel/smpboot.c      |  9 +++---
 include/linux/sched/topology.h | 34 ++++++++++++++++++++---
 include/linux/topology.h       |  2 +-
 kernel/sched/topology.c        | 28 +++++++------------
 8 files changed, 93 insertions(+), 60 deletions(-)


base-commit: 4628e5bbca916edaf4ed55915ab399f9ba25519f
-- 
2.34.1
Re: [PATCH v7 0/8] sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
Posted by Shrikanth Hegde 1 month, 1 week ago

On 8/26/25 9:43 AM, K Prateek Nayak wrote:
> This version uses Peter's suggestion from [1] as if and incrementally
> adds cleanup on top to the arch/ bits. I've tested the x86 side but the
> PowerPC and the s390 bits are only build tested. Review and feedback is
> greatly appreciated.
> 
> [1] https://lore.kernel.org/lkml/20250825091910.GT3245006@noisy.programming.kicks-ass.net/
> 
> Patches are prepared on top of tip:master at commit 4628e5bbca91 ("Merge
> branch into tip/master: 'x86/tdx'")
> ---
> changelog v6..v7:
> 
> o Fix the s390 and ppc build errors (Intel test robot)
> 
> o Use Peter's diff as is and incrementally do the cleanup on top. The
>    PowerPC part was slightly more extensive due to the lack of
>    CONFIG_SCHED_MC in arch/powerpc/Kconfig.
> 
> v6: https://lore.kernel.org/lkml/20250825120244.11093-1-kprateek.nayak@amd.com/
> ---
> K Prateek Nayak (7):
>    powerpc/smp: Rename cpu_corgroup_* to cpu_corgrp_*
>    powerpc/smp: Export cpu_coregroup_mask()
>    powerpc/smp: Introduce CONFIG_SCHED_MC to guard MC scheduling bits
>    sched/topology: Unify tl_smt_mask() across core and all arch
>    sched/topology: Unify tl_cls_mask() across core and x86
>    sched/topology: Unify tl_mc_mask() across core and all arch
>    sched/topology: Unify tl_pkg_mask() across core and all arch
> 
> Peter Zijlstra (1):
>    sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
> 
Can the names be standardized to begin with tl_ ?

arch/powerpc/kernel/smp.c:			SDTL_INIT(smallcore_smt_mask, powerpc_smt_flags, SMT);
arch/powerpc/kernel/smp.c:			SDTL_INIT(shared_cache_mask, powerpc_shared_cache_flags, CACHE);
arch/s390/kernel/topology.c:	SDTL_INIT(cpu_book_mask, NULL, BOOK),
arch/s390/kernel/topology.c:	SDTL_INIT(cpu_drawer_mask, NULL, DRAWER),
kernel/sched/topology.c:	tl[i++] = SDTL_INIT(sd_numa_mask, NULL, NODE);
kernel/sched/topology.c:		tl[i] = SDTL_INIT(sd_numa_mask, cpu_numa_flags, NUMA);
Re: [PATCH v7 0/8] sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
Posted by Peter Zijlstra 1 month, 1 week ago
On Tue, Aug 26, 2025 at 03:35:03PM +0530, Shrikanth Hegde wrote:
> 
> 
> On 8/26/25 9:43 AM, K Prateek Nayak wrote:
> > This version uses Peter's suggestion from [1] as if and incrementally
> > adds cleanup on top to the arch/ bits. I've tested the x86 side but the
> > PowerPC and the s390 bits are only build tested. Review and feedback is
> > greatly appreciated.
> > 
> > [1] https://lore.kernel.org/lkml/20250825091910.GT3245006@noisy.programming.kicks-ass.net/
> > 
> > Patches are prepared on top of tip:master at commit 4628e5bbca91 ("Merge
> > branch into tip/master: 'x86/tdx'")
> > ---
> > changelog v6..v7:
> > 
> > o Fix the s390 and ppc build errors (Intel test robot)
> > 
> > o Use Peter's diff as is and incrementally do the cleanup on top. The
> >    PowerPC part was slightly more extensive due to the lack of
> >    CONFIG_SCHED_MC in arch/powerpc/Kconfig.
> > 
> > v6: https://lore.kernel.org/lkml/20250825120244.11093-1-kprateek.nayak@amd.com/
> > ---
> > K Prateek Nayak (7):
> >    powerpc/smp: Rename cpu_corgroup_* to cpu_corgrp_*
> >    powerpc/smp: Export cpu_coregroup_mask()
> >    powerpc/smp: Introduce CONFIG_SCHED_MC to guard MC scheduling bits
> >    sched/topology: Unify tl_smt_mask() across core and all arch
> >    sched/topology: Unify tl_cls_mask() across core and x86
> >    sched/topology: Unify tl_mc_mask() across core and all arch
> >    sched/topology: Unify tl_pkg_mask() across core and all arch
> > 
> > Peter Zijlstra (1):
> >    sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
> > 
> Can the names be standardized to begin with tl_ ?
> 
> arch/powerpc/kernel/smp.c:			SDTL_INIT(smallcore_smt_mask, powerpc_smt_flags, SMT);
> arch/powerpc/kernel/smp.c:			SDTL_INIT(shared_cache_mask, powerpc_shared_cache_flags, CACHE);
> arch/s390/kernel/topology.c:	SDTL_INIT(cpu_book_mask, NULL, BOOK),
> arch/s390/kernel/topology.c:	SDTL_INIT(cpu_drawer_mask, NULL, DRAWER),
> kernel/sched/topology.c:	tl[i++] = SDTL_INIT(sd_numa_mask, NULL, NODE);
> kernel/sched/topology.c:		tl[i] = SDTL_INIT(sd_numa_mask, cpu_numa_flags, NUMA);

Already done :-) My version looks like the below.

I picked up v6 yesterday and this morning started poking at the robot
fail reported before seeing this v7 thing.

Current pile lives here for the robots:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core

I thought about doing a /cpu_*_flags/tl_*_flags/ patch as well, but
figured this was enough for now.

---
Subject: sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon, 25 Aug 2025 12:02:44 +0000

Leon [1] and Vinicius [2] noted a topology_span_sane() warning during
their testing starting from v6.16-rc1. Debug that followed pointed to
the tl->mask() for the NODE domain being incorrectly resolved to that of
the highest NUMA domain.

tl->mask() for NODE is set to the sd_numa_mask() which depends on the
global "sched_domains_curr_level" hack. "sched_domains_curr_level" is
set to the "tl->numa_level" during tl traversal in build_sched_domains()
calling sd_init() but was not reset before topology_span_sane().

Since "tl->numa_level" still reflected the old value from
build_sched_domains(), topology_span_sane() for the NODE domain trips
when the span of the last NUMA domain overlaps.

Instead of replicating the "sched_domains_curr_level" hack, get rid of
it entirely and instead, pass the entire "sched_domain_topology_level"
object to tl->cpumask() function to prevent such mishap in the future.

sd_numa_mask() now directly references "tl->numa_level" instead of
relying on the global "sched_domains_curr_level" hack to index into
sched_domains_numa_masks[].

The original warning was reproducible on the following NUMA topology
reported by Leon:

    $ sudo numactl -H
    available: 5 nodes (0-4)
    node 0 cpus: 0 1
    node 0 size: 2927 MB
    node 0 free: 1603 MB
    node 1 cpus: 2 3
    node 1 size: 3023 MB
    node 1 free: 3008 MB
    node 2 cpus: 4 5
    node 2 size: 3023 MB
    node 2 free: 3007 MB
    node 3 cpus: 6 7
    node 3 size: 3023 MB
    node 3 free: 3002 MB
    node 4 cpus: 8 9
    node 4 size: 3022 MB
    node 4 free: 2718 MB
    node distances:
    node   0   1   2   3   4
      0:  10  39  38  37  36
      1:  39  10  38  37  36
      2:  38  38  10  37  36
      3:  37  37  37  10  36
      4:  36  36  36  36  10

The above topology can be mimicked using the following QEMU cmd that was
used to reproduce the warning and test the fix:

     sudo qemu-system-x86_64 -enable-kvm -cpu host \
     -m 20G -smp cpus=10,sockets=10 -machine q35 \
     -object memory-backend-ram,size=4G,id=m0 \
     -object memory-backend-ram,size=4G,id=m1 \
     -object memory-backend-ram,size=4G,id=m2 \
     -object memory-backend-ram,size=4G,id=m3 \
     -object memory-backend-ram,size=4G,id=m4 \
     -numa node,cpus=0-1,memdev=m0,nodeid=0 \
     -numa node,cpus=2-3,memdev=m1,nodeid=1 \
     -numa node,cpus=4-5,memdev=m2,nodeid=2 \
     -numa node,cpus=6-7,memdev=m3,nodeid=3 \
     -numa node,cpus=8-9,memdev=m4,nodeid=4 \
     -numa dist,src=0,dst=1,val=39 \
     -numa dist,src=0,dst=2,val=38 \
     -numa dist,src=0,dst=3,val=37 \
     -numa dist,src=0,dst=4,val=36 \
     -numa dist,src=1,dst=0,val=39 \
     -numa dist,src=1,dst=2,val=38 \
     -numa dist,src=1,dst=3,val=37 \
     -numa dist,src=1,dst=4,val=36 \
     -numa dist,src=2,dst=0,val=38 \
     -numa dist,src=2,dst=1,val=38 \
     -numa dist,src=2,dst=3,val=37 \
     -numa dist,src=2,dst=4,val=36 \
     -numa dist,src=3,dst=0,val=37 \
     -numa dist,src=3,dst=1,val=37 \
     -numa dist,src=3,dst=2,val=37 \
     -numa dist,src=3,dst=4,val=36 \
     -numa dist,src=4,dst=0,val=36 \
     -numa dist,src=4,dst=1,val=36 \
     -numa dist,src=4,dst=2,val=36 \
     -numa dist,src=4,dst=3,val=36 \
     ...

  [ prateek: Moved common functions to include/linux/sched/topology.h,
    reuse the common bits for s390 and ppc, commit message ]

Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1]
Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84cd, f55dac1dafb3
Reported-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/a3de98387abad28592e6ab591f3ff6107fe01dc1.1755893468.git.tim.c.chen@linux.intel.com/ [2]
---
 arch/powerpc/Kconfig                |    4 ++++
 arch/powerpc/include/asm/topology.h |    2 ++
 arch/powerpc/kernel/smp.c           |   27 +++++++++++----------------
 arch/s390/kernel/topology.c         |   20 +++++++-------------
 arch/x86/kernel/smpboot.c           |    8 ++++----
 include/linux/sched/topology.h      |   28 +++++++++++++++++++++++++++-
 include/linux/topology.h            |    2 +-
 kernel/sched/topology.c             |   28 ++++++++++------------------
 8 files changed, 66 insertions(+), 53 deletions(-)

--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -971,6 +971,10 @@ config SCHED_SMT
 	  when dealing with POWER5 cpus at a cost of slightly increased
 	  overhead in some places. If unsure say N here.
 
+config SCHED_MC
+	def_bool y
+	depends on PPC64 && SMP
+
 config PPC_DENORMALISATION
 	bool "PowerPC denormalisation exception handling"
 	depends on PPC_BOOK3S_64
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -134,6 +134,8 @@ static inline int cpu_to_coregroup_id(in
 #ifdef CONFIG_PPC64
 #include <asm/smp.h>
 
+struct cpumask *cpu_coregroup_mask(int cpu);
+
 #define topology_physical_package_id(cpu)	(cpu_to_chip_id(cpu))
 
 #define topology_sibling_cpumask(cpu)	(per_cpu(cpu_sibling_map, cpu))
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1028,19 +1028,19 @@ static int powerpc_shared_proc_flags(voi
  * We can't just pass cpu_l2_cache_mask() directly because
  * returns a non-const pointer and the compiler barfs on that.
  */
-static const struct cpumask *shared_cache_mask(int cpu)
+static const struct cpumask *tl_cache_mask(struct sched_domain_topology_level *tl, int cpu)
 {
 	return per_cpu(cpu_l2_cache_map, cpu);
 }
 
 #ifdef CONFIG_SCHED_SMT
-static const struct cpumask *smallcore_smt_mask(int cpu)
+static const struct cpumask *tl_smallcore_smt_mask(struct sched_domain_topology_level *tl, int cpu)
 {
 	return cpu_smallcore_mask(cpu);
 }
 #endif
 
-static struct cpumask *cpu_coregroup_mask(int cpu)
+struct cpumask *cpu_coregroup_mask(int cpu)
 {
 	return per_cpu(cpu_coregroup_map, cpu);
 }
@@ -1054,11 +1054,6 @@ static bool has_coregroup_support(void)
 	return coregroup_enabled;
 }
 
-static const struct cpumask *cpu_mc_mask(int cpu)
-{
-	return cpu_coregroup_mask(cpu);
-}
-
 static int __init init_big_cores(void)
 {
 	int cpu;
@@ -1448,7 +1443,7 @@ static bool update_mask_by_l2(int cpu, c
 		return false;
 	}
 
-	cpumask_and(*mask, cpu_online_mask, cpu_cpu_mask(cpu));
+	cpumask_and(*mask, cpu_online_mask, cpu_node_mask(cpu));
 
 	/* Update l2-cache mask with all the CPUs that are part of submask */
 	or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
@@ -1538,7 +1533,7 @@ static void update_coregroup_mask(int cp
 		return;
 	}
 
-	cpumask_and(*mask, cpu_online_mask, cpu_cpu_mask(cpu));
+	cpumask_and(*mask, cpu_online_mask, cpu_node_mask(cpu));
 
 	/* Update coregroup mask with all the CPUs that are part of submask */
 	or_cpumasks_related(cpu, cpu, submask_fn, cpu_coregroup_mask);
@@ -1601,7 +1596,7 @@ static void add_cpu_to_masks(int cpu)
 
 	/* If chip_id is -1; limit the cpu_core_mask to within PKG */
 	if (chip_id == -1)
-		cpumask_and(mask, mask, cpu_cpu_mask(cpu));
+		cpumask_and(mask, mask, cpu_node_mask(cpu));
 
 	for_each_cpu(i, mask) {
 		if (chip_id == cpu_to_chip_id(i)) {
@@ -1701,22 +1696,22 @@ static void __init build_sched_topology(
 	if (has_big_cores) {
 		pr_info("Big cores detected but using small core scheduling\n");
 		powerpc_topology[i++] =
-			SDTL_INIT(smallcore_smt_mask, powerpc_smt_flags, SMT);
+			SDTL_INIT(tl_smallcore_smt_mask, powerpc_smt_flags, SMT);
 	} else {
-		powerpc_topology[i++] = SDTL_INIT(cpu_smt_mask, powerpc_smt_flags, SMT);
+		powerpc_topology[i++] = SDTL_INIT(tl_smt_mask, powerpc_smt_flags, SMT);
 	}
 #endif
 	if (shared_caches) {
 		powerpc_topology[i++] =
-			SDTL_INIT(shared_cache_mask, powerpc_shared_cache_flags, CACHE);
+			SDTL_INIT(tl_cache_mask, powerpc_shared_cache_flags, CACHE);
 	}
 
 	if (has_coregroup_support()) {
 		powerpc_topology[i++] =
-			SDTL_INIT(cpu_mc_mask, powerpc_shared_proc_flags, MC);
+			SDTL_INIT(tl_mc_mask, powerpc_shared_proc_flags, MC);
 	}
 
-	powerpc_topology[i++] = SDTL_INIT(cpu_cpu_mask, powerpc_shared_proc_flags, PKG);
+	powerpc_topology[i++] = SDTL_INIT(tl_pkg_mask, powerpc_shared_proc_flags, PKG);
 
 	/* There must be one trailing NULL entry left.  */
 	BUG_ON(i >= ARRAY_SIZE(powerpc_topology) - 1);
--- a/arch/s390/kernel/topology.c
+++ b/arch/s390/kernel/topology.c
@@ -509,33 +509,27 @@ int topology_cpu_init(struct cpu *cpu)
 	return rc;
 }
 
-static const struct cpumask *cpu_thread_mask(int cpu)
-{
-	return &cpu_topology[cpu].thread_mask;
-}
-
-
 const struct cpumask *cpu_coregroup_mask(int cpu)
 {
 	return &cpu_topology[cpu].core_mask;
 }
 
-static const struct cpumask *cpu_book_mask(int cpu)
+static const struct cpumask *tl_book_mask(struct sched_domain_topology_level *tl, int cpu)
 {
 	return &cpu_topology[cpu].book_mask;
 }
 
-static const struct cpumask *cpu_drawer_mask(int cpu)
+static const struct cpumask *tl_drawer_mask(struct sched_domain_topology_level *tl, int cpu)
 {
 	return &cpu_topology[cpu].drawer_mask;
 }
 
 static struct sched_domain_topology_level s390_topology[] = {
-	SDTL_INIT(cpu_thread_mask, cpu_smt_flags, SMT),
-	SDTL_INIT(cpu_coregroup_mask, cpu_core_flags, MC),
-	SDTL_INIT(cpu_book_mask, NULL, BOOK),
-	SDTL_INIT(cpu_drawer_mask, NULL, DRAWER),
-	SDTL_INIT(cpu_cpu_mask, NULL, PKG),
+	SDTL_INIT(tl_smt_mask, cpu_smt_flags, SMT),
+	SDTL_INIT(tl_mc_mask, cpu_core_flags, MC),
+	SDTL_INIT(tl_book_mask, NULL, BOOK),
+	SDTL_INIT(tl_drawer_mask, NULL, DRAWER),
+	SDTL_INIT(tl_pkg_mask, NULL, PKG),
 	{ NULL, },
 };
 
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -479,14 +479,14 @@ static int x86_cluster_flags(void)
 static bool x86_has_numa_in_package;
 
 static struct sched_domain_topology_level x86_topology[] = {
-	SDTL_INIT(cpu_smt_mask, cpu_smt_flags, SMT),
+	SDTL_INIT(tl_smt_mask, cpu_smt_flags, SMT),
 #ifdef CONFIG_SCHED_CLUSTER
-	SDTL_INIT(cpu_clustergroup_mask, x86_cluster_flags, CLS),
+	SDTL_INIT(tl_cls_mask, x86_cluster_flags, CLS),
 #endif
 #ifdef CONFIG_SCHED_MC
-	SDTL_INIT(cpu_coregroup_mask, x86_core_flags, MC),
+	SDTL_INIT(tl_mc_mask, x86_core_flags, MC),
 #endif
-	SDTL_INIT(cpu_cpu_mask, x86_sched_itmt_flags, PKG),
+	SDTL_INIT(tl_pkg_mask, x86_sched_itmt_flags, PKG),
 	{ NULL },
 };
 
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -30,11 +30,19 @@ struct sd_flag_debug {
 };
 extern const struct sd_flag_debug sd_flag_debug[];
 
+struct sched_domain_topology_level;
+
 #ifdef CONFIG_SCHED_SMT
 static inline int cpu_smt_flags(void)
 {
 	return SD_SHARE_CPUCAPACITY | SD_SHARE_LLC;
 }
+
+static inline const
+struct cpumask *tl_smt_mask(struct sched_domain_topology_level *tl, int cpu)
+{
+	return cpu_smt_mask(cpu);
+}
 #endif
 
 #ifdef CONFIG_SCHED_CLUSTER
@@ -42,6 +50,12 @@ static inline int cpu_cluster_flags(void
 {
 	return SD_CLUSTER | SD_SHARE_LLC;
 }
+
+static inline const
+struct cpumask *tl_cls_mask(struct sched_domain_topology_level *tl, int cpu)
+{
+	return cpu_clustergroup_mask(cpu);
+}
 #endif
 
 #ifdef CONFIG_SCHED_MC
@@ -49,8 +63,20 @@ static inline int cpu_core_flags(void)
 {
 	return SD_SHARE_LLC;
 }
+
+static inline const
+struct cpumask *tl_mc_mask(struct sched_domain_topology_level *tl, int cpu)
+{
+	return cpu_coregroup_mask(cpu);
+}
 #endif
 
+static inline const
+struct cpumask *tl_pkg_mask(struct sched_domain_topology_level *tl, int cpu)
+{
+	return cpu_node_mask(cpu);
+}
+
 #ifdef CONFIG_NUMA
 static inline int cpu_numa_flags(void)
 {
@@ -172,7 +198,7 @@ bool cpus_equal_capacity(int this_cpu, i
 bool cpus_share_cache(int this_cpu, int that_cpu);
 bool cpus_share_resources(int this_cpu, int that_cpu);
 
-typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
+typedef const struct cpumask *(*sched_domain_mask_f)(struct sched_domain_topology_level *tl, int cpu);
 typedef int (*sched_domain_flags_f)(void);
 
 struct sd_data {
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -260,7 +260,7 @@ static inline bool topology_is_primary_t
 
 #endif
 
-static inline const struct cpumask *cpu_cpu_mask(int cpu)
+static inline const struct cpumask *cpu_node_mask(int cpu)
 {
 	return cpumask_of_node(cpu_to_node(cpu));
 }
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1591,7 +1591,6 @@ static void claim_allocations(int cpu, s
 enum numa_topology_type sched_numa_topology_type;
 
 static int			sched_domains_numa_levels;
-static int			sched_domains_curr_level;
 
 int				sched_max_numa_distance;
 static int			*sched_domains_numa_distance;
@@ -1632,14 +1631,7 @@ sd_init(struct sched_domain_topology_lev
 	int sd_id, sd_weight, sd_flags = 0;
 	struct cpumask *sd_span;
 
-#ifdef CONFIG_NUMA
-	/*
-	 * Ugly hack to pass state to sd_numa_mask()...
-	 */
-	sched_domains_curr_level = tl->numa_level;
-#endif
-
-	sd_weight = cpumask_weight(tl->mask(cpu));
+	sd_weight = cpumask_weight(tl->mask(tl, cpu));
 
 	if (tl->sd_flags)
 		sd_flags = (*tl->sd_flags)();
@@ -1677,7 +1669,7 @@ sd_init(struct sched_domain_topology_lev
 	};
 
 	sd_span = sched_domain_span(sd);
-	cpumask_and(sd_span, cpu_map, tl->mask(cpu));
+	cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
 	sd_id = cpumask_first(sd_span);
 
 	sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
@@ -1737,17 +1729,17 @@ sd_init(struct sched_domain_topology_lev
  */
 static struct sched_domain_topology_level default_topology[] = {
 #ifdef CONFIG_SCHED_SMT
-	SDTL_INIT(cpu_smt_mask, cpu_smt_flags, SMT),
+	SDTL_INIT(tl_smt_mask, cpu_smt_flags, SMT),
 #endif
 
 #ifdef CONFIG_SCHED_CLUSTER
-	SDTL_INIT(cpu_clustergroup_mask, cpu_cluster_flags, CLS),
+	SDTL_INIT(tl_cls_mask, cpu_cluster_flags, CLS),
 #endif
 
 #ifdef CONFIG_SCHED_MC
-	SDTL_INIT(cpu_coregroup_mask, cpu_core_flags, MC),
+	SDTL_INIT(tl_mc_mask, cpu_core_flags, MC),
 #endif
-	SDTL_INIT(cpu_cpu_mask, NULL, PKG),
+	SDTL_INIT(tl_pkg_mask, NULL, PKG),
 	{ NULL, },
 };
 
@@ -1769,9 +1761,9 @@ void __init set_sched_topology(struct sc
 
 #ifdef CONFIG_NUMA
 
-static const struct cpumask *sd_numa_mask(int cpu)
+static const struct cpumask *sd_numa_mask(struct sched_domain_topology_level *tl, int cpu)
 {
-	return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)];
+	return sched_domains_numa_masks[tl->numa_level][cpu_to_node(cpu)];
 }
 
 static void sched_numa_warn(const char *str)
@@ -2411,7 +2403,7 @@ static bool topology_span_sane(const str
 		 * breaks the linking done for an earlier span.
 		 */
 		for_each_cpu(cpu, cpu_map) {
-			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
+			const struct cpumask *tl_cpu_mask = tl->mask(tl, cpu);
 			int id;
 
 			/* lowest bit set in this mask is used as a unique id */
@@ -2419,7 +2411,7 @@ static bool topology_span_sane(const str
 
 			if (cpumask_test_cpu(id, id_seen)) {
 				/* First CPU has already been seen, ensure identical spans */
-				if (!cpumask_equal(tl->mask(id), tl_cpu_mask))
+				if (!cpumask_equal(tl->mask(tl, id), tl_cpu_mask))
 					return false;
 			} else {
 				/* First CPU hasn't been seen before, ensure it's a completely new span */
Re: [PATCH v7 0/8] sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
Posted by Valentin Schneider 1 month ago
On 26/08/25 12:13, Peter Zijlstra wrote:
> Subject: sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon, 25 Aug 2025 12:02:44 +0000
>
> Leon [1] and Vinicius [2] noted a topology_span_sane() warning during
> their testing starting from v6.16-rc1. Debug that followed pointed to
> the tl->mask() for the NODE domain being incorrectly resolved to that of
> the highest NUMA domain.
>
> tl->mask() for NODE is set to the sd_numa_mask() which depends on the
> global "sched_domains_curr_level" hack. "sched_domains_curr_level" is
> set to the "tl->numa_level" during tl traversal in build_sched_domains()
> calling sd_init() but was not reset before topology_span_sane().
>
> Since "tl->numa_level" still reflected the old value from
> build_sched_domains(), topology_span_sane() for the NODE domain trips
> when the span of the last NUMA domain overlaps.
>
> Instead of replicating the "sched_domains_curr_level" hack, get rid of
> it entirely and instead, pass the entire "sched_domain_topology_level"
> object to tl->cpumask() function to prevent such mishap in the future.
>
> sd_numa_mask() now directly references "tl->numa_level" instead of
> relying on the global "sched_domains_curr_level" hack to index into
> sched_domains_numa_masks[].
>

Eh, of course I see this *after* looking at the v6 patch.

I tested this again for good measure, but given I only test this under
x86 and the changes with v6 are in s390/ppc, I didn't expect to see much
change :-)

Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Re: [PATCH v7 0/8] sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
Posted by Shrikanth Hegde 1 month ago

On 8/29/25 1:23 PM, Valentin Schneider wrote:
> On 26/08/25 12:13, Peter Zijlstra wrote:
>> Subject: sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
>> From: Peter Zijlstra <peterz@infradead.org>
>> Date: Mon, 25 Aug 2025 12:02:44 +0000
>>
>> Leon [1] and Vinicius [2] noted a topology_span_sane() warning during
>> their testing starting from v6.16-rc1. Debug that followed pointed to
>> the tl->mask() for the NODE domain being incorrectly resolved to that of
>> the highest NUMA domain.
>>
>> tl->mask() for NODE is set to the sd_numa_mask() which depends on the
>> global "sched_domains_curr_level" hack. "sched_domains_curr_level" is
>> set to the "tl->numa_level" during tl traversal in build_sched_domains()
>> calling sd_init() but was not reset before topology_span_sane().
>>
>> Since "tl->numa_level" still reflected the old value from
>> build_sched_domains(), topology_span_sane() for the NODE domain trips
>> when the span of the last NUMA domain overlaps.
>>
>> Instead of replicating the "sched_domains_curr_level" hack, get rid of
>> it entirely and instead, pass the entire "sched_domain_topology_level"
>> object to tl->cpumask() function to prevent such mishap in the future.
>>
>> sd_numa_mask() now directly references "tl->numa_level" instead of
>> relying on the global "sched_domains_curr_level" hack to index into
>> sched_domains_numa_masks[].
>>
> 
> Eh, of course I see this *after* looking at the v6 patch.
> 
> I tested this again for good measure, but given I only test this under
> x86 and the changes with v6 are in s390/ppc, I didn't expect to see much
> change :-)
> 
> Reviewed-by: Valentin Schneider <vschneid@redhat.com>
> Tested-by: Valentin Schneider <vschneid@redhat.com>
> 

I was looking at: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core

Current code doesn't allow one to enable/disable SCHED_MC on ppc since it is set always in kconfig.
Used the below patch:

I think since the config is there, it would be good to provide a option to disable. no?

---

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index fc0d1c19f5a1..da5b2f8d3686 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -170,9 +170,8 @@ config PPC
  	select ARCH_STACKWALK
  	select ARCH_SUPPORTS_ATOMIC_RMW
  	select ARCH_SUPPORTS_DEBUG_PAGEALLOC	if PPC_BOOK3S || PPC_8xx
-	select ARCH_SUPPORTS_SCHED_SMT		if PPC64 && SMP
  	select ARCH_SUPPORTS_SCHED_MC		if PPC64 && SMP
-	select SCHED_MC				if ARCH_SUPPORTS_SCHED_MC
+	select ARCH_SUPPORTS_SCHED_SMT		if PPC64 && SMP
  	select ARCH_USE_BUILTIN_BSWAP
  	select ARCH_USE_CMPXCHG_LOCKREF		if PPC64
  	select ARCH_USE_MEMTEST
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 68edb66c2964..458ec5bd859e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1706,10 +1706,12 @@ static void __init build_sched_topology(void)
  			SDTL_INIT(tl_cache_mask, powerpc_shared_cache_flags, CACHE);
  	}
  
+#ifdef CONFIG_SCHED_MC
  	if (has_coregroup_support()) {
  		powerpc_topology[i++] =
  			SDTL_INIT(tl_mc_mask, powerpc_shared_proc_flags, MC);
  	}
+#endif
  
  	powerpc_topology[i++] = SDTL_INIT(tl_pkg_mask, powerpc_shared_proc_flags, PKG);
Re: [PATCH v7 0/8] sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
Posted by Peter Zijlstra 1 month ago
On Fri, Aug 29, 2025 at 02:23:06PM +0530, Shrikanth Hegde wrote:

> I was looking at: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core
> 
> Current code doesn't allow one to enable/disable SCHED_MC on ppc since it is set always in kconfig.
> Used the below patch:
> 
> I think since the config is there, it would be good to provide a option to disable. no?

So current PPC code has this MC thing unconditional. I've been
preserving that behaviour. If PPC maintainers feel they want this
selectable, I'm happy to include something like the below, but as a
separate patch with a separate changelog that states this explicit
choice.

> ---
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index fc0d1c19f5a1..da5b2f8d3686 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -170,9 +170,8 @@ config PPC
>  	select ARCH_STACKWALK
>  	select ARCH_SUPPORTS_ATOMIC_RMW
>  	select ARCH_SUPPORTS_DEBUG_PAGEALLOC	if PPC_BOOK3S || PPC_8xx
> -	select ARCH_SUPPORTS_SCHED_SMT		if PPC64 && SMP
>  	select ARCH_SUPPORTS_SCHED_MC		if PPC64 && SMP
> -	select SCHED_MC				if ARCH_SUPPORTS_SCHED_MC
> +	select ARCH_SUPPORTS_SCHED_SMT		if PPC64 && SMP
>  	select ARCH_USE_BUILTIN_BSWAP
>  	select ARCH_USE_CMPXCHG_LOCKREF		if PPC64
>  	select ARCH_USE_MEMTEST
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 68edb66c2964..458ec5bd859e 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1706,10 +1706,12 @@ static void __init build_sched_topology(void)
>  			SDTL_INIT(tl_cache_mask, powerpc_shared_cache_flags, CACHE);
>  	}
> +#ifdef CONFIG_SCHED_MC
>  	if (has_coregroup_support()) {
>  		powerpc_topology[i++] =
>  			SDTL_INIT(tl_mc_mask, powerpc_shared_proc_flags, MC);
>  	}
> +#endif
>  	powerpc_topology[i++] = SDTL_INIT(tl_pkg_mask, powerpc_shared_proc_flags, PKG);
>
Re: [PATCH v7 0/8] sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
Posted by Shrikanth Hegde 1 month ago

On 9/1/25 2:28 PM, Peter Zijlstra wrote:
> On Fri, Aug 29, 2025 at 02:23:06PM +0530, Shrikanth Hegde wrote:
> 
>> I was looking at: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core
>>
>> Current code doesn't allow one to enable/disable SCHED_MC on ppc since it is set always in kconfig.
>> Used the below patch:
>>
>> I think since the config is there, it would be good to provide a option to disable. no?
> 
> So current PPC code has this MC thing unconditional. I've been
> preserving that behaviour. If PPC maintainers feel they want this
> selectable, I'm happy to include something like the below, but as a
> separate patch with a separate changelog that states this explicit
> choice.
> 

Fair enough. Will send it as separate patch.

>> ---
>>
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index fc0d1c19f5a1..da5b2f8d3686 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -170,9 +170,8 @@ config PPC
>>   	select ARCH_STACKWALK
>>   	select ARCH_SUPPORTS_ATOMIC_RMW
>>   	select ARCH_SUPPORTS_DEBUG_PAGEALLOC	if PPC_BOOK3S || PPC_8xx
>> -	select ARCH_SUPPORTS_SCHED_SMT		if PPC64 && SMP
>>   	select ARCH_SUPPORTS_SCHED_MC		if PPC64 && SMP
>> -	select SCHED_MC				if ARCH_SUPPORTS_SCHED_MC
>> +	select ARCH_SUPPORTS_SCHED_SMT		if PPC64 && SMP
>>   	select ARCH_USE_BUILTIN_BSWAP
>>   	select ARCH_USE_CMPXCHG_LOCKREF		if PPC64
>>   	select ARCH_USE_MEMTEST
>> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
>> index 68edb66c2964..458ec5bd859e 100644
>> --- a/arch/powerpc/kernel/smp.c
>> +++ b/arch/powerpc/kernel/smp.c
>> @@ -1706,10 +1706,12 @@ static void __init build_sched_topology(void)
>>   			SDTL_INIT(tl_cache_mask, powerpc_shared_cache_flags, CACHE);
>>   	}
>> +#ifdef CONFIG_SCHED_MC
>>   	if (has_coregroup_support()) {
>>   		powerpc_topology[i++] =
>>   			SDTL_INIT(tl_mc_mask, powerpc_shared_proc_flags, MC);
>>   	}
>> +#endif
>>   	powerpc_topology[i++] = SDTL_INIT(tl_pkg_mask, powerpc_shared_proc_flags, PKG);
>>


If possible for below two, please consider tags.

https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/core&id=496d4cc3d478a662f90cce3a3e3be4af56f78a02
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/core&id=a912f3e2c6d91f7ea7b294c02796b59af4f50078

Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>

for powerpc bits:
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Re: [PATCH v7 0/8] sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()
Posted by K Prateek Nayak 1 month ago
Hello Shrikanth,

On 8/29/2025 2:23 PM, Shrikanth Hegde wrote:
> I was looking at: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core
> 
> Current code doesn't allow one to enable/disable SCHED_MC on ppc since it is set always in kconfig.
> Used the below patch:
> 
> I think since the config is there, it would be good to provide a option to disable. no?

I think this makes sense.

FWIW, Peter added the "select SCHED_MC" to keep it consistent with the
current behavior.

> 
> ---
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index fc0d1c19f5a1..da5b2f8d3686 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -170,9 +170,8 @@ config PPC
>      select ARCH_STACKWALK
>      select ARCH_SUPPORTS_ATOMIC_RMW
>      select ARCH_SUPPORTS_DEBUG_PAGEALLOC    if PPC_BOOK3S || PPC_8xx
> -    select ARCH_SUPPORTS_SCHED_SMT        if PPC64 && SMP
>      select ARCH_SUPPORTS_SCHED_MC        if PPC64 && SMP
> -    select SCHED_MC                if ARCH_SUPPORTS_SCHED_MC
> +    select ARCH_SUPPORTS_SCHED_SMT        if PPC64 && SMP
>      select ARCH_USE_BUILTIN_BSWAP
>      select ARCH_USE_CMPXCHG_LOCKREF        if PPC64
>      select ARCH_USE_MEMTEST
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 68edb66c2964..458ec5bd859e 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1706,10 +1706,12 @@ static void __init build_sched_topology(void)
>              SDTL_INIT(tl_cache_mask, powerpc_shared_cache_flags, CACHE);
>      }
>  
> +#ifdef CONFIG_SCHED_MC
>      if (has_coregroup_support()) {
>          powerpc_topology[i++] =
>              SDTL_INIT(tl_mc_mask, powerpc_shared_proc_flags, MC);
>      }
> +#endif

When I was looking at this, the whole of .*coregroup.* related bits in
smp.c can technically go behind CONFIG_SCHED_MC too but that is a much
larger cleanup and perhaps unnecessary too so this looks good.

-- 
Thanks and Regards,
Prateek