sched/topology: Optimize sd->shared allocation

[PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data

Posted by K Prateek Nayak 2 weeks, 5 days ago

Use the "sched_domain_shared" object allocated in s_data for
"sd->shared" assignments. Assign "sd->shared" for the topmost
SD_SHARE_LLC domain before degeneration and rely on the degeneration
path to correctly pass down the shared object to "sd_llc".

sd_degenerate_parent() ensures degenerating domains must have the same
sched_domain_span() which ensures 1:1 passing down of the shared object.
If the topmost SD_SHARE_LLC domain degenerates, the shared object is
freed from destroy_sched_domain() when the last reference is dropped.

build_sched_domains() NULLs out the objects that have been assigned as
"sd->shared" and the unassigned ones are freed from the __sds_free()
path.

Post cpu_attach_domain(), all reclaims of "sd->shared" are handled via
call_rcu() on the sched_domain object via destroy_sched_domains_rcu().

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog rfc v2..v3:

o Broke off from a single large patch. Previously
  https://lore.kernel.org/lkml/20251208092744.32737-3-kprateek.nayak@amd.com/
---
 kernel/sched/topology.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 623e8835d322..0f56462fef6f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -679,6 +679,9 @@ static void update_top_cache_domain(int cpu)
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
+
+		/* If sd_llc exists, sd_llc_shared should exist too. */
+		WARN_ON_ONCE(!sd->shared);
 		sds = sd->shared;
 	}
 
@@ -727,6 +730,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 		if (sd_parent_degenerate(tmp, parent)) {
 			tmp->parent = parent->parent;
 
+			/* Pick reference to parent->shared. */
+			if (parent->shared) {
+				WARN_ON_ONCE(tmp->shared);
+				tmp->shared = parent->shared;
+				parent->shared = NULL;
+			}
+
 			if (parent->parent) {
 				parent->parent->child = tmp;
 				parent->parent->groups->flags = tmp->flags;
@@ -1732,16 +1742,6 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->cache_nice_tries = 1;
 	}
 
-	/*
-	 * For all levels sharing cache; connect a sched_domain_shared
-	 * instance.
-	 */
-	if (sd->flags & SD_SHARE_LLC) {
-		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
-		atomic_inc(&sd->shared->ref);
-		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
-	}
-
 	sd->private = sdd;
 
 	return sd;
@@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		unsigned int imb_span = 1;
 
 		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+			struct sched_domain *parent = sd->parent;
 			struct sched_domain *child = sd->child;
 
+			/* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
+			if ((sd->flags & SD_SHARE_LLC) &&
+			    (!parent || !(parent->flags & SD_SHARE_LLC))) {
+				int llc_id = cpumask_first(sched_domain_span(sd));
+
+				sd->shared = *per_cpu_ptr(d.sds, llc_id);
+				atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+				atomic_inc(&sd->shared->ref);
+			}
+
 			if (!(sd->flags & SD_SHARE_LLC) && child &&
 			    (child->flags & SD_SHARE_LLC)) {
 				struct sched_domain __rcu *top_p;
@@ -2709,6 +2720,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		if (!cpumask_test_cpu(i, cpu_map))
 			continue;
 
+		if (atomic_read(&(*per_cpu_ptr(d.sds, i))->ref))
+			*per_cpu_ptr(d.sds, i) = NULL;
+
 		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
 			claim_allocations(i, sd);
 			init_sched_groups_capacity(i, sd);
-- 
2.34.1

Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data

Posted by Valentin Schneider 3 days, 15 hours ago

On 20/01/26 11:32, K Prateek Nayak wrote:
> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>               unsigned int imb_span = 1;
>
>               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			struct sched_domain *parent = sd->parent;
>                       struct sched_domain *child = sd->child;
>
> +			/* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
> +			if ((sd->flags & SD_SHARE_LLC) &&
> +			    (!parent || !(parent->flags & SD_SHARE_LLC))) {
> +				int llc_id = cpumask_first(sched_domain_span(sd));
> +
> +				sd->shared = *per_cpu_ptr(d.sds, llc_id);
> +				atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> +				atomic_inc(&sd->shared->ref);
> +			}
> +

We now have two if's looking for the highest_flag_domain(i, SD_SHARE_LLC),
but given this needs to write the sd->imb_numa_nr for every SD I couldn't
factorize this into something that looked sane :(

>                       if (!(sd->flags & SD_SHARE_LLC) && child &&
>                           (child->flags & SD_SHARE_LLC)) {
>                               struct sched_domain __rcu *top_p;

Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data

Posted by K Prateek Nayak 3 days, 2 hours ago

Hello Valentin,

On 2/5/2026 10:23 PM, Valentin Schneider wrote:
> On 20/01/26 11:32, K Prateek Nayak wrote:
>> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>               unsigned int imb_span = 1;
>>
>>               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>> +			struct sched_domain *parent = sd->parent;
>>                       struct sched_domain *child = sd->child;
>>
>> +			/* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
>> +			if ((sd->flags & SD_SHARE_LLC) &&
>> +			    (!parent || !(parent->flags & SD_SHARE_LLC))) {
>> +				int llc_id = cpumask_first(sched_domain_span(sd));
>> +
>> +				sd->shared = *per_cpu_ptr(d.sds, llc_id);
>> +				atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
>> +				atomic_inc(&sd->shared->ref);
>> +			}
>> +
> 
> We now have two if's looking for the highest_flag_domain(i, SD_SHARE_LLC),
> but given this needs to write the sd->imb_numa_nr for every SD I couldn't
> factorize this into something that looked sane :(

Yeah! The "imb_numa_nr" cares about the "sd_llc" *after* we've crossed
it and "sd->shared" assignment cares when we are *at* the sd_llc.

Since we have to assign the "sd->shared" before claim_allocations(),
I couldn't find a better spot to assign it.

That said, "imb_numa_nr" calculation can be modified to use the "sd_llc"
and its "parent". I'll let you be the judge of whether the following is
better or worse ;-)

  (Only build tested)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ac268da91778..e98bb812de35 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2614,13 +2614,23 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		unsigned int imb_span = 1;
 
 		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
-			struct sched_domain *child = sd->child;
+			struct sched_domain *parent = sd->parent;
 
-			if (!(sd->flags & SD_SHARE_LLC) && child &&
-			    (child->flags & SD_SHARE_LLC)) {
-				struct sched_domain __rcu *top_p;
+			/* Topmost SD_SHARE_LLC domain. */
+			if ((sd->flags & SD_SHARE_LLC) &&
+			    (!parent || !(parent->flags & SD_SHARE_LLC))) {
+				int sd_id = cpumask_first(sched_domain_span(sd));
+				struct sched_domain *top_p;
 				unsigned int nr_llcs;
 
+				sd->shared = *per_cpu_ptr(d.sds, sd_id);
+				atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+				atomic_inc(&sd->shared->ref);
+
+				/* No SD_NUMA domains. */
+				if (!parent)
+					break;
+
 				/*
 				 * For a single LLC per node, allow an
 				 * imbalance up to 12.5% of the node. This is
@@ -2641,7 +2651,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 				 * factors and that there is a correlation
 				 * between LLCs and memory channels.
 				 */
-				nr_llcs = sd->span_weight / child->span_weight;
+				nr_llcs = parent->span_weight / sd->span_weight;
 				if (nr_llcs == 1)
 					imb = sd->span_weight >> 3;
 				else
@@ -2650,11 +2660,11 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 				sd->imb_numa_nr = imb;
 
 				/* Set span based on the first NUMA domain. */
-				top_p = sd->parent;
+				top_p = parent;
 				while (top_p && !(top_p->flags & SD_NUMA)) {
 					top_p = top_p->parent;
 				}
-				imb_span = top_p ? top_p->span_weight : sd->span_weight;
+				imb_span = top_p ? top_p->span_weight : parent->span_weight;
 			} else {
 				int factor = max(1U, (sd->span_weight / imb_span));
 
---

> 
>>                       if (!(sd->flags & SD_SHARE_LLC) && child &&
>>                           (child->flags & SD_SHARE_LLC)) {
>>                               struct sched_domain __rcu *top_p;
> 

-- 
Thanks and Regards,
Prateek

Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data

Posted by Valentin Schneider 2 days, 22 hours ago

On 06/02/26 10:50, K Prateek Nayak wrote:
> Hello Valentin,
>
> On 2/5/2026 10:23 PM, Valentin Schneider wrote:
>> On 20/01/26 11:32, K Prateek Nayak wrote:
>>> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>>               unsigned int imb_span = 1;
>>>
>>>               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>>> +			struct sched_domain *parent = sd->parent;
>>>                       struct sched_domain *child = sd->child;
>>>
>>> +			/* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
>>> +			if ((sd->flags & SD_SHARE_LLC) &&
>>> +			    (!parent || !(parent->flags & SD_SHARE_LLC))) {
>>> +				int llc_id = cpumask_first(sched_domain_span(sd));
>>> +
>>> +				sd->shared = *per_cpu_ptr(d.sds, llc_id);
>>> +				atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
>>> +				atomic_inc(&sd->shared->ref);
>>> +			}
>>> +
>>
>> We now have two if's looking for the highest_flag_domain(i, SD_SHARE_LLC),
>> but given this needs to write the sd->imb_numa_nr for every SD I couldn't
>> factorize this into something that looked sane :(
>
> Yeah! The "imb_numa_nr" cares about the "sd_llc" *after* we've crossed
> it and "sd->shared" assignment cares when we are *at* the sd_llc.
>
> Since we have to assign the "sd->shared" before claim_allocations(),
> I couldn't find a better spot to assign it.
>
> That said, "imb_numa_nr" calculation can be modified to use the "sd_llc"
> and its "parent". I'll let you be the judge of whether the following is
> better or worse ;-)
>
>   (Only build tested)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index ac268da91778..e98bb812de35 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2614,13 +2614,23 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>               unsigned int imb_span = 1;
>
>               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> -			struct sched_domain *child = sd->child;
> +			struct sched_domain *parent = sd->parent;
>
> -			if (!(sd->flags & SD_SHARE_LLC) && child &&
> -			    (child->flags & SD_SHARE_LLC)) {
> -				struct sched_domain __rcu *top_p;
> +			/* Topmost SD_SHARE_LLC domain. */
> +			if ((sd->flags & SD_SHARE_LLC) &&
> +			    (!parent || !(parent->flags & SD_SHARE_LLC))) {
> +				int sd_id = cpumask_first(sched_domain_span(sd));
> +				struct sched_domain *top_p;
>                               unsigned int nr_llcs;
>
> +				sd->shared = *per_cpu_ptr(d.sds, sd_id);
> +				atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> +				atomic_inc(&sd->shared->ref);
> +
> +				/* No SD_NUMA domains. */
> +				if (!parent)
> +					break;
> +

AIUI we currently write sd->imb_numa_nr for all SD's, but it's only useful
for the SD_NUMA ones... How about the lightly tested:
---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index afb2c26efb4e5..03db45658f6bd 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2575,6 +2575,51 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
 	return true;
 }
 
+static inline void adjust_numa_imbalance(struct sched_domain *sd_llc,
+					 unsigned int *imb, unsigned int *imb_span)
+{
+	/*
+	 * For a single LLC per node, allow an
+	 * imbalance up to 12.5% of the node. This is
+	 * arbitrary cutoff based two factors -- SMT and
+	 * memory channels. For SMT-2, the intent is to
+	 * avoid premature sharing of HT resources but
+	 * SMT-4 or SMT-8 *may* benefit from a different
+	 * cutoff. For memory channels, this is a very
+	 * rough estimate of how many channels may be
+	 * active and is based on recent CPUs with
+	 * many cores.
+	 *
+	 * For multiple LLCs, allow an imbalance
+	 * until multiple tasks would share an LLC
+	 * on one node while LLCs on another node
+	 * remain idle. This assumes that there are
+	 * enough logical CPUs per LLC to avoid SMT
+	 * factors and that there is a correlation
+	 * between LLCs and memory channels.
+	 */
+	struct sched_domain *top_p;
+	unsigned int nr_llcs;
+
+	WARN_ON(!(sd_llc->flags & SD_SHARE_LLC));
+	WARN_ON(!sd_llc->parent);
+
+	nr_llcs = sd_llc->parent->span_weight / sd_llc->span_weight;
+	if (nr_llcs == 1)
+		*imb = sd_llc->parent->span_weight >> 3;
+	else
+		*imb = nr_llcs;
+	*imb = max(1U, *imb);
+	sd_llc->parent->imb_numa_nr = *imb;
+
+	/* Set span based on the first NUMA domain. */
+	top_p = sd_llc->parent->parent;
+	while (top_p && !(top_p->flags & SD_NUMA)) {
+		top_p = top_p->parent;
+	}
+	*imb_span = top_p ? top_p->span_weight : sd_llc->parent->span_weight;
+}
+
 /*
  * Build sched domains for a given set of CPUs and attach the sched domains
  * to the individual CPUs
@@ -2640,63 +2685,30 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		unsigned int imb = 0;
 		unsigned int imb_span = 1;
 
-		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
-			struct sched_domain *parent = sd->parent;
-
-			/* Topmost SD_SHARE_LLC domain. */
-			if ((sd->flags & SD_SHARE_LLC) &&
-			    (!parent || !(parent->flags & SD_SHARE_LLC))) {
-				int sd_id = cpumask_first(sched_domain_span(sd));
-				struct sched_domain *top_p;
-				unsigned int nr_llcs;
-
-				sd->shared = *per_cpu_ptr(d.sds, sd_id);
-				atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
-				atomic_inc(&sd->shared->ref);
-
-				/* No SD_NUMA domains. */
-				if (!parent)
-					break;
-
-				/*
-				 * For a single LLC per node, allow an
-				 * imbalance up to 12.5% of the node. This is
-				 * arbitrary cutoff based two factors -- SMT and
-				 * memory channels. For SMT-2, the intent is to
-				 * avoid premature sharing of HT resources but
-				 * SMT-4 or SMT-8 *may* benefit from a different
-				 * cutoff. For memory channels, this is a very
-				 * rough estimate of how many channels may be
-				 * active and is based on recent CPUs with
-				 * many cores.
-				 *
-				 * For multiple LLCs, allow an imbalance
-				 * until multiple tasks would share an LLC
-				 * on one node while LLCs on another node
-				 * remain idle. This assumes that there are
-				 * enough logical CPUs per LLC to avoid SMT
-				 * factors and that there is a correlation
-				 * between LLCs and memory channels.
-				 */
-				nr_llcs = parent->span_weight / sd->span_weight;
-				if (nr_llcs == 1)
-					imb = sd->span_weight >> 3;
-				else
-					imb = nr_llcs;
-				imb = max(1U, imb);
-				sd->imb_numa_nr = imb;
-
-				/* Set span based on the first NUMA domain. */
-				top_p = parent;
-				while (top_p && !(top_p->flags & SD_NUMA)) {
-					top_p = top_p->parent;
-				}
-				imb_span = top_p ? top_p->span_weight : parent->span_weight;
-			} else {
-				int factor = max(1U, (sd->span_weight / imb_span));
+		sd = *per_cpu_ptr(d.sd, i);
+		/* First, find the topmost SD_SHARE_LLC domain */
+		while (sd && sd->parent && (sd->parent->flags & SD_SHARE_LLC))
+			sd = sd->parent;
 
-				sd->imb_numa_nr = imb * factor;
-			}
+		if (sd->flags & SD_SHARE_LLC) {
+			int sd_id = cpumask_first(sched_domain_span(sd));
+
+			sd->shared = *per_cpu_ptr(d.sds, sd_id);
+			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+			atomic_inc(&sd->shared->ref);
+		}
+
+		/* Special case the first parent of the topmost SD_SHARE_LLC domain. */
+		if ((sd->flags & SD_SHARE_LLC) && sd->parent) {
+			adjust_numa_imbalance(sd, &imb, &imb_span);
+			sd = sd->parent->parent;
+		}
+
+		/* Update the upper remainder of the topology */
+		while (sd) {
+			int factor = max(1U, (sd->span_weight / imb_span));
+			sd->imb_numa_nr = imb * factor;
+			sd = sd->parent;
 		}
 	}

Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data

Posted by Shrikanth Hegde 2 weeks, 4 days ago


On 1/20/26 5:02 PM, K Prateek Nayak wrote:
> Use the "sched_domain_shared" object allocated in s_data for
> "sd->shared" assignments. Assign "sd->shared" for the topmost
> SD_SHARE_LLC domain before degeneration and rely on the degeneration
> path to correctly pass down the shared object to "sd_llc".
> 
> sd_degenerate_parent() ensures degenerating domains must have the same
> sched_domain_span() which ensures 1:1 passing down of the shared object.
> If the topmost SD_SHARE_LLC domain degenerates, the shared object is
> freed from destroy_sched_domain() when the last reference is dropped.
> 
> build_sched_domains() NULLs out the objects that have been assigned as
> "sd->shared" and the unassigned ones are freed from the __sds_free()
> path.
> 
> Post cpu_attach_domain(), all reclaims of "sd->shared" are handled via
> call_rcu() on the sched_domain object via destroy_sched_domains_rcu().
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changelog rfc v2..v3:
> 
> o Broke off from a single large patch. Previously
>    https://lore.kernel.org/lkml/20251208092744.32737-3-kprateek.nayak@amd.com/
> ---
>   kernel/sched/topology.c | 34 ++++++++++++++++++++++++----------
>   1 file changed, 24 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 623e8835d322..0f56462fef6f 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -679,6 +679,9 @@ static void update_top_cache_domain(int cpu)
>   	if (sd) {
>   		id = cpumask_first(sched_domain_span(sd));
>   		size = cpumask_weight(sched_domain_span(sd));
> +
> +		/* If sd_llc exists, sd_llc_shared should exist too. */
> +		WARN_ON_ONCE(!sd->shared);
>   		sds = sd->shared;
>   	}
>   
> @@ -727,6 +730,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
>   		if (sd_parent_degenerate(tmp, parent)) {
>   			tmp->parent = parent->parent;
>   
> +			/* Pick reference to parent->shared. */
> +			if (parent->shared) {
> +				WARN_ON_ONCE(tmp->shared);
> +				tmp->shared = parent->shared;
> +				parent->shared = NULL;
> +			}
> +
>   			if (parent->parent) {
>   				parent->parent->child = tmp;
>   				parent->parent->groups->flags = tmp->flags;
> @@ -1732,16 +1742,6 @@ sd_init(struct sched_domain_topology_level *tl,
>   		sd->cache_nice_tries = 1;
>   	}
>   
> -	/*
> -	 * For all levels sharing cache; connect a sched_domain_shared
> -	 * instance.
> -	 */
> -	if (sd->flags & SD_SHARE_LLC) {
> -		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> -		atomic_inc(&sd->shared->ref);
> -		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> -	}
> -
>   	sd->private = sdd;
>   
>   	return sd;
> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>   		unsigned int imb_span = 1;
>   
>   		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			struct sched_domain *parent = sd->parent;
>   			struct sched_domain *child = sd->child;
>   
> +			/* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
> +			if ((sd->flags & SD_SHARE_LLC) &&
> +			    (!parent || !(parent->flags & SD_SHARE_LLC))) {
> +				int llc_id = cpumask_first(sched_domain_span(sd));
> +
> +				sd->shared = *per_cpu_ptr(d.sds, llc_id);
> +				atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> +				atomic_inc(&sd->shared->ref);
> +			}
> +
>   			if (!(sd->flags & SD_SHARE_LLC) && child &&
>   			    (child->flags & SD_SHARE_LLC)) {
>   				struct sched_domain __rcu *top_p;
> @@ -2709,6 +2720,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>   		if (!cpumask_test_cpu(i, cpu_map))
>   			continue;
>   
> +		if (atomic_read(&(*per_cpu_ptr(d.sds, i))->ref))
> +			*per_cpu_ptr(d.sds, i) = NULL;
> +

Can we do this claim_allocations only?
sdt_alloc and free is complicated already.

>   		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>   			claim_allocations(i, sd);
>   			init_sched_groups_capacity(i, sd);

Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data

Posted by K Prateek Nayak 2 weeks, 3 days ago

Hello Shrikanth,

On 1/22/2026 1:42 PM, Shrikanth Hegde wrote:
>> @@ -2709,6 +2720,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>           if (!cpumask_test_cpu(i, cpu_map))
>>               continue;
>>   +        if (atomic_read(&(*per_cpu_ptr(d.sds, i))->ref))
>> +            *per_cpu_ptr(d.sds, i) = NULL;
>> +
> 
> Can we do this claim_allocations only?

I didn't do it there since we didn't have reference to the "s_data"
inside claim_allocations().

If I remember this right, only init_sched_groups_capacity() has the
requirement to traverse the CPUs in reverse to do
update_group_capacity() when we hit the first CPU in the group.
It doesn't modify the "->ref" of any allocations.

I can put the claim_allocations() bits the previous loop and pass the
CPU and the s_data reference so it can free both "d.sds" and all the
"d.sd" bits in one place and retain this reverse loop for
init_sched_groups_capacity(). Does that sound better?

> sdt_alloc and free is complicated already.
> 
>>           for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>>               claim_allocations(i, sd);
>>               init_sched_groups_capacity(i, sd);
> 

-- 
Thanks and Regards,
Prateek

Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data

Posted by Shrikanth Hegde 2 weeks, 3 days ago


On 1/22/26 2:06 PM, K Prateek Nayak wrote:
> Hello Shrikanth,
> 
> On 1/22/2026 1:42 PM, Shrikanth Hegde wrote:
>>> @@ -2709,6 +2720,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>>            if (!cpumask_test_cpu(i, cpu_map))
>>>                continue;
>>>    +        if (atomic_read(&(*per_cpu_ptr(d.sds, i))->ref))
>>> +            *per_cpu_ptr(d.sds, i) = NULL;
>>> +
>>
>> Can we do this claim_allocations only?
> 
> I didn't do it there since we didn't have reference to the "s_data"
> inside claim_allocations().
> 
> If I remember this right, only init_sched_groups_capacity() has the
> requirement to traverse the CPUs in reverse to do
> update_group_capacity() when we hit the first CPU in the group.
> It doesn't modify the "->ref" of any allocations.
> 
> I can put the claim_allocations() bits the previous loop and pass the
> CPU and the s_data reference so it can free both "d.sds" and all the
> "d.sd" bits in one place and retain this reverse loop for
> init_sched_groups_capacity(). Does that sound better?
> 

Yes.IMO Having it in one place is better.
Even the next loop could be used to do that.

>> sdt_alloc and free is complicated already.
>>
>>>            for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>>>                claim_allocations(i, sd);
>>>                init_sched_groups_capacity(i, sd);
>>
>

Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data

Posted by K Prateek Nayak 2 weeks, 3 days ago

On 1/23/2026 9:38 AM, Shrikanth Hegde wrote:
>> I can put the claim_allocations() bits the previous loop and pass the
>> CPU and the s_data reference so it can free both "d.sds" and all the
>> "d.sd" bits in one place and retain this reverse loop for
>> init_sched_groups_capacity(). Does that sound better?
>>
> 
> Yes.IMO Having it in one place is better.
> Even the next loop could be used to do that.

Ack! Will change accordingly in the next version. Thank you again for
the suggestion.

-- 
Thanks and Regards,
Prateek

Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data

Posted by Chen, Yu C 2 weeks, 4 days ago

On 1/20/2026 7:32 PM, K Prateek Nayak wrote:
> Use the "sched_domain_shared" object allocated in s_data for
> "sd->shared" assignments. Assign "sd->shared" for the topmost
> SD_SHARE_LLC domain before degeneration and rely on the degeneration
> path to correctly pass down the shared object to "sd_llc".
> 
> sd_degenerate_parent() ensures degenerating domains must have the same
> sched_domain_span() which ensures 1:1 passing down of the shared object.
> If the topmost SD_SHARE_LLC domain degenerates, the shared object is
> freed from destroy_sched_domain() when the last reference is dropped.
> 
> build_sched_domains() NULLs out the objects that have been assigned as
> "sd->shared" and the unassigned ones are freed from the __sds_free()
> path.
> 
> Post cpu_attach_domain(), all reclaims of "sd->shared" are handled via
> call_rcu() on the sched_domain object via destroy_sched_domains_rcu().
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changelog rfc v2..v3:
> 
> o Broke off from a single large patch. Previously
>    https://lore.kernel.org/lkml/20251208092744.32737-3-kprateek.nayak@amd.com/
> ---
>   kernel/sched/topology.c | 34 ++++++++++++++++++++++++----------
>   1 file changed, 24 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 623e8835d322..0f56462fef6f 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -679,6 +679,9 @@ static void update_top_cache_domain(int cpu)
>   	if (sd) {
>   		id = cpumask_first(sched_domain_span(sd));
>   		size = cpumask_weight(sched_domain_span(sd));
> +
> +		/* If sd_llc exists, sd_llc_shared should exist too. */
> +		WARN_ON_ONCE(!sd->shared);
>   		sds = sd->shared;
>   	}
>   
> @@ -727,6 +730,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
>   		if (sd_parent_degenerate(tmp, parent)) {
>   			tmp->parent = parent->parent;
>   
> +			/* Pick reference to parent->shared. */
> +			if (parent->shared) {
> +				WARN_ON_ONCE(tmp->shared);
> +				tmp->shared = parent->shared;
> +				parent->shared = NULL;
> +			}
> +
>   			if (parent->parent) {
>   				parent->parent->child = tmp;
>   				parent->parent->groups->flags = tmp->flags;
> @@ -1732,16 +1742,6 @@ sd_init(struct sched_domain_topology_level *tl,
>   		sd->cache_nice_tries = 1;
>   	}
>   
> -	/*
> -	 * For all levels sharing cache; connect a sched_domain_shared
> -	 * instance.
> -	 */
> -	if (sd->flags & SD_SHARE_LLC) {
> -		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> -		atomic_inc(&sd->shared->ref);
> -		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> -	}
> -
>   	sd->private = sdd;
>   
>   	return sd;
> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>   		unsigned int imb_span = 1;
>   
>   		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			struct sched_domain *parent = sd->parent;
>   			struct sched_domain *child = sd->child;
>   
> +			/* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
> +			if ((sd->flags & SD_SHARE_LLC) &&
> +			    (!parent || !(parent->flags & SD_SHARE_LLC))) {
> +				int llc_id = cpumask_first(sched_domain_span(sd));
> +
> +				sd->shared = *per_cpu_ptr(d.sds, llc_id);

I agree that in the current implementation, we use the llc_id="first CPU" to
index into d.sds, and this value actually represents the LLC ID. In the
cache-aware scheduling, we plan to convert the llc_id to a logical ID that
is no longer tied to the CPU number. Just 2 cents, to avoid confusion, maybe
rename the  aforementioned llc_id to sd_id?

Anyway I will run some tests on the entire patch set and provide feedback
afterward.

thanks,
Chenyu

Re: [PATCH v3 3/8] sched/topology: Switch to assigning "sd->shared" from s_data

Posted by K Prateek Nayak 2 weeks, 4 days ago

Hello Chenyu,

On 1/21/2026 8:56 PM, Chen, Yu C wrote:
>> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>           unsigned int imb_span = 1;
>>             for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>> +            struct sched_domain *parent = sd->parent;
>>               struct sched_domain *child = sd->child;
>>   +            /* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
>> +            if ((sd->flags & SD_SHARE_LLC) &&
>> +                (!parent || !(parent->flags & SD_SHARE_LLC))) {
>> +                int llc_id = cpumask_first(sched_domain_span(sd));
>> +
>> +                sd->shared = *per_cpu_ptr(d.sds, llc_id);
> 
> I agree that in the current implementation, we use the llc_id="first CPU" to
> index into d.sds, and this value actually represents the LLC ID. In the
> cache-aware scheduling, we plan to convert the llc_id to a logical ID that
> is no longer tied to the CPU number. Just 2 cents, to avoid confusion, maybe
> rename the  aforementioned llc_id to sd_id?

Ack! Will modify in the next version!

> 
> Anyway I will run some tests on the entire patch set and provide feedback
> afterward.

Thanks a ton for taking a look at the series! Much appreciated.

-- 
Thanks and Regards,
Prateek