Use the "sched_domain_shared" object allocated in s_data for
"sd->shared" assignments. Assign "sd->shared" for the topmost
SD_SHARE_LLC domain before degeneration and rely on the degeneration
path to correctly pass down the shared object to "sd_llc".
sd_degenerate_parent() ensures degenerating domains must have the same
sched_domain_span() which ensures 1:1 passing down of the shared object.
If the topmost SD_SHARE_LLC domain degenerates, the shared object is
freed from destroy_sched_domain() when the last reference is dropped.
build_sched_domains() NULLs out the objects that have been assigned as
"sd->shared" and the unassigned ones are freed from the __sds_free()
path.
Post cpu_attach_domain(), all reclaims of "sd->shared" are handled via
call_rcu() on the sched_domain object via destroy_sched_domains_rcu().
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changelog rfc v2..v3:
o Broke off from a single large patch. Previously
https://lore.kernel.org/lkml/20251208092744.32737-3-kprateek.nayak@amd.com/
---
kernel/sched/topology.c | 34 ++++++++++++++++++++++++----------
1 file changed, 24 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 623e8835d322..0f56462fef6f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -679,6 +679,9 @@ static void update_top_cache_domain(int cpu)
if (sd) {
id = cpumask_first(sched_domain_span(sd));
size = cpumask_weight(sched_domain_span(sd));
+
+ /* If sd_llc exists, sd_llc_shared should exist too. */
+ WARN_ON_ONCE(!sd->shared);
sds = sd->shared;
}
@@ -727,6 +730,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
if (sd_parent_degenerate(tmp, parent)) {
tmp->parent = parent->parent;
+ /* Pick reference to parent->shared. */
+ if (parent->shared) {
+ WARN_ON_ONCE(tmp->shared);
+ tmp->shared = parent->shared;
+ parent->shared = NULL;
+ }
+
if (parent->parent) {
parent->parent->child = tmp;
parent->parent->groups->flags = tmp->flags;
@@ -1732,16 +1742,6 @@ sd_init(struct sched_domain_topology_level *tl,
sd->cache_nice_tries = 1;
}
- /*
- * For all levels sharing cache; connect a sched_domain_shared
- * instance.
- */
- if (sd->flags & SD_SHARE_LLC) {
- sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
- atomic_inc(&sd->shared->ref);
- atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
- }
-
sd->private = sdd;
return sd;
@@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
unsigned int imb_span = 1;
for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+ struct sched_domain *parent = sd->parent;
struct sched_domain *child = sd->child;
+ /* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
+ if ((sd->flags & SD_SHARE_LLC) &&
+ (!parent || !(parent->flags & SD_SHARE_LLC))) {
+ int llc_id = cpumask_first(sched_domain_span(sd));
+
+ sd->shared = *per_cpu_ptr(d.sds, llc_id);
+ atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+ atomic_inc(&sd->shared->ref);
+ }
+
if (!(sd->flags & SD_SHARE_LLC) && child &&
(child->flags & SD_SHARE_LLC)) {
struct sched_domain __rcu *top_p;
@@ -2709,6 +2720,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
if (!cpumask_test_cpu(i, cpu_map))
continue;
+ if (atomic_read(&(*per_cpu_ptr(d.sds, i))->ref))
+ *per_cpu_ptr(d.sds, i) = NULL;
+
for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
claim_allocations(i, sd);
init_sched_groups_capacity(i, sd);
--
2.34.1
On 20/01/26 11:32, K Prateek Nayak wrote:
> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> unsigned int imb_span = 1;
>
> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> + struct sched_domain *parent = sd->parent;
> struct sched_domain *child = sd->child;
>
> + /* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
> + if ((sd->flags & SD_SHARE_LLC) &&
> + (!parent || !(parent->flags & SD_SHARE_LLC))) {
> + int llc_id = cpumask_first(sched_domain_span(sd));
> +
> + sd->shared = *per_cpu_ptr(d.sds, llc_id);
> + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> + atomic_inc(&sd->shared->ref);
> + }
> +
We now have two if's looking for the highest_flag_domain(i, SD_SHARE_LLC),
but given this needs to write the sd->imb_numa_nr for every SD I couldn't
factorize this into something that looked sane :(
> if (!(sd->flags & SD_SHARE_LLC) && child &&
> (child->flags & SD_SHARE_LLC)) {
> struct sched_domain __rcu *top_p;
Hello Valentin,
On 2/5/2026 10:23 PM, Valentin Schneider wrote:
> On 20/01/26 11:32, K Prateek Nayak wrote:
>> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>> unsigned int imb_span = 1;
>>
>> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>> + struct sched_domain *parent = sd->parent;
>> struct sched_domain *child = sd->child;
>>
>> + /* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
>> + if ((sd->flags & SD_SHARE_LLC) &&
>> + (!parent || !(parent->flags & SD_SHARE_LLC))) {
>> + int llc_id = cpumask_first(sched_domain_span(sd));
>> +
>> + sd->shared = *per_cpu_ptr(d.sds, llc_id);
>> + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
>> + atomic_inc(&sd->shared->ref);
>> + }
>> +
>
> We now have two if's looking for the highest_flag_domain(i, SD_SHARE_LLC),
> but given this needs to write the sd->imb_numa_nr for every SD I couldn't
> factorize this into something that looked sane :(
Yeah! The "imb_numa_nr" cares about the "sd_llc" *after* we've crossed
it and "sd->shared" assignment cares when we are *at* the sd_llc.
Since we have to assign the "sd->shared" before claim_allocations(),
I couldn't find a better spot to assign it.
That said, "imb_numa_nr" calculation can be modified to use the "sd_llc"
and its "parent". I'll let you be the judge of whether the following is
better or worse ;-)
(Only build tested)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ac268da91778..e98bb812de35 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2614,13 +2614,23 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
unsigned int imb_span = 1;
for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
- struct sched_domain *child = sd->child;
+ struct sched_domain *parent = sd->parent;
- if (!(sd->flags & SD_SHARE_LLC) && child &&
- (child->flags & SD_SHARE_LLC)) {
- struct sched_domain __rcu *top_p;
+ /* Topmost SD_SHARE_LLC domain. */
+ if ((sd->flags & SD_SHARE_LLC) &&
+ (!parent || !(parent->flags & SD_SHARE_LLC))) {
+ int sd_id = cpumask_first(sched_domain_span(sd));
+ struct sched_domain *top_p;
unsigned int nr_llcs;
+ sd->shared = *per_cpu_ptr(d.sds, sd_id);
+ atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+ atomic_inc(&sd->shared->ref);
+
+ /* No SD_NUMA domains. */
+ if (!parent)
+ break;
+
/*
* For a single LLC per node, allow an
* imbalance up to 12.5% of the node. This is
@@ -2641,7 +2651,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
* factors and that there is a correlation
* between LLCs and memory channels.
*/
- nr_llcs = sd->span_weight / child->span_weight;
+ nr_llcs = parent->span_weight / sd->span_weight;
if (nr_llcs == 1)
imb = sd->span_weight >> 3;
else
@@ -2650,11 +2660,11 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
sd->imb_numa_nr = imb;
/* Set span based on the first NUMA domain. */
- top_p = sd->parent;
+ top_p = parent;
while (top_p && !(top_p->flags & SD_NUMA)) {
top_p = top_p->parent;
}
- imb_span = top_p ? top_p->span_weight : sd->span_weight;
+ imb_span = top_p ? top_p->span_weight : parent->span_weight;
} else {
int factor = max(1U, (sd->span_weight / imb_span));
---
>
>> if (!(sd->flags & SD_SHARE_LLC) && child &&
>> (child->flags & SD_SHARE_LLC)) {
>> struct sched_domain __rcu *top_p;
>
--
Thanks and Regards,
Prateek
On 06/02/26 10:50, K Prateek Nayak wrote:
> Hello Valentin,
>
> On 2/5/2026 10:23 PM, Valentin Schneider wrote:
>> On 20/01/26 11:32, K Prateek Nayak wrote:
>>> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>> unsigned int imb_span = 1;
>>>
>>> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>>> + struct sched_domain *parent = sd->parent;
>>> struct sched_domain *child = sd->child;
>>>
>>> + /* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
>>> + if ((sd->flags & SD_SHARE_LLC) &&
>>> + (!parent || !(parent->flags & SD_SHARE_LLC))) {
>>> + int llc_id = cpumask_first(sched_domain_span(sd));
>>> +
>>> + sd->shared = *per_cpu_ptr(d.sds, llc_id);
>>> + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
>>> + atomic_inc(&sd->shared->ref);
>>> + }
>>> +
>>
>> We now have two if's looking for the highest_flag_domain(i, SD_SHARE_LLC),
>> but given this needs to write the sd->imb_numa_nr for every SD I couldn't
>> factorize this into something that looked sane :(
>
> Yeah! The "imb_numa_nr" cares about the "sd_llc" *after* we've crossed
> it and "sd->shared" assignment cares when we are *at* the sd_llc.
>
> Since we have to assign the "sd->shared" before claim_allocations(),
> I couldn't find a better spot to assign it.
>
> That said, "imb_numa_nr" calculation can be modified to use the "sd_llc"
> and its "parent". I'll let you be the judge of whether the following is
> better or worse ;-)
>
> (Only build tested)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index ac268da91778..e98bb812de35 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2614,13 +2614,23 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> unsigned int imb_span = 1;
>
> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> - struct sched_domain *child = sd->child;
> + struct sched_domain *parent = sd->parent;
>
> - if (!(sd->flags & SD_SHARE_LLC) && child &&
> - (child->flags & SD_SHARE_LLC)) {
> - struct sched_domain __rcu *top_p;
> + /* Topmost SD_SHARE_LLC domain. */
> + if ((sd->flags & SD_SHARE_LLC) &&
> + (!parent || !(parent->flags & SD_SHARE_LLC))) {
> + int sd_id = cpumask_first(sched_domain_span(sd));
> + struct sched_domain *top_p;
> unsigned int nr_llcs;
>
> + sd->shared = *per_cpu_ptr(d.sds, sd_id);
> + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> + atomic_inc(&sd->shared->ref);
> +
> + /* No SD_NUMA domains. */
> + if (!parent)
> + break;
> +
AIUI we currently write sd->imb_numa_nr for all SD's, but it's only useful
for the SD_NUMA ones... How about the lightly tested:
---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index afb2c26efb4e5..03db45658f6bd 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2575,6 +2575,51 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
return true;
}
+static inline void adjust_numa_imbalance(struct sched_domain *sd_llc,
+ unsigned int *imb, unsigned int *imb_span)
+{
+ /*
+ * For a single LLC per node, allow an
+ * imbalance up to 12.5% of the node. This is
+ * arbitrary cutoff based two factors -- SMT and
+ * memory channels. For SMT-2, the intent is to
+ * avoid premature sharing of HT resources but
+ * SMT-4 or SMT-8 *may* benefit from a different
+ * cutoff. For memory channels, this is a very
+ * rough estimate of how many channels may be
+ * active and is based on recent CPUs with
+ * many cores.
+ *
+ * For multiple LLCs, allow an imbalance
+ * until multiple tasks would share an LLC
+ * on one node while LLCs on another node
+ * remain idle. This assumes that there are
+ * enough logical CPUs per LLC to avoid SMT
+ * factors and that there is a correlation
+ * between LLCs and memory channels.
+ */
+ struct sched_domain *top_p;
+ unsigned int nr_llcs;
+
+ WARN_ON(!(sd_llc->flags & SD_SHARE_LLC));
+ WARN_ON(!sd_llc->parent);
+
+ nr_llcs = sd_llc->parent->span_weight / sd_llc->span_weight;
+ if (nr_llcs == 1)
+ *imb = sd_llc->parent->span_weight >> 3;
+ else
+ *imb = nr_llcs;
+ *imb = max(1U, *imb);
+ sd_llc->parent->imb_numa_nr = *imb;
+
+ /* Set span based on the first NUMA domain. */
+ top_p = sd_llc->parent->parent;
+ while (top_p && !(top_p->flags & SD_NUMA)) {
+ top_p = top_p->parent;
+ }
+ *imb_span = top_p ? top_p->span_weight : sd_llc->parent->span_weight;
+}
+
/*
* Build sched domains for a given set of CPUs and attach the sched domains
* to the individual CPUs
@@ -2640,63 +2685,30 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
unsigned int imb = 0;
unsigned int imb_span = 1;
- for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
- struct sched_domain *parent = sd->parent;
-
- /* Topmost SD_SHARE_LLC domain. */
- if ((sd->flags & SD_SHARE_LLC) &&
- (!parent || !(parent->flags & SD_SHARE_LLC))) {
- int sd_id = cpumask_first(sched_domain_span(sd));
- struct sched_domain *top_p;
- unsigned int nr_llcs;
-
- sd->shared = *per_cpu_ptr(d.sds, sd_id);
- atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
- atomic_inc(&sd->shared->ref);
-
- /* No SD_NUMA domains. */
- if (!parent)
- break;
-
- /*
- * For a single LLC per node, allow an
- * imbalance up to 12.5% of the node. This is
- * arbitrary cutoff based two factors -- SMT and
- * memory channels. For SMT-2, the intent is to
- * avoid premature sharing of HT resources but
- * SMT-4 or SMT-8 *may* benefit from a different
- * cutoff. For memory channels, this is a very
- * rough estimate of how many channels may be
- * active and is based on recent CPUs with
- * many cores.
- *
- * For multiple LLCs, allow an imbalance
- * until multiple tasks would share an LLC
- * on one node while LLCs on another node
- * remain idle. This assumes that there are
- * enough logical CPUs per LLC to avoid SMT
- * factors and that there is a correlation
- * between LLCs and memory channels.
- */
- nr_llcs = parent->span_weight / sd->span_weight;
- if (nr_llcs == 1)
- imb = sd->span_weight >> 3;
- else
- imb = nr_llcs;
- imb = max(1U, imb);
- sd->imb_numa_nr = imb;
-
- /* Set span based on the first NUMA domain. */
- top_p = parent;
- while (top_p && !(top_p->flags & SD_NUMA)) {
- top_p = top_p->parent;
- }
- imb_span = top_p ? top_p->span_weight : parent->span_weight;
- } else {
- int factor = max(1U, (sd->span_weight / imb_span));
+ sd = *per_cpu_ptr(d.sd, i);
+ /* First, find the topmost SD_SHARE_LLC domain */
+ while (sd && sd->parent && (sd->parent->flags & SD_SHARE_LLC))
+ sd = sd->parent;
- sd->imb_numa_nr = imb * factor;
- }
+ if (sd->flags & SD_SHARE_LLC) {
+ int sd_id = cpumask_first(sched_domain_span(sd));
+
+ sd->shared = *per_cpu_ptr(d.sds, sd_id);
+ atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+ atomic_inc(&sd->shared->ref);
+ }
+
+ /* Special case the first parent of the topmost SD_SHARE_LLC domain. */
+ if ((sd->flags & SD_SHARE_LLC) && sd->parent) {
+ adjust_numa_imbalance(sd, &imb, &imb_span);
+ sd = sd->parent->parent;
+ }
+
+ /* Update the upper remainder of the topology */
+ while (sd) {
+ int factor = max(1U, (sd->span_weight / imb_span));
+ sd->imb_numa_nr = imb * factor;
+ sd = sd->parent;
}
}
On 1/20/26 5:02 PM, K Prateek Nayak wrote:
> Use the "sched_domain_shared" object allocated in s_data for
> "sd->shared" assignments. Assign "sd->shared" for the topmost
> SD_SHARE_LLC domain before degeneration and rely on the degeneration
> path to correctly pass down the shared object to "sd_llc".
>
> sd_degenerate_parent() ensures degenerating domains must have the same
> sched_domain_span() which ensures 1:1 passing down of the shared object.
> If the topmost SD_SHARE_LLC domain degenerates, the shared object is
> freed from destroy_sched_domain() when the last reference is dropped.
>
> build_sched_domains() NULLs out the objects that have been assigned as
> "sd->shared" and the unassigned ones are freed from the __sds_free()
> path.
>
> Post cpu_attach_domain(), all reclaims of "sd->shared" are handled via
> call_rcu() on the sched_domain object via destroy_sched_domains_rcu().
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changelog rfc v2..v3:
>
> o Broke off from a single large patch. Previously
> https://lore.kernel.org/lkml/20251208092744.32737-3-kprateek.nayak@amd.com/
> ---
> kernel/sched/topology.c | 34 ++++++++++++++++++++++++----------
> 1 file changed, 24 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 623e8835d322..0f56462fef6f 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -679,6 +679,9 @@ static void update_top_cache_domain(int cpu)
> if (sd) {
> id = cpumask_first(sched_domain_span(sd));
> size = cpumask_weight(sched_domain_span(sd));
> +
> + /* If sd_llc exists, sd_llc_shared should exist too. */
> + WARN_ON_ONCE(!sd->shared);
> sds = sd->shared;
> }
>
> @@ -727,6 +730,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
> if (sd_parent_degenerate(tmp, parent)) {
> tmp->parent = parent->parent;
>
> + /* Pick reference to parent->shared. */
> + if (parent->shared) {
> + WARN_ON_ONCE(tmp->shared);
> + tmp->shared = parent->shared;
> + parent->shared = NULL;
> + }
> +
> if (parent->parent) {
> parent->parent->child = tmp;
> parent->parent->groups->flags = tmp->flags;
> @@ -1732,16 +1742,6 @@ sd_init(struct sched_domain_topology_level *tl,
> sd->cache_nice_tries = 1;
> }
>
> - /*
> - * For all levels sharing cache; connect a sched_domain_shared
> - * instance.
> - */
> - if (sd->flags & SD_SHARE_LLC) {
> - sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> - atomic_inc(&sd->shared->ref);
> - atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> - }
> -
> sd->private = sdd;
>
> return sd;
> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> unsigned int imb_span = 1;
>
> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> + struct sched_domain *parent = sd->parent;
> struct sched_domain *child = sd->child;
>
> + /* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
> + if ((sd->flags & SD_SHARE_LLC) &&
> + (!parent || !(parent->flags & SD_SHARE_LLC))) {
> + int llc_id = cpumask_first(sched_domain_span(sd));
> +
> + sd->shared = *per_cpu_ptr(d.sds, llc_id);
> + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> + atomic_inc(&sd->shared->ref);
> + }
> +
> if (!(sd->flags & SD_SHARE_LLC) && child &&
> (child->flags & SD_SHARE_LLC)) {
> struct sched_domain __rcu *top_p;
> @@ -2709,6 +2720,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> if (!cpumask_test_cpu(i, cpu_map))
> continue;
>
> + if (atomic_read(&(*per_cpu_ptr(d.sds, i))->ref))
> + *per_cpu_ptr(d.sds, i) = NULL;
> +
Can we do this claim_allocations only?
sdt_alloc and free is complicated already.
> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> claim_allocations(i, sd);
> init_sched_groups_capacity(i, sd);
Hello Shrikanth,
On 1/22/2026 1:42 PM, Shrikanth Hegde wrote:
>> @@ -2709,6 +2720,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>> if (!cpumask_test_cpu(i, cpu_map))
>> continue;
>> + if (atomic_read(&(*per_cpu_ptr(d.sds, i))->ref))
>> + *per_cpu_ptr(d.sds, i) = NULL;
>> +
>
> Can we do this claim_allocations only?
I didn't do it there since we didn't have reference to the "s_data"
inside claim_allocations().
If I remember this right, only init_sched_groups_capacity() has the
requirement to traverse the CPUs in reverse to do
update_group_capacity() when we hit the first CPU in the group.
It doesn't modify the "->ref" of any allocations.
I can put the claim_allocations() bits the previous loop and pass the
CPU and the s_data reference so it can free both "d.sds" and all the
"d.sd" bits in one place and retain this reverse loop for
init_sched_groups_capacity(). Does that sound better?
> sdt_alloc and free is complicated already.
>
>> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>> claim_allocations(i, sd);
>> init_sched_groups_capacity(i, sd);
>
--
Thanks and Regards,
Prateek
On 1/22/26 2:06 PM, K Prateek Nayak wrote:
> Hello Shrikanth,
>
> On 1/22/2026 1:42 PM, Shrikanth Hegde wrote:
>>> @@ -2709,6 +2720,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>> if (!cpumask_test_cpu(i, cpu_map))
>>> continue;
>>> + if (atomic_read(&(*per_cpu_ptr(d.sds, i))->ref))
>>> + *per_cpu_ptr(d.sds, i) = NULL;
>>> +
>>
>> Can we do this claim_allocations only?
>
> I didn't do it there since we didn't have reference to the "s_data"
> inside claim_allocations().
>
> If I remember this right, only init_sched_groups_capacity() has the
> requirement to traverse the CPUs in reverse to do
> update_group_capacity() when we hit the first CPU in the group.
> It doesn't modify the "->ref" of any allocations.
>
> I can put the claim_allocations() bits the previous loop and pass the
> CPU and the s_data reference so it can free both "d.sds" and all the
> "d.sd" bits in one place and retain this reverse loop for
> init_sched_groups_capacity(). Does that sound better?
>
Yes.IMO Having it in one place is better.
Even the next loop could be used to do that.
>> sdt_alloc and free is complicated already.
>>
>>> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>>> claim_allocations(i, sd);
>>> init_sched_groups_capacity(i, sd);
>>
>
On 1/23/2026 9:38 AM, Shrikanth Hegde wrote: >> I can put the claim_allocations() bits the previous loop and pass the >> CPU and the s_data reference so it can free both "d.sds" and all the >> "d.sd" bits in one place and retain this reverse loop for >> init_sched_groups_capacity(). Does that sound better? >> > > Yes.IMO Having it in one place is better. > Even the next loop could be used to do that. Ack! Will change accordingly in the next version. Thank you again for the suggestion. -- Thanks and Regards, Prateek
On 1/20/2026 7:32 PM, K Prateek Nayak wrote:
> Use the "sched_domain_shared" object allocated in s_data for
> "sd->shared" assignments. Assign "sd->shared" for the topmost
> SD_SHARE_LLC domain before degeneration and rely on the degeneration
> path to correctly pass down the shared object to "sd_llc".
>
> sd_degenerate_parent() ensures degenerating domains must have the same
> sched_domain_span() which ensures 1:1 passing down of the shared object.
> If the topmost SD_SHARE_LLC domain degenerates, the shared object is
> freed from destroy_sched_domain() when the last reference is dropped.
>
> build_sched_domains() NULLs out the objects that have been assigned as
> "sd->shared" and the unassigned ones are freed from the __sds_free()
> path.
>
> Post cpu_attach_domain(), all reclaims of "sd->shared" are handled via
> call_rcu() on the sched_domain object via destroy_sched_domains_rcu().
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changelog rfc v2..v3:
>
> o Broke off from a single large patch. Previously
> https://lore.kernel.org/lkml/20251208092744.32737-3-kprateek.nayak@amd.com/
> ---
> kernel/sched/topology.c | 34 ++++++++++++++++++++++++----------
> 1 file changed, 24 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 623e8835d322..0f56462fef6f 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -679,6 +679,9 @@ static void update_top_cache_domain(int cpu)
> if (sd) {
> id = cpumask_first(sched_domain_span(sd));
> size = cpumask_weight(sched_domain_span(sd));
> +
> + /* If sd_llc exists, sd_llc_shared should exist too. */
> + WARN_ON_ONCE(!sd->shared);
> sds = sd->shared;
> }
>
> @@ -727,6 +730,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
> if (sd_parent_degenerate(tmp, parent)) {
> tmp->parent = parent->parent;
>
> + /* Pick reference to parent->shared. */
> + if (parent->shared) {
> + WARN_ON_ONCE(tmp->shared);
> + tmp->shared = parent->shared;
> + parent->shared = NULL;
> + }
> +
> if (parent->parent) {
> parent->parent->child = tmp;
> parent->parent->groups->flags = tmp->flags;
> @@ -1732,16 +1742,6 @@ sd_init(struct sched_domain_topology_level *tl,
> sd->cache_nice_tries = 1;
> }
>
> - /*
> - * For all levels sharing cache; connect a sched_domain_shared
> - * instance.
> - */
> - if (sd->flags & SD_SHARE_LLC) {
> - sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> - atomic_inc(&sd->shared->ref);
> - atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> - }
> -
> sd->private = sdd;
>
> return sd;
> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> unsigned int imb_span = 1;
>
> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> + struct sched_domain *parent = sd->parent;
> struct sched_domain *child = sd->child;
>
> + /* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
> + if ((sd->flags & SD_SHARE_LLC) &&
> + (!parent || !(parent->flags & SD_SHARE_LLC))) {
> + int llc_id = cpumask_first(sched_domain_span(sd));
> +
> + sd->shared = *per_cpu_ptr(d.sds, llc_id);
I agree that in the current implementation, we use the llc_id="first CPU" to
index into d.sds, and this value actually represents the LLC ID. In the
cache-aware scheduling, we plan to convert the llc_id to a logical ID that
is no longer tied to the CPU number. Just 2 cents, to avoid confusion, maybe
rename the aforementioned llc_id to sd_id?
Anyway I will run some tests on the entire patch set and provide feedback
afterward.
thanks,
Chenyu
Hello Chenyu,
On 1/21/2026 8:56 PM, Chen, Yu C wrote:
>> @@ -2655,8 +2655,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>> unsigned int imb_span = 1;
>> for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>> + struct sched_domain *parent = sd->parent;
>> struct sched_domain *child = sd->child;
>> + /* Attach sd->shared to the topmost SD_SHARE_LLC domain. */
>> + if ((sd->flags & SD_SHARE_LLC) &&
>> + (!parent || !(parent->flags & SD_SHARE_LLC))) {
>> + int llc_id = cpumask_first(sched_domain_span(sd));
>> +
>> + sd->shared = *per_cpu_ptr(d.sds, llc_id);
>
> I agree that in the current implementation, we use the llc_id="first CPU" to
> index into d.sds, and this value actually represents the LLC ID. In the
> cache-aware scheduling, we plan to convert the llc_id to a logical ID that
> is no longer tied to the CPU number. Just 2 cents, to avoid confusion, maybe
> rename the aforementioned llc_id to sd_id?
Ack! Will modify in the next version!
>
> Anyway I will run some tests on the entire patch set and provide feedback
> afterward.
Thanks a ton for taking a look at the series! Much appreciated.
--
Thanks and Regards,
Prateek
© 2016 - 2026 Red Hat, Inc.