kernel/sched/fair.c | 5 +++++ 1 file changed, 5 insertions(+)
wake_wide() uses sd_llc_size as the spreading threshold to detect wide
waker/wakee relationships and to disable wake_affine() for those cases.
On SMT systems, sd_llc_size counts logical CPUs rather than physical
cores. This inflates the wake_wide() threshold, allowing wake_affine()
to pack more tasks into one LLC domain than the actual compute capacity
of its physical cores can sustain. The resulting SMT interference may
cost more than the cache-locality benefit wake_affine() intends to gain.
Scale the factor by the SMT width of the current CPU so that it
approximates the number of independent physical cores in the LLC domain,
making wake_wide() more likely to kick in before SMT interference
becomes significant. On non-SMT systems the SMT width is 1 and behaviour
is unchanged.
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
---
kernel/sched/fair.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f07df8987a5ef..4896582c6e904 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p)
unsigned int slave = p->wakee_flips;
int factor = __this_cpu_read(sd_llc_size);
+ /* Scale factor to physical-core count to account for SMT interference. */
+ if (sched_smt_active())
+ factor = DIV_ROUND_UP(factor,
+ cpumask_weight(cpu_smt_mask(smp_processor_id())));
+
if (master < slave)
swap(master, slave);
if (slave < factor || master < slave * factor)
--
2.18.0
Hi. On 4/7/26 12:09 PM, Zhang Qiao wrote: > wake_wide() uses sd_llc_size as the spreading threshold to detect wide > waker/wakee relationships and to disable wake_affine() for those cases. > > On SMT systems, sd_llc_size counts logical CPUs rather than physical > cores. This inflates the wake_wide() threshold, allowing wake_affine() > to pack more tasks into one LLC domain than the actual compute capacity > of its physical cores can sustain. The resulting SMT interference may > cost more than the cache-locality benefit wake_affine() intends to gain. > Isn't load balance to move it out? What does the workload do? > Scale the factor by the SMT width of the current CPU so that it > approximates the number of independent physical cores in the LLC domain, > making wake_wide() more likely to kick in before SMT interference > becomes significant. On non-SMT systems the SMT width is 1 and behaviour > is unchanged. > There are systems where LLC_SIZE == SMT_SIZE. i.e one core in the LLC. This would effectively disable wake_affine feature in such systems. Power10 being a major example. > Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> > --- > kernel/sched/fair.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f07df8987a5ef..4896582c6e904 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p) > unsigned int slave = p->wakee_flips; > int factor = __this_cpu_read(sd_llc_size); > > + /* Scale factor to physical-core count to account for SMT interference. */ > + if (sched_smt_active()) > + factor = DIV_ROUND_UP(factor, > + cpumask_weight(cpu_smt_mask(smp_processor_id()))); > + > if (master < slave) > swap(master, slave); > if (slave < factor || master < slave * factor)
Hi Shrikanth, 在 2026/4/8 1:58, Shrikanth Hegde 写道: > Hi. > > On 4/7/26 12:09 PM, Zhang Qiao wrote: >> wake_wide() uses sd_llc_size as the spreading threshold to detect wide >> waker/wakee relationships and to disable wake_affine() for those cases. >> >> On SMT systems, sd_llc_size counts logical CPUs rather than physical >> cores. This inflates the wake_wide() threshold, allowing wake_affine() >> to pack more tasks into one LLC domain than the actual compute capacity >> of its physical cores can sustain. The resulting SMT interference may >> cost more than the cache-locality benefit wake_affine() intends to gain. >> > > Isn't load balance to move it out? What does the workload do? The workload is a producer-consumer model: one producer wakes up ~50 different consumers, with roughly 10+ consumers running concurrently. The total number of tasks is well below the CPU count. In this scenario, load balancing is largely ineffective. Each consumer spends most of its time sleeping, gets woken by the producer, runs briefly to process the message, then goes back to sleep. There is almost no window where a consumer sits on a CPU runqueue in the runnable state waiting to be pulled. Since load balancing can only migrate runnable tasks, it simply has no target to act on here. > >> Scale the factor by the SMT width of the current CPU so that it >> approximates the number of independent physical cores in the LLC domain, >> making wake_wide() more likely to kick in before SMT interference >> becomes significant. On non-SMT systems the SMT width is 1 and behaviour >> is unchanged. >> > > There are systems where LLC_SIZE == SMT_SIZE. i.e one core in the LLC. > This would effectively disable wake_affine feature in such systems. > > Power10 being a major example. > >> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> >> --- >> kernel/sched/fair.c | 5 +++++ >> 1 file changed, 5 insertions(+) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index f07df8987a5ef..4896582c6e904 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p) >> unsigned int slave = p->wakee_flips; >> int factor = __this_cpu_read(sd_llc_size); >> + /* Scale factor to physical-core count to account for SMT interference. */ >> + if (sched_smt_active()) >> + factor = DIV_ROUND_UP(factor, >> + cpumask_weight(cpu_smt_mask(smp_processor_id()))); >> + >> if (master < slave) >> swap(master, slave); >> if (slave < factor || master < slave * factor) > > .
On 16.04.26 09:41, Zhang Qiao wrote: > Hi Shrikanth, > > 在 2026/4/8 1:58, Shrikanth Hegde 写道: >> Hi. >> >> On 4/7/26 12:09 PM, Zhang Qiao wrote: >>> wake_wide() uses sd_llc_size as the spreading threshold to detect wide >>> waker/wakee relationships and to disable wake_affine() for those cases. >>> >>> On SMT systems, sd_llc_size counts logical CPUs rather than physical >>> cores. This inflates the wake_wide() threshold, allowing wake_affine() >>> to pack more tasks into one LLC domain than the actual compute capacity >>> of its physical cores can sustain. The resulting SMT interference may >>> cost more than the cache-locality benefit wake_affine() intends to gain. >>> >> >> Isn't load balance to move it out? What does the workload do? > > The workload is a producer-consumer model: one producer wakes up ~50 > different consumers, with roughly 10+ consumers running concurrently. > The total number of tasks is well below the CPU count. But higher than your MC core count I believe? Otherwise you wouldn't care. I assume you have MC CPU count of 12-24. Do you have more than 2 different MCs. > In this scenario, load balancing is largely ineffective. Each consumer > spends most of its time sleeping, gets woken by the producer, runs > briefly to process the message, then goes back to sleep. There is > almost no window where a consumer sits on a CPU runqueue in the runnable > state waiting to be pulled. Since load balancing can only migrate > runnable tasks, it simply has no target to act on here. OK, but SD_BALANCE_WAKE is not set by default, nobody would experience a difference in behaviour on an SMT machine in terms of waking tasks wide, i.e. going through the slow path. Like I tried to explain in the adjacent thread, your wakees would only end up in the slow path in case your sched domains would have SD_BALANCE_WAKE set. Or do you just want to force wakeups which have wake_wide(p) return 1 always into the fast path with 'new_cpu == prev_cpu'? But this wouldn't be wake wide? [...]
Hi, 在 2026/4/22 21:26, Dietmar Eggemann 写道: > On 16.04.26 09:41, Zhang Qiao wrote: >> Hi Shrikanth, >> >> 在 2026/4/8 1:58, Shrikanth Hegde 写道: >>> Hi. >>> >>> On 4/7/26 12:09 PM, Zhang Qiao wrote: >>>> wake_wide() uses sd_llc_size as the spreading threshold to detect wide >>>> waker/wakee relationships and to disable wake_affine() for those cases. >>>> >>>> On SMT systems, sd_llc_size counts logical CPUs rather than physical >>>> cores. This inflates the wake_wide() threshold, allowing wake_affine() >>>> to pack more tasks into one LLC domain than the actual compute capacity >>>> of its physical cores can sustain. The resulting SMT interference may >>>> cost more than the cache-locality benefit wake_affine() intends to gain. >>>> >>> >>> Isn't load balance to move it out? What does the workload do? >> >> The workload is a producer-consumer model: one producer wakes up ~50 >> different consumers, with roughly 10+ consumers running concurrently. >> The total number of tasks is well below the CPU count. > > But higher than your MC core count I believe? Otherwise you wouldn't > care. I assume you have MC CPU count of 12-24. Do you have more than 2 > different MCs. My server has 10 different MCs (LLCs), with each MC containing 8 physical cores (16 threads with SMT-2). > >> In this scenario, load balancing is largely ineffective. Each consumer >> spends most of its time sleeping, gets woken by the producer, runs >> briefly to process the message, then goes back to sleep. There is >> almost no window where a consumer sits on a CPU runqueue in the runnable >> state waiting to be pulled. Since load balancing can only migrate >> runnable tasks, it simply has no target to act on here. > > OK, but SD_BALANCE_WAKE is not set by default, nobody would experience a SD_BALANCE_WAKE was not enabled in my tests. > difference in behaviour on an SMT machine in terms of waking tasks wide, > i.e. going through the slow path. Like I tried to explain in the > adjacent thread, your wakees would only end up in the slow path in case > your sched domains would have SD_BALANCE_WAKE set.> > Or do you just want to force wakeups which have wake_wide(p) return 1 > always into the fast path with 'new_cpu == prev_cpu'? But this wouldn't > be wake wide? The observed improvement comes from suppressing wake_affine() before it pulls wakees onto the waker's physical core. In the producer-consumer workload, without this patch, consumers are repeatedly affined into the waker's LLC and end up co-scheduled on the same physical core's SMT siblings. With the patch, wake_wide() fires earlier and wakees are left on prev_cpu, resulting in better spread across physical cores. Thanks Zhang Qiao > > [...] > > . >
On 29.04.26 04:43, Zhang Qiao wrote: > > Hi, > > 在 2026/4/22 21:26, Dietmar Eggemann 写道: >> On 16.04.26 09:41, Zhang Qiao wrote: >>> Hi Shrikanth, >>> >>> 在 2026/4/8 1:58, Shrikanth Hegde 写道: >>>> Hi. >>>> >>>> On 4/7/26 12:09 PM, Zhang Qiao wrote: [...] >>> The workload is a producer-consumer model: one producer wakes up ~50 >>> different consumers, with roughly 10+ consumers running concurrently. >>> The total number of tasks is well below the CPU count. >> >> But higher than your MC core count I believe? Otherwise you wouldn't >> care. I assume you have MC CPU count of 12-24. Do you have more than 2 >> different MCs. > > My server has 10 different MCs (LLCs), with each MC containing 8 physical cores > (16 threads with SMT-2). Thanks. >>> In this scenario, load balancing is largely ineffective. Each consumer >>> spends most of its time sleeping, gets woken by the producer, runs >>> briefly to process the message, then goes back to sleep. There is >>> almost no window where a consumer sits on a CPU runqueue in the runnable >>> state waiting to be pulled. Since load balancing can only migrate >>> runnable tasks, it simply has no target to act on here. >> >> OK, but SD_BALANCE_WAKE is not set by default, nobody would experience a > > SD_BALANCE_WAKE was not enabled in my tests. Right, looks like I mixed up balance flags & fast/slow path with the wake affine vs. wake wide logic. >> difference in behaviour on an SMT machine in terms of waking tasks wide, >> i.e. going through the slow path. Like I tried to explain in the >> adjacent thread, your wakees would only end up in the slow path in case >> your sched domains would have SD_BALANCE_WAKE set.> >> Or do you just want to force wakeups which have wake_wide(p) return 1 >> always into the fast path with 'new_cpu == prev_cpu'? But this wouldn't >> be wake wide? > > The observed improvement comes from suppressing wake_affine() before it > pulls wakees onto the waker's physical core. In the producer-consumer > workload, without this patch, consumers are repeatedly affined into the > waker's LLC and end up co-scheduled on the same physical core's SMT > siblings. With the patch, wake_wide() fires earlier and wakees are left > on prev_cpu, resulting in better spread across physical cores. Makes sense. You mentioned having ~10+ consumers running concurrently. I’m curious why select_idle_sibling() isn’t doing a better job of distributing those tasks across idle cores, even though wakeups are affine to the waker and its LLC domain. Is this because you only have 8 cores per LLC, combined with general system noise?
在 2026/5/11 23:54, Dietmar Eggemann 写道: > On 29.04.26 04:43, Zhang Qiao wrote: >> >> Hi, >> >> 在 2026/4/22 21:26, Dietmar Eggemann 写道: >>> On 16.04.26 09:41, Zhang Qiao wrote: >>>> Hi Shrikanth, >>>> >>>> 在 2026/4/8 1:58, Shrikanth Hegde 写道: >>>>> Hi. >>>>> >>>>> On 4/7/26 12:09 PM, Zhang Qiao wrote: > > [...] > >>>> The workload is a producer-consumer model: one producer wakes up ~50 >>>> different consumers, with roughly 10+ consumers running concurrently. >>>> The total number of tasks is well below the CPU count. >>> >>> But higher than your MC core count I believe? Otherwise you wouldn't >>> care. I assume you have MC CPU count of 12-24. Do you have more than 2 >>> different MCs. >> >> My server has 10 different MCs (LLCs), with each MC containing 8 physical cores >> (16 threads with SMT-2). > > Thanks. > >>>> In this scenario, load balancing is largely ineffective. Each consumer >>>> spends most of its time sleeping, gets woken by the producer, runs >>>> briefly to process the message, then goes back to sleep. There is >>>> almost no window where a consumer sits on a CPU runqueue in the runnable >>>> state waiting to be pulled. Since load balancing can only migrate >>>> runnable tasks, it simply has no target to act on here. >>> >>> OK, but SD_BALANCE_WAKE is not set by default, nobody would experience a >> >> SD_BALANCE_WAKE was not enabled in my tests. > > Right, looks like I mixed up balance flags & fast/slow path with the > wake affine vs. wake wide logic. > >>> difference in behaviour on an SMT machine in terms of waking tasks wide, >>> i.e. going through the slow path. Like I tried to explain in the >>> adjacent thread, your wakees would only end up in the slow path in case >>> your sched domains would have SD_BALANCE_WAKE set.> >>> Or do you just want to force wakeups which have wake_wide(p) return 1 >>> always into the fast path with 'new_cpu == prev_cpu'? But this wouldn't >>> be wake wide? >> >> The observed improvement comes from suppressing wake_affine() before it >> pulls wakees onto the waker's physical core. In the producer-consumer >> workload, without this patch, consumers are repeatedly affined into the >> waker's LLC and end up co-scheduled on the same physical core's SMT >> siblings. With the patch, wake_wide() fires earlier and wakees are left >> on prev_cpu, resulting in better spread across physical cores. > > Makes sense. > > You mentioned having ~10+ consumers running concurrently. I’m curious > why select_idle_sibling() isn’t doing a better job of distributing those > tasks across idle cores, even though wakeups are affine to the waker and > its LLC domain. Is this because you only have 8 cores per LLC, combined > with general system noise? Yes, exactly. Each LLC has only 8 physical cores (16 threads with SMT-2). When more than 8 consumers are woken into the same LLC domain, the number of running tasks exceeds the physical core count, and SMT siblings are forced to share execution resources, causing the interference we observed. Thanks, Zhang Qiao > > . >
On 07.04.26 08:39, Zhang Qiao wrote:
> wake_wide() uses sd_llc_size as the spreading threshold to detect wide
> waker/wakee relationships and to disable wake_affine() for those cases.
>
> On SMT systems, sd_llc_size counts logical CPUs rather than physical
> cores. This inflates the wake_wide() threshold, allowing wake_affine()
> to pack more tasks into one LLC domain than the actual compute capacity
> of its physical cores can sustain. The resulting SMT interference may
> cost more than the cache-locality benefit wake_affine() intends to gain.
>
> Scale the factor by the SMT width of the current CPU so that it
> approximates the number of independent physical cores in the LLC domain,
> making wake_wide() more likely to kick in before SMT interference
> becomes significant. On non-SMT systems the SMT width is 1 and behaviour
> is unchanged.
>
> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
> ---
> kernel/sched/fair.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f07df8987a5ef..4896582c6e904 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p)
> unsigned int slave = p->wakee_flips;
> int factor = __this_cpu_read(sd_llc_size);
>
> + /* Scale factor to physical-core count to account for SMT interference. */
> + if (sched_smt_active())
> + factor = DIV_ROUND_UP(factor,
> + cpumask_weight(cpu_smt_mask(smp_processor_id())));
> +
> if (master < slave)
> swap(master, slave);
> if (slave < factor || master < slave * factor)
I assume not a lot of people care since this needs:
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5847b83d9d55..596c5d590532 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1691,7 +1691,7 @@ sd_init(struct sched_domain_topology_level *tl,
.flags = 1*SD_BALANCE_NEWIDLE
| 1*SD_BALANCE_EXEC
| 1*SD_BALANCE_FORK
- | 0*SD_BALANCE_WAKE
+ | 1*SD_BALANCE_WAKE
| 1*SD_WAKE_AFFINE
| 0*SD_SHARE_CPUCAPACITY
| 0*SD_SHARE_LLC
And then it's a trade-off between one busy thread per core vs. wakeup cost.
On 4/7/26 8:08 PM, Dietmar Eggemann wrote: > On 07.04.26 08:39, Zhang Qiao wrote: >> wake_wide() uses sd_llc_size as the spreading threshold to detect wide >> waker/wakee relationships and to disable wake_affine() for those cases. >> >> On SMT systems, sd_llc_size counts logical CPUs rather than physical >> cores. This inflates the wake_wide() threshold, allowing wake_affine() >> to pack more tasks into one LLC domain than the actual compute capacity >> of its physical cores can sustain. The resulting SMT interference may >> cost more than the cache-locality benefit wake_affine() intends to gain. >> >> Scale the factor by the SMT width of the current CPU so that it >> approximates the number of independent physical cores in the LLC domain, >> making wake_wide() more likely to kick in before SMT interference >> becomes significant. On non-SMT systems the SMT width is 1 and behaviour >> is unchanged. >> >> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> >> --- >> kernel/sched/fair.c | 5 +++++ >> 1 file changed, 5 insertions(+) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index f07df8987a5ef..4896582c6e904 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p) >> unsigned int slave = p->wakee_flips; >> int factor = __this_cpu_read(sd_llc_size); >> >> + /* Scale factor to physical-core count to account for SMT interference. */ >> + if (sched_smt_active()) >> + factor = DIV_ROUND_UP(factor, >> + cpumask_weight(cpu_smt_mask(smp_processor_id()))); >> + >> if (master < slave) >> swap(master, slave); >> if (slave < factor || master < slave * factor) > > I assume not a lot of people care since this needs: wake_affine machinery needs SD_WAKE_AFFINE. No? > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 5847b83d9d55..596c5d590532 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -1691,7 +1691,7 @@ sd_init(struct sched_domain_topology_level *tl, > .flags = 1*SD_BALANCE_NEWIDLE > | 1*SD_BALANCE_EXEC > | 1*SD_BALANCE_FORK > - | 0*SD_BALANCE_WAKE > + | 1*SD_BALANCE_WAKE > | 1*SD_WAKE_AFFINE > | 0*SD_SHARE_CPUCAPACITY > | 0*SD_SHARE_LLC > > And then it's a trade-off between one busy thread per core vs. wakeup cost.
On 07.04.26 20:16, Shrikanth Hegde wrote:
>
>
> On 4/7/26 8:08 PM, Dietmar Eggemann wrote:
>> On 07.04.26 08:39, Zhang Qiao wrote:
>>> wake_wide() uses sd_llc_size as the spreading threshold to detect wide
>>> waker/wakee relationships and to disable wake_affine() for those cases.
>>>
>>> On SMT systems, sd_llc_size counts logical CPUs rather than physical
>>> cores. This inflates the wake_wide() threshold, allowing wake_affine()
>>> to pack more tasks into one LLC domain than the actual compute capacity
>>> of its physical cores can sustain. The resulting SMT interference may
>>> cost more than the cache-locality benefit wake_affine() intends to gain.
>>>
>>> Scale the factor by the SMT width of the current CPU so that it
>>> approximates the number of independent physical cores in the LLC domain,
>>> making wake_wide() more likely to kick in before SMT interference
>>> becomes significant. On non-SMT systems the SMT width is 1 and behaviour
>>> is unchanged.
>>>
>>> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
>>> ---
>>> kernel/sched/fair.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index f07df8987a5ef..4896582c6e904 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p)
>>> unsigned int slave = p->wakee_flips;
>>> int factor = __this_cpu_read(sd_llc_size);
>>> + /* Scale factor to physical-core count to account for SMT
>>> interference. */
>>> + if (sched_smt_active())
>>> + factor = DIV_ROUND_UP(factor,
>>> + cpumask_weight(cpu_smt_mask(smp_processor_id())));
>>> +
>>> if (master < slave)
>>> swap(master, slave);
>>> if (slave < factor || master < slave * factor)
>>
>> I assume not a lot of people care since this needs:
>
> wake_affine machinery needs SD_WAKE_AFFINE. No?
Yes, the potential call to wake_affine() and forcing 'sd = NULL' but
that's not forcing a wakeup (WF_TTWU) into the slow path
(sched_balance_find_dst_cpu()), which IMHO is the actual wake wide.
You need 'sd != NULL' which can only be set by (1)for a wakeup:
for_each_domain(cpu, tmp)
...
if (tmp->flags & sd_flag) <-- '(1) SD_BALANCE_WAKE == WF_TTWU'
sd = tmp;
and since SD_BALANCE_WAKE is never set per default in sd_init()
[kernel/sched/topology.c] I wonder how they achieved this wide (i.e. not
affine MC for this_cpu or prev_cpu) wakeup?
By default, we only select wide for WF_FORK and WF_EXEC.
Or do they just want to force 'wake_wide(p) == 1' into sis(..., new_cpu
= prev_cpu) ?
[...]
© 2016 - 2026 Red Hat, Inc.