[RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width

Zhang Qiao posted 1 patch 2 months, 1 week ago
kernel/sched/fair.c | 5 +++++
1 file changed, 5 insertions(+)
[RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width
Posted by Zhang Qiao 2 months, 1 week ago
wake_wide() uses sd_llc_size as the spreading threshold to detect wide
waker/wakee relationships and to disable wake_affine() for those cases.

On SMT systems, sd_llc_size counts logical CPUs rather than physical
cores. This inflates the wake_wide() threshold, allowing wake_affine()
to pack more tasks into one LLC domain than the actual compute capacity
of its physical cores can sustain. The resulting SMT interference may
cost more than the cache-locality benefit wake_affine() intends to gain.

Scale the factor by the SMT width of the current CPU so that it
approximates the number of independent physical cores in the LLC domain,
making wake_wide() more likely to kick in before SMT interference
becomes significant. On non-SMT systems the SMT width is 1 and behaviour
is unchanged.

Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
---
 kernel/sched/fair.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f07df8987a5ef..4896582c6e904 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p)
 	unsigned int slave = p->wakee_flips;
 	int factor = __this_cpu_read(sd_llc_size);
 
+	/* Scale factor to physical-core count to account for SMT interference. */
+	if (sched_smt_active())
+		factor = DIV_ROUND_UP(factor,
+				cpumask_weight(cpu_smt_mask(smp_processor_id())));
+
 	if (master < slave)
 		swap(master, slave);
 	if (slave < factor || master < slave * factor)
-- 
2.18.0
Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width
Posted by Shrikanth Hegde 2 months, 1 week ago
Hi.

On 4/7/26 12:09 PM, Zhang Qiao wrote:
> wake_wide() uses sd_llc_size as the spreading threshold to detect wide
> waker/wakee relationships and to disable wake_affine() for those cases.
> 
> On SMT systems, sd_llc_size counts logical CPUs rather than physical
> cores. This inflates the wake_wide() threshold, allowing wake_affine()
> to pack more tasks into one LLC domain than the actual compute capacity
> of its physical cores can sustain. The resulting SMT interference may
> cost more than the cache-locality benefit wake_affine() intends to gain.
>

Isn't load balance to move it out? What does the workload do?

> Scale the factor by the SMT width of the current CPU so that it
> approximates the number of independent physical cores in the LLC domain,
> making wake_wide() more likely to kick in before SMT interference
> becomes significant. On non-SMT systems the SMT width is 1 and behaviour
> is unchanged.
> 

There are systems where LLC_SIZE == SMT_SIZE. i.e one core in the LLC.
This would effectively disable wake_affine feature in such systems.

Power10 being a major example.

> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
> ---
>   kernel/sched/fair.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f07df8987a5ef..4896582c6e904 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p)
>   	unsigned int slave = p->wakee_flips;
>   	int factor = __this_cpu_read(sd_llc_size);
>   
> +	/* Scale factor to physical-core count to account for SMT interference. */
> +	if (sched_smt_active())
> +		factor = DIV_ROUND_UP(factor,
> +				cpumask_weight(cpu_smt_mask(smp_processor_id())));
> +
>   	if (master < slave)
>   		swap(master, slave);
>   	if (slave < factor || master < slave * factor)
Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width
Posted by Zhang Qiao 1 month, 4 weeks ago
Hi Shrikanth,

在 2026/4/8 1:58, Shrikanth Hegde 写道:
> Hi.
> 
> On 4/7/26 12:09 PM, Zhang Qiao wrote:
>> wake_wide() uses sd_llc_size as the spreading threshold to detect wide
>> waker/wakee relationships and to disable wake_affine() for those cases.
>>
>> On SMT systems, sd_llc_size counts logical CPUs rather than physical
>> cores. This inflates the wake_wide() threshold, allowing wake_affine()
>> to pack more tasks into one LLC domain than the actual compute capacity
>> of its physical cores can sustain. The resulting SMT interference may
>> cost more than the cache-locality benefit wake_affine() intends to gain.
>>
> 
> Isn't load balance to move it out? What does the workload do?

The workload is a producer-consumer model: one producer wakes up ~50
different consumers, with roughly 10+ consumers running concurrently.
The total number of tasks is well below the CPU count.

In this scenario, load balancing is largely ineffective. Each consumer
spends most of its time sleeping, gets woken by the producer, runs
briefly to process the message, then goes back to sleep. There is
almost no window where a consumer sits on a CPU runqueue in the runnable
state waiting to be pulled. Since load balancing can only migrate
runnable tasks, it simply has no target to act on here.

> 
>> Scale the factor by the SMT width of the current CPU so that it
>> approximates the number of independent physical cores in the LLC domain,
>> making wake_wide() more likely to kick in before SMT interference
>> becomes significant. On non-SMT systems the SMT width is 1 and behaviour
>> is unchanged.
>>
> 
> There are systems where LLC_SIZE == SMT_SIZE. i.e one core in the LLC.
> This would effectively disable wake_affine feature in such systems.
> 
> Power10 being a major example.
> 
>> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
>> ---
>>   kernel/sched/fair.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index f07df8987a5ef..4896582c6e904 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p)
>>       unsigned int slave = p->wakee_flips;
>>       int factor = __this_cpu_read(sd_llc_size);
>>   +    /* Scale factor to physical-core count to account for SMT interference. */
>> +    if (sched_smt_active())
>> +        factor = DIV_ROUND_UP(factor,
>> +                cpumask_weight(cpu_smt_mask(smp_processor_id())));
>> +
>>       if (master < slave)
>>           swap(master, slave);
>>       if (slave < factor || master < slave * factor)
> 
> .
Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width
Posted by Dietmar Eggemann 1 month, 3 weeks ago
On 16.04.26 09:41, Zhang Qiao wrote:
> Hi Shrikanth,
> 
> 在 2026/4/8 1:58, Shrikanth Hegde 写道:
>> Hi.
>>
>> On 4/7/26 12:09 PM, Zhang Qiao wrote:
>>> wake_wide() uses sd_llc_size as the spreading threshold to detect wide
>>> waker/wakee relationships and to disable wake_affine() for those cases.
>>>
>>> On SMT systems, sd_llc_size counts logical CPUs rather than physical
>>> cores. This inflates the wake_wide() threshold, allowing wake_affine()
>>> to pack more tasks into one LLC domain than the actual compute capacity
>>> of its physical cores can sustain. The resulting SMT interference may
>>> cost more than the cache-locality benefit wake_affine() intends to gain.
>>>
>>
>> Isn't load balance to move it out? What does the workload do?
> 
> The workload is a producer-consumer model: one producer wakes up ~50
> different consumers, with roughly 10+ consumers running concurrently.
> The total number of tasks is well below the CPU count.

But higher than your MC core count I believe? Otherwise you wouldn't
care. I assume you have MC CPU count of 12-24. Do you have more than 2
different MCs.

> In this scenario, load balancing is largely ineffective. Each consumer
> spends most of its time sleeping, gets woken by the producer, runs
> briefly to process the message, then goes back to sleep. There is
> almost no window where a consumer sits on a CPU runqueue in the runnable
> state waiting to be pulled. Since load balancing can only migrate
> runnable tasks, it simply has no target to act on here.

OK, but SD_BALANCE_WAKE is not set by default, nobody would experience a
difference in behaviour on an SMT machine in terms of waking tasks wide,
i.e. going through the slow path. Like I tried to explain in the
adjacent thread, your wakees would only end up in the slow path in case
your sched domains would have SD_BALANCE_WAKE set.

Or do you just want to force wakeups which have wake_wide(p) return 1
always into the fast path with 'new_cpu == prev_cpu'? But this wouldn't
be wake wide?

[...]
Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width
Posted by Zhang Qiao 1 month, 2 weeks ago
Hi,

在 2026/4/22 21:26, Dietmar Eggemann 写道:
> On 16.04.26 09:41, Zhang Qiao wrote:
>> Hi Shrikanth,
>>
>> 在 2026/4/8 1:58, Shrikanth Hegde 写道:
>>> Hi.
>>>
>>> On 4/7/26 12:09 PM, Zhang Qiao wrote:
>>>> wake_wide() uses sd_llc_size as the spreading threshold to detect wide
>>>> waker/wakee relationships and to disable wake_affine() for those cases.
>>>>
>>>> On SMT systems, sd_llc_size counts logical CPUs rather than physical
>>>> cores. This inflates the wake_wide() threshold, allowing wake_affine()
>>>> to pack more tasks into one LLC domain than the actual compute capacity
>>>> of its physical cores can sustain. The resulting SMT interference may
>>>> cost more than the cache-locality benefit wake_affine() intends to gain.
>>>>
>>>
>>> Isn't load balance to move it out? What does the workload do?
>>
>> The workload is a producer-consumer model: one producer wakes up ~50
>> different consumers, with roughly 10+ consumers running concurrently.
>> The total number of tasks is well below the CPU count.
> 
> But higher than your MC core count I believe? Otherwise you wouldn't
> care. I assume you have MC CPU count of 12-24. Do you have more than 2
> different MCs.

My server has 10 different MCs (LLCs), with each MC containing 8 physical cores
(16 threads with SMT-2).

> 
>> In this scenario, load balancing is largely ineffective. Each consumer
>> spends most of its time sleeping, gets woken by the producer, runs
>> briefly to process the message, then goes back to sleep. There is
>> almost no window where a consumer sits on a CPU runqueue in the runnable
>> state waiting to be pulled. Since load balancing can only migrate
>> runnable tasks, it simply has no target to act on here.
> 
> OK, but SD_BALANCE_WAKE is not set by default, nobody would experience a

SD_BALANCE_WAKE was not enabled in my tests.

> difference in behaviour on an SMT machine in terms of waking tasks wide,
> i.e. going through the slow path. Like I tried to explain in the
> adjacent thread, your wakees would only end up in the slow path in case
> your sched domains would have SD_BALANCE_WAKE set.>
> Or do you just want to force wakeups which have wake_wide(p) return 1
> always into the fast path with 'new_cpu == prev_cpu'? But this wouldn't
> be wake wide?

The observed improvement comes from suppressing wake_affine() before it
pulls wakees onto the waker's physical core. In the producer-consumer
workload, without this patch, consumers are repeatedly affined into the
waker's LLC and end up co-scheduled on the same physical core's SMT
siblings. With the patch, wake_wide() fires earlier and wakees are left
on prev_cpu, resulting in better spread across physical cores.


Thanks
Zhang Qiao

> 
> [...]
> 
> .
> 
Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width
Posted by Dietmar Eggemann 1 month ago
On 29.04.26 04:43, Zhang Qiao wrote:
> 
> Hi,
> 
> 在 2026/4/22 21:26, Dietmar Eggemann 写道:
>> On 16.04.26 09:41, Zhang Qiao wrote:
>>> Hi Shrikanth,
>>>
>>> 在 2026/4/8 1:58, Shrikanth Hegde 写道:
>>>> Hi.
>>>>
>>>> On 4/7/26 12:09 PM, Zhang Qiao wrote:

[...]

>>> The workload is a producer-consumer model: one producer wakes up ~50
>>> different consumers, with roughly 10+ consumers running concurrently.
>>> The total number of tasks is well below the CPU count.
>>
>> But higher than your MC core count I believe? Otherwise you wouldn't
>> care. I assume you have MC CPU count of 12-24. Do you have more than 2
>> different MCs.
> 
> My server has 10 different MCs (LLCs), with each MC containing 8 physical cores
> (16 threads with SMT-2).

Thanks.

>>> In this scenario, load balancing is largely ineffective. Each consumer
>>> spends most of its time sleeping, gets woken by the producer, runs
>>> briefly to process the message, then goes back to sleep. There is
>>> almost no window where a consumer sits on a CPU runqueue in the runnable
>>> state waiting to be pulled. Since load balancing can only migrate
>>> runnable tasks, it simply has no target to act on here.
>>
>> OK, but SD_BALANCE_WAKE is not set by default, nobody would experience a
> 
> SD_BALANCE_WAKE was not enabled in my tests.

Right, looks like I mixed up balance flags & fast/slow path with the
wake affine vs. wake wide logic.

>> difference in behaviour on an SMT machine in terms of waking tasks wide,
>> i.e. going through the slow path. Like I tried to explain in the
>> adjacent thread, your wakees would only end up in the slow path in case
>> your sched domains would have SD_BALANCE_WAKE set.>
>> Or do you just want to force wakeups which have wake_wide(p) return 1
>> always into the fast path with 'new_cpu == prev_cpu'? But this wouldn't
>> be wake wide?
> 
> The observed improvement comes from suppressing wake_affine() before it
> pulls wakees onto the waker's physical core. In the producer-consumer
> workload, without this patch, consumers are repeatedly affined into the
> waker's LLC and end up co-scheduled on the same physical core's SMT
> siblings. With the patch, wake_wide() fires earlier and wakees are left
> on prev_cpu, resulting in better spread across physical cores.

Makes sense.

You mentioned having ~10+ consumers running concurrently. I’m curious
why select_idle_sibling() isn’t doing a better job of distributing those
tasks across idle cores, even though wakeups are affine to the waker and
its LLC domain. Is this because you only have 8 cores per LLC, combined
with general system noise?
Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width
Posted by Zhang Qiao 3 weeks, 6 days ago

在 2026/5/11 23:54, Dietmar Eggemann 写道:
> On 29.04.26 04:43, Zhang Qiao wrote:
>>
>> Hi,
>>
>> 在 2026/4/22 21:26, Dietmar Eggemann 写道:
>>> On 16.04.26 09:41, Zhang Qiao wrote:
>>>> Hi Shrikanth,
>>>>
>>>> 在 2026/4/8 1:58, Shrikanth Hegde 写道:
>>>>> Hi.
>>>>>
>>>>> On 4/7/26 12:09 PM, Zhang Qiao wrote:
> 
> [...]
> 
>>>> The workload is a producer-consumer model: one producer wakes up ~50
>>>> different consumers, with roughly 10+ consumers running concurrently.
>>>> The total number of tasks is well below the CPU count.
>>>
>>> But higher than your MC core count I believe? Otherwise you wouldn't
>>> care. I assume you have MC CPU count of 12-24. Do you have more than 2
>>> different MCs.
>>
>> My server has 10 different MCs (LLCs), with each MC containing 8 physical cores
>> (16 threads with SMT-2).
> 
> Thanks.
> 
>>>> In this scenario, load balancing is largely ineffective. Each consumer
>>>> spends most of its time sleeping, gets woken by the producer, runs
>>>> briefly to process the message, then goes back to sleep. There is
>>>> almost no window where a consumer sits on a CPU runqueue in the runnable
>>>> state waiting to be pulled. Since load balancing can only migrate
>>>> runnable tasks, it simply has no target to act on here.
>>>
>>> OK, but SD_BALANCE_WAKE is not set by default, nobody would experience a
>>
>> SD_BALANCE_WAKE was not enabled in my tests.
> 
> Right, looks like I mixed up balance flags & fast/slow path with the
> wake affine vs. wake wide logic.
> 
>>> difference in behaviour on an SMT machine in terms of waking tasks wide,
>>> i.e. going through the slow path. Like I tried to explain in the
>>> adjacent thread, your wakees would only end up in the slow path in case
>>> your sched domains would have SD_BALANCE_WAKE set.>
>>> Or do you just want to force wakeups which have wake_wide(p) return 1
>>> always into the fast path with 'new_cpu == prev_cpu'? But this wouldn't
>>> be wake wide?
>>
>> The observed improvement comes from suppressing wake_affine() before it
>> pulls wakees onto the waker's physical core. In the producer-consumer
>> workload, without this patch, consumers are repeatedly affined into the
>> waker's LLC and end up co-scheduled on the same physical core's SMT
>> siblings. With the patch, wake_wide() fires earlier and wakees are left
>> on prev_cpu, resulting in better spread across physical cores.
> 
> Makes sense.
> 
> You mentioned having ~10+ consumers running concurrently. I’m curious
> why select_idle_sibling() isn’t doing a better job of distributing those
> tasks across idle cores, even though wakeups are affine to the waker and
> its LLC domain. Is this because you only have 8 cores per LLC, combined
> with general system noise?

Yes, exactly. Each LLC has only 8 physical cores (16 threads with SMT-2).
When more than 8 consumers are woken into the same LLC domain, the number
of running tasks exceeds the physical core count, and SMT siblings are
forced to share execution resources, causing the interference we observed.

Thanks,
Zhang Qiao

> 
> .
> 
Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width
Posted by Dietmar Eggemann 2 months, 1 week ago
On 07.04.26 08:39, Zhang Qiao wrote:
> wake_wide() uses sd_llc_size as the spreading threshold to detect wide
> waker/wakee relationships and to disable wake_affine() for those cases.
> 
> On SMT systems, sd_llc_size counts logical CPUs rather than physical
> cores. This inflates the wake_wide() threshold, allowing wake_affine()
> to pack more tasks into one LLC domain than the actual compute capacity
> of its physical cores can sustain. The resulting SMT interference may
> cost more than the cache-locality benefit wake_affine() intends to gain.
> 
> Scale the factor by the SMT width of the current CPU so that it
> approximates the number of independent physical cores in the LLC domain,
> making wake_wide() more likely to kick in before SMT interference
> becomes significant. On non-SMT systems the SMT width is 1 and behaviour
> is unchanged.
> 
> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
> ---
>  kernel/sched/fair.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f07df8987a5ef..4896582c6e904 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p)
>  	unsigned int slave = p->wakee_flips;
>  	int factor = __this_cpu_read(sd_llc_size);
>  
> +	/* Scale factor to physical-core count to account for SMT interference. */
> +	if (sched_smt_active())
> +		factor = DIV_ROUND_UP(factor,
> +				cpumask_weight(cpu_smt_mask(smp_processor_id())));
> +
>  	if (master < slave)
>  		swap(master, slave);
>  	if (slave < factor || master < slave * factor)

I assume not a lot of people care since this needs:

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5847b83d9d55..596c5d590532 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1691,7 +1691,7 @@ sd_init(struct sched_domain_topology_level *tl,
                .flags                  = 1*SD_BALANCE_NEWIDLE
                                        | 1*SD_BALANCE_EXEC
                                        | 1*SD_BALANCE_FORK
-                                       | 0*SD_BALANCE_WAKE
+                                       | 1*SD_BALANCE_WAKE
                                        | 1*SD_WAKE_AFFINE
                                        | 0*SD_SHARE_CPUCAPACITY
                                        | 0*SD_SHARE_LLC

And then it's a trade-off between one busy thread per core vs. wakeup cost.
Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width
Posted by Shrikanth Hegde 2 months, 1 week ago

On 4/7/26 8:08 PM, Dietmar Eggemann wrote:
> On 07.04.26 08:39, Zhang Qiao wrote:
>> wake_wide() uses sd_llc_size as the spreading threshold to detect wide
>> waker/wakee relationships and to disable wake_affine() for those cases.
>>
>> On SMT systems, sd_llc_size counts logical CPUs rather than physical
>> cores. This inflates the wake_wide() threshold, allowing wake_affine()
>> to pack more tasks into one LLC domain than the actual compute capacity
>> of its physical cores can sustain. The resulting SMT interference may
>> cost more than the cache-locality benefit wake_affine() intends to gain.
>>
>> Scale the factor by the SMT width of the current CPU so that it
>> approximates the number of independent physical cores in the LLC domain,
>> making wake_wide() more likely to kick in before SMT interference
>> becomes significant. On non-SMT systems the SMT width is 1 and behaviour
>> is unchanged.
>>
>> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
>> ---
>>   kernel/sched/fair.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index f07df8987a5ef..4896582c6e904 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p)
>>   	unsigned int slave = p->wakee_flips;
>>   	int factor = __this_cpu_read(sd_llc_size);
>>   
>> +	/* Scale factor to physical-core count to account for SMT interference. */
>> +	if (sched_smt_active())
>> +		factor = DIV_ROUND_UP(factor,
>> +				cpumask_weight(cpu_smt_mask(smp_processor_id())));
>> +
>>   	if (master < slave)
>>   		swap(master, slave);
>>   	if (slave < factor || master < slave * factor)
> 
> I assume not a lot of people care since this needs:

wake_affine machinery needs SD_WAKE_AFFINE. No?

> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 5847b83d9d55..596c5d590532 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1691,7 +1691,7 @@ sd_init(struct sched_domain_topology_level *tl,
>                  .flags                  = 1*SD_BALANCE_NEWIDLE
>                                          | 1*SD_BALANCE_EXEC
>                                          | 1*SD_BALANCE_FORK
> -                                       | 0*SD_BALANCE_WAKE
> +                                       | 1*SD_BALANCE_WAKE
>                                          | 1*SD_WAKE_AFFINE
>                                          | 0*SD_SHARE_CPUCAPACITY
>                                          | 0*SD_SHARE_LLC
> 
> And then it's a trade-off between one busy thread per core vs. wakeup cost.
Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width
Posted by Dietmar Eggemann 1 month, 3 weeks ago
On 07.04.26 20:16, Shrikanth Hegde wrote:
> 
> 
> On 4/7/26 8:08 PM, Dietmar Eggemann wrote:
>> On 07.04.26 08:39, Zhang Qiao wrote:
>>> wake_wide() uses sd_llc_size as the spreading threshold to detect wide
>>> waker/wakee relationships and to disable wake_affine() for those cases.
>>>
>>> On SMT systems, sd_llc_size counts logical CPUs rather than physical
>>> cores. This inflates the wake_wide() threshold, allowing wake_affine()
>>> to pack more tasks into one LLC domain than the actual compute capacity
>>> of its physical cores can sustain. The resulting SMT interference may
>>> cost more than the cache-locality benefit wake_affine() intends to gain.
>>>
>>> Scale the factor by the SMT width of the current CPU so that it
>>> approximates the number of independent physical cores in the LLC domain,
>>> making wake_wide() more likely to kick in before SMT interference
>>> becomes significant. On non-SMT systems the SMT width is 1 and behaviour
>>> is unchanged.
>>>
>>> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
>>> ---
>>>   kernel/sched/fair.c | 5 +++++
>>>   1 file changed, 5 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index f07df8987a5ef..4896582c6e904 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p)
>>>       unsigned int slave = p->wakee_flips;
>>>       int factor = __this_cpu_read(sd_llc_size);
>>>   +    /* Scale factor to physical-core count to account for SMT
>>> interference. */
>>> +    if (sched_smt_active())
>>> +        factor = DIV_ROUND_UP(factor,
>>> +                cpumask_weight(cpu_smt_mask(smp_processor_id())));
>>> +
>>>       if (master < slave)
>>>           swap(master, slave);
>>>       if (slave < factor || master < slave * factor)
>>
>> I assume not a lot of people care since this needs:
> 
> wake_affine machinery needs SD_WAKE_AFFINE. No?

Yes, the potential call to wake_affine() and forcing 'sd = NULL' but
that's not forcing a wakeup (WF_TTWU) into the slow path
(sched_balance_find_dst_cpu()), which IMHO is the actual wake wide.

You need 'sd != NULL' which can only be set by (1)for a wakeup:

for_each_domain(cpu, tmp)
  ...
  if (tmp->flags & sd_flag) <-- '(1) SD_BALANCE_WAKE == WF_TTWU'
    sd = tmp;

and since SD_BALANCE_WAKE is never set per default in sd_init()
[kernel/sched/topology.c] I wonder how they achieved this wide (i.e. not
affine MC for this_cpu or prev_cpu) wakeup?

By default, we only select wide for WF_FORK and WF_EXEC.

Or do they just want to force 'wake_wide(p) == 1' into sis(..., new_cpu
= prev_cpu) ?

[...]