sched/fair: SMT-aware asymmetric CPU capacity

[PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

Posted by Andrea Righi 1 week ago

This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
introducing SMT awareness.

= Problem =

Nominal per-logical-CPU capacity can overstate usable compute when an SMT
sibling is busy, because the physical core doesn't deliver its full nominal
capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
that are not actually good destinations.

= Proposed Solution =

This patch set aligns those paths with a simple rule already used
elsewhere: when SMT is active, prefer fully idle cores and avoid treating
partially idle SMT siblings as full-capacity targets where that would
mislead load balance.

Patch set summary:

 - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

   Prefer fully-idle SMT cores in asym-capacity idle selection. In the
   wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
   idle selection can prefer CPUs on fully idle cores, with a safe fallback.

 - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity

   Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
   Provided for consistency with PATCH 1/4.

 - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems

   Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
   consistency with PATCH 1/4. I've also tested with/without
   /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
   noticed any regression.

 - [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer

   When choosing the housekeeping CPU that runs the idle load balancer,
   prefer an idle CPU on a fully idle core so migrated work lands where
   effective capacity is available.

   The change is still consistent with the same "avoid CPUs with busy
   sibling" logic and it shows some benefits on Vera, but could have
   negative impact on other systems, I'm including it for completeness
   (feedback is appreciated).

This patch set has been tested on the new NVIDIA Vera Rubin platform, where
SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.

Without these patches, performance can drop up to ~2x with CPU-intensive
workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
account for busy SMT siblings.

Alternative approaches have been evaluated, such as equalizing CPU
capacities, either by exposing uniform values via firmware (ACPI/CPPC) or
normalizing them in the kernel by grouping CPUs within a small capacity
window (+-5%) [1][2], or enabling asympacking [3].

However, adding SMT awareness to SD_ASYM_CPUCAPACITY has shown better
results so far. Improving this policy also seems worthwhile in general, as
other platforms in the future may enable SMT with asymmetric CPU
topologies.

[1] https://lore.kernel.org/lkml/20260324005509.1134981-1-arighi@nvidia.com
[2] https://lore.kernel.org/lkml/20260318092214.130908-1-arighi@nvidia.com
[3] https://lore.kernel.org/all/20260325181314.3875909-1-christian.loehle@arm.com/

Andrea Righi (4):
      sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
      sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
      sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
      sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer

 kernel/sched/fair.c     | 163 +++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/topology.c |   9 ---
 2 files changed, 147 insertions(+), 25 deletions(-)

Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

Posted by Dietmar Eggemann 2 days, 20 hours ago

Hi Andrea,

On 26.03.26 16:02, Andrea Righi wrote:

[...]

> This patch set has been tested on the new NVIDIA Vera Rubin platform, where
> SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
> as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
> 
> Without these patches, performance can drop up to ~2x with CPU-intensive
> workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
> account for busy SMT siblings.
> 
> Alternative approaches have been evaluated, such as equalizing CPU
> capacities, either by exposing uniform values via firmware (ACPI/CPPC) or
> normalizing them in the kernel by grouping CPUs within a small capacity
> window (+-5%) [1][2], or enabling asympacking [3].
> 
> However, adding SMT awareness to SD_ASYM_CPUCAPACITY has shown better
> results so far. Improving this policy also seems worthwhile in general, as
> other platforms in the future may enable SMT with asymmetric CPU
> topologies.
I still wonder whether we really need select_idle_capacity() (plus the
smt part) for asymmetric CPU capacity systems where the CPU capacity
differences are < 5% of SCHED_CAPACITY_SCALE.

The known example would be the NVIDIA Grace (!smt) server with its
slightly different perf_caps.highest_perf values.

We did run DCPerf Mediawiki on this thing with:

 (1) ASYM_CPUCAPACITY (default)

 (2) NO ASYM_CPUCAPACITY

We also ran on a comparable ARM64 server (!smt) for comparison:

 (1) ASYM_CPUCAPACITY

 (2) NO ASYM_CPUCAPACITY (default)

Both systems have 72 CPUs, run v6.8 and have a single MC sched domain
with LLC spanning over all 72 CPUs. During the tests there were ~750
tasks among them the workload related:

  #hhvmworker                   147
  #mariadbd                     204
  #memcached                     11
  #nginx                          8
  #wrk                          144
  #ProxygenWorker                 1

load_balance:

  not_idle	3x more on (2)

  idle		2x more on (2)

  newly_idle    2-10x more on (2)

wakeup:

  move_affine	2-3x more on (1)

  ttwu_local	1.5-2 more on (2)

We also instrumented all the bailout conditions in select_task_sibling()
(sis())->select_idle_cpu() and select_idle_capacity() (sic()).

In (1) almost all wakeups end up in select_idle_cpu() returning -1 due
to the fact that 'sd->shared->nr_idle_scan' under SIS_UTIL is 0. So
sis() in (1) almost always returns target (this_cpu or prev_cpu). sic()
doesn't do this.

What I haven't done is to try (1) with SIS_UTIL or (2) with NO_SIS_UTIL.

I wonder whether this is the underlying reason for the benefit of (1)
over (2) we see here with smt now?

So IMHO before adding smt support to (1) for these small CPPC based CPU
capacity differences we should make sure that the same can't be achieved
by disabling SIS_UTIL or to soften it a bit.

So does (2) with NO_SIS_UTIL performs worse than (1) with your smt
related add-ons in sic()?

Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

Posted by Andrea Righi 2 days, 9 hours ago

Hi Dietmar,

On Tue, Mar 31, 2026 at 12:30:55AM +0200, Dietmar Eggemann wrote:
> Hi Andrea,
> 
> On 26.03.26 16:02, Andrea Righi wrote:
> 
> [...]
> 
> > This patch set has been tested on the new NVIDIA Vera Rubin platform, where
> > SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
> > as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
> > 
> > Without these patches, performance can drop up to ~2x with CPU-intensive
> > workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
> > account for busy SMT siblings.
> > 
> > Alternative approaches have been evaluated, such as equalizing CPU
> > capacities, either by exposing uniform values via firmware (ACPI/CPPC) or
> > normalizing them in the kernel by grouping CPUs within a small capacity
> > window (+-5%) [1][2], or enabling asympacking [3].
> > 
> > However, adding SMT awareness to SD_ASYM_CPUCAPACITY has shown better
> > results so far. Improving this policy also seems worthwhile in general, as
> > other platforms in the future may enable SMT with asymmetric CPU
> > topologies.
> I still wonder whether we really need select_idle_capacity() (plus the
> smt part) for asymmetric CPU capacity systems where the CPU capacity
> differences are < 5% of SCHED_CAPACITY_SCALE.
> 
> The known example would be the NVIDIA Grace (!smt) server with its
> slightly different perf_caps.highest_perf values.
> 
> We did run DCPerf Mediawiki on this thing with:
> 
>  (1) ASYM_CPUCAPACITY (default)
> 
>  (2) NO ASYM_CPUCAPACITY
> 
> We also ran on a comparable ARM64 server (!smt) for comparison:
> 
>  (1) ASYM_CPUCAPACITY
> 
>  (2) NO ASYM_CPUCAPACITY (default)
> 
> Both systems have 72 CPUs, run v6.8 and have a single MC sched domain
> with LLC spanning over all 72 CPUs. During the tests there were ~750
> tasks among them the workload related:
> 
>   #hhvmworker                   147
>   #mariadbd                     204
>   #memcached                     11
>   #nginx                          8
>   #wrk                          144
>   #ProxygenWorker                 1
> 
> load_balance:
> 
>   not_idle	3x more on (2)
> 
>   idle		2x more on (2)
> 
>   newly_idle    2-10x more on (2)
> 
> wakeup:
> 
>   move_affine	2-3x more on (1)
> 
>   ttwu_local	1.5-2 more on (2)
> 
> We also instrumented all the bailout conditions in select_task_sibling()
> (sis())->select_idle_cpu() and select_idle_capacity() (sic()).
> 
> In (1) almost all wakeups end up in select_idle_cpu() returning -1 due
> to the fact that 'sd->shared->nr_idle_scan' under SIS_UTIL is 0. So
> sis() in (1) almost always returns target (this_cpu or prev_cpu). sic()
> doesn't do this.
> 
> What I haven't done is to try (1) with SIS_UTIL or (2) with NO_SIS_UTIL.
> 
> I wonder whether this is the underlying reason for the benefit of (1)
> over (2) we see here with smt now?
> 
> So IMHO before adding smt support to (1) for these small CPPC based CPU
> capacity differences we should make sure that the same can't be achieved
> by disabling SIS_UTIL or to soften it a bit.
> 
> So does (2) with NO_SIS_UTIL performs worse than (1) with your smt
> related add-ons in sic()?

Thanks for running these experiments and sharing the data, this is very
useful!

I did a quick test on Vera using the NVBLAS benchmark, comparing NO
ASYM_CPUCAPACITY with and without SIS_UTIL, but the difference seems to be
within error range. I'll also run DCPerf MediaWiki with all the different
configurations to see if I get similar results.

More in general, I agree that for small capacity differences (e.g., within
~5%) the benefits of using ASYM_CPUCAPACITY is questionable. And I'm also
fine to go back to the idea of grouping together CPUS within the 5%
capacity window, if we think it's a safer approach (results in your case
are quite evident, and BTW, that means we also shouldn't have
ASYM_CPU_CAPACITY on Grace, so in theory the 5% threshold should also
improve performance on Grace, that doesn't have SMT).

That said, I still think there's value in adding SMT awareness to
select_idle_capacity(). Even if we decide to avoid ASYM_CPUCAPACITY for
small capacity deltas, we should ensure that the behavior remains
reasonable if both features are enabled, for any reason. Right now, there
are cases where the current behavior leads to significant performance
degradation (~2x), so having a mechanism to prevent clearly suboptimal task
placement still seems worthwhile. Essentially, what I'm saying is that one
thing doesn't exclude the other.

Thanks,
-Andrea

Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

Posted by Dietmar Eggemann 1 day, 7 hours ago

On 31.03.26 11:04, Andrea Righi wrote:
> Hi Dietmar,
> 
> On Tue, Mar 31, 2026 at 12:30:55AM +0200, Dietmar Eggemann wrote:
>> Hi Andrea,
>>
>> On 26.03.26 16:02, Andrea Righi wrote:

[...]

>> So does (2) with NO_SIS_UTIL performs worse than (1) with your smt
>> related add-ons in sic()?
> 
> Thanks for running these experiments and sharing the data, this is very
> useful!
> 
> I did a quick test on Vera using the NVBLAS benchmark, comparing NO
> ASYM_CPUCAPACITY with and without SIS_UTIL, but the difference seems to be
> within error range. I'll also run DCPerf MediaWiki with all the different

I'm not familiar with the NVBLAS benchmark. Does it drive your system
into 'sd->shared->nr_idle_scan = 0' state?

We just have to understand where this benefit of using sic() instead of
sis() is coming from. I'm doubtful that this is the best_cpu thing after
if (!choose_idle_cpu(cpu, p)) in sic()'s for_each_cpu_wrap(cpu, cpus,
target) loop given that the CPU capacity diffs are so small.

> configurations to see if I get similar results.
> 
> More in general, I agree that for small capacity differences (e.g., within
> ~5%) the benefits of using ASYM_CPUCAPACITY is questionable. And I'm also
> fine to go back to the idea of grouping together CPUS within the 5%
> capacity window, if we think it's a safer approach (results in your case
> are quite evident, and BTW, that means we also shouldn't have
> ASYM_CPU_CAPACITY on Grace, so in theory the 5% threshold should also
> improve performance on Grace, that doesn't have SMT).

There shouldn't be so many machines with these binning-introduced small
CPU capacity diffs out there? In fact, I only know about your Grace
(!smt) and Vera (smt) machines.

> That said, I still think there's value in adding SMT awareness to
> select_idle_capacity(). Even if we decide to avoid ASYM_CPUCAPACITY for
> small capacity deltas, we should ensure that the behavior remains
> reasonable if both features are enabled, for any reason. Right now, there
> are cases where the current behavior leads to significant performance
> degradation (~2x), so having a mechanism to prevent clearly suboptimal task
> placement still seems worthwhile. Essentially, what I'm saying is that one
> thing doesn't exclude the other.

IMHO, in case we would know where this improvement is coming from using
sic() instead of default sis() (which already as smt support) then
maybe, it's a lot of extra code at the end ... And mobile big.LITTLE
(with larger CPU capacity diffs) doesn't have smt.

Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

Posted by Vincent Guittot 1 day, 6 hours ago

On Wed, 1 Apr 2026 at 13:57, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 31.03.26 11:04, Andrea Righi wrote:
> > Hi Dietmar,
> >
> > On Tue, Mar 31, 2026 at 12:30:55AM +0200, Dietmar Eggemann wrote:
> >> Hi Andrea,
> >>
> >> On 26.03.26 16:02, Andrea Righi wrote:
>
> [...]
>
> >> So does (2) with NO_SIS_UTIL performs worse than (1) with your smt
> >> related add-ons in sic()?
> >
> > Thanks for running these experiments and sharing the data, this is very
> > useful!
> >
> > I did a quick test on Vera using the NVBLAS benchmark, comparing NO
> > ASYM_CPUCAPACITY with and without SIS_UTIL, but the difference seems to be
> > within error range. I'll also run DCPerf MediaWiki with all the different
>
> I'm not familiar with the NVBLAS benchmark. Does it drive your system
> into 'sd->shared->nr_idle_scan = 0' state?
>
> We just have to understand where this benefit of using sic() instead of
> sis() is coming from. I'm doubtful that this is the best_cpu thing after
> if (!choose_idle_cpu(cpu, p)) in sic()'s for_each_cpu_wrap(cpu, cpus,
> target) loop given that the CPU capacity diffs are so small.
>
> > configurations to see if I get similar results.
> >
> > More in general, I agree that for small capacity differences (e.g., within
> > ~5%) the benefits of using ASYM_CPUCAPACITY is questionable. And I'm also
> > fine to go back to the idea of grouping together CPUS within the 5%
> > capacity window, if we think it's a safer approach (results in your case
> > are quite evident, and BTW, that means we also shouldn't have
> > ASYM_CPU_CAPACITY on Grace, so in theory the 5% threshold should also
> > improve performance on Grace, that doesn't have SMT).
>
> There shouldn't be so many machines with these binning-introduced small
> CPU capacity diffs out there? In fact, I only know about your Grace
> (!smt) and Vera (smt) machines.

In any case it's always better to add the support than enabling asym_packing

>
> > That said, I still think there's value in adding SMT awareness to
> > select_idle_capacity(). Even if we decide to avoid ASYM_CPUCAPACITY for
> > small capacity deltas, we should ensure that the behavior remains
> > reasonable if both features are enabled, for any reason. Right now, there
> > are cases where the current behavior leads to significant performance
> > degradation (~2x), so having a mechanism to prevent clearly suboptimal task
> > placement still seems worthwhile. Essentially, what I'm saying is that one
> > thing doesn't exclude the other.
>
> IMHO, in case we would know where this improvement is coming from using
> sic() instead of default sis() (which already as smt support) then
> maybe, it's a lot of extra code at the end ... And mobile big.LITTLE
> (with larger CPU capacity diffs) doesn't have smt.

The last proposal based on  prateek proposal in sic() doesn't seems that large

Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

Posted by Andrea Righi 1 day, 6 hours ago

On Wed, Apr 01, 2026 at 02:08:27PM +0200, Vincent Guittot wrote:
> On Wed, 1 Apr 2026 at 13:57, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >
> > On 31.03.26 11:04, Andrea Righi wrote:
> > > Hi Dietmar,
> > >
> > > On Tue, Mar 31, 2026 at 12:30:55AM +0200, Dietmar Eggemann wrote:
> > >> Hi Andrea,
> > >>
> > >> On 26.03.26 16:02, Andrea Righi wrote:
> >
> > [...]
> >
> > >> So does (2) with NO_SIS_UTIL performs worse than (1) with your smt
> > >> related add-ons in sic()?
> > >
> > > Thanks for running these experiments and sharing the data, this is very
> > > useful!
> > >
> > > I did a quick test on Vera using the NVBLAS benchmark, comparing NO
> > > ASYM_CPUCAPACITY with and without SIS_UTIL, but the difference seems to be
> > > within error range. I'll also run DCPerf MediaWiki with all the different
> >
> > I'm not familiar with the NVBLAS benchmark. Does it drive your system
> > into 'sd->shared->nr_idle_scan = 0' state?

It's something internally unfortunately... it's just running a single
CPU-intensive task for each SMT core (in practice half of the CPUs tasks).
I don't think we're hitting sd->shared->nr_idle_scan == 0 in this case.

> >
> > We just have to understand where this benefit of using sic() instead of
> > sis() is coming from. I'm doubtful that this is the best_cpu thing after
> > if (!choose_idle_cpu(cpu, p)) in sic()'s for_each_cpu_wrap(cpu, cpus,
> > target) loop given that the CPU capacity diffs are so small.
> >
> > > configurations to see if I get similar results.
> > >
> > > More in general, I agree that for small capacity differences (e.g., within
> > > ~5%) the benefits of using ASYM_CPUCAPACITY is questionable. And I'm also
> > > fine to go back to the idea of grouping together CPUS within the 5%
> > > capacity window, if we think it's a safer approach (results in your case
> > > are quite evident, and BTW, that means we also shouldn't have
> > > ASYM_CPU_CAPACITY on Grace, so in theory the 5% threshold should also
> > > improve performance on Grace, that doesn't have SMT).
> >
> > There shouldn't be so many machines with these binning-introduced small
> > CPU capacity diffs out there? In fact, I only know about your Grace
> > (!smt) and Vera (smt) machines.
> 
> In any case it's always better to add the support than enabling asym_packing
> 
> >
> > > That said, I still think there's value in adding SMT awareness to
> > > select_idle_capacity(). Even if we decide to avoid ASYM_CPUCAPACITY for
> > > small capacity deltas, we should ensure that the behavior remains
> > > reasonable if both features are enabled, for any reason. Right now, there
> > > are cases where the current behavior leads to significant performance
> > > degradation (~2x), so having a mechanism to prevent clearly suboptimal task
> > > placement still seems worthwhile. Essentially, what I'm saying is that one
> > > thing doesn't exclude the other.
> >
> > IMHO, in case we would know where this improvement is coming from using
> > sic() instead of default sis() (which already as smt support) then
> > maybe, it's a lot of extra code at the end ... And mobile big.LITTLE
> > (with larger CPU capacity diffs) doesn't have smt.
> 
> The last proposal based on  prateek proposal in sic() doesn't seems that large

Exactly, I was referring just to that patch, which would solve the big part
of the performance issue. We can ignore the ILB part for now.

Thanks,
-Andrea

Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

Posted by Andrea Righi 1 day, 5 hours ago

On Wed, Apr 01, 2026 at 02:42:34PM +0200, Andrea Righi wrote:
> On Wed, Apr 01, 2026 at 02:08:27PM +0200, Vincent Guittot wrote:
> > On Wed, 1 Apr 2026 at 13:57, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > >
> > > On 31.03.26 11:04, Andrea Righi wrote:
> > > > Hi Dietmar,
> > > >
> > > > On Tue, Mar 31, 2026 at 12:30:55AM +0200, Dietmar Eggemann wrote:
> > > >> Hi Andrea,
> > > >>
> > > >> On 26.03.26 16:02, Andrea Righi wrote:
> > >
> > > [...]
> > >
> > > >> So does (2) with NO_SIS_UTIL performs worse than (1) with your smt
> > > >> related add-ons in sic()?
> > > >
> > > > Thanks for running these experiments and sharing the data, this is very
> > > > useful!
> > > >
> > > > I did a quick test on Vera using the NVBLAS benchmark, comparing NO
> > > > ASYM_CPUCAPACITY with and without SIS_UTIL, but the difference seems to be
> > > > within error range. I'll also run DCPerf MediaWiki with all the different
> > >
> > > I'm not familiar with the NVBLAS benchmark. Does it drive your system
> > > into 'sd->shared->nr_idle_scan = 0' state?
> 
> It's something internally unfortunately... it's just running a single
> CPU-intensive task for each SMT core (in practice half of the CPUs tasks).
> I don't think we're hitting sd->shared->nr_idle_scan == 0 in this case.

Just finished running some tests with DCPerf MediaWiki on Vera as well
(sorry, it took a while, I did mutliple runs to rule out potential flukes):

 +---------------------------------+--------+--------+--------+--------+
 | Configuration                   |   rps  |  p50   |  p95   |  p99   |
 +---------------------------------+--------+--------+--------+--------+
 | NO ASYM + SIS_UTIL              |  8113  |  0.067 |  0.184 |  0.225 |
 | NO ASYM + NO_SIS_UTIL           |  8093  |  0.068 |  0.184 |  0.223 |
 |                                 |        |        |        |        |
 | ASYM + SMT + SIS_UTIL           |  8129  |  0.076 |  0.149 |  0.188 |
 | ASYM + SMT + NO_SIS_UTIL        |  8138  |  0.076 |  0.148 |  0.186 |
 |                                 |        |        |        |        |
 | ASYM + ILB SMT + SIS_UTIL       |  8189  |  0.075 |  0.150 |  0.189 |
 | ASYM + SMT + ILB SMT + SIS_UTIL |  8185  |  0.076 |  0.151 |  0.190 |
 +---------------------------------+--------+--------+--------+--------+

Looking at the data:
 - SIS_UTIL doesn't seem relevant in this case (differences are within
   error range),
 - ASYM_CPU_CAPACITY seems to provide a small throughput gain, but it seems
   more beneficial for tail latency reduction,
 - the ILB SMT patch seems to slightly improve throughput, but the biggest
   benefit is still coming from ASYM_CPU_CAPACITY.

Overall, also in this case it seems beneficial to use ASYM_CPU_CAPACITY
rather than equalizing the capacities.

That said, I'm still not sure why ASYM is helping. The frequency asymmetry
is really small (~2%), so the latency improvements are unlikely to come
from prioritizing the faster cores, as that should mainly affect throughput
rather than tail latency and likely to a smaller extent.

Thanks,
-Andrea

Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

Posted by Balbir Singh 5 days, 5 hours ago

On 3/27/26 02:02, Andrea Righi wrote:
> This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> introducing SMT awareness.
> 
> = Problem =
> 
> Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> sibling is busy, because the physical core doesn't deliver its full nominal
> capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> that are not actually good destinations.
> 
> = Proposed Solution =
> 
> This patch set aligns those paths with a simple rule already used
> elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> partially idle SMT siblings as full-capacity targets where that would
> mislead load balance.

In kernel/sched/topology.c

	/* Don't attempt to spread across CPUs of different capacities. */
	if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
		sd->child->flags &= ~SD_PREFER_SIBLING;

Should handle the selection, but I guess this does not work for SMT level sd's?

> 
> Patch set summary:
> 
>  - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
> 
>    Prefer fully-idle SMT cores in asym-capacity idle selection. In the
>    wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
>    idle selection can prefer CPUs on fully idle cores, with a safe fallback.
> 
>  - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
> 
>    Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
>    Provided for consistency with PATCH 1/4.
> 
>  - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
> 
>    Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
>    consistency with PATCH 1/4. I've also tested with/without
>    /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
>    noticed any regression.
> 
>  - [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
> 
>    When choosing the housekeeping CPU that runs the idle load balancer,
>    prefer an idle CPU on a fully idle core so migrated work lands where
>    effective capacity is available.
> 
>    The change is still consistent with the same "avoid CPUs with busy
>    sibling" logic and it shows some benefits on Vera, but could have
>    negative impact on other systems, I'm including it for completeness
>    (feedback is appreciated).
> 
> This patch set has been tested on the new NVIDIA Vera Rubin platform, where
> SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
> as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
> 

Are you referring to nominal_freq?

> Without these patches, performance can drop up to ~2x with CPU-intensive
> workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
> account for busy SMT siblings.
> 
> Alternative approaches have been evaluated, such as equalizing CPU
> capacities, either by exposing uniform values via firmware (ACPI/CPPC) or
> normalizing them in the kernel by grouping CPUs within a small capacity
> window (+-5%) [1][2], or enabling asympacking [3].
> 
> However, adding SMT awareness to SD_ASYM_CPUCAPACITY has shown better
> results so far. Improving this policy also seems worthwhile in general, as
> other platforms in the future may enable SMT with asymmetric CPU
> topologies.
> 
> [1] https://lore.kernel.org/lkml/20260324005509.1134981-1-arighi@nvidia.com
> [2] https://lore.kernel.org/lkml/20260318092214.130908-1-arighi@nvidia.com
> [3] https://lore.kernel.org/all/20260325181314.3875909-1-christian.loehle@arm.com/
> 
> Andrea Righi (4):
>       sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
>       sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
>       sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
>       sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
> 
>  kernel/sched/fair.c     | 163 +++++++++++++++++++++++++++++++++++++++++++-----
>  kernel/sched/topology.c |   9 ---
>  2 files changed, 147 insertions(+), 25 deletions(-)


Thanks,
Balbir

Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

Posted by Andrea Righi 4 days, 20 hours ago

Hi Balbir,

On Sun, Mar 29, 2026 at 12:03:19AM +1100, Balbir Singh wrote:
> On 3/27/26 02:02, Andrea Righi wrote:
> > This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> > introducing SMT awareness.
> > 
> > = Problem =
> > 
> > Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> > sibling is busy, because the physical core doesn't deliver its full nominal
> > capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> > that are not actually good destinations.
> > 
> > = Proposed Solution =
> > 
> > This patch set aligns those paths with a simple rule already used
> > elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> > partially idle SMT siblings as full-capacity targets where that would
> > mislead load balance.
> 
> In kernel/sched/topology.c
> 
> 	/* Don't attempt to spread across CPUs of different capacities. */
> 	if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> 		sd->child->flags &= ~SD_PREFER_SIBLING;
> 
> Should handle the selection, but I guess this does not work for SMT level sd's?

IIUC, SD_PREFER_SIBLING steers load balance toward sibling_imbalance()
(spread runnables across child/sibling domains), it doesn't encode the
fully-idle core first logic. In practice it doesn't give us SMT-aware
destination choice when a sibling is busy and this series is trying to
cover that gap in the palcement path.

BTW, on Vera the hierarchy is SMT -> MC -> NUMA:

root@localhost:~# grep . /sys/kernel/debug/sched/domains/cpu0/domain*/flags
/sys/kernel/debug/sched/domains/cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
/sys/kernel/debug/sched/domains/cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_SHARE_LLC
/sys/kernel/debug/sched/domains/cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SERIALIZE SD_NUMA

And domain1/groups_flags (child / SMT flags on the sched groups used at the
MC level) still has SD_PREFER_SIBLING together with SD_SHARE_CPUCAPACITY.

root@localhost:~# cat /sys/kernel/debug/sched/domains/cpu0/domain1/groups_flags
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING

So, prefer-sibling is still in play for SMT (including via MC
groups_flags). On machines where asymmetry attaches immediately above SMT,
topology may strip that flag and reduce this branch of behavior, but
explicit SMT-aware placement still matters.

> > 
> > Patch set summary:
> > 
> >  - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
> > 
> >    Prefer fully-idle SMT cores in asym-capacity idle selection. In the
> >    wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
> >    idle selection can prefer CPUs on fully idle cores, with a safe fallback.
> > 
> >  - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
> > 
> >    Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
> >    Provided for consistency with PATCH 1/4.
> > 
> >  - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
> > 
> >    Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
> >    consistency with PATCH 1/4. I've also tested with/without
> >    /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
> >    noticed any regression.
> > 
> >  - [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
> > 
> >    When choosing the housekeeping CPU that runs the idle load balancer,
> >    prefer an idle CPU on a fully idle core so migrated work lands where
> >    effective capacity is available.
> > 
> >    The change is still consistent with the same "avoid CPUs with busy
> >    sibling" logic and it shows some benefits on Vera, but could have
> >    negative impact on other systems, I'm including it for completeness
> >    (feedback is appreciated).
> > 
> > This patch set has been tested on the new NVIDIA Vera Rubin platform, where
> > SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
> > as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
> > 
> 
> Are you referring to nominal_freq?
> 

Correct.

Thanks,
-Andrea

Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

Posted by Balbir Singh 3 days, 21 hours ago

On 3/29/26 09:50, Andrea Righi wrote:
> Hi Balbir,
> 
> On Sun, Mar 29, 2026 at 12:03:19AM +1100, Balbir Singh wrote:
>> On 3/27/26 02:02, Andrea Righi wrote:
>>> This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
>>> introducing SMT awareness.
>>>
>>> = Problem =
>>>
>>> Nominal per-logical-CPU capacity can overstate usable compute when an SMT
>>> sibling is busy, because the physical core doesn't deliver its full nominal
>>> capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
>>> that are not actually good destinations.
>>>
>>> = Proposed Solution =
>>>
>>> This patch set aligns those paths with a simple rule already used
>>> elsewhere: when SMT is active, prefer fully idle cores and avoid treating
>>> partially idle SMT siblings as full-capacity targets where that would
>>> mislead load balance.
>>
>> In kernel/sched/topology.c
>>
>> 	/* Don't attempt to spread across CPUs of different capacities. */
>> 	if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
>> 		sd->child->flags &= ~SD_PREFER_SIBLING;
>>
>> Should handle the selection, but I guess this does not work for SMT level sd's?
> 
> IIUC, SD_PREFER_SIBLING steers load balance toward sibling_imbalance()
> (spread runnables across child/sibling domains), it doesn't encode the
> fully-idle core first logic. In practice it doesn't give us SMT-aware
> destination choice when a sibling is busy and this series is trying to
> cover that gap in the palcement path.
> 

Thanks, so we care about idle selection, not necessarily balancing and yes I did
see that sd->child needs to be set for SD_PEFER_SIBLING to be cleared.

> BTW, on Vera the hierarchy is SMT -> MC -> NUMA:
> 
> root@localhost:~# grep . /sys/kernel/debug/sched/domains/cpu0/domain*/flags
> /sys/kernel/debug/sched/domains/cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
> /sys/kernel/debug/sched/domains/cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_SHARE_LLC
> /sys/kernel/debug/sched/domains/cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SERIALIZE SD_NUMA
> 
> And domain1/groups_flags (child / SMT flags on the sched groups used at the
> MC level) still has SD_PREFER_SIBLING together with SD_SHARE_CPUCAPACITY.
> 
> root@localhost:~# cat /sys/kernel/debug/sched/domains/cpu0/domain1/groups_flags
> SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
> 
> So, prefer-sibling is still in play for SMT (including via MC
> groups_flags). On machines where asymmetry attaches immediately above SMT,
> topology may strip that flag and reduce this branch of behavior, but
> explicit SMT-aware placement still matters.
> 
>>>
>>> Patch set summary:
>>>
>>>  - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
>>>
>>>    Prefer fully-idle SMT cores in asym-capacity idle selection. In the
>>>    wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
>>>    idle selection can prefer CPUs on fully idle cores, with a safe fallback.
>>>
>>>  - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
>>>
>>>    Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
>>>    Provided for consistency with PATCH 1/4.
>>>
>>>  - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
>>>
>>>    Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
>>>    consistency with PATCH 1/4. I've also tested with/without
>>>    /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
>>>    noticed any regression.
>>>
>>>  - [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
>>>
>>>    When choosing the housekeeping CPU that runs the idle load balancer,
>>>    prefer an idle CPU on a fully idle core so migrated work lands where
>>>    effective capacity is available.
>>>
>>>    The change is still consistent with the same "avoid CPUs with busy
>>>    sibling" logic and it shows some benefits on Vera, but could have
>>>    negative impact on other systems, I'm including it for completeness
>>>    (feedback is appreciated).
>>>
>>> This patch set has been tested on the new NVIDIA Vera Rubin platform, where
>>> SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
>>> as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
>>>
>>
>> Are you referring to nominal_freq?
>>
> 
> Correct.
> 

Thanks,
Balbir

Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

Posted by Shrikanth Hegde 6 days, 2 hours ago

Hi Andrea.

On 3/26/26 8:32 PM, Andrea Righi wrote:
> This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> introducing SMT awareness.
> 
> = Problem =
> 
> Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> sibling is busy, because the physical core doesn't deliver its full nominal
> capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> that are not actually good destinations.
> 

How does energy model define the opp for SMT?

SMT systems have multiple of different functional blocks, a few ALU(arithmetic),
LSU(load store unit) etc. If same/similar workload runs on sibling, it would affect the
performance, but sibling is using different functional blocks, then it would
not.

So underlying actual CPU Capacity of each thread depends on what each sibling is running.
I don't understand how does the firmware/energy models define this.

> = Proposed Solution =
> 
> This patch set aligns those paths with a simple rule already used
> elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> partially idle SMT siblings as full-capacity targets where that would
> mislead load balance.
> 
> Patch set summary:
> 
>   - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
> 
>     Prefer fully-idle SMT cores in asym-capacity idle selection. In the
>     wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
>     idle selection can prefer CPUs on fully idle cores, with a safe fallback.
> 
>   - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
> 
>     Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
>     Provided for consistency with PATCH 1/4.
> 
>   - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
> 
>     Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
>     consistency with PATCH 1/4. I've also tested with/without
>     /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
>     noticed any regression.
> 
>   - [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
> 
>     When choosing the housekeeping CPU that runs the idle load balancer,
>     prefer an idle CPU on a fully idle core so migrated work lands where
>     effective capacity is available.
> 
>     The change is still consistent with the same "avoid CPUs with busy
>     sibling" logic and it shows some benefits on Vera, but could have
>     negative impact on other systems, I'm including it for completeness
>     (feedback is appreciated).
> 
> This patch set has been tested on the new NVIDIA Vera Rubin platform, where
> SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
> as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
>

I assume the CPU_CAPACITY values fixed?
first sibling has max, while other has less?

> Without these patches, performance can drop up to ~2x with CPU-intensive
> workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
> account for busy SMT siblings.
> 

How is the performance measured here? Which benchmark?
By any chance you are running number_running_task <= (nr_cpus / smt_threads_per_core),
so it is all fitting nicely?

If you increase those numbers, how does the performance numbers compare?

Also, whats the system is like? SMT level?

> Alternative approaches have been evaluated, such as equalizing CPU
> capacities, either by exposing uniform values via firmware (ACPI/CPPC) or
> normalizing them in the kernel by grouping CPUs within a small capacity
> window (+-5%) [1][2], or enabling asympacking [3].
> 
> However, adding SMT awareness to SD_ASYM_CPUCAPACITY has shown better
> results so far. Improving this policy also seems worthwhile in general, as
> other platforms in the future may enable SMT with asymmetric CPU
> topologies.
> 
> [1] https://lore.kernel.org/lkml/20260324005509.1134981-1-arighi@nvidia.com
> [2] https://lore.kernel.org/lkml/20260318092214.130908-1-arighi@nvidia.com
> [3] https://lore.kernel.org/all/20260325181314.3875909-1-christian.loehle@arm.com/
> 
> Andrea Righi (4):
>        sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
>        sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
>        sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
>        sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
> 
>   kernel/sched/fair.c     | 163 +++++++++++++++++++++++++++++++++++++++++++-----
>   kernel/sched/topology.c |   9 ---
>   2 files changed, 147 insertions(+), 25 deletions(-)

Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

Posted by Andrea Righi 6 days, 1 hour ago

On Fri, Mar 27, 2026 at 10:01:03PM +0530, Shrikanth Hegde wrote:
> Hi Andrea.
> 
> On 3/26/26 8:32 PM, Andrea Righi wrote:
> > This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> > introducing SMT awareness.
> > 
> > = Problem =
> > 
> > Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> > sibling is busy, because the physical core doesn't deliver its full nominal
> > capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> > that are not actually good destinations.
> > 
> 
> How does energy model define the opp for SMT?

For now, as suggested by Vincent, we should probably ignore EAS / energy
model and keep it as it is (not compatible with SMT). I'll drop PATCH 3/4
and focus only at SD_ASYM_CPUCAPACITY + SMT.

> 
> SMT systems have multiple of different functional blocks, a few ALU(arithmetic),
> LSU(load store unit) etc. If same/similar workload runs on sibling, it would affect the
> performance, but sibling is using different functional blocks, then it would
> not.
> 
> So underlying actual CPU Capacity of each thread depends on what each sibling is running.
> I don't understand how does the firmware/energy models define this.

They don't and they probably shouldn't. I don't think it's possible to
model CPU capacity with a static nominal value when SMT is enabled, since
the effective capacity changes if the corresponding sibling is busy or not.

It should be up to the scheduler to figure out a reasonable way to estimate
the actual capacity, considering the status of the other sibling (e.g.,
prioritizing the fully-idle SMT cores over the partially-idle SMT cores,
like we do in other parts of the scheduler code).

> 
> > = Proposed Solution =
> > 
> > This patch set aligns those paths with a simple rule already used
> > elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> > partially idle SMT siblings as full-capacity targets where that would
> > mislead load balance.
> > 
> > Patch set summary:
> > 
> >   - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
> > 
> >     Prefer fully-idle SMT cores in asym-capacity idle selection. In the
> >     wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
> >     idle selection can prefer CPUs on fully idle cores, with a safe fallback.
> > 
> >   - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
> > 
> >     Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
> >     Provided for consistency with PATCH 1/4.
> > 
> >   - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
> > 
> >     Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
> >     consistency with PATCH 1/4. I've also tested with/without
> >     /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
> >     noticed any regression.
> > 
> >   - [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
> > 
> >     When choosing the housekeeping CPU that runs the idle load balancer,
> >     prefer an idle CPU on a fully idle core so migrated work lands where
> >     effective capacity is available.
> > 
> >     The change is still consistent with the same "avoid CPUs with busy
> >     sibling" logic and it shows some benefits on Vera, but could have
> >     negative impact on other systems, I'm including it for completeness
> >     (feedback is appreciated).
> > 
> > This patch set has been tested on the new NVIDIA Vera Rubin platform, where
> > SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
> > as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
> > 
> 
> I assume the CPU_CAPACITY values fixed?
> first sibling has max, while other has less?

The firmware is exposing the same capacity for both siblings. SMT cores may
have different capacity, but siblings within the same SMT core have the
same capacity.

There was an idea to expose a higher capacity for all the 1st sibling and
a lower capacity for all the 2nd siblings, but I don't think it's a good
idea, since that would just confuse the scheduler (and the 2nd sibling
doesn't really have a lower nominal capacity if it's running alone).

> 
> > Without these patches, performance can drop up to ~2x with CPU-intensive
> > workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
> > account for busy SMT siblings.
> > 
> 
> How is the performance measured here? Which benchmark?

I've used an internal NVIDIA suite (based on NVBLAS), I also tried Linpack
and got similar results. I'm planning to repeat the tests using public
benchmarks and share the results as soon as I can.

> By any chance you are running number_running_task <= (nr_cpus / smt_threads_per_core),
> so it is all fitting nicely?

That's the case that gives me the optimal results.

> 
> If you increase those numbers, how does the performance numbers compare?

I tried different number of tasks. The more I approach system saturation
the smaller the benefits are. When I completely saturate the system I don't
see any benefit with this changes, neither regressions, but I guess that's
expected.

> 
> Also, whats the system is like? SMT level?

2 siblings for each SMT core.

Thanks,
-Andrea

Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

Posted by Shrikanth Hegde 5 days, 12 hours ago

>> How is the performance measured here? Which benchmark?
> 
> I've used an internal NVIDIA suite (based on NVBLAS), I also tried Linpack
> and got similar results. I'm planning to repeat the tests using public
> benchmarks and share the results as soon as I can.
> 
>> By any chance you are running number_running_task <= (nr_cpus / smt_threads_per_core),
>> so it is all fitting nicely?
> 
> That's the case that gives me the optimal results.
> 
>>
>> If you increase those numbers, how does the performance numbers compare?
> 
> I tried different number of tasks. The more I approach system saturation
> the smaller the benefits are. When I completely saturate the system I don't
> see any benefit with this changes, neither regressions, but I guess that's
> expected.
> 


Ok. That's good.

I gave hackbench on powerpc with SMT=4, i didn't observe any regressions or improvements.
Only PATCH 4/4 applies in this case as there is no asym_cpu_capacity

Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

Posted by Christian Loehle 1 week ago

On 3/26/26 15:02, Andrea Righi wrote:
> This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> introducing SMT awareness.
> 
> = Problem =
> 
> Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> sibling is busy, because the physical core doesn't deliver its full nominal
> capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> that are not actually good destinations.
> 
> = Proposed Solution =
> 
> This patch set aligns those paths with a simple rule already used
> elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> partially idle SMT siblings as full-capacity targets where that would
> mislead load balance.
> 
> Patch set summary:
> 
>  - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
> 
>    Prefer fully-idle SMT cores in asym-capacity idle selection. In the
>    wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
>    idle selection can prefer CPUs on fully idle cores, with a safe fallback.
> 
>  - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
> 
>    Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
>    Provided for consistency with PATCH 1/4.
> 
>  - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
> 
>    Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
>    consistency with PATCH 1/4. I've also tested with/without
>    /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
>    noticed any regression.


There's a lot more to unpack, but just to confirm, Vera doesn't have an EM, right?
There's no EAS with it?
(To be more precise, CPPC should bail out of building an artifical EM if there's no
or only one efficiency class:
drivers/cpufreq/cppc_cpufreq.c:

if (bitmap_weight(used_classes, 256) <= 1) {
		pr_debug("Efficiency classes are all equal (=%d). "
			"No EM registered", class);
		return;
	}

This is the case, right?

> [snip]

Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

Posted by Andrea Righi 6 days, 12 hours ago

On Thu, Mar 26, 2026 at 04:33:08PM +0000, Christian Loehle wrote:
> On 3/26/26 15:02, Andrea Righi wrote:
> > This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> > introducing SMT awareness.
> > 
> > = Problem =
> > 
> > Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> > sibling is busy, because the physical core doesn't deliver its full nominal
> > capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> > that are not actually good destinations.
> > 
> > = Proposed Solution =
> > 
> > This patch set aligns those paths with a simple rule already used
> > elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> > partially idle SMT siblings as full-capacity targets where that would
> > mislead load balance.
> > 
> > Patch set summary:
> > 
> >  - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
> > 
> >    Prefer fully-idle SMT cores in asym-capacity idle selection. In the
> >    wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
> >    idle selection can prefer CPUs on fully idle cores, with a safe fallback.
> > 
> >  - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
> > 
> >    Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
> >    Provided for consistency with PATCH 1/4.
> > 
> >  - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
> > 
> >    Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
> >    consistency with PATCH 1/4. I've also tested with/without
> >    /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
> >    noticed any regression.
> 
> 
> There's a lot more to unpack, but just to confirm, Vera doesn't have an EM, right?
> There's no EAS with it?
> (To be more precise, CPPC should bail out of building an artifical EM if there's no
> or only one efficiency class:
> drivers/cpufreq/cppc_cpufreq.c:
> 
> if (bitmap_weight(used_classes, 256) <= 1) {
> 		pr_debug("Efficiency classes are all equal (=%d). "
> 			"No EM registered", class);
> 		return;
> 	}
> 
> This is the case, right?

Yes, that's correct, so my testing on Vera with EAS isn't that meaningful.

Thanks,
-Andrea