sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

[PATCH] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

Posted by Andrea Righi 2 weeks, 5 days ago

On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
different per-core frequencies), the wakeup path uses
select_idle_capacity() and prioritizes idle CPUs with higher capacity
for better task placement. However, when those CPUs belong to SMT cores,
their effective capacity can be much lower than the nominal capacity
when the sibling thread is busy: SMT siblings compete for shared
resources, so a "high capacity" CPU that is idle but whose sibling is
busy does not deliver its full capacity. This effective capacity
reduction cannot be modeled by the static capacity value alone.

Introduce SMT awareness in the asym-capacity idle selection policy: when
SMT is active prefer fully-idle SMT cores over partially-idle ones. A
two-phase selection first tries only CPUs on fully idle cores, then
falls back to any idle CPU if none fit.

Prioritizing fully-idle SMT cores yields better task placement because
the effective capacity of partially-idle SMT cores is reduced; always
preferring them when available leads to more accurate capacity usage on
task wakeup.

On an SMT system with asymmetric CPU capacities, SMT-aware idle
selection has been shown to improve throughput by around 15-18% for
CPU-bound workloads, running an amount of tasks equal to the amount of
SMT cores.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a35a82e47920..0f97c44d4606b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7945,9 +7945,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
  * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
  * the task fits. If no CPU is big enough, but there are idle ones, try to
  * maximize capacity.
+ *
+ * When @smt_idle_only is true (asym + SMT), only consider CPUs on cores whose
+ * SMT siblings are all idle, to avoid stacking and sharing SMT resources.
  */
 static int
-select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
+select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target,
+		     bool smt_idle_only)
 {
 	unsigned long task_util, util_min, util_max, best_cap = 0;
 	int fits, best_fits = 0;
@@ -7967,6 +7971,9 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		if (!choose_idle_cpu(cpu, p))
 			continue;
 
+		if (smt_idle_only && !is_core_idle(cpu))
+			continue;
+
 		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
 
 		/* This CPU fits with all requirements */
@@ -8102,8 +8109,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		 * capacity path.
 		 */
 		if (sd) {
-			i = select_idle_capacity(p, sd, target);
-			return ((unsigned)i < nr_cpumask_bits) ? i : target;
+			/*
+			 * When asym + SMT and the hint says idle cores exist,
+			 * try idle cores first to avoid stacking on SMT; else
+			 * scan all idle CPUs.
+			 */
+			if (sched_smt_active() && test_idle_cores(target)) {
+				i = select_idle_capacity(p, sd, target, true);
+				if ((unsigned int)i >= nr_cpumask_bits)
+					i = select_idle_capacity(p, sd, target, false);
+			} else {
+				i = select_idle_capacity(p, sd, target, false);
+			}
+			return ((unsigned int)i < nr_cpumask_bits) ? i : target;
 		}
 	}
 
-- 
2.53.0

Re: [PATCH] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

Posted by Vincent Guittot 2 weeks, 5 days ago

On Wed, 18 Mar 2026 at 10:22, Andrea Righi <arighi@nvidia.com> wrote:
>
> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> different per-core frequencies), the wakeup path uses
> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> for better task placement. However, when those CPUs belong to SMT cores,

Interesting, which kind of system has both SMT and SD_ASYM_CPUCAPACITY
? I thought both were never set simultaneously and SD_ASYM_PACKING was
used for system involving SMT like x86

> their effective capacity can be much lower than the nominal capacity
> when the sibling thread is busy: SMT siblings compete for shared
> resources, so a "high capacity" CPU that is idle but whose sibling is
> busy does not deliver its full capacity. This effective capacity
> reduction cannot be modeled by the static capacity value alone.
>
> Introduce SMT awareness in the asym-capacity idle selection policy: when
> SMT is active prefer fully-idle SMT cores over partially-idle ones. A
> two-phase selection first tries only CPUs on fully idle cores, then
> falls back to any idle CPU if none fit.
>
> Prioritizing fully-idle SMT cores yields better task placement because
> the effective capacity of partially-idle SMT cores is reduced; always
> preferring them when available leads to more accurate capacity usage on
> task wakeup.
>
> On an SMT system with asymmetric CPU capacities, SMT-aware idle
> selection has been shown to improve throughput by around 15-18% for
> CPU-bound workloads, running an amount of tasks equal to the amount of
> SMT cores.
>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/fair.c | 24 +++++++++++++++++++++---
>  1 file changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a35a82e47920..0f97c44d4606b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7945,9 +7945,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
>   * the task fits. If no CPU is big enough, but there are idle ones, try to
>   * maximize capacity.
> + *
> + * When @smt_idle_only is true (asym + SMT), only consider CPUs on cores whose
> + * SMT siblings are all idle, to avoid stacking and sharing SMT resources.
>   */
>  static int
> -select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> +select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target,
> +                    bool smt_idle_only)
>  {
>         unsigned long task_util, util_min, util_max, best_cap = 0;
>         int fits, best_fits = 0;
> @@ -7967,6 +7971,9 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>                 if (!choose_idle_cpu(cpu, p))
>                         continue;
>
> +               if (smt_idle_only && !is_core_idle(cpu))
> +                       continue;
> +
>                 fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>
>                 /* This CPU fits with all requirements */
> @@ -8102,8 +8109,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>                  * capacity path.
>                  */
>                 if (sd) {
> -                       i = select_idle_capacity(p, sd, target);
> -                       return ((unsigned)i < nr_cpumask_bits) ? i : target;
> +                       /*
> +                        * When asym + SMT and the hint says idle cores exist,
> +                        * try idle cores first to avoid stacking on SMT; else
> +                        * scan all idle CPUs.
> +                        */
> +                       if (sched_smt_active() && test_idle_cores(target)) {
> +                               i = select_idle_capacity(p, sd, target, true);
> +                               if ((unsigned int)i >= nr_cpumask_bits)
> +                                       i = select_idle_capacity(p, sd, target, false);

Can't you make it one pass in select_idle_capacity ?

> +                       } else {
> +                               i = select_idle_capacity(p, sd, target, false);
> +                       }
> +                       return ((unsigned int)i < nr_cpumask_bits) ? i : target;
>                 }
>         }
>
> --
> 2.53.0
>

Re: [PATCH] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

Posted by Andrea Righi 2 weeks, 5 days ago

Hi Vincent,

On Wed, Mar 18, 2026 at 10:41:15AM +0100, Vincent Guittot wrote:
> On Wed, 18 Mar 2026 at 10:22, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > different per-core frequencies), the wakeup path uses
> > select_idle_capacity() and prioritizes idle CPUs with higher capacity
> > for better task placement. However, when those CPUs belong to SMT cores,
> 
> Interesting, which kind of system has both SMT and SD_ASYM_CPUCAPACITY
> ? I thought both were never set simultaneously and SD_ASYM_PACKING was
> used for system involving SMT like x86

It's an NVIDIA platform (not publicly available yet), where the firmware
exposes different CPU capacities and has SMT enabled, so both
SD_ASYM_CPUCAPACITY and SMT are present. I'm not sure whether the final
firmware release will keep this exact configuration (there's a good chance
it will), so I'm targeting it to be prepared.

> 
> > their effective capacity can be much lower than the nominal capacity
> > when the sibling thread is busy: SMT siblings compete for shared
> > resources, so a "high capacity" CPU that is idle but whose sibling is
> > busy does not deliver its full capacity. This effective capacity
> > reduction cannot be modeled by the static capacity value alone.
> >
> > Introduce SMT awareness in the asym-capacity idle selection policy: when
> > SMT is active prefer fully-idle SMT cores over partially-idle ones. A
> > two-phase selection first tries only CPUs on fully idle cores, then
> > falls back to any idle CPU if none fit.
> >
> > Prioritizing fully-idle SMT cores yields better task placement because
> > the effective capacity of partially-idle SMT cores is reduced; always
> > preferring them when available leads to more accurate capacity usage on
> > task wakeup.
> >
> > On an SMT system with asymmetric CPU capacities, SMT-aware idle
> > selection has been shown to improve throughput by around 15-18% for
> > CPU-bound workloads, running an amount of tasks equal to the amount of
> > SMT cores.
> >
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/fair.c | 24 +++++++++++++++++++++---
> >  1 file changed, 21 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0a35a82e47920..0f97c44d4606b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7945,9 +7945,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> >   * the task fits. If no CPU is big enough, but there are idle ones, try to
> >   * maximize capacity.
> > + *
> > + * When @smt_idle_only is true (asym + SMT), only consider CPUs on cores whose
> > + * SMT siblings are all idle, to avoid stacking and sharing SMT resources.
> >   */
> >  static int
> > -select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > +select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target,
> > +                    bool smt_idle_only)
> >  {
> >         unsigned long task_util, util_min, util_max, best_cap = 0;
> >         int fits, best_fits = 0;
> > @@ -7967,6 +7971,9 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >                 if (!choose_idle_cpu(cpu, p))
> >                         continue;
> >
> > +               if (smt_idle_only && !is_core_idle(cpu))
> > +                       continue;
> > +
> >                 fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> >
> >                 /* This CPU fits with all requirements */
> > @@ -8102,8 +8109,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >                  * capacity path.
> >                  */
> >                 if (sd) {
> > -                       i = select_idle_capacity(p, sd, target);
> > -                       return ((unsigned)i < nr_cpumask_bits) ? i : target;
> > +                       /*
> > +                        * When asym + SMT and the hint says idle cores exist,
> > +                        * try idle cores first to avoid stacking on SMT; else
> > +                        * scan all idle CPUs.
> > +                        */
> > +                       if (sched_smt_active() && test_idle_cores(target)) {
> > +                               i = select_idle_capacity(p, sd, target, true);
> > +                               if ((unsigned int)i >= nr_cpumask_bits)
> > +                                       i = select_idle_capacity(p, sd, target, false);
> 
> Can't you make it one pass in select_idle_capacity ?

Oh yes, absolutely, we can select the best-fit CPU in the same pass and use
it as a fallback if we can't find any fully-idle SMT CPU. I'll change that.

> 
> > +                       } else {
> > +                               i = select_idle_capacity(p, sd, target, false);
> > +                       }
> > +                       return ((unsigned int)i < nr_cpumask_bits) ? i : target;
> >                 }
> >         }
> >
> > --
> > 2.53.0
> >

Thanks,
-Andrea

Re: [PATCH] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

Posted by Vincent Guittot 2 weeks, 4 days ago

On Wed, 18 Mar 2026 at 11:31, Andrea Righi <arighi@nvidia.com> wrote:
>
> Hi Vincent,
>
> On Wed, Mar 18, 2026 at 10:41:15AM +0100, Vincent Guittot wrote:
> > On Wed, 18 Mar 2026 at 10:22, Andrea Righi <arighi@nvidia.com> wrote:
> > >
> > > On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > > different per-core frequencies), the wakeup path uses
> > > select_idle_capacity() and prioritizes idle CPUs with higher capacity
> > > for better task placement. However, when those CPUs belong to SMT cores,
> >
> > Interesting, which kind of system has both SMT and SD_ASYM_CPUCAPACITY
> > ? I thought both were never set simultaneously and SD_ASYM_PACKING was
> > used for system involving SMT like x86
>
> It's an NVIDIA platform (not publicly available yet), where the firmware
> exposes different CPU capacities and has SMT enabled, so both
> SD_ASYM_CPUCAPACITY and SMT are present. I'm not sure whether the final
> firmware release will keep this exact configuration (there's a good chance
> it will), so I'm targeting it to be prepared.

That's probably not the only place where SD_ASYM_CPUCAPACITY will fail
with SMT. The misfit is another place as an example

>
> >
> > > their effective capacity can be much lower than the nominal capacity
> > > when the sibling thread is busy: SMT siblings compete for shared
> > > resources, so a "high capacity" CPU that is idle but whose sibling is
> > > busy does not deliver its full capacity. This effective capacity
> > > reduction cannot be modeled by the static capacity value alone.
> > >
> > > Introduce SMT awareness in the asym-capacity idle selection policy: when
> > > SMT is active prefer fully-idle SMT cores over partially-idle ones. A
> > > two-phase selection first tries only CPUs on fully idle cores, then
> > > falls back to any idle CPU if none fit.
> > >
> > > Prioritizing fully-idle SMT cores yields better task placement because
> > > the effective capacity of partially-idle SMT cores is reduced; always
> > > preferring them when available leads to more accurate capacity usage on
> > > task wakeup.
> > >
> > > On an SMT system with asymmetric CPU capacities, SMT-aware idle
> > > selection has been shown to improve throughput by around 15-18% for
> > > CPU-bound workloads, running an amount of tasks equal to the amount of
> > > SMT cores.
> > >
> > > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > > ---
> > >  kernel/sched/fair.c | 24 +++++++++++++++++++++---
> > >  1 file changed, 21 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 0a35a82e47920..0f97c44d4606b 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7945,9 +7945,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > >   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> > >   * the task fits. If no CPU is big enough, but there are idle ones, try to
> > >   * maximize capacity.
> > > + *
> > > + * When @smt_idle_only is true (asym + SMT), only consider CPUs on cores whose
> > > + * SMT siblings are all idle, to avoid stacking and sharing SMT resources.
> > >   */
> > >  static int
> > > -select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > > +select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target,
> > > +                    bool smt_idle_only)
> > >  {
> > >         unsigned long task_util, util_min, util_max, best_cap = 0;
> > >         int fits, best_fits = 0;
> > > @@ -7967,6 +7971,9 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > >                 if (!choose_idle_cpu(cpu, p))
> > >                         continue;
> > >
> > > +               if (smt_idle_only && !is_core_idle(cpu))
> > > +                       continue;
> > > +
> > >                 fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> > >
> > >                 /* This CPU fits with all requirements */
> > > @@ -8102,8 +8109,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > >                  * capacity path.
> > >                  */
> > >                 if (sd) {
> > > -                       i = select_idle_capacity(p, sd, target);
> > > -                       return ((unsigned)i < nr_cpumask_bits) ? i : target;
> > > +                       /*
> > > +                        * When asym + SMT and the hint says idle cores exist,
> > > +                        * try idle cores first to avoid stacking on SMT; else
> > > +                        * scan all idle CPUs.
> > > +                        */
> > > +                       if (sched_smt_active() && test_idle_cores(target)) {
> > > +                               i = select_idle_capacity(p, sd, target, true);
> > > +                               if ((unsigned int)i >= nr_cpumask_bits)
> > > +                                       i = select_idle_capacity(p, sd, target, false);
> >
> > Can't you make it one pass in select_idle_capacity ?
>
> Oh yes, absolutely, we can select the best-fit CPU in the same pass and use
> it as a fallback if we can't find any fully-idle SMT CPU. I'll change that.
>
> >
> > > +                       } else {
> > > +                               i = select_idle_capacity(p, sd, target, false);
> > > +                       }
> > > +                       return ((unsigned int)i < nr_cpumask_bits) ? i : target;
> > >                 }
> > >         }
> > >
> > > --
> > > 2.53.0
> > >
>
> Thanks,
> -Andrea

Re: [PATCH] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

Posted by Andrea Righi 2 weeks, 4 days ago

Hi Vincent,

On Thu, Mar 19, 2026 at 08:17:06AM +0100, Vincent Guittot wrote:
> On Wed, 18 Mar 2026 at 11:31, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > Hi Vincent,
> >
> > On Wed, Mar 18, 2026 at 10:41:15AM +0100, Vincent Guittot wrote:
> > > On Wed, 18 Mar 2026 at 10:22, Andrea Righi <arighi@nvidia.com> wrote:
> > > >
> > > > On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > > > different per-core frequencies), the wakeup path uses
> > > > select_idle_capacity() and prioritizes idle CPUs with higher capacity
> > > > for better task placement. However, when those CPUs belong to SMT cores,
> > >
> > > Interesting, which kind of system has both SMT and SD_ASYM_CPUCAPACITY
> > > ? I thought both were never set simultaneously and SD_ASYM_PACKING was
> > > used for system involving SMT like x86
> >
> > It's an NVIDIA platform (not publicly available yet), where the firmware
> > exposes different CPU capacities and has SMT enabled, so both
> > SD_ASYM_CPUCAPACITY and SMT are present. I'm not sure whether the final
> > firmware release will keep this exact configuration (there's a good chance
> > it will), so I'm targeting it to be prepared.
> 
> That's probably not the only place where SD_ASYM_CPUCAPACITY will fail
> with SMT. The misfit is another place as an example

Yeah, that's right, with SD_ASYM_CPUCAPACITY + SMT when a misfit task is
moved from a "small" CPU to a "big" CPU, if the big CPU has a busy sibling,
its effective capacity is much lower than its nominal capacity.

Maybe when SMT is active, we could allow pulling a misfit task only when
dst_cpu is on a fully idle core. That's a simple change, I'll run some
tests with this, but as you said there might be other places to fix as
well.

Thanks,
-Andrea

Re: [PATCH] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

Posted by Christian Loehle 2 weeks, 5 days ago

On 3/18/26 10:31, Andrea Righi wrote:
> Hi Vincent,
> 
> On Wed, Mar 18, 2026 at 10:41:15AM +0100, Vincent Guittot wrote:
>> On Wed, 18 Mar 2026 at 10:22, Andrea Righi <arighi@nvidia.com> wrote:
>>>
>>> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
>>> different per-core frequencies), the wakeup path uses
>>> select_idle_capacity() and prioritizes idle CPUs with higher capacity
>>> for better task placement. However, when those CPUs belong to SMT cores,
>>
>> Interesting, which kind of system has both SMT and SD_ASYM_CPUCAPACITY
>> ? I thought both were never set simultaneously and SD_ASYM_PACKING was
>> used for system involving SMT like x86
> 
> It's an NVIDIA platform (not publicly available yet), where the firmware
> exposes different CPU capacities and has SMT enabled, so both
> SD_ASYM_CPUCAPACITY and SMT are present. I'm not sure whether the final
> firmware release will keep this exact configuration (there's a good chance
> it will), so I'm targeting it to be prepared.

Andrea,
that makes me think, I've played with a nvidia grace available to me recently,
which sets slightly different CPPC highest_perf values (~2%) which automatically
will set SD_ASYM_CPUCAPACITY and run the entire capacity-aware scheduling
machinery for really almost negligible capacity differences, where it's
questionable how sensible that is.
I have an arm64 + CPPC implementation for asym-packing for this machine, maybe
we can reuse that for here too?
(Given that really capacity and SMT are contradicting, e.g. a physical core with
1024 capacity but 4 threads may give you a lot lower observed CPU capacity than
a 512 nosmt core per (logical) CPU depending on how much the sibling threads are
utilized.)

> [snip]

Re: [PATCH] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

Posted by Andrea Righi 2 weeks, 5 days ago

Hi Christian,

On Wed, Mar 18, 2026 at 03:43:26PM +0000, Christian Loehle wrote:
> On 3/18/26 10:31, Andrea Righi wrote:
> > Hi Vincent,
> > 
> > On Wed, Mar 18, 2026 at 10:41:15AM +0100, Vincent Guittot wrote:
> >> On Wed, 18 Mar 2026 at 10:22, Andrea Righi <arighi@nvidia.com> wrote:
> >>>
> >>> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> >>> different per-core frequencies), the wakeup path uses
> >>> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> >>> for better task placement. However, when those CPUs belong to SMT cores,
> >>
> >> Interesting, which kind of system has both SMT and SD_ASYM_CPUCAPACITY
> >> ? I thought both were never set simultaneously and SD_ASYM_PACKING was
> >> used for system involving SMT like x86
> > 
> > It's an NVIDIA platform (not publicly available yet), where the firmware
> > exposes different CPU capacities and has SMT enabled, so both
> > SD_ASYM_CPUCAPACITY and SMT are present. I'm not sure whether the final
> > firmware release will keep this exact configuration (there's a good chance
> > it will), so I'm targeting it to be prepared.
> 
> 
> Andrea,
> that makes me think, I've played with a nvidia grace available to me recently,
> which sets slightly different CPPC highest_perf values (~2%) which automatically
> will set SD_ASYM_CPUCAPACITY and run the entire capacity-aware scheduling
> machinery for really almost negligible capacity differences, where it's
> questionable how sensible that is.

That looks like the same system that I've been working with. I agree that
treating small CPPC differences as full asymmetry can be a bit overkill.

I've been experimenting with flattening the capacities (to force the
"regular" idle CPU selection policy), which performs better than the
current asym-capacity CPU selection. However, adding the SMT awareness to
the asym-capacity, seems to give a consistent +2-3% (same set of
CPU-intensive benchmarks) compared to flatening alone, which is not bad.

> I have an arm64 + CPPC implementation for asym-packing for this machine, maybe
> we can reuse that for here too?

Sure, that sounds interesting, if it's available somewhere I'd be happy to
do some testing.

Thanks,
-Andrea

Re: [PATCH] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

Posted by Christian Loehle 2 weeks, 4 days ago

On 3/18/26 17:09, Andrea Righi wrote:
> Hi Christian,
> 
> On Wed, Mar 18, 2026 at 03:43:26PM +0000, Christian Loehle wrote:
>> On 3/18/26 10:31, Andrea Righi wrote:
>>> Hi Vincent,
>>>
>>> On Wed, Mar 18, 2026 at 10:41:15AM +0100, Vincent Guittot wrote:
>>>> On Wed, 18 Mar 2026 at 10:22, Andrea Righi <arighi@nvidia.com> wrote:
>>>>>
>>>>> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
>>>>> different per-core frequencies), the wakeup path uses
>>>>> select_idle_capacity() and prioritizes idle CPUs with higher capacity
>>>>> for better task placement. However, when those CPUs belong to SMT cores,
>>>>
>>>> Interesting, which kind of system has both SMT and SD_ASYM_CPUCAPACITY
>>>> ? I thought both were never set simultaneously and SD_ASYM_PACKING was
>>>> used for system involving SMT like x86
>>>
>>> It's an NVIDIA platform (not publicly available yet), where the firmware
>>> exposes different CPU capacities and has SMT enabled, so both
>>> SD_ASYM_CPUCAPACITY and SMT are present. I'm not sure whether the final
>>> firmware release will keep this exact configuration (there's a good chance
>>> it will), so I'm targeting it to be prepared.
>>
>>
>> Andrea,
>> that makes me think, I've played with a nvidia grace available to me recently,
>> which sets slightly different CPPC highest_perf values (~2%) which automatically
>> will set SD_ASYM_CPUCAPACITY and run the entire capacity-aware scheduling
>> machinery for really almost negligible capacity differences, where it's
>> questionable how sensible that is.
> 
> That looks like the same system that I've been working with. I agree that
> treating small CPPC differences as full asymmetry can be a bit overkill.
> 
> I've been experimenting with flattening the capacities (to force the
> "regular" idle CPU selection policy), which performs better than the
> current asym-capacity CPU selection. However, adding the SMT awareness to
> the asym-capacity, seems to give a consistent +2-3% (same set of
> CPU-intensive benchmarks) compared to flatening alone, which is not bad.
> 
>> I have an arm64 + CPPC implementation for asym-packing for this machine, maybe
>> we can reuse that for here too?
> 
> Sure, that sounds interesting, if it's available somewhere I'd be happy to
> do some testing.
> 
Hi Andrea,

I will clean up the asympacking code a bit and share it with you for testing.

Interestingly, when we looked at DCPerf MediaWiki, we found the exact opposite.

On NVIDIA Grace, enabling CAS due to the small CPPC highest_perf differences was
actually beneficial for the workload. More interestingly, we saw a similar uplift
on a different arm64 server without ASYM_CPUCAPACITY when we force-enabled
sched_asym_cpucap_active() even though the system was highest_perf-symmetric.
That suggests the uplift on Grace may have come from CAS-specific behavior rather
than from better selection of the highest_perf CPUs.

I'd be very curious whether something similar (i.e. the inverse) is happening in your
case as well, i.e. flattening the capacities but still forcing
select_idle_sibling() / sched_asym_cpucap_active() despite equal capacities. Of course,
that will also depend on the workloads (what are you testing?)

Just to illustrate, below is one example where CAS improved both score and CPU utilization:
+--------------------------+----------------------+-------------------------+-----------------------------------------+
| Platform                 | default (v6.8)       | force all CPUs = 1024   | force sched_asym_cpucap_active() = TRUE |
+--------------------------+----------------------+-------------------------+-----------------------------------------+
| arm64 symmetric (72 CPUs)| 100% (90% CPU util)  | -------------           | 104.26% (99%)                           |
| Grace (72 CPUs)          | 100% (99%)           | 99.49% (90%)            | -------------                           |
+--------------------------+----------------------+-------------------------+-----------------------------------------+

Re: [PATCH] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

Posted by Andrea Righi 2 weeks, 4 days ago

Hi Christian,

On Thu, Mar 19, 2026 at 11:58:39AM +0000, Christian Loehle wrote:
> On 3/18/26 17:09, Andrea Righi wrote:
> > Hi Christian,
> > 
> > On Wed, Mar 18, 2026 at 03:43:26PM +0000, Christian Loehle wrote:
> >> On 3/18/26 10:31, Andrea Righi wrote:
> >>> Hi Vincent,
> >>>
> >>> On Wed, Mar 18, 2026 at 10:41:15AM +0100, Vincent Guittot wrote:
> >>>> On Wed, 18 Mar 2026 at 10:22, Andrea Righi <arighi@nvidia.com> wrote:
> >>>>>
> >>>>> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> >>>>> different per-core frequencies), the wakeup path uses
> >>>>> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> >>>>> for better task placement. However, when those CPUs belong to SMT cores,
> >>>>
> >>>> Interesting, which kind of system has both SMT and SD_ASYM_CPUCAPACITY
> >>>> ? I thought both were never set simultaneously and SD_ASYM_PACKING was
> >>>> used for system involving SMT like x86
> >>>
> >>> It's an NVIDIA platform (not publicly available yet), where the firmware
> >>> exposes different CPU capacities and has SMT enabled, so both
> >>> SD_ASYM_CPUCAPACITY and SMT are present. I'm not sure whether the final
> >>> firmware release will keep this exact configuration (there's a good chance
> >>> it will), so I'm targeting it to be prepared.
> >>
> >>
> >> Andrea,
> >> that makes me think, I've played with a nvidia grace available to me recently,
> >> which sets slightly different CPPC highest_perf values (~2%) which automatically
> >> will set SD_ASYM_CPUCAPACITY and run the entire capacity-aware scheduling
> >> machinery for really almost negligible capacity differences, where it's
> >> questionable how sensible that is.
> > 
> > That looks like the same system that I've been working with. I agree that
> > treating small CPPC differences as full asymmetry can be a bit overkill.
> > 
> > I've been experimenting with flattening the capacities (to force the
> > "regular" idle CPU selection policy), which performs better than the
> > current asym-capacity CPU selection. However, adding the SMT awareness to
> > the asym-capacity, seems to give a consistent +2-3% (same set of
> > CPU-intensive benchmarks) compared to flatening alone, which is not bad.
> > 
> >> I have an arm64 + CPPC implementation for asym-packing for this machine, maybe
> >> we can reuse that for here too?
> > 
> > Sure, that sounds interesting, if it's available somewhere I'd be happy to
> > do some testing.
> > 
> Hi Andrea,
> 
> I will clean up the asympacking code a bit and share it with you for testing.
> 
> Interestingly, when we looked at DCPerf MediaWiki, we found the exact opposite.
> 
> On NVIDIA Grace, enabling CAS due to the small CPPC highest_perf differences was
> actually beneficial for the workload. More interestingly, we saw a similar uplift
> on a different arm64 server without ASYM_CPUCAPACITY when we force-enabled
> sched_asym_cpucap_active() even though the system was highest_perf-symmetric.
> That suggests the uplift on Grace may have come from CAS-specific behavior rather
> than from better selection of the highest_perf CPUs.

What NVIDIA Grace in particular? On GB300 ASYM_CPUCAPACITY seems to be
enabled. I can try to disable / equalize the capacities and repeat the test
there as well.

> 
> I'd be very curious whether something similar (i.e. the inverse) is happening in your
> case as well, i.e. flattening the capacities but still forcing
> select_idle_sibling() / sched_asym_cpucap_active() despite equal capacities. Of course,
> that will also depend on the workloads (what are you testing?)

I can definitely try that. I'm using an internal benchmark suite, in
particular the benchmark that is showing the bigger improvements is based
on the NVBLAS library (but using the CPUs, not the GPUs), not sure if it's
publicly available, I'll check.

> 
> Just to illustrate, below is one example where CAS improved both score and CPU utilization:
> +--------------------------+----------------------+-------------------------+-----------------------------------------+
> | Platform                 | default (v6.8)       | force all CPUs = 1024   | force sched_asym_cpucap_active() = TRUE |
> +--------------------------+----------------------+-------------------------+-----------------------------------------+
> | arm64 symmetric (72 CPUs)| 100% (90% CPU util)  | -------------           | 104.26% (99%)                           |
> | Grace (72 CPUs)          | 100% (99%)           | 99.49% (90%)            | -------------                           |
> +--------------------------+----------------------+-------------------------+-----------------------------------------+

I see, interesting. Now I'm curious to do the opposite on the GB300 that I
have access to, flattening the capacities to 1024 and see what I get.

Thanks,
-Andrea

Re: [PATCH] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

Posted by Vincent Guittot 2 weeks, 4 days ago

On Wed, 18 Mar 2026 at 18:09, Andrea Righi <arighi@nvidia.com> wrote:
>
> Hi Christian,
>
> On Wed, Mar 18, 2026 at 03:43:26PM +0000, Christian Loehle wrote:
> > On 3/18/26 10:31, Andrea Righi wrote:
> > > Hi Vincent,
> > >
> > > On Wed, Mar 18, 2026 at 10:41:15AM +0100, Vincent Guittot wrote:
> > >> On Wed, 18 Mar 2026 at 10:22, Andrea Righi <arighi@nvidia.com> wrote:
> > >>>
> > >>> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > >>> different per-core frequencies), the wakeup path uses
> > >>> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> > >>> for better task placement. However, when those CPUs belong to SMT cores,
> > >>
> > >> Interesting, which kind of system has both SMT and SD_ASYM_CPUCAPACITY
> > >> ? I thought both were never set simultaneously and SD_ASYM_PACKING was
> > >> used for system involving SMT like x86
> > >
> > > It's an NVIDIA platform (not publicly available yet), where the firmware
> > > exposes different CPU capacities and has SMT enabled, so both
> > > SD_ASYM_CPUCAPACITY and SMT are present. I'm not sure whether the final
> > > firmware release will keep this exact configuration (there's a good chance
> > > it will), so I'm targeting it to be prepared.
> >
> >
> > Andrea,
> > that makes me think, I've played with a nvidia grace available to me recently,
> > which sets slightly different CPPC highest_perf values (~2%) which automatically
> > will set SD_ASYM_CPUCAPACITY and run the entire capacity-aware scheduling
> > machinery for really almost negligible capacity differences, where it's
> > questionable how sensible that is.
>
> That looks like the same system that I've been working with. I agree that
> treating small CPPC differences as full asymmetry can be a bit overkill.
>
> I've been experimenting with flattening the capacities (to force the
> "regular" idle CPU selection policy), which performs better than the
> current asym-capacity CPU selection. However, adding the SMT awareness to
> the asym-capacity, seems to give a consistent +2-3% (same set of
> CPU-intensive benchmarks) compared to flatening alone, which is not bad.

Do you mean that this patch is +2% > vs plain SMP  > than current asym
cpucapacity implementation ?

>
> > I have an arm64 + CPPC implementation for asym-packing for this machine, maybe
> > we can reuse that for here too?
>
> Sure, that sounds interesting, if it's available somewhere I'd be happy to
> do some testing.
>
> Thanks,
> -Andrea

Re: [PATCH] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

Posted by Andrea Righi 2 weeks, 4 days ago

Hi Vincent,

On Thu, Mar 19, 2026 at 08:20:27AM +0100, Vincent Guittot wrote:
> On Wed, 18 Mar 2026 at 18:09, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > Hi Christian,
> >
> > On Wed, Mar 18, 2026 at 03:43:26PM +0000, Christian Loehle wrote:
> > > On 3/18/26 10:31, Andrea Righi wrote:
> > > > Hi Vincent,
> > > >
> > > > On Wed, Mar 18, 2026 at 10:41:15AM +0100, Vincent Guittot wrote:
> > > >> On Wed, 18 Mar 2026 at 10:22, Andrea Righi <arighi@nvidia.com> wrote:
> > > >>>
> > > >>> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > > >>> different per-core frequencies), the wakeup path uses
> > > >>> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> > > >>> for better task placement. However, when those CPUs belong to SMT cores,
> > > >>
> > > >> Interesting, which kind of system has both SMT and SD_ASYM_CPUCAPACITY
> > > >> ? I thought both were never set simultaneously and SD_ASYM_PACKING was
> > > >> used for system involving SMT like x86
> > > >
> > > > It's an NVIDIA platform (not publicly available yet), where the firmware
> > > > exposes different CPU capacities and has SMT enabled, so both
> > > > SD_ASYM_CPUCAPACITY and SMT are present. I'm not sure whether the final
> > > > firmware release will keep this exact configuration (there's a good chance
> > > > it will), so I'm targeting it to be prepared.
> > >
> > >
> > > Andrea,
> > > that makes me think, I've played with a nvidia grace available to me recently,
> > > which sets slightly different CPPC highest_perf values (~2%) which automatically
> > > will set SD_ASYM_CPUCAPACITY and run the entire capacity-aware scheduling
> > > machinery for really almost negligible capacity differences, where it's
> > > questionable how sensible that is.
> >
> > That looks like the same system that I've been working with. I agree that
> > treating small CPPC differences as full asymmetry can be a bit overkill.
> >
> > I've been experimenting with flattening the capacities (to force the
> > "regular" idle CPU selection policy), which performs better than the
> > current asym-capacity CPU selection. However, adding the SMT awareness to
> > the asym-capacity, seems to give a consistent +2-3% (same set of
> > CPU-intensive benchmarks) compared to flatening alone, which is not bad.
> 
> Do you mean that this patch is +2% > vs plain SMP  > than current asym
> cpucapacity implementation ?

Yes, that's correct. More exactly:

                                  speedup %
 ------------------------------+------------
 current asym CPU capacity     |    -
 equal CPU capacity            |  +13.6%
 SMT-aware asym CPU capacity   |  +15.0%

Thanks,
-Andrea