sched/fair: SMT-aware asymmetric CPU capacity

[PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer

Posted by Andrea Righi 1 week ago

When choosing which idle housekeeping CPU runs the idle load balancer,
prefer one on a fully idle core if SMT is active, so balance can migrate
work onto a CPU that still offers full effective capacity. Fall back to
any idle candidate if none qualify.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 593a89f688679..a1ee21f7b32f6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12733,11 +12733,15 @@ static inline int on_null_domain(struct rq *rq)
  * - When one of the busy CPUs notices that there may be an idle rebalancing
  *   needed, they will kick the idle load balancer, which then does idle
  *   load balancing for all the idle CPUs.
+ *
+ * - When SMT is active, prefer a CPU on a fully idle core as the ILB
+ *   target, so that when it runs balance it becomes the destination CPU
+ *   and can accept migrated tasks with full effective capacity.
  */
 static inline int find_new_ilb(void)
 {
 	const struct cpumask *hk_mask;
-	int ilb_cpu;
+	int ilb_cpu, fallback = -1;
 
 	hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
 
@@ -12746,11 +12750,22 @@ static inline int find_new_ilb(void)
 		if (ilb_cpu == smp_processor_id())
 			continue;
 
+#ifdef CONFIG_SCHED_SMT
+		if (!idle_cpu(ilb_cpu))
+			continue;
+
+		if (fallback < 0)
+			fallback = ilb_cpu;
+
+		if (!sched_smt_active() || is_core_idle(ilb_cpu))
+			return ilb_cpu;
+#else
 		if (idle_cpu(ilb_cpu))
 			return ilb_cpu;
+#endif
 	}
 
-	return -1;
+	return fallback;
 }
 
 /*
-- 
2.53.0

Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer

Posted by Shrikanth Hegde 6 days, 6 hours ago


On 3/26/26 8:32 PM, Andrea Righi wrote:
> When choosing which idle housekeeping CPU runs the idle load balancer,
> prefer one on a fully idle core if SMT is active, so balance can migrate
> work onto a CPU that still offers full effective capacity. Fall back to
> any idle candidate if none qualify.
> 
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>   kernel/sched/fair.c | 19 +++++++++++++++++--
>   1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 593a89f688679..a1ee21f7b32f6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12733,11 +12733,15 @@ static inline int on_null_domain(struct rq *rq)
>    * - When one of the busy CPUs notices that there may be an idle rebalancing
>    *   needed, they will kick the idle load balancer, which then does idle
>    *   load balancing for all the idle CPUs.
> + *
> + * - When SMT is active, prefer a CPU on a fully idle core as the ILB
> + *   target, so that when it runs balance it becomes the destination CPU
> + *   and can accept migrated tasks with full effective capacity.
>    */
>   static inline int find_new_ilb(void)
>   {
>   	const struct cpumask *hk_mask;
> -	int ilb_cpu;
> +	int ilb_cpu, fallback = -1;
>   
>   	hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
>   
> @@ -12746,11 +12750,22 @@ static inline int find_new_ilb(void)
>   		if (ilb_cpu == smp_processor_id())
>   			continue;
>   
> +#ifdef CONFIG_SCHED_SMT
> +		if (!idle_cpu(ilb_cpu))
> +			continue;
> +
> +		if (fallback < 0)
> +			fallback = ilb_cpu;
> +
> +		if (!sched_smt_active() || is_core_idle(ilb_cpu))

is_core_idle does loop for all sublings and nohz.idle_cpus_mask
will have all siblings likely.

So that might turn out be a bit expensive on large SMT system such as SMT=4
Also, this is with interrupt disabled.

Will try to run this on powerpc system and see if simple benchmarks show anything.

> +			return ilb_cpu;
> +#else
>   		if (idle_cpu(ilb_cpu))
>   			return ilb_cpu;
> +#endif
>   	}
>   
> -	return -1;
> +	return fallback;
>   }
>   
>   /*

Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer

Posted by Vincent Guittot 6 days, 11 hours ago

On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
>
> When choosing which idle housekeeping CPU runs the idle load balancer,
> prefer one on a fully idle core if SMT is active, so balance can migrate
> work onto a CPU that still offers full effective capacity. Fall back to
> any idle candidate if none qualify.

This one isn't straightforward for me. The ilb cpu will check all
other idle CPUs 1st and finish with itself so unless the next CPU in
the idle_cpus_mask is a sibling, this should not make a difference

Did you see any perf diff ?


>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/fair.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 593a89f688679..a1ee21f7b32f6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12733,11 +12733,15 @@ static inline int on_null_domain(struct rq *rq)
>   * - When one of the busy CPUs notices that there may be an idle rebalancing
>   *   needed, they will kick the idle load balancer, which then does idle
>   *   load balancing for all the idle CPUs.
> + *
> + * - When SMT is active, prefer a CPU on a fully idle core as the ILB
> + *   target, so that when it runs balance it becomes the destination CPU
> + *   and can accept migrated tasks with full effective capacity.
>   */
>  static inline int find_new_ilb(void)
>  {
>         const struct cpumask *hk_mask;
> -       int ilb_cpu;
> +       int ilb_cpu, fallback = -1;
>
>         hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
>
> @@ -12746,11 +12750,22 @@ static inline int find_new_ilb(void)
>                 if (ilb_cpu == smp_processor_id())
>                         continue;
>
> +#ifdef CONFIG_SCHED_SMT

you can probably get rid of the CONFIG and put this special case below
sched_smt_active()


> +               if (!idle_cpu(ilb_cpu))
> +                       continue;
> +
> +               if (fallback < 0)
> +                       fallback = ilb_cpu;
> +
> +               if (!sched_smt_active() || is_core_idle(ilb_cpu))
> +                       return ilb_cpu;
> +#else
>                 if (idle_cpu(ilb_cpu))
>                         return ilb_cpu;
> +#endif
>         }
>
> -       return -1;
> +       return fallback;
>  }
>
>  /*
> --
> 2.53.0
>

Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer

Posted by Andrea Righi 6 days, 10 hours ago

Hi Vincent,

On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > When choosing which idle housekeeping CPU runs the idle load balancer,
> > prefer one on a fully idle core if SMT is active, so balance can migrate
> > work onto a CPU that still offers full effective capacity. Fall back to
> > any idle candidate if none qualify.
> 
> This one isn't straightforward for me. The ilb cpu will check all
> other idle CPUs 1st and finish with itself so unless the next CPU in
> the idle_cpus_mask is a sibling, this should not make a difference
> 
> Did you see any perf diff ?

I actually see a benefit, in particular, with the first patch applied I see
a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
which seems pretty consistent across runs (definitely not in error range).

The intention with this change was to minimize SMT noise running the ILB
code on a fully-idle core when possible, but I also didn't expect to see
such big difference.

I'll investigate more to better understand what's happening.

> 
> 
> >
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Christian Loehle <christian.loehle@arm.com>
> > Cc: Koba Ko <kobak@nvidia.com>
> > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/fair.c | 19 +++++++++++++++++--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 593a89f688679..a1ee21f7b32f6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -12733,11 +12733,15 @@ static inline int on_null_domain(struct rq *rq)
> >   * - When one of the busy CPUs notices that there may be an idle rebalancing
> >   *   needed, they will kick the idle load balancer, which then does idle
> >   *   load balancing for all the idle CPUs.
> > + *
> > + * - When SMT is active, prefer a CPU on a fully idle core as the ILB
> > + *   target, so that when it runs balance it becomes the destination CPU
> > + *   and can accept migrated tasks with full effective capacity.
> >   */
> >  static inline int find_new_ilb(void)
> >  {
> >         const struct cpumask *hk_mask;
> > -       int ilb_cpu;
> > +       int ilb_cpu, fallback = -1;
> >
> >         hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
> >
> > @@ -12746,11 +12750,22 @@ static inline int find_new_ilb(void)
> >                 if (ilb_cpu == smp_processor_id())
> >                         continue;
> >
> > +#ifdef CONFIG_SCHED_SMT
> 
> you can probably get rid of the CONFIG and put this special case below
> sched_smt_active()

Ah good point, will change this.

> 
> 
> > +               if (!idle_cpu(ilb_cpu))
> > +                       continue;
> > +
> > +               if (fallback < 0)
> > +                       fallback = ilb_cpu;
> > +
> > +               if (!sched_smt_active() || is_core_idle(ilb_cpu))
> > +                       return ilb_cpu;
> > +#else
> >                 if (idle_cpu(ilb_cpu))
> >                         return ilb_cpu;
> > +#endif
> >         }
> >
> > -       return -1;
> > +       return fallback;
> >  }
> >
> >  /*
> > --
> > 2.53.0
> >

Thanks,
-Andrea

Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer

Posted by K Prateek Nayak 6 days, 8 hours ago

Hello Andrea,

On 3/27/2026 3:14 PM, Andrea Righi wrote:
> Hi Vincent,
> 
> On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
>> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
>>>
>>> When choosing which idle housekeeping CPU runs the idle load balancer,
>>> prefer one on a fully idle core if SMT is active, so balance can migrate
>>> work onto a CPU that still offers full effective capacity. Fall back to
>>> any idle candidate if none qualify.
>>
>> This one isn't straightforward for me. The ilb cpu will check all
>> other idle CPUs 1st and finish with itself so unless the next CPU in
>> the idle_cpus_mask is a sibling, this should not make a difference
>>
>> Did you see any perf diff ?
> 
> I actually see a benefit, in particular, with the first patch applied I see
> a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
> which seems pretty consistent across runs (definitely not in error range).
> 
> The intention with this change was to minimize SMT noise running the ILB
> code on a fully-idle core when possible, but I also didn't expect to see
> such big difference.
> 
> I'll investigate more to better understand what's happening.

Interesting! Either this "CPU-intensive workload" hates SMT turning
busy (but to an extent where performance drops visibly?) or ILB
keeps getting interrupted on an SMT sibling that is burdened by
interrupts leading to slower balance (or IRQs driving the workload
being delayed by rq_lock disabling them)

Would it be possible to share the total SCHED_SOFTIRQ time, load
balancing attempts, and utlization with and without the patch? I too
will go queue up some runs to see if this makes a difference.

-- 
Thanks and Regards,
Prateek

Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer

Posted by Andrea Righi 3 days, 2 hours ago

On Fri, Mar 27, 2026 at 05:04:23PM +0530, K Prateek Nayak wrote:
> Hello Andrea,
> 
> On 3/27/2026 3:14 PM, Andrea Righi wrote:
> > Hi Vincent,
> > 
> > On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
> >> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> >>>
> >>> When choosing which idle housekeeping CPU runs the idle load balancer,
> >>> prefer one on a fully idle core if SMT is active, so balance can migrate
> >>> work onto a CPU that still offers full effective capacity. Fall back to
> >>> any idle candidate if none qualify.
> >>
> >> This one isn't straightforward for me. The ilb cpu will check all
> >> other idle CPUs 1st and finish with itself so unless the next CPU in
> >> the idle_cpus_mask is a sibling, this should not make a difference
> >>
> >> Did you see any perf diff ?
> > 
> > I actually see a benefit, in particular, with the first patch applied I see
> > a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
> > which seems pretty consistent across runs (definitely not in error range).
> > 
> > The intention with this change was to minimize SMT noise running the ILB
> > code on a fully-idle core when possible, but I also didn't expect to see
> > such big difference.
> > 
> > I'll investigate more to better understand what's happening.
> 
> Interesting! Either this "CPU-intensive workload" hates SMT turning
> busy (but to an extent where performance drops visibly?) or ILB
> keeps getting interrupted on an SMT sibling that is burdened by
> interrupts leading to slower balance (or IRQs driving the workload
> being delayed by rq_lock disabling them)

Alright, I dug a bit deeper into what's going on.

In this case, the workload showing the large benefit (the NVBLAS benchmark)
is running exactly one task per SMT core, all pinned to NUMA node 0. The
system has two nodes, so node 1 remains mostly idle.

With the SMT-aware select_idle_capacity(), tasks get distributed across SMT
cores in a way that avoids placing them on busy siblings, which is nice and
it's the part that gives most of the speedup.

However, without this ILB patch, find_new_ilb() always picks a CPU with a
busy sibling on node 0, because for_each_cpu_and() always starts from the
lower CPU IDs. As a result, the ILB always ends up running on CPUs with a
CPU-intensive worker running on its sibling, disrupting each other's
performance.

As an experiment, I tried something silly like the following, biasing the
ILB selection toward node 1 (node0 = 0-87,176-263, node1 = 88-177,264-351):

	struct cpumask tmp;

	cpumask_and(&tmp, nohz.idle_cpus_mask, hk_mask);
	for_each_cpu_wrap(ilb_cpu, &tmp, nr_cpu_ids / 4) {
		if (ilb_cpu == smp_processor_id())
			continue;

		if (idle_cpu(ilb_cpu))
			return ilb_cpu;
	}

And I get pretty much the same speedup (slighly better actually, because I
always get an idle CPU in one step, since node 1 is always idle with this
particular benchmark).

So, in this particular scenario this patch makes sense, because we
avoid the "SMT contention" at very low cost. In general, I think the
benefit can be quite situational. I could still make sense to have it, the
extra overhead is limited to an additional is_core_idle() check over idle &
HK candidates (worst case), which could be worthwhile if it reduces
interference from busy SMT siblings.

What do you think?

Thanks,
-Andrea

Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer

Posted by Andrea Righi 5 days, 23 hours ago

On Fri, Mar 27, 2026 at 05:04:23PM +0530, K Prateek Nayak wrote:
> Hello Andrea,
> 
> On 3/27/2026 3:14 PM, Andrea Righi wrote:
> > Hi Vincent,
> > 
> > On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
> >> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> >>>
> >>> When choosing which idle housekeeping CPU runs the idle load balancer,
> >>> prefer one on a fully idle core if SMT is active, so balance can migrate
> >>> work onto a CPU that still offers full effective capacity. Fall back to
> >>> any idle candidate if none qualify.
> >>
> >> This one isn't straightforward for me. The ilb cpu will check all
> >> other idle CPUs 1st and finish with itself so unless the next CPU in
> >> the idle_cpus_mask is a sibling, this should not make a difference
> >>
> >> Did you see any perf diff ?
> > 
> > I actually see a benefit, in particular, with the first patch applied I see
> > a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
> > which seems pretty consistent across runs (definitely not in error range).
> > 
> > The intention with this change was to minimize SMT noise running the ILB
> > code on a fully-idle core when possible, but I also didn't expect to see
> > such big difference.
> > 
> > I'll investigate more to better understand what's happening.
> 
> Interesting! Either this "CPU-intensive workload" hates SMT turning
> busy (but to an extent where performance drops visibly?) or ILB
> keeps getting interrupted on an SMT sibling that is burdened by
> interrupts leading to slower balance (or IRQs driving the workload
> being delayed by rq_lock disabling them)
> 
> Would it be possible to share the total SCHED_SOFTIRQ time, load
> balancing attempts, and utlization with and without the patch? I too
> will go queue up some runs to see if this makes a difference.

Quick update: I also tried this on a Vera machine with a firmware that
exposes the same capacity for all the CPUs (so with SD_ASYM_CPUCAPACITY
disabled and SMT still on of course) and I see similar performance
benefits.

Looking at SCHED_SOFTIRQ and load balancing attempts I don't see big
differences, all within error range (results produced using a vibe-coded
python script):

 - baseline (stats/sec):

  SCHED softirq count  :        2,625
  LB attempts (total)  :       69,832

  Per-domain breakdown:
    domain0 (SMT):
      lb_count    (total)  :       68,482  [balanced=68,472  failed=9]
        CPU_IDLE         : lb=1,408  imb(load=0 util=0 task=0 misfit=0)  gained=0
        CPU_NEWLY_IDLE   : lb=67,041  imb(load=0 util=0 task=7 misfit=0)  gained=0
        CPU_NOT_IDLE     : lb=33  imb(load=0 util=0 task=2 misfit=0)  gained=0
    domain1 (MC):
      lb_count    (total)  :          902  [balanced=900  failed=2]
        CPU_NEWLY_IDLE   : lb=869  imb(load=0 util=0 task=0 misfit=0)  gained=0
        CPU_NOT_IDLE     : lb=33  imb(load=0 util=0 task=2 misfit=0)  gained=0
    domain2 (NUMA):
      lb_count    (total)  :          448  [balanced=441  failed=7]
        CPU_NEWLY_IDLE   : lb=415  imb(load=0 util=0 task=44 misfit=0)  gained=0
        CPU_NOT_IDLE     : lb=33  imb(load=0 util=0 task=268 misfit=0)  gained=0

 - with ilb-smt (stats/sec):

  SCHED softirq count  :        2,671
  LB attempts (total)  :       68,572

  Per-domain breakdown:
    domain0 (SMT):
      lb_count    (total)  :       67,239  [balanced=67,197  failed=41]
        CPU_IDLE         : lb=1,419  imb(load=0 util=0 task=0 misfit=0)  gained=0
        CPU_NEWLY_IDLE   : lb=65,783  imb(load=0 util=0 task=42 misfit=0)  gained=1
        CPU_NOT_IDLE     : lb=37  imb(load=0 util=0 task=0 misfit=0)  gained=0
    domain1 (MC):
      lb_count    (total)  :          833  [balanced=833  failed=0]
        CPU_NEWLY_IDLE   : lb=796  imb(load=0 util=0 task=0 misfit=0)  gained=0
        CPU_NOT_IDLE     : lb=37  imb(load=0 util=0 task=0 misfit=0)  gained=0
    domain2 (NUMA):
      lb_count    (total)  :          500  [balanced=488  failed=12]
        CPU_NEWLY_IDLE   : lb=463  imb(load=0 util=0 task=44 misfit=0)  gained=0
        CPU_NOT_IDLE     : lb=37  imb(load=0 util=0 task=627 misfit=0)  gained=0

I'll add more direct instrumentation to check what ILB is doing
differently...

And I'll also repeat the test and collect the same metrics on the Vera
machine with the firmware that exposes different CPU capacities as soon as
I get access again.

Thanks,
-Andrea

Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer

Posted by Andrea Righi 5 days, 21 hours ago

On Fri, Mar 27, 2026 at 09:36:15PM +0100, Andrea Righi wrote:
> On Fri, Mar 27, 2026 at 05:04:23PM +0530, K Prateek Nayak wrote:
> > Hello Andrea,
> > 
> > On 3/27/2026 3:14 PM, Andrea Righi wrote:
> > > Hi Vincent,
> > > 
> > > On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
> > >> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> > >>>
> > >>> When choosing which idle housekeeping CPU runs the idle load balancer,
> > >>> prefer one on a fully idle core if SMT is active, so balance can migrate
> > >>> work onto a CPU that still offers full effective capacity. Fall back to
> > >>> any idle candidate if none qualify.
> > >>
> > >> This one isn't straightforward for me. The ilb cpu will check all
> > >> other idle CPUs 1st and finish with itself so unless the next CPU in
> > >> the idle_cpus_mask is a sibling, this should not make a difference
> > >>
> > >> Did you see any perf diff ?
> > > 
> > > I actually see a benefit, in particular, with the first patch applied I see
> > > a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
> > > which seems pretty consistent across runs (definitely not in error range).
> > > 
> > > The intention with this change was to minimize SMT noise running the ILB
> > > code on a fully-idle core when possible, but I also didn't expect to see
> > > such big difference.
> > > 
> > > I'll investigate more to better understand what's happening.
> > 
> > Interesting! Either this "CPU-intensive workload" hates SMT turning
> > busy (but to an extent where performance drops visibly?) or ILB
> > keeps getting interrupted on an SMT sibling that is burdened by
> > interrupts leading to slower balance (or IRQs driving the workload
> > being delayed by rq_lock disabling them)
> > 
> > Would it be possible to share the total SCHED_SOFTIRQ time, load
> > balancing attempts, and utlization with and without the patch? I too
> > will go queue up some runs to see if this makes a difference.
> 
> Quick update: I also tried this on a Vera machine with a firmware that
> exposes the same capacity for all the CPUs (so with SD_ASYM_CPUCAPACITY
> disabled and SMT still on of course) and I see similar performance
> benefits.
> 
> Looking at SCHED_SOFTIRQ and load balancing attempts I don't see big
> differences, all within error range (results produced using a vibe-coded
> python script):
> 
>  - baseline (stats/sec):
> 
>   SCHED softirq count  :        2,625
>   LB attempts (total)  :       69,832
> 
>   Per-domain breakdown:
>     domain0 (SMT):
>       lb_count    (total)  :       68,482  [balanced=68,472  failed=9]
>         CPU_IDLE         : lb=1,408  imb(load=0 util=0 task=0 misfit=0)  gained=0
>         CPU_NEWLY_IDLE   : lb=67,041  imb(load=0 util=0 task=7 misfit=0)  gained=0
>         CPU_NOT_IDLE     : lb=33  imb(load=0 util=0 task=2 misfit=0)  gained=0
>     domain1 (MC):
>       lb_count    (total)  :          902  [balanced=900  failed=2]
>         CPU_NEWLY_IDLE   : lb=869  imb(load=0 util=0 task=0 misfit=0)  gained=0
>         CPU_NOT_IDLE     : lb=33  imb(load=0 util=0 task=2 misfit=0)  gained=0
>     domain2 (NUMA):
>       lb_count    (total)  :          448  [balanced=441  failed=7]
>         CPU_NEWLY_IDLE   : lb=415  imb(load=0 util=0 task=44 misfit=0)  gained=0
>         CPU_NOT_IDLE     : lb=33  imb(load=0 util=0 task=268 misfit=0)  gained=0
> 
>  - with ilb-smt (stats/sec):
> 
>   SCHED softirq count  :        2,671
>   LB attempts (total)  :       68,572
> 
>   Per-domain breakdown:
>     domain0 (SMT):
>       lb_count    (total)  :       67,239  [balanced=67,197  failed=41]
>         CPU_IDLE         : lb=1,419  imb(load=0 util=0 task=0 misfit=0)  gained=0
>         CPU_NEWLY_IDLE   : lb=65,783  imb(load=0 util=0 task=42 misfit=0)  gained=1
>         CPU_NOT_IDLE     : lb=37  imb(load=0 util=0 task=0 misfit=0)  gained=0
>     domain1 (MC):
>       lb_count    (total)  :          833  [balanced=833  failed=0]
>         CPU_NEWLY_IDLE   : lb=796  imb(load=0 util=0 task=0 misfit=0)  gained=0
>         CPU_NOT_IDLE     : lb=37  imb(load=0 util=0 task=0 misfit=0)  gained=0
>     domain2 (NUMA):
>       lb_count    (total)  :          500  [balanced=488  failed=12]
>         CPU_NEWLY_IDLE   : lb=463  imb(load=0 util=0 task=44 misfit=0)  gained=0
>         CPU_NOT_IDLE     : lb=37  imb(load=0 util=0 task=627 misfit=0)  gained=0
> 
> I'll add more direct instrumentation to check what ILB is doing
> differently...

More data.

== SMT contention ==

tracepoint:sched:sched_switch
{
    if (args->next_pid != 0) {
        @busy[cpu] = 1;
    } else {
        delete(@busy[cpu]);
    }
}

tracepoint:sched:sched_switch
/ args->prev_pid == 0 && args->next_pid != 0 /
{
    $sib = (cpu + 176) % 352;

    if (@busy[$sib]) {
        @smt_contention++;
    } else {
        @smt_no_contention++;
    }
}

END
{
    printf("smt_contention %lld\n", (int64)@smt_contention);
    printf("smt_no_contention %lld\n", (int64)@smt_no_contention);
}

 - baseline:

@smt_contention: 1103
@smt_no_contention: 3815

 - ilb-smt:

@smt_contention: 937
@smt_no_contention: 4459

== ILB duration ==

 - baseline:

@ilb_duration_us:
[0]                  147 |                                                    |
[1]                  354 |@                                                   |
[2, 4)               739 |@@@                                                 |
[4, 8)              3040 |@@@@@@@@@@@@@@@@                                    |
[8, 16)             9825 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)            8142 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[32, 64)            1267 |@@@@@@                                              |
[64, 128)           1607 |@@@@@@@@                                            |
[128, 256)          2222 |@@@@@@@@@@@                                         |
[256, 512)          2326 |@@@@@@@@@@@@                                        |
[512, 1K)            141 |                                                    |
[1K, 2K)              37 |                                                    |
[2K, 4K)               7 |

 - ilb-smt:

@ilb_duration_us:
[0]                   79 |                                                    |
[1]                  137 |                                                    |
[2, 4)              1440 |@@@@@@@@@@                                          |
[4, 8)              2897 |@@@@@@@@@@@@@@@@@@@@                                |
[8, 16)             7433 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)            4993 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[32, 64)            2390 |@@@@@@@@@@@@@@@@                                    |
[64, 128)           2254 |@@@@@@@@@@@@@@@                                     |
[128, 256)          2731 |@@@@@@@@@@@@@@@@@@@                                 |
[256, 512)          1083 |@@@@@@@                                             |
[512, 1K)            265 |@                                                   |
[1K, 2K)              29 |                                                    |
[2K, 4K)               5 |                                                    |

== rq_lock hold ==

 - baseline:

@lb_rqlock_hold_us:
[0]               664396 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                77446 |@@@@@@                                              |
[2, 4)             25044 |@                                                   |
[4, 8)             19847 |@                                                   |
[8, 16)             2434 |                                                    |
[16, 32)             605 |                                                    |
[32, 64)             308 |                                                    |
[64, 128)             38 |                                                    |
[128, 256)             2 |                                                    |

 - ilb-smt:

@lb_rqlock_hold_us:
[0]               229152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]               135060 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
[2, 4)             26989 |@@@@@@                                              |
[4, 8)             48034 |@@@@@@@@@@                                          |
[8, 16)             1919 |                                                    |
[16, 32)            2236 |                                                    |
[32, 64)             595 |                                                    |
[64, 128)            135 |                                                    |
[128, 256)            27 |                                                    |

For what I see ILB runs are more expensive, but I still don't see why I'm
getting the speedup with this ilb-smt patch. I'll keep investigating...

-Andrea

[PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
[PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
[PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
[PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer