sched: Ignore overutilized by lone task on max-cap CPU

[PATCH 0/1] sched: Ignore overutilized by lone task on max-cap CPU

Posted by Christian Loehle 1 month, 1 week ago

I'm trying to deliver on my overdue promise of redefining overutilized state.
My investigation basically lead to redefinition of overutilized state
bringing very little hard improvements, while it comes with at least
some risk of worsening platforms and workload combinations I might've
overlooked, therefore I only concentrate on one, the least
controversial, for now.
When a task is alone on a max-cap CPU there's no reason to let it
trigger OU because it will only ever be placed on another max-cap CPU,
as such we skip setting overutilized in such a scenario in a careful
way, namely still letting it trigger if there's any other task or the
capacity is (usually temporarily) reduced because of system or thermal
pressure.
On platforms common in phones this strategy didn't prove useful, as
even one such a task would already be the majority of the phones'
thermal (or even power budget) and therefore such a situation not being
very stable and continuing to attempt EAS on the other CPUs seemed
unnecessary.
OTOH there are more and more systems (e.g. apple silicon,
radxa orion o6, x86 hybrids) where such a situation could be sustained
and there are also many more max-cap CPUs, so more possibilites for the
patch to trigger.

For further information and the OSPM discussion see:
https://www.youtube.com/watch?v=N0tZ8GhhQzc

Radxa orion o6 (capacities: 1024, 279, 279, 279, 279, 905, 905, 866, 866, 984, 984, 1024):
Mean of 10 Geekbench6.3 iterations (all values are the mean)
+------------+--------+---------+-------+--------------+
| Test       | patch  | score   | OU %  | OU triggers  |
+------------+--------+---------+-------+--------------+
| GB6 Single | patch  | 1182.4  | 26.14 | 1942.4       |
| GB6 Single | base   | 1186.9  | 71.23 |  573.0       |
+------------+--------+---------+-------+--------------+
| GB6 Multi  | patch  | 5227.7  | 44.11 |  984.5       |
| GB6 Multi  | base   | 5395.6  | 53.17 |  773.1       |
+------------+--------+---------+-------+--------------+
(OU triggers are overutilized rd 0->1 transitions)
GB6 Multi score stdev is 43 for base.

RK3399 ((384, 384, 384, 384)(1024, 1024))
stress-ng --cpu X --timeout 60s
Mean of 10 iterations
+-----------+--------+------+--------------+
| stress-ng | patch  | OU % | OU triggers  |
+-----------+--------+------+--------------+
| 1x        | patch  | 0.01 | 10.5         |
| 1x        | base   | 99.7 |  4.4         |
+-----------+--------+------+--------------+
| 2x        | patch  | 0.01 | 13.8         |
| 2x        | base   | 99.7 |  5.3         |
+-----------+--------+------+--------------+
| 3x        | patch  | 99.8 |  4.1         |
| 3x        | base   | 99.8 |  4.6         |
+-----------+--------+------+--------------+
(System only has 2 1024-capacity CPUs, so for 3x stress-ng
patch and base are intended to behave the same.)

M1 Pro ((485, 485) (1024, 1024, 1024) (1024, 1024, 1024))
(backported to the 6.17-based asahi kernel)
+-----------+--------+-------+--------------+
| stress-ng | patch  | OU %  | OU triggers  |
+-----------+--------+-------+--------------+
| 1x        | patch  |  8.26 |        432.0 |
| 1x        | base   | 99.14 |          4.2 |
+-----------+--------+-------+--------------+
| 2x        | patch  |  8.79 |        470.2 |
| 2x        | base   | 99.21 |          3.8 |
+-----------+--------+-------+--------------+
| 4x        | patch  |  8.99 |        475.2 |
| 4x        | base   | 99.17 |          4.6 |
+-----------+--------+-------+--------------+
| 6x        | patch  |  8.81 |        478.8 |
| 6x        | base   | 99.14 |          5.0 |
+-----------+--------+-------+--------------+
| 7x        | patch  | 99.21 |          4.0 |
| 7x        | base   | 99.27 |          4.2 |
+-----------+--------+-------+--------------+

Mean of 20 Geekbench 6.3 iterations
+------------+--------+---------+-------+--------------+
| Test       | patch  | score   | OU %  | OU triggers  |
+------------+--------+---------+-------+--------------+
| GB6 Single | patch  |  2296.9 |  3.99 |        669.4 |
| GB6 Single | base   |  2295.8 | 50.06 |         28.4 |
+------------+--------+---------+-------+--------------+
| GB6 Multi  | patch  | 10621.8 | 18.77 |        636.4 |
| GB6 Multi  | base   | 10686.8 | 28.72 |         66.8 |
+------------+--------+---------+-------+--------------+

Energy numbers are trace-based (lisa.estimate_from_trace()):
GB6 Single -12.63% energy average (equal score)
GB6 Multi +1.76% energy average (for equal score runs)

No changes observed with geekbench6 on a Pixel 6 6.12-based with patch backported.

Functional test:
Using the above described M1 Pro I created an rt-app workload [1]:
Workload:
- tskbusy: periodic 100% duty, period 1s, duration 10s (single always-running task)
- tsk_{a..d}: periodic 5% duty, 16ms period, duration 10s (four small periodic tasks)
Target system: 8 CPUs (0-7), 2 little (cpu0 & cpu1), 6 big
Metric: per-task CPU residency (seconds) over the 10s run
OU metric: time spent in overutilized state / total time; Number of
OU 0->1 transitions (triggers).

Case A Mainline:
Small task CPU residency (s), 10s run
task   cpu0    cpu1    cpu2    cpu3    cpu4    cpu5    cpu6    cpu7    total
tsk_a  0.124   0.000   0.000   0.000   0.035   1.791   0.492   0.001   2.444
tsk_b  0.002   0.000   0.500   0.000   0.000   0.001   0.004   0.000   0.507
tsk_c  0.000   0.000   0.000   0.000   0.001   0.000   1.895   0.630   2.526
tsk_d  0.000   0.389   0.001   0.000   0.450   0.000   0.000   0.000   0.840

(Little CPUs 0 & 1 rarely get picked for the small tasks due to CAS' task
placement, which isn't deterministically "always picking big CPUs", but since
they make up 6/8 of them this is the common case.)

Overutilized:
- OU time = 10.0s / 11.0s  (ratio 0.909)
- OU triggers = 7

Case B Patch:
Small task CPU residency (s), 10s run
task   cpu0    cpu1    cpu2    cpu3    cpu4    cpu5    cpu6    cpu7    total
tsk_a  0.055   1.907   0.006   0.012   0.002   0.001   0.000   0.005   1.987
tsk_b  1.845   0.115   0.014   0.000   0.004   0.002   0.000   0.000   1.981
tsk_c  0.914   1.069   0.007   0.000   0.004   0.005   0.000   0.000   1.999
tsk_d  1.000   0.985   0.004   0.005   0.000   0.000   0.000   0.000   1.995

Overutilized:
- OU time = 0.1s / 11.2s (ratio 0.007)
- OU triggers = 57

(Little CPUs 0 & 1 get picked by the vast majority of wakeups and aren't migrated
to the big CPUs.)


[1]
LISA's RTApp workload generation description:

rtapp_profile = {
    f'tskbusy': RTAPhase(
        prop_wload=PeriodicWload(
            duty_cycle_pct=100,
            period=1,
            duration=10,
        )
    ),
    f'tsk_a': RTAPhase(
        prop_wload=PeriodicWload(
            duty_cycle_pct=5,
            period=16e-3,
            duration=10,
        )
    ),
    f'tsk_b': RTAPhase(
        prop_wload=PeriodicWload(
            duty_cycle_pct=5,
            period=16e-3,
            duration=10,
        )
    ),
    f'tsk_c': RTAPhase(
        prop_wload=PeriodicWload(
            duty_cycle_pct=5,
            period=16e-3,
            duration=10,
        )
    ),
    f'tsk_d': RTAPhase(
        prop_wload=PeriodicWload(
            duty_cycle_pct=5,
            period=16e-3,
            duration=10,
        )
    )
}

Christian Loehle (1):
  sched/fair: Ignore OU for lone task on max-cap CPU

 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

-- 
2.34.1

Re: [PATCH 0/1] sched: Ignore overutilized by lone task on max-cap CPU

Posted by Qais Yousef 3 weeks, 4 days ago

On 12/30/25 09:30, Christian Loehle wrote:
> I'm trying to deliver on my overdue promise of redefining overutilized state.
> My investigation basically lead to redefinition of overutilized state
> bringing very little hard improvements, while it comes with at least
> some risk of worsening platforms and workload combinations I might've
> overlooked, therefore I only concentrate on one, the least
> controversial, for now.

What are the controversial bits?

This is a step forward, but not sure it is in the right direction. The concept
of a *cpu* being overutilized === rd is overutilized no longer makes sense
since misfit was decoupled from this logic which was the sole reason to
require this check at CPU level.  Overutilized state is, rightly, set at the
rootdomain level. And the check makes sense to be done at that level too by
traversing the perf domains and seeing if we are in a state that requires
moving tasks around. Which should be done in update_{sg,sd}_lb_stats() logic
only.

I guess the difficult question (which might be what you're referring to as
controversial), is at what point we can no longer pack (use EAS) and must
distribute tasks around?

I think this question is limited by what the lb can do today. With push lb,
I believe the current global lb is likely to be unnecessary in small systems
(single LLC) since it can shuffle things around immediately to handle misfit
and overload.

On top of that, what can the existing global lb do? I am not sure to be honest.
The system has to have a number of long running tasks > num_cpus for it to be
useful. But given util signal will lose its meaning under these circumstances,
I am not sure the global lb can do a better job than push lb trying to move
these tasks around. But it could do a more comprehensive job in one go? I'll
defer to Vincent, he probably more able to answer this from the top of his
head. But the answer to this question is the key to how we want to define this
*system* is overutilized state.

Assuming this is on top of push lb, I believe something like below which will
trigger overutilized only if all cpus are overutilized (ie system is nearly
maxed out (has 20% or less headroom)) is a good starting point at least.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..ba08f4aefa03 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6814,17 +6814,6 @@ static inline void set_rd_overutilized(struct root_domain *rd, bool flag)
 	trace_sched_overutilized_tp(rd, flag);
 }

-static inline void check_update_overutilized_status(struct rq *rq)
-{
-	/*
-	 * overutilized field is used for load balancing decisions only
-	 * if energy aware scheduler is being used
-	 */
-
-	if (!is_rd_overutilized(rq->rd) && cpu_overutilized(rq->cpu))
-		set_rd_overutilized(rq->rd, 1);
-}
-
 /* Runqueue only has SCHED_IDLE tasks enqueued */
 static int sched_idle_rq(struct rq *rq)
 {
@@ -6968,23 +6957,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	/* At this point se is NULL and we are at root level*/
 	add_nr_running(rq, 1);

-	/*
-	 * Since new tasks are assigned an initial util_avg equal to
-	 * half of the spare capacity of their CPU, tiny tasks have the
-	 * ability to cross the overutilized threshold, which will
-	 * result in the load balancer ruining all the task placement
-	 * done by EAS. As a way to mitigate that effect, do not account
-	 * for the first enqueue operation of new tasks during the
-	 * overutilized flag detection.
-	 *
-	 * A better way of solving this problem would be to wait for
-	 * the PELT signals of tasks to converge before taking them
-	 * into account, but that is not straightforward to implement,
-	 * and the following generally works well enough in practice.
-	 */
-	if (!task_new)
-		check_update_overutilized_status(rq);
-
 	assert_list_leaf_cfs_rq(rq);

 	hrtick_update(rq);
@@ -10430,8 +10402,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		nr_running = rq->nr_running;
 		sgs->sum_nr_running += nr_running;

-		if (cpu_overutilized(i))
-			*sg_overutilized = 1;
+		*sg_overutilized &= cpu_overutilized(i);

 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
@@ -11087,7 +11058,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
 	unsigned long sum_util = 0;
-	bool sg_overloaded = 0, sg_overutilized = 0;
+	bool sg_overloaded = 0, sg_overutilized = 1;

 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
@@ -13378,7 +13349,6 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		task_tick_numa(rq, curr);

 	update_misfit_status(curr, rq);
-	check_update_overutilized_status(task_rq(curr));

 	task_tick_core(rq, curr);
 }

Re: [PATCH 0/1] sched: Ignore overutilized by lone task on max-cap CPU

Posted by Christian Loehle 3 weeks, 2 days ago

On 1/13/26 13:11, Qais Yousef wrote:
> On 12/30/25 09:30, Christian Loehle wrote:
>> I'm trying to deliver on my overdue promise of redefining overutilized state.
>> My investigation basically lead to redefinition of overutilized state
>> bringing very little hard improvements, while it comes with at least
>> some risk of worsening platforms and workload combinations I might've
>> overlooked, therefore I only concentrate on one, the least
>> controversial, for now.
> 
> What are the controversial bits?
> 
> This is a step forward, but not sure it is in the right direction. The concept
> of a *cpu* being overutilized === rd is overutilized no longer makes sense
> since misfit was decoupled from this logic which was the sole reason to
> require this check at CPU level.  Overutilized state is, rightly, set at the
> rootdomain level. And the check makes sense to be done at that level too by
> traversing the perf domains and seeing if we are in a state that requires
> moving tasks around. Which should be done in update_{sg,sd}_lb_stats() logic
> only.
> 
> I guess the difficult question (which might be what you're referring to as
> controversial), is at what point we can no longer pack (use EAS) and must
> distribute tasks around?

And that is precisely the 'controversial bits', I didn't want to touch them
with this patch specifically.
A more holistic redefinition of OU is still on the table, but it needs to
a) Still fulfill the requirements we want from it (guarantee of accurate PELT
values because compute capacity was 'always' provided, switching to throughput
maximization when needed).
b) Provide sufficient testing to convince us of not regressing anything majorly
on the quite diverse EAS platforms we have today.

I think $SUBJECT does a) and b) well, but of course it's for improving a
specific set of systems and doesn't address the issues with OU that have been
named in the past.

> 
> I think this question is limited by what the lb can do today. With push lb,
> I believe the current global lb is likely to be unnecessary in small systems
> (single LLC) since it can shuffle things around immediately to handle misfit
> and overload.
> 
> On top of that, what can the existing global lb do? I am not sure to be honest.
> The system has to have a number of long running tasks > num_cpus for it to be
> useful. But given util signal will lose its meaning under these circumstances,
> I am not sure the global lb can do a better job than push lb trying to move
> these tasks around. But it could do a more comprehensive job in one go? I'll
> defer to Vincent, he probably more able to answer this from the top of his
> head. But the answer to this question is the key to how we want to define this
> *system* is overutilized state.
> 
> Assuming this is on top of push lb, I believe something like below which will
> trigger overutilized only if all cpus are overutilized (ie system is nearly
> maxed out (has 20% or less headroom)) is a good starting point at least.

It's an approach, but it needs a lot of data to convince everyone that
push lb + much more liberal OU state outperforms current global LB OU.

Given this is not really about defining OU in a final state, any comments from
you and Vincent on $SUBJECT and the problem it's addressing would be 
much appreciated!

> [snip]

Re: [PATCH 0/1] sched: Ignore overutilized by lone task on max-cap CPU

Posted by Qais Yousef 1 day, 12 hours ago

On 01/15/26 11:17, Christian Loehle wrote:
> On 1/13/26 13:11, Qais Yousef wrote:
> > On 12/30/25 09:30, Christian Loehle wrote:
> >> I'm trying to deliver on my overdue promise of redefining overutilized state.
> >> My investigation basically lead to redefinition of overutilized state
> >> bringing very little hard improvements, while it comes with at least
> >> some risk of worsening platforms and workload combinations I might've
> >> overlooked, therefore I only concentrate on one, the least
> >> controversial, for now.
> > 
> > What are the controversial bits?
> > 
> > This is a step forward, but not sure it is in the right direction. The concept
> > of a *cpu* being overutilized === rd is overutilized no longer makes sense
> > since misfit was decoupled from this logic which was the sole reason to
> > require this check at CPU level.  Overutilized state is, rightly, set at the
> > rootdomain level. And the check makes sense to be done at that level too by
> > traversing the perf domains and seeing if we are in a state that requires
> > moving tasks around. Which should be done in update_{sg,sd}_lb_stats() logic
> > only.
> > 
> > I guess the difficult question (which might be what you're referring to as
> > controversial), is at what point we can no longer pack (use EAS) and must
> > distribute tasks around?
> 
> And that is precisely the 'controversial bits', I didn't want to touch them
> with this patch specifically.

What makes it controversial? I don't think it is really controversial. Maybe
you're referring to some offline discussion or I missed something on the list.

The concept of a *cpu* being overutilized doesn't make sense for the purpose of
this overutilized state. Trying to make it better in some particular case is
not really moving us in the right direction. And not discussing what's the
right direction doesn't move us in any direction :)

> A more holistic redefinition of OU is still on the table, but it needs to
> a) Still fulfill the requirements we want from it (guarantee of accurate PELT
> values because compute capacity was 'always' provided, switching to throughput
> maximization when needed).

I think the PELT is very inaccurate :-) You saw my talk about invariance and
black hole effect?

If you view this problem as PELT accuracy, this is a problem. The code was
tightly coupled to misfit logic, which it was decoupled from and these
cpu_overutilized() checks are overzealous can be safely removed from many
locations to start with. We need to focus on the concept of system
overutilized. Even if you keep the current logic as-is but just move the checks
to the right place when deciding to do load balance where we take the global
view of the system's state. Not on context switch etc which I think were to
help misfit to trigger?

> b) Provide sufficient testing to convince us of not regressing anything majorly
> on the quite diverse EAS platforms we have today.

I don't think the testing effort is that hard really. Things that need multi
core performance should give us indications. GB MT is one of them, but you can
try speedometer with code compilation (limited to fewer cores than NCPUS) in
the background for instance to see how much this affect the score.

I'd agree it is hard if we don't know under what conditions it is supposed to
help.  Which is my main point here. It is supposed to be useful under specific
scenarios only. And these scenarios are NOT tied to cpu state, but global
system state. And I think we can reason about them.

When packing on a PD is worse than distributing? It is definitely not when
a CPU is saturated. feec() has improved a lot over the years and does
distribute load a lot better than its earlier days. The question is when does
it fail?

I think under few scenarios:

1. Number of tasks >> number of cpus
2. Many of these tasks are long running and won't sleep and wake up again for
   feec()/wake up to distribute them again.

It will need then to help move those long running tasks to idle cpus as they
become available. But if tasks are sleeping and waking up then they'd be
distributed without any additional help. If not, the fix is to make wake up
path smarter.

It also can help when there are no idle time but many tasks keep waking up.
Some tasks can get stuck enqueued for a long time where we can have nr_running
high on one cpu, but I'd argue we have issues with wake up path packing when
the system is loaded. Even feec() shouldn't do that. Still lb is useful because
enqueued task can't go to sleep even if they need to run for a short time if
they are not given a chance to.

Do you have other scenarios in mind? I think breaking the problem based on
benefits would help advance the code and clarify what satisfactory tests are
required that it behaves correctly. It seems you imply we can't know where it
is supposed to help with to test sufficiently it is not a problem, and here
where I disagree. We should be able to quantify and demonstrate where it should
help.

> 
> I think $SUBJECT does a) and b) well, but of course it's for improving a
> specific set of systems and doesn't address the issues with OU that have been
> named in the past.
> 
> > 
> > I think this question is limited by what the lb can do today. With push lb,
> > I believe the current global lb is likely to be unnecessary in small systems
> > (single LLC) since it can shuffle things around immediately to handle misfit
> > and overload.
> > 
> > On top of that, what can the existing global lb do? I am not sure to be honest.
> > The system has to have a number of long running tasks > num_cpus for it to be
> > useful. But given util signal will lose its meaning under these circumstances,
> > I am not sure the global lb can do a better job than push lb trying to move
> > these tasks around. But it could do a more comprehensive job in one go? I'll
> > defer to Vincent, he probably more able to answer this from the top of his
> > head. But the answer to this question is the key to how we want to define this
> > *system* is overutilized state.
> > 
> > Assuming this is on top of push lb, I believe something like below which will
> > trigger overutilized only if all cpus are overutilized (ie system is nearly
> > maxed out (has 20% or less headroom)) is a good starting point at least.
> 
> It's an approach, but it needs a lot of data to convince everyone that
> push lb + much more liberal OU state outperforms current global LB OU.
> 
> Given this is not really about defining OU in a final state, any comments from
> you and Vincent on $SUBJECT and the problem it's addressing would be 
> much appreciated!

I think you're avoiding the problem. And testing effort is not really that
different in both cases IMO.

In my view generally our load balancer is not great and very slow to react.
I do believe the push lb will make this overutilized state completely
unnecessary. But we shall see :)

I am not a fan of this band aid. But as I said, makes things better but not
moving us in the right direction. I'd rather see discussions in the latter.
Burying it around with we'll do it later and it's controversial is what
concerns me the most and makes me not keen in taking this small improvement.
But if Peter or Vincent would see it helpful no real objection from me FWIW.
I just think it's not hard to do better.

Cheers

--
Qais Yousef

Re: [PATCH 0/1] sched: Ignore overutilized by lone task on max-cap CPU

Posted by Pierre Gondois 4 weeks, 1 day ago

On 12/30/25 10:30, Christian Loehle wrote:
> I'm trying to deliver on my overdue promise of redefining overutilized state.
> My investigation basically lead to redefinition of overutilized state
> bringing very little hard improvements, while it comes with at least
> some risk of worsening platforms and workload combinations I might've
> overlooked, therefore I only concentrate on one, the least
> controversial, for now.
> When a task is alone on a max-cap CPU there's no reason to let it
> trigger OU because it will only ever be placed on another max-cap CPU,
> as such we skip setting overutilized in such a scenario in a careful
> way, namely still letting it trigger if there's any other task or the
> capacity is (usually temporarily) reduced because of system or thermal
> pressure.
> On platforms common in phones this strategy didn't prove useful, as
> even one such a task would already be the majority of the phones'
> thermal (or even power budget) and therefore such a situation not being
> very stable and continuing to attempt EAS on the other CPUs seemed
> unnecessary.
> OTOH there are more and more systems (e.g. apple silicon,
> radxa orion o6, x86 hybrids) where such a situation could be sustained
> and there are also many more max-cap CPUs, so more possibilites for the
> patch to trigger.
>
> For further information and the OSPM discussion see:
> https://www.youtube.com/watch?v=N0tZ8GhhQzc
>
> Radxa orion o6 (capacities: 1024, 279, 279, 279, 279, 905, 905, 866, 866, 984, 984, 1024):
> Mean of 10 Geekbench6.3 iterations (all values are the mean)
> +------------+--------+---------+-------+--------------+
> | Test       | patch  | score   | OU %  | OU triggers  |
> +------------+--------+---------+-------+--------------+
> | GB6 Single | patch  | 1182.4  | 26.14 | 1942.4       |
> | GB6 Single | base   | 1186.9  | 71.23 |  573.0       |
> +------------+--------+---------+-------+--------------+
> | GB6 Multi  | patch  | 5227.7  | 44.11 |  984.5       |
> | GB6 Multi  | base   | 5395.6  | 53.17 |  773.1       |
> +------------+--------+---------+-------+--------------+
> (OU triggers are overutilized rd 0->1 transitions)

Not really important, but having more/less OU transitions
should not be a criteria right ?
If the goal is to use EAS as much as possible, it would be
better to compare the number of task placement decisions
that go through EAS between the 2 versions.

(I think the numbers are convincing enough,
this is just to discuss).


> GB6 Multi score stdev is 43 for base.
>
> RK3399 ((384, 384, 384, 384)(1024, 1024))
> stress-ng --cpu X --timeout 60s
> Mean of 10 iterations
> +-----------+--------+------+--------------+
> | stress-ng | patch  | OU % | OU triggers  |
> +-----------+--------+------+--------------+
> | 1x        | patch  | 0.01 | 10.5         |
> | 1x        | base   | 99.7 |  4.4         |
> +-----------+--------+------+--------------+
> | 2x        | patch  | 0.01 | 13.8         |
> | 2x        | base   | 99.7 |  5.3         |
> +-----------+--------+------+--------------+
> | 3x        | patch  | 99.8 |  4.1         |
> | 3x        | base   | 99.8 |  4.6         |
> +-----------+--------+------+--------------+
> (System only has 2 1024-capacity CPUs, so for 3x stress-ng
> patch and base are intended to behave the same.)
>
> M1 Pro ((485, 485) (1024, 1024, 1024) (1024, 1024, 1024))
> (backported to the 6.17-based asahi kernel)
> +-----------+--------+-------+--------------+
> | stress-ng | patch  | OU %  | OU triggers  |
> +-----------+--------+-------+--------------+
> | 1x        | patch  |  8.26 |        432.0 |
> | 1x        | base   | 99.14 |          4.2 |
> +-----------+--------+-------+--------------+
> | 2x        | patch  |  8.79 |        470.2 |
> | 2x        | base   | 99.21 |          3.8 |
> +-----------+--------+-------+--------------+
> | 4x        | patch  |  8.99 |        475.2 |
> | 4x        | base   | 99.17 |          4.6 |
> +-----------+--------+-------+--------------+
> | 6x        | patch  |  8.81 |        478.8 |
> | 6x        | base   | 99.14 |          5.0 |
> +-----------+--------+-------+--------------+
> | 7x        | patch  | 99.21 |          4.0 |
> | 7x        | base   | 99.27 |          4.2 |
> +-----------+--------+-------+--------------+
>
> Mean of 20 Geekbench 6.3 iterations
> +------------+--------+---------+-------+--------------+
> | Test       | patch  | score   | OU %  | OU triggers  |
> +------------+--------+---------+-------+--------------+
> | GB6 Single | patch  |  2296.9 |  3.99 |        669.4 |
> | GB6 Single | base   |  2295.8 | 50.06 |         28.4 |
> +------------+--------+---------+-------+--------------+
> | GB6 Multi  | patch  | 10621.8 | 18.77 |        636.4 |
> | GB6 Multi  | base   | 10686.8 | 28.72 |         66.8 |
> +------------+--------+---------+-------+--------------+
>
> Energy numbers are trace-based (lisa.estimate_from_trace()):
> GB6 Single -12.63% energy average (equal score)
> GB6 Multi +1.76% energy average (for equal score runs)

Just to repeat some things that you said in another thread:
-
for the GB6 Multi, it should be expected to have a slightly
lower score as CAS gives better score in general and EAS runs
longer with your patch.
It is however unfortunate to get a slightly higher energy consumption.
-
The focus should be put on GB6 single where the energy saving is
greatly improved

>
> No changes observed with geekbench6 on a Pixel 6 6.12-based with patch backported.
>
> Functional test:
> Using the above described M1 Pro I created an rt-app workload [1]:
> Workload:
> - tskbusy: periodic 100% duty, period 1s, duration 10s (single always-running task)
> - tsk_{a..d}: periodic 5% duty, 16ms period, duration 10s (four small periodic tasks)
> Target system: 8 CPUs (0-7), 2 little (cpu0 & cpu1), 6 big
> Metric: per-task CPU residency (seconds) over the 10s run
> OU metric: time spent in overutilized state / total time; Number of
> OU 0->1 transitions (triggers).
>
> Case A Mainline:
> Small task CPU residency (s), 10s run
> task   cpu0    cpu1    cpu2    cpu3    cpu4    cpu5    cpu6    cpu7    total
> tsk_a  0.124   0.000   0.000   0.000   0.035   1.791   0.492   0.001   2.444
> tsk_b  0.002   0.000   0.500   0.000   0.000   0.001   0.004   0.000   0.507
> tsk_c  0.000   0.000   0.000   0.000   0.001   0.000   1.895   0.630   2.526
> tsk_d  0.000   0.389   0.001   0.000   0.450   0.000   0.000   0.000   0.840
>
> (Little CPUs 0 & 1 rarely get picked for the small tasks due to CAS' task
> placement, which isn't deterministically "always picking big CPUs", but since
> they make up 6/8 of them this is the common case.)
>
> Overutilized:
> - OU time = 10.0s / 11.0s  (ratio 0.909)
> - OU triggers = 7
>
> Case B Patch:
> Small task CPU residency (s), 10s run
> task   cpu0    cpu1    cpu2    cpu3    cpu4    cpu5    cpu6    cpu7    total
> tsk_a  0.055   1.907   0.006   0.012   0.002   0.001   0.000   0.005   1.987
> tsk_b  1.845   0.115   0.014   0.000   0.004   0.002   0.000   0.000   1.981
> tsk_c  0.914   1.069   0.007   0.000   0.004   0.005   0.000   0.000   1.999
> tsk_d  1.000   0.985   0.004   0.005   0.000   0.000   0.000   0.000   1.995
>
> Overutilized:
> - OU time = 0.1s / 11.2s (ratio 0.007)
> - OU triggers = 57
>
> (Little CPUs 0 & 1 get picked by the vast majority of wakeups and aren't migrated
> to the big CPUs.)
>
>
> [1]
> LISA's RTApp workload generation description:
>
> rtapp_profile = {
>      f'tskbusy': RTAPhase(
>          prop_wload=PeriodicWload(
>              duty_cycle_pct=100,
>              period=1,
>              duration=10,
>          )
>      ),
>      f'tsk_a': RTAPhase(
>          prop_wload=PeriodicWload(
>              duty_cycle_pct=5,
>              period=16e-3,
>              duration=10,
>          )
>      ),
>      f'tsk_b': RTAPhase(
>          prop_wload=PeriodicWload(
>              duty_cycle_pct=5,
>              period=16e-3,
>              duration=10,
>          )
>      ),
>      f'tsk_c': RTAPhase(
>          prop_wload=PeriodicWload(
>              duty_cycle_pct=5,
>              period=16e-3,
>              duration=10,
>          )
>      ),
>      f'tsk_d': RTAPhase(
>          prop_wload=PeriodicWload(
>              duty_cycle_pct=5,
>              period=16e-3,
>              duration=10,
>          )
>      )
> }
>
> Christian Loehle (1):
>    sched/fair: Ignore OU for lone task on max-cap CPU
>
>   kernel/sched/fair.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>

Re: [PATCH 0/1] sched: Ignore overutilized by lone task on max-cap CPU

Posted by Christian Loehle 3 weeks, 2 days ago

On 1/9/26 17:12, Pierre Gondois wrote:
> 
> On 12/30/25 10:30, Christian Loehle wrote:
>> I'm trying to deliver on my overdue promise of redefining overutilized state.
>> My investigation basically lead to redefinition of overutilized state
>> bringing very little hard improvements, while it comes with at least
>> some risk of worsening platforms and workload combinations I might've
>> overlooked, therefore I only concentrate on one, the least
>> controversial, for now.
>> When a task is alone on a max-cap CPU there's no reason to let it
>> trigger OU because it will only ever be placed on another max-cap CPU,
>> as such we skip setting overutilized in such a scenario in a careful
>> way, namely still letting it trigger if there's any other task or the
>> capacity is (usually temporarily) reduced because of system or thermal
>> pressure.
>> On platforms common in phones this strategy didn't prove useful, as
>> even one such a task would already be the majority of the phones'
>> thermal (or even power budget) and therefore such a situation not being
>> very stable and continuing to attempt EAS on the other CPUs seemed
>> unnecessary.
>> OTOH there are more and more systems (e.g. apple silicon,
>> radxa orion o6, x86 hybrids) where such a situation could be sustained
>> and there are also many more max-cap CPUs, so more possibilites for the
>> patch to trigger.
>>
>> For further information and the OSPM discussion see:
>> https://www.youtube.com/watch?v=N0tZ8GhhQzc
>>
>> Radxa orion o6 (capacities: 1024, 279, 279, 279, 279, 905, 905, 866, 866, 984, 984, 1024):
>> Mean of 10 Geekbench6.3 iterations (all values are the mean)
>> +------------+--------+---------+-------+--------------+
>> | Test       | patch  | score   | OU %  | OU triggers  |
>> +------------+--------+---------+-------+--------------+
>> | GB6 Single | patch  | 1182.4  | 26.14 | 1942.4       |
>> | GB6 Single | base   | 1186.9  | 71.23 |  573.0       |
>> +------------+--------+---------+-------+--------------+
>> | GB6 Multi  | patch  | 5227.7  | 44.11 |  984.5       |
>> | GB6 Multi  | base   | 5395.6  | 53.17 |  773.1       |
>> +------------+--------+---------+-------+--------------+
>> (OU triggers are overutilized rd 0->1 transitions)
> 
> Not really important, but having more/less OU transitions
> should not be a criteria right ?
> If the goal is to use EAS as much as possible, it would be
> better to compare the number of task placement decisions
> that go through EAS between the 2 versions.
> 
> (I think the numbers are convincing enough,
> this is just to discuss).

I agree that the number of OU transitions / triggers shouldn't
really be relevant. I focused on the number of find_energy_efficient_cpu()
calls skipped due to OU, but those were almost always very well correlated
to the ratio of non-OU to OU, so I focused on that here.
There was some discussion around the # triggers though too, so I included
it.

> 
> 
>> GB6 Multi score stdev is 43 for base.
>>
>> RK3399 ((384, 384, 384, 384)(1024, 1024))
>> stress-ng --cpu X --timeout 60s
>> Mean of 10 iterations
>> +-----------+--------+------+--------------+
>> | stress-ng | patch  | OU % | OU triggers  |
>> +-----------+--------+------+--------------+
>> | 1x        | patch  | 0.01 | 10.5         |
>> | 1x        | base   | 99.7 |  4.4         |
>> +-----------+--------+------+--------------+
>> | 2x        | patch  | 0.01 | 13.8         |
>> | 2x        | base   | 99.7 |  5.3         |
>> +-----------+--------+------+--------------+
>> | 3x        | patch  | 99.8 |  4.1         |
>> | 3x        | base   | 99.8 |  4.6         |
>> +-----------+--------+------+--------------+
>> (System only has 2 1024-capacity CPUs, so for 3x stress-ng
>> patch and base are intended to behave the same.)
>>
>> M1 Pro ((485, 485) (1024, 1024, 1024) (1024, 1024, 1024))
>> (backported to the 6.17-based asahi kernel)
>> +-----------+--------+-------+--------------+
>> | stress-ng | patch  | OU %  | OU triggers  |
>> +-----------+--------+-------+--------------+
>> | 1x        | patch  |  8.26 |        432.0 |
>> | 1x        | base   | 99.14 |          4.2 |
>> +-----------+--------+-------+--------------+
>> | 2x        | patch  |  8.79 |        470.2 |
>> | 2x        | base   | 99.21 |          3.8 |
>> +-----------+--------+-------+--------------+
>> | 4x        | patch  |  8.99 |        475.2 |
>> | 4x        | base   | 99.17 |          4.6 |
>> +-----------+--------+-------+--------------+
>> | 6x        | patch  |  8.81 |        478.8 |
>> | 6x        | base   | 99.14 |          5.0 |
>> +-----------+--------+-------+--------------+
>> | 7x        | patch  | 99.21 |          4.0 |
>> | 7x        | base   | 99.27 |          4.2 |
>> +-----------+--------+-------+--------------+
>>
>> Mean of 20 Geekbench 6.3 iterations
>> +------------+--------+---------+-------+--------------+
>> | Test       | patch  | score   | OU %  | OU triggers  |
>> +------------+--------+---------+-------+--------------+
>> | GB6 Single | patch  |  2296.9 |  3.99 |        669.4 |
>> | GB6 Single | base   |  2295.8 | 50.06 |         28.4 |
>> +------------+--------+---------+-------+--------------+
>> | GB6 Multi  | patch  | 10621.8 | 18.77 |        636.4 |
>> | GB6 Multi  | base   | 10686.8 | 28.72 |         66.8 |
>> +------------+--------+---------+-------+--------------+
>>
>> Energy numbers are trace-based (lisa.estimate_from_trace()):
>> GB6 Single -12.63% energy average (equal score)
>> GB6 Multi +1.76% energy average (for equal score runs)
> 
> Just to repeat some things that you said in another thread:
> -
> for the GB6 Multi, it should be expected to have a slightly
> lower score as CAS gives better score in general and EAS runs
> longer with your patch.
> It is however unfortunate to get a slightly higher energy consumption.
> -
> The focus should be put on GB6 single where the energy saving is
> greatly improved

Agreed, thanks!

> [snip]