sched/fair: Revert boost in cpu_util()

[PATCH] sched/fair: Revert boost in cpu_util()

Posted by hongyan.xia(夏弘彦) 1 week ago

From: Hongyan Xia <hongyan.xia@transsion.com>

We have seen a massive power consumption regression (20% SoC power
increase in many apps) after updating our kernel. After bisection we
pinpointed the regression to the cpu_util(boost) feature. After
reverting the boost feature the massive energy regression is gone.
Detailed trace analysis down below. The regression is found across quite
many apps but Youtube is one of the worst offenders, shown in the
1080p60fps video benchmark:

 Setup      FPS   SoC Power (mW)  diff
w/  boost  59.94      913.6
w/o boost  59.93      720.4     -21.15%

Signed-off-by: Hongyan Xia <hongyan.xia@transsion.com>

---
Analysis:

We found several problems that result in the power spike:

1. Arithmetic should not happen between util_avg and runnable_avg:

After util = max(util, runnable) which potentially picks runnable value
in cpu_util(), we then add or subtract task util values from it. This
produces a value that is half-runnable-half-util which is ill-defined.
This alone should be a warning sign. This breaks EAS calculations in
many cases, leading to sub-optimal task placements.

2. Using the absolute value of runnable_avg to drive frequency is
   too high to be reasonable:

We use runnable in a _relative_ way to util to know whether there is
contention in several places. However, the _absolute_ value should not
be used like util. Runnable_avg tends to be significantly higher,
making it much easier to saturate frequency.

For example, if three tasks each with a util of 100 contend on the same
rq, the rq util is 300 but runnable_avg shoots up to 900. 900 drives the
CPU at the max frequency, and it's highly questionable whether this
boost is the right decision.

3. Runnable_avg may not even reflect true contention:

When tasks are dependent, the bottleneck is often the data flow between
tasks, not the contention seen by runnable_avg. Boosting frequency with
runnable in such scenarios wastes power without performance benefits.

We found 1 has minor power regression but 2 and 3 regresses power
significantly. We have seen multiple applications with the
producer-consumer model with many worker threads suffer. When there is
IPC between producer and consumer, boosting frequency blindly does not
help performance at all if consumer is limited by how much data is flown
through. Youtube suffer from 1, 2 and 3 at the same time, leading to a
total SoC power regression of 20% shown in the results above.
---
 kernel/sched/cpufreq_schedutil.c |  2 +-
 kernel/sched/fair.c              | 32 +++++++-------------------------
 kernel/sched/sched.h             |  1 -
 3 files changed, 8 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index ae9fd211cec1..ba867192513b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -228,7 +228,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
 	unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
 
 	if (!scx_switched_all())
-		util += cpu_util_cfs_boost(sg_cpu->cpu);
+		util += cpu_util_cfs(sg_cpu->cpu);
 	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
 	util = max(util, boost);
 	sg_cpu->bw_min = min;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 728965851842..86c6814121b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8192,7 +8192,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
  * @cpu: the CPU to get the utilization for
  * @p: task for which the CPU utilization should be predicted or NULL
  * @dst_cpu: CPU @p migrates to, -1 if @p moves from @cpu or @p == NULL
- * @boost: 1 to enable boosting, otherwise 0
  *
  * The unit of the return value must be the same as the one of CPU capacity
  * so that CPU utilization can be compared with CPU capacity.
@@ -8210,12 +8209,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
  * be when a long-sleeping task wakes up. The contribution to CPU utilization
  * of such a task would be significantly decayed at this point of time.
  *
- * Boosted CPU utilization is defined as max(CPU runnable, CPU utilization).
- * CPU contention for CFS tasks can be detected by CPU runnable > CPU
- * utilization. Boosting is implemented in cpu_util() so that internal
- * users (e.g. EAS) can use it next to external users (e.g. schedutil),
- * latter via cpu_util_cfs_boost().
- *
  * CPU utilization can be higher than the current CPU capacity
  * (f_curr/f_max * max CPU capacity) or even the max CPU capacity because
  * of rounding errors as well as task migrations or wakeups of new tasks.
@@ -8229,16 +8222,10 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
  * Return: (Boosted) (estimated) utilization for the specified CPU.
  */
 static unsigned long
-cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
+cpu_util(int cpu, struct task_struct *p, int dst_cpu)
 {
 	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
 	unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
-	unsigned long runnable;
-
-	if (boost) {
-		runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
-		util = max(util, runnable);
-	}
 
 	/*
 	 * If @dst_cpu is -1 or @p migrates from @cpu to @dst_cpu remove its
@@ -8295,12 +8282,7 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
 
 unsigned long cpu_util_cfs(int cpu)
 {
-	return cpu_util(cpu, NULL, -1, 0);
-}
-
-unsigned long cpu_util_cfs_boost(int cpu)
-{
-	return cpu_util(cpu, NULL, -1, 1);
+	return cpu_util(cpu, NULL, -1);
 }
 
 /*
@@ -8322,7 +8304,7 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
 	if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))
 		p = NULL;
 
-	return cpu_util(cpu, p, -1, 0);
+	return cpu_util(cpu, p, -1);
 }
 
 /*
@@ -8489,7 +8471,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
 	int cpu;
 
 	for_each_cpu(cpu, pd_cpus) {
-		unsigned long util = cpu_util(cpu, p, -1, 0);
+		unsigned long util = cpu_util(cpu, p, -1);
 
 		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
 	}
@@ -8513,7 +8495,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
 
 	for_each_cpu(cpu, pd_cpus) {
 		struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
-		unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
+		unsigned long util = cpu_util(cpu, p, dst_cpu);
 		unsigned long eff_util, min, max;
 
 		/*
@@ -8675,7 +8657,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 				continue;
 
-			util = cpu_util(cpu, p, cpu, 0);
+			util = cpu_util(cpu, p, cpu);
 			cpu_cap = capacity_of(cpu);
 
 			/*
@@ -11848,7 +11830,7 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
 			break;
 
 		case migrate_util:
-			util = cpu_util_cfs_boost(i);
+			util = cpu_util_cfs(i);
 
 			/*
 			 * Don't try to pull utilization from a CPU with one
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d..1c934dd126b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3551,7 +3551,6 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
 
 
 extern unsigned long cpu_util_cfs(int cpu);
-extern unsigned long cpu_util_cfs_boost(int cpu);
 
 static inline unsigned long cpu_util_rt(struct rq *rq)
 {
-- 
2.47.3

Re: [PATCH] sched/fair: Revert boost in cpu_util()

Posted by Dietmar Eggemann 2 days, 21 hours ago

On 18.05.26 04:40, hongyan.xia(夏弘彦) wrote:
> From: Hongyan Xia <hongyan.xia@transsion.com>

I'm on vacation this week so will have a closer look beginning of next week.

> We have seen a massive power consumption regression (20% SoC power
> increase in many apps) after updating our kernel. After bisection we

What is the kernel version you updated to? Which one you have been using
so far?

Are you using Android on your devices? I remember there was some
functionality added to avoid janks in display pipeline.

> pinpointed the regression to the cpu_util(boost) feature. After
> reverting the boost feature the massive energy regression is gone.
> Detailed trace analysis down below. The regression is found across quite
> many apps but Youtube is one of the worst offenders, shown in the
> 1080p60fps video benchmark:
> 
>  Setup      FPS   SoC Power (mW)  diff
> w/  boost  59.94      913.6
> w/o boost  59.93      720.4     -21.15%
> 
> Signed-off-by: Hongyan Xia <hongyan.xia@transsion.com>
> 
> ---
> Analysis:

[...]

> 2. Using the absolute value of runnable_avg to drive frequency is
>    too high to be reasonable:
> 
> We use runnable in a _relative_ way to util to know whether there is

Is this part of the value adds you put on top of mainline kernel? Are
you able to share this here?

> contention in several places. However, the _absolute_ value should not
> be used like util. Runnable_avg tends to be significantly higher,
> making it much easier to saturate frequency.
> 
> For example, if three tasks each with a util of 100 contend on the same
> rq, the rq util is 300 but runnable_avg shoots up to 900. 900 drives the
> CPU at the max frequency, and it's highly questionable whether this
> boost is the right decision.

Shouldn't this be max 600, in case the task's runtime overlap perfectly?
In case they don't overlap at all runnable_avg should be util_avg. Is
this a theoretical example or taken from your traces?

> 3. Runnable_avg may not even reflect true contention:
> 
> When tasks are dependent, the bottleneck is often the data flow between
> tasks, not the contention seen by runnable_avg. Boosting frequency with
> runnable in such scenarios wastes power without performance benefits.

That's probably true. But here any global feature (which doesn't need
per-task setup) won't be able to give perfect results, only per-task
setup can fix this.

Re: [PATCH] sched/fair: Revert boost in cpu_util()

Posted by hongyan.xia 2 days, 19 hours ago

On 5/22/2026 3:49 PM, Dietmar Eggemann wrote:
> On 18.05.26 04:40, hongyan.xia(夏弘彦) wrote:
>> From: Hongyan Xia <hongyan.xia@transsion.com>
> 
> I'm on vacation this week so will have a closer look beginning of next week.

Vacation certainly more important. Patches can wait :)

>> We have seen a massive power consumption regression (20% SoC power
>> increase in many apps) after updating our kernel. After bisection we
> 
> What is the kernel version you updated to? Which one you have been using
> so far?
> 
> Are you using Android on your devices? I remember there was some
> functionality added to avoid janks in display pipeline.

Yes, this is Android but with scheduler vendor hooks stripped, so kernel 
6.6 schedutil with some GKI patches on top. There is "something to avoid 
janks", but to make sure the results can be shared upstream, such 
Android things are disabled, leaving basically vanilla schedutil.

>> pinpointed the regression to the cpu_util(boost) feature. After
>> reverting the boost feature the massive energy regression is gone.
>> Detailed trace analysis down below. The regression is found across quite
>> many apps but Youtube is one of the worst offenders, shown in the
>> 1080p60fps video benchmark:
>>
>>   Setup      FPS   SoC Power (mW)  diff
>> w/  boost  59.94      913.6
>> w/o boost  59.93      720.4     -21.15%
>>
>> Signed-off-by: Hongyan Xia <hongyan.xia@transsion.com>
>>
>> ---
>> Analysis:
> 
> [...]
> 
>> 2. Using the absolute value of runnable_avg to drive frequency is
>>     too high to be reasonable:
>>
>> We use runnable in a _relative_ way to util to know whether there is
> 
> Is this part of the value adds you put on top of mainline kernel? Are
> you able to share this here?

By 'we' I mean upstream schedutil, like the comparison in 
util_est_update between util and runnable to know whether there is 
contention. The results I shared in this thread were always run on a 
setup that doesn't include our internal stuff.

>> contention in several places. However, the _absolute_ value should not
>> be used like util. Runnable_avg tends to be significantly higher,
>> making it much easier to saturate frequency.
>>
>> For example, if three tasks each with a util of 100 contend on the same
>> rq, the rq util is 300 but runnable_avg shoots up to 900. 900 drives the
>> CPU at the max frequency, and it's highly questionable whether this
>> boost is the right decision.
> 
> Shouldn't this be max 600, in case the task's runtime overlap perfectly?
> In case they don't overlap at all runnable_avg should be util_avg. Is
> this a theoretical example or taken from your traces?

You are right, the first task that completes no longer accumulates 
runnable_avg, so the runnable_avg of the three tasks are 100, 200, 300, 
so 600 in total.

This is only a theoretical example, although in apps like Youtube 
playback we do see the average frequency hovering around 1.5x w/ boost 
than w/o boost, so in reality there is still a big difference.

>> 3. Runnable_avg may not even reflect true contention:
>>
>> When tasks are dependent, the bottleneck is often the data flow between
>> tasks, not the contention seen by runnable_avg. Boosting frequency with
>> runnable in such scenarios wastes power without performance benefits.
> 
> That's probably true. But here any global feature (which doesn't need
> per-task setup) won't be able to give perfect results, only per-task
> setup can fix this.

Our observation seems that few tasks benefit and most of our workloads 
suffer from a big energy regression, so putting this feature on the 
generic path might not be the best thing.

Christian did suggest some gaming benchmarks. Sadly our setup has a 
completely different governor for gaming. I will try to rip out 
everything and put games back on vanilla schedutil. Hopefully I will 
come back soon with gaming results.

Re: [PATCH] sched/fair: Revert boost in cpu_util()

Posted by Christian Loehle 6 days, 19 hours ago

On 5/18/26 03:40, hongyan.xia(夏弘彦) wrote:
> From: Hongyan Xia <hongyan.xia@transsion.com>
> 
> We have seen a massive power consumption regression (20% SoC power
> increase in many apps) after updating our kernel. After bisection we
> pinpointed the regression to the cpu_util(boost) feature. After
> reverting the boost feature the massive energy regression is gone.
> Detailed trace analysis down below. The regression is found across quite
> many apps but Youtube is one of the worst offenders, shown in the
> 1080p60fps video benchmark:
> 
>  Setup      FPS   SoC Power (mW)  diff
> w/  boost  59.94      913.6
> w/o boost  59.93      720.4     -21.15%
> 
> Signed-off-by: Hongyan Xia <hongyan.xia@transsion.com>
> 
> ---
> Analysis:
> 
> We found several problems that result in the power spike:
> 
> 1. Arithmetic should not happen between util_avg and runnable_avg:
> 
> After util = max(util, runnable) which potentially picks runnable value
> in cpu_util(), we then add or subtract task util values from it. This
> produces a value that is half-runnable-half-util which is ill-defined.
> This alone should be a warning sign. This breaks EAS calculations in
> many cases, leading to sub-optimal task placements.
> 
> 2. Using the absolute value of runnable_avg to drive frequency is
>    too high to be reasonable:
> 
> We use runnable in a _relative_ way to util to know whether there is
> contention in several places. However, the _absolute_ value should not
> be used like util. Runnable_avg tends to be significantly higher,
> making it much easier to saturate frequency.
> 
> For example, if three tasks each with a util of 100 contend on the same
> rq, the rq util is 300 but runnable_avg shoots up to 900. 900 drives the
> CPU at the max frequency, and it's highly questionable whether this
> boost is the right decision.
> 
> 3. Runnable_avg may not even reflect true contention:
> 
> When tasks are dependent, the bottleneck is often the data flow between
> tasks, not the contention seen by runnable_avg. Boosting frequency with
> runnable in such scenarios wastes power without performance benefits.
> 
> We found 1 has minor power regression but 2 and 3 regresses power
> significantly. We have seen multiple applications with the
> producer-consumer model with many worker threads suffer. When there is
> IPC between producer and consumer, boosting frequency blindly does not
> help performance at all if consumer is limited by how much data is flown
> through. Youtube suffer from 1, 2 and 3 at the same time, leading to a
> total SoC power regression of 20% shown in the results above.

We did discuss removing runnable boost internally as well, but I’d love to see
more data too.
The original issue it was trying to solve was avoiding jank frames during load
spikes, which YouTube does not really exercise. Some gaming workload data would
therefore be a useful addition here.

Runnable boost was considered as an alternative to approaches like reducing the
PELT half-life and similar changes. Qais’ current ideas also try to tackle this
problem, of course, so +CC.

If you have run many workloads, do you also have data on where this feature actually
helped, especially in reducing jank frames?

Some discussion from back then:
https://lore.kernel.org/lkml/20230406155030.1989554-1-dietmar.eggemann@arm.com/
https://lore.kernel.org/lkml/20220829055450.1703092-1-dietmar.eggemann@arm.com/

> [snip]

Re: [PATCH] sched/fair: Revert boost in cpu_util()

Posted by hongyan.xia(夏弘彦) 6 days, 17 hours ago

On 5/18/2026 6:04 PM, Christian Loehle wrote:
> [Some people who received this message don't often get email from christian.loehle@arm.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On 5/18/26 03:40, hongyan.xia(夏弘彦) wrote:
>> From: Hongyan Xia <hongyan.xia@transsion.com>
>>
>> We have seen a massive power consumption regression (20% SoC power
>> increase in many apps) after updating our kernel. After bisection we
>> pinpointed the regression to the cpu_util(boost) feature. After
>> reverting the boost feature the massive energy regression is gone.
>> Detailed trace analysis down below. The regression is found across quite
>> many apps but Youtube is one of the worst offenders, shown in the
>> 1080p60fps video benchmark:
>>
>>   Setup      FPS   SoC Power (mW)  diff
>> w/  boost  59.94      913.6
>> w/o boost  59.93      720.4     -21.15%
>>
>> Signed-off-by: Hongyan Xia <hongyan.xia@transsion.com>
>>
>> ---
>> Analysis:
>>
>> We found several problems that result in the power spike:
>>
>> 1. Arithmetic should not happen between util_avg and runnable_avg:
>>
>> After util = max(util, runnable) which potentially picks runnable value
>> in cpu_util(), we then add or subtract task util values from it. This
>> produces a value that is half-runnable-half-util which is ill-defined.
>> This alone should be a warning sign. This breaks EAS calculations in
>> many cases, leading to sub-optimal task placements.
>>
>> 2. Using the absolute value of runnable_avg to drive frequency is
>>     too high to be reasonable:
>>
>> We use runnable in a _relative_ way to util to know whether there is
>> contention in several places. However, the _absolute_ value should not
>> be used like util. Runnable_avg tends to be significantly higher,
>> making it much easier to saturate frequency.
>>
>> For example, if three tasks each with a util of 100 contend on the same
>> rq, the rq util is 300 but runnable_avg shoots up to 900. 900 drives the
>> CPU at the max frequency, and it's highly questionable whether this
>> boost is the right decision.
>>
>> 3. Runnable_avg may not even reflect true contention:
>>
>> When tasks are dependent, the bottleneck is often the data flow between
>> tasks, not the contention seen by runnable_avg. Boosting frequency with
>> runnable in such scenarios wastes power without performance benefits.
>>
>> We found 1 has minor power regression but 2 and 3 regresses power
>> significantly. We have seen multiple applications with the
>> producer-consumer model with many worker threads suffer. When there is
>> IPC between producer and consumer, boosting frequency blindly does not
>> help performance at all if consumer is limited by how much data is flown
>> through. Youtube suffer from 1, 2 and 3 at the same time, leading to a
>> total SoC power regression of 20% shown in the results above.
>
> We did discuss removing runnable boost internally as well, but I’d love to see
> more data too.
> The original issue it was trying to solve was avoiding jank frames during load
> spikes, which YouTube does not really exercise. Some gaming workload data would
> therefore be a useful addition here.

Although I would be glad to provide more data (after more benchmarks and
pending our internal approval), I wonder, what level of performance gain
do we expect from this feature to justify the big energy regression?

> Runnable boost was considered as an alternative to approaches like reducing the
> PELT half-life and similar changes. Qais’ current ideas also try to tackle this
> problem, of course, so +CC.
>
> If you have run many workloads, do you also have data on where this feature actually
> helped, especially in reducing jank frames?

We ran our Day of Use (DoU, including Facebook, Youtube and other
popular apps) test model and we did see a 6.6% increase in jank frames
after the revert. Dropped frames went up from 106 to 113 in a total of
70210 frames. However, in our test model there is no way an increase of
7 frames within 70210 justifies the energy regression between 10% and
20% in a lot of apps, hence for us the trade-off decision is very clear
here.

Another question from me is, if this feature has potentially buggy
corners or mathematical unsoundness (mostly the half-util-half-runnable
value inside cpu_util()), should we rely on its performance gain?

>
> Some discussion from back then:
> https://lore.kernel.org/lkml/20230406155030.1989554-1-dietmar.eggemann@arm.com/
> https://lore.kernel.org/lkml/20220829055450.1703092-1-dietmar.eggemann@arm.com/
>
>> [snip]

Re: [PATCH] sched/fair: Revert boost in cpu_util()

Posted by Qais Yousef 6 days, 3 hours ago

On 05/18/26 11:37, hongyan.xia(夏弘彦) wrote:
> On 5/18/2026 6:04 PM, Christian Loehle wrote:
> > [Some people who received this message don't often get email from christian.loehle@arm.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > On 5/18/26 03:40, hongyan.xia(夏弘彦) wrote:
> >> From: Hongyan Xia <hongyan.xia@transsion.com>
> >>
> >> We have seen a massive power consumption regression (20% SoC power
> >> increase in many apps) after updating our kernel. After bisection we
> >> pinpointed the regression to the cpu_util(boost) feature. After
> >> reverting the boost feature the massive energy regression is gone.
> >> Detailed trace analysis down below. The regression is found across quite
> >> many apps but Youtube is one of the worst offenders, shown in the
> >> 1080p60fps video benchmark:
> >>
> >>   Setup      FPS   SoC Power (mW)  diff
> >> w/  boost  59.94      913.6
> >> w/o boost  59.93      720.4     -21.15%
> >>
> >> Signed-off-by: Hongyan Xia <hongyan.xia@transsion.com>
> >>
> >> ---
> >> Analysis:
> >>
> >> We found several problems that result in the power spike:
> >>
> >> 1. Arithmetic should not happen between util_avg and runnable_avg:
> >>
> >> After util = max(util, runnable) which potentially picks runnable value
> >> in cpu_util(), we then add or subtract task util values from it. This
> >> produces a value that is half-runnable-half-util which is ill-defined.
> >> This alone should be a warning sign. This breaks EAS calculations in
> >> many cases, leading to sub-optimal task placements.

I don't think it does. The util signal itself has issues too :)

> >>
> >> 2. Using the absolute value of runnable_avg to drive frequency is
> >>     too high to be reasonable:
> >>
> >> We use runnable in a _relative_ way to util to know whether there is
> >> contention in several places. However, the _absolute_ value should not
> >> be used like util. Runnable_avg tends to be significantly higher,
> >> making it much easier to saturate frequency.
> >>
> >> For example, if three tasks each with a util of 100 contend on the same
> >> rq, the rq util is 300 but runnable_avg shoots up to 900. 900 drives the
> >> CPU at the max frequency, and it's highly questionable whether this
> >> boost is the right decision.

I think this is the idea. These tasks are waiting behind other tasks.

> >>
> >> 3. Runnable_avg may not even reflect true contention:
> >>
> >> When tasks are dependent, the bottleneck is often the data flow between
> >> tasks, not the contention seen by runnable_avg. Boosting frequency with
> >> runnable in such scenarios wastes power without performance benefits.

I believe contention is used to describe several tasks fighting for CPU time
but only a single task can run and the other will be waiting. But I think
I know what you mean, I think this is the same I was highlighting in [1].
We don't care if some tasks end up waiting for more.

> >>
> >> We found 1 has minor power regression but 2 and 3 regresses power
> >> significantly. We have seen multiple applications with the
> >> producer-consumer model with many worker threads suffer. When there is
> >> IPC between producer and consumer, boosting frequency blindly does not
> >> help performance at all if consumer is limited by how much data is flown
> >> through. Youtube suffer from 1, 2 and 3 at the same time, leading to a
> >> total SoC power regression of 20% shown in the results above.
> >
> > We did discuss removing runnable boost internally as well, but I’d love to see
> > more data too.
> > The original issue it was trying to solve was avoiding jank frames during load
> > spikes, which YouTube does not really exercise. Some gaming workload data would
> > therefore be a useful addition here.
> 
> Although I would be glad to provide more data (after more benchmarks and
> pending our internal approval), I wonder, what level of performance gain
> do we expect from this feature to justify the big energy regression?
> 
> > Runnable boost was considered as an alternative to approaches like reducing the
> > PELT half-life and similar changes. Qais’ current ideas also try to tackle this
> > problem, of course, so +CC.

A lot of the current behavior is actually good for power by accident. And this
runnable approach helps performance as a workaround to these issues. We need to
defer some decisions to userspace and just give them a better way to decide
their trade-offs. One person's regression is another person's gain..

> >
> > If you have run many workloads, do you also have data on where this feature actually
> > helped, especially in reducing jank frames?
> 
> We ran our Day of Use (DoU, including Facebook, Youtube and other
> popular apps) test model and we did see a 6.6% increase in jank frames
> after the revert. Dropped frames went up from 106 to 113 in a total of
> 70210 frames. However, in our test model there is no way an increase of
> 7 frames within 70210 justifies the energy regression between 10% and
> 20% in a lot of apps, hence for us the trade-off decision is very clear
> here.
> 
> Another question from me is, if this feature has potentially buggy
> corners or mathematical unsoundness (mostly the half-util-half-runnable
> value inside cpu_util()), should we rely on its performance gain?
> 
> >
> > Some discussion from back then:
> > https://lore.kernel.org/lkml/20230406155030.1989554-1-dietmar.eggemann@arm.com/
> > https://lore.kernel.org/lkml/20220829055450.1703092-1-dietmar.eggemann@arm.com/

Generally I remember I had concerns on this approach then [1]. I kept quite
after it got merged and won't complain if it is removed now.

[1] https://lore.kernel.org/lkml/20230504152328.twh3rqgq2o2gvd4u@airbuntu/

Re: [PATCH] sched/fair: Revert boost in cpu_util()

Posted by hongyan.xia(夏弘彦) 6 days, 2 hours ago

On 5/19/2026 9:17 AM, Qais Yousef wrote:
> On 05/18/26 11:37, hongyan.xia(夏弘彦) wrote:
>> On 5/18/2026 6:04 PM, Christian Loehle wrote:
>>> [Some people who received this message don't often get email from christian.loehle@arm.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> On 5/18/26 03:40, hongyan.xia(夏弘彦) wrote:
>>>> From: Hongyan Xia <hongyan.xia@transsion.com>
>>>>
>>>> We have seen a massive power consumption regression (20% SoC power
>>>> increase in many apps) after updating our kernel. After bisection we
>>>> pinpointed the regression to the cpu_util(boost) feature. After
>>>> reverting the boost feature the massive energy regression is gone.
>>>> Detailed trace analysis down below. The regression is found across quite
>>>> many apps but Youtube is one of the worst offenders, shown in the
>>>> 1080p60fps video benchmark:
>>>>
>>>>    Setup      FPS   SoC Power (mW)  diff
>>>> w/  boost  59.94      913.6
>>>> w/o boost  59.93      720.4     -21.15%
>>>>
>>>> Signed-off-by: Hongyan Xia <hongyan.xia@transsion.com>
>>>>
>>>> ---
>>>> Analysis:
>>>>
>>>> We found several problems that result in the power spike:
>>>>
>>>> 1. Arithmetic should not happen between util_avg and runnable_avg:
>>>>
>>>> After util = max(util, runnable) which potentially picks runnable value
>>>> in cpu_util(), we then add or subtract task util values from it. This
>>>> produces a value that is half-runnable-half-util which is ill-defined.
>>>> This alone should be a warning sign. This breaks EAS calculations in
>>>> many cases, leading to sub-optimal task placements.
>
> I don't think it does. The util signal itself has issues too :)

One issue I found is that it sometimes piles up tasks on the same CPU,
because rq.runnable_avg - task.util_avg is still very high and not much
lower than rq.runnable_avg, making EAS think there is no benefit in
spreading out tasks when other CPUs are empty.

But this problem is usually temporary and doesn't last long in reality.

>>>>
>>>> 2. Using the absolute value of runnable_avg to drive frequency is
>>>>      too high to be reasonable:
>>>>
>>>> We use runnable in a _relative_ way to util to know whether there is
>>>> contention in several places. However, the _absolute_ value should not
>>>> be used like util. Runnable_avg tends to be significantly higher,
>>>> making it much easier to saturate frequency.
>>>>
>>>> For example, if three tasks each with a util of 100 contend on the same
>>>> rq, the rq util is 300 but runnable_avg shoots up to 900. 900 drives the
>>>> CPU at the max frequency, and it's highly questionable whether this
>>>> boost is the right decision.
>
> I think this is the idea. These tasks are waiting behind other tasks.
>
>>>>
>>>> 3. Runnable_avg may not even reflect true contention:
>>>>
>>>> When tasks are dependent, the bottleneck is often the data flow between
>>>> tasks, not the contention seen by runnable_avg. Boosting frequency with
>>>> runnable in such scenarios wastes power without performance benefits.
>
> I believe contention is used to describe several tasks fighting for CPU time
> but only a single task can run and the other will be waiting. But I think
> I know what you mean, I think this is the same I was highlighting in [1].
> We don't care if some tasks end up waiting for more.
>
>>>>
>>>> We found 1 has minor power regression but 2 and 3 regresses power
>>>> significantly. We have seen multiple applications with the
>>>> producer-consumer model with many worker threads suffer. When there is
>>>> IPC between producer and consumer, boosting frequency blindly does not
>>>> help performance at all if consumer is limited by how much data is flown
>>>> through. Youtube suffer from 1, 2 and 3 at the same time, leading to a
>>>> total SoC power regression of 20% shown in the results above.
>>>
>>> We did discuss removing runnable boost internally as well, but I’d love to see
>>> more data too.
>>> The original issue it was trying to solve was avoiding jank frames during load
>>> spikes, which YouTube does not really exercise. Some gaming workload data would
>>> therefore be a useful addition here.
>>
>> Although I would be glad to provide more data (after more benchmarks and
>> pending our internal approval), I wonder, what level of performance gain
>> do we expect from this feature to justify the big energy regression?
>>
>>> Runnable boost was considered as an alternative to approaches like reducing the
>>> PELT half-life and similar changes. Qais’ current ideas also try to tackle this
>>> problem, of course, so +CC.
>
> A lot of the current behavior is actually good for power by accident. And this
> runnable approach helps performance as a workaround to these issues. We need to
> defer some decisions to userspace and just give them a better way to decide
> their trade-offs. One person's regression is another person's gain..

To be honest, yes, we live in a world where many things work by accident
and there are definitely a lot of 'accidents' in schedutil. Our
motivation for this patch is mostly our real world test scenarios that
mimic customer day of use patterns, and it looks like the perf gain is
small compared with the energy regression across common apps.

>>>
>>> If you have run many workloads, do you also have data on where this feature actually
>>> helped, especially in reducing jank frames?
>>
>> We ran our Day of Use (DoU, including Facebook, Youtube and other
>> popular apps) test model and we did see a 6.6% increase in jank frames
>> after the revert. Dropped frames went up from 106 to 113 in a total of
>> 70210 frames. However, in our test model there is no way an increase of
>> 7 frames within 70210 justifies the energy regression between 10% and
>> 20% in a lot of apps, hence for us the trade-off decision is very clear
>> here.
>>
>> Another question from me is, if this feature has potentially buggy
>> corners or mathematical unsoundness (mostly the half-util-half-runnable
>> value inside cpu_util()), should we rely on its performance gain?
>>
>>>
>>> Some discussion from back then:
>>> https://lore.kernel.org/lkml/20230406155030.1989554-1-dietmar.eggemann@arm.com/
>>> https://lore.kernel.org/lkml/20220829055450.1703092-1-dietmar.eggemann@arm.com/
>
> Generally I remember I had concerns on this approach then [1]. I kept quite
> after it got merged and won't complain if it is removed now.
>
> [1] https://lore.kernel.org/lkml/20230504152328.twh3rqgq2o2gvd4u@airbuntu/

I must say I'm now almost completely echoing what you were saying. Sad
that I didn't see this thread back then. Our test results confirmed the
concerns in that thread, namely:

1. Whether it's a global win: The performance gain seems limited, like
the jank results (not with Jankbench, but actual animations animated by
common apps) I just shared with Christian.
2. Hurts power: Yes, we saw a dramatic 20% SoC power increase in certain
apps like Youtube playback.
3. Being selective: This is also our concern. In our analysis, looks
like it boosts frequency often in cases where it doesn't help perf.

Sad that these questions are answered 3 years later, but better late
than never :)

Re: [PATCH] sched/fair: Revert boost in cpu_util()

Posted by Qais Yousef 5 days, 15 hours ago

On 05/19/26 02:41, hongyan.xia(夏弘彦) wrote:
> On 5/19/2026 9:17 AM, Qais Yousef wrote:
> > On 05/18/26 11:37, hongyan.xia(夏弘彦) wrote:
> >> On 5/18/2026 6:04 PM, Christian Loehle wrote:
> >>> [Some people who received this message don't often get email from christian.loehle@arm.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>
> >>> On 5/18/26 03:40, hongyan.xia(夏弘彦) wrote:
> >>>> From: Hongyan Xia <hongyan.xia@transsion.com>
> >>>>
> >>>> We have seen a massive power consumption regression (20% SoC power
> >>>> increase in many apps) after updating our kernel. After bisection we
> >>>> pinpointed the regression to the cpu_util(boost) feature. After
> >>>> reverting the boost feature the massive energy regression is gone.
> >>>> Detailed trace analysis down below. The regression is found across quite
> >>>> many apps but Youtube is one of the worst offenders, shown in the
> >>>> 1080p60fps video benchmark:
> >>>>
> >>>>    Setup      FPS   SoC Power (mW)  diff
> >>>> w/  boost  59.94      913.6
> >>>> w/o boost  59.93      720.4     -21.15%
> >>>>
> >>>> Signed-off-by: Hongyan Xia <hongyan.xia@transsion.com>
> >>>>
> >>>> ---
> >>>> Analysis:
> >>>>
> >>>> We found several problems that result in the power spike:
> >>>>
> >>>> 1. Arithmetic should not happen between util_avg and runnable_avg:
> >>>>
> >>>> After util = max(util, runnable) which potentially picks runnable value
> >>>> in cpu_util(), we then add or subtract task util values from it. This
> >>>> produces a value that is half-runnable-half-util which is ill-defined.
> >>>> This alone should be a warning sign. This breaks EAS calculations in
> >>>> many cases, leading to sub-optimal task placements.
> >
> > I don't think it does. The util signal itself has issues too :)
> 
> One issue I found is that it sometimes piles up tasks on the same CPU,
> because rq.runnable_avg - task.util_avg is still very high and not much
> lower than rq.runnable_avg, making EAS think there is no benefit in
> spreading out tasks when other CPUs are empty.
> 
> But this problem is usually temporary and doesn't last long in reality.

I see. I think the major problem with this logic is that runnable is useful
only during this transient time. But it will take a long time to decay which
I think (guess really) what causes these problems you're observing. The
contention has gone, but the signal can take 50-100ms to resolve to previous
behavior - I think.

> 
> >>>>
> >>>> 2. Using the absolute value of runnable_avg to drive frequency is
> >>>>      too high to be reasonable:
> >>>>
> >>>> We use runnable in a _relative_ way to util to know whether there is
> >>>> contention in several places. However, the _absolute_ value should not
> >>>> be used like util. Runnable_avg tends to be significantly higher,
> >>>> making it much easier to saturate frequency.
> >>>>
> >>>> For example, if three tasks each with a util of 100 contend on the same
> >>>> rq, the rq util is 300 but runnable_avg shoots up to 900. 900 drives the
> >>>> CPU at the max frequency, and it's highly questionable whether this
> >>>> boost is the right decision.
> >
> > I think this is the idea. These tasks are waiting behind other tasks.
> >
> >>>>
> >>>> 3. Runnable_avg may not even reflect true contention:
> >>>>
> >>>> When tasks are dependent, the bottleneck is often the data flow between
> >>>> tasks, not the contention seen by runnable_avg. Boosting frequency with
> >>>> runnable in such scenarios wastes power without performance benefits.
> >
> > I believe contention is used to describe several tasks fighting for CPU time
> > but only a single task can run and the other will be waiting. But I think
> > I know what you mean, I think this is the same I was highlighting in [1].
> > We don't care if some tasks end up waiting for more.
> >
> >>>>
> >>>> We found 1 has minor power regression but 2 and 3 regresses power
> >>>> significantly. We have seen multiple applications with the
> >>>> producer-consumer model with many worker threads suffer. When there is
> >>>> IPC between producer and consumer, boosting frequency blindly does not
> >>>> help performance at all if consumer is limited by how much data is flown
> >>>> through. Youtube suffer from 1, 2 and 3 at the same time, leading to a
> >>>> total SoC power regression of 20% shown in the results above.
> >>>
> >>> We did discuss removing runnable boost internally as well, but I’d love to see
> >>> more data too.
> >>> The original issue it was trying to solve was avoiding jank frames during load
> >>> spikes, which YouTube does not really exercise. Some gaming workload data would
> >>> therefore be a useful addition here.
> >>
> >> Although I would be glad to provide more data (after more benchmarks and
> >> pending our internal approval), I wonder, what level of performance gain
> >> do we expect from this feature to justify the big energy regression?
> >>
> >>> Runnable boost was considered as an alternative to approaches like reducing the
> >>> PELT half-life and similar changes. Qais’ current ideas also try to tackle this
> >>> problem, of course, so +CC.
> >
> > A lot of the current behavior is actually good for power by accident. And this
> > runnable approach helps performance as a workaround to these issues. We need to
> > defer some decisions to userspace and just give them a better way to decide
> > their trade-offs. One person's regression is another person's gain..
> 
> To be honest, yes, we live in a world where many things work by accident
> and there are definitely a lot of 'accidents' in schedutil. Our
> motivation for this patch is mostly our real world test scenarios that
> mimic customer day of use patterns, and it looks like the perf gain is
> small compared with the energy regression across common apps.
> 
> >>>
> >>> If you have run many workloads, do you also have data on where this feature actually
> >>> helped, especially in reducing jank frames?
> >>
> >> We ran our Day of Use (DoU, including Facebook, Youtube and other
> >> popular apps) test model and we did see a 6.6% increase in jank frames
> >> after the revert. Dropped frames went up from 106 to 113 in a total of
> >> 70210 frames. However, in our test model there is no way an increase of
> >> 7 frames within 70210 justifies the energy regression between 10% and
> >> 20% in a lot of apps, hence for us the trade-off decision is very clear
> >> here.
> >>
> >> Another question from me is, if this feature has potentially buggy
> >> corners or mathematical unsoundness (mostly the half-util-half-runnable
> >> value inside cpu_util()), should we rely on its performance gain?
> >>
> >>>
> >>> Some discussion from back then:
> >>> https://lore.kernel.org/lkml/20230406155030.1989554-1-dietmar.eggemann@arm.com/
> >>> https://lore.kernel.org/lkml/20220829055450.1703092-1-dietmar.eggemann@arm.com/
> >
> > Generally I remember I had concerns on this approach then [1]. I kept quite
> > after it got merged and won't complain if it is removed now.
> >
> > [1] https://lore.kernel.org/lkml/20230504152328.twh3rqgq2o2gvd4u@airbuntu/
> 
> I must say I'm now almost completely echoing what you were saying. Sad
> that I didn't see this thread back then. Our test results confirmed the
> concerns in that thread, namely:
> 
> 1. Whether it's a global win: The performance gain seems limited, like
> the jank results (not with Jankbench, but actual animations animated by
> common apps) I just shared with Christian.
> 2. Hurts power: Yes, we saw a dramatic 20% SoC power increase in certain
> apps like Youtube playback.
> 3. Being selective: This is also our concern. In our analysis, looks
> like it boosts frequency often in cases where it doesn't help perf.
> 
> Sad that these questions are answered 3 years later, but better late
> than never :)

:)