[PATCH v2] sched/fair: Revert boost in cpu_util()

Hongyan Xia posted 1 patch 1 week, 4 days ago
kernel/sched/cpufreq_schedutil.c |  2 +-
kernel/sched/fair.c              | 34 ++++++++------------------------
kernel/sched/sched.h             |  1 -
3 files changed, 9 insertions(+), 28 deletions(-)
[PATCH v2] sched/fair: Revert boost in cpu_util()
Posted by Hongyan Xia 1 week, 4 days ago
From: Hongyan Xia <hongyan.xia@transsion.com>

We have seen a massive power consumption regression (20% SoC power
increase in many apps) after updating our kernel. After bisection we
pinpointed the regression to the cpu_util(boost) feature. After
reverting the boost feature the massive energy regression is gone.
Detailed trace analysis down below. The regression is found across quite
many apps but Youtube is one of the worst offenders. Some energy
benchmark numbers are here.

Youtube 1080p60fps video benchmark:
                FPS   SoC Power  diff
w/  boost      59.94   913.6mW
w/o boost      59.93   720.4mW  -21.15%

Mobile Legends (gaming)
               FPS   sdev   Total power  diff
w/  boost     120.16  0.47   3294.10mW
w/o boost     120.07  0.56   2996.09mW  -9.05%

Genshin Impact (gaming, medium quality)
                FPS   sdev  Total power  diff
w/  boost      60.05  0.34   6215.84mW
w/o boost      60.03  0.35   5695.46mW  -8.37%

Signed-off-by: Hongyan Xia <hongyan.xia@transsion.com>

---
Changed in v2:
- Sync all comments with code changes.
- Update commit message with more benchmark numbers.

Analysis:

We found several problems that result in the power spike:

1. Arithmetic should not happen between util_avg and runnable_avg:

After util = max(util, runnable) which potentially picks runnable value
in cpu_util(), we then add or subtract task util values from it. This
produces a value that is half-runnable-half-util which is ill-defined.
This alone should be a warning sign. This breaks EAS calculations in
many cases, leading to sub-optimal task placements.

2. Using the absolute value of runnable_avg to drive frequency is
   too high to be reasonable:

Schedutil use runnable in a _relative_ way to util to know whether there
is contention in several places. However, the _absolute_ value should
not be used like util. Runnable_avg tends to be significantly higher,
making it much easier to saturate frequency.

For example, if three tasks each with a util of 100 contend on the same
rq, the rq util is 300 but runnable_avg shoots up to 600, which is often
much higher than needed.

3. Runnable_avg may not even reflect true contention:

When tasks are dependent, the bottleneck is often the data flow between
tasks, not the contention seen by runnable_avg. Boosting frequency with
runnable in such scenarios wastes power without performance benefits.

We found 1 has minor power regression but 2 and 3 regresses power
significantly. We have seen multiple applications with the
producer-consumer model with many worker threads suffer. When there is
IPC between producer and consumer, boosting frequency blindly does not
help performance at all if consumer is limited by how much data is flown
through. Youtube suffer from 1, 2 and 3 at the same time, leading to a
total SoC power regression of 20% shown in the results above.

---
 kernel/sched/cpufreq_schedutil.c |  2 +-
 kernel/sched/fair.c              | 34 ++++++++------------------------
 kernel/sched/sched.h             |  1 -
 3 files changed, 9 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index ae9fd211cec1..ba867192513b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -228,7 +228,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
 	unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
 
 	if (!scx_switched_all())
-		util += cpu_util_cfs_boost(sg_cpu->cpu);
+		util += cpu_util_cfs(sg_cpu->cpu);
 	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
 	util = max(util, boost);
 	sg_cpu->bw_min = min;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 728965851842..ecf8b4860951 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8192,7 +8192,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
  * @cpu: the CPU to get the utilization for
  * @p: task for which the CPU utilization should be predicted or NULL
  * @dst_cpu: CPU @p migrates to, -1 if @p moves from @cpu or @p == NULL
- * @boost: 1 to enable boosting, otherwise 0
  *
  * The unit of the return value must be the same as the one of CPU capacity
  * so that CPU utilization can be compared with CPU capacity.
@@ -8210,12 +8209,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
  * be when a long-sleeping task wakes up. The contribution to CPU utilization
  * of such a task would be significantly decayed at this point of time.
  *
- * Boosted CPU utilization is defined as max(CPU runnable, CPU utilization).
- * CPU contention for CFS tasks can be detected by CPU runnable > CPU
- * utilization. Boosting is implemented in cpu_util() so that internal
- * users (e.g. EAS) can use it next to external users (e.g. schedutil),
- * latter via cpu_util_cfs_boost().
- *
  * CPU utilization can be higher than the current CPU capacity
  * (f_curr/f_max * max CPU capacity) or even the max CPU capacity because
  * of rounding errors as well as task migrations or wakeups of new tasks.
@@ -8226,19 +8219,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
  * though since this is useful for predicting the CPU capacity required
  * after task migrations (scheduler-driven DVFS).
  *
- * Return: (Boosted) (estimated) utilization for the specified CPU.
+ * Return: (Estimated) utilization for the specified CPU.
  */
 static unsigned long
-cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
+cpu_util(int cpu, struct task_struct *p, int dst_cpu)
 {
 	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
 	unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
-	unsigned long runnable;
-
-	if (boost) {
-		runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
-		util = max(util, runnable);
-	}
 
 	/*
 	 * If @dst_cpu is -1 or @p migrates from @cpu to @dst_cpu remove its
@@ -8295,12 +8282,7 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
 
 unsigned long cpu_util_cfs(int cpu)
 {
-	return cpu_util(cpu, NULL, -1, 0);
-}
-
-unsigned long cpu_util_cfs_boost(int cpu)
-{
-	return cpu_util(cpu, NULL, -1, 1);
+	return cpu_util(cpu, NULL, -1);
 }
 
 /*
@@ -8322,7 +8304,7 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
 	if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))
 		p = NULL;
 
-	return cpu_util(cpu, p, -1, 0);
+	return cpu_util(cpu, p, -1);
 }
 
 /*
@@ -8489,7 +8471,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
 	int cpu;
 
 	for_each_cpu(cpu, pd_cpus) {
-		unsigned long util = cpu_util(cpu, p, -1, 0);
+		unsigned long util = cpu_util(cpu, p, -1);
 
 		busy_time += effective_cpu_util(cpu, util, NULL, NULL);
 	}
@@ -8513,7 +8495,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
 
 	for_each_cpu(cpu, pd_cpus) {
 		struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
-		unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
+		unsigned long util = cpu_util(cpu, p, dst_cpu);
 		unsigned long eff_util, min, max;
 
 		/*
@@ -8675,7 +8657,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 				continue;
 
-			util = cpu_util(cpu, p, cpu, 0);
+			util = cpu_util(cpu, p, cpu);
 			cpu_cap = capacity_of(cpu);
 
 			/*
@@ -11848,7 +11830,7 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
 			break;
 
 		case migrate_util:
-			util = cpu_util_cfs_boost(i);
+			util = cpu_util_cfs(i);
 
 			/*
 			 * Don't try to pull utilization from a CPU with one
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d..1c934dd126b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3551,7 +3551,6 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
 
 
 extern unsigned long cpu_util_cfs(int cpu);
-extern unsigned long cpu_util_cfs_boost(int cpu);
 
 static inline unsigned long cpu_util_rt(struct rq *rq)
 {
-- 
2.47.3

Re: [PATCH v2] sched/fair: Revert boost in cpu_util()
Posted by Vincent Guittot 4 days, 8 hours ago
On Thu, 28 May 2026 at 04:36, Hongyan Xia <hongyan.xia@transsion.com> wrote:
>
> From: Hongyan Xia <hongyan.xia@transsion.com>
>
> We have seen a massive power consumption regression (20% SoC power
> increase in many apps) after updating our kernel. After bisection we

It's always good to provide more details: kernel, version, hardware
and the test condition

> pinpointed the regression to the cpu_util(boost) feature. After
> reverting the boost feature the massive energy regression is gone.
> Detailed trace analysis down below. The regression is found across quite
> many apps but Youtube is one of the worst offenders. Some energy
> benchmark numbers are here.
>
> Youtube 1080p60fps video benchmark:
>                 FPS   SoC Power  diff
> w/  boost      59.94   913.6mW
> w/o boost      59.93   720.4mW  -21.15%
>
> Mobile Legends (gaming)
>                FPS   sdev   Total power  diff
> w/  boost     120.16  0.47   3294.10mW
> w/o boost     120.07  0.56   2996.09mW  -9.05%
>
> Genshin Impact (gaming, medium quality)
>                 FPS   sdev  Total power  diff
> w/  boost      60.05  0.34   6215.84mW
> w/o boost      60.03  0.35   5695.46mW  -8.37%
>
> Signed-off-by: Hongyan Xia <hongyan.xia@transsion.com>
>
> ---
> Changed in v2:
> - Sync all comments with code changes.
> - Update commit message with more benchmark numbers.
>
> Analysis:
>
> We found several problems that result in the power spike:
>
> 1. Arithmetic should not happen between util_avg and runnable_avg:
>
> After util = max(util, runnable) which potentially picks runnable value
> in cpu_util(), we then add or subtract task util values from it. This
> produces a value that is half-runnable-half-util which is ill-defined.
> This alone should be a warning sign. This breaks EAS calculations in
> many cases, leading to sub-optimal task placements.

This can be easily fixed

>
> 2. Using the absolute value of runnable_avg to drive frequency is
>    too high to be reasonable:
>
> Schedutil use runnable in a _relative_ way to util to know whether there
> is contention in several places. However, the _absolute_ value should
> not be used like util. Runnable_avg tends to be significantly higher,
> making it much easier to saturate frequency.
>
> For example, if three tasks each with a util of 100 contend on the same
> rq, the rq util is 300 but runnable_avg shoots up to 600, which is often
> much higher than needed.

In the email thread of the prev version, you said that using
runnable_avg is good but not like the current implementation. So
instead of blindly reverting it, please submit a better usage, as this
was added to fix some performance issues.

>
> 3. Runnable_avg may not even reflect true contention:
>
> When tasks are dependent, the bottleneck is often the data flow between
> tasks, not the contention seen by runnable_avg. Boosting frequency with
> runnable in such scenarios wastes power without performance benefits.
>
> We found 1 has minor power regression but 2 and 3 regresses power
> significantly. We have seen multiple applications with the
> producer-consumer model with many worker threads suffer. When there is
> IPC between producer and consumer, boosting frequency blindly does not
> help performance at all if consumer is limited by how much data is flown
> through. Youtube suffer from 1, 2 and 3 at the same time, leading to a
> total SoC power regression of 20% shown in the results above.

Tasks contention is a real problem and runnable_avg is one metric that
reflects this.

>
> ---
>  kernel/sched/cpufreq_schedutil.c |  2 +-
>  kernel/sched/fair.c              | 34 ++++++++------------------------
>  kernel/sched/sched.h             |  1 -
>  3 files changed, 9 insertions(+), 28 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index ae9fd211cec1..ba867192513b 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -228,7 +228,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
>         unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
>
>         if (!scx_switched_all())
> -               util += cpu_util_cfs_boost(sg_cpu->cpu);
> +               util += cpu_util_cfs(sg_cpu->cpu);
>         util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
>         util = max(util, boost);
>         sg_cpu->bw_min = min;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..ecf8b4860951 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8192,7 +8192,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>   * @cpu: the CPU to get the utilization for
>   * @p: task for which the CPU utilization should be predicted or NULL
>   * @dst_cpu: CPU @p migrates to, -1 if @p moves from @cpu or @p == NULL
> - * @boost: 1 to enable boosting, otherwise 0
>   *
>   * The unit of the return value must be the same as the one of CPU capacity
>   * so that CPU utilization can be compared with CPU capacity.
> @@ -8210,12 +8209,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>   * be when a long-sleeping task wakes up. The contribution to CPU utilization
>   * of such a task would be significantly decayed at this point of time.
>   *
> - * Boosted CPU utilization is defined as max(CPU runnable, CPU utilization).
> - * CPU contention for CFS tasks can be detected by CPU runnable > CPU
> - * utilization. Boosting is implemented in cpu_util() so that internal
> - * users (e.g. EAS) can use it next to external users (e.g. schedutil),
> - * latter via cpu_util_cfs_boost().
> - *
>   * CPU utilization can be higher than the current CPU capacity
>   * (f_curr/f_max * max CPU capacity) or even the max CPU capacity because
>   * of rounding errors as well as task migrations or wakeups of new tasks.
> @@ -8226,19 +8219,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>   * though since this is useful for predicting the CPU capacity required
>   * after task migrations (scheduler-driven DVFS).
>   *
> - * Return: (Boosted) (estimated) utilization for the specified CPU.
> + * Return: (Estimated) utilization for the specified CPU.
>   */
>  static unsigned long
> -cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> +cpu_util(int cpu, struct task_struct *p, int dst_cpu)
>  {
>         struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
>         unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
> -       unsigned long runnable;
> -
> -       if (boost) {
> -               runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
> -               util = max(util, runnable);
> -       }
>
>         /*
>          * If @dst_cpu is -1 or @p migrates from @cpu to @dst_cpu remove its
> @@ -8295,12 +8282,7 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
>
>  unsigned long cpu_util_cfs(int cpu)
>  {
> -       return cpu_util(cpu, NULL, -1, 0);
> -}
> -
> -unsigned long cpu_util_cfs_boost(int cpu)
> -{
> -       return cpu_util(cpu, NULL, -1, 1);
> +       return cpu_util(cpu, NULL, -1);
>  }
>
>  /*
> @@ -8322,7 +8304,7 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
>         if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))
>                 p = NULL;
>
> -       return cpu_util(cpu, p, -1, 0);
> +       return cpu_util(cpu, p, -1);
>  }
>
>  /*
> @@ -8489,7 +8471,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
>         int cpu;
>
>         for_each_cpu(cpu, pd_cpus) {
> -               unsigned long util = cpu_util(cpu, p, -1, 0);
> +               unsigned long util = cpu_util(cpu, p, -1);
>
>                 busy_time += effective_cpu_util(cpu, util, NULL, NULL);
>         }
> @@ -8513,7 +8495,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
>
>         for_each_cpu(cpu, pd_cpus) {
>                 struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
> -               unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
> +               unsigned long util = cpu_util(cpu, p, dst_cpu);
>                 unsigned long eff_util, min, max;
>
>                 /*
> @@ -8675,7 +8657,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>                         if (!cpumask_test_cpu(cpu, p->cpus_ptr))
>                                 continue;
>
> -                       util = cpu_util(cpu, p, cpu, 0);
> +                       util = cpu_util(cpu, p, cpu);
>                         cpu_cap = capacity_of(cpu);
>
>                         /*
> @@ -11848,7 +11830,7 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
>                         break;
>
>                 case migrate_util:
> -                       util = cpu_util_cfs_boost(i);
> +                       util = cpu_util_cfs(i);
>
>                         /*
>                          * Don't try to pull utilization from a CPU with one
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9f63b15d309d..1c934dd126b2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3551,7 +3551,6 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
>
>
>  extern unsigned long cpu_util_cfs(int cpu);
> -extern unsigned long cpu_util_cfs_boost(int cpu);
>
>  static inline unsigned long cpu_util_rt(struct rq *rq)
>  {
> --
> 2.47.3
>